Thematic Analysis Tool

A Python-based tool for performing thematic analysis on collections of PDF documents. It supports both supervised (using key concepts) and unsupervised (clustering) analysis, leveraging NLP techniques and GPT models for insightful results.

This thematic analyzer can also be run directly in this google collab notebook. To do your thematic analysis, follow the instructions in the notebook. No additional setup is required.

Features

PDF Processing: Extracts text from PDFs, including handling scanned documents with OCR (optional).
Text Cleaning: Customizable cleaning options to remove noise, handle dates, numbers, and choose lemmatization methods.
Supervised Analysis: Identify themes based on provided key concepts using semantic similarity.
Unsupervised Analysis: Cluster sentences using KMeans, extract keywords, and generate theme names using GPT.
GPT Integration: Leverages GPT models for theme naming, summarization, and generating overall conclusions (optional, requires OpenAI API key).
Output: Generates CSV summaries of themes, including example sentences, associated documents, and GPT-generated summaries. Also produces visualizations of theme distributions.

Installation

Clone the repository:

git clone https://github.com/YOUR_USERNAME/thematic-analysis-tool.git
cd thematic-analysis-tool

Install dependencies:
```
pip install -r requirements.txt
```
Install Tesseract OCR (Optional):
- If you want to use OCR for scanned PDFs, install Tesseract OCR:
  - Windows: Download from the official website and add the installation directory to your system's PATH. You may also need to set the pytesseract.pytesseract.tesseract_cmd variable in your script to point to the tesseract.exe file.
  - macOS: brew install tesseract
  - Linux: sudo apt-get install tesseract-ocr (or equivalent for your distribution)
- Enable OCR in the script by setting the enable_ocr flag in the PDFProcessor class.
OpenAI API Key (Optional):
- If you want to use the GPT features (theme naming, summarization, conclusion generation), you'll need an OpenAI API key.
- Set the OPENAI_API_KEY environment variable:
  - Linux/macOS:
```
export OPENAI_API_KEY="your_api_key_here"
```
  - Windows:
```
setx OPENAI_API_KEY "your_api_key_here"
```

Usage

Prepare your PDFs: Place the PDF files you want to analyze in the pdfs directory (or change the pdf_directory variable in the script).
Configure the script:
- Open the Python script in a text editor.
- Modify the following variables at the beginning of the main() function according to your needs:
  - pdf_directory: Path to the directory containing your PDF files.
  - output_path: Path to the directory where you want to store the output.
  - openai_api_key: If you have an OpenAI API key and want to use GPT features, set this variable using os.environ.get("OPENAI_API_KEY") and follow the instruction above to set your API key as an evironment variable. Otherwise leave as None.
  - research_question: Enter your research question here.
  - key_concepts: If you want to perform supervised analysis, add your key concepts to this list. Otherwise, leave it empty for unsupervised analysis.
  - text_cleaning_config: Adjust the text cleaning parameters (keep_dates, keep_numbers, lemmatization_method) as needed.
Run the script:
```
python thematic_analysis.py
```

Output

The script will generate the following outputs in the specified output directory:

analysis_log.txt: A log file containing details of the analysis process.
supervised_thematic_summary_<suffix>.csv: Summary of themes identified in supervised mode (if key concepts were provided).
unsupervised_thematic_summary_<suffix>.csv: Summary of themes identified in unsupervised mode.
supervised_theme_distribution_<suffix>.png: Visualization of theme distribution in supervised mode.
unsupervised_theme_distribution_<suffix>.png: Visualization of theme distribution in unsupervised mode.

The CSV files contain the following columns:

Theme: The name of the theme.
Keywords: Top keywords associated with the theme.
GPT_Theme_Summary: A GPT-generated summary of the theme (if OpenAI API key is provided).
Example_Sentences: Example sentences belonging to the theme.
Documents: List of documents where the theme was found.
Count: The number of sentences associated with the theme.
Overall_Conclusion: A GPT-generated overall conclusion based on the analysis (if OpenAI API key is provided).

License

This project is licensed under the MIT License

Contributing

Contributions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
analyzer.py		analyzer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thematic Analysis Tool

Features

Installation

Usage

Output

License

Contributing

About

Releases

Packages

Languages

Aaron-Lutz/ThematicAnalysis

Folders and files

Latest commit

History

Repository files navigation

Thematic Analysis Tool

Features

Installation

Usage

Output

License

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages