A Python application for processing PDF files and creating structured outputs. This tool is designed for batch processing of PDF documents with a focus on annotation extraction and content structuring, featuring detailed logging and an interactive command-line interface.
Click the image above to watch a demonstration of how this tool works.
- 📄 Batch PDF processing
- 📑 Annotation extraction
- 📊 JSON output generation
- 📝 Markdown conversion for debugging
- 🔄 Progress tracking
- 📋 Detailed logging
- Clone this repository:
git clone https://github.com/tirandagan/AdvancedRAGingest.git
cd AdvancedRAGingest
- Install Poetry (if not already installed):
curl -sSL https://install.python-poetry.org | python3 -
- Install dependencies using Poetry:
poetry install
- Activate the Poetry shell:
poetry shell
The application uses the following directory structure:
project_root/
├── input/ # Place your PDF files here
├── output/
│ ├── json/ # Generated JSON files with PDF content
│ └── annotations/ # Extracted annotations
├── logs/ # Application logs
├── pyproject.toml # Poetry dependency management
└── config.yaml # Configuration file
-
Place your PDF files in the
input/
directory. -
Ensure you're in the Poetry shell:
poetry shell
- Run the application:
python 01_LoadPDFs.py
- Select from two available tasks:
- Option 1: "Ingest PDFs and create JSON & Annotations"
- Processes PDF files from the input directory
- Extracts content and annotations
- Generates JSON output files
- Option 2: "Create Debugging Markdowns from partition JSONs"
- Creates markdown files from previously processed JSON files
- Useful for debugging and content verification
- Option 1: "Ingest PDFs and create JSON & Annotations"
The processing generates several types of output files:
- Structured content extracted from PDFs
- Includes document metadata and text content
- Organized in a format suitable for further processing
- Contains extracted PDF annotations
- Includes highlights, comments, and other markup
- Preserved in structured format for analysis
The application generates detailed logs in pdf_converter.log
:
- Processing status and progress
- Warning and error messages
- Operation timestamps
The following log sources are managed:
- http.client (ERROR level)
- httpx (ERROR level)
- unstructured (ERROR level)
- unstructured_ingest (ERROR level)
The application uses a configuration system that can be customized through config.yaml
. Configuration is loaded at startup and includes:
- Directory paths
- Processing options
- Logging settings
The application includes error handling for:
- Invalid directory paths
- PDF processing errors
- Configuration issues
- File system operations
For development work:
# Install development dependencies
poetry install --with dev
# Run tests
poetry run pytest
# Format code
poetry run black .
(C) 2024 Prof. Tiran Dagan, FDU University. All rights reserved.
For issues, questions, or suggestions, please open an issue on GitHub.