PDF Ingestion and Processing Tool

A Python application for processing PDF files and creating structured outputs. This tool is designed for batch processing of PDF documents with a focus on annotation extraction and content structuring, featuring detailed logging and an interactive command-line interface.

Demo Video

Click the image above to watch a demonstration of how this tool works.

Features

📄 Batch PDF processing
📑 Annotation extraction
📊 JSON output generation
📝 Markdown conversion for debugging
🔄 Progress tracking
📋 Detailed logging

Installation

Clone this repository:

git clone https://github.com/tirandagan/AdvancedRAGingest.git
cd AdvancedRAGingest

Install Poetry (if not already installed):

curl -sSL https://install.python-poetry.org | python3 -

Install dependencies using Poetry:

poetry install

Activate the Poetry shell:

poetry shell

Directory Structure

The application uses the following directory structure:

project_root/
├── input/              # Place your PDF files here
├── output/
│   ├── json/          # Generated JSON files with PDF content
│   └── annotations/   # Extracted annotations
├── logs/              # Application logs
├── pyproject.toml     # Poetry dependency management
└── config.yaml        # Configuration file

Usage

Place your PDF files in the input/ directory.
Ensure you're in the Poetry shell:

poetry shell

Run the application:

python 01_LoadPDFs.py

Select from two available tasks:
- Option 1: "Ingest PDFs and create JSON & Annotations"
  - Processes PDF files from the input directory
  - Extracts content and annotations
  - Generates JSON output files
- Option 2: "Create Debugging Markdowns from partition JSONs"
  - Creates markdown files from previously processed JSON files
  - Useful for debugging and content verification

Output Description

The processing generates several types of output files:

JSON Output (`output/json/`)

Structured content extracted from PDFs
Includes document metadata and text content
Organized in a format suitable for further processing

Annotations (`output/annotations/`)

Contains extracted PDF annotations
Includes highlights, comments, and other markup
Preserved in structured format for analysis

Logging

The application generates detailed logs in pdf_converter.log:

Processing status and progress
Warning and error messages
Operation timestamps

The following log sources are managed:

http.client (ERROR level)
httpx (ERROR level)
unstructured (ERROR level)
unstructured_ingest (ERROR level)

Configuration

The application uses a configuration system that can be customized through config.yaml. Configuration is loaded at startup and includes:

Directory paths
Processing options
Logging settings

Error Handling

The application includes error handling for:

Invalid directory paths
PDF processing errors
Configuration issues
File system operations

Development

For development work:

# Install development dependencies
poetry install --with dev

# Run tests
poetry run pytest

# Format code
poetry run black .

License

Support

For issues, questions, or suggestions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
documentation		documentation
helpers		helpers
utils		utils
README.md		README.md
config.ini.template		config.ini.template
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Ingestion and Processing Tool

Demo Video

Features

Installation

Directory Structure

Usage

Output Description

JSON Output (`output/json/`)

Annotations (`output/annotations/`)

Logging

Configuration

Error Handling

Development

License

Support

About

Releases

Packages

Languages

neel09-cyber/AdvancedRAGingest

Folders and files

Latest commit

History

Repository files navigation

PDF Ingestion and Processing Tool

Demo Video

Features

Installation

Directory Structure

Usage

Output Description

JSON Output (output/json/)

Annotations (output/annotations/)

Logging

Configuration

Error Handling

Development

License

Support

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

JSON Output (`output/json/`)

Annotations (`output/annotations/`)

Packages