Skip to content

Streamlined ingest using unstructured.io calls to partition, enrich and the chunk a complex PDF

Notifications You must be signed in to change notification settings

neel09-cyber/AdvancedRAGingest

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Ingestion and Processing Tool

A Python application for processing PDF files and creating structured outputs. This tool is designed for batch processing of PDF documents with a focus on annotation extraction and content structuring, featuring detailed logging and an interactive command-line interface.

Demo Video

PDF Processing Tool Demo

Click the image above to watch a demonstration of how this tool works.

Features

  • 📄 Batch PDF processing
  • 📑 Annotation extraction
  • 📊 JSON output generation
  • 📝 Markdown conversion for debugging
  • 🔄 Progress tracking
  • 📋 Detailed logging

Installation

  1. Clone this repository:
git clone https://github.com/tirandagan/AdvancedRAGingest.git
cd AdvancedRAGingest
  1. Install Poetry (if not already installed):
curl -sSL https://install.python-poetry.org | python3 -
  1. Install dependencies using Poetry:
poetry install
  1. Activate the Poetry shell:
poetry shell

Directory Structure

The application uses the following directory structure:

project_root/
├── input/              # Place your PDF files here
├── output/
│   ├── json/          # Generated JSON files with PDF content
│   └── annotations/   # Extracted annotations
├── logs/              # Application logs
├── pyproject.toml     # Poetry dependency management
└── config.yaml        # Configuration file

Usage

  1. Place your PDF files in the input/ directory.

  2. Ensure you're in the Poetry shell:

poetry shell
  1. Run the application:
python 01_LoadPDFs.py
  1. Select from two available tasks:
    • Option 1: "Ingest PDFs and create JSON & Annotations"
      • Processes PDF files from the input directory
      • Extracts content and annotations
      • Generates JSON output files
    • Option 2: "Create Debugging Markdowns from partition JSONs"
      • Creates markdown files from previously processed JSON files
      • Useful for debugging and content verification

Output Description

The processing generates several types of output files:

JSON Output (output/json/)

  • Structured content extracted from PDFs
  • Includes document metadata and text content
  • Organized in a format suitable for further processing

Annotations (output/annotations/)

  • Contains extracted PDF annotations
  • Includes highlights, comments, and other markup
  • Preserved in structured format for analysis

Logging

The application generates detailed logs in pdf_converter.log:

  • Processing status and progress
  • Warning and error messages
  • Operation timestamps

The following log sources are managed:

  • http.client (ERROR level)
  • httpx (ERROR level)
  • unstructured (ERROR level)
  • unstructured_ingest (ERROR level)

Configuration

The application uses a configuration system that can be customized through config.yaml. Configuration is loaded at startup and includes:

  • Directory paths
  • Processing options
  • Logging settings

Error Handling

The application includes error handling for:

  • Invalid directory paths
  • PDF processing errors
  • Configuration issues
  • File system operations

Development

For development work:

# Install development dependencies
poetry install --with dev

# Run tests
poetry run pytest

# Format code
poetry run black .

License

(C) 2024 Prof. Tiran Dagan, FDU University. All rights reserved.

Support

For issues, questions, or suggestions, please open an issue on GitHub.

About

Streamlined ingest using unstructured.io calls to partition, enrich and the chunk a complex PDF

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Shell 0.2%