Project Summary: This project utilizes a combination of various python models and libraries, including Layout Parser, Detectron2, LatexOcr, Bud-OCR,PyMuPDF, pdf2img, pytessaract and more, to extract figures, figure captions, tables, table captions, equations, and text from pdf documents.
Linux OR Mac and Python: > 3.8
Step 1. Create the virtual environment: Use the python3 -m venv command to create a virtual environment. Replace your_env_name with the name you want to give to your virtual environment:
python3 -m venv your_env_name
Step 2. Activate the virtual environment: You need to activate the virtual environment to start using it. Use the following command:
source your_env_name/bin/activate`
Step 3. Clone pdf_extraction_pipeline repo
git clone https://github.com/BudEcosystem/pdf_extraction_pipeline.git
Step 5. Installation, run requirements.txt file to install required packages
pip install -r .\pdf_extraction_pipeline\requirements.txt
Step 6. Create .env file inside pdf_extraction_pipeline folder and copy the key content of example.eve to .env file open .env and modify the environment variables
Step 7 Run process_pdf.py file to expract pdf data and Run process_epub.py file to extract epub data and to run pdf extraction using rabbitmq please read instructions given in Readme file inside pdf_pipeline folder
python pdf_extraction_pipeline/process_pdf.py
python pdf_extraction_pipeline/process_epub.py
Note if you are getting any installation error, then manually install packages and models one by one,
Installation of Layout Parser and Detectron2 To detect the layout of a document image (https://layout-parser.readthedocs.io/en/latest/notes/installation.html)
pip install layoutparser
pip install "layoutparser[effdet]"
pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2"
pip install "layoutparser[paddledetection]"
pip install "layoutparser[ocr]"
pip install layoutparser torchvision && pip install "detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2"
pip install "layoutparser[ocr]"
Nougat Installation To extract equation's latex code (https://github.com/facebookresearch/nougat)
pip install nougat-ocr
LatexOcr To extract latex code from image containg equation (https://github.com/lukas-blecher/LaTeX-OCR)
pip install "pix2tex[gui]"
Installation of various python packages install required package by using following command
pip install package_name