PDF-Data-Extraction-Pipeline

Project Summary: This project utilizes a combination of various python models and libraries, including Layout Parser, Detectron2, LatexOcr, Bud-OCR,PyMuPDF, pdf2img, pytessaract and more, to extract figures, figure captions, tables, table captions, equations, and text from pdf documents.

Code flow for pdf data extraction

System Requirements

Linux OR Mac and Python: > 3.8

Quick Start

Step 1. Create the virtual environment: Use the python3 -m venv command to create a virtual environment. Replace your_env_name with the name you want to give to your virtual environment:

python3 -m venv your_env_name

Step 2. Activate the virtual environment: You need to activate the virtual environment to start using it. Use the following command:

source your_env_name/bin/activate`

Step 3. Clone pdf_extraction_pipeline repo

git clone https://github.com/BudEcosystem/pdf_extraction_pipeline.git

Step 5. Installation, run requirements.txt file to install required packages

pip install -r .\pdf_extraction_pipeline\requirements.txt

Step 6. Create .env file inside pdf_extraction_pipeline folder and copy the key content of example.eve to .env file open .env and modify the environment variables

Step 7 Run process_pdf.py file to expract pdf data and Run process_epub.py file to extract epub data and to run pdf extraction using rabbitmq please read instructions given in Readme file inside pdf_pipeline folder

python pdf_extraction_pipeline/process_pdf.py

python pdf_extraction_pipeline/process_epub.py

Note if you are getting any installation error, then manually install packages and models one by one,

Installation of various models used

Installation of Layout Parser and Detectron2 To detect the layout of a document image (https://layout-parser.readthedocs.io/en/latest/notes/installation.html)

pip install layoutparser

pip install "layoutparser[effdet]"

pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2"

pip install "layoutparser[paddledetection]"

pip install "layoutparser[ocr]"

pip install layoutparser torchvision && pip install "detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2"

pip install "layoutparser[ocr]"

Nougat Installation To extract equation's latex code (https://github.com/facebookresearch/nougat)

pip install nougat-ocr

LatexOcr To extract latex code from image containg equation (https://github.com/lukas-blecher/LaTeX-OCR)

pip install "pix2tex[gui]"

Installation of various python packages install required package by using following command

pip install package_name

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
code		code
epub_extraction		epub_extraction
math_latex		math_latex
pdf_pipeline		pdf_pipeline
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
docker-compose.yml		docker-compose.yml
example.env		example.env
flowChart.png		flowChart.png
get_publisher_books.py		get_publisher_books.py
model_loader.py		model_loader.py
nougat.py		nougat.py
process_epub.py		process_epub.py
process_pdf.py		process_pdf.py
tablecaption.py		tablecaption.py
test.py		test.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-Data-Extraction-Pipeline

Code flow for pdf data extraction

System Requirements

Quick Start

Installation of various models used

About

Releases

Packages

Contributors 4

Languages

BudEcosystem/PDF-and-EPUB-extraction-pipeline-for-GPU-and-CPU

Folders and files

Latest commit

History

Repository files navigation

PDF-Data-Extraction-Pipeline

Code flow for pdf data extraction

System Requirements

Quick Start

Installation of various models used

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages