paperpdf2xml

A set of Python 3 CLI to convert scientific papers in PDF format to XML documents with sections and tables.

Prerequisites

Make sure you have installed pdftottext utility installed for initial PDF to text conversion.

For Ubuntu/Debian

sudo apt-get install poppler-utils

For RedHat/RHEL/ Fedora/ CentOS Linux

sudo yum install poppler-utils

create a Python virtual environment

python3 -m venv ~/pdf_env

activate the virtual environment and install dependencies

source ~/pdf_env/bin/activate
pip install --upgrade pip
pip install pdftotree==0.2.13
pip install h5py==2.10.0
pip install tensorflow
pip install Keras
pip install spacy
python -m spacy download en_core_web_sm

Install spacy NLP library and models (A virtual environment is recommended)

pip install spacy
python -m spacy download en_core_web_sm

Usage

pdftotext paper.pdf paper.txt
python pdftext2pages.py -i paper.txt -o /tmp/paper1
python paper2xml.py -i /tmp/paper1/pdf.xml -o /tmp/paper1/paper.xml

The generated tmp/paper1/paper.xml contains paper section and table information with the common page headers and footers (line numbers) removed, formula lines detected heuristically and stripped. The generated XML can then be used for text mining applications.

python pdftext2pages.py -h 

usage: pdftext2pages.py [-h] -i I -o O

optional arguments:
  -h, --help  show this help message and exit
  -i I        input PDF Text file
  -o O        output directory

python paper2xml.py -h
usage: paper2xml.py [-h] -i I -o O

optional arguments:
  -h, --help  show this help message and exit
  -i I        input PDF XML file
  -o O        output XML file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

paperpdf2xml

Prerequisites

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

paperpdf2xml

Prerequisites

Usage