Skip to content

Latest commit

 

History

History
76 lines (56 loc) · 1.71 KB

README.md

File metadata and controls

76 lines (56 loc) · 1.71 KB

paperpdf2xml

A set of Python 3 CLI to convert scientific papers in PDF format to XML documents with sections and tables.

Prerequisites

  • Make sure you have installed pdftottext utility installed for initial PDF to text conversion.

For Ubuntu/Debian

sudo apt-get install poppler-utils

For RedHat/RHEL/ Fedora/ CentOS Linux

sudo yum install poppler-utils
  • create a Python virtual environment
python3 -m venv ~/pdf_env
  • activate the virtual environment and install dependencies
source ~/pdf_env/bin/activate
pip install --upgrade pip
pip install pdftotree==0.2.13
pip install h5py==2.10.0
pip install tensorflow
pip install Keras
pip install spacy
python -m spacy download en_core_web_sm
  • Install spacy NLP library and models (A virtual environment is recommended)
pip install spacy
python -m spacy download en_core_web_sm

Usage

pdftotext paper.pdf paper.txt
python pdftext2pages.py -i paper.txt -o /tmp/paper1
python paper2xml.py -i /tmp/paper1/pdf.xml -o /tmp/paper1/paper.xml

The generated tmp/paper1/paper.xml contains paper section and table information with the common page headers and footers (line numbers) removed, formula lines detected heuristically and stripped. The generated XML can then be used for text mining applications.

python pdftext2pages.py -h 

usage: pdftext2pages.py [-h] -i I -o O

optional arguments:
  -h, --help  show this help message and exit
  -i I        input PDF Text file
  -o O        output directory
python paper2xml.py -h
usage: paper2xml.py [-h] -i I -o O

optional arguments:
  -h, --help  show this help message and exit
  -i I        input PDF XML file
  -o O        output XML file