Skip to content

Script to extract plain text from pdfs and recover broken sentences

Notifications You must be signed in to change notification settings

onadegibert/pdftoml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf2ml

Project description

pdftoml is a script that allows you to convert pdf files to plain text ready to use for machine learning.

Install and run

Virtual environment

pdf2ml was built and tested with Python3.7. It should work for Python >= 3.6 but it has not been tested with other versions than 3.7.

For creating the virtual environment and installing the dependencies (from requirements.txt), run:

bash setup.sh

With the virtual environment activated (source venv/bin/activate), run the following with the python interpreter:

(venv) $ python src/pdf2ml.py input_dir output_dir language

Examples

In the test/ directory, there is a pdf of the Spanish Constitution in Catalan.

We could run the following command:

(venv) $ python src/pdftoml.py test out ca

The output will be stored in out/ directory.

Contributing

Pull requests are welcome!

Authors

License

This project is licensed under the MIT License.

About

Script to extract plain text from pdfs and recover broken sentences

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published