pdf2ml

Project description

pdftoml is a script that allows you to convert pdf files to plain text ready to use for machine learning.

Install and run

Virtual environment

pdf2ml was built and tested with Python3.7. It should work for Python >= 3.6 but it has not been tested with other versions than 3.7.

For creating the virtual environment and installing the dependencies (from requirements.txt), run:

bash setup.sh

With the virtual environment activated (source venv/bin/activate), run the following with the python interpreter:

(venv) $ python src/pdf2ml.py input_dir output_dir language

Examples

In the test/ directory, there is a pdf of the Spanish Constitution in Catalan.

We could run the following command:

(venv) $ python src/pdftoml.py test out ca

The output will be stored in out/ directory.

Contributing

Pull requests are welcome!

Authors

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2ml

Project description

Install and run

Virtual environment

Examples

Contributing

Authors

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
test		test
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

onadegibert/pdftoml

Folders and files

Latest commit

History

Repository files navigation

pdf2ml

Project description

Install and run

Virtual environment

Examples

Contributing

Authors

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages