Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full documentation for processing from scratch #3

Open
annahaensch opened this issue Feb 22, 2022 · 0 comments
Open

Full documentation for processing from scratch #3

annahaensch opened this issue Feb 22, 2022 · 0 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@annahaensch
Copy link
Collaborator

Originally opened by @maminian on Dec. 16, 2021

Modify README.md (the main page one) to indicate precisely which python packages are required for scanning the pdf files and processing them. Ideally they should be named by how one would install them using pip/conda/whatever. If anyone wants to go as far as setting up a Python virtual environment from scratch working towards better reproducibility, that'd be great too.

If you're not familiar with github markdown, it's not that important. If you just get the information written down, we can iterate to polish it later.

At a glance, some of the imports in the Python files in the code/ directory that I don't think (or don't recognize) to be default Python modules (but maybe they are?) are below.

  • fuzzywuzz
  • usaddress
  • pandas
  • PIL
  • pytesseract
  • pdf2image

From @maminian:

Doing a little investigating: pytesseract is only a python wrapper for Tesseract, which one needs to install separately. There's the source code here:

https://github.com/tesseract-ocr/tesseract

which has a section about installation.

@annahaensch annahaensch added documentation Improvements or additions to documentation good first issue Good for newcomers labels Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant