Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem setting the OCR Environment #46

Open
panchyni opened this issue Nov 3, 2023 · 4 comments
Open

Problem setting the OCR Environment #46

panchyni opened this issue Nov 3, 2023 · 4 comments

Comments

@panchyni
Copy link

panchyni commented Nov 3, 2023

Using the current Setup Environment Instructions, when pip tries to install 'tesseract' it installs this package (https://pypi.org/project/tesseract/), which is not the intended one. A colleague of one mine was able to fix this error by loading tesseract through conda instead of pip, but it require some modifications to other packages:

  1. In requirements.txt change PyPDF2 to PyPDF2<3.0
  2. In requirements_ocr.txt remove tesseract
  3. Run conda create -n stopa_env python=3.9 pip poppler tesseract
  4. Run conda activate stopa_env
  5. Run pip install -U -r requirements.txt
  6. Run pip install -U -r requirements_ocr.txt

I believe this is issue with the pip package repository, but I can provide details about our system, conda install, and the final stop_env environment if those would be useful.

@annahaensch
Copy link
Collaborator

Good catch! I think there's a solution might even be a bit easier than that:

It looks like at the moment tesseract (the wrong package) is being installed in requirement_ocr.txt but so it pytesseract (the correct package). And actually tesseract is never called anywhere else, so I suspect simply removing tesseract from requirements_ocr.txt might solve all of the problems with minimal overhead.

@panchyni
Copy link
Author

panchyni commented Nov 3, 2023

Unfortunately, after removing the tesseract package from requirements_ocr.txt, I get the following error when running python pdf_to_parquet.py 2019:

Traceback (most recent call last):
  File "/mnt/ufs18/home-205/panchyni/QSIDE/SToPA/scripts/pdf_to_parquet.py", line 11, in <module>
    import src as tools
  File "/mnt/ufs18/home-205/panchyni/QSIDE/SToPA/scripts/../src/__init__.py", line 4, in <module>
    from .ocr_tools import *
  File "/mnt/ufs18/home-205/panchyni/QSIDE/SToPA/scripts/../src/ocr_tools.py", line 17, in <module>
    import src.settings as settings
  File "/mnt/ufs18/home-205/panchyni/QSIDE/SToPA/scripts/../src/settings.py", line 33, in <module>
    _is_tesseract_executable = os.access(shutil.which(pytesseract.pytesseract.tesseract_cmd), os.X_OK)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: access: path should be string, bytes or os.PathLike, not NoneType

The underlying issues seems to be that pytesseract.pytesseract.tesseract_cmd is defined as "tesseract" but there is no "tesseract" command in the enviroment bin folder, only "pytesseract". As such shutil.which returns nothing and produces the error.

I think adding pytesseract.pytesseract.tesseract_cmd = "pytesseract" might fix the above error, but I will need to test it in a python=3.9, PyPDF2<3.0 environment because currently they change produces the same issues we had when installing tesseract via conda the first time.

@panchyni
Copy link
Author

panchyni commented Nov 3, 2023

Even with the prior change + python=3.9, PyPDF2<3.0 I am getting the following error:


  File "/mnt/home/panchyni/anaconda3/envs/stopa_env_test/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['pytesseract', '--version']' returned non-zero exit status 1.

The issue appears to be that pytesseract has no --version option while tesseract does, so I was barking up the wrong tree with swapping the command.

@cgross95
Copy link

cgross95 commented Nov 3, 2023

Hi there, I'm @panchyni's colleague who contributed the installation instructions in the original issue. It looks like tesseract is a prerequisite for pytesseract since I believe the latter is just Python bindings for the command line program available in the former. So I think installing tesseract via conda is still necessary for OCR work.

I also believe that pinning some of the versions (like python=3.9 and PyPDF<3.0) is necessary due to recent updates in the packages in requirements.txt that make things not play nicely together. I think a good middle ground could be to have a couple of conda environment.yml files (that include the pip dependencies) rather than pip requirements.txt files. If that sounds useful, I'd be happy to submit a pull request with some environment.yml files (one that just includes the base packages and one that also includes the OCR packages).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants