Problem setting the OCR Environment #46

panchyni · 2023-11-03T16:55:25Z

Using the current Setup Environment Instructions, when pip tries to install 'tesseract' it installs this package (https://pypi.org/project/tesseract/), which is not the intended one. A colleague of one mine was able to fix this error by loading tesseract through conda instead of pip, but it require some modifications to other packages:

In requirements.txt change PyPDF2 to PyPDF2<3.0
In requirements_ocr.txt remove tesseract
Run conda create -n stopa_env python=3.9 pip poppler tesseract
Run conda activate stopa_env
Run pip install -U -r requirements.txt
Run pip install -U -r requirements_ocr.txt

I believe this is issue with the pip package repository, but I can provide details about our system, conda install, and the final stop_env environment if those would be useful.

The text was updated successfully, but these errors were encountered:

annahaensch · 2023-11-03T18:07:23Z

Good catch! I think there's a solution might even be a bit easier than that:

It looks like at the moment tesseract (the wrong package) is being installed in requirement_ocr.txt but so it pytesseract (the correct package). And actually tesseract is never called anywhere else, so I suspect simply removing tesseract from requirements_ocr.txt might solve all of the problems with minimal overhead.

panchyni · 2023-11-03T18:38:08Z

Unfortunately, after removing the tesseract package from requirements_ocr.txt, I get the following error when running python pdf_to_parquet.py 2019:

Traceback (most recent call last):
  File "/mnt/ufs18/home-205/panchyni/QSIDE/SToPA/scripts/pdf_to_parquet.py", line 11, in <module>
    import src as tools
  File "/mnt/ufs18/home-205/panchyni/QSIDE/SToPA/scripts/../src/__init__.py", line 4, in <module>
    from .ocr_tools import *
  File "/mnt/ufs18/home-205/panchyni/QSIDE/SToPA/scripts/../src/ocr_tools.py", line 17, in <module>
    import src.settings as settings
  File "/mnt/ufs18/home-205/panchyni/QSIDE/SToPA/scripts/../src/settings.py", line 33, in <module>
    _is_tesseract_executable = os.access(shutil.which(pytesseract.pytesseract.tesseract_cmd), os.X_OK)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: access: path should be string, bytes or os.PathLike, not NoneType

The underlying issues seems to be that pytesseract.pytesseract.tesseract_cmd is defined as "tesseract" but there is no "tesseract" command in the enviroment bin folder, only "pytesseract". As such shutil.which returns nothing and produces the error.

I think adding pytesseract.pytesseract.tesseract_cmd = "pytesseract" might fix the above error, but I will need to test it in a python=3.9, PyPDF2<3.0 environment because currently they change produces the same issues we had when installing tesseract via conda the first time.

panchyni · 2023-11-03T19:51:57Z

Even with the prior change + python=3.9, PyPDF2<3.0 I am getting the following error:


  File "/mnt/home/panchyni/anaconda3/envs/stopa_env_test/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['pytesseract', '--version']' returned non-zero exit status 1.

The issue appears to be that pytesseract has no --version option while tesseract does, so I was barking up the wrong tree with swapping the command.

cgross95 · 2023-11-03T21:11:26Z

Hi there, I'm @panchyni's colleague who contributed the installation instructions in the original issue. It looks like tesseract is a prerequisite for pytesseract since I believe the latter is just Python bindings for the command line program available in the former. So I think installing tesseract via conda is still necessary for OCR work.

I also believe that pinning some of the versions (like python=3.9 and PyPDF<3.0) is necessary due to recent updates in the packages in requirements.txt that make things not play nicely together. I think a good middle ground could be to have a couple of conda environment.yml files (that include the pip dependencies) rather than pip requirements.txt files. If that sounds useful, I'd be happy to submit a pull request with some environment.yml files (one that just includes the base packages and one that also includes the OCR packages).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem setting the OCR Environment #46

Problem setting the OCR Environment #46

panchyni commented Nov 3, 2023

annahaensch commented Nov 3, 2023

panchyni commented Nov 3, 2023

panchyni commented Nov 3, 2023 •

edited

Loading

cgross95 commented Nov 3, 2023

Problem setting the OCR Environment #46

Problem setting the OCR Environment #46

Comments

panchyni commented Nov 3, 2023

annahaensch commented Nov 3, 2023

panchyni commented Nov 3, 2023

panchyni commented Nov 3, 2023 • edited Loading

cgross95 commented Nov 3, 2023

panchyni commented Nov 3, 2023 •

edited

Loading