Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR #46

Open
dwmcqueen opened this issue Mar 11, 2019 · 5 comments
Open

OCR #46

dwmcqueen opened this issue Mar 11, 2019 · 5 comments

Comments

@dwmcqueen
Copy link
Contributor

Do the documents need to be OCRed prior to uploading?

@ericlex
Copy link

ericlex commented Mar 11, 2019 via email

@dwmcqueen
Copy link
Contributor Author

I just attempted a clean and reinstall and tried loading a doc that was not OCRed.

I got this error:

`Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Celery task id: fc37ca52-d218-4cdd-9a49-69bb95381e06

Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Start task "Load Documents", id=None
Kwargs: {'project': {'model': 'project.project', 'pk': 1}, 'source_data': '/', 'source_type': 'agreements', 'document_type': {'model': 'document.documenttype', 'pk': '68f992f1-dba3-4dc0-a815-4d868b23c5b4'}, 'detect_contract': True, 'delete': False, 'run_standard_locators': True, 'user_id': 1, 'metadata': {'result_links': [{'name': 'View Document List', 'link': 'document:document-list'}, {'name': 'View Text Unit List', 'link': 'document:text-unit-list'}]}, 'task_id': 'fc37ca52-d218-4cdd-9a49-69bb95381e06'}
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Parse / at NginxFileAccess: http://contrax-nginx:80/media/data/documents/
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Detected 1 files. Added 1 subtasks.
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Load Documents: starting 1 sub-tasks...
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:25 | End of main task "Load Documents", id=None. Sub-tasks may be still running.
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
INFO 2019-03-19 23:07:25 | Trying TIKA for file: JS#52732.PDF
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
ERROR 2019-03-19 23:07:26 | TIKA returned too small text for file: JS#52732.PDF
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
INFO 2019-03-19 23:07:26 | Trying Textract for file: JS#52732.PDF
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
INFO 2019-03-19 23:07:26 | Caught exception while trying to parse file with Textract: JS#52732.PDF
Traceback (most recent call last):
File "/contraxsuite_services/apps/task/tasks.py", line 597, in try_parsing_with_textract
return textract2text(file_path, ext=ext), 'textract'
File "/contraxsuite_services/apps/task/utils/ocr/textract.py", line 116, in textract2text
text = process(path, ext=ext, method='tesseract', language=language)
File "/contraxsuite_services/apps/task/utils/ocr/textract.py", line 99, in process
filetype_module = importlib.import_module(rel_module, 'textract.parsers')
File "/contraxsuite_services/venv/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'textract.parsers.PDF_parser'`

@dwmcqueen
Copy link
Contributor Author

Looks like there is an issue with Tesseract in latest version. I did a full clean reinstall of 1.1.9 and keep getting a ModuleNotFoundError: No module named 'textract.parsers.PDF_parser'` even on previously OCRed text.

@dwmcqueen dwmcqueen reopened this Mar 20, 2019
@dwmcqueen
Copy link
Contributor Author

If it helps, here is the output of docker ls:

ub5b48qsfg0s contraxsuite_contrax-celery global 1/1 lexpredict/lexpredict-contraxsuite:latest ngb0mq80ze6g contraxsuite_contrax-celery-beat replicated 1/1 lexpredict/lexpredict-contraxsuite:latest lzbuwjlkxfx4 contraxsuite_contrax-curator_filebeat replicated 1/1 stefanprodan/es-curator-cron:latest pn8w3ejqmsuf contraxsuite_contrax-curator_metricbeat replicated 0/0 stefanprodan/es-curator-cron:latest p928pz2n09ym contraxsuite_contrax-db replicated 1/1 postgres:9.6 tmpz5r4tkhcb contraxsuite_contrax-elasticsearch replicated 1/1 docker.elastic.co/elasticsearch/elasticsearch-oss:6.2.4 w8nwy98y4rlj contraxsuite_contrax-filebeat global 1/1 docker.elastic.co/beats/filebeat:6.2.4 ir5yt9t1kg47 contraxsuite_contrax-flower replicated 0/0 lexpredict/lexpredict-contraxsuite:latest pock348z204w contraxsuite_contrax-jupyter replicated 1/1 lexpredict/lexpredict-contraxsuite:latest seulb1l7wcya contraxsuite_contrax-kibana replicated 1/1 docker.elastic.co/kibana/kibana-oss:6.2.4 us12mggxpgz5 contraxsuite_contrax-logrotate global 1/1 tutum/logrotate:latest m3cwbg5xibfj contraxsuite_contrax-metricbeat replicated 0/0 docker.elastic.co/beats/metricbeat:6.2.4 l4d2wnujj4gw contraxsuite_contrax-nginx replicated 1/1 nginx:stable *:80->8080/tcp, *:443->4443/tcp lqo0l3ubbsz7 contraxsuite_contrax-rabbitmq replicated 1/1 rabbitmq:3-management uul2xgwxo17u contraxsuite_contrax-tika global 1/1 lexpredict/tika-server:latest azlhtr3dv8nn contraxsuite_contrax-uwsgi replicated 1/1 lexpredict/lexpredict-contraxsuite:latest

@dwmcqueen
Copy link
Contributor Author

Sorry for the frequent update. I did confirm that running OCR locally on the document and re-uploading allowed the standard Load Document task to function correctly. So something seems amiss with the Tesseract OCRing process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants