-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR #46
Comments
No they dont. We have Apache Tika embedded, which uses Google Tesseract
under the hood for OCR.
…On Mon, Mar 11, 2019 at 5:42 PM dwmcqueen ***@***.***> wrote:
Do the documents need to be OCRed prior to uploading?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#46>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AdAEOvXLLetpVYo919wA3hdY8doUZ2zFks5vVs2wgaJpZM4bpny6>
.
--
*Eric Detterman *| VP and Global Head of Products and Solution Engineering,
*LexPredict, LLC*
*Email: *[email protected]
*LinkedIn: *
*https://www.linkedin.com/in/ericdetterman
<https://www.linkedin.com/in/ericdetterman>**Web: *https://www.lexpredict.
<https://www.lexpredict.com/>com/ <https://www.lexpredict.com/>
*Cell: +1 (248) 550-2111*
--
*CONFIDENTIALITY NOTICE*: This transmission, including any attachments,
may contain confidential, protected, or sensitive information. If you are
not the intended recipient of this transmission, you may not disclose,
copy, redistribute, or use the contents of this message. If you have
received this email in error, please destroy it and notify the sender
immediately.
|
I just attempted a clean and reinstall and tried loading a doc that was not OCRed. I got this error: `Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06 Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06 |
Looks like there is an issue with Tesseract in latest version. I did a full clean reinstall of 1.1.9 and keep getting a ModuleNotFoundError: No module named 'textract.parsers.PDF_parser'` even on previously OCRed text. |
If it helps, here is the output of docker ls:
|
Sorry for the frequent update. I did confirm that running OCR locally on the document and re-uploading allowed the standard Load Document task to function correctly. So something seems amiss with the Tesseract OCRing process. |
Do the documents need to be OCRed prior to uploading?
The text was updated successfully, but these errors were encountered: