-
Notifications
You must be signed in to change notification settings - Fork 12
When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it #57
Comments
Doesn't CC do this with Tika already? |
Not from images I don't think, does it--I think it extracts text from document formats like Word, PDF, etc. I could be wrong, though. And we'd want Zones. |
There's a way to chain in Tesseract to Tika, but the documentation seems troublesome. |
Scope for this ticket: When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it. Useful: https://github.com/meh/ruby-tesseract-ocr#hocr |
Here's some sample HOCR https://gist.github.com/jpstroop/abc27a5a87e2268fc184 |
To close this, what needs to be supported:
|
UI should support changing the language and the
|
@jpstroop Do you have the file you ran that on? Looking at ruby-tesseract-ocr and seeing if we can avoid shelling out. |
meh/ruby-tesseract-ocr#50 Ruby library doesn't support Tesseract 3.04 |
@tpendragon files in slack |
With the gem it looks like you can set the page mode with |
https://github.com/meh/ruby-tesseract-ocr seems to be the most active.
Users doing ingest may need to set language for best results. Should this be in Hydra::Derivatives?
The text was updated successfully, but these errors were encountered: