Skip to content
This repository has been archived by the owner on May 14, 2022. It is now read-only.

When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it #57

Closed
jpstroop opened this issue Sep 9, 2015 · 12 comments · Fixed by #387
Assignees

Comments

@jpstroop
Copy link
Member

jpstroop commented Sep 9, 2015

https://github.com/meh/ruby-tesseract-ocr seems to be the most active.

Users doing ingest may need to set language for best results. Should this be in Hydra::Derivatives?

@tpendragon
Copy link
Contributor

Doesn't CC do this with Tika already?

@jpstroop
Copy link
Member Author

jpstroop commented Sep 9, 2015

Not from images I don't think, does it--I think it extracts text from document formats like Word, PDF, etc. I could be wrong, though. And we'd want Zones.

@tpendragon
Copy link
Contributor

There's a way to chain in Tesseract to Tika, but the documentation seems troublesome.

@tpendragon
Copy link
Contributor

Scope for this ticket: When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it. Useful: https://github.com/meh/ruby-tesseract-ocr#hocr

@tpendragon tpendragon changed the title OCR on ingest When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it Oct 27, 2015
@jpstroop
Copy link
Member Author

Here's some sample HOCR https://gist.github.com/jpstroop/abc27a5a87e2268fc184

@tpendragon
Copy link
Contributor

To close this, what needs to be supported:

  1. Searching in the catalog returns full text results (probably with highlighting)
  2. Annotation lists are linked to each canvas in the manifest, with the full text as annotations.

@jpstroop
Copy link
Member Author

UI should support changing the language and the pagesegmode

pagesegmode values are:
  0 = Orientation and script detection (OSD) only.
  1 = Automatic page segmentation with OSD.
  2 = Automatic page segmentation, but no OSD, or OCR
  3 = Fully automatic page segmentation, but no OSD. (Default)
  4 = Assume a single column of text of variable sizes.
  5 = Assume a single uniform block of vertically aligned text.
  6 = Assume a single uniform block of text.
  7 = Treat the image as a single text line.
  8 = Treat the image as a single word.
  9 = Treat the image as a single word in a circle.
  10 = Treat the image as a single character.

@tpendragon
Copy link
Contributor

@jpstroop Do you have the file you ran that on? Looking at ruby-tesseract-ocr and seeing if we can avoid shelling out.

@tpendragon
Copy link
Contributor

meh/ruby-tesseract-ocr#50 Ruby library doesn't support Tesseract 3.04

@jpstroop
Copy link
Member Author

@tpendragon files in slack

@tpendragon
Copy link
Contributor

With the gem it looks like you can set the page mode with page_segmentation_mode = 4

@tpendragon
Copy link
Contributor

Split out #381, #382, #383, #384, #385, #386 from this issue. This issue is now just "run tesseract derivatives and store them"

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants