When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it #57

jpstroop · 2015-09-09T14:24:51Z

https://github.com/meh/ruby-tesseract-ocr seems to be the most active.

Users doing ingest may need to set language for best results. Should this be in Hydra::Derivatives?

tpendragon · 2015-09-09T14:33:51Z

Doesn't CC do this with Tika already?

jpstroop · 2015-09-09T15:32:01Z

Not from images I don't think, does it--I think it extracts text from document formats like Word, PDF, etc. I could be wrong, though. And we'd want Zones.

tpendragon · 2015-09-09T15:33:48Z

There's a way to chain in Tesseract to Tika, but the documentation seems troublesome.

tpendragon · 2015-10-27T16:59:36Z

Scope for this ticket: When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it. Useful: https://github.com/meh/ruby-tesseract-ocr#hocr

jpstroop · 2016-01-25T18:57:09Z

Here's some sample HOCR https://gist.github.com/jpstroop/abc27a5a87e2268fc184

tpendragon · 2016-01-25T18:58:42Z

To close this, what needs to be supported:

Searching in the catalog returns full text results (probably with highlighting)
Annotation lists are linked to each canvas in the manifest, with the full text as annotations.

jpstroop · 2016-01-25T18:59:33Z

UI should support changing the language and the pagesegmode

pagesegmode values are:
  0 = Orientation and script detection (OSD) only.
  1 = Automatic page segmentation with OSD.
  2 = Automatic page segmentation, but no OSD, or OCR
  3 = Fully automatic page segmentation, but no OSD. (Default)
  4 = Assume a single column of text of variable sizes.
  5 = Assume a single uniform block of vertically aligned text.
  6 = Assume a single uniform block of text.
  7 = Treat the image as a single text line.
  8 = Treat the image as a single word.
  9 = Treat the image as a single word in a circle.
  10 = Treat the image as a single character.

tpendragon · 2016-01-25T21:55:40Z

@jpstroop Do you have the file you ran that on? Looking at ruby-tesseract-ocr and seeing if we can avoid shelling out.

tpendragon · 2016-01-25T22:20:20Z

meh/ruby-tesseract-ocr#50 Ruby library doesn't support Tesseract 3.04

jpstroop · 2016-01-25T22:42:12Z

@tpendragon files in slack

tpendragon · 2016-01-26T00:01:50Z

With the gem it looks like you can set the page mode with page_segmentation_mode = 4

tpendragon · 2016-01-26T23:27:18Z

Split out #381, #382, #383, #384, #385, #386 from this issue. This issue is now just "run tesseract derivatives and store them"

tpendragon added the Sprint #2 label Oct 27, 2015

tpendragon changed the title ~~OCR on ingest~~ When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it Oct 27, 2015

tpendragon removed the Sprint #2 label Jan 7, 2016

tpendragon added the ready label Jan 26, 2016

tpendragon mentioned this issue Jan 26, 2016

Generate hOCR for FileSets and make them searchable #387

Merged

tpendragon self-assigned this Jan 26, 2016

tpendragon added in progress and removed ready labels Jan 26, 2016

escowles closed this as completed in #387 Jan 27, 2016

escowles removed the in progress label Jan 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it #57

When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it #57

jpstroop commented Sep 9, 2015

tpendragon commented Sep 9, 2015

jpstroop commented Sep 9, 2015

tpendragon commented Sep 9, 2015

tpendragon commented Oct 27, 2015

jpstroop commented Jan 25, 2016

tpendragon commented Jan 25, 2016

jpstroop commented Jan 25, 2016

tpendragon commented Jan 25, 2016

tpendragon commented Jan 25, 2016

jpstroop commented Jan 25, 2016

tpendragon commented Jan 26, 2016

tpendragon commented Jan 26, 2016

When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it #57

When ingesting an item a derivative should be created which builds hOCR of the document via Tesseract and stores it #57

Comments

jpstroop commented Sep 9, 2015

tpendragon commented Sep 9, 2015

jpstroop commented Sep 9, 2015

tpendragon commented Sep 9, 2015

tpendragon commented Oct 27, 2015

jpstroop commented Jan 25, 2016

tpendragon commented Jan 25, 2016

jpstroop commented Jan 25, 2016

tpendragon commented Jan 25, 2016

tpendragon commented Jan 25, 2016

jpstroop commented Jan 25, 2016

tpendragon commented Jan 26, 2016

tpendragon commented Jan 26, 2016