hOCRpy

This package extracts text, bounding box, and confidence score information from the structured output of OCR systems like Tesseract. This output, which is called hOCR, is a useful data representation for identifying page formats, messy OCR, and more. In addition to providing a wrapper around hOCR data, hOCRpy enables page rendering (for corpus exploration) and several ways of analyzing the data, including:

Bounding box metrics
Format prediction

Basic Usage

hOCRpy will automatically parse a hOCR file from a filepath.

from hOCRpy import hOCR

path = 'examples/hocr/one_column.hocr'
hocr = hOCR(path)

# Get tokens, their bounding boxes, and their confidence scores
for token, bbox, score in zip(hocr.tokens, hocr.bboxes, hocr.scores):
    print(token, bbox, score)
>> The [193, 157, 245, 180] 0.96
>> Life [256, 157, 304, 180] 0.96
>> and [315, 158, 360, 181] 0.96
>> Work [371, 158, 445, 181] 0.96
>> of [456, 158, 483, 181] 0.96
>> [...]

# Return a plaintext blob
hocr.text
>> 'The Life and Work of...'

# Number of tokens
hocr.num_tokens
>> 324

# Average confidence score
import numpy as np

np.mean(hocr.scores)
>> 0.9364197530864197

During corpus exploration, it's often helpful to get a high-level overview of a page's structure.

hocr.show_structure(which='token')

Option options include area, paragraph, and line. In addition to these, it's possible to re-render the entire page, fitting each token back into its respective bounding box.

hocr.show_page(outline=None, scale=True)

See analysis.ipynb for a demonstration of how hOCRpy may be used to analyze hOCR data.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
examples		examples
hOCRpy		hOCRpy
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hOCRpy

Basic Usage

About

Releases

Packages

Languages

License

t-shoemaker/hOCRpy

Folders and files

Latest commit

History

Repository files navigation

hOCRpy

Basic Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages