Skip to content

t-shoemaker/hOCRpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hOCRpy

This package extracts text, bounding box, and confidence score information from the structured output of OCR systems like Tesseract. This output, which is called hOCR, is a useful data representation for identifying page formats, messy OCR, and more. In addition to providing a wrapper around hOCR data, hOCRpy enables page rendering (for corpus exploration) and several ways of analyzing the data, including:

  1. Bounding box metrics
  2. Format prediction

Basic Usage

hOCRpy will automatically parse a hOCR file from a filepath.

from hOCRpy import hOCR

path = 'examples/hocr/one_column.hocr'
hocr = hOCR(path)

# Get tokens, their bounding boxes, and their confidence scores
for token, bbox, score in zip(hocr.tokens, hocr.bboxes, hocr.scores):
    print(token, bbox, score)
>> The [193, 157, 245, 180] 0.96
>> Life [256, 157, 304, 180] 0.96
>> and [315, 158, 360, 181] 0.96
>> Work [371, 158, 445, 181] 0.96
>> of [456, 158, 483, 181] 0.96
>> [...]

# Return a plaintext blob
hocr.text
>> 'The Life and Work of...'

# Number of tokens
hocr.num_tokens
>> 324

# Average confidence score
import numpy as np

np.mean(hocr.scores)
>> 0.9364197530864197

During corpus exploration, it's often helpful to get a high-level overview of a page's structure.

hocr.show_structure(which='token')

Option options include area, paragraph, and line. In addition to these, it's possible to re-render the entire page, fitting each token back into its respective bounding box.

hocr.show_page(outline=None, scale=True)

See analysis.ipynb for a demonstration of how hOCRpy may be used to analyze hOCR data.

About

Parsing and analyzing hOCR data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published