Digitize Text (pre-processing)

As society beecomes more reliant on technology there is an inccrease desire to transition materials that were once hard-copy (e.g., handwritten cursive, handwritten printed and type-writer documents) into electronically saved. The goal of this project is to explore how NLP techniques can assist in cleaning materials after extracting text from images using Optical Character Recognition (OCR) tool, Tesseract. The techniques explored include fine-tuning the pre-trained language model, GPT2.

NOTE: OCR is primarily developed for extracting printed material from images of text.

Code to Run:

Pre-Training Model:

File Name: gpt-train.ipynb
Change paths to where data is located on local machine
Run code blocks in order

Warning: May run into memory issues, so we advise to use Koa's NV-H100 GPU

Prompt Engineering

File Name: typed_text_OCR/SmartDoc_OCR_post_processing.ipynb
Change paths to where data is located on local machine
Run code blocks in order

Evaluating Data

Our primary evaluation metric is cosine similarity

File Name: gpt-evaluation-save.ipynb
Change paths to where processed data is located on local machine
Run code blocks in order

Tesseract

File Name: OCR_tesseract_digitize.ipynb
Use function run_tesseract_on_images and provide the relative input and output directories
For SmartDoc data set run typed_text_OCR/SmartDoc_OCR.ipynb blocks in order

Warning: Second run_tesseract_on_images function concatenates the OCR from multiple images

Data

Structure to access pre-processed data:

All data is split and located in dataForModel
Within each split of data there is the Betham data and the IAM data, and each folder further has the GT and OCR.
Train: Val: Test == 70:20:10

Important to Note for Raw Data Pre-Processing:

IAM Database

Forms => images of whole pieces of paper, GT data is located on the forms
used lines to extract the handwriting from files and concatenated to be one line in .txt
GT was located in xml files based on position, and extracted as one line in .txt file

Bentham Dataset

GT in XML files so extracted as one line in .txt
Pages has whole handwritten no GT on top. There is \n throughout OCR

SmartDoc dataset

Available for download here: https://zenodo.org/records/2572929
Ground Truth is just text with no positional information besides order

Uncommon Imports

Importing Google's PyTesseract

use brew install tesseract
to find executable which tesseract

Authors

Joel Nicolow
Amanda Nitta
Jan (Mark) Schittenhelm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Digitize Text (pre-processing)

Code to Run:

Pre-Training Model:

Prompt Engineering

Evaluating Data

Tesseract

Data

Structure to access pre-processed data:

Important to Note for Raw Data Pre-Processing:

IAM Database

Bentham Dataset

SmartDoc dataset

Uncommon Imports

Importing Google's PyTesseract

Authors

Files

README.md

Latest commit

History

README.md

File metadata and controls

Digitize Text (pre-processing)

Code to Run:

Pre-Training Model:

Prompt Engineering

Evaluating Data

Tesseract

Data

Structure to access pre-processed data:

Important to Note for Raw Data Pre-Processing:

IAM Database

Bentham Dataset

SmartDoc dataset

Uncommon Imports

Importing Google's PyTesseract

Authors