Using text-based features with clustering algorithm to semi-automatically discover and annotate layout types for different forms in document images

Python code along with the notebooks under this directory can be used to speed up the labelling/annotation for layout types of different forms-like document images.

There is a work in progress to extract this code and the overall approach as a separate UI tool to make the labelling efforts even easier and faster - however until it's implemented and published please add any related suggestions, issues or PRs into current repo.

Table of content

Getting started
Approach in a nutshell

1. Getting started

In this section you can find ready to reuse notebooks and Python scripts which will help you to discover different layout types within your dataset of forms images as well as help you annotate them for further usage with Forms Recognizer.

Files and folders structure:

../Form_Layout_Clustering
    |--- notebooks/  # reusable notebooks showing end-to-end usage example
    |--- src/  # all the heavy python code encapsulated into modules
    |--- invoice_vocabulary.txt  # example vocabulary for invoice documents

The main file you should be interested in is layout-clustering-and-labeling notebook showing end-to-end example of how to implement this approach and reuse with your data.

2. Approach in a nutshell

In a nutshell, the approach described in this section could be simplified to the following steps:

Extract text from document images with OCR software
Process and clean the extracted text using regex and fuzzy matching/filtering based on words vocabulary
Use TFIDF vectorizer with N-grams to generate feature vectors representing each document
Apply density-based clustering on said features to extract groups of document images with similar/the same forms layout types

Feel free to reach out to Karol Zak in case of additional questions.

For an alternativer approach look at the code accelerator Search based classification for a simple but effective search based approach on text features.

Back to the Analysis section

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Using text-based features with clustering algorithm to semi-automatically discover and annotate layout types for different forms in document images

Table of content

1. Getting started

2. Approach in a nutshell

Files

README.md

Latest commit

History

README.md

File metadata and controls

Using text-based features with clustering algorithm to semi-automatically discover and annotate layout types for different forms in document images

Table of content

1. Getting started

2. Approach in a nutshell