Using text-based features with clustering algorithm to semi-automatically discover and annotate layout types for different forms in document images
Python code along with the notebooks under this directory can be used to speed up the labelling/annotation for layout types of different forms-like document images.
There is a work in progress to extract this code and the overall approach as a separate UI tool to make the labelling efforts even easier and faster - however until it's implemented and published please add any related suggestions, issues or PRs into current repo.
In this section you can find ready to reuse notebooks and Python scripts which will help you to discover different layout types within your dataset of forms images as well as help you annotate them for further usage with Forms Recognizer.
Files and folders structure:
../Form_Layout_Clustering
|--- notebooks/ # reusable notebooks showing end-to-end usage example
|--- src/ # all the heavy python code encapsulated into modules
|--- invoice_vocabulary.txt # example vocabulary for invoice documents
The main file you should be interested in is layout-clustering-and-labeling notebook showing end-to-end example of how to implement this approach and reuse with your data.
In a nutshell, the approach described in this section could be simplified to the following steps:
- Extract text from document images with OCR software
- Process and clean the extracted text using regex and fuzzy matching/filtering based on words vocabulary
- Use TFIDF vectorizer with N-grams to generate feature vectors representing each document
- Apply density-based clustering on said features to extract groups of document images with similar/the same forms layout types
Feel free to reach out to Karol Zak in case of additional questions.
For an alternativer approach look at the code accelerator Search based classification for a simple but effective search based approach on text features.
Back to the Analysis section