This codebase provides an example implementation of a custom classification model in 4 steps:
- Document Preparation (analyze_layout.py)
- Document Upload (upload_documents.py)
- Build Classifier (build_classifier.py)
- Classify Documents (classify_document.py)
In order to complete this workshop, you will need to the following:
- Python 3.11 or higher (recommended using an Anaconda environment)
- Visual Studio Code
- Python and Jupyter extensions
- Access to Azure Cognitive Services
- Access to an Azure Storage Container
Before running the scripts, you need to set up your environment variables. Rename the .env.txt
to .env
file and include the following variables:
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT
: The endpoint to your Document Intelligence resource.AZURE_DOCUMENT_INTELLIGENCE_KEY
: Your Document Intelligence API key.AZURE_STORAGE_CONNECTION_STRING
: The connection string to your Azure Storage serviceAZURE_STORAGE_CONTAINER_NAME
: The name of your Azure Blob Storage containerTRAINING_DOCUMENTS
: The path to your training documentsTESTING_DOCUMENTS
: The path to your testing documentsCLASSIFIER_ID
: The model ID of your Document Intelligence (wait until after runningbuild_classifier.py
)BASE_CLASSIFIER_ID
: The model ID of your base classifier (edit only if you want to perform incremental training on an existing classifier)
Please replace the placeholders with your actual values.
Your TRAINING_DOCUMENTS
folder should be structured as shown below:
📂TRAINING_DOCUMENTS
┣ 📂DocumentType1
┃ ┣ 📜trainingFile1.ext
┃ ┣ 📜trainingFile2.ext
┃ ┣ 📜trainingFile3.ext
┃ ┣ 📜trainingFile4.ext
┃ ┣ 📜trainingFile5.ext
┃ ┗ 📜...
┣ 📂DocumentType2
┣ 📂...
You must include AT LEAST 5 training files for each type of document you wish to train the model on.
Install the required modules
pip install -r requirements.txt
This script uses the Document Intelligence layout model to analyze your training files and create corresponding .ocr.json files. These files are saved locally alongside your training data files and will be uploaded when running the upload_documents.py script.
python analyze_layout.py
This script uploads labeled data to your Azure Blob Storage container.
python upload_documents.py
This scripts demonstrates how to build a classifier model.
python build_classifier.py
Remember to copy and paste the Classifier ID in .env
This scripts demonstrates how to classify a folder of documents using a trained document classifier via
python classify_document.py