Classify Maps

At CER, we receive applications from companies containing thousands of pages of documents.We want to develop a Machine Learning Algorithm to differentiate the pages which are maps (aka alignment sheets) from pages which are not maps.

Sample Maps:

Sample Non-Maps:

Approach

For the problem stated above we will be using classification algorithms and we will be using features such as area of images in a page, number of images in a page, count of words, we will also be checking if the page has certain words such as the word "North" or "N", "Figure", "Map", "Alignment Sheet" or "Sheet", "Legend", "scale", and "kilometers" or "km".

Once we have the features extracted we will be training Classification models such as, XG Boost Classifier, Support Vector Classifier, Decision Treen Classifier, Random Forest Classifier, Random Forest Regressor and XG Boost Regressor. We will be comparing the accuracy and performance of the confusion matrix for these models on Test Set and Training Set. Then we will save the best performing model for future use.

Note: The results from the regressor models are converted into binary output, hence, we will be referring these regression models classification models.

Description of the folder structure:

Training Set: This folder contains the files which are used to prepare the training set and the test set.
Validation Set: This folder contains the files which are used to validate the trained models and then identify the best performing model.
feature_extraction.py: This file contains the funtions which are used to extract the features in a PDF page. These features will be used to do classification.
Classify Maps.ipynb: In this file first we read the PDF files in the Training Set fonder. We treat each page as a unique entity and then we use functions from feature_extraction.py to extract feaures for all the pages. Then all these pages with their features are split in to Training set and Test Set. Then we train the classification models and evaluate the model performances. Further we exctract the features for the PDF in the Validation Set folder in the same way. Then we evaluate the models and pick the best model.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.idea		.idea
.vscode		.vscode
data		data
images		images
root		root
venv		venv
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classify Maps

Sample Maps:

Sample Non-Maps:

Approach

About

Releases

Packages

Languages

License

iVibudh/CER-classify-maps

Folders and files

Latest commit

History

Repository files navigation

Classify Maps

Sample Maps:

Sample Non-Maps:

Approach

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages