This project is comprised of two datasets. The biankatpas dataset as well as the Kaggle set.
The two raw datasets are held here under their named directories.
The compiled dataset is saved as a .csv
in the src directory during script execution.
- 4409 records
GitHub dataset from the portugues project
- 1235 records
Given the size of the two datasets, I've avoided re-publishing them to this repo (the Kaggle dataset is 9GB+). So their is some set up to get your local data ready.
First set up the larger of the two datasets.
- Download the Kaggle data set as a zip
- Extract and rename the
Dataset 1 (Simplex)
folder into the Kaggle directory.
Second, set up the biankatpas dataset,
- clone the git repo from biankatpas.
- Copy the
Dataset
folder into the biankatpas directory.
In the interest of time, I didn't modularise the pipeline into a CLI script. So you will have to modify the runner.py
script as needed.
- Uncomment the
create_dataset
method,fn
,logger
, andto_csv
lines. This will enable loading then saving the compiled dataset to disk.
- optionally, enable "preview" to
True
to verify the HOG descriptor extraction process. - also consider modifying "n_jobs", 16 threads will load your CPU to 100% on a desktop easily. Lower is recommended for laptops
- Done! Validate the CSV file is to your liking