Dataset

This project is comprised of two datasets. The biankatpas dataset as well as the Kaggle set.

The two raw datasets are held here under their named directories.

The compiled dataset is saved as a .csv in the src directory during script execution.

4409 records

GitHub dataset from the portugues project

1235 records

Set Up your Local Copy

Given the size of the two datasets, I've avoided re-publishing them to this repo (the Kaggle dataset is 9GB+). So their is some set up to get your local data ready.

First set up the larger of the two datasets.

Download the Kaggle data set as a zip
Extract and rename the Dataset 1 (Simplex) folder into the Kaggle directory.

Second, set up the biankatpas dataset,

clone the git repo from biankatpas.
Copy the Dataset folder into the biankatpas directory.

Compile the dataset

In the interest of time, I didn't modularise the pipeline into a CLI script. So you will have to modify the runner.py script as needed.

Uncomment the create_dataset method, fn, logger, and to_csv lines. This will enable loading then saving the compiled dataset to disk.

optionally, enable "preview" to True to verify the HOG descriptor extraction process.
also consider modifying "n_jobs", 16 threads will load your CPU to 100% on a desktop easily. Lower is recommended for laptops

Done! Validate the CSV file is to your liking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset

Set Up your Local Copy

Compile the dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset

Set Up your Local Copy

Compile the dataset