Skip to content

mcity/mcity_data_engine

Repository files navigation

Mcity Data Engine

mcity_dataengine

Test Results Test Results for UofM Cluster Ubuntu Version Python Version PyTorch Version CUDA Version Visitors

Mcity Data Engine Google Colab Demo Mcity Data Engine Wiki Mcity Data Engine Docs Mcity Data Engine Logs Mcity Data Engine Models

The Mcity Data Engine is an essential tool in the Mcity makerspace for transportation innovators making AI algorithms and seeking actionable data insights through machine learning. Details on the Data Engine can be found in the Wiki. The data engine supports all stages to continuously improve AI models based on raw visual data:

Mcity Data Engine Overview

On February 24, 2025, Daniel Bogdoll, a research scholar at Mcity, gave a presentation on the first release of the Mcity Data Engine in Ann Arbor, Michigan. The recording provides insight into the general architecture, its features and ecosystem integrations, and demonstrates successful data curation and model training for improved Vulnerable Road User (VRU) detection:

Online Demo: Data Selection with Embeddings

To get a first feel for the Mcity Data Engine, we provide an online demo in a Google Colab environment. We will load the Fisheye8K dataset and demonstrate the Mcity Data Engine workflow Embedding Selection. This workflow leverages a set of models to compute image embeddings which are used to determine both representative and rare samples. The dataset is then visualized in the Voxel51 UI, highlighting how often a sample was picked by the workflow.

Note that most of the Mcity Data Engine workflows require a more powerful GPU, so the possibilities within the Colab environment are limited. Other workflows may not work.

Online demo on Google Colab: Mcity Data Engine Web Demo

Local Execution

At least one GPU is required for many of the Mcity Data Engine workflows. Check the hardware setups we have tested in the Wiki. To download the repository and install the requirements run:

git clone --recurse-submodules [email protected]:mcity/mcity_data_engine.git
cd mcity_data_engine
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Login with your Weights and Biases and Hugging Face accounts:

wandb.login()
huggingface-cli login

Launch a Voxel51 session in one terminal: python session_v51.py

Configure your run in the config/config.py and launch the Mcity Data Engine in a second terminal: python main.py

Notebooks and Submodules

To exclude the output of jupyter notebooks from git tracking, add the following lines to your .git/config :

[filter "strip-notebook-output-engine"]
    clean = <your_path>/mcity_data_engine/.venv/bin/jupyter nbconvert --ClearOutputPreprocessor.enabled=True --ClearMetadataPreprocessor.enabled=True --to=notebook --stdin --stdout
    smudge = cat
    required = true

and those to .git/modules/mcity_data_engine_scripts/config

[filter "strip-notebook-output-scripts"]
    clean = <your_path>/mcity_data_engine/.venv/bin/jupyter nbconvert --ClearOutputPreprocessor.enabled=True --ClearMetadataPreprocessor.enabled=True --to=notebook --stdin --stdout
    smudge = cat
    required = true

In order to keep the submodules updated, add the following lines to the top of your .git/hooks/pre-commit:

git submodule update --recursive --remote
git add .gitmodules $(git submodule foreach --quiet 'echo $name')

Repository Structure

.
├── main.py                     # Entry point of the framework → Terminal 1
├── session_v51.py              # Script to launch Voxel51 session → Terminal 2
├── workflows/                  # Workflows for the Mcity Data Engine
├── config/                     # Local configuration files
├── utils/                      # General-purpose utility functions
├── cloud/                      # Scripts run in the cloud to pre-process data
├── docs/                       # Documentation generated with `pdoc`
├── tests/                      # Tests using Pytest
├── custom_models/              # External models with containerized environments
├── mcity_data_engine_scripts/  # Experiment scripts and one-time operations (Mcity internal)
├── .vscode                     # Settings for VS Code IDE
├── .github/workflows/          # GitHub Action workflows
├── .gitignore                  # Files and directories to be ignored by Git
├── .gitattributes              # Rules for handling files like Notebooks during commits
├── .gitmodules                 # Configuration for managing Git submodules
├── .secret                     # Secret tokens (not tracked by Git)
└── requirements.txt            # Python dependencies (pip install -r requirements.txt)

Training

Training runs are logged with Weights and Biases (WandB).

In order to change the standard WandB directory, run

echo 'export WANDB_DIR="<your_path>/mcity_data_engine/logs"' >> ~/.profile
source ~/.profile

Contribution

Contributions are very welcome! The Mcity Data Engine is a blueprint for data curation and model training and will not support every use case out of the box. Please find instructions on how to contribute here:

Special thanks to these amazing people for contributing to the Mcity Data Engine! 🙌

Citation

If you use the Mcity Data Engine in your research, feel free to cite the project:

@article{bogdoll2025mcitydataengine,
  title={Mcity Data Engine},
  author={Bogdoll, Daniel and Anata, Rajanikant Patnaik and Stevens, Gregory},
  journal={GitHub. Note: https://github.com/mcity/mcity_data_engine},
  year={2025}
}