Mcity Data Engine

The Mcity Data Engine is an essential tool in the Mcity makerspace for transportation innovators making AI algorithms and seeking actionable data insights through machine learning. Details on the Data Engine can be found in the Wiki. The data engine supports all stages to continuously improve AI models based on raw visual data:

On February 24, 2025, Daniel Bogdoll, a research scholar at Mcity, gave a presentation on the first release of the Mcity Data Engine in Ann Arbor, Michigan. The recording provides insight into the general architecture, its features and ecosystem integrations, and demonstrates successful data curation and model training for improved Vulnerable Road User (VRU) detection:

Online Demo: Data Selection with Embeddings

To get a first feel for the Mcity Data Engine, we provide an online demo in a Google Colab environment. We will load the Fisheye8K dataset and demonstrate the Mcity Data Engine workflow Embedding Selection. This workflow leverages a set of models to compute image embeddings which are used to determine both representative and rare samples. The dataset is then visualized in the Voxel51 UI, highlighting how often a sample was picked by the workflow.

Note that most of the Mcity Data Engine workflows require a more powerful GPU, so the possibilities within the Colab environment are limited. Other workflows may not work.

Online demo on Google Colab: Mcity Data Engine Web Demo

Local Execution

At least one GPU is required for many of the Mcity Data Engine workflows. Check the hardware setups we have tested in the Wiki. To download the repository and install the requirements run:

git clone --recurse-submodules [email protected]:mcity/mcity_data_engine.git
cd mcity_data_engine
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Login with your Weights and Biases and Hugging Face accounts:

wandb.login()
huggingface-cli login

Launch a Voxel51 session in one terminal: python session_v51.py

Configure your run in the config/config.py and launch the Mcity Data Engine in a second terminal: python main.py

Notebooks and Submodules

To exclude the output of jupyter notebooks from git tracking, add the following lines to your .git/config :

[filter "strip-notebook-output-engine"]
    clean = <your_path>/mcity_data_engine/.venv/bin/jupyter nbconvert --ClearOutputPreprocessor.enabled=True --ClearMetadataPreprocessor.enabled=True --to=notebook --stdin --stdout
    smudge = cat
    required = true

and those to .git/modules/mcity_data_engine_scripts/config

[filter "strip-notebook-output-scripts"]
    clean = <your_path>/mcity_data_engine/.venv/bin/jupyter nbconvert --ClearOutputPreprocessor.enabled=True --ClearMetadataPreprocessor.enabled=True --to=notebook --stdin --stdout
    smudge = cat
    required = true

In order to keep the submodules updated, add the following lines to the top of your .git/hooks/pre-commit:

git submodule update --recursive --remote
git add .gitmodules $(git submodule foreach --quiet 'echo $name')

Repository Structure

.
├── main.py                     # Entry point of the framework → Terminal 1
├── session_v51.py              # Script to launch Voxel51 session → Terminal 2
├── workflows/                  # Workflows for the Mcity Data Engine
├── config/                     # Local configuration files
├── utils/                      # General-purpose utility functions
├── cloud/                      # Scripts run in the cloud to pre-process data
├── docs/                       # Documentation generated with `pdoc`
├── tests/                      # Tests using Pytest
├── custom_models/              # External models with containerized environments
├── mcity_data_engine_scripts/  # Experiment scripts and one-time operations (Mcity internal)
├── .vscode                     # Settings for VS Code IDE
├── .github/workflows/          # GitHub Action workflows
├── .gitignore                  # Files and directories to be ignored by Git
├── .gitattributes              # Rules for handling files like Notebooks during commits
├── .gitmodules                 # Configuration for managing Git submodules
├── .secret                     # Secret tokens (not tracked by Git)
└── requirements.txt            # Python dependencies (pip install -r requirements.txt)

Training

Training runs are logged with Weights and Biases (WandB).

In order to change the standard WandB directory, run

echo 'export WANDB_DIR="<your_path>/mcity_data_engine/logs"' >> ~/.profile
source ~/.profile

Contribution

Contributions are very welcome! The Mcity Data Engine is a blueprint for data curation and model training and will not support every use case out of the box. Please find instructions on how to contribute here:

Special thanks to these amazing people for contributing to the Mcity Data Engine! 🙌

Citation

If you use the Mcity Data Engine in your research, feel free to cite the project:

@article{bogdoll2025mcitydataengine,
  title={Mcity Data Engine},
  author={Bogdoll, Daniel and Anata, Rajanikant Patnaik and Stevens, Gregory},
  journal={GitHub. Note: https://github.com/mcity/mcity_data_engine},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mcity Data Engine

Online Demo: Data Selection with Embeddings

Local Execution

Notebooks and Submodules

Repository Structure

Training

Contribution

Citation

About

Releases 2

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1,100 Commits
.github		.github
.vscode		.vscode
cloud		cloud
config		config
custom_models/CoDETR		custom_models/CoDETR
docs		docs
mcity_data_engine_scripts @ 3675579		mcity_data_engine_scripts @ 3675579
tests		tests
utils		utils
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION		CITATION
LICENSE		LICENSE
README.md		README.md
fish_eye_8k_colab.ipynb		fish_eye_8k_colab.ipynb
main.py		main.py
requirements.txt		requirements.txt
requirements_colab.txt		requirements_colab.txt
session_v51.py		session_v51.py

License

mcity/mcity_data_engine

Folders and files

Latest commit

History

Repository files navigation

Mcity Data Engine

Online Demo: Data Selection with Embeddings

Local Execution

Notebooks and Submodules

Repository Structure

Training

Contribution

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Contributors 5

Languages