Awesome mmpie1 created by AndriiZelenko
$ git clone [email protected]:AndriiZelenko/mmpie1.git && cd mmpie1
You may use standard python tools (pip) as desired, but it is recommended to use Rye, in which case all you need to do is:
$ rye sync
To run the end-to-end demo application, refer to the Demo section below.
The mmpie1 package is an end-to-end machine learning framework for training and deploying models. It is meant to help kickstart new projects, containing all the structural components necessary to train, manage and deploy models. Mmpie1 has been created in a modular design, so if a particular component is not needed, it can simply be deleted.
At a high level, the mmpie1 package covers the following components of the ML lifecycle:
- Training framework, built on top of PyTorch Lightning, including data, model, and configuration management;
- Lifecycle management, built on top of MLFlow, including experiment tracking, model versioning, and model serving;
- Packaging, including both python (
pip install mmpie1
) and docker (docker compose up
) functionality; - Model serving, built on top of FastAPI, including a REST API for model access through a discord client
- Quality of life tools, including unit tests, automated documentation, and linting.
The Mmpie1 package has been tested on Ubuntu 22.04 with Python 3.10.12.
The mmpie1 package is organized into the following components:
Components
- docs. Contains the build files for Sphinx documentation. Refer to the Documentation section for more details;
- docker. Contains Dockerfiles for building the mmpie1 Docker image. Refer to the Docker module for more details;
- mmpie1: The main package, containing all the core functionality of the project;
- mmpie1/backend: Contains modules for running each of the discord, gateway and training servers;;
- mmpie1/configs: Contains the (Hydra) yaml files for model training;
- mmpie1/core: Contains core mmpie1 packages, including:
- Config. This class provides access to the associated config.ini file, used for package-level configurations;
- Mmpie1Base. This class serves as a helper base class. It provides unified logging functionality, among other features.
- mmpie1/data. Contains modules for individual
pl.LightningDataModule
datasets; - mmpie1/models. Contains modules for individual
torch.nn.Module
models; also includes a torch Lightning wrapper class fortorch.nn.Modules
, which can be used to wrap models during training; - mmpie1/modules: Modules used within the package, including model Registry and GPT classes;
- mmpie1/utils: Utility functions used throughout the package. Notable utilities include:
- conversions. Provides methods for converting and serializing PIL images, used to pass images between servers;
- logging. Provides a
default_logger
method, may be used to customize the logger provided by the BaseMmpie1 class;
- mmpie1/scripts: Standalone scripts built on top of the mmpie1 package; the project starts with a training script that demonstrates how to set up a training pipeline to store models to the model registry;
- tests: Contains unit tests for the mmpie1 package. Refer to the Unit Tests section for more details;
It is highly recommended to use Rye as your package manager. In addition to handling
your virtual environment and dependencies for you, a number of useful commands have also been included, which
you can use through rye run <command>
.
To lint your code with pylint, isort and black:
$ rye run lint
Output
pylint mmpie1/
------------------------------------
Your code has been rated at 10.00/10
pylint --disable=protected-access tests/
-------------------------------------------------------------------
Your code has been rated at 10.00/10
isort -l 120 --check mmpie1/
isort -l 120 --check tests/
black -l 120 --check mmpie1/
All done! β¨ π° β¨
43 files would be left unchanged.
black -l 120 --check tests/
All done! β¨ π° β¨
14 files would be left unchanged.
To run unit tests with pytest:
$ rye run test
Output
============================= test session starts ==============================
platform linux -- Python 3.11.6, pytest-7.4.4, pluggy-1.3.0
rootdir: ~/mmpie1
plugins: cov-4.1.0, anyio-4.2.0, hydra-core-1.3.2
collected 13 items
tests/mmpie1/core/test_config.py ... [ 23%]
...
tests/mmpie1/utils/test_timer_collection.py . [100%]
---------- coverage: platform linux, python 3.11.6-final-0 -----------
Name Stmts Miss Cover Missing
--------------------------------------------------------------------------------
mmpie1/__init__.py 3 0 100%
...
mmpie1/utils/timer_collection.py 28 0 100%
--------------------------------------------------------------------------------
TOTAL 1134 854 25%
=========================== short test summary info ============================
SKIPPED [1] tests/mmpie1/core/test_registry.py:12: No runs found in MLflow.
======================== 12 passed, 1 skipped in 7.31s =========================
To auto-format your code with isort and black:
$ rye run format
Output
isort -l 120 mmpie1/
black -l 120 mmpie1/
All done! β¨ π° β¨
43 files left unchanged.
isort -l 120 tests/
black -l 120 tests/
All done! β¨ π° β¨
14 files left unchanged.
Build the docs using sphinx:
$ rye run docs
Both HTML and PDF docs will be built, located in docs/_build/html
(i.e. open index.html in a browser) and
docs/_build/simplepdf/Mmpie1.pdf
respectively.
To generate a dependency graph of the project, use pylint and graphviz. Make sure graphviz is installed:
apt-get install graphviz
And then run:
rye run graph-dependencies
Which should generate two files in the root directory:
You may use these graphs to help get a quick overview of the project, delete superfluous code and avoid circular dependencies.
To build the package:
rye build
Output
building mmpie1
* Creating virtualenv isolated environment...
* Installing packages in isolated environment... (hatchling)
* Getting build dependencies for sdist...
* Building sdist...
* Building wheel from sdist
* Creating virtualenv isolated environment...
* Installing packages in isolated environment... (hatchling)
* Getting build dependencies for wheel...
* Building wheel...
Successfully built mmpie1-0.1.0.tar.gz and mmpie1-0.1.0-py3-none-any.whl
You have a starter CI workflow in .github/workflows/ci.yml that will lint and test your project on Linux/MacOS/Windows. By default they will run with every push / pull request and can be accessed directly from GithubActions.
The end-to-end application demo allows training and deploying models through a private discord server. It is meant to showcase the full functionality of the mmpie1 package, and provide a starting point for deploying your own models.
After setting up the backend servers, an end user will be able to do the following through your private discord server:
- Train a model. The user may start a training job, passing in Hydra command line options to configure the training or run sweeps. The training job will be tracked in MLFlow, with logs and metrics available through Tensorboard. The user will be notified when the training job completes, with all resulting models stored in the model registry.
- Deploy a model. The user may deploy a model from the model registry, which will be served through the gateway server on subsequent requests.
- Run inference. The user may run the deployed model, passing an image through the discord bot to be classified.
The following helper commands are also exposed through the demo:
- Registry summary. The user may request a summary of the model registry, including all models and their associated metrics.
- Server logs. The user may request the server logs, which are kept by each backend server separately. In practice these logs would likely be used through an administrator or developer account.
Finally, the demo is set up to also make use of GPT, if an OpenAI API key is provided. In this case, the following additional commands are available:
- Chat. The user may chat with GPT, which will be used as a general chat agent.
- Debug. The user may request GPT to help debug any problem with the model or servers. In this case the server logs are automatically passed to GPT, which will provide debug advice on any errors to the user.
The overall component structure of the demo application is shown above. To run the demo, you will need to set up a private Discord server for the frontend, and then run each of the backend servers in turn. The instructions for each step are detailed below.
The end-to-end demo uses Discord as the front-end deployment environment, as it is generally very easy to set up a new discord server and associated bot for deployment. Indeed, many companies have used Discord as a deployment environment for their products at scale to great success (e.g. consider Midjourney).
First, create a new discord server, and then create an associated new discord bot. Add the bot to your server as instructed.
Finally, add your discord bot token to your config.ini file under API_KEYS/DISCORD. You may
find this key by going to the discord developer portal and navigating to
the Bot
tab. Click on Reset Token
and then copy/paste the new token to your config.ini file.
Congrats! Your application now has a front-end deployment environment!
While not absolutely necessary, the demo is set up to incorporate GPT for end-client ease-of-use. By default, it will
be used as a general chat agent for anyone DMing the bot, and can be prompted more directly by setting its system
prompt. More, as a concrete example, a sample debug
command is provided, which will pass the server logs to GPT in an
effort to get it to provide debug advice on any errors.
If you have not, sign up and log into the OpenAI Developer Platform and then navigate to your API keys. Create a new key and copy/paste it into your config.ini file under API_KEYS/OPENAI.
The demo includes a relatively sophisticated end-to-end deployment architecture. While it has many components, they are all modular and meant to be able to put on separate machines, as appropriate, to scale to an actual production environment.
The backend consists of the following servers:
- MLFlow Tracking and Registry Server. The MLFlow server is used to track experiments and store models in the registry;
- Training Server (Optional). The training server is used to train models and store them in the registry; it is
only needed if you wish to train models from discord (i.e. use the
>train
command in discord). It is not needed to train models locally. - Deployment Server. The deployment server is used to serve models from the registry.
- Discord Client. The discord client receives requests from the discord channel and forwards them to the backend to be processed;
- Gateway Server. The gateway server is used to serve models from the registry. It accepts any user requests not processed by the discord client (i.e. all requests relating to model inference/training and the model registry). Under the hood it communicates with the MLFlow, Training and Deployment servers, as appropriate;
- Tensorboard (Optional). Tensorboard is used to monitor training progress. It is not accessible from the discord client, but can be accessed directly through a local browser. It is only needed if you wish to monitor training progress.
All servers may be run locally, and may be started as described below.
Notes:
- If you choose different ports for the servers, you will need to update the hosts listed in your config.ini file to match, or pass them in as command line arguments to each other server.
- The default configs assume that output data (MLFlow registry, hydra training runs, tensorboard logs, standard
debug logs, etc.) will be stored in
${HOME}/mmpie1
and, for unit tests, that the project has been installed in${HOME}/projects/mmpie1
. If you wish to change these locations, you will need to update the same config.ini file accordingly. To create the default directories automatically, you may runrye run create_project_directories
, which will create all directories according to your config.ini file. - The number of workers on the training server (
-w 4
) determines how many simultaneous training runs may be done. If you set the number too high you may run out of memory. In practice, you will likely want to run the training server in a completely separate environment, and configure each training job to get a separate GPU. - If, at any point, you get an error saying
mmpie1
cannot be found, remember to add the mmpie1 path to your PYTHONPATH variable. E.g.
PYTHONPATH=${HOME}/projects/mmpie1 [... continue command]
- Start the MLFlow Tracking and Registry server:
rye run mlflow_server
Without Rye
mlflow server --backend-store-uri ${HOME}/mmpie1/mlflow --port 8080
Once the MLFlow server is up and running, you may get local access by opening a browser to its address:
- Start the Gateway server:
rye run gateway_server
Without Rye
python -m gunicorn -w 1 -b localhost:8081 -k uvicorn.workers.UvicornWorker "mmpie1.backend.gateway.gateway_server:app()"
- Start the Training server:
rye run training_server
Without Rye
python -m gunicorn -w 4 -b localhost:8082 -k uvicorn.workers.UvicornWorker "mmpie1.backend.training.training_server:app()"
- Start the Deployment server:
rye run deployment_server
Without Rye
python -m gunicorn -w 1 -b localhost:8083 -k uvicorn.workers.UvicornWorker "mmpie1.backend.deployment.deployment_server:app()"
- Start the Discord client:
rye run discord_client
Without Rye
python mmpie1/backend/discord/discord_client.py
- (Optional) Start Tensorboard:
tensorboard --logdir ${HOME}/mltemlpate/tensorboard
For a more streamlined deployment, follow the instructions in the docker readme, in which case you may configure the deployment through a single Docker Compose file. Then all the backend servers can be deployed with a single Docker Compose call from the docker directory:
docker compose up
Once the backend servers are up and running, you may showcase your demo application through your discord server. You may
run the following commands in any channel the bot has access to, or DM the bot directly. If DMing the bot directly, all
non-command messages will equate to running the >chat
command.
The demo comes with the following commands out-of-the-box.
Examples:
>train
>train --config-name train.yaml --model=cnn --dataset=mnist
>train --config-name train.yaml --model=cnn --dataset=mnist --trainer.max_epochs=2,5,10,15,20 --multirun
Any given arguments are directly passed to and parsed by Hydra. In particular, note that if you run a parameter sweep
(by providing multiple values for a given parameter), you must also pass the --multirun
flag. Training will run in the
background on the training server, and store the trained models in the model registry when complete. You will be
notified when the training job completes.
When deploying a model, pass in the model name and version number. For example,
>load_model ModelName 1
There are two commands for running inference:
>classify_id TEST_ID_INTEGER
>classify_image [UPLOAD IMAGE]
In the first case, you may specify an integer ID from the test dataset, which will be loaded and classified by the model. In the second case, you may upload an image (drag and drop the image file into your Discord message); example MNIST images may be found in the tests/resources directory.
You may use the >registry_summary
command at any time to get a summary of the model registry, including all models and
their associated metrics.
You may use the >logs
command at any time to get the server logs.
You may use the >chat
command at any time to converse freely with GPT. This command is implied if you DM the bot with
a general message.
You may use the debug
command at any time to request GPT to help debug any problem with the model or servers. The
server logs are automatically passed to GPT. The debug command itself may be run alone, or the user may optionally pass
a message to GPT to help it understand the context of the debug request.
Examples:
>debug
>debug Check the training logs to see if the last training request is still running or returned an error.