Make sure to read :doc:`prerequisites` before installing mlbench.
Then, the library can be installed directly using pip
:
$ pip install mlbench-core
This will install the mlbench
CLI to the current environment, and will allow creation/deletion of clusters, as well as creating runs.
In addition to installation of CLI, we have provided alternative ways of installing mlbench-core depending on the use-case. These are alternatives and allow installation of extra libraries which are not needed for CLI use. Here are all the installation possibilities:
$ pip install mlbench-core[test] # Install with test requirements
$ pip install mlbench-core[lint] # Install with lint requirements
$ pip install mlbench-core[torch] # Install only with torch requirements
$ pip install mlbench-core[tensorflow] # Install only with tensorflow requirements
$ pip install mlbench-core[dev] # Install all dependencies for development (all of the above)
$ mlbench --help
Usage: mlbench [OPTIONS] COMMAND [ARGS]...
Console script for mlbench_cli.
Options:
--version Print mlbench version
--help Show this message and exit.
Commands:
charts Chart the results of benchmark runs Save generated...
create-cluster Create a new cluster.
delete Delete a benchmark run
delete-cluster Delete a cluster.
download Download the results of a benchmark run
get-dashboard-url Returns the dashboard URL of the current cluster
list-clusters List all currently configured clusters.
run Start a new run for a benchmark image
set-cluster Set the current cluster to use.
status Get the status of a benchmark run, or all runs if no...
One can easily deploy a cluster on both AWS and GCloud, using the mlbench
CLI.
For example, one can create a GCloud cluster by running:
$ mlbench create-cluster gcloud 3 my-cluster
[...]
MLBench successfully deployed
Which creates a cluster called my-cluster-3
with 3 nodes (See mlbench create-cluster gcloud --help
for more options).
Once created, experiments can be run using:
$ mlbench run my-run 2
Benchmark:
[0] PyTorch Cifar-10 ResNet-20
[1] PyTorch Cifar-10 ResNet-20 (Scaling LR)
[2] PyTorch Linear Logistic Regression
[3] PyTorch Machine Translation GNMT
[4] PyTorch Machine Translation Transformer
[5] Tensorflow Cifar-10 ResNet-20 Open-MPI
[6] PyTorch Distributed Backend benchmarking
[7] Custom Image
Selection [0]: 1
[...]
Run started with name my-run-2
A few handy commands for quickstart:
- To obtain the dashboard URL:
mlbench get-dashboard-url
.- To see the state of the experiment:
mlbench status my-run-2
.- To download the results of the experiment:
mlbench download my-run-2
.- To delete the cluster:
mlbench delete-cluster gcloud my-cluster-3
MLBench also supports deployment of dashboard and tasks to a local cluster. This uses the KIND technology and can be easily deployed using the CLI. This can be used for debugging code or development, but is not meant to
Click here to download and install KIND.
$ mlbench create-cluster kind 3 my-cluster
[...]
MLBench successfully deployed
This creates a local cluster of 3 "nodes", as well as a local docker registry on port 5000. This allows for deploying runs using local docker images. To do that, one needs to push their docker image to the local repository:
$ docker tag <repo>/<image-name>:<tag> localhost:5000/<image-name>:<tag>
$ docker push localhost:5000/<image-name>:<tag>
You can now use the image localhost:5000/<image-name>:<tag>
in MLBench's dashboard to run a task.
The manual deployment requires the repo mlbench-helm to be cloned, and helm to be installed :ref:`helm-install`
MLBench's Helm charts can also be deployed manually on a running Kubernetes cluster.
For that, it is needed to have the credentials for the cluster in the kubectl
config.
For example, to obtain the credentials for a GCloud Kubernetes cluster, one should run
$ gcloud container clusters get-credentials --zone ${MACHINE_ZONE} ${CLUSTER_NAME}
This will setup kubectl
for the cluster.
Then to deploy the dashboard on the running cluster, we need to apply our values to the existing helm template, and deploy it onto the cluster
$ cd mlbench-helm
$ helm template ${RELEASE_NAME} . \
--set limits.workers=${NUM_NODES-1} \
--set limits.gpu=${NUM_GPUS} \
--set limits.cpu=${NUM_CPUS-1} | kubectl apply -f -
- Where :
RELEASE_NAME
represents the cluster name (calledmy-cluster-3
in the example above)NUM_NODES
is the maximum number of worker nodes available. This sets the maximum number of nodes that can be chosen for an experiment in the UI/CLI.NUM_GPUS
is the number of gpus requested by each worker pod.NUM_CPUS
is the maximum number of CPUs (Cores) available on each worker node. Uses Kubernetes notation (8 or 8000m for 8 cpus/cores). This is also the maximum number of Cores that can be selected for an experiment in the UI
This will deploy the helm charts with the corresponding images to each node, and will set the hardware limits.
Note
- Get the application URL by running these commands:
$ export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services ${RELEASE_NAME}-mlbench-master) $ export NODE_IP=$(gcloud compute instances list|grep $(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}") |awk '{print $5}') $ gcloud compute firewall-rules create --quiet mlbench --allow tcp:$NODE_PORT,tcp:$NODE_PORT $ echo http://$NODE_IP:$NODE_PORT
!DANGER!
The last command opens up a firewall rule to the google cloud. Make sure to delete the rule once it's not needed anymore:
$ gcloud compute firewall-rules delete --quiet mlbench
One can also create a new myvalues.yml
file with custom limits:
limits:
workers:
cpu:
gpu:
gcePersistentDisk:
enabled:
pdName:
limits.workers
is the maximum number of worker nodes available to mlbench. This sets the maximum number of nodes that can be chosen for an experiment in the UI. By default mlbench starts 2 workers on startup.limits.cpu
is the maximum number of CPUs (Cores) available on each worker node. Uses Kubernetes notation (8 or 8000m for 8 cpus/cores). This is also the maximum number of Cores that can be selected for an experiment in the UIlimits.gpu
is the number of gpus requested by each worker pod.gcePersistentDisk.enabled
create resources related to NFS persistentVolume and persistentVolumeClaim.gcePersistentDisk.pdName
is the name of persistent disk existed in GKE.
Caution!
If workers
, cpu
or gpu
are set higher than available in the cluster, Kubernetes will not be able to allocate nodes to mlbench and the deployment will hang indefinitely, without throwing an exception.
Kubernetes will just wait until nodes that fit the requirements become available. So make sure the cluster actually has the requested requirements.
Note
To use gpu
in the cluster, the nvidia device plugin should be installed. See :ref:`plugins` for details
Note
Use commands like gcloud compute disks create --size=10G --zone=europe-west1-b my-pd-name
to create persistent disk.
Note
The GCE persistent disk will be mounted to /datasets/ directory on each worker.
Caution!
Google installs several pods on each node by default, limiting the available CPU. This can take up to 0.5 CPU cores per node. So make sure to provision VM's that have at least 1 more core than the amount of cores you want to use for you mlbench experiment. See here for further details on node limits.
In values.yaml
, one can optionally install Kubernetes plugins by turning on/off the following flags:
weave.enabled
: If true, install the weave network plugin.nvidiaDevicePlugin.enabled
: If true, install the nvidia device plugin.