Skip to content

Latest commit

 

History

History
94 lines (58 loc) · 3.94 KB

update-deepops.md

File metadata and controls

94 lines (58 loc) · 3.94 KB

Updating DeepOps

Overview

This document details how to upgrade DeepOps to the next release.

Updating the DeepOps repository

Update the repository

Each release of DeepOps is a named tag of the Git repository, with versions using the YY.MM numbering scheme. So, for example, the March 2019 release of DeepOps is a tag named 19.03.

If you don’t have a local clone of the DeepOps repository, you should clone one to your local provisioning host.

git clone https://github.com/NVIDIA/deepops

Check out the git tag for the release you’re upgrading to:

git checkout <YY.MM>

If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example:

git checkout -b <new-branch-name>

If you’ve made local changes to the DeepOps repository in your own branch, you can rebase your branch onto the release to port your changes to the new release.

git checkout <your-branch>
git rebase <YY.MM>

Note that if there are any conflicts between your local branch and the release, you will need to resolve those conflicts as part of the rebase process.

Port your config

The DeepOps configuration files sometimes change from one release to the next. You should compare your existing configuration (usually in the config/ directory) to the example configuration provided by DeepOps (in config.example/) to identify new required parameters or changes in the config structure.

Component-based upgrade

DeepOps provides a modular toolkit for deploying many different components of a GPU cluster, such as Kubernetes, Slurm, and Ceph. Each release of DeepOps pins a specific set of versions for these components, so updating to the next release will update the versions of the provided components.

Because DeepOps is modular, any given deployment may only include a subset of the available components. You should identify which components of DeepOps you’re currently using, and what parts of your cluster you will need to upgrade.

Unless otherwise specified, components can be upgraded in any order.

Updating Slurm

DeepOps specifies the Slurm version as part of the Ansible playbook, so we can upgrade Slurm simply by re-running Ansible.

First, run Ansible in check mode to verify the playbook can run successfully.

ansible-playbook --check -i config/inventory -l slurm-cluster playbooks/slurm-cluster.yml

If there are any errors (for example, due to a changed config), identify and fix those issues. Once check mode runs clean, run Ansible to perform the upgrade.

ansible-playbook --check -i config/inventory -l slurm-cluster playbooks/slurm-cluster.yml

Updating Kubernetes (Kubespray)

DeepOps deploys Kubernetes using Kubespray, an Ansible framework for deploying production-ready Kubernetes clusters. The Kubernetes upgrade process will be specific to the release of Kubespray which is pinned in the new release of DeepOps. Please see the Kubespray upgrade docs for instructions to upgrade Kubernetes.

Updating Helm

The preferred version of Helm for DeepOps is specified as part the install script. To make sure you have the correct version of Helm, just re-run the install script from the root of the repository.

scripts/install_helm.sh

Updating Ceph (Rook)

Ceph is installed for DeepOps via Rook, an orchestration system for deploying storage systems in Kubernetes.

To upgrade to the most recent version of Ceph and Rook, use the following commands:

helm update
helm search rook		# Find the most recent version of rook
helm upgrade --namespace rook-ceph-system rook-ceph rook-master/rook-ceph --version ${version}

For more details on the Ceph upgrade process, please consult the Rook documentation.