diff --git a/README.md b/README.md index 17f0c32..78b309b 100644 --- a/README.md +++ b/README.md @@ -61,6 +61,8 @@ Energy of the group of pods is exposed in 2 ways: * Through Prometheus at `http://prometheus-susql.openshift-kepler-operator.svc.cluster.local:9090` using the query `susql_total_energy_joules{susql_label_1=my-label-1,susql_label_2=my-label-2}` * From `status` of the `LabelGroup` CRD given as `labelgroup.status.totalEnergy` +## Other Examples +- A step by step explanation of how to aggregate a [GPU based Jupyter Notebook workload on OpenShift AI](doc/openshift-ai-example-notebook.md). ## License diff --git a/doc/openshift-ai-example-notebook.md b/doc/openshift-ai-example-notebook.md new file mode 100644 index 0000000..b6e65a7 --- /dev/null +++ b/doc/openshift-ai-example-notebook.md @@ -0,0 +1,122 @@ +# Example: Aggregating GPU Workload running on OpenShift AI Jupyter Notebook + +The following instructions show step by step how to use SusQL to aggregate energy data +consumed by a GPU utilizing Jupyter notebook running on OpenShift AI. + +## Prerequisites +The following are assumed to be installed and available. +- OpenShift Cluster (verified on version 4.14) +- OpenShift AI (verified on version 2.11) +- Kepler (verified with community version 0.13.) +- SusQL (verified with community version 0.0.22) +- Command line access to the OpenShift Cluster (e.g., logged in with `oc` command) +To use GPU functionality seamlessly within the cluster, a GPU and necessary software must be available: +- NVIDIA GPU Operator (verified with version 24.3) +- Node Feature Discovery Operator (verified with version 4.14) +- Kernel Module Management Operator (verified with version 2.1) + +## Create a Jupyter Notebook + +Any code that runs in a Jupyter notebook can be aggregated by SusQL. The following sample code +demonstrates the use of GPU resources. This is also a good test case to verify that GPU is +configured correctly. + +``` +pip install pycaret[full] + +import torch +import time + +if torch.cuda.is_available(): + device = torch.device('cuda') +else: + device = torch.device('cpu') + +matrix_size = 16384 + +x = torch.randn(matrix_size, matrix_size) +y = torch.randn(matrix_size, matrix_size) + +x_gpu = x.to(device) +y_gpu = y.to(device) +torch.cuda.synchronize() + +for i in range(10): + start = time.time() + result_gpu = torch.matmul(x_gpu, y_gpu) + print("Run time using device",result_gpu.device,"is","{:.7f}".format(time.time() - start)) +``` + +A Jupyter Notebook can be created and run through the following steps: + +- Log into OpenShift +- Open the OpenShift AI Console by clicking the app tile, and then clicking "Red Hat OpenShift AI". + - If this is your first time, or if you want a refresher, you can scroll down to the "Get oriented with learning resources" section + and follow the "Creating a Jupyter notebook" tutorial. Otherwise continue with the instructions below. +- Click "Applications" on the menu on the left hand side of the window. Click "Enabled". Then click "Launch Application" on the newly displayed tile. + - The first time you will need to create a notebook server. These instructions were verified with the PyTorch image using CUDA v11.8, Python v3.9, and PyTorch v2.2. +- Within the Jupyter notebook server, click the "+" sign and click on the "Python 3.9" tile. Copy the code above into cells in the notebook. + (It is probably best to put the `pip` command in its own cell to run once before running the rest of the Python code.) +- Verify that the code runs and uses the GPU. + + +## Attach a SusQL label to the Jupyter Notebook server: + +Although the OpenShift Web Console can be used to set labels on existing workloads, this is also easy to do from the command line: + +The following command removes a SusQL label on the Jupyter Notebook Server pod, in case one happens to be defined. +``` +$ oc label pod $(oc get po -n rhods-notebooks | grep jupyter | head -1 | cut -f 1 -d" ") -n rhods-notebooks "susql.label/1-" +pod/jupyter-nb-kube-3aadmin-0 unlabeled +``` + +Next, this command sets the label `Susql.label/1` to `openshiftaij` for the Jupyter notebook server running in namespace rhods-notebooks. +`` +$ oc label pod $(oc get po -n rhods-notebooks | grep jupyter | head -1 | cut -f 1 -d" ") -n rhods-notebooks "susql.label/1=openshiftaij" +pod/jupyter-nb-kube-3aadmin-0 labeled +`` + +And, finally, this command can verify that the label has been set +`` +$ oc describe pod $(oc get po -n rhods-notebooks | grep jupyter | head -1 | cut -f 1 -d" ") -n rhods-notebooks | grep -i susql + susql.label/1=openshiftaij +`` + +## Create the SusQL LabelGroup + +First create a LabelGroup definition file called `openshiftaij.yaml` as follows: +``` +--- +apiVersion: susql.ibm.com/v1 +kind: LabelGroup +metadata: + name: openshiftaij + namespace: rhods-notebooks +spec: + labels: + - openshiftaij +--- +``` +And apply the file: +``` +$ oc apply -f openshiftaij.yaml +labelgroup.susql.ibm.com/openshiftaij created +``` + +## Visualize + +- Go back to the Jupyter Notebook Server, and run the workload. + (Possibly change the number of times the program loops to a very large number to make the energy usage obvious.) +- Then on the OpenShift Web GUI, click "Observe" on the left hand side menu, then click "Metrics" and enter the following search +query, and click "Run queries": +``` +susql_total_energy_joules{susql_label_1="openshiftaij"} +``` + +If you have cloned the GitHub `susql-operator` repository, you could also run the `test/susqltop` command to view energy aggregation from the command line. + +``` +$ test/susqltop +NameSpace LabelGroup Labels TotalEnergy (J) +rhods-notebooks openshiftaij ["openshiftaij"] 17963.00 +```