feat: Enable setting a toleration for GPU tests (#152)

Enable setting `toleration` for workload pods created by gpu tests. This is achieved by creating a poddefault that adds the toleration from the user's input to pods with the label `enable-gpu: "true"` . More details on spec KF113. To complement the feature, the following changes are done as well: * Refactor the README.md a bit in order to not introduce duplicate instructions * Modify the retry since tests may wait for a GPU node to be scaled up Fix #151
canonical · Jan 17, 2025 · d6f279a · d6f279a
1 parent 29ea01f
commit d6f279a
Show file tree

Hide file tree

Showing 8 changed files with 191 additions and 66 deletions.
diff --git a/README.md b/README.md
@@ -24,7 +24,9 @@ found in the [Run the tests](#run-the-tests) section.
       * [A subset of UATs](#run-a-subset-of-uats)
       * [Kubeflow UATs](#run-kubeflow-uats)
       * [MLflow UATs](#run-mlflow-uats)
-      * [Include NVIDIA GPU UATs](#include-nvidia-gpu-uats)
+   * [NVIDIA GPU UAT](#nvidia-gpu-uat)
+      * [From inside a notebook](#run-nvidia-gpu-uat-from-inside-a-notebook)
+      * [Using the `driver`](#run-nvidia-gpu-uat-using-the-driver)
    * [Behind proxy](#run-behind-proxy)
       * [Prerequisites for KServe UATs](#prerequisites-for-kserve-uats)
       * [From inside a notebook](#running-using-notebook)
@@ -72,7 +74,7 @@ NOTE: Depending on the version of Charmed Kubeflow you want to test, make sure t
    * Navigate to `Advanced options` > `Configurations`
    * Select all available configurations in order for Kubeflow integrations to work as expected
    * Launch the Notebook and wait for it to be created
-* Start a new terminal session and clone this repo locally:
+* From inside the Notebook, start a new terminal session and clone this repo locally:
 
    ```bash
    git clone https://github.com/canonical/charmed-kubeflow-uats.git
@@ -82,8 +84,9 @@ NOTE: Depending on the version of Charmed Kubeflow you want to test, make sure t
    ```bash
    cd charmed-kubeflow-uats/tests
    ```
-* Follow the instructions of the provided [README.md](tests/README.md) to execute the test suite
-  with `pytest`
+* There are two options here:
+   1. Follow the instructions of the provided [README.md](tests/README.md) to execute the test suite with `pytest`
+   2. For each `.ipynb` test file of interest, open it and run the Notebook.
 
 ### Running from a configured management environment using the `driver`
 
@@ -171,7 +174,30 @@ tox -e mlflow-remote
 tox -e mlflow-local
 ```
 
-#### Include NVIDIA GPU UATs
+### NVIDIA GPU UAT
+
+#### Run NVIDIA GPU UAT from inside a notebook
+
+##### Prerequisites
+If a [taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) is used to prevent scheduling unintended workload to GPU nodes, a toleration is needed in order to enable GPU tests to schedule workloads. To ensure that pods created by GPU tests have the proper toleration:
+1. Edit the [PodDefault](./assets/gpu-toleration-poddefault.yaml.j2) to replace the placeholder under `tolerations` with your own toleration e.g.
+```
+  tolerations:
+    - key: "MyKey"
+      value: "gpu"
+      effect: "NoSchedule"
+```
+2. Apply the PodDefault to the namespace where you 'll be running the tests in.
+   ```
+   kubectl apply -f ./assets/gpu-toleration-poddefault.yaml.j2 -n <your_namespace>
+   ```
+
+If no taint is used, there are no prerequisites.
+
+##### Steps
+In order to run the NVIDIA GPU UAT from inside a notebook, follow the same steps described in the [From inside a notebook](#running-inside-a-notebook) section above.
+
+#### Run NVIDIA GPU UAT using the driver
 
 By default, [GPU UATs](./tests/notebooks/gpu/) are not included in any of the `tox` environments since they require a cluster with a GPU. In order to include those, use the `--include-gpu-tests` flag, e.g.
 
@@ -184,6 +210,23 @@ tox -e uats-remote -- --include-gpu-tests --filter "kfp"
 
 As shown in the example above, tests under the `gpu` directory follow the same filters with the rest of the tests.
 
+##### Taints and tolerations
+
+If a [taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) is used to prevent scheduling unintended workload to GPU nodes, a toleration is needed in order to enable GPU tests to schedule workloads. This is achieved via the `--toleration` argument which enables passing the sub-arguments `key, operator, value, effect, seconds`. For example:
+
+```bash
+#  Here's an example taint the GPU node may have
+#  taints:
+#     effect: NoSchedule
+#     key: MyKey
+#     value: MyValue
+
+tox -e uats-remote -- --include-gpu-tests --toleration key="MyKey" value="MyValue" effect="NoSchedule"
+```
+
+The driver will populate the [PodDefault](./assets/gpu-toleration-poddefault.yaml.j2) with the passed toleration values and apply it, ensuring that the toleration is added to workload pods requiring a GPU. Since most fields are optional, make sure that the toleration passed is a valid one by consulting relevant [Kubernetes docs](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling).
+
+
 ### Run behind proxy
 
 #### Prerequisites for KServe UATs
@@ -257,26 +300,9 @@ To run the tests behind proxy using Notebook:
    ```
    microk8s kubectl apply -f ./tests/proxy-poddefault.yaml -n <your_namespace>
    ```
-3. Create a Notebook and from the `Advanced Options > Configurations` select `Add proxy settings`,
-   then click `Launch` to start the Notebook.
-   Wait for the Notebook to be Ready, then Connect to it.
-4. From inside the Notebook, start a new terminal session and clone this repo:
-
-   ```bash
-   git clone https://github.com/canonical/charmed-kubeflow-uats.git
-   ```
-   Open the `charmed-kubeflow-uats/tests` directory and for each `.ipynb` test file there, open it
-   and run the Notebook.
+3. Continue as described in the [Running inside a Notebook](#running-inside-a-notebook) section above.
    
-   Currently, the following tests are supported to run behind proxy:
-   * e2e-wine
-   * katib
-   * kfp_v2
-   * kserve
-   * mlflow
-   * mlflow-kserve
-   * mlflow-minio
-   * training
+   Currently, all tests are supported to run behind proxy except kfp-v1.
 
 #### Running using `driver`
 

diff --git a/assets/gpu-toleration-poddefault.yaml.j2 b/assets/gpu-toleration-poddefault.yaml.j2
@@ -0,0 +1,38 @@
+apiVersion: kubeflow.org/v1alpha1
+kind: PodDefault
+metadata:
+  name: gpu-toleration
+spec:
+  desc: Add toleration to pods with label enable-gpu = 'true' in order to enable GPU access.
+  tolerations:
+    -
+      {% if key %}
+      key: {{ key }}
+      {% endif %}
+      {% if operator %} 
+      operator: {{ operator }}
+      {% endif %}
+      {% if value %} 
+      value: {{ value }}
+      {% endif %}
+      {% if effect %} 
+      effect: {{ effect }}
+      {% endif %}
+      {% if seconds %} 
+      tolerationSeconds: {{ seconds }}
+      {% endif %}
+  _example_toleration:
+    ################################
+    #                              #
+    #    EXAMPLE CONFIGURATION     #
+    #                              #
+    ################################
+    # This just serves as an example for how to configure the values of a toleration
+    # In order to ensure the toleration is valid, please consult Kubernetes documentation
+    # https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling
+    - key: "MyKey"
+      operator: "Exists"
+      effect: "NoSchedule"
+  selector:
+    matchLabels:
+      enable-gpu: "true"
diff --git a/driver/conftest.py b/driver/conftest.py
@@ -13,6 +13,8 @@ def pytest_addoption(parser: Parser):
       https://docs.pytest.org/en/7.4.x/reference/reference.html#command-line-flags)
     * Add an `--include-gpu-tests` flag to include the tests under the `gpu` directory
       in the executed tests.
+    * Add a `--toleration` option that enables setting a `toleration` entry for pods
+      with the enable-gpu = 'true' label.
     """
     parser.addoption(
         "--proxy",
@@ -39,3 +41,15 @@ def pytest_addoption(parser: Parser):
         help="Defines whether to include the tests under the `gpu` directory in the executed tests."
         "By default, it is set to False.",
     )
+    parser.addoption(
+        "--toleration",
+        nargs="+",
+        help="Set a number of key-value pairs for the toleration needed to access a GPU node. With the"
+        " use of a PodDefault, the toleration is set to pods that have the label enable-gpu='true'."
+        " Example:"
+        " --toleration key='key1' operator='Equal' value='value1' effect='NoSchedule' seconds='3600'."
+        " Since most fields are optional, ensure that that the toleration passed is a valid one by"
+        " consulting relevant Kubernetes docs:\n"
+        " https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling.",
+        action="store",
+    )
diff --git a/driver/test_kubeflow_workloads.py b/driver/test_kubeflow_workloads.py
@@ -6,7 +6,6 @@
 import subprocess
 import time
 from pathlib import Path
-from typing import Dict
 
 import pytest
 from lightkube import ApiError, Client, codecs
@@ -20,6 +19,8 @@
     assert_namespace_active,
     assert_poddefault_created_in_namespace,
     assert_profile_deleted,
+    context_from,
+    create_poddefault,
     fetch_job_logs,
     wait_for_job,
 )
@@ -54,6 +55,7 @@
     plural="poddefaults",
 )
 PODDEFAULT_WITH_PROXY_PATH = Path("tests") / "proxy-poddefault.yaml.j2"
+PODDEFAULT_WITH_TOLERATION_PATH = Path("assets") / "gpu-toleration-poddefault.yaml.j2"
 
 KFP_PODDEFAULT_NAME = "access-ml-pipeline"
 
@@ -119,30 +121,30 @@ def create_profile(lightkube_client):
 
 
 @pytest.fixture(scope="function")
-def create_poddefaults_on_proxy(request, lightkube_client):
+def create_poddefault_on_proxy(request, lightkube_client):
     """Create PodDefault with proxy env variables for the Notebook inside the Job."""
     # Simply yield if the proxy flag is not set
     if not request.config.getoption("proxy"):
         yield
     else:
-        log.info("Adding PodDefault with proxy settings.")
-        poddefault_resource = codecs.load_all_yaml(
-            PODDEFAULT_WITH_PROXY_PATH.read_text(),
-            context=proxy_context(request),
+        yield from create_poddefault(
+            PODDEFAULT_WITH_PROXY_PATH, context_from("proxy", request), NAMESPACE, lightkube_client
         )
-        # Using the first item of the list of poddefault_resource. It is a one item list.
-        lightkube_client.create(poddefault_resource[0], namespace=NAMESPACE)
 
-        yield
 
-        # delete the PodDefault at the end of the module tests
-        log.info("Deleting PodDefault...")
-        poddefault_resource = codecs.load_all_yaml(
-            PODDEFAULT_WITH_PROXY_PATH.read_text(),
-            context=proxy_context(request),
+@pytest.fixture(scope="function")
+def create_poddefault_on_toleration(request, lightkube_client):
+    """Create PodDefault with toleration for workload pods created by GPU tests."""
+    # Simply yield if the proxy flag is not set
+    if not request.config.getoption("toleration"):
+        yield
+    else:
+        yield from create_poddefault(
+            PODDEFAULT_WITH_TOLERATION_PATH,
+            context_from("toleration", request),
+            NAMESPACE,
+            lightkube_client,
         )
-        poddefault_name = poddefault_resource[0].metadata.name
-        lightkube_client.delete(PODDEFAULT_RESOURCE, name=poddefault_name, namespace=NAMESPACE)
 
 
 @pytest.mark.dependency()
@@ -191,7 +193,8 @@ def test_kubeflow_workloads(
     pytest_cmd,
     tests_checked_out_commit,
     request,
-    create_poddefaults_on_proxy,
+    create_poddefault_on_proxy,
+    create_poddefault_on_toleration,
 ):
     """Run a K8s Job to execute the notebook tests."""
     log.info(f"Starting Kubernetes Job {NAMESPACE}/{JOB_NAME} to run notebook tests...")
@@ -223,12 +226,3 @@ def test_kubeflow_workloads(
     finally:
         log.info("Fetching Job logs...")
         fetch_job_logs(JOB_NAME, NAMESPACE, TESTS_LOCAL_RUN)
-
-
-def proxy_context(request) -> Dict[str, str]:
-    """Return a dictionary with proxy environment variables from user input."""
-    proxy_context = {}
-    for proxy in request.config.getoption("proxy"):
-        key, value = proxy.split("=")
-        proxy_context[key] = value
-    return proxy_context
diff --git a/driver/utils.py b/driver/utils.py
@@ -3,9 +3,10 @@
 
 import logging
 import subprocess
+from typing import Dict
 
 import tenacity
-from lightkube import ApiError, Client
+from lightkube import ApiError, Client, codecs
 from lightkube.generic_resource import create_global_resource, create_namespaced_resource
 from lightkube.resources.batch_v1 import Job
 from lightkube.resources.core_v1 import Namespace
@@ -151,3 +152,41 @@ def assert_profile_deleted(client, profile_name, logger: logging.Logger):
     logger.info(f"Waiting for Profile {profile_name} to be deleted..")
 
     assert deleted, f"Waited too long for Profile {profile_name} to be deleted!"
+
+
+def context_from(argument: str, request) -> Dict[str, str]:
+    """Return a dictionary with key-value entries from the CLI argument."""
+    context = {}
+    for pair in request.config.getoption(argument):
+        key, value = pair.split("=")
+        context[key] = value
+    return context
+
+
+def create_poddefault(
+    poddefault_path: str, poddefault_context: Dict[str, str], namespace: str, lightkube_client
+):
+    """Apply the PodDefault from the path after rendering it with the passed context.
+
+    Apply the PodDefault from the path after rendering it with the passed context.
+    Once execution is complete, delete the created poddefault.
+    """
+    poddefault_resource = codecs.load_all_yaml(
+        poddefault_path.read_text(),
+        poddefault_context,
+    )
+    poddefault_name = poddefault_resource[0].metadata.name
+    log.info(f"Adding {poddefault_name} PodDefault...")
+    # Using the first item of the list of poddefault_resource. It is a one item list.
+    lightkube_client.create(poddefault_resource[0], namespace=namespace)
+
+    yield
+
+    # delete the PodDefault at the end of the module tests
+    poddefault_resource = codecs.load_all_yaml(
+        poddefault_path.read_text(),
+        poddefault_context,
+    )
+    poddefault_name = poddefault_resource[0].metadata.name
+    log.info(f"Deleting {poddefault_name} PodDefault...")
+    lightkube_client.delete(PODDEFAULT_RESOURCE, name=poddefault_name, namespace=namespace)
diff --git a/tests/notebooks/gpu/kfp-tensorflow/kfp-tensorflow-integration.ipynb b/tests/notebooks/gpu/kfp-tensorflow/kfp-tensorflow-integration.ipynb
@@ -32,7 +32,7 @@
     "import kfp\n",
     "import os\n",
     "\n",
-    "from kfp import dsl\n",
+    "from kfp import dsl, kubernetes\n",
     "from tenacity import retry, stop_after_attempt, wait_exponential"
    ]
   },
@@ -123,14 +123,22 @@
     "@dsl.pipeline\n",
     "def gpu_check_pipeline() -> str:\n",
     "    \"\"\"Create a pipeline that runs code to check access to a GPU.\"\"\"\n",
-    "    gpu_check1 = add_gpu_request(gpu_check())\n",
+    "    gpu_check1 = kubernetes.add_pod_label(\n",
+    "        add_gpu_request(gpu_check()),\n",
+    "        label_key=\"enable-gpu\",\n",
+    "        label_value=\"true\",\n",
+    "    )\n",
     "    return gpu_check1.output\n",
     "\n",
     "\n",
     "@dsl.pipeline\n",
     "def gpu_check_pipeline_proxy() -> str:\n",
     "    \"\"\"Create a pipeline that runs code to check access to a GPU and sets the appropriate proxy ENV variables.\"\"\"\n",
-    "    gpu_check1 = add_proxy(add_gpu_request(gpu_check()))\n",
+    "    gpu_check1 = kubernetes.add_pod_label(\n",
+    "        add_proxy(add_gpu_request(gpu_check())),\n",
+    "        label_key=\"enable-gpu\",\n",
+    "        label_value=\"true\",\n",
+    "    )\n",
     "    return gpu_check1.output"
    ]
   },
@@ -180,8 +188,8 @@
    "outputs": [],
    "source": [
     "@retry(\n",
-    "    wait=wait_exponential(multiplier=2, min=1, max=10),\n",
-    "    stop=stop_after_attempt(30),\n",
+    "    wait=wait_exponential(multiplier=4, min=1, max=30),\n",
+    "    stop=stop_after_attempt(24),\n",
     "    reraise=True,\n",
     ")\n",
     "def assert_run_succeeded(client, run_id):\n",

diff --git a/tests/notebooks/gpu/kfp-tensorflow/requirements.in b/tests/notebooks/gpu/kfp-tensorflow/requirements.in
@@ -1,2 +1,3 @@
 kfp>=2.4,<3.0
 tenacity
+kfp-kubernetes