Skip to content

Commit

Permalink
feat: Enable setting a toleration for GPU tests (#152)
Browse files Browse the repository at this point in the history
Enable setting `toleration` for workload pods created by gpu tests. 
This is achieved by creating a poddefault that adds the toleration 
from the user's input to pods with the label `enable-gpu: "true"` . 
More details on spec KF113. To complement the feature, the following 
changes are done as well:
* Refactor the README.md a bit in order to not introduce duplicate instructions
* Modify the retry since tests may wait for a GPU node to be scaled up

Fix #151
  • Loading branch information
orfeas-k authored Jan 17, 2025
1 parent 29ea01f commit d6f279a
Show file tree
Hide file tree
Showing 8 changed files with 191 additions and 66 deletions.
74 changes: 50 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@ found in the [Run the tests](#run-the-tests) section.
* [A subset of UATs](#run-a-subset-of-uats)
* [Kubeflow UATs](#run-kubeflow-uats)
* [MLflow UATs](#run-mlflow-uats)
* [Include NVIDIA GPU UATs](#include-nvidia-gpu-uats)
* [NVIDIA GPU UAT](#nvidia-gpu-uat)
* [From inside a notebook](#run-nvidia-gpu-uat-from-inside-a-notebook)
* [Using the `driver`](#run-nvidia-gpu-uat-using-the-driver)
* [Behind proxy](#run-behind-proxy)
* [Prerequisites for KServe UATs](#prerequisites-for-kserve-uats)
* [From inside a notebook](#running-using-notebook)
Expand Down Expand Up @@ -72,7 +74,7 @@ NOTE: Depending on the version of Charmed Kubeflow you want to test, make sure t
* Navigate to `Advanced options` > `Configurations`
* Select all available configurations in order for Kubeflow integrations to work as expected
* Launch the Notebook and wait for it to be created
* Start a new terminal session and clone this repo locally:
* From inside the Notebook, start a new terminal session and clone this repo locally:

```bash
git clone https://github.com/canonical/charmed-kubeflow-uats.git
Expand All @@ -82,8 +84,9 @@ NOTE: Depending on the version of Charmed Kubeflow you want to test, make sure t
```bash
cd charmed-kubeflow-uats/tests
```
* Follow the instructions of the provided [README.md](tests/README.md) to execute the test suite
with `pytest`
* There are two options here:
1. Follow the instructions of the provided [README.md](tests/README.md) to execute the test suite with `pytest`
2. For each `.ipynb` test file of interest, open it and run the Notebook.

### Running from a configured management environment using the `driver`

Expand Down Expand Up @@ -171,7 +174,30 @@ tox -e mlflow-remote
tox -e mlflow-local
```

#### Include NVIDIA GPU UATs
### NVIDIA GPU UAT

#### Run NVIDIA GPU UAT from inside a notebook

##### Prerequisites
If a [taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) is used to prevent scheduling unintended workload to GPU nodes, a toleration is needed in order to enable GPU tests to schedule workloads. To ensure that pods created by GPU tests have the proper toleration:
1. Edit the [PodDefault](./assets/gpu-toleration-poddefault.yaml.j2) to replace the placeholder under `tolerations` with your own toleration e.g.
```
tolerations:
- key: "MyKey"
value: "gpu"
effect: "NoSchedule"
```
2. Apply the PodDefault to the namespace where you 'll be running the tests in.
```
kubectl apply -f ./assets/gpu-toleration-poddefault.yaml.j2 -n <your_namespace>
```

If no taint is used, there are no prerequisites.

##### Steps
In order to run the NVIDIA GPU UAT from inside a notebook, follow the same steps described in the [From inside a notebook](#running-inside-a-notebook) section above.

#### Run NVIDIA GPU UAT using the driver

By default, [GPU UATs](./tests/notebooks/gpu/) are not included in any of the `tox` environments since they require a cluster with a GPU. In order to include those, use the `--include-gpu-tests` flag, e.g.

Expand All @@ -184,6 +210,23 @@ tox -e uats-remote -- --include-gpu-tests --filter "kfp"

As shown in the example above, tests under the `gpu` directory follow the same filters with the rest of the tests.

##### Taints and tolerations

If a [taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) is used to prevent scheduling unintended workload to GPU nodes, a toleration is needed in order to enable GPU tests to schedule workloads. This is achieved via the `--toleration` argument which enables passing the sub-arguments `key, operator, value, effect, seconds`. For example:

```bash
# Here's an example taint the GPU node may have
# taints:
# effect: NoSchedule
# key: MyKey
# value: MyValue

tox -e uats-remote -- --include-gpu-tests --toleration key="MyKey" value="MyValue" effect="NoSchedule"
```

The driver will populate the [PodDefault](./assets/gpu-toleration-poddefault.yaml.j2) with the passed toleration values and apply it, ensuring that the toleration is added to workload pods requiring a GPU. Since most fields are optional, make sure that the toleration passed is a valid one by consulting relevant [Kubernetes docs](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling).


### Run behind proxy

#### Prerequisites for KServe UATs
Expand Down Expand Up @@ -257,26 +300,9 @@ To run the tests behind proxy using Notebook:
```
microk8s kubectl apply -f ./tests/proxy-poddefault.yaml -n <your_namespace>
```
3. Create a Notebook and from the `Advanced Options > Configurations` select `Add proxy settings`,
then click `Launch` to start the Notebook.
Wait for the Notebook to be Ready, then Connect to it.
4. From inside the Notebook, start a new terminal session and clone this repo:
```bash
git clone https://github.com/canonical/charmed-kubeflow-uats.git
```
Open the `charmed-kubeflow-uats/tests` directory and for each `.ipynb` test file there, open it
and run the Notebook.
3. Continue as described in the [Running inside a Notebook](#running-inside-a-notebook) section above.
Currently, the following tests are supported to run behind proxy:
* e2e-wine
* katib
* kfp_v2
* kserve
* mlflow
* mlflow-kserve
* mlflow-minio
* training
Currently, all tests are supported to run behind proxy except kfp-v1.
#### Running using `driver`
Expand Down
38 changes: 38 additions & 0 deletions assets/gpu-toleration-poddefault.yaml.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
name: gpu-toleration
spec:
desc: Add toleration to pods with label enable-gpu = 'true' in order to enable GPU access.
tolerations:
-
{% if key %}
key: {{ key }}
{% endif %}
{% if operator %}
operator: {{ operator }}
{% endif %}
{% if value %}
value: {{ value }}
{% endif %}
{% if effect %}
effect: {{ effect }}
{% endif %}
{% if seconds %}
tolerationSeconds: {{ seconds }}
{% endif %}
_example_toleration:
################################
# #
# EXAMPLE CONFIGURATION #
# #
################################
# This just serves as an example for how to configure the values of a toleration
# In order to ensure the toleration is valid, please consult Kubernetes documentation
# https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling
- key: "MyKey"
operator: "Exists"
effect: "NoSchedule"
selector:
matchLabels:
enable-gpu: "true"
14 changes: 14 additions & 0 deletions driver/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ def pytest_addoption(parser: Parser):
https://docs.pytest.org/en/7.4.x/reference/reference.html#command-line-flags)
* Add an `--include-gpu-tests` flag to include the tests under the `gpu` directory
in the executed tests.
* Add a `--toleration` option that enables setting a `toleration` entry for pods
with the enable-gpu = 'true' label.
"""
parser.addoption(
"--proxy",
Expand All @@ -39,3 +41,15 @@ def pytest_addoption(parser: Parser):
help="Defines whether to include the tests under the `gpu` directory in the executed tests."
"By default, it is set to False.",
)
parser.addoption(
"--toleration",
nargs="+",
help="Set a number of key-value pairs for the toleration needed to access a GPU node. With the"
" use of a PodDefault, the toleration is set to pods that have the label enable-gpu='true'."
" Example:"
" --toleration key='key1' operator='Equal' value='value1' effect='NoSchedule' seconds='3600'."
" Since most fields are optional, ensure that that the toleration passed is a valid one by"
" consulting relevant Kubernetes docs:\n"
" https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling.",
action="store",
)
46 changes: 20 additions & 26 deletions driver/test_kubeflow_workloads.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
import subprocess
import time
from pathlib import Path
from typing import Dict

import pytest
from lightkube import ApiError, Client, codecs
Expand All @@ -20,6 +19,8 @@
assert_namespace_active,
assert_poddefault_created_in_namespace,
assert_profile_deleted,
context_from,
create_poddefault,
fetch_job_logs,
wait_for_job,
)
Expand Down Expand Up @@ -54,6 +55,7 @@
plural="poddefaults",
)
PODDEFAULT_WITH_PROXY_PATH = Path("tests") / "proxy-poddefault.yaml.j2"
PODDEFAULT_WITH_TOLERATION_PATH = Path("assets") / "gpu-toleration-poddefault.yaml.j2"

KFP_PODDEFAULT_NAME = "access-ml-pipeline"

Expand Down Expand Up @@ -119,30 +121,30 @@ def create_profile(lightkube_client):


@pytest.fixture(scope="function")
def create_poddefaults_on_proxy(request, lightkube_client):
def create_poddefault_on_proxy(request, lightkube_client):
"""Create PodDefault with proxy env variables for the Notebook inside the Job."""
# Simply yield if the proxy flag is not set
if not request.config.getoption("proxy"):
yield
else:
log.info("Adding PodDefault with proxy settings.")
poddefault_resource = codecs.load_all_yaml(
PODDEFAULT_WITH_PROXY_PATH.read_text(),
context=proxy_context(request),
yield from create_poddefault(
PODDEFAULT_WITH_PROXY_PATH, context_from("proxy", request), NAMESPACE, lightkube_client
)
# Using the first item of the list of poddefault_resource. It is a one item list.
lightkube_client.create(poddefault_resource[0], namespace=NAMESPACE)

yield

# delete the PodDefault at the end of the module tests
log.info("Deleting PodDefault...")
poddefault_resource = codecs.load_all_yaml(
PODDEFAULT_WITH_PROXY_PATH.read_text(),
context=proxy_context(request),
@pytest.fixture(scope="function")
def create_poddefault_on_toleration(request, lightkube_client):
"""Create PodDefault with toleration for workload pods created by GPU tests."""
# Simply yield if the proxy flag is not set
if not request.config.getoption("toleration"):
yield
else:
yield from create_poddefault(
PODDEFAULT_WITH_TOLERATION_PATH,
context_from("toleration", request),
NAMESPACE,
lightkube_client,
)
poddefault_name = poddefault_resource[0].metadata.name
lightkube_client.delete(PODDEFAULT_RESOURCE, name=poddefault_name, namespace=NAMESPACE)


@pytest.mark.dependency()
Expand Down Expand Up @@ -191,7 +193,8 @@ def test_kubeflow_workloads(
pytest_cmd,
tests_checked_out_commit,
request,
create_poddefaults_on_proxy,
create_poddefault_on_proxy,
create_poddefault_on_toleration,
):
"""Run a K8s Job to execute the notebook tests."""
log.info(f"Starting Kubernetes Job {NAMESPACE}/{JOB_NAME} to run notebook tests...")
Expand Down Expand Up @@ -223,12 +226,3 @@ def test_kubeflow_workloads(
finally:
log.info("Fetching Job logs...")
fetch_job_logs(JOB_NAME, NAMESPACE, TESTS_LOCAL_RUN)


def proxy_context(request) -> Dict[str, str]:
"""Return a dictionary with proxy environment variables from user input."""
proxy_context = {}
for proxy in request.config.getoption("proxy"):
key, value = proxy.split("=")
proxy_context[key] = value
return proxy_context
41 changes: 40 additions & 1 deletion driver/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@

import logging
import subprocess
from typing import Dict

import tenacity
from lightkube import ApiError, Client
from lightkube import ApiError, Client, codecs
from lightkube.generic_resource import create_global_resource, create_namespaced_resource
from lightkube.resources.batch_v1 import Job
from lightkube.resources.core_v1 import Namespace
Expand Down Expand Up @@ -151,3 +152,41 @@ def assert_profile_deleted(client, profile_name, logger: logging.Logger):
logger.info(f"Waiting for Profile {profile_name} to be deleted..")

assert deleted, f"Waited too long for Profile {profile_name} to be deleted!"


def context_from(argument: str, request) -> Dict[str, str]:
"""Return a dictionary with key-value entries from the CLI argument."""
context = {}
for pair in request.config.getoption(argument):
key, value = pair.split("=")
context[key] = value
return context


def create_poddefault(
poddefault_path: str, poddefault_context: Dict[str, str], namespace: str, lightkube_client
):
"""Apply the PodDefault from the path after rendering it with the passed context.
Apply the PodDefault from the path after rendering it with the passed context.
Once execution is complete, delete the created poddefault.
"""
poddefault_resource = codecs.load_all_yaml(
poddefault_path.read_text(),
poddefault_context,
)
poddefault_name = poddefault_resource[0].metadata.name
log.info(f"Adding {poddefault_name} PodDefault...")
# Using the first item of the list of poddefault_resource. It is a one item list.
lightkube_client.create(poddefault_resource[0], namespace=namespace)

yield

# delete the PodDefault at the end of the module tests
poddefault_resource = codecs.load_all_yaml(
poddefault_path.read_text(),
poddefault_context,
)
poddefault_name = poddefault_resource[0].metadata.name
log.info(f"Deleting {poddefault_name} PodDefault...")
lightkube_client.delete(PODDEFAULT_RESOURCE, name=poddefault_name, namespace=namespace)
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"import kfp\n",
"import os\n",
"\n",
"from kfp import dsl\n",
"from kfp import dsl, kubernetes\n",
"from tenacity import retry, stop_after_attempt, wait_exponential"
]
},
Expand Down Expand Up @@ -123,14 +123,22 @@
"@dsl.pipeline\n",
"def gpu_check_pipeline() -> str:\n",
" \"\"\"Create a pipeline that runs code to check access to a GPU.\"\"\"\n",
" gpu_check1 = add_gpu_request(gpu_check())\n",
" gpu_check1 = kubernetes.add_pod_label(\n",
" add_gpu_request(gpu_check()),\n",
" label_key=\"enable-gpu\",\n",
" label_value=\"true\",\n",
" )\n",
" return gpu_check1.output\n",
"\n",
"\n",
"@dsl.pipeline\n",
"def gpu_check_pipeline_proxy() -> str:\n",
" \"\"\"Create a pipeline that runs code to check access to a GPU and sets the appropriate proxy ENV variables.\"\"\"\n",
" gpu_check1 = add_proxy(add_gpu_request(gpu_check()))\n",
" gpu_check1 = kubernetes.add_pod_label(\n",
" add_proxy(add_gpu_request(gpu_check())),\n",
" label_key=\"enable-gpu\",\n",
" label_value=\"true\",\n",
" )\n",
" return gpu_check1.output"
]
},
Expand Down Expand Up @@ -180,8 +188,8 @@
"outputs": [],
"source": [
"@retry(\n",
" wait=wait_exponential(multiplier=2, min=1, max=10),\n",
" stop=stop_after_attempt(30),\n",
" wait=wait_exponential(multiplier=4, min=1, max=30),\n",
" stop=stop_after_attempt(24),\n",
" reraise=True,\n",
")\n",
"def assert_run_succeeded(client, run_id):\n",
Expand Down
1 change: 1 addition & 0 deletions tests/notebooks/gpu/kfp-tensorflow/requirements.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
kfp>=2.4,<3.0
tenacity
kfp-kubernetes
Loading

0 comments on commit d6f279a

Please sign in to comment.