bump `nvidia-device-plugin` to `v0.16.1` #242

andy108369 · 2024-07-29T22:40:13Z

k8s-device-plugin v0.16.1 got released 3 days ago:
They have updated CUDA base image version to 12.5.1 among the other changes https://github.com/NVIDIA/k8s-device-plugin/releases

Need to test the following:

whether we can upgrade the current nvidia-device-plugin helm chart up to 0.16.1 without impacting existing GPU deployments (can probably pick some provider with least used GPUs; probably sandbox will do best)
see if it changes the reported CUDA version upon nvidia-smi | grep Version (probably this isn't related, but still worth checking)
bump the 0.15.1 to 0.16.1 version in the docs https://akash.network/docs/providers/build-a-cloud-provider/gpu-resource-enablement/
upgrade nvidia-device-plugin across all the GPU providers

The text was updated successfully, but these errors were encountered:

andy108369 · 2024-08-01T15:44:18Z

Testing this on Cato provider that had 0 leases since yesterday.
Currently am hitting this issue NVIDIA/k8s-device-plugin#856

andy108369 · 2024-08-01T16:45:03Z

Figured the issue is because new nvidia-device-plugin 0.16.x helm-charts (0.16.0 rc1, 0.16.0, 0.16.1) are dropping SYS_ADMIN capability leading to unable to create plugin manager: nvml init failed: ERROR_LIBRARY_NOT_FOUND error.

Let's keep using nvidia-device-plugin 0.15.1 until NVIDIA/k8s-device-plugin#856 gets fixed or a better workaround is found instead of modifying/customizing the helm-chart manually.

andy108369 · 2024-08-01T16:57:40Z

For the record: Restarting nvidia-device-plugin/nvdp, even uninstalling it - does not impact on already existing & active GPU workloads. It will impact them if their pod will get restarted. It will go into Pending state until it finds a worker node with the GPU. If nvdp plugin is not running, the pod will go into Pending state forever.

And it does not change the reported CUDA version upon nvidia-smi | grep Version as expected. (since for that there are cuda-compat-<ver> packages + LD_LIBRARY_PATH method to load them up)

andy108369 · 2024-08-02T11:17:16Z

Workaround

The quick workaround is to pass securityContext.capabilities.add[0]=SYS_ADMIN to the chart, e.g.:

helm upgrade --install nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.16.1 \
  --set runtimeClassName="nvidia" \
  --set deviceListStrategy=volume-mounts \
  --set securityContext.capabilities.add[0]=SYS_ADMIN

andy108369 · 2024-08-02T11:30:17Z

Going to update our docs after a better fix is released to issue 856.

andy108369 added repo/provider Akash provider-services repo issues repo/helm-charts Akash Helm Chart repo issues labels Jul 29, 2024

andy108369 self-assigned this Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bump `nvidia-device-plugin` to `v0.16.1` #242

bump `nvidia-device-plugin` to `v0.16.1` #242

andy108369 commented Jul 29, 2024 •

edited

Loading

andy108369 commented Aug 1, 2024

andy108369 commented Aug 1, 2024

andy108369 commented Aug 1, 2024 •

edited

Loading

andy108369 commented Aug 2, 2024

andy108369 commented Aug 2, 2024

bump nvidia-device-plugin to v0.16.1 #242

bump nvidia-device-plugin to v0.16.1 #242

Comments

andy108369 commented Jul 29, 2024 • edited Loading

andy108369 commented Aug 1, 2024

andy108369 commented Aug 1, 2024

andy108369 commented Aug 1, 2024 • edited Loading

andy108369 commented Aug 2, 2024

Workaround

andy108369 commented Aug 2, 2024

bump `nvidia-device-plugin` to `v0.16.1` #242

bump `nvidia-device-plugin` to `v0.16.1` #242

andy108369 commented Jul 29, 2024 •

edited

Loading

andy108369 commented Aug 1, 2024 •

edited

Loading