You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In that daemonset the FAIL_ON_INIT_ERROR environment variable is set to false, whereas NVIDIA recommends to leave it set to the default of true in their README:
When set to true, the FAIL_ON_INIT_ERROR option fails the plugin if an error is encountered during initialization. When set to false, it prints an error message and blocks the plugin indefinitely instead of failing. Blocking indefinitely follows legacy semantics that allow the plugin to deploy successfully on nodes that don't have GPUs on them (and aren't supposed to have GPUs on them) without throwing an error. In this way, you can blindly deploy a daemonset with the plugin on all nodes in your cluster, whether they have GPUs on them or not, without encountering an error. However, doing so means that there is no way to detect an actual error on nodes that are supposed to have GPUs on them. Failing if an initialization error is encountered is now the default and should be adopted by all new deployments.
Is it set up that way because the example daemonset does not include a nodeSelector and may deploy the plugin on non-GPU nodes? Or is there a different reason that one might need to use this setting due to the way AKS handles GPUs?
The text was updated successfully, but these errors were encountered:
Question
The documentation that explains how to set up the NVIDIA device plugin for Kubernetes shows an example daemonset to set it up:
https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#manually-install-the-nvidia-device-plugin
In that daemonset the
FAIL_ON_INIT_ERROR
environment variable is set tofalse
, whereas NVIDIA recommends to leave it set to the default oftrue
in their README:Is it set up that way because the example daemonset does not include a
nodeSelector
and may deploy the plugin on non-GPU nodes? Or is there a different reason that one might need to use this setting due to the way AKS handles GPUs?The text was updated successfully, but these errors were encountered: