[Question] Why does NVIDIA device plugin daemonset sample set FAIL_ON_INIT_ERROR to false? #4616

OvervCW · 2024-10-30T14:51:12Z

Question
The documentation that explains how to set up the NVIDIA device plugin for Kubernetes shows an example daemonset to set it up:

https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#manually-install-the-nvidia-device-plugin

In that daemonset the FAIL_ON_INIT_ERROR environment variable is set to false, whereas NVIDIA recommends to leave it set to the default of true in their README:

When set to true, the FAIL_ON_INIT_ERROR option fails the plugin if an error is encountered during initialization. When set to false, it prints an error message and blocks the plugin indefinitely instead of failing. Blocking indefinitely follows legacy semantics that allow the plugin to deploy successfully on nodes that don't have GPUs on them (and aren't supposed to have GPUs on them) without throwing an error. In this way, you can blindly deploy a daemonset with the plugin on all nodes in your cluster, whether they have GPUs on them or not, without encountering an error. However, doing so means that there is no way to detect an actual error on nodes that are supposed to have GPUs on them. Failing if an initialization error is encountered is now the default and should be adopted by all new deployments.

Is it set up that way because the example daemonset does not include a nodeSelector and may deploy the plugin on non-GPU nodes? Or is there a different reason that one might need to use this setting due to the way AKS handles GPUs?

The text was updated successfully, but these errors were encountered:

OvervCW added the question label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Why does NVIDIA device plugin daemonset sample set FAIL_ON_INIT_ERROR to false? #4616

[Question] Why does NVIDIA device plugin daemonset sample set FAIL_ON_INIT_ERROR to false? #4616

OvervCW commented Oct 30, 2024

[Question] Why does NVIDIA device plugin daemonset sample set FAIL_ON_INIT_ERROR to false? #4616

[Question] Why does NVIDIA device plugin daemonset sample set FAIL_ON_INIT_ERROR to false? #4616

Comments

OvervCW commented Oct 30, 2024