Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Why does NVIDIA device plugin daemonset sample set FAIL_ON_INIT_ERROR to false? #4616

Open
OvervCW opened this issue Oct 30, 2024 · 0 comments
Labels

Comments

@OvervCW
Copy link

OvervCW commented Oct 30, 2024

Question
The documentation that explains how to set up the NVIDIA device plugin for Kubernetes shows an example daemonset to set it up:

https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#manually-install-the-nvidia-device-plugin

In that daemonset the FAIL_ON_INIT_ERROR environment variable is set to false, whereas NVIDIA recommends to leave it set to the default of true in their README:

When set to true, the FAIL_ON_INIT_ERROR option fails the plugin if an error is encountered during initialization. When set to false, it prints an error message and blocks the plugin indefinitely instead of failing. Blocking indefinitely follows legacy semantics that allow the plugin to deploy successfully on nodes that don't have GPUs on them (and aren't supposed to have GPUs on them) without throwing an error. In this way, you can blindly deploy a daemonset with the plugin on all nodes in your cluster, whether they have GPUs on them or not, without encountering an error. However, doing so means that there is no way to detect an actual error on nodes that are supposed to have GPUs on them. Failing if an initialization error is encountered is now the default and should be adopted by all new deployments.

Is it set up that way because the example daemonset does not include a nodeSelector and may deploy the plugin on non-GPU nodes? Or is there a different reason that one might need to use this setting due to the way AKS handles GPUs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant