inventory-operator: doesn't detect when nvdp-nvidia-device-plugin
marks GPU as unhealthy
#249
Labels
nvdp-nvidia-device-plugin
marks GPU as unhealthy
#249
Logs https://gist.github.com/andy108369/cac9f968f1c6a3eb7c6e92135b8afd42
querying 8443/status endpoint would report all 8 GPUs are available, but at least one was marked as unhealthy.
Rarely you can recover from this error by bouncing
nvdp-nvidia-device-plugin
pod on the node where it was marked unhealthy.But the point is that inventory-operator should ideally detect this as otherwise GPU deployments will be stuck in "Pending" until all 8 GPUs will become available again:
The text was updated successfully, but these errors were encountered: