Bug: `operator-inventory` fails to detect GPU/CPU, causing unlabeled nodes and `null` in GRPC status #240

andy108369 · 2024-07-26T09:09:39Z

The operator-inventory occasionally fails to detect the GPU/CPU, resulting in worker nodes remaining unlabeled. Consequently, the GRPC status endpoint returns null for cpu_info and/or gpu_info, which in turn affects the Cloudmos / Console API statistics.

curl https://api.cloudmos.io/internal/gpu | jq '.gpus.details.nvidia[] | select(.model == "rtx4090")'

SW versions

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                          IMAGE
akash-node-1-0                                ghcr.io/akash-network/node:0.36.0
akash-provider-0                              ghcr.io/akash-network/provider:0.6.2
operator-hostname-6dddc6db79-kj48g            ghcr.io/akash-network/provider:0.6.2
operator-inventory-55776b97f7-ksrt4           ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node1   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node2   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node3   ghcr.io/akash-network/provider:0.6.2

Logs

https://gist.github.com/andy108369/49bcc40a15b85de75cb3f1808a32c1f9

The text was updated successfully, but these errors were encountered:

andy108369 · 2024-07-27T08:23:36Z

Have observed the same on provider.h100.wdc.val.akash.pub provider:

$ grpcurl -insecure provider.h100.wdc.val.akash.pub:8444 akash.provider.v1.ProviderRPC.GetStatus | jq '.cluster.inventory.cluster.nodes[] | {node: .name, cpu_info: .resources.cpu.info, gpu_info: .resources.gpu.info}'
...
...
}
{
  "node": "node6",
  "cpu_info": null,
  "gpu_info": null
}

Fixed by bouncing the operator-inventory - kubectl -n akash-services rollout restart deployment/operator-inventory

root@node1:~# kubectl -n akash-services logs deployment/operator-inventory --timestamps  |grep -v Ceph
2024-07-27T08:21:07.926076173Z I[2024-07-27|08:21:07.926] using in cluster kube config                 cmp=provider
2024-07-27T08:21:09.003050511Z INFO	rest listening on ":8080"
2024-07-27T08:21:09.003185275Z INFO	watcher.storageclasses	started
2024-07-27T08:21:09.003263541Z INFO	nodes.nodes	waiting for nodes to finish
2024-07-27T08:21:09.003347497Z INFO	grpc listening on ":8081"
2024-07-27T08:21:09.003855573Z INFO	watcher.config	started
2024-07-27T08:21:09.005745505Z INFO	rook-ceph	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-07-27T08:21:09.009702561Z INFO	nodes.node.monitor	starting	{"node": "node6"}
2024-07-27T08:21:09.009718398Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node3"}
2024-07-27T08:21:09.009739736Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node1"}
2024-07-27T08:21:09.009754127Z INFO	nodes.node.monitor	starting	{"node": "node3"}
2024-07-27T08:21:09.009760696Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node5"}
2024-07-27T08:21:09.009777807Z INFO	nodes.node.monitor	starting	{"node": "node4"}
2024-07-27T08:21:09.009784127Z INFO	nodes.node.monitor	starting	{"node": "node2"}
2024-07-27T08:21:09.009789636Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node4"}
2024-07-27T08:21:09.009795215Z INFO	nodes.node.monitor	starting	{"node": "node1"}
2024-07-27T08:21:09.009806046Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node2"}
2024-07-27T08:21:09.009812526Z INFO	nodes.node.monitor	starting	{"node": "node5"}
2024-07-27T08:21:09.009876008Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node6"}
2024-07-27T08:21:09.015424597Z INFO	rancher	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-07-27T08:21:10.842061838Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node1"}
2024-07-27T08:21:11.237795598Z INFO	nodes.node.monitor	started	{"node": "node1"}
2024-07-27T08:21:11.504769748Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node2"}
2024-07-27T08:21:11.703728596Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node4"}
2024-07-27T08:21:12.113093042Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node3"}
2024-07-27T08:21:12.198401612Z INFO	nodes.node.monitor	started	{"node": "node4"}
2024-07-27T08:21:12.311559647Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node6"}
2024-07-27T08:21:12.370969406Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node5"}
2024-07-27T08:21:12.459074609Z INFO	nodes.node.monitor	started	{"node": "node3"}
2024-07-27T08:21:12.794276565Z INFO	nodes.node.monitor	started	{"node": "node6"}
2024-07-27T08:21:12.843802757Z INFO	nodes.node.monitor	started	{"node": "node2"}
2024-07-27T08:21:13.722493046Z INFO	nodes.node.monitor	started	{"node": "node5"}
2024-07-27T08:21:15.228564039Z INFO	nodes.node.monitor	successfully applied labels and/or annotations patches for node "node6"	{"labels": {"akash.network":"true","akash.network/capabilities.gpu.vendor.nvidia.model.h100":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.interface.sxm":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.ram.80Gi":"8","akash.network/capabilities.storage.class.beta3":"1","nvidia.com/gpu.present":"true"}}

andy108369 · 2024-08-09T11:48:02Z

I've noticed that operator inventory is consuming 100% cpu (out of 2 CPU's it is allocated via the helm chart)
On Valdi H100 provider:

On Oblivus H100 as well:

Maybe that's something that could contribute to the issue.

andy108369 added the repo/provider Akash provider-services repo issues label Jul 26, 2024

andy108369 mentioned this issue Aug 21, 2024

inventory-operator: doesn't detect when nvdp-nvidia-device-plugin marks GPU as unhealthy #249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: `operator-inventory` fails to detect GPU/CPU, causing unlabeled nodes and `null` in GRPC status #240

Bug: `operator-inventory` fails to detect GPU/CPU, causing unlabeled nodes and `null` in GRPC status #240

andy108369 commented Jul 26, 2024

andy108369 commented Jul 27, 2024

andy108369 commented Aug 9, 2024 •

edited

Loading

Bug: operator-inventory fails to detect GPU/CPU, causing unlabeled nodes and null in GRPC status #240

Bug: operator-inventory fails to detect GPU/CPU, causing unlabeled nodes and null in GRPC status #240

Comments

andy108369 commented Jul 26, 2024

SW versions

Logs

andy108369 commented Jul 27, 2024

andy108369 commented Aug 9, 2024 • edited Loading

Bug: `operator-inventory` fails to detect GPU/CPU, causing unlabeled nodes and `null` in GRPC status #240

Bug: `operator-inventory` fails to detect GPU/CPU, causing unlabeled nodes and `null` in GRPC status #240

andy108369 commented Aug 9, 2024 •

edited

Loading