Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: operator-inventory fails to detect GPU/CPU, causing unlabeled nodes and null in GRPC status #240

Open
andy108369 opened this issue Jul 26, 2024 · 2 comments
Labels
repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

The operator-inventory occasionally fails to detect the GPU/CPU, resulting in worker nodes remaining unlabeled. Consequently, the GRPC status endpoint returns null for cpu_info and/or gpu_info, which in turn affects the Cloudmos / Console API statistics.

curl https://api.cloudmos.io/internal/gpu | jq '.gpus.details.nvidia[] | select(.model == "rtx4090")'

SW versions

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                          IMAGE
akash-node-1-0                                ghcr.io/akash-network/node:0.36.0
akash-provider-0                              ghcr.io/akash-network/provider:0.6.2
operator-hostname-6dddc6db79-kj48g            ghcr.io/akash-network/provider:0.6.2
operator-inventory-55776b97f7-ksrt4           ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node1   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node2   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node3   ghcr.io/akash-network/provider:0.6.2

Logs

https://gist.github.com/andy108369/49bcc40a15b85de75cb3f1808a32c1f9

@andy108369 andy108369 added the repo/provider Akash provider-services repo issues label Jul 26, 2024
@andy108369
Copy link
Contributor Author

Have observed the same on provider.h100.wdc.val.akash.pub provider:

$ grpcurl -insecure provider.h100.wdc.val.akash.pub:8444 akash.provider.v1.ProviderRPC.GetStatus | jq '.cluster.inventory.cluster.nodes[] | {node: .name, cpu_info: .resources.cpu.info, gpu_info: .resources.gpu.info}'
...
...
}
{
  "node": "node6",
  "cpu_info": null,
  "gpu_info": null
}

Fixed by bouncing the operator-inventory - kubectl -n akash-services rollout restart deployment/operator-inventory

root@node1:~# kubectl -n akash-services logs deployment/operator-inventory --timestamps  |grep -v Ceph
2024-07-27T08:21:07.926076173Z I[2024-07-27|08:21:07.926] using in cluster kube config                 cmp=provider
2024-07-27T08:21:09.003050511Z INFO	rest listening on ":8080"
2024-07-27T08:21:09.003185275Z INFO	watcher.storageclasses	started
2024-07-27T08:21:09.003263541Z INFO	nodes.nodes	waiting for nodes to finish
2024-07-27T08:21:09.003347497Z INFO	grpc listening on ":8081"
2024-07-27T08:21:09.003855573Z INFO	watcher.config	started
2024-07-27T08:21:09.005745505Z INFO	rook-ceph	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-07-27T08:21:09.009702561Z INFO	nodes.node.monitor	starting	{"node": "node6"}
2024-07-27T08:21:09.009718398Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node3"}
2024-07-27T08:21:09.009739736Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node1"}
2024-07-27T08:21:09.009754127Z INFO	nodes.node.monitor	starting	{"node": "node3"}
2024-07-27T08:21:09.009760696Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node5"}
2024-07-27T08:21:09.009777807Z INFO	nodes.node.monitor	starting	{"node": "node4"}
2024-07-27T08:21:09.009784127Z INFO	nodes.node.monitor	starting	{"node": "node2"}
2024-07-27T08:21:09.009789636Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node4"}
2024-07-27T08:21:09.009795215Z INFO	nodes.node.monitor	starting	{"node": "node1"}
2024-07-27T08:21:09.009806046Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node2"}
2024-07-27T08:21:09.009812526Z INFO	nodes.node.monitor	starting	{"node": "node5"}
2024-07-27T08:21:09.009876008Z INFO	nodes.node.discovery	starting hardware discovery pod	{"node": "node6"}
2024-07-27T08:21:09.015424597Z INFO	rancher	   ADDED monitoring StorageClass	{"name": "beta3"}
2024-07-27T08:21:10.842061838Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node1"}
2024-07-27T08:21:11.237795598Z INFO	nodes.node.monitor	started	{"node": "node1"}
2024-07-27T08:21:11.504769748Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node2"}
2024-07-27T08:21:11.703728596Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node4"}
2024-07-27T08:21:12.113093042Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node3"}
2024-07-27T08:21:12.198401612Z INFO	nodes.node.monitor	started	{"node": "node4"}
2024-07-27T08:21:12.311559647Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node6"}
2024-07-27T08:21:12.370969406Z INFO	nodes.node.discovery	started hardware discovery pod	{"node": "node5"}
2024-07-27T08:21:12.459074609Z INFO	nodes.node.monitor	started	{"node": "node3"}
2024-07-27T08:21:12.794276565Z INFO	nodes.node.monitor	started	{"node": "node6"}
2024-07-27T08:21:12.843802757Z INFO	nodes.node.monitor	started	{"node": "node2"}
2024-07-27T08:21:13.722493046Z INFO	nodes.node.monitor	started	{"node": "node5"}
2024-07-27T08:21:15.228564039Z INFO	nodes.node.monitor	successfully applied labels and/or annotations patches for node "node6"	{"labels": {"akash.network":"true","akash.network/capabilities.gpu.vendor.nvidia.model.h100":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.interface.sxm":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.ram.80Gi":"8","akash.network/capabilities.storage.class.beta3":"1","nvidia.com/gpu.present":"true"}}

@andy108369
Copy link
Contributor Author

andy108369 commented Aug 9, 2024

I've noticed that operator inventory is consuming 100% cpu (out of 2 CPU's it is allocated via the helm chart)
On Valdi H100 provider:
image

On Oblivus H100 as well:
image

Maybe that's something that could contribute to the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

1 participant