Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-smi fails with Failed to initialize NVML after some time in Pods using **systemd** cgroups #266

Open
andy108369 opened this issue Nov 25, 2024 · 2 comments
Assignees
Labels
repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Nov 25, 2024

Summary

Customer reported that nvidia-smi stops working in Kubernetes pods with the error Failed to initialize NVML: Unknown Error after some time.

🟢 Worth noting, applications already running on the GPU remain fully functional and unaffected

⚙️ Customer is using nvidia-smi for the metrics. Hence, this affects their metrics, which is operationally important for them.

This is a known issue detailed in NVIDIA Container Toolkit Issue #48, and the behavior was reproduced in our environment.


Reproducer

  1. Create a nvidia-smi-loop.yaml file with the following pod configuration:

    Make sure to set kubernetes.io/hostname to the desired node name of your cluster.

    apiVersion: v1
    kind: Pod
    metadata:
      name: cuda-nvidia-smi-loop
    spec:
      restartPolicy: OnFailure
      runtimeClassName: nvidia
      containers:
      - name: cuda
        image: "nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04"
        command: ["/bin/sh", "-c"]
        args: ["while true; do nvidia-smi -L; sleep 5; done"]
        resources:
          limits:
            nvidia.com/gpu: 1
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      nodeSelector:
        kubernetes.io/hostname: node10
  2. Deploy the pod:

    kubectl apply -f nvidia-smi-loop.yaml
  3. Trigger a daemon-reload after about 10 seconds the pod's been running:

    sleep 15
    systemctl daemon-reload
  4. Check pod logs:

    kubectl logs cuda-nvidia-smi-loop --timestamps

Result

2024-11-25T13:33:37.068625936Z GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-9a3643e7-ac3c-850e-3436-5de6cfa48c23)
2024-11-25T13:33:42.128740632Z GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-9a3643e7-ac3c-850e-3436-5de6cfa48c23)
2024-11-25T13:33:52.245576418Z Failed to initialize NVML: Unknown Error
2024-11-25T13:33:57.297379775Z Failed to initialize NVML: Unknown Error

Configuration Differences

The issue is specific to environments using systemd cgroup management with the NVIDIA container runtime. Observations from different environments:

  1. K3s-based Provider:

    • Systemd cgroup is enabled in the containerd configuration:

      root@node1:~# crictl ps |grep nvid
      ec0f71ea4d12b       159abe21a6880       3 weeks ago         Running             nvidia-device-plugin-ctr   0                   8e1c0567b6d49       nvdp-nvidia-device-plugin-b59hh
      
      root@node1:~# crictl inspect ec0f71ea4d12b | grep -A3 runtimeOptions
          "runtimeOptions": {
            "binary_name": "/usr/bin/nvidia-container-runtime",
            "systemd_cgroup": true
          },
      
    • containerd configuration (/etc/containerd/config.toml):

      # cat /etc/containerd/config.toml
      disabled_plugins = ["cri"]
      
    • it appears containerd enables SystemdCgroup by default when it's not explicitly set in the containerd config:

      # crictl info | grep -i -C2 nvidia-container-runtime
                "runtimeRoot": "",
                "options": {
                  "BinaryName": "/usr/bin/nvidia-container-runtime",
                  "SystemdCgroup": true
                },
      
  2. Kubespray-based Provider:

    • Systemd cgroup is not enabled in the containerd configuration:

      root@worker-01:~# crictl inspect 04ac886af1ec7 |grep -A3 runtimeOptions
          "runtimeOptions": {
            "binary_name": "/usr/bin/nvidia-container-runtime"
          },
          "config": {
      
    • As seen from the kubespray-based provider systemdCgroup = true option is absent from the [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] configuration.

      # cat /etc/containerd/config.toml
      version = 2
      root = "/data/containerd"
      state = "/run/containerd"
      oom_score = 0
      
      [grpc]
        max_recv_message_size = 16777216
        max_send_message_size = 16777216
      
      [debug]
        level = "info"
      
      [metrics]
        address = ""
        grpc_histogram = false
      
      [plugins]
        [plugins."io.containerd.grpc.v1.cri"]
          sandbox_image = "registry.k8s.io/pause:3.9"
          max_container_log_line_size = -1
          enable_unprivileged_ports = false
          enable_unprivileged_icmp = false
          [plugins."io.containerd.grpc.v1.cri".containerd]
            default_runtime_name = "runc"
            snapshotter = "overlayfs"
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                runtime_type = "io.containerd.runc.v2"
                runtime_engine = ""
                runtime_root = ""
                base_runtime_spec = "/etc/containerd/cri-base.json"
      
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                  systemdCgroup = true
                  binaryName = "/usr/local/bin/runc"
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
                runtime_type = "io.containerd.runc.v2"
                runtime_engine = ""
                runtime_root = ""
      
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                  BinaryName = "/usr/bin/nvidia-container-runtime"
          [plugins."io.containerd.grpc.v1.cri".registry]
            [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
              [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
                endpoint = ["https://registry-1.docker.io"]
      
      # crictl info | grep -i -C2 nvidia-container-runtime
                "runtimeRoot": "",
                "options": {
                  "BinaryName": "/usr/bin/nvidia-container-runtime"
                },
                "privileged_without_host_devices": false,
      

Next Steps

Proposed Fix: Explicitly disable systemd cgroup for NVIDIA container runtime in k3s-based providers.

  1. Generate config.toml.tmpl with SystemdCgroup = false for nvidia-container-runtime on all GPU-enabled nodes:

    cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml | \
    sed '/BinaryName = "\/usr\/bin\/nvidia-container-runtime"/!b;n;s/SystemdCgroup = true/SystemdCgroup = false/' \
    > /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
  2. Restart k3s-agent (workers) and/or k3s (control-planes) systemd service:

    NOTE: restarting the containerd service will likely cause other pods running on the affected node to restart or experience disruptions. ⚠️
    If this is a worker node:

    systemctl restart k3s-agent.service

    And if this is a control-plane node:

    systemctl restart k3s.service
  3. Verify SystemdCgroup is disabled:

    crictl info | grep -i -C2 nvidia-container-runtime
  4. Test the reproducer. (nvidia-smi-loop.yaml steps from above)

One liner

This command will perform all three (1-3) of the above steps automatically:

  • default installation under /var/lib/rancher directory
test -f /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl || { cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml | \
sed '/BinaryName = "\/usr\/bin\/nvidia-container-runtime"/!b;n;s/SystemdCgroup = true/SystemdCgroup = false/' | tee /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl ; systemctl is-active --quiet k3s.service && systemctl restart k3s.service || (systemctl is-active --quiet k3s-agent.service && systemctl restart k3s-agent.service); sleep 5s; crictl info | grep -i -C2 nvidia-container-runtime; }
  • custom data-dir /data/k3s

Verify if it is used:

grep -A1 data-dir /etc/systemd/system/k3s.service /etc/systemd/system/k3s-agent.service 2>/dev/null
crictl -c /data/k3s/agent/etc/crictl.yaml ps
test -f /data/k3s/agent/etc/containerd/config.toml.tmpl || { cat /data/k3s/agent/etc/containerd/config.toml | \
sed '/BinaryName = "\/usr\/bin\/nvidia-container-runtime"/!b;n;s/SystemdCgroup = true/SystemdCgroup = false/' | tee /data/k3s/agent/etc/containerd/config.toml.tmpl ; systemctl is-active --quiet k3s.service && systemctl restart k3s.service || (systemctl is-active --quiet k3s-agent.service && systemctl restart k3s-agent.service); sleep 5s; crictl -c /data/k3s/agent/etc/crictl.yaml info | grep -i -C2 nvidia-container-runtime; }
Verification/cleanup

Verify the provider status endpoint:

provider_info2.sh <provider-address>

If reported values seem off, bounce the operator-inventory:

kubectl -n akash-services rollout restart deployment operator-inventory

See if there are any failed pods to delete:

kubectl get pods -A -o wide --sort-by='{.metadata.creationTimestamp}' 
kubectl get pods -A --field-selector status.phase=Failed

To delete Failed pods

kubectl delete pods -A --field-selector status.phase=Failed

Documentation Update

If this works out, we need to advise K3s-based providers to disable systemd cgroup management in the NVIDIA container runtime.
And update the server-mgmt documentation.

@andy108369
Copy link
Contributor Author

andy108369 commented Nov 27, 2024

The following K3s-based providers will perform the maintenance today November 27th at 17:00 UTC to address this issue:

  • provider.h100.sdg.val.akash.pub
  • provider.h100.hou.val.akash.pub
  • provider.rtx4090.wyo.eg.akash.pub
  • provider.a100.iah.val.akash.pub
  • provider.cato.akash.pub

During this maintenance, deployments will restart. We've informed the clients to ensure their deployments are running correctly once the maintenance is completed.

@andy108369
Copy link
Contributor Author

The maintenance complete.

@chainzero chainzero added repo/provider Akash provider-services repo issues and removed testnet-5 awaiting-triage labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

2 participants