You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Exporter works fine with timesliced vGPU, but crashes on startup if I define a MIG-backed vGPU (but not if I just enable MIG mode) Removing the vGPU does not fix the exporter, but disabling MIG mode does.
I've noticed two other issues that may also be related to this: #441 #434
I've simplified my setup down to the minimum in order to isolate this problem.
I shut down the guest and disable the 10C vgpu instance
[root@localhost ~]# shutdown -h now
[csibbitt@hab-19 ~]$ cd /sys/bus/pci/devices/0000:04:00.0/virtfn1/1c854c26-6ee7-491e-b8f9-3d3a8f1016dd
[csibbitt@hab-19 1c854c26-6ee7-491e-b8f9-3d3a8f1016dd]$ sudo su -c "echo 1 > remove"
[csibbitt@hab-19 ~]$ cd /sys/bus/pci/devices/0000:04:00.0/virtfn0/9361957c-95c8-4323-bc93-e437d389802d
[csibbitt@hab-19 9361957c-95c8-4323-bc93-e437d389802d]$ sudo su -c "echo 1 > remove"
[csibbitt@hab-19 9361957c-95c8-4323-bc93-e437d389802d]$ mdevctl list
[csibbitt@hab-19 9361957c-95c8-4323-bc93-e437d389802d]$
dcgm-exporter is still working, but I must shut it down to change to MIG mode: "00000000:04:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi)"
I restart dcgm-exporter at this point and notice it still only showing metrics for the main card, not the MIG devices, which is unexpected based on what the docs say.
I create the vGPU mdev device in the MIG instance, thinking this might change things.
What is the version?
3.3.8-3.6.0
What happened?
Exporter works fine with timesliced vGPU, but crashes on startup if I define a MIG-backed vGPU (but not if I just enable MIG mode) Removing the vGPU does not fix the exporter, but disabling MIG mode does.
I've noticed two other issues that may also be related to this:
#441
#434
I've simplified my setup down to the minimum in order to isolate this problem.
What did you expect to happen?
I expected it to report metrics on the MIG device as described in the documentation here: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html#multi-instance-gpu-mig-support
What is the GPU model?
What is the environment?
Host is baremetal RHEL9 w/ NVIDIA AIE Host driver 565.63 installed from .run
How did you deploy the dcgm-exporter and what is the configuration?
dcgm-exporter is working on the host, showing the physical GPU stats
A100-10C vGPU is working in the guest
How to reproduce the issue?
Change GPU into MIG mode
I shut down the guest and disable the 10C vgpu instance
dcgm-exporter is still working, but I must shut it down to change to MIG mode: "00000000:04:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi)"
I restart dcgm-exporter at this point and notice it still only showing metrics for the main card, not the MIG devices, which is unexpected based on what the docs say.
I create the vGPU mdev device in the MIG instance, thinking this might change things.
No change in dcgm-exporter, still only showing the main card, so I decide to restart it, and it fails to come up.
If I remove the mdev device, dcgm-exporter continues to fail
If I disable MIG mode, dcgm-exporter works again
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: