Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded #398

Open
fortminors opened this issue Oct 9, 2024 · 5 comments

Comments

@fortminors
Copy link

fortminors commented Oct 9, 2024

Hello!
I have built dcgm-exporter from source with

git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
make binary

Then, I have created a custom metrics file with

cat << EOT > dcp-metrics-custom.csv
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active.
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
EOT

And finally started dcgm-exporter with the custom metrics

sudo cmd/dcgm-exporter/dcgm-exporter -c 500 -f dcp-metrics-custom.csv

This gives me

2024/10/09 11:10:23 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
INFO[0000] Starting dcgm-exporter                       
INFO[0000] DCGM successfully initialized!               
INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded 
INFO[0000] Falling back to metric file 'dcp-metrics-custom.csv' 
WARN[0000] Skipping line 0 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled 
WARN[0000] Skipping line 1 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled 
WARN[0000] Skipping line 2 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled 
WARN[0000] Skipping line 3 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled 
INFO[0000] Not collecting GPU metrics; no fields to watch for device type: 1 
INFO[0000] Not collecting NvSwitch metrics; no fields to watch for device type: 3 
INFO[0000] Not collecting NvLink metrics; no fields to watch for device type: 6 
INFO[0000] Not collecting CPU metrics; no fields to watch for device type: 7 
INFO[0000] Not collecting CPU Core metrics; no fields to watch for device type: 8 
INFO[0000] Pipeline starting                            
INFO[0000] Starting webserver                           
INFO[0000] Listening on                                  address="[::]:9400"
INFO[0000] TLS is disabled.                              address="[::]:9400" http2=false

Watching at http://localhost:9400/metrics does not show any metrics, so I assume they are not collected (and/or not enabled), which is actually stated in the dcgm-exporter logs.

I have also tried using the latest dcgm-exporter docker images (nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04 - latest and nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04 - that matches my driver that ships with CUDA 12.2) with

docker run --gpus all -v ./custom_metrics/dcp-metrics-custom.csv:/etc/dcgm-exporter/custom_metrics/dcp-metrics-custom.csv --net host --cap-add SYS_ADMIN --privileged nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04 -f /etc/dcgm-exporter/custom_metrics/dcp-metrics-custom.csv

But it gives me the same output

2024/10/09 11:51:23 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
time="2024-10-09T11:51:23Z" level=info msg="Starting dcgm-exporter"
time="2024-10-09T11:51:23Z" level=info msg="DCGM successfully initialized!"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-10-09T11:51:24Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/custom_metrics/dcp-metrics-custom.csv'"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 0 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 1 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 2 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 3 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting GPU metrics; no fields to watch for device type: 1"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-10-09T11:51:24Z" level=info msg="Pipeline starting"
time="2024-10-09T11:51:24Z" level=info msg="Starting webserver"
time="2024-10-09T11:51:24Z" level=info msg="Listening on" address="[::]:9400"
time="2024-10-09T11:51:24Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false

How should I deal with this issue? And how do I enable these metrics?

$ nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:01:00.0  On |                  N/A |
|100%   91C    P2             141W / 170W |   4675MiB / 12288MiB |     89%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2436      G   /usr/lib/xorg/Xorg                         1216MiB |
|    0   N/A  N/A      2729      G   /usr/bin/gnome-shell                        169MiB |
|    0   N/A  N/A      4459      G   ...Telegram/Telegram                          2MiB |
|    0   N/A  N/A      5432      G   ...ures=SpareRendererForSitePerProcess      131MiB |
|    0   N/A  N/A      8298      G   ...seed-version=20241008-180117.502000      523MiB |
|    0   N/A  N/A     55717      G   ...ures=SpareRendererForSitePerProcess       68MiB |
|    0   N/A  N/A     71215      G   ...erProcess --variations-seed-version       60MiB |
|    0   N/A  N/A    591055      C   /prog                                      2478MiB |
+---------------------------------------------------------------------------------------+
$ dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Not loaded                                       |
| 9         | SysMon             | Not loaded                                       |
+-----------+--------------------+--------------------------------------------------+
$ sudo nv-hostengine -f host.log --log-level debug
Err: Failed to start DCGM Server: -7
@fortminors
Copy link
Author

fortminors commented Oct 9, 2024

I just found out that dcgm does not support GTX/RTX gpus, unfortunately, as pointed out by this comment. It would be really useful to add this to documentation, as I can easily build a cloud with GTX/RTX gpus.

Is there a similar tool that does the same thing for GTX/RTX? Except of course profiling with nsys/ncu.

I just want to monitor the SM occupancy rates at every point of time without interfering with the running programs

@yyang4069
Copy link

yyang4069 commented Oct 12, 2024

I also encountered the same problem, how to solve it?
Driver Version: 525.85.12
exporter-image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04

nvidia-smi
Sat Oct 12 16:06:30 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+

logs:

time="2024-10-12T07:02:49Z" level=info msg="Starting dcgm-exporter"
time="2024-10-12T07:02:49Z" level=info msg="DCGM successfully initialized!"
time="2024-10-12T07:02:49Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-10-12T07:02:49Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_NVLINK_RX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_NVLINK_TX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 28 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 29 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 30 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 31 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 32 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 33 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 34 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 35 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 36 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=info msg="Initializing system entities of type: GPU"
time="2024-10-12T07:02:55Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2024-10-12T07:02:55Z" level=info msg="Not collecting switch metrics: no switches to monitor"
time="2024-10-12T07:02:55Z" level=info msg="Initializing system entities of type: NvLink"
time="2024-10-12T07:02:55Z" level=info msg="Not collecting link metrics: no switches to monitor"
time="2024-10-12T07:02:55Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-10-12T07:02:55Z" level=info msg="Pipeline starting"
time="2024-10-12T07:02:55Z" level=info msg="Starting webserver"

@fzyzcjy
Copy link

fzyzcjy commented Oct 23, 2024

Hi, have you fixed it? Thanks!

@fortminors
Copy link
Author

Hello! No, there is no way to fix it for customer grade GPUs. This tool is built specifically for cloud GPUs, unfortunately. Hopefully, Nvidia will add a similar tool for customer grade GPUs in future

@fzyzcjy
Copy link

fzyzcjy commented Oct 23, 2024

I see, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants