Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Unitrace] Unable to query hardware metrics for kernel instance (-q, --metric-query) on a B580 #83

Open
Alcpz opened this issue Feb 10, 2025 · 4 comments

Comments

@Alcpz
Copy link

Alcpz commented Feb 10, 2025

Description

I'm unable to query hardware metrics for kernels using Unitrace. After launching it, it fails with the following output:

[ERROR] Failed to initialize Level Zero runtime
[INFO] Please ensure that either /proc/sys/dev/i915/perf_stream_paranoid or /proc/sys/dev/xe/observation_paranoid are set to 0.

Those files are correctly set to 0.

Where it fails

Printed from the following function:

// ze_metrics.h:94
void PrintMetricList(uint32_t device_id)

The returned error from level_zero is 0x70020000 (ZE_RESULT_ERROR_DEPENDENCY_UNAVAILABLE).

Steps to reproduce

# llama.cpp llama-bench built using the ggml_sycl backend
❯ ONEAPI_DEVICE_SELECTOR=level_zero:0 /home/acabrera/sources/pti-gpu/tools/unitrace/build-bmg/unitrace -q -o metrics.csv 
  ./bin/llama-bench -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p 0 -n 128 -pg 0,0 -t 8 -r 1 -sm none -ngl 99 -o md
[ERROR] Failed to initialize Level Zero runtime
[INFO] Please ensure that either /proc/sys/dev/i915/perf_stream_paranoid or /proc/sys/dev/xe/observation_paranoid are set to 0.
❯ cat /proc/sys/dev/i915/perf_stream_paranoid
0
❯ cat /proc/sys/dev/xe/observation_paranoid
0

Environment

❯ cat /etc/issue
Ubuntu 24.10 \n \l

❯ dpkg -l | grep intel
ii  intel-fw-gpu                               2024.37.5-362~22.04                      all          Firmware package for Intel integrated and discrete GPUs
ii  intel-gpu-tools                            1.29-1                                   amd64        tools for debugging the Intel graphics driver
ii  intel-gsc                                  0.9.5-112~u22.04                         amd64        Intel(R) Graphics System Controller Firmware
ii  intel-igc-cm                               1.0.225.54083-1077~22.04                 amd64        Intel(R) C for Metal Compiler -- CM Frontend lib
ii  intel-media-va-driver-non-free:amd64       25.1.0-0ubuntu1~ppa1                     amd64        VAAPI driver for the Intel GEN8+ Graphics family
ii  intel-microcode                            3.20241112.0ubuntu0.24.10.1              amd64        Processor microcode firmware for Intel CPUs
ii  intel-ocloc                                24.52.32224.14-1077~22.04                amd64        Tool for managing Intel Compute GPU device binary format
ii  intel-opencl-icd                           24.52.32224.14-1077~22.04                amd64        Intel graphics compute runtime for OpenCL
ii  libdrm-intel1:amd64                        2.4.122-1                                amd64        Userspace interface to intel-specific kernel DRM services -- runtime
ii  libze-intel-gpu1                           24.52.32224.14-1077~22.04                amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.

# Open source dpcpp, nightly-2025-02-09 tag
❯ sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Graphics [0xe20b] 20.1.0 [1.6.31740+11]
@mschilling0
Copy link
Contributor

mschilling0 commented Feb 10, 2025

Can you try installing these packages? I am not familiar with what they are called or how they get installed on Ubuntu 24.10 though.

intel-metrics-discovery
intel-metrics-library

@Alcpz
Copy link
Author

Alcpz commented Feb 10, 2025

I have to contact the admin to do so. Will get back once I get that sorted out, thanks for the quick reply!

@mschilling0
Copy link
Contributor

mschilling0 commented Feb 10, 2025

Thanks. Yes, I believe that's why you're seeing that error, they are dynamic dependencies of L0 I guess. Let me know how it goes.

The names are the names of the packages for Ubuntu 24.04.

@Alcpz
Copy link
Author

Alcpz commented Feb 11, 2025

Thanks for the pointer, it helped. I am now facing a different issue:

terminate called after throwing an instance of 'std::bad_cast'
  what():  std::bad_cast

The backtrace points to __pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at ./nptl/pthread_kill.c:44 in the binary I'm trying to trace. This does not happen unitrace, so, there is something in the middle that is not working:

#19 0x00000e75de0e17a0 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#20 0x00000e75de0dd7f6 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#21 0x00000e75f5b36e59 in ZeCollector::EnumerateAndSetupDevices() () from /home/acabrera/sources/pti-gpu/tools/unitrace/build-bmg/libunitrace_tool.so
#22 0x00000e75f5b34e32 in ZeCollector::ZeCollector(Logger*, CollectorOptions, void (*)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned int, unsigned int, int, _ze_device_handle_t*, unsigned long, bool, _ze_grou
p_count_t const&, unsigned long), void (*)(std::vector<unsigned long, std::allocator<unsigned long> >*, FLOW_DIR, API_TRACING_ID, unsigned long, unsigned long), void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::a
llocator<char> >&) () from /home/acabrera/sources/pti-gpu/tools/unitrace/build-bmg/libunitrace_tool.so
#23 0x00000e75f5ad2bd0 in ZeCollector::Create(Logger*, CollectorOptions, void (*)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned int, unsigned int, int, _ze_device_handle_t*, unsigned long, bool, _ze_group_cou
nt_t const&, unsigned long), void (*)(std::vector<unsigned long, std::allocator<unsigned long> >*, FLOW_DIR, API_TRACING_ID, unsigned long, unsigned long), void*) ()
   from /home/acabrera/sources/pti-gpu/tools/unitrace/build-bmg/libunitrace_tool.so
#24 0x00000e75f5abfdda in UniTracer::Create(TraceOptions const&) () from /home/acabrera/sources/pti-gpu/tools/unitrace/build-bmg/libunitrace_tool.so
#25 0x00000e75f5aba36b in Init() () from /home/acabrera/sources/pti-gpu/tools/unitrace/build-bmg/libunitrace_tool.so
#26 0x00000e75f5d097ef in call_init (l=<optimized out>, argc=argc@entry=19, argv=argv@entry=0x7ffd4ad2eb28, env=env@entry=0x7ffd4ad2ebc8) at ./elf/dl-init.c:74

This is the only relevant part from the trace that I found. If I find time I will try to dig a bit more unless you have any other suggestions I could try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants