[Issue]: `nvidia-smi not found` #515

joerowell · 2024-02-20T11:37:59Z

Problem Description

The estimate_matmul functionality in Triton relies rather heavily on the underlying stats of the GPU. On CUDA platforms, this functionality is realised by calling nvidia-smi and then parsing the results. I see that this code is still present in this fork of Triton:

triton/python/triton/testing.py

Line 12 in 35edd6a

def nvsmi(attrs):

Would it be possible to get support added for rocm-smi here instead? This makes autotuning Triton kernels for GEMM etc much easier.

Operating System

CPU

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

zhanglx13 · 2024-02-21T15:42:50Z

@joerowell We can add it later after we merge this fork with upstream.
For gemm tuning, we have a dedicated script to tune gemm kernels. You can refer to this README for more info and let me know if you have more questions.

zhanglx13 · 2024-04-16T03:14:13Z

@jataylo @micmelesse This seems to be related to the nvsmi related test failure. What is the status of that test?

MARD1NO · 2024-12-25T01:41:33Z

@jataylo @micmelesse This seems to be related to the nvsmi related test failure. What is the status of that test?

Hi, any update of rocm-smi version of estimate_matmul function ? I also encounter this problem

jataylo · 2024-12-25T03:12:04Z

@jataylo @micmelesse This seems to be related to the nvsmi related test failure. What is the status of that test?

We got around this in inductor by hard coding flops for specific arch when we required this, @zhanglx13 @micmelesse may have to consider writing amdsmi equivalents in triton.

MARD1NO · 2024-12-25T03:35:21Z

@jataylo @micmelesse This seems to be related to the nvsmi related test failure. What is the status of that test?

We got around this in inductor by hard coding flops for specific arch when we required this, @zhanglx13 @micmelesse may have to consider writing amdsmi equivalents in triton.

Can you give me the relevant code paths?

jataylo · 2024-12-31T10:51:41Z

@MARD1NO Not sure if this helps at all https://github.com/pytorch/pytorch/blob/main/torch/_utils_internal.py#L208

This how we had to get around the nvsmi call from triton previously, by hard coding max-clock rates for our gfx arches. Not ideal.

ppanchad-amd added the Under Investigation label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: `nvidia-smi not found` #515

[Issue]: `nvidia-smi not found` #515

joerowell commented Feb 20, 2024

zhanglx13 commented Feb 21, 2024

zhanglx13 commented Apr 16, 2024

MARD1NO commented Dec 25, 2024

jataylo commented Dec 25, 2024 •

edited

Loading

MARD1NO commented Dec 25, 2024

jataylo commented Dec 31, 2024

[Issue]: nvidia-smi not found #515

[Issue]: nvidia-smi not found #515

Comments

joerowell commented Feb 20, 2024

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

zhanglx13 commented Feb 21, 2024

zhanglx13 commented Apr 16, 2024

MARD1NO commented Dec 25, 2024

jataylo commented Dec 25, 2024 • edited Loading

MARD1NO commented Dec 25, 2024

jataylo commented Dec 31, 2024

[Issue]: `nvidia-smi not found` #515

[Issue]: `nvidia-smi not found` #515

jataylo commented Dec 25, 2024 •

edited

Loading