Full documentation for rocm_smi_lib is available at https://rocm.docs.amd.com/.
All information listed below is for reference and subject to change.
-
Added
rsmi_dev_memory_partition_capabilities_get
which returns driver memory partition capablities.
Driver now has the ability to report what the user can set memory partition modes to. User can now see available memory partition modes upon an invalid argument return from memory partition mode set (rsmi_dev_memory_partition_set
). -
Added support for GPU metrics 1.6 to
rsmi_dev_gpu_metrics_info_get()
Updatedrsmi_dev_gpu_metrics_info_get()
and structurersmi_gpu_metrics_t
to include new fields for PVIOL / TVIOL, XCP (Graphics Compute Partitions) stats, and pcie_lc_perf_other_end_recovery:uint64_t accumulation_counter
- used for all throttled calculationsuint64_t prochot_residency_acc
- Processor hot accumulatoruint64_t ppt_residency_acc
- Package Power Tracking (PPT) accumulator (used in PVIOL calculations)uint64_t socket_thm_residency_acc
- Socket thermal accumulator - (used in TVIOL calculations)uint64_t vr_thm_residency_acc
- Voltage Rail (VR) thermal accumulatoruint64_t hbm_thm_residency_acc
- High Bandwidth Memory (HBM) thermal accumulatoruint16_t num_partition
- corresponds to the current total number of partitionsstruct amdgpu_xcp_metrics_t xcp_stats[MAX_NUM_XCP]
- for each partition associated with current GPU, provides gfx busy & accumulators, jpeg, and decoder (VCN) engine utilizationsuint32_t gfx_busy_inst[MAX_NUM_XCC]
- graphic engine utilization (%)uint16_t jpeg_busy[MAX_NUM_JPEG_ENGS]
- jpeg engine utilization (%)uint16_t vcn_busy[MAX_NUM_VCNS]
- decoder (VCN) engine utilization (%)uint64_t gfx_busy_acc[MAX_NUM_XCC]
- graphic engine utilization accumulated (%)
uint32_t pcie_lc_perf_other_end_recovery
- corresponds to the pcie other end recovery counter
-
Added ability to view raw GPU metrics
rocm-smi --showmetrics
Users can now view GPU metrics from our newrocm-smi --showmetrics
. Unlike AMD SMI (or other ROCM-SMI interfaces), these values are not converted into applicable units as users may see inamd-smi metric
. Units listed display as indicated by the driver, they are not converted (eg. in other AMD SMI/ROCm SMI interfaces which use the data provided). It is important to note, that fields displayingN/A
data mean this ASIC does not support or backward compatibility was not provided in a newer ASIC's GPU metric structure.
-
Added back in C++ tests for
memorypartition_read_write
.
Due to driver adding in all needed features for memory partition write. We have re-enabled memorypartition_read_write. -
Updated
rsmi_dev_memory_partition_set
to not return until a successful restart of AMD GPU Driver.
This change keeps checking for ~ up to 40 seconds for a successful restart of the AMD GPU driver. Additionally, the API call continues to check if memory partition (NPS) SYSFS files are successfully updated to reflect the user's requested memory partition (NPS) mode change. Otherwise, reports an error back to the user. Due to these changes, we have updated ROCm SMI's CLI to reflect the maximum wait of 40 seconds, while memory partition change is in progress. -
All APIs now have the ability to catch driver reporting invalid arguments.
Now ROCm SMI APIs can show RSMI_STATUS_INVALID_ARGS when driver returns EINVAL.
- Removed
--resetcomputepartition
, and--resetmemorypartition
options and associated APIs.- This change is part of the partition feature redesign.
- The related APIs
rsmi_dev_compute_partition_reset()
andrsmi_dev_memory_partition_reset()
.
- Fixed
rsmi_dev_target_graphics_version_get
,rocm-smi --showhw
, androcm-smi --showprod
not displaying properly for MI2x or Navi 3x ASICs.
- Re-enable C++ tests for
memorypartition_read_write
.- This change is part of the partition feature redesign.
- SMI's workflow needs to be adjusted in order to accomidate incoming driver changes to enable Dynamic memory partition feature. We plan on re-enabling testing for this feature during ROCm 6.4.
- Improved handling of UnicodeEncodeErrors with non UTF-8 locales
Non UTF-8 locales were causing crashing on UTF-8 special characters
- Fixed rsmitstReadWrite.TestComputePartitionReadWrite segfault
Segfault was caused due to unhandled start conditions:
-
When setting CPX as a partition mode, there is a DRM node limitation of 64. This is a known limitation of the driver, if other drivers are using other DRM nodes (ex. using PCIe space, such as ast).
The number of DRM nodes can be checked vials /sys/class/drm
Recommended steps for removing unnecessary drivers:
a. unloading amdgpu -sudo rmmod amdgpu
b. removing unnecessary driver(s) - ex.sudo rmmod ast
c. reload amgpu -sudo modprobe amdgpu
-
Since user could start amdgpu in different partition modes (ex.
sudo modprobe amdgpu user_partt_mode=1
). Test needed to keep track of total number of devices, in order to ensure test comes back to the original configuration. The test segfault could be seen on all MI3x ASICs, if brought up in a non-SPX configuration upon boot.
- Added Partition ID API (
rsmi_dev_partition_id_get(..)
)
Previouslyrsmi_dev_partition_id_get
could only be retrived by querying throughrsmi_dev_pci_id_get()
and parsing optional bits in our python CLI/API. We are now making this available directly through API. As well as added testing, in our compute partitioning tests verifing partition IDs update accordingly.
-
Partition ID CLI output
Due to driver changes in KFD, some devices may report bits [31:28] or [2:0]. With the newly addedrsmi_dev_partition_id_get(..)
, we provided this fallback to properly retreive partition ID. We plan to eventually remove partition ID from the function portion of the BDF (Bus Device Function). See below for PCI ID description.- bits [63:32] = domain
- bits [31:28] or bits [2:0] = partition id
- bits [27:16] = reserved
- bits [15:8] = Bus
- bits [7:3] = Device
- bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes
- Added Ring Hang event
AddedRSMI_EVT_NOTIF_RING_HANG
to the possible events in thersmi_evt_notification_type_t
enum.
- Fixed parsing of
pp_od_clk_voltage
withinget_od_clk_volt_info
The parsing ofpp_od_clk_voltage
was not dynamic enough to work with the dropping of voltage curve support on MI series cards.
-
Unlock mutex if process is dead Added in order to unlock mutex when process is dead. Additional debug output has been added if futher issues are detected.
-
Added Partition ID to rocm-smi CLI
rsmi_dev_pci_id_get()
now provides partition ID. See API for better detail. Previously these bits were reserved bits (right before domain) and partition id was within function.- bits [63:32] = domain
- bits [31:28] = partition id
- bits [27:16] = reserved
- bits [15: 0] = pci bus/device/function
rocm-smi now provides partition ID in rocm-smi
and rocm-smi --showhw
. If device supports partitioning and is in a non-SPX mode (CPX, DPX,TPX,... etc) partition ID will be non-zero. In SPX and non-supported devices will show as 0. See examples provided below.
$ rocm-smi
========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
====================================================================================================================
0 1 0x73bf, 34495 43.0°C 6.0W N/A, N/A, 0 0Mhz 96Mhz 0% manual 150.0W 3% 0%
1 2 0x73a3, 22215 34.0°C 8.0W N/A, N/A, 0 0Mhz 96Mhz 20.0% manual 213.0W 0% 0%
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================
Device below is in TPX
$ rocm-smi --showhw
================================= ROCm System Management Interface =================================
====================================== Concise Hardware Info =======================================
GPU NODE DID GUID GFX VER GFX RAS SDMA RAS UMC RAS VBIOS BUS PARTITION ID
0 4 0x74a0 3877 gfx942 ENABLED ENABLED DISABLED N/A 0000:01:00.0 0
1 5 0x74a0 54196 gfx942 ENABLED ENABLED DISABLED N/A 0000:01:00.0 1
2 6 0x74a0 36891 gfx942 ENABLED ENABLED DISABLED N/A 0000:01:00.0 2
3 7 0x74a0 28397 gfx942 ENABLED ENABLED DISABLED N/A 0001:01:00.0 0
4 8 0x74a0 45692 gfx942 ENABLED ENABLED DISABLED N/A 0001:01:00.0 1
5 9 0x74a0 61907 gfx942 ENABLED ENABLED DISABLED N/A 0001:01:00.0 2
6 10 0x74a0 52404 gfx942 ENABLED ENABLED DISABLED N/A 0002:01:00.0 0
7 11 0x74a0 4133 gfx942 ENABLED ENABLED DISABLED N/A 0002:01:00.0 1
8 12 0x74a0 21386 gfx942 ENABLED ENABLED DISABLED N/A 0002:01:00.0 2
9 13 0x74a0 10876 gfx942 ENABLED ENABLED DISABLED N/A 0003:01:00.0 0
10 14 0x74a0 63213 gfx942 ENABLED ENABLED DISABLED N/A 0003:01:00.0 1
11 15 0x74a0 46402 gfx942 ENABLED ENABLED DISABLED N/A 0003:01:00.0 2
====================================================================================================
======================================= End of ROCm SMI Log ========================================
-
Added
NODE
,GUID
, andGFX Version
Changes impact the following rocm-smi CLIs:rocm-smi
rocm-smi -i
rocm-smi --showhw
rocm-smi --showproduct
NODE
- is the KFD node, since these can both be CPU and GPU devices. This field is invariant between boots.GUID
- also known as GPU ID. GUID is the KFD GPU's ID. This field has a chance to be variant between boots.GFX Version
- this is the device's target graphics version.
See below for a few example outputs.
$ rocm-smi --showhw
================================= ROCm System Management Interface =================================
====================================== Concise Hardware Info =======================================
GPU NODE DID GUID GFX VER GFX RAS SDMA RAS UMC RAS VBIOS BUS PARTITION ID
0 4 0x74a0 3877 gfx942 ENABLED ENABLED DISABLED N/A 0000:01:00.0 0
1 5 0x74a0 54196 gfx942 ENABLED ENABLED DISABLED N/A 0000:01:00.0 1
2 6 0x74a0 36891 gfx942 ENABLED ENABLED DISABLED N/A 0000:01:00.0 2
3 7 0x74a0 28397 gfx942 ENABLED ENABLED DISABLED N/A 0001:01:00.0 0
4 8 0x74a0 45692 gfx942 ENABLED ENABLED DISABLED N/A 0001:01:00.0 1
5 9 0x74a0 61907 gfx942 ENABLED ENABLED DISABLED N/A 0001:01:00.0 2
6 10 0x74a0 52404 gfx942 ENABLED ENABLED DISABLED N/A 0002:01:00.0 0
7 11 0x74a0 4133 gfx942 ENABLED ENABLED DISABLED N/A 0002:01:00.0 1
8 12 0x74a0 21386 gfx942 ENABLED ENABLED DISABLED N/A 0002:01:00.0 2
9 13 0x74a0 10876 gfx942 ENABLED ENABLED DISABLED N/A 0003:01:00.0 0
10 14 0x74a0 63213 gfx942 ENABLED ENABLED DISABLED N/A 0003:01:00.0 1
11 15 0x74a0 46402 gfx942 ENABLED ENABLED DISABLED N/A 0003:01:00.0 2
====================================================================================================
======================================= End of ROCm SMI Log ========================================
$ rocm-smi -i
============================ ROCm System Management Interface ============================
=========================================== ID ===========================================
GPU[0] : Device Name: Aqua Vanjaram [Instinct MI300A]
GPU[0] : Device ID: 0x74a0
GPU[0] : Device Rev: 0x00
GPU[0] : Subsystem ID: 0x74a0
GPU[0] : GUID: 60294
GPU[1] : Device Name: Aqua Vanjaram [Instinct MI300A]
GPU[1] : Device ID: 0x74a0
GPU[1] : Device Rev: 0x00
GPU[1] : Subsystem ID: 0x74a0
GPU[1] : GUID: 35406
GPU[2] : Device Name: Aqua Vanjaram [Instinct MI300A]
GPU[2] : Device ID: 0x74a0
GPU[2] : Device Rev: 0x00
GPU[2] : Subsystem ID: 0x74a0
GPU[2] : GUID: 10263
GPU[3] : Device Name: Aqua Vanjaram [Instinct MI300A]
GPU[3] : Device ID: 0x74a0
GPU[3] : Device Rev: 0x00
GPU[3] : Subsystem ID: 0x74a0
GPU[3] : GUID: 52959
==========================================================================================
================================== End of ROCm SMI Log ===================================
$ rocm-smi --showproduct
============================ ROCm System Management Interface ============================
====================================== Product Info ======================================
GPU[0] : Card Series: Aqua Vanjaram [Instinct MI300A]
GPU[0] : Card Model: 0x74a0
GPU[0] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: N/A
GPU[0] : Subsystem ID: 0x74a0
GPU[0] : Device Rev: 0x00
GPU[0] : Node ID: 4
GPU[0] : GUID: 60294
GPU[0] : GFX Version: gfx942
GPU[1] : Card Series: Aqua Vanjaram [Instinct MI300A]
GPU[1] : Card Model: 0x74a0
GPU[1] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1] : Card SKU: N/A
GPU[1] : Subsystem ID: 0x74a0
GPU[1] : Device Rev: 0x00
GPU[1] : Node ID: 5
GPU[1] : GUID: 35406
GPU[1] : GFX Version: gfx942
GPU[2] : Card Series: Aqua Vanjaram [Instinct MI300A]
GPU[2] : Card Model: 0x74a0
GPU[2] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[2] : Card SKU: N/A
GPU[2] : Subsystem ID: 0x74a0
GPU[2] : Device Rev: 0x00
GPU[2] : Node ID: 6
GPU[2] : GUID: 10263
GPU[2] : GFX Version: gfx942
GPU[3] : Card Series: Aqua Vanjaram [Instinct MI300A]
GPU[3] : Card Model: 0x74a0
GPU[3] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[3] : Card SKU: N/A
GPU[3] : Subsystem ID: 0x74a0
GPU[3] : Device Rev: 0x00
GPU[3] : Node ID: 7
GPU[3] : GUID: 52959
GPU[3] : GFX Version: gfx942
==========================================================================================
================================== End of ROCm SMI Log ===================================
- Documentation now includes C++ and Python: tutorials, API guides, and C++ reference pages See https://rocm.docs.amd.com/ once 6.1.1 is released.
- Aligned
rocm-smi
fields display "N/A" instead of "unknown"/"unsupported":Card ID
,DID
,Model
,SKU
, andVBIOS
Impacts the following commands:rocm-smi
- see other examples above for 6.1.1rocm-smi --showhw
- see other examples above for 6.1.1rocm-smi --showproduct
- see other examples above for 6.1.1rocm-smi -i
- see other examples above for 6.1.1rocm-smi --showvbios
- see example below
$ rocm-smi --showvbios
============================ ROCm System Management Interface ============================
========================================= VBIOS ==========================================
GPU[0] : VBIOS version: N/A
GPU[1] : VBIOS version: N/A
GPU[2] : VBIOS version: N/A
GPU[3] : VBIOS version: N/A
==========================================================================================
================================== End of ROCm SMI Log ===================================
-
Removed stacked id formatting in
rocm-smi
This is to simplify identifiers helpful to users. More identifiers can be found on:rocm-smi -i
rocm-smi --showhw
rocm-smi --showproduct
See examples shown above for 6.1.1. Previous output example can be seen below.
$ rocm-smi
========================================== ROCm System Management Interface ==========================================
==================================================== Concise Info ====================================================
Device [Model : Revision] Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
Name (20 chars) (Junction) (Socket) (Mem, Compute)
======================================================================================================================
0 [0x74a0 : 0x00] 40.0°C 102.0W NPS1, SPX 31Mhz 1300Mhz 0% manual 550.0W 0% 0%
Aqua Vanjaram [Insti
======================================================================================================================
================================================ End of ROCm SMI Log =================================================
- Fixed HIP and ROCm SMI mismatch on GPU bus assignments These changes prompted us to to provide better visability for our device nodes and partition IDs (see addition provided above). See examples below for fix overview.
- MI300a GPU device
Domain:Bus:Device.function
clashes with another AMD USB device Cause(s): a. ROCm SMI did not propagate domain consistently (for partitioned devices) b. AMD GPU driver previously reported partition IDs within function node - causing clash with the other AMD USB device PCIe ID displayed. - Domain does not propagate for devices which support partitioning (MI300x/a) Cause(s): a. ROCm SMI did not propagate domain consistently (for partitioned devices)
- Displayed topology will show disordered nodes when compared to HIP Cause(s): a. ROCm SMI did not propogate domain consistently (for partitioned devices)
Device in TPX
$ rocm-smi --showhw
================================= ROCm System Management Interface =================================
====================================== Concise Hardware Info =======================================
GPU NODE DID GUID GFX VER GFX RAS SDMA RAS UMC RAS VBIOS BUS PARTITION ID
0 4 0x74a0 3877 gfx942 ENABLED ENABLED DISABLED N/A 0000:01:00.0 0
1 5 0x74a0 54196 gfx942 ENABLED ENABLED DISABLED N/A 0000:01:00.0 1
2 6 0x74a0 36891 gfx942 ENABLED ENABLED DISABLED N/A 0000:01:00.0 2
3 7 0x74a0 28397 gfx942 ENABLED ENABLED DISABLED N/A 0001:01:00.0 0
4 8 0x74a0 45692 gfx942 ENABLED ENABLED DISABLED N/A 0001:01:00.0 1
5 9 0x74a0 61907 gfx942 ENABLED ENABLED DISABLED N/A 0001:01:00.0 2
6 10 0x74a0 52404 gfx942 ENABLED ENABLED DISABLED N/A 0002:01:00.0 0
7 11 0x74a0 4133 gfx942 ENABLED ENABLED DISABLED N/A 0002:01:00.0 1
8 12 0x74a0 21386 gfx942 ENABLED ENABLED DISABLED N/A 0002:01:00.0 2
9 13 0x74a0 10876 gfx942 ENABLED ENABLED DISABLED N/A 0003:01:00.0 0
10 14 0x74a0 63213 gfx942 ENABLED ENABLED DISABLED N/A 0003:01:00.0 1
11 15 0x74a0 46402 gfx942 ENABLED ENABLED DISABLED N/A 0003:01:00.0 2
====================================================================================================
======================================= End of ROCm SMI Log ========================================
$ lspci -D|grep -i "process\|usb"
0000:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300A]
0000:01:00.1 USB controller: Advanced Micro Devices, Inc. [AMD] Device 14df
0001:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300A]
0002:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300A]
0003:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300A]
$ rocm-smi ----showtoponuma
======================================= Numa Nodes =======================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: 0
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: 0
GPU[2] : (Topology) Numa Node: 0
GPU[2] : (Topology) Numa Affinity: 0
GPU[3] : (Topology) Numa Node: 1
GPU[3] : (Topology) Numa Affinity: 1
GPU[4] : (Topology) Numa Node: 1
GPU[4] : (Topology) Numa Affinity: 1
GPU[5] : (Topology) Numa Node: 1
GPU[5] : (Topology) Numa Affinity: 1
GPU[6] : (Topology) Numa Node: 2
GPU[6] : (Topology) Numa Affinity: 2
GPU[7] : (Topology) Numa Node: 2
GPU[7] : (Topology) Numa Affinity: 2
GPU[8] : (Topology) Numa Node: 2
GPU[8] : (Topology) Numa Affinity: 2
GPU[9] : (Topology) Numa Node: 3
GPU[9] : (Topology) Numa Affinity: 3
GPU[10] : (Topology) Numa Node: 3
GPU[10] : (Topology) Numa Affinity: 3
GPU[11] : (Topology) Numa Node: 3
GPU[11] : (Topology) Numa Affinity: 3
================================== End of ROCm SMI Log ===================================
- Fixed memory leaks Caused by not closing directories and creating maps nodes instead of checking using by using .at().
- Fixed Python rocm_smi API calls Fixed initializing calls which reuse rocmsmi.initializeRsmi() bindings.
Traceback (most recent call last):
File "/home/charpoag/rocmsmi_pythonapi.py", line 9, in <module>
rocm_smi.initializeRsmi()
File "/opt/rocm/libexec/rocm_smi/rocm_smi.py", line 3531, in initializeRsmi
ret_init = rocmsmi.rsmi_init(0)
NameError: name 'rocmsmi' is not defined
- Fixed rsmi_dev_activity_metric_get gfx/memory activity does not update with GPU activity Checks and forces rereading gpu metrics unconditionally.
- Added support to set max/min clock level for sclk (
RSMI_CLK_TYPE_SYS
) or mclk (RSMI_CLK_TYPE_MEM
) Users can now set a maximum or minimum sclk or mclk value throughrsmi_dev_clk_extremum_set()
API provided ASIC support. Alternatively, users can use our Python CLI tool (rocm-smi --setextremum max sclk 1500
). See example below.
$ sudo /opt/rocm/bin/rocm-smi --setextremum max sclk 2100
============================ ROCm System Management Interface ============================
******WARNING******
Operating your AMD GPU outside of official AMD specifications or outside of
factory settings, including but not limited to the conducting of overclocking,
over-volting or under-volting (including use of this interface software,
even if such software has been directly or indirectly provided by AMD or otherwise
affiliated in any way with AMD), may cause damage to your AMD GPU, system components
and/or result in system failure, as well as cause other problems.
DAMAGES CAUSED BY USE OF YOUR AMD GPU OUTSIDE OF OFFICIAL AMD SPECIFICATIONS OR
OUTSIDE OF FACTORY SETTINGS ARE NOT COVERED UNDER ANY AMD PRODUCT WARRANTY AND
MAY NOT BE COVERED BY YOUR BOARD OR SYSTEM MANUFACTURER'S WARRANTY.
Please use this utility with caution.
Do you accept these terms? [y/N] y
================================ Set Valid sclk Extremum =================================
GPU[0] : Successfully set max sclk to 2100(MHz)
GPU[1] : Successfully set max sclk to 2100(MHz)
GPU[2] : Successfully set max sclk to 2100(MHz)
GPU[3] : Successfully set max sclk to 2100(MHz)
================================== End of ROCm SMI Log ===================================
- Added
rsmi_dev_target_graphics_version_get()
Users can now query through ROCm SMI API (rsmi_dev_target_graphics_version_get()
) to retreive the target graphics version for a GPU device. Currently, this output is not supplied through our rocm-smi CLI.
-
Removed non-unified API headers: Individual GPU metric APIs are no longer supported The individual metric APIs (
rsmi_dev_metrics_*
) were removed in order to keep updates easier for new GPU metric support. By providing a simple API (rsmi_dev_gpu_metrics_info_get()
) with its reported device metrics, it is worth noting there is a risk for ABI break-age usingrsmi_dev_gpu_metrics_info_get()
. It is vital to understand, that ABI breaks are necessary (in some cases) in order to support newer ASICs and metrics for our customers. We will continue to supportrsmi_dev_gpu_metrics_info_get()
with these considerations and limitations in mind. -
Depricated rsmi_dev_power_ave_get(), use newer API rsmi_dev_power_get() As outlined in change below for 6.0.0 (Added a generic power API: rsmi_dev_power_get), is now depricated. Please update your ROCm SMI API calls accordingly.
- Fix
--showpids
reporting[PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN
Output was failing because cu_occupancy debugfs method is not provided on some graphics cards by design.get_compute_process_info_by_pid
was updated to reflect this and returns with output needed by CLI. - Fix
rocm-smi --showpower
output was inconsistent on Navi32/31 devices Updated to usersmi_dev_power_get()
within CLI to provide a consistent device power output. This was caused due to using the now depricatedrsmi_dev_average_power_get()
API. - Fixed
rocm-smi --setcomputepartition
androcm-smi --resetcomputepartition
to notate if device is EBUSY - Fixed
rocm-smi --setmemorypartition
androcm-smi --resetmemorypartition
read only SYSFS to return RSMI_STATUS_NOT_SUPPORTED Thersmi_dev_memory_partition_set
API is updated to handle the readonly SYSFS check. Corresponding tests and CLI (rocm-smi --setmemorypartition
androcm-smi --resetmemorypartition
) calls were updated accordingly. - Fix
rocm-smi --showclkvolt
androcm-smi --showvc
displaying 0 for overdrive and voltage curve is not supported
- HIP and ROCm SMI mismatch on GPU bus assignments Three separate issues have been identified:
- MI300a GPU device
Domain:Bus:Device.function
clashes with another AMD USB device
$ lspci|grep -i "process\|usb"
0000:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 74a0
0000:01:00.1 USB controller: Advanced Micro Devices, Inc. [AMD] Device 14df
0001:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 74a0
0002:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 74a0
0003:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 74a0
$ rocm-smi --showbus
============================ ROCm System Management Interface ============================
======================================= PCI Bus ID =======================================
GPU[0] : PCI Bus: 0000:01:00.0
GPU[1] : PCI Bus: 0000:01:00.1
GPU[2] : PCI Bus: 0000:01:00.2
GPU[3] : PCI Bus: 0000:01:00.3
...
==========================================================================================
================================== End of ROCm SMI Log ===================================
- Domain does not propagate for devices which support partitioning (MI300x/a) For example, a device in non-SPX (single partition) - devices will overlap in function device.
$ rocm-smi --showbus
============================ ROCm System Management Interface ============================
======================================= PCI Bus ID =======================================
GPU[0] : PCI Bus: 0000:01:00.0
GPU[1] : PCI Bus: 0000:01:00.1
GPU[2] : PCI Bus: 0000:01:00.1
GPU[3] : PCI Bus: 0000:01:00.1
GPU[4] : PCI Bus: 0000:01:00.1
GPU[5] : PCI Bus: 0000:01:00.2
GPU[6] : PCI Bus: 0000:01:00.2
GPU[7] : PCI Bus: 0000:01:00.2
GPU[8] : PCI Bus: 0000:01:00.2
GPU[9] : PCI Bus: 0000:01:00.3
GPU[10] : PCI Bus: 0000:01:00.3
GPU[11] : PCI Bus: 0000:01:00.3
GPU[12] : PCI Bus: 0000:01:00.3
GPU[13] : PCI Bus: 0000:01:00.4
GPU[14] : PCI Bus: 0000:01:00.4
GPU[15] : PCI Bus: 0000:01:00.4
GPU[16] : PCI Bus: 0000:01:00.4
GPU[17] : PCI Bus: 0000:01:00.5
GPU[18] : PCI Bus: 0000:01:00.5
GPU[19] : PCI Bus: 0000:01:00.5
GPU[20] : PCI Bus: 0000:01:00.5
GPU[21] : PCI Bus: 0001:01:00.0
GPU[22] : PCI Bus: 0002:01:00.0
GPU[23] : PCI Bus: 0003:01:00.0
================================== End of ROCm SMI Log ===================================
- Displayed topology will show disordered nodes when compared to HIP See rocm-smi output vs transferbench.
rocm-smi --showtopo option is not displaying the correct information when the MI300 driver is loaded in TPX mode.
============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
get_link_weight_topology, Not supported on the given system
ERROR: GPU[1] : Cannot read Link Weight: Not supported on this machine
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU9 GPU10 GPU11
GPU0 0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU1 XGMI 0 XXXX XXXX XXXX XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU2 XGMI XXXX 0 XXXX XXXX XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU3 XGMI XXXX XXXX 0 XXXX XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU4 XGMI XXXX XXXX XXXX 0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU5 XGMI XGMI XGMI XGMI XGMI 0 XXXX XXXX XXXX XGMI XGMI XGMI
GPU6 XGMI XGMI XGMI XGMI XGMI XXXX 0 XXXX XXXX XGMI XGMI XGMI
GPU7 XGMI XGMI XGMI XGMI XGMI XXXX XXXX 0 XXXX XGMI XGMI XGMI
GPU8 XGMI XGMI XGMI XGMI XGMI XXXX XXXX XXXX 0 XGMI XGMI XGMI
GPU9 XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0 XGMI XGMI
GPU10 XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0 XGMI
GPU11 XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0
======================================= Numa Nodes =======================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: 0
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: 0
GPU[2] : (Topology) Numa Node: 0
GPU[2] : (Topology) Numa Affinity: 1
GPU[3] : (Topology) Numa Node: 0
GPU[3] : (Topology) Numa Affinity: 2
GPU[4] : (Topology) Numa Node: 0
GPU[4] : (Topology) Numa Affinity: 3
GPU[5] : (Topology) Numa Node: 0
GPU[5] : (Topology) Numa Affinity: 0
GPU[6] : (Topology) Numa Node: 0
GPU[6] : (Topology) Numa Affinity: 1
GPU[7] : (Topology) Numa Node: 0
GPU[7] : (Topology) Numa Affinity: 2
GPU[8] : (Topology) Numa Node: 0
GPU[8] : (Topology) Numa Affinity: 3
GPU[9] : (Topology) Numa Node: 1
GPU[9] : (Topology) Numa Affinity: 1
GPU[10] : (Topology) Numa Node: 2
GPU[10] : (Topology) Numa Affinity: 2
GPU[11] : (Topology) Numa Node: 3
GPU[11] : (Topology) Numa Affinity: 3
================================== End of ROCm SMI Log ===================================
./Transferbench
...
| GPU 00 | GPU 01 | GPU 02 | GPU 03 | GPU 04 | GPU 05 | GPU 06 | GPU 07 | PCIe Bus ID | #CUs | Closest NUMA | DMA engines
--------+--------+--------+--------+--------+--------+--------+--------+--------+--------------+------+-------------+------------
GPU 00 | - | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | 0000:0c:00.0 | 304 | 0 |0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
GPU 01 | XGMI-1 | - | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | 0000:22:00.0 | 304 | 0 |0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
GPU 02 | XGMI-1 | XGMI-1 | - | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | 0000:38:00.0 | 304 | 0 |0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
GPU 03 | XGMI-1 | XGMI-1 | XGMI-1 | - | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | 0000:5c:00.0 | 304 | 0 |0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
GPU 04 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | - | XGMI-1 | XGMI-1 | XGMI-1 | 0000:9f:00.0 | 304 | 1 |0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
GPU 05 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | - | XGMI-1 | XGMI-1 | 0000:af:00.0 | 304 | 1 |0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
GPU 06 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | - | XGMI-1 | 0000:bf:00.0 | 304 | 1 |0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
GPU 07 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | XGMI-1 | - | 0000:df:00.0 | 304 | 1 |0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
...
- Added rocm-smi --version
The SMI will report two "versions", ROCM-SMI version and other is ROCM-SMI-LIB version.
- The ROCM-SMI version is the CLI/tool version number with commit ID appended after
+
sign. - The ROCM-SMI-LIB version is the library package version number.
- The ROCM-SMI version is the CLI/tool version number with commit ID appended after
$ rocm-smi --version
ROCM-SMI version: 2.0.0+8e78352
ROCM-SMI-LIB version: 6.0.0
-
Added support for gfx941/gfx942 metrics You can now query MI300 device metrics to get real-time information. Metrics include power, temperature, energy, and performance. Users can query through
rsmi_dev_gpu_metrics_info_get()
. -
Compute and memory partition support Users can now view, set, and reset partitions. The topology display can provide a more in-depth look at the device's current configuration. If your ASIC supports these features, the following commands can help get started:
rocm-smi --showcomputepartition
rocm-smi --setcomputepartition <SPX, DPX, CPX, TPX, QPX>
rocm-smi --resetcomputepartition
rocm-smi --showmemorypartition
rocm-smi --setmemorypartition <NPS1, NPS2, NPS4, NPS8>
rocm-smi --resetmemorypartition
-
GPU index sorting made consistent with other tools To ensure alignment with other ROCm software tools, GPU index sorting is optimized to use Bus:Device.Function (BDF) rather than the card number.
-
Increase max BDF ID length To allow for larger BDF data, we have increased the maximum BDF length from 256 to 512 buffer size.
-
Documentation is transitioning to Sphinx Sphinx allows us to generate code documentation easier for our users. Helps us provide centrized HTML documentation at single website location. Here customers can see how to use our software and tools.
-
Added a generic power API:
rsmi_dev_power_get()
Older ASICs provided average socket power, newer ASICs (MI300) provide current socket power. The generic API provides one interface to retreive either of these power readings, allowing backwards compatability. -
Added flexible temperature readings (
rocm-smi
androcm-smi --showtempgraph
) Older ASICs provided edge temperature, newer ASICs (MI300) provide junction socket power (not edge). The rocm-smi CLI now provides a way to view which type of temperature is read across all sockets. -
Added deep sleep frequency readings Newer ASICs (MI300) provide ability to know if a clock is in deep sleep.
- Add new test to measure api execution time.
- Remove the shared mutex if no process is using it.
- Updated to C++17, gtest-1.14, and cmake 3.14
- Fix memory usage division by 0
- Fix missing firmware blocks (rocm-smi --showfw)
- Fix rocm-smi --showevents shows wrong gpuID
- ROCm SMI CLI: Add --showtempgraph Feature.
- Relying on vendor ID to detect AMDGPU.
- Change pragma message to warning for backward compatibility.
- Add new test to measure api execution time.
- Remove the shared mutex if no process is using it.
- Fix --showproductname when device's SKU cannot be parsed out of the VBIOS string.
- Fix compile error: ‘memcpy’ was not declared.
- Fix order of CE and UE reporting in ROCm SMI CLI.
- Handle error return value from ReadSysfsStr function.