-
Notifications
You must be signed in to change notification settings - Fork 37
Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53
Comments
blockDim
is **not** loaded
Hi @Maetveis , (Not an AMD employee here, but going to try to help.) I guess you probably moved on to something else, but if anything, I would recommend to open a bug ticket in LLVM instead, as maybe it is something the compiler can be taught to improve potentially. Your code seems wrong due to all threads writing into From basic experience, reaching max performance on RDNA can be quite challenging compared to GCN.
Anyway, that's pure guessing. |
Thanks, @Epliz for the suggestions, I replied to them below
I am still interested in this as solving it would unblock upstreaming an optimization in vkFFT for AMDGPU, as well as potentially improve many kernels on RDNA2 if the root cause can be fixed.
For this specific issue I don't think the compiler is involved because it can be reproduced by changing the generated assembly (removing the load instruction for --- test.s 2023-01-31 08:43:20.798273297 +0000
+++ test.noasm.s 2023-01-31 08:44:16.353553582 +0000
@@ -7,7 +7,7 @@
VkFFT_main: ; @VkFFT_main
; %bb.0:
s_load_dwordx4 s[0:3], s[6:7], 0x0
- s_load_dword s4, s[4:5], 0x4
+ ;s_load_dword s4, s[4:5], 0x4
v_add_nc_u32_e32 v3, v1, v0
v_mul_u32_u24_e32 v1, 7, v1
s_mulk_i32 s9, 0x18f0
Yes its the effect of the reduction, using different locations for each thread still reproduces the problem, using the same location (while technically UB due to the race condition) results in a simpler assembly. My guess was something to do with the command processor (CP)1 not caching the dispatch packet, but accessing it from the kernel (to read --- test.s 2023-01-31 09:35:20.406608882 +0000
+++ test.noasm.s 2023-01-31 09:37:13.877167577 +0000
@@ -6,8 +6,9 @@
.type VkFFT_main,@function
VkFFT_main: ; @VkFFT_main
; %bb.0:
+ s_load_dwordx2 s[6:7], s[4:5], 0x28
s_load_dwordx4 s[0:3], s[6:7], 0x0
- s_load_dword s4, s[4:5], 0x4
+ ;s_load_dword s4, s[4:5], 0x4
v_add_nc_u32_e32 v3, v1, v0
v_mul_u32_u24_e32 v1, 7, v1
s_mulk_i32 s9, 0x18f0 |
Using blockDim in hip kernels unfortunately incurs a large overhead, because this (dynamic) information is stored in the dispatch packet located in a host-coherent memory region. Since vkFFT always knows the work group size its going to use, just replace uses of blockDim with these values. This means the load from non-cached memory is avoided, the dispatch pointer doesn't have to be loaded which frees up 2 SGPRs, and some indexing calculations might constant fold better. The added option `useStaticWorkGroupSize` has three possible values: - -1: Disable embedding blockDim sizes, effectively the old behavior - 0: Automatically enable embedding when profitable (always except for RDNA2) - 1: Always enable RDNA is disabled by default because this can actually decrease performance sometimes with the reason not fully known, details at [1] [1]: ROCm/hipamd#53
Using blockDim in hip kernels unfortunately incurs a large overhead, because this (dynamic) information is stored in the dispatch packet located in a host-coherent memory region. Since vkFFT always knows the work group size its going to use, just replace uses of blockDim with these values. This means the load from non-cached memory is avoided, the dispatch pointer doesn't have to be loaded which frees up 2 SGPRs, and some indexing calculations might constant fold better. The added option `useStaticWorkGroupSize` has three possible values: - -1: Disable embedding blockDim sizes, effectively the old behavior - 0: Automatically enable embedding when profitable (always except for RDNA2) - 1: Always enable RDNA is disabled by default because this can actually decrease performance sometimes with the reason not fully known, details at [1] [1]: ROCm/hipamd#53 Co-authored-by: [email protected]
The following kernel:
becomes ~10% slower when the inline assembly for loading
blockDim.x
is removed.This seems to happen only on RDNA2 (tested with V620 and RX6650XT) on Linux.
Observations
c
the slowdownbecomes less without loading
blockDim
blockDim
is almost identical, differing only by the loadof the block size from the dispatch packet and the metadata that flags that the dispatch packet is in use.
Motivation
The kernel above was produced using c-reduce on a kernel extracted from vkFFT.
A possible optimization for VkFFT's HIP backend aimed to replace
blockDim
with it's values ahead of time as they are known when compilation happens.This results in big improvements especially for small problem sizes, except select cases in RDNA2, where it leads to a slowdown as much as ~30%. This issue is the result of investigating the cause of this.
Environment
hipconfig
clang version
rocminfo
Attachments
Archive file
blockdim-faster-linux-rdna2.tar.gz
containing the original VkFFT kernel,the host code used to do the speed tests (based on the launch params done by VkFFT), annotated assembly from the kernel and the script used for the test case reduction.
The script is useful to verify the slowdown. It compiles and runs the kernel reduced kernel source (
test.hip
) with and without loadingblockDim
up to 3 times and shows the difference in time taken.The text was updated successfully, but these errors were encountered: