Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53

Maetveis · 2022-10-26T15:07:43Z

The following kernel:

extern "C" __global__ void VkFFT_main(unsigned long long* g, unsigned long long* h) {
  __shared__ unsigned long long c[8192];
  asm volatile(";x: %0" : : "s"((unsigned)blockDim.x));
  unsigned b =
      threadIdx.y * (threadIdx.y + threadIdx.x) * 7 + blockIdx.z * 6384;
  c[0] = g[b];
  h[b] = c[threadIdx.x];
}

becomes ~10% slower when the inline assembly for loading blockDim.x is removed.
This seems to happen only on RDNA2 (tested with V620 and RX6650XT) on Linux.

Observations

Does not seem to reproduce on Windows with the RX6900XT
The amount of shared memory used by the kernel is relevant, decreasing the size of c the slowdown
becomes less without loading blockDim
The generated assembly with and without blockDim is almost identical, differing only by the load
of the block size from the dispatch packet and the metadata that flags that the dispatch packet is in use.
- What's more if this load instruction is removed from the faster kernel (without changing the meta-data) it becomes as slow as the other one.

Motivation

The kernel above was produced using c-reduce on a kernel extracted from vkFFT.
A possible optimization for VkFFT's HIP backend aimed to replace blockDim with it's values ahead of time as they are known when compilation happens.
This results in big improvements especially for small problem sizes, except select cases in RDNA2, where it leads to a slowdown as much as ~30%. This issue is the result of investigating the cause of this.

Environment

hipconfig

❯ hipconfig
HIP version  : 5.3.22061-e8e78f1a

== hipconfig
HIP_PATH     : /opt/rocm-5.3.0
ROCM_PATH    : /opt/rocm-5.3.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-5.3.0/include -I/opt/rocm-5.3.0/llvm/bin/../lib/clang/15.0.0 -I/opt/rocm-5.3.0/hsa/include

== hip-clang
HSA_PATH         : /opt/rocm-5.3.0/hsa
HIP_CLANG_PATH   : /opt/rocm-5.3.0/llvm/bin
AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.3.0 22362 3cf23f77f8208174a2ee7c616f4be23674d7b081)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.3.0/llvm/bin
AMD LLVM version 15.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver3

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags :  -std=c++11 -isystem "/opt/rocm-5.3.0/llvm/lib/clang/15.0.0/include/.." -isystem /opt/rocm-5.3.0/hsa/include -isystem "/opt/rocm-5.3.0/include" -O3
hip-clang-ldflags  :  -L"/opt/rocm-5.3.0/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt

=== Environment Variables
PATH=/home/gergely/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

== Linux Kernel
Hostname     : nostromo
Linux nostromo 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal

clang version

❯ /opt/rocm/llvm/bin/clang++ --version
AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.3.0 22362 3cf23f77f8208174a2ee7c616f4be23674d7b081)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin

rocminfo

❯ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD EPYC 7713P 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 7713P 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            128
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    528082872(0x1f79e7b8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    528082872(0x1f79e7b8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    528082872(0x1f79e7b8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1030
  Uuid:                    GPU-abcb45dca7663b11
  Marketing Name:          AMD Radeon PRO V620
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      4096(0x1000) KB
    L3:                      131072(0x20000) KB
  Chip ID:                 29601(0x73a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2570
  BDFID:                   33536
  Internal Node ID:        1
  Compute Unit:            72
  SIMDs per CU:            2
  Shader Engines:          8
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1030
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx1030
  Uuid:                    GPU-2293a876b6331dff
  Marketing Name:          AMD Radeon PRO V620
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      4096(0x1000) KB
    L3:                      131072(0x20000) KB
  Chip ID:                 29601(0x73a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2570
  BDFID:                   34304
  Internal Node ID:        2
  Compute Unit:            72
  SIMDs per CU:            2
  Shader Engines:          8
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1030
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Attachments

Archive file blockdim-faster-linux-rdna2.tar.gz containing the original VkFFT kernel,
the host code used to do the speed tests (based on the launch params done by VkFFT), annotated assembly from the kernel and the script used for the test case reduction.
The script is useful to verify the slowdown. It compiles and runs the kernel reduced kernel source (test.hip) with and without loading blockDim up to 3 times and shows the difference in time taken.

The text was updated successfully, but these errors were encountered:

Epliz · 2023-01-30T19:58:26Z

Hi @Maetveis ,

(Not an AMD employee here, but going to try to help.)

I guess you probably moved on to something else, but if anything, I would recommend to open a bug ticket in LLVM instead, as maybe it is something the compiler can be taught to improve potentially.

Your code seems wrong due to all threads writing into c[0], but I assume that it was a typo when writing your message, or a big part of the code is missing due to using c-reduce.

From basic experience, reaching max performance on RDNA can be quite challenging compared to GCN.
While hard to tell and profiling being quite hard on RDNA, I suspect that you might be hitting one these issues:

bad write coalescing when writing h, that somehow gets better when putting the load. I have observed that putting synchronizations (with syncthreads()) can sometimes improve performance by improving write coalescing
if you are really have all your threads writing into c[0], depending on your blocksize, the load might help with avoiding all threads are hitting the LDS at the same time and the LDS serializes less all the writes

Anyway, that's pure guessing.
Best,
Epliz

Maetveis · 2023-01-31T09:50:18Z

Thanks, @Epliz for the suggestions, I replied to them below

I guess you probably moved on to something else

I am still interested in this as solving it would unblock upstreaming an optimization in vkFFT for AMDGPU, as well as potentially improve many kernels on RDNA2 if the root cause can be fixed.

I would recommend to open a bug ticket in LLVM instead, as maybe it is something the compiler can be taught to improve potentially.

For this specific issue I don't think the compiler is involved because it can be reproduced by changing the generated assembly (removing the load instruction for blockDim).

--- test.s	2023-01-31 08:43:20.798273297 +0000
+++ test.noasm.s	2023-01-31 08:44:16.353553582 +0000
@@ -7,7 +7,7 @@
 VkFFT_main:                             ; @VkFFT_main
 ; %bb.0:
 	s_load_dwordx4 s[0:3], s[6:7], 0x0
-	s_load_dword s4, s[4:5], 0x4
+	;s_load_dword s4, s[4:5], 0x4
 	v_add_nc_u32_e32 v3, v1, v0
 	v_mul_u32_u24_e32 v1, 7, v1
 	s_mulk_i32 s9, 0x18f0

Your code seems wrong due to all threads writing into c[0], but I assume that it was a typo when writing your message, or a big part of the code is missing due to using c-reduce.

Yes its the effect of the reduction, using different locations for each thread still reproduces the problem, using the same location (while technically UB due to the race condition) results in a simpler assembly.

My guess was something to do with the command processor (CP)1 not caching the dispatch packet, but accessing it from the kernel (to read blockDim from it) leaves it in cache for subsequent blocks. This is supported by the fact that the following change (loading the kernarg pointer by hand; which is also done by the CP) also result in the faster performing kernel.

--- test.s	2023-01-31 09:35:20.406608882 +0000
+++ test.noasm.s	2023-01-31 09:37:13.877167577 +0000
@@ -6,8 +6,9 @@
 	.type	VkFFT_main,@function
 VkFFT_main:                             ; @VkFFT_main
 ; %bb.0:
+        s_load_dwordx2 s[6:7], s[4:5], 0x28
 	s_load_dwordx4 s[0:3], s[6:7], 0x0
-	s_load_dword s4, s[4:5], 0x4
+	;s_load_dword s4, s[4:5], 0x4
 	v_add_nc_u32_e32 v3, v1, v0
 	v_mul_u32_u24_e32 v1, 7, v1
 	s_mulk_i32 s9, 0x18f0

Using blockDim in hip kernels unfortunately incurs a large overhead, because this (dynamic) information is stored in the dispatch packet located in a host-coherent memory region. Since vkFFT always knows the work group size its going to use, just replace uses of blockDim with these values. This means the load from non-cached memory is avoided, the dispatch pointer doesn't have to be loaded which frees up 2 SGPRs, and some indexing calculations might constant fold better. The added option `useStaticWorkGroupSize` has three possible values: - -1: Disable embedding blockDim sizes, effectively the old behavior - 0: Automatically enable embedding when profitable (always except for RDNA2) - 1: Always enable RDNA is disabled by default because this can actually decrease performance sometimes with the reason not fully known, details at [1] [1]: ROCm/hipamd#53

Using blockDim in hip kernels unfortunately incurs a large overhead, because this (dynamic) information is stored in the dispatch packet located in a host-coherent memory region. Since vkFFT always knows the work group size its going to use, just replace uses of blockDim with these values. This means the load from non-cached memory is avoided, the dispatch pointer doesn't have to be loaded which frees up 2 SGPRs, and some indexing calculations might constant fold better. The added option `useStaticWorkGroupSize` has three possible values: - -1: Disable embedding blockDim sizes, effectively the old behavior - 0: Automatically enable embedding when profitable (always except for RDNA2) - 1: Always enable RDNA is disabled by default because this can actually decrease performance sometimes with the reason not fully known, details at [1] [1]: ROCm/hipamd#53 Co-authored-by: [email protected]

Maetveis changed the title ~~Slowdown observed on Linux with RDNA2 when blockDim is **not** loaded~~ Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded Oct 26, 2022

Maetveis mentioned this issue Feb 14, 2023

HIP: add option to embed static blockDim DTolm/VkFFT#105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53

Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53

Maetveis commented Oct 26, 2022

Epliz commented Jan 30, 2023

Maetveis commented Jan 31, 2023 •

edited

Loading

Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53

Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53

Comments

Maetveis commented Oct 26, 2022

Observations

Motivation

Environment

Attachments

Epliz commented Jan 30, 2023

Maetveis commented Jan 31, 2023 • edited Loading

Maetveis commented Jan 31, 2023 •

edited

Loading