Skip to content
This repository has been archived by the owner on Jan 26, 2024. It is now read-only.

Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53

Open
Maetveis opened this issue Oct 26, 2022 · 2 comments
Open

Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53

Maetveis opened this issue Oct 26, 2022 · 2 comments

Comments

@Maetveis
Copy link

The following kernel:

extern "C" __global__ void VkFFT_main(unsigned long long* g, unsigned long long* h) {
  __shared__ unsigned long long c[8192];
  asm volatile(";x: %0" : : "s"((unsigned)blockDim.x));
  unsigned b =
      threadIdx.y * (threadIdx.y + threadIdx.x) * 7 + blockIdx.z * 6384;
  c[0] = g[b];
  h[b] = c[threadIdx.x];
}

becomes ~10% slower when the inline assembly for loading blockDim.x is removed.
This seems to happen only on RDNA2 (tested with V620 and RX6650XT) on Linux.

Observations

  • Does not seem to reproduce on Windows with the RX6900XT
  • The amount of shared memory used by the kernel is relevant, decreasing the size of c the slowdown
    becomes less without loading blockDim
  • The generated assembly with and without blockDim is almost identical, differing only by the load
    of the block size from the dispatch packet and the metadata that flags that the dispatch packet is in use.
    • What's more if this load instruction is removed from the faster kernel (without changing the meta-data) it becomes as slow as the other one.

Motivation

The kernel above was produced using c-reduce on a kernel extracted from vkFFT.
A possible optimization for VkFFT's HIP backend aimed to replace blockDim with it's values ahead of time as they are known when compilation happens.
This results in big improvements especially for small problem sizes, except select cases in RDNA2, where it leads to a slowdown as much as ~30%. This issue is the result of investigating the cause of this.

Environment

hipconfig
hipconfig
HIP version  : 5.3.22061-e8e78f1a

== hipconfig
HIP_PATH     : /opt/rocm-5.3.0
ROCM_PATH    : /opt/rocm-5.3.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-5.3.0/include -I/opt/rocm-5.3.0/llvm/bin/../lib/clang/15.0.0 -I/opt/rocm-5.3.0/hsa/include

== hip-clang
HSA_PATH         : /opt/rocm-5.3.0/hsa
HIP_CLANG_PATH   : /opt/rocm-5.3.0/llvm/bin
AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.3.0 22362 3cf23f77f8208174a2ee7c616f4be23674d7b081)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.3.0/llvm/bin
AMD LLVM version 15.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver3

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags :  -std=c++11 -isystem "/opt/rocm-5.3.0/llvm/lib/clang/15.0.0/include/.." -isystem /opt/rocm-5.3.0/hsa/include -isystem "/opt/rocm-5.3.0/include" -O3
hip-clang-ldflags  :  -L"/opt/rocm-5.3.0/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt

=== Environment Variables
PATH=/home/gergely/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

== Linux Kernel
Hostname     : nostromo
Linux nostromo 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal
clang version
/opt/rocm/llvm/bin/clang++ --version
AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.3.0 22362 3cf23f77f8208174a2ee7c616f4be23674d7b081)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin
rocminfo
rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD EPYC 7713P 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 7713P 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            128
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    528082872(0x1f79e7b8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    528082872(0x1f79e7b8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    528082872(0x1f79e7b8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1030
  Uuid:                    GPU-abcb45dca7663b11
  Marketing Name:          AMD Radeon PRO V620
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      4096(0x1000) KB
    L3:                      131072(0x20000) KB
  Chip ID:                 29601(0x73a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2570
  BDFID:                   33536
  Internal Node ID:        1
  Compute Unit:            72
  SIMDs per CU:            2
  Shader Engines:          8
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1030
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx1030
  Uuid:                    GPU-2293a876b6331dff
  Marketing Name:          AMD Radeon PRO V620
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      4096(0x1000) KB
    L3:                      131072(0x20000) KB
  Chip ID:                 29601(0x73a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2570
  BDFID:                   34304
  Internal Node ID:        2
  Compute Unit:            72
  SIMDs per CU:            2
  Shader Engines:          8
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1030
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Attachments

Archive file blockdim-faster-linux-rdna2.tar.gz containing the original VkFFT kernel,
the host code used to do the speed tests (based on the launch params done by VkFFT), annotated assembly from the kernel and the script used for the test case reduction.
The script is useful to verify the slowdown. It compiles and runs the kernel reduced kernel source (test.hip) with and without loading blockDim up to 3 times and shows the difference in time taken.

@Maetveis Maetveis changed the title Slowdown observed on Linux with RDNA2 when blockDim is **not** loaded Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded Oct 26, 2022
@Epliz
Copy link

Epliz commented Jan 30, 2023

Hi @Maetveis ,

(Not an AMD employee here, but going to try to help.)

I guess you probably moved on to something else, but if anything, I would recommend to open a bug ticket in LLVM instead, as maybe it is something the compiler can be taught to improve potentially.

Your code seems wrong due to all threads writing into c[0], but I assume that it was a typo when writing your message, or a big part of the code is missing due to using c-reduce.

From basic experience, reaching max performance on RDNA can be quite challenging compared to GCN.
While hard to tell and profiling being quite hard on RDNA, I suspect that you might be hitting one these issues:

  1. bad write coalescing when writing h, that somehow gets better when putting the load. I have observed that putting synchronizations (with syncthreads()) can sometimes improve performance by improving write coalescing
  2. if you are really have all your threads writing into c[0], depending on your blocksize, the load might help with avoiding all threads are hitting the LDS at the same time and the LDS serializes less all the writes

Anyway, that's pure guessing.
Best,
Epliz

@Maetveis
Copy link
Author

Maetveis commented Jan 31, 2023

Thanks, @Epliz for the suggestions, I replied to them below

I guess you probably moved on to something else

I am still interested in this as solving it would unblock upstreaming an optimization in vkFFT for AMDGPU, as well as potentially improve many kernels on RDNA2 if the root cause can be fixed.

I would recommend to open a bug ticket in LLVM instead, as maybe it is something the compiler can be taught to improve potentially.

For this specific issue I don't think the compiler is involved because it can be reproduced by changing the generated assembly (removing the load instruction for blockDim).

--- test.s	2023-01-31 08:43:20.798273297 +0000
+++ test.noasm.s	2023-01-31 08:44:16.353553582 +0000
@@ -7,7 +7,7 @@
 VkFFT_main:                             ; @VkFFT_main
 ; %bb.0:
 	s_load_dwordx4 s[0:3], s[6:7], 0x0
-	s_load_dword s4, s[4:5], 0x4
+	;s_load_dword s4, s[4:5], 0x4
 	v_add_nc_u32_e32 v3, v1, v0
 	v_mul_u32_u24_e32 v1, 7, v1
 	s_mulk_i32 s9, 0x18f0

Your code seems wrong due to all threads writing into c[0], but I assume that it was a typo when writing your message, or a big part of the code is missing due to using c-reduce.

Yes its the effect of the reduction, using different locations for each thread still reproduces the problem, using the same location (while technically UB due to the race condition) results in a simpler assembly.

My guess was something to do with the command processor (CP)1 not caching the dispatch packet, but accessing it from the kernel (to read blockDim from it) leaves it in cache for subsequent blocks. This is supported by the fact that the following change (loading the kernarg pointer by hand; which is also done by the CP) also result in the faster performing kernel.

--- test.s	2023-01-31 09:35:20.406608882 +0000
+++ test.noasm.s	2023-01-31 09:37:13.877167577 +0000
@@ -6,8 +6,9 @@
 	.type	VkFFT_main,@function
 VkFFT_main:                             ; @VkFFT_main
 ; %bb.0:
+        s_load_dwordx2 s[6:7], s[4:5], 0x28
 	s_load_dwordx4 s[0:3], s[6:7], 0x0
-	s_load_dword s4, s[4:5], 0x4
+	;s_load_dword s4, s[4:5], 0x4
 	v_add_nc_u32_e32 v3, v1, v0
 	v_mul_u32_u24_e32 v1, 7, v1
 	s_mulk_i32 s9, 0x18f0

Maetveis pushed a commit to StreamHPC/VkFFT that referenced this issue Feb 14, 2023
Using blockDim in hip kernels unfortunately incurs a large overhead,
because this (dynamic) information is stored in the dispatch packet
located in a host-coherent memory region. Since vkFFT always knows the
work group size its going to use, just replace uses of blockDim with
these values.
This means the load from non-cached memory is avoided, the dispatch
pointer doesn't have to be loaded which frees up 2 SGPRs, and some
indexing calculations might constant fold better.

The added option `useStaticWorkGroupSize` has three possible values:
- -1: Disable embedding blockDim sizes, effectively the old behavior
-  0: Automatically enable embedding when profitable (always except for RDNA2)
-  1: Always enable

RDNA is disabled by default because this can actually decrease performance
sometimes with the reason not fully known, details at [1]

[1]: ROCm/hipamd#53
Maetveis pushed a commit to StreamHPC/VkFFT that referenced this issue Feb 14, 2023
Using blockDim in hip kernels unfortunately incurs a large overhead,
because this (dynamic) information is stored in the dispatch packet
located in a host-coherent memory region. Since vkFFT always knows the
work group size its going to use, just replace uses of blockDim with
these values.
This means the load from non-cached memory is avoided, the dispatch
pointer doesn't have to be loaded which frees up 2 SGPRs, and some
indexing calculations might constant fold better.

The added option `useStaticWorkGroupSize` has three possible values:
- -1: Disable embedding blockDim sizes, effectively the old behavior
-  0: Automatically enable embedding when profitable (always except for RDNA2)
-  1: Always enable

RDNA is disabled by default because this can actually decrease performance
sometimes with the reason not fully known, details at [1]

[1]: ROCm/hipamd#53

Co-authored-by: [email protected]
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants