_mm256_storeu_pd and _mm256_loadu_pd using 128 bit lanes #1198

AlexK-BD · 2024-07-17T17:57:44Z

Simple test case:

void test(double* dst, double* a) {
    __m256d b = _mm256_loadu_pd(a);
    __m256d c = _mm256_add_pd(b, b);
    _mm256_storeu_pd(dst, c);
}

On arm64, with gcc 11 (-O2 -fno-stack-protector -march=armv8.2-a) it generates the following:

0000000000000000 <_Z4testPdS_>:
   0:   ad400021        ldp     q1, q0, [x1]
   4:   d10143ff        sub     sp, sp, #0x50
   8:   9100ffe2        add     x2, sp, #0x3f
   c:   927be842        and     x2, x2, #0xffffffffffffffe0
  10:   4e61d421        fadd    v1.2d, v1.2d, v1.2d
  14:   4e60d400        fadd    v0.2d, v0.2d, v0.2d
  18:   ad000041        stp     q1, q0, [x2]
  1c:   ad400440        ldp     q0, q1, [x2]
  20:   ad000400        stp     q0, q1, [x0]
  24:   910143ff        add     sp, sp, #0x50
  28:   d65f03c0        ret

Note the confusing sequences of stp, ldp, stp, at the end, and all the stuff with the stack pointer at the beginning.

With this change:

0000000000000000 <_Z4testPdS_>:
   0:   ad400021        ldp     q1, q0, [x1]
   4:   4e61d421        fadd    v1.2d, v1.2d, v1.2d
   8:   4e60d400        fadd    v0.2d, v0.2d, v0.2d
   c:   ad000001        stp     q1, q0, [x0]
  10:   d65f03c0        ret

I've looked a bit at how x86_64 without avx behaves as well. It mostly doesn't suffer from the same problem (again, gcc 11 -msse2). It does seem to think something is up with the stack, though.

Without this change, -O2 -msse2:

0000000000000000 <_Z4testPdS_>:
   0:   f3 0f 1e fa             endbr64 
   4:   55                      push   %rbp
   5:   48 89 e5                mov    %rsp,%rbp
   8:   48 83 e4 e0             and    $0xffffffffffffffe0,%rsp
   c:   48 83 ec 60             sub    $0x60,%rsp
  10:   66 0f 10 46 10          movupd 0x10(%rsi),%xmm0
  15:   66 0f 10 0e             movupd (%rsi),%xmm1
  19:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
  20:   00 00 
  22:   48 89 44 24 58          mov    %rax,0x58(%rsp)
  27:   31 c0                   xor    %eax,%eax
  29:   66 0f 58 c9             addpd  %xmm1,%xmm1
  2d:   66 0f 58 c0             addpd  %xmm0,%xmm0
  31:   0f 11 0f                movups %xmm1,(%rdi)
  34:   0f 11 47 10             movups %xmm0,0x10(%rdi)
  38:   48 8b 44 24 58          mov    0x58(%rsp),%rax
  3d:   64 48 2b 04 25 28 00    sub    %fs:0x28,%rax
  44:   00 00 
  46:   75 02                   jne    4a <_Z4testPdS_+0x4a>
  48:   c9                      leave  
  49:   c3                      ret    
  4a:   e8 00 00 00 00          call   4f <_Z4testPdS_+0x4f>

Without this change, -O2 -msse2 -fno-stack-protector:

0000000000000000 <_Z4testPdS_>:
   0:   f3 0f 1e fa             endbr64 
   4:   66 0f 10 46 10          movupd 0x10(%rsi),%xmm0
   9:   66 0f 10 0e             movupd (%rsi),%xmm1
   d:   66 0f 58 c0             addpd  %xmm0,%xmm0
  11:   66 0f 58 c9             addpd  %xmm1,%xmm1
  15:   0f 11 47 10             movups %xmm0,0x10(%rdi)
  19:   0f 11 0f                movups %xmm1,(%rdi)
  1c:   c3                      ret

With this change, both produce:

0000000000000000 <_Z4testPdS_>:
   0:   f3 0f 1e fa             endbr64 
   4:   66 0f 10 0e             movupd (%rsi),%xmm1
   8:   66 0f 10 46 10          movupd 0x10(%rsi),%xmm0
   d:   66 0f 58 c9             addpd  %xmm1,%xmm1
  11:   66 0f 58 c0             addpd  %xmm0,%xmm0
  15:   0f 11 0f                movups %xmm1,(%rdi)
  18:   0f 11 47 10             movups %xmm0,0x10(%rdi)
  1c:   c3                      ret

So this seems like an improvement.

Is this a good approach? Should I use a narrower #if check to turn on this behavior (arm only)?

mr-c

It seems reasonable! However I'm just the cheerleader-in-residence; I will defer to the research of you and others and merge or not accordingly.

Though it did cause a llvm/clang crash (congratulations 🎉) Can you file a bug report?

https://github.com/simd-everywhere/simde/actions/runs/9979128222/job/27577599524#step:8:1

And then we can add a workaround for emscripten that references that bug report

mr-c

Looks like emscripten/llvm fixed the compiler error

mr-c · 2024-08-20T16:56:37Z

@AlexK-BD I'm happy to merge this as it is; please let me know if that is okay with you

mr-c · 2024-09-12T11:04:24Z

Ping @AlexK-BD ; I'm thinking about making a new SIMDe release this month. I'd like to merge this PR of yours, if that is okay with you.

AlexK-BD · 2024-09-13T13:13:05Z

Yes, please merge it if possible.

mr-c · 2024-09-13T15:12:38Z

Thank you @AlexK-BD !

mr-c requested changes Jul 20, 2024

View reviewed changes

mr-c force-pushed the load_store_256_as_128-pr branch from 3b6722c to c8e5ad7 Compare August 20, 2024 09:14

mr-c approved these changes Aug 20, 2024

View reviewed changes

mr-c force-pushed the load_store_256_as_128-pr branch from c8e5ad7 to 39f649d Compare September 12, 2024 11:02

AlexK-BD marked this pull request as ready for review September 13, 2024 13:12

_mm256_storeu_pd and _mm256_loadu_pd using 128 bit lanes

deeb482

mr-c force-pushed the load_store_256_as_128-pr branch from 39f649d to deeb482 Compare September 13, 2024 13:42

mr-c enabled auto-merge (rebase) September 13, 2024 13:42

mr-c disabled auto-merge September 13, 2024 15:12

mr-c merged commit 96054b8 into simd-everywhere:master Sep 13, 2024
95 of 98 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_mm256_storeu_pd and _mm256_loadu_pd using 128 bit lanes #1198

_mm256_storeu_pd and _mm256_loadu_pd using 128 bit lanes #1198

AlexK-BD commented Jul 17, 2024

mr-c left a comment

mr-c left a comment

mr-c commented Aug 20, 2024

mr-c commented Sep 12, 2024

AlexK-BD commented Sep 13, 2024

mr-c commented Sep 13, 2024

_mm256_storeu_pd and _mm256_loadu_pd using 128 bit lanes #1198

_mm256_storeu_pd and _mm256_loadu_pd using 128 bit lanes #1198

Conversation

AlexK-BD commented Jul 17, 2024

mr-c left a comment

Choose a reason for hiding this comment

mr-c left a comment

Choose a reason for hiding this comment

mr-c commented Aug 20, 2024

mr-c commented Sep 12, 2024

AlexK-BD commented Sep 13, 2024

mr-c commented Sep 13, 2024