Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_mm256_storeu_pd and _mm256_loadu_pd using 128 bit lanes #1198

Merged
merged 1 commit into from
Sep 13, 2024

Conversation

AlexK-BD
Copy link
Contributor

Simple test case:

void test(double* dst, double* a) {
    __m256d b = _mm256_loadu_pd(a);
    __m256d c = _mm256_add_pd(b, b);
    _mm256_storeu_pd(dst, c);
}

On arm64, with gcc 11 (-O2 -fno-stack-protector -march=armv8.2-a) it generates the following:

0000000000000000 <_Z4testPdS_>:
   0:   ad400021        ldp     q1, q0, [x1]
   4:   d10143ff        sub     sp, sp, #0x50
   8:   9100ffe2        add     x2, sp, #0x3f
   c:   927be842        and     x2, x2, #0xffffffffffffffe0
  10:   4e61d421        fadd    v1.2d, v1.2d, v1.2d
  14:   4e60d400        fadd    v0.2d, v0.2d, v0.2d
  18:   ad000041        stp     q1, q0, [x2]
  1c:   ad400440        ldp     q0, q1, [x2]
  20:   ad000400        stp     q0, q1, [x0]
  24:   910143ff        add     sp, sp, #0x50
  28:   d65f03c0        ret

Note the confusing sequences of stp, ldp, stp, at the end, and all the stuff with the stack pointer at the beginning.

With this change:

0000000000000000 <_Z4testPdS_>:
   0:   ad400021        ldp     q1, q0, [x1]
   4:   4e61d421        fadd    v1.2d, v1.2d, v1.2d
   8:   4e60d400        fadd    v0.2d, v0.2d, v0.2d
   c:   ad000001        stp     q1, q0, [x0]
  10:   d65f03c0        ret

I've looked a bit at how x86_64 without avx behaves as well. It mostly doesn't suffer from the same problem (again, gcc 11 -msse2). It does seem to think something is up with the stack, though.

Without this change, -O2 -msse2:

0000000000000000 <_Z4testPdS_>:
   0:   f3 0f 1e fa             endbr64 
   4:   55                      push   %rbp
   5:   48 89 e5                mov    %rsp,%rbp
   8:   48 83 e4 e0             and    $0xffffffffffffffe0,%rsp
   c:   48 83 ec 60             sub    $0x60,%rsp
  10:   66 0f 10 46 10          movupd 0x10(%rsi),%xmm0
  15:   66 0f 10 0e             movupd (%rsi),%xmm1
  19:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
  20:   00 00 
  22:   48 89 44 24 58          mov    %rax,0x58(%rsp)
  27:   31 c0                   xor    %eax,%eax
  29:   66 0f 58 c9             addpd  %xmm1,%xmm1
  2d:   66 0f 58 c0             addpd  %xmm0,%xmm0
  31:   0f 11 0f                movups %xmm1,(%rdi)
  34:   0f 11 47 10             movups %xmm0,0x10(%rdi)
  38:   48 8b 44 24 58          mov    0x58(%rsp),%rax
  3d:   64 48 2b 04 25 28 00    sub    %fs:0x28,%rax
  44:   00 00 
  46:   75 02                   jne    4a <_Z4testPdS_+0x4a>
  48:   c9                      leave  
  49:   c3                      ret    
  4a:   e8 00 00 00 00          call   4f <_Z4testPdS_+0x4f>

Without this change, -O2 -msse2 -fno-stack-protector:

0000000000000000 <_Z4testPdS_>:
   0:   f3 0f 1e fa             endbr64 
   4:   66 0f 10 46 10          movupd 0x10(%rsi),%xmm0
   9:   66 0f 10 0e             movupd (%rsi),%xmm1
   d:   66 0f 58 c0             addpd  %xmm0,%xmm0
  11:   66 0f 58 c9             addpd  %xmm1,%xmm1
  15:   0f 11 47 10             movups %xmm0,0x10(%rdi)
  19:   0f 11 0f                movups %xmm1,(%rdi)
  1c:   c3                      ret    

With this change, both produce:

0000000000000000 <_Z4testPdS_>:
   0:   f3 0f 1e fa             endbr64 
   4:   66 0f 10 0e             movupd (%rsi),%xmm1
   8:   66 0f 10 46 10          movupd 0x10(%rsi),%xmm0
   d:   66 0f 58 c9             addpd  %xmm1,%xmm1
  11:   66 0f 58 c0             addpd  %xmm0,%xmm0
  15:   0f 11 0f                movups %xmm1,(%rdi)
  18:   0f 11 47 10             movups %xmm0,0x10(%rdi)
  1c:   c3                      ret  

So this seems like an improvement.

Is this a good approach? Should I use a narrower #if check to turn on this behavior (arm only)?

Copy link
Collaborator

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable! However I'm just the cheerleader-in-residence; I will defer to the research of you and others and merge or not accordingly.

Though it did cause a llvm/clang crash (congratulations 🎉) Can you file a bug report?

https://github.com/simd-everywhere/simde/actions/runs/9979128222/job/27577599524#step:8:1

And then we can add a workaround for emscripten that references that bug report

Copy link
Collaborator

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like emscripten/llvm fixed the compiler error

@mr-c
Copy link
Collaborator

mr-c commented Aug 20, 2024

@AlexK-BD I'm happy to merge this as it is; please let me know if that is okay with you

@mr-c
Copy link
Collaborator

mr-c commented Sep 12, 2024

Ping @AlexK-BD ; I'm thinking about making a new SIMDe release this month. I'd like to merge this PR of yours, if that is okay with you.

@AlexK-BD AlexK-BD marked this pull request as ready for review September 13, 2024 13:12
@AlexK-BD
Copy link
Contributor Author

Yes, please merge it if possible.

@mr-c mr-c enabled auto-merge (rebase) September 13, 2024 13:42
@mr-c mr-c disabled auto-merge September 13, 2024 15:12
@mr-c mr-c merged commit 96054b8 into simd-everywhere:master Sep 13, 2024
95 of 98 checks passed
@mr-c
Copy link
Collaborator

mr-c commented Sep 13, 2024

Thank you @AlexK-BD !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants