_mm256_storeu_pd and _mm256_loadu_pd using 128 bit lanes #1198
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Simple test case:
On arm64, with gcc 11 (
-O2 -fno-stack-protector -march=armv8.2-a
) it generates the following:Note the confusing sequences of
stp, ldp, stp
, at the end, and all the stuff with the stack pointer at the beginning.With this change:
I've looked a bit at how x86_64 without avx behaves as well. It mostly doesn't suffer from the same problem (again, gcc 11 -msse2). It does seem to think something is up with the stack, though.
Without this change,
-O2 -msse2
:Without this change,
-O2 -msse2 -fno-stack-protector
:With this change, both produce:
So this seems like an improvement.
Is this a good approach? Should I use a narrower
#if
check to turn on this behavior (arm only)?