-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special shuffles to rearrange lanes #30
Comments
Here is a summary table of the proposed instructions and their support for different architectures.
To be noted that some "general shuffles" can be implemented quite efficiently without needing a The main issue we have here is that AVX and AVX512 are modeled around "nested" SIMD, where most swizzle operations are defined in term of 128-bit swizzles. Thus, even though "unpacklo|hi" SSE operation directly match "interleave low|high", this match breaks for larger vectors. The same problem exists for narrowing operations like I don't have a good solution to solve this issue. |
Concat low/high also might be useful (e.g. to combine two half regs of demoted/narrowed values). For PSHUFB, 128-bit blocks can actually be helpful, whereas palignr is much less useful because of them. I suppose we could argue that trying to impose general shuffle on x86 would be very expensive, whereas the |
I thought about adding "concat low/high", but I'm not sure what its use case would be as we would mostly not deal with half registers. Also, if you look at #27, you will see that I proposed shuffles inside 128-bit blocks ( Concerning the "general shuffles" for x86, I would like to remind that some of them can be efficiently implemented on x86 (most of |
We see this when demoting one register from u16 to u8 etc. I suppose requiring two inputs would mostly prevent that, but what about f32 -> i8, would we always need 4 inputs?
Great! |
My point of view was that in the case you don't need 2 (or 4) inputs, you pass zeros as the relevant inputs. |
Yes, that would work. I'm curious whether you are interested in reducing code changes when porting scalar code to SIMD? When allowing half vectors, the code for demoting is basically the same as scalar after replacing array access with Load/Store. With 2-inputs, that at least requires an extra param, or encourages to unroll the loop 2x. It is also less efficient on some architectures. I am not saying we should disallow 2:1 entirely, but it's not clear to me that it is the best option, if there is only going to be one. |
For memory accesses, I think we should provide narrowing/widening ones that would always deal with full vectors. Also, If we provide 2:1 conversions and pass a zero, a smart-ish WASM engine could detect it and do a 1:½ conversion. |
I think the value of this instructions is a bit less if they just map to general shuffles, though there might be situations where that is inevitable.
Yes, 32-bit and 64-bit have separate instructions, which makes it easier like you mentioned above. This is a useful distinction, as those instructions are cheaper (even though still "general") than the byte-wise versions.
True - I wonder what would the lowering be if we take the nested approach for this set of operations, though probably would not the most efficient on Arm. |
The "nested" approach is efficiently implementable with It seems that all __m512i vec.v8.interleave_even(__m512i a, __m512i b) {
b = _mm512_slli_epi16(b, 8);
__mmask64 mask = 0x5555555555555555ul; // mask for even elements
return _mm512_mask_mov_epi8(mask, a, b);
}
__m512i vec.v8.interleave_odd(__m512i a, __m512i b) {
a = _mm512_srli_epi16(a, 8);
__mmask64 mask = 0x5555555555555555ul; // mask for even elements
return _mm512_mask_mov_epi8(mask, a, b);
}
|
@lemaitre FYI here is another example where half vectors can help: https://github.com/riscv/riscv-v-spec/pull/657/files |
@jan-wassenberg I understand this example, but I'm not convinced. Such an example in SSE would look like that (equivalent to LMUL=1 for both): void add_ref(long N, ...) {
for (long I = 0; I < N; I += 16) {
__m128i vc_a = _mm_load_si128((__m128i*)(&c_a[I]));
__m128i vc_b = _mm_load_si128((__m128i*)(&c_b[I]));
__m128i vc_c = _mm_add_epi8(vc_a, vc_b);
_mm_store_si128((__m128i*)(&c_c[I]), vc_c);
for (long i = I; i < I+16; i += 2) {
__m128i vl_a = _mm_load_si128((__m128i*)(&l_a[i]));
__m128i vl_b = _mm_load_si128((__m128i*)(&l_b[i]));
__m128i vl_c = _mm_add_epi64(vl_a, vl_b);
_mm_store_si128((__m128i*)(&l_c[i]), vl_c);
...
__m128i vl_m = _mm_load_si128((__m128i*)(&l_m[i]));
vl_m = _mm_add_epi64(_mm_add_epi64(vl_m, vl_c), ...);
_mm_store_si128((__m128i*)(&l_m[i]), vl_m);
}
}
} |
Sorry it took me this long to reply.
Would not To be honest, I don't think implied mask detection in |
Depends on the actual mask and arch, but if the mask corresponds to a 32-bit shuffle on AVX2, a On 128-bit archs, The other benefit of this
However, LUT1 is not limited to 128-bit blocks and elements can be fetch from anywhere within the source vector.
We are on the same page here. You might remember that I wanted more shuffle instructions to make WASM engines easier, but this never caught up. The key to have a performant runtime is to keep as much semantics as possible from source code to the WASM engine. And this is done by having more specialized instructions, not less. |
Specialized shuffles to zip or unzip lanes. #28 proposes interleave and concat which roughly correspond to Arm's zip and unzip instructions. In short, interleave/zip takes odd or even lanes from two vectors and interleaves them in the output. Concat/unzip is the reverse operation - odd or even lanes are from each of the source are together in the destination.
Closest x86 to zip/interleave is unpack, but it takes adjacent lanes from the source operands instead of odd or even. It is called unpack, because when it used with a vector of zeros it is the opposite of "pack" which reduces lane sizes with signed or unsigned saturation.
Obvious takeaway is that this operations exist, but they are quite different on two major platforms. The less obvious thing is how to marry the two approaches.
The text was updated successfully, but these errors were encountered: