-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broadcast lane #29
Comments
It should be noted that I have 2 use cases for this instruction:
|
This operation is not that easy to implement with Neon and SVE either because of the variable index. It is basically equivalent to |
Actually, x86 SIMD has even more limitations, but overall, I don't think it's that bad. First, I assume that WASM engines will actually do proper constant folding and thus detect when the index is "compile-time" known. This is not quite pattern matching, though (and is actually simpler and more robust, I assume). Also, for 8-bit types, it is also not that hard to implement on all archs: DUP the index into a vector, and use this vector in a TBL. |
I am not sure if in that case I am comfortable with the idea of using
The first 3 instructions would be the same for the |
Oh, I missed that However, this trick is only doable in SVE, and certainly not on x86. So all in all, the "TBL"-like implementation will still be good for those archs. Also, as |
I don't assume JIT compilation in my analysis and I use vector length-agnostic code generation - in fact, AOT compilation of WebAssembly doesn't sound like something completely out of the ordinary (still not common, though). Yes, that's arguably quite conservative, but it's usually straightforward to simplify once the assumptions are known. |
Proposed in #28 (originally #27). It is different from existing splat, since it broadcasts a lane from input, rather than a scalar, also takes an index to select which element to broadcast:
On x86 broadcast instructions first appear in AVX (32-bit floating point elements, AVX2 for integers), however x86 variants don't take an index and only broadcasts first element of the source. General-purpose shuffle would need to be used to emulate this on SSE, which is not great (definitely slower than specialized version). Also, taking an index would lead to this turning into a general purpose shuffle on AVX+ as well.
The text was updated successfully, but these errors were encountered: