Broadcast lane #29

penzn · 2021-02-27T02:38:15Z

Proposed in #28 (originally #27). It is different from existing splat, since it broadcasts a lane from input, rather than a scalar, also takes an index to select which element to broadcast:

Gets a single lane from vector and broadcast it to the entire vector.
idx is interpreted modulo the cardinal of the vector.

vec.v8.splat_lane(v: vec.v8, idx: i32) -> vec.v8

vec.v16.splat_lane(v: vec.v16, idx: i32) -> vec.v16

vec.v32.splat_lane(v: vec.v32, idx: i32) -> vec.v32

vec.v64.splat_lane(v: vec.v64, idx: i32) -> vec.v64

vec.v128.splat_lane(v: vec.v128, idx: i32) -> vec.v128

On x86 broadcast instructions first appear in AVX (32-bit floating point elements, AVX2 for integers), however x86 variants don't take an index and only broadcasts first element of the source. General-purpose shuffle would need to be used to emulate this on SSE, which is not great (definitely slower than specialized version). Also, taking an index would lead to this turning into a general purpose shuffle on AVX+ as well.

The text was updated successfully, but these errors were encountered:

lemaitre · 2021-02-27T15:41:51Z

It should be noted that vec.vX.splat_lane is not a substitute to vec.S.splat (which takes a scalar). It is a complement.
Yes, in general, vec.vX.splat_lane would be implemented using 1-input shuffles.

I have 2 use cases for this instruction:

prefix sums (to propagate the last value to the next iteration)
matrix multiplication (to save some loads)

akirilov-arm · 2021-06-14T15:37:44Z

This operation is not that easy to implement with Neon and SVE either because of the variable index. It is basically equivalent to extract_lane, followed by an indexed DUP (index 0), for example. If the index is known at compile time and in the case of SVE has a suitable value (fits within the first 16 bytes), then it can be reduced to just the DUP.

lemaitre · 2021-06-14T17:22:50Z

Actually, x86 SIMD has even more limitations, but overall, I don't think it's that bad.

First, I assume that WASM engines will actually do proper constant folding and thus detect when the index is "compile-time" known. This is not quite pattern matching, though (and is actually simpler and more robust, I assume).

Also, for 8-bit types, it is also not that hard to implement on all archs: DUP the index into a vector, and use this vector in a TBL.
For larger types, a bit more index manipulation is required because TBL (and equivalents) have a byte granularity.

akirilov-arm · 2021-06-14T19:57:25Z

Also, for 8-bit types, it is also that hard to implement on all archs: DUP the index into a vector, and use this vector in a TBL.

I am not sure if in that case I am comfortable with the idea of using TBL for SVE (or a transfer from a general-purpose to a vector register), so what I have in mind is:

cntb    x1
udiv    w2, w0, w1
msub    w1, w1, w2, w0
whilelo p0.b, wzr, w1
lasta   b1, p0, z0.b
dup     z1.b, z1.b[0]

The first 3 instructions would be the same for the TBL approach as well (they take care of getting the index into the proper range).

lemaitre · 2021-06-14T20:42:27Z

Oh, I missed that lasta could put the result into vector (first lane I assume). Then yes, it is probably a good way to do it.

However, this trick is only doable in SVE, and certainly not on x86. So all in all, the "TBL"-like implementation will still be good for those archs.

Also, as cntb is constant during the execution, it could be hardcoded during the translation, and for a power of 2, those 3 instructions will become one.

akirilov-arm · 2021-06-14T20:55:16Z

I don't assume JIT compilation in my analysis and I use vector length-agnostic code generation - in fact, AOT compilation of WebAssembly doesn't sound like something completely out of the ordinary (still not common, though). Yes, that's arguably quite conservative, but it's usually straightforward to simplify once the assumptions are known.

penzn mentioned this issue Feb 27, 2021

Add inter-lane operations #28

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast lane #29

Broadcast lane #29

penzn commented Feb 27, 2021

lemaitre commented Feb 27, 2021

akirilov-arm commented Jun 14, 2021 •

edited

Loading

lemaitre commented Jun 14, 2021 •

edited

Loading

akirilov-arm commented Jun 14, 2021

lemaitre commented Jun 14, 2021

akirilov-arm commented Jun 14, 2021

Broadcast lane #29

Broadcast lane #29

Comments

penzn commented Feb 27, 2021

lemaitre commented Feb 27, 2021

akirilov-arm commented Jun 14, 2021 • edited Loading

lemaitre commented Jun 14, 2021 • edited Loading

akirilov-arm commented Jun 14, 2021

lemaitre commented Jun 14, 2021

akirilov-arm commented Jun 14, 2021

akirilov-arm commented Jun 14, 2021 •

edited

Loading

lemaitre commented Jun 14, 2021 •

edited

Loading