-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-lane loads and stores #9
Comments
Thanks for branching this discussion. Would it make sense to list a fourth option, that static types (e.g. vec64) implicitly define the size to load? Continuation of discussion with Florian.. I can see you are passionate about this topic, which is great :)
This is interesting, it seems you would like autovectorization to work? From my perspective, it still isn't dependable/robust - and it's been 16 years since that went into GCC. As far as I'm concerned, it's far more reliable to use intrinsics or preferably a wrapper on top of them.
:) It was a moderately complex predictor, so the only parallelization we could see was doing it for two independent color channels at a time.
To be clear, I am worried about set_length for a similar reason - it's not performance-portable to older platforms. The mechanism I had in mind is providing smaller types, vec64, vec32, and for completeness also 16/8. There's specialized codepaths for loading/storing them, but otherwise they just use the vec128 type for computations (to reduce code size). This would be in addition to the type that's the widest available. I believe this is more efficient and safer than using masked load/store: the type already indicates how much to read, so we can just use the 64-bit load instruction directly without decoding a mask. Also, if we change our minds (oh, can only do a single element), we don't need to update all the call sites of the load/store to make sure the masks are updated.
Most differences in opinion are grounded in differing experiences. I tried really hard to make masked loads efficient for HighwayHash (where users give unpadded buffers and the hash function has to respect that), and was rather disappointed.
This can be interesting to discuss further. I imagine that most users will already have used native code, and we've already established that they care about speed (why use SIMD otherwise). Then the question is: how much slowdown would they accept before giving up and writing blog posts how native apps are preferable? IMO something like a 10-20% penalty is acceptable, how's your thinking on this?
Agreed. But in the use cases I've seen, the remainders are by definition negligible when we're processing entire scanlines or audio sample buffers or whatever. The only case I've seen where 'remainders' are important is strlen, and I would modestly claim that null-terminated strings are a rather poor choice anyway (Pascal strings, std::string, even BSTR have fixed this for a long time).
Oh yes, concurrency is harder. But aren't those apps already in trouble if the vectors are of unknown length? In other words: if they can't write a full vector (however long it is), then it's a "remainder" case anyway. |
First, I would like to highlight the fact that mask support is pretty much required anyway because AVX512 and SVE comparison instructions do return an actual mask, and not a "mask-like" vector like it used to be on previous architectures. It would still be possible to use masks only for ternary operator (
Padding should be the recommended way for high performance applications even if we have support for masked memory accesses or
Masking is not a new concept and is partially supported for a long time (even without native mask types). A masked Here is a summary of these problems (where I refer to masked accesses, but still apply to partial accesses):
Actually, masked stores are not problematic on x86 as they exist since SSE2, and even if they don't exist for 8-bit lanes on AVX2, it is still possible to split vector in 2, and use the 128-bit mask store. Unaligned masked loads are a different beast. There is no support for masked load in SSE, and only partial support (int32 and int64) in AVX. The signal handler solution could work in theory, but would need a complete list of masked loads and a way to retrieve the actual mask. The benefit of this solution is that you pay for it only if you're on the verge of faulting, but is otherwise free.
I would probably prefer to have masked gather for this.
SIMD is my work, and I had time to think about these on my own. So why not share my thoughts and try to make the world a better place ? ;)
Yes, I would like to see autovectorization (which is not a WASM concern as it is before WASM). Making autovectorization work does not exclude the possibility to use intrinsics, though.
Well, I have the impression that this is not really about flexible vectors then. It would be a refinement of fixed-sized SIMD for smaller types.
It is more efficient only if the actual length is know at the time the code is generated. If it's not, then it would not be more efficient than I would much prefer rely on meta-data and constant folding to optimize masked memory accesses.
This is a valid concern for source code, but not so much for "assembly".
That's surprising, as that's not what the Intel intrinsics guide says: it is more 8 cycles latency and more importantly, you can do 2 per cycles.
Yes, I think 10-20% is acceptable, but I also think that if the documentation of masked load/store specifically says that padding should be preferred if suitable, people will try that before complaining about speed.
Yes, but those are used to interface with the kernel so will probably never disappear.
If you have a multidimensional array where the last dimension is relatively small, but the others are much larger, but are not suited to be the last dimension because of the way processing is done. In this case, you can easily rely on full-width accesses if can stop before the last portion. int i;
for (i = 0; i < w - vlen; i += vlen) {
process(i); // not masked
}
maskX m = `i + laneid < w`;
process(i, m); It is easy to imagine such a case where remainder is not the majority of the processing, and yet non-negligible, because repeated many times. |
I am a bit worried how to represent conversion between those and "main" vector types, though I do see why they can be useful.
That is a great subject - I would like to see autovectorization working, and it should be working in at least the naive cases. The question is what compilers are capable of producing. Keep in mind that currently our only compiler backend is LLVM. |
As far as I can see, LLVM is already quite good at vectorizing code. |
Yes, this would need to be defined. If using the load32_zero semantics (WebAssembly/simd#237), |
Trying to establish a new home for the discussion on masks and their alternatives. Let me know if the title does not make sense - changing would be easy.
Generally there are a few ways to handle loads and stores that require less than full register:
Padding (#8) can be seen either as an enhancement to these or even somewhat orthogonal - when data is padded, use of some of the partial vector operations can be avoided.
The bare variant of (1) simply disables loads and stores operating on less than a hardware register. Remainders are to be handled in a scalar loop.
Masks (2) are the same approach as used in AVX and SVE. Since different instruction sets represent masks differently, they need to come with their own types and instructions. For a prior discussion on masks see: #6 (comment) and onward.
Set length (3) approach allows to exclude the tail of the vector - it is possible to exclude a lane, but only with all the lanes that come after. The upside is that the representation is simpler than masks, and works with both masking and non-masking ISAs, but downside is that it introduces internal state.
The text was updated successfully, but these errors were encountered: