-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What should be the semantics/constraints of hypothetical masked memory accesses? #13
Comments
About read-modify-write for masked store: as the wasm memory model is still a work in progress (hi @conrad-watt, @rossberg) it's probably premature to say what will be allowed. The EcmaScript memory model explicitly disallows this as an implementation technique for its current data types, though it does allow floating point loads and stores to tear at byte boundaries. EcmaScript is relevant because we want the wasm and EcmaScript memory models to be a single model, if possible; on the other hand, EcmaScript is unlikely ever to have SIMD data, so there's some freedom here. As we're unlikely ever to have atomic SIMD data of any size it seems that we have some leeway here to choose a model that works for good implementations, though it probably must be a good fit for the existing memory model's formal framework. |
Trying to get my head around this has really sent me down a rabbit hole.
Is it possible we would want to vectorise scalar Wasm code as a Wasm->Wasm transformation (e.g. in wasm-opt), using one of these operations? Allowing load-blend-store might make this harder (since the scalar Wasm wouldn't have the same race condition). |
I think tearing at byte (or element) boundary would be ok. In fact, I don't think (masked) accesses on AVX512 or SVE are atomic, so tearing might be unavoidable here. However, I don't get your point about EcmaScript memory model. Does EcmaScript have any sort of partial/masked memory accesses? If not, why do you say that it explicitly disallows this? Anyway, the "load-blend-store" is one way to emulate the masked store, but not the only one. I would personally prefer race-condition freedom.
As far as I understand, fixed-width SIMD in WASM has pretty loose alignment requirements (alignment is just an hint). Now, we still might want to forbid unaligned masked accesses and keep only unmasked unaligned and masked aligned memory accesses to simplify implementations.
I'm not sure we would want a WASM->WASM vectorization, but a C->WASM vectorization would be interesting. It would probably be as hard to detect race conditions in C than in WASM, so the issue remains. This argument is the one that makes me prefer scalar emulation over "load-blend-store" when there is no hardware support. |
I just found this paper: http://www.cosenza.eu/papers/PohlSCOPES18.pdf They also talk about unaligned masked loads, making this paper even more relevant for this discussion. |
Ah, you're right. I misunderstood.
I'm still trying to catch up on how vectorisation works, but could something analogous to the following be a real optimisation?
Gets transformed/compiled into
|
Nothing that's masked or partial per se. There are issues around architectures that don't have byte-granularity accesses; for example, the first Alpha had to do byte accesses as rmw.
I'm quoting the spec, specifically the third bullet of note 2 here: https://tc39.es/ecma262/#sec-shared-memory-guidelines. I don't remember the details - Conrad might - but this may just be a consequence of the details in the model that allow the race semantics to work portably. |
@conrad-watt A better example would be: for (int i = 0; i < n; ++i) {
auto a = A[i];
if (a > 0) {
B[i] = a;
}
} Autovectorization is primarily for loops (even more so with flexible vectors).
That makes sense, thanks for the clarification. Therefore, I would recommend to forbid the "load-blend-store" sequence to emulate a masked store in order to not introduce data races. |
@lemaitre does my example below make sense? I believe it supports forbidding the load-blend-store implementation (for shared memories). Consider the following code where
This is a fine C program. It's guaranteed by the C memory model that if a subsequent thread synchronizes with both thread 1 and 2, then it will observe This is the concretisation of @lemaitre's "data-race" concern - but even without having to classify data-races, we can see that the compilation scheme isn't sound if load-blend-store is allowed. I think there's a similar logic in forbidding this implementation technique in the JS spec. In the counter-example above, if Thread 2's |
Does x86 guarantee its masked load/store instructions are race-condition free? This is a great example of why I believe masked loads to be a dangerous/misleading abstraction: If we instead have applications write out the load-blend-store manually, we see that they are writing the entire memory range, and thus thread 2 has a data race. Tools such as tsan will detect this. If we rely on the fact that single-element stores are atomic (which is not even guaranteed on all platforms, especially if unaligned), and then provide masked_store() = loop of conditional scalar stores, I do not see what benefit that brings vs. users just writing that loop directly. It would at least make more visible how expensive this is going to be. |
@conrad-watt // Thread 1: i0 = 0, i1 = 100
// Thread 2: i0 = 100, i1 = 200
for (int i = i0; i < i1; ++i) {
A[i] = f(i);
}
That's a good question. The closest I could find is Intel pseudo code where the store is per element within an if, or it explicitely says
But I'm not able to find anything more than that.
Just a reminder: the goal of this issue is not to see if we want them. So I would prefer that you expose why they cannot be made efficient on most architectures and see if it is workaround-able, instead of saying that it's not efficient and should be avoided.
We do not need element atomicity, just race-condition freedom on inactive elements.
If we provide a WASM instruction, the scalar emulation will be there only if there is no hardware support. I would also like to remind that SSE has a masked store instruction. So the question is mostly about Neon (and Altivec?). |
Yes, it is unfortunately not made explicit. I would consider it risky to assert any claim of race-freedom (even for inactive lanes) on x86. |
As a core has to get exclusive access to the cache line in order to write to it, we are sure that there can be no race condition on inactive elements between cores. |
I finally took the time to test what is the actual behavior of masked store instructions (check out the gist). I tested 4 ways to "mask store" 4x32 bits:
All the threads access the same memory location, but their mask are non-conflicting (in practice, only the 4 first threads have non-zero masks). All the variants give correct results on Skylake-X and Haswell event with hyper threading. I also measured the speed of all these variants:
In the light of the measurements, scalar emulation is inexpensive and can actually be used. It would be interesting to see how the results changes with smaller types (byte access) and longer vectors (256 and 512 bits). |
I just measured both throughput and latencies of all the possible combinations for masked stores. The columns indicates the size and the number of elements within a vector.
Throughput⁻¹ (cycles):
Latencies (cycles):
First thing: Then, scalar is good only when the number of elements is low (either SIMD is small or type is large). AVX2 and AVX512 are really good, especially their throughput. In light of those measurements, masked store for SSE are OK (with scalar emulation). The only "issue" would be 8-bit and 16-bit integer masked store on AVX2 which would be slow. Also, for loop epilog, the same mask can be reused for multiple masked stores, thus, it will not be necessary to duplicate the "lane branching" logic. Therefore, I think it can be good to also standardized What do you think? |
I believe masked memory accesses are crucial, but I would like to separate two aspects:
The goal of this issue is to address the question "How would they work?" assuming we want them.
I leave the question "Do we really need them?" for issue #7.
There is 4 main operations:
There also could be variants for certain mask shapes like
loadOnlyN
orstoreOnlyN
.Some of these are explored in this paper: http://www.cosenza.eu/papers/PohlSCOPES18.pdf
Aligned masked load
As long as the SIMD width is smaller than a page, it would be possible to just use a regular aligned load, and select only the active lanes specified by the mask.
No hardware support is required to have this operation efficient.
Unaligned masked load
Unaligned masked load are a different beast.
One may think it would be possible to do the same than for aligned masked load, but an aligned load might partially fault.
This is the case when the load crosses page boundary.
If the memory fault happens for active elements, there is no problem as the fault is legitimate, but the fault might also happen only on inactive elements.
In that case, the fault should be ignored and the load succeed.
There is 3 way to workaround this issue:
I think the signal handler would be the most efficient one as faults would be rare, even if masked loads are intensively used.
This would require an internal signal handler, but I have the impression that WASM engines already do have signal handlers for internal use.
Such a signal handler would require a way to check if the load is actually masked, and in that case, where the mask is actually stored.
This could be done either with a table of all unaligned masked loads of the program, or a specific nop sequence just before the load instruction.
The scalar emulation could be done in a library function to avoid code bloat.
Aligned masked store
A mask store should leave the masked out values the same in memory (they would not be modified).
This could be done in the following ways:
Masked stores are, after all, pretty well supported on x86 architectures, and the emulation with tighter stores on AVX2 would be pretty close to an hypothetical hardware instruction.
So overhead on x86 here would be minimal.
The "load-blend-store" would also be quite efficient, but has a risk of race-condition if another thread (or an interrupt handler) stores elements at the position of inactive elements of the first thread.
So this option should probably be avoided in mutlithreaded context.
Anyway, such a situation would lead to false-sharing and would be detrimental for performance.
To be noted that false-sharing is only a performance concern and does not affect correctness.
On Neon, the "load-blend-store" can be modified by "load acquire-blend-store conditional" loop to solve the race-condition issue.
Unaligned masked store
Unaligned masked stores are almost the same than aligned ones.
The only difference would be that a "load-blend-store" emulation would require the same signal-handler trick than unaligned masked loads.
loadOnlyN
,storeOnlyN
In case we have scalar emulations of masked operations, such an emulation can be more efficient if the mask has a special shape like "the first N lanes are actives, the remaining are inactive".
In that case, we could avoid multiple ifs and have a jump table.
On Neon, this could be pretty efficient as there are instructions for the sequences "load scalar-insert" and "extract-store scalar" and because all the instructions have the same size.
In that case, we could have a computed goto with minimal size (<~20 instructions for 8-bit elements, much less for 32-bit elements).
Such a
loadOnlyN
orstoreOnlyN
could either be an explicit WASM instruction, or just an optimization that the engine could do when it recognizes the shape of the mask (with some "metadata folding" on masks).Extra considerations
Maybe we want just some of them like only the aligned ones (I don't recommend it, but possible).
We could also say that the program may fault if the load crosses a page boundary even if the fault occurs only on inactive elements.
In that case, it would be the responsibility of the end-user to ensure the 2 pages are actually allocated.
I also don't recommend it, but you might have some ideas to make it work without much complexity on the user side.
The text was updated successfully, but these errors were encountered: