-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should something for 128-bit division or wide division be included? #15
Comments
I wrote the algorithms used in compiler-builtins. The delegate algorithm is used on many targets as the default for 128 bit division. The trifecta algorithm is the fastest algorithm on architectures that have fast 64 to 128 bit widening multiplication and a hardware 64-by-64 divide (it was originally configured by just checking the pointer width, but I forgot about webassembly and Alex just changed it to also be enabled on webassembly which is probably the best choice for now). The asymmetric algorithm configured on x86-64 takes advantage of a special assembly instruction that has a 128-by-64 divide, and expands it to the full 128-by-128. So, the remaining performance gap is because asymmetric is faster on x86-64 but requires assembly. The reference crate for the division code used in compiler-builtins is https://crates.io/crates/specialized-div-rem . |
Thanks for the info and links @AaronKutch! One thing I can clarify is that in my testing I was not testing My hypothesis is that if compiler-builtins is built with this proposal (which has widening multiplication), and I test with the updated version that's using the trifecta algorithm, then I should see a small performance loss on x86_64 due to not having access to 128-by-64 divides but close to no performance loss on aarch64 which does not have a native 128-by-64 divide. I'll work on measuring this hypothesis. |
Upon re-measuring I'm seeing the following numbers. This is with an updated division algorithm in Rust's compiler-builtins:
So it looks like this benchmark in particular is not stressing these paths of |
Following up on this with arm64 numbers, I've measured now that with everything updated an M2 is 15% slower-than-native in this division benchmark. That follows the intuition here I think where 15% is probably "built in wasmtime overhead" and the 20% difference with x64 is likely that x64 has access to a native instruction wasm can't use. |
I'm going to close this as "no" with some rationale recorded in the overview |
Currently this proposal's overview states that division on native platforms always goes to
__udivti3
so it's probably not worth adding anything. Additionally the overview cites that while x64 has the ability to divide a 128-bit number by a 64-bit number that's not what 128-bit division in wasm wants in theory.Despite these two claims benchmarking shows that bignum division is significantly slower in WebAssembly than it is in native. Later benchmarking has shown that with a prototype version of this proposal the gap can be significantly reduced, but it's still quite far behind native.
I've done some further investigation and found that the Rust compiler's implementation of
__udivti3
wasn't optimal for wasm and native architectures were using a different algorithm. After updating the algorithm used and enabling a prototype for this proposal on x64 the performance gap is that wasm is now 36% behind native. This is significantly closer than previous attempts (yay!)Given all this I'd like to open this issue to track further investigation of this. For example are there remaining low-hanging fruit in this 36% that can be easily picked by slightly changing this proposal or adding a new opcode? I'm not sure at this time!
The text was updated successfully, but these errors were encountered: