-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM Support for bergamot-translator matrix-multiplies for Mozilla #249
Comments
Some autovectorization seems to be helping in ARM, still unable to enable neon: |
what autovectorisation you would see depends on what the the WebAssembly VM allows for. |
Ok but the point of this work is to have code that runs natively in Firefox (gecko), not WebAssembly. |
The implementation is being prepared in multiple parts, with concrete details starting to materialize now. The parts include
When the CI in pull request above ultimately becomes green we will have ARM compile. Target is my android phone, so I expect to be able to test it. If somebody has an M1 device and can lend a hand, please feel free to help out with testing. There is a bergamot-translator part of this, but I hope just a submodule update here will be enough. The position this task finds itself in is weird. The C-interface is written based on intgemm. intgemm assumes x86 with a lot of registers, intrinsics in source not guarded behind an ifdef.
Among the above, we're going with (2) for now. This is a mess of ifdef which I will start now, slowly bringing CI to green. Might have to redo this after a first round of experimentation. Please let know if the original authors of the API have a cleaner entry-point / cut point. |
v0Ruy based backend with a slower implementation is succeeding builds on ARM. The relevant implementation of the firefox interface via Ruy is: All except Tests involve comparing intgemm path ruy path on the same firefox API (on x86) and are currently succeeding as well. As WebAssembly is intercepting Next steps is improving performance (better transpose, vectorization etc). |
FWIW support for ARM 32-bit is not really needed, we are looking for Aarch64 specific support at the moment, which might be easier to test and develop. |
@jerinphilip That's right. The plan is to compile bergamot-translator on wasm and use native code only for gemm calls. Therefore, the only native code that we need is the gemm implementations for intel and arm architecture. => Marian doesn't have to be compiled on arm for our use case. We are in the process of landing intel implementation (i.e. intgemm) in gecko. Once, the arm implementation is ready, we plan to land that as well. |
I just to clarify the ARM term here. The 32-bit ARM support will not be needed at all in nearest future. The Aarch64 support (aka ARM64) is needed in support of the Begamot in Firefox. I cannot see that any differentiation between these platforms is made above.
Is your phone Aarch64? I see github actions can support "linux/arm64". Can we use that instead of Android NDK? Raspberry PI4 is a cheap alternative for M1, if the latter is not available. |
We're not working at this level, we're consuming ruy. So everything that is available at https://github.com/google/ruy/blob/8c3fd3f266b4a22d542d4aa41329b5018d6b87e1/ruy/path.h#L24-L91 will be available to the firefox-interface implementation (including alternate intel paths, and a fallback implementation). The CI configuration is armv8-a (aarch64) with neon intrinsics (for simd). This is the device I intend to run stuff for a sanity check. |
I'll try to add a CI for:
My initial attempts had missing libraries (math etc) troubles, so I started off with android which looked far better tooled and complete. |
ruy abstracts over several ARM variants, so we're just ensuring ruy is called / adapted correctly. Phone support seems to come for free here, so might as well do it if we're doing M1. We should eventually test on M1. @jerinphilip I can make you 4 cores of ARM-based Ubuntu in the cloud for free if that helps. https://docs.oracle.com/en-us/iaas/Content/FreeTier/freetier_topic-Always_Free_Resources.htm |
@jerinphilip Could you please also implement |
@XapaJIaMnu Taking inputs on jerinphilip/MozIntGemm#12. |
https://github.com/jerinphilip/MozIntGemm has been taken to finish by @kpu. This abstracts over two libraries intgemm/ruy and switches library based on the target platform. There is additional automation providing a source tarball for integration into Mozilla, optimizing for minimizing dependencies. Save a few final adjustments that can be carried over into issues - the Mozilla relevant bits of the task here are completed, and therefore closing this issue. It is possible to make the source-transform feed directly into a gecko-dev directory structure if someone could communicate and let me know how to do this best. I prefer keeping the source transform so the debugging mechanisms etc (which depends on googletest) are required to be stripped to minimize hurdles for Mozilla partners. The source transform automated via GitHub actions should potentially be able to continuously move the build system from CMake to Mozilla's mach build system simplifying manual labour that would otherwise be required. By construction, due to the reliance on functions visible only inside Firefox via intrinsic added into WASM, this source cannot be used for anything else like #324, which will be pursued therefore independently. I expect because software - there will be some issues - but most of this repository is walking in the dark given the absence of certain development elements and tooling. Please file maintenance issues/bugs in the MozIntGemm repository. From #205 the following now remains checked:
|
[Will be edited as more information is available]
Fitting ruy to C intgemm interface
One clean angle appears to be to obtain
int8*int8 -> int32
convert it to float32 after to match intgemm.Apparently marian rough edges which rely on x86 arch calls have already been worked around somehow, so integrating ruy into marian is also an option (connecting at where Mozilla wants it).
Playground: https://github.com/jerinphilip/arm-playground
We have come to learn that it's not very easy to get the ARM code in to intgemm, because it's super sunk in that some intel only stuff exists there and needs a lot of work.
Approach instead looking at ruy integration at marian-dev similar to a previous oneDNN inclusion: https://github.com/XapaJIaMnu/marian-dev/blob/wngt2020OneDNN/src/tensors/cpu/intgemm_interface.h#L625
Simpler objective: get ruy pipeline that produces identical results to intgemm:
f32 -> quantise -> multiply -> unquantise
Noticing the following:
probably some row-major/column major orderwe only have one matrix that arrives in col major format Everything else is row major. intgemm offers two variants shifted and not shifted. ARM probably has some superior to avoid a hacky something (shifted) in intel.There are choices based on flags (bias
,relu
) among:a.
UnquantiseAndAddBiasAndRelu
b.
UnquantiseAndAddBias
c.
JustUnquantiseRelu
d.
JustUnquantise
Integration Background
The approach currently undertaken is fastest to Mozilla dictated interface as Mozilla is the primary customer right now (
wasm_intgemm_interface.h
: browsermt/marian-dev, jerinphilip/arm-playground.The way I understand, https://github.com/mozilla-extensions/firefox-translations/issues/75#issuecomment-881543045 attempts to bring a relay through WebAssembly to call the native implementation of intgemm (AVX2+ depending on what's available on hardware) which from meetings I understand to be already integrated into gecko-dev. The WebAssembly VM or whatever intercepts the calls from JS and relays it to these AVX2+ functions for better speed.
There is a fallback path, which allows an intgemm path compile on WebAssembly generating SSSE3 codepath, same as the slower WebAssembly implementation before.
I have found the following so far tracked across several issues:
The text was updated successfully, but these errors were encountered: