-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable 8-bit integer computations in Attention layer of Marian framework #50
Comments
Working on it, but this is not likely to lead to any increase in speed. Long answer: For the attention layer both |
Once this extra GEMM implementation (onnxjs) dependency is removed:
Thanks for explaining regarding the negligible performance gain. One question though on this front.
Could it lead to reduction in the speed? |
yes, it may lead to reduction in speed, we need to thread carefully/implement a new type of gemm. |
It would be interesting to see how it turns out. Thanks for doing it 👍 |
Feature description
Currently, computations in Attention layer are 32-bit (floating point) while rest of the layers can do integer computations (8-bit and 16-bit). It would be great if the computations in Attention layer can also happen in 8-bit.
We already have intgemm to do 8-bit integer gemm operations in other layers and the same can be used for Attention layer as well.
Some advantages of doing it:
cc @andrenatal @kpu @XapaJIaMnu
The text was updated successfully, but these errors were encountered: