Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable 8-bit integer computations in Attention layer of Marian framework #50

Open
Tracked by #15
abhi-agg opened this issue Aug 17, 2021 · 4 comments
Open
Tracked by #15
Labels
enhancement New feature or request

Comments

@abhi-agg
Copy link

Feature description

Currently, computations in Attention layer are 32-bit (floating point) while rest of the layers can do integer computations (8-bit and 16-bit). It would be great if the computations in Attention layer can also happen in 8-bit.

We already have intgemm to do 8-bit integer gemm operations in other layers and the same can be used for Attention layer as well.

Some advantages of doing it:

  1. Faster inference
  2. Removal of an sgemm library dependency for consumers who only want to do 8-bit int gemm

cc @andrenatal @kpu @XapaJIaMnu

@abhi-agg abhi-agg added the enhancement New feature or request label Aug 17, 2021
@XapaJIaMnu
Copy link
Collaborator

Working on it, but this is not likely to lead to any increase in speed.

Long answer:
All other operations involved one parameter matrix (B) that never changes and one activation matrix (A) that changes. Therefore one only requires one quantisation operation on the activations before multiply.

For the attention layer both A and B are generated on the fly, which means that every time before multiplication we need to quantise both of them, making everything much more expensive. For this reason we haven't prioritised on making everything 8bit, but we do recognise the need/convenience to remove an extra GEMM implementation from the dependencies.

@abhi-agg
Copy link
Author

abhi-agg commented Aug 18, 2021

Once this extra GEMM implementation (onnxjs) dependency is removed:

  1. It will reduce the overall bergamot wasm binary size which will lead to lower compile and load time for the wasm binary in browser.
  2. We would be able to compile the entire bergamot code base with -msse4.1 for wasm as onnxjs doesn't compile with this option right now.

Thanks for explaining regarding the negligible performance gain. One question though on this front.

that every time before multiplication we need to quantise both of them, making everything much more expensive.

Could it lead to reduction in the speed?

@XapaJIaMnu
Copy link
Collaborator

yes, it may lead to reduction in speed, we need to thread carefully/implement a new type of gemm.

@abhi-agg
Copy link
Author

It would be interesting to see how it turns out. Thanks for doing it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants