Enable 8-bit integer computations in Attention layer of Marian framework #50

abhi-agg · 2021-08-17T15:20:50Z

Feature description

Currently, computations in Attention layer are 32-bit (floating point) while rest of the layers can do integer computations (8-bit and 16-bit). It would be great if the computations in Attention layer can also happen in 8-bit.

We already have intgemm to do 8-bit integer gemm operations in other layers and the same can be used for Attention layer as well.

Some advantages of doing it:

Faster inference
Removal of an sgemm library dependency for consumers who only want to do 8-bit int gemm

cc @andrenatal @kpu @XapaJIaMnu

XapaJIaMnu · 2021-08-17T15:34:51Z

Working on it, but this is not likely to lead to any increase in speed.

Long answer:
All other operations involved one parameter matrix (B) that never changes and one activation matrix (A) that changes. Therefore one only requires one quantisation operation on the activations before multiply.

For the attention layer both A and B are generated on the fly, which means that every time before multiplication we need to quantise both of them, making everything much more expensive. For this reason we haven't prioritised on making everything 8bit, but we do recognise the need/convenience to remove an extra GEMM implementation from the dependencies.

abhi-agg · 2021-08-18T13:41:59Z

Once this extra GEMM implementation (onnxjs) dependency is removed:

It will reduce the overall bergamot wasm binary size which will lead to lower compile and load time for the wasm binary in browser.
We would be able to compile the entire bergamot code base with -msse4.1 for wasm as onnxjs doesn't compile with this option right now.

Thanks for explaining regarding the negligible performance gain. One question though on this front.

that every time before multiplication we need to quantise both of them, making everything much more expensive.

Could it lead to reduction in the speed?

XapaJIaMnu · 2021-08-18T14:00:54Z

yes, it may lead to reduction in speed, we need to thread carefully/implement a new type of gemm.

abhi-agg · 2021-08-18T15:23:59Z

It would be interesting to see how it turns out. Thanks for doing it 👍

abhi-agg added the enhancement New feature or request label Aug 17, 2021

abhi-agg mentioned this issue Aug 17, 2021

WASM Support mozilla/bergamot-translator#15

Closed

13 tasks

jerinphilip mentioned this issue Feb 8, 2022

Compilation succeeds without MKL into segfault browsermt/bergamot-translator#340

Open

jerinphilip mentioned this issue Apr 6, 2022

8-bit quantized jerinphilip/sgemm#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable 8-bit integer computations in Attention layer of Marian framework #50

Enable 8-bit integer computations in Attention layer of Marian framework #50

abhi-agg commented Aug 17, 2021

XapaJIaMnu commented Aug 17, 2021

abhi-agg commented Aug 18, 2021 •

edited

Loading

XapaJIaMnu commented Aug 18, 2021

abhi-agg commented Aug 18, 2021

Enable 8-bit integer computations in Attention layer of Marian framework #50

Enable 8-bit integer computations in Attention layer of Marian framework #50

Comments

abhi-agg commented Aug 17, 2021

Feature description

XapaJIaMnu commented Aug 17, 2021

abhi-agg commented Aug 18, 2021 • edited Loading

XapaJIaMnu commented Aug 18, 2021

abhi-agg commented Aug 18, 2021

abhi-agg commented Aug 18, 2021 •

edited

Loading