You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, since transformer inference is memory bound, when increasing the numbers of processed tokens the forward time should behave in an increasing step-wise manner.
In the following image I show the median forward times with 4-bit quantized weights for some sequence lengths, notably there is a 4.3x increase when going from sequence length 1 to 8, while the time between 8 and 32 tokens is along the same line. This shows that the QMM kernels for small lengths are likely under-optimized.
As comparison here are the results for FP16 weights. Where the forward for lengths between 1 and 64 are in the same step/wave.
I'm pretty uncertain about this, but based on the last image, shouldn't we expect that the QMM times for lengths between 1 and 64/4=16 to be in the same step and not several times more?
Experiments done on a M1 Macbook Air 8GB. mlx 0.20.0 and 0.19.3 MacOS 15.2 Beta, Llama 3.2-1B-Instruct-4bit/bf16 with the following code:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, since transformer inference is memory bound, when increasing the numbers of processed tokens the forward time should behave in an increasing step-wise manner.
In the following image I show the median forward times with 4-bit quantized weights for some sequence lengths, notably there is a 4.3x increase when going from sequence length 1 to 8, while the time between 8 and 32 tokens is along the same line. This shows that the QMM kernels for small lengths are likely under-optimized.
As comparison here are the results for FP16 weights. Where the forward for lengths between 1 and 64 are in the same step/wave.
I'm pretty uncertain about this, but based on the last image, shouldn't we expect that the QMM times for lengths between 1 and 64/4=16 to be in the same step and not several times more?
Experiments done on a M1 Macbook Air 8GB.
mlx 0.20.0
and0.19.3
MacOS 15.2 Beta,Llama 3.2-1B-Instruct-4bit/bf16
with the following code:Beta Was this translation helpful? Give feedback.
All reactions