Improve MNK Quantized Matrix Multiplications for small K values (2-8) #1593

0seba · 2024-11-16T23:19:50Z

0seba
Nov 16, 2024

Hi, since transformer inference is memory bound, when increasing the numbers of processed tokens the forward time should behave in an increasing step-wise manner.

In the following image I show the median forward times with 4-bit quantized weights for some sequence lengths, notably there is a 4.3x increase when going from sequence length 1 to 8, while the time between 8 and 32 tokens is along the same line. This shows that the QMM kernels for small lengths are likely under-optimized.

As comparison here are the results for FP16 weights. Where the forward for lengths between 1 and 64 are in the same step/wave.

I'm pretty uncertain about this, but based on the last image, shouldn't we expect that the QMM times for lengths between 1 and 64/4=16 to be in the same step and not several times more?

Experiments done on a M1 Macbook Air 8GB. mlx 0.20.0 and 0.19.3 MacOS 15.2 Beta, Llama 3.2-1B-Instruct-4bit/bf16 with the following code:

import time
import numpy as np
import mlx.core as mx
import mlx_lm
from mlx_lm.models import cache
from mlx_lm.models.cache import KVCache

# model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-1B-Instruct-4bit", tokenizer_config={})
model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-1B-Instruct-bf16", tokenizer_config={})

prompt_cache = cache.make_prompt_cache(model)
_c: KVCache
for _c in prompt_cache:
    _c.update_and_fetch(
        mx.random.normal([1, 8, 512, 64], dtype=mx.float16),
        mx.random.normal([1, 8, 512, 64], dtype=mx.float16),
        )

seqlen = 1
print("Seqlen", "Toks/s", "Fwd Time (ms)", sep="\t")
while seqlen <= 512:
    inp = mx.array([[100] * seqlen])
    tps = []
    fwd_times = []
    for _ in range(25):
        for _c in prompt_cache:
            _c.offset = max(0, 512 - seqlen - 64)
        
        tic = time.perf_counter()
        logits = model(inp, cache=prompt_cache)
        mx.eval(logits)
        toc = time.perf_counter() - tic
        # print(1 / toc * seqlen, 1 / toc, toc * 1000)
        tps.append(1 / toc * seqlen)
        fwd_times.append(toc * 1000)
    print(seqlen, round(np.median(tps), 1), round(np.median(fwd_times), 1), sep="\t")

    seqlen *= 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MNK Quantized Matrix Multiplications for small K values (2-8) #1593

{{title}}

Replies: 0 comments

Select a reply

Improve MNK Quantized Matrix Multiplications for small K values (2-8) #1593

0seba Nov 16, 2024

Replies: 0 comments

0seba
Nov 16, 2024