Refactor generation benchmark to compare with AWQ and HQQ #128

dacorvo · 2024-03-21T14:21:44Z

The same generation metrics can now be evaluated on models quantized using quanto, BitdsAndBytes, hqq and AutoAWQ.

Example on princeton-nlp/Sheared-LLaMA-1.3B with int4 weights and float16 activations on an Nvidia A10:

library	perplexity	prediction	latency
fp16	8.85	0.83 %	24 ms
quanto	9.99	0.81 %	69 ms
bnb	9.35	0.82 %	35 ms
hqq	9.05	0.83 %	96 ms
AutoAWQ	9.02	0.19 %	7 ms

This will allow to compare bnb, awq, gptq and hqq more easily.

dacorvo added 3 commits March 21, 2024 14:07

refactor(bench): support different quantizers

4148260

This will allow to compare bnb, awq, gptq and hqq more easily.

feat(bench): add HQQ setup

5b17986

feat(bench): add AWQ setup

78ba3bc

dacorvo merged commit 96871c1 into main Mar 21, 2024
1 check passed

dacorvo deleted the benchmark_libs branch March 21, 2024 14:37

dacorvo mentioned this pull request Mar 24, 2024

Performance of quanto quants vs bnb, AWQ, GPTQ, GGML ? #129

Closed

lifelongeeek mentioned this pull request Sep 20, 2024

Does AWQ is officially supported now? #313

Closed

Provide feedback