Marlin fp8 #241

fxmarty · 2024-07-12T11:50:30Z

Fixes #238

optimum/quanto/tensor/qbytes_ops.py

optimum/quanto/tensor/marlin/fp8_packed.py

fxmarty · 2024-07-12T13:16:19Z

optimum/quanto/nn/qmodule.py

+        if self.weight_qtype == qfloat8_e4m3fn and self.activation_qtype is None:
+            # Marlin FP8 kernel only supports per-tensor fp8 quantization.
+            axis = None
+        else:
+            axis = 0


Having this controlflow here is quite ugly.

I don't understand why we have this restriction while the kernel checks for a vector scale.

Here is the kernel code:

TORCH_CHECK(b_rank == 2, "b_scales rank = ", b_rank, " is not 2"); TORCH_CHECK(b_scales.size(1) == size_n, "b_scales dim 1 = ", b_scales.size(1), " is not size_n = ", size_n); // Channelwise only for FP8 TORCH_CHECK(b_scales.size(0) == 1) num_groups = b_scales.size(0);

My message was wrong (I meant vector -> I edited the comment).
So clearly NOT a scalar (this is why in the unit test you had to repeat your scalar scale). I suspect the reason why it only works with a scalar scale (same value for all output features) is because of some kind of inter-leaving that is missing.

fxmarty · 2024-07-12T17:09:47Z

Compared to calling the kernel in isolation, in my config end to end benchmark is slower using quanto, due to many overheads.

A separate PR may be needed to remove some of them first (or after this one - if we don't care about perf).

I'll try with a larger model / different GPU see if python latency is hidden or not.

@dacorvo Am I doing something wrong here?

dacorvo · 2024-07-17T13:06:42Z

Compared to calling the kernel in isolation, in my config end to end benchmark is slower using quanto, due to many overheads.

A separate PR may be needed to remove some of them first (or after this one - if we don't care about perf).

I'll try with a larger model / different GPU see if python latency is hidden or not.

@dacorvo Am I doing something wrong here?

Can you explain how you deduce there is an overhead here, as compared to the standard call from inside torch.nn.functional.linear to torch.ops.aten.mm ? BTW, what tool did you use ?

fxmarty · 2024-07-17T14:56:50Z

For sure. Last week I was using quanto benchmark script with

python evaluate_model.py --device cuda --metric decode-latency --quantizer quanto --weights float8_e4m3fn --activations none --dtype fp16 --batch_size 1
python evaluate_model.py --device cuda --metric decode-latency --quantizer quanto --weights none --activations none --dtype fp16 --batch_size 1

and did not have speedups, but I am unsure why. From the profile, I assume it is some overhead from quanto's dispatch, but could be something else.

The following script gives the profile below: https://gist.github.com/fxmarty/1aff830cdd57aa650412f34bd4076b3b, comparing a linear call through a quanto module, through direct torch.ops.quanto_ext.fp8_gemm call, or through fp16 torch.nn.functional.linear.

Benchmarking the same (https://gist.github.com/fxmarty/e449c55e4a1dbf9b1657f395aa542eb4) for a single linear, there indeed seems to be a 10-25% overhead (in my problem setup) from torch_function, multiple torch.library dispatch, etc (first and second col). Still faster than native fp16 though, so have to see

fxmarty · 2024-07-19T12:03:19Z

@dacorvo Running python evaluate_model.py --device cuda --metric decode-latency --quantizer quanto --weights float8_e4m3fn --activations none --dtype fp16 --batch_size 1 --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 on my laptop between b8dbdf0 and d52e44a (avoid dispatch when not necessary),

I get:

For b8dbdf0: 11.855 ms / token
For d52e44a: 9.964 ms / token

To me this is a substantial enough difference to care.

Edit: on A100, for llama 3 8B, with end-to-end (prefill + decode), I do get with mixed fp16+fp8:

For b8dbdf0: 40.350 ms / token
For d52e44a: 33.952 ms / token

^ this is not faster than fp16-fp16 in any case (getting 32.681 ms), weird

dacorvo · 2024-07-19T13:52:42Z

@dacorvo Running python evaluate_model.py --device cuda --metric decode-latency --quantizer quanto --weights float8_e4m3fn --activations none --dtype fp16 --batch_size 1 --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 on my laptop between b8dbdf0 and d52e44a (avoid dispatch when not necessary),

I get:

For b8dbdf0: 11.855 ms / token

For d52e44a: 9.964 ms / token

To me this is a substantial enough difference to care.

Edit: on A100, for llama 3 8B, with end-to-end (prefill + decode), I do get with mixed fp16+fp8:

For b8dbdf0: 40.350 ms / token

For d52e44a: 33.952 ms / token

^ this is not faster than fp16-fp16 in any case (getting 32.681 ms), weird

That is indeed a substantial difference. It would be great if you could:

run the unit tests to identify the extent of the consequences of your changes on other features,
rebase your branch,
maybe clean up the branch, group your commits into meaningful groups of changes and use conventional commits, so that we can run the CI.

fxmarty · 2024-07-23T13:30:30Z

optimum/quanto/tensor/qbytes.py

+            activation_qtype (`qtype`, defaults to `None`):
+                The qtype used for the activations. If one needs to use a different tensor subclass e.g. for weights depending on the activations qtype, this argument must be specified accordingly when calling `QBytesTensor.create`.
+            tensor_type (`Optional[str]`, defaults to `None`):
+                Specifies whether the tensor is to be considered as a `"weight"` or `"activation"`, which may influence the tensor subclass to be used.


At this point we need to know whether we are quantizing a weight or activation, and what is the qtype of the activation so as to pick the correct tensor subclass.

The only reason I see for which we would want to do this would be to avoid creating a f8 packed tensor when the activations might be float8, to be able to use scaled_mm later on. For other use cases you can always dequantize the input to be able to call the kernel.
I need to think about this because at this stage this means the factory method is only ever used when creating quantized weights for Linear layers. This means there might actually be a subclass involved here (like a QLinearBytesTensor). To be honest, this was already the case also for AWQBitsTensor and TinyGemmBitsTensor.

Exactly: when calling QBytesTensor.create for a float8 activation, or for a float8 weight when the activations are float8 as well.

optimum/quanto/tensor/quantizers/symmetric.py

dacorvo

Thanks you for this pull-request, that challenges the current design more than I would have expected. I need to think a bit about how to address the valid concerns you raised here.
If we put the organization of the code aside, I also think the restriction to per-tensor scale makes the kernel unusable, at least for quanto.

optimum/quanto/library/extensions/extension.py

optimum/quanto/nn/qlinear.py

optimum/quanto/nn/qmodule.py

optimum/quanto/tensor/marlin/fp8_packed.py

optimum/quanto/tensor/qbytes_ops.py

optimum/quanto/tensor/qtensor.py

optimum/quanto/tensor/qactivation.py

optimum/quanto/tensor/qtensor_func.py

optimum/quanto/tensor/quantizers/symmetric.py

fxmarty · 2024-07-24T11:08:15Z

the restriction to per-tensor scale makes the kernel unusable

I think this can be easily changed in a later PR. Neither vllm nor tgi support per-column scales, and yet they achieve nice memory reductions & speedup with claimed no quality loss.

fxmarty · 2024-07-25T17:06:23Z

On par with my tests in TGI, we have decent speedup with this kernel only when using cudagraph. I don't really explain myself why.

Using transformers + A100 + 8B model & measuring decode latency at batch size = 1:

fp8 + torch.compile + static cache
Peak memory during benchmark: 9.7056 GB
Average decode latency per token: 8.833560943603516 ms

fp16 + torch.compile + static cache
Peak memory during benchmark: 15.3295 GB
Average decode latency per token: 12.276221656799317 ms

fp8 + eager
Peak memory during benchmark: 8.5034 GB
Average decode latency per token: 33.871917724609375 ms

fp16 + eager
Peak memory during benchmark: 14.9994 GB
Average decode latency per token: 32.10729331970215 ms

Perplexity bench is very decent as well

…fail

github-actions · 2024-08-14T01:55:57Z

This PR is stale because it has been open 15 days with no activity. Remove stale label or comment or this will be closed in 5 days.

dacorvo · 2024-08-28T08:44:33Z

Obsoleted by #296

fxmarty requested a review from dacorvo as a code owner July 12, 2024 11:50

fxmarty commented Jul 12, 2024

View reviewed changes

optimum/quanto/tensor/qbytes_ops.py Outdated Show resolved Hide resolved

fxmarty commented Jul 12, 2024

View reviewed changes

optimum/quanto/tensor/marlin/fp8_packed.py Outdated Show resolved Hide resolved

fxmarty commented Jul 12, 2024

View reviewed changes

fxmarty force-pushed the marlin-fp8 branch from 230404a to b7d6faf Compare July 19, 2024 10:06

fxmarty force-pushed the marlin-fp8 branch from 04d5a20 to b96c28d Compare July 19, 2024 12:54

fxmarty mentioned this pull request Jul 19, 2024

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin vllm-project/vllm#5975

Merged

fxmarty force-pushed the marlin-fp8 branch from 8e75bda to 3d08f41 Compare July 23, 2024 13:16

fxmarty commented Jul 23, 2024

View reviewed changes

optimum/quanto/tensor/quantizers/symmetric.py Outdated Show resolved Hide resolved

dacorvo requested changes Jul 23, 2024

View reviewed changes

fxmarty force-pushed the marlin-fp8 branch 3 times, most recently from a29517d to f5222fa Compare July 24, 2024 18:00

fxmarty added 9 commits July 29, 2024 13:20

feat(library): add marlin float8/float8 mm kernel

63bd92c

fix(style): make style

f5a88ea

refactor(kernel): move marlin kernel following ext->extensions refactor

44b8096

fix(various): various fixes

39ca204

fix(documentation): fix doc

392fb75

fix(style): ruff again

0ff27c2

fix(dispatch): avoid recalling QBytesTensor.create when not needed

9d3046d

refactor(marlin): refactor added a packed class as per review comment

d50f7fe

fix(qbytes_ops): use for _to_copy

6b30dcd

fxmarty added 6 commits July 29, 2024 13:20

feat(torch): add torch.compile support

adcdeea

fix(compile): bypass FakeTensor bug that makes isinstance evaluation …

1ffde38

…fail

fix(style): style

37d0cd8

feat: add marlin qdq test

1df4570

fix: Fix after rebase

4dd4771

fix: post rebase fixes

0c42ad7

fxmarty force-pushed the marlin-fp8 branch from 3f9c20c to 0c42ad7 Compare July 29, 2024 12:08

fxmarty added 4 commits July 29, 2024 17:06

feat(marlin): support per-tensor scale

8fedf8b

fix: add missing files

b16dc5f

fix: cleaning

efdd147

fix: remove extension load on unsupported system

07a5652

fxmarty requested a review from dacorvo July 29, 2024 15:38

github-actions bot added Stale and removed Stale labels Aug 14, 2024

dacorvo mentioned this pull request Aug 27, 2024

Add support for Marlin fp16/fp8 kernel (refactored) #296

Merged

dacorvo closed this Aug 28, 2024

dacorvo deleted the marlin-fp8 branch September 13, 2024 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marlin fp8 #241

Marlin fp8 #241

fxmarty commented Jul 12, 2024

fxmarty Jul 12, 2024

dacorvo Jul 12, 2024 •

edited

Loading

dacorvo Jul 17, 2024 •

edited

Loading

fxmarty commented Jul 12, 2024 •

edited

Loading

dacorvo commented Jul 17, 2024 •

edited

Loading

fxmarty commented Jul 17, 2024

fxmarty commented Jul 19, 2024 •

edited

Loading

dacorvo commented Jul 19, 2024 •

edited

Loading

fxmarty Jul 23, 2024

dacorvo Jul 23, 2024

fxmarty Jul 24, 2024 •

edited

Loading

dacorvo left a comment

fxmarty commented Jul 24, 2024 •

edited

Loading

fxmarty commented Jul 25, 2024 •

edited

Loading

github-actions bot commented Aug 14, 2024

dacorvo commented Aug 28, 2024

Marlin fp8 #241

Marlin fp8 #241

Conversation

fxmarty commented Jul 12, 2024

fxmarty Jul 12, 2024

Choose a reason for hiding this comment

dacorvo Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

dacorvo Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

fxmarty commented Jul 12, 2024 • edited Loading

dacorvo commented Jul 17, 2024 • edited Loading

fxmarty commented Jul 17, 2024

fxmarty commented Jul 19, 2024 • edited Loading

dacorvo commented Jul 19, 2024 • edited Loading

fxmarty Jul 23, 2024

Choose a reason for hiding this comment

dacorvo Jul 23, 2024

Choose a reason for hiding this comment

fxmarty Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

dacorvo left a comment

Choose a reason for hiding this comment

fxmarty commented Jul 24, 2024 • edited Loading

fxmarty commented Jul 25, 2024 • edited Loading

github-actions bot commented Aug 14, 2024

dacorvo commented Aug 28, 2024

dacorvo Jul 12, 2024 •

edited

Loading

dacorvo Jul 17, 2024 •

edited

Loading

fxmarty commented Jul 12, 2024 •

edited

Loading

dacorvo commented Jul 17, 2024 •

edited

Loading

fxmarty commented Jul 19, 2024 •

edited

Loading

dacorvo commented Jul 19, 2024 •

edited

Loading

fxmarty Jul 24, 2024 •

edited

Loading

fxmarty commented Jul 24, 2024 •

edited

Loading

fxmarty commented Jul 25, 2024 •

edited

Loading