Support for gpt2 quantization #52

kobzaond · 2022-02-15T11:53:03Z

I tried to quantize (add QDQ layers) the gpt2 model:

batch_size=8
        with QATCalibrate(method="histogram", percentile=99.999) as qat:
            model_q = self.model.cuda()
            qat.setup_model_qat(model_q)  # prepare quantizer to any model

            with torch.no_grad():
                for start_index in range(0, 650, batch_size):
                    end_index = start_index + batch_size
                    data = self.data[start_index:end_index]
                    data = self.tokenizer(data, return_tensors='pt', padding=True, truncation=True, max_length=512)
                    input_torch = {
                        k: torch.tensor(v, dtype=torch.long, device="cuda")
                        for k, v in data.items()
                        if k in ["input_ids", "attention_mask", "token_type_ids"]
                    }
                    model_q(**input_torch)

but no QDQ layers were inserted - I assume that you don't support GPT2 yet. Do you plan add it?

The text was updated successfully, but these errors were encountered:

pommedeterresautee · 2022-02-15T17:18:07Z

Indeed we have not yet done it, but it should be fairly simple.

You can call patch_model (https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/QDQModels/patch.py#L44) and for an example of simple module: https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/QDQModels/QDQAlbert.py

Let me know if it's clear for you.

kobzaond · 2022-02-18T12:19:53Z

Thank you for your response. I tried to make QDQGPT2.py with the same pattern as QDQBert.py or QDQElectra.py... and added the new patch module into the list in https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/QDQModels/patch.py#L44.

But actually I was not able to fully understand how the quantization is working - I got that you insert the QDQ layers but got lost in the code. Anyway, afterward I tried to quantize the GPT2 model, which worked, except that certain layers have amax value 'nan'. e.g.:
`
(11): GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
(matmul_quantizer_0): TensorQuantizer(8bit fake per-tensor amax=5.6953 calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_1): TensorQuantizer(8bit fake per-tensor amax=5.9871 calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_2): TensorQuantizer(8bit fake per-tensor amax=0.9995 calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_3): TensorQuantizer(8bit fake per-tensor amax=13.3477 calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_4): TensorQuantizer(8bit fake per-tensor amax=nan calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_5): TensorQuantizer(8bit fake per-tensor amax=nan calibrator=HistogramCalibrator scale=1.0 quant)

`

Then I tried to convert the model into onnx and tensorrt, both worked. However, in tensorrt the speed is slower than with fp32 precision. Do you have any idea, why it is so slow?

pommedeterresautee · 2022-02-18T12:43:17Z

Have you build engine with int 8 support?

kobzaond · 2022-02-18T13:20:09Z

Yes, I've set both fp16 and int8 flags

config.set_flag(trt.BuilderFlag.INT8)
config.set_flag(trt.BuilderFlag.FP16)

basically I used analogical code to your quantization demo, only the model changed; I can share some of my measurements (in seconds - it is an average over 20 runs, sample is always the same).:

<google-sheets-html-origin><style type="text/css"><!--td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}--></style>

tensorrt fp16, batch1 | 0.0052
-- | --
tensorrt fp16, batch8 | 0.058
tensorrt int8, batch1 | 0.016
tensorrt int8, batch8 | 0.124

pommedeterresautee · 2022-03-03T06:36:43Z

have you checked that your local tensorrt version is the same than the docker image you use?

kobzaond · 2022-03-09T10:22:49Z

I am not using any docker image, all is installed either in python virtual environment, conda environment or locally, so there shouldn't be any version missmatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for gpt2 quantization #52

Support for gpt2 quantization #52

kobzaond commented Feb 15, 2022

pommedeterresautee commented Feb 15, 2022

kobzaond commented Feb 18, 2022

pommedeterresautee commented Feb 18, 2022 •

edited

Loading

kobzaond commented Feb 18, 2022 •

edited

Loading

pommedeterresautee commented Mar 3, 2022

kobzaond commented Mar 9, 2022

Support for gpt2 quantization #52

Support for gpt2 quantization #52

Comments

kobzaond commented Feb 15, 2022

pommedeterresautee commented Feb 15, 2022

kobzaond commented Feb 18, 2022

pommedeterresautee commented Feb 18, 2022 • edited Loading

kobzaond commented Feb 18, 2022 • edited Loading

pommedeterresautee commented Mar 3, 2022

kobzaond commented Mar 9, 2022

pommedeterresautee commented Feb 18, 2022 •

edited

Loading

kobzaond commented Feb 18, 2022 •

edited

Loading