Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for gpt2 quantization #52

Open
kobzaond opened this issue Feb 15, 2022 · 6 comments
Open

Support for gpt2 quantization #52

kobzaond opened this issue Feb 15, 2022 · 6 comments

Comments

@kobzaond
Copy link

I tried to quantize (add QDQ layers) the gpt2 model:

batch_size=8
        with QATCalibrate(method="histogram", percentile=99.999) as qat:
            model_q = self.model.cuda()
            qat.setup_model_qat(model_q)  # prepare quantizer to any model

            with torch.no_grad():
                for start_index in range(0, 650, batch_size):
                    end_index = start_index + batch_size
                    data = self.data[start_index:end_index]
                    data = self.tokenizer(data, return_tensors='pt', padding=True, truncation=True, max_length=512)
                    input_torch = {
                        k: torch.tensor(v, dtype=torch.long, device="cuda")
                        for k, v in data.items()
                        if k in ["input_ids", "attention_mask", "token_type_ids"]
                    }
                    model_q(**input_torch)

but no QDQ layers were inserted - I assume that you don't support GPT2 yet. Do you plan add it?

@pommedeterresautee
Copy link
Member

Indeed we have not yet done it, but it should be fairly simple.

You can call patch_model (https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/QDQModels/patch.py#L44) and for an example of simple module: https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/QDQModels/QDQAlbert.py

Let me know if it's clear for you.

@kobzaond
Copy link
Author

Thank you for your response. I tried to make QDQGPT2.py with the same pattern as QDQBert.py or QDQElectra.py... and added the new patch module into the list in https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/QDQModels/patch.py#L44.

But actually I was not able to fully understand how the quantization is working - I got that you insert the QDQ layers but got lost in the code. Anyway, afterward I tried to quantize the GPT2 model, which worked, except that certain layers have amax value 'nan'. e.g.:
`
(11): GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
(matmul_quantizer_0): TensorQuantizer(8bit fake per-tensor amax=5.6953 calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_1): TensorQuantizer(8bit fake per-tensor amax=5.9871 calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_2): TensorQuantizer(8bit fake per-tensor amax=0.9995 calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_3): TensorQuantizer(8bit fake per-tensor amax=13.3477 calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_4): TensorQuantizer(8bit fake per-tensor amax=nan calibrator=HistogramCalibrator scale=1.0 quant)
(matmul_quantizer_5): TensorQuantizer(8bit fake per-tensor amax=nan calibrator=HistogramCalibrator scale=1.0 quant)

`

Then I tried to convert the model into onnx and tensorrt, both worked. However, in tensorrt the speed is slower than with fp32 precision. Do you have any idea, why it is so slow?

@pommedeterresautee
Copy link
Member

pommedeterresautee commented Feb 18, 2022

Have you build engine with int 8 support?

@kobzaond
Copy link
Author

kobzaond commented Feb 18, 2022

Yes, I've set both fp16 and int8 flags

config.set_flag(trt.BuilderFlag.INT8)
config.set_flag(trt.BuilderFlag.FP16)

basically I used analogical code to your quantization demo, only the model changed; I can share some of my measurements (in seconds - it is an average over 20 runs, sample is always the same).:

<google-sheets-html-origin><style type="text/css"><!--td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}--></style>

tensorrt fp16, batch1 | 0.0052
-- | --
tensorrt fp16, batch8 | 0.058
tensorrt int8, batch1 | 0.016
tensorrt int8, batch8 | 0.124


@pommedeterresautee
Copy link
Member

have you checked that your local tensorrt version is the same than the docker image you use?

@kobzaond
Copy link
Author

kobzaond commented Mar 9, 2022

I am not using any docker image, all is installed either in python virtual environment, conda environment or locally, so there shouldn't be any version missmatch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants