Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU quantization for sentence-transformer: ONNX quantized model #124

Open
Matthieu-Tinycoaching opened this issue Aug 9, 2022 · 3 comments

Comments

@Matthieu-Tinycoaching
Copy link

Hi,

Thanks for the nice repo!

Isn't it possible to obtain an ONNX model from GPU quantization of a sentence-transformer?

It seems that the end to end notebook is based on tensorRT quantized model.

Thanks!

@pommedeterresautee
Copy link
Member

On GPU, TensorRT is the only way to run quantized models.
Onnx Runtime requires you to use the TensorRT provider.

So you need to:
1/ extract onnx model from sentence transformers (wrappers are provided in this lib in case you have some difficulties)
2/ do same kind of work than described in the repo on this onnx file

@Matthieu-Tinycoaching
Copy link
Author

Matthieu-Tinycoaching commented Aug 9, 2022

Do you recommend to apply QAT? Is this suppose to redo pre-training on the whole dataset in case of multilingual model?

@pommedeterresautee
Copy link
Member

Is this suppose to redo pre-training on the whole dataset in case of multilingual model?
It depends ;-)
What I can say is official Nvidia doc says only 10%, but in CV with large dataset. In fine tuning regime, it definitely depends of the size of your dataset.

And regarding the opportunity to do it, it's more about how important latency improvement matters to your use case and the difficulty to industrialize it Vs other strategies like distillation in miniLM or whatever smaller model for instance. In our case, at some points we excluded it from our industrialization pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants