You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On GPU, TensorRT is the only way to run quantized models.
Onnx Runtime requires you to use the TensorRT provider.
So you need to:
1/ extract onnx model from sentence transformers (wrappers are provided in this lib in case you have some difficulties)
2/ do same kind of work than described in the repo on this onnx file
Is this suppose to redo pre-training on the whole dataset in case of multilingual model?
It depends ;-)
What I can say is official Nvidia doc says only 10%, but in CV with large dataset. In fine tuning regime, it definitely depends of the size of your dataset.
And regarding the opportunity to do it, it's more about how important latency improvement matters to your use case and the difficulty to industrialize it Vs other strategies like distillation in miniLM or whatever smaller model for instance. In our case, at some points we excluded it from our industrialization pipeline.
Hi,
Thanks for the nice repo!
Isn't it possible to obtain an ONNX model from GPU quantization of a sentence-transformer?
It seems that the end to end notebook is based on tensorRT quantized model.
Thanks!
The text was updated successfully, but these errors were encountered: