-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed difference ONNX vs TensorRT with samples sorted by sequence length #55
Comments
Is there some batching applied? |
Oh, I completely forgot to mention that. Yes, I use a batch size of 64. This behavior only applies if batching is used. |
how each batch is built? is it made of seq of the exact same len ? |
The samples are just ordered by character length and then batched, so they still may vary within a batch (but much less than before). The speed up just comes from the fact less batches are padded to the I replaced
with
and added |
can you provide me with some reproducible code so I test on my side? |
Hey @pommedeterresautee, sorry for the long wait - I was on a holiday trip. I based my script on your demo scripts but I cannot disclose the model and/or dataset. You can basically use any dataset with 2 inputs, e.g. example for QA. I hope you can make use of it anyway. I attached the script to call the inference assemble hosted in triton (transformer_onnx_inference or transformer_trt_inference) and the slightly modified model.py for the tokenize endpoint in triton. If you experience the same what I do, then calling the ONNX model's inference endpoint should be slower if you comment out the length sorting in triton_inference_qa_test.py
model.py
|
I noticed something unexpected when comparing two scenarios for a model converted via ONNX and TensorRT (distilroberta with classification head):
Result: The TensorRT model does not seem to care about the sequence lengths and keeps the same speed for both scenarios. The ONNX model, however, gets almost twice as fast when I use the second scenario.
I was wondering if tensorRT's optimization does somehow require to pad to the max length internally. I was searching for a parameter or a reason for this behavior but couldn't find anything useful. For conversion, I set the seq-len parameter to
1 60 60
.I was wondering if perhaps someone else has already observed this and knows the reason / a solution.
The text was updated successfully, but these errors were encountered: