-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memeory error for batch size more than 1 for T5 models. #60
Comments
Thank you for the progress done on T5! :) I have been following your t5 file, and I found some points that can hopefully help. (I do not want to bother with another thread as t5 is not merged yet).
I have not tried TRT yet so I cannot add to @Ki6an's post. I will update once I start playing with it. Thanks again for this great repo! |
@Ki6an hi, I am using between 10 and 20Gb of RAM (working on a 3090 RTX). I never experienced issue with batch > 1. @victox5 I am working on it. Basically I try to make a tool to set automatically the right precision on each node without following fixed pattern. Honestly it's not easy, it raises many other issues, and probably that the new transformer engine from H100 (https://blogs.nvidia.com/blog/2022/03/22/h100-transformer-engine/) announced a few hours ago is a better way :-) |
@Ki6an the T5 support is now official, you can check the notebook. It fixes the issue of double weights. |
closing because recent merged work should have fixed this issue. Don't hesitate to reopen the issue @Ki6an |
@pommedeterresautee thanks, Do you plan on adding support for the Triton server? |
T5 work requires a good support of the |
Great! |
Current Triton version (including 22.05) used a version of ONNX Runtime which has not a good support of Not sure to understand what you mean by "lots of requests" on Python? It should be one per generated token, at least that's what I would expect ;-) |
Great find! Thanks for fixing the bug. Sorry for the replying late on this. As mentioned above, I'm trying to serve the T5 model from triton server. I have an encoder (ort-backend), decoder (ort-backend) and ensemble_t5 (that uses a python backend to preprocess the text and also to handle the huggingface api). I have converted the model with cache (i.e. with past key values) so my decoder takes 24 past-key-values, input_ids and encoder hidden states as input and outputs 24 pkv and logits. For generating a single token, we need to send these (24 pkv + input_ids + encoder_hidden_states) inputs to the decoder and request (24 pkv and logits) as output. As you can see there is a lot of data movement between ensemble_t5 and the decoder. Which is making the model slower. I was asking if there is a better way to handle this ? and make the model fast. thanks |
Triton uses dlpack to pass tensors from one backend to another, it's supposed to be close to cost free (it just wraps the tensor but there is no copy). Did you measured that the slowness was because of these transfers between backends ? |
thanks for the response and tip, the execution of the onnx model part is slow inference_request = pb_utils.InferenceRequest(
model_name=self.model_path,
requested_output_names=["logits"] + self.output_pkv_names,
inputs=[input_ids, encoder_attention_mask, encoder_hidden_states]
+ input_past_key_values,
)
inference_response = inference_request.exec() I also noticed that the following part is also little slow... logits = T5Helper.get_output_tensors(inference_response, "logits")
list_out_pkv = [
T5Helper.get_output_tensors(inference_response, name)
for name in self.output_pkv_names
] here class T5Helper:
@staticmethod
def get_output_tensors(inference_response, name):
output = pb_utils.get_output_tensor_by_name(inference_response, name)
tensor = torch.from_dlpack(output.to_dlpack())
return tensor.cuda() |
Sorry to reask, just to be sure, you are using 2 decoders and 1 encoder, right? Moreover, why do you need Is Also, probably not your issue, but to improve things, a correct host RAM allocation helps: |
yes
for some reason, it's not placing the output tensors directly on GPU even though I set this slowdown occurs when I place the whole pipeline (i.e tokens and model) on gpu I tried keeping the model and tokens on the CPU, but this time the input part i.e input_ids = pb_utils.Tensor.from_dlpack("input_ids", torch.to_dlpack(input_ids))
encoder_attention_mask = pb_utils.Tensor.from_dlpack(
"encoder_attention_mask", torch.to_dlpack(attention_mask)
)
encoder_hidden_states = pb_utils.Tensor.from_dlpack(
"encoder_hidden_states", torch.to_dlpack(encoder_output)
)
flat_past_key_values = functools.reduce(operator.iconcat, past_key_values, [])
input_past_key_values = [
pb_utils.Tensor.from_dlpack(name, torch.to_dlpack(tensor))
for name, tensor in zip(self.input_pkv_names, flat_past_key_values)
] is slow (almost half as slow but in overall speed comparison with the torch, its no improvement) I'm always keeping |
Ok I think you have found the culprit, if the tensor is provided on CPU, there is no way to get low latency. Also I would check if there is something in the Python code which does a to("cpu"), for instance if you have built your custom model, you can double check that it's moved on the right device: |
FYI, Triton 22.07 has been released. It fixes a bug where ORT tensors where always put in host memory (plus it's built with ORT 1.12.0 which have also it's own memory placement bug). Updated code of this repo (there are some subtleties to manage, not just an update of the docker image):
Let us know if it helps regarding your issue. |
hey, first of all, thanks for creating this amazing library!
I'm following your T5 implementation with trt,
transformer-deploy/t5.py
Line 222 in b52850d
And, I'm trying to convert the onnx version of the T5 model to tensorrt engine using your
build_engine
method,transformer-deploy/src/transformer_deploy/backends/trt_utils.py
Line 64 in 1f2d2c1
It works fine for a batch size of 1, but for
batch size > 1
. it's taking longer to build (almost an hour just for the t5-small encoder), and even after that it's not building the model successfully and getting the following error :some system info if that helps;
trt+cuda - 8.2.1-1+cuda11.4
os - ubuntu 20.04.3
gpu - T4 with 15GB memory
the errors say I need more GPU memory, I was wondering how much GPU memory did you use for a batch size of 5? or maybe I'm missing something?
I would really appreciate any help, thank you!
The text was updated successfully, but these errors were encountered: