Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

big performance difference on tensorRT #85

Open
HireezShanPeng opened this issue May 31, 2022 · 4 comments
Open

big performance difference on tensorRT #85

HireezShanPeng opened this issue May 31, 2022 · 4 comments

Comments

@HireezShanPeng
Copy link

Hi, I just tried the demo code below, in your result, the [TensorRT (FP16)] result is much better than others. However, the results I got are quite different. there is not such a big difference between [TensorRT (FP16)] and others (the output is attached). I wonder if you know what happened or how I can figure out the reason for that. Thank you.

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
  bash -c "cd /project && \
    convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128"
Inference done on Tesla M60
latencies:
[Pytorch (FP32)] mean=6.31ms, sd=1.32ms, min=4.48ms, max=10.75ms, median=6.39ms, 95p=8.63ms, 99p=9.33ms
[Pytorch (FP16)] mean=8.81ms, sd=2.02ms, min=6.59ms, max=55.42ms, median=8.70ms, 95p=11.20ms, 99p=12.16ms
**### [TensorRT (FP16)] mean=4.59ms, sd=1.97ms, min=2.27ms, max=10.38ms, median=4.47ms, 95p=8.02ms, 99p=8.86ms**
[ONNX Runtime (FP32)] mean=5.03ms, sd=2.00ms, min=2.64ms, max=10.45ms, median=5.16ms, 95p=8.37ms, 99p=9.17ms
[ONNX Runtime (optimized)] mean=5.19ms, sd=2.04ms, min=2.80ms, max=10.59ms, median=5.25ms, 95p=8.67ms, 99p=9.40ms

@HireezShanPeng
Copy link
Author

Also, I got this error while running the code above

Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 358, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 329, in main
    check_accuracy(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 82, in check_accuracy
    f"{engine_name} discrepency is too high ({discrepency:.2f} > {tolerance}):\n"
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 678, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

@kamalkraj
Copy link
Contributor

kamalkraj commented Jun 1, 2022

Hi @HireezShanPeng,

The big performance difference comes from RTX 3090 card used in the readme.

M60 doesn't support 16bit precision. There is no performance advantage of running a 16-bit model on M60.

logs on my 1080Ti, which also doesn't support 16-bit precision
Screenshot 2022-06-01 at 1 07 09 PM

More information:
https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/
https://www.tensorflow.org/guide/mixed_precision#supported_hardware
https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/

@kamalkraj
Copy link
Contributor

NVIDIA/TensorRT#218

@pommedeterresautee
Copy link
Member

pommedeterresautee commented Jun 2, 2022

Thank you @kamalkraj for your answer. Just to complete:

  • when there is no tensor cores dedicated to FP16, mixed precision is usually slower as we add some casting (Fp32 <-> FP16) here and there involving plenty of tensor copies. It's mostly offset by the kernel fusions applied;
  • if you are doing cloud inference, T4 is a "cheap" option (regarding other GPU prices) which supports FP16. However, you need to keep in mind that recent GPUs have much more FP16 tensor cores than the good old T4 and the difference with FP32 is even higher.

Regarding your bug I am not sure to understand when it happens. Can you please provide more context?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants