big performance difference on tensorRT #85

HireezShanPeng · 2022-05-31T12:21:28Z

Hi, I just tried the demo code below, in your result, the [TensorRT (FP16)] result is much better than others. However, the results I got are quite different. there is not such a big difference between [TensorRT (FP16)] and others (the output is attached). I wonder if you know what happened or how I can figure out the reason for that. Thank you.

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
  bash -c "cd /project && \
    convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128"

Inference done on Tesla M60
latencies:
[Pytorch (FP32)] mean=6.31ms, sd=1.32ms, min=4.48ms, max=10.75ms, median=6.39ms, 95p=8.63ms, 99p=9.33ms
[Pytorch (FP16)] mean=8.81ms, sd=2.02ms, min=6.59ms, max=55.42ms, median=8.70ms, 95p=11.20ms, 99p=12.16ms
**### [TensorRT (FP16)] mean=4.59ms, sd=1.97ms, min=2.27ms, max=10.38ms, median=4.47ms, 95p=8.02ms, 99p=8.86ms**
[ONNX Runtime (FP32)] mean=5.03ms, sd=2.00ms, min=2.64ms, max=10.45ms, median=5.16ms, 95p=8.37ms, 99p=9.17ms
[ONNX Runtime (optimized)] mean=5.19ms, sd=2.04ms, min=2.80ms, max=10.59ms, median=5.25ms, 95p=8.67ms, 99p=9.40ms

The text was updated successfully, but these errors were encountered:

HireezShanPeng · 2022-05-31T13:05:54Z

Also, I got this error while running the code above

Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 358, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 329, in main
    check_accuracy(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 82, in check_accuracy
    f"{engine_name} discrepency is too high ({discrepency:.2f} > {tolerance}):\n"
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 678, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

kamalkraj · 2022-06-01T07:38:46Z

Hi @HireezShanPeng,

The big performance difference comes from RTX 3090 card used in the readme.

M60 doesn't support 16bit precision. There is no performance advantage of running a 16-bit model on M60.

logs on my 1080Ti, which also doesn't support 16-bit precision

More information:
https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/
https://www.tensorflow.org/guide/mixed_precision#supported_hardware
https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/

kamalkraj · 2022-06-01T07:57:19Z

NVIDIA/TensorRT#218

pommedeterresautee · 2022-06-02T07:18:33Z

Thank you @kamalkraj for your answer. Just to complete:

when there is no tensor cores dedicated to FP16, mixed precision is usually slower as we add some casting (Fp32 <-> FP16) here and there involving plenty of tensor copies. It's mostly offset by the kernel fusions applied;
if you are doing cloud inference, T4 is a "cheap" option (regarding other GPU prices) which supports FP16. However, you need to keep in mind that recent GPUs have much more FP16 tensor cores than the good old T4 and the difference with FP32 is even higher.

Regarding your bug I am not sure to understand when it happens. Can you please provide more context?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

big performance difference on tensorRT #85

big performance difference on tensorRT #85

HireezShanPeng commented May 31, 2022

HireezShanPeng commented May 31, 2022

kamalkraj commented Jun 1, 2022 •

edited

Loading

kamalkraj commented Jun 1, 2022

pommedeterresautee commented Jun 2, 2022 •

edited

Loading

big performance difference on tensorRT #85

big performance difference on tensorRT #85

Comments

HireezShanPeng commented May 31, 2022

HireezShanPeng commented May 31, 2022

kamalkraj commented Jun 1, 2022 • edited Loading

kamalkraj commented Jun 1, 2022

pommedeterresautee commented Jun 2, 2022 • edited Loading

kamalkraj commented Jun 1, 2022 •

edited

Loading

pommedeterresautee commented Jun 2, 2022 •

edited

Loading