You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
When using shared memory, the inference speed is much lower than that of CUDA shared memory. The trace log shows that the input & output time is greater than the infer time.
Triton Information
docker images nvcr.io/nvidia/tritonserver:24.11-py3
Expected behavior
Shared memory should be able to achieve the same throughput as shared cuda memory?
Worse, due to IO limitations when using multiple GPUs, the throughput is almost the same as that of a single GPU
Currently shared memory seems to be limited by certain IOs. What operations are included in the input & output in the trace log?
The text was updated successfully, but these errors were encountered:
What operations are included in the input & output in the trace log?
TensorRT model runs on GPU device. It expects the input data to be available in specific GPU buffers and return results on the GPU memory itself. Compute input time records the latency in moving the data from the source to the input GPU buffer which the model will consume. Compute output time records the latency in moving data from the GPU memory where the model wrote the results to user-requested memory.
Shared memory should be able to achieve the same throughput as shared cuda memory?
This is not expected.
System Shared Memory:
When using system shared memory, the tensor data is provided in host memory. Triton performs a H2D copy to bring data in GPU buffers. This is what gets recorded in compute input. When the model run is complete, triton performs a D2H copy to bring data from GPU to the requested buffer which is again on host memory as system shared memory is specified.
Cuda shared Memory:
When using cuda shared memory, the tensor data is provided in GPU device memory already. Triton performs a D2D copy to bring data in GPU buffers. This is what gets recorded in compute input. When the model run is complete, triton performs a D2D copy to bring data from GPU to the requested buffer which this time is on GPU as cuda shared memory is specified.
Hence, the performance of cuda shared memory is better than system shared memory for a fixed request concurrency of 100.
Worse, due to IO limitations when using multiple GPUs, the throughput is almost the same as that of a single GPU
This is a challenge. Can you share perf_analyzer numbers for multiple GPUs for both the cases? shared-memory = [cuda, system].
I can imagine, using cuda shared memory in case of multiple GPUs, if the model instance on a different GPU is running inference than the one where tensor data was provided/requested, there will be a lot of D2D transactions. And this could be quite common.
Description
When using shared memory, the inference speed is much lower than that of CUDA shared memory. The trace log shows that the input & output time is greater than the infer time.
Triton Information
docker images
nvcr.io/nvidia/tritonserver:24.11-py3
To Reproduce
config:
docker-compose.yml:
Use the same deployment environment for model(remove .zip)conversion
docker compose run -it --rm tritonserver sh
Conversion command
/usr/src/tensorrt/bin/trtexec --onnx=yolo11n.onnx --saveEngine=model.plan --minShapes=images:1x3x128x128 --optShapes=images:32x3x128x128 --maxShapes=images:32x3x640x640 --memPoolSize=workspace:1024 --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --useCudaGraph
Testing with perf_analyzer
command
perf_analyzer -m yolo -b 1 --shared-memory cuda --output-shared-memory-size 846720 --shape images:3,384,640 --concurrency-range 100 -i Grpc
result:
trace.json summary:
share memory command:
perf_analyzer -m yolo -b 1 --shared-memory system --output-shared-memory-size 846720 --shape images:3,384,640 --concurrency-range 100 -i Grpc
result:
trace.json summary:
Expected behavior
Shared memory should be able to achieve the same throughput as shared cuda memory?
Worse, due to IO limitations when using multiple GPUs, the throughput is almost the same as that of a single GPU
Currently shared memory seems to be limited by certain IOs. What operations are included in the input & output in the trace log?
The text was updated successfully, but these errors were encountered: