Shared memory io bottleneck? #7905

wensimin · 2024-12-24T01:39:22Z

Description
When using shared memory, the inference speed is much lower than that of CUDA shared memory. The trace log shows that the input & output time is greater than the infer time.

Triton Information
docker images nvcr.io/nvidia/tritonserver:24.11-py3

To Reproduce
config:

max_batch_size: 32
platform: "tensorrt_plan"
version_policy: {specific: {
    versions: 3
}}
dynamic_batching {
    preferred_batch_size: 32
    max_queue_delay_microseconds: 1000
}
instance_group {
    count: 2  
    kind: KIND_GPU
}
input {
    name: "images"
    data_type: TYPE_FP16
    dims: 3
    dims: -1
    dims: -1
}
output {
    name: "output0"
    data_type: TYPE_FP16
    dims: 84
    dims: -1
}

docker-compose.yml:

services:
  tritonserver:
    image: nvcr.io/nvidia/tritonserver:24.11-py3
    runtime: nvidia
    network_mode: host
    ipc: host
    pid: host
    volumes:
      - $PWD:/trt/
    command: >
      tritonserver --model-repository=/trt/models
      --trace-config triton,file=/trt/logs/trace.json
      --trace-config rate=100
      --trace-config level=TIMESTAMPS
      --trace-config count=10000
    environment:
      - TRITON_SERVER_LOG_LEVEL=DEBUG
    restart: unless-stopped
    shm_size: 120G

Use the same deployment environment for model(remove .zip)conversion
docker compose run -it --rm tritonserver sh

Conversion command
/usr/src/tensorrt/bin/trtexec --onnx=yolo11n.onnx --saveEngine=model.plan --minShapes=images:1x3x128x128 --optShapes=images:32x3x128x128 --maxShapes=images:32x3x640x640 --memPoolSize=workspace:1024 --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --useCudaGraph
Testing with perf_analyzer
command
perf_analyzer -m yolo -b 1 --shared-memory cuda --output-shared-memory-size 846720 --shape images:3,384,640 --concurrency-range 100 -i Grpc
result:

Request concurrency: 100
  Client: 
    Request count: 158002
    Throughput: 8584.92 infer/sec
    Avg latency: 11635 usec (standard deviation 2431 usec)
    p50 latency: 11581 usec
    p90 latency: 13136 usec
    p95 latency: 13633 usec
    p99 latency: 14460 usec
    Avg gRPC time: 11615 usec ((un)marshal request/response 6 usec + response wait 11609 usec)
  Server: 
    Inference count: 158002
    Execution count: 6451
    Successful request count: 158002
    Avg request latency: 11881 usec (overhead 1306 usec + queue 1397 usec + compute input 38 usec + compute infer 5762 usec + compute output 3376 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 100, throughput: 8584.92 infer/sec, latency 11635 usec

trace.json summary:

nvidia-dev-1  | Summary for yolo (3): trace count = 1597
nvidia-dev-1  | GRPC infer request (avg): 11129.050242955542us
nvidia-dev-1  | 	Send (avg): 66.5512548528491us
nvidia-dev-1  | 
nvidia-dev-1  | 	Handler (avg): 11732.878185973701us
nvidia-dev-1  | 		Overhead (avg): 1315.4620350657483us
nvidia-dev-1  | 		Queue (avg): 1335.56562366938us
nvidia-dev-1  | 		Compute (avg): 9081.850527238572us
nvidia-dev-1  | 			Input (avg): 41.90431183469004us
nvidia-dev-1  | 			Infer (avg): 5724.3190507201us
nvidia-dev-1  | 			Output (avg): 3315.627164683782us

share memory command:
perf_analyzer -m yolo -b 1 --shared-memory system --output-shared-memory-size 846720 --shape images:3,384,640 --concurrency-range 100 -i Grpc
result:

Request concurrency: 100
  Client: 
    Request count: 92535
    Throughput: 5088.58 infer/sec
    Avg latency: 19633 usec (standard deviation 4389 usec)
    p50 latency: 19313 usec
    p90 latency: 22239 usec
    p95 latency: 23707 usec
    p99 latency: 27567 usec
    Avg gRPC time: 19611 usec ((un)marshal request/response 7 usec + response wait 19604 usec)
  Server: 
    Inference count: 92535
    Execution count: 3792
    Successful request count: 92535
    Avg request latency: 19669 usec (overhead 1313 usec + queue 6080 usec + compute input 4739 usec + compute infer 4041 usec + compute output 3495 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 100, throughput: 5088.58 infer/sec, latency 19633 usec

trace.json summary:

nvidia-dev-1  | Summary for yolo (3): trace count = 929
nvidia-dev-1  | GRPC infer request (avg): 19105.859079655544us
nvidia-dev-1  | 	Send (avg): 126.07316361679224us
nvidia-dev-1  | 
nvidia-dev-1  | 	Handler (avg): 19645.858298170075us
nvidia-dev-1  | 		Overhead (avg): 1337.5397825618945us
nvidia-dev-1  | 		Queue (avg): 6129.83818729817us
nvidia-dev-1  | 		Compute (avg): 12178.48032831001us
nvidia-dev-1  | 			Input (avg): 4732.871638320775us
nvidia-dev-1  | 			Infer (avg): 3995.0442206673843us
nvidia-dev-1  | 			Output (avg): 3450.5644693218514us

Expected behavior
Shared memory should be able to achieve the same throughput as shared cuda memory?
Worse, due to IO limitations when using multiple GPUs, the throughput is almost the same as that of a single GPU
Currently shared memory seems to be limited by certain IOs. What operations are included in the input & output in the trace log?

The text was updated successfully, but these errors were encountered:

wensimin · 2024-12-27T07:25:54Z

This problem causes the performance to be lower under multiple GPUs, 3700fpsvs2500. Any help?

tanmayv25 · 2025-01-24T21:05:29Z

What operations are included in the input & output in the trace log?

TensorRT model runs on GPU device. It expects the input data to be available in specific GPU buffers and return results on the GPU memory itself. Compute input time records the latency in moving the data from the source to the input GPU buffer which the model will consume. Compute output time records the latency in moving data from the GPU memory where the model wrote the results to user-requested memory.

Shared memory should be able to achieve the same throughput as shared cuda memory?

This is not expected.

System Shared Memory:
When using system shared memory, the tensor data is provided in host memory. Triton performs a H2D copy to bring data in GPU buffers. This is what gets recorded in compute input. When the model run is complete, triton performs a D2H copy to bring data from GPU to the requested buffer which is again on host memory as system shared memory is specified.

Cuda shared Memory:
When using cuda shared memory, the tensor data is provided in GPU device memory already. Triton performs a D2D copy to bring data in GPU buffers. This is what gets recorded in compute input. When the model run is complete, triton performs a D2D copy to bring data from GPU to the requested buffer which this time is on GPU as cuda shared memory is specified.

Hence, the performance of cuda shared memory is better than system shared memory for a fixed request concurrency of 100.

Worse, due to IO limitations when using multiple GPUs, the throughput is almost the same as that of a single GPU

This is a challenge. Can you share perf_analyzer numbers for multiple GPUs for both the cases? shared-memory = [cuda, system].
I can imagine, using cuda shared memory in case of multiple GPUs, if the model instance on a different GPU is running inference than the one where tensor data was provided/requested, there will be a lot of D2D transactions. And this could be quite common.

tanmayv25 added the performance A possible performance tune-up label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared memory io bottleneck? #7905

Shared memory io bottleneck? #7905

wensimin commented Dec 24, 2024

wensimin commented Dec 27, 2024

tanmayv25 commented Jan 24, 2025

Shared memory io bottleneck? #7905

Shared memory io bottleneck? #7905

Comments

wensimin commented Dec 24, 2024

wensimin commented Dec 27, 2024

tanmayv25 commented Jan 24, 2025