The Phenomenon of Latency Jumps in Inference #8202

daliwang777 · 2024-09-05T16:37:39Z

daliwang777
Sep 5, 2024

When I used vLLM to accelerate inference for Llama3 70b, I observed an interesting phenomenon. When each input request consists of 1 token, there is a latency jump every 256 requests. Does anyone know why this happens?

SeungminHeo · 2024-09-09T07:35:28Z

SeungminHeo
Sep 9, 2024

because default scheduler batch size is 256 so when reached 256 requests and nothing completed, additional requests will be Waited by scheduler within vLLM. so waiting time is added to latency, that is the phenomenon you observed i guess.

1 reply

daliwang777 Sep 9, 2024
Author

Thank you！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Phenomenon of Latency Jumps in Inference #8202

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

The Phenomenon of Latency Jumps in Inference #8202

daliwang777 Sep 5, 2024

Replies: 1 comment · 1 reply

SeungminHeo Sep 9, 2024

daliwang777 Sep 9, 2024 Author

daliwang777
Sep 5, 2024

Replies: 1 comment 1 reply

SeungminHeo
Sep 9, 2024

daliwang777 Sep 9, 2024
Author