The Phenomenon of Latency Jumps in Inference #8202
daliwang777
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
because default scheduler batch size is |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When I used vLLM to accelerate inference for Llama3 70b, I observed an interesting phenomenon. When each input request consists of 1 token, there is a latency jump every 256 requests. Does anyone know why this happens?
![llama370bcuda8](https://private-user-images.githubusercontent.com/133930773/364869904-961fb1bf-cb13-458e-9c64-138cc92dccfa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0NTIxMjgsIm5iZiI6MTczOTQ1MTgyOCwicGF0aCI6Ii8xMzM5MzA3NzMvMzY0ODY5OTA0LTk2MWZiMWJmLWNiMTMtNDU4ZS05YzY0LTEzOGNjOTJkY2NmYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxM1QxMzAzNDhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0zMzcwMDIwODJkM2E3ZjgxOWE3NTA2ZWYyNmQwYjc5ZWRlMjU4OTAxNjViM2I2MGE2OTcyYTI2ZmZkN2U4NTdiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.s0p3antuEkRh-sL5u56pc_-tOY9XGGL0GfnCC9RKoxM)
Beta Was this translation helpful? Give feedback.
All reactions