How to ensure that it consumes all "Pending" requests so that it doesn't launch a process without using all of "--max-num-seqs"? #415
Replies: 3 comments
-
Increase your |
Beta Was this translation helpful? Give feedback.
-
How can I launch a single batch of a defined size and have that run rather than using the Scheduler code? Ideally I would like to do the usual I don't see an easy way to run code like that. I know that my GPU can do 256 prompts at a time I have tried using the openai style API to send 256 requests to it, but the way the Scheduler works it starts processing it instantly which causes the throughput to go way down. The requests are handled in smaller batches with a very long tail rather than as a single unit. I have traced through the Scheduler and engine.step() code and it is very complex. What I'd like to do is have a throttle stage that looks like this:
This would not be hard to add but it's not clear where it would go in the existing code? |
Beta Was this translation helpful? Give feedback.
-
I think it's possible to speed up throughput maybe by about 4x It looks to me like the Scheduler is a little over-active at moving running reqs back to "pending", which reduces throughput a lot. Here's a graph of: running res, pending reqs and tok/s This gives: Avg tok/s 137 So if it were to run at peak the entire time, like it should if it was all done in a single batch, the speed up would be: Potential tok/s gain 353% You can see in the chart that it starts out getting up to all 256 reqs and then the scheduler starts swapping it out back to pending. It shows in the logs like: tokens/s, Running: 256 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: |
Beta Was this translation helpful? Give feedback.
-
For example, I am invoking it with:
python3 -m aphrodite.endpoints.openai.api_server --model /root/neuralbeagle14-7b.Q4_K_M.gguf --host 0.0.0.0 --port 5000 --dtype=half --max-length 1000 --max-model-len 1000 --gpu-memory-utilization .75 --max-num-seqs 64
I send many requests to it and I notice it will sometimes use all of the "64 reqs" capacity and other times it starts the batch without fully populating itself. It leaves many requests from the "Pending" list untouched and this results in sub-optimal throughput:
Beta Was this translation helpful? Give feedback.
All reactions