How to ensure that it consumes all "Pending" requests so that it doesn't launch a process without using all of "--max-num-seqs"? #415

Lucidology · 2024-04-17T03:35:32Z

Lucidology
Apr 17, 2024

For example, I am invoking it with:

python3 -m aphrodite.endpoints.openai.api_server --model /root/neuralbeagle14-7b.Q4_K_M.gguf --host 0.0.0.0 --port 5000 --dtype=half --max-length 1000 --max-model-len 1000 --gpu-memory-utilization .75 --max-num-seqs 64

I send many requests to it and I notice it will sometimes use all of the "64 reqs" capacity and other times it starts the batch without fully populating itself. It leaves many requests from the "Pending" list untouched and this results in sub-optimal throughput:

INFO:     Avg prompt throughput: 13.3 tokens/s, Avg generation throughput: 96.9 tokens/s, Running: 64 reqs, Swapped: 0 reqs,
Pending: 81 reqs, GPU KV cache usage: 95.5%, CPU KV cache usage: 0.0%

INFO:     Avg prompt throughput: 14.5 tokens/s, Avg generation throughput: 52.2 tokens/s, Running: 17 reqs, Swapped: 0 reqs,
Pending: 94 reqs, GPU KV cache usage: 32.8%, CPU KV cache usage: 0.0%

INFO:     Avg prompt throughput: 36.8 tokens/s, Avg generation throughput: 41.9 tokens/s, Running: 38 reqs, Swapped: 0 reqs,
Pending: 50 reqs, GPU KV cache usage: 56.7%, CPU KV cache usage: 0.0%

INFO:     Avg prompt throughput: 23.0 tokens/s, Avg generation throughput: 81.6 tokens/s, Running: 64 reqs, Swapped: 0 reqs,
Pending: 17 reqs, GPU KV cache usage: 95.5%, CPU KV cache usage: 0.0%

INFO:     Avg prompt throughput: 12.3 tokens/s, Avg generation throughput: 64.8 tokens/s, Running: 22 reqs, Swapped: 0 reqs,
Pending: 27 reqs, GPU KV cache usage: 32.8%, CPU KV cache usage: 0.0%

sgsdxzy · 2024-04-17T03:57:43Z

sgsdxzy
Apr 17, 2024
Collaborator

Increase your --max-model-len or max-num-batched-tokens, or (experimental) enable --chunked-prefill

0 replies

Lucidology · 2024-04-27T05:40:23Z

Lucidology
Apr 27, 2024
Author

How can I launch a single batch of a defined size and have that run rather than using the Scheduler code? Ideally I would like to do the usual model.generate(promptList) to do batched inference.

I don't see an easy way to run code like that. I know that my GPU can do 256 prompts at a time I have tried using the openai style API to send 256 requests to it, but the way the Scheduler works it starts processing it instantly which causes the throughput to go way down. The requests are handled in smaller batches with a very long tail rather than as a single unit.

I have traced through the Scheduler and engine.step() code and it is very complex. What I'd like to do is have a throttle stage that looks like this:

On new request:
 is the request buffer full? y: -> flush requests and continue
 has it been more than 5 seconds since the last request was added? y: -> flush requests and continue
 no to both: add request to buffer and continue

This would not be hard to add but it's not clear where it would go in the existing code?

0 replies

Lucidology · 2024-04-27T16:51:15Z

Lucidology
Apr 27, 2024
Author

I think it's possible to speed up throughput maybe by about 4x

It looks to me like the Scheduler is a little over-active at moving running reqs back to "pending", which reduces throughput a lot. Here's a graph of: running res, pending reqs and tok/s

This gives:

Avg tok/s 137
Avg Running Reqs 57
Peak tok/s 485
Peak Running Reqs 256

So if it were to run at peak the entire time, like it should if it was all done in a single batch, the speed up would be:

Potential tok/s gain 353%
Potential running reqs gain 448%

You can see in the chart that it starts out getting up to all 256 reqs and then the scheduler starts swapping it out back to pending.

It shows in the logs like:

tokens/s, Running: 256 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
96.0%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 427.4
tokens/s, Running: 237 reqs, Swapped: 0 reqs, Pending: 19 reqs, GPU KV cache usage:
100.0%, CPU KV cache usage: 0.0%

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to ensure that it consumes all "Pending" requests so that it doesn't launch a process without using all of "--max-num-seqs"? #415

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to ensure that it consumes all "Pending" requests so that it doesn't launch a process without using all of "--max-num-seqs"? #415

Lucidology Apr 17, 2024

Replies: 3 comments

sgsdxzy Apr 17, 2024 Collaborator

Lucidology Apr 27, 2024 Author

Lucidology Apr 27, 2024 Author

Lucidology
Apr 17, 2024

sgsdxzy
Apr 17, 2024
Collaborator

Lucidology
Apr 27, 2024
Author

Lucidology
Apr 27, 2024
Author