[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

NickLucche · 2025-01-22T18:01:06Z

TL;DR: Basically another take at #7448 based on the work on the Whisper model, with sugar on top to provide a drop-in replacement module.

Addressing TODOs https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/bart.py#L352 and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py#L750.

Current cross-attention QKV projection is sub-optimal as we're wasting cycles on bigger-than-necessary matrices, especially important in the compute-bound stage. That is because QKVParallellLinear layers are being used to only compute the q and kv projection, separately in two sequential calls.

I propose adopting the solution we make use of here https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/whisper.py#L173, where q\kv are being split into a ColumnParallelLinear and QKVParallelLinear layer, respectively, instantiating and sharding only the matrices we actually make use of. Support of tensor parallelism should be unscathed.

I also provide a drop-in replacement util layer QKVCrossParallellLinear to use in substitution of QKVParallellLinear layers such that loading code remains the same, especially the usual stacked_params_mapping.

==>Let me know what you think about the util Module interface/API, otherwise I can just substitute in its optimized code inline.

Early benchmarking results (single L4 24gb, running facebook/bart-large-cnn):

PRE-PR b197a5cc

============ Serving Benchmark Result ============
Successful requests:                     12        
Benchmark duration (s):                  3.64      
Total input tokens:                      790       
Total generated tokens:                  374       
Request throughput (req/s):              3.29      
Output token throughput (tok/s):         102.67    
Total Token throughput (tok/s):          319.55    
---------------Time to First Token----------------
Mean TTFT (ms):                          73.88     
Median TTFT (ms):                        69.12     
P99 TTFT (ms):                           129.42    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.50     
Median TPOT (ms):                        9.94      
P99 TPOT (ms):                           14.84     
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.93      
Median ITL (ms):                         8.70      
P99 ITL (ms):                            10.50     
==================================================

POST-PR

============ Serving Benchmark Result ============
Successful requests:                     12        
Benchmark duration (s):                  3.62      
Total input tokens:                      790       
Total generated tokens:                  374       
Request throughput (req/s):              3.31      
Output token throughput (tok/s):         103.29    
Total Token throughput (tok/s):          321.46    
---------------Time to First Token----------------
Mean TTFT (ms):                          74.35     
Median TTFT (ms):                        69.84     
P99 TTFT (ms):                           129.07    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.37     
Median TPOT (ms):                        9.81      
P99 TPOT (ms):                           14.53     
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.84      
Median ITL (ms):                         8.75      
P99 ITL (ms):                            9.85      
==================================================

TODO:

Document QKVCrossParallellLinear both in code and docs in "how to add model"
Replace in other encoder decoder models ([Model] Add T5 model (2/2) #11901?)

Signed-off-by: NickLucche <[email protected]>

github-actions · 2025-01-22T18:01:17Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: NickLucche <[email protected]>

NickLucche added 4 commits January 21, 2025 19:32

first draft

23f7e9c

Signed-off-by: NickLucche <[email protected]>

cleanup

0433fe6

Signed-off-by: NickLucche <[email protected]>

submodules in dict to avoid param registration

fbda81f

Signed-off-by: NickLucche <[email protected]>

mllama test

2dc291e

Signed-off-by: NickLucche <[email protected]>

format

a2af89e

Signed-off-by: NickLucche <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

NickLucche commented Jan 22, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 22, 2025

[Core] Optimizing cross-attention QKVParallelLinear computation #12325

Are you sure you want to change the base?

[Core] Optimizing cross-attention QKVParallelLinear computation #12325

Conversation

NickLucche commented Jan 22, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 22, 2025

[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

[Core] Optimizing cross-attention `QKVParallelLinear` computation #12325

NickLucche commented Jan 22, 2025 •

edited by github-actions bot

Loading