FLOP counting for vLLM inference #12341

dianastea · 2025-01-23T07:03:59Z

PR Description:
Added functionality to count (theoretical) FLOPS using a ContextManager that uses TorchDispatchMode to count flops for each operation on input tensors to an llm run with vLLM using torch.__dispatch__.

This FLOP counter heavily uses the code from https://dev-discuss.pytorch.org/t/the-ideal-pytorch-flop-counter-with-torch-dispatch/505 and adds FLOP counting for operations relevant to LLM inference, namely the following aten operations:

softmax/log softmax
admm/mm/matmul/bmm 
attention
native layer norm

Example Usage:
Note that the model must be declared in scope of the context manager.

from vllm import LLM
from vllm.flop_utils import FlopContextManager
with FlopContextManager():
    llm = LLM("facebook/opt-125m") 
    llm.generate("Hello how are you?")

Example Output from above:
Upon exiting the ContextManager, the number of GFLOPS per module/sub-module are outputted.

Total: 2.409247744 GFLOPS
Module:  Global
aten.native_layer_norm: 0.0016128 GFLOPS
aten.addmm: 1.783627776 GFLOPS
vllm.unified_attention: 0.001880064 GFLOPS
aten.relu: 0.000774144 GFLOPS
aten.mm: 0.618135552 GFLOPS
aten._softmax: 0.001608704 GFLOPS
aten._log_softmax: 0.001608704 GFLOPS

Module:  OPTForCausalLM
aten.native_layer_norm: 0.0016128 GFLOPS
aten.addmm: 1.783627776 GFLOPS
vllm.unified_attention: 0.001880064 GFLOPS
aten.relu: 0.000774144 GFLOPS

Module:  OPTDecoder
aten.native_layer_norm: 0.0016128 GFLOPS
aten.addmm: 1.783627776 GFLOPS
vllm.unified_attention: 0.001880064 GFLOPS
aten.relu: 0.000774144 GFLOPS

Module:  OPTDecoderLayer
aten.native_layer_norm: 0.001548288 GFLOPS
aten.addmm: 1.783627776 GFLOPS
vllm.unified_attention: 0.001880064 GFLOPS
aten.relu: 0.000774144 GFLOPS

Limitations:

Manually computing FLOPS for each aten operation... this is merely theoretical and needs some more research into hardware profiling tools. This also means that every operation needs to be accounted for with an individual function.

Issue:
Resolving Issue 3490, #3490

github-actions · 2025-01-23T07:04:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

dianastea added 2 commits January 22, 2025 23:00

vllm flop utils poc

e8d5efd

Merge branch 'flops' of github.com:dianastea/vllm into flops

9a1ceec

modified vllm flops to use context manager

91cf85f

dianastea changed the title ~~Flops~~ FLOP counting for vLLM inference Jan 24, 2025

dianastea and others added 5 commits January 23, 2025 21:43

Delete vllm/flop_utils_old.py

0b83bc6

reset cpu_model_runner.py

351c39b

refactored flop_utils.py

ce6d909

added attention

d3d448c

added tracer on module

751f854

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLOP counting for vLLM inference #12341

FLOP counting for vLLM inference #12341

dianastea commented Jan 23, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 23, 2025

FLOP counting for vLLM inference #12341

Are you sure you want to change the base?

FLOP counting for vLLM inference #12341

Conversation

dianastea commented Jan 23, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 23, 2025

dianastea commented Jan 23, 2025 •

edited by github-actions bot

Loading