Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLOP counting for vLLM inference #12341

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Conversation

dianastea
Copy link

@dianastea dianastea commented Jan 23, 2025

PR Description:
Added functionality to count (theoretical) FLOPS using a ContextManager that uses TorchDispatchMode to count flops for each operation on input tensors to an llm run with vLLM using torch.__dispatch__.

This FLOP counter heavily uses the code from https://dev-discuss.pytorch.org/t/the-ideal-pytorch-flop-counter-with-torch-dispatch/505 and adds FLOP counting for operations relevant to LLM inference, namely the following aten operations:

softmax/log softmax
admm/mm/matmul/bmm 
attention
native layer norm 

Example Usage:
Note that the model must be declared in scope of the context manager.

from vllm import LLM
from vllm.flop_utils import FlopContextManager
with FlopContextManager():
    llm = LLM("facebook/opt-125m") 
    llm.generate("Hello how are you?") 

Example Output from above:
Upon exiting the ContextManager, the number of GFLOPS per module/sub-module are outputted.

Total: 2.409247744 GFLOPS
Module:  Global
aten.native_layer_norm: 0.0016128 GFLOPS
aten.addmm: 1.783627776 GFLOPS
vllm.unified_attention: 0.001880064 GFLOPS
aten.relu: 0.000774144 GFLOPS
aten.mm: 0.618135552 GFLOPS
aten._softmax: 0.001608704 GFLOPS
aten._log_softmax: 0.001608704 GFLOPS

Module:  OPTForCausalLM
aten.native_layer_norm: 0.0016128 GFLOPS
aten.addmm: 1.783627776 GFLOPS
vllm.unified_attention: 0.001880064 GFLOPS
aten.relu: 0.000774144 GFLOPS

Module:  OPTDecoder
aten.native_layer_norm: 0.0016128 GFLOPS
aten.addmm: 1.783627776 GFLOPS
vllm.unified_attention: 0.001880064 GFLOPS
aten.relu: 0.000774144 GFLOPS

Module:  OPTDecoderLayer
aten.native_layer_norm: 0.001548288 GFLOPS
aten.addmm: 1.783627776 GFLOPS
vllm.unified_attention: 0.001880064 GFLOPS
aten.relu: 0.000774144 GFLOPS

Limitations:

  • Manually computing FLOPS for each aten operation... this is merely theoretical and needs some more research into hardware profiling tools. This also means that every operation needs to be accounted for with an individual function.

Issue:
Resolving Issue 3490, #3490

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@dianastea dianastea changed the title Flops FLOP counting for vLLM inference Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant