Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core]: (Last/N) Support prefill only models by Workflow Defined Engine #8964

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

noooop
Copy link
Contributor

@noooop noooop commented Sep 30, 2024

FIX #8453

PTAL #8452 #8779

This is the first part of [Support prefill only models by Workflow Defined Engine]

The main goal is to support the BERT model

Includes the following parts:

  • code organization:Modules are organized according to function-first. If there are new modules, you can create a new folder under the corresponding module. e.g.
    • Reranker model input pairs. New modules can be organized as follows
inputs/
  data.py
  parse.py
  preprocess.py
  registry.py
  reranker/
    data.py
    preprocess.py

  • [Core]: (2/N) Support prefill only models by Workflow Defined Engine - Prefill only attention #9124
  • Prefill only attention
    • In order to support bidirectional enable, the entire program uses the same AttnBackend instance. This way you can use attn_type to switch all Attention Backend AttentionType
    • Attention Backend now supported
      • Flash Attention Backend
      • Torch SDPA Backend
      • XFormers Backend
      • FlashInfer Backend (Because prefill only models do not involve kv cache, When using Flashinfer backend in prefill only models, you are actually using FLASH ATTN backend
      • Torch naive backend (as a control group

  • [Core]: (1/N) Support prefill only models by Workflow Defined Engine - Prefill only scheduler #9181

  • InputProcessor: Input(request_id, inputs, params, arrival_time) -> InputProcessor -> Request

    • Like other text input models, bert supports str, Dict, TextPrompt, TokensPrompt, TextOnlyInputs
    • InputProcessor will do simple verification of the input, but will not do tokenizer.encode to avoid being stuck in a large number of submissions at the same time.
  • RequestProcessor: Request -> RequestProcessor -> SchedulableRequest

    • RequestProcessor is the callback function of the scheduler. It will only be called when scheduling is needed, and tokenizer.encode will be executed.
  • Scheduler

    • A simple first-come-first-serve scheduler that supports batched_tokens and num_requests to control concurrency
    • Prefill only models, there is no dependency between tasks, so it supports both sync and async scheduling

[Core]: (3/N) Support prefill only models by Workflow Defined Engine - new models

  • loader
    • Because AttnBackend instance needs to be passed into the model, the loader needs to be slightly modified.
  • modelzoo and ModelRegistry
    • Because workflow parameters need to be supported, modelzoo and ModelRegistry need to be slightly modified.
  • new models
    • bert

  • [Core]: (Last/N) Support prefill only models by Workflow Defined Engine #8964
  • ModelInputBuilder
    • Just calculate seq_lens and call attention_metadata_builder. Convert seq_lens to seq_start_loc using torch.cumsum
  • Executor
    • Support Sync\Async Executor. Async Executor data enters from the executor_in queue and goes out from the executor_out queue
  • Worker & Runner
    • Temporarily need to use FakeGroupCoordinator to fix the distributed environment (parallel_state._TP, parallel_state._PP)
  • LLMEngine
    • Workflow Defined Engine. Load modules on demand according to Workflow

Modifications required to the vllm code: (not yet submitted

  • MQLLMEngine
    • Use wde ModelRegistry to determine whether the current model is supported. If so, use wde Engine.
  • entrypoints
    • There may be some compatibility issues that need to be modified slightly

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 30, 2024

For the PR that is going to be merged, can you please modify the existing vLLM code instead of transferring them into a completely new namespace/directory? Otherwise, it's difficult to see how the existing code needs to be changed. If the implementation is completely different, you can create a file in the existing directory but with wde suffix, similar to how we have block manager v2 right now.

@noooop
Copy link
Contributor Author

noooop commented Sep 30, 2024

Now the vllm project is divided by function. Modules with the same function are put together, all executors are put together, and all workers are put together. In fact, it is very painful to maintain. According to Conway's Law, systems should be built around business capabilities.

Although I really want more people to discuss the wde architecture, it seems that no one is very interested.

@noooop
Copy link
Contributor Author

noooop commented Sep 30, 2024

For example,
you have a set of tools for painting watercolors, pens, and paper.
You have a set of tools for painting oils, pens, and paper.
You also have pens and papers for daily use.

Divide by function, put all the pens together, put all the paper together, put all the paints together, the result is very inconvenient to use. .

Divided according to usage scenarios, it is obviously much easier to use and maintain.

This is how wde modules are divided.

@noooop
Copy link
Contributor Author

noooop commented Sep 30, 2024

Let's discuss it for a while, maybe there is a better way.

@noooop
Copy link
Contributor Author

noooop commented Sep 30, 2024

Let's focus on more specific issues.

InputProcessor

Reranker model input pairs. If using wde, I only need to switch to RerankerInputProcessor, plug and play. So where should I put this piece of code?

https://github.com/noooop/vllm/blob/wde_encode_only/vllm/wde/reranker/

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 1, 2024

Although I really want more people to discuss the wde architecture, it seems that no one is very interested.

Even if your architecture is perfect (this is never the case, I keep throwing away my old code 😅), there is an important practical issue to consider, which is how to adopt this without breaking the guarantees currently provided by vLLM. Keep in mind that this is not a personal project, so you don't have the freedom to change everything in one commit - rather, hundreds of devs are working on this project at the same time. If you can't solve this issue, no one will be willing to jump to completely new code. So, I suggest keeping the original directory layout where possible.

Reranker model input pairs. If using wde, I only need to switch to RerankerInputProcessor, plug and play. So where should I put this piece of code?

Currently, our code is organized function-first, capability-second, as you have said. But, there's nothing preventing you from adding capability-specific files into function-specific directories for now.

@noooop
Copy link
Contributor Author

noooop commented Oct 1, 2024

Can we start reviewing the code as it is laid out now.

Specifically reviewing the module as I listed them from the top down, my code is also submitted in this order。

Focusing on the part where I have slightly modified the vllm module, should I keep both files or apply the changes to the original files.

Once the review is complete, I will change it to a functional-first organizational.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 1, 2024

Focusing on the part where I have slightly modified the vllm module, should I keep both files or apply the changes to the original files.

Please change the original files. It's difficult to see the diffs otherwise. Also this makes sure that your implementation can still work with existing vLLM code.

@DarkLight1337
Copy link
Member

Ideally we should be able to run our existing tests on your code to verify its correctness.

@noooop
Copy link
Contributor Author

noooop commented Oct 1, 2024

  • InputProcessor: Input(request_id, inputs, params, arrival_time) -> InputProcessor -> Request
    • Like other text input models, bert supports str, Dict, TextPrompt, TokensPrompt, TextOnlyInputs
    • InputProcessor will do simple verification of the input, but will not do tokenizer.encode to avoid being stuck in a large number of submissions at the same time.

82197c8

Can I put this code in the inputs folder?

https://github.com/vllm-project/vllm/tree/main/vllm/inputs

@DarkLight1337
Copy link
Member

InputProcessor: Input(request_id, inputs, params, arrival_time) -> InputProcessor -> Request
Can I put this code in the inputs folder?

Yes, please do so.

@noooop
Copy link
Contributor Author

noooop commented Oct 1, 2024

@DarkLight1337

The role of wde/core/schema/engine_io.py in wde is the same as that of inputs.data.

At the time of my first commit, you hadn't finished refactoring inputs.data.

I can change to using inputs.data

but where should I put the input of the reranker

  1. Modify inputs.data directly

  2. Place it under the inputs folder, no hierarchy

  • inputs
    • data.py
    • parse.py
    • preprocess.py
    • registry.py
    • data_reranker.py
    • preprocess_reranker.py
  1. Place it under the inputs\reranker folder.
  • inputs
    • data.py
    • parse.py
    • preprocess.py
    • registry.py
    • reranker
      • data.py
      • preprocess.py
  1. put the previous code to the core or base folder
  • inputs
    • core or base
      • data.py
      • parse.py
      • preprocess.py
      • registry.py
    • reranker
      • data.py
      • reranker.py
  1. Organized by business capabilities
  • inputs

    • data.py
    • parse.py
    • preprocess.py
    • registry.py
  • reranker

    • inputs
      • data.py
      • preprocess.py

I will discuss the code organization structure with you only this time. Other modules will be reorganized according to the results of this discussion.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 1, 2024

The role of wde/core/schema/engine_io.py in wde is the same as that of inputs.data.

At the time of my first commit, you hadn't finished refactoring inputs.data.

I can change to using inputs.data

but where should I put the input of the reranker

I prefer option 3. We start with something like this:

inputs/
  data.py
  parse.py
  preprocess.py
  registry.py
  reranker/
    data.py
    preprocess.py

Then in future PRs, we can gradually move the existing code, e.g.:

inputs/
  parse.py
  registry.py
  base/
    data.py
    preprocess.py
  decoder_only/
    data.py
    preprocess.py
  encoder_decoder/
    data.py
    preprocess.py
  reranker/
    data.py
    preprocess.py

To increase the chance of acceptance, I further recommend only doing this reorganization within one main module (in this case, vllm/inputs) at a time. This will be a long and painful process but at least your PRs may get merged. Eventually, our code structure may look like:

core/
    base/
    decoder_only/
    encoder_decoder/
    ...
engine/
    base/
    decoder_only/
    encoder_decoder/
    ...
inputs/
    base/
    decoder_only/
    encoder_decoder/
    ...

Only afterwards can we even consider making the switch from function-first to capability-first at the top level:

base/
    engine/
    core/
    inputs/
decoder_only/
    engine/
    core/
    inputs/
encoder_decoder/
    engine/
    core/
    inputs/
...

@noooop
Copy link
Contributor Author

noooop commented Oct 1, 2024

@DarkLight1337

next

InputProcessor will do simple verification of the input, but will not do tokenizer.encode to avoid being stuck in a large number of submissions at the same time.
RequestProcessor is the callback function of the scheduler. It will only be called when scheduling is needed, and tokenizer.encode will be executed. (lazy Request Processor)

In vllm/inputs/preprocess.py , the verification and encoding parts are merged.

In wde preprocess is separated into two parts.

Do you ok with that?

Of course, this process will only be used on the prefill only scheduler and will not modify the previous scheduler.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 1, 2024

@DarkLight1337

next

InputProcessor will do simple verification of the input, but will not do tokenizer.encode to avoid being stuck in a large number of submissions at the same time.
RequestProcessor is the callback function of the scheduler. It will only be called when scheduling is needed, and tokenizer.encode will be executed. (lazy Request Processor)

In vllm/inputs/preprocess.py , the verification and encoding parts are merged.

In wde preprocess is separated into two parts.

Do you ok with that?

Of course, this process will only be used on the prefill only scheduler and will not modify the previous scheduler.

As long as this strategy works for other model types, I'm fine with it. (i.e. don't split them up only to find that it actually doesn't work e.g. for embedding models)

@noooop
Copy link
Contributor Author

noooop commented Oct 1, 2024

I'm busy these days, so I'll submit a new code on the 8th.

@noooop
Copy link
Contributor Author

noooop commented Oct 1, 2024

@DarkLight1337

next

  • Prefill only attention
    • In order to support bidirectional enable, the entire program uses the same AttnBackend instance. This way you can use attn_type to switch all Attention Backend AttentionType
    • Attention Backend now supported
      • Flash Attention Backend
      • Torch SDPA Backend
      • XFormers Backend
      • FlashInfer Backend (Because prefill only models do not involve kv cache, When using Flashinfer backend in prefill only models, you are actually using FLASH ATTN backend
      • Torch naive backend (as a control group

250be85

If you have time, please review the Prefill only attention related code.

I will put Prefill only attention under the attention folder

attention
    prefill_only  <- new
    backends
    ops
    __init__.py
    layer.py
    selector.py

Do you ok with that?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 1, 2024

I'm busy these days, so I'll submit a new code on the 8th.

No worries, take your time.

@DarkLight1337

next

  • Prefill only attention

    • In order to support bidirectional enable, the entire program uses the same AttnBackend instance. This way you can use attn_type to switch all Attention Backend AttentionType

    • Attention Backend now supported

      • Flash Attention Backend
      • Torch SDPA Backend
      • XFormers Backend
      • FlashInfer Backend (Because prefill only models do not involve kv cache, When using Flashinfer backend in prefill only models, you are actually using FLASH ATTN backend
      • Torch naive backend (as a control group

250be85

If you have time, please review the Prefill only attention related code.

I'm not familiar with the attention implementation, so you will have to ask someone else for this. This is another reason why we should split up the PR so each PR focuses on one module. Then we can have different people reviewing changes to each module.

@noooop
Copy link
Contributor Author

noooop commented Oct 1, 2024

Can you help me get someone to come take a look?

Prefill only attention has no decoding stage, so it is very simple

@DarkLight1337
Copy link
Member

@comaniac may be able to comment on the attention backend. But I suggest you split up the PR first.

@noooop noooop changed the title [Core]: (1/N) Support prefill only models by Workflow Defined Engine [Core]: (Last/N) Support prefill only models by Workflow Defined Engine Oct 7, 2024
@DarkLight1337
Copy link
Member

So far it seems that you're just adding brand new files. Are you going to further refine those PRs so vLLM actually uses them?

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

So far it seems that you're just adding brand new files. Are you going to further refine those PRs so vLLM actually uses them?

I organized the previous code according to function-first

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

Do you have any suggestions for the approximate location of the new code?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 9, 2024

So far it seems that you're just adding brand new files. Are you going to further refine those PRs so vLLM actually uses them?

I organized the previous code according to function-first

I prefer you delete the old files that have been split up to the new locations. This will prevent the issue of forgetting to copy over the changes from more recent PRs since your PR was first created (by forcing you to handle merge conflicts).

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

I prefer you delete the old files that have been split up to the new location. This will prevent the issue of forgetting to copy over the changes from more recent PRs since your PR was first created (by forcing you to handle merge conflicts).

I know, this PR is temporarily used to discuss how to segment. So submit all the code for now

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 9, 2024

So far it seems that you're just adding brand new files. Are you going to further refine those PRs so vLLM actually uses them?

I organized the previous code according to function-first

I prefer you delete the old files that have been split up to the new locations. This will prevent the issue of forgetting to copy over the changes from more recent PRs since your PR was first created (by forcing you to handle merge conflicts).

For example:

  1. You create a new TextInputProcessor which is based on the existing InputPreprocessor.
  2. The logic of InputPreprocessor is then changed by someone else in another PR.
  3. Your PR still works but doesn't include the changes by (2), resulting in a regression. Even though we have tests, they don't have 100% coverage so it's possible to miss this. The more files you create, the easier it is to miss such changes.

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

For example:

  1. You create a new TextInputProcessor which is based on the existing InputPreprocessor.
  2. The logic of InputPreprocessor is then changed by someone else in another PR.
  3. Your PR still works but doesn't include the changes by (2). The more files you create, the easier it is to miss such changes.

I know, this PR is temporarily used to discuss how to segment. So submit all the code for now

This PR definitely needs to be resubmitted

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 9, 2024

Your PR still works but doesn't include the changes by (2). The more files you create, the easier it is to miss such changes.
I know, this PR is temporarily used to discuss how to segment. So submit all the code for now

I find myself frequently having to switch over to another copy of the repo and find where your code originally comes from. It would ease the review process if you actually "move" the code.

@DarkLight1337
Copy link
Member

Also, to avoid triggering CI, please mark those PRs with draft status.

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

The current function of this pr is

Let's first roughly put all the modules in the correct location, and then try to split the pr.

We discuss specific code changes in #9124 #9181

Divide and conquer, coarse to fine

I will manually keep both sides in sync with the latest updates.

@noooop noooop marked this pull request as draft October 9, 2024 07:23
@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

I find myself frequently having to switch over to another copy of the repo and find where your code originally comes from. It would ease the review process if you actually "move" the code.

Sorry, I don't want to make it so complicated

Previously

@DarkLight1337
Copy link
Member

Do you have any suggestions for the approximate location of the new code?

Honestly, I don't care too much about the location of the new files. The more important part is how to transition from the existing code to the new layout.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 9, 2024

I would rather you dive straight into the refactoring PRs one at a time instead of just partially showing the new layout. For looking at the new layout, I prefer the branch that has all the new files in one place so I can get the big picture.

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

I would rather you dive straight into the refactoring PRs one at a time instead of just partially showing the new layout. For looking at the new layout, I prefer the branch that has all the new files in one place so I can get the big picture.

So how should I cooperate?

This is all the code to support the first model : bert

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 9, 2024

I would rather you dive straight into the refactoring PRs one at a time instead of just partially showing the new layout. For looking at the new layout, I prefer the branch that has all the new files in one place so I can get the big picture.

So how should I cooperate?

You can have two branches:

  • One branch with all the new files together (so, basically your original PR with wde/ directory, except that the files should be in the existing directories as we have discussed before)
  • One active branch which focuses on refactoring one module at a time. To avoid accumulating technical debt, I suggest you limit this to one module at a time (e.g. inputs) so you don't have to keep resolving merge conflicts.

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

You can have two branches:

  • One branch with all the new files together (so, basically your original PR with wde/ directory, except that the files should be in the existing directories as we have discussed before)
    This pr (I've added back the previous wde code

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

@DarkLight1337

I've added back the previous wde code

Do you need to split the wde code by module and submit it multiple times?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 9, 2024

For viewing the layout, keep everything in the same PR (don't split it).

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

You can have two branches:

  • One branch with all the new files together (so, basically your original PR with wde/ directory, except that the files should be in the existing directories as we have discussed before)
    This pr (I've added back the previous wde code

Is everything okay now

(Just a moment ago, Accidentally pushed -f to #8452, finally fixed. pushed -f is too dangerous

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 9, 2024

It is better now. I don't think attention backends should be placed under layers though. Something like vllm.<workflow>.backends.attention would be more appropriate.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 9, 2024

Also, you have some inconsistent naming, where model tests are placed under models directory while the model implementations are inside modelzoo. I prefer renaming modelzoo to the current models so it's same as Transformers library.

@noooop
Copy link
Contributor Author

noooop commented Oct 9, 2024

It is better now. I don't think attention backends should be placed under layers though. Something like vllm.<workflow>.backends.attention would be more appropriate.

mark

Also, you have some inconsistent naming, where model tests are placed under models directory while the model implementations are inside modelzoo. I prefer renaming modelzoo to the current models so it's same as Transformers library.

mark

@noooop
Copy link
Contributor Author

noooop commented Oct 14, 2024

@DarkLight1337

The biggest difference between the prefill only models and the decoding model lies in the attention and scheduler.

Corresponding to #9124 and #9181

these two PRs are not directly related and can be reviewed at the same time without conflict.

Why not modify the existing code directly, but add a new file?

  1. If we directly add new code, these modules are becoming increasingly complex
  2. Some new features must be compromised for compatibility
  3. Code related to decoding models is almost never used in prefill only models, and prefill only models are almost never used in decoding models. There is no need to force the two parts of code together.
  4. Separating the two pieces of code can be optimized separately in the future. Combine them, for the sake of compatibility, ultimately leading to suboptimal results.
  5. These are very important codes, and modifications may introduce bugs.

If we merge #9124 and #9181, the rest are very simple codes and are easy to merge.

As for the Workflow Defined Engine architecture, it is actually similar to the code below and is actually not complicated.

https://github.com/noooop/vllm/blob/49299c097f6696107c4cb8033679d9a419044a0f/vllm/core/scheduler.py#L314C1-L318C34

        version = "v1"
        if self.scheduler_config.use_v2_block_manager:
            version = "v2"
        if self.scheduler_config.embedding_mode:
            version = "embedding"

        BlockSpaceManagerImpl = BlockSpaceManager.get_block_space_manager_class(
            version)
   

@DarkLight1337
Copy link
Member

Since @WoosukKwon has already started the refactoring, you should coordinate this with him.

@noooop
Copy link
Contributor Author

noooop commented Oct 14, 2024

Since @WoosukKwon has already started the refactoring, you should coordinate this with him.

Do you mean that this PR is submitted directly to the refactored version?

@DarkLight1337
Copy link
Member

Since @WoosukKwon has already started the refactoring, you should coordinate this with him.

Do you mean that this PR is submitted directly to the refactored version?

You should ping him with a link to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC]: Support encode only models by Workflow Defined Engine
2 participants