[WIP] Prototyping re-arch #9166

WoosukKwon · 2024-10-08T20:51:49Z

No description provided.

github-actions · 2024-10-08T20:52:02Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

zhuohan123

First part of review: scheduler and KV cache manager.

High-level notes on the new scheduler and KV cache manager:

Assuming chunked prefill always on.
- I still have some confusion on how does the new scheduler works for decoding requests. See comment below.
No sequence group (which is the goal of the re-arch)
No CPU swapping (swapping will be in KV cache manager)
All sharing via prefix caching
- However, still requires a prefix caching enabled KV cache manager.

zhuohan123 · 2024-10-11T05:36:12Z

vllm/core/kv_cache_manager.py

+        # Reserve block id 0 for padding.
+        self.free_block_ids = list(range(num_gpu_blocks))


Is block 0 actually reserved?

zhuohan123 · 2024-10-11T05:53:14Z

vllm/core/scheduler_v2.py

+            num_tokens = request.num_tokens - request.num_computed_tokens
+            num_tokens = min(num_tokens, token_budget)


Here we assume we always do chunked prefill right? How does the num_tokens computation here work for decoding phase requests? Will request.num_tokens - request.num_computed_tokens always be 1 in that case?

zhuohan123 · 2024-10-11T05:54:00Z

vllm/core/scheduler_v2.py

+            if preempted_reqs:
+                break


Nit: this should be outside of the while loop.

zhuohan123 · 2024-10-11T06:01:46Z

vllm/core/scheduler_v2.py

+            if request.status == RequestStatus.WAITING:
+                scheduled_new_reqs.append(request)
+            elif request.status == RequestStatus.PREEMPTED:
+                scheduled_resumed_reqs.append(request)


Why do we need to distinguish these two? Is it for delta update optimization?

zhuohan123 · 2024-10-11T06:05:14Z

vllm/core/scheduler_v2.py

+            finished_req_ids=self.finished_req_ids,
+            aborted_req_ids=self.aborted_req_ids,


Nit: Maybe add a comment to distinguish these two fields with other fields.

Suggested change

finished_req_ids=self.finished_req_ids,

aborted_req_ids=self.aborted_req_ids,

# These two fields are existing states in the scheduler instead of newly scheduled in this step.

finished_req_ids=self.finished_req_ids,

aborted_req_ids=self.aborted_req_ids,

zhuohan123 · 2024-10-11T06:31:43Z

vllm/core/scheduler_v2.py

+    def abort_requests(self, request_ids: Union[str, Iterable[str]]) -> None:
+        if isinstance(request_ids, str):
+            request_ids = (request_ids, )
+        request_ids = set(request_ids)
+
+        # TODO: Optimize this.
+        for queue in [self.waiting, self.running]:
+            aborted_reqs: List[Request] = []
+            for request in queue:
+                if not request_ids:
+                    break
+                if request.request_id in request_ids:
+                    request.status = RequestStatus.FINISHED_ABORTED
+                    aborted_reqs.append(request)
+                    request_ids.remove(request.request_id)
+
+            for request in aborted_reqs:
+                queue.remove(request)
+                self.aborted_req_ids.add(request.request_id)
+                self._free_request(request)
+
+    def stop_requests(self, request_ids: Union[str, Iterable[str]]) -> None:


What are the differences between stop and abort? Can we merge the two functions?

zhuohan123 · 2024-10-11T06:33:45Z

vllm/core/kv_cache_manager.py

+logger = init_logger(__name__)
+
+
+class KVCacheManager:


Is this the new interface for the block manager? Should we implement prefix caching/hierarchical cache in this class?

WoosukKwon added 30 commits September 25, 2024 09:54

TMP

6c91c48

Minor

edbcd70

Minor

a817fe4

Minor

46bc435

Add worker v2

78d8966

Add model runner v2

0d27d3d

Minor

c86ce2c

yapf

50e4af2

Minor

9d14fd1

Add clear

23152fa

Minor

27a2683

Minor

025aeb8

Working

8d29ffc

Deque -> List

788d3f4

Minor

a7912ce

Top-k Top-p sampling

1ff2463

Add comment

256ac81

Optimize

a53cfae

Minor

105ceaa

Use int32 instead of int64

a5ca329

Remove ref cnt

438dc09

Use numpy

09a7fa4

Add back ref cnts

ea2b5e0

Minor

5f8bc7d

Output text

36453c1

Detokenizer

7a813f6

yapf

9c15340

Minor

3777a59

Hacky impl of detokenizer

f7e8062

terminate

37b0d99

Add TODO

f40d51a

noooop mentioned this pull request Oct 9, 2024

[PoC]: Support encode only models by Workflow Defined Engine #8452

Draft

zhuohan123 reviewed Oct 11, 2024

View reviewed changes

WoosukKwon closed this Nov 12, 2024

mergify bot added the frontend label Nov 12, 2024

WoosukKwon deleted the re-arch-seq-group branch November 13, 2024 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Prototyping re-arch #9166

[WIP] Prototyping re-arch #9166

WoosukKwon commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

zhuohan123 left a comment •

edited

Loading

zhuohan123 Oct 11, 2024

zhuohan123 Oct 11, 2024 •

edited

Loading

zhuohan123 Oct 11, 2024

zhuohan123 Oct 11, 2024

zhuohan123 Oct 11, 2024

zhuohan123 Oct 11, 2024

zhuohan123 Oct 11, 2024

		# Reserve block id 0 for padding.
		self.free_block_ids = list(range(num_gpu_blocks))

		num_tokens = request.num_tokens - request.num_computed_tokens
		num_tokens = min(num_tokens, token_budget)

		finished_req_ids=self.finished_req_ids,
		aborted_req_ids=self.aborted_req_ids,

		logger = init_logger(__name__)


		class KVCacheManager:

[WIP] Prototyping re-arch #9166

[WIP] Prototyping re-arch #9166

Conversation

WoosukKwon commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

zhuohan123 left a comment • edited Loading

Choose a reason for hiding this comment

zhuohan123 Oct 11, 2024

Choose a reason for hiding this comment

zhuohan123 Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

zhuohan123 Oct 11, 2024

Choose a reason for hiding this comment

zhuohan123 Oct 11, 2024

Choose a reason for hiding this comment

zhuohan123 Oct 11, 2024

Choose a reason for hiding this comment

zhuohan123 Oct 11, 2024

Choose a reason for hiding this comment

zhuohan123 Oct 11, 2024

Choose a reason for hiding this comment

zhuohan123 left a comment •

edited

Loading

zhuohan123 Oct 11, 2024 •

edited

Loading