-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use dynamic batching when generating #9
Conversation
This will increment the number of batches dynamically when you call prefill, and it will reduce the number of batches only when prefill is called again. The intention is to avoid useless recompilation (keeping batch size the same as long as possible). Also note that if a slot was removed, the whole KV cache should be rebuilt, and for now we do not do that.
Since batch size is now dynamic, expected results are now different, so they are updated accordingly.
944ecbe
to
fe61d04
Compare
Now it is only useful for warmup, since dynamic batching is being used.
fe61d04
to
8f7ceab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to me on the logic, left a few comments for potentially optimize a few minor things.
Super cool!
SEQUENCE_LENGTH = 1024 | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def model_name_or_path(): | ||
os.environ["HF_BATCH_SIZE"] = str(BATCH_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should keep a HF_MAX_BATCH_SIZE
somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned below, maybe we'll use it later
@@ -312,7 +312,10 @@ def __init__( | |||
tokenizer.padding_side = "left" | |||
self.tokenizer = tokenizer | |||
self.special_tokens = self.tokenizer.all_special_ids | |||
self.slots = [Slot(i, tokenizer, self.model.device) for i in range(self.model.config.batch_size)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we introduce the HF_MAX_BATCH_SIZE
maybe we can initialize this list, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to better understand the usage of the batches, if it can be increased/reduced often, how this affects performance. At that point it will be easier to think about a reasonable algorithm to reduce compilation and batch change overhead as much as possible.
@@ -350,13 +353,18 @@ def warmup(self, batch: Batch) -> int: | |||
The maximum number of tokens the model supports. | |||
""" | |||
# Just check that the warmup request parameters match the model capacity | |||
batch_size = self.model.config.batch_size | |||
# NOTE: later self.model.config.batch_size might become self.model.config.max_batch_size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok maybe you want to keep it for later 🤗
# Assign each request to an empty slot | ||
logger.debug(f"Prefilling {len(batch.requests)} new request(s) with {len(empty_slots)} empty slot(s)") | ||
logger.debug(f"Prefilling {len(batch.requests)} new request(s) adding to {len(active_slots)} active slot(s)") | ||
new_slots = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is new_slots
used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless 😄
slot.assign(request, self.model.generation_config) | ||
new_slots.append(slot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about new_slots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same answer, I'll remove it
Just to be sure: In this PR we are not limiting the maximum batch size the server can handle? If so, can we impl this in a following PR? |
What does this PR do?
This will increment the number of batches dynamically when you call prefill, and it will reduce the number of batches only when prefill is called again.
The intention is to avoid useless recompilation (keeping batch size the same as long as possible).