New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: use dynamic batching when generating #9

Merged

tengomucho merged 4 commits into main from dynamic-batches

Apr 2, 2024

Collaborator

tengomucho commented Mar 26, 2024

What does this PR do?

This will increment the number of batches dynamically when you call prefill, and it will reduce the number of batches only when prefill is called again.
The intention is to avoid useless recompilation (keeping batch size the same as long as possible).

tengomucho added 2 commits

March 26, 2024 15:59


          feat: use dynamic batching when generating

bc700ad

This will increment the number of batches dynamically when you call
prefill, and it will reduce the number of batches only when prefill is
called again.
The intention is to avoid useless recompilation (keeping batch size the
same as long as possible). Also note that if a slot was removed, the
whole KV cache should be rebuilt, and for now we do not do that.


          fix(tests): correct expected result for tests with sample decoding

9a29580

Since batch size is now dynamic, expected results are now different, so
they are updated accordingly.

tengomucho force-pushed the dynamic-batches branch 2 times, most recently from 944ecbe to fe61d04 Compare

March 26, 2024 16:06


          chore: update HF_BATCH_SIZE usage

8f7ceab

Now it is only useful for warmup, since dynamic batching is being used.

tengomucho force-pushed the dynamic-batches branch from fe61d04 to 8f7ceab Compare

March 26, 2024 17:17

tengomucho marked this pull request as ready for review

March 26, 2024 17:27

tengomucho requested a review from mfuntowicz

March 26, 2024 17:27

mfuntowicz approved these changes

View reviewed changes

Member

mfuntowicz left a comment

LGTM to me on the logic, left a few comments for potentially optimize a few minor things.

Super cool!

text-generation-inference/integration-tests/test_gpt2.py

               SEQUENCE_LENGTH = 1024
               @pytest.fixture(scope="module")
               def model_name_or_path():
-                  os.environ["HF_BATCH_SIZE"] = str(BATCH_SIZE)

Member

mfuntowicz Mar 27, 2024

Maybe we should keep a HF_MAX_BATCH_SIZE somewhere?

Collaborator Author

tengomucho Mar 27, 2024

as mentioned below, maybe we'll use it later

text-generation-inference/server/text_generation_server/generator.py

@@ @@ -312,7 +312,10 @@ def __init__( @@
                       tokenizer.padding_side = "left"
                       self.tokenizer = tokenizer
                       self.special_tokens = self.tokenizer.all_special_ids
-                      self.slots = [Slot(i, tokenizer, self.model.device) for i in range(self.model.config.batch_size)]

Member

mfuntowicz Mar 27, 2024

If we introduce the HF_MAX_BATCH_SIZE maybe we can initialize this list, wdyt?

Collaborator Author

tengomucho Mar 27, 2024

I think I need to better understand the usage of the batches, if it can be increased/reduced often, how this affects performance. At that point it will be easier to think about a reasonable algorithm to reduce compilation and batch change overhead as much as possible.

text-generation-inference/server/text_generation_server/generator.py

@@ @@ -350,13 +353,18 @@ def warmup(self, batch: Batch) -> int: @@
                           The maximum number of tokens the model supports.
                       """
                       # Just check that the warmup request parameters match the model capacity
-                      batch_size = self.model.config.batch_size
+                      # NOTE: later self.model.config.batch_size might become self.model.config.max_batch_size.

Member

mfuntowicz Mar 27, 2024

Ok maybe you want to keep it for later 🤗

text-generation-inference/server/text_generation_server/generator.py Outdated

                       # Assign each request to an empty slot
-                      logger.debug(f"Prefilling {len(batch.requests)} new request(s) with {len(empty_slots)} empty slot(s)")
+                      logger.debug(f"Prefilling {len(batch.requests)} new request(s) adding to {len(active_slots)} active slot(s)")
+                      new_slots = []

Member

mfuntowicz Mar 27, 2024

What is new_slots used for?

Collaborator Author

tengomucho Mar 27, 2024

useless 😄

text-generation-inference/server/text_generation_server/generator.py Outdated

		slot.assign(request, self.model.generation_config)
		new_slots.append(slot)

Member

mfuntowicz Mar 27, 2024

Same comment about new_slots

Collaborator Author

tengomucho Mar 27, 2024

same answer, I'll remove it

text-generation-inference/server/text_generation_server/generator.py Show resolved Hide resolved

Member

mfuntowicz commented Mar 27, 2024

Just to be sure: In this PR we are not limiting the maximum batch size the server can handle? If so, can we impl this in a following PR?


          chore: remove useless variable

309f58d

tengomucho merged commit 595b7b2 into main

1 check passed

mfuntowicz deleted the dynamic-batches branch

April 29, 2024 07:54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet