Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch inference support (WIP) #951

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from
Draft

Add batch inference support (WIP) #951

wants to merge 19 commits into from

Conversation

abetlen
Copy link
Owner

@abetlen abetlen commented Nov 28, 2023

Closes #771

This is going to be a big PR as it requires refactoring a good deal of the Llama class to make it thread safe and support multiple parallel sequences. The goal is to introduce no breaking changes as long as you don't use the new functionality, some of the lower level methods like eval, sample, generate may have undefined behaviour when the kv cache has multiple sequences, asserts will need to be raised accordingly.

  • Refactor Llama._create_completion spaghetti-ball (should be able to fix Fix mirostat sampling #914 as well)
  • Add support for multiple completions per request
  • Add support for parallel requests

@abetlen abetlen changed the title Add batch processing Add batch processing (WIP) Nov 28, 2023
@abetlen abetlen changed the title Add batch processing (WIP) Add batch inference support (WIP) Nov 28, 2023
@turian
Copy link
Contributor

turian commented Dec 28, 2023

@abetlen any progress on this? I am very interested in this feature

@thomasgauthier
Copy link

@abetlen I'm also curious to know if this is still a planned feature.
Thank you

@dimaioksha
Copy link

same here

@abetlen
Copy link
Owner Author

abetlen commented Jan 19, 2024

Hey guys, yes it is, it's just taking longer than expected because I need to refactor a lot of the Llama class internals while avoiding breaking changes. At the same time I also don't want to hold up bug fixes and llama.cpp updates.

Next steps right now are

  • Refactoring create_completion so it can be used for parallel sequence completions
  • Introduce a sampling context to manage parallel sequences state like grammars, mirostat params, etc
  • Add multi-completion (ie n parameter in the OpenAI api) support
  • Add parallel completions support through a slots api similar to llama.cpp

@K-Mistele
Copy link
Contributor

+1 on this, would really really love to see this feature - right now, I can't use llama-cpp-python in production because of it :(

@aalyousfi
Copy link

Add support for multiple completions per request
Add support for parallel requests

Hope we can get these two great features soon!

@parallaxe
Copy link

I am also highly interested in this, would be really really great! 😀

@npip99
Copy link

npip99 commented Mar 11, 2024

@abetlen Hey; this would be huge if you're still working on it.

Right now I'm using 100% of the VRAM on an A40, and getting like 3% utilization for the FLOPS and Memory Bandwidth 😆

I just need to be able to throw more inferencing at it, but running the python file twice simultaneously will take up twice the VRAM (Not viable, I'm at VRAM limit already).

I will happily sponsor with the cloud compute that I will save 🙏


~ I'm not totally sure if I understand the code, but from my understanding of this PR; with this feature:

  • A "context" will have its own specific kv_cache
  • A context can be save_model/load_model'd, which will save/load all state (including kv_cache?)
  • And then, I can have N threads, each one can do an llm load model to initialize a context, and then .eval, and then save_model to continue that evaluation at a later time. And keep a dictionary of llm states for all clients who are waiting on evals to progress.
  • Target N to be whatever gives me 100% util on FLOP or Memory Bandwidth (probably bandwidth)

Or, alternatively, will the kv_cache be global? It could probably save RAM by allowing the parallel processes to share in the kv_cache, but also maybe that's harder to implement and it wouldn't matter in cases where the parallel processing threads don't share any substrings. Not sure.


Maybe totally high level idea would be to allow Llama() to be initialized multiple times, but implicitly share VRAM if multiple are initialized from the same underlying model file ~ would be easiest for the User, but maybe that gets weird with underlying implementation. Interesting ideas though.

model = Model("../models/llama-2-70b")
llama1 = Llama(model)
llama2 = Llama(model)

@stanier stanier mentioned this pull request Apr 3, 2024
3 tasks
@cdoern
Copy link

cdoern commented Oct 23, 2024

Hey @abetlen any updates on this one? Looking to add support for this into instructlab/sdg and instructlab/instructlab !!!! Really hoping for this functionality 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add batched inference Fix mirostat sampling
9 participants