Add batch inference support (WIP) #951

abetlen · 2023-11-28T10:00:46Z

Closes #771

This is going to be a big PR as it requires refactoring a good deal of the Llama class to make it thread safe and support multiple parallel sequences. The goal is to introduce no breaking changes as long as you don't use the new functionality, some of the lower level methods like eval, sample, generate may have undefined behaviour when the kv cache has multiple sequences, asserts will need to be raised accordingly.

Refactor Llama._create_completion spaghetti-ball (should be able to fix Fix mirostat sampling #914 as well)
Add support for multiple completions per request
Add support for parallel requests

turian · 2023-12-28T12:20:00Z

@abetlen any progress on this? I am very interested in this feature

…to batch-processing

…n into batch-processing

thomasgauthier · 2024-01-18T18:20:07Z

@abetlen I'm also curious to know if this is still a planned feature.
Thank you

dimaioksha · 2024-01-18T23:12:47Z

same here

abetlen · 2024-01-19T14:00:46Z

Hey guys, yes it is, it's just taking longer than expected because I need to refactor a lot of the Llama class internals while avoiding breaking changes. At the same time I also don't want to hold up bug fixes and llama.cpp updates.

Next steps right now are

Refactoring create_completion so it can be used for parallel sequence completions
Introduce a sampling context to manage parallel sequences state like grammars, mirostat params, etc
Add multi-completion (ie n parameter in the OpenAI api) support
Add parallel completions support through a slots api similar to llama.cpp

K-Mistele · 2024-01-20T22:50:35Z

+1 on this, would really really love to see this feature - right now, I can't use llama-cpp-python in production because of it :(

aalyousfi · 2024-02-15T08:33:43Z

Add support for multiple completions per request
Add support for parallel requests

Hope we can get these two great features soon!

parallaxe · 2024-02-21T18:00:41Z

I am also highly interested in this, would be really really great! 😀

npip99 · 2024-03-11T06:12:12Z

@abetlen Hey; this would be huge if you're still working on it.

Right now I'm using 100% of the VRAM on an A40, and getting like 3% utilization for the FLOPS and Memory Bandwidth 😆

I just need to be able to throw more inferencing at it, but running the python file twice simultaneously will take up twice the VRAM (Not viable, I'm at VRAM limit already).

I will happily sponsor with the cloud compute that I will save 🙏

~ I'm not totally sure if I understand the code, but from my understanding of this PR; with this feature:

A "context" will have its own specific kv_cache
A context can be save_model/load_model'd, which will save/load all state (including kv_cache?)
And then, I can have N threads, each one can do an llm load model to initialize a context, and then .eval, and then save_model to continue that evaluation at a later time. And keep a dictionary of llm states for all clients who are waiting on evals to progress.
Target N to be whatever gives me 100% util on FLOP or Memory Bandwidth (probably bandwidth)

Or, alternatively, will the kv_cache be global? It could probably save RAM by allowing the parallel processes to share in the kv_cache, but also maybe that's harder to implement and it wouldn't matter in cases where the parallel processing threads don't share any substrings. Not sure.

Maybe totally high level idea would be to allow Llama() to be initialized multiple times, but implicitly share VRAM if multiple are initialized from the same underlying model file ~ would be easiest for the User, but maybe that gets weird with underlying implementation. Interesting ideas though.

model = Model("../models/llama-2-70b")
llama1 = Llama(model)
llama2 = Llama(model)

cdoern · 2024-10-23T15:59:43Z

Hey @abetlen any updates on this one? Looking to add support for this into instructlab/sdg and instructlab/instructlab !!!! Really hoping for this functionality 🙏

abetlen added 3 commits November 28, 2023 04:09

Clean up llama.py into seperate modules.

3c13436

docs: re-order api reference

135b300

Merge branch 'main' into batch-processing

b5c33da

abetlen changed the title ~~Add batch processing~~ Add batch processing (WIP) Nov 28, 2023

abetlen changed the title ~~Add batch processing (WIP)~~ Add batch inference support (WIP) Nov 28, 2023

Add sampling context

3a1ba77

reuank mentioned this pull request Nov 28, 2023

Feature request: Batched inference for llama.cpp models eth-sri/lmql#261

Open

abetlen added 6 commits November 29, 2023 05:43

Merge branch 'main' into batch-processing

85caba5

Add vocab utils

7ae9a3e

Refactor _create_completion

40f2293

Merge branch 'main' into batch-processing

4335a9d

Merge branch 'main' into batch-processing

a625412

Fix logits are not json serializable

fcbd177

abetlen added 9 commits January 5, 2024 04:09

Merge branch 'main' into batch-processing

b1e9962

Fix #1038

e1cd61e

Merge branch 'main' into batch-processing

456a601

Use sampling context

7f4ba48

Merge branch 'main' of https://github.com/abetlen/llama-cpp-python in…

7a1c2b5

…to batch-processing

Merge branch 'main' into batch-processing

358593f

Merge branch 'batch-processing' of github.com:abetlen/llama_cpp_pytho…

e7ef07d

…n into batch-processing

Cleanup pyproject

6f08021

Merge branch 'main' into batch-processing

850416a

stanier mentioned this pull request Apr 3, 2024

Add batched inference #771

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch inference support (WIP) #951

Add batch inference support (WIP) #951

abetlen commented Nov 28, 2023 •

edited

Loading

turian commented Dec 28, 2023

thomasgauthier commented Jan 18, 2024

dimaioksha commented Jan 18, 2024

abetlen commented Jan 19, 2024

K-Mistele commented Jan 20, 2024

aalyousfi commented Feb 15, 2024

parallaxe commented Feb 21, 2024

npip99 commented Mar 11, 2024 •

edited

Loading

cdoern commented Oct 23, 2024 •

edited

Loading

Add batch inference support (WIP) #951

Are you sure you want to change the base?

Add batch inference support (WIP) #951

Conversation

abetlen commented Nov 28, 2023 • edited Loading

turian commented Dec 28, 2023

thomasgauthier commented Jan 18, 2024

dimaioksha commented Jan 18, 2024

abetlen commented Jan 19, 2024

K-Mistele commented Jan 20, 2024

aalyousfi commented Feb 15, 2024

parallaxe commented Feb 21, 2024

npip99 commented Mar 11, 2024 • edited Loading

cdoern commented Oct 23, 2024 • edited Loading

abetlen commented Nov 28, 2023 •

edited

Loading

npip99 commented Mar 11, 2024 •

edited

Loading

cdoern commented Oct 23, 2024 •

edited

Loading