-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch inference support (WIP) #951
base: main
Are you sure you want to change the base?
Conversation
@abetlen any progress on this? I am very interested in this feature |
…to batch-processing
…n into batch-processing
@abetlen I'm also curious to know if this is still a planned feature. |
same here |
Hey guys, yes it is, it's just taking longer than expected because I need to refactor a lot of the Next steps right now are
|
+1 on this, would really really love to see this feature - right now, I can't use |
Hope we can get these two great features soon! |
I am also highly interested in this, would be really really great! 😀 |
@abetlen Hey; this would be huge if you're still working on it. Right now I'm using 100% of the VRAM on an A40, and getting like 3% utilization for the FLOPS and Memory Bandwidth 😆 I just need to be able to throw more inferencing at it, but running the python file twice simultaneously will take up twice the VRAM (Not viable, I'm at VRAM limit already). I will happily sponsor with the cloud compute that I will save 🙏 ~ I'm not totally sure if I understand the code, but from my understanding of this PR; with this feature:
Or, alternatively, will the kv_cache be global? It could probably save RAM by allowing the parallel processes to share in the kv_cache, but also maybe that's harder to implement and it wouldn't matter in cases where the parallel processing threads don't share any substrings. Not sure. Maybe totally high level idea would be to allow Llama() to be initialized multiple times, but implicitly share VRAM if multiple are initialized from the same underlying model file ~ would be easiest for the User, but maybe that gets weird with underlying implementation. Interesting ideas though. model = Model("../models/llama-2-70b") |
Hey @abetlen any updates on this one? Looking to add support for this into instructlab/sdg and instructlab/instructlab !!!! Really hoping for this functionality 🙏 |
Closes #771
Llama._create_completion
spaghetti-ball (should be able to fix Fix mirostat sampling #914 as well)