-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to vLLM inference engine - to possibly gain x10 speedup in inference #2785
Comments
If the performance claims aren't overcooked or super situational, this could be huge |
AI is where you have some of the brightest minds in the world working on some of the most complicated maths and somehow someone just comes and does something like this (assuming it's real). Are we in an "AI summer"? 😂 |
It's Exlllama for everything else.. and can just have a new loader added. |
vLLM only speeds up 24x for running full fat models with massive parallelization, so if you need to run 100 inferences at the same time, its fast. But for most people, exllama is still faster/better. @turboderp has some good insights on the local llama reddit. Unless someone is feeling ambitious, I think this could be closed. The issue poster probably didnt understand what vLLM is really for. |
Does tensor parallel help multi-gpu? And with the multi-user support this might actually serve the intended purpose. |
I'm not sure how they arrive at those results. Plain HF Transformers can be mighty slow, but you have to really try to make it that slow, I feel. As for vLLM, it's not for quantized models, and as such it's quite a bit slower than ExLllama (or Llama.cpp with GPU acceleration for that matter.) If you're deploying a full-precision model to serve inference to multiple clients it might be very useful, though. |
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below. |
As we do not consider adding new model loader for single mode, we should consider vllm now, as it is freqently support newly release models like Qwen, with both multi-client servering and quantization (AWQ) https://github.com/vllm-project/vllm |
@oobabooga is this on the roadmap? |
Seems it's not coming for now at least |
This should be re-considered, the concerns of plaguing the codebase with CUDA dependants is true.. we should address the design constraints to make this happen and not close the door entirely to something that potentially can benefit ooga's tool. @oobabooga what could be the acceptance criteria? I do feel very handy serve/eval/play at the same time in a friendly eco like ooga's. |
VLLM has gradually introduced support for GPTQ and AWQ models, with imminent plans to accommodate the as-yet-unmerged QLORA and QALORA models. Moreover, the acceleration effects delivered by VLLM are now strikingly evident. Given these developments, I propose considering the incorporation of VLLM support. The project is rapidly evolving and poised for a promising future. |
+1 for vllm |
for example , vllm manages all the tokens' kv caches in to blocks, it can be faster even batch is 1. |
yeah vLLM support should be added. |
vLLM is an open-source LLM inference and serving library that accelerates HuggingFace Transformers by 24x and powers Vicuna and Chatbot Arena.
Blog post: https://vllm.ai/
Repo: https://github.com/vllm-project/vllm
The text was updated successfully, but these errors were encountered: