-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Vllm backend #4860
Support Vllm backend #4860
Conversation
Some screen shots running the Xwin-LM-7B-V0.2-AWQ about 66 t/s on 3090
|
What is the advantage of VLLM over the existing loaders in the |
The main purpose is to have vllm as a backend to support a more variety of models. I will test a bit more on AutoAWQ vs vllm, and native hf transformer vs vllm later Edit (WIP): Prompt in webui:
It seems there is a overhead in the geneartion method of vllm.py that actually bottlneck the performance. Exllamav2.py does it a lot better Exllamav2 v0.0.10
|
09999f1
to
92b6203
Compare
@oobabooga When Vllm startup, its logging system actual hinder the webui's logging shown in terminal, I am not sure what could be the cause though (vllm has occupied the stdout stream?), it only happens to vllm. But both log can be shown when you |
@oobabooga FYI, vllm-project/vllm#916 vllm now supports GPTQ |
da70478
to
8b4835f
Compare
VLLM adds many very specific and heavy CUDA requirements to the project. I don't want to add it when it does not provide a clear benefit over existing loaders in the |
Has anyone tried this? |
Also relevant vllm-project/vllm#1836 |
I know for sure that regular AutoAWQ has horrible multi-gpu support. Vllm might actually load a split 70b properly. That's the advantage. Now that I see this exist I might give it a go. |
I think this fits better as a plugin/extension. IMHO, this can potentially be a high-throughput inference-server within ooga.. with the flexibility of jumping and interacting with the model that is being served. |
Try the P-R it looks fairly easy. I aim to do it at some point. |
I definitely would have loved such functionality! I used ooga initially to test best models for use case, and design the prompt templates and all, and having to move away from it was such a pain. Sadly ooga is not very performant as it is, and i do realize that it is not actually meant to be, however supporting vllm backend which is performant by nature (especially something like gptq quantization) would just put batteries on the whole thing and ooga could literally be used as the webserver while keeping a front end with functionalities to quickly jump in and test things and also allows easy introduction of non devs in the process and using them to improve the product. Would be great if this could be officially supported, I believe the PR has accomplished most of the required functionality. |
Summery
Checklist: