Add support to vLLM inference engine - to possibly gain x10 speedup in inference #2785

ofirkris · 2023-06-20T20:36:36Z

vLLM is an open-source LLM inference and serving library that accelerates HuggingFace Transformers by 24x and powers Vicuna and Chatbot Arena.

Blog post: https://vllm.ai/
Repo: https://github.com/vllm-project/vllm

Slug-Cat · 2023-06-20T22:01:47Z

If the performance claims aren't overcooked or super situational, this could be huge

CamiloMM · 2023-06-21T00:08:55Z

AI is where you have some of the brightest minds in the world working on some of the most complicated maths and somehow someone just comes and does something like this (assuming it's real).

Are we in an "AI summer"? 😂

Ph0rk0z · 2023-06-21T11:25:07Z

It's Exlllama for everything else.. and can just have a new loader added.

tensiondriven · 2023-06-22T23:57:30Z

vLLM only speeds up 24x for running full fat models with massive parallelization, so if you need to run 100 inferences at the same time, its fast. But for most people, exllama is still faster/better. @turboderp has some good insights on the local llama reddit.

Unless someone is feeling ambitious, I think this could be closed. The issue poster probably didnt understand what vLLM is really for.

Ph0rk0z · 2023-06-24T14:12:09Z

Does tensor parallel help multi-gpu? And with the multi-user support this might actually serve the intended purpose.

cibernicola · 2023-07-08T11:28:13Z

Does anyone know anything about this?

turboderp · 2023-07-08T23:07:12Z

I'm not sure how they arrive at those results. Plain HF Transformers can be mighty slow, but you have to really try to make it that slow, I feel. As for vLLM, it's not for quantized models, and as such it's quite a bit slower than ExLllama (or Llama.cpp with GPU acceleration for that matter.) If you're deploying a full-precision model to serve inference to multiple clients it might be very useful, though.

github-actions · 2023-08-07T23:16:15Z

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

yhyu13 · 2023-12-04T15:24:47Z

@oobabooga

#4794 (comment)

As we do not consider adding new model loader for single mode, we should consider vllm now, as it is freqently support newly release models like Qwen, with both multi-client servering and quantization (AWQ) https://github.com/vllm-project/vllm

rafa-9 · 2024-01-08T04:31:19Z

@oobabooga is this on the roadmap?

nonetrix · 2024-01-25T09:11:59Z

Seems it's not coming for now at least
#4860

fblgit · 2024-02-07T10:14:42Z

This should be re-considered, the concerns of plaguing the codebase with CUDA dependants is true.. we should address the design constraints to make this happen and not close the door entirely to something that potentially can benefit ooga's tool.
I guess you could serve externally an OpenAI format from a VLLM model and override such thing at ooga's side. It could be merely a different script with different requirements to hack this up?

@oobabooga what could be the acceptance criteria? I do feel very handy serve/eval/play at the same time in a friendly eco like ooga's.

micsama · 2024-04-23T08:25:39Z

VLLM has gradually introduced support for GPTQ and AWQ models, with imminent plans to accommodate the as-yet-unmerged QLORA and QALORA models. Moreover, the acceleration effects delivered by VLLM are now strikingly evident. Given these developments, I propose considering the incorporation of VLLM support. The project is rapidly evolving and poised for a promising future.

eigen2017 · 2024-05-06T10:16:49Z

+1 for vllm
vllm now becomes the first choice when we need LLM serves on line.
it's not only a IT ditributed for more throughput thing, but also accelerated on batch=1.
it has flash attention\ page attention...
some how i found someone here has misunderstandings of that "paralleling is only for more tps, not for batch=1",
high parallel, or as parallel as you can , is good for batch=1 according to cuda design.

eigen2017 · 2024-05-06T10:20:09Z

for example , vllm manages all the tokens' kv caches in to blocks, it can be faster even batch is 1.

KnutJaegersberg · 2024-05-11T07:27:58Z

yeah vLLM support should be added.

ofirkris added the enhancement New feature or request label Jun 20, 2023

github-actions bot added the stale label Aug 7, 2023

github-actions bot closed this as completed Aug 7, 2023

yhyu13 mentioned this issue Dec 4, 2023

Support Qwen models with custom loader #4794

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to vLLM inference engine - to possibly gain x10 speedup in inference #2785

Add support to vLLM inference engine - to possibly gain x10 speedup in inference #2785

ofirkris commented Jun 20, 2023

Slug-Cat commented Jun 20, 2023

CamiloMM commented Jun 21, 2023

Ph0rk0z commented Jun 21, 2023 •

edited

Loading

tensiondriven commented Jun 22, 2023

Ph0rk0z commented Jun 24, 2023

cibernicola commented Jul 8, 2023

turboderp commented Jul 8, 2023

github-actions bot commented Aug 7, 2023

yhyu13 commented Dec 4, 2023

rafa-9 commented Jan 8, 2024

nonetrix commented Jan 25, 2024

fblgit commented Feb 7, 2024

micsama commented Apr 23, 2024

eigen2017 commented May 6, 2024

eigen2017 commented May 6, 2024

KnutJaegersberg commented May 11, 2024

Add support to vLLM inference engine - to possibly gain x10 speedup in inference #2785

Add support to vLLM inference engine - to possibly gain x10 speedup in inference #2785

Comments

ofirkris commented Jun 20, 2023

Slug-Cat commented Jun 20, 2023

CamiloMM commented Jun 21, 2023

Ph0rk0z commented Jun 21, 2023 • edited Loading

tensiondriven commented Jun 22, 2023

Ph0rk0z commented Jun 24, 2023

cibernicola commented Jul 8, 2023

turboderp commented Jul 8, 2023

github-actions bot commented Aug 7, 2023

yhyu13 commented Dec 4, 2023

rafa-9 commented Jan 8, 2024

nonetrix commented Jan 25, 2024

fblgit commented Feb 7, 2024

micsama commented Apr 23, 2024

eigen2017 commented May 6, 2024

eigen2017 commented May 6, 2024

KnutJaegersberg commented May 11, 2024

Ph0rk0z commented Jun 21, 2023 •

edited

Loading