Support Vllm backend #4860

yhyu13 · 2023-12-09T15:29:26Z

Summery

Update readme
Openai extension topk=1 instead of 0 (=0 cause error for vllm ) by default(i.e. greedy best)
Add sync vllm model backend (The Async vllm backend would not cooperate with current generator implementation of textgen). The async vllm backend would not work due to some extension (e.g. openai) uses uv asyncio that does not support nested asyncio run. The vllm sync respond just as fast as the async backend, being easier to integrate in the same time. If you want to use the vllm async backend, you may want use vllm api server directly,
In order to pass vllm args to server.py, have to only parse known args for webui and ignore vllm args first. And then when loading vllm model, only parse known args for vllm. It seems there is no conflict there.

Checklist:

I have read the Contributing guidelines.

yhyu13 · 2023-12-09T15:31:38Z

Some screen shots running the Xwin-LM-7B-V0.2-AWQ about 66 t/s on 3090

#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate textgen

MODEL_NAME=$1
if [ -z "$MODEL_NAME" ]; then
    MODEL_NAME=Xwin-LM-7B-V0.2-AWQ
    #MODEL_NAME=Qwen-1_8B-Chat
fi
# -m debugpy --listen 0.0.0.0:5678 --wait-for-client 
CUDA_VISIBLE_DEVICES=1 python server.py \
    --model $MODEL_NAME \
    --loader vllm \
    --api-port 5051 \
    --api \
    --listen-port 7681 \
    --trust-remote-code \
    --auto-devices \
    --verbose \
    --seed 34 \
    --max-model-len 4096 \
    | tee ./$MODEL_NAME.log

oobabooga · 2023-12-15T03:27:52Z

What is the advantage of VLLM over the existing loaders in the batch_size=1 use case?

yhyu13 · 2023-12-15T06:52:56Z

The main purpose is to have vllm as a backend to support a more variety of models. I will test a bit more on AutoAWQ vs vllm, and native hf transformer vs vllm later

Edit (WIP):
Benchmarked on a single 3090 ubuntu torch 2.1.2 with flash attn 2.3.4 cuda12.1

Prompt in webui:
What will AI be like in the year 1010 A.D? Think step by step

#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate textgen

MODEL_NAME=$1
if [ -z "$MODEL_NAME" ]; then
    MODEL_NAME=Qwen-1_8B-Chat
    #MODEL_NAME=Yi-6B-200K-AWQ
    #MODEL_NAME=Xwin-LM-7B-V0.2-AWQ
fi

# What will AI be like in the year 1010 A.D? Think step by step

CUDA_VISIBLE_DEVICES=0 python server.py \
    --model $MODEL_NAME \
    --loader vllm \
    --api-port 5051 \
    --api \
    --listen-port 7681 \
    --trust-remote-code \
    --auto-devices \
    --verbose \
    --seed 34 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.5 \
    |& tee ./vllm_$MODEL_NAME.log

Model	token/s
Xwin-LM-7B-V0.2-AWQ	101.5 tokens/s(vllm), 87.18 tokens/s(textgen)
Yi-6B-200K-AWQ	273.9 tokens/s (vllm), 91 tokens/s (textgen)
Qwen-1_8B-Chat	351.3(vllm) 111.62 tokens/s (textgen)

It seems there is a overhead in the geneartion method of vllm.py that actually bottlneck the performance. Exllamav2.py does it a lot better

Exllamav2 v0.0.10

Model	token/s
Xwin-LM-7B-V0.2-GPTQ	117.39 tokens/s(textgen)
Yi-6B-200K-GPTQ	72.53 tokens/s

yhyu13 · 2023-12-16T09:10:58Z

@oobabooga When Vllm startup, its logging system actual hinder the webui's logging shown in terminal, I am not sure what could be the cause though (vllm has occupied the stdout stream?), it only happens to vllm. But both log can be shown when you python server.py |& tee log.txt

yhyu13 · 2023-12-17T06:20:15Z

@oobabooga FYI, vllm-project/vllm#916 vllm now supports GPTQ

oobabooga · 2023-12-18T13:26:12Z

VLLM adds many very specific and heavy CUDA requirements to the project. I don't want to add it when it does not provide a clear benefit over existing loaders in the batch_size=1 case.

nonetrix · 2024-01-25T09:09:43Z

Has anyone tried this?
https://github.com/EmbeddedLLM/vllm-rocm

nonetrix · 2024-01-25T09:10:53Z

Also relevant vllm-project/vllm#1836

Ph0rk0z · 2024-01-25T11:11:39Z

I know for sure that regular AutoAWQ has horrible multi-gpu support. Vllm might actually load a split 70b properly. That's the advantage. Now that I see this exist I might give it a go.

fblgit · 2024-02-07T10:09:23Z

I think this fits better as a plugin/extension.
The benefit behind it is clear if you look at evaluation capabilities, API, Multi-User, and overall: performance.

IMHO, this can potentially be a high-throughput inference-server within ooga.. with the flexibility of jumping and interacting with the model that is being served.

Ph0rk0z · 2024-02-07T11:34:20Z

Try the P-R it looks fairly easy. I aim to do it at some point.

Elsayed91 · 2024-02-21T23:55:56Z

I think this fits better as a plugin/extension. The benefit behind it is clear if you look at evaluation capabilities, API, Multi-User, and overall: performance.

IMHO, this can potentially be a high-throughput inference-server within ooga.. with the flexibility of jumping and interacting with the model that is being served.

I definitely would have loved such functionality! I used ooga initially to test best models for use case, and design the prompt templates and all, and having to move away from it was such a pain.

Sadly ooga is not very performant as it is, and i do realize that it is not actually meant to be, however supporting vllm backend which is performant by nature (especially something like gptq quantization) would just put batteries on the whole thing and ooga could literally be used as the webserver while keeping a front end with functionalities to quickly jump in and test things and also allows easy introduction of non devs in the process and using them to improve the product.

Would be great if this could be officially supported, I believe the PR has accomplished most of the required functionality.

oobabooga mentioned this pull request Dec 15, 2023

Add HQQ quant loader for ooba #4888

Merged

1 task

yhyu13 force-pushed the support_vllm_loader branch from 09999f1 to 92b6203 Compare December 16, 2023 07:48

yhyu13 changed the title ~~Support Vllm sync model backend;~~ Support Vllm backend Dec 18, 2023

yhyu13 added 3 commits December 18, 2023 02:59

Support Vllm sync model backend;

beaa633

Only print unknown arg when there are unkown

d08190d

Comment script & update readme

8b4835f

yhyu13 force-pushed the support_vllm_loader branch from da70478 to 8b4835f Compare December 18, 2023 03:00

oobabooga closed this Dec 18, 2023

nonetrix mentioned this pull request Jan 25, 2024

Add support to vLLM inference engine - to possibly gain x10 speedup in inference #2785

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Vllm backend #4860

Support Vllm backend #4860

yhyu13 commented Dec 9, 2023 •

edited

Loading

yhyu13 commented Dec 9, 2023 •

edited

Loading

oobabooga commented Dec 15, 2023

yhyu13 commented Dec 15, 2023 •

edited

Loading

yhyu13 commented Dec 16, 2023

yhyu13 commented Dec 17, 2023

oobabooga commented Dec 18, 2023

nonetrix commented Jan 25, 2024

nonetrix commented Jan 25, 2024

Ph0rk0z commented Jan 25, 2024

fblgit commented Feb 7, 2024 •

edited

Loading

Ph0rk0z commented Feb 7, 2024

Elsayed91 commented Feb 21, 2024

Support Vllm backend #4860

Support Vllm backend #4860

Conversation

yhyu13 commented Dec 9, 2023 • edited Loading

Summery

Checklist:

yhyu13 commented Dec 9, 2023 • edited Loading

oobabooga commented Dec 15, 2023

yhyu13 commented Dec 15, 2023 • edited Loading

yhyu13 commented Dec 16, 2023

yhyu13 commented Dec 17, 2023

oobabooga commented Dec 18, 2023

nonetrix commented Jan 25, 2024

nonetrix commented Jan 25, 2024

Ph0rk0z commented Jan 25, 2024

fblgit commented Feb 7, 2024 • edited Loading

Ph0rk0z commented Feb 7, 2024

Elsayed91 commented Feb 21, 2024

yhyu13 commented Dec 9, 2023 •

edited

Loading

yhyu13 commented Dec 9, 2023 •

edited

Loading

yhyu13 commented Dec 15, 2023 •

edited

Loading

fblgit commented Feb 7, 2024 •

edited

Loading