Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Vllm backend #4860

Closed
wants to merge 3 commits into from
Closed

Conversation

yhyu13
Copy link
Contributor

@yhyu13 yhyu13 commented Dec 9, 2023

Summery

  1. Update readme
  2. Openai extension topk=1 instead of 0 (=0 cause error for vllm ) by default(i.e. greedy best)
  3. Add sync vllm model backend (The Async vllm backend would not cooperate with current generator implementation of textgen). The async vllm backend would not work due to some extension (e.g. openai) uses uv asyncio that does not support nested asyncio run. The vllm sync respond just as fast as the async backend, being easier to integrate in the same time. If you want to use the vllm async backend, you may want use vllm api server directly,
  4. In order to pass vllm args to server.py, have to only parse known args for webui and ignore vllm args first. And then when loading vllm model, only parse known args for vllm. It seems there is no conflict there.

Checklist:

@yhyu13
Copy link
Contributor Author

yhyu13 commented Dec 9, 2023

Some screen shots running the Xwin-LM-7B-V0.2-AWQ about 66 t/s on 3090

#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate textgen

MODEL_NAME=$1
if [ -z "$MODEL_NAME" ]; then
    MODEL_NAME=Xwin-LM-7B-V0.2-AWQ
    #MODEL_NAME=Qwen-1_8B-Chat
fi
# -m debugpy --listen 0.0.0.0:5678 --wait-for-client 
CUDA_VISIBLE_DEVICES=1 python server.py \
    --model $MODEL_NAME \
    --loader vllm \
    --api-port 5051 \
    --api \
    --listen-port 7681 \
    --trust-remote-code \
    --auto-devices \
    --verbose \
    --seed 34 \
    --max-model-len 4096 \
    | tee ./$MODEL_NAME.log

Vllm_textgen

vllm_openai

@oobabooga
Copy link
Owner

What is the advantage of VLLM over the existing loaders in the batch_size=1 use case?

@yhyu13
Copy link
Contributor Author

yhyu13 commented Dec 15, 2023

The main purpose is to have vllm as a backend to support a more variety of models. I will test a bit more on AutoAWQ vs vllm, and native hf transformer vs vllm later

Edit (WIP):
Benchmarked on a single 3090 ubuntu torch 2.1.2 with flash attn 2.3.4 cuda12.1

Prompt in webui:
What will AI be like in the year 1010 A.D? Think step by step

#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate textgen

MODEL_NAME=$1
if [ -z "$MODEL_NAME" ]; then
    MODEL_NAME=Qwen-1_8B-Chat
    #MODEL_NAME=Yi-6B-200K-AWQ
    #MODEL_NAME=Xwin-LM-7B-V0.2-AWQ
fi

# What will AI be like in the year 1010 A.D? Think step by step

CUDA_VISIBLE_DEVICES=0 python server.py \
    --model $MODEL_NAME \
    --loader vllm \
    --api-port 5051 \
    --api \
    --listen-port 7681 \
    --trust-remote-code \
    --auto-devices \
    --verbose \
    --seed 34 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.5 \
    |& tee ./vllm_$MODEL_NAME.log

Model token/s
Xwin-LM-7B-V0.2-AWQ 101.5 tokens/s(vllm), 87.18 tokens/s(textgen)
Yi-6B-200K-AWQ 273.9 tokens/s (vllm), 91 tokens/s (textgen)
Qwen-1_8B-Chat 351.3(vllm) 111.62 tokens/s (textgen)

It seems there is a overhead in the geneartion method of vllm.py that actually bottlneck the performance. Exllamav2.py does it a lot better

Exllamav2 v0.0.10

Model token/s
Xwin-LM-7B-V0.2-GPTQ 117.39 tokens/s(textgen)
Yi-6B-200K-GPTQ 72.53 tokens/s

@yhyu13
Copy link
Contributor Author

yhyu13 commented Dec 16, 2023

@oobabooga When Vllm startup, its logging system actual hinder the webui's logging shown in terminal, I am not sure what could be the cause though (vllm has occupied the stdout stream?), it only happens to vllm. But both log can be shown when you python server.py |& tee log.txt

@yhyu13
Copy link
Contributor Author

yhyu13 commented Dec 17, 2023

@oobabooga FYI, vllm-project/vllm#916 vllm now supports GPTQ

@yhyu13 yhyu13 changed the title Support Vllm sync model backend; Support Vllm backend Dec 18, 2023
@oobabooga
Copy link
Owner

VLLM adds many very specific and heavy CUDA requirements to the project. I don't want to add it when it does not provide a clear benefit over existing loaders in the batch_size=1 case.

@oobabooga oobabooga closed this Dec 18, 2023
@nonetrix
Copy link

Has anyone tried this?
https://github.com/EmbeddedLLM/vllm-rocm

@nonetrix
Copy link

Also relevant vllm-project/vllm#1836

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jan 25, 2024

I know for sure that regular AutoAWQ has horrible multi-gpu support. Vllm might actually load a split 70b properly. That's the advantage. Now that I see this exist I might give it a go.

@fblgit
Copy link

fblgit commented Feb 7, 2024

I think this fits better as a plugin/extension.
The benefit behind it is clear if you look at evaluation capabilities, API, Multi-User, and overall: performance.

IMHO, this can potentially be a high-throughput inference-server within ooga.. with the flexibility of jumping and interacting with the model that is being served.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Feb 7, 2024

Try the P-R it looks fairly easy. I aim to do it at some point.

@Elsayed91
Copy link

I think this fits better as a plugin/extension. The benefit behind it is clear if you look at evaluation capabilities, API, Multi-User, and overall: performance.

IMHO, this can potentially be a high-throughput inference-server within ooga.. with the flexibility of jumping and interacting with the model that is being served.

I definitely would have loved such functionality! I used ooga initially to test best models for use case, and design the prompt templates and all, and having to move away from it was such a pain.

Sadly ooga is not very performant as it is, and i do realize that it is not actually meant to be, however supporting vllm backend which is performant by nature (especially something like gptq quantization) would just put batteries on the whole thing and ooga could literally be used as the webserver while keeping a front end with functionalities to quickly jump in and test things and also allows easy introduction of non devs in the process and using them to improve the product.

Would be great if this could be officially supported, I believe the PR has accomplished most of the required functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants