vLLM generating repeated/duplicate responses #12276

saraswatmks · 2025-01-21T18:43:06Z

saraswatmks
Jan 21, 2025

I am trying to serve a LLAMA 70B fine tuned model using vLLM on A100 80GB GPU. The model is fine tuned for chat assistant use case. I notice after 10-15 message conversation, the LLM response start to repeat itself. Below is an example:

User: Hello
Assistant: How are you ?
.
// assume 10 - 15 messages in between
.
.
User: How bad is world economy?
Assistant: World economy is really bad these days due to war.

User: What about US economy?
Assistant: World economy is really bad these days due to war.

User: Why are some people good and bad?
Assistant: World economy is really bad these days due to war.

User: Why do men earn higher than women?
Assistant: World economy is really bad these days due to war.

Currently, I want the server to handle 10 concurrent requests, each having a fixed prompt with 1350 token length. Therefore, I set the following arguments as
--max-model-len 4096, --max-seq-len 10, --max-num-batched-tokens 40960

Besides, I have tried adding and removing bitsandbytes quantization, I still get the repeated response.
I am running the vllm server using the docker image: vllm/vllm-openai:latest
I have tried enabling and disabling the --enable-prefix-caching as well but still no effect
I have tried different values for repetition_penalty, frequency_penalty, presence_penalty but still not effect.
Besides llama 70B, I also tried with llama 8B, mistral 22B and get the same repetition in response.

Note: I don't get this problem when I deploy the model using llama.cpp, everything seems to work fine.

How can I fix this repetition issue? Looking for some guidance here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM generating repeated/duplicate responses #12276

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

vLLM generating repeated/duplicate responses #12276

saraswatmks Jan 21, 2025

Replies: 0 comments

saraswatmks
Jan 21, 2025