vLLM generating repeated/duplicate responses #12276
Unanswered
saraswatmks
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to serve a LLAMA 70B fine tuned model using vLLM on A100 80GB GPU. The model is fine tuned for chat assistant use case. I notice after 10-15 message conversation, the LLM response start to repeat itself. Below is an example:
User: Hello
Assistant: How are you ?
.
// assume 10 - 15 messages in between
.
.
User: How bad is world economy?
Assistant: World economy is really bad these days due to war.
User: What about US economy?
Assistant: World economy is really bad these days due to war.
User: Why are some people good and bad?
Assistant: World economy is really bad these days due to war.
User: Why do men earn higher than women?
Assistant: World economy is really bad these days due to war.
Currently, I want the server to handle 10 concurrent requests, each having a fixed prompt with 1350 token length. Therefore, I set the following arguments as
--max-model-len 4096
,--max-seq-len 10
,--max-num-batched-tokens 40960
bitsandbytes
quantization, I still get the repeated response.vllm/vllm-openai:latest
--enable-prefix-caching
as well but still no effectrepetition_penalty
,frequency_penalty
,presence_penalty
but still not effect.Note: I don't get this problem when I deploy the model using llama.cpp, everything seems to work fine.
How can I fix this repetition issue? Looking for some guidance here.
Beta Was this translation helpful? Give feedback.
All reactions