Replies: 2 comments 4 replies
-
|
Beta Was this translation helpful? Give feedback.
2 replies
-
Regarding question 1: Although everything appears to be loaded correctly, it still doesn't work. It threw an error. There seem to be a timeout issue. What am I missing please?
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We have conducted initial tests on the Aphrodite-engine and are impressed with the results. We are now considering replacing vLLM with Aphrodite-engine for production. However, I have a few questions:
We plan to run it on RunPod using this template. To utilise four GPUs, is it sufficient to set NUM_GPUS to 4? We were planning to use turboderp/Llama-3-70B-Instruct-exl2 with 6.0bpw quantisation and were hoping to deploy it on 4 x 16 GB VRAM GPUs, totalling 64 GB. Is this configuration possible, and is exl2 supported across 4 GPUs?
Regarding how Aphrodite-engine manages concurrent API requests: Does it batch and process them sequentially, or does it handle them concurrently in parallel? I read on Reddit that it supports concurrent processing but at a reduced quality of 30%-40% when asynchronous-concurrent generation is enabled. In our case, maintaining high response quality is critical. Is there an option to queue the requests to run sequentially or with reduced concurrency to avoid quality degradation?
Is it possible to increase the context length on Llama-3 from 8192 to 9728 by setting CONTEXT_LENGTH to 9728? Aphrodite apparently supports automatic RoPE scaling. Would this adjustment negatively impact the quality of responses?
Many thanks.
Beta Was this translation helpful? Give feedback.
All reactions