You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run FP8 inference on Meta-Llama-3-70B-Instruct using vLLM with FP8 quantization. I successfully launched vLLM with the following command:
However, when starting inference, vLLM reported an error.
ERROR 11-03 05:27:49 async_llm_engine.py:671] Engine iteration timed out. This should never happen!
ERROR 11-03 05:27:49 async_llm_engine.py:56] Engine background task failed
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56] File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56] done, _ = await asyncio.wait(
ERROR 11-03 05:27:49 async_llm_engine.py:56] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 11-03 05:27:49 async_llm_engine.py:56] return await _wait(fs, timeout, return_when, loop)
ERROR 11-03 05:27:49 async_llm_engine.py:56] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 11-03 05:27:49 async_llm_engine.py:56] await waiter
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56] File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 11-03 05:27:49 async_llm_engine.py:56] return_value = task.result()
ERROR 11-03 05:27:49 async_llm_engine.py:56] File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 11-03 05:27:49 async_llm_engine.py:56] File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 11-03 05:27:49 async_llm_engine.py:56] self._do_exit(exc_type)
ERROR 11-03 05:27:49 async_llm_engine.py:56] File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 11-03 05:27:49 async_llm_engine.py:56] raise asyncio.TimeoutError
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
return_value = task.result()
File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
self._do_exit(exc_type)
File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/vllm-fork/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause
In addition, the warm-up phase with this setup took about 10 hours to complete.
What is the correct way to run FP8 inference with this vLLM fork?
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Thank you for very detailed description! Tiny detail I missed is which branch you used, however please use habana_main branch and then you can set following environments:
export VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 # this timeout you experience right now, value is in secondsexport VLLM_RPC_TIMEOUT=600000 # this timeout you can experience in the future, value is in microseconds
You can test your server skipping warmup stage via this env:
export VLLM_SKIP_WARMUP=true
This can help you to save a lot of time for warmup. NOTE: We do not recommend to run vLLM server without warmup in production environment, however this option is good for development and testing.
Summarizing this command should help you quickly verify if the configuration is working fine:
Your current environment
How would you like to use vllm
I'm trying to run FP8 inference on Meta-Llama-3-70B-Instruct using vLLM with FP8 quantization. I successfully launched vLLM with the following command:
However, when starting inference, vLLM reported an error.
In addition, the warm-up phase with this setup took about 10 hours to complete.
What is the correct way to run FP8 inference with this vLLM fork?
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: