You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vLLM API server version 0.6.3.dev588+g1033c3eb ERROR logs:
vLLM API server version 0.6.3.dev588+g1033c3eb
2024-11-06T06:07:44.937593775Z INFO 11-06 06:07:44 api_server.py:529] args: Namespace(host='0.0.0.0', port=80, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Intel/neural-chat-7b-v3-3', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', weights_load_device=None, config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=128, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, use_padding_aware_scheduling=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_num_prefill_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=2048, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
2024-11-06T06:07:44.945568835Z INFO 11-06 06:07:44 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/4a9933a4-7df0-4f34-8fab-7e282d15998b for IPC Path.
2024-11-06T06:07:44.947903590Z INFO 11-06 06:07:44 api_server.py:179] Started engine process with PID 76
2024-11-06T06:07:45.754159186Z INFO 11-06 06:07:45 config.py:1684] For HPU, we cast models to bfloat16 instead ofusing float16 by default. Please specify `dtype`if you want to use float16.
2024-11-06T06:07:45.754172719Z WARNING 11-06 06:07:45 config.py:1710] Casting torch.float16 to torch.bfloat16.
2024-11-06T06:07:48.321337853Z INFO 11-06 06:07:48 config.py:1684] For HPU, we cast models to bfloat16 instead ofusing float16 by default. Please specify `dtype`if you want to use float16.
2024-11-06T06:07:48.321961332Z WARNING 11-06 06:07:48 config.py:1710] Casting torch.float16 to torch.bfloat16.
2024-11-06T06:07:51.591056452Z INFO 11-06 06:07:51 llm_engine.py:238] Initializing an LLM engine (v0.6.3.dev588+g1033c3eb) with config: model='Intel/neural-chat-7b-v3-3', speculative_config=None, tokenizer='Intel/neural-chat-7b-v3-3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Intel/neural-chat-7b-v3-3, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
2024-11-06T06:07:51.953135882Z WARNING 11-06 06:07:51 utils.py:809] Pin memory is not supported on HPU.
2024-11-06T06:07:51.954068705Z INFO 11-06 06:07:51 selector.py:146] Using HPUAttention backend.
2024-11-06T06:07:51.955959907Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
2024-11-06T06:07:51.955977928Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
2024-11-06T06:07:51.956000940Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
2024-11-06T06:07:51.956030106Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
2024-11-06T06:07:51.956048597Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
2024-11-06T06:07:51.956063713Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
2024-11-06T06:07:51.956092830Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
2024-11-06T06:07:51.956107333Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
2024-11-06T06:07:51.956126117Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
2024-11-06T06:07:51.956152364Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
2024-11-06T06:07:51.956169377Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
2024-11-06T06:07:51.956184949Z INFO 11-06 06:07:51 hpu_model_runner.py:126] VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
2024-11-06T06:07:51.956211634Z INFO 11-06 06:07:51 hpu_model_runner.py:791] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
2024-11-06T06:07:51.956232964Z INFO 11-06 06:07:51 hpu_model_runner.py:796] Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
2024-11-06T06:07:54.963275449Z ============================= HABANA PT BRIDGE CONFIGURATION ===========================
2024-11-06T06:07:54.963298190Z PT_HPU_LAZY_MODE = 1
2024-11-06T06:07:54.963301336Z PT_RECIPE_CACHE_PATH =
2024-11-06T06:07:54.963303909Z PT_CACHE_FOLDER_DELETE = 0
2024-11-06T06:07:54.963306020Z PT_HPU_RECIPE_CACHE_CONFIG =
2024-11-06T06:07:54.963308151Z PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
2024-11-06T06:07:54.963310288Z PT_HPU_LAZY_ACC_PAR_MODE = 1
2024-11-06T06:07:54.963313898Z PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
2024-11-06T06:07:54.963316883Z ---------------------------: System Configuration :---------------------------
2024-11-06T06:07:54.963334143Z Num CPU Cores : 160
2024-11-06T06:07:54.963342494Z CPU RAM : 1056375272 KB
2024-11-06T06:07:54.963345045Z ------------------------------------------------------------------------------
2024-11-06T06:07:55.325741089Z INFO 11-06 06:07:55 selector.py:146] Using HPUAttention backend.
2024-11-06T06:07:55.351805323Z INFO 11-06 06:07:55 loader.py:405] Loading weights on hpu...
2024-11-06T06:07:55.523109974Z INFO 11-06 06:07:55 weight_utils.py:243] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading pt checkpoint shards: 50% Completed | 1/2 [00:03<00:03, 3.65s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:09<00:00, 4.96s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:09<00:00, 4.76s/it]
2024-11-06T06:12:01.584748470Z
2024-11-06T06:12:01.712776811Z INFO 11-06 06:12:01 hpu_model_runner.py:677] Pre-loading model weights on hpu:0 took 13.51 GiB of device memory (13.51 GiB/94.62 GiB used) and 10.13 GiB of host memory (101.3 GiB/1007 GiB used)
2024-11-06T06:12:01.879108760Z INFO 11-06 06:12:01 hpu_model_runner.py:742] Wrapping in HPU Graph took 0 B of device memory (13.51 GiB/94.62 GiB used) and 0 B of host memory (101.3 GiB/1007 GiB used)
2024-11-06T06:12:01.955971188Z INFO 11-06 06:12:01 hpu_model_runner.py:746] Loading model weights took in total 13.51 GiB of device memory (13.51 GiB/94.62 GiB used) and 10.13 GiB of host memory (101.3 GiB/1007 GiB used)
2024-11-06T06:12:02.122943905Z Process SpawnProcess-1:
2024-11-06T06:12:02.124353987Z Traceback (most recent call last):
2024-11-06T06:12:02.124418381Z File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
2024-11-06T06:12:02.124431776Z self.run()
2024-11-06T06:12:02.124434528Z File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
2024-11-06T06:12:02.124437394Z self._target(*self._args, **self._kwargs)
2024-11-06T06:12:02.124440431Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 394, in run_mp_engine
2024-11-06T06:12:02.124443967Z engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
2024-11-06T06:12:02.124446295Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args
2024-11-06T06:12:02.124448424Z return cls(
2024-11-06T06:12:02.124450915Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 78, in __init__
2024-11-06T06:12:02.124452957Z self.engine = LLMEngine(*args,
2024-11-06T06:12:02.124455140Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/llm_engine.py", line 351, in __init__
2024-11-06T06:12:02.124457352Z self._initialize_kv_caches()
2024-11-06T06:12:02.124459464Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/llm_engine.py", line 486, in _initialize_kv_caches
2024-11-06T06:12:02.124461680Z self.model_executor.determine_num_available_blocks())
2024-11-06T06:12:02.124463829Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/executor/hpu_executor.py", line 84, in determine_num_available_blocks
2024-11-06T06:12:02.124465962Z returnself.driver_worker.determine_num_available_blocks()
2024-11-06T06:12:02.124472282Z File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-11-06T06:12:02.124483563Z return func(*args, **kwargs)
2024-11-06T06:12:02.124485860Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_worker.py", line 180, in determine_num_available_blocks
2024-11-06T06:12:02.124487890Z self.model_runner.profile_run()
2024-11-06T06:12:02.124490049Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 1451, in profile_run
2024-11-06T06:12:02.124492106Z self.warmup_scenario(max_batch_size, max_seq_len, True, kv_caches,
2024-11-06T06:12:02.124494159Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 1523, in warmup_scenario
2024-11-06T06:12:02.124496750Z self.execute_model(inputs, kv_caches, warmup_mode=True)
2024-11-06T06:12:02.124498854Z File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-11-06T06:12:02.124500918Z return func(*args, **kwargs)
2024-11-06T06:12:02.124503086Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 2134, in execute_model
2024-11-06T06:12:02.124505077Z hidden_states = self.model.forward(
2024-11-06T06:12:02.124506999Z File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
2024-11-06T06:12:02.124508957Z return wrapped_hpugraph_forward(
2024-11-06T06:12:02.124511111Z File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 570, in wrapped_hpugraph_forward
2024-11-06T06:12:02.124513092Z return orig_fwd(*args, **kwargs)
2024-11-06T06:12:02.124518413Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 387, in forward
2024-11-06T06:12:02.124520552Z hidden_states = self.model(*args, **kwargs)
2024-11-06T06:12:02.124522575Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
2024-11-06T06:12:02.124524603Z return self._call_impl(*args, **kwargs)
2024-11-06T06:12:02.124526561Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1523, in _call_impl
2024-11-06T06:12:02.124528467Z return forward_call(*args, **kwargs)
2024-11-06T06:12:02.124530484Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 566, in forward
2024-11-06T06:12:02.124532466Z model_output = self.model(input_ids, positions, kv_caches,
2024-11-06T06:12:02.124534463Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
2024-11-06T06:12:02.124536495Z return self._call_impl(*args, **kwargs)
2024-11-06T06:12:02.124538647Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
2024-11-06T06:12:02.124541079Z result = forward_call(*args, **kwargs)
2024-11-06T06:12:02.124543216Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 352, in forward
2024-11-06T06:12:02.124545224Z hidden_states, residual = layer(positions, hidden_states,
2024-11-06T06:12:02.124549176Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
2024-11-06T06:12:02.124551285Z return self._call_impl(*args, **kwargs)
2024-11-06T06:12:02.124553290Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
2024-11-06T06:12:02.124558764Z result = forward_call(*args, **kwargs)
2024-11-06T06:12:02.124560997Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 261, in forward
2024-11-06T06:12:02.124564877Z hidden_states = self.self_attn(positions=positions,
2024-11-06T06:12:02.124566833Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
2024-11-06T06:12:02.124568776Z return self._call_impl(*args, **kwargs)
2024-11-06T06:12:02.124570827Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
2024-11-06T06:12:02.124572827Z result = forward_call(*args, **kwargs)
2024-11-06T06:12:02.124574956Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 191, in forward
2024-11-06T06:12:02.124577015Z attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
2024-11-06T06:12:02.124579095Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
2024-11-06T06:12:02.124581179Z return self._call_impl(*args, **kwargs)
2024-11-06T06:12:02.124583093Z File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
2024-11-06T06:12:02.124585070Z result = forward_call(*args, **kwargs)
2024-11-06T06:12:02.124587455Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/attention/layer.py", line 100, in forward
2024-11-06T06:12:02.124589583Z return self.impl.forward(query,
2024-11-06T06:12:02.124591560Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/attention/backends/hpu_attn.py", line 208, in forward
2024-11-06T06:12:02.124593825Z out = ops.prompt_attention(
2024-11-06T06:12:02.124595895Z File "/usr/local/lib/python3.10/dist-packages/vllm_hpu_extension/ops.py", line 226, in prompt_attention
2024-11-06T06:12:02.124597898Z attn_weights = FusedSDPA.apply(query, key, value, None, 0.0, True,
2024-11-06T06:12:02.124600173Z File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
2024-11-06T06:12:02.124602156Z returnsuper().apply(*args, **kwargs) # type: ignore[misc]
2024-11-06T06:12:02.124604321Z TypeError: FusedSDPA.forward() takes from 4 to 9 positional arguments but 12 were given
2024-11-06T06:12:10.476259445Z Traceback (most recent call last):
2024-11-06T06:12:10.476285257Z File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-11-06T06:12:10.476288074Z return _run_code(code, main_globals, None,
2024-11-06T06:12:10.476290838Z File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-11-06T06:12:10.476293813Z exec(code, run_globals)
2024-11-06T06:12:10.476297432Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 585, in<module>
2024-11-06T06:12:10.476319911Z uvloop.run(run_server(args))
2024-11-06T06:12:10.476338742Z File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
2024-11-06T06:12:10.476343922Z returnloop.run_until_complete(wrapper())
2024-11-06T06:12:10.476346216Z File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
2024-11-06T06:12:10.476427701Z File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
2024-11-06T06:12:10.476454277Z return await main
2024-11-06T06:12:10.476457891Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 552, in run_server
2024-11-06T06:12:10.476523987Z async with build_async_engine_client(args) as engine_client:
2024-11-06T06:12:10.476531429Z File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
2024-11-06T06:12:10.476555838Z return await anext(self.gen)
2024-11-06T06:12:10.476559711Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
2024-11-06T06:12:10.476583794Z async with build_async_engine_client_from_engine_args(
2024-11-06T06:12:10.476589087Z File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
2024-11-06T06:12:10.476613263Z return await anext(self.gen)
2024-11-06T06:12:10.476627725Z File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
2024-11-06T06:12:10.476633460Z raise RuntimeError(
2024-11-06T06:12:10.476638090Z RuntimeError: Engine process failed to start
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
Gaudi2H
Model Input Dumps
startup issues.
🐛 Describe the bug
vLLM API server version 0.6.3.dev563+ga5136ec1 should be OK when startup in the 6 Nov morning.
But afternoon, it doesn't work. I found the vLLM API version changed.
relate to in the last two days commits I guess.
vLLM API server version 0.6.3.dev563+ga5136ec1 Correct Logs:
vLLM API server version 0.6.3.dev588+g1033c3eb ERROR logs:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: