You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
will be trust_remote_code=False use_auth_token='' load_in_4bit=False torch_dtype=torch.float16 revision=None.
But when llm_on_ray-serve is executed, a warning lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/optimize.py:948: UserWarning: fail to apply ipex.llm.optimize due to: Unsupported input type, fallback to the origin model appears. And after sending a request, the server will report an error:
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) 2024-04-19 02:06:37,419 - llm_on_ray.inference.predictor_deployment - INFO - Handling dynamic batch (size=1) ...
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ERROR 2024-04-19 02:06:37,427 llama-2-7b-chat-hf_PredictorDeployment c5fp5i4e 3e35b1cb-a52d-4ebc-aa5c-dce9703fc4b4 /llama-2-7b-chat-hf/llama-2-7b-chat-hf replica.py:352 - Request failed:
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ray::ServeReplica:llama-2-7b-chat-hf:PredictorDeployment.handle_request_with_rejection() (pid=2799681, ip=10.0.11.2)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/utils.py", line 164, in wrap_to_ray_error
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) raise exception
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 1102, in call_user_method
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) await self._call_func_or_gen(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 828, in _call_func_or_gen
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) result = await result
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 403, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return await self.handle_non_streaming(prompts, config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 220, in handle_non_streaming
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return await self.handle_dynamic_batch((prompts, config))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 591, in batch_wrapper
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return await enqueue_request(args, kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 243, in _assign_func_results
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) results = await func_future
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 249, in handle_dynamic_batch
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) batch_results = self.predictor.generate(prompts, **config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/transformer_predictor.py", line 113, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) gen_tokens = self.model.generate(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return func(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 1719, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self.sample(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 2801, in sample
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) outputs = self(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/models.py", line 108, in LlamaForCausalLM_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) outputs = self.model(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) layer_outputs = decoder_layer(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 874, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return LlamaDecoderLayer_forward(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 26, in LlamaDecoderLayer_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) hidden_states = self.input_layernorm(hidden_states)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/cpu/fusions/mha_fusion.py", line 137, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return torch.ops.torch_ipex.rmsnorm(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/_ops.py", line 755, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._op(*args, **(kwargs or {}))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) RuntimeError: Unsupported input type
If I remove parameter torch_dtype=torch.float16 in model_config, it will work fine.
When ipex is set to true on cpu, the value here
llm-on-ray/llm_on_ray/inference/transformer_predictor.py
Line 59 in f536304
trust_remote_code=False use_auth_token='' load_in_4bit=False torch_dtype=torch.float16 revision=None
.But when llm_on_ray-serve is executed, a warning
lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/optimize.py:948: UserWarning: fail to apply ipex.llm.optimize due to: Unsupported input type, fallback to the origin model
appears. And after sending a request, the server will report an error:If I remove parameter
torch_dtype=torch.float16
in model_config, it will work fine.conda env
model
Llama-2-7b-hf
The text was updated successfully, but these errors were encountered: