Issue about using ipex on cpu #197

KepingYan · 2024-04-19T02:11:01Z

When ipex is set to true on cpu, the value here

llm-on-ray/llm_on_ray/inference/transformer_predictor.py

Line 59 in f536304

**model_config.dict(),

will be trust_remote_code=False use_auth_token='' load_in_4bit=False torch_dtype=torch.float16 revision=None.

But when llm_on_ray-serve is executed, a warning lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/optimize.py:948: UserWarning: fail to apply ipex.llm.optimize due to: Unsupported input type, fallback to the origin model appears. And after sending a request, the server will report an error:

(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) 2024-04-19 02:06:37,419 - llm_on_ray.inference.predictor_deployment - INFO - Handling dynamic batch (size=1) ...
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ERROR 2024-04-19 02:06:37,427 llama-2-7b-chat-hf_PredictorDeployment c5fp5i4e 3e35b1cb-a52d-4ebc-aa5c-dce9703fc4b4 /llama-2-7b-chat-hf/llama-2-7b-chat-hf replica.py:352 - Request failed:
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ray::ServeReplica:llama-2-7b-chat-hf:PredictorDeployment.handle_request_with_rejection() (pid=2799681, ip=10.0.11.2)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/utils.py", line 164, in wrap_to_ray_error
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     raise exception
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 1102, in call_user_method
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     await self._call_func_or_gen(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 828, in _call_func_or_gen
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     result = await result
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 403, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return await self.handle_non_streaming(prompts, config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 220, in handle_non_streaming
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return await self.handle_dynamic_batch((prompts, config))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 591, in batch_wrapper
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return await enqueue_request(args, kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 243, in _assign_func_results
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     results = await func_future
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 249, in handle_dynamic_batch
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     batch_results = self.predictor.generate(prompts, **config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/transformer_predictor.py", line 113, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     gen_tokens = self.model.generate(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return func(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 1719, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self.sample(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 2801, in sample
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     outputs = self(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/models.py", line 108, in LlamaForCausalLM_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     outputs = self.model(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     layer_outputs = decoder_layer(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 874, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return LlamaDecoderLayer_forward(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 26, in LlamaDecoderLayer_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     hidden_states = self.input_layernorm(hidden_states)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/cpu/fusions/mha_fusion.py", line 137, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return torch.ops.torch_ipex.rmsnorm(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/_ops.py", line 755, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._op(*args, **(kwargs or {}))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) RuntimeError: Unsupported input type

If I remove parameter torch_dtype=torch.float16 in model_config, it will work fine.

conda env

intel-extension-for-pytorch 2.2.0+cpu
torch                       2.2.2+cpu
transformers                4.35.0

model
Llama-2-7b-hf

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue about using ipex on cpu #197

Issue about using ipex on cpu #197

KepingYan commented Apr 19, 2024

Issue about using ipex on cpu #197

Issue about using ipex on cpu #197

Comments

KepingYan commented Apr 19, 2024