relax UnsqueezeBroadcastReshapeSDPAFusion with no need to ask querry … #27515

ceciliapeng2011 · 2024-11-12T08:08:10Z

…branch from non-reshape.

Details:

improve the 2nd token latency for GLM4 model.

Tickets:

157261

…branch from non-reshape.

TianmengChen · 2024-11-13T12:42:32Z

Hi, cecilia
We try to run glm4v under this PR, but get error log:

Traceback (most recent call last):
  File "C:\chen\zhipu\glm4v\test_v_ov.py", line 20, in <module>
    model = OvGLM4v(MODEL_PATH, "GPU")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\chen\zhipu\glm4v\glm4v_helper.py", line 407, in __init__
    compiled_model = core.compile_model(self.model, device)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\chen\genai\Lib\site-packages\openvino\runtime\ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception from src\inference\src\cpp\core.cpp:107:
Exception from src\inference\src\dev\plugin.cpp:53:
Exception from src\core\src\pass\graph_rewrite.cpp:295:
[MATCHER] UnsqueezeBroadcastReshapeSDPAFusionnode: gpu_opset::SDPA __module.model.encoder.layers.0.self_attention.core_attention/aten::scaled_dot_product_attention/ScaledDotProductAttention (opset1::Reshape aten::flatten/Reshape[0]:f16[?,16,?,128], opset1::Reshape __module.model.encoder.layers.0.self_attention/aten::view/Reshape_2[0]:f16[?,16,?,128], opset1::Reshape __module.model.encoder.layers.0.self_attention/aten::view/Reshape_4[0]:f16[?,16,?,128], opset1::Select __module.model.encoder.layers.0.self_attention.core_attention/aten::masked_fill_/Select[0]:f16[?,1,?,?]) -> (f16[?,16,?,128]) callback has thrown: Check 'value_input_correctness' failed at src\core\shape_inference\include\scaled_dot_product_attention_shape_inference.hpp:68:
While validating node 'gpu_opset::SDPA SDPA_58065 (opset1::Reshape aten::flatten/Reshape[0]:f16[?,16,?,128], gpu_opset::KVCache __module.model.encoder.layers.0.self_attention/aten::cat/Concat[0]:f16[?,?,?,?], gpu_opset::KVCache __module.model.encoder.layers.0.self_attention/aten::cat/Concat_1[0]:f16[?,4,?,128], opset1::Select __module.model.encoder.layers.0.self_attention.core_attention/aten::masked_fill_/Select[0]:f16[?,1,?,?]) -> ()' with friendly_name 'SDPA_58065':
Shape inference input shapes {[?,16,?,128],[?,?,?,?],[?,4,?,128]}
Value input shape not compatible with other inputs.

can you have a look, thank you.

…hen read_value of init shape full 0.

ceciliapeng2011 requested review from a team as code owners November 12, 2024 08:08

github-actions bot added the category: GPU OpenVINO GPU plugin label Nov 12, 2024

relax UnsqueezeBroadcastReshapeSDPAFusion with no need to ask querry …

e6e560d

…branch from non-reshape.

p-durandin added do_not_merge and removed do_not_merge labels Nov 13, 2024

make shape_infer of kvcache and sdpa more robust. There seen a case w…

91205cd

…hen read_value of init shape full 0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relax UnsqueezeBroadcastReshapeSDPAFusion with no need to ask querry … #27515

relax UnsqueezeBroadcastReshapeSDPAFusion with no need to ask querry … #27515

ceciliapeng2011 commented Nov 12, 2024 •

edited by wenjiew

Loading

TianmengChen commented Nov 13, 2024

relax UnsqueezeBroadcastReshapeSDPAFusion with no need to ask querry … #27515

Are you sure you want to change the base?

relax UnsqueezeBroadcastReshapeSDPAFusion with no need to ask querry … #27515

Conversation

ceciliapeng2011 commented Nov 12, 2024 • edited by wenjiew Loading

Details:

Tickets:

TianmengChen commented Nov 13, 2024

ceciliapeng2011 commented Nov 12, 2024 •

edited by wenjiew

Loading