Dynamic Quantization in OpenVINO #25075
Unanswered
Nikitha-Shreyaa
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to perform inference for a few LLM models using OpenVINO. My machine by default supports bfloat16. When I checked for the logs when performing inference for the models, the inference is happened to be at bfloat16. To perform the inference at fp32 I used "INFERENCE_PRECISION_HINT":"f32".
Now I am trying to perform dynamic quantization for "distilbert-base-uncased-finetuned-sst-2-english" model. If I wanted to do so, after getting the int8 weight compressed model, should I load the model as
or
Because when I checked the logs for both the above cases,
in case:1 I was not able to view anything related to dynamic quant but
in case:2 I got logs related to dynamic quant. The below is the part of the log where my doubt arises
by setting "INFERENCE_PRECISION_HINT":"f32"
onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_f32::blocked:ab::f0 wei_u8:a:blocked:AB4b32a4b::f0 bia_f32::blocked:a::f0 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:wei:1 attr-zero-points:wei:1 src_dyn_quant_group_size:32;,,mb6ic768oc768,0.0251465
Without setting "INFERENCE_PRECISION_HINT":"f32"
onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_bf16,forward_inference,src_bf16::blocked:ab::f0 wei_u8:a:blocked:AB16b64a::f0 bia_bf16::blocked:a::f0 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:wei:1 attr-zero-points:wei:1 ,,mb6ic768oc768,0.0180664
Does quantization has to be done from fp32 only and not from bfloat16?
code.txt
Beta Was this translation helpful? Give feedback.
All reactions