Dynamic Quantization in OpenVINO #25075

Nikitha-Shreyaa · 2024-06-17T14:01:49Z

Nikitha-Shreyaa
Jun 17, 2024

I am trying to perform inference for a few LLM models using OpenVINO. My machine by default supports bfloat16. When I checked for the logs when performing inference for the models, the inference is happened to be at bfloat16. To perform the inference at fp32 I used "INFERENCE_PRECISION_HINT":"f32".

Now I am trying to perform dynamic quantization for "distilbert-base-uncased-finetuned-sst-2-english" model. If I wanted to do so, after getting the int8 weight compressed model, should I load the model as

model = OVModelForSequenceClassification.from_pretrained(
	model_path,
	ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"},
)

or

model = OVModelForSequenceClassification.from_pretrained(
	model_path,
	ov_config={"INFERENCE_PRECISION_HINT":"f32", 
                           "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"},
)

Because when I checked the logs for both the above cases,

in case:1 I was not able to view anything related to dynamic quant but
in case:2 I got logs related to dynamic quant. The below is the part of the log where my doubt arises

by setting "INFERENCE_PRECISION_HINT":"f32"
onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_f32::blocked:ab::f0 wei_u8:a:blocked:AB4b32a4b::f0 bia_f32::blocked:a::f0 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:wei:1 attr-zero-points:wei:1 src_dyn_quant_group_size:32;,,mb6ic768oc768,0.0251465

Without setting "INFERENCE_PRECISION_HINT":"f32"
onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_bf16,forward_inference,src_bf16::blocked:ab::f0 wei_u8:a:blocked:AB16b64a::f0 bia_bf16::blocked:a::f0 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:wei:1 attr-zero-points:wei:1 ,,mb6ic768oc768,0.0180664

Does quantization has to be done from fp32 only and not from bfloat16?

code.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Quantization in OpenVINO #25075

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Dynamic Quantization in OpenVINO #25075

Nikitha-Shreyaa Jun 17, 2024

Replies: 0 comments

Nikitha-Shreyaa
Jun 17, 2024