Performance drop (discrepencies) with model converted through the scripts comparing to the official one. #1206

ZhangPeng4242 · 2025-02-23T19:07:06Z

System Info

I am using nextjs 15

Environment/Platform

Description

I have been converting LLama-3.2-1b-instruct models through the official scripts that is provided in this repo. I noticed significant performance drop with this conversion compared to the official one provided. All using the same quantization q4f16, and all the rest is the same, this is the answer to a simple question "tell me a joke."

official one - onnx-community/Llama-3.2-1B-Instruct-q4f16

another official conversion? has the same performance drop issue - https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct

my own conversion throgh the scripts

How can I solve it? what's wrong in the conversion?

Reproduction

Run the conversion with the scripts using this command: python -m scripts.convert --quantize --model_id meta-llama/Llama-3.2-1B-Instruct. And then upload the model to hugging face, and using this model to generate with q4f16 quantization, the results are different from this one: onnx-community/Llama-3.2-1B-Instruct-q4f16

ZhangPeng4242 added the bug Something isn't working label Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance drop (discrepencies) with model converted through the scripts comparing to the official one. #1206

Performance drop (discrepencies) with model converted through the scripts comparing to the official one. #1206

ZhangPeng4242 commented Feb 23, 2025

Performance drop (discrepencies) with model converted through the scripts comparing to the official one. #1206

Performance drop (discrepencies) with model converted through the scripts comparing to the official one. #1206

Comments

ZhangPeng4242 commented Feb 23, 2025

System Info

Environment/Platform

Description

Reproduction