Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Qwen2-72B-Instruct model by INT4-AWQ quantization failed #2445

Open
2 of 4 tasks
wangpeilin opened this issue Nov 14, 2024 · 0 comments
Open
2 of 4 tasks

Build Qwen2-72B-Instruct model by INT4-AWQ quantization failed #2445

wangpeilin opened this issue Nov 14, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@wangpeilin
Copy link

System Info

Ubuntu 20.04
NVIDIA A100
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 and 24.07
TensorRT-LLM v0.14.0 and v0.11.0

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. docker run -itd --name xxx --gpus=all -p8000:8000 -p8001:8001 -p8002:8002 -v /share/datasets:/share/datasets nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
  2. code version is 0.14.0
    git clone https://github.com/NVIDIA/TensorRT-LLM.git
    git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
  3. cd TensorRT-LLM/examples
    python3 ./quantization/quantize.py
    --model_dir /path/Qwen_Qwen2-72B-Instruct/
    --output_dir /path/Qwen_Qwen2-72B-Instruct_int4_awq_4gpu
    --dtype bfloat16
    --qformat int4_awq
    --awq_block_size 128
    --calib_size 32
    --tp_size 4
  4. trtllm-build
    --checkpoint_dir /path/Qwen_Qwen2-72B-Instruct_int4_awq_4gpu/
    --output_dir triton_model_repo/Qwen_Qwen2-72B-Instruct_int4_awq/tensorrt_llm/1/
    --gemm_plugin auto

Expected behavior

success convert model to quantified checkpoint and TensorRT engines

actual behavior

when I set tp_size=4 and awq_block_size=128 or 64, it report errors "Weight shape is not divisible for block size for block quantization."
when I set tp_size=4 and awq_block_size=32 or 16, step3 quantize.py run success but trtllm-build failed which report error2.

error1
Image

error2
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 4096 and num_col_bytes = 3696. (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:279)

additional notes

This issue seems due to the weight shape of Qwen2-72B model. I build quantization Qwen1.5-72B and Llama-3-70B success.

@wangpeilin wangpeilin added the bug Something isn't working label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant