You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
trtllm-build
--checkpoint_dir /path/Qwen_Qwen2-72B-Instruct_int4_awq_4gpu/
--output_dir triton_model_repo/Qwen_Qwen2-72B-Instruct_int4_awq/tensorrt_llm/1/
--gemm_plugin auto
Expected behavior
success convert model to quantified checkpoint and TensorRT engines
actual behavior
when I set tp_size=4 and awq_block_size=128 or 64, it report errors "Weight shape is not divisible for block size for block quantization."
when I set tp_size=4 and awq_block_size=32 or 16, step3 quantize.py run success but trtllm-build failed which report error2.
error1
error2
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 4096 and num_col_bytes = 3696. (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:279)
additional notes
This issue seems due to the weight shape of Qwen2-72B model. I build quantization Qwen1.5-72B and Llama-3-70B success.
The text was updated successfully, but these errors were encountered:
System Info
Ubuntu 20.04
NVIDIA A100
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 and 24.07
TensorRT-LLM v0.14.0 and v0.11.0
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
git clone https://github.com/NVIDIA/TensorRT-LLM.git
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
python3 ./quantization/quantize.py
--model_dir /path/Qwen_Qwen2-72B-Instruct/
--output_dir /path/Qwen_Qwen2-72B-Instruct_int4_awq_4gpu
--dtype bfloat16
--qformat int4_awq
--awq_block_size 128
--calib_size 32
--tp_size 4
--checkpoint_dir /path/Qwen_Qwen2-72B-Instruct_int4_awq_4gpu/
--output_dir triton_model_repo/Qwen_Qwen2-72B-Instruct_int4_awq/tensorrt_llm/1/
--gemm_plugin auto
Expected behavior
success convert model to quantified checkpoint and TensorRT engines
actual behavior
when I set tp_size=4 and awq_block_size=128 or 64, it report errors "Weight shape is not divisible for block size for block quantization."
when I set tp_size=4 and awq_block_size=32 or 16, step3 quantize.py run success but trtllm-build failed which report error2.
error1
error2
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 4096 and num_col_bytes = 3696. (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:279)
additional notes
This issue seems due to the weight shape of Qwen2-72B model. I build quantization Qwen1.5-72B and Llama-3-70B success.
The text was updated successfully, but these errors were encountered: