You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using a vLLM integration similar to this to run models with bitblas backend. When tested, our model generations were nonsense for most of the prompts on a A100. However, the same model works just fine with a A6000.
To debug it further I saved the intermediate activations from the vLLM model for the prompt that fails and found out that the activations after the first QKV layer have NaN values. Which can be seen below:
A100 activations
A6000 activations
Same activations are fine on a A6000:
To compare A100 against A6000 and to also provide a minimal repro, I saved the input tensor that is passed to the QKV layer (which fails on A100 but succeeds on A6000), as well as the bitblas compatible quantized weights of the QKV layer, and the expected output from the A6000.
importbitblasfrombitblas.cacheimportglobal_operator_cache, get_database_pathfrombitblas.moduleimportauto_detect_nvidia_target, BITBLAS_DATABASE_PATHimportosos.environ["CUDA_DEVICE_ORDER"] ="PCI_BUS_ID"BITBLAS_TARGET=auto_detect_nvidia_target()
BITBLAS_DATABASE_PATH="/workspace/.cache/bitblas"def_get_or_create_bitblas_operator(config):
ifglobal_operator_cache.size() ==0:
global_operator_cache.load_from_database(BITBLAS_DATABASE_PATH, BITBLAS_TARGET)
bitblas_matmul=global_operator_cache.get(config)
ifbitblas_matmulisNone:
# should disable tuning for the first time because we may require loading bitblas operator from database.bitblas_matmul=bitblas.Matmul(config) # default tuning is topk=20# bitblas_matmul.hardware_aware_finetune(topk=20)global_operator_cache.add(config, bitblas_matmul)
global_operator_cache.save_into_database(BITBLAS_DATABASE_PATH, BITBLAS_TARGET)
print("BitBLAS Tuning done, appended operator to global_operator_cache.")
else:
print("BitBLAS Operator found in global_operator_cache.")
returnbitblas_matmul# load qweight, zeros, and scalesbitblas_weights=torch.load("/workspace/data/single_input_qkvproj_weights_a6000.pt")
qweight=bitblas_weights['qweight'].cuda()
scales=bitblas_weights['scales'].cuda()
zeros=bitblas_weights['zeros'].cuda()
# layernorm output which is passed to QKVln_act=torch.load("/workspace/data/single_input_ln0_act_a6000.pt").cuda()
# init matmul engineK,N=8192,10240bitblas_dtype=torch.float16GROUPSIZE=128BITBLAS_OPT_M= [1, 16, 32, 64, 128, 256, 512]
NBITS=4matmul_config=bitblas.MatmulConfig(M=BITBLAS_OPT_M,
N=N,
K=K,
A_dtype="bfloat16"ifbitblas_dtype==torch.bfloat16else"float16",
W_dtype={4:"uint4",2:"uint2"}[NBITS],
accum_dtype="float32"ifbitblas_dtype==torch.bfloat16else"float16",
out_dtype="float16",
layout="nt",
with_bias=False,
group_size=GROUPSIZE,
with_scaling=True,
with_zeros=True,
zeros_mode="original",
#fast_decoding=True,
)
matmul_eng=_get_or_create_bitblas_operator(matmul_config)
# matmul on A100out=matmul_eng(ln_act, qweight, scales, zeros)
# this passes nowassertnotout.isnan().any().item()
# load expected output from the successful A6000.expected_output=torch.load("/workspace/data/single_input_bitblas_output_a6000.pt")
# compute relative % diff between outputseps=1e-4out=out.cpu()
rel_diff= (out-expected_output).abs() / (expected_output.abs() +eps)
rel_diff.min(), rel_diff.max() # Note the very high max difference: (tensor(0.), tensor(27696.))# relative difference heatmapimportmatplotlib.pyplotaspltplt.imshow((rel_diff+eps).log(), aspect='auto')
plt.colorbar()
plt.show()
plt.show()
There is a large difference especially to the right side of the output matrix.
Do you have any explanation for this and how to find the root cause of this mismatch? Thanks!
Note: I also tried running the model with (dtype=bfloat16, accum dtype=fp32 and out=fp16) but still generations were nonsensical(such as repeated parantheses) probably due to NaNs.
The text was updated successfully, but these errors were encountered:
I am using a vLLM integration similar to this to run models with bitblas backend. When tested, our model generations were nonsense for most of the prompts on a A100. However, the same model works just fine with a A6000.
To debug it further I saved the intermediate activations from the vLLM model for the prompt that fails and found out that the activations after the first QKV layer have NaN values. Which can be seen below:
A100 activations
A6000 activations
Same activations are fine on a A6000:
To compare A100 against A6000 and to also provide a minimal repro, I saved the input tensor that is passed to the QKV layer (which fails on A100 but succeeds on A6000), as well as the bitblas compatible quantized weights of the QKV layer, and the expected output from the A6000.
There is a large difference especially to the right side of the output matrix.
Do you have any explanation for this and how to find the root cause of this mismatch? Thanks!
Note: I also tried running the model with (dtype=bfloat16, accum dtype=fp32 and out=fp16) but still generations were nonsensical(such as repeated parantheses) probably due to NaNs.
The text was updated successfully, but these errors were encountered: