-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance of float16 with fast tuning #173
Comments
@klxy0304, BitBLAS uses a straightforward rule to determine whether a GEMM shape should utilize the tensor core, as seen here: [matmul_analysis.py#L669-L670](https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/gpu/matmul_analysis.py#L669-L670). The rule requires each dimension to be larger than 16 (in your case, the dimension is 8). However, you can still enable it by running: tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target, allow_gemv=True)
|
@LeiWang1999
could you tell me how to solve this? |
Looks like it's a environment related issues, maybe you could try disable parallel_build. |
@LeiWang1999 , I tried setting parallel_build=False, and in order to eliminate the original environment problem, I started a new docker container and reinstalled it through "pip install bitblas". But this error still occurs. |
@klxy0304 , would you mind append |
@LeiWang1999 sure, after I appended it,the log is:
|
@LeiWang1999 I found that the reason is that the judgment check_tile_shape_isvalid in the emit_config interface keeps failing, resulting in max_workers=0. As seen here: |
@klxy0304 , I tested on my A100, and the issue seems to be that the value of N is too large, which may cause an overflow (N * K) of the maximum INT32 value. |
We should implement a Pass to cast all index into int64 datatype in case at least one index is out of the default maximum value of integer 32:
|
Hello,
I tried to run a fast tuning of GEMM with float16:
But I got results that are not as expected:
[BitBLAS] The best latency of top 1 is 11.767 ms
[BitBLAS] The best latency of top 20 is 5.987 ms
For comparison, I tuned a single-layer model using TVM's Meta Schedule, with the model structure as nn.Linear(3584, 152064) and a batch size of 8. Below are the tuning log results:
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done
0 | fused_nn_dense_add | 8721174528 | 1 | 13285.4769 | 656.4442 | 656.4442 | 1535 |
The result is 656 us, I would like to know if I am using the BitBlas tuning method incorrectly.
The text was updated successfully, but these errors were encountered: