Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of float16 with fast tuning #173

Open
klxy0304 opened this issue Sep 3, 2024 · 9 comments
Open

performance of float16 with fast tuning #173

klxy0304 opened this issue Sep 3, 2024 · 9 comments

Comments

@klxy0304
Copy link

klxy0304 commented Sep 3, 2024

Hello,
I tried to run a fast tuning of GEMM with float16:

from bitblas.base.roller.policy import TensorCorePolicy, DefaultPolicy
from bitblas.base.arch import CUDA
from bitblas.base.utils import apply_and_build
import tvm
from tvm.script import tir as T

M = 8
N = 152064
K = 3584
@tvm.script.ir_module
class MatmulNT:
    @T.prim_func
    def main(a: T.handle, b: T.handle, c: T.handle):
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        A = T.match_buffer(a, [M, K], dtype="float16")
        B = T.match_buffer(b, [N, K], dtype="float16")
        C = T.match_buffer(c, [M, N], dtype="float16")

        for i, j, k in T.grid(M, N, K):
            with T.block("B"):
                vi, vj, vk = T.axis.remap("SSR", [i, j, k])
                with T.init():
                    C[vi, vj] = tvm.tir.const(0, "float16")
                C[vi, vj] = C[vi, vj] + A[vi, vk].astype("float16") * B[
                    vj, vk
                ].astype("float16") 

ir_module = MatmulNT
func = ir_module["main"]
target = tvm.target.Target("nvidia/nvidia-a100")
arch = CUDA(target)
# Tune with SIMT Cuda Core
policy = DefaultPolicy(func=func, arch=arch)
try:
    tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target)
except Exception:
    tags = None
# Tune with Tensor Core if possible
if tags:
    policy = TensorCorePolicy(func=tensorized_func, arch=arch, tags=tags)

configs = policy.emit_config(topk=20)

cpresults, best = apply_and_build(func, configs, arch, parallel_build=True)

print("[BitBLAS] The best latency of top 1 is {:.3f} ms".format(cpresults[0].latency))
print("[BitBLAS] The best latency of top 20 is {:.3f} ms".format(best.latency))

But I got results that are not as expected:
[BitBLAS] The best latency of top 1 is 11.767 ms
[BitBLAS] The best latency of top 20 is 5.987 ms

For comparison, I tuned a single-layer model using TVM's Meta Schedule, with the model structure as nn.Linear(3584, 152064) and a batch size of 8. Below are the tuning log results:

ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done

0 | fused_nn_dense_add | 8721174528 | 1 | 13285.4769 | 656.4442 | 656.4442 | 1535 |

The result is 656 us, I would like to know if I am using the BitBlas tuning method incorrectly.

@LeiWang1999
Copy link
Contributor

@klxy0304, BitBLAS uses a straightforward rule to determine whether a GEMM shape should utilize the tensor core, as seen here: [matmul_analysis.py#L669-L670](https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/gpu/matmul_analysis.py#L669-L670).

The rule requires each dimension to be larger than 16 (in your case, the dimension is 8). However, you can still enable it by running:

tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target, allow_gemv=True)

@klxy0304
Copy link
Author

klxy0304 commented Sep 3, 2024

@LeiWang1999
Thank you for the quick reply. When I enable the tensor core utilization, I received a error from tvm as below:

Traceback (most recent call last):
  File "/root/workspace/tuning_work/bitblas/bitblas_tuning.py", line 47, in <module>
    cpresults, best = apply_and_build(func, configs, arch, parallel_build=True)
  File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 293, in apply_and_build
    return apply_and_build_parallel(
  File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
    builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
    self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 144, in __init__
    raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd03871a050>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
    self._lock.acquire()
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'

could you tell me how to solve this?

@LeiWang1999
Copy link
Contributor

Looks like it's a environment related issues, maybe you could try disable parallel_build.

@klxy0304
Copy link
Author

klxy0304 commented Sep 3, 2024

@LeiWang1999 , I tried setting parallel_build=False, and in order to eliminate the original environment problem, I started a new docker container and reinstalled it through "pip install bitblas". But this error still occurs.

@LeiWang1999
Copy link
Contributor

@klxy0304 , would you mind append bitblas.set_log_level("Debug") before M=8?

@klxy0304
Copy link
Author

klxy0304 commented Sep 3, 2024

@LeiWang1999 sure, after I appended it,the log is:

Traceback (most recent call last):
  File "/ossfs/workspace/bitblas_tune.py", line 48, in <module>
    cpresults, best = apply_and_build(func, configs, arch, parallel_build=False)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 293, in apply_and_build
    return apply_and_build_parallel(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
    builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
    self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/concurrent/futures/thread.py", line 144, in __init__
    raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd1d8a96020>
Traceback (most recent call last):
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
    self._lock.acquire()
    ^^^^^^^^^^
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'

@klxy0304
Copy link
Author

klxy0304 commented Sep 4, 2024

@LeiWang1999 I found that the reason is that the judgment check_tile_shape_isvalid in the emit_config interface keeps failing, resulting in max_workers=0. As seen here:
(https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/base/roller/policy/default.py#L46-L47)
Is this caused by the definition of MatmulNT?

@LeiWang1999
Copy link
Contributor

LeiWang1999 commented Sep 4, 2024

@klxy0304 , I tested on my A100, and the issue seems to be that the value of N is too large, which may cause an overflow (N * K) of the maximum INT32 value.

@LeiWang1999
Copy link
Contributor

We should implement a Pass to cast all index into int64 datatype in case at least one index is out of the default maximum value of integer 32:

  • implement pass: LegalizeIndexDataType

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants