performance of float16 with fast tuning #173

klxy0304 · 2024-09-03T06:56:25Z

Hello,
I tried to run a fast tuning of GEMM with float16:

from bitblas.base.roller.policy import TensorCorePolicy, DefaultPolicy
from bitblas.base.arch import CUDA
from bitblas.base.utils import apply_and_build
import tvm
from tvm.script import tir as T

M = 8
N = 152064
K = 3584
@tvm.script.ir_module
class MatmulNT:
    @T.prim_func
    def main(a: T.handle, b: T.handle, c: T.handle):
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        A = T.match_buffer(a, [M, K], dtype="float16")
        B = T.match_buffer(b, [N, K], dtype="float16")
        C = T.match_buffer(c, [M, N], dtype="float16")

        for i, j, k in T.grid(M, N, K):
            with T.block("B"):
                vi, vj, vk = T.axis.remap("SSR", [i, j, k])
                with T.init():
                    C[vi, vj] = tvm.tir.const(0, "float16")
                C[vi, vj] = C[vi, vj] + A[vi, vk].astype("float16") * B[
                    vj, vk
                ].astype("float16") 

ir_module = MatmulNT
func = ir_module["main"]
target = tvm.target.Target("nvidia/nvidia-a100")
arch = CUDA(target)
# Tune with SIMT Cuda Core
policy = DefaultPolicy(func=func, arch=arch)
try:
    tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target)
except Exception:
    tags = None
# Tune with Tensor Core if possible
if tags:
    policy = TensorCorePolicy(func=tensorized_func, arch=arch, tags=tags)

configs = policy.emit_config(topk=20)

cpresults, best = apply_and_build(func, configs, arch, parallel_build=True)

print("[BitBLAS] The best latency of top 1 is {:.3f} ms".format(cpresults[0].latency))
print("[BitBLAS] The best latency of top 20 is {:.3f} ms".format(best.latency))

But I got results that are not as expected:
[BitBLAS] The best latency of top 1 is 11.767 ms
[BitBLAS] The best latency of top 20 is 5.987 ms

For comparison, I tuned a single-layer model using TVM's Meta Schedule, with the model structure as nn.Linear(3584, 152064) and a batch size of 8. Below are the tuning log results:

ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done

0 | fused_nn_dense_add | 8721174528 | 1 | 13285.4769 | 656.4442 | 656.4442 | 1535 |

The result is 656 us, I would like to know if I am using the BitBlas tuning method incorrectly.

LeiWang1999 · 2024-09-03T07:11:07Z

@klxy0304, BitBLAS uses a straightforward rule to determine whether a GEMM shape should utilize the tensor core, as seen here: [matmul_analysis.py#L669-L670](https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/gpu/matmul_analysis.py#L669-L670).

The rule requires each dimension to be larger than 16 (in your case, the dimension is 8). However, you can still enable it by running:

tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target, allow_gemv=True)

klxy0304 · 2024-09-03T08:44:10Z

@LeiWang1999
Thank you for the quick reply. When I enable the tensor core utilization, I received a error from tvm as below:

Traceback (most recent call last):
  File "/root/workspace/tuning_work/bitblas/bitblas_tuning.py", line 47, in <module>
    cpresults, best = apply_and_build(func, configs, arch, parallel_build=True)
  File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 293, in apply_and_build
    return apply_and_build_parallel(
  File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
    builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
    self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 144, in __init__
    raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd03871a050>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
    self._lock.acquire()
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'

could you tell me how to solve this?

LeiWang1999 · 2024-09-03T09:10:06Z

Looks like it's a environment related issues, maybe you could try disable parallel_build.

klxy0304 · 2024-09-03T09:33:34Z

@LeiWang1999 , I tried setting parallel_build=False, and in order to eliminate the original environment problem, I started a new docker container and reinstalled it through "pip install bitblas". But this error still occurs.

LeiWang1999 · 2024-09-03T11:37:07Z

@klxy0304 , would you mind append bitblas.set_log_level("Debug") before M=8?

klxy0304 · 2024-09-03T12:33:39Z

@LeiWang1999 sure, after I appended it，the log is:

Traceback (most recent call last):
  File "/ossfs/workspace/bitblas_tune.py", line 48, in <module>
    cpresults, best = apply_and_build(func, configs, arch, parallel_build=False)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 293, in apply_and_build
    return apply_and_build_parallel(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
    builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
    self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/concurrent/futures/thread.py", line 144, in __init__
    raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd1d8a96020>
Traceback (most recent call last):
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
    self._lock.acquire()
    ^^^^^^^^^^
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'

klxy0304 · 2024-09-04T04:32:17Z

@LeiWang1999 I found that the reason is that the judgment check_tile_shape_isvalid in the emit_config interface keeps failing, resulting in max_workers=0. As seen here：
(https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/base/roller/policy/default.py#L46-L47)
Is this caused by the definition of MatmulNT?

LeiWang1999 · 2024-09-04T06:23:53Z

@klxy0304 , I tested on my A100, and the issue seems to be that the value of N is too large, which may cause an overflow （N * K） of the maximum INT32 value.

LeiWang1999 · 2024-12-09T07:47:26Z

We should implement a Pass to cast all index into int64 datatype in case at least one index is out of the default maximum value of integer 32:

implement pass: LegalizeIndexDataType

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of float16 with fast tuning #173

performance of float16 with fast tuning #173

klxy0304 commented Sep 3, 2024

LeiWang1999 commented Sep 3, 2024

klxy0304 commented Sep 3, 2024

LeiWang1999 commented Sep 3, 2024

klxy0304 commented Sep 3, 2024

LeiWang1999 commented Sep 3, 2024

klxy0304 commented Sep 3, 2024

klxy0304 commented Sep 4, 2024

LeiWang1999 commented Sep 4, 2024 •

edited

Loading

LeiWang1999 commented Dec 9, 2024

performance of float16 with fast tuning #173

performance of float16 with fast tuning #173

Comments

klxy0304 commented Sep 3, 2024

ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done

LeiWang1999 commented Sep 3, 2024

klxy0304 commented Sep 3, 2024

LeiWang1999 commented Sep 3, 2024

klxy0304 commented Sep 3, 2024

LeiWang1999 commented Sep 3, 2024

klxy0304 commented Sep 3, 2024

klxy0304 commented Sep 4, 2024

LeiWang1999 commented Sep 4, 2024 • edited Loading

LeiWang1999 commented Dec 9, 2024

LeiWang1999 commented Sep 4, 2024 •

edited

Loading