`gemm_strided_batched` only using strided CUDA kernel when first matrix is transposed #2529

THargreaves · 2024-10-25T13:03:26Z

Describe the bug

When using gemm_strided_batched I've noticed that the fast strided kernels provided by CUDA are only used when the transpose flags are ('N', 'N') or ('T', 'N'), not when they are ('N', 'T'). In the latter case, a far slower non-batched gemm is used.

This may be a limitation of CUDA itself that I'm not aware of—feel free to close the issue if so.

To reproduce

julia> A = CUDA.rand(3, 3, 10^6);

julia> B = CUDA.rand(3, 3, 10^6);

julia> CUDA.@profile CUDA.CUBLAS.gemm_strided_batched('N', 'N', A, B)
Profiler ran for 1.58 ms, capturing 89 events.

Host-side activity: calling CUDA APIs took 191.93 µs (12.16% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│    8.73% │  137.81 µs │     1 │                                      │ cuMemAllocFromPoolAsync │
│    3.26% │    51.5 µs │    16 │   3.22 µs ± 4.7    (  1.67 ‥ 20.74)  │ cudaLaunchKernel        │
│    0.09% │    1.43 µs │    32 │   44.7 ns ± 141.21 (   0.0 ‥ 715.26) │ cudaGetLastError        │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 1.37 ms (86.98% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                                                                                                                                                                                                    ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   86.98% │    1.37 ms │    16 │  85.79 µs ± 16.59  ( 25.51 ‥ 94.65) │ void gemmSN_NN_kernel<float, 128, 2, 4, 8, 3, 4, false, cublasGemvTensorStridedBatched<float const>, cublasGemvTensorStridedBatched<float const>, cublasGemvTensorStridedBatched<float>>(cublasGemmSmal ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                 1 column omitted


julia> CUDA.@profile CUDA.CUBLAS.gemm_strided_batched('T', 'N', A, B)
Profiler ran for 1.14 ms, capturing 89 events.

Host-side activity: calling CUDA APIs took 179.05 µs (15.65% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   10.67% │  122.07 µs │     1 │                                      │ cuMemAllocFromPoolAsync │
│    4.61% │   52.69 µs │    16 │   3.29 µs ± 4.8    (  1.67 ‥ 21.22)  │ cudaLaunchKernel        │
│    0.31% │    3.58 µs │    32 │ 111.76 ns ± 200.71 (   0.0 ‥ 953.67) │ cudaGetLastError        │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 952.72 µs (83.27% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                                                                                     ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   83.27% │  952.72 µs │    16 │  59.55 µs ± 11.41  ( 16.93 ‥ 63.9) │ void gemmSN_TN_kernel<float, 128, 16, 2, 4, 4, 4, false, cublasGemvTensorStridedBatched<float const>, cublasGemvTensorStridedBatched<float const>, cublasGemvTensorStridedBatched<float>>(cublasGemmSmal ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                 1 column omitted


julia> CUDA.@profile CUDA.CUBLAS.gemm_strided_batched('N', 'T', A, B)
Profiler ran for 17.41 ms, capturing 89 events.

Host-side activity: calling CUDA APIs took 192.64 µs (1.11% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│    0.81% │  141.62 µs │     1 │                                      │ cuMemAllocFromPoolAsync │
│    0.28% │   48.64 µs │    16 │   3.04 µs ± 4.49   (  1.43 ‥ 19.79)  │ cudaLaunchKernel        │
│    0.01% │    1.91 µs │    32 │   59.6 ns ± 121.12 (   0.0 ‥ 476.84) │ cudaGetLastError        │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 17.2 ms (98.79% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                    │
├──────────┼────────────┼───────┼────────────────────────────────────┼─────────────────────────┤
│   98.79% │    17.2 ms │    16 │   1.07 ms ± 0.21   (  0.29 ‥ 1.15) │ ampere_sgemm_128x128_nt │
└──────────┴────────────┴───────┴────────────────────────────────────┴─────────────────────────┘

Expected behavior

A fast batched kernel should be used for the final operation.

Version info

Details on Julia:

Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × AMD Ryzen Threadripper 7960X 24-Cores
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 48 default, 0 interactive, 24 GC (on 48 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 48

Details on CUDA:

CUDA runtime 12.6, artifact installation
CUDA driver 12.6
NVIDIA driver 560.35.3

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+560.35.3

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 20.415 GiB / 23.988 GiB available)

The text was updated successfully, but these errors were encountered:

maleadt · 2024-10-25T14:31:55Z

Thanks for the report. gemm_strided_batched is implemented entirely based on cublasXgemmStridedBatched, i.e., with no fallbacks to slower non-batched versions on our side. So I don't think we can do much here. Maybe file an issue with NVIDIA?

THargreaves added the bug Something isn't working label Oct 25, 2024

maleadt closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`gemm_strided_batched` only using strided CUDA kernel when first matrix is transposed #2529

`gemm_strided_batched` only using strided CUDA kernel when first matrix is transposed #2529

THargreaves commented Oct 25, 2024

maleadt commented Oct 25, 2024

gemm_strided_batched only using strided CUDA kernel when first matrix is transposed #2529

gemm_strided_batched only using strided CUDA kernel when first matrix is transposed #2529

Comments

THargreaves commented Oct 25, 2024

maleadt commented Oct 25, 2024

`gemm_strided_batched` only using strided CUDA kernel when first matrix is transposed #2529

`gemm_strided_batched` only using strided CUDA kernel when first matrix is transposed #2529