Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gemm_strided_batched only using strided CUDA kernel when first matrix is transposed #2529

Closed
THargreaves opened this issue Oct 25, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@THargreaves
Copy link

Describe the bug

When using gemm_strided_batched I've noticed that the fast strided kernels provided by CUDA are only used when the transpose flags are ('N', 'N') or ('T', 'N'), not when they are ('N', 'T'). In the latter case, a far slower non-batched gemm is used.

This may be a limitation of CUDA itself that I'm not aware of—feel free to close the issue if so.

To reproduce

julia> A = CUDA.rand(3, 3, 10^6);

julia> B = CUDA.rand(3, 3, 10^6);

julia> CUDA.@profile CUDA.CUBLAS.gemm_strided_batched('N', 'N', A, B)
Profiler ran for 1.58 ms, capturing 89 events.

Host-side activity: calling CUDA APIs took 191.93 µs (12.16% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│    8.73%137.81 µs │     1 │                                      │ cuMemAllocFromPoolAsync │
│    3.26%51.5 µs │    163.22 µs ± 4.7    (  1.6720.74)  │ cudaLaunchKernel        │
│    0.09%1.43 µs │    3244.7 ns ± 141.21 (   0.0715.26) │ cudaGetLastError        │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 1.37 ms (86.98% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                                                                                                                                                                                                    
├──────────┼────────────┼───────┼─────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   86.98%1.37 ms │    1685.79 µs ± 16.59  ( 25.5194.65) │ void gemmSN_NN_kernel<float, 128, 2, 4, 8, 3, 4, false, cublasGemvTensorStridedBatched<float const>, cublasGemvTensorStridedBatched<float const>, cublasGemvTensorStridedBatched<float>>(cublasGemmSmal 
└──────────┴────────────┴───────┴─────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                 1 column omitted


julia> CUDA.@profile CUDA.CUBLAS.gemm_strided_batched('T', 'N', A, B)
Profiler ran for 1.14 ms, capturing 89 events.

Host-side activity: calling CUDA APIs took 179.05 µs (15.65% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   10.67%122.07 µs │     1 │                                      │ cuMemAllocFromPoolAsync │
│    4.61%52.69 µs │    163.29 µs ± 4.8    (  1.6721.22)  │ cudaLaunchKernel        │
│    0.31%3.58 µs │    32111.76 ns ± 200.71 (   0.0953.67) │ cudaGetLastError        │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 952.72 µs (83.27% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                                                                                     
├──────────┼────────────┼───────┼────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   83.27%952.72 µs │    1659.55 µs ± 11.41  ( 16.9363.9) │ void gemmSN_TN_kernel<float, 128, 16, 2, 4, 4, 4, false, cublasGemvTensorStridedBatched<float const>, cublasGemvTensorStridedBatched<float const>, cublasGemvTensorStridedBatched<float>>(cublasGemmSmal 
└──────────┴────────────┴───────┴────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                 1 column omitted


julia> CUDA.@profile CUDA.CUBLAS.gemm_strided_batched('N', 'T', A, B)
Profiler ran for 17.41 ms, capturing 89 events.

Host-side activity: calling CUDA APIs took 192.64 µs (1.11% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│    0.81%141.62 µs │     1 │                                      │ cuMemAllocFromPoolAsync │
│    0.28%48.64 µs │    163.04 µs ± 4.49   (  1.4319.79)  │ cudaLaunchKernel        │
│    0.01%1.91 µs │    3259.6 ns ± 121.12 (   0.0476.84) │ cudaGetLastError        │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 17.2 ms (98.79% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                    │
├──────────┼────────────┼───────┼────────────────────────────────────┼─────────────────────────┤
│   98.79%17.2 ms │    161.07 ms ± 0.21   (  0.291.15) │ ampere_sgemm_128x128_nt │
└──────────┴────────────┴───────┴────────────────────────────────────┴─────────────────────────┘

Expected behavior

A fast batched kernel should be used for the final operation.

Version info

Details on Julia:

Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × AMD Ryzen Threadripper 7960X 24-Cores
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 48 default, 0 interactive, 24 GC (on 48 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 48

Details on CUDA:

CUDA runtime 12.6, artifact installation
CUDA driver 12.6
NVIDIA driver 560.35.3

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+560.35.3

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 20.415 GiB / 23.988 GiB available)
@THargreaves THargreaves added the bug Something isn't working label Oct 25, 2024
@maleadt
Copy link
Member

maleadt commented Oct 25, 2024

Thanks for the report. gemm_strided_batched is implemented entirely based on cublasXgemmStridedBatched, i.e., with no fallbacks to slower non-batched versions on our side. So I don't think we can do much here. Maybe file an issue with NVIDIA?

@maleadt maleadt closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants