gemm_strided_batched
only using strided CUDA kernel when first matrix is transposed
#2529
Labels
bug
Something isn't working
Describe the bug
When using
gemm_strided_batched
I've noticed that the fast strided kernels provided by CUDA are only used when the transpose flags are ('N', 'N') or ('T', 'N'), not when they are ('N', 'T'). In the latter case, a far slower non-batched gemm is used.This may be a limitation of CUDA itself that I'm not aware of—feel free to close the issue if so.
To reproduce
Expected behavior
A fast batched kernel should be used for the final operation.
Version info
Details on Julia:
Details on CUDA:
The text was updated successfully, but these errors were encountered: