Skip to content

Performance

Matthew Nicely edited this page May 15, 2022 · 3 revisions

CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit performance comparable to cuBLAS for scalar GEMM computations. The above figure shows CUTLASS performance relative to cuBLAS for large matrix dimensions on an NVIDIA A100, an NVIDIA A2, an NVIDIA TitanV, and an NVIDIA GeForce 2080 Ti compiled with the CUDA 11.5 Toolkit. Tensor Core operations are implemented using CUDA's mma instruction.

Clone this wiki locally