Releases · NVIDIA/cutlass

24 May 20:10

hwu36

v3.1.0

6f47420

CUTLASS 3.1

New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
New efficient epilogues using TMA for Hopper.
Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
Performance optimizations for the warp-specialized persistent ping-pong kernel.
Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying Params structs.
FMHA Backward Pass from Meta xFormers.
Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
Batched Strided GEMV support both row major and column major input matrix.
Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
Row Broadcast can be fused in the epilogue.
The GitHub branch is renamed from master to main in this release.
Optimal performance using CUDA 12.1
Updates and bugfixes from the community (thanks!)

Assets 2

1 Join discussion

10 Mar 04:19

hwu36

v3.0.0

c4f6b8c

CUTLASS 3.0

3.0.0 (2023-01-23)

CuTe, a new core library and backend for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors.
A new conceptual operation hierarchy that replaces the architecture-centric hierarchy of CUTLASS 2.x and documentation for CUTLASS 3.0's GEMM API changes.
Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same device::GemmUniversalAdapter and kernel::GemmUniversal types, allowing users to include both APIs in the same translation units. More information can be found in the 3.x backwards compatibility section.
Updates to Functionality which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3.
Updates to Compatibility Section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures and Target Architecture.
New warp-specialized GEMM kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters.
Extensions to CUTLASS profiler to support threadblock cluster shapes in library and profiler tile configurations.
CUTLASS library integration for 3.x API kernels built through the new CollectiveBuilder API, enabling CUTLASS profiler.
Support for Hopper GEMMs through the new 3.0 API with CuTe-based exposure of the Hopper Tensor Memory Accelerator and WGMMA Tensor Core features.
Set of examples that demonstrate the usage of the new 3.0 API to easily build GEMM kernels targeting Hopper: examples 48, 49, and 50.

Assets 2

20 Jan 21:35

kerrmudgeon

v2.11.0

66d9cdd

CUTLASS 2.11

2.11.0 (2022-11-19)

Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.
Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.
Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.
BLAS3 functions with Hoppers new double precision matrix multiplication instructions.
ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.
Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
Optimized DepthWise Conv. Two new modes are added
- kOptimized - use direct conv to compute instead of implicit GEMM.
  - The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
- kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
  - The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.
FP8 data type definition and conversion routines.
Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.
Deprecation announcement: CUTLASS plans to deprecate the following:
- Maxwell and Pascal GPU architectures
- Ubuntu 16.04
- CUDA 10.2

Assets 2

16 Sep 02:42

hwu36

v2.10.0

fc9ebc6

CUTLASS 2.10.0

CUTLASS 2.10.0

CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors.
Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable.
Optimizations for GEMM+Softmax.
Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM.
GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after.
GEMM Epilogue Permutation Fusion can permute the GEMM output before storing.
Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized.
Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now.
Standalone Layernorm and Pooling kernels.
Back-to-back GEMM enhancements.
Updates and bugfixes from the community (thanks!)

Assets 2

1 Join discussion

29 Jun 01:15

kerrmudgeon

v2.9.1

e45e773

CUTLASS 2.9.1

Bug fixes, performance tuning, and enhancements to documentation.

Assets 2

27 Apr 16:31

kerrmudgeon

v2.9.0

319a389

CUTLASS 2.9.0

CUTLASS 2.9.0

First layer Convolution kernels specialized for small channel counts and reduced alignment
- Few channels specialization for reduced alignment capabilities
- Fixed channels further specialized when channel count perfectly matches the access vector size
- Unit tests
- Python-based instance emitter in the CUTLASS Library and support in the Profiler
BLAS3 operators accelerated by Tensor Cores
- Supported types: f32, cf32, f64, cf64
- HERK with emitter
- SYRK with emitter
- SYMM with emitter
- TRMM with emitter
- Unit tests
CUTLASS Python demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using CUDA Python
- Python-based runtime interoperable with existing emitters
GEMM + Softmax example
Optimal performance using CUDA 11.6u2
Updates and bugfixes from the community (thanks!)

Assets 2

06 Dec 19:22

kerrmudgeon

v2.8.0

5fe09c2

CUTLASS 2.8

TF32x3: emulated single-precision using Tensor Cores
- 45+ TFLOPs on NVIDIA A100
- GEMM SDK example (real)
- COMPLEX GEMM SDK example (complex)
- Implicit GEMM Convolution SDK example
Mainloop fusion for Convolution: convolution with fused per-channel scale-bias-relu
Grouped GEMM: similar to batched GEMM with distinct problem size per group
- SDK example with performance comparison with Batched Strided GEMM
- cutlass::gemm::device::GemmGrouped
Implicit GEMM Convolution fusion supports staging 1st convolution's output accumulator in the shared memory on Turing. This allows more flexible warp tile sizes and less regsiter pressue.
Optimal performance using CUDA 11.5
Updates from the community (thanks!)
Deprecation announcement: CUTLASS plans to deprecate the following:
- Maxwell and Pascal GPU architectures
- Ubuntu 16.04
- CUDA 10.2

Assets 2

20 Sep 18:10

kerrmudgeon

v2.7.0

2e07c4c

CUTLASS 2.7

2.7.0

Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
- Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
- Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
- Thank you for filing these issues!

Assets 2

03 Sep 17:27

kerrmudgeon

v2.6.1

6c2f8f2

CUTLASS 2.6.1

Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators)
Tuning for GEMMs fused with partial reductions
Corrections and bug fixes reported by the CUTLASS community
- Thank you for filing these issues!

Assets 2

03 Sep 16:52

kerrmudgeon

v2.6.0

a01feb9

CUTLASS 2.6.0

CUTLASS 2.6.0

Optimal performance when compiled with the CUDA 11.4 Toolkit
- Adopt the new L2 prefetch feature in cp.async and global load
Fused operators with GEMM and Convolution
- Fused broadcast in epilogue
- Fused partial reduction in epilogue
64b tensor strides and leading dimensions support for GEMMs
Affine rank=2 matrix layouts
- Row stride and column stride for matrices using cutlass::layout::AffineRank2
- Support FP64 tensor core and SIMT GEMM.
Batched GEMV preview implementation
New strided Dgrad implementation
- Accelerates over previous implementation by cutting down redundant math by 4x
- Support using new Dy and w analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution interface
Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
- Updates to quaternion.h and functional.h
- SDK Example for GEMM and Convolution
- Unit tests for GEMM and Convolution
Many improvements to the epilogue.
- Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
- Performance improvement for FP16 tensor core kernels
- Bug fixes
Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
Updated minimum CUDA Toolkit requirement to 10.2
- CUDA 11.4 Toolkit recommended
Corrections and bug fixes reported by the CUTLASS community
- Thank you for filing these issues!

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.0.0 (2023-01-23)

2.11.0 (2022-11-19)

2.7.0

Releases: NVIDIA/cutlass

CUTLASS 3.1

CUTLASS 3.0

3.0.0 (2023-01-23)

CUTLASS 2.11

2.11.0 (2022-11-19)

CUTLASS 2.10.0

CUTLASS 2.9.1

CUTLASS 2.9.0

CUTLASS 2.8

CUTLASS 2.7

2.7.0

CUTLASS 2.6.1

CUTLASS 2.6.0