Skip to content

Releases: NVIDIA/cutlass

CUTLASS 2.5.0

03 Mar 19:20
0f10563
Compare
Choose a tag to compare

CUTLASS 2.5 is a minor release contributing:

  • Tensor reductions
    • m-to-n reductions of tensors with affine layout
    • Specializations for reductions including contiguous dimension
    • Specializations for reductions excluding contiguous dimension
    • Custom reduction functors such as cutlass::logical_and
    • Large tensor support, up to 2^63 elements (however, each dimension is limited to an extent of 2^31)
  • Optimizations for 3-D convolution
  • Fused Convolution+Convolution example
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

CUTLASS 2.4.0

03 Dec 16:03
Compare
Choose a tag to compare

CUTLASS 2.4

  • Implicit GEMM convolution kernels supporting CUDA and Tensor Cores on NVIDIA GPUs
    • Operators: forward (Fprop), backward data gradient (Dgrad), and backward weight gradient (Wgrad) convolution
    • Data type: FP32, complex, Tensor Float 32 (TF32), BFloat16 (BF16), Float16, Int4, Int8, Int32
    • Spatial dimensions: 1-D, 2-D, and 3-D
    • Layout: NHWC, NCxHWx
  • Implicit GEMM convolution components:
    • Global memory iterators supporting Fprop, Dgrad, and Wgrad
    • MmaMultistage for implicit GEMM convolution for NVIDIA Ampere architecture
    • MmaPipeline for implicit GEMM convolution for NVIDIA Volta and Turing architectures
    • Documentation describing Implicit GEMM Convolution algorithm and implementation

CUTLASS 2.3

25 Sep 18:27
c2b80ad
Compare
Choose a tag to compare

CUTLASS 2.3

CUTLASS 2.2

15 Jun 17:48
1ab1027
Compare
Choose a tag to compare
  • NVIDIA Ampere Architecture features
    • Fast Tensor Core operations:
    • Maximum performance via mma.sync
    • Tensor Float 32, BFloat16, and double-precision data types
    • Mixed integer data types (int8, int4, bin1)
    • Asynchronous copy for deep software pipelines via cp.async
    • Described in GTC 2020 Webinar (SR 21745) (free registration required)
  • Features:
    • SDK examples showing GEMM fused with bias+relu and fused GEMM+GEMM
    • Complex-valued GEMMs targeting NVIDIA Ampere Tensor Cores in double-precision and Tensor Float 32
    • Gaussian complex GEMMs using 3m complex multiply algorithm
    • Universal GEMM kernel supporting two batch modes and two algorithms for parallel reductions
  • Policy updates:
    • CUDA 11 Toolkit needed to enable NVIDIA Ampere Architecture features
    • Disabled F16C by default for compatibility - enable on cmake command line with -DCUTLASS_ENABLE_F16C=ON

CUTLASS 2.1

09 Apr 23:48
e33d90b
Compare
Choose a tag to compare

Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores

  • Computes complex matrix products on matrices stored as disjoint real and imaginary parts
  • SDK Examples of Planar Complex GEMMs

BLAS-style host-side API added to CUTLASS Library

  • API to launch compiled kernel instances for GEMM and planar complex GEMM

Minor enhancements and bug fixes

CUTLASS 2.0

22 Nov 17:40
7c0cd26
Compare
Choose a tag to compare

Substantially refactored for

  • Better performance, particularly for native Turing Tensor Cores
  • Robust and durable templates spanning the design space
  • Encapsulated functionality embodying modern C++11 programming techniques
  • Optimized containers and data types for efficient, generic, portable device code

Updates to:

  • Quick start guide
  • Documentation
  • Utilities
  • CUTLASS Profiler

Native Turing Tensor Cores

  • Efficient GEMM kernels targeting Turing Tensor Cores
  • Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands

Coverage of existing CUTLASS functionality

  • GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
  • Volta Tensor Cores through native mma.sync and through WMMA API
  • Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
  • Batched GEMM operations
  • Complex-valued GEMMs

Note: a host compiler supporting C++11 or greater is required.

CUTLASS 1.3.3

18 Nov 19:34
b5cab17
Compare
Choose a tag to compare

Final tagged release of CUTLASS 1.x branch.

CUTLASS 1.3.2

10 Jul 18:42
b5cab17
Compare
Choose a tag to compare

Performance enhancement for Volta Tensor Cores TN layout

  • Fixed performance defect with indirect access to pointer array for Volta TensorCores TN arrangement.

CUTLASS 1.3.0

20 Mar 17:53
877bdca
Compare
Choose a tag to compare

CUTLASS 1.3 adds efficient GEMM kernels targeting Volta Tensor Cores via mma.sync instruction added in CUDA 10.1.

CUTLASS 1.2

26 Oct 22:02
ed2ed4d
Compare
Choose a tag to compare

CUTLASS 1.2.0
(2018-10-26)

  • Parallelized reductions across threadblocks ("Split-K")
  • Improved IGEMM performance
  • Batched strided WMMA GEMMs