Skip to content

v0.4.0

Compare
Choose a tag to compare
@mobicham mobicham released this 05 Dec 09:51
· 56 commits to master since this release
8908e50
  • Improved performance on the A100 and H100.
  • Flexible bitpacking support (32-bit / 8-bit, over cols or rows).
  • Best config caching over all kernels.
  • Helper functions for easier usage.
  • GEMV_SPLITK kernel for better performance at batch-size=1 with non-packed data.
  • Improved accuracy via dumping for 8-bit weights with GEMV kernels.
  • Max-autotuning.
  • Avoid out-of-shared-memory by limiting num_stages based on the GPU device.
  • Various bug fixes.