v0.4.0
- Improved performance on the A100 and H100.
- Flexible bitpacking support (32-bit / 8-bit, over cols or rows).
- Best config caching over all kernels.
- Helper functions for easier usage.
GEMV_SPLITK
kernel for better performance at batch-size=1 with non-packed data.- Improved accuracy via dumping for 8-bit weights with GEMV kernels.
- Max-autotuning.
- Avoid out-of-shared-memory by limiting
num_stages
based on the GPU device. - Various bug fixes.