Releases: mobiusml/gemlite
Releases · mobiusml/gemlite
v0.4.1
v0.4.0
- Improved performance on the A100 and H100.
- Flexible bitpacking support (32-bit / 8-bit, over cols or rows).
- Best config caching over all kernels.
- Helper functions for easier usage.
GEMV_SPLITK
kernel for better performance at batch-size=1 with non-packed data.- Improved accuracy via dumping for 8-bit weights with GEMV kernels.
- Max-autotuning.
- Avoid out-of-shared-memory by limiting
num_stages
based on the GPU device. - Various bug fixes.
v0.3.0
- New GEMV RevSplitK algorithm outperforms GEMM Split-K and GEMV for batch-size=1
- Add support for channel-wise scaling (weights, activations, weights + activations)
- Add support for FP8 x FP8 / FP8 x Wn
- Add support for INT8 x Wn
- Improved autotune speed
- Improved base configs for 4090 RTX, A100 and H100
- Better control for autotune via
set_autotune
v.0.2.1
v0.1.0
Triton Kernels
- A16W8 (GEMV + GEMM) - with grouping
- A16W4 (GEMV + GEMM) - with grouping
- A16W2 (GEMV + GEMM) - with grouping
- A16W1 (GEMV + GEMM) - with grouping
CUDA Kernels
- A16W8 (GEMV - batch-size=1) - no grouping
- A16W4 (GEMV - batch-size=1) - no grouping
- A16W2 (GEMV - batch-size=1) - no grouping
- A8W8 (GEMV - batch-size=1) - no grouping
- A8W4 (GEMV - batch-size=1) - no grouping
- A8W2 (GEMV - batch-size=1) - no grouping