v0.3.0
- New GEMV RevSplitK algorithm outperforms GEMM Split-K and GEMV for batch-size=1
- Add support for channel-wise scaling (weights, activations, weights + activations)
- Add support for FP8 x FP8 / FP8 x Wn
- Add support for INT8 x Wn
- Improved autotune speed
- Improved base configs for 4090 RTX, A100 and H100
- Better control for autotune via
set_autotune