v2.4.10 SGEMM TF32 Stage 2/3
What's Changed
- [HGEMM] HGEMM WMMA Stage mma4x2+warp4x4 by @DefTruth in #76
- [SGEMM] Add SGEMM WMMA TF32 Stage2/3 by @DefTruth in #77
- [SGEMM] Add cuBLAS SGEMM F32/TF32 baseline by @DefTruth in #78
- [SGEMM] Add Kernel cudaFuncSetAttribute hint by @DefTruth in #79
- [RoPE] Add minimal RoPE f32/f32x4 pack impl by @bear-zd in #80
Full Changelog: v2.4.9...v2.4.10