Groupwise scaling along M for FP8 gemm #2037

soundOfDestiny · 2025-01-13T07:26:59Z

Background (copied from #1932)

As we adopt narrower datatypes, traditional scaling methods struggle to maintain accuracy, particularly with 8-bit floating-point types (e.g., e5m2_t, e4m3_t). The typical GEMM operation uses tensorwise scaling with $D = alpha * (A @ B) + beta * C$, but narrower datatypes necessitate more finer-grained scaling techniques. Before we dive deep into groupwise scaling below is a glossary of various scaling methods:

Tensorwise Scaling: Uses a single scaling factor per tensor, applied in the epilogue.
Rowwise Scaling: Uses a row vector for scaling, with dimensions Mx1 for operand A and 1xN for operand B, avoiding the scaling along the reduction dimension. This can also be handled in the epilogue with EpilogueVisitorTree.
Blockwise Scaling (Blockwise Scaling for FP8 #1932): Introduces a 2D scaling tensor, assigning one scaling value per CTA Block. Since this scaling involves the reduction dimension (M, N, K), it must be applied during the mainloop, impacting performance. Blockwise Scaling for FP8 #1932 implements blockwise scaling for CUTLASS F8 GEMM, staging scaling tensors via shared memory, and preparing for future support of groupwise scaling.
Groupwise Scaling (along M in A tensor, this PR): Uses a 2D scaling tensor with multiple scaling values per CTA Block. Scaling granularity is independent of CTA Block configuration, allowing greater flexibility for future implementations.

Summary

As #1932 adds blockwise scaling strategy, this PR is a patch based on #1932 and adds groupwise scaling strategy along M in A tensor. Scaling granularity along M is made independent of CTA Block configuration, however, scaling granularities along N and K are still blockwise (i.e. one scaling value per CTA Block).

This PR restricts scaling granularity along M to a factor of TILE_SHAPE_M in CTA Block configuration, while one can set the GEMM scaling granularity along M to exactly TILE_SHAPE_M (i.e. fallback to blockwise scaling strategy) and call repeat_interleave method on input tensor ScaleA to simulate the situation that scaling granularity is multiplies of TILE_SHAPE_M.

Groupwise Scaling

In this implementation, we load scaling tensors with more elements than #1932 to shared memory since there might be various scaling along M per CTA Block. However, each thread only needs to load at most 2 scale values for A tensor and exactly one scale value for B tensor from shared memory to registers per iteration because WGMMA accumulators of each thread involve only 2 rows in result tensor.

Performance

I haven't observed a performance degradation compared with #1932
blockwise scaling

./64_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling 
  Disposition: Passed
  Problem Size: 1024x512x1024x1
  Rasterization: Heuristic with a maximum CTA swizzle of 1
  Avg runtime: 0.0112583 ms
  GFLOPS: 95373.3

groupwise scaling (this PR, setting scaling granularity along M to 64)

./64_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling 
  Disposition: Passed
  Problem Size: 1024x512x1024x1
  Rasterization: Heuristic with a maximum CTA swizzle of 1
  Avg runtime: 0.0112435 ms
  GFLOPS: 95499.3

zhyncs · 2025-01-17T08:15:38Z

Hi @hwu36 This PR is from the DeepSeek Team. Could you help review and merge it? The SGLang team wants to implement block-wise FP8 using CUTLASS for DeepSeek V3. This PR is essential for us. Thanks!

ll2088 · 2025-01-21T06:32:46Z

Hi @hwu36 This PR is from the DeepSeek Team. Could you help review and merge it? The SGLang team wants to implement block-wise FP8 using CUTLASS for DeepSeek V3. This PR is essential for us. Thanks!

Hi @zhyncs zh This PR looks like a example demo，Has the integration with SGLang been done? Could you post a PR about the integration code with SGLang?

zhyncs · 2025-01-21T06:44:26Z

@ll2088
Our current open source version https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/quantization/fp8_kernel.py has been referenced and adapted by other projects, including vLLM and LightLLM.
The version developed based on CUTLASS is currently based on this branch. https://github.com/soundOfDestiny/cutlass/tree/f8_groupwise_scaling_pr_branch.
We hope the official CUTLASS will review and merge this PR soon so we can use the official version. Currently, v3.7.0 includes block-wise but not per-token-per-128-channel support.

ll2088 · 2025-01-21T06:51:27Z

@ll2088 Our current open source version https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/quantization/fp8_kernel.py has been referenced and adapted by other projects, including vLLM and LightLLM. The version developed based on CUTLASS is currently based on this branch. https://github.com/soundOfDestiny/cutlass/tree/f8_groupwise_scaling_pr_branch. We hope the official CUTLASS will review and merge this PR soon so we can use the official version. Currently, v3.7.0 includes block-wise but not per-token-per-128-channel support.

The version developed based on CUTLASS in SGLang, Does it PRed? Could you post it here?

zhyncs · 2025-01-21T06:52:28Z

Not yet.

ll2088 · 2025-01-21T08:38:03Z

@soundOfDestiny using TileShape = Shape<_1,_128,_128>; why does it not work? compile problem occurs.

ll2088 · 2025-01-21T10:02:08Z

@soundOfDestiny using TileShape = Shape<_1,_128,_128>; why does it not work? compile problem occurs.

And why does ScaleMsPerTile = 128 not work? @soundOfDestiny

ll2088 · 2025-01-21T13:23:45Z

@zhyncs ScaleMsPerTile=128 is not supported here, the shared memory is not enough.

/workspace/applied-ai/kernels/cuda/cutlass_gemm/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:336 Setting smem size to 234496
/workspace/applied-ai/kernels/cuda/cutlass_gemm/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:343 cudaFuncSetAttribute() returned error: invalid argument
Got cutlass error: Error Internal at: 673

soundOfDestiny · 2025-01-21T14:41:31Z

@zhyncs ScaleMsPerTile=128 is not supported here, the shared memory is not enough.

/workspace/applied-ai/kernels/cuda/cutlass_gemm/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:336 Setting smem size to 234496 /workspace/applied-ai/kernels/cuda/cutlass_gemm/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:343 cudaFuncSetAttribute() returned error: invalid argument Got cutlass error: Error Internal at: 673

The issue of incorrect calculation of shared memory size has appeared since #1932.
It has been fixed in latest commit.

soundOfDestiny mentioned this pull request Jan 13, 2025

Blockwise Scaling for FP8 #1932

Merged

soundOfDestiny force-pushed the f8_groupwise_scaling_pr_branch branch from 9d997ce to a08ef31 Compare January 21, 2025 06:57

soundOfDestiny force-pushed the f8_groupwise_scaling_pr_branch branch from a08ef31 to 0c08d7c Compare January 21, 2025 14:40

soundOfDestiny force-pushed the f8_groupwise_scaling_pr_branch branch from 0c08d7c to df73dd0 Compare January 21, 2025 14:50

FP8 groupwise scaling along M

3197c81

soundOfDestiny force-pushed the f8_groupwise_scaling_pr_branch branch from df73dd0 to 3197c81 Compare January 21, 2025 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupwise scaling along M for FP8 gemm #2037

Groupwise scaling along M for FP8 gemm #2037

soundOfDestiny commented Jan 13, 2025

zhyncs commented Jan 17, 2025

ll2088 commented Jan 21, 2025

zhyncs commented Jan 21, 2025

ll2088 commented Jan 21, 2025

zhyncs commented Jan 21, 2025

ll2088 commented Jan 21, 2025 •

edited

Loading

ll2088 commented Jan 21, 2025

ll2088 commented Jan 21, 2025 •

edited

Loading

soundOfDestiny commented Jan 21, 2025

Groupwise scaling along M for FP8 gemm #2037

Are you sure you want to change the base?

Groupwise scaling along M for FP8 gemm #2037

Conversation

soundOfDestiny commented Jan 13, 2025

Background (copied from #1932)

Summary

Groupwise Scaling

Performance

zhyncs commented Jan 17, 2025

ll2088 commented Jan 21, 2025

zhyncs commented Jan 21, 2025

ll2088 commented Jan 21, 2025

zhyncs commented Jan 21, 2025

ll2088 commented Jan 21, 2025 • edited Loading

ll2088 commented Jan 21, 2025

ll2088 commented Jan 21, 2025 • edited Loading

soundOfDestiny commented Jan 21, 2025

ll2088 commented Jan 21, 2025 •

edited

Loading

ll2088 commented Jan 21, 2025 •

edited

Loading