[QST] Is get_tiled_shape in depthwise conv correct #1213

yupatrick22 · 2023-11-27T14:12:17Z

I am trying to write a depthwise conv kernel. By reviewing the sample code 46, I find that the the the get_tiled_shape function in thread block swizzle class for depthwise is overloaded, and always output 1 at KM dimention, no matter what conv problem size is. This seems weird to me. Could you please explain it? or it is a bug?

hwu36 · 2023-11-28T20:46:44Z

@Ethan-Yan27

Ethan-Yan27 · 2023-11-29T13:26:48Z

It is intentional to do that. Please refer to the basic idea of depthwise 2d conv implementation #1133 (comment) .

With this basic idea, normally, if we do not set split_k, one threadblock would compute all the output tiles belong to specific output channels (i.e. sharing same filters).

For example, in the example kernel, if user set ./46_depthwise_simt_conv2dfprop --n=1 --h=112 --w=112 --c=1024 --k=1024 --r=3 --s=3 --g=1024, the grid size is <1,16,1>, each thread block would calculate all 112x112 outputs of its corresponding channel.

yupatrick22 · 2023-11-30T06:22:04Z

@Ethan-Yan27 Thanks for your reply.

Following the workload partition in #1133 (comment), each thread block would calculate all output pixels corresponding to a specific channel range, like cta0: 0-31,cta1:32-63. This means that the channel of input activation should be at least 32*132=4224 to utilize all 132 SMs of H800, while the user scenario may not fulfill this constraint.

Looking back to the example ./46_depthwise_simt_conv2dfprop --n=1 --h=112 --w=112 --c=1024 --k=1024 --r=3 --s=3 --g=1024, the grid size is <1,16,1>, it means only 16 out of 132 SMs is utilized, which is very low occupancy. However, each thread block should do 112112/(88)=196 iterations. Why not make thread block do less iterations, and generate larger gird size to fully utilize GPU? for example grid size <196,16,1>?

Ethan-Yan27 · 2023-11-30T06:41:52Z

To fully use the GPU resource, please enable splitK feature. It would launch more threadblocks and reduce the mainloop iteration per threadblocks.

The example option: --splitk=XXX

    return gemm::GemmCoord(1,
                     (implicit_gemm_problem_size.n() + tile_size.n() - 1) / tile_size.n(),
                     split_k_slices);  <-------------------------------------------------------------- this option would affect here.

yupatrick22 · 2023-11-30T09:04:27Z

Yes, I have try on splitK, it works well for the example case --n=1 --h=112 --w=112 --c=1024 --k=1024 --r=3 --s=3 --g=1024, the utilization is much improved.

But, I have another case, where both filter size and channel size is small, R=3, S=3, and C=32, the feature map however is very large, N=256, H=112 and W=112, splitK might not be a good choice for this case, since RSC=288 is very small. Is it possible to split NHW into several thread blocks?

Ethan-Yan27 · 2023-11-30T09:53:15Z

SplitK here means splitting NHW dim (or more accurately, the output NPQ dim), not K dim. Though the splitK name is not very Intuitive, it is for your case.

Take the example pic #1133 (comment), if we set splitK=2, then 4 threadblocks would be launched.

cta0 would compute p0-7 & q0-7 & k0-31.
cta1 would compute P0-7 & Q0-7 & K32-63.
cta2 would compute P8-15 & Q0-7 & K0-31.
cta3 would compute P8-15 & Q0-7 & K32-63.

FYI, activation iterator movement for each cta: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/threadblock/depthwise_fprop_activation_tile_access_iterator_direct_conv_fixed_stride_dilation.h#L214

yupatrick22 · 2023-12-01T03:08:00Z

Thank you very much for the clarifying this~ I am a little confused, since from the gtc speech, depthwise conv in cutlass use a im2col + implicit gemm method as below

But is that the method exactly used in sample code 46? It looks like using direct conv method in code 46

Ethan-Yan27 · 2023-12-01T03:50:12Z

In the gtc you quoted, we had just released kAnalytic. The other two modes were released later.

https://github.com/NVIDIA/cutlass/blob/main/examples/46_depthwise_simt_conv2dfprop/depthwise_simt_conv2dfprop.cu#L36C1-L43.

There are 3 types of implementations of depthwise 2d convoltion
  1. kAnalytic
    Implicit gemm 2d convoltion algorithm.
  2. kOptimized
    An optimized algorithm and supports arbitrary stride and dilation.
  3. kFixedStrideDilation
    An optimized algorithm with fixed stride and dilation to reduce the runtime computation and do
more optimizations.

The example 46 demonstrates the kFixedStrideDilation version which uses the direct conv approach.

If you are interested in the implicit fprop verison, please refer to https://github.com/NVIDIA/cutlass/blob/main/test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu. In general, the perf is kFixedStrideDilation> kOptimized> kAnalytic.

yupatrick22 · 2023-12-01T06:37:40Z

Thanks @Ethan-Yan27

yupatrick22 · 2023-12-05T10:59:34Z

In the gtc you quoted, we had just released kAnalytic. The other two modes were released later.

https://github.com/NVIDIA/cutlass/blob/main/examples/46_depthwise_simt_conv2dfprop/depthwise_simt_conv2dfprop.cu#L36C1-L43.
There are 3 types of implementations of depthwise 2d convoltion
  1. kAnalytic
    Implicit gemm 2d convoltion algorithm.
  2. kOptimized
    An optimized algorithm and supports arbitrary stride and dilation.
  3. kFixedStrideDilation
    An optimized algorithm with fixed stride and dilation to reduce the runtime computation and do
more optimizations.
The example 46 demonstrates the kFixedStrideDilation version which uses the direct conv approach.

If you are interested in the implicit fprop verison, please refer to https://github.com/NVIDIA/cutlass/blob/main/test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu. In general, the perf is kFixedStrideDilation> kOptimized> kAnalytic.

I have try on the mentioned implicit gemm code, but the GPU utilization(i.e. compute & memory util.) is low, there seems no direct parameters like splitK in https://github.com/NVIDIA/cutlass/blob/main/test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu that will improve GPU utilization, how can i fix that ? @Ethan-Yan27

Ethan-Yan27 · 2023-12-05T14:31:46Z

Hi, both group conv and depthwise conv(the kAnalytic version) don't support splitK.

Here is canimplement details. https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/device/implicit_gemm_convolution.h#L116C11-L116C11

If you want good perf, please use kOptimized and kFixedStrideDilation for depthwise conv kernel. Thanks.

yupatrick22 added ? - Needs Triage bug Something isn't working labels Nov 27, 2023

mnicely changed the title ~~Is get_tiled_shape in depthwise conv correct~~ [QST] Is get_tiled_shape in depthwise conv correct Nov 27, 2023

yupatrick22 closed this as completed Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Is get_tiled_shape in depthwise conv correct #1213

[QST] Is get_tiled_shape in depthwise conv correct #1213

yupatrick22 commented Nov 27, 2023 •

edited

Loading

hwu36 commented Nov 28, 2023

Ethan-Yan27 commented Nov 29, 2023

yupatrick22 commented Nov 30, 2023

Ethan-Yan27 commented Nov 30, 2023

yupatrick22 commented Nov 30, 2023

Ethan-Yan27 commented Nov 30, 2023

yupatrick22 commented Dec 1, 2023

Ethan-Yan27 commented Dec 1, 2023 •

edited

Loading

yupatrick22 commented Dec 1, 2023

yupatrick22 commented Dec 5, 2023 •

edited

Loading

Ethan-Yan27 commented Dec 5, 2023

[QST] Is get_tiled_shape in depthwise conv correct #1213

[QST] Is get_tiled_shape in depthwise conv correct #1213

Comments

yupatrick22 commented Nov 27, 2023 • edited Loading

hwu36 commented Nov 28, 2023

Ethan-Yan27 commented Nov 29, 2023

yupatrick22 commented Nov 30, 2023

Ethan-Yan27 commented Nov 30, 2023

yupatrick22 commented Nov 30, 2023

Ethan-Yan27 commented Nov 30, 2023

yupatrick22 commented Dec 1, 2023

Ethan-Yan27 commented Dec 1, 2023 • edited Loading

yupatrick22 commented Dec 1, 2023

yupatrick22 commented Dec 5, 2023 • edited Loading

Ethan-Yan27 commented Dec 5, 2023

yupatrick22 commented Nov 27, 2023 •

edited

Loading

Ethan-Yan27 commented Dec 1, 2023 •

edited

Loading

yupatrick22 commented Dec 5, 2023 •

edited

Loading