Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Is get_tiled_shape in depthwise conv correct #1213

Closed
yupatrick22 opened this issue Nov 27, 2023 · 11 comments
Closed

[QST] Is get_tiled_shape in depthwise conv correct #1213

yupatrick22 opened this issue Nov 27, 2023 · 11 comments
Labels
? - Needs Triage bug Something isn't working

Comments

@yupatrick22
Copy link

yupatrick22 commented Nov 27, 2023

I am trying to write a depthwise conv kernel. By reviewing the sample code 46, I find that the the the get_tiled_shape function in thread block swizzle class for depthwise is overloaded, and always output 1 at KM dimention, no matter what conv problem size is. This seems weird to me. Could you please explain it? or it is a bug?

image

@yupatrick22 yupatrick22 added ? - Needs Triage bug Something isn't working labels Nov 27, 2023
@mnicely mnicely changed the title Is get_tiled_shape in depthwise conv correct [QST] Is get_tiled_shape in depthwise conv correct Nov 27, 2023
@hwu36
Copy link
Collaborator

hwu36 commented Nov 28, 2023

@Ethan-Yan27

@Ethan-Yan27
Copy link
Collaborator

It is intentional to do that. Please refer to the basic idea of depthwise 2d conv implementation #1133 (comment) .

With this basic idea, normally, if we do not set split_k, one threadblock would compute all the output tiles belong to specific output channels (i.e. sharing same filters).

For example, in the example kernel, if user set ./46_depthwise_simt_conv2dfprop --n=1 --h=112 --w=112 --c=1024 --k=1024 --r=3 --s=3 --g=1024, the grid size is <1,16,1>, each thread block would calculate all 112x112 outputs of its corresponding channel.

@yupatrick22
Copy link
Author

@Ethan-Yan27 Thanks for your reply.

Following the workload partition in #1133 (comment), each thread block would calculate all output pixels corresponding to a specific channel range, like cta0: 0-31,cta1:32-63. This means that the channel of input activation should be at least 32*132=4224 to utilize all 132 SMs of H800, while the user scenario may not fulfill this constraint.

Looking back to the example ./46_depthwise_simt_conv2dfprop --n=1 --h=112 --w=112 --c=1024 --k=1024 --r=3 --s=3 --g=1024, the grid size is <1,16,1>, it means only 16 out of 132 SMs is utilized, which is very low occupancy. However, each thread block should do 112112/(88)=196 iterations. Why not make thread block do less iterations, and generate larger gird size to fully utilize GPU? for example grid size <196,16,1>?

@Ethan-Yan27
Copy link
Collaborator

To fully use the GPU resource, please enable splitK feature. It would launch more threadblocks and reduce the mainloop iteration per threadblocks.

The example option: --splitk=XXX

    return gemm::GemmCoord(1,
                     (implicit_gemm_problem_size.n() + tile_size.n() - 1) / tile_size.n(),
                     split_k_slices);  <-------------------------------------------------------------- this option would affect here. 

@yupatrick22
Copy link
Author

Yes, I have try on splitK, it works well for the example case --n=1 --h=112 --w=112 --c=1024 --k=1024 --r=3 --s=3 --g=1024, the utilization is much improved.

But, I have another case, where both filter size and channel size is small, R=3, S=3, and C=32, the feature map however is very large, N=256, H=112 and W=112, splitK might not be a good choice for this case, since RSC=288 is very small. Is it possible to split NHW into several thread blocks?

@Ethan-Yan27
Copy link
Collaborator

SplitK here means splitting NHW dim (or more accurately, the output NPQ dim), not K dim. Though the splitK name is not very Intuitive, it is for your case.

Take the example pic #1133 (comment), if we set splitK=2, then 4 threadblocks would be launched.

  • cta0 would compute p0-7 & q0-7 & k0-31.
  • cta1 would compute P0-7 & Q0-7 & K32-63.
  • cta2 would compute P8-15 & Q0-7 & K0-31.
  • cta3 would compute P8-15 & Q0-7 & K32-63.

FYI, activation iterator movement for each cta: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/threadblock/depthwise_fprop_activation_tile_access_iterator_direct_conv_fixed_stride_dilation.h#L214

@yupatrick22
Copy link
Author

Thank you very much for the clarifying this~ I am a little confused, since from the gtc speech, depthwise conv in cutlass use a im2col + implicit gemm method as below
image
But is that the method exactly used in sample code 46? It looks like using direct conv method in code 46

@Ethan-Yan27
Copy link
Collaborator

Ethan-Yan27 commented Dec 1, 2023

In the gtc you quoted, we had just released kAnalytic. The other two modes were released later.

https://github.com/NVIDIA/cutlass/blob/main/examples/46_depthwise_simt_conv2dfprop/depthwise_simt_conv2dfprop.cu#L36C1-L43.

There are 3 types of implementations of depthwise 2d convoltion
  1. kAnalytic
    Implicit gemm 2d convoltion algorithm.
  2. kOptimized
    An optimized algorithm and supports arbitrary stride and dilation.
  3. kFixedStrideDilation
    An optimized algorithm with fixed stride and dilation to reduce the runtime computation and do
more optimizations.

The example 46 demonstrates the kFixedStrideDilation version which uses the direct conv approach.

If you are interested in the implicit fprop verison, please refer to https://github.com/NVIDIA/cutlass/blob/main/test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu. In general, the perf is kFixedStrideDilation> kOptimized> kAnalytic.

@yupatrick22
Copy link
Author

Thanks @Ethan-Yan27

@yupatrick22
Copy link
Author

yupatrick22 commented Dec 5, 2023

In the gtc you quoted, we had just released kAnalytic. The other two modes were released later.

https://github.com/NVIDIA/cutlass/blob/main/examples/46_depthwise_simt_conv2dfprop/depthwise_simt_conv2dfprop.cu#L36C1-L43.

There are 3 types of implementations of depthwise 2d convoltion
  1. kAnalytic
    Implicit gemm 2d convoltion algorithm.
  2. kOptimized
    An optimized algorithm and supports arbitrary stride and dilation.
  3. kFixedStrideDilation
    An optimized algorithm with fixed stride and dilation to reduce the runtime computation and do
more optimizations.

The example 46 demonstrates the kFixedStrideDilation version which uses the direct conv approach.

If you are interested in the implicit fprop verison, please refer to https://github.com/NVIDIA/cutlass/blob/main/test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu. In general, the perf is kFixedStrideDilation> kOptimized> kAnalytic.

I have try on the mentioned implicit gemm code, but the GPU utilization(i.e. compute & memory util.) is low, there seems no direct parameters like splitK in https://github.com/NVIDIA/cutlass/blob/main/test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu that will improve GPU utilization, how can i fix that ? @Ethan-Yan27

@Ethan-Yan27
Copy link
Collaborator

Hi, both group conv and depthwise conv(the kAnalytic version) don't support splitK.

Here is canimplement details. https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/device/implicit_gemm_convolution.h#L116C11-L116C11

If you want good perf, please use kOptimized and kFixedStrideDilation for depthwise conv kernel. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants