-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Is get_tiled_shape in depthwise conv correct #1213
Comments
It is intentional to do that. Please refer to the basic idea of depthwise 2d conv implementation #1133 (comment) . With this basic idea, normally, if we do not set split_k, one threadblock would compute all the output tiles belong to specific output channels (i.e. sharing same filters). For example, in the example kernel, if user set |
@Ethan-Yan27 Thanks for your reply. Following the workload partition in #1133 (comment), each thread block would calculate all output pixels corresponding to a specific channel range, like cta0: 0-31,cta1:32-63. This means that the channel of input activation should be at least 32*132=4224 to utilize all 132 SMs of H800, while the user scenario may not fulfill this constraint. Looking back to the example ./46_depthwise_simt_conv2dfprop --n=1 --h=112 --w=112 --c=1024 --k=1024 --r=3 --s=3 --g=1024, the grid size is <1,16,1>, it means only 16 out of 132 SMs is utilized, which is very low occupancy. However, each thread block should do 112112/(88)=196 iterations. Why not make thread block do less iterations, and generate larger gird size to fully utilize GPU? for example grid size <196,16,1>? |
To fully use the GPU resource, please enable splitK feature. It would launch more threadblocks and reduce the mainloop iteration per threadblocks. The example option:
|
Yes, I have try on splitK, it works well for the example case --n=1 --h=112 --w=112 --c=1024 --k=1024 --r=3 --s=3 --g=1024, the utilization is much improved. But, I have another case, where both filter size and channel size is small, R=3, S=3, and C=32, the feature map however is very large, N=256, H=112 and W=112, splitK might not be a good choice for this case, since RSC=288 is very small. Is it possible to split NHW into several thread blocks? |
SplitK here means splitting NHW dim (or more accurately, the output NPQ dim), not K dim. Though the splitK name is not very Intuitive, it is for your case. Take the example pic #1133 (comment), if we set splitK=2, then 4 threadblocks would be launched.
FYI, activation iterator movement for each cta: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/threadblock/depthwise_fprop_activation_tile_access_iterator_direct_conv_fixed_stride_dilation.h#L214 |
In the gtc you quoted, we had just released kAnalytic. The other two modes were released later.
The example 46 demonstrates the kFixedStrideDilation version which uses the direct conv approach. If you are interested in the implicit fprop verison, please refer to https://github.com/NVIDIA/cutlass/blob/main/test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu. In general, the perf is kFixedStrideDilation> kOptimized> kAnalytic. |
Thanks @Ethan-Yan27 |
I have try on the mentioned implicit gemm code, but the GPU utilization(i.e. compute & memory util.) is low, there seems no direct parameters like splitK in https://github.com/NVIDIA/cutlass/blob/main/test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu that will improve GPU utilization, how can i fix that ? @Ethan-Yan27 |
Hi, both group conv and depthwise conv(the kAnalytic version) don't support splitK. Here is canimplement details. https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/device/implicit_gemm_convolution.h#L116C11-L116C11 If you want good perf, please use kOptimized and kFixedStrideDilation for depthwise conv kernel. Thanks. |
I am trying to write a depthwise conv kernel. By reviewing the sample code 46, I find that the the the get_tiled_shape function in thread block swizzle class for depthwise is overloaded, and always output 1 at KM dimention, no matter what conv problem size is. This seems weird to me. Could you please explain it? or it is a bug?
The text was updated successfully, but these errors were encountered: