Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Wrap launch bounds #570

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gevtushenko
Copy link
Collaborator

This PR addresses the following issue by replacing __launch_bounds__ usages with CUB_DETAIL_LAUNCH_BOUNDS. CUB_DETAIL_LAUNCH_BOUNDS leads to __launch_bounds__ usage only when RDC is not specified. Builds without RDC are not affected by this PR. For builds with RDC, the max performance differences are:

facility type diff
cub::DeviceSelect::If (complex predicate) All 0%
cub::DeviceSelect::If U32 -9%
cub::DeviceSelect::If U64 0%
cub::DeviceSegmentedReduce::Sum U8 1%
cub::DeviceSegmentedReduce::Sum U64 -10%
cub::DeviceSegmentedRadixSort::SortPairs U{8,16} 4%
cub::DeviceSegmentedRadixSort::SortPairs U{32,64} -32%
cub::DeviceSegmentedRadixSort::SortKeys U{8,16} 8%
cub::DeviceSegmentedRadixSort::SortKeys U{32,64} -25%
cub::DeviceScan::InclusiveSum All 0%
cub::DeviceScan::InclusiveSum - complex op F32 -7%
cub::DeviceScan::ExclusiveSum All 0%
cub::DeviceReduce::Reduce - custom op All -8%
cub::DeviceReduce::Reduce U8 20%
cub::DeviceReduce::Reduce U32 8%
cub::DeviceReduce::Reduce F64 -4%
cub::DevicePartition::If All -4%
cub::DeviceHistogram::HistogramRange - A lot of bins U64 30%
cub::DeviceHistogram::HistogramRange All 0%
cub::DeviceHistogram::HistogramEven - A lot of bins U32 -20%
cub::DeviceHistogram::HistogramEven All 0%
cub::DeviceAdjacentDifference All 0%

Negative diff means speedup of the version without __launch_bounds__. Since the results are quite controversial, I wouldn't like to advertise the macro as our API. If absolutely needed, one might define:

#define CUB_DETAIL_LAUNCH_BOUNDS(...) \
  __launch_bounds__(__VA_ARGS__)

#include <cub/cub.cuh>

But for now it's an implementation detail that fixes compilation with RDC in some corner cases. Going forward, we might consider having tuning API that would control __launch_bounds__ specification as well as pragma unroll usage. The default tuning would be a function of the input types.

@gevtushenko gevtushenko added the type: bug: compiler Bug in a compiler, not this library. label Sep 5, 2022
@gevtushenko gevtushenko added this to the 2.1.0 milestone Sep 5, 2022
gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Sep 5, 2022
@gevtushenko gevtushenko added the testing: gpuCI in progress Started gpuCI testing. label Sep 5, 2022
@dkolsen-pgi
Copy link
Collaborator

The performance results are interesting with the mix of better and worse. My interpretation of that is that some of the functions are not as well tuned as they could be.

@gevtushenko
Copy link
Collaborator Author

The performance results are interesting with the mix of better and worse. My interpretation of that is that some of the functions are not as well tuned as they could be.

My interpretation is that presence of __launch_bounds__ probably wasn't questioned during the tuning process before.

@gevtushenko gevtushenko marked this pull request as draft September 6, 2022 09:37
@gevtushenko
Copy link
Collaborator Author

Testing revealed some issues of this approach. We can't simply remove __launch_bounds__, here's a reproducer that answers the question why it's the case:

#include <stdio.h>
#include <cub/cub.cuh>

using MaxPolicyT = cub::DispatchRadixSort<false, unsigned short, cub::NullType, unsigned int>::MaxPolicy;

int main() {
  int sm_occupancy{};
  if (cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &sm_occupancy,
        cub::DeviceRadixSortDownsweepKernel<MaxPolicyT, false,  true, unsigned short, cub::NullType, unsigned int>,
        512,
        0)) {
    std::printf("error\n");
  }
  std::printf("%d\n", sm_occupancy);
}

When DeviceRadixSortDownsweepKernel has __launch_bounds__ specified, cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 1. Otherwise, it returns 0. That's because without __launch_bounds__ nvcc generates code that requires 153 registers per thread. This gives us 78336 registers per thread block. V100, however, has only 65k registers per thread block, making it impossible to launch a kernel with required thread block size.

In order to proceed with this PR, we have to retune entire CUB for every supported HW with and without __launch_bounds__. Even in this case, I have doubts if we can do so in a presence of custom user types. It might happen so that a custom user type exceeds a particular register count and we are unable to launch a kernel with pre-tuned thread block size again. So the tuning process has to be complemented with a new abstraction layer that would iterate over thread block size search space and converge on a tuning based of funcAttrib.numRegs. @dkolsen-pgi am I missing something?

@jrhemstad
Copy link
Collaborator

jrhemstad commented Sep 6, 2022

So the tuning process has to be complemented with a new abstraction layer that would iterate over thread block size search space and converge on a tuning based of funcAttrib.numRegs. @dkolsen-pgi am I missing something?

As a shorter-term work around, we could clamp the threads-per-block at 128 when RDC is used. With the 255 registers per thread limit, then we are guaranteed that a CTA size of 128 will always fall within the register allocation requirements.

This would obviously have performance implications, but I don't see what else we can do better any time soon.

a new abstraction layer that would iterate over thread block size search space and converge on a tuning based of funcAttrib.numRegs

I don't think this would work. You can only use cudaFuncGetAttributes on __global__ functions. If you queried the kernel for its register usage, that doesn't tell you anything about the register requirements of the invoked __device__ functions when using RDC.

@gevtushenko
Copy link
Collaborator Author

@jrhemstad I agree with your point, thanks! I'll probably try to clamp the threads block size.

@jrhemstad jrhemstad removed their request for review April 26, 2023 20:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
testing: gpuCI in progress Started gpuCI testing. type: bug: compiler Bug in a compiler, not this library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New indirection level for launch bounds
4 participants