Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SegRed sequentialization in certain cases #2054

Merged
merged 7 commits into from
Dec 5, 2023
Merged

Conversation

sortraev
Copy link
Collaborator

@sortraev sortraev commented Dec 1, 2023

This PR modifies and, ideally, improves GPU code generation for non-segmented and large-segments segmented reductions with non-commutative and primitive operators.

The stage one main loop is stripmined by a factor chunk (dependent on available resources and parameter element type(s)), inserting collective copies via local mem to thread-private (register) mem of each reduction parameter going into the stage one intra-group (partial) reductions. This saves a factor chunk number of intra-group reductions at the cost of some overhead in collective copies.

The general structure of the module remains, however a number of smaller design choices have been made to accomodate the distinction between the different kinds of reductions (ie. the cases to which we apply the extra optimization, and all other reductions) wrt. eg. declaration and handling of intermediate arrays.

There are still some TODO's left to attend to before merging:

  1. do formal validation testing, ie. more testing than simply benchmarking with auto output validation.
  2. quantify the performance improvement, if any. So far I have seen decent speedups for certain cases in the futhark-benchmarks/micro/reduce*.fut tests on a 4090rtx (save for reductions over iotas; see below), but which other benchmark suites are interesting?
  3. examine whether chunking can be applied to the small segments case somehow.
  4. the optimization adds redundant overhead to reductions for which parameters do not need manifestation, including but not necessarily limited to reductions over iotas. How can we avoid this, assuming there are enough common cases that it is a big problem?
  5. I am unsure about some of the design choices. For example, I use SegRedKind to distinguish between the case we want to optimize and the cases we do not, and while this feels more modular and legible than simply passing around a bool is_noncomm_primparams_reduction parameter, it does feel a little janky in places. Ideally this distinction would be made implicitly throughout the module somehow, similar to how the distinction between segmented vs. non-segmented is made implicitly by interpreting non-segmented reductions as single-segment segmented reductions. @athas do you have a good idea?

Also in this PR:

  • Add nextMul to Util.IntegralExp to standardize size alignment.
  • Add isPrimParam and getChunkSize (the latter lifted from SegScan.SinglePass) to GPU.Base, as well as minor refactorings to the module.
  • Change SegScan.SinglePass to use getChunkSize, and some light refactoring.
  • Remove comment from ImpGen, since it seemed to be a duplicate of sComment.
  • CUDA compiled binaries now properly return 0 on successful --dump-cuda, similar to the other GPU backends.

This commit modifies and, ideally, improves GPU code generation for
non-segmented and large-segments segmented reductions with
non-commutative and primitive operators (see description in module
header).

There are still some TODO's left to attend to -- most importantly, we
have yet to determine for certain exactly how big of an improvement --
if any at all -- this change is.

Also:
* Adds `nextMul` to `Util.IntegralExp` to standardize size alignment.
* Adds `isPrimParam` and `getChunkSize` (the latter lifted from
  `SegScan.SinglePass`) to `GPU.Base`, as well as minor refactorings to
  the module.
* Changes `SegScan.SinglePass` to use `getChunkSize`, and some light
  refactoring.
* Adds `tests/soacs/scan-in-loop.fut`, which test proper resetting and
  reuse of global mem dynamic ID counter.
* Removes `comment` from `ImpGen`, since it seemed to be a duplicate of
  `sComment`.
* CUDA compiled binaries now properly return 0 on successful
  `--dump-cuda`, similar to the other GPU backends.
@sortraev sortraev requested a review from athas December 1, 2023 13:21
@athas athas added the run-benchmarks Makes GA run the benchmark suite. label Dec 1, 2023
@sortraev sortraev requested a review from coancea December 1, 2023 14:11
@athas
Copy link
Member

athas commented Dec 1, 2023

There's a lot of stuff here that is general refactoring of parts of code generation, and which makes reviewing difficult. I have added some of it manually to master, and may add some more.

src/Futhark/CodeGen/ImpGen/GPU/SegRed.hs Outdated Show resolved Hide resolved
src/Futhark/CodeGen/ImpGen/GPU/SegRed.hs Outdated Show resolved Hide resolved
@sortraev
Copy link
Collaborator Author

sortraev commented Dec 2, 2023

Yes, @athas, I am sorry that this PR is not focused entirely on the relevant changes. Some of the other changes were nice-to-haves during development (such as nextMul and my scripts not breaking on --dump-cuda returning 1), but some of the other changes were obviously unnecessary. I tend to go looking for things to fix that do not need fixing when my mind begins to wander.

I can make a new PR with only the changes to GPU.SegRed and any dependencies, if you prefer.

@athas
Copy link
Member

athas commented Dec 2, 2023

I think I already got most of the pertinent bits.

Since it is not the same as `segFlat space` under virtualization.
@athas
Copy link
Member

athas commented Dec 4, 2023

Remember to also merge master into your branch and resolve the conflicts.

@athas
Copy link
Member

athas commented Dec 5, 2023

This is a 2x speedup on gccontent. Not bad.

@athas athas merged commit 5234eb8 into master Dec 5, 2023
24 checks passed
@athas athas deleted the reduce_opts branch December 5, 2023 13:54
athas added a commit that referenced this pull request Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmarks Makes GA run the benchmark suite.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants