Improve SegRed sequentialization in certain cases #2054

sortraev · 2023-12-01T13:21:42Z

This PR modifies and, ideally, improves GPU code generation for non-segmented and large-segments segmented reductions with non-commutative and primitive operators.

The stage one main loop is stripmined by a factor chunk (dependent on available resources and parameter element type(s)), inserting collective copies via local mem to thread-private (register) mem of each reduction parameter going into the stage one intra-group (partial) reductions. This saves a factor chunk number of intra-group reductions at the cost of some overhead in collective copies.

The general structure of the module remains, however a number of smaller design choices have been made to accomodate the distinction between the different kinds of reductions (ie. the cases to which we apply the extra optimization, and all other reductions) wrt. eg. declaration and handling of intermediate arrays.

There are still some TODO's left to attend to before merging:

do formal validation testing, ie. more testing than simply benchmarking with auto output validation.
quantify the performance improvement, if any. So far I have seen decent speedups for certain cases in the futhark-benchmarks/micro/reduce*.fut tests on a 4090rtx (save for reductions over iotas; see below), but which other benchmark suites are interesting?
examine whether chunking can be applied to the small segments case somehow.
the optimization adds redundant overhead to reductions for which parameters do not need manifestation, including but not necessarily limited to reductions over iotas. How can we avoid this, assuming there are enough common cases that it is a big problem?
I am unsure about some of the design choices. For example, I use SegRedKind to distinguish between the case we want to optimize and the cases we do not, and while this feels more modular and legible than simply passing around a bool is_noncomm_primparams_reduction parameter, it does feel a little janky in places. Ideally this distinction would be made implicitly throughout the module somehow, similar to how the distinction between segmented vs. non-segmented is made implicitly by interpreting non-segmented reductions as single-segment segmented reductions. @athas do you have a good idea?

Also in this PR:

Add nextMul to Util.IntegralExp to standardize size alignment.
Add isPrimParam and getChunkSize (the latter lifted from SegScan.SinglePass) to GPU.Base, as well as minor refactorings to the module.
Change SegScan.SinglePass to use getChunkSize, and some light refactoring.
Remove comment from ImpGen, since it seemed to be a duplicate of sComment.
CUDA compiled binaries now properly return 0 on successful --dump-cuda, similar to the other GPU backends.

This commit modifies and, ideally, improves GPU code generation for non-segmented and large-segments segmented reductions with non-commutative and primitive operators (see description in module header). There are still some TODO's left to attend to -- most importantly, we have yet to determine for certain exactly how big of an improvement -- if any at all -- this change is. Also: * Adds `nextMul` to `Util.IntegralExp` to standardize size alignment. * Adds `isPrimParam` and `getChunkSize` (the latter lifted from `SegScan.SinglePass`) to `GPU.Base`, as well as minor refactorings to the module. * Changes `SegScan.SinglePass` to use `getChunkSize`, and some light refactoring. * Adds `tests/soacs/scan-in-loop.fut`, which test proper resetting and reuse of global mem dynamic ID counter. * Removes `comment` from `ImpGen`, since it seemed to be a duplicate of `sComment`. * CUDA compiled binaries now properly return 0 on successful `--dump-cuda`, similar to the other GPU backends.

athas · 2023-12-01T21:26:12Z

There's a lot of stuff here that is general refactoring of parts of code generation, and which makes reviewing difficult. I have added some of it manually to master, and may add some more.

src/Futhark/CodeGen/ImpGen/GPU/SegRed.hs

sortraev · 2023-12-02T08:47:00Z

Yes, @athas, I am sorry that this PR is not focused entirely on the relevant changes. Some of the other changes were nice-to-haves during development (such as nextMul and my scripts not breaking on --dump-cuda returning 1), but some of the other changes were obviously unnecessary. I tend to go looking for things to fix that do not need fixing when my mind begins to wander.

I can make a new PR with only the changes to GPU.SegRed and any dependencies, if you prefer.

athas · 2023-12-02T08:57:17Z

I think I already got most of the pertinent bits.

Since it is not the same as `segFlat space` under virtualization.

athas · 2023-12-04T15:46:26Z

Remember to also merge master into your branch and resolve the conflicts.

athas · 2023-12-05T13:54:39Z

This is a 2x speedup on gccontent. Not bad.

sortraev requested a review from athas December 1, 2023 13:21

Address ormolu suggestions

b27fc09

sortraev force-pushed the reduce_opts branch from 020c2b9 to b27fc09 Compare December 1, 2023 13:46

athas added the run-benchmarks Makes GA run the benchmark suite. label Dec 1, 2023

sortraev requested a review from coancea December 1, 2023 14:11

athas reviewed Dec 1, 2023

View reviewed changes

src/Futhark/CodeGen/ImpGen/GPU/SegRed.hs Outdated Show resolved Hide resolved

athas requested changes Dec 1, 2023

View reviewed changes

src/Futhark/CodeGen/ImpGen/GPU/SegRed.hs Outdated Show resolved Hide resolved

src/Futhark/CodeGen/ImpGen/GPU/SegRed.hs Outdated Show resolved Hide resolved

sortraev added 2 commits December 4, 2023 16:28

Restore global_tid in largeSegmentsReduction

ed5bec7

Since it is not the same as `segFlat space` under virtualization.

Reintroduce an errorsync

5951d50

sortraev added 2 commits December 4, 2023 16:51

Remove redundant TODO

6bfb419

Merge branch 'master' into reduce_opts

625fa31

sortraev force-pushed the reduce_opts branch from 4923e66 to 3dd6d2e Compare December 4, 2023 22:58

Remove SegRedKind and respect vectorized prim SegBinOps

3dd6d2e

athas merged commit 5234eb8 into master Dec 5, 2023
24 checks passed

athas deleted the reduce_opts branch December 5, 2023 13:54

athas added a commit that referenced this pull request Dec 5, 2023

Note #2054.

fc47136

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SegRed sequentialization in certain cases #2054

Improve SegRed sequentialization in certain cases #2054

sortraev commented Dec 1, 2023 •

edited

Loading

athas commented Dec 1, 2023

sortraev commented Dec 2, 2023

athas commented Dec 2, 2023

athas commented Dec 4, 2023

athas commented Dec 5, 2023

Improve SegRed sequentialization in certain cases #2054

Improve SegRed sequentialization in certain cases #2054

Conversation

sortraev commented Dec 1, 2023 • edited Loading

athas commented Dec 1, 2023

sortraev commented Dec 2, 2023

athas commented Dec 2, 2023

athas commented Dec 4, 2023

athas commented Dec 5, 2023

sortraev commented Dec 1, 2023 •

edited

Loading