-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SegRed sequentialization in certain cases #2054
Conversation
This commit modifies and, ideally, improves GPU code generation for non-segmented and large-segments segmented reductions with non-commutative and primitive operators (see description in module header). There are still some TODO's left to attend to -- most importantly, we have yet to determine for certain exactly how big of an improvement -- if any at all -- this change is. Also: * Adds `nextMul` to `Util.IntegralExp` to standardize size alignment. * Adds `isPrimParam` and `getChunkSize` (the latter lifted from `SegScan.SinglePass`) to `GPU.Base`, as well as minor refactorings to the module. * Changes `SegScan.SinglePass` to use `getChunkSize`, and some light refactoring. * Adds `tests/soacs/scan-in-loop.fut`, which test proper resetting and reuse of global mem dynamic ID counter. * Removes `comment` from `ImpGen`, since it seemed to be a duplicate of `sComment`. * CUDA compiled binaries now properly return 0 on successful `--dump-cuda`, similar to the other GPU backends.
There's a lot of stuff here that is general refactoring of parts of code generation, and which makes reviewing difficult. I have added some of it manually to |
Yes, @athas, I am sorry that this PR is not focused entirely on the relevant changes. Some of the other changes were nice-to-haves during development (such as I can make a new PR with only the changes to |
I think I already got most of the pertinent bits. |
Since it is not the same as `segFlat space` under virtualization.
Remember to also merge master into your branch and resolve the conflicts. |
This is a 2x speedup on gccontent. Not bad. |
This PR modifies and, ideally, improves GPU code generation for non-segmented and large-segments segmented reductions with non-commutative and primitive operators.
The stage one main loop is stripmined by a factor
chunk
(dependent on available resources and parameter element type(s)), inserting collective copies via local mem to thread-private (register) mem of each reduction parameter going into the stage one intra-group (partial) reductions. This saves a factorchunk
number of intra-group reductions at the cost of some overhead in collective copies.The general structure of the module remains, however a number of smaller design choices have been made to accomodate the distinction between the different kinds of reductions (ie. the cases to which we apply the extra optimization, and all other reductions) wrt. eg. declaration and handling of intermediate arrays.
There are still some TODO's left to attend to before merging:
auto output
validation.futhark-benchmarks/micro/reduce*.fut
tests on a 4090rtx (save for reductions overiota
s; see below), but which other benchmark suites are interesting?iota
s. How can we avoid this, assuming there are enough common cases that it is a big problem?SegRedKind
to distinguish between the case we want to optimize and the cases we do not, and while this feels more modular and legible than simply passing around a boolis_noncomm_primparams_reduction
parameter, it does feel a little janky in places. Ideally this distinction would be made implicitly throughout the module somehow, similar to how the distinction between segmented vs. non-segmented is made implicitly by interpreting non-segmented reductions as single-segment segmented reductions. @athas do you have a good idea?Also in this PR:
nextMul
toUtil.IntegralExp
to standardize size alignment.isPrimParam
andgetChunkSize
(the latter lifted fromSegScan.SinglePass
) toGPU.Base
, as well as minor refactorings to the module.SegScan.SinglePass
to usegetChunkSize
, and some light refactoring.comment
fromImpGen
, since it seemed to be a duplicate ofsComment
.--dump-cuda
, similar to the other GPU backends.