Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

Merged
merged 1 commit into from
Nov 15, 2024

Conversation

rafbiels
Copy link
Contributor

@rafbiels rafbiels commented Oct 31, 2024

Change the implementation of commonMemSetLargePattern to use the largest pattern word size supported by the backend into which the pattern can be divided. That is, use 4-byte words if the pattern size is a multiple of 4, otherwise use 2-byte words for even sizes and 1-byte words for odd sizes.

Keep the idea of filling the entire destination region with the first word, and only start strided fill from the second, but implement it correctly. The previous implementation produced incorrect results for any pattern size which wasn't a multiple of 4. For example, a pattern of std::array<uint8_t,5>{{1,2,3,4,5}} produced 1 2 3 4 5 2 3 4 1 5 3 4 1 2 5 ... instead of 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 ....

For HIP, the strided fill remains to be always in 1-byte increments because HIP API doesn't provide strided multi-byte memset functions like CUDA does. For CUDA, both the initial memset and the strided ones use the largest possible word size.

Add a new optimisation skipping the strided fills completely if the pattern is equal to the first word repeated throughout. This is most commonly the case for a pattern of all zeros, but other cases are possible. This optimisation is implemented in both CUDA and HIP adapters.

CUDA correctness and performance before (left) and afer (right) this patch:

We can see the new solution has O(1) complexity for the optimised case of first word repeated throughout the pattern. For a varied pattern, the complexity depends on pattern size S and is O(S) when S is a multiple of 4, otherwise O(2S) when S is even and O(4S) when S is odd.

Similarly for HIP we see the following, where all results are now correct and we get O(1) performance for all-zeros. We also see we don't get the performance boost for multiples of 2B and 4B due to the lack of strided 2B and 4B memset functions in HIP API.

The validation/microbenchmark code is available here:
https://gist.github.com/rafbiels/807cfff4b023d8c53a0663be99ffa2ef
and was adapted into a new e2e test in the intel/llvm PR.

intel/llvm PR: intel/llvm#15991

@github-actions github-actions bot added the cuda CUDA adapter specific issues label Oct 31, 2024
@rafbiels rafbiels force-pushed the rafbiels/improve-memset branch from 5e99670 to cc52821 Compare November 4, 2024 19:10
@github-actions github-actions bot added the hip HIP adapter specific issues label Nov 4, 2024
@rafbiels rafbiels marked this pull request as ready for review November 5, 2024 17:31
@rafbiels rafbiels requested review from a team as code owners November 5, 2024 17:31
Copy link
Contributor

@konradkusiak97 konradkusiak97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this patch. Thanks for fixing my buggy code, I have plenty to learn from you! LGTM.

@rafbiels
Copy link
Contributor Author

rafbiels commented Nov 7, 2024

I really like this patch. Thanks for fixing my buggy code, I have plenty to learn from you! LGTM.

Thanks for the review @konradkusiak97! Looking through the history of the code, I don't think that was any of your fault. The implementation of [pi|ur]EnqueueMemBufferFill has assumed 1, 2, and multiple-of-4 byte pattern sizes since the initial commit years ago. It has never worked correctly for other sizes, though it only came up now because we started using it more.

@rafbiels rafbiels added the ready to merge Added to PR's which are ready to merge label Nov 7, 2024
…tern

Change the implementation of commonMemSetLargePattern to use the largest
pattern word size supported by the backend into which the pattern can be
divided. That is, use 4-byte words if the pattern size is a multiple of 4,
2-byte words for even sizes and 1-byte words for odd sizes.

Keep the idea of filling the entire destination region with the first word,
and only start strided fill from the second, but implement it correctly.
The previous implementation produced incorrect results for any pattern size
which wasn't a multiple of 4. For HIP, the strided fill remains to be always
in 1-byte increments because HIP API doesn't provide strided multi-byte
memset functions like CUDA does. For CUDA, both the initial memset and the
strided ones use the largest possible word size.

Add a new optimisation skipping the strided fills completely if the pattern
is equal to the first word repeated throughout. This is most commonly the
case for a pattern of all zeros, but other cases are possible. This
optimisation is implemented in both CUDA and HIP adapters.
@rafbiels rafbiels force-pushed the rafbiels/improve-memset branch from cc52821 to 6f9d5c5 Compare November 15, 2024 12:26
@rafbiels
Copy link
Contributor Author

rebased to the current tag used by intel/llvm in order to get a fresh intel/llvm CI run

@callumfare callumfare merged commit e3247c2 into oneapi-src:main Nov 15, 2024
82 of 83 checks passed
martygrant pushed a commit to intel/llvm that referenced this pull request Dec 13, 2024
~~Update the UR tag to include
oneapi-src/unified-runtime#2273 fixing
`queue::fill` for the CUDA and HIP backends. It was previously producing
incorrect outputs for any pattern size other than 1, 2, or a multiple of
4 bytes. A new optimisation is also added which speeds up the fill
greatly if the pattern equals to the first word repeated throughout
(e.g. all zeros). See the UR PR for more details.~~

_The UR tag update was collected in
#16040 so now this PR only adds an e2e
test as stated below._

Add a new e2e test to validate `queue::fill` outputs for any pattern
size between 1 and 32 bytes. This test fails for CUDA and HIP before the
UR change and passes with this PR. Other backends already worked
correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA adapter specific issues hip HIP adapter specific issues ready to merge Added to PR's which are ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants