-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273
Conversation
5e99670
to
cc52821
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this patch. Thanks for fixing my buggy code, I have plenty to learn from you! LGTM.
Thanks for the review @konradkusiak97! Looking through the history of the code, I don't think that was any of your fault. The implementation of |
…tern Change the implementation of commonMemSetLargePattern to use the largest pattern word size supported by the backend into which the pattern can be divided. That is, use 4-byte words if the pattern size is a multiple of 4, 2-byte words for even sizes and 1-byte words for odd sizes. Keep the idea of filling the entire destination region with the first word, and only start strided fill from the second, but implement it correctly. The previous implementation produced incorrect results for any pattern size which wasn't a multiple of 4. For HIP, the strided fill remains to be always in 1-byte increments because HIP API doesn't provide strided multi-byte memset functions like CUDA does. For CUDA, both the initial memset and the strided ones use the largest possible word size. Add a new optimisation skipping the strided fills completely if the pattern is equal to the first word repeated throughout. This is most commonly the case for a pattern of all zeros, but other cases are possible. This optimisation is implemented in both CUDA and HIP adapters.
cc52821
to
6f9d5c5
Compare
rebased to the current tag used by intel/llvm in order to get a fresh intel/llvm CI run |
~~Update the UR tag to include oneapi-src/unified-runtime#2273 fixing `queue::fill` for the CUDA and HIP backends. It was previously producing incorrect outputs for any pattern size other than 1, 2, or a multiple of 4 bytes. A new optimisation is also added which speeds up the fill greatly if the pattern equals to the first word repeated throughout (e.g. all zeros). See the UR PR for more details.~~ _The UR tag update was collected in #16040 so now this PR only adds an e2e test as stated below._ Add a new e2e test to validate `queue::fill` outputs for any pattern size between 1 and 32 bytes. This test fails for CUDA and HIP before the UR change and passes with this PR. Other backends already worked correctly.
Change the implementation of
commonMemSetLargePattern
to use the largest pattern word size supported by the backend into which the pattern can be divided. That is, use 4-byte words if the pattern size is a multiple of 4, otherwise use 2-byte words for even sizes and 1-byte words for odd sizes.Keep the idea of filling the entire destination region with the first word, and only start strided fill from the second, but implement it correctly. The previous implementation produced incorrect results for any pattern size which wasn't a multiple of 4. For example, a pattern of
std::array<uint8_t,5>{{1,2,3,4,5}}
produced1 2 3 4 5 2 3 4 1 5 3 4 1 2 5 ...
instead of1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 ...
.For HIP, the strided fill remains to be always in 1-byte increments because HIP API doesn't provide strided multi-byte memset functions like CUDA does. For CUDA, both the initial memset and the strided ones use the largest possible word size.
Add a new optimisation skipping the strided fills completely if the pattern is equal to the first word repeated throughout. This is most commonly the case for a pattern of all zeros, but other cases are possible. This optimisation is implemented in both CUDA and HIP adapters.
CUDA correctness and performance before (left) and afer (right) this patch:
We can see the new solution has O(1) complexity for the optimised case of first word repeated throughout the pattern. For a varied pattern, the complexity depends on pattern size S and is O(S) when S is a multiple of 4, otherwise O(2S) when S is even and O(4S) when S is odd.
Similarly for HIP we see the following, where all results are now correct and we get O(1) performance for all-zeros. We also see we don't get the performance boost for multiples of 2B and 4B due to the lack of strided 2B and 4B memset functions in HIP API.
The validation/microbenchmark code is available here:
https://gist.github.com/rafbiels/807cfff4b023d8c53a0663be99ffa2ef
and was adapted into a new e2e test in the intel/llvm PR.
intel/llvm PR: intel/llvm#15991