Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

rafbiels · 2024-10-31T23:42:57Z

Change the implementation of commonMemSetLargePattern to use the largest pattern word size supported by the backend into which the pattern can be divided. That is, use 4-byte words if the pattern size is a multiple of 4, otherwise use 2-byte words for even sizes and 1-byte words for odd sizes.

Keep the idea of filling the entire destination region with the first word, and only start strided fill from the second, but implement it correctly. The previous implementation produced incorrect results for any pattern size which wasn't a multiple of 4. For example, a pattern of std::array<uint8_t,5>{{1,2,3,4,5}} produced 1 2 3 4 5 2 3 4 1 5 3 4 1 2 5 ... instead of 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 ....

For HIP, the strided fill remains to be always in 1-byte increments because HIP API doesn't provide strided multi-byte memset functions like CUDA does. For CUDA, both the initial memset and the strided ones use the largest possible word size.

Add a new optimisation skipping the strided fills completely if the pattern is equal to the first word repeated throughout. This is most commonly the case for a pattern of all zeros, but other cases are possible. This optimisation is implemented in both CUDA and HIP adapters.

CUDA correctness and performance before (left) and afer (right) this patch:

We can see the new solution has O(1) complexity for the optimised case of first word repeated throughout the pattern. For a varied pattern, the complexity depends on pattern size S and is O(S) when S is a multiple of 4, otherwise O(2S) when S is even and O(4S) when S is odd.

Similarly for HIP we see the following, where all results are now correct and we get O(1) performance for all-zeros. We also see we don't get the performance boost for multiples of 2B and 4B due to the lack of strided 2B and 4B memset functions in HIP API.

The validation/microbenchmark code is available here:
https://gist.github.com/rafbiels/807cfff4b023d8c53a0663be99ffa2ef
and was adapted into a new e2e test in the intel/llvm PR.

intel/llvm PR: intel/llvm#15991

konradkusiak97

I really like this patch. Thanks for fixing my buggy code, I have plenty to learn from you! LGTM.

rafbiels · 2024-11-07T16:30:06Z

I really like this patch. Thanks for fixing my buggy code, I have plenty to learn from you! LGTM.

Thanks for the review @konradkusiak97! Looking through the history of the code, I don't think that was any of your fault. The implementation of [pi|ur]EnqueueMemBufferFill has assumed 1, 2, and multiple-of-4 byte pattern sizes since the initial commit years ago. It has never worked correctly for other sizes, though it only came up now because we started using it more.

…tern Change the implementation of commonMemSetLargePattern to use the largest pattern word size supported by the backend into which the pattern can be divided. That is, use 4-byte words if the pattern size is a multiple of 4, 2-byte words for even sizes and 1-byte words for odd sizes. Keep the idea of filling the entire destination region with the first word, and only start strided fill from the second, but implement it correctly. The previous implementation produced incorrect results for any pattern size which wasn't a multiple of 4. For HIP, the strided fill remains to be always in 1-byte increments because HIP API doesn't provide strided multi-byte memset functions like CUDA does. For CUDA, both the initial memset and the strided ones use the largest possible word size. Add a new optimisation skipping the strided fills completely if the pattern is equal to the first word repeated throughout. This is most commonly the case for a pattern of all zeros, but other cases are possible. This optimisation is implemented in both CUDA and HIP adapters.

rafbiels · 2024-11-15T12:39:15Z

rebased to the current tag used by intel/llvm in order to get a fresh intel/llvm CI run

~~Update the UR tag to include oneapi-src/unified-runtime#2273 fixing `queue::fill` for the CUDA and HIP backends. It was previously producing incorrect outputs for any pattern size other than 1, 2, or a multiple of 4 bytes. A new optimisation is also added which speeds up the fill greatly if the pattern equals to the first word repeated throughout (e.g. all zeros). See the UR PR for more details.~~ _The UR tag update was collected in #16040 so now this PR only adds an e2e test as stated below._ Add a new e2e test to validate `queue::fill` outputs for any pattern size between 1 and 32 bytes. This test fails for CUDA and HIP before the UR change and passes with this PR. Other backends already worked correctly.

github-actions bot added the cuda CUDA adapter specific issues label Oct 31, 2024

rafbiels force-pushed the rafbiels/improve-memset branch from 5e99670 to cc52821 Compare November 4, 2024 19:10

github-actions bot added the hip HIP adapter specific issues label Nov 4, 2024

rafbiels mentioned this pull request Nov 5, 2024

Add e2e test for queue::fill with a range of pattern sizes intel/llvm#15991

Merged

rafbiels marked this pull request as ready for review November 5, 2024 17:31

rafbiels requested review from a team as code owners November 5, 2024 17:31

rafbiels requested a review from konradkusiak97 November 5, 2024 17:31

konradkusiak97 approved these changes Nov 7, 2024

View reviewed changes

rafbiels added the ready to merge Added to PR's which are ready to merge label Nov 7, 2024

rafbiels force-pushed the rafbiels/improve-memset branch from cc52821 to 6f9d5c5 Compare November 15, 2024 12:26

callumfare merged commit e3247c2 into oneapi-src:main Nov 15, 2024
82 of 83 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

rafbiels commented Oct 31, 2024 •

edited

Loading

konradkusiak97 left a comment

rafbiels commented Nov 7, 2024

rafbiels commented Nov 15, 2024

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

Fix incorrect outputs and improve performance of commonMemSetLargePattern #2273

Conversation

rafbiels commented Oct 31, 2024 • edited Loading

konradkusiak97 left a comment

Choose a reason for hiding this comment

rafbiels commented Nov 7, 2024

rafbiels commented Nov 15, 2024

rafbiels commented Oct 31, 2024 •

edited

Loading