Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Performance improvement for low-cardinality grouped range rolling windows with nulls #17714

Open
wence- opened this issue Jan 10, 2025 · 1 comment
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@wence-
Copy link
Contributor

wence- commented Jan 10, 2025

Is your feature request related to a problem? Please describe.

In spark, the orderby column used in a rolling window aggregation may contain nulls. If it does, all those nulls are treated as part of a single window.

In the ungrouped case, constructing the partition into nulls and non-nulls is easy: just use the null count of the column and whether the nulls were sorted at the beginning or the end.

In the grouped case, it is somewhat harder. The orderby column is required to be sorted group-wise, with the nulls sorted before or after within the group.

Currently, computing null bounds within each group is done by running thrust::for_each over the groups and finding the thrust::partition_point that separates nulls from non-nulls in that group. This search uses the thrust::seq execution policy and therefore runs one thread per group.

This is a reasonable strategy for high cardinality group keys (where the groups might typically be small). However, it is terrible when the group cardinality is low. Although partition_point is a log N algorithm, if we have only a few large groups, the DRAM bandwidth usage of the single thread doing the search is very low.

The attached nsys profile shows the time spent computing these null partition points for a table with 2^28 rows, and a grouped key cardinality of 10.

Image

In contrast, when the group cardinality is 100'000'000, we barely see the partition point search:

Image

Describe the solution you'd like

This should be possible to do much faster. The "simplest" solution would be if thrust exposed a segmented_partition_point that took care of doing all the parallelisation for us. Unfortunately that doesn't exist.

However, I think we can take advantage of the structure of the problem. What we really need to know is the number of nulls in each group. Since we require that the orderby column is sorted wrt the group keys, we can compute the null count in each group using a CUB DeviceSegmentedReduce. Knowing whether the nulls were sorted at the beginning or the end of each group then allows us to translate this into the partition point we need.

@wence- wence- added the feature request New feature or request label Jan 10, 2025
@wence- wence- self-assigned this Jan 10, 2025
@wence- wence- added the libcudf Affects libcudf (C++/CUDA) code. label Jan 10, 2025
@wence-
Copy link
Contributor Author

wence- commented Jan 10, 2025

Sneak peak of the cub solution, so this really basically completely disappears

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

1 participant