[FEA] Performance improvement for low-cardinality grouped range rolling windows with nulls #17714

wence- · 2025-01-10T12:56:00Z

Is your feature request related to a problem? Please describe.

In spark, the orderby column used in a rolling window aggregation may contain nulls. If it does, all those nulls are treated as part of a single window.

In the ungrouped case, constructing the partition into nulls and non-nulls is easy: just use the null count of the column and whether the nulls were sorted at the beginning or the end.

In the grouped case, it is somewhat harder. The orderby column is required to be sorted group-wise, with the nulls sorted before or after within the group.

Currently, computing null bounds within each group is done by running thrust::for_each over the groups and finding the thrust::partition_point that separates nulls from non-nulls in that group. This search uses the thrust::seq execution policy and therefore runs one thread per group.

This is a reasonable strategy for high cardinality group keys (where the groups might typically be small). However, it is terrible when the group cardinality is low. Although partition_point is a log N algorithm, if we have only a few large groups, the DRAM bandwidth usage of the single thread doing the search is very low.

The attached nsys profile shows the time spent computing these null partition points for a table with 2^28 rows, and a grouped key cardinality of 10.

In contrast, when the group cardinality is 100'000'000, we barely see the partition point search:

Describe the solution you'd like

This should be possible to do much faster. The "simplest" solution would be if thrust exposed a segmented_partition_point that took care of doing all the parallelisation for us. Unfortunately that doesn't exist.

However, I think we can take advantage of the structure of the problem. What we really need to know is the number of nulls in each group. Since we require that the orderby column is sorted wrt the group keys, we can compute the null count in each group using a CUB DeviceSegmentedReduce. Knowing whether the nulls were sorted at the beginning or the end of each group then allows us to translate this into the partition point we need.

The text was updated successfully, but these errors were encountered:

wence- · 2025-01-10T18:34:14Z

Sneak peak of the cub solution, so this really basically completely disappears

wence- added the feature request New feature or request label Jan 10, 2025

wence- self-assigned this Jan 10, 2025

wence- added the libcudf Affects libcudf (C++/CUDA) code. label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Performance improvement for low-cardinality grouped range rolling windows with nulls #17714

[FEA] Performance improvement for low-cardinality grouped range rolling windows with nulls #17714

wence- commented Jan 10, 2025

wence- commented Jan 10, 2025 •

edited

Loading

[FEA] Performance improvement for low-cardinality grouped range rolling windows with nulls #17714

[FEA] Performance improvement for low-cardinality grouped range rolling windows with nulls #17714

Comments

wence- commented Jan 10, 2025

wence- commented Jan 10, 2025 • edited Loading

wence- commented Jan 10, 2025 •

edited

Loading