You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
In spark, the orderby column used in a rolling window aggregation may contain nulls. If it does, all those nulls are treated as part of a single window.
In the ungrouped case, constructing the partition into nulls and non-nulls is easy: just use the null count of the column and whether the nulls were sorted at the beginning or the end.
In the grouped case, it is somewhat harder. The orderby column is required to be sorted group-wise, with the nulls sorted before or after within the group.
Currently, computing null bounds within each group is done by running thrust::for_each over the groups and finding the thrust::partition_point that separates nulls from non-nulls in that group. This search uses the thrust::seq execution policy and therefore runs one thread per group.
This is a reasonable strategy for high cardinality group keys (where the groups might typically be small). However, it is terrible when the group cardinality is low. Although partition_point is a log N algorithm, if we have only a few large groups, the DRAM bandwidth usage of the single thread doing the search is very low.
The attached nsys profile shows the time spent computing these null partition points for a table with 2^28 rows, and a grouped key cardinality of 10.
In contrast, when the group cardinality is 100'000'000, we barely see the partition point search:
Describe the solution you'd like
This should be possible to do much faster. The "simplest" solution would be if thrust exposed a segmented_partition_point that took care of doing all the parallelisation for us. Unfortunately that doesn't exist.
However, I think we can take advantage of the structure of the problem. What we really need to know is the number of nulls in each group. Since we require that the orderby column is sorted wrt the group keys, we can compute the null count in each group using a CUB DeviceSegmentedReduce. Knowing whether the nulls were sorted at the beginning or the end of each group then allows us to translate this into the partition point we need.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
In spark, the orderby column used in a rolling window aggregation may contain nulls. If it does, all those nulls are treated as part of a single window.
In the ungrouped case, constructing the partition into nulls and non-nulls is easy: just use the null count of the column and whether the nulls were sorted at the beginning or the end.
In the grouped case, it is somewhat harder. The orderby column is required to be sorted group-wise, with the nulls sorted before or after within the group.
Currently, computing null bounds within each group is done by running
thrust::for_each
over the groups and finding thethrust::partition_point
that separates nulls from non-nulls in that group. This search uses thethrust::seq
execution policy and therefore runs one thread per group.This is a reasonable strategy for high cardinality group keys (where the groups might typically be small). However, it is terrible when the group cardinality is low. Although
partition_point
is alog N
algorithm, if we have only a few large groups, the DRAM bandwidth usage of the single thread doing the search is very low.The attached nsys profile shows the time spent computing these null partition points for a table with 2^28 rows, and a grouped key cardinality of 10.
In contrast, when the group cardinality is 100'000'000, we barely see the partition point search:
Describe the solution you'd like
This should be possible to do much faster. The "simplest" solution would be if thrust exposed a
segmented_partition_point
that took care of doing all the parallelisation for us. Unfortunately that doesn't exist.However, I think we can take advantage of the structure of the problem. What we really need to know is the number of nulls in each group. Since we require that the orderby column is sorted wrt the group keys, we can compute the null count in each group using a CUB
DeviceSegmentedReduce
. Knowing whether the nulls were sorted at the beginning or the end of each group then allows us to translate this into the partition point we need.The text was updated successfully, but these errors were encountered: