Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement appearance order for vec_locate_sorted_groups() #1747

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Nov 16, 2022

If we go the route of using order-by-appearance in summarise(.by = ), then we have two options for group computation algorithms:

  • Hash based (equality) approach with vec_group_loc()
  • Sort based (ordering) approach with vec_locate_sorted_groups()

vec_locate_sorted_groups() has a number of advantages that make it a contender here. Past exploration has shown that it scales better with large numbers of groups, and the integer counting sort when the range of the integer values is less than 100,000 makes it extremely fast on most integer data. It also has built in optimizations for when the data is already ordered, which I am hopeful will be a common case if people arrange() before they mutate() or summarise().

vec_locate_sorted_groups() does have the one downside that it requires vec_proxy_order() rather than just vec_proxy_equal(), but I think that is okay for dplyr because group_by() was already requiring this by doing vec_order() internally.

However, vec_locate_sorted_groups() obviously isn't a drop in replacement for vec_group_loc() because it sorts the group keys by value rather than by first appearance.

This PR adds a logical appearance argument to vec_locate_sorted_groups(), which makes it return groups by first appearance. This makes it essentially equivalent to vec_group_loc(), but they use very different algorithms.

library(vctrs)

x <- c(2, 5, 1, 1, 2, 2)

# Sorted order
vec_locate_sorted_groups(x)
#>   key     loc
#> 1   1    3, 4
#> 2   2 1, 5, 6
#> 3   5       2

# Appearance order (hash-based)
vec_group_loc(x)
#>   key     loc
#> 1   2 1, 5, 6
#> 2   5       2
#> 3   1    3, 4

# Appearance order (sort-based)
vec_locate_sorted_groups(x, appearance = TRUE)
#>   key     loc
#> 1   2 1, 5, 6
#> 2   5       2
#> 3   1    3, 4

Created on 2022-11-16 with reprex v2.0.2.9000

appearance is a little weird as an argument, because it means that direction, na_value, and chr_proxy_collate are essentially meaningless when appearance = TRUE. But I still think it fits here, and is more flexible for us VS transitioning vec_group_loc() over to using the sort-based approach, because there are still pros to using the hash-based approach it currently uses in some cases.

Comment on lines -248 to +269
chr_ordered = TRUE) {
appearance = FALSE) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you'll see in the C code, appearance = TRUE has become the "official" way of saying that we don't need to actually order character vectors, we just need to group them, allowing us to use the faster chr_appearance() C function vs using chr_order(). It has replaced the chr_ordered argument that was being used for testing.

This is the first time we would be exposing chr_appearance() to user code. I added and tested it a while back with the eventual goal of exposing it as an optimization somehow, and I like how it happens here through appearance.

true
false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these true->false flips are me going from chr_ordered to appearance in the function signatures, which are inverses of one another

r_ssize n_groups = r_length(sizes);

SEXP loc = KEEP(r_alloc_list(n_groups));

SEXP key_loc = KEEP(r_alloc_integer(n_groups));
int* p_key_loc = r_int_begin(key_loc);

int start = 0;
if (appearance) {
SEXP o_appearance = r_list_get(info, 3);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea in this branch is to build key_loc, which is used to slice x to generate the group keys, and loc, the list_of<int> containing the group locations, in first appearance ordering rather than in sorted-by-value order (which is what the !appearance branch does, and was the original implementation).

Comment on lines +342 to 352
* Returns a list of length three or four:
* - The first element of the list contains the ordering as an integer vector.
* - The second element of the list contains the group sizes as an integer
* vector.
* - The third element of the list contains the max group size as an integer.
* - The optional fourth element of the list contains an additional ordering
* integer vector that re-orders the sorted unique values of `x` to generate
* an appearance ordering. It is only present if `appearance` is `true`.
*/
// [[ include("order.h") ]]
SEXP vec_order_info(SEXP x,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec_order_info() is an intermediate helper function that would never be exported, so I don't mind that it has a little bit of size instability here.

Comment on lines +534 to +546
// Order of the unique keys
SEXP keys = PROTECT_N(r_alloc_integer(n_groups), &n_prot);
int* p_keys = r_int_begin(keys);

r_ssize start = 0;

for (r_ssize i = 0; i < n_groups; ++i) {
p_keys[i] = p_order->p_data[start];
start += p_sizes[i];
}

// Appearance order of the unique keys
struct order* p_order_appearance = new_order(n_groups);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to find the sorted group keys of x and extract their order values into key. And then we take that integer vector of group key order values and sort it again, which gives us a way to map back to order by appearance.

One nice thing is that we are reusing all of the memory that was already allocated for the first ordering. Because this is ordering a dense integer vector and is reusing memory, it has extremely low overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant