Implement `appearance` order for `vec_locate_sorted_groups()` #1747

DavisVaughan · 2022-11-16T18:33:55Z

If we go the route of using order-by-appearance in summarise(.by = ), then we have two options for group computation algorithms:

Hash based (equality) approach with vec_group_loc()
Sort based (ordering) approach with vec_locate_sorted_groups()

vec_locate_sorted_groups() has a number of advantages that make it a contender here. Past exploration has shown that it scales better with large numbers of groups, and the integer counting sort when the range of the integer values is less than 100,000 makes it extremely fast on most integer data. It also has built in optimizations for when the data is already ordered, which I am hopeful will be a common case if people arrange() before they mutate() or summarise().

vec_locate_sorted_groups() does have the one downside that it requires vec_proxy_order() rather than just vec_proxy_equal(), but I think that is okay for dplyr because group_by() was already requiring this by doing vec_order() internally.

However, vec_locate_sorted_groups() obviously isn't a drop in replacement for vec_group_loc() because it sorts the group keys by value rather than by first appearance.

This PR adds a logical appearance argument to vec_locate_sorted_groups(), which makes it return groups by first appearance. This makes it essentially equivalent to vec_group_loc(), but they use very different algorithms.

library(vctrs)

x <- c(2, 5, 1, 1, 2, 2)

# Sorted order
vec_locate_sorted_groups(x)
#>   key     loc
#> 1   1    3, 4
#> 2   2 1, 5, 6
#> 3   5       2

# Appearance order (hash-based)
vec_group_loc(x)
#>   key     loc
#> 1   2 1, 5, 6
#> 2   5       2
#> 3   1    3, 4

# Appearance order (sort-based)
vec_locate_sorted_groups(x, appearance = TRUE)
#>   key     loc
#> 1   2 1, 5, 6
#> 2   5       2
#> 3   1    3, 4

^{Created on 2022-11-16 with reprex v2.0.2.9000}

appearance is a little weird as an argument, because it means that direction, na_value, and chr_proxy_collate are essentially meaningless when appearance = TRUE. But I still think it fits here, and is more flexible for us VS transitioning vec_group_loc() over to using the sort-based approach, because there are still pros to using the hash-based approach it currently uses in some cases.

DavisVaughan · 2022-11-16T18:50:21Z

R/order.R

-                           chr_ordered = TRUE) {
+                           appearance = FALSE) {


As you'll see in the C code, appearance = TRUE has become the "official" way of saying that we don't need to actually order character vectors, we just need to group them, allowing us to use the faster chr_appearance() C function vs using chr_order(). It has replaced the chr_ordered argument that was being used for testing.

This is the first time we would be exposing chr_appearance() to user code. I added and tested it a while back with the eventual goal of exposing it as an optimization somehow, and I like how it happens here through appearance.

DavisVaughan · 2022-11-16T18:52:44Z

src/match-joint.c

-    true
+    false


All these true->false flips are me going from chr_ordered to appearance in the function signatures, which are inverses of one another

DavisVaughan · 2022-11-16T18:57:03Z

src/order.c

  r_ssize n_groups = r_length(sizes);

  SEXP loc = KEEP(r_alloc_list(n_groups));

  SEXP key_loc = KEEP(r_alloc_integer(n_groups));
  int* p_key_loc = r_int_begin(key_loc);

-  int start = 0;
+  if (appearance) {
+    SEXP o_appearance = r_list_get(info, 3);


The idea in this branch is to build key_loc, which is used to slice x to generate the group keys, and loc, the list_of<int> containing the group locations, in first appearance ordering rather than in sorted-by-value order (which is what the !appearance branch does, and was the original implementation).

DavisVaughan · 2022-11-16T18:58:58Z

src/order.c

+ * Returns a list of length three or four:
 * - The first element of the list contains the ordering as an integer vector.
 * - The second element of the list contains the group sizes as an integer
 *   vector.
 * - The third element of the list contains the max group size as an integer.
+ * - The optional fourth element of the list contains an additional ordering
+ *   integer vector that re-orders the sorted unique values of `x` to generate
+ *   an appearance ordering. It is only present if `appearance` is `true`.
 */
 // [[ include("order.h") ]]
 SEXP vec_order_info(SEXP x,


vec_order_info() is an intermediate helper function that would never be exported, so I don't mind that it has a little bit of size instability here.

DavisVaughan · 2022-11-16T19:04:05Z

src/order.c

+    // Order of the unique keys
+    SEXP keys = PROTECT_N(r_alloc_integer(n_groups), &n_prot);
+    int* p_keys = r_int_begin(keys);
+
+    r_ssize start = 0;
+
+    for (r_ssize i = 0; i < n_groups; ++i) {
+      p_keys[i] = p_order->p_data[start];
+      start += p_sizes[i];
+    }
+
+    // Appearance order of the unique keys
+    struct order* p_order_appearance = new_order(n_groups);


The idea is to find the sorted group keys of x and extract their order values into key. And then we take that integer vector of group key order values and sort it again, which gives us a way to map back to order by appearance.

One nice thing is that we are reusing all of the memory that was already allocated for the first ordering. Because this is ordering a dense integer vector and is reusing memory, it has extremely low overhead.

DavisVaughan added 2 commits November 16, 2022 13:30

Implement appearance order for vec_locate_sorted_groups()

aa0a757

NEWS bullet

85de208

DavisVaughan commented Nov 16, 2022

View reviewed changes

DavisVaughan mentioned this pull request Nov 16, 2022

Implement .by tidyverse/dplyr#6528

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `appearance` order for `vec_locate_sorted_groups()` #1747

Implement `appearance` order for `vec_locate_sorted_groups()` #1747

DavisVaughan commented Nov 16, 2022 •

edited

Loading

DavisVaughan Nov 16, 2022

DavisVaughan Nov 16, 2022

DavisVaughan Nov 16, 2022

DavisVaughan Nov 16, 2022

DavisVaughan Nov 16, 2022

Implement appearance order for vec_locate_sorted_groups() #1747

Are you sure you want to change the base?

Implement appearance order for vec_locate_sorted_groups() #1747

Conversation

DavisVaughan commented Nov 16, 2022 • edited Loading

DavisVaughan Nov 16, 2022

Choose a reason for hiding this comment

DavisVaughan Nov 16, 2022

Choose a reason for hiding this comment

DavisVaughan Nov 16, 2022

Choose a reason for hiding this comment

DavisVaughan Nov 16, 2022

Choose a reason for hiding this comment

DavisVaughan Nov 16, 2022

Choose a reason for hiding this comment

Implement `appearance` order for `vec_locate_sorted_groups()` #1747

Implement `appearance` order for `vec_locate_sorted_groups()` #1747

DavisVaughan commented Nov 16, 2022 •

edited

Loading