[R] Implement anonymous functions in calls to dplyr::across #43207

TPDeramus · 2024-07-10T13:46:32Z

Describe the usage question you have. Please include as many useful details as possible.

Hi Arrow devs.

I wanted to ask about something I noticed about using the column-wise operators with dplyr in arrow tables.

If I had an arrow table, and I wanted to run a basic function such as mean, max, or min using summarize, it appears that arrow does not currently accept the na.rm = TRUE argument, or that if it does, I can't seem to find it in the documentation.

Say I took the original dataset:

Participant	Rating
Donna	17
Donna	NA
Greg	21
Greg	NA

If these were generic R dataframes, either of these two calls would work (though one is deprecated):

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()

Producing:

Participant	Rating
Donna	17
Greg	21

However, when I run the same commands as an arrow table, both throw errors:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

Error in `across_setup()`:
! Anonymous functions are not yet supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()

Error in `expand_across()`:
! `...` argument to `across()` is deprecated in dplyr and not supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

And the one that does work:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max)) |>
  as.data.frame()

Returns NA values that are not what I want:

Participant	Rating
Donna	NA
Greg	NA

Is there a way to pass the na.rm = TRUE argument to this call without having to manually drop the NA values for each column or row of interest I have in my data?

Component(s)

R

The text was updated successfully, but these errors were encountered:

TPDeramus · 2024-07-10T20:39:12Z

I'm sure there's a way it could be done with arrow_max or call_function, but that is not readily apparent to me either and also keeps throwing function does not exist errors (probably due to it being nested in group_by, summarize and across).

TPDeramus · 2024-07-11T20:22:07Z

Hi Arrow Devs.

Some individuals in the Posit forums found a solution and it prompted some discussion we thought might be worth sending your way:
https://forum.posit.co/t/arrow-with-tidyverse-calling-min-max-mean-with-summarize-on-arrow-tables/188985

"dplyr::across() also supports a purrr-style lambda definition, which strangely seems to work in arrow where the other methods failed."

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), ~max(.x, na.rm = TRUE))) |>
  as.data.frame()
##   Participant Rating
## 1        Greg     21
## 2       Donna     17

"_I'm not sure at what points the operations become outsourced to arrow methods, but I don't know whether the ~min(.x, ...) lambda notation somehow tricks dplyr into not outsourcing this operation to arrow.

With dbplyr, everything is converted to SQL queries instead and you can view the SQL query to check it. Is there an equivalent arrow command that lets you see what commands are sent to arrow?_"

Would any of you be willing to explain how this works on the backend?

Happy to pass it on.

thisisnic · 2024-07-13T12:25:35Z

Thanks for reporting this @TPDeramus!

In short, in the backend, arrow code converts the dplyr code into Arrow Expressions. In the case of the across() implementation, from what I recall, we just work out the individual calls and then our mutate() implementation later converts that into Arrow Expressions.

Great you've got a workaround here, I'll take a look at implementing anonymous functions at some point in future, as it'll be useful to have and now we have better support for that kind of thing in arrow than we used to.

TPDeramus added the Type: usage Issue is a user question label Jul 10, 2024

github-actions bot added the Component: R label Jul 10, 2024

thisisnic self-assigned this Jul 13, 2024

thisisnic changed the title ~~[R] dplyr summarize commands do not accept na.rm arguments~~ [R] Implement anonymous functions in calls to dplyr::across Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Implement anonymous functions in calls to dplyr::across #43207

[R] Implement anonymous functions in calls to dplyr::across #43207

TPDeramus commented Jul 10, 2024 •

edited

Loading

TPDeramus commented Jul 10, 2024

TPDeramus commented Jul 11, 2024

thisisnic commented Jul 13, 2024 •

edited

Loading

[R] Implement anonymous functions in calls to dplyr::across #43207

[R] Implement anonymous functions in calls to dplyr::across #43207

Comments

TPDeramus commented Jul 10, 2024 • edited Loading

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

TPDeramus commented Jul 10, 2024

TPDeramus commented Jul 11, 2024

thisisnic commented Jul 13, 2024 • edited Loading

TPDeramus commented Jul 10, 2024 •

edited

Loading

thisisnic commented Jul 13, 2024 •

edited

Loading