Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Implement anonymous functions in calls to dplyr::across #43207

Open
TPDeramus opened this issue Jul 10, 2024 · 3 comments
Open

[R] Implement anonymous functions in calls to dplyr::across #43207

TPDeramus opened this issue Jul 10, 2024 · 3 comments
Assignees
Labels
Component: R Type: usage Issue is a user question

Comments

@TPDeramus
Copy link

TPDeramus commented Jul 10, 2024

Describe the usage question you have. Please include as many useful details as possible.

Hi Arrow devs.

I wanted to ask about something I noticed about using the column-wise operators with dplyr in arrow tables.

If I had an arrow table, and I wanted to run a basic function such as mean, max, or min using summarize, it appears that arrow does not currently accept the na.rm = TRUE argument, or that if it does, I can't seem to find it in the documentation.

Say I took the original dataset:

Participant Rating
Donna 17
Donna NA
Greg 21
Greg NA

If these were generic R dataframes, either of these two calls would work (though one is deprecated):

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()

Producing:

Participant Rating
Donna 17
Greg 21

However, when I run the same commands as an arrow table, both throw errors:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

Error in `across_setup()`:
! Anonymous functions are not yet supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()

Error in `expand_across()`:
! `...` argument to `across()` is deprecated in dplyr and not supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

And the one that does work:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max)) |>
  as.data.frame()

Returns NA values that are not what I want:

Participant Rating
Donna NA
Greg NA

Is there a way to pass the na.rm = TRUE argument to this call without having to manually drop the NA values for each column or row of interest I have in my data?

Component(s)

R

@TPDeramus TPDeramus added the Type: usage Issue is a user question label Jul 10, 2024
@TPDeramus
Copy link
Author

I'm sure there's a way it could be done with arrow_max or call_function, but that is not readily apparent to me either and also keeps throwing function does not exist errors (probably due to it being nested in group_by, summarize and across).

@TPDeramus
Copy link
Author

Hi Arrow Devs.

Some individuals in the Posit forums found a solution and it prompted some discussion we thought might be worth sending your way:
https://forum.posit.co/t/arrow-with-tidyverse-calling-min-max-mean-with-summarize-on-arrow-tables/188985

"dplyr::across() also supports a purrr-style lambda definition, which strangely seems to work in arrow where the other methods failed."

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), ~max(.x, na.rm = TRUE))) |>
  as.data.frame()
##   Participant Rating
## 1        Greg     21
## 2       Donna     17

"_I'm not sure at what points the operations become outsourced to arrow methods, but I don't know whether the ~min(.x, ...) lambda notation somehow tricks dplyr into not outsourcing this operation to arrow.

With dbplyr, everything is converted to SQL queries instead and you can view the SQL query to check it. Is there an equivalent arrow command that lets you see what commands are sent to arrow?_"

Would any of you be willing to explain how this works on the backend?

Happy to pass it on.

@thisisnic
Copy link
Member

thisisnic commented Jul 13, 2024

Thanks for reporting this @TPDeramus!

In short, in the backend, arrow code converts the dplyr code into Arrow Expressions. In the case of the across() implementation, from what I recall, we just work out the individual calls and then our mutate() implementation later converts that into Arrow Expressions.

Great you've got a workaround here, I'll take a look at implementing anonymous functions at some point in future, as it'll be useful to have and now we have better support for that kind of thing in arrow than we used to.

@thisisnic thisisnic self-assigned this Jul 13, 2024
@thisisnic thisisnic changed the title [R] dplyr summarize commands do not accept na.rm arguments [R] Implement anonymous functions in calls to dplyr::across Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

2 participants