You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the usage question you have. Please include as many useful details as possible.
Hi Arrow devs.
I wanted to ask about something I noticed about using the column-wise operators with dplyr in arrow tables.
If I had an arrow table, and I wanted to run a basic function such as mean, max, or min using summarize, it appears that arrow does not currently accept the na.rm = TRUE argument, or that if it does, I can't seem to find it in the documentation.
Say I took the original dataset:
Participant
Rating
Donna
17
Donna
NA
Greg
21
Greg
NA
If these were generic R dataframes, either of these two calls would work (though one is deprecated):
However, when I run the same commands as an arrow table, both throw errors:
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
as.data.frame()
Error in `across_setup()`:
! Anonymous functions are not yet supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
as.data.frame()
Error in `expand_across()`:
! `...` argument to `across()` is deprecated in dplyr and not supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.
Is there a way to pass the na.rm = TRUE argument to this call without having to manually drop the NA values for each column or row of interest I have in my data?
Component(s)
R
The text was updated successfully, but these errors were encountered:
I'm sure there's a way it could be done with arrow_max or call_function, but that is not readily apparent to me either and also keeps throwing function does not exist errors (probably due to it being nested in group_by, summarize and across).
"_I'm not sure at what points the operations become outsourced to arrow methods, but I don't know whether the ~min(.x, ...) lambda notation somehow tricks dplyr into not outsourcing this operation to arrow.
With dbplyr, everything is converted to SQL queries instead and you can view the SQL query to check it. Is there an equivalent arrow command that lets you see what commands are sent to arrow?_"
Would any of you be willing to explain how this works on the backend?
In short, in the backend, arrow code converts the dplyr code into Arrow Expressions. In the case of the across() implementation, from what I recall, we just work out the individual calls and then our mutate() implementation later converts that into Arrow Expressions.
Great you've got a workaround here, I'll take a look at implementing anonymous functions at some point in future, as it'll be useful to have and now we have better support for that kind of thing in arrow than we used to.
thisisnic
changed the title
[R] dplyrsummarize commands do not accept na.rm arguments
[R] Implement anonymous functions in calls to dplyr::across
Jul 13, 2024
Describe the usage question you have. Please include as many useful details as possible.
Hi Arrow devs.
I wanted to ask about something I noticed about using the column-wise operators with
dplyr
inarrow
tables.If I had an arrow table, and I wanted to run a basic function such as
mean
,max
, ormin
usingsummarize
, it appears thatarrow
does not currently accept thena.rm = TRUE
argument, or that if it does, I can't seem to find it in the documentation.Say I took the original dataset:
If these were generic
R
dataframes, either of these two calls would work (though one is deprecated):Producing:
However, when I run the same commands as an arrow table, both throw errors:
And the one that does work:
Returns
NA
values that are not what I want:Is there a way to pass the
na.rm = TRUE
argument to this call without having to manually drop theNA
values for each column or row of interest I have in my data?Component(s)
R
The text was updated successfully, but these errors were encountered: