Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mutate(.keep = "none") in transmute() #483

Merged
merged 12 commits into from
Jan 24, 2025
Merged

Conversation

markfairbanks
Copy link
Collaborator

@markfairbanks markfairbanks commented Dec 10, 2024

Closes #470

Note: I removed most of the dedicated transmute() tests as they're now covered by mutate() tests

@markfairbanks markfairbanks requested a review from eutwt December 10, 2024 14:42
@eutwt
Copy link
Collaborator

eutwt commented Dec 25, 2024

The code change looks good. There's a tradeoff to the change, because if you're doing an immediate transmute after converting to a lazy frame, the data table is copied but currently in main it would have only used .()/list() in j. This makes the keep = "none" translation take more memory (example below). I'm not sure the relative frequency of using transmute ~immediately vs transmute later in a pipeline where your data would have already been copied.

library(bench)
library(data.table)
library(magrittr)
n <- 1e6

dt <- replicate(10, sample(n), simplify = FALSE) %>% 
  setNames(., tail(letters, length(.))) %>% 
  as.data.table()

mark(
  add_then_remove = copy(dt)[, `:=`(double_x = x * 2)][, `:=`(names(dt), NULL)],
  list_in_j = dt[, .(double_x = x * 2)],
)
#> # A tibble: 2 × 6
#>   expression           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 add_then_remove    6.8ms   8.96ms      91.7    47.6MB     504.
#> 2 list_in_j         2.19ms   2.86ms     315.     15.3MB     137.

Created on 2024-12-25 with reprex v2.0.2

One thought I had is that transmute could be a reframe() which uses vec_recycle_common(..., .size = .N) instead of list(...) in j, but that is a bad option because it would prevent GForce optimizations on the ... expressions

@markfairbanks
Copy link
Collaborator Author

I think we need a re-order to match dplyr output for tibble(x = 1, y = 2) %>% transmute(y, x)

Ah yep I'll make the reorder change.

One thought I had is that transmute could be a reframe() which uses vec_recycle_common(..., .size = .N) instead of list(...) in j, but that is a bad option because it would prevent GForce optimizations on the ... expressions

We could also do something like this to preserve GForce but I think it makes the translation look odd.

library(dtplyr)
library(dplyr)

df <- tibble(x = 1:2, y = 3:4) %>%
  lazy_dt()

df %>%
  reframe(y, mean_x = mean(x), .row_preserve = row_number()) %>%
  select(-.row_preserve)
#> Source: local data table [2 x 2]
#> Call:   `_DT1`[, .(y = y, mean_x = mean(x), .row_preserve = seq_len(.N))][, 
#>     `:=`(".row_preserve", NULL)]
#> 
#>       y mean_x
#>   <int>  <dbl>
#> 1     3    1.5
#> 2     4    1.5
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

@markfairbanks markfairbanks requested a review from eutwt December 26, 2024 18:43
@markfairbanks
Copy link
Collaborator Author

Actually sorry I completely did the column order thing wrong 😅

Let me take another pass at it.

@markfairbanks
Copy link
Collaborator Author

Ok good to review now

@markfairbanks markfairbanks requested a review from eutwt January 21, 2025 01:52
Copy link
Collaborator

@eutwt eutwt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! This will be good to have. Apologies for the delayed review

@markfairbanks markfairbanks merged commit 75310e3 into main Jan 24, 2025
13 checks passed
@markfairbanks markfairbanks deleted the simplify-transmute branch January 24, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert transmute() to use mutate(.keep = "none")
2 participants