`separate_wider_delim` renames original column when `col_remove=FALSE` and `names=` not specified #1499

tszberkowitz · 2023-05-14T19:32:33Z

When splitting a delimited character variable using the newer separate_wider_delim() function from the tidyr package (v 1.3.0), if you:

specify the names_sep= argument,
do NOT specify the names= argument, and
specify cols_remove=FALSE,

then the original variable is retained in the output data set (as expected) but:

the original variable name has been duplicated using the value specified in the names_sep= argument such that, e.g., names_sep='_' with cols=varname produces a variable named varname_varname in the output data, and
the original variable is located after the new separated columns, which is different from how the older separate() function behaves (placing the original column before the new columns).

Note that the first point above (variable renaming) is the major issue. The second point is just something that I was not unexpecting.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(reprex)

# Create test data set
## 1 character variable (`v`):
##  * semicolon-delimited values,
##  * includes NA,
##  * inconsistent/unpredictable number of delimiters per value
test <- tibble(
  v = c('a;b', 'c', NA, 'd;e;f', 'g;h')
)

# specifying `names` (not `names_sep`)
# `cols_remove` is TRUE => behaves as expected (original column name unchanged)
separate_wider_delim(
  data = test,
  cols = v,
  delim = ';',
  names = c('v_1', 'v_2', 'v_3'),
  too_few = 'align_start',
  cols_remove = FALSE
)
#> # A tibble: 5 × 4
#>   v_1   v_2   v_3   v    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     <NA>  a;b  
#> 2 c     <NA>  <NA>  c    
#> 3 <NA>  <NA>  <NA>  <NA> 
#> 4 d     e     f     d;e;f
#> 5 g     h     <NA>  g;h

# specifying `names_sep` only
# `cols_remove` is TRUE (default) => behaves as expected
separate_wider_delim(
  data = test,
  cols = v,
  delim = ';',
  names_sep = '_',
  too_few = 'align_start',
  cols_remove = TRUE
)
#> # A tibble: 5 × 3
#>   v_1   v_2   v_3  
#>   <chr> <chr> <chr>
#> 1 a     b     <NA> 
#> 2 c     <NA>  <NA> 
#> 3 <NA>  <NA>  <NA> 
#> 4 d     e     f    
#> 5 g     h     <NA>

# specifying `names_sep` only
# `cols_remove` is FALSE => **unexpected renaming of original variable**
separate_wider_delim(
  data = test,
  cols = v,
  delim = ';',
  names_sep = '_',
  too_few = 'align_start',
  cols_remove = FALSE
)
#> # A tibble: 5 × 4
#>   v_1   v_2   v_3   v_v  
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     <NA>  a;b  
#> 2 c     <NA>  <NA>  c    
#> 3 <NA>  <NA>  <NA>  <NA> 
#> 4 d     e     f     d;e;f
#> 5 g     h     <NA>  g;h

## Expected output from previous code chunk:
##  * note original column name unchanged
# # A tibble: 5 × 4
#   v_1   v_2   v_3   v    
#   <chr> <chr> <chr> <chr>
# 1 a     b     <NA>  a;b  
# 2 c     <NA>  <NA>  c    
# 3 <NA>  <NA>  <NA>  <NA> 
# 4 d     e     f     d;e;f 
# 5 g     h     <NA>  g;h   


# old behavior (with `separate()`)
# * original variable located before new `separate()`d columns
separate(
  data = test,
  col = v,
  into = c('v_1', 'v_2', 'v_3'),
  sep = ';',
  remove = FALSE,
  fill = 'right'
)
#> # A tibble: 5 × 4
#>   v     v_1   v_2   v_3  
#>   <chr> <chr> <chr> <chr>
#> 1 a;b   a     b     <NA> 
#> 2 c     c     <NA>  <NA> 
#> 3 <NA>  <NA>  <NA>  <NA> 
#> 4 d;e;f d     e     f    
#> 5 g;h   g     h     <NA>

^{Created on 2023-05-14 with reprex v2.0.2}

Session info

sessionInfo()
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22621)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] reprex_2.0.2 tidyr_1.3.0  dplyr_1.1.2 
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.2       cli_3.6.1         knitr_1.42        rlang_1.1.1      
#>  [5] xfun_0.39         stringi_1.7.12    purrr_1.0.1       styler_1.9.1     
#>  [9] generics_0.1.3    glue_1.6.2        htmltools_0.5.5   fansi_1.0.4      
#> [13] rmarkdown_2.21    R.cache_0.16.0    tibble_3.2.1      evaluate_0.21    
#> [17] fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.3   stringr_1.5.0    
#> [21] compiler_4.3.0    fs_1.6.2          pkgconfig_2.0.3   rstudioapi_0.14  
#> [25] R.oo_1.25.0       R.utils_2.12.2    digest_0.6.31     R6_2.5.1         
#> [29] tidyselect_1.2.0  utf8_1.2.3        pillar_1.9.0      magrittr_2.0.3   
#> [33] R.methodsS3_1.8.2 tools_4.3.0       withr_2.5.0

The text was updated successfully, but these errors were encountered:

hadley · 2023-11-01T19:14:27Z

Somewhat more minimal reprex:

library(tidyverse)
library(tidyr)
library(reprex)

df <- tibble(x = c('a;b', 'c', NA, 'd;e;f', 'g;h'))

names(separate_wider_delim(
  df, x, delim = ';', too_few = 'align_start',
  names_sep = '_',
  cols_remove = FALSE
))
#> [1] "x_1" "x_2" "x_3" "x_x"

^{Created on 2023-11-01 with reprex v2.0.2}

ryanzomorrodi · 2024-07-20T17:21:48Z

The difficulty seems to be that unpack is provided with the original column packed within itself causing it to create a {col}_{col} name in the output.

One solution is modifying unpack with a keep_rep_outer parameter that is by default TRUE to keep unpack's current behavior, but when called within separate_wider_delim is FALSE. The parameter would be passed along to rename_with_name_sep, which would keep the name of any packed column with the same name as the column it is in. One problem I can think of is if the column being separated is named `1`.

rename_with_names_sep <- function(x, outer, names_sep, keep_rep_outer) {
  inner <- names(x)
  names <- apply_names_sep(outer, inner, names_sep)
  if (!keep_rep_outer) {
    names[names == paste0(outer, names_sep, outer)] <- outer
  }
  set_names(x, names)
}

An alternative not pretty solution:
This doesn't involve changing any function signatures. It removes the original columns before unpacking and inserts them back in.

map_unpack <- function(data, cols, fun, names_sep, names_repair, error_call = caller_env()) {
  cols <- tidyselect::eval_select(
    enquo(cols),
    data = data,
    allow_rename = FALSE,
    allow_empty = FALSE,
    error_call = error_call
  )
  col_names <- names(cols)

  ori_cols <- data[, col_names, drop = FALSE]
  for (col in col_names) {
    data[[col]] <- fun(data[[col]], col)
    cols_remove = !(col %in% colnames(data[[col]]))
    data[[col]][[col]] <- NULL
  }

  unpacked <- unpack(
    data = data,
    cols = all_of(col_names),
    names_sep = names_sep,
    names_repair = names_repair,
    error_call = error_call
  )

  if (!cols_remove) {
    ori_index <- match(col_names, colnames(data))
    new_index <- ori_index + cumsum(lengths(data[, col_names, drop = FALSE]))
    for (i in seq_along(ori_cols)) {
      unpacked <- tibble::add_column(
        unpacked, 
        ori_cols[, i, drop = FALSE], 
        .before = new_index[i])
    }
  }

  unpacked
}

DavisVaughan · 2024-10-24T15:38:19Z

From #1539

library(tidyr)

separate_wider_delim(
  tibble(a="x y"),
  cols=a,
  delim=" ",
  names_sep="",
  cols_remove=FALSE
)
#> # A tibble: 1 × 3
#>   a1    a2    aa   
#>   <chr> <chr> <chr>
#> 1 x     y     x y

^{Created on 2024-10-24 with reprex v2.1.1}

tszberkowitz changed the title ~~separate_wider_delim renames original column when col_remove=TRUE and names= not specified~~ separate_wider_delim renames original column when col_remove=FALSE and names= not specified May 16, 2023

hadley added bug an unexpected problem or unintended behavior strings 🎻 labels Nov 1, 2023

sda030 mentioned this issue Jul 15, 2024

separate_wider_delim changes input column names when using names_sep with cols_remove=FALSE #1539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`separate_wider_delim` renames original column when `col_remove=FALSE` and `names=` not specified #1499

`separate_wider_delim` renames original column when `col_remove=FALSE` and `names=` not specified #1499

tszberkowitz commented May 14, 2023 •

edited

Loading

hadley commented Nov 1, 2023

ryanzomorrodi commented Jul 20, 2024 •

edited

Loading

DavisVaughan commented Oct 24, 2024

separate_wider_delim renames original column when col_remove=FALSE and names= not specified #1499

separate_wider_delim renames original column when col_remove=FALSE and names= not specified #1499

Comments

tszberkowitz commented May 14, 2023 • edited Loading

hadley commented Nov 1, 2023

ryanzomorrodi commented Jul 20, 2024 • edited Loading

DavisVaughan commented Oct 24, 2024

`separate_wider_delim` renames original column when `col_remove=FALSE` and `names=` not specified #1499

`separate_wider_delim` renames original column when `col_remove=FALSE` and `names=` not specified #1499

tszberkowitz commented May 14, 2023 •

edited

Loading

ryanzomorrodi commented Jul 20, 2024 •

edited

Loading