`pivot_longer` converts variable labels to new value labels #1535

JDenn0514 · 2023-12-19T17:57:03Z

I think it would be really useful for pivot_longer to preserve the variable labels as the value labels after pivoting. Unfortunately this is not possible. To clarify, I don't think value labels should be preserved.

I work with survey data that are usually saved as .sav files. I then use the haven package to import them to R. This gives me both variable labels and value labels. In this case, only variable labels are relevant. Often, the variable labels are the questions used in the survey and are quite long. When a question is "select all that apply", each response option is split into a new variable. In order to analyze the questions, I use pivot_longer to make it into one variable.

The issue I run into is that I would like to maintain the original variable labels as the value labels. Here is an example data frame with variable labels.

library(tidyr)

# create fake data
df <- tibble(
  q1 = haven::labelled(0:1, label = "Question 1"),
  q2_1 = haven::labelled(0:1, label = "Question 2, Response Option 1"),
  q2_2 = haven::labelled(0:1, label = "Question 2, Response Option 2"),
  q2_3 = haven::labelled(0:1, label = "Question 2, Response Option 3"),
  q2_4 = haven::labelled(0:1, label = "Question 2, Response Option 4"),
  q2_5 = haven::labelled(0:1, label = "Question 2, Response Option 5"),
  q2_6 = haven::labelled(0:1, label = "Question 2, Response Option 6"),
  q2_7 = haven::labelled(0:1, label = "Question 2, Response Option 7"),
  q2_8 = haven::labelled(0:1, label = "Question 2, Response Option 8"),
  q3 = haven::labelled(0:1, label = "Question 3")
)

# pivot the data
df_long <- df %>% 
  pivot_longer(
    cols = -c(q1, q3),
    names_to = " var",
    values_to = "resp"
  )

df_long

labelled::look_for(df_long)

We can see in df_long that none of the variable labels made it in as value_labels for "var". We can manually set it using labelled::set_variable_labels like this:

df_long <- df_long %>% 
  set_value_labels(
    var = c(q2_1 = "Question 2, Response Option 1",
                 q2_2 = "Question 2, Response Option 2",
                 q2_3 = "Question 2, Response Option 3",
                 q2_4 = "Question 2, Response Option 4",
                 q2_5 = "Question 2, Response Option 5",
                 q2_6 = "Question 2, Response Option 6",
                 q2_7 = "Question 2, Response Option 7",
                 q2_8 = "Question 2, Response Option 8")
  )

labelled::look_for(df_long)

We can see now that the "var" variable has value labels. These value labels are the same as the variable labels of the variables that were pivoted. This is the ideal output.

Is it possible to change pivot_longer() so that it provides you the option of using the variable labels as the new value_labels in the "var" column?

The current workaround I have found is with the sjlabelled::label_to_colname() function as seen below:

df_long <- df %>% 
  sjlabelled::label_to_colnames(q2_1:q2_8) %>% 
  pivot_longer(
    cols = -c(q1, q3),
    names_to = "var",
    values_to = "resp"
  )

df_long

However, this is really not ideal as it just renames the variables immediately and doesn't add value_labels. Thus making subsetting the "var" variable incredibly cumbersome.

I also think that this should be possible since each variable has only one variable_label and therefore there shouldn't be any conflicts when pivoting, unlike with value_labels.

The text was updated successfully, but these errors were encountered:

JDenn0514 · 2023-12-21T15:03:08Z

I believe that I have found a simple solution. I am not sure how easy it would be to add this into pivot_longer as it exists right now but this was the function I created:

pivot_longer_values <- function(data, cols, names_to, values_to, add_value_labels = TRUE) {
  long <- data %>% 
    tidyr::pivot_longer(
      cols = {{ cols }},
      names_to = names_to,
      values_to = values_to
    )
  
  if (add_value_labels == TRUE) {
    # create a vector containing the variable labels
    var_labs <- labelled::var_label(x = data %>% select( {{ cols }})) %>% 
      unlist()
    
    # flip the names and values of the vector
    var_labs <- setNames(names(var_labs), var_labs)
    
    # add the vector of labels as value labels to the new column of names
    labelled::val_labels(long[{{names_to}}]) <- var_labs
    
    return(long)
  } else {
    return(long)
  }
  
}

This function is basically a wrapper around the pivot_longer function with the same function variable names. I also added in add_value_labels so that people have the option of including the variable labels as value labels in the new columns specified under names_to. I tested this with the original data set I created and on other data I have and it seems to work.

# use original data set and make it longer with the value labels included
tbl_long <- tbl %>% 
  pivot_longer_values(
    cols = -c(q1, q3),
    names_to = "var",
    values_to = "resp",
    add_value_labels = TRUE
  )

# use labelled::look_for() to see if it worked
labelled::look_for(tbl_long)

# make a new column with the labels as the names
tbl_long %>% mutate(var_f = as_factor(var))

# here is a more in-depth example of a problem that this solves.
tbl_long %>% 
  # get the labels in a new column
  mutate(var_f = as_factor(var)) %>% 
  # filter out some of the variables
  filter(!var %in% c("q2_6", "q2_7", "q2_8")) %>% 
  # group it by the labels
  group_by(var_f) %>% 
  # get the frequency 
  count(q1)

Hopefully this is something that can be added into tidyr.

DavisVaughan · 2024-10-24T16:08:34Z

Its not super satisfying but I feel like the best way to retain the information of the label is to promote it to the column name before pivoting. Note that the label on the final resp column isn't particularly meaningful in this case.

library(tidyr)

# create fake data
df <- tibble(
  q1 = haven::labelled(0:1, label = "Question 1"),
  q2_1 = haven::labelled(0:1, label = "Question 2, Response Option 1"),
  q2_2 = haven::labelled(0:1, label = "Question 2, Response Option 2"),
  q2_3 = haven::labelled(0:1, label = "Question 2, Response Option 3"),
  q2_4 = haven::labelled(0:1, label = "Question 2, Response Option 4"),
  q2_5 = haven::labelled(0:1, label = "Question 2, Response Option 5"),
  q2_6 = haven::labelled(0:1, label = "Question 2, Response Option 6"),
  q2_7 = haven::labelled(0:1, label = "Question 2, Response Option 7"),
  q2_8 = haven::labelled(0:1, label = "Question 2, Response Option 8"),
  q3 = haven::labelled(0:1, label = "Question 3")
)

df_long <- df %>%
  dplyr::rename_with(
    .fn = \(name) purrr::map_chr(df[name], labelled::label_attribute)
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = "var",
    values_to = "resp"
  )

df_long
#> # A tibble: 20 × 2
#>    var                           resp     
#>    <chr>                         <int+lbl>
#>  1 Question 1                    0        
#>  2 Question 2, Response Option 1 0        
#>  3 Question 2, Response Option 2 0        
#>  4 Question 2, Response Option 3 0        
#>  5 Question 2, Response Option 4 0        
#>  6 Question 2, Response Option 5 0        
#>  7 Question 2, Response Option 6 0        
#>  8 Question 2, Response Option 7 0        
#>  9 Question 2, Response Option 8 0        
#> 10 Question 3                    0        
#> 11 Question 1                    1        
#> 12 Question 2, Response Option 1 1        
#> 13 Question 2, Response Option 2 1        
#> 14 Question 2, Response Option 3 1        
#> 15 Question 2, Response Option 4 1        
#> 16 Question 2, Response Option 5 1        
#> 17 Question 2, Response Option 6 1        
#> 18 Question 2, Response Option 7 1        
#> 19 Question 2, Response Option 8 1        
#> 20 Question 3                    1

# all labeled with q1 label
df_long$resp
#> <labelled<integer>[20]>: Question 1
#>  [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

I don't believe there are any tidyr specific tools that we should add to support haven more though

JDenn0514 · 2024-10-25T00:07:13Z

No worries! I appreciate the response. The sjlabelled package already has a function that basically does this called label_to_colname(). The issue with this is that the variable names become quite cumbersome if you have a lot of variables in the data set and only want to pivot over a few because now you have to write out the entire label. For example, using the data above:

df_long <- df %>% 
  sjlabelled::label_to_colnames(q2_1:q2_8) %>% 
  pivot_longer(
    cols = c(`Question 2, Response Option 1`:`Question 2, Response Option 8`),
    names_to = "var",
    values_to = "resp"
  )

It becomes even worse if you don't want to do it over every variable in a row. But no worries if it's not something in the cards.

Can I ask why you don't think you should add any tools to support haven? Do you think it might be something that haven would implement?

DavisVaughan · 2024-10-25T13:26:17Z

You could also use a more complex intermediate data structure - a packed data frame. This lets you keep the label alongside the values during the pivot by prematurely recycling the label of size 1 to the size of the values.

I think this would work quite elegantly for selecting columns

I strip off the attributes because as shown above the label doesn't make sense after the pivot.

library(tidyr)
library(dplyr)

# create fake data
df <- tibble(
  q1 = haven::labelled(0:1, label = "Question 1"),
  q2_1 = haven::labelled(0:1, label = "Question 2, Response Option 1"),
  q2_2 = haven::labelled(0:1, label = "Question 2, Response Option 2"),
  q2_3 = haven::labelled(0:1, label = "Question 2, Response Option 3"),
  q2_4 = haven::labelled(0:1, label = "Question 2, Response Option 4"),
  q2_5 = haven::labelled(0:1, label = "Question 2, Response Option 5"),
  q2_6 = haven::labelled(0:1, label = "Question 2, Response Option 6"),
  q2_7 = haven::labelled(0:1, label = "Question 2, Response Option 7"),
  q2_8 = haven::labelled(0:1, label = "Question 2, Response Option 8"),
  q3 = haven::labelled(0:1, label = "Question 3")
)

side_by_side <- function(x) {
  label <- labelled::label_attribute(x)
  attributes(x) <- NULL
  data.frame(value = x, label = label)
}

df %>%
  mutate(across(q2_1:q2_8, side_by_side)) %>%
  pivot_longer(
    cols = q2_1:q2_8,
    names_to = "var",
    values_to = "resp"
  ) %>%
  unpack(resp)
#> # A tibble: 16 × 5
#>    q1        q3        var   value label                        
#>    <int+lbl> <int+lbl> <chr> <int> <chr>                        
#>  1 0         0         q2_1      0 Question 2, Response Option 1
#>  2 0         0         q2_2      0 Question 2, Response Option 2
#>  3 0         0         q2_3      0 Question 2, Response Option 3
#>  4 0         0         q2_4      0 Question 2, Response Option 4
#>  5 0         0         q2_5      0 Question 2, Response Option 5
#>  6 0         0         q2_6      0 Question 2, Response Option 6
#>  7 0         0         q2_7      0 Question 2, Response Option 7
#>  8 0         0         q2_8      0 Question 2, Response Option 8
#>  9 1         1         q2_1      1 Question 2, Response Option 1
#> 10 1         1         q2_2      1 Question 2, Response Option 2
#> 11 1         1         q2_3      1 Question 2, Response Option 3
#> 12 1         1         q2_4      1 Question 2, Response Option 4
#> 13 1         1         q2_5      1 Question 2, Response Option 5
#> 14 1         1         q2_6      1 Question 2, Response Option 6
#> 15 1         1         q2_7      1 Question 2, Response Option 7
#> 16 1         1         q2_8      1 Question 2, Response Option 8

# here is what it looks like, `q2_1` itself is a data frame
df %>%
  mutate(across(q2_1:q2_8, side_by_side)) %>%
  select(q1, q2_1)
#> # A tibble: 2 × 2
#>   q1        q2_1$value $label                       
#>   <int+lbl>      <int> <chr>                        
#> 1 0                  0 Question 2, Response Option 1
#> 2 1                  1 Question 2, Response Option 1

Can I ask why you don't think you should add any tools to support haven?

In general tidyr's pivoting tools are very generic. So adding tools inside tidyr specifically for a single package goes against that ethos a little, even if we technically own both packages. It suggests there is some other workflow that we are missing.

JDenn0514 · 2024-10-25T13:51:15Z

That reasoning makes a ton of sense to me.

Also, that side_by_side() function is very interesting and to me seems like a much better and more elegant solution than converting the labels to the column names.

I have a personal package where I created a function called pivot_longer_values() that preserves the variable labels as value labels. I’ll then use haven::as_factor() when I want to use the labels instead of variable names. Using side_by_side() seems similarly simple.

Thanks!

JDenn0514 changed the title ~~Pivot_longer converts variable labels to new value labels~~ pivot_longer converts variable labels to new value labels May 25, 2024

DavisVaughan closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pivot_longer` converts variable labels to new value labels #1535

`pivot_longer` converts variable labels to new value labels #1535

JDenn0514 commented Dec 19, 2023 •

edited

Loading

JDenn0514 commented Dec 21, 2023

DavisVaughan commented Oct 24, 2024

JDenn0514 commented Oct 25, 2024 •

edited

Loading

DavisVaughan commented Oct 25, 2024

JDenn0514 commented Oct 25, 2024

pivot_longer converts variable labels to new value labels #1535

pivot_longer converts variable labels to new value labels #1535

Comments

JDenn0514 commented Dec 19, 2023 • edited Loading

JDenn0514 commented Dec 21, 2023

DavisVaughan commented Oct 24, 2024

JDenn0514 commented Oct 25, 2024 • edited Loading

DavisVaughan commented Oct 25, 2024

JDenn0514 commented Oct 25, 2024

`pivot_longer` converts variable labels to new value labels #1535

`pivot_longer` converts variable labels to new value labels #1535

JDenn0514 commented Dec 19, 2023 •

edited

Loading

JDenn0514 commented Oct 25, 2024 •

edited

Loading