Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_longer converts variable labels to new value labels #1535

Closed
JDenn0514 opened this issue Dec 19, 2023 · 5 comments
Closed

pivot_longer converts variable labels to new value labels #1535

JDenn0514 opened this issue Dec 19, 2023 · 5 comments

Comments

@JDenn0514
Copy link

JDenn0514 commented Dec 19, 2023

I think it would be really useful for pivot_longer to preserve the variable labels as the value labels after pivoting. Unfortunately this is not possible. To clarify, I don't think value labels should be preserved.


I work with survey data that are usually saved as .sav files. I then use the haven package to import them to R. This gives me both variable labels and value labels. In this case, only variable labels are relevant. Often, the variable labels are the questions used in the survey and are quite long. When a question is "select all that apply", each response option is split into a new variable. In order to analyze the questions, I use pivot_longer to make it into one variable.

The issue I run into is that I would like to maintain the original variable labels as the value labels. Here is an example data frame with variable labels.

library(tidyr)

# create fake data
df <- tibble(
  q1 = haven::labelled(0:1, label = "Question 1"),
  q2_1 = haven::labelled(0:1, label = "Question 2, Response Option 1"),
  q2_2 = haven::labelled(0:1, label = "Question 2, Response Option 2"),
  q2_3 = haven::labelled(0:1, label = "Question 2, Response Option 3"),
  q2_4 = haven::labelled(0:1, label = "Question 2, Response Option 4"),
  q2_5 = haven::labelled(0:1, label = "Question 2, Response Option 5"),
  q2_6 = haven::labelled(0:1, label = "Question 2, Response Option 6"),
  q2_7 = haven::labelled(0:1, label = "Question 2, Response Option 7"),
  q2_8 = haven::labelled(0:1, label = "Question 2, Response Option 8"),
  q3 = haven::labelled(0:1, label = "Question 3")
)

# pivot the data
df_long <- df %>% 
  pivot_longer(
    cols = -c(q1, q3),
    names_to = " var",
    values_to = "resp"
  )

df_long

labelled::look_for(df_long)

We can see in df_long that none of the variable labels made it in as value_labels for "var". We can manually set it using labelled::set_variable_labels like this:

df_long <- df_long %>% 
  set_value_labels(
    var = c(q2_1 = "Question 2, Response Option 1",
                 q2_2 = "Question 2, Response Option 2",
                 q2_3 = "Question 2, Response Option 3",
                 q2_4 = "Question 2, Response Option 4",
                 q2_5 = "Question 2, Response Option 5",
                 q2_6 = "Question 2, Response Option 6",
                 q2_7 = "Question 2, Response Option 7",
                 q2_8 = "Question 2, Response Option 8")
  )

labelled::look_for(df_long)

We can see now that the "var" variable has value labels. These value labels are the same as the variable labels of the variables that were pivoted. This is the ideal output.

Is it possible to change pivot_longer() so that it provides you the option of using the variable labels as the new value_labels in the "var" column?

The current workaround I have found is with the sjlabelled::label_to_colname() function as seen below:

df_long <- df %>% 
  sjlabelled::label_to_colnames(q2_1:q2_8) %>% 
  pivot_longer(
    cols = -c(q1, q3),
    names_to = "var",
    values_to = "resp"
  )

df_long

However, this is really not ideal as it just renames the variables immediately and doesn't add value_labels. Thus making subsetting the "var" variable incredibly cumbersome.

I also think that this should be possible since each variable has only one variable_label and therefore there shouldn't be any conflicts when pivoting, unlike with value_labels.

@JDenn0514
Copy link
Author

I believe that I have found a simple solution. I am not sure how easy it would be to add this into pivot_longer as it exists right now but this was the function I created:

pivot_longer_values <- function(data, cols, names_to, values_to, add_value_labels = TRUE) {
  long <- data %>% 
    tidyr::pivot_longer(
      cols = {{ cols }},
      names_to = names_to,
      values_to = values_to
    )
  
  if (add_value_labels == TRUE) {
    # create a vector containing the variable labels
    var_labs <- labelled::var_label(x = data %>% select( {{ cols }})) %>% 
      unlist()
    
    # flip the names and values of the vector
    var_labs <- setNames(names(var_labs), var_labs)
    
    # add the vector of labels as value labels to the new column of names
    labelled::val_labels(long[{{names_to}}]) <- var_labs
    
    return(long)
  } else {
    return(long)
  }
  
}

This function is basically a wrapper around the pivot_longer function with the same function variable names. I also added in add_value_labels so that people have the option of including the variable labels as value labels in the new columns specified under names_to. I tested this with the original data set I created and on other data I have and it seems to work.

# use original data set and make it longer with the value labels included
tbl_long <- tbl %>% 
  pivot_longer_values(
    cols = -c(q1, q3),
    names_to = "var",
    values_to = "resp",
    add_value_labels = TRUE
  )

# use labelled::look_for() to see if it worked
labelled::look_for(tbl_long)

# make a new column with the labels as the names
tbl_long %>% mutate(var_f = as_factor(var))

# here is a more in-depth example of a problem that this solves.
tbl_long %>% 
  # get the labels in a new column
  mutate(var_f = as_factor(var)) %>% 
  # filter out some of the variables
  filter(!var %in% c("q2_6", "q2_7", "q2_8")) %>% 
  # group it by the labels
  group_by(var_f) %>% 
  # get the frequency 
  count(q1)

Hopefully this is something that can be added into tidyr.

@JDenn0514 JDenn0514 changed the title Pivot_longer converts variable labels to new value labels pivot_longer converts variable labels to new value labels May 25, 2024
@DavisVaughan
Copy link
Member

Its not super satisfying but I feel like the best way to retain the information of the label is to promote it to the column name before pivoting. Note that the label on the final resp column isn't particularly meaningful in this case.

library(tidyr)

# create fake data
df <- tibble(
  q1 = haven::labelled(0:1, label = "Question 1"),
  q2_1 = haven::labelled(0:1, label = "Question 2, Response Option 1"),
  q2_2 = haven::labelled(0:1, label = "Question 2, Response Option 2"),
  q2_3 = haven::labelled(0:1, label = "Question 2, Response Option 3"),
  q2_4 = haven::labelled(0:1, label = "Question 2, Response Option 4"),
  q2_5 = haven::labelled(0:1, label = "Question 2, Response Option 5"),
  q2_6 = haven::labelled(0:1, label = "Question 2, Response Option 6"),
  q2_7 = haven::labelled(0:1, label = "Question 2, Response Option 7"),
  q2_8 = haven::labelled(0:1, label = "Question 2, Response Option 8"),
  q3 = haven::labelled(0:1, label = "Question 3")
)

df_long <- df %>%
  dplyr::rename_with(
    .fn = \(name) purrr::map_chr(df[name], labelled::label_attribute)
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = "var",
    values_to = "resp"
  )

df_long
#> # A tibble: 20 × 2
#>    var                           resp     
#>    <chr>                         <int+lbl>
#>  1 Question 1                    0        
#>  2 Question 2, Response Option 1 0        
#>  3 Question 2, Response Option 2 0        
#>  4 Question 2, Response Option 3 0        
#>  5 Question 2, Response Option 4 0        
#>  6 Question 2, Response Option 5 0        
#>  7 Question 2, Response Option 6 0        
#>  8 Question 2, Response Option 7 0        
#>  9 Question 2, Response Option 8 0        
#> 10 Question 3                    0        
#> 11 Question 1                    1        
#> 12 Question 2, Response Option 1 1        
#> 13 Question 2, Response Option 2 1        
#> 14 Question 2, Response Option 3 1        
#> 15 Question 2, Response Option 4 1        
#> 16 Question 2, Response Option 5 1        
#> 17 Question 2, Response Option 6 1        
#> 18 Question 2, Response Option 7 1        
#> 19 Question 2, Response Option 8 1        
#> 20 Question 3                    1

# all labeled with q1 label
df_long$resp
#> <labelled<integer>[20]>: Question 1
#>  [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

I don't believe there are any tidyr specific tools that we should add to support haven more though

@JDenn0514
Copy link
Author

JDenn0514 commented Oct 25, 2024

No worries! I appreciate the response. The sjlabelled package already has a function that basically does this called label_to_colname(). The issue with this is that the variable names become quite cumbersome if you have a lot of variables in the data set and only want to pivot over a few because now you have to write out the entire label. For example, using the data above:

df_long <- df %>% 
  sjlabelled::label_to_colnames(q2_1:q2_8) %>% 
  pivot_longer(
    cols = c(`Question 2, Response Option 1`:`Question 2, Response Option 8`),
    names_to = "var",
    values_to = "resp"
  )

It becomes even worse if you don't want to do it over every variable in a row. But no worries if it's not something in the cards.

Can I ask why you don't think you should add any tools to support haven? Do you think it might be something that haven would implement?

@DavisVaughan
Copy link
Member

You could also use a more complex intermediate data structure - a packed data frame. This lets you keep the label alongside the values during the pivot by prematurely recycling the label of size 1 to the size of the values.

I think this would work quite elegantly for selecting columns

I strip off the attributes because as shown above the label doesn't make sense after the pivot.

library(tidyr)
library(dplyr)

# create fake data
df <- tibble(
  q1 = haven::labelled(0:1, label = "Question 1"),
  q2_1 = haven::labelled(0:1, label = "Question 2, Response Option 1"),
  q2_2 = haven::labelled(0:1, label = "Question 2, Response Option 2"),
  q2_3 = haven::labelled(0:1, label = "Question 2, Response Option 3"),
  q2_4 = haven::labelled(0:1, label = "Question 2, Response Option 4"),
  q2_5 = haven::labelled(0:1, label = "Question 2, Response Option 5"),
  q2_6 = haven::labelled(0:1, label = "Question 2, Response Option 6"),
  q2_7 = haven::labelled(0:1, label = "Question 2, Response Option 7"),
  q2_8 = haven::labelled(0:1, label = "Question 2, Response Option 8"),
  q3 = haven::labelled(0:1, label = "Question 3")
)

side_by_side <- function(x) {
  label <- labelled::label_attribute(x)
  attributes(x) <- NULL
  data.frame(value = x, label = label)
}

df %>%
  mutate(across(q2_1:q2_8, side_by_side)) %>%
  pivot_longer(
    cols = q2_1:q2_8,
    names_to = "var",
    values_to = "resp"
  ) %>%
  unpack(resp)
#> # A tibble: 16 × 5
#>    q1        q3        var   value label                        
#>    <int+lbl> <int+lbl> <chr> <int> <chr>                        
#>  1 0         0         q2_1      0 Question 2, Response Option 1
#>  2 0         0         q2_2      0 Question 2, Response Option 2
#>  3 0         0         q2_3      0 Question 2, Response Option 3
#>  4 0         0         q2_4      0 Question 2, Response Option 4
#>  5 0         0         q2_5      0 Question 2, Response Option 5
#>  6 0         0         q2_6      0 Question 2, Response Option 6
#>  7 0         0         q2_7      0 Question 2, Response Option 7
#>  8 0         0         q2_8      0 Question 2, Response Option 8
#>  9 1         1         q2_1      1 Question 2, Response Option 1
#> 10 1         1         q2_2      1 Question 2, Response Option 2
#> 11 1         1         q2_3      1 Question 2, Response Option 3
#> 12 1         1         q2_4      1 Question 2, Response Option 4
#> 13 1         1         q2_5      1 Question 2, Response Option 5
#> 14 1         1         q2_6      1 Question 2, Response Option 6
#> 15 1         1         q2_7      1 Question 2, Response Option 7
#> 16 1         1         q2_8      1 Question 2, Response Option 8

# here is what it looks like, `q2_1` itself is a data frame
df %>%
  mutate(across(q2_1:q2_8, side_by_side)) %>%
  select(q1, q2_1)
#> # A tibble: 2 × 2
#>   q1        q2_1$value $label                       
#>   <int+lbl>      <int> <chr>                        
#> 1 0                  0 Question 2, Response Option 1
#> 2 1                  1 Question 2, Response Option 1

Can I ask why you don't think you should add any tools to support haven?

In general tidyr's pivoting tools are very generic. So adding tools inside tidyr specifically for a single package goes against that ethos a little, even if we technically own both packages. It suggests there is some other workflow that we are missing.

@JDenn0514
Copy link
Author

That reasoning makes a ton of sense to me.

Also, that side_by_side() function is very interesting and to me seems like a much better and more elegant solution than converting the labels to the column names.

I have a personal package where I created a function called pivot_longer_values() that preserves the variable labels as value labels. I’ll then use haven::as_factor() when I want to use the labels instead of variable names. Using side_by_side() seems similarly simple.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants