-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pivot_longer
converts variable labels to new value labels
#1535
Comments
I believe that I have found a simple solution. I am not sure how easy it would be to add this into pivot_longer as it exists right now but this was the function I created: pivot_longer_values <- function(data, cols, names_to, values_to, add_value_labels = TRUE) {
long <- data %>%
tidyr::pivot_longer(
cols = {{ cols }},
names_to = names_to,
values_to = values_to
)
if (add_value_labels == TRUE) {
# create a vector containing the variable labels
var_labs <- labelled::var_label(x = data %>% select( {{ cols }})) %>%
unlist()
# flip the names and values of the vector
var_labs <- setNames(names(var_labs), var_labs)
# add the vector of labels as value labels to the new column of names
labelled::val_labels(long[{{names_to}}]) <- var_labs
return(long)
} else {
return(long)
}
} This function is basically a wrapper around the # use original data set and make it longer with the value labels included
tbl_long <- tbl %>%
pivot_longer_values(
cols = -c(q1, q3),
names_to = "var",
values_to = "resp",
add_value_labels = TRUE
)
# use labelled::look_for() to see if it worked
labelled::look_for(tbl_long)
# make a new column with the labels as the names
tbl_long %>% mutate(var_f = as_factor(var))
# here is a more in-depth example of a problem that this solves.
tbl_long %>%
# get the labels in a new column
mutate(var_f = as_factor(var)) %>%
# filter out some of the variables
filter(!var %in% c("q2_6", "q2_7", "q2_8")) %>%
# group it by the labels
group_by(var_f) %>%
# get the frequency
count(q1)
Hopefully this is something that can be added into tidyr. |
pivot_longer
converts variable labels to new value labels
Its not super satisfying but I feel like the best way to retain the information of the library(tidyr)
# create fake data
df <- tibble(
q1 = haven::labelled(0:1, label = "Question 1"),
q2_1 = haven::labelled(0:1, label = "Question 2, Response Option 1"),
q2_2 = haven::labelled(0:1, label = "Question 2, Response Option 2"),
q2_3 = haven::labelled(0:1, label = "Question 2, Response Option 3"),
q2_4 = haven::labelled(0:1, label = "Question 2, Response Option 4"),
q2_5 = haven::labelled(0:1, label = "Question 2, Response Option 5"),
q2_6 = haven::labelled(0:1, label = "Question 2, Response Option 6"),
q2_7 = haven::labelled(0:1, label = "Question 2, Response Option 7"),
q2_8 = haven::labelled(0:1, label = "Question 2, Response Option 8"),
q3 = haven::labelled(0:1, label = "Question 3")
)
df_long <- df %>%
dplyr::rename_with(
.fn = \(name) purrr::map_chr(df[name], labelled::label_attribute)
) %>%
pivot_longer(
cols = everything(),
names_to = "var",
values_to = "resp"
)
df_long
#> # A tibble: 20 × 2
#> var resp
#> <chr> <int+lbl>
#> 1 Question 1 0
#> 2 Question 2, Response Option 1 0
#> 3 Question 2, Response Option 2 0
#> 4 Question 2, Response Option 3 0
#> 5 Question 2, Response Option 4 0
#> 6 Question 2, Response Option 5 0
#> 7 Question 2, Response Option 6 0
#> 8 Question 2, Response Option 7 0
#> 9 Question 2, Response Option 8 0
#> 10 Question 3 0
#> 11 Question 1 1
#> 12 Question 2, Response Option 1 1
#> 13 Question 2, Response Option 2 1
#> 14 Question 2, Response Option 3 1
#> 15 Question 2, Response Option 4 1
#> 16 Question 2, Response Option 5 1
#> 17 Question 2, Response Option 6 1
#> 18 Question 2, Response Option 7 1
#> 19 Question 2, Response Option 8 1
#> 20 Question 3 1
# all labeled with q1 label
df_long$resp
#> <labelled<integer>[20]>: Question 1
#> [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 I don't believe there are any tidyr specific tools that we should add to support haven more though |
No worries! I appreciate the response. The sjlabelled package already has a function that basically does this called label_to_colname(). The issue with this is that the variable names become quite cumbersome if you have a lot of variables in the data set and only want to pivot over a few because now you have to write out the entire label. For example, using the data above:
It becomes even worse if you don't want to do it over every variable in a row. But no worries if it's not something in the cards. Can I ask why you don't think you should add any tools to support haven? Do you think it might be something that haven would implement? |
You could also use a more complex intermediate data structure - a packed data frame. This lets you keep the label alongside the values during the pivot by prematurely recycling the label of size 1 to the size of the values. I think this would work quite elegantly for selecting columns I strip off the attributes because as shown above the library(tidyr)
library(dplyr)
# create fake data
df <- tibble(
q1 = haven::labelled(0:1, label = "Question 1"),
q2_1 = haven::labelled(0:1, label = "Question 2, Response Option 1"),
q2_2 = haven::labelled(0:1, label = "Question 2, Response Option 2"),
q2_3 = haven::labelled(0:1, label = "Question 2, Response Option 3"),
q2_4 = haven::labelled(0:1, label = "Question 2, Response Option 4"),
q2_5 = haven::labelled(0:1, label = "Question 2, Response Option 5"),
q2_6 = haven::labelled(0:1, label = "Question 2, Response Option 6"),
q2_7 = haven::labelled(0:1, label = "Question 2, Response Option 7"),
q2_8 = haven::labelled(0:1, label = "Question 2, Response Option 8"),
q3 = haven::labelled(0:1, label = "Question 3")
)
side_by_side <- function(x) {
label <- labelled::label_attribute(x)
attributes(x) <- NULL
data.frame(value = x, label = label)
}
df %>%
mutate(across(q2_1:q2_8, side_by_side)) %>%
pivot_longer(
cols = q2_1:q2_8,
names_to = "var",
values_to = "resp"
) %>%
unpack(resp)
#> # A tibble: 16 × 5
#> q1 q3 var value label
#> <int+lbl> <int+lbl> <chr> <int> <chr>
#> 1 0 0 q2_1 0 Question 2, Response Option 1
#> 2 0 0 q2_2 0 Question 2, Response Option 2
#> 3 0 0 q2_3 0 Question 2, Response Option 3
#> 4 0 0 q2_4 0 Question 2, Response Option 4
#> 5 0 0 q2_5 0 Question 2, Response Option 5
#> 6 0 0 q2_6 0 Question 2, Response Option 6
#> 7 0 0 q2_7 0 Question 2, Response Option 7
#> 8 0 0 q2_8 0 Question 2, Response Option 8
#> 9 1 1 q2_1 1 Question 2, Response Option 1
#> 10 1 1 q2_2 1 Question 2, Response Option 2
#> 11 1 1 q2_3 1 Question 2, Response Option 3
#> 12 1 1 q2_4 1 Question 2, Response Option 4
#> 13 1 1 q2_5 1 Question 2, Response Option 5
#> 14 1 1 q2_6 1 Question 2, Response Option 6
#> 15 1 1 q2_7 1 Question 2, Response Option 7
#> 16 1 1 q2_8 1 Question 2, Response Option 8
# here is what it looks like, `q2_1` itself is a data frame
df %>%
mutate(across(q2_1:q2_8, side_by_side)) %>%
select(q1, q2_1)
#> # A tibble: 2 × 2
#> q1 q2_1$value $label
#> <int+lbl> <int> <chr>
#> 1 0 0 Question 2, Response Option 1
#> 2 1 1 Question 2, Response Option 1
In general tidyr's pivoting tools are very generic. So adding tools inside tidyr specifically for a single package goes against that ethos a little, even if we technically own both packages. It suggests there is some other workflow that we are missing. |
That reasoning makes a ton of sense to me. Also, that I have a personal package where I created a function called Thanks! |
I think it would be really useful for pivot_longer to preserve the variable labels as the value labels after pivoting. Unfortunately this is not possible. To clarify, I don't think value labels should be preserved.
I work with survey data that are usually saved as .sav files. I then use the haven package to import them to R. This gives me both variable labels and value labels. In this case, only variable labels are relevant. Often, the variable labels are the questions used in the survey and are quite long. When a question is "select all that apply", each response option is split into a new variable. In order to analyze the questions, I use pivot_longer to make it into one variable.
The issue I run into is that I would like to maintain the original variable labels as the value labels. Here is an example data frame with variable labels.
We can see in df_long that none of the variable labels made it in as value_labels for "var". We can manually set it using
labelled::set_variable_labels
like this:We can see now that the "var" variable has value labels. These value labels are the same as the variable labels of the variables that were pivoted. This is the ideal output.
Is it possible to change pivot_longer() so that it provides you the option of using the variable labels as the new value_labels in the "var" column?
The current workaround I have found is with the sjlabelled::label_to_colname() function as seen below:
However, this is really not ideal as it just renames the variables immediately and doesn't add value_labels. Thus making subsetting the "var" variable incredibly cumbersome.
I also think that this should be possible since each variable has only one variable_label and therefore there shouldn't be any conflicts when pivoting, unlike with value_labels.
The text was updated successfully, but these errors were encountered: