Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vignette on untidy data #1484

Open
hadley opened this issue Feb 7, 2023 · 0 comments
Open

Vignette on untidy data #1484

hadley opened this issue Feb 7, 2023 · 0 comments

Comments

@hadley
Copy link
Member

hadley commented Feb 7, 2023

Extracted from R4DS due to page limitations. If it happens before book is published, add link to this vignette from tidy data chapter.


Untidy data

While pivot_wider() is occasionally useful for making tidy data, its real strength is making untidy data.
While that sounds like a bad thing, untidy isn't a pejorative term: there are many untidy data structures that are extremely useful.
Tidy data is a great starting point for most analyses but it's not the only data format you'll ever need.

The following sections will show a few examples of pivot_wider() making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.

Presenting data to humans

As you've seen, dplyr::count() produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.

diamonds |> 
  count(clarity, color)

This is easy to visualize or summarize further, but it's not the most compact form for display.
You can use pivot_wider() to create a form more suitable for display to other humans:

diamonds |> 
  count(clarity, color) |> 
  pivot_wider(
    names_from = color, 
    values_from = n
  )

This display also makes it easy to compare in two directions, horizontally and vertically, much like facet_grid().

pivot_wider() can be great for quickly sketching out a table.
But for real presentation tables, we highly suggest learning a package like gt.
gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables.
It takes some work to learn but the payoff is the ability to make just about any table you can imagine.

Multivariate statistics

Most classical multivariate statistical methods (like dimension reduction and clustering) require your data in matrix form, where each column is a time point, or a location, or a gene, or a species, but definitely not a variable.
Sometimes these formats have substantial performance or space advantages, or sometimes they're just necessary to get closer to the underlying matrix mathematics.

We're not going to cover these statistical methods here, but it is useful to know how to get your data into the form that they need.
For example, let's imagine you wanted to cluster the gapminder data to find countries that had similar progression of gdpPercap over time.
To do this, we need one row for each country and one column for each year:

library(gapminder)

col_year <- gapminder |> 
  mutate(gdpPercap = log10(gdpPercap)) |> 
  pivot_wider(
    id_cols = country, 
    names_from = year,
    values_from = gdpPercap
  ) 
col_year

pivot_wider() produces a tibble where each row is labelled by the country variable.
But most classic statistical algorithms don't want the identifier as an explicit variable; they want as a row name.
We can turn the country variable into row names with column_to_rowname():

col_year <- col_year |> 
  column_to_rownames("country") 

head(col_year)

This makes a data frame, because tibbles don't support row names1.

We're now ready to cluster with (e.g.) kmeans():

cluster <- stats::kmeans(col_year, centers = 6)

Extracting the data out of this object into a form you can work with is a challenge you'll need to come back to later in the book, once you've learned more about lists.
But for now, you can get the clustering membership out with this code:

cluster_id <- cluster$cluster |> 
  enframe() |> 
  rename(country = name, cluster_id = value)
cluster_id

You could then combine this back with the original data using one of the joins you'll learn about in @sec-joins.

gapminder |> left_join(cluster_id)

Pragmatic computation

Sometimes it's just easier to answer a question using untidy data.
For example, if you're interested in just the total number of missing values in cms_patient_experience, it's easier to work with the untidy form:

cms_patient_experience |> 
  group_by(org_pac_id) |> 
  summarize(
    n_miss = sum(is.na(prf_rate)),
    n = n(),
  )

This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didn't actually define what a variable is (and it's surprisingly hard to do so).
It's totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest.

So if you're stuck figuring out how to do some computation, maybe it's time to switch up the organisation of your data.
For computations involving a fixed number of values (like computing differences or ratios), it's usually easier if the data is in columns; for those with a variable number of values (like sums or means) it's usually easier in rows.
Don't be afraid to untidy, transform, and re-tidy if needed.

Let's explore this idea by looking at cms_patient_care, which has a similar structure to cms_patient_experience:

cms_patient_care

It contains information about 9 measures (beliefs_addressed, composite_process, dyspena_treatment, ...) on 14 different facilities (identified by ccn with a name given by facility_name).
Compared to cms_patient_experience, however, each measurement is recorded in two rows with a score, the percentage of patients who answered yes to the survey question, and a denominator, the number of patients that the question applies to.
Depending on what you want to do next, you may find any of the following three structures useful:

  • If you want to compute the number of patients that answered yes to the question, you may pivot type into the columns:

    cms_patient_care |> 
      pivot_wider(
        names_from = type,
        values_from = score
      ) |> 
      mutate(
        numerator = round(observed / 100 * denominator)
      )
    
  • If you want to display the distribution of each metric, you may keep it as is so you could facet by measure_abbr.

    #| fig.show: "hide"
    
    cms_patient_care |> 
      filter(type == "observed") |> 
      ggplot(aes(x = score)) + 
      geom_histogram(binwidth = 2) + 
      facet_wrap(vars(measure_abbr))
    
  • If you want to explore how different metrics are related, you may put the measure names in the columns so you could compare them in scatterplots.

    #| fig.show: "hide"
    
    cms_patient_care |> 
      filter(type == "observed") |> 
      select(-type) |> 
      pivot_wider(
        names_from = measure_abbr,
        values_from = score
      ) |> 
      ggplot(aes(x = dyspnea_screening, y = dyspena_treatment)) + 
      geom_point() + 
      coord_equal()
    

Footnotes

  1. tibbles don't use row names because they only work for a subset of important cases: when observations can be identified by a single character vector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant