You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extracted from R4DS due to page limitations. If it happens before book is published, add link to this vignette from tidy data chapter.
Untidy data
While pivot_wider() is occasionally useful for making tidy data, its real strength is making untidy data.
While that sounds like a bad thing, untidy isn't a pejorative term: there are many untidy data structures that are extremely useful.
Tidy data is a great starting point for most analyses but it's not the only data format you'll ever need.
The following sections will show a few examples of pivot_wider() making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.
Presenting data to humans
As you've seen, dplyr::count() produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.
diamonds |>
count(clarity, color)
This is easy to visualize or summarize further, but it's not the most compact form for display.
You can use pivot_wider() to create a form more suitable for display to other humans:
This display also makes it easy to compare in two directions, horizontally and vertically, much like facet_grid().
pivot_wider() can be great for quickly sketching out a table.
But for real presentation tables, we highly suggest learning a package like gt.
gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables.
It takes some work to learn but the payoff is the ability to make just about any table you can imagine.
Multivariate statistics
Most classical multivariate statistical methods (like dimension reduction and clustering) require your data in matrix form, where each column is a time point, or a location, or a gene, or a species, but definitely not a variable.
Sometimes these formats have substantial performance or space advantages, or sometimes they're just necessary to get closer to the underlying matrix mathematics.
We're not going to cover these statistical methods here, but it is useful to know how to get your data into the form that they need.
For example, let's imagine you wanted to cluster the gapminder data to find countries that had similar progression of gdpPercap over time.
To do this, we need one row for each country and one column for each year:
pivot_wider() produces a tibble where each row is labelled by the country variable.
But most classic statistical algorithms don't want the identifier as an explicit variable; they want as a row name.
We can turn the country variable into row names with column_to_rowname():
This makes a data frame, because tibbles don't support row names1.
We're now ready to cluster with (e.g.) kmeans():
cluster <- stats::kmeans(col_year, centers = 6)
Extracting the data out of this object into a form you can work with is a challenge you'll need to come back to later in the book, once you've learned more about lists.
But for now, you can get the clustering membership out with this code:
You could then combine this back with the original data using one of the joins you'll learn about in @sec-joins.
gapminder |> left_join(cluster_id)
Pragmatic computation
Sometimes it's just easier to answer a question using untidy data.
For example, if you're interested in just the total number of missing values in cms_patient_experience, it's easier to work with the untidy form:
This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didn't actually define what a variable is (and it's surprisingly hard to do so).
It's totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest.
So if you're stuck figuring out how to do some computation, maybe it's time to switch up the organisation of your data.
For computations involving a fixed number of values (like computing differences or ratios), it's usually easier if the data is in columns; for those with a variable number of values (like sums or means) it's usually easier in rows.
Don't be afraid to untidy, transform, and re-tidy if needed.
Let's explore this idea by looking at cms_patient_care, which has a similar structure to cms_patient_experience:
cms_patient_care
It contains information about 9 measures (beliefs_addressed, composite_process, dyspena_treatment, ...) on 14 different facilities (identified by ccn with a name given by facility_name).
Compared to cms_patient_experience, however, each measurement is recorded in two rows with a score, the percentage of patients who answered yes to the survey question, and a denominator, the number of patients that the question applies to.
Depending on what you want to do next, you may find any of the following three structures useful:
If you want to compute the number of patients that answered yes to the question, you may pivot type into the columns:
tibbles don't use row names because they only work for a subset of important cases: when observations can be identified by a single character vector. ↩
The text was updated successfully, but these errors were encountered:
Extracted from R4DS due to page limitations. If it happens before book is published, add link to this vignette from tidy data chapter.
Untidy data
While
pivot_wider()
is occasionally useful for making tidy data, its real strength is making untidy data.While that sounds like a bad thing, untidy isn't a pejorative term: there are many untidy data structures that are extremely useful.
Tidy data is a great starting point for most analyses but it's not the only data format you'll ever need.
The following sections will show a few examples of
pivot_wider()
making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.Presenting data to humans
As you've seen,
dplyr::count()
produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.This is easy to visualize or summarize further, but it's not the most compact form for display.
You can use
pivot_wider()
to create a form more suitable for display to other humans:This display also makes it easy to compare in two directions, horizontally and vertically, much like
facet_grid()
.pivot_wider()
can be great for quickly sketching out a table.But for real presentation tables, we highly suggest learning a package like gt.
gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables.
It takes some work to learn but the payoff is the ability to make just about any table you can imagine.
Multivariate statistics
Most classical multivariate statistical methods (like dimension reduction and clustering) require your data in matrix form, where each column is a time point, or a location, or a gene, or a species, but definitely not a variable.
Sometimes these formats have substantial performance or space advantages, or sometimes they're just necessary to get closer to the underlying matrix mathematics.
We're not going to cover these statistical methods here, but it is useful to know how to get your data into the form that they need.
For example, let's imagine you wanted to cluster the gapminder data to find countries that had similar progression of
gdpPercap
over time.To do this, we need one row for each country and one column for each year:
pivot_wider()
produces a tibble where each row is labelled by thecountry
variable.But most classic statistical algorithms don't want the identifier as an explicit variable; they want as a row name.
We can turn the
country
variable into row names withcolumn_to_rowname()
:This makes a data frame, because tibbles don't support row names1.
We're now ready to cluster with (e.g.)
kmeans()
:Extracting the data out of this object into a form you can work with is a challenge you'll need to come back to later in the book, once you've learned more about lists.
But for now, you can get the clustering membership out with this code:
You could then combine this back with the original data using one of the joins you'll learn about in @sec-joins.
Pragmatic computation
Sometimes it's just easier to answer a question using untidy data.
For example, if you're interested in just the total number of missing values in
cms_patient_experience
, it's easier to work with the untidy form:This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didn't actually define what a variable is (and it's surprisingly hard to do so).
It's totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest.
So if you're stuck figuring out how to do some computation, maybe it's time to switch up the organisation of your data.
For computations involving a fixed number of values (like computing differences or ratios), it's usually easier if the data is in columns; for those with a variable number of values (like sums or means) it's usually easier in rows.
Don't be afraid to untidy, transform, and re-tidy if needed.
Let's explore this idea by looking at
cms_patient_care
, which has a similar structure tocms_patient_experience
:It contains information about 9 measures (
beliefs_addressed
,composite_process
,dyspena_treatment
, ...) on 14 different facilities (identified byccn
with a name given byfacility_name
).Compared to
cms_patient_experience
, however, each measurement is recorded in two rows with ascore
, the percentage of patients who answered yes to the survey question, and a denominator, the number of patients that the question applies to.Depending on what you want to do next, you may find any of the following three structures useful:
If you want to compute the number of patients that answered yes to the question, you may pivot
type
into the columns:If you want to display the distribution of each metric, you may keep it as is so you could facet by
measure_abbr
.If you want to explore how different metrics are related, you may put the measure names in the columns so you could compare them in scatterplots.
Footnotes
tibbles don't use row names because they only work for a subset of important cases: when observations can be identified by a single character vector. ↩
The text was updated successfully, but these errors were encountered: