From 6c0ed78c81bad9b6d98ca8003eddd13a87cb5372 Mon Sep 17 00:00:00 2001
From: Julien Brun 7.2
In the long format, you usually have 1 column for the observed variable and the other columns are ID variables. The tidyr
basicsmpg
dataset is an example of a long dataset with each row representing a single car and each column representing a variable of that car such as manufacturer
and year
.
mpg
## # A tibble: 234 x 11
-## manufacturer model displ year cyl trans drv cty hwy
-## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
-## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
-## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
-## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
-## 4 audi a4 2.0 2008 4 auto(av) f 21 30
-## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
-## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
-## 7 audi a4 3.1 2008 6 auto(av) f 18 27
-## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
-## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
-## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
-## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
+## manufacturer model displ year cyl trans drv cty hwy fl
+## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
+## 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
+## 2 audi a4 1.80 1999 4 manual… f 21 29 p
+## 3 audi a4 2.00 2008 4 manual… f 20 31 p
+## 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
+## 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
+## 6 audi a4 2.80 1999 6 manual… f 18 26 p
+## 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
+## 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
+## 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
+## 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
+## # ... with 224 more rows, and 1 more variable: class <chr>
These different data formats mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due to it’s shape. However, the long format is more machine readable and is closer to the formatting of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.
Note: Generally, mathematical operations are better in long format, although some plotting functions actually work better with wide format.
@@ -488,23 +488,23 @@gather()
data fr
head(gap_long)
## # A tibble: 6 x 2
## obstype_year obs_values
-## <chr> <chr>
-## 1 continent Africa
-## 2 continent Africa
-## 3 continent Africa
-## 4 continent Africa
-## 5 continent Africa
-## 6 continent Africa
+## <chr> <chr>
+## 1 continent Africa
+## 2 continent Africa
+## 3 continent Africa
+## 4 continent Africa
+## 5 continent Africa
+## 6 continent Africa
tail(gap_long)
## # A tibble: 6 x 2
## obstype_year obs_values
-## <chr> <chr>
-## 1 pop_2007 9031088
-## 2 pop_2007 7554661
-## 3 pop_2007 71158647
-## 4 pop_2007 60776238
-## 5 pop_2007 20434176
-## 6 pop_2007 4115771
+## <chr> <chr>
+## 1 pop_2007 9031088
+## 2 pop_2007 7554661
+## 3 pop_2007 71158647
+## 4 pop_2007 60776238
+## 5 pop_2007 20434176
+## 6 pop_2007 4115771
We have reshaped our dataframe but this new format isn’t really what we wanted.
What went wrong? Notice that it didn’t know that we wanted to keep continent
and country
untouched; we need to give it more information about which columns we want reshaped. We can do this in several ways.
One way is to identify the columns is by name. Listing them explicitly can be a good approach if there are just a few. But in our case we have 30 columns. I’m not going to list them out here since there is way too much potential for error if I tried to list gdpPercap_1952
, gdpPercap_1957
, gdpPercap_1962
and so on. But we could use some of dplyr
’s awesome helper functions — because we expect that there is a better way to do this!
gather()
data fr
## $ obs_values: num 2449 3521 1063 851 543 ...
head(gap_long)
## # A tibble: 6 x 5
-## continent country obs_type year obs_values
-## <chr> <chr> <chr> <int> <dbl>
-## 1 Africa Algeria gdpPercap 1952 2449
-## 2 Africa Angola gdpPercap 1952 3521
-## 3 Africa Benin gdpPercap 1952 1063
-## 4 Africa Botswana gdpPercap 1952 851
-## 5 Africa Burkina Faso gdpPercap 1952 543
-## 6 Africa Burundi gdpPercap 1952 339
+## continent country obs_type year obs_values
+## <chr> <chr> <chr> <int> <dbl>
+## 1 Africa Algeria gdpPercap 1952 2449.
+## 2 Africa Angola gdpPercap 1952 3521.
+## 3 Africa Benin gdpPercap 1952 1063.
+## 4 Africa Botswana gdpPercap 1952 851.
+## 5 Africa Burkina Faso gdpPercap 1952 543.
+## 6 Africa Burundi gdpPercap 1952 339.
tail(gap_long)
## # A tibble: 6 x 5
-## continent country obs_type year obs_values
-## <chr> <chr> <chr> <int> <dbl>
-## 1 Europe Sweden pop 2007 9031088
-## 2 Europe Switzerland pop 2007 7554661
-## 3 Europe Turkey pop 2007 71158647
-## 4 Europe United Kingdom pop 2007 60776238
-## 5 Oceania Australia pop 2007 20434176
-## 6 Oceania New Zealand pop 2007 4115771
+## continent country obs_type year obs_values
+## <chr> <chr> <chr> <int> <dbl>
+## 1 Europe Sweden pop 2007 9031088.
+## 2 Europe Switzerland pop 2007 7554661.
+## 3 Europe Turkey pop 2007 71158647.
+## 4 Europe United Kingdom pop 2007 60776238.
+## 5 Oceania Australia pop 2007 20434176.
+## 6 Oceania New Zealand pop 2007 4115771.
Excellent. This is long format: every row is a unique observation. Yay!
complete()
One of the coolest functions in tidyr
is the function complete()
. Jarrett Byrnes has written up a great blog piece showcasing the utility of this function so I’m going to use that example here.
One of the coolest functions in tidyr
is the function complete()
. Jarrett Byrnes has written up a great blog piece showcasing the utility of this function so I’m going to use that example here.
We’ll start with an example dataframe where the data recorder enters the Abundance of two species of kelp, Saccharina and Agarum in the years 1999, 2000 and 2004.
kelpdf <- data.frame(
Year = c(1999, 2000, 2004, 1999, 2004),
@@ -771,10 +771,11 @@ 7.8 Other links
"facebook": true,
"twitter": true,
"google": false,
+"linkedin": false,
"weibo": false,
"instapper": false,
"vk": false,
-"all": ["facebook", "google", "twitter", "weibo", "instapaper"]
+"all": ["facebook", "google", "twitter", "linkedin", "weibo", "instapaper"]
},
"fontsettings": {
"theme": "white",
diff --git a/tidyr.Rmd b/tidyr.Rmd
index b90f68f..a472ede 100644
--- a/tidyr.Rmd
+++ b/tidyr.Rmd
@@ -428,7 +428,7 @@ str(gap_wide_new)
### `complete()`
-One of the coolest functions in `tidyr` is the function `complete()`. Jarrett Byrnes has written up a [great blog piece]((http://www.imachordata.com/you-complete-me/)) showcasing the utility of this function so I'm going to use that example here.
+One of the coolest functions in `tidyr` is the function `complete()`. Jarrett Byrnes has written up a [great blog piece](http://www.imachordata.com/you-complete-me/) showcasing the utility of this function so I'm going to use that example here.
We'll start with an example dataframe where the data recorder enters the Abundance of two species of kelp, *Saccharina* and *Agarum* in the years 1999, 2000 and 2004.
```{r, eval=F}