autosize: true author: Tristan Mahr, @tjmahr date: March 18, 2015 css: assets/custom.css
Madison R Users Group
Repository for this talk: https://github.com/tjmahr/MadR_Pipelines
We just covered pipelines. Now we'll use pipelines to work on tabular data.
library("magrittr")
library("dplyr")
library("nycflights13")
The examples for the main dplyr verbs were lifted from the introductory vignette.
http://www.rstudio.com/resources/cheatsheets/
Convert data to a tbl_df
so that it uses dplyr's nice print
method.
flights <- tbl_df(flights)
flights
Source: local data frame [336,776 x 16]
year month day dep_time dep_delay
1 2013 1 1 517 2
2 2013 1 1 533 4
3 2013 1 1 542 2
4 2013 1 1 544 -1
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
incremental: true
Use glimpse
to few some values in each column.
glimpse(flights)
Observations: 336776
Variables:
$ year (int) 2013, 2013, 2013, 2013...
$ month (int) 1, 1, 1, 1, 1, 1, 1, 1...
$ day (int) 1, 1, 1, 1, 1, 1, 1, 1...
$ dep_time (int) 517, 533, 542, 544, 55...
$ dep_delay (dbl) 2, 4, 2, -1, -6, -4, -...
$ arr_time (int) 830, 850, 923, 1004, 8...
$ arr_delay (dbl) 11, 20, 33, -18, -25, ...
$ carrier (chr) "UA", "UA", "AA", "B6"...
$ tailnum (chr) "N14228", "N24211", "N...
$ flight (int) 1545, 1714, 1141, 725,...
$ origin (chr) "EWR", "LGA", "JFK", "...
$ dest (chr) "IAH", "IAH", "MIA", "...
$ air_time (dbl) 227, 227, 160, 183, 11...
$ distance (dbl) 1400, 1416, 1089, 1576...
$ hour (dbl) 5, 5, 5, 5, 5, 5, 5, 5...
$ minute (dbl) 17, 33, 42, 44, 54, 54...
type: prompt incremental: true
Is dpylr's print method really that useful since it's just printing four rows of four columns on these slides?
It normally fills width of the console. I cheated to accommodate the width of these slides, by setting some hidden flags:
options(
width = 45,
dplyr.width = 44,
dplyr.print_min = 4,
dplyr.print_max = 4)
incremental: true
How many flights flew to Madison in 2013?
flights %>%
filter(dest == "MSN")
Source: local data frame [572 x 16]
year month day dep_time dep_delay
1 2013 1 1 1353 -4
2 2013 1 2 1422 25
3 2013 1 3 1415 24
4 2013 1 4 1345 -5
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
incremental: true
How many flights flew to Madison in first week of January?
# Comma separated conditions are combined with '&'
flights %>%
filter(dest == "MSN", month == 1, day <= 7)
Source: local data frame [6 x 16]
year month day dep_time dep_delay
1 2013 1 1 1353 -4
2 2013 1 2 1422 25
3 2013 1 3 1415 24
4 2013 1 4 1345 -5
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
type: prompt incremental: true
Commas in the filter statement are implicit
&
(and) operators. Is there anything similar for|
(or)?
Logical or statements are supported, but there's no shorthand.
flights %>%
filter(dest == "MSN" | dest == "ORD" | dest == "MDW")
For more complicated checks, I would try a set operation.
flights %>%
filter(is.element(dest, c("MSN", "ORD", "MDW")))
incremental: true
Sort by which airport they departed from in NYC, then year, month, day.
flights %>%
arrange(origin, year, month, day)
Source: local data frame [336,776 x 16]
year month day dep_time dep_delay
1 2013 1 1 517 2
2 2013 1 1 554 -4
3 2013 1 1 555 -5
4 2013 1 1 558 -2
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
incremental: true
Find longest delayed flights to Madison.
flights %>%
filter(dest == "MSN") %>%
arrange(desc(dep_delay))
Source: local data frame [572 x 16]
year month day dep_time dep_delay
1 2013 12 5 2000 340
2 2013 11 17 45 310
3 2013 3 8 1907 302
4 2013 9 12 1841 291
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
incremental: true
Select the columns you want.
flights %>%
select(origin, year, month, day)
Source: local data frame [336,776 x 4]
origin year month day
1 EWR 2013 1 1
2 LGA 2013 1 1
3 JFK 2013 1 1
4 JFK 2013 1 1
.. ... ... ... ...
incremental: true
select
has many helper functions. See ?select
.
flights %>%
select(origin, year:day, starts_with("dep"))
Source: local data frame [336,776 x 6]
origin year month day dep_time dep_delay
1 EWR 2013 1 1 517 2
2 LGA 2013 1 1 533 4
3 JFK 2013 1 1 542 2
4 JFK 2013 1 1 544 -1
.. ... ... ... ... ... ...
incremental: true
We can drop columns by "negating" the name. Since helpers give us column names, we can negate them too.
flights %>%
select(-dest, -starts_with("arr"),
-ends_with("time"))
Source: local data frame [336,776 x 11]
year month day dep_delay carrier tailnum
1 2013 1 1 2 UA N14228
2 2013 1 1 4 UA N24211
3 2013 1 1 2 AA N619AA
4 2013 1 1 -1 B6 N804JB
.. ... ... ... ... ... ...
Variables not shown: flight (int), origin
(chr), distance (dbl), hour (dbl), minute
(dbl)
- convert to a
tbl_df
- nice print method glimpse
- some of each columnfilter
- subsettingarrange
- sorting (desc
to reverse the sort)select
- picking (and omiting) columns
incremental: true
Rename columns with rename(NewName = OldName)
. To keep the order
correct, read/remember the renaming =
as "was".
flights %>%
rename(y = year, m = month, d = day)
Source: local data frame [336,776 x 16]
y m d dep_time dep_delay arr_time
1 2013 1 1 517 2 830
2 2013 1 1 533 4 850
3 2013 1 1 542 2 923
4 2013 1 1 544 -1 1004
.. ... . . ... ... ...
Variables not shown: arr_delay (dbl),
carrier (chr), tailnum (chr), flight
(int), origin (chr), dest (chr), air_time
(dbl), distance (dbl), hour (dbl), minute
(dbl)
incremental: true
How much departure delay did the flight make up for in the air?
flights %>%
mutate(
gain = arr_delay - dep_delay,
speed = (distance / air_time) * 60,
gain_per_hour = gain / (air_time / 60)) %>%
select(gain:gain_per_hour)
Source: local data frame [336,776 x 3]
gain speed gain_per_hour
1 9 370.0441 2.378855
2 16 374.2731 4.229075
3 31 408.3750 11.625000
4 -17 516.7213 -5.573770
.. ... ... ...
incremental: true
Let's compute the average delay per month of flights to Madison.
Normally--in aggregate
, by
or plyr's d*ply
functions--you
specify the grouping as an argument to the aggregation function.
aggregate(dep_delay ~ month, flights, mean,
subset = flights$dest == "MSN")
month dep_delay
1 1 18.07692
2 2 20.11111
3 3 41.87097
4 4 29.40741
5 5 21.54839
6 6 29.83333
7 7 11.13333
8 8 19.06667
9 9 15.97183
10 10 19.26190
11 11 15.96250
12 12 42.68831
incremental: true
In dplyr, grouping is its own action. It is done as its own step in
the pipeline. Here, we filter
to the flights to Madison and
group_by
month.
msn_by_month <- flights %>%
filter(dest == "MSN") %>%
group_by(month)
msn_by_month
Source: local data frame [572 x 16]
Groups: month
year month day dep_time dep_delay
1 2013 1 1 1353 -4
2 2013 1 2 1422 25
3 2013 1 3 1415 24
4 2013 1 4 1345 -5
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
incremental: true
Now we use summarise
to compute (several) aggregate values within
each group (month). summarise
returns one row per group.
msn_by_month %>%
summarise(
flights = n(),
avg_delay = mean(dep_delay, na.rm = TRUE),
n_planes = n_distinct(tailnum))
Source: local data frame [12 x 4]
month flights avg_delay n_planes
1 1 27 18.07692 23
2 2 47 20.11111 36
3 3 31 41.87097 29
4 4 27 29.40741 25
.. ... ... ... ...
incremental: true
tally
is a shortcut for counting number of items per group.
Number of flights from NYC by destination by month:
flights %>%
group_by(dest, month) %>%
tally
Source: local data frame [1,113 x 3]
Groups: dest
dest month n
1 ABQ 4 9
2 ABQ 5 31
3 ABQ 6 30
4 ABQ 7 31
.. ... ... ..
incremental: true
Remove the grouping structure with ungroup
.
msn_by_month %>% ungroup
Source: local data frame [572 x 16]
year month day dep_time dep_delay
1 2013 1 1 1353 -4
2 2013 1 2 1422 25
3 2013 1 3 1415 24
4 2013 1 4 1345 -5
.. ... ... ... ... ...
Variables not shown: arr_time (int),
arr_delay (dbl), carrier (chr), tailnum
(chr), flight (int), origin (chr), dest
(chr), air_time (dbl), distance (dbl),
hour (dbl), minute (dbl)
incremental: true
Each summarise
statement peels off one layer of grouping (from the
right of the list of groups).
# day gets peeled off
per_day <- flights %>%
group_by(dest, year, month, day) %>%
summarise(flights = n())
per_day
Source: local data frame [31,229 x 5]
Groups: dest, year, month
dest year month day flights
1 ABQ 2013 4 22 1
2 ABQ 2013 4 23 1
3 ABQ 2013 4 24 1
4 ABQ 2013 4 25 1
.. ... ... ... ... ...
========================================================= title: false incremental: true
Peel off month
grouping.
per_month <- per_day %>%
summarise(flights = sum(flights))
per_month
Source: local data frame [1,113 x 4]
Groups: dest, year
dest year month flights
1 ABQ 2013 4 9
2 ABQ 2013 5 31
3 ABQ 2013 6 30
4 ABQ 2013 7 31
.. ... ... ... ...
========================================================= title: false incremental: true
Peel off year
grouping.
per_year <- per_month %>%
summarise(flights = sum(flights))
per_year
Source: local data frame [105 x 3]
Groups: dest
dest year flights
1 ABQ 2013 254
2 ACK 2013 265
3 ALB 2013 439
4 ANC 2013 8
.. ... ... ...
type: prompt
Is there a way to keep a group after a
summarise
statement?
The grouping variable goes away because summarise
returns one row
for each value of the grouping variable. If you group_by
month and
summarise
, you have a row for each month. You can add the month
grouping back to the data-frame, but this is almost always not
necessary because you have only row per group at that point.
Relatedly, you can group by row with rowwise
.
incremental: true
Rank flight delays within each destination and month. (dplyr provides a bunch of ranking functions. See the cheatsheet.)
flights %>%
group_by(dest, month) %>%
mutate(timely = row_number(dep_delay),
late = row_number(desc(dep_delay))) %>%
select(dep_delay, timely:late)
Source: local data frame [336,776 x 5]
Groups: dest, month
dest month dep_delay timely late
1 IAH 1 2 347 195
2 IAH 1 4 380 162
3 MIA 1 2 662 294
4 BQN 1 -1 46 38
.. ... ... ... ... ...
incremental: true
This is also good for mean-centering variables within groups.
mean_center <- function(xs) {
xs - mean(xs, na.rm = TRUE)
}
flights %>%
group_by(dest, month) %>%
mutate(c_delay = mean_center(dep_delay)) %>%
select(dep_delay, c_delay)
Source: local data frame [336,776 x 4]
Groups: dest, month
dest month dep_delay c_delay
1 IAH 1 2 -5.294643
2 IAH 1 4 -3.294643
3 MIA 1 2 -5.360656
4 BQN 1 -1 -6.870968
.. ... ... ... ...
- select
- filter
- arrange
- glimpse
- rename
- mutate
- group_by, ungroup
- summarise
- assembly: bind_rows, bind_cols
- column-wise operations: mutate_each, summarise_each
- join tables together: left_join, right_join, inner_join, full_join
- filtering joins: semi_join, anti_join
- do: arbitrary code on each chunk
- different types of tabular data (databases, data.tables)
type: prompt
For fun, I demonstrated how I use pipelines when making tables.