forked from psych252/psych252book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
18-data_wrangling.Rmd
1489 lines (1059 loc) · 52.8 KB
/
18-data_wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Appendix B: Data wrangling
In this lecture, we will take a look at how to wrangle data using the [dplyr](https://dplyr.tidyverse.org/) package. Again, getting our data into shape is something we'll need to do throughout the course, so it's worth spending some time getting a good sense for how this works. The nice thing about R is that (thanks to the `tidyverse`), both visualization and data wrangling are particularly powerful.
## Learning goals
- Review R basics (incl. variable modes, data types, operators, control flow, and functions).
- Learn how the pipe operator `%>%` works.
- See different ways for getting a sense of one's data.
- Master key data manipulation verbs from the `dplyr` package (incl. `filter()`, `arrange()`, `rename()`, `relocate()`, `select()`, `mutate()`) as well as the helper functions `across()` and `where()`.
## Load packages
Let's first load the packages that we need for this chapter.
```{r, message=FALSE}
library("knitr") # for rendering the RMarkdown file
library("skimr") # for visualizing data
library("visdat") # for visualizing data
library("DT") # for visualizing data
library("tidyverse") # for data wrangling
opts_chunk$set(comment = "",
fig.show = "hold")
```
## Some R basics
To test your knowledge of the R basics, I recommend taking the free interactive tutorial on datacamp: [Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r). Here, I will just give a very quick overview of some of the basics.
### Modes
Variables in R can have different modes. Table \@ref(tab:variable-modes) shows the most common ones.
```{r variable-modes, echo=FALSE}
name = c("numeric", "character", "logical", "not available")
example = c("`1`, `3`, `48`",
"`'Steve'`, `'a'`, `'78'`",
"`TRUE`, `FALSE`","`NA`")
kable(x = tibble(name, example),
caption = "Most commonly used variable modes in R.",
align = c("r", "l"),
booktabs = TRUE)
```
For characters you can either use `"` or `'`. R has a number of functions to convert a variable from one mode to another. `NA` is used for missing values.
```{r}
tmp1 = "1" # we start with a character
str(tmp1)
tmp2 = as.numeric(tmp1) # turn it into a numeric
str(tmp2)
tmp3 = as.factor(tmp2) # turn that into a factor
str(tmp3)
tmp4 = as.character(tmp3) # and go full cycle by turning it back into a character
str(tmp4)
identical(tmp1, tmp4) # checks whether tmp1 and tmp4 are the same
```
The `str()` function displays the structure of an R object. Here, it shows us what mode the variable is.
### Data types
R has a number of different data types. Table \@ref(tab:data-types) shows the ones you're most likely to come across (taken from [this source](https://www.statmethods.net/input/datatypes.html)):
```{r data-types, echo=FALSE}
name = c("vector", "factor", "matrix", "array", "data frame", "list")
description = c(
"list of values with of the same variable mode",
"for ordinal variables",
"2D data structure",
"same as matrix for higher dimensional data",
"similar to matrix but with column names",
"flexible type that can contain different other variable types"
)
kable(x = tibble(name, description),
align = c("r", "l"),
caption = "Most commonly used data types in R.",
booktabs = TRUE)
```
#### Vectors
We build vectors using the concatenate function `c()`, and we use `[]` to access one or more elements of a vector.
```{r}
numbers = c(1, 4, 5) # make a vector
numbers[2] # access the second element
numbers[1:2] # access the first two elements
numbers[c(1, 3)] # access the first and last element
```
In R (unlike in Python for example), 1 refers to the first element of a vector (or list).
#### Matrix
We build a matrix using the `matrix()` function, and we use `[]` to access its elements.
```{r}
matrix = matrix(data = c(1, 2, 3, 4, 5, 6),
nrow = 3,
ncol = 2)
matrix # the full matrix
matrix[1, 2] # element in row 1, column 2
matrix[1, ] # all elements in the first row
matrix[ , 1] # all elements in the first column
matrix[-1, ] # a matrix which excludes the first row
```
Note how we use an empty placeholder to indicate that we want to select all the values in a row or column, and `-` to indicate that we want to remove something.
#### Array
Arrays work the same was as matrices with data of more than two dimensions.
#### Data frame
```{r}
df = tibble(participant_id = c(1, 2, 3),
participant_name = c("Leia", "Luke", "Darth")) # make the data frame
df # the complete data frame
df[1, 2] # a single element using numbers
df$participant_id # all participants
df[["participant_id"]] # same as before but using [[]] instead of $
df$participant_name[2] # name of the second participant
df[["participant_name"]][2] # same as above
```
We'll use data frames a lot. Data frames are like a matrix with column names. Data frames are also more general than matrices in that different columns can have different modes. For example, one column might be a character, another one numeric, and another one a factor.
Here we used the `tibble()` function to create the data frame. A `tibble` is almost the same as a data frame but it has better defaults for formatting output in the console (more information on tibbles is [here](http://r4ds.had.co.nz/tibbles.html)).
#### Lists
```{r}
l.mixed = list(number = 1,
character = "2",
factor = factor(3),
matrix = matrix(1:4, ncol = 2),
df = tibble(x = c(1, 2), y = c(3, 4)))
l.mixed
# three different ways of accessing a list
l.mixed$character
l.mixed[["character"]]
l.mixed[[2]]
```
Lists are a very flexible data format. You can put almost anything in a list.
### Operators
Table \@ref(tab:logical-operators) shows the comparison operators that result in logical outputs.
```{r logical-operators, echo=FALSE}
operators = c("`==`", "`!=`", "`>`, `<`", "`>=`, `<=`", "`&`, `|`, `!`", "`%in%`")
explanation = c("equal to", "not equal to", "greater/less than",
"greater/less than or equal", "logical operators: and, or, not",
"checks whether an element is in an object")
kable(tibble(symbol = operators, name = explanation),
caption = "Table of comparison operators that result in
boolean (TRUE/FALSE) outputs.",
booktabs = TRUE)
```
The `%in%` operator is very useful, and we can use it like so:
```{r data-10}
x = c(1, 2, 3)
2 %in% x
c(3, 4) %in% x
```
It's particularly useful for filtering data as we will see below.
### Control flow
#### if-then {#if-else}
```{r}
number = 3
if (number == 1) {
print("The number is 1.")
} else if (number == 2) {
print("The number is 2.")
} else {
print("The number is neither 1 nor 2.")
}
```
As a shorthand version, we can also use the `ifelse()` function like so:
```{r}
number = 3
ifelse(test = number == 1, yes = "correct", no = "false")
```
#### for loop
```{r}
sequence = 1:10
for(i in 1:length(sequence)){
print(i)
}
```
#### while loop
```{r}
number = 1
while(number <= 10){
print(number)
number = number + 1
}
```
### Functions
```{r}
fun.add_two_numbers = function(a, b){
x = a + b
return(str_c("The result is ", x))
}
fun.add_two_numbers(1, 2)
```
I've used the `str_c()` function here to concatenate the string with the number. (R converts the number `x` into a string for us.) Note, R functions can only return a single object. However, this object can be a list (which can contain anything).
#### Some often used functions
```{r, echo=FALSE}
name = c(
"`length()`",
"`dim()`",
"`rm() `",
"`seq()`",
"`rep()`",
"`max()`",
"`min()`",
"`which.max()`",
"`which.min()`",
"`mean()`",
"`median()`",
"`sum()`",
"`var()`",
"`sd()`"
)
description = c(
"length of an object",
"dimensions of an object (e.g. number of rows and columns)",
"remove an object",
"generate a sequence of numbers",
"repeat something n times",
"maximum",
"minimum",
"index of the maximum",
"index of the maximum",
"mean",
"median",
"sum",
"variance",
"standard deviation"
)
kable(x = tibble(name, description),
caption = "Some frequently used functions.",
align = c("r", "l"),
booktabs = TRUE)
```
### The pipe operator `%>%`
```{r, out.width = "80%", echo=FALSE, fig.cap="Inspiration for the `magrittr` package name."}
include_graphics("figures/pipe.jpg")
```
```{r, out.width = '40%', echo=FALSE, fig.cap="The `magrittr` package logo."}
include_graphics("figures/magrittr.png")
```
The pipe operator `%>%` is a special operator introduced in the `magrittr` package. It is used heavily in the tidyverse. The basic idea is simple: this operator allows us to "pipe" several functions into one long chain that matches the order in which we want to do stuff.
Let's consider the following example of making and eating a cake (thanks to https://twitter.com/dmi3k/status/1191824875842879489?s=09). This would be the traditional way of writing some code:
```{r, eval=F}
eat(
slice(
bake(
put(
pour(
mix(ingredients),
into = baking_form),
into = oven),
time = 30),
pieces = 6),
1)
```
To see what's going on here, we need to read the code inside out. That is, we have to start in the innermost bracket, and then work our way outward. However, there is a natural causal ordering to these steps and wouldn't it be nice if we could just write code in that order? Thanks to the pipe operator `%>%` we can! Here is the same example using the pipe:
```{r, eval=F}
ingredients %>%
mix %>%
pour(into = baking_form) %>%
put(into = oven) %>%
bake(time = 30) %>%
slice(pieces = 6) %>%
eat(1)
```
This code is much easier to read and write, since it represents the order in which we want to do things!
Abstractly, the pipe operator does the following:
> `f(x)` can be rewritten as `x %>% f()`
For example, in standard R, we would write:
```{r}
x = 1:3
# standard R
sum(x)
```
With the pipe, we can rewrite this as:
```{r}
x = 1:3
# with the pipe
x %>% sum()
```
This doesn't seem super useful yet, but just hold on a little longer.
> `f(x, y)` can be rewritten as `x %>% f(y)`
So, we could rewrite the following standard R code ...
```{r}
# rounding pi to 6 digits, standard R
round(pi, digits = 6)
```
... by using the pipe:
```{r}
# rounding pi to 6 digits, standard R
pi %>% round(digits = 6)
```
Here is another example:
```{r}
a = 3
b = 4
sum(a, b) # standard way
a %>% sum(b) # the pipe way
```
The pipe operator inserts the result of the previous computation as a first element into the next computation. So, `a %>% sum(b)` is equivalent to `sum(a, b)`. We can also specify to insert the result at a different position via the `.` operator. For example:
```{r}
a = 1
b = 10
b %>% seq(from = a, to = .)
```
Here, I used the `.` operator to specify that I woud like to insert the result of `b` where I've put the `.` in the `seq()` function.
> `f(x, y)` can be rewritten as `y %>% f(x, .)`
Still not to thrilled about the pipe? We can keep going though (and I'm sure you'll be convinced eventually.)
> `h(g(f(x)))` can be rewritten as `x %>% f() %>% g() %>% h()`
For example, consider that we want to calculate the root mean squared error (RMSE) between prediction and data.
Here is how the RMSE is defined:
$$
\text{RMSE} = \sqrt\frac{\sum_{i=1}^n(\hat{y}_i-y_i)^2}{n}
$$
where $\hat{y}_i$ denotes the prediction, and $y_i$ the actually observed value.
In base R, we would do the following.
```{r}
data = c(1, 3, 4, 2, 5)
prediction = c(1, 2, 2, 1, 4)
# calculate root mean squared error
rmse = sqrt(mean((prediction-data)^2))
print(rmse)
```
Using the pipe operator makes the operation more intuitive:
```{r}
data = c(1, 3, 4, 2, 5)
prediction = c(1, 2, 2, 1, 4)
# calculate root mean squared error the pipe way
rmse = (prediction-data)^2 %>%
mean() %>%
sqrt() %>%
print()
```
First, we calculate the squared error, then we take the mean, then the square root, and then print the result.
The pipe operator `%>%` is similar to the `+` used in `ggplot2`. It allows us to take step-by-step actions in a way that fits the causal ordering of how we want to do things.
> __Tip__: The keyboard shortcut for the pipe operator is:
> `cmd/ctrl + shift + m`
> __Definitely learn this one__ -- we'll use the pipe a lot!!
> __Tip__: Code is generally easier to read when the pipe `%>%` is at the end of a line (just like the `+` in `ggplot2`).
A key advantage of using the pipe is that you don't have to save intermediate computations as new variables and this helps to keep your environment nice and clean!
#### Practice 1
Let's practice the pipe operator.
```{r}
# here are some numbers
x = seq(from = 1, to = 5, by = 1)
# taking the log the standard way
log(x)
# now take the log the pipe way (write your code underneath)
```
```{r}
# some more numbers
x = seq(from = 10, to = 5, by = -1)
# the standard way
mean(round(sqrt(x), digits = 2))
# the pipe way (write your code underneath)
```
## A quick note on naming things
Personally, I like to name things in a (pretty) consistent way so that I have no trouble finding stuff even when I open up a project that I haven't worked on for a while. I try to use the following naming conventions:
```{r, echo=FALSE}
name = c("df.thing",
"l.thing",
"fun.thing",
"tmp.thing")
use = c("for data frames",
"for lists",
"for functions",
"for temporary variables")
kable(x = tibble(name, use),
caption = "Some naming conventions I adopt to make my life easier.",
align = c("r", "l"),
booktabs = TRUE)
```
## Looking at data
The package `dplyr` which we loaded as part of the tidyverse, includes a data set with information about starwars characters. Let's store this as `df.starwars`.
```{r}
df.starwars = starwars
```
> Note: Unlike in other languages (such as Python or Matlab), a `.` in a variable name has no special meaning and can just be used as part of the name. I've used `df` here to indicate for myself that this variable is a data frame.
Before visualizing the data, it's often useful to take a quick direct look at the data.
There are several ways of taking a look at data in R. Personally, I like to look at the data within RStudio's data viewer. To do so, you can:
- click on the `df.starwars` variable in the "Environment" tab
- type `View(df.starwars)` in the console
- move your mouse over (or select) the variable in the editor (or console) and hit `F2`
I like the `F2` route the best as it's fast and flexible.
Sometimes it's also helpful to look at data in the console instead of the data viewer. Particularly when the data is very large, the data viewer can be sluggish.
Here are some useful functions:
### `head()`
Without any extra arguments specified, `head()` shows the top six rows of the data.
```{r}
head(df.starwars)
```
### `glimpse()`
`glimpse()` is helpful when the data frame has many columns. The data is shown in a transposed way with columns as rows.
```{r}
glimpse(df.starwars)
```
### `distinct()`
`distinct()` shows all the distinct values for a character or factor column.
```{r}
df.starwars %>%
distinct(species)
```
### `count()`
`count()` shows a count of all the different distinct values in a column.
```{r}
df.starwars %>%
count(eye_color)
```
It's possible to do grouped counts by combining several variables.
```{r}
df.starwars %>%
count(eye_color, gender) %>%
head(n = 10)
```
### `datatable()`
For RMardkown files specifically, we can use the `datatable()` function from the `DT` package to get an interactive table widget.
```{r}
df.starwars %>%
DT::datatable()
```
### Other tools for taking a quick look at data
#### `vis_dat()`
The `vis_dat()` function from the `visdat` package, gives a visual summary that makes it easy to see the variable types and whether there are missing values in the data.
```{r}
visdat::vis_dat(df.starwars)
```
```{block, type='info'}
When R loads packages, functions loaded in earlier packages are overwritten by functions of the same name from later packages. This means that the order in which packages are loaded matters. To make sure that a function from the correct package is used, you can use the `package_name::function_name()` construction. This way, the `function_name()` from the `package_name` is used, rather than the same function from a different package.
This is why, in general, I recommend to load the tidyverse package last (since it contains a large number of functions that we use a lot).
```
#### `skim()`
The `skim()` function from the `skimr` package provides a nice overview of the data, separated by variable types.
```{r}
# install.packages("skimr")
skimr::skim(df.starwars)
```
#### `dfSummary()`
The `summarytools` package is another great package for taking a look at the data. It renders a nice html output for the data frame including a lot of helpful information. You can find out more about this package [here](https://cran.r-project.org/web/packages/summarytools/index.html).
```{r, eval=FALSE}
df.starwars %>%
select(where(~ !is.list(.))) %>% # this removes all list columns
summarytools::dfSummary() %>%
summarytools::view()
```
> Note: The summarytools::view() function will not show up here in the html. It generates a summary of the data that is displayed in the Viewer in RStudio.
Once we've taken a look at the data, the next step would be to visualize relationships between variables of interest.
## Wrangling data
We use the functions in the package `dplyr` to manipulate our data.
### `filter()`
`filter()` lets us apply logical (and other) operators (see Table \@ref(tab:logical-operators)) to subset the data. Here, I've filtered out the masculine characters.
```{r}
df.starwars %>%
filter(gender == "masculine")
```
We can combine multiple conditions in the same call. Here, I've filtered out masculine characters, whose height is greater than the median height (i.e. they are in the top 50 percentile), and whose mass was not `NA`.
```{r}
df.starwars %>%
filter(gender == "masculine",
height > median(height, na.rm = T),
!is.na(mass))
```
Many functions like `mean()`, `median()`, `var()`, `sd()`, `sum()` have the argument `na.rm` which is set to `FALSE` by default. I set the argument to `TRUE` here (or `T` for short), which means that the `NA` values are ignored, and the `median()` is calculated based on the remaining values.
You can use `,` and `&` interchangeably in `filter()`. Make sure to use parentheses when combining several logical operators to indicate which logical operation should be performed first:
```{r}
df.starwars %>%
filter((skin_color %in% c("dark", "pale") | sex == "hermaphroditic") & height > 170)
```
The starwars characters that have either a `"dark"` or a `"pale"` skin tone, or whose sex is `"hermaphroditic"`, and whose height is at least `170` cm. The `%in%` operator is useful when there are multiple options. Instead of `skin_color %in% c("dark", "pale")`, I could have also written `skin_color == "dark" | skin_color == "pale"` but this gets cumbersome as the number of options increases.
### `arrange()`
`arrange()` allows us to sort the values in a data frame by one or more column entries.
```{r}
df.starwars %>%
arrange(hair_color, desc(height))
```
Here, I've sorted the data frame first by `hair_color`, and then by `height`. I've used the `desc()` function to sort `height` in descending order. Bail Prestor Organa is the tallest black character in starwars.
### `rename() `
`rename()` renames column names.
```{r}
df.starwars %>%
rename(person = name,
mass_kg = mass)
```
The new variable names goes on the LHS of the`=` sign, and the old name on the RHS.
To rename all variables at the same time use `rename_with()`:
```{r}
df.starwars %>%
rename_with(.fn = ~ toupper(.))
```
Notice that I used the `~` here in the function call. I will explain what this does shortly.
### `relocate()`
`relocate()` moves columns. For example, the following piece of code moves the `species` column to the front of the data frame:
```{r}
df.starwars %>%
relocate(species)
```
We could also move the `species` column after the name column like so:
```{r}
df.starwars %>%
relocate(species, .after = name)
```
### `select()`
`select()` allows us to select a subset of the columns in the data frame.
```{r}
df.starwars %>%
select(name, height, mass)
```
We can select multiple columns using the `(from:to)` syntax:
```{r}
df.starwars %>%
select(name:birth_year) # from name to birth_year
```
Or use a variable for column selection:
```{r}
columns = c("name", "height", "species")
df.starwars %>%
select(one_of(columns)) # useful when using a variable for column selection
```
We can also _deselect_ (multiple) columns:
```{r}
df.starwars %>%
select(-name, -(birth_year:vehicles))
```
And select columns by partially matching the column name:
```{r}
df.starwars %>%
select(contains("_")) # every column that contains the character "_"
```
```{r}
df.starwars %>%
select(starts_with("h")) # every column that starts with an "h"
```
We can rename some of the columns using `select()` like so:
```{r}
df.starwars %>%
select(person = name, height, mass_kg = mass)
```
#### `where()`
`where()` is a useful helper function that comes in handy, for example, when we want to select columns based on their data type.
```{r}
df.starwars %>%
select(where(fn = is.numeric)) # just select numeric columns
```
The following selects all columns that are not numeric:
```{r}
df.starwars %>%
select(where(fn = ~ !is.numeric(.))) # selects all columns that are not numeric
```
Note that I used `~` here to indicate that I'm creating an anonymous function to check whether column type is numeric. A one-sided formula (expression beginning with `~`) is interpreted as `function(x)`, and wherever `x` would go in the function is represented by `.`.
So, I could write the same code like so:
```{r}
df.starwars %>%
select(where(function(x) !is.numeric(x))) # selects all columns that are not numeric
```
For more details, take a look at the help file for `select()`, and this [this great tutorial](https://suzan.rbind.io/2018/01/dplyr-tutorial-1/) in which I learned about some of the more advanced ways of using `select()`.
### Practice 2
Create a data frame that:
- only has the species `Human` and `Droid`
- with the following data columns (in this order): name, species, birth_year, homeworld
- is arranged according to birth year (with the lowest entry at the top of the data frame)
- and has the `name` column renamed to `person`
```{r}
# write your code here
```
### `mutate() `
`mutate()` is used to change existing columns or make new ones.
```{r}
df.starwars %>%
mutate(height = height / 100, # to get height in meters
bmi = mass / (height^2)) %>% # bmi = kg / (m^2)
select(name, height, mass, bmi)
```
Here, I've calculated the bmi for the different starwars characters. I first mutated the height variable by going from cm to m, and then created the new column "bmi".
A useful helper function for `mutate()` is `ifelse()` which is a shorthand for the if-else control flow (Section \@ref(if-else)). Here is an example:
```{r}
df.starwars %>%
mutate(height_categorical = ifelse(height > median(height, na.rm = T),
"tall",
"short")) %>%
select(name, contains("height"))
```
`ifelse()` works in the following way: we first specify the condition, then what should be returned if the condition is true, and finally what should be returned otherwise. The more verbose version of the statement above would be: `ifelse(test = height > median(height, na.rm = T), yes = "tall", no = "short")`
In previous versions of `dplyr` (the package we use for data wrangling), there were a variety of additional mutate functions such as `mutate_at()`, `mutate_if()`, and `mutate_all()`. In the most recent version of `dplyr`, these additional functions have been deprecated, and replaced with the flexible `across()` helper function.
#### `across()`
`across()` allows us to use the syntax that we've learned for `select()` to select particular variables and apply a function to each of the selected variables.
For example, let's imagine that we want to z-score a number of variables in our data frame. We can do this like so:
```{r}
df.starwars %>%
mutate(across(.cols = c(height, mass, birth_year),
.fns = scale))
```
In the `.cols = ` argument of `across()`, I've specified what variables to mutate. In the `.fns = ` argument, I've specified that I want to use the function `scale`. Note that I wrote the function without `()`. The `.fns` argument expects allows these possible values:
- the function itself, e.g. `mean`
- a call to the function with `.` as a dummy argument, `~ mean(.)` (note the `~` before the function call)
- a list of functions `list(mean = mean, median = ~ median(.))` (where I've mixed both of the other ways)
We can also use names to create new columns:
```{r}
df.starwars %>%
mutate(across(.cols = c(height, mass, birth_year),
.fns = scale,
.names = "{.col}_z")) %>%
select(name, contains("height"), contains("mass"), contains("birth_year"))
```
I've specified how I'd like the new variables to be called by using the `.names = ` argument of `across()`. `{.col}` stands of the name of the original column, and here I've just added `_z` to each column name for the scaled columns.
We can also apply several functions at the same time.
```{r}
df.starwars %>%
mutate(across(.cols = c(height, mass, birth_year),
.fns = list(z = scale,
centered = ~ scale(., scale = FALSE)))) %>%
select(name, contains("height"), contains("mass"), contains("birth_year"))
```
Here, I've created z-scored and centered (i.e. only subtracted the mean but didn't divide by the standard deviation) versions of the `height`, `mass`, and `birth_year` columns in one go.
You can use the `everything()` helper function if you want to apply a function to all of the columns in your data frame.
```{r}
df.starwars %>%
select(height, mass) %>%
mutate(across(.cols = everything(),
.fns = as.character)) # transform all columns to characters
```
Here, I've selected some columns first, and then changed the mode to character in each of them.
Sometimes, you want to apply a function only to those columns that have a particular data type. This is where `where()` comes in handy!
For example, the following code changes all the numeric columns to character columns:
```{r}
df.starwars %>%
mutate(across(.cols = where(~ is.numeric(.)),
.fns = ~ as.character(.)))
```
Or we could round all the numeric columns to one digit:
```{r}
df.starwars %>%
mutate(across(.cols = where(~ is.numeric(.)),
.fns = ~ round(., digits = 1)))
```
### Practice 3
Compute the body mass index for `masculine` characters who are `human`.
- select only the columns you need
- filter out only the rows you need
- make the new variable with the body mass index
- arrange the data frame starting with the highest body mass index
```{r}
# write your code here
```
### Summarizing data
OK, let's load the `starwars` data set again:
```{r}
df.starwars = starwars
```
A particularly powerful way of interacting with data is by grouping and summarizing it. `summarize()` returns a single value for each summary that we ask for:
```{r}
df.starwars %>%
summarize(height_mean = mean(height, na.rm = T),
height_max = max(height, na.rm = T),
n = n())
```
Here, I computed the mean height, the maximum height, and the total number of observations (using the function `n()`).
Let's say we wanted to get a quick sense for how tall starwars characters from different species are. To do that, we combine grouping with summarizing:
```{r}
df.starwars %>%
group_by(species) %>%
summarize(height_mean = mean(height, na.rm = T))
```
I've first used `group_by()` to group our data frame by the different species, and then used `summarize()` to calculate the mean height of each species.
It would also be useful to know how many observations there are in each group.
```{r}
df.starwars %>%
group_by(species) %>%
summarize(height_mean = mean(height, na.rm = T),
group_size = n()) %>%
arrange(desc(group_size))
```
Here, I've used the `n()` function to get the number of observations in each group, and then I've arranged the data frame according to group size in descending order.
Note that `n()` always yields the number of observations in each group. If we don't group the data, then we get the overall number of observations in our data frame (i.e. the number of rows).
So, Humans are the largest group in our data frame, followed by Droids (who are considerably smaller) and Gungans (who would make for good Basketball players).
Sometimes `group_by()` is also useful without summarizing the data. For example, we often want to z-score (i.e. normalize) data on the level of individual participants. To do so, we first group the data on the level of participants, and then use `mutate()` to scale the data. Here is an example:
```{r}
# first let's generate some random data
set.seed(1) # to make this reproducible
df.summarize = tibble(participant = rep(1:3, each = 5),
judgment = sample(0:100, size = 15, replace = TRUE)) %>%
print()
```
```{r}
df.summarize %>%
group_by(participant) %>% # group by participants
mutate(judgment_zscored = scale(judgment)) %>% # z-score data of individual participants
ungroup() %>% # ungroup the data frame
head(n = 10) # print the top 10 rows
```
First, I've generated some random data using the repeat function `rep()` for making a `participant` column, and the `sample()` function to randomly choose values from a range between 0 and 100 with replacement. (We will learn more about these functions later when we look into how to simulate data.) I've then grouped the data by participant, and used the scale function to z-score the data.
> __TIP__: Don't forget to `ungroup()` your data frame. Otherwise, any subsequent operations are applied per group.
Sometimes, I want to run operations on each row, rather than per column. For example, let's say that I wanted each character's average combined height and mass.
Let's see first what doesn't work:
```{r}
df.starwars %>%
mutate(mean_height_mass = mean(c(height, mass), na.rm = T)) %>%
select(name, height, mass, mean_height_mass)
```
Note that all the values are the same. The value shown here is just the mean of all the values in `height` and `mass`.
```{r}
df.starwars %>%
select(height, mass) %>%
unlist() %>% # turns the data frame into a vector
mean(na.rm = T)
```
To get the mean by row, we can either spell out the arithmetic
```{r}
df.starwars %>%
mutate(mean_height_mass = (height + mass) / 2) %>% # here, I've replaced the mean() function
select(name, height, mass, mean_height_mass)
```
or use the `rowwise()` helper function which is like `group_by()` but treats each row like a group:
```{r}
df.starwars %>%
rowwise() %>% # now, each row is treated like a separate group
mutate(mean_height_mass = mean(c(height, mass), na.rm = T)) %>%
ungroup() %>%
select(name, height, mass, mean_height_mass)
```
#### Practice 1
Find out what the average `height` and `mass` (as well as the standard deviation) is from different `species` in different `homeworld`s. Why is the standard deviation `NA` for many groups?
```{r}
# write your code here
```
Who is the tallest member of each species? What eye color do they have? The `top_n()` function or the `row_number()` function (in combination with `filter()`) will be useful here.
```{r}
# write your code here
```
### Reshaping data
We want our data frames to be tidy. What's tidy?
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
For more information on tidy data frames see the [Tidy data](http://r4ds.had.co.nz/tidy-data.html) chapter in Hadley Wickham's R for Data Science book.