forked from psych252/psych252book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
psych710.Rmd
17053 lines (13168 loc) · 521 KB
/
psych710.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Psych 252: Statistical Methods for Behavioral and Social Sciences"
author: "Tobias Gerstenberg"
date: "`r Sys.Date()`"
book_filename: "psych710"
language:
ui:
chapter_name: "Chapter "
delete_merged_file: true
output_dir: "docs"
site: bookdown::bookdown_site
documentclass: book
bibliography: [book.bib, packages.bib]
biblio-style: apalike
link-citations: yes
github-repo: SocialInteractionLab/psych710-notes
description: "Course notes for Psych 710."
---
# Preface {-}
This book contains the course notes for [Psych 252](https://psych252.github.io/). The book is not intended to be self-explanatory and instead should be used in combination with the course lectures posted [here](https://github.com/psych252/psych252slides).
If you have any questions about the notes, please feel free to contact me at: [email protected] or post an issue on the book's [github repository](https://github.com/psych252/psych252book).
## Course description {-}
This course offers an introduction to advanced topics in statistics with the focus of understanding data in the behavioral and social sciences. It is a practical course in which learning statistical concepts and building models in R go hand in hand. The course is organized into three parts: In the first part, we will learn how to visualize, wrangle, and simulate data in R. In the second part, we will cover topics in frequentist statistics (such as multiple regression, logistic regression, and mixed effects models) using the general linear model as an organizing framework. We will learn how to compare models using simulation methods such as bootstrapping and cross-validation. In the third part, we will focus on Bayesian data analysis as an alternative framework for answering statistical questions.
## Course homepage {-}
https://psych252.github.io/
## License and citation {-}
This book is licensed under the [Creative Commons Zero v1.0 Universal license](https://github.com/psych252/psych252book/blob/master/LICENSE). If you find these materials helpful for your work, I'd appreciate you citing the book:
```
@book{gerstenberg2022methods,
title = {Statistical methods for the behavioral and social sciences},
author = {Tobias Gerstenberg},
year = {2022},
url = {https://psych252.github.io/psych252book/}
}
```
```{r index-01, include=FALSE}
# automatically create a bib database for R packages
knitr::write_bib(c(
.packages(), 'bookdown', 'knitr', 'rmarkdown'
), 'packages.bib')
knitr::opts_chunk$set(
comment = "",
results = "hold",
fig.show = "hold")
library("ggplot2")
# set plotting theme
theme_set(theme_classic() + #set the theme
theme(text = element_text(size = 20))) #set the default text size
# export figures as pdf in latex
options(knitr.graphics.auto_pdf = TRUE)
```
<!--chapter:end:index.Rmd-->
# Introduction
## Thanks
Various people have helped in the process of putting together these materials (either knowingly, or unknowingly). Big thanks go to:
- [Alexandra Chouldechova](https://www.andrew.cmu.edu/user/achoulde/)
- [Allison Horst](https://www.allisonhorst.com/)
- [Andrew Heiss](https://www.andrewheiss.com/)
- [Ben Baumer](https://www.smith.edu/academics/faculty/ben-baumer)
- [Benoit Monin](https://www.gsb.stanford.edu/faculty-research/faculty/benoit-monin)
- [Bodo Winter](https://bodowinter.com/)
- [David Lagnado](https://www.ucl.ac.uk/pals/people/david-lagnado)
- [Ewart Thomas](https://profiles.stanford.edu/ewart-thomas)
- [Henrik Singmann](http://singmann.org/)
- [Julian Jara-Ettinger](https://psychology.yale.edu/people/julian-jara-ettinger)
- [Justin Gardner](https://profiles.stanford.edu/justin-gardner)
- [Kevin Smith](http://www.mit.edu/~k2smith/)
- [Lisa DeBruine](https://debruine.github.io/)
- [Maarten Speekenbrink](https://www.ucl.ac.uk/pals/people/maarten-speekenbrink)
- [Matthew Kay](https://www.mjskay.com/)
- [Matthew Salganik](http://www.princeton.edu/~mjs3/)
- [Michael Franke](https://michael-franke.github.io/heimseite/)
- [Mika Braginsky](https://mikabr.io/)
- [Mike Frank](https://web.stanford.edu/~mcfrank/)
- [Mine Çetinkaya-Rundel](https://mine-cr.com/)
- [Nick C. Huntington-Klein](https://www.nickchk.com/)
- [Nilam Ram](https://profiles.stanford.edu/nilam-ram)
- [Patrick Mair](https://psychology.fas.harvard.edu/people/patrick-mair)
- [Paul-Christian Bürkner](https://paul-buerkner.github.io/about/)
- [Peter Cushner Mohanty](https://explorecourses.stanford.edu/instructor/pmohanty)
- [Richard McElreath](https://xcelab.net/rm/)
- [Russ Poldrack](https://profiles.stanford.edu/russell-poldrack)
- [Stephen Dewitt](https://www.ucl.ac.uk/pals/research/experimental-psychology/person/stephen-dewitt/)
- [Solomon Kurz](https://solomonkurz.netlify.app/)
- [Tom Hardwicke](https://tomhardwicke.netlify.app/)
- [Tristan Mahr](https://www.tjmahr.com/)
Special thanks go to my teaching teams:
- 2024:
- Ari Beller
- Beth Rispoli
- Satchel Grant
- Shawn Schwartz
- 2023:
- Nilam Ram (instructor)
- Ari Beller
- Yoonji Lee
- Satchel Grant
- Josh Wilson
- 2022:
- Ari Beller
- Sarah Wu
- Chengxu Zhuang
- 2021:
- Andrew Nam
- Catherine Thomas
- Jon Walters
- Dan Yamins
- 2020:
- Tyler Bonnen
- Andrew Nam
- Jinxiao Zhang
- 2019:
- Andrew Lampinen
- Mona Rosenke
- Shao-Fang (Pam) Wang
## List of R packages used in this book
```{r, eval=FALSE, message=FALSE}
# RMarkdown
library("knitr") # markdown things
library("bookdown") # markdown things
library("kableExtra") # for nicely formatted tables
# Datasets
library("gapminder") # data available from Gapminder.org
library("NHANES") # data set
library("datarium") # data set
library("titanic") # titanic dataset
# Data manipulation
library("arrangements") # fast generators and iterators for permutations, combinations and partitions
library("magrittr") # for wrangling
library("tidyverse") # everything else
# Visualization
library("patchwork") # making figure panels
library("ggpol") # for making fancy boxplots
library("ggridges") # for making joyplots
library("gganimate") # for making animations
library("GGally") # for pairs plot
library("ggrepel") # for labels in ggplots
library("corrr") # for calculating correlations between many variables
library("corrplot") # for plotting correlations
library("DiagrammeR") # for drawing diagrams
library("DiagrammeRsvg") # for visualizing diagrams
library("ggeffects") # for visualizing effects
library("bayesplot") # for visualization of Bayesian model fits
library("skimr") # for quick data visualizations
library("visdat") # for quick data visualizations
library("rsvg") # for visualization
library("see") # for visualizing data
# Modeling
library("afex") # also for running ANOVAs
library("lme4") # mixed effects models
library("emmeans") # comparing estimated marginal means
library("broom.mixed") # getting tidy mixed model summaries
library("janitor") # cleaning variable names
library("car") # for running ANOVAs
library("rstanarm") # for Bayesian models
library("greta") # Bayesian models
library("tidybayes") # tidying up results from Bayesian models
library("boot") # bootstrapping
library("modelr") # cross-validation and bootstrapping
library("mediation") # for mediation and moderation analysis
library("multilevel") # Sobel test
library("extraDistr") # additional probability distributions
library("effects") # for showing effects in linear, generalized linear, and other models
library("brms") # Bayesian regression
library("parameters") # For extracting parameters
# Misc
library("tictoc") # timing things
library("MASS") # various useful functions (e.g. bootstrapped confidence intervals)
library("lsr") # for computing effect size measures
library("extrafont") # additional fonts
library("pwr") # for power calculations
library("arrangements") # fast generators and iterators for permutations, combinations and partitions
library("stargazer") # for regression tables
library("sjPlot") # for regression tables
library("xtable") # for tables
library("DT") # for tables
library("papaja") # for reporting results
library("statsExpressions") # for extracting stats results APA style
```
## Session info
```{r, echo=F}
sessionInfo()
```
<!--chapter:end:01-introduction.Rmd-->
# Visualization 1
In this lecture, we will take a look at how to visualize data using the powerful [ggplot2](https://ggplot2.tidyverse.org/) package. We will use `ggplot2` a lot throughout the rest of the course!
## Learning goals
- Take a look at some suboptimal plots, and think about how to make them better.
- Get familiar with the RStudio interface.
- Understand the general philosophy behind `ggplot2` -- a grammar of graphics.
- Understand the mapping from data to geoms in `ggplot2`.
- Create informative figures using grouping and facets.
## Load packages
Let's first load the packages that we need for this chapter. You can click on the green arrow to execute the code chunk below.
```{r, message=FALSE}
library("knitr") # for rendering the RMarkdown file
library("tidyverse") # for plotting (and many more cool things we'll discover later)
# these options here change the formatting of how comments are rendered
# in RMarkdown
opts_chunk$set(comment = "",
fig.show = "hold")
```
The `tidyverse` is a collection of packages that includes `ggplot2`.
## Why visualize data?
```{r hiding, echo=FALSE, fig.cap="Are you hiding anything?", out.width="95%"}
include_graphics("figures/hiding_data.png")
```
> The greatest value of a picture is when it forces us to notice what we never expected to see. — John Tukey
> There is no single statistical tool that is as powerful as a well‐chosen graph. [@chambers1983graphical]
> ...make __both__ calculations __and__ graphs. Both sorts of output should be studied; each will contribute to understanding. [@anscombe1973american]
```{r anscombe, echo=FALSE, fig.cap="Anscombe's quartet.", out.width="95%"}
include_graphics("figures/anscombe.png")
```
Anscombe's quartet in Figure \@ref(fig:anscombe) (left side) illustrates the importance of visualizing data. Even though the datasets I-IV have the same summary statistics (mean, standard deviation, correlation), they are importantly different from each other. On the right side, we have four data sets with the same summary statistics that are very similar to each other.
```{r healy, echo=FALSE, fig.cap= "The Pearson's $r$ correlation coefficient is the same for all of these datasets. Source: [Data Visualization -- A practical introduction by Kieran Healy](http://socviz.co/lookatdata.html#lookatdata)", out.width="95%"}
include_graphics("figures/correlations.png")
```
All the datasets in Figure \@ref(fig:healy) share the same correlation coefficient. However, again, they are very different from each other.
```{r datasaurus, echo=FALSE, fig.cap="__The Datasaurus Dozen__. While different in appearance, each dataset has the same summary statistics to two decimal places (mean, standard deviation, and Pearson's correlation).", out.width="95%"}
include_graphics("figures/datasaurus_dozen.png")
```
The data sets in Figure \@ref(fig:datasaurus) all share the same summary statistics. Clearly, the data sets are not the same though.
> __Tip__: Always plot the data first!
[Here](https://www.autodeskresearch.com/publications/samestats) is the paper from which I took Figure \@ref(fig:datasaurus). It explains how the figures were generated and shows more examples for how summary statistics and some kinds of plots are insufficient to get a good sense for what's going on in the data.
## Some basics
### Setting up RStudio
```{r, echo=FALSE, fig.cap="General preferences.", out.width="50%"}
include_graphics("figures/r_preferences_general.png")
```
__Make sure that__:
- Restore .RData into workspace at startup is _unselected_
- Save workspace to .RData on exit is set to _Never_
This can otherwise cause problems with reproducibility and weird behavior between R sessions because certain things may still be saved in your workspace.
```{r, out.width='100%', echo=FALSE, fig.cap="Code window preferences.", out.width="95%"}
include_graphics("figures/r_preferences_code.png")
```
__Make sure that__:
- Soft-wrap R source files is _selected_
This way you don't have to scroll horizontally. At the same time, avoid writing long single lines of code. For example, instead of writing code like so:
```{r, eval=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
stat_summary(fun = "mean", geom = "bar", color = "black", fill = "lightblue", width = 0.85) +
stat_summary(fun.data = "mean_cl_boot", geom = "linerange", size = 1.5) +
labs(title = "Price as a function of quality of cut", subtitle = "Note: The price is in US dollars", tag = "A", x = "Quality of the cut", y = "Price")
```
You may want to write it this way instead:
```{r, eval=FALSE}
ggplot(data = diamonds,
mapping = aes(x = cut,
y = price)) +
# display the means
stat_summary(fun = "mean",
geom = "bar",
color = "black",
fill = "lightblue",
width = 0.85) +
# display the error bars
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1.5) +
# change labels
labs(title = "Price as a function of quality of cut",
subtitle = "Note: The price is in US dollars", # we might want to change this later
tag = "A",
x = "Quality of the cut",
y = "Price")
```
This makes it much easier to see what's going on, and you can easily add comments to individual lines of code.
>__Tip__: If a function has more than two arguments put each argument on a new line.
RStudio makes it easy to write nice code. It figures out where to put the next line of code when you press `ENTER`. And if things ever get messy, just select the code of interest and hit `cmd + i` to re-indent the code.
Here are some more resources with tips for how to write nice code in R:
- [Advanced R style guide](http://adv-r.had.co.nz/Style.html)
>__Tip__: Use a consistent coding style. This makes reading code and debugging much easier!
### Getting help
There are three simple ways to get help in R. You can either put a `?` in front of the function you'd like to learn more about, or use the `help()` function.
```{r, eval=FALSE}
?print
help("print")
```
>__Tip__: To see the help file, hover over a function (or dataset) with the mouse (or select the text) and then press `F1`.
I recommend using `F1` to get to help files -- it's the fastest way!
R help files can sometimes look a little cryptic. Most R help files have the following sections (copied from [here](https://www.dummies.com/programming/r/r-for-dummies-cheat-sheet/)):
---
__Title__: A one-sentence overview of the function.
__Description__: An introduction to the high-level objectives of the function.
__Usage__: A description of the syntax of the function (in other words, how the function is called). This is where you find all the arguments that you can supply to the function, as well as any default values of these arguments.
__Arguments__: A description of each argument. Usually this includes a specification of the class (for example, character, numeric, list, and so on). This section is an important one to understand, because arguments are frequently a cause of errors in R.
__Details__: Extended details about how the function works, provides longer descriptions of the various ways to call the function (if applicable), and a longer discussion of the arguments.
__Value__: A description of the class of the value returned by the function.
__See also__: Links to other relevant functions. In most of the R editors, you can click these links to read the Help files for these functions.
__Examples__: Worked examples of real R code that you can paste into your console and run.
---
Here is the help file for the `print()` function:
```{r, echo=FALSE, fig.cap="Help file for the print() function.", out.width="95%"}
include_graphics("figures/help_print.png")
```
### R Markdown infos
An RMarkdown file has four key components:
1. YAML header
2. Headings to structure the document
3. Text
4. Code chunks
The **YAML** (*Y*et *A*nother *M*arkdown *L*anguage) header specifies general options such as whether you'd like to have a table of content displayed, and in what output format you want to create your report (e.g. html or pdf). Notice that the YAML header cares about indentation, so make sure to get that right!
**Headings** are very useful for structuring your RMarkdown file. For your reports, it's often a good idea to have one header for each code chunk. The outline viewer here on the right is great for navigating large analysis files.
**Text** is self-explanatory.
**Code chunks** is where the coding happens. You can add one via the Insert button above, or via the shortcut `cmd + option + i` (the much cooler way of doing it!)
```{r another-code-chunk, eval=FALSE}
```
Code chunks can have code chunk options which we can set by clicking on the cog symbol on the right. You can also give code chunks a name, so that we can refer to it in text. I've named the one above "another-code-chunk". Make sure to have no white space or underscore in a code chunk name.
### Helpful keyboard shortcuts
- `cmd + enter`: run selected code
- `cmd + shift + enter`: run code chunk
- `cmd + i`: re-indent selected code
- `cmd + shift + c`: comment/uncomment several lines of code
- `cmd + shift + d`: duplicate line underneath
- set up your own shortcuts to do useful things like
- switch tabs
- jump up and down between code chunks
- ...
## Data visualization
We will use the `ggplot2` package to visualize data. By the end of next class, you'll be able to make a figure like this:
```{r, echo=FALSE, fig.cap="What a nice figure!", out.width="95%"}
include_graphics("figures/combined_plot.png")
```
Now let's figure out (pun intended!) how to get there.
### Setting up a plot
Let's first get some data.
```{r}
df.diamonds = diamonds
```
The `diamonds` dataset comes with the `ggplot2` package. We can get a description of the dataset by running the following command:
```{r, eval=FALSE}
?diamonds
```
Above, we assigned the `diamonds` dataset to the variable `df.diamonds` so that we can see it in the data explorer.
Let's take a look at the full dataset by clicking on it in the explorer.
>__Tip__: You can view a data frame by highlighting the text in the editor (or simply moving the mouse above the text), and then pressing `F2`.
The `df.diamonds` data frame contains information about almost 60,000 diamonds, including their `price`, `carat` value, size, etc. Let's use visualization to get a better sense for this dataset.
We start by setting up the plot. To do so, we pass a data frame to the function `ggplot()` in the following way.
```{r}
ggplot(data = df.diamonds)
```
This, by itself, won't do anything yet. We also need to specify what to plot.
Let's take a look at how much diamonds of different color cost. The help file says that diamonds labeled D have the best color, and diamonds labeled J the worst color. Let's make a bar plot that shows the average price of diamonds for different colors.
We do so via specifying a mapping from the data to the plot aesthetics with the function `aes()`. We need to tell `aes()` what we would like to display on the x-axis, and the y-axis of the plot.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price))
```
Here, we specified that we want to plot `color` on the x-axis, and `price` on the y-axis. As you can see, `ggplot2` has already figured out how to label the axes. However, we still need to specify _how_ to plot it.
### Bar plot
Let's make a __bar graph__:
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price)) +
stat_summary(fun = "mean",
geom = "bar")
```
Neat! Three lines of code produce an almost-publication-ready plot (to be published in the _Proceedings of Unnecessary Diamonds_)! Note how we used a `+` at the end of the first line of code to specify that there will be more. This is a very powerful idea underlying `ggplot2`. We can start simple and keep adding things to the plot step by step.
We used the `stat_summary()` function to define _what_ we want to plot (the "mean"), and _how_ (as a "bar" chart). Let's take a closer look at that function.
```{r, eval=FALSE}
help(stat_summary)
```
Not the the easiest help file ... We supplied two arguments to the function, `fun = ` and `geom = `.
1. The `fun` argument specifies _what_ function we'd like to apply to the data for each value of `x`. Here, we said that we would like to take the `mean` and we specified that as a string.
2. The `geom` (= geometric object) argument specifies _how_ we would like to plot the result, namely as a "bar" plot.
Instead of showing the "mean", we could also show the "median" instead.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price)) +
stat_summary(fun = "median",
geom = "bar")
```
And instead of making a bar plot, we could plot some points.
```{r}
ggplot(df.diamonds,
aes(x = color,
y = price)) +
stat_summary(fun = "mean",
geom = "point")
```
>__Tip__: Take a look [here](https://ggplot2.tidyverse.org/reference/#section-layer-geoms) to see what other geoms ggplot2 supports.
Somewhat surprisingly, diamonds with the best color (D) are not the most expensive ones. What's going on here? We'll need to do some more exploration to figure this out.
### Setting the default plot theme
Before moving on, let's set a different default theme for our plots. Personally, I'm not a big fan of the gray background and the white grid lines. Also, the default size of the text should be bigger. We can change the default theme using the `theme_set()` function like so:
```{r}
theme_set(theme_classic() + # set the theme
theme(text = element_text(size = 20))) # set the default text size
```
From now on, all our plots will use what's specified in `theme_classic()`, and the default text size will be larger, too. For any individual plot, we can still override these settings.
### Scatter plot
I don't know much about diamonds, but I do know that diamonds with a higher `carat` value tend to be more expensive. `color` was a discrete variable with seven different values. `carat`, however, is a continuous variable. We want to see how the price of diamonds differs as a function of the `carat` value. Since we are interested in the relationship between two continuous variables, plotting a bar graph won't work. Instead, let's make a __scatter plot__. Let's put the `carat` value on the x-axis, and the `price` on the y-axis.
```{r scatter, fig.cap="Scatterplot."}
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price)) +
geom_point()
```
Cool! That looks sensible. Diamonds with a higher `carat` value tend to have a higher `price`. Our dataset has `r nrow(diamonds)` rows. So the plot actually shows `r nrow(diamonds)` circles even though we can't see all of them since they overlap.
Let's make some progress on trying to figure out why the diamonds with the better color weren't the most expensive ones on average. We'll add some color to the scatter plot in Figure \@ref(fig:scatter). We color each of the points based on the diamond's color. To do so, we pass another argument to the aesthetics of the plot via `aes()`.
```{r scatter-color, fig.cap="Scatterplot with color."}
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price,
color = color)) +
geom_point()
```
Aha! Now we've got some color. Notice how in Figure \@ref(fig:scatter-color) `ggplot2` added a legend for us, thanks! We'll see later how to play around with legends. Form just eye-balling the plot, it looks like the diamonds with the best `color` (D) tended to have a lower `carat` value, and the ones with the worst `color` (J), tended to have the highest carat values.
So this is why diamonds with better colors are less expensive -- these diamonds have a lower carat value overall.
There are many other things that we can define in `aes()`. Take a quick look at the vignette:
```{r, eval=FALSE}
vignette("ggplot2-specs")
```
#### Practice plot 1
Make a scatter plot that shows the relationship between the variables `depth` (on the x-axis), and `table` (on the y-axis). Take a look at the description for the `diamonds` dataset so you know what these different variables mean. Your plot should look like the one shown in Figure \@ref(fig:practice-plot1).
```{r}
# make practice plot 1 here
```
```{r practice-plot1, fig.align="center", fig.cap="Practice plot 1.", fig.height=6, fig.width=8, out.width="95%"}
include_graphics("figures/vis1_practice_plot1.png")
```
__Advanced__: A neat trick to get a better sense for the data here is to add transparency. Your plot should look like the one shown in Figure \@ref(fig:practice-plot1a).
```{r}
# make advanced practice plot 1 here
```
```{r practice-plot1a, fig.align="center", fig.cap="Practice plot 1.", fig.height=6, fig.width=8, out.width="95%"}
include_graphics("figures/vis1_practice_plot1a.png")
```
### Line plot
What else do we know about the diamonds? We actually know the quality of how they were cut. The `cut` variable ranges from "Fair" to "Ideal". First, let's take a look at the relationship between `cut` and `price`. This time, we'll make a line plot instead of a bar plot (just because we can).
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = cut,
y = price)) +
stat_summary(fun = "mean",
geom = "line")
```
Oops! All we did is that we replaced `x = color` with `x = cut`, and `geom = "bar"` with `geom = "line"`. However, the plot doesn't look like expected (i.e. there is no real plot). What happened here? The reason is that the line plot needs to know which points to connect. The error message tells us that each group consists of only one observation. Let's adjust the group aesthetic to fix this.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = cut,
y = price,
group = 1)) +
stat_summary(fun = "mean",
geom = "line")
```
By adding the parameter `group = 1` to `mapping = aes()`, we specify that we would like all the levels in `x = cut` to be treated as coming from the same group. The reason for this is that `cut` (our x-axis variable) is a factor (and not a numeric variable), so, by default, `ggplot2` tries to draw a separate line for each factor level. We'll learn more about grouping below (and about factors later).
Interestingly, there is no simple relationship between the quality of the cut and the price of the diamond. In fact, "Ideal" diamonds tend to be cheapest.
### Adding error bars
We often don't just want to show the means but also give a sense for how much the data varies. `ggplot2` has some convenient ways of specifying error bars. Let's take a look at how much `price` varies as a function of `clarity` (another variable in our `diamonds` data frame).
```{r errorbars-normal, fig.cap="Relationship between diamond clarity and price. Error bars indicate 95% bootstrapped confidence intervals."}
ggplot(data = df.diamonds,
mapping = aes(x = clarity,
y = price)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "pointrange")
```
Here we have it. The average price of our diamonds for different levels of `clarity` together with bootstrapped 95% confidence intervals. How do we know that we have 95% confidence intervals? That's what `mean_cl_boot()` computes as a default. Let's take a look at that function:
```{r, eval=FALSE}
help(mean_cl_boot)
```
Note that I had to use the `fun.data = ` argument here instead of `fun = ` because the `mean_cl_boot()` function produces three data points for each value of the x-axis (the mean, lower, and upper confidence interval).
### Order matters
The order in which we add geoms to a ggplot matters! Generally, we want to plot error bars before the points that represent the means. To illustrate, let's set the color in which we show the means to "red".
```{r good-figure, fig.cap='This figure looks good. Error bars and means are drawn in the correct order.'}
ggplot(data = df.diamonds,
mapping = aes(x = clarity,
y = price)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange") +
stat_summary(fun = "mean",
geom = "point",
color = "red")
```
Figure \@ref(fig:good-figure) looks good.
```{r bad-figure, fig.cap='This figure looks bad. Error bars and means are drawn in the incorrect order.'}
# I've changed the order in which the means and error bars are drawn.
ggplot(df.diamonds,
aes(x = clarity,
y = price)) +
stat_summary(fun = "mean",
geom = "point",
color = "red") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange")
```
Figure \@ref(fig:bad-figure) doesn't look good. The error bars are on top of the points that represent the means.
One cool feature about using `stat_summary()` is that we did not have to change anything about the data frame that we used to make the plots. We directly used our raw data instead of having to make separate data frames that contain the relevant information (such as the means and the confidence intervals).
You may not remember exactly what confidence intervals actually are. Don't worry! We'll have a recap later in class.
Let's take a look at two more principles for plotting data that are extremely helpful: groups and facets. But before, another practice plot.
#### Practice plot 2
Make a bar plot that shows the average `price` of diamonds (on the y-axis) as a function of their `clarity` (on the x-axis). Also add error bars. Your plot should look like the one shown in Figure \@ref(fig:practice-plot2).
```{r}
# make practice plot 2 here
```
```{r practice-plot2, out.width="90%", fig.align="center", fig.cap="Practice plot 2.", out.width="95%"}
include_graphics("figures/vis1_practice_plot2.png")
```
__Advanced__: Try to make the plot shown in Figure \@ref(fig:practice-plot2a).
```{r}
# make advanced practice plot 2 here
```
```{r practice-plot2a, out.width="90%", fig.align="center", fig.cap="Practice plot 2.", out.width="95%"}
include_graphics("figures/vis1_practice_plot2a.png")
```
### Grouping data
Grouping in `ggplot2` is a very powerful idea. It allows us to plot subsets of the data -- again without the need to make separate data frames first.
Let's make a plot that shows the relationship between `price` and `color` separately for the different qualities of `cut`.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut)) +
stat_summary(fun = "mean",
geom = "line")
```
Well, we got some separate lines here but we don't know which line corresponds to which cut. Let's add some color!
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut,
color = cut)) +
stat_summary(fun = "mean",
geom = "line",
size = 2)
```
Nice! In addition to adding color, I've made the lines a little thicker here by setting the `size` argument to 2.
Grouping is very useful for bar plots. Let's take a look at how the average price of diamonds looks like taking into account both `cut` and `color` (I know -- exciting times!). Let's put the `color` on the x-axis and then group by the `cut`.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut,
color = cut)) +
stat_summary(fun = "mean",
geom = "bar")
```
That's a fail! Several things went wrong here. All the bars are gray and only their outline is colored differently. Instead we want the bars to have a different color. For that we need to specify the `fill` argument rather than the `color` argument! But things are worse. The bars currently are shown on top of each other. Instead, we'd like to put them next to each other. Here is how we can do that:
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut,
fill = cut)) +
stat_summary(fun = "mean",
geom = "bar",
position = position_dodge())
```
Neato! We've changed the `color` argument to `fill`, and have added the `position = position_dodge()` argument to the `stat_summary()` call. This argument makes it such that the bars are nicely dodged next to each other. Let's add some error bars just for kicks.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut,
fill = cut)) +
stat_summary(fun = "mean",
geom = "bar",
position = position_dodge(width = 0.9),
color = "black") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
position = position_dodge(width = 0.9))
```
Voila! Now with error bars. Note that we've added the `width = 0.9` argument to `position_dodge()`. Somehow R was complaining when this was not defined for geom "linerange". I've also added some outline to the bars by including the argument `color = "black"`. I think it looks nicer this way.
So, still somewhat surprisingly, diamonds with the worst color (J) are more expensive than dimanods with the best color (D), and diamonds with better cuts are not necessarily more expensive.
#### Practice plot 3
Recreate the plot shown in Figure \@ref(fig:practice-plot3).
```{r}
# make practice plot 3 here
```
```{r practice-plot3, out.width="90%", fig.align="center", fig.cap="Practice plot 3.", out.width="95%"}
include_graphics("figures/vis1_practice_plot3.png")
```
__Advanced__: Try to recreate the plot show in in Figure \@ref(fig:practice-plot3a).
```{r}
# make advanced practice plot 3 here
```
```{r practice-plot3a, out.width="90%", fig.align="center", fig.cap="Practice plot 3.", out.width="95%"}
include_graphics("figures/vis1_practice_plot3a.png")
```
### Making facets
Having too much information in a single plot can be overwhelming. The previous plot is already pretty busy. Facets are a nice way of splitting up plots and showing information in separate panels.
Let's take a look at how wide these diamonds tend to be. The width in mm is given in the `y` column of the diamonds data frame. We'll make a histogram first. To make a histogram, the only aesthetic we needed to specify is `x`.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_histogram()
```
That looks bad! Let's pick a different value for the width of the bins in the histogram.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_histogram(binwidth = 0.1)
```
Still bad. There seems to be an outlier diamond that happens to be almost 60 mm wide, while most of the rest is much narrower. One option would be to remove the outlier from the data before plotting it. But generally, we don't want to make new data frames. Instead, let's just limit what data we show in the plot.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_histogram(binwidth = 0.1) +
coord_cartesian(xlim = c(3, 10))
```
I've used the `coord_cartesian()` function to restrict the range of data to show by passing a minimum and maximum to the `xlim` argument. This looks better now.
Instead of histograms, we can also plot a density fitted to the distribution.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_density() +
coord_cartesian(xlim = c(3, 10))
```
Looks pretty similar to our histogram above! Just like we can play around with the binwidth of the histogram, we can change the smoothing bandwidth of the kernel that is used to create the histogram. Here is a histogram with a much wider bandwidth:
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_density(bw = 0.5) +
coord_cartesian(xlim = c(3, 10))
```
We'll learn more about how these densities are determined later in class.
I promised that this section was about making facets, right? We're getting there! Let's first take a look at how wide diamonds of different `color` are. We can use grouping to make this happen.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = y,
group = color,
fill = color)) +
geom_density(bw = 0.2,
alpha = 0.2) +
coord_cartesian(xlim = c(3, 10))
```
OK! That's a little tricky to tell apart. Notice that I've specified the `alpha` argument in the `geom_density()` function so that the densities in the front don't completely hide the densities in the back. But this plot still looks too busy. Instead of grouping, let's put the densities for the different colors, in separate panels. That's what facetting allows you to do.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = y,
fill = color)) +
geom_density(bw = 0.2) +
facet_grid(cols = vars(color)) +
coord_cartesian(xlim = c(3, 10))
```
Now we have the densities next to each other in separate panels. I've removed the `alpha` argument since the densities aren't overlapping anymore. To make the different panels, I used the `facet_grid()` function and specified that I want separate columns for the different colors (`cols = vars(color)`). What's the deal with `vars()`? Why couldn't we just write `facet_grid(cols = color)` instead? The short answer is: that's what the function wants. The long answer is: long. (We'll learn more about this later in the course.)
To show the facets in different rows instead of columns we simply replace `cols = vars(color)` with `rows = vars(color)`.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = y,
fill = color)) +
geom_density(bw = 0.2) +
facet_grid(rows = vars(color)) +
coord_cartesian(xlim = c(3, 10))
```
Several aspects about this plot should be improved:
- the y-axis text is overlapping
- having both a legend and separate facet labels is redundant
- having separate fills is not really necessary here
So, what does this plot actually show us? Well, J-colored diamonds tend to be wider than D-colored diamonds. Fascinating!
Of course, we could go completely overboard with facets and groups. So let's do it! Let's look at how the average `price` (somewhat more interesting) varies as a function of `color`, `cut`, and `clarity`. We'll put color on the x-axis, and make separate rows for `cut` and columns for `clarity`.
```{r stretching-it, fig.cap="A figure that is stretching it in terms of information."}
ggplot(data = df.diamonds,
mapping = aes(y = price,
x = color,
fill = color)) +
stat_summary(fun = "mean",
geom = "bar",
color = "black") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange") +
facet_grid(rows = vars(cut),
cols = vars(clarity))
```
Figure \@ref(fig:stretching-it) is stretching it in terms of how much information it presents. But it gives you a sense for how to combine the different bits and pieces we've learned so far.
#### Practice plot 4
Recreate the plot shown in Figure \@ref(fig:practice-plot4).
```{r}
# make practice plot 4 here
```
```{r practice-plot4, fig.align="center", fig.cap="Practice plot 4.", out.width="95%"}
include_graphics("figures/vis1_practice_plot4.png")
```
### Global, local, and setting `aes()`
`ggplot2` allows you to specify the plot aesthetics in different ways.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price,
color = color)) +
geom_point() +
geom_smooth(method = "lm",
se = F)
```
Here, I've drawn a scatter plot of the relationship between `carat` and `price`, and I have added the best-fitting regression lines via the `geom_smooth(method = "lm")` call. (We will learn more about what these regression lines mean later in class.)
Because I have defined all the aesthetics at the top level (i.e. directly within the `ggplot()` function), the aesthetics apply to all the functions afterwards. Aesthetics defined in the `ggplot()` call are __global__. In this case, the `geom_point()` and the `geom_smooth()` functions. The `geom_smooth()` function produces separate best-fit regression lines for each different color.
But what if we only wanted to show one regression line instead that applies to all the data? Here is one way of doing so:
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price)) +
geom_point(mapping = aes(color = color)) +
geom_smooth(method = "lm")
```
Here, I've moved the color aesthetic into the `geom_point()` function call. Now, the `x` and `y` aesthetics still apply to both the `geom_point()` and the `geom_smooth()` function call (they are __global__), but the `color` aesthetic applies only to `geom_point()` (it is __local__). Alternatively, we can simply overwrite global aesthetics within local function calls.
```{r}
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price,
color = color)) +
geom_point() +
geom_smooth(method = "lm",
color = "black")
```
Here, I've set `color = "black"` within the `geom_smooth()` function, and now only one overall regression line is displayed since the global color aesthetic was overwritten in the local function call.
## Additional resources
### Cheatsheets
- [RStudio IDE](figures/rstudio-ide.pdf) --> information about RStudio
- [RMarkdown](figures/rmarkdown.pdf) --> information about writing in RMarkdown
- [RMarkdown reference](figures/rmarkdown-reference.pdf) --> RMarkdown reference sheet
- [Data visualization](figures/visualization-principles.pdf) --> general principles of effective graphic design
- [ggplot2](figures/data-visualization.pdf) --> specific information about ggplot
### Datacamp courses
- [Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r)
- [ggplot (intro)](https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2)
- [Reporting](https://www.datacamp.com/courses/communicating-with-data-in-the-tidyverse)
- [visualization best practices](https://www.datacamp.com/courses/visualization-best-practices-in-r)
### Books and chapters
- [R graphics cookbook](http://www.cookbook-r.com/Graphs/) --> quick intro to the the most common graphs
- [ggplot2 book](https://ggplot2-book.org/)
- [R for Data Science book](http://r4ds.had.co.nz/)
+ [Data visualization](http://r4ds.had.co.nz/data-visualisation.html)
+ [Graphics for communication](http://r4ds.had.co.nz/graphics-for-communication.html)
- [Data Visualization -- A practical introduction (by Kieran Healy)](http://socviz.co/)
+ [Look at data](http://socviz.co/lookatdata.html#lookatdata)
+ [Make a plot](http://socviz.co/makeplot.html#makeplot)
+ [Show the right numbers](http://socviz.co/groupfacettx.html#groupfacettx)
- [Fundamentals of Data Visualization](https://serialmentor.com/dataviz/) --> very nice resource that goes beyond basic functionality of `ggplot` and focuses on how to make good figures (e.g. how to choose colors, axes, ...)
### Misc
- [nice online ggplot tutorial](https://evamaerey.github.io/ggplot2_grammar_guide/about)
- [how to read R help files](https://socviz.co/appendix.html#a-little-more-about-r)
- [ggplot2 extensions](https://exts.ggplot2.tidyverse.org/gallery/) --> gallery of ggplot2 extension packages
- [ggplot2 visualizations with code](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html) --> gallery of plots with code