forked from psych252/psych252book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-basic-regression.Rmd
624 lines (486 loc) · 17.8 KB
/
05-basic-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
# Linear regression
## Learning goals
- Multiple regression.
- Appreciate model assumptions.
- Several continuous predictors.
- Hypothesis tests.
- Interpreting parameters.
- Reporting results.
- One categorical predictor.
- Both continuous and categorical predictors.
- Interpreting interactions.
## Load packages and set plotting theme
```{r, message=FALSE}
library("knitr") # for knitting RMarkdown
library("kableExtra") # for making nice tables
library("janitor") # for cleaning column names
library("broom") # for tidying up linear models
library("corrr") # for calculating correlations between many variables
library("corrplot") # for plotting correlations
library("GGally") # for running ggpairs() function
library("tidyverse") # for wrangling, plotting, etc.
# include references for used packages
knitr::write_bib(.packages(), "packages.bib")
```
```{r}
theme_set(theme_classic() + #set the theme
theme(text = element_text(size = 20))) #set the default text size
opts_chunk$set(comment = "",
fig.show = "hold")
```
## Load data sets
Let's load the data sets that we'll explore in this class:
```{r, warning=F, message=FALSE}
# credit data set
df.credit = read_csv("data/credit.csv") %>%
rename(index = `...1`) %>%
clean_names()
# advertising data set
df.ads = read_csv("data/advertising.csv") %>%
rename(index = `...1`) %>%
clean_names()
```
```{r, echo=F, fig.cap="Description of the different variables in the df.credit data set."}
tibble(variable = c("income", "limit", "rating", "cards", "age", "education",
"gender", "student", "married", "ethnicity", "balance"),
description = c("in thousand dollars",
"credit limit",
"credit rating",
"number of credit cards",
"in years",
"years of education",
"male or female",
"student or not",
"married or not",
"African American, Asian, Caucasian",
"average credit card debt")) %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = F)
```
## Multiple continuous variables
Let's take a look at a case where we have multiple continuous predictor variables. In this case, we want to make sure that our predictors are not too highly correlated with each other (as this makes the interpration of how much each variable explains the outcome difficult). So we first need to explore the pairwise correlations between variables.
### Explore correlations
The `corrr` package is great for exploring correlations between variables. To find out more how `corrr` works, take a look at this vignette:
```{r, eval=F}
vignette(topic = "using-corrr",
package = "corrr")
```
Here is an example that illustrates some of the key functions in the `corrr` package (using the advertisement data):
```{r}
df.ads %>%
select(where(is.numeric)) %>%
correlate(quiet = T) %>%
shave() %>%
fashion()
```
#### Visualize correlations
##### Correlations with the dependent variable
```{r, fig.cap="Bar plot illustrating how strongly different variables correlate with income."}
df.credit %>%
select(where(is.numeric)) %>%
correlate(quiet = T) %>%
select(term, income) %>%
mutate(term = reorder(term, income)) %>%
drop_na() %>%
ggplot(aes(x = term,
y = income,
fill = income)) +
geom_hline(yintercept = 0) +
geom_col(color = "black",
show.legend = F) +
scale_fill_gradient2(low = "indianred2",
mid = "white",
high = "skyblue1",
limits = c(-1, 1)) +
coord_flip() +
theme(axis.title.y = element_blank())
```
##### All pairwise correlations
```{r, fig.caption = "Correlation plot showing the pairwise correlations between different variables."}
tmp = df.credit %>%
select(where(is.numeric), -index) %>%
correlate(diagonal = 0,
quiet = T) %>%
rearrange() %>%
select(-term) %>%
as.matrix() %>%
corrplot()
```
```{r, fig.cap="Pairwise correlations with scatter plots, correlation values, and densities on the diagonal."}
df.ads %>%
select(-index) %>%
ggpairs()
```
With some customization:
```{r, fig.cap="Pairwise correlations with scatter plots, correlation values, and densities on the diagonal (customized)."}
df.ads %>%
select(-index) %>%
ggpairs(lower = list(continuous = wrap("points",
alpha = 0.3)),
upper = list(continuous = wrap("cor", size = 8))) +
theme(panel.grid.major = element_blank())
```
### Multipe regression
Now that we've explored the correlations, let's have a go at the multiple regression.
#### Visualization
We'll first take another look at the pairwise relationships:
```{r}
tmp.x = "tv"
# tmp.x = "radio"
# tmp.x = "newspaper"
# tmp.y = "radio"
tmp.y = "radio"
# tmp.y = "tv"
ggplot(df.ads,
aes_string(x = tmp.x, y = tmp.y)) +
stat_smooth(method = "lm",
color = "black",
fullrange = T) +
geom_point(alpha = 0.3) +
annotate(geom = "text",
x = -Inf,
y = Inf,
hjust = -0.5,
vjust = 1.5,
label = str_c("r = ", cor(df.ads[[tmp.x]], df.ads[[tmp.y]]) %>%
round(2) %>% # round
str_remove("^0+") # remove 0
),
size = 8) +
theme(text = element_text(size = 30))
```
TV ads and radio ads aren't correlated. Yay!
#### Fitting, hypothesis testing, evaluation
Let's see whether adding radio ads is worth it (over and above having TV ads).
```{r}
# fit the models
fit_c = lm(sales ~ 1 + tv, data = df.ads)
fit_a = lm(sales ~ 1 + tv + radio, data = df.ads)
# do the F test
anova(fit_c, fit_a)
```
It's worth it!
Let's evaluate how well the model actually does. We do this by taking a look at the residual plot, and check whether the residuals are normally distributed.
```{r}
tmp.fit = lm(sales ~ 1 + tv + radio, data = df.ads)
df.plot = tmp.fit %>%
augment() %>%
clean_names()
# residual plot
ggplot(df.plot,
aes(x = fitted,
y = resid)) +
geom_point()
# density of residuals
ggplot(df.plot,
aes(x = resid)) +
stat_density(geom = "line")
# QQ plot
ggplot(df.plot,
aes(sample = resid)) +
geom_qq() +
geom_qq_line()
```
There is a slight non-linear trend in the residuals. We can also see that the residuals aren't perfectly normally distributed. We'll see later what we can do about this ...
Let's see how well the model does overall:
```{r}
fit_a %>%
glance() %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = "striped",
full_width = F)
```
As we can see, the model almost explains 90% of the variance. That's very decent!
#### Visualizing the model fits
Here is a way of visualizing how both tv ads and radio ads affect sales:
```{r}
df.plot = lm(sales ~ 1 + tv + radio, data = df.ads) %>%
augment() %>%
clean_names()
df.tidy = lm(sales ~ 1 + tv + radio, data = df.ads) %>%
tidy()
ggplot(df.plot, aes(x = radio, y = sales, color = tv)) +
geom_point() +
scale_color_gradient(low = "gray80", high = "black") +
theme(legend.position = c(0.1, 0.8))
```
We used color here to encode TV ads (and the x-axis for the radio ads).
In addition, we might want to illustrate what relationship between radio ads and sales the model predicts for three distinct values for TV ads. Like so:
```{r}
df.plot = lm(sales ~ 1 + tv + radio, data = df.ads) %>%
augment() %>%
clean_names()
df.tidy = lm(sales ~ 1 + tv + radio, data = df.ads) %>%
tidy()
ggplot(df.plot, aes(x = radio, y = sales, color = tv)) +
geom_point() +
geom_abline(intercept = df.tidy$estimate[1] + df.tidy$estimate[2] * 200,
slope = df.tidy$estimate[3]) +
geom_abline(intercept = df.tidy$estimate[1] + df.tidy$estimate[2] * 100,
slope = df.tidy$estimate[3]) +
geom_abline(intercept = df.tidy$estimate[1] + df.tidy$estimate[2] * 0,
slope = df.tidy$estimate[3]) +
scale_color_gradient(low = "gray80", high = "black") +
theme(legend.position = c(0.1, 0.8))
```
#### Interpreting the model fits
Fitting the augmented model yields the following estimates for the coefficients in the model:
```{r}
fit_a %>%
tidy(conf.int = T) %>%
head(10) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = "striped",
full_width = F)
```
#### Standardizing the predictors
One thing we can do to make different predictors more comparable is to standardize them.
```{r}
df.ads = df.ads %>%
mutate(across(.cols = c(tv, radio),
.fns = ~ scale(.),
.names = "{.col}_scaled"))
df.ads %>%
select(-newspaper) %>%
head(10) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = "striped",
full_width = F)
```
We can standardize (z-score) variables using the `scale()` function.
```{r}
# tmp.variable = "tv"
tmp.variable = "tv_scaled"
ggplot(df.ads,
aes(x = .data[[tmp.variable]])) +
stat_density(geom = "line",
size = 1) +
annotate(geom = "text",
x = median(df.ads[[tmp.variable]]),
y = -Inf,
label = str_c("sd = ", sd(df.ads[[tmp.variable]]) %>% round(2)),
size = 10,
vjust = -1,
hjust = 0.5) +
annotate(geom = "text",
x = median(df.ads[[tmp.variable]]),
y = -Inf,
label = str_c("mean = ", mean(df.ads[[tmp.variable]]) %>% round(2)),
size = 10,
vjust = -3,
hjust = 0.5)
```
Scaling a variable leaves the distribution intact, but changes the mean to 0 and the SD to 1.
## One categorical variable
Let's compare a compact model that only predicts the mean, with a model that uses the student variable as an additional predictor.
```{r}
# fit the models
fit_c = lm(balance ~ 1, data = df.credit)
fit_a = lm(balance ~ 1 + student, data = df.credit)
# run the F test
anova(fit_c, fit_a)
fit_a %>%
summary()
```
The `summary()` shows that it's worth it: the augmented model explains a signifcant amount of the variance (i.e. it significantly reduces the proportion in error PRE).
### Visualization of the model predictions
Let's visualize the model predictions. Here is the compact model:
```{r}
ggplot(df.credit,
aes(x = index,
y = balance)) +
geom_hline(yintercept = mean(df.credit$balance),
size = 1) +
geom_segment(aes(xend = index,
yend = mean(df.credit$balance)),
alpha = 0.1) +
geom_point(alpha = 0.5)
```
It just predicts the mean (the horizontal black line). The vertical lines from each data point to the mean illustrate the residuals.
And here is the augmented model:
```{r}
df.fit = fit_a %>%
tidy() %>%
mutate(estimate = round(estimate,2))
ggplot(df.credit,
aes(x = index,
y = balance,
color = student)) +
geom_hline(yintercept = df.fit$estimate[1],
size = 1,
color = "#E41A1C") +
geom_hline(yintercept = df.fit$estimate[1] + df.fit$estimate[2],
size = 1,
color = "#377EB8") +
geom_segment(data = df.credit %>%
filter(student == "No"),
aes(xend = index,
yend = df.fit$estimate[1]),
alpha = 0.1,
color = "#E41A1C") +
geom_segment(data = df.credit %>%
filter(student == "Yes"),
aes(xend = index,
yend = df.fit$estimate[1] + df.fit$estimate[2]),
alpha = 0.1,
color = "#377EB8") +
geom_point(alpha = 0.5) +
scale_color_brewer(palette = "Set1") +
guides(color = guide_legend(reverse = T))
```
Note that this model predicts two horizontal lines. One for students, and one for non-students.
Let's make simple plot that shows the means of both groups with bootstrapped confidence intervals.
```{r}
ggplot(data = df.credit,
mapping = aes(x = student, y = balance, fill = student)) +
stat_summary(fun = "mean",
geom = "bar",
color = "black",
show.legend = F) +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1) +
scale_fill_brewer(palette = "Set1")
```
And let's double check that we also get a signifcant result when we run a t-test instead of our model comparison procedure:
```{r}
t.test(x = df.credit$balance[df.credit$student == "No"],
y = df.credit$balance[df.credit$student == "Yes"])
```
### Dummy coding
When we put a variable in a linear model that is coded as a character or as a factor, R automatically recodes this variable using dummy coding. It uses level 1 as the reference category for factors, or the value that comes first in the alphabet for characters.
```{r}
df.credit %>%
select(income, student) %>%
mutate(student_dummy = ifelse(student == "No", 0, 1))%>%
head(10) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = "striped",
full_width = F)
```
### Reporting the results
To report the results, we could show a plot like this:
```{r}
df.plot = df.credit
ggplot(df.plot,
aes(x = student,
y = balance)) +
geom_point(alpha = 0.1,
position = position_jitter(height = 0, width = 0.1)) +
stat_summary(fun.data = "mean_cl_boot",
size = 1)
```
And then report the means and standard deviations together with the result of our signifance test:
```{r}
df.credit %>%
group_by(student) %>%
summarize(mean = mean(balance),
sd = sd(balance)) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
```
## One continuous and one categorical variable
Now let's take a look at a case where we have one continuous and one categorical predictor variable. Let's first formulate and fit our models:
```{r}
# fit the models
fit_c = lm(balance ~ 1 + income, df.credit)
fit_a = lm(balance ~ 1 + income + student, df.credit)
# run the F test
anova(fit_c, fit_a)
```
We see again that it's worth it. The augmented model explains significantly more variance than the compact model.
### Visualization of the model predictions
Let's visualize the model predictions again. Let's start with the compact model:
```{r}
df.augment = fit_c %>%
augment() %>%
clean_names()
ggplot(df.augment,
aes(x = income,
y = balance)) +
geom_smooth(method = "lm", se = F, color = "black") +
geom_segment(aes(xend = income,
yend = fitted),
alpha = 0.3) +
geom_point(alpha = 0.3)
```
This time, the compact model still predicts just one line (like above) but note that this line is not horizontal anymore.
```{r}
df.tidy = fit_a %>%
tidy() %>%
mutate(estimate = round(estimate,2))
df.augment = fit_a %>%
augment() %>%
clean_names()
ggplot(df.augment,
aes(x = income,
y = balance,
group = student,
color = student)) +
geom_segment(data = df.augment %>%
filter(student == "No"),
aes(xend = income,
yend = fitted),
color = "#E41A1C",
alpha = 0.3) +
geom_segment(data = df.augment %>%
filter(student == "Yes"),
aes(xend = income,
yend = fitted),
color = "#377EB8",
alpha = 0.3) +
geom_abline(intercept = df.tidy$estimate[1],
slope = df.tidy$estimate[2],
color = "#E41A1C",
size = 1) +
geom_abline(intercept = df.tidy$estimate[1] + df.tidy$estimate[3],
slope = df.tidy$estimate[2],
color = "#377EB8",
size = 1) +
geom_point(alpha = 0.3) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = c(0.1, 0.9)) +
guides(color = guide_legend(reverse = T))
```
The augmented model predicts two lines again, each with the same slope (but the intercept differs).
## Interactions
Let's check whether there is an interaction between how income affects balance for students vs. non-students.
### Visualization
Let's take a look at the data first.
```{r}
ggplot(data = df.credit,
mapping = aes(x = income,
y = balance,
group = student,
color = student)) +
geom_smooth(method = "lm", se = F) +
geom_point(alpha = 0.3) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = c(0.1, 0.9)) +
guides(color = guide_legend(reverse = T))
```
Note that we just specified here that we want to have a linear model (via `geom_smooth(method = "lm")`). By default, `ggplot2` assumes that we want a model that includes interactions. We can see this by the fact that two fitted lines are not parallel.
But is the interaction in the model worth it? That is, does a model that includes an interaction explain significantly more variance in the data, than a model that does not have an interaction.
### Hypothesis test
Let's check:
```{r}
# fit models
fit_c = lm(formula = balance ~ income + student, data = df.credit)
fit_a = lm(formula = balance ~ income * student, data = df.credit)
# F-test
anova(fit_c, fit_a)
```
Nope, not worth it! The F-test comes out non-significant.
## Additional resources
### Datacamp
- [Statistical modeling 1](https://www.datacamp.com/courses/statistical-modeling-in-r-part-1)
- [Statistical modeling 2](https://www.datacamp.com/courses/statistical-modeling-in-r-part-2)
- [Correlation and regression](https://www.datacamp.com/courses/correlation-and-regression)
### Misc
- [Nice review of multiple regression in R](https://bookdown.org/roback/bookdown-bysh/ch-MLRreview.html)
## Session info
Information about this R session including which version of R was used, and what packages were loaded.
```{r}
sessionInfo()
```
## References