-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathModule 4 - In-Class Coding Analysis.Rmd
407 lines (316 loc) · 14.8 KB
/
Module 4 - In-Class Coding Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
title: "Module 4 - Multiple Linear Regression"
subtitle: <center> <h1>In-Class Analysis</h1> </center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
# load packages here
library(tidyverse)
library(ggfortify) # plot lm objects using ggplot instead of base R
library(car) # needed for added-variable plots and dfbetas and dffits
library(corrplot) # colored correlation matrix
library(gridExtra) # NEW PACKAGE for putting multiple ggplots in one plot
```
## Data and Description
*Note: you will be duplicating a lot of code from Module 2 - it will save you time if you open up your Module 2 code and copy/paste as needed. Alternatively, you can skip the sections that have you copy/paste code from Module 2 to save time. Also, we won't worry about the axis limits for this coding day to save time.*
**This is the same data set used in the Module 4 Course Notes. You can check your code output against the output from the course notes to verify you are getting the correct results.**
Companies are continually evaluating supervisors to not only determine adequate performance, but also gauge employee morale (an important indicator for employee productivity). In an effort to understand the important aspects of a good supervisor, 30 employees at a certain company were asked to provide an overall rating and scores on 6 characteristics of their immediate managers. Employees were asked to rate the following statements on a scale from 0 to 100 (0 meaning ”completely disagree” to 100 meaning ”completely agree”):
Variable | Description
---------- | -------------
Rating | Overall rating of supervisor performance. Higher score means better supervisor.
Complaints | Score for "Your supervisor handles employee complaints appropriately."
Privileges | Score for "Your supervisor allows special privileges."
Learn | Score for "Your supervisor provides opportunities to learn new things."
Raises | Score for "Your supervisor bases raises on performance."
Critical | Score for "Your supervisor is too critical of poor performance."
Advance | Score for "I am *not* satisfied with the rate I am advancing in the company.”
Do the following:
1. Download the "Supervisor.txt" file from Canvas and put it in the same folder as this R Markdown file.
2. Read in the data set, call it "super", and take a look at the top few rows.
```{r, message=FALSE}
#super <- read.csv("Supervisor.txt", header = TRUE, sep = " ")
#head(super)
(super <- read_table2("Supervisor.txt"))
```
## Explore the Data
### Create a Scatterplot Matrix. Do you observe any variables that appear strongly linearly correlated?
Hint: you can use the `plot` function or the `pairs` function.
```{r, fig.align='center', fig.height=8, fig.width=8}
point_matrix <- function(data) {
par(pty = "s", las = 1)
pairs(data, pch = 19, lower.panel = NULL)
}
point_matrix(super)
```
### Create a Correlation Matrix. Do the correlations here match your insights obtained from the scatterplot matrix?
Hint: use the `cor` function. You may want to use the `round` function to display only 2 decimal places for easier viewing.
Hint: for a color- and shape-coded correlation matrix, use the `corrplot` package and function (use the `cor` function as input).
```{r, fig.height=5}
show_cor <- function(data) {
par(mfrow = c(1, 2))
corrplot(cor(data), method = "number", type = "upper", diag = F, tl.col = "#1f3366", cl.pos = "n")
title("Correlation Coefficients")
corrplot(cor(data), type = "upper", diag = F, tl.col = "#1f3366", cl.pos = "n")
title("Correlation Matrix")
}
show_cor(super)
```
## Fit a Multiple Linear Regression Model
Make sure to save the residuals to your `super` data frame.
Hint: for the X variables in the `lm` function, you can type all the variables out separated by a plus sign "+", or you can simply type "~.". The "." tells R that you want to include every column in your data set, except the column you specified as the response, as an independent variable in your model. You have to be careful with the "~." notation. Once you save the residuals and/or fitted values to the data frame (or anything else), you cannot rerun your linear model, otherwise those to variables will be added to the model. Your output should match the output in the class notes.
```{r}
super_lm <- lm(Rating ~ ., data = super)
summary(super_lm)
```
## Check Multiple Linear Regression Model Assumptions
### 1. The X's vs Y are linear
Check your scatterplot matrix from above, in addition to the residuals vs. fitted values plot, the residuals vs predictor plots, and the partial regression plots (below).
**(a) Scatterplot Matrix (already created - see above)**
**(b) Residuals vs. Fitted Values Plot**
```{r, fig.align='center', warning=FALSE}
resid_vs_fitted <- function(model) {
autoplot(model, which = 1, ncol = 1) +
theme_minimal() +
theme(aspect.ratio = 1)
}
resid_vs_fitted(super_lm)
```
**(c) Residuals vs. Predictor Plots (6 plots in total)**
```{r, fig.align='center', fig.height=5}
rpred_col <- function(data, residuals, predictor) {
ggplot(data = data,
mapping = aes(x = pull(data, predictor),
y = residuals)) +
geom_point() +
geom_smooth(se = FALSE, span = 0.95, n = 7, size = 0.5) +
geom_abline(slope = 0, intercept = 0, linetype = "dashed") +
theme_minimal() +
theme(aspect.ratio = 1) +
xlab(predictor) +
ylab("Residuals")
}
resid_vs_pred <- function(model) {
data <- model.frame(model)
predictors <- attr(model$terms, "term.labels")
plots <- lapply(predictors, rpred_col, data = data, residuals = resid(model))
plots["ncol"] <- ceiling(sqrt(length(plots)))
plots["top"] <- "Residuals vs Predictors"
do.call(grid.arrange, plots)
}
resid_vs_pred(super_lm)
```
**(d) Partial Regression Plots (also called "Added Variable" plots)**
Hint: use the `avPlots` function with your fitted model as the argument to the function.
```{r, fig.align='center', fig.height = 7}
added_variable_plots <- function(model) {
predictors <- attr(model$terms, "term.labels")
rows <- floor(sqrt(length(predictors)))
cols <- length(predictors) / rows
par(pty = "s", cex.lab = 2, cex.axis = 1.5)
avPlots(model, layout = c(rows, cols), pch = 19)
}
added_variable_plots(super_lm)
```
### 2. The residuals are independent across all values of y
No code to test this assumption - you just have to think about it.
### 3. The residuals are normally distributed and centered at zero
**(a) Boxplot**
```{r, fig.align='center'}
make_boxplot <- function(model) {
residuals <- data.frame(residuals = resid(model))
ggplot(data = residuals, mapping = aes(y = residuals)) +
geom_boxplot() +
stat_summary(mapping = aes(x = 0),
fun = mean, geom = "point",
shape = 4, size = 2, color = "darkred") +
theme_classic() +
theme(aspect.ratio = 2,
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
# scale_y_continuous(limits = c(-20000, 30000), breaks = seq(-20000, 30000, 10000)) +
ylab("Residuals") +
xlab("")
}
make_boxplot(super_lm)
```
**(b) Histogram**
```{r, fig.align='center'}
make_hist <- function(model) {
residuals <- data.frame(residuals = resid(model))
ggplot(data = residuals, mapping = aes(x = residuals)) +
geom_histogram(binwidth = sd(residuals$residuals / 4), mapping = aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "blue",
args = list(mean = 0,
sd = sd(residuals$residuals)),
size = 1.2) +
xlab("Residuals") +
ylab("Density") +
theme_light()
}
make_hist(super_lm)
```
**(c) Normal Probability Plot**
```{r, fig.align='center'}
QQPlot <- function(model) {
autoplot(model, which = 2, ncol = 1) +
theme_bw() +
theme(aspect.ratio = 1)
}
QQPlot(super_lm)
```
**(d) Shapiro-Wilk Test**
```{r}
shapiro.test(super_lm$residuals)
```
### 4. The residuals have constant variance across all values of the X's
Check the residuals vs. fitted values plot from above.
### 5. The model describes all observations (i.e., there are no influential points)
Check the scatterplot matrix, boxplot, histogram, normal probability plot, and the partial regression plots from above, in addition to Cook's Distance, DFBETAS, and DFFITS (below).
**(a) Scatterplot Matrix (already created - see above)**
**(b) Boxplot (already created - see above)**
**(c) Histogram (already created - see above)**
**(d) Normal Probability Plot (already created - see above)**
**(e) Partial Regression Plots (already created - see above)**
**(f) Cook's Distance**
```{r, fig.align='center'}
cooksd_plot <- function(model, nLabels = 3) {
cooks_d <- cooks.distance(model)
top_cd <- as.numeric(names(sort(cooks_d, decreasing = TRUE)[1:nLabels]))
ggplot() +
geom_point(data = tibble(cooks_d),
mapping = aes(x = as.numeric(names(cooks_d)),
y = cooks_d)) +
geom_text(mapping = aes(x = top_cd,
y = cooks_d[top_cd] + max(cooks_d) / 40,
label = top_cd)) +
theme_bw() +
ylab("Cook's Distance") +
xlab("Observation Number") +
geom_hline(mapping = aes(yintercept = 4 / length(cooks_d)),
color = "red", linetype = "dashed") +
theme(aspect.ratio = 1)
}
cooksd_plot(super_lm)
```
**(g) DFBETAS (6 plots total)**
Note: you may wish to plot both cut-off lines since the sample size is right on the boundary of the rough cut-off value of $n = 30$.
```{r, fig.align='center'}
dfb_col <- function(df_betas, predictor, nLabels = 3) {
# Find which observations have the highest dfbetas
top_vals <- df_betas[predictor] %>%
arrange(desc(abs(eval(parse(text = predictor))))) %>%
.[1:nLabels,] %>%
pull(predictor)
top_ind <- which(pull(df_betas, predictor) %in% top_vals)
out <- ggplot() +
geom_point(data = df_betas,
mapping = aes(x = as.numeric(rownames(df_betas)),
y = abs(pull(df_betas, predictor)))) +
geom_text(mapping = aes(x = top_ind,
y = abs(pull(df_betas, predictor)[top_ind]) + 0.07,
label = top_ind)) +
theme_bw() +
theme(aspect.ratio = 1) +
ylab("Abs of DFBETAS") +
xlab("Observation Number") +
ggtitle(predictor)
if(length(dfbetas) <= 30) {
out <- out +
geom_hline(mapping = aes(yintercept = 1),
color = "red", linetype = "dashed")
}else {
out <- out +
geom_hline(mapping = aes(yintercept = 2 / sqrt(length(dfbetas))),
color = "red", linetype = "dashed")
}
return(out)
}
plot_dfbetas <- function(model, nLabels = 3) {
predictors <- attr(model$terms, "term.labels")
df_betas <- as_tibble(dfbetas(model)[, predictors])
plots <- lapply(predictors, dfb_col, df_betas = df_betas)
plots["ncol"] <- ceiling(sqrt(length(plots)))
do.call(grid.arrange, plots)
}
plot_dfbetas(super_lm)
```
**(h) DFFITS**
```{r, fig.align='center'}
plot_dffits <- function(model, nLabels = 3) {
df_fits <- dffits(model)
top_dff <- as.numeric(names(sort(abs(df_fits), decreasing = TRUE)[1:nLabels]))
df_fits_plot <- ggplot() +
geom_point(data = tibble(df_fits),
mapping = aes(x = as.numeric(names(df_fits)),
y = abs(df_fits))) +
geom_text(mapping = aes(x = top_dff,
y = abs(df_fits[top_dff]) + max(df_fits) / 40,
label = top_cd)) +
theme_bw() +
ylab("Absolute Value of DFFITS for Y") +
xlab("Observation Number") +
theme(aspect.ratio = 1)
if(length(df_fits) <= 30) {
df_fits_plot +
geom_hline(mapping = aes(yintercept =
2 * sqrt(length(model$coefficients) /
length(df_fits))),
color = "red", linetype = "dashed")
}else {
df_fits_plot +
geom_hline(mapping = aes(yintercept = 1),
color = "red", linetype = "dashed")
}
}
plot_dffits(super_lm)
```
### 6. All important predictors are included
No code to test this assumption - you just have to think about it.
### 7. No Multicollinearity
Check the scatterplot matrix and correlation matrix from above, in addition to the variance inflation factors (below).
**(a) Scatterplot Matrix (already created - see above)**
```{r, fig.align='center'}
# can copy/paste the code here for convenience (you will have to subset the
# data set to only include the columns in the original data set since you have
# added the residuals, DFFITS, DFBETAS, etc.)
```
**(b) Correlation Matrix (already created - see above)**
```{r, fig.align='center'}
# can copy/paste the code here for convenience (you will have to subset the
# data set to only include the columns in the original data set since you have
# added the residuals, DFFITS, DFBETAS, etc.)
```
**(c) Variance Inflation Factors (VIF)**
Hint: use the `vif` function with your fitted model as the argument to the function, and use the criteria we discussed in class to check for multicollinearity.
```{r}
vif(super_lm)
```
## Use the Model for Inference
### Use the `confint` R function to create a 95% confidence interval for each coefficient. Does the interval for Complaints match the interval in the course notes?
```{r}
confint(super_lm)
```
### Using the `predict` function, calculate a 95% confidence interval for the average supervisor rating ($Y$) when Complaints = 60, Privileges = 50, Learn = 56, Raises = 63, Critical = 76, and Advance = 40. Does this match the interval in the course notes?
```{r}
# your code here
```
### Using the `predict` function, calculate a 95% prediction interval for the average supervisor rating ($Y$) when Complaints = 60, Privileges = 50, Learn = 56, Raises = 63, Critical = 76, and Advance = 40. Does this match the interval in the course notes?
```{r}
# your code here
```
## Could we have simplified the model (more on this to come)?
### Use the `anova` R function to test some coefficients, Learn and Raises, simultaneously. The function will take two arguments: (1) the original model and (2) a model excluding Learn and Raises. Can we safely drop Learn and Raises from the model?
```{r}
super_lm_reduced <- lm(Rating ~ . - Learn, data = super)
anova(super_lm, super_lm_reduced)
```
Since the p-value is relatively large, there is no significant difference in these models, so we can go with the simpler model, meaning we can drop Learn and Raises.
## Summary and Conclusions
Overall, the assumptions all seem to be *roughly* met. We will discuss next steps for this analysis in the upcoming module, but you could certainly try transformations to the data, as we discussed in Module 2. Again, we should *always* start a data analysis with exploratory data analysis (EDA), then we should check to make sure our model assumptions are met, and then we can proceed with statistical inference.