-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathModule 2 - In-Class Coding Analysis - COMPLETED (1).Rmd
540 lines (397 loc) · 20.4 KB
/
Module 2 - In-Class Coding Analysis - COMPLETED (1).Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
---
title: "Module 2 - Simple Linear Regression Model Assumptions"
subtitle: <center> <h1>In-Class Analysis</h1> </center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
# load packages here
library(tidyverse)
library(ggfortify) # plot lm objects using ggplot instead of base R
library(car) # Brown-Forsythe Test and Box-Cox transformation
```
## Data and Description
Recent increases in gas prices make buyers more prone to purchase a car with better gas mileage, as measured by the **miles per gallon (MPG)**. Because of this, car manufacturers are increasingly trying to produce the car that gives the best MPG. Complicating this process are the many factors that go into determining what gas mileage a car will achieve on the road.
One such factor is the **weight** of the car. While it is generally understood that heavier cars will experience fewer MPG, there is little understanding of how much an increase in weight will lead to a decrease MPG. By understanding this relationship, manufacturers will be able to perform a cost--benefit analysis that will assist them in their vehicle production.
The MPG data set contains measurements of the **weight (column 1)** (in pounds) and **MPG (column 2)** of 289 cars. Download the MPGData.txt file from Canvas, and put it in the same folder as this R Markdown file.
Do the following (what we've done before):
1. Read in the data set, take a look at the top few rows, and look at a summary of the data.
2. Create a scatterplot of the data with Weight on the x-axis and MPG on the y-axis, and overlay the linear regression line.
3. Apply linear regression to the data, and save the residuals and fitted values to the `cars` data frame.
```{r}
# Note: this code is all from Module 1
cars <- read.csv("MPGData.txt", header = TRUE, sep = " ")
head(cars)
summary(cars)
# we are saving this plot as a variable since we will use it later when
# checking assumptions
cars_base_plot <- ggplot(data = cars, mapping = aes(x = Weight, y = MPG)) +
geom_point() +
theme_bw() +
scale_x_continuous(limits = c(1500, 3500)) +
scale_y_continuous(limits = c(10, 50)) +
theme(aspect.ratio = 1)
cars_base_plot + geom_smooth(method = "lm", se = FALSE)
cars_lm <- lm(MPG ~ Weight, data = cars)
summary(cars_lm)
cars$residuals <- cars_lm$residuals
cars$fittedMPG <- cars_lm$fitted.values
```
## Diagnostics: Check That Assumptions are Met
### 1. (L) x vs y is linear
**(a) Scatterplot**
```{r, fig.align='center'}
cars_base_plot
```
The trend appears linear (not quadratic, exponential, etc.), so this assumption appears to be met.
**(b) Residuals vs. Fitted Values Plot**
```{r, fig.align='center'}
# save the plot as a variable since we will use it later for other assumptions
cars_resid_vs_fit <- autoplot(cars_lm, which = 1, ncol = 1, nrow = 1) +
theme_bw() +
scale_y_continuous(limits = c(-20, 20)) +
scale_x_continuous(limits = c(15, 40)) +
theme(aspect.ratio = 1)
cars_resid_vs_fit
```
This plot labels the three most "extreme" data points (by their row numbers) in the data set. The colored, solid line should roughly follow the dotted, horizontal line at 0. This plot suggests we can assume a linear relationship between MPG and Weight.
**(c) Residuals vs. Predictor Plot**
```{r, fig.align='center'}
ggplot(data = cars, mapping = aes(x = Weight, y = residuals)) +
geom_point() +
theme_bw() +
scale_y_continuous(limits = c(-20, 20)) +
scale_x_continuous(limits = c(1500, 3500)) +
theme(aspect.ratio = 1)
```
We should see an even spread of points around the horizontal line to conclude the two variables have a linear relationship. This plots suggests MPG and Weight are linearly associated.
### 2. (I) The residuals are independent across all values of y
This assumption is difficult to test statistically. Generally, if you have a random sample, the residuals will be independent. Since the description of the data did not include how the cars were sampled, we do not know if this assumption is met.
*If* the observations in this data set were in a natural order, then we could use a sequence plot to assess dependence, *but a sequence plot is inappropriate here* since the data are not in a natural order. Below is the code to create a sequence plot, for your reference only.
**(a) Sequence Plot**
```{r, eval=FALSE}
# Note: this plot is not appropriate to create for this data set
ggplot(data = cars, mapping = aes(x = 1:dim(cars)[1], y = residuals)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits = c(-15, 20)) +
scale_x_continuous(limits = c(0, 295)) +
xlab("Order in Data Set") +
theme(aspect.ratio = 1)
```
### 3. (N) The residuals are normally distributed and centered at zero
**(a) Boxplot**
```{r, fig.align='center'}
cars_boxplot <- ggplot(data = cars, mapping = aes(y = residuals)) +
geom_boxplot() +
theme_bw() +
scale_y_continuous(limits = c(-20, 20)) +
theme(aspect.ratio = 1)
cars_boxplot
```
The boxplot should be centered at 0, have equal area in the box on both sizes of the median, and should have whiskers similar in length and few outliers. This plot suggests normality except the outliers at the top, which suggests slight right-skewness - a possible cause for concern.
**(b) Histogram**
```{r, fig.align='center'}
cars_hist <- ggplot(data = cars, mapping = aes(x = residuals)) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 2) +
stat_function(fun = dnorm,
color = "red",
size = 2,
args = list(mean = mean(cars$residuals),
sd = sd(cars$residuals))) +
theme_bw() +
scale_x_continuous(limits = c(-20, 20)) +
scale_y_continuous(limits = c(0, 0.1)) +
theme(aspect.ratio = 1)
cars_hist
```
The histogram should follow the superimposed, red normal curve. The distribution of residuals looks slightly right skewed.
**(c) Normal Probability Plot**
```{r, fig.align='center'}
cars_qq <- autoplot(cars_lm, which = 2, ncol = 1, nrow = 1) +
theme_bw() +
scale_x_continuous(limits = c(-3, 3)) +
scale_y_continuous(limits = c(-4, 4)) +
theme(aspect.ratio = 1)
cars_qq
```
The points should follow the dashed, diagonal line. This plot shows non-negligible deviation from the line - especially at the upper theoretical quantiles. This is cause for concern.
**(d) Shapiro-Wilk Test**
```{r}
shapiro.test(cars_lm$residuals)
```
Since the p-value is small, we reject the null hypothesis. We conclude that the data do not come from a normal distribution.
### 4. (E) The residuals have constant variance across all values of x
**(a) Residuals vs. Fitted Values Plot**
```{r, fig.align='center'}
cars_resid_vs_fit
```
The residuals should be equally spread around the horizontal line. There may be slightly more spread at larger fitted values than at smaller fitted values. This is mildly concerning.
**(c) Brown-Forsythe Test**
```{r}
grp <- as.factor(c(rep("lower", floor(dim(cars)[1] / 2)),
rep("upper", ceiling(dim(cars)[1] / 2))))
leveneTest(cars[order(cars$Weight), "residuals"] ~ grp, center = median)
```
Since the p-value is very small (highly significant), we reject the null hypothesis of constant variance and conclude the variance is likely not constant (heterogeneity).
### 5. (A) The model describes all observations (i.e., there are no influential points)
**(a) Scatterplot**
```{r, fig.align='center'}
cars_base_plot
```
There may be some potential influential points, but none look striking.
**(b) Boxplot**
```{r, fig.align='center'}
cars_boxplot
```
Roughly 9 potential outliers.
**(c) Histogram**
```{r, fig.align='center'}
cars_hist
```
No apparent outliers.
**(d) Normal Probability Plot**
```{r, fig.align='center'}
cars_qq
```
Possible outliers at the upper right area of the plot.
**(e) Cook's Distance**
```{r, fig.align='center'}
# get Cook's distance values for all observations
cars$cooksd <- cooks.distance(cars_lm)
# plot Cook's distance against the observation number
ggplot(data = cars) +
geom_point(mapping = aes(x = as.numeric(rownames(cars)),
y = cooksd)) +
theme_bw() +
ylab("Cook's Distance") +
xlab("Observation Number") +
geom_hline(mapping = aes(yintercept = 4 / length(cooksd)),
color = "red", linetype = "dashed") +
scale_x_continuous(limits = c(0, 300)) +
scale_y_continuous(limits = c(0, 0.05)) +
theme(aspect.ratio = 1)
# print a list of potential outliers according to Cook's distance
cars %>%
mutate(rowNum = row.names(cars)) %>% # save original row numbers
filter(cooksd > 4 / length(cooksd)) %>% # select potential outliers
arrange(desc(cooksd)) # order from largest Cook's distance to smallest
```
There are 14 potential outliers, according to Cook's distance. I wouldn't be terrbily concerned about any of them. Potentially one, but it isn't terribly far from the rest of the data.
**(f) DFBETAS**
```{r, fig.align='center'}
# calculate the DFBETAS for Weight
cars$dfbetas_weight <- as.vector(dfbetas(cars_lm)[, 2])
# plot the DFBETAS against the observation number
ggplot(data = cars) +
geom_point(mapping = aes(x = as.numeric(rownames(cars)),
y = abs(dfbetas_weight))) +
theme_bw() +
ylab("Absolute Value of DFBETAS for Weight") +
xlab("Observation Number") +
# for n > 30
geom_hline(mapping = aes(yintercept = 2 / sqrt(length(dfbetas_weight))),
color = "red", linetype = "dashed") +
# for n <= 30 (code for future, small data sets)
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") +
scale_x_continuous(limits = c(0, 300)) +
scale_y_continuous(limits = c(0, 0.25)) +
theme(aspect.ratio = 1)
# print a list of potential influential points according to DFBETAS
# for n > 30
cars %>%
mutate(rowNum = row.names(cars)) %>% # save original row numbers
filter(abs(dfbetas_weight) > 2 /
sqrt(length(rownames(cars)))) %>% # select potential influential pts
arrange(desc(abs(dfbetas_weight))) # order from largest DFBETAS to smallest
# for n <= 30 (code for future, small data sets)
# cars %>%
# mutate(rowNum = row.names(cars)) %>% # save original row numbers
# filter(abs(dfbetas_weight) > 1) %>% # select potential influential pts
# arrange(desc(abs(dfbetas_weight))) # order from largest DFBETAS to smallest
```
No points are that far away from the rest. I would say no influential points here.
**(g) DFFITS**
```{r, fig.align='center'}
# calculate the DFFITS
cars$dffits <- dffits(cars_lm)
# plot the DFFITS against the observation number
ggplot(data = cars) +
geom_point(mapping = aes(x = as.numeric(rownames(cars)),
y = abs(dffits))) +
theme_bw() +
ylab("Absolute Value of DFFITS for Y") +
xlab("Observation Number") +
# for n > 30
geom_hline(mapping = aes(yintercept = 2 * sqrt(length(cars_lm$coefficients) /
length(dffits))),
color = "red", linetype = "dashed") +
# for n <= 30 (code for future, small data sets)
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") +
scale_x_continuous(limits = c(0, 300)) +
scale_y_continuous(limits = c(0, 0.3)) +
theme(aspect.ratio = 1)
# print a list of potential influential points according to DFFITS
# for n > 30
cars %>%
mutate(rowNum = row.names(cars)) %>% # save original row numbers
# select potential influential pts
filter(abs(dffits) > 2 * sqrt(length(cars_lm$coefficients) /
length(dffits))) %>%
arrange(desc(abs(dffits))) # order from largest DFFITS to smallest
# for n <= 30 (code for future, small data sets)
# cars %>%
# mutate(rowNum = row.names(cars)) %>% # save original row numbers
# filter(abs(dffits) > 1) %>% # select potential influential pts
# arrange(desc(abs(dffits))) # order from largest DFFITS to smallest
```
Again, I'm not too concerened with influential points based on this plot. There may be some outliers, as indicated by other plots, but I don't think any of them are terribly influential. I would move on saying this assumption is met.
### 6. Additional predictor variables are unnecessary
This assumption is very likely not met. There are many other variables that could help predict MPG like the number of cylinders, vehical type, etc.
## Summarize Findings:
#### 1. x vs y is linear
From the diagnostics, this assumption appears to be met.
#### 2. The residuals are independent across all values of y
We do not know if this assumption is met or not since we are lacking information on how the data was collected.
#### 3. The residuals are normally distributed and centered at zero
The residuals look slightly right skewed - we can do better. We will transform MPG to try to improve this. This assumption is not met.
#### 4. The residuals have constant variance across all values of x
The diagnostics indicate possible heteroscedasticity. We may want to address this by transforming MPG. This assumption is not met.
#### 5. The model describes all observations (i.e., there are no influential points)
I think this assumption is met. While there may be some outliers, none seem to be terribly influential.
#### 6. Additional predictor variables are unnecessary
This assumption is likely not met, as discussed above.
## Remedial Measures: "Fix" Unmet Assumptions
Given our assessment of the assumptions, we will want to perform some kind of transformation. Since the trend looks fairly linear, but other assumptions were not met, we will focus on transforming MPG instead of Weight. We will use the Box-Cox approach to find the "best" transformation of MPG (Y).
```{r, fig.align='center'}
bc <- boxCox(cars_lm) # plot curve
bc$x[which.max(bc$y)] # pull out the "best" lambda value
```
The "best" lambda value is -0.14, but a lambda value of 0 is much more interpretable, and, since 0 is contained in the 95% confidence interval, we will use lambda = 0 (corresponds to a log transformation).
Let's transform MPG using the log transform, refit the model, and run the diagnostics again to gauge improvement.
```{r}
cars$MPG_trans <- log(cars$MPG)
cars_lm_trans <- lm(MPG_trans ~ Weight, data = cars)
summary(cars_lm_trans)
cars$residuals_trans <- cars_lm_trans$residuals
cars$fittedMPG_trans <- cars_lm_trans$fitted.values
```
It can be informative to view the line (on the log scale) as what it looks like as a curve (on the regular scale). Here is how you plot the transformed regression model on original scale of the data.
```{r}
# Sequence of Weight values that we are interested in using to predict MPG
Weight_values <- seq(min(cars$Weight), max(cars$Weight), length = 100)
# Predictions of **log(MPG)** across those value of Weight
log_MPG_preds <- predict(cars_lm_trans,
newdata = data.frame(Weight = Weight_values))
# Predictions of **MPG** (back-transformed) across those value of Weight
MPG_preds <- exp(log_MPG_preds) # use exp to "undo" the log transform
# Store results in a data frame for plotting
preds <- data.frame("Weight_values" = Weight_values,
"MPG_preds" = MPG_preds)
# Plot the predictions on the original scale (to get a curved line)
cars_base_plot +
geom_line(data = preds,
aes(x = Weight_values, y = MPG_preds),
size = 1.5, color ="blue")
```
### Re-Check Assumptions with new Y values
#### 1. (L) x vs y is linear
**(a) Scatterplot**
```{r, fig.align='center'}
ggplot(data = cars, mapping = aes(x = Weight, y = MPG_trans)) +
geom_point() +
theme_bw() +
ylab("log(MPG)") +
scale_x_continuous(limits = c(1500, 3500)) +
scale_y_continuous(limits = c(2.5, 4)) +
theme(aspect.ratio = 1)
```
The trend still appears linear.
**(b) Residuals vs. Fitted Values Plot**
```{r, fig.align='center'}
cars_resid_vs_fit_trans <- autoplot(cars_lm_trans,
which = 1, ncol = 1, nrow = 1) +
theme_bw() +
scale_y_continuous(limits = c(-0.75, 0.75)) +
scale_x_continuous(limits = c(2.8, 3.8)) +
theme(aspect.ratio = 1)
cars_resid_vs_fit_trans
```
This plot suggests a linear relationship between MPG_trans and Weight.
**(c) Residuals vs. Predictor Plot**
```{r, fig.align='center'}
ggplot(data = cars, mapping = aes(x = Weight, y = residuals_trans)) +
geom_point() +
theme_bw() +
scale_y_continuous(limits = c(-0.75, 0.75)) +
scale_x_continuous(limits = c(1500, 3500)) +
theme(aspect.ratio = 1)
```
Still suggest a linear relationship.
#### 3. (N) The residuals are normally distributed and centered at zero
**(a) Boxplot**
```{r, fig.align='center'}
ggplot(data = cars, mapping = aes(y = residuals_trans)) +
geom_boxplot() +
theme_bw() +
scale_y_continuous(limits = c(-0.75, 0.75)) +
theme(aspect.ratio = 1)
```
The boxplot looks okay. A slight right-skew is evident.
**(b) Histogram**
```{r, fig.align='center'}
ggplot(data = cars, mapping = aes(x = residuals_trans)) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 0.1) +
stat_function(fun = dnorm,
color = "red",
size = 2,
args = list(mean = mean(cars$residuals_trans),
sd = sd(cars$residuals_trans))) +
theme_bw() +
scale_x_continuous(limits = c(-0.75, 0.75)) +
scale_y_continuous(limits = c(0, 3)) +
theme(aspect.ratio = 1)
```
The residuals looks normally distributed.
**(c) Normal Probability Plot**
```{r, fig.align='center'}
autoplot(cars_lm_trans, which = 2, ncol = 1, nrow = 1) +
theme_bw() +
scale_x_continuous(limits = c(-3, 3)) +
scale_y_continuous(limits = c(-4, 4)) +
theme(aspect.ratio = 1)
```
The points follow the diagonal line at lot more closely than they did prior to transforming MPG. The tails are pulled in a lot more than they were previously. This looks like a pretty good QQ plot!
**(d) Shapiro-Wilk Test**
```{r}
shapiro.test(cars$residuals_trans)
```
The p-value is not significant, indicating the data are normally distributed (we fail to reject the null hypothesis).
#### 4. (E) The residuals have constant variance across all values of x
**(a) Residuals vs. Fitted Values Plot**
```{r, fig.align='center'}
cars_resid_vs_fit_trans
```
The residuals seem equally spread around the horizontal line. Looks good.
**(c) Brown-Forsythe Test**
```{r}
grp <- as.factor(c(rep("lower", floor(dim(cars)[1] / 2)),
rep("upper", ceiling(dim(cars)[1] / 2))))
leveneTest(cars[order(cars$Weight), "residuals_trans"] ~ grp, center = median)
```
The p-value is non-significant, so homoscedasticity is met
### Repeat
We will not do this here, but it would be good to try a few more transformations, view the diagnostics, and from there determine the "best" model fit among all the transformed models.
## Summary and Conclusions
*Always* start by plotting your data (exploratory data analysis: view data, create scatterplot, summarize data, etc.) before jumping into an analysis or fitting a model. We saw MPG and Weight looked to be linearly correlated, so we chose to fit a simple linear regression model to the data. Once the model was fit, we ran diagnostics to ensure the assumptions underlying the linear regression model were met. We found some evidence to suggest that the residuals were not homoscedastic or normally distributed. To fix this, we applied a Box-Cox approach and determined the log transform would be the best transformation for MPG. We applied this transformation, refit the model, and re-checked the assumptions. The assumptions all appear to be met. In practice, we could try additional transformations, compare the transformed models, and then pick the best model. Once we have a model that satisfies the assumptions, we can look at our model coefficients and p-values and safely draw conclusions.
Note: When interpreting the model coefficients, remember you are on a transformed scale. For example, to interpret the slope of -0.0003845 from the transformed model, we would say, "Average MPG decreases by 0.038% for every 1 pound increase in Weight."
<!-- - *italics* -->
<!-- - **bold**, and -->
<!-- - `code font` -->