-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathHomework 2.Rmd
382 lines (296 loc) · 15.9 KB
/
Homework 2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
---
title: "Homework 2"
subtitle: <center> <h1>Simple Linear Regression Model Assumptions</h1> </center>
author: <center> Joshua Carpenter <center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
library(tidyverse)
library(ggfortify)
library(car)
```
## Data and Description
One key component of determining appropriate speed limits is the amount of distance that is required to stop at a given speed. For example, in residential neighborhoods, when pedestrians are commonly in the roadways, it is important to be able to stop in a very short distance to ensure pedestrian safety. The speed of vehicles may be useful for determining the distance required to stop at that given speed, which can aid public officials in determining speed limits.
The Stopping Distance data set compares the **distance (column 2)** (in feet) required for a car to stop on a certain rural road against the **speed (column 1)** (MPH) of the car. Download the StoppingDistance.txt file from Canvas, and put it in the same folder as this R Markdown file.
#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.
#### 1. Read in the data set, and call the data frame "stop". Print a summary of the data and make sure the data makes sense.
```{r, message=FALSE}
(stop <- read_table2("StoppingDistance.txt"))
summary(stop)
```
#### 2. Create a scatterplot of the data with variables on the appropriate axes (think about which variable makes the most sense to be the response). Make you plot look professional (make sure the axes have appropriate limits to capture the data nicely, make sure the axes labels are descriptive, etc.).
```{r, fig.align='center'}
(stop.plot <- ggplot(data = stop, mapping = aes(x = Speed, y = Distance)) +
geom_point() +
xlab("Car Speed (mph)") +
ylab("Stopping Distance (ft)") +
coord_fixed() +
theme_minimal() +
scale_x_continuous(limits = c(0, 50), minor_breaks = NULL) +
ylim(NA, 150))
```
#### 3. Briefly describe the relationship between Speed and Distance. (Hint: you should use 2 or 3 key words.)
As speed increases, so does stopping distance, as we would expect. There is a strong, positive correlation that is roughly linear, but appears to be slightly curved.
#### 4. Add the OLS regression line to the scatterplot you created in question 2 (note: if you receive a warning about rows with missing values, you may need to adjust an axis limit).
```{r, fig.align='center', message=FALSE}
stop.plot +
geom_smooth(method = "lm", se = F)
```
#### 5. Apply linear regression to the data (no transformations), and save the residuals and fitted values to the `stop` dataframe.
```{r}
stop.lm <- lm(Distance ~ Speed, data = stop)
summary(stop.lm)
stop <- stop %>%
mutate(resid = stop.lm$residuals, fit = stop.lm$fitted.values)
```
#### 6. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above. Do not use "x" and "y" in your model - use variable names that are fairly descriptive.
$\widehat{\text{Distance}}_i=-20.131+3.142\cdot\widehat{\text{Speed}}_i$\
### Questions 7-12 involve using diagnostics to determine if the linear regression assumptions are met. For each assumption, (1) perform appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.
#### 7. (L) X vs Y is linear (use at least two diagnostic tools)
```{r, fig.align='center',message=FALSE}
# Scatterplot
stop.plot
# Resid vs. Fitted
stop.resid_fitted <- autoplot(stop.lm, which = 1, ncol = 1) +
theme_minimal() +
coord_fixed()
```
No, the data is slightly non-linear
#### 8. (I) The residuals are independent (no diagnostic tools - just think about how the data was collected and briefly write your thoughts)
Given the information we have, we cannot know whether the data is independent. There could be some dependence based on the cars used in the experiment of the times at which the data was collected.
#### 9. (N) The residuals are normally distributed and centered at zero (use at least three diagnostic tools)
```{r, fig.align='center'}
# Q-Q
(stop.qq <- autoplot(stop.lm, which = 2, ncol = 1, nrow = 1) +
theme_light() +
coord_fixed())
# Histogram
ggplot(data = stop, mapping = aes(x = resid(stop.lm))) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 3) +
stat_function(fun = dnorm,
color = "blue",
args = list(mean = mean(stop$resid),
sd = sd(stop$resid))) +
xlab("Residuals") +
ylab("Density") +
theme_light()
# Boxplot
ggplot(data = stop, mapping = aes(y = resid(stop.lm))) +
geom_boxplot() +
stat_summary(mapping = aes(x = 0),
fun = mean, geom = "point",
shape = 4, size = 2, color = "darkred") +
theme_classic() +
theme(aspect.ratio = 2,
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
scale_y_continuous(limits = c(-30, 40), breaks = seq(-30, 40, 10)) +
ylab("Residuals") +
xlab("")
```
The residuals are somewhat right-skewed and not normally distributed, therefore this assumption is not met.
#### 10. (E) The residuals have equal/constant variance across all values of X (use two diagnostic tools)
```{r, fig.align='center'}
#Resid vs Fitted
stop.resid_fitted
# Brown-Forsyth Test
grp <- as.factor(c(rep("lower", floor(dim(stop)[1] / 2)),
rep("upper", ceiling(dim(stop)[1] / 2))))
leveneTest(pull(stop[order(stop$Speed), "resid"]) ~ grp, center = median)
```
The variance is not very constant as seen in the Residuals vs. Fitted Values plot and the Brown-Forsyth statistic is large enough to cause some doubt. This assumption is not met.
#### 11. (A) The model describes all observations (i.e., there are no influential points) (use at least four diagnostic tools)
```{r, fig.align='center'}
#Q-Q
stop.qq
# Cook's Distance
stop <- stop %>% mutate(cooksd = cooks.distance(stop.lm))
top3cd <- as.numeric(names(sort(stop$cooksd, decreasing = TRUE)[1:3]))
ggplot() +
geom_point(data = stop,
mapping = aes(x = as.numeric(rownames(stop)),
y = cooksd)) +
geom_text(mapping = aes(x = top3cd,
y = stop$cooksd[top3cd],
label = top3cd)) +
theme_bw() +
ylab("Cook's Distance") +
xlab("Observation Number") +
geom_hline(mapping = aes(yintercept = 4 / length(stop$cooksd)),
color = "red", linetype = "dashed") +
theme(aspect.ratio = 1)
# DFBETAS
stop <- stop %>%
mutate(dfbetas_speed = dfbetas(stop.lm)[, "Speed"])
names(stop$dfbetas_speed) <- 1:nrow(stop)
top3dfbeta <- as.numeric(names(
sort(stop$dfbetas_speed, decreasing = TRUE)[1:3]
))
# Plot the DFBETAS against the observation number
ggplot() +
geom_point(data = stop,
mapping = aes(x = as.numeric(rownames(stop)),
y = abs(dfbetas_speed))) +
geom_text(mapping = aes(x = top3dfbeta,
y = stop$dfbetas_speed[top3dfbeta],
label = top3dfbeta)) +
theme_bw() +
ylab("Absolute Value of DFBETAS for Weight") +
xlab("Observation Number") +
geom_hline(mapping = aes(yintercept = 2 / sqrt(length(stop$dfbetas_speed))),
color = "red", linetype = "dashed") +
theme(aspect.ratio = 1)
#Resid vs Fitted
stop.resid_fitted
```
All of the metrics identify observations 55, 60, and 62 as potential influential points, however, based on the graphs of cook's distance and dfbetas, I'd say we only need to worry about observations 60 and 62.
#### 12. (R) Additional predictor variables are not required (no diagnostic tools - just think about the variables you have and if there are other variables you think would help predict the response)
As in most simple linear regression, this assumption is not perfectly met. There could be influential variables for tire type, car weight, ABS system, etc.
#### 13. Based on your analysis of the diagnostic measures, briefly discuss why this simple linear regression model on the raw data (not transformed) is *not* appropriate.
The assumptions for linear regression are badly met at best. Particularly, the data does not not appear to be linear and the residuals are not normally or uniformly distributed.
#### 14. Fix the model by making any necessary transformations. Justify the transformation you chose in words (why did you choose to transform just x, just y, or both?). (Note: if boxCox(mod) throws an error, replace mod with the formula for the linear model, y ~ x.) (Note: you will most likely need to repeat questions 14 and 18 until you are satisfied with the transformation you chose. Only then should you fill out this section - I only want to see the model you end up choosing, not all of your attempted models.)
```{r, fig.align='center'}
bc <- boxCox(stop$Distance ~ stop$Speed)
stop_sqrt <- stop %>%
mutate(Distance = sqrt(Distance))
stop_sqrt.lm <- lm(Distance ~ Speed, data = stop_sqrt)
summary(stop_sqrt.lm)
stop_sqrt <- stop_sqrt %>%
mutate(resid = stop_sqrt.lm$residuals, fit = stop_sqrt.lm$fitted.values)
```
Since the residuals are not normally distributed we will most likely need to transform y. The box-cox approach shows that a square root transformation might be the most advantageous. After doing that transformation the data meets the assumptions and we need to look no further.
### Now, in Questions 15-18, re-check your transformed model and verify that the assumptions (the assumptions that were addressed in the questions above) are met. Provide a brief discussion about how each of the previously violated assumptions are now satisfied. Also, provide the code you used to assess adherence to the assumptions. (Note that transforming will not change your responses about (I) the residuals being independent and (R) additional predictor variables not being required, so we will skip these assumptions here.)
#### 15. (L) Linearity (use at least two diagnostic tools)
```{r, fig.align='center', message=FALSE}
# Scatterplot
(stop_sqrt.plot <- ggplot(data = stop_sqrt, mapping = aes(x = Speed, y = Distance)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("Car Speed (mph)") +
ylab("Stopping Distance (ft)") +
theme_minimal() +
coord_fixed())
# Resid vs. Fitted
(stop_sqrt.resid_fitted <- autoplot(stop_sqrt.lm, which = 1, ncol = 1) +
theme_minimal() +
coord_fixed())
```
This transformed data looks very linear
#### 16. (N) The residuals are normally distributed and centered at zero (use at least three diagnostic tools)
```{r, fig.align='center'}
# Q-Q
(stop_sqrt.qq <- autoplot(stop_sqrt.lm, which = 2, ncol = 1, nrow = 1) +
theme_light() +
coord_fixed())
# Histogram
ggplot(data = stop_sqrt, mapping = aes(x = resid(stop_sqrt.lm))) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 0.25) +
stat_function(fun = dnorm,
color = "blue",
args = list(mean = mean(stop_sqrt$resid),
sd = sd(stop_sqrt$resid))) +
xlab("Residuals") +
ylab("Density") +
theme_light()
# Boxplot
ggplot(data = stop_sqrt, mapping = aes(y = resid(stop_sqrt.lm))) +
geom_boxplot() +
stat_summary(mapping = aes(x = 0),
fun = mean, geom = "point",
shape = 4, size = 2, color = "darkred") +
theme_classic() +
theme(aspect.ratio = 2,
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
# scale_y_continuous(limits = c(-1.5, 1.5), breaks = seq(-1.5, 1.5, 0.5)) +
ylab("Residuals") +
xlab("")
```
The residuals are not perfectly normally distributed, but close to enough to meet our assumptions.
#### 17. (E) The residuals have equal/constant variance across all values of X (use two diagnostic tools)
```{r, fig.align='center'}
#Resid vs Fitted
stop_sqrt.resid_fitted
# Brown-Forsyth Test
grp <- as.factor(c(rep("lower", floor(dim(stop)[1] / 2)),
rep("upper", ceiling(dim(stop)[1] / 2))))
leveneTest(pull(stop[order(stop_sqrt$Speed), "resid"]) ~ grp, center = median)
```
The Brown-Forsyth statistic hasn't changed much, but the residuals vs fitted values plot looks a lot better. This assumption is met.
#### 18. (A) The model describes all observations (i.e., there are no influential points) (use at least four diagnostic tools)
```{r, fig.align='center',message=FALSE}
#Q-Q
stop_sqrt.qq
# Cook's Distance
stop_sqrt <- stop_sqrt %>% mutate(cooksd = cooks.distance(stop_sqrt.lm))
top3cd <- as.numeric(names(sort(stop_sqrt$cooksd, decreasing = TRUE)[1:3]))
ggplot() +
geom_point(data = stop_sqrt,
mapping = aes(x = as.numeric(rownames(stop_sqrt)),
y = cooksd)) +
geom_text(mapping = aes(x = top3cd,
y = stop_sqrt$cooksd[top3cd],
label = top3cd)) +
theme_bw() +
ylab("Cook's Distance") +
xlab("Observation Number") +
geom_hline(mapping = aes(yintercept = 4 / length(stop_sqrt$cooksd)),
color = "red", linetype = "dashed") +
theme(aspect.ratio = 1)
# DFBETAS
stop_sqrt <- stop_sqrt %>%
mutate(dfbetas_speed = dfbetas(stop_sqrt.lm)[, "Speed"])
names(stop_sqrt$dfbetas_speed) <- 1:nrow(stop_sqrt)
top3dfbeta <- as.numeric(names(
sort(stop_sqrt$dfbetas_speed, decreasing = TRUE)[1:3]
))
# Plot the DFBETAS against the observation number
ggplot() +
geom_point(data = stop_sqrt,
mapping = aes(x = as.numeric(rownames(stop_sqrt)),
y = abs(dfbetas_speed))) +
geom_text(mapping = aes(x = top3dfbeta,
y = stop_sqrt$dfbetas_speed[top3dfbeta],
label = top3dfbeta)) +
theme_bw() +
ylab("Absolute Value of DFBETAS for Weight") +
xlab("Observation Number") +
geom_hline(mapping = aes(yintercept = 2 / sqrt(length(stop_sqrt$dfbetas_speed))),
color = "red", linetype = "dashed") +
theme(aspect.ratio = 1)
# Resid vs Fitted
stop_sqrt.resid_fitted
# Perform regression without possible influential points
ggplot(data = stop_sqrt, mapping = aes(x = Speed, y = Distance)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(data = stop_sqrt[c(-60,-55),],
method = "lm",
se = FALSE,
color = "red") +
xlab("Car Speed (mph)") +
ylab("Stopping Distance (ft)") +
theme_minimal()
```
Although some possible influential points are identified, they do not make a significant difference to the regression line. This assumption is met.
#### 19. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above from your transformed model. Do not use "x" and "y" in your model - use variable names that are fairly descriptive.
$log\widehat{(\text{Distance}}_i)=0.932396+0.252466\cdot\widehat{\text{Speed}}_i$\
#### 20. Plot your new fitted *curve* on the scatterplot of the original data (on the original scale - not the transformed scale). Do you think this curve fits the data better than the line you previously fit?
```{r}
xs <- seq(0, 42)
preds <- predict(stop_sqrt.lm, tibble(Speed = xs))
curve <- tibble(x = xs, y = preds^2)
stop.plot +
geom_line(data = curve, mapping = aes(x = x, y = y), color = "blue")
```
Yes, this curve fits the data much better.
#### 21. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.
I learned that linear plots are beautiful. I also learned that the box-cox equation is very useful.
#### 22. Briefly summarize what you learned from this analysis *to a non-statistician*. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
The purpose of this analysis was to predict the necessary stopping distance for a car based on the speed it is going. We found that that we can predict stopping distance with some accuracy based on speed.