-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathModule 5 - In-Class Coding Analysis.Rmd
316 lines (204 loc) · 11.9 KB
/
Module 5 - In-Class Coding Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "Module 5 - Multiple Linear Regression Variable Selection Methods"
subtitle: <center> <h1>In-Class Analysis</h1> </center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
# load packages here
library(tidyverse)
library(corrplot) # colored correlation matrix
library(ggfortify) # plot glmnet objects using ggplot instead of base R
library(car) # needed for VIFs
library(bestglm) # for stepwise methods
library(glmnet) # for ridge, lasso, and elastic net
set.seed(12345) # make sure to set your seed when doing cross validation!
```
## Data and Description
**This is the same data set used in the Module 5 Course Notes. You can check all of the code you create here with the output from the course notes to verify you are getting the correct results.**
Environmental impact studies seek to identify and quantify the affect of environmental conditions on human and ecological health. For example, extreme heat poses a threat to public health by creating conditions conducive to hyperthermia. Likewise, extreme cold poses a threat to public health via conditions suitable to hypothermia. Extreme weather events (tornados, typhoons, etc.) also pose an obvious threat to public health. A less understood environmental variable that, hypothetically, may also pose a threat to public health is the concentration of pollution (air quality).
In an effort to understand the impact of the environment on human health, the data set "EnvironmentalImpacts.txt" (found on Canvas) contains environmental and socio-economic information for 60 different cities in the U.S. The collected variables are given in the table below.
Variable | Description
-------------- | -------------
AnnPrecip | Average annual precipitation
MeanJanTemp | Average January temperature (in degrees Fahrenheit)
MeanJulyTemp | Average July temperature (in degrees Fahrenheit)
PctGT65 | Percent of population greater than 65 years old
PopPerHouse | Population per household
School | Median school years completed
PctSound | Percent of housing units that are "sound"
PopPerSqMile | Population per square mile
PctNonWhite | Percent of population that is nonwhite
PctWhiteCollar | Percent of employment in white-collar jobs
PctU20000 | Percent of families with income under $20,000
Hydrocarbons | Relative pollution potential of hydrocarbons
Nitrogen | Relative pollution potential of oxides in nitrogen
SO2 | Relative pollution potential of oxides in sulfur dioxide
RelHumid | Annual average relative humidity
AAMort | Age-adjusted mortality
The goal of this analysis is to determine which, if any, of the above environmental and socioeconomic variables contributed to the mortality rate.
Do the following:
1. Download the "EnvironmentalImpacts.txt" file from Canvas and put it in the same folder as this R Markdown file.
2. Read in the data set, call it "env", and take a look at the top few rows.
```{r}
env <- read.csv("EnvironmentalImpacts.txt", header = TRUE, sep = " ")
head(env)
```
## Explore the data: create a scatterplot matrix, correlation matrix, etc.
Hint: use the `plot` (or `pairs`) and `cor` functions.
```{r, fig.align='center'}
plot(env) # Too small of plots
plot(env[, c(16,1,2,3,9,12:15)], pch = 19, cex = .5)
round(cor(env), 2) #verging on useless...
corrplot(cor(env), type = "upper") #May want to condense the information
```
## Fit a multiple linear regression model using all variables in the data set. Look at a summary of the results.
Hint: use the `lm` function with your formula looking something like `y ~ .`, where the `.` includes all variables (excluding y) in the data set as predictors.
```{r, fig.align='center'}
env_lm <- lm(AAMort ~ ., data = env)
summary(env_lm)
```
## Normally at this step, you would check the model assumptions. For now, skip checking assumptions 1-6 (you can always go back and do this later) and use variance inflation factors to test for multicollinearity. Do you think multicollinearity is a problem for this model?
Hint: use the function `vif`.
```{r, fig.align='center'}
env_vif <- vif(env_lm)
max(env_vif) # Way above 10
mean(env_vif) # Way above 1
```
< your response here >
## We now want to apply variable selection procedures.
### Start by checking all possible subsets ("best subsets" method) of the full model you created above. Use the BIC as the selection criteria. Which model is the "best"? Which model is the second "best"? What is the difference between the top two models?
We will use the `bestglm` package/function (you could also use the `regsubsets` function from the `leaps` package). Note that the response variable must be the last column in the data set for `bestglm` to work.
You can see which variables were included in the top 5 (or whatever number you set for the `TopModels` argument, defaults to 5) best models using `best_subsets_bic$BestModels`. You can also see the model summary output for the single best model using `summary(best_subsets_bic$BestModel)`. You could save `best_subsets_bic$BestModel` as a variable for later use (i.e. when checking assumptions).
```{r, fig.align='center'}
best_subsets_bic <- bestglm(env,
IC = "BIC",
method = "exhaustive")
#to view the top ten models
best_subsets_bic$BestModels #shows which variables were or were not included
# summary of the best model
summary(best_subsets_bic$BestModel)
```
< your response here >
### Now apply forward selection (just for illustration - NEVER do this in real life). Use the AIC as the selection criteria. Which model is the "best"? Which model is the second "best"? What is the difference between the top two models?
Hint: Similar code as above - just make a few changes (you will change the `IC` argument and the `method` argument.).
```{r, fig.align='center'}
forward_aic <- bestglm(env,
IC = "AIC",
method = "forward")
#to view the top ten models
forward_aic$BestModels #shows which variables were or were not included
# summary of the best model
summary(forward_aic$BestModel)
```
< your response here >
### Now apply backward selection. Use the predictive mean square error (PMSE) as the selection criteria. Which model is the "best"?
Hint: Similar code as above - just make a few changes. You will change the `IC` argument and the `method` argument. Specifically, the `IC` argument should be set to `"CV"` - this is what specifies the PMSE criteria. Note that for the `"CV"` method, the function will run `t` cross-validations to see which is the best model. Note that `$BestModels` does not work when applying cross-validation.
```{r, fig.align='center'}
backward_cv <- bestglm(env,
IC = "CV",
method = "backward",
t=100) # this takes into account computation speed
# summary of the best model
summary(backward_cv$BestModel)
```
< your response here >
### Now apply stepwise/sequential replacement selection. Use the predictive mean square error (PMSE) as the selection criteria. Which model is the "best"?
Hint: Similar code as above - just make a few changes (`method="seqrep"`).
```{r, fig.align='center'}
seqrep_cv <- bestglm(env,
IC = "CV",
method = "seqrep",
t=100) # this takes into account computation speed
# summary of the best model
summary(seqrep_cv$BestModel)
```
< your response here >
## We will now apply shrinkage methods as our variable selection procedures.
### Start by applying ridge regression (just for illustration - this isn't a variable selection procedure since no variables will drop out of the model) to a model with ALL covariates included.
We will use the `glmnet` package/function. Note that the data must be a matrix, not a dataframe.
```{r, fig.align='center'}
env_x <- as.matrix(env[, 1:15])
env_y <- env[, 16]
# use cross validation to pick the "best" (based on MSE) lambda
env_ridge_cv <- cv.glmnet(x = env_x,
y = env_y,
type.measure = "mse",
alpha = 0) # 0 is code for "ridge regression"
# plot (log) lambda vs MSE
autoplot(env_ridge_cv, label = FALSE) +
theme_bw() +
theme(aspect.ratio = 1)
# lambda.min: value of lambda that gives minimum mean cross-validated error
env_ridge_cv$lambda.min
# lambda.1se: value of lambda within 1 standard error of the minimum
# cross-validated error
env_ridge_cv$lambda.1se
coef(env_ridge_cv, s = "lambda.min")
coef(env_ridge_cv, s = "lambda.1se")
```
### Now apply LASSO to a model with ALL covariates included.
Hint: same code as above, except you need to set `alpha = 1` instead of `alpha = 0`.
```{r, fig.align='center'}
# use cross validation to pick the "best" (based on MSE) lambda
env_lasso_cv <- cv.glmnet(x = env_x,
y = env_y,
type.measure = "mse",
alpha = 1) # 1 is code for "lasso"
# plot (log) lambda vs MSE
autoplot(env_lasso_cv, label = FALSE) +
theme_bw() +
theme(aspect.ratio = 1)
# lambda.min: value of lambda that gives minimum mean cross-validated error
env_lasso_cv$lambda.min
# lambda.1se: value of lambda within 1 standard error of the minimum
# cross-validated error
env_lasso_cv$lambda.1se
coef(env_lasso_cv, s = "lambda.min")
coef(env_lasso_cv, s = "lambda.1se")
```
### Now apply the elastic net method to a model with ALL covariates included.
Hint: same code as above, except you need to set `alpha = 0.5`.
```{r, fig.align='center'}
# use cross validation to pick the "best" (based on MSE) lambda
env_elastic_net_cv <- cv.glmnet(x = env_x,
y = env_y,
type.measure = "mse",
alpha = 0.5) # 0.5 is code for "elastic"
# plot (log) lambda vs MSE
autoplot(env_elastic_net_cv, label = FALSE) +
theme_bw() +
theme(aspect.ratio = 1)
# lambda.min: value of lambda that gives minimum mean cross-validated error
env_elastic_net_cv$lambda.min
# lambda.1se: value of lambda within 1 standard error of the minimum
# cross-validated error
env_elastic_net_cv$lambda.1se
coef(env_elastic_cv, s = "lambda.min")
coef(env_elastic_cv, s = "lambda.1se")
```
### Now that you have seen the various results from the different methods, pick a subset of variables that you will include in the model. Create the multiple linear regression model with these variables (alternatively, you can call the best model using $BestModel, as explained above).
```{r, fig.align='center'}
#Ex. Best subsets
env_lm_subset1 <- best_subsets_bic$BestModel
summary(env_lm_subset1)
#Ex. Pick and choose
env_lm_subset2 <- lm(AAMort ~ AnnPrecip + School + PctNonWhite + log.SO2, data = env)
summary(env_lm_subset2)
```
### Once you have chosen a model, you should check the model assumptions. For now, skip checking assumptions 1-6 (you can always go back and do this later) and use variance inflation factors to test for multicollinearity. Do you think multicollinearity is a problem for this model?
```{r, fig.align='center'}
#use the subsets from above
env_vif_subset1 <- vif(env_lm_subset1)
env_vif_subset1
max(env_vif_subset1) # 2.1, much better than before
mean(env_vif_subset1) # Just barely above 1, at 1.6
```
< your response here >
## Summary and Conclusions
Multicollinearity is a common problem for linear models, and it is important that it is addressed to avoid the many pitfalls that arise (such as inflated standard errors). We performed several variable selection procedures to try to eliminate multicollinearity, which would also help us avoid overfitting. Ultimately, there is no "right" answer as to which variables should be included. There are many acceptable models, and there are certainly many bad models, but there is no one correct model. In general, it is good to run several procedures and look for similarities in the variables included.