-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathModule 7 - In-Class Coding Analysis.Rmd
252 lines (197 loc) · 9.94 KB
/
Module 7 - In-Class Coding Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "Module 7 - Logistic Regression"
subtitle: <center> <h1>In-Class Analysis</h1> </center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
# load packages here
library(tidyverse)
library(corrplot) # for the correlation matrix
library(bestglm) # for variable selection
library(car) # for the VIFs
library(pROC) # for the ROC curve
library(ROCR) # for the color-coded ROC curve
```
## Data and Description
Coronary heart disease (CHD) refers to a narrowing of the coronary arteries, the blood vessels that supply oxygen and blood to the heart. It is also known as coronary artery disease. It is a major cause of illness and death. CHD normally happens when cholesterol accumulates on the artery walls, creating plaques. The arteries narrow, reducing blood flow to the heart. Sometimes, a clot can obstruct the flow of blood to the heart muscle. CHD commonly causes angina pectoris (chest pain), shortness of breath, myocardial infarction, or a heart attack. It is the most common type of heart disease in the United States, where it accounts for 370,000 deaths every year.
The CHD data set contains information on 757 subjects (randomly selected) aged 39 to 59 years old and free of heart disease as determined by electrocardiogram at an initial screening. At baseline the variables in the following table were collected. Follow-up continued for 8.5 years with repeat examinations to determine if patients developed CHD. The goal is to determine risk factors (ways of healthy living) to avoid CHD.
Variable | Description
-------- | -------------
age | Age in years
height | Height in inches
weight | Weight in pounds
sbp | Systolic blood pressure in mmHg (millimeters of mercury)
dbp | Diastolic blood pressure in mmHg (millimeters of mercury)
chol | Cholesterol in mg/dL (milligrams of cholesterol per deciliter of blood)
cigs | Number of cigarettes smoked a day
Do the following:
1. Download the "CHD.csv" file from Canvas and put it in the same folder as this R Markdown file.
2. Read in the data set, call it "chd", and look at a summary of the data.
3. Convert the response (chd) to a factor.
```{r}
heart <- read.csv("CHD.csv", header = TRUE)
summary(heart)
heart$chd <- as.factor(heart$chd)
```
## Explore the data. We will make just a few plots for illustration. For the age variable, create a jittered scatterplot, a boxplot, and a contingency table.
```{r, fig.align='center'}
# Jittered Scatterplot for age
ggplot(data = heart, mapping = aes(y = chd, x = age)) +
geom_point() +
geom_jitter(height = 0.1) +
theme_bw() +
theme(aspect.ratio = 1)
# Boxplot for age
ggplot(data = heart, mapping = aes(y = age, x = chd)) +
geom_boxplot() +
theme_bw() +
theme(aspect.ratio = 1) +
coord_flip()
# Cross-Tabulation (Contingency Table) for age
table(heart$age, heart$chd)
```
## Create a correlation matrix of all predictors.
```{r, fig.align='center'}
corrplot(cor(heart %>% select(-chd)), type = "upper")
```
## Since the correlation matrix indicates potential problems of multicollinearity, perform some type of variable selection procedure to determine which variables to include in the model. Hint: You will need to specify the "family" argument.
```{r, fig.align='center'}
# For illustration, we will use best subsets (exhaustive) with the BIC metric
heart_best_subsets_bic <- bestglm(heart,
IC = "BIC",
method = "exhaustive",
TopModels = 1,
family = binomial)
summary(heart_best_subsets_bic$BestModel)
```
## Use the `glm` function to fit a logistic regression model using the variables identified above. Note that you will need to include the following argument to the `glm` function to indicate you are performing logistic regression: `family = binomial(link = "logit")`.
```{r, fig.align='center'}
heart_logistic <- glm(...)
summary(heart_logistic)
```
## Briefly, check the logistic regression model assumptions.
(1) The x's vs log odds are linear (monotone in probability)
```{r, fig.align='center'}
# your code here
```
(2) The observations are independent (just think about it)
(3) The model describes all observations (no influential points)
You can use several of the metrics we used before like DFBETAS/DFFITS they are calculated differently, but the same principle holds). You can skip this step, knowing that you would just copy/paste code you have used in previous modules.
(4) Additional predictors are unnecessary (just think about it)
(5) No multicollinearity
Check the Variance Inflation Factors (VIFs).
```{r, fig.align='center'}
# this code uses the pseudo R-Squared
heart_vifs <- vif(heart_logistic)
heart_vifs
mean(heart_vifs)
# this code uses the real R-Squared (and is valid since we are looking at how
# the predictor variables relate, and we are not using the response)
heart_lm <- lm(as.numeric(chd) ~ age + weight + sbp + chol + cigs, data = heart)
heart_lm_vifs <- vif(heart_lm)
heart_lm_vifs
mean(heart_lm_vifs)
```
## Create confidence intervals for $\beta_k$, $\exp\{\beta_k\}$, and $100 \times (\exp\{\beta_k\} - 1)%$ for all predictors using the `confint` function. Make sure the confidence intervals for age match the ones in the notes and that you can interpret the intervals correctly.
```{r, fig.align='center'}
# your code here
# hint: for the last two confidence intervals, you will just be transforming
# the values you get from confint (you will not be using different arguments
# to the function)
```
## Calculate the predicted probability that a patient has CHD using the `predict` function for a patient where age = 50, weight = 182, sbp = 136, chol = 253, and cigs = 20.
```{r, eval=FALSE}
# your code here
```
## Calculate a 95% confidence interval for the predicted probability that a patient has CHD for a patient where age = 50, weight = 182, sbp = 136, chol = 253, and cigs = 20.
```{r, fig.align='center'}
# your code here
```
## Check the logistic regression model performance
### Compute the likelihood ratio test statistic (aka the deviance, aka the model chi-squared test) and its corresponding p-value. Hint: use the deviances reported in the logistic regression model output for the test statistic, and use the `pchisq` function for the p-value.
```{r, fig.align='center'}
# your code here
```
### Compute the Pseudo R-Squared. Hint: use the deviances reported in the logistic regression model output.
```{r, fig.align='center'}
# your code here
```
### Find the best cutoff value for classification by minimizing the percent misclassified (c in the notes).
```{r, fig.align='center'}
# get the predicted probabilities for all 757 patients:
heart_preds <- predict(heart_logistic, type = "response")
# create a sequence from 0 to 1 to represent all possible cut-off values that
# we could choose:
possible_cutoffs <- seq(0, 1, length = 100)
# transform heart$chd from a factor with levels "yes" and "no" to a factor with
# levels 1 and 0:
heart_binary <- ifelse(heart$chd == "yes", 1, 0)
# create an empty vector where we will store the percent misclassified for each
# possible cut-off value we created:
percent_misclass <- rep(NA, length(possible_cutoffs))
# for each possible cut-off value, (1) grab the cut-off value, (2) for all 757
# patients, store a 1 in "classify" if their predicted probability is larger
# than the cut-off value, and (3) compute the average percent misclassified
# across the 757 patients when using that cut-off by averaging the number of
# times "classify" (0 or 1 based on how that cut-off classified a person) is
# not the same as heart_binary (the truth):
for(i in 1:length(possible_cutoffs)) {
cutoff <- possible_cutoffs[i] # (1)
classify <- ifelse(heart_preds > cutoff, 1, 0) # (2)
percent_misclass[i] <- mean(classify != heart_binary) # (3)
}
# percent_misclass holds the average misclassification rates for each cut-off
# put this information in a dataframe so we can plot it with ggplot:
misclass_data <- as.data.frame(cbind(percent_misclass, possible_cutoffs))
# plot the misclassification rate against the cut-off value:
ggplot(data = misclass_data,
mapping = aes(x = possible_cutoffs, y = percent_misclass)) +
geom_line(size = 2) +
theme_bw() +
xlab("Cutoff Value") +
ylab("Percent Misclassified") +
theme(aspect.ratio = 1)
# choose the "best" cut-off that minimizes the percent misclassified:
cutoff <- possible_cutoffs[which.min(percent_misclass)]
cutoff
```
### Create a confusion matrix based on the best cutoff (note that the order is changed in the table created here vs the table in the notes)
```{r, fig.align='center'}
# your code here
```
### Plot the ROC curve and obtain the AUC
```{r, fig.align='center'}
# METHOD 1: using the pROC package
my_roc <- roc(heart$chd, heart_preds)
ggplot() +
geom_path(mapping = aes(x = 1 - my_roc$specificities,
y = my_roc$sensitivities),
size = 2) +
geom_abline(slope = 1, intercept = 0) +
theme_bw() +
xlab("1 - Specificity (False Positive Rate)") +
ylab("Sensitivity (True Positive Rate)") +
theme(aspect.ratio = 1)
auc(my_roc) # AUC
# METHOD 2: using the ROCR package (ROC curve colored based on cutoff value)
pred <- prediction(heart_preds, heart$chd)
perf <- performance(pred, "tpr", "fpr") # tpr = true positive rate,
# fpr = false positive rate
plot(perf,
colorize = TRUE,
lwd = 6,
xlab = "1 - Specificity (False Positive Rate)",
ylab = "Sensitivity (True Positive Rate)")
abline(a = 0, b = 1)
# AUC
auc <- performance(pred, measure = "auc")
[email protected][[1]]
```
## Summary and Conclusions
Logistic regression is used when you have a binary response variable. Many of the things we learned for linear regression apply to logistic regression, but there are several differences. For instance, we have few model assumptions, we measure performance based on deviances (instead of sums of squares), the coefficients are in terms of log odds ratios, and we can perform classification.