-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathHomework 4.Rmd
357 lines (284 loc) · 15.7 KB
/
Homework 4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
---
title: "Homework 4"
subtitle: <center> <h1>Multiple Linear Regression</h1> </center>
author: <center> Joshua Carpenter <center>
output: html_document
---
<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>
```{r setup, include=FALSE}
library(tidyverse)
library(ggfortify) # plot lm objects using ggplot instead of base R
library(car) # needed for added-variable plots and dfbetas and dffits
library(corrplot) # colored correlation matrix
library(gridExtra) # NEW PACKAGE for putting multiple ggplots in one plot
library(grid) # To go along with gridExtra
```
*Note that you do not need to properly format the axis limits in your plots for this assignment (to save time). You should, however, still make the plots square in shape.*
## Data and Description
Measuring body fat is not simple. One method requires submerging the body underwater in a tank and measuring the increase in water level. A simpler method for estimating body fat would be preferred. In order to develop such a method, researchers recorded age (years), weight (pounds), height (inches), and three body circumference measurements (around the neck, chest, and abdominal (all in centimeters)) for 252 men. Each man’s percentage of body fat was accurately estimated by an underwater weighing technique (the variable brozek is the percentage of body fat). The hope is to be able to use this data to create a model that will accurately predict body fat percentage, by using just the basic variables recorded, without having to use the tank submerging method.
The data can be found in the BodyFat data set on Canvas. Download "BodyFat.txt", and put it in the same folder as this R Markdown file.
#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.
#### 1. Read in the data set, and call the data frame "bodyfat". Print a summary of the data and make sure the data makes sense.
```{r}
bodyfat <- read.csv("bodyfat.txt", sep = " ", row.names = 1)
bodyfat <- tibble(bodyfat)
summary(bodyfat)
```
#### 2. Create and print a scatterplot matrix of the data.
```{r, fig.align='center', fig.height=8, fig.width=8}
point_matrix <- function(data) {
par(pty = "s", las = 1)
pairs(data, pch = 19, lower.panel = NULL)
}
point_matrix(bodyfat)
```
#### 3. Based on the scatterplot matrix, briefly explain which variables you think will be "significant" for predicting brozek and which variables you think will *not* be helpful at predicting brozek. Explain how the scatterplot helped determine your answers.
I think that weight, neck chest and abdomen will be useful in predicting brozek because they seem to have moderate to strong linear relationships. The last two do not seem to have any correlation.
#### 4. Create and print a correlation matrix (numeric or color- and shape-coded).
```{r, fig.height=4}
show_cor <- function(data) {
par(mfrow = c(1, 2))
corrplot(cor(data), method = "number", type = "upper", diag = F, tl.col = "#1f3366", cl.pos = "n")
title("Correlation Coefficients")
corrplot(cor(data), type = "upper", diag = F, tl.col = "#1f3366", cl.pos = "n")
title("Correlation Matrix")
}
show_cor(bodyfat)
```
#### 5. Based on the scatterplot matrix and the correlation matrix, are their any pairs of variables that you suspect will cause a problem for the multicollinearity assumption? If so, which ones?
Weight has a strong correlation with neck, chest, and abdomen so that will definitly cause a problem. We should probably also keep and eye on neck with chest and abdomen.
#### 6. Fit a multiple linear regression model to the data (no transformations). Print a summary of the results. Save the residuals to the `bodyfat` data frame.
```{r}
bodyfat_lm <- lm(brozek ~ ., data = bodyfat)
summary(bodyfat_lm)
round(bodyfat_lm$coefficients, 3)
```
#### 7. Briefly comment on the "significance" of the variables: were you surprised by the results? Are there any variables that are significant that you think shouldn't be? Are there any variables that are not significant that you think should be?
The results mostly look okay, but chest should be significant. Our results are definitly being thrown off by a multi-colinearity problem.
#### 8. Briefly comment on the sign (+/-) of the coefficients for the variables. Are their any variables where the sign is the opposite of what you expected?
Weight and neck have a negative sign when I would expect them to be positive. IE I would expect people who weigh more and have larger necks to have more body fat.
#### 9. Mathematically write out the *fitted* multiple linear regression model for this data set using the coefficients you found above (do not use betas). Do not use "X" and "Y" in your model - use variable names that are fairly descriptive.
$\widehat{\text{brozek}}_i=-20.1+0.005\cdot\text{age}_i-0.087\cdot\text{weight}_i-0.140\cdot\text{height}_i-0.442\cdot\text{neck}_i+0.00048\cdot\text{chest}_i+0.875\cdot\text{abdomen}_i$
#### 10. *Assuming* the model assumptions are all met, how would you interpret the coefficient for Weight?
For people with the same measurements in every other area, a one pound increase in weight results in a 0.087 percent increase in body fat.
#### 11. Briefly explain what it means to "hold all else constant," when you interpret the coefficient for Weight?
"Holding all else constant" means that we are isolating the effect of one variable by looking at individuals with the same scores in all other categories.
#### 12. Briefly explain what the F-test indicates, as reported in the model output from question 6.
A p-value of $2.2\times10^{-16}$ indicates that at least one of the variables in our model has a significant effect on body fat.
#### 13. Briefly interpret the *adjusted* R-squared, as reported in the model output from question 6.
71% of the variability in body fat percentage can be explained by the predictors in our model.
### Questions 14-20 involve using diagnostics to determine if the linear regression assumptions are met. For each assumption, (1) perform appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.
#### 14. (L) The X's vs Y are linear (use the residual vs. predictor plots, partial regression plots, and one other diagnostic tool of your choice).
```{r, fig.align='center', fig.height=5, message=FALSE}
# residual vs. predictor plots
rpred_col <- function(data, residuals, predictor) {
ggplot(data = data,
mapping = aes(x = pull(data, predictor),
y = residuals)) +
geom_point() +
geom_smooth(se = FALSE, span = 0.95, n = 7, size = 0.5) +
geom_abline(slope = 0, intercept = 0, linetype = "dashed") +
theme_minimal() +
theme(aspect.ratio = 1) +
xlab(predictor) +
ylab("Residuals")
}
resid_vs_pred <- function(model) {
data <- model.frame(model)
predictors <- attr(model$terms, "term.labels")
plots <- lapply(predictors, rpred_col, data = data, residuals = resid(model))
plots["ncol"] <- ceiling(sqrt(length(plots)))
plots["top"] <- "Residuals vs Predictors"
do.call(grid.arrange, plots)
}
resid_vs_pred(bodyfat_lm)
```
```{r, fig.align='center', fig.height=5}
# partial regression plots
added_variable_plots <- function(model) {
predictors <- attr(model$terms, "term.labels")
rows <- floor(sqrt(length(predictors)))
cols <- length(predictors) / rows
par(pty = "s", cex.lab = 1.5, cex.axis = 1.5)
avPlots(model, layout = c(rows, cols), pch = 19)
}
added_variable_plots(bodyfat_lm)
```
```{r, fig.align='center', warning=FALSE}
resid_vs_fitted <- function(model) {
autoplot(model, which = 1, ncol = 1) +
theme_minimal() +
theme(aspect.ratio = 1)
}
resid_vs_fitted(bodyfat_lm)
```
This assumption is met. There are a few outliers that mess up the plots, but the scatterplots and residuals vs fitted values look really good and so do the residuals vs predictors except the outliers. The added variable plots also look linear.
#### 15. (I) The residuals are independent (no diagnostic tools - just think about how the data was collected and briefly write your thoughts)
With the information we have, we have no reason to think that the data might be dependent in any way. This assumption is met.
#### 16. (N) The residuals are normally distributed and centered at zero (use all four diagnostic tools)
```{r, fig.align='center'}
# Boxplot
make_boxplot <- function(model) {
residuals <- data.frame(residuals = resid(model))
ggplot(data = residuals, mapping = aes(y = residuals)) +
geom_boxplot() +
stat_summary(mapping = aes(x = 0),
fun = mean, geom = "point",
shape = 4, size = 2, color = "darkred") +
theme_classic() +
theme(aspect.ratio = 2,
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
# scale_y_continuous(limits = c(-20000, 30000), breaks = seq(-20000, 30000, 10000)) +
ylab("Residuals") +
xlab("")
}
make_boxplot(bodyfat_lm)
```
```{r, fig.align='center'}
# Histogram
make_hist <- function(model) {
residuals <- data.frame(residuals = resid(model))
ggplot(data = residuals, mapping = aes(x = residuals)) +
geom_histogram(binwidth = sd(residuals$residuals / 4), mapping = aes(y = ..density..)) +
stat_function(fun = dnorm,
color = "blue",
args = list(mean = 0,
sd = sd(residuals$residuals)),
size = 1.2) +
xlab("Residuals") +
ylab("Density") +
theme_light()
}
make_hist(bodyfat_lm)
```
```{r, fig.align='center'}
# Q-Q
QQPlot <- function(model) {
autoplot(model, which = 2, ncol = 1) +
theme_bw() +
theme(aspect.ratio = 1)
}
QQPlot(bodyfat_lm)
```
```{r, fig.align='center'}
# Shapiro-Wilk
shapiro.test(bodyfat_lm$residuals)
```
All four indicators clearly indicate the the residuals are normally distributed. This assumption is met.
#### 17. (E) The residuals have equal/constant variance across all values of X (only one diagnostic tool)
```{r, fig.align='center'}
resid_vs_fitted(bodyfat_lm)
```
There are no clear trends or horn shapes in the residuals. This assumption is met.
#### 18. (A) The model describes all observations (i.e., there are no influential points) (use Cook's distance, DFBETAS, and DFFITS. Also, in your response, refer to the evidence from the plots you created in previous questions)
```{r, fig.align='center'}
# Cook's Distance
cooksd_plot <- function(model, nLabels = 3) {
cooks_d <- cooks.distance(model)
top_cd <- as.numeric(names(sort(cooks_d, decreasing = TRUE)[1:nLabels]))
ggplot() +
geom_point(data = tibble(cooks_d),
mapping = aes(x = as.numeric(names(cooks_d)),
y = cooks_d)) +
geom_text(mapping = aes(x = top_cd,
y = cooks_d[top_cd] + max(cooks_d) / 40,
label = top_cd)) +
theme_bw() +
ylab("Cook's Distance") +
xlab("Observation Number") +
geom_hline(mapping = aes(yintercept = 4 / length(cooks_d)),
color = "red", linetype = "dashed") +
theme(aspect.ratio = 1)
}
cooksd_plot(bodyfat_lm)
```
```{r, fig.align='center'}
# DFBETAS
dfb_col <- function(df_betas, predictor, nLabels = 3) {
# Find which observations have the highest dfbetas
top_vals <- df_betas[predictor] %>%
arrange(desc(abs(eval(parse(text = predictor))))) %>%
.[1:nLabels,] %>%
pull(predictor)
top_ind <- which(pull(df_betas, predictor) %in% top_vals)
out <- ggplot() +
geom_point(data = df_betas,
mapping = aes(x = as.numeric(rownames(df_betas)),
y = abs(pull(df_betas, predictor)))) +
geom_text(mapping = aes(x = top_ind,
y = abs(pull(df_betas, predictor)[top_ind]) + 0.07,
label = top_ind)) +
theme_bw() +
theme(aspect.ratio = 1) +
ylab("Abs of DFBETAS") +
xlab("Observation Number") +
ggtitle(predictor)
if(length(dfbetas) <= 30) {
out <- out +
geom_hline(mapping = aes(yintercept = 1),
color = "red", linetype = "dashed")
}else {
out <- out +
geom_hline(mapping = aes(yintercept = 2 / sqrt(length(dfbetas))),
color = "red", linetype = "dashed")
}
return(out)
}
plot_dfbetas <- function(model, nLabels = 3) {
predictors <- attr(model$terms, "term.labels")
df_betas <- as_tibble(dfbetas(model)[, predictors])
plots <- lapply(predictors, dfb_col, df_betas = df_betas)
plots["ncol"] <- ceiling(sqrt(length(plots)))
do.call(grid.arrange, plots)
}
plot_dfbetas(bodyfat_lm)
```
```{r, fig.align='center'}
# DFFITS
plot_dffits <- function(model, nLabels = 3) {
df_fits <- dffits(model)
top_dff <- as.numeric(names(sort(abs(df_fits), decreasing = TRUE)[1:nLabels]))
df_fits_plot <- ggplot() +
geom_point(data = tibble(df_fits),
mapping = aes(x = as.numeric(names(df_fits)),
y = abs(df_fits))) +
geom_text(mapping = aes(x = top_dff,
y = abs(df_fits[top_dff]) + 0.05,
label = top_dff)) +
theme_bw() +
ylab("Absolute Value of DFFITS for Y") +
xlab("Observation Number") +
theme(aspect.ratio = 1)
if(length(df_fits) <= 30) {
df_fits_plot +
geom_hline(mapping = aes(yintercept =
2 * sqrt(length(model$coefficients) /
length(df_fits))),
color = "red", linetype = "dashed")
}else {
df_fits_plot +
geom_hline(mapping = aes(yintercept = 1),
color = "red", linetype = "dashed")
}
}
plot_dffits(bodyfat_lm)
```
As far as I can tell, we would only be worried about observation 39 which, in every plot, seems to be extremely far from the rest of the data. Most likely it is a recording error or something of the sort. If we discover that that is the case we could removve it and continue the analysis without it. For now, this assumption is not met.
#### 19. (R) Additional predictor variables are not required (no diagnostic tools - just think about the variables you have and if there are other variables you think would help predict the response)
Those seem like a pretty good set of predictors. I can't think of anything else that would help predict body fat. This assumption is met.
#### 20. No multicollinearity (for this assumption, compute the variance inflation factors (VIFs) and compare the VIFs to your comments in questions 5. Do the variance inflation factors match your assumptions from questions 5? Is this assumption met?
```{r, fig.align='center'}
vif(bodyfat_lm)
```
Based on the variance inflation factors I would be worried about the same three variables from question 5: weight, abdomen, and chest.
### Note: your next homework assigment will use this same data set, and you will be asked to fix the assumptions that were broken.
#### 21. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.
I learned *a lot* about plots in R by coming up with functions for the various plots. The effects of multi-colinearity were emphasized and clarified for me. Other than that it was pretty similar to simple linear regression.
#### 22. Briefly summarize what you learned from this analysis *to a non-statistician*. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
The purpose of this data was to determine an easier way of measuring body fat than submergence. We tried to come up with a way of roughly calculating body fat based on age, weight and body measurements. We found that such a technique is viable, however, we still need to do more work to determine the exact method.