Homework 4.Rmd

---
title: "Homework 4"
subtitle: <center> <h1>Multiple Linear Regression</h1> </center>
author: <center> Joshua Carpenter <center>
output: html_document
---

<style type="text/css">
h1.title {
  font-size: 40px;
  text-align: center;
}
</style>

```{r setup, include=FALSE}
library(tidyverse)
library(ggfortify)  # plot lm objects using ggplot instead of base R
library(car)  # needed for added-variable plots and dfbetas and dffits
library(corrplot)  # colored correlation matrix
library(gridExtra)  # NEW PACKAGE for putting multiple ggplots in one plot
library(grid) # To go along with gridExtra
```

*Note that you do not need to properly format the axis limits in your plots for this assignment (to save time). You should, however, still make the plots square in shape.*

## Data and Description

Measuring body fat is not simple. One method requires submerging the body underwater in a tank and measuring the increase in water level. A simpler method for estimating body fat would be preferred. In order to develop such a method, researchers recorded age (years), weight (pounds), height (inches), and three body circumference measurements (around the neck, chest, and abdominal (all in centimeters)) for 252 men. Each man’s percentage of body fat was accurately estimated by an underwater weighing technique (the variable brozek is the percentage of body fat). The hope is to be able to use this data to create a model that will accurately predict body fat percentage, by using just the basic variables recorded, without having to use the tank submerging method. 

The data can be found in the BodyFat data set on Canvas. Download "BodyFat.txt", and put it in the same folder as this R Markdown file.

#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.

#### 1. Read in the data set, and call the data frame "bodyfat". Print a summary of the data and make sure the data makes sense. 

```{r}
bodyfat <- read.csv("bodyfat.txt", sep = " ", row.names = 1)
bodyfat <- tibble(bodyfat)
summary(bodyfat)
```

#### 2. Create and print a scatterplot matrix of the data.

```{r, fig.align='center', fig.height=8, fig.width=8}
point_matrix <- function(data) {
  par(pty = "s", las = 1)
  pairs(data, pch = 19, lower.panel = NULL)
}
point_matrix(bodyfat)
```

#### 3. Based on the scatterplot matrix, briefly explain which variables you think will be "significant" for predicting brozek and which variables you think will *not* be helpful at predicting brozek. Explain how the scatterplot helped determine your answers.

I think that weight, neck chest and abdomen will be useful in predicting brozek because they seem to have moderate to strong linear relationships. The last two do not seem to have any correlation.

#### 4. Create and print a correlation matrix (numeric or color- and shape-coded).

```{r, fig.height=4}
show_cor <- function(data) {
  par(mfrow = c(1, 2))
  corrplot(cor(data), method = "number", type = "upper", diag = F, tl.col = "#1f3366", cl.pos = "n")
  title("Correlation Coefficients")
  corrplot(cor(data), type = "upper", diag = F, tl.col = "#1f3366", cl.pos = "n")
  title("Correlation Matrix")
}
show_cor(bodyfat)
```

#### 5. Based on the scatterplot matrix and the correlation matrix, are their any pairs of variables that you suspect will cause a problem for the multicollinearity assumption? If so, which ones?

Weight has a strong correlation with neck, chest, and abdomen so that will definitly cause a problem. We should probably also keep and eye on neck with chest and abdomen.

#### 6. Fit a multiple linear regression model to the data (no transformations). Print a summary of the results. Save the residuals to the `bodyfat` data frame.

```{r}
bodyfat_lm <- lm(brozek ~ ., data = bodyfat)
summary(bodyfat_lm)
round(bodyfat_lm$coefficients, 3)
```

#### 7. Briefly comment on the "significance" of the variables: were you surprised by the results? Are there any variables that are significant that you think shouldn't be? Are there any variables that are not significant that you think should be?

The results mostly look okay, but chest should be significant. Our results are definitly being thrown off by a multi-colinearity problem.

#### 8. Briefly comment on the sign (+/-) of the coefficients for the variables. Are their any variables where the sign is the opposite of what you expected?

Weight and neck have a negative sign when I would expect them to be positive. IE I would expect people who weigh more and have larger necks to have more body fat.

#### 9. Mathematically write out the *fitted* multiple linear regression model for this data set using the coefficients you found above (do not use betas). Do not use "X" and "Y" in your model - use variable names that are fairly descriptive.

$\widehat{\text{brozek}}_i=-20.1+0.005\cdot\text{age}_i-0.087\cdot\text{weight}_i-0.140\cdot\text{height}_i-0.442\cdot\text{neck}_i+0.00048\cdot\text{chest}_i+0.875\cdot\text{abdomen}_i$

#### 10. *Assuming* the model assumptions are all met, how would you interpret the coefficient for Weight?

For people with the same measurements in every other area, a one pound increase in weight results in a 0.087 percent increase in body fat.

#### 11. Briefly explain what it means to "hold all else constant," when you interpret the coefficient for Weight?

"Holding all else constant" means that we are isolating the effect of one variable by looking at individuals with the same scores in all other categories.

#### 12. Briefly explain what the F-test indicates, as reported in the model output from question 6.

A p-value of $2.2\times10^{-16}$ indicates that at least one of the variables in our model has a significant effect on body fat.

#### 13. Briefly interpret the *adjusted* R-squared, as reported in the model output from question 6.

71% of the variability in body fat percentage can be explained by the predictors in our model.

### Questions 14-20 involve using diagnostics to determine if the linear regression assumptions are met. For each assumption, (1) perform appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.

#### 14. (L) The X's vs Y are linear (use the residual vs. predictor plots, partial regression plots, and one other diagnostic tool of your choice). 

```{r, fig.align='center', fig.height=5, message=FALSE}
# residual vs. predictor plots
rpred_col <- function(data, residuals, predictor) {
  ggplot(data = data,
         mapping = aes(x = pull(data, predictor),
                       y = residuals)) +
    geom_point() +
    geom_smooth(se = FALSE, span = 0.95, n = 7, size = 0.5) +
    geom_abline(slope = 0, intercept = 0, linetype = "dashed") +
    theme_minimal() +
    theme(aspect.ratio = 1) +
    xlab(predictor) +
    ylab("Residuals")
}

resid_vs_pred <- function(model) {
  data <- model.frame(model)
  predictors <- attr(model$terms, "term.labels")
  plots <- lapply(predictors, rpred_col, data = data, residuals = resid(model))
  plots["ncol"] <- ceiling(sqrt(length(plots)))
  plots["top"] <- "Residuals vs Predictors"
  do.call(grid.arrange, plots)
}
resid_vs_pred(bodyfat_lm)
```

```{r, fig.align='center', fig.height=5}
# partial regression plots
added_variable_plots <- function(model) {
  predictors <- attr(model$terms, "term.labels")
  rows <- floor(sqrt(length(predictors)))
  cols <- length(predictors) / rows
  par(pty = "s", cex.lab = 1.5, cex.axis = 1.5)
  avPlots(model, layout = c(rows, cols), pch = 19)
}
added_variable_plots(bodyfat_lm)
```

```{r, fig.align='center', warning=FALSE}
resid_vs_fitted <- function(model) {
  autoplot(model, which = 1, ncol = 1) +
    theme_minimal() +
    theme(aspect.ratio = 1)
}
resid_vs_fitted(bodyfat_lm)
```

This assumption is met. There are a few outliers that mess up the plots, but the scatterplots and residuals vs fitted values look really good and so do the residuals vs predictors except the outliers. The added variable plots also look linear.

#### 15. (I) The residuals are independent (no diagnostic tools - just think about how the data was collected and briefly write your thoughts)

With the information we have, we have no reason to think that the data might be dependent in any way. This assumption is met.

#### 16. (N) The residuals are normally distributed and centered at zero (use all four diagnostic tools)

```{r, fig.align='center'}
# Boxplot
make_boxplot <- function(model) {
  residuals <- data.frame(residuals = resid(model))
  ggplot(data = residuals, mapping = aes(y = residuals)) +
  geom_boxplot() +
  stat_summary(mapping = aes(x = 0),
               fun = mean, geom = "point",
               shape = 4, size = 2, color = "darkred") +
  theme_classic() +
  theme(aspect.ratio = 2,
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
#  scale_y_continuous(limits = c(-20000, 30000), breaks = seq(-20000, 30000, 10000)) +
  ylab("Residuals") +
  xlab("")
}
make_boxplot(bodyfat_lm)
```

```{r, fig.align='center'}
# Histogram
make_hist <- function(model) {
  residuals <- data.frame(residuals = resid(model))
  ggplot(data = residuals, mapping = aes(x = residuals)) +
  geom_histogram(binwidth = sd(residuals$residuals / 4), mapping = aes(y = ..density..)) +
  stat_function(fun = dnorm,
                color = "blue",
                args = list(mean = 0,
                            sd = sd(residuals$residuals)),
                size = 1.2) +
  xlab("Residuals") +
  ylab("Density") +
  theme_light()
}
make_hist(bodyfat_lm)
```

```{r, fig.align='center'}
# Q-Q
QQPlot <- function(model) {
  autoplot(model, which = 2, ncol = 1) +
    theme_bw() +
    theme(aspect.ratio = 1)
}
QQPlot(bodyfat_lm)
```

```{r, fig.align='center'}
# Shapiro-Wilk
shapiro.test(bodyfat_lm$residuals)
```

All four indicators clearly indicate the the residuals are normally distributed. This assumption is met.

#### 17. (E) The residuals have equal/constant variance across all values of X (only one diagnostic tool)

```{r, fig.align='center'}
resid_vs_fitted(bodyfat_lm)
```

There are no clear trends or horn shapes in the residuals. This assumption is met.

#### 18. (A) The model describes all observations (i.e., there are no influential points) (use Cook's distance, DFBETAS, and DFFITS. Also, in your response, refer to the evidence from the plots you created in previous questions)

```{r, fig.align='center'}
# Cook's Distance
cooksd_plot <- function(model, nLabels = 3) {
  cooks_d <- cooks.distance(model)
  top_cd <- as.numeric(names(sort(cooks_d, decreasing = TRUE)[1:nLabels]))
  
  ggplot() +
  geom_point(data = tibble(cooks_d),
             mapping = aes(x = as.numeric(names(cooks_d)), 
                           y = cooks_d)) +
  geom_text(mapping = aes(x = top_cd,
                          y = cooks_d[top_cd] + max(cooks_d) / 40,
                          label = top_cd)) +
  theme_bw() +
  ylab("Cook's Distance") +
  xlab("Observation Number") +
  geom_hline(mapping = aes(yintercept = 4 / length(cooks_d)),
             color = "red", linetype = "dashed") +
  theme(aspect.ratio = 1)
}
cooksd_plot(bodyfat_lm)
```

```{r, fig.align='center'}
# DFBETAS
dfb_col <- function(df_betas, predictor, nLabels = 3) {
  # Find which observations have the highest dfbetas
   top_vals <- df_betas[predictor] %>%
    arrange(desc(abs(eval(parse(text = predictor))))) %>%
    .[1:nLabels,] %>%
    pull(predictor)
  top_ind <- which(pull(df_betas, predictor) %in% top_vals)
  
  out <- ggplot() +
    geom_point(data = df_betas,
               mapping = aes(x = as.numeric(rownames(df_betas)),
                             y = abs(pull(df_betas, predictor)))) +
     geom_text(mapping = aes(x = top_ind,
                             y = abs(pull(df_betas, predictor)[top_ind]) + 0.07,
                             label = top_ind)) +
    theme_bw() +
    theme(aspect.ratio = 1) +
    ylab("Abs of DFBETAS") +
    xlab("Observation Number") +
    ggtitle(predictor)
  
  if(length(dfbetas) <= 30) {
    out <- out +
      geom_hline(mapping = aes(yintercept = 1),
                 color = "red", linetype = "dashed")
  }else {
    out <- out +
      geom_hline(mapping = aes(yintercept = 2 / sqrt(length(dfbetas))),
                 color = "red", linetype = "dashed")
  }
  return(out)
}


plot_dfbetas <- function(model, nLabels = 3) {
  predictors <- attr(model$terms, "term.labels")
  df_betas <- as_tibble(dfbetas(model)[, predictors])

  plots <- lapply(predictors, dfb_col, df_betas = df_betas)
  plots["ncol"] <- ceiling(sqrt(length(plots)))
  do.call(grid.arrange, plots)
}
plot_dfbetas(bodyfat_lm)
```

```{r, fig.align='center'}
# DFFITS
plot_dffits <- function(model, nLabels = 3) {
  df_fits <- dffits(model)
  top_dff <- as.numeric(names(sort(abs(df_fits), decreasing = TRUE)[1:nLabels]))
  
  df_fits_plot <- ggplot() + 
    geom_point(data = tibble(df_fits),
               mapping = aes(x = as.numeric(names(df_fits)), 
                             y = abs(df_fits))) +
    geom_text(mapping = aes(x = top_dff,
                            y = abs(df_fits[top_dff]) + 0.05,
                            label = top_dff)) +
    theme_bw() +
    ylab("Absolute Value of DFFITS for Y") +
    xlab("Observation Number") +
    theme(aspect.ratio = 1)
  if(length(df_fits) <= 30) {
    df_fits_plot +
      geom_hline(mapping = aes(yintercept = 
                                 2 * sqrt(length(model$coefficients) /
                                                       length(df_fits))),
                 color = "red", linetype = "dashed")
  }else {
    df_fits_plot +
      geom_hline(mapping = aes(yintercept = 1),
                 color = "red", linetype = "dashed")
  }
}
plot_dffits(bodyfat_lm)
```

As far as I can tell, we would only be worried about observation 39 which, in every plot, seems to be extremely far from the rest of the data. Most likely it is a recording error or something of the sort. If we discover that that is the case we could removve it and continue the analysis without it. For now, this assumption is not met.

#### 19. (R) Additional predictor variables are not required (no diagnostic tools - just think about the variables you have and if there are other variables you think would help predict the response)

Those seem like a pretty good set of predictors. I can't think of anything else that would help predict body fat. This assumption is met.

#### 20. No multicollinearity (for this assumption, compute the variance inflation factors (VIFs) and compare the VIFs to your comments in questions 5. Do the variance inflation factors match your assumptions from questions 5? Is this assumption met?

```{r, fig.align='center'}
vif(bodyfat_lm)
```

Based on the variance inflation factors I would be worried about the same three variables from question 5: weight, abdomen, and chest.

### Note: your next homework assigment will use this same data set, and you will be asked to fix the assumptions that were broken.

#### 21. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.

I learned *a lot* about plots in R by coming up with functions for the various plots. The effects of multi-colinearity were emphasized and clarified for me. Other than that it was pretty similar to simple linear regression.

#### 22. Briefly summarize what you learned from this analysis *to a non-statistician*. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.

The purpose of this data was to determine an easier way of measuring body fat than submergence. We tried to come up with a way of roughly calculating body fat based on age, weight and body measurements. We found that such a technique is viable, however, we still need to do more work to determine the exact method.