Homework 2.Rmd

---
title: "Homework 2"
subtitle: <center> <h1>Simple Linear Regression Model Assumptions</h1> </center>
author: <center> Joshua Carpenter <center>
output: html_document
---

<style type="text/css">
h1.title {
  font-size: 40px;
  text-align: center;
}
</style>

```{r setup, include=FALSE}
library(tidyverse)
library(ggfortify)
library(car)
```

## Data and Description

One key component of determining appropriate speed limits is the amount of distance that is required to stop at a given speed. For example, in residential neighborhoods, when pedestrians are commonly in the roadways, it is important to be able to stop in a very short distance to ensure pedestrian safety. The speed of vehicles may be useful for determining the distance required to stop at that given speed, which can aid public officials in determining speed limits.

The Stopping Distance data set compares the **distance (column 2)** (in feet) required for a car to stop on a certain rural road against the **speed (column 1)** (MPH) of the car. Download the StoppingDistance.txt file from Canvas, and put it in the same folder as this R Markdown file.

#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.

#### 1. Read in the data set, and call the data frame "stop". Print a summary of the data and make sure the data makes sense.

```{r, message=FALSE}
(stop <- read_table2("StoppingDistance.txt"))
summary(stop)
```

#### 2. Create a scatterplot of the data with variables on the appropriate axes (think about which variable makes the most sense to be the response). Make you plot look professional (make sure the axes have appropriate limits to capture the data nicely, make sure the axes labels are descriptive, etc.).

```{r, fig.align='center'}
(stop.plot <- ggplot(data = stop, mapping = aes(x = Speed, y = Distance)) +
  geom_point() +
  xlab("Car Speed (mph)") +
  ylab("Stopping Distance (ft)") +
  coord_fixed() +
  theme_minimal() +
  scale_x_continuous(limits = c(0, 50), minor_breaks = NULL) +
  ylim(NA, 150))
```

#### 3. Briefly describe the relationship between Speed and Distance. (Hint: you should use 2 or 3 key words.)

As speed increases, so does stopping distance, as we would expect. There is a strong, positive correlation that is roughly linear, but appears to be slightly curved.

#### 4. Add the OLS regression line to the scatterplot you created in question 2 (note: if you receive a warning about rows with missing values, you may need to adjust an axis limit).

```{r, fig.align='center', message=FALSE}
stop.plot +
  geom_smooth(method = "lm", se = F)
```

#### 5. Apply linear regression to the data (no transformations), and save the residuals and fitted values to the `stop` dataframe.

```{r}
stop.lm <- lm(Distance ~ Speed, data = stop)
summary(stop.lm)
stop <- stop %>%
  mutate(resid = stop.lm$residuals, fit = stop.lm$fitted.values)
```

#### 6. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above. Do not use "x" and "y" in your model - use variable names that are fairly descriptive.

$\widehat{\text{Distance}}_i=-20.131+3.142\cdot\widehat{\text{Speed}}_i$\


### Questions 7-12 involve using diagnostics to determine if the linear regression assumptions are met. For each assumption, (1) perform appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.

#### 7. (L) X vs Y is linear (use at least two diagnostic tools)

```{r, fig.align='center',message=FALSE}
# Scatterplot
stop.plot

# Resid vs. Fitted
stop.resid_fitted <- autoplot(stop.lm, which = 1, ncol = 1) +
  theme_minimal() +
  coord_fixed()
```

No, the data is slightly non-linear

#### 8. (I) The residuals are independent (no diagnostic tools - just think about how the data was collected and briefly write your thoughts)

Given the information we have, we cannot know whether the data is independent. There could be some dependence based on the cars used in the experiment of the times at which the data was collected.

#### 9. (N) The residuals are normally distributed and centered at zero (use at least three diagnostic tools)

```{r, fig.align='center'}
# Q-Q
(stop.qq <- autoplot(stop.lm, which = 2, ncol = 1, nrow = 1) +
  theme_light() +
  coord_fixed())

# Histogram
ggplot(data = stop, mapping = aes(x = resid(stop.lm))) +
  geom_histogram(mapping = aes(y = ..density..), binwidth = 3) +
  stat_function(fun = dnorm, 
                color = "blue",
                args = list(mean = mean(stop$resid), 
                            sd = sd(stop$resid))) +
  xlab("Residuals") +
  ylab("Density") +
  theme_light()

# Boxplot
ggplot(data = stop, mapping = aes(y = resid(stop.lm))) +
  geom_boxplot() +
  stat_summary(mapping = aes(x = 0),
               fun = mean, geom = "point",
               shape = 4, size = 2, color = "darkred") +
  theme_classic() +
  theme(aspect.ratio = 2,
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  scale_y_continuous(limits = c(-30, 40), breaks = seq(-30, 40, 10)) +
  ylab("Residuals") +
  xlab("")
```

The residuals are somewhat right-skewed and not normally distributed, therefore this assumption is not met.

#### 10. (E) The residuals have equal/constant variance across all values of X (use two diagnostic tools)

```{r, fig.align='center'}
#Resid vs Fitted
stop.resid_fitted

# Brown-Forsyth Test
grp <- as.factor(c(rep("lower", floor(dim(stop)[1] / 2)), 
                   rep("upper", ceiling(dim(stop)[1] / 2))))
leveneTest(pull(stop[order(stop$Speed), "resid"]) ~ grp, center = median)
```

The variance is not very constant as seen in the Residuals vs. Fitted Values plot and the Brown-Forsyth statistic is large enough to cause some doubt. This assumption is not met.

#### 11. (A) The model describes all observations (i.e., there are no influential points) (use at least four diagnostic tools)

```{r, fig.align='center'}
#Q-Q
stop.qq

# Cook's Distance
stop <- stop %>% mutate(cooksd = cooks.distance(stop.lm))
top3cd <- as.numeric(names(sort(stop$cooksd, decreasing = TRUE)[1:3]))

ggplot() +
  geom_point(data = stop,
             mapping = aes(x = as.numeric(rownames(stop)), 
                           y = cooksd)) +
  geom_text(mapping = aes(x = top3cd,
                          y = stop$cooksd[top3cd],
                          label = top3cd)) +
  theme_bw() +
  ylab("Cook's Distance") +
  xlab("Observation Number") +
  geom_hline(mapping = aes(yintercept = 4 / length(stop$cooksd)),
             color = "red", linetype = "dashed") +
  theme(aspect.ratio = 1)

# DFBETAS
stop <- stop %>%
  mutate(dfbetas_speed = dfbetas(stop.lm)[, "Speed"])
names(stop$dfbetas_speed) <- 1:nrow(stop)
top3dfbeta <- as.numeric(names(
  sort(stop$dfbetas_speed, decreasing = TRUE)[1:3]
))

# Plot the DFBETAS against the observation number
ggplot() + 
  geom_point(data = stop,
             mapping = aes(x = as.numeric(rownames(stop)), 
                           y = abs(dfbetas_speed))) +
  geom_text(mapping = aes(x = top3dfbeta,
                          y = stop$dfbetas_speed[top3dfbeta],
                          label = top3dfbeta)) +
  theme_bw() +
  ylab("Absolute Value of DFBETAS for Weight") +
  xlab("Observation Number") +
  geom_hline(mapping = aes(yintercept = 2 / sqrt(length(stop$dfbetas_speed))),
             color = "red", linetype = "dashed") +
  theme(aspect.ratio = 1)

#Resid vs Fitted
stop.resid_fitted
```

All of the metrics identify observations 55, 60, and 62 as potential influential points, however, based on the graphs of cook's distance and dfbetas, I'd say we only need to worry about observations 60 and 62.

#### 12. (R) Additional predictor variables are not required (no diagnostic tools - just think about the variables you have and if there are other variables you think would help predict the response)

As in most simple linear regression, this assumption is not perfectly met. There could be influential variables for tire type, car weight, ABS system, etc.

#### 13. Based on your analysis of the diagnostic measures, briefly discuss why this simple linear regression model on the raw data (not transformed) is *not* appropriate.

The assumptions for linear regression are badly met at best. Particularly, the data does not not appear to be linear and the residuals are not normally or uniformly distributed.

#### 14. Fix the model by making any necessary transformations. Justify the transformation you chose in words (why did you choose to transform just x, just y, or both?). (Note: if boxCox(mod) throws an error, replace mod with the formula for the linear model, y ~ x.) (Note: you will most likely need to repeat questions 14 and 18 until you are satisfied with the transformation you chose. Only then should you fill out this section - I only want to see the model you end up choosing, not all of your attempted models.)

```{r, fig.align='center'}
bc <- boxCox(stop$Distance ~ stop$Speed)

stop_sqrt <- stop %>%
  mutate(Distance = sqrt(Distance))

stop_sqrt.lm <- lm(Distance ~ Speed, data = stop_sqrt)
summary(stop_sqrt.lm)
stop_sqrt <- stop_sqrt %>%
  mutate(resid = stop_sqrt.lm$residuals, fit = stop_sqrt.lm$fitted.values)
```

Since the residuals are not normally distributed we will most likely need to transform y. The box-cox approach shows that a square root transformation might be the most advantageous. After doing that transformation the data meets the assumptions and we need to look no further.

### Now, in Questions 15-18, re-check your transformed model and verify that the assumptions (the assumptions that were addressed in the questions above) are met. Provide a brief discussion about how each of the previously violated assumptions are now satisfied. Also, provide the code you used to assess adherence to the assumptions. (Note that transforming will not change your responses about (I) the residuals being independent and (R) additional predictor variables not being required, so we will skip these assumptions here.)

#### 15. (L) Linearity (use at least two diagnostic tools)

```{r, fig.align='center', message=FALSE}
# Scatterplot
(stop_sqrt.plot <- ggplot(data = stop_sqrt, mapping = aes(x = Speed, y = Distance)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Car Speed (mph)") +
  ylab("Stopping Distance (ft)") +
  theme_minimal() +
  coord_fixed())

# Resid vs. Fitted
(stop_sqrt.resid_fitted <- autoplot(stop_sqrt.lm, which = 1, ncol = 1) +
  theme_minimal() +
  coord_fixed())
```

This transformed data looks very linear

#### 16. (N) The residuals are normally distributed and centered at zero (use at least three diagnostic tools)

```{r, fig.align='center'}
# Q-Q
(stop_sqrt.qq <- autoplot(stop_sqrt.lm, which = 2, ncol = 1, nrow = 1) +
  theme_light() +
  coord_fixed())

# Histogram
ggplot(data = stop_sqrt, mapping = aes(x = resid(stop_sqrt.lm))) +
  geom_histogram(mapping = aes(y = ..density..), binwidth = 0.25) +
  stat_function(fun = dnorm, 
                color = "blue",
                args = list(mean = mean(stop_sqrt$resid), 
                            sd = sd(stop_sqrt$resid))) +
  xlab("Residuals") +
  ylab("Density") +
  theme_light()

# Boxplot
ggplot(data = stop_sqrt, mapping = aes(y = resid(stop_sqrt.lm))) +
  geom_boxplot() +
  stat_summary(mapping = aes(x = 0),
               fun = mean, geom = "point",
               shape = 4, size = 2, color = "darkred") +
  theme_classic() +
  theme(aspect.ratio = 2,
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
#  scale_y_continuous(limits = c(-1.5, 1.5), breaks = seq(-1.5, 1.5, 0.5)) +
  ylab("Residuals") +
  xlab("")
```

The residuals are not perfectly normally distributed, but close to enough to meet our assumptions.

#### 17. (E) The residuals have equal/constant variance across all values of X (use two diagnostic tools)

```{r, fig.align='center'}
#Resid vs Fitted
stop_sqrt.resid_fitted

# Brown-Forsyth Test
grp <- as.factor(c(rep("lower", floor(dim(stop)[1] / 2)), 
                   rep("upper", ceiling(dim(stop)[1] / 2))))
leveneTest(pull(stop[order(stop_sqrt$Speed), "resid"]) ~ grp, center = median)
```

The Brown-Forsyth statistic hasn't changed much, but the residuals vs fitted values plot looks a lot better. This assumption is met.

#### 18. (A) The model describes all observations (i.e., there are no influential points) (use at least four diagnostic tools)

```{r, fig.align='center',message=FALSE}
#Q-Q
stop_sqrt.qq

# Cook's Distance
stop_sqrt <- stop_sqrt %>% mutate(cooksd = cooks.distance(stop_sqrt.lm))
top3cd <- as.numeric(names(sort(stop_sqrt$cooksd, decreasing = TRUE)[1:3]))

ggplot() +
  geom_point(data = stop_sqrt,
             mapping = aes(x = as.numeric(rownames(stop_sqrt)), 
                           y = cooksd)) +
  geom_text(mapping = aes(x = top3cd,
                          y = stop_sqrt$cooksd[top3cd],
                          label = top3cd)) +
  theme_bw() +
  ylab("Cook's Distance") +
  xlab("Observation Number") +
  geom_hline(mapping = aes(yintercept = 4 / length(stop_sqrt$cooksd)),
             color = "red", linetype = "dashed") +
  theme(aspect.ratio = 1)

# DFBETAS
stop_sqrt <- stop_sqrt %>%
  mutate(dfbetas_speed = dfbetas(stop_sqrt.lm)[, "Speed"])
names(stop_sqrt$dfbetas_speed) <- 1:nrow(stop_sqrt)
top3dfbeta <- as.numeric(names(
  sort(stop_sqrt$dfbetas_speed, decreasing = TRUE)[1:3]
))

# Plot the DFBETAS against the observation number
ggplot() + 
  geom_point(data = stop_sqrt,
             mapping = aes(x = as.numeric(rownames(stop_sqrt)), 
                           y = abs(dfbetas_speed))) +
  geom_text(mapping = aes(x = top3dfbeta,
                          y = stop_sqrt$dfbetas_speed[top3dfbeta],
                          label = top3dfbeta)) +
  theme_bw() +
  ylab("Absolute Value of DFBETAS for Weight") +
  xlab("Observation Number") +
  geom_hline(mapping = aes(yintercept = 2 / sqrt(length(stop_sqrt$dfbetas_speed))),
             color = "red", linetype = "dashed") +
  theme(aspect.ratio = 1)

# Resid vs Fitted
stop_sqrt.resid_fitted

# Perform regression without possible influential points
ggplot(data = stop_sqrt, mapping = aes(x = Speed, y = Distance)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(data = stop_sqrt[c(-60,-55),],
              method = "lm",
              se = FALSE,
              color = "red") +
  xlab("Car Speed (mph)") +
  ylab("Stopping Distance (ft)") +
  theme_minimal()
```

Although some possible influential points are identified, they do not make a significant difference to the regression line. This assumption is met.


#### 19. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above from your transformed model. Do not use "x" and "y" in your model - use variable names that are fairly descriptive.

$log\widehat{(\text{Distance}}_i)=0.932396+0.252466\cdot\widehat{\text{Speed}}_i$\

#### 20. Plot your new fitted *curve* on the scatterplot of the original data (on the original scale - not the transformed scale). Do you think this curve fits the data better than the line you previously fit?

```{r}
xs <- seq(0, 42)
preds <- predict(stop_sqrt.lm, tibble(Speed = xs))
curve <- tibble(x = xs, y = preds^2)
stop.plot +
  geom_line(data = curve, mapping = aes(x = x, y = y), color = "blue")
```

Yes, this curve fits the data much better.

#### 21. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.

I learned that linear plots are beautiful. I also learned that the box-cox equation is very useful.

#### 22. Briefly summarize what you learned from this analysis *to a non-statistician*. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.

The purpose of this analysis was to predict the necessary stopping distance for a car based on the speed it is going. We found that that we can predict stopping distance with some accuracy based on speed.