Homework 7.Rmd

---
title: "Homework 7"
subtitle: <center> <h1>Logistic Regression</h1> </center>
author: <center> Joshua Carpenter <center>
output: html_document
---

<style type="text/css">
h1.title {
font-size: 40px;
text-align: center;
}
</style>

```{r setup, include=FALSE}
library(devtools)
install_github("mtnbikerjoshua/jcreg@development")
library(jcreg)
library(gridExtra)
library(tidyverse)
library(car)
library(knitr)
library(pROC)
set.seed(1321)
```

## Data and Description

**NOTE: For this assignment, you do not need to make your graphs/plots look "pretty", meaning you do not need to worry about changing the axis limits, relabeling axes, etc.**

Type 2 diabetes is a problem with the body that causes blood sugar levels to rise higher than normal (hyperglycemia) because the body does not use insulin properly. Specifically, the body cannot make enough insulin to keep blood sugar levels normal. Type 2 diabetes is associated with various health complications such as neuropathy (nerve damage), glaucoma, cataracts and various skin disorders. Early detection of diabetes is crucial to proper treatment so as to alleviate complications.

The data set contains information on 392 randomly selected women who are at risk for diabetes. The data set contains the following variables:

Variable  | Description
--------- | -------------
pregnant  | Number of times pregnant
glucose   | Plasma glucose concentration at 2 hours in an oral glucose tolerance test
diastolic | Diastolic blood pressure (mm Hg)
triceps   | Triceps skin fold thickness (mm)
insulin   | 2 hour serum insulin (mu U/ml)
bmi       | Body mass index ($kg/m^2$, mass in kilograms divided by height in meters-squared)
pedigree  | Numeric strength of diabetes in family line (higher numbers mean stronger history)
age       | Age
diabetes  | Does the patient have diabetes (0 if "No", 1 if "Yes")

The data can be found in the Diabetes data set on Canvas. Download Diabetes.txt, and put it in the same folder as this R Markdown file.

#### 0. Replace the text "< PUT YOUR NAME HERE >" (above next to "author:") with your full name.

#### 1. Read in the data set, and call the data frame "dia". Print a summary of the data and make sure the data makes sense.

```{r}
(dia <- read.csv("Diabetes.txt", sep = " ", row.names = 1) %>%
  as_tibble())
```

#### 2. Convert the response variable to a factor. Hint: use `as.factor` and override the current response column in the data set.

```{r}
dia <- dia %>%
  mutate(diabetes = as.factor(diabetes))
```

#### 3. Explore the data: create a correlation matrix for the covariates. *Comment on why or why not you think multicollinearity may be a problem for this data set.* 

Disclaimer: For many already familiar tasks in this homework, like EDA and variable selection, I use functions from a package I wrote myself called jcreg. If you want to see the code for those functions see https://github.com/mtnbikerjoshua/jcreg. I included the code to install the package in the setup chunk of this .Rmd file so that the code would be reproducible. Once you have the package installed, you can read the documentation via `?function_name`.

```{r, fig.align='center'}
dia %>%
  select(-diabetes) %>%
  cor_graphic() #From jcreg
```

Based on the correlation matrix, I do not think there will be a problem with multicollinearity. The majority of variables have little to no correlation. None of the few variables that are more correlated have a correlation coefficient of greater than 0.7.

#### 4. Explore the data: create boxplots for these predictors against the response: glucose, bmi, pedigree, and age (4 plots in total. You may want to use the grid.arrange function from the gridExtra package to display them in a 2x2 grid). *Briefly comment on one interesting trend you observe.*

```{r, fig.align='center', fig.width=10}
dia_boxplot <- function(data, name = "Variable") {
  head(var)
  ggplot(mapping = aes(y = data)) +
    geom_boxplot() +
    stat_summary(mapping = ggplot2::aes(x = 0),
                 fun = mean, geom = "point",
                 shape = 4, size = 2, color = "darkred") +
    theme_classic() +
    theme(aspect.ratio = 2,
          axis.text.x = ggplot2::element_blank(),
          axis.ticks.x = ggplot2::element_blank()) +
    ylab(name) +
    xlab("")
}
dia_boxplots <- dia %>%
  select(glucose, bmi, pedigree, age) %>%
  mapply(dia_boxplot, data = ., name = names(.), SIMPLIFY = FALSE)
dia_boxplots["ncol"] <- 4
do.call(grid.arrange, dia_boxplots)
```

All these variables are right skewed, `pedigree` and `age` pretty heavily.

#### 5. Explore the data: create jittered scatterplots for these predictors against the response: pregnant, diastolic, triceps, insulin (4 plots in total. You may want to use the grid.arrange function from the gridExtra package to display them in a 2x2 grid). *Briefly comment on one interesting trend you observe.*

```{r, fig.align='center'}
dia_jitterplot <- function(data, name = "Variable") {
  ggplot(data = dia, mapping = aes(x = data, y = diabetes)) +
    geom_jitter(height = 0.2) +
    theme_classic() +
    xlab(name)
}
dia_jitterplots <- dia %>%
  select(pregnant, diastolic, triceps, insulin) %>%
  mapply(dia_jitterplot, data = ., name = names(.), SIMPLIFY = FALSE)
dia_boxplots["ncol"] <- 2
do.call(grid.arrange, dia_jitterplots)
```

I notice that no one variable can clearly predict the response. Although the distribution of points between the positive and negative case is different, there is not an easy dividing line.

#### 6. Briefly explain why traditional multiple linear regression methods are not suitable for *this* data set. You should mention at least two of the reasons we discussed in class (*your reasons should refer to this data set (i.e. be specific, not general)*).

Since diabetes is a binary response, traditional linear regression will perform poorly because predictions are not limited to the range of 0 to 1. Also, based on the shape of the data, the residuals will not be normally distributed, meaning that we can't use linear regression in the first place.

#### 7. Use a variable selection procedure to help you decide which, if any, variables to omit from the logistic regression model you will soon fit. You may choose which selection method to use (best subsets, backward, sequential replacement, LASSO, or elastic net) and which metric/criteria to use (AIC, BIC, or CV/PMSE). *Briefly justify (in a few sentences) why you chose the method and metric that you did.*

```{r, fig.align='center', message=FALSE}
(vars <- var_selection(dia, method = c("best_subsets", "seqrep", "lasso"), family = binomial))
```

Out of all the variable selection methods, I would choose to use best subsets for this data because we have a small number of variables allowing us to compare all possible models and find the best one. I chose to use BIC as the metric because it penalizes most harshly for more variables and will yield a simpler model, which I would prefer if it can still fit the data reasonably well. I also included results from sequential replacement and LASSO for comparison. I will use the model selected by best subsets.

#### 8. Write out the logistic regression model for this data set using the covariates that you see fit. You should use parameters/Greek letters (NOT the "fitted" model using numbers...since you have not fit a model yet;) ).

$$\log\left(\frac{\text{diabetes}_i}{1-\text{diabetes}_i}\right)=\beta_0+\beta_1\cdot\text{glucose}_i+\beta_2\cdot\text{bmi}_i+\beta_3\cdot\text{pedigree}_i+\beta_4\cdot\text{age}_i+\epsilon_i\ \ \ \text{where}\ \ \ \epsilon_i\stackrel{iid}{\sim}N(\mu,\sigma^2)$$

#### 9. Fit a logistic regression model using the covariates you covariates you chose. Print a summary of the results.

```{r, fig.align='center'}
dia_model <- vars$best_models$best_subsets
summary(dia_model)
```


### Questions 10-14 involve using diagnostics to check the logistic regression model assumptions. For each assumption, (1) code the diagnostic(s) that I indicate (next to the assumption in parentheses) to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.


#### 10. The X's vs log odds are linear (monotone in probability) (Use scatterplots with smoothers)

```{r, fig.align='center'}
dia_smooth <- function(var) {
  dia_loess <- loess(as.numeric(diabetes) ~ eval(parse(text = var)),
                     data = dia, family = "symmetric")
  ggplot(data = dia, mapping = aes(x = eval(parse(text = var)), y = diabetes, group = FALSE)) +
    geom_point() +
    geom_line(mapping = aes(y = predict(dia_loess, dia))) +
    theme_bw() +
    theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())+
    xlab(var)
}
dia_smoothed <- dia %>%
  select(glucose, bmi, pedigree, age) %>%
  names() %>%
  lapply(dia_smooth)
do.call(grid.arrange, dia_smoothed)
```

All the plots are monotonic except age. This assumption is not met.

#### 11. The observations are independent (no diagnostic tools - just think about how the data was collected and briefly write your thoughts)

The subjects were randomly selected. This assumption is met.

#### 12. The model describes all observations (i.e., there are no influential points) (Use DFFITS)

```{r, fig.align='center'}
jcreg_dffits(dia_model)
```

According to the dffits plot, there are no clearly influential points. This assumption is met.

#### 13. Additional predictor variables are not required (no diagnostic tools - just think about the variables you have and if there are other variables you think would help predict the response)

The most important predictors of diabetes are probably glucose and insulin. I can't think of any other variables that would affect the response.

#### 14. No multicollinearity (Use variance inflation factors)

```{r, fig.align='center'}
vif(dia_model)
```

The vif's are low (all close to 1) and the correlation coefficients are also low. This assumption is met.


#### 15. Briefly comment on if all assumptions are met. If there is anything you would like to do before proceeding to statistical inference, do that here.

```{r, fig.align='center'}
dia <- dia %>%
  mutate(age = (age)^-1)
dia_smooth("age")
```

I transformed `age` in order to meet the linear log odds assumption.

#### 16. For the coefficient for bmi, compute (and output) the log odds ratio ($\beta_{bmi}$, pull this value from the model output), odds ratio ($\exp\{\beta_{bmi}\}$), and the odds ratio converted to a percentage ($100 \times (\exp\{\beta_{bmi}\} - 1)%$). (If you cannot view the math used in this question (and subsequent), you can see it by knitting the document.)

```{r, fig.align='center'}
beta_bmi <- dia_model$coefficients["bmi"]
cat("Log odds ratio for BMI:", round(beta_bmi, 4))

oddsratio_bmi <- exp(beta_bmi)
cat("Odss ratio for bmi:", round(oddsratio_bmi,3))

cat("Odds ratio as percent increase for BMI:", round(100 * (oddsratio_bmi - 1), 2), "%")
```

#### 17. Interpret the coefficient for bmi based on the FOUR different ways we discussed in class.

*Interpretation 1:* Holding all else constant, for every one unit increase in BMI, we expect the log odds of having diabetes to increase by 0.0744.

*Interpretation 2:* Since the coefficient for BMI is greater than zero, we expect the probability of having diabetes to be higher for people with a higher BMI.

*Interpretation 3:* Holding all else constant, we expect a one unit increase in BMI to increase the odds of having diabetes by 1.077 times.

*Interpretation 4:* Holding all else constant, we expect a one unit increase in BMI to increase the odds of having diabetes by 7.73%.

#### 18. Create (and output) 95% confidence intervals for $\beta_k$, $\exp\{\beta_k\}$, and $100 \times (\exp\{\beta_k\} - 1)%$ for all predictors using the `confint` function.

```{r, fig.align='center', message=FALSE}
cat("Confidence intervals for the log odds ratios:",
    capture.output(print(confint(dia_model))),
    sep = "\n")

cat("Confidence intervals for the odds ratios:",
    capture.output(exp(confint(dia_model))),
    sep = "\n")

cat("Confidence intervals for the odds ratios as percent increases:",
    capture.output(100 * (exp(confint(dia_model)) - 1)),
    sep = "\n")
```

#### 19. Interpret the 95% confidence intervals for bmi for $\beta_{bmi}$, $\exp\{\beta_{bmi}\}$, and $100 \times (\exp\{\beta_{bmi}\} - 1)%$ (three interpretations total).

*Interpretation using $\beta_{bmi}$:* We are 95% confident that, holding all else constant, a one unit increase in BMI will lead to an increase in the log odds of having diabetes of between 0.036 and 0.1154.

*Interpretation using $\exp\{\beta_{bmi}\}$:* We are 95% confident that, holding all else constant, for every additional unit of BMI, the odds of having diabetes are between 1.036 and 1.122 times greater.

*Interpretation using $100 \times (\exp\{\beta_{bmi}\} - 1)%$:* We are 95% confident that, holding all else constant, for every additional unit of BMI, the odds of having diabetes increase by 3.63 to 12.23 percent.

#### 20. Calculate a 95% confidence interval for the predicted probability that a patient has diabetes where pregnant = 1, glucose = 90, diastolic = 62, triceps = 18, insulin = 59, bmi = 25.1, pedigree = 1.268 and age = 25. Note that you may not need to use all of these values depending on the variables you chose to include in your model. *Do you think this patient will develop diabetes? Why or why not?*

```{r}
newdata <- data.frame(pregnant = 1, glucose = 90, diastolic = 62,
                      triceps = 18, insulin = 59, bmi = 25.1,
                      pedigree = 1.268, age = 25)
pred_prob <- predict(dia_model, newdata = newdata,
                     se.fit = TRUE)
CI <- pred_prob$fit + qnorm(c(0.5, 0.025, 0.975)) * pred_prob$se.fit
names(CI) <- c("fit", "lower", "upper")
round(exp(CI) / (1 + exp(CI)), 2)
```

I do not think this person will develop diabetes because the fitted value and confidence interval bounds are very low, indicating a small probability of developing diabetes.

#### 21. Compute the likelihood ratio test statistic (aka deviance, aka model chi-squared test) for the model, and compute the associated $p$-value. Print out the test statistic and the $p$-value. *Based on the results, what do you conclude?*

```{r, fig.align='center'}
summary(dia_model)
```

With a p-value of approximately zero, we conclude that at least one of the predictors has a significant effect on the probability of developing diabetes.

#### 22. Compute (and output) the pseudo $R^2$ value for the model.

```{r, fig.align='center'}
cat("Pseudo R-squared:", 1 - dia_model$deviance / dia_model$null.deviance)
```

#### 23. What is the best cutoff value for the model that minimizes the percent misclassified? Show your code and output the best cutoff value.

```{r, fig.align='center'}
dia_preds <- predict(dia_model, type = "response")
possible_cutoffs <- seq(0, 1, length = 100)

diabetes_binary <- ifelse(dia$diabetes == "1", 1, 0)

percent_misclassified <- lapply(possible_cutoffs, "<=", dia_preds) %>%
  lapply("!=", diabetes_binary) %>%
  sapply(mean, simplify = TRUE)

ggplot(mapping = aes(x = possible_cutoffs, y = percent_misclassified)) +
  geom_line() +
  theme_classic() +
  theme(aspect.ratio = 1) +
  xlab("Possible Cutoffs") +
  ylab("Percent Misclassified")

(cutoff <- possible_cutoffs[which.min(percent_misclassified)])
```


#### 24. Create (and output) a confusion matrix using the best cutoff value you found above.

```{r, fig.align='center'}
conf_mat <- function(model, cutoff) {
  data <- model.frame(model)
  predicted <- as.numeric(fitted(model) > cutoff)
  table("Truth" = model$y, "Predicted" = predicted)
}
(dia_confusion <- conf_mat(dia_model, cutoff))
```

#### 25. Based on the confusion matrix, what is the value for the specificity, and what does the specificity measure? Print the specificity.

```{r, fig.align='center'}
cat("Specificity:", dia_confusion["0", "0"] / sum(dia_confusion["0",]))
```

Specificity measures the proportion of people who do not have diabetes that are correctly classified as such.

#### 26. Based on the confusion matrix, what is the value for the sensitivity, and what does the sensitivity measure? Print the sensitivity.

```{r, fig.align='center'}
cat("Sensitivity:", dia_confusion["1", "1"] / sum(dia_confusion["1",]))
```

Specificity measures the proportion of people who have diabetes that are correctly classified as such. 

#### 27. Based on the confusion matrix, what is the percent correctly classified (accuracy), and what does the percent correctly classified measure? Print the percent correctly classified.

```{r, fig.align='center'}
cat("Accuracy:", (dia_confusion["1", "1"] + dia_confusion["0", "0"]) / sum(dia_confusion))
```

The percent correctly classified measures exactly what you would think: the number of people correctly classified out of the total number of people.

#### 28. Plot (and output) the ROC curve for the model (either using the `pROC` package or the `ROCR` package).

```{r, fig.align='center'}
dia_roc <- roc(dia$diabetes, dia_preds)

ggplot() +
  geom_path(mapping = aes(x = 1 - dia_roc$specificities, 
                          y = dia_roc$sensitivities)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  theme_bw() + 
  xlab("1 - Specificity (False Positive Rate)") +
  ylab("Sensitivity (True Positive Rate)") +
  theme(aspect.ratio = 1)
```

#### 29. What is the AUC for the ROC curve plotted above? Print the value of the AUC.

```{r, fig.align='center'}
cat("AUC:", auc(dia_roc))
```

#### 30. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.

The main takeaways I had from this homework were mainly relating to model evaluation for logistic regression. I came to understand the confusion matrix, sensitivity, specificity, accuracy, etc.

#### 31. Briefly summarize what you learned from this analysis *to a non-statistician*. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.

The purpose of this analysis was to create a method of early detection of diabetes. We concluded that we can predict diabetes in women at risk with 80% accuracy based on the results of a oral glucose tolerance test, family history of diabetes, age, and BMI.