GroupProject1.Rmd

---
title: "Group Project 1"
author: "Joshua Carpenter, Mark Guymon, Carter Erickson, Kevin Sanchez"
date: "2/12/2021"
output: html_document
---

```{r setup, include=FALSE}
library(tidyverse)
library(ggfortify)
library(car)
```

# Background and Introduction
Sports have always been of interest to people around the world, and one of the things that is spoken of most often in sports is who is the best. The sport of soccer in particular attributes greatness to those who do one thing very well: score goals. There are many factors that make it easier and harder for a player to score goals, and thus determine how great of a player they are.

One such factor in determining goals scored in one's career is how many matches the same person has played in. While some players may play hundreds, if not thousands of games, players who play in fewer matches can still score more goals. By understanding this relationship, both teams and fans alike will be able to have an understanding of who the better player is in a given scenario, and what they can expect out of this player if he is on their team. 

We are interested in better understanding this relationship between the number of goals scored in a forward's career and how many matches he has appeared in. Given recent trends in soccer history, we hypothesize that matches played will have a significant affect on goals scored, specifically that the more matches played in, the more goals a player will have scored. 

To test our assumption, we obtained data on 163 players across various teams in the Premier League. We will begin our analysis by applying basic summary statistics and exploratory data techniques to better understand the data. Then, we apply simple linear regression with goals scored as the response, regressed on the number of matches played. 

We conclude our analysis by using what we learned to infer to the broader soccer player population. We recommend picking players that produce more goals per match than others, given what we found in our analysis. We also suggest waiting until a player has played at least 3 or 4 years before making any definitive decisions on his skill level. 

# Methods and Results

In an effort to understand the impact of number of games played on number of goals scored, we obtained a data set that contains information for 572 soccer players across various teams and leagues. The data comes from an online database at https://footystats.org/download-stats-csv, and we downloaded the data set (a .csv file) on February 12, 2021. 

The following table displays the variable names in this data set, along with their descriptions.

Variable    | Description
----------  | -------------
appearances | Number of matches a player has appeared in
goals       | Number of goals scored by the player

We start by applying basic summary and exploratory statistics to this data to better understand the data and identify trends.

```{r, message=FALSE}
# Read in the data set, include only data for forwards
(premier <- read_csv("premier1819.csv") %>%
  filter(position %in% c("Forward")) %>%
  select(appearances = appearances_overall, goals = goals_overall))
summary(premier)
```

```{r}
(premier_plot <- ggplot(data = premier, mapping = aes(x = appearances, y = goals)) +
  geom_point() +
  theme_bw() +
  xlim(0, 40) +
  xlab("Appearances") +
  ylab("Goals"))
paste(
  "Correlation coefficient:",
  round(cor(premier$appearances, premier$goals), 5)
)
```
From our exploratory data analyses, we notice several interesting features. First, matches appeared in is slightly positively correlated with number of goals scored. This is what we expected. After analyzing the histograms of the variables, we notice goals scored is slightly right skewed. Knowing this can lead to problems for some of the assumptions, we may need to transform goals scored in the future.

From our EDA, we noticed and were concerned about some of the very low values for goals scored for some players. Because there were several players who hadn't scored a goal in a match they appeared in, we decided to transform the dataset to make it workable. We excluded all players except forwards and we used the box-cox approach to transform the data. We looked into the other values and determined the were correct. Since they are correct, we may want to fit a model with and without these observations to determine the amount of influence these observations have on the model. The box-cox method told us that a reciprocal square root would be the best transform and that is what we decided to go with even though the assumptions are not perfectly met.

```{r}
# Shift the data up one to remove zeros
premier <- premier %>%
  mutate(goals = goals + 1)

# Use the box cox approach to find the best transformation
bc <- boxCox(premier$goals ~ premier$appearances)  # plot curve
bc$x[which.max(bc$y)]  # pull out the "best" lambda value

# Transform the data
premier_trans <- premier %>%
  mutate(goals = 1/sqrt(goals))

# Plot the transformed data and check to correlation coefficient
(premier_trans_plot <-
  ggplot(data = premier_trans, mapping = aes(x = appearances, y = goals)) +
  geom_point() +
  theme_bw() +
  xlim(0, 40) +
  xlab("Appearances") +
  ylab("Transformed Goals"))
paste(
  "Correlation coefficient:",
  round(cor(premier_trans$appearances, premier_trans$goals), 5)
)
```

We now want to fit a multiple linear regression model to the data set with MPG as the response and Weight as the predictor. Here is the general linear model we want to fit:

$\text{Goals Scored}_i = \beta_0 + \beta_1\text{Matches Played}_i + \epsilon_i \space \text{where} \space \epsilon_i \sim N(0, \sigma^2)$

We now fit an initial model.

```{r, message=FALSE}
# Fit a linear model to the transformed data and save the residuals
premier_lm <- lm(goals ~ appearances, data = premier_trans)
premier_trans$residuals <- premier_lm$residuals

# Plot the regression line on the transformed data
premier_trans_plot +
  geom_smooth(method = "lm", se = FALSE)
```

## Check Assuptions

### L) Linear
```{r}
# Scatterplot
premier_trans_plot

# Resids vs fitted
(premier_residfit <- autoplot(premier_lm, which = 1, ncol = 1) +
    theme_classic())
```
The data is not perfectly linear, but better than before the transformation. We could wish for better, but the data is linear enough. This assumption is met.

### I) Independent
The residuals are independent: Considering there can only be so many forwards on a given team, it would be very improbable to obtain the data for two forwards at the same time. The only case where two forwards would be in the same place at the same time is when they were on opposing teams, so they would have little to no effect on each other.

### N) Normal
```{r}
# Q-Q Plot
(premier_QQ <- autoplot(premier_lm, which = 2, ncol = 1, nrow = 1) +
   theme_bw() +
   coord_fixed())

# Histogram
premier_hist <- ggplot(data = premier_trans, mapping = aes(x = residuals)) + 
  geom_histogram(mapping = aes(y = ..density..), binwidth = 0.04) +
  stat_function(fun = dnorm, 
                color = "red",
                args = list(mean = mean(premier_trans$residuals), 
                            sd = sd(premier_trans$residuals))) +
  theme_minimal() +
  xlab("Residuals") +
  ylab("Density")
premier_hist

# Box-plot
ggplot(data = premier_trans, mapping = aes(y = residuals)) +
  geom_boxplot() +
  stat_summary(mapping = aes(x = 0),
               fun = mean, geom = "point",
               shape = 4, size = 2, color = "darkred") +
  theme_classic() +
  theme(aspect.ratio = 2,
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  scale_y_continuous(limits = c(-0.5, 0.5), breaks = seq(-0.4, 0.4, 0.2)) +
  ylab("Residuals") +
  xlab("")
```
The residuals are roughly normally distributed. There is some slight right-skewness, but not too worrisome. This assumption is met.

### E) Equal
```{r}
# Residuals vs Fitted Values
premier_residfit

# Brown-Forsythe Test
grp <- as.factor(c(rep("lower", floor(dim(premier)[1] / 2)), 
                   rep("upper", ceiling(dim(premier)[1] / 2))))
leveneTest(arrange(premier_trans, appearances)$residuals ~ grp, center = median)
```
This assumption is not met, as clearly indicated by both the residuals vs. fitted values plot and the Brown-Forsyth test, however we are going to continue with the analysis for practice sake. This transformation was the best we could get.

### A) All
```{r}
#Q-Q
premier_QQ

# Cook's Distance
premier_trans <- premier_trans %>%
  mutate(cooksd = cooks.distance(premier_lm))
top4cd <- as.numeric(names(sort(premier_trans$cooksd, decreasing = TRUE)[1:4]))

ggplot() +
  geom_point(data = premier_trans,
             mapping = aes(x = as.numeric(rownames(premier_trans)), 
                           y = cooksd)) +
  geom_text(mapping = aes(x = top4cd,
                          y = premier_trans$cooksd[top4cd],
                          label = top4cd)) +
  theme_bw() +
  ylab("Cook's Distance") +
  xlab("Observation Number") +
  geom_hline(mapping = aes(yintercept = 4 / length(premier_trans$cooksd)),
             color = "red", linetype = "dashed") +
  theme(aspect.ratio = 1)

# DFBETAS
premier_trans <- premier_trans %>%
  mutate(dfbetas_appearances = dfbetas(premier_lm)[, "appearances"])
names(premier_trans$dfbetas_appearances) <- 1:nrow(premier_trans)
top3dfbeta <- as.numeric(names(
  sort(abs(premier_trans$dfbetas_appearances), decreasing = TRUE)[1:3]
))

# Plot the DFBETAS against the observation number
ggplot() + 
  geom_point(data = premier_trans,
             mapping = aes(x = as.numeric(rownames(premier_trans)), 
                           y = abs(dfbetas_appearances))) +
  geom_text(mapping = aes(x = top3dfbeta,
                          y = abs(premier_trans$dfbetas_appearances[top3dfbeta]),
                          label = top3dfbeta)) +
  theme_bw() +
  ylab("Absolute Value of DFBETAS for Runoff") +
  xlab("Observation Number") +
  geom_hline(mapping = aes(yintercept = 2 / sqrt(length(premier_trans$dfbetas_appearances))),
             color = "red", linetype = "dashed") +
  theme(aspect.ratio = 1)

#Resid vs Fitted
premier_residfit
```
This assumption is met. Though the plots indicate a few possible influential points, they are grouped closely with the rest of the data and they are not repeated accross plots. There are none to worry about.

### R) Required
There aren’t many other additional predictor variables that are likely to help in predicting the response variable of goals scored. The only thing that may affect the number of goals scored per forward would be the weather in a given match, but it varies pretty randomly and would have an equal effect on all forwards, so there isn’t much extra variability that isn’t accounted for.

## Analyze the results
Now that we have a model that describes the data well with all assumptions met (except constant variance), we would like to use the model to make inferences and predictions. Here is our fitted linear model:
  
  $\frac{1}{\sqrt{\widehat{\text{Goals}}_i}} = 1.016242 - 0.019850\cdot\text{Appearences}_i$
  
To start, we will assess the model slopes, confidence intervals of the slopes, and hypothesis tests. Note that our hypothesis tests are not valid since the constant variance assumption was not met, but we will interpret them anyway for practice.

```{r, fig.align='center'}
summary(premier_lm)
confint(premier_lm, level = 0.95, parm = "appearances")
```

As number of appearances increases by one, we are 95% confident that the inverse square root of the average number of goals scored per player decreases between 0.0223 and 0.0174. This range suggests that there is a significant relationship between the number of matches played and goals scored. A p-value of < 2.2e-16 indicates the same thing because it is such a small value we feel comfortable rejecting the null hypothesis of a relationship not existing. 

We now want to get predictions for new players. We are particularly interested in the predicted average number of goals scored for a player that plays in 30 games (an estimate of the average amount of games a soccer player plays in a year). We will use this information to create confidence and prediction intervals for average goals scored.  
```{r, fig.align='center'}
1/predict(premier_lm, 
        newdata = data.frame(appearances = 30),
        interval = "confidence",
        level = 0.95)^2
```

We are 95% confident that the average number of goals scored by a player who appears 30 times is between 4.72 and 6.89. 

```{r, fig.align='center'}
1/predict(premier_lm, 
        newdata = data.frame(appearances = 30),
        interval = "prediction",
        level = 0.95)^2
```

We are 95% confident that if we were to observe one more player who appeared 30 times, he will score between 1.81 and 5.65 times.

We also plotted the confidence and prediction intervals across all values of games played. You can see that the prediction interval is inaccuate thanks to the badly met assumptions. 
```{r, fig.align='center',message=FALSE,warning=FALSE}
appearances_values <- seq(min(premier$appearances), max(premier$appearances), 
                          length = 100)

conf_int_mean <- predict(premier_lm, 
                         newdata = data.frame(appearances = appearances_values),
                         interval = "confidence",
                         level = 0.95)

preds <- data.frame("appearances_values" = appearances_values, 1/conf_int_mean^2)

premier_plot +
  geom_line(data = preds, mapping = aes(x = appearances_values, y = fit),
            color = "blue", size = 1)+
  
  geom_line(data = preds, mapping = aes(x = appearances_values, y = lwr),
            color = "red", size = 1)+
  
  geom_line(data = preds, mapping = aes(x = appearances_values, y = upr),
            color = "red", size = 1)

appearances_values <- seq(min(premier$appearances), max(premier$appearances), 
                          length = 100)

pred_int <- predict(premier_lm, 
                         newdata = data.frame(appearances = appearances_values),
                         interval = "prediction",
                         level = 0.95)

preds <- data.frame("appearances_values" = appearances_values, ifelse(1/pred_int^2 < 30, 1/pred_int^2, NA))

premier_plot +
  geom_line(data = preds, mapping = aes(x = appearances_values, y = fit),
            color = "blue", size = 1)+
  
  geom_line(data = preds, mapping = aes(x = appearances_values, y = lwr),
            color = "red", size = 1)+
  
  geom_line(data = preds, mapping = aes(x = appearances_values, y = upr),
            color = "red", size = 1)
```
```{r}
#MSE
anova <- aov(premier_lm)  # get ANOVA components
premier_anova <- summary(anova)[[1]]  # save data in a usable form
premier_anova

premier_mse <- premier_anova["Residuals", "Sum Sq"] / premier_anova["Residuals", "Df"]
premier_mse

```
The MSE in this case is very hard to interpret.
```{r}
#RMSE
premier_rmse <- sqrt(premier_mse)
premier_rmse

```
An RMSE of 0.16 on the scale of the transformed data is fairly large, indicating that the model does not fit the data very well.
```{r}
#MAE
sum(abs(premier_trans$residuals)) / premier_lm$df.residual

```
See interpretation of RMSE.

```{r}
summary(premier_lm)
```
The R-Squared tells us that our model explains about 70% of the variance in the data. This is not too bad, so the model fits the data alright.

# Summary and Conclusions

Understanding how many times a player appears in a match affects the number of goals he scores is very important in determining the talent level of a given player, which assists team managers in deciding whether or not to use a team's allotted cap space to sign a player to a lucrative deal. We conducted an analysis to determine how games played impacts goals scored, assuming we would find a significant positive relationship between the two. After fitting a simple linear regression model, we found that matches played, does, indeed, have a significant positive impact on goals scored. We can likely not trust these results, however, since the model didn't fit the data very well and not all the assumptions were met. That is not to say that there is no relationship whatsoever, it just might not be as strong or weak as the model suggests.