04-resampling-methods.qmd

---
format:
  html:
    number-depth: 3
    css: summary-format.css
---
# Resampling methods

## Introduction

The *Resampling Methods* are **indispensable** to obtain additional **information** about a `fitted model`.

In this chapter, we will explore the methods:

- **Cross-validation**
  - Used to estimate the `test error` in order to evaluate a ***model's performance*** (*model assessment*).
  - The `location` of the *minimum point in the test error curve* to select between ***several models***  or find the best of ***level of ﬂexibility*** to one model (*model selection*).

- **Bootstrap**
  - Estimates the ***uncertainty*** associated with a given value or statistical learning method. For example, it can estimate the `standard errors` of the coeﬃcients from a linear regression ﬁt


## Cross-Validation

### Training error limitations

The **training error** trends to underestimate the **real model's error**.

![](img/01-Training-vs-Test-Error.png){fig-align="center"}

To solve that problem we can ***hold out a subset of the training observations*** from the ﬁtting process.


### Validation Set Approach

***Splits randomly*** the available set of observations into two parts, a ***training set*** and a ***validation set***.

![](img/16-validation-approach.png){fig-align="center"}

As we split the data randomly we will have a ***different estimation of the error rate*** based on the ***seed*** set in `R`.

![](img/17-validation-approach-results-variation.png){fig-align="center"}


|Main Characteristics|Level|
|:-------------|:---:|
|Accuracy in estimating the testing error|***<span style='color:red;'>Low</span>***|
|Time efficiency|***<span style='color:green;'>High</span>***|
|Proportion of data used to train the models (bias mitigation)|***<span style='color:red;'>Low</span>***|
|Estimation variance|***-***|


#### Coding example

You can learn more about tidymodels and this book in [ISLR tidymodels labs](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/05-resampling-methods.html) by Emil Hvitfeldt.

```{r}
# Loading functions and data
library(tidymodels)
library(ISLR)
library(data.table)


# Defining the model type to train
LinealRegression <- linear_reg() |>
  set_mode("regression") |>
  set_engine("lm")


# Creating the rplit object
set.seed(1)
AutoValidationSplit <- initial_split(Auto, strata = mpg, prop = 0.5)

AutoValidationTraining <- training(AutoValidationSplit)
AutoValidationTesting <- testing(AutoValidationSplit)


lapply(1:10, function(degree){
  
  recipe_to_apply <- 
    recipe(mpg ~ horsepower, data = AutoValidationTraining) |>
    step_poly(horsepower, degree = degree)
  
  workflow() |>
    add_model(LinealRegression) |>
    add_recipe(recipe_to_apply) |>
    fit(data = AutoValidationTraining) |>
    augment(new_data = AutoValidationTesting) |>
    rmse(truth = mpg, estimate = .pred) |>
    transmute(degree = degree,
              .metric, 
              .estimator, 
              .estimate) }) |>
  rbindlist()

```


### Leave-One-Out Cross-Validation (LOOCV)

The statistical learning method is ﬁt on the $n-1$ ***training observations***, and ***a prediction*** $\hat{y}_1$ is made for the excluded observation to calculate $\text{MSE}_1 = (y_1-\hat{y}_1)^2$. Then it repeats the process $n$ times and estimate the ***test error rate***.

![](img/18-Leave-One-Out-CV.png){fig-align="center"}

Based on the average of $n$ test estimates it reports the ***test error rate***.

$$
\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n\text{MSE}_i
$$


![](img/19-Leave-One-Out-CV-results.png){fig-align="center"}

|Main Characteristics|Level|
|:-------------|:---:|
|Accuracy in estimating the testing error|***<span style='color:green;'>High</span>***|
|Time efficiency|***<span style='color:red;'>Low</span>***|
|Proportion of data used to train the models (bias mitigation)|***<span style='color:green;'>High</span>***|
|Estimation variance|***<span style='color:red;'>High</span>***|


#### Amazing shortcuts 

There some cases where we can perform this technique just fitting one model with all the observations. there are listed in the next table.

|Model|Formula|Description|
|:---:|:-----:|:----------|
|Lineal or polynomial regression|$\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}_i}{1-h_i} \right)^2$|- $\hat{y}_i$: Refers to the *i*th ﬁtted value from the original least squares ﬁt. <br> - $h_i$ Refers to the *leverage* of $x_i$ as measure of the rarity of each value.|
|Smoothing splines|$\text{RSS}_{cv} (\lambda) = \sum_{i = 1}^n \left[ \frac{y_i - \hat{g}_{\lambda} (x_i)} {1 - \{ \mathbf{S_{\lambda}} \}_{ii}} \right]^2$|- $\hat{g}_\lambda$: Refers the smoothing spline function ﬁtted to all of the training observations.|


#### Coding example

```{r}

collect_loo_testing_error <- function(formula,
                                      loo_split,
                                      metric_function = rmse,
                                      ...){
  # Validations
  stopifnot("There is no espace between y and ~" = formula %like% "[A-Za-z]+ ")
  stopifnot("loo_split must be a data.table object" = is.data.table(loo_split))
  
  predictor <- sub(pattern = " .+", replacement = "", formula)
  formula_to_fit <- as.formula(formula)
  
  Results <-
    loo_split[, training(splits[[1L]]), by = "id"
    ][, .(model = .(lm(formula_to_fit, data = .SD))),
      by = "id"
    ][loo_split[, testing(splits[[1L]]), by = "id"],
      on = "id"
    ][, .pred := predict(model[[1L]], newdata = .SD),
      by = "id"
    ][,  metric_function(.SD, truth = !!predictor, estimate = .pred, ...) ]
 
  setDT(Results)
  
  
  if(formula %like% "degree"){
    
    degree <- gsub(pattern = "[ A-Za-z,=\\~()]", replacement = "", formula)
    
    Results <- 
      Results[,.(degree = degree, 
                .metric, 
                .estimator, 
                .estimate)]
    
  }
  
  return(Results)
    
}


# Creating the rplit object
AutoLooSplit <- loo_cv(Auto)

# Transforming to data.table
setDT(AutoLooSplit)

paste0("mpg ~ poly(horsepower, degree=", 1:10, ")") |>
  lapply(collect_loo_testing_error,
         loo_split = AutoLooSplit) |>
  rbindlist()

```

### k-Fold Cross-Validation

Involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The ﬁrst fold is treated as a validation set, and the method is ﬁt on the remaining $k-1$ folds.

![](img/20-k-Fold-Cross-Validation.png){fig-align="center"}

Based on the average of $k$ test estimates it reports the ***test error rate***.

$$
\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^k\text{MSE}_i
$$


![](img/21-k-Fold-Cross-Validation-Results.png){fig-align="center"}

|Main Characteristics|Level|
|:-------------------|:---:|
|Accuracy in estimating the testing error|***<span style='color:green;'>High</span>***|
|Time efficiency|***Regular***|
|Proportion of data used to train the models (bias mitigation)|***Regular***|
|Estimation variance|***Regular***|

_According to **Hands-on Machine Learning with R**, as $k$ gets larger, the difference between the estimated performance and the true performance to be seen on the test set will decrease._

**Note**: The book recommends using $k = 5$ or $k = 10$.

#### Coding example

```{r}

AutoKFoldRecipe <- 
  recipe(mpg ~ horsepower, data = Auto) |>
  step_poly(horsepower, degree = tune())

AutoTuneReponse <-
  workflow() |>
  add_recipe(AutoKFoldRecipe) |>
  add_model(LinealRegression) |>
  tune_grid(resamples = vfold_cv(Auto, v = 10), 
            grid = tibble(degree = seq(1, 10)))

show_best(AutoTuneReponse, metric = "rmse")

autoplot(AutoTuneReponse)

```


### LOOCV vs 10-Fold CV accuracy

- **True test MSE**:  Blue solid line 
- **LOOCV**: Black dashed line
- **10-fold CV**: Orange solid line 

![](img/22-k-fold-cv-vs-LOOCV-simulation.png){fig-align="center"}

## Bootstrap

By taking many samples from a population we can obtain the `Sampling Distribution` of a value, but in many cases we just can get a single sample from the population. In theses cases, we can resample the data `with replacement` to generate many samples from one sample, creating a `Bootstrap Distribution` of a value.

![](img/23-bootstrap-example.png){fig-align="center"}

As you can see bellow the center of the *Bootstrap Distribution* must of the time **differs from the center** of the *Sampling Distribution*, but its very accurate at estimating the **dispersion of the value**.


![](img/24-Sampling-Bootstrap-distributions.png){fig-align="center"}

### Coding example

```{r}

AutoBootstraps <- bootstraps(Auto, times = 500)

boot.fn <- function(split) {
  LinealRegression |> 
    fit(mpg ~ horsepower, data = analysis(split)) |>
    tidy()
}

AutoBootstraps |>
  mutate(models = map(splits, boot.fn)) |>
  unnest(cols = c(models)) |>
  group_by(term) |>
  summarise(low = quantile(estimate, 0.025),
            mean = mean(estimate),
            high = quantile(estimate, 0.975),
            sd = sd(estimate))


```