-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path04-resampling-methods.qmd
284 lines (180 loc) · 8.79 KB
/
04-resampling-methods.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
format:
html:
number-depth: 3
css: summary-format.css
---
# Resampling methods
## Introduction
The *Resampling Methods* are **indispensable** to obtain additional **information** about a `fitted model`.
In this chapter, we will explore the methods:
- **Cross-validation**
- Used to estimate the `test error` in order to evaluate a ***model's performance*** (*model assessment*).
- The `location` of the *minimum point in the test error curve* to select between ***several models*** or find the best of ***level of flexibility*** to one model (*model selection*).
- **Bootstrap**
- Estimates the ***uncertainty*** associated with a given value or statistical learning method. For example, it can estimate the `standard errors` of the coefficients from a linear regression fit
## Cross-Validation
### Training error limitations
The **training error** trends to underestimate the **real model's error**.
![](img/01-Training-vs-Test-Error.png){fig-align="center"}
To solve that problem we can ***hold out a subset of the training observations*** from the fitting process.
### Validation Set Approach
***Splits randomly*** the available set of observations into two parts, a ***training set*** and a ***validation set***.
![](img/16-validation-approach.png){fig-align="center"}
As we split the data randomly we will have a ***different estimation of the error rate*** based on the ***seed*** set in `R`.
![](img/17-validation-approach-results-variation.png){fig-align="center"}
|Main Characteristics|Level|
|:-------------|:---:|
|Accuracy in estimating the testing error|***<span style='color:red;'>Low</span>***|
|Time efficiency|***<span style='color:green;'>High</span>***|
|Proportion of data used to train the models (bias mitigation)|***<span style='color:red;'>Low</span>***|
|Estimation variance|***-***|
#### Coding example
You can learn more about tidymodels and this book in [ISLR tidymodels labs](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/05-resampling-methods.html) by Emil Hvitfeldt.
```{r}
# Loading functions and data
library(tidymodels)
library(ISLR)
library(data.table)
# Defining the model type to train
LinealRegression <- linear_reg() |>
set_mode("regression") |>
set_engine("lm")
# Creating the rplit object
set.seed(1)
AutoValidationSplit <- initial_split(Auto, strata = mpg, prop = 0.5)
AutoValidationTraining <- training(AutoValidationSplit)
AutoValidationTesting <- testing(AutoValidationSplit)
lapply(1:10, function(degree){
recipe_to_apply <-
recipe(mpg ~ horsepower, data = AutoValidationTraining) |>
step_poly(horsepower, degree = degree)
workflow() |>
add_model(LinealRegression) |>
add_recipe(recipe_to_apply) |>
fit(data = AutoValidationTraining) |>
augment(new_data = AutoValidationTesting) |>
rmse(truth = mpg, estimate = .pred) |>
transmute(degree = degree,
.metric,
.estimator,
.estimate) }) |>
rbindlist()
```
### Leave-One-Out Cross-Validation (LOOCV)
The statistical learning method is fit on the $n-1$ ***training observations***, and ***a prediction*** $\hat{y}_1$ is made for the excluded observation to calculate $\text{MSE}_1 = (y_1-\hat{y}_1)^2$. Then it repeats the process $n$ times and estimate the ***test error rate***.
![](img/18-Leave-One-Out-CV.png){fig-align="center"}
Based on the average of $n$ test estimates it reports the ***test error rate***.
$$
\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n\text{MSE}_i
$$
![](img/19-Leave-One-Out-CV-results.png){fig-align="center"}
|Main Characteristics|Level|
|:-------------|:---:|
|Accuracy in estimating the testing error|***<span style='color:green;'>High</span>***|
|Time efficiency|***<span style='color:red;'>Low</span>***|
|Proportion of data used to train the models (bias mitigation)|***<span style='color:green;'>High</span>***|
|Estimation variance|***<span style='color:red;'>High</span>***|
#### Amazing shortcuts
There some cases where we can perform this technique just fitting one model with all the observations. there are listed in the next table.
|Model|Formula|Description|
|:---:|:-----:|:----------|
|Lineal or polynomial regression|$\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}_i}{1-h_i} \right)^2$|- $\hat{y}_i$: Refers to the *i*th fitted value from the original least squares fit. <br> - $h_i$ Refers to the *leverage* of $x_i$ as measure of the rarity of each value.|
|Smoothing splines|$\text{RSS}_{cv} (\lambda) = \sum_{i = 1}^n \left[ \frac{y_i - \hat{g}_{\lambda} (x_i)} {1 - \{ \mathbf{S_{\lambda}} \}_{ii}} \right]^2$|- $\hat{g}_\lambda$: Refers the smoothing spline function fitted to all of the training observations.|
#### Coding example
```{r}
collect_loo_testing_error <- function(formula,
loo_split,
metric_function = rmse,
...){
# Validations
stopifnot("There is no espace between y and ~" = formula %like% "[A-Za-z]+ ")
stopifnot("loo_split must be a data.table object" = is.data.table(loo_split))
predictor <- sub(pattern = " .+", replacement = "", formula)
formula_to_fit <- as.formula(formula)
Results <-
loo_split[, training(splits[[1L]]), by = "id"
][, .(model = .(lm(formula_to_fit, data = .SD))),
by = "id"
][loo_split[, testing(splits[[1L]]), by = "id"],
on = "id"
][, .pred := predict(model[[1L]], newdata = .SD),
by = "id"
][, metric_function(.SD, truth = !!predictor, estimate = .pred, ...) ]
setDT(Results)
if(formula %like% "degree"){
degree <- gsub(pattern = "[ A-Za-z,=\\~()]", replacement = "", formula)
Results <-
Results[,.(degree = degree,
.metric,
.estimator,
.estimate)]
}
return(Results)
}
# Creating the rplit object
AutoLooSplit <- loo_cv(Auto)
# Transforming to data.table
setDT(AutoLooSplit)
paste0("mpg ~ poly(horsepower, degree=", 1:10, ")") |>
lapply(collect_loo_testing_error,
loo_split = AutoLooSplit) |>
rbindlist()
```
### k-Fold Cross-Validation
Involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining $k-1$ folds.
![](img/20-k-Fold-Cross-Validation.png){fig-align="center"}
Based on the average of $k$ test estimates it reports the ***test error rate***.
$$
\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^k\text{MSE}_i
$$
![](img/21-k-Fold-Cross-Validation-Results.png){fig-align="center"}
|Main Characteristics|Level|
|:-------------------|:---:|
|Accuracy in estimating the testing error|***<span style='color:green;'>High</span>***|
|Time efficiency|***Regular***|
|Proportion of data used to train the models (bias mitigation)|***Regular***|
|Estimation variance|***Regular***|
_According to **Hands-on Machine Learning with R**, as $k$ gets larger, the difference between the estimated performance and the true performance to be seen on the test set will decrease._
**Note**: The book recommends using $k = 5$ or $k = 10$.
#### Coding example
```{r}
AutoKFoldRecipe <-
recipe(mpg ~ horsepower, data = Auto) |>
step_poly(horsepower, degree = tune())
AutoTuneReponse <-
workflow() |>
add_recipe(AutoKFoldRecipe) |>
add_model(LinealRegression) |>
tune_grid(resamples = vfold_cv(Auto, v = 10),
grid = tibble(degree = seq(1, 10)))
show_best(AutoTuneReponse, metric = "rmse")
autoplot(AutoTuneReponse)
```
### LOOCV vs 10-Fold CV accuracy
- **True test MSE**: Blue solid line
- **LOOCV**: Black dashed line
- **10-fold CV**: Orange solid line
![](img/22-k-fold-cv-vs-LOOCV-simulation.png){fig-align="center"}
## Bootstrap
By taking many samples from a population we can obtain the `Sampling Distribution` of a value, but in many cases we just can get a single sample from the population. In theses cases, we can resample the data `with replacement` to generate many samples from one sample, creating a `Bootstrap Distribution` of a value.
![](img/23-bootstrap-example.png){fig-align="center"}
As you can see bellow the center of the *Bootstrap Distribution* must of the time **differs from the center** of the *Sampling Distribution*, but its very accurate at estimating the **dispersion of the value**.
![](img/24-Sampling-Bootstrap-distributions.png){fig-align="center"}
### Coding example
```{r}
AutoBootstraps <- bootstraps(Auto, times = 500)
boot.fn <- function(split) {
LinealRegression |>
fit(mpg ~ horsepower, data = analysis(split)) |>
tidy()
}
AutoBootstraps |>
mutate(models = map(splits, boot.fn)) |>
unnest(cols = c(models)) |>
group_by(term) |>
summarise(low = quantile(estimate, 0.025),
mean = mean(estimate),
high = quantile(estimate, 0.975),
sd = sd(estimate))
```