-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path07-non-parametric-models.qmd
995 lines (661 loc) · 37.9 KB
/
07-non-parametric-models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
---
format:
html:
number-depth: 3
css: summary-format.css
---
# Non-parametric Methods
## Smoothing splines
They arise as a result of minimizing a residual sum of squares criterion subject to a smoothness penalty.
In this method rather than trying to minimize $\text{RSS} = \sum_{i=1}^n (y_i - g(x_i))^2$, we try to find a function $g(x)$, known as *smoothing spline*, which could minimize the following expression based on the $\lambda$ *nonnegative tuning parameter*.
$$
\underbrace{\sum_{i=1}^n (y_i - g(x_i))^2 }_{\text{loss function (data fitting)}} +
\underbrace{\lambda \int g''(t)^2dt}_{\text{penalty term (g varibility)}}
$$
The second derivative of $g(t)$ measure how **wiggly** is the function near $t$, where its value is $0$ when the function is a *straight line* as a line is *perfectly smooth*. The we can use *integral* to get total change in the function $g'(t)$, over its *entire range*. As consequence, **the larger the value of** $\mathbf{\lambda}$ **, the smoother** $\mathbf{g}$ **will be**.
::: {.callout-note}
The function $g(x)$ is a **natural cubic spline** with knots at $x_1, \dots ,x_n$. As results the *effective degrees of freedom* ($df_{\lambda} = \sum_{i=1}^n \{ \mathbf{S}_{\lambda} \}_{ii}$) are between $n$ and $2$ depending on the value of $\mathbf{\lambda}$.
:::
We can use *cross-validation* to find the best value which can minimize the **RSS**. It turns out that the **leave one-out cross-validation** error (LOOCV) can be computed very efficiently
for smoothing splines, with essentially the same cost as **computing a single fit**, using the following formula:
$$
\text{RSS}_{cv} (\lambda) = \sum_{i = 1}^n (y_i - \hat{g}_\lambda^{(-i)} (x_i))^2 =
\sum_{i = 1}^n \left[ \frac{y_i - \hat{g}_{\lambda} (x_i)}
{1 - \{ \mathbf{S_{\lambda}} \}_{ii}}
\right]^2
$$
Where:
- $\hat{g}_\lambda^{(-i)}$: Refers to the function fitted without the *i*th observation $(x_i, y_i)$.
- $\hat{g}_\lambda$: Refers the smoothing spline function fitted to all of the training observations.
![](img/41-smoothing-splines-example.png){fig-align="center"}
## Local regression
Computes the fit at target point $x_0$ using only the nearly training observations by:
1. Gather the $s=k/n$ closest (known as ***span**) fraction of points. This step is very important as it controls the flexibility level can be selected using *cross-validation*.
2. Assign a weight $K_{i0} = K(x_i, x_0)$ for each selected point based on the distance to $x_0$. As lower is the distance as higher needs to be the weight.
3. Find the coefficients which minimize the *weighted least squares regression* for the current $x_0$ value.
$$
\sum_{i=1}^n = K_{i0}(y_i - \beta_0 - \beta_1x_i)^2
$$
4. Calculate the fitted value of $x_0$ using $\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1x_0$.
In the next illustration we can see how the model works with some simulated data.
![](img/42-local_regression.png){fig-align="center"}
::: {.callout-note}
It performs poorly when we have more than 3 or 4 predictors in our model.
:::
## Multivariate Adaptive Regression Splines (MARS)
<br>
<br>
|Lineal models|Nonlinear models|
|:------------|:---------------|
|If you know **a priori** the nature of a nonlinear relation you could **manually adapt the model** to take in consideration the nonlinear patters by including *polynomial terms* or *step functions*|The model would find the important nonlinear interactions present in the data|
### Extending linear models
- **Polynomial regression**: It extends the *linear model* by adding extra predictors, obtained by *raising each of the original predictors to a power*. Generally speaking, it is unusual to use $d$ greater than 3 or 4 as the larger $d$ becomes, the easier the function fit becomes **overly flexible** and oddly shaped as it tends to increase the presence of **multicollinearity**.
$$
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d + \epsilon_i
$$
- **Piecewise constant regression**: It's a **step function** which breaks the range of $X$ into bins and fit a simple constant (e.g., the mean response) in each bin.
If we define the cutpoints as $c_1, c_2, \dots, c_K$ in the range of *X*, we can create *dummy variables* to represent each range. For example, if $c_1 \leq x_i < c_2$ is `TRUE` then $C_1(x_i) = 1$ and then we need to repeat that process for each value of $X$ and range. As result we can fit a *lineal regression* based on the new variables.
$$
y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) \dots + \beta_K C_K(x_i) + \epsilon_i
$$
![](img/109-nonlinear-comparisons.png){fig-align="center"}
### Explaning the model
MARS is an algorithm that **automatically creates a piecewise linear model** after grasping the concept of multiple linear regression.
It will first look for the **single point across the range of **$X$ values where two different linear relationships between $Y$ and $X$ achieve the smallest error (e.g., Sum of Squares Error). What results is known as a hinge function $h(x-a)$, where $a$ is the **cutpoint value** (*knot*).
For example, let's use $1.183606$ the our first knot.
$$
y =
\begin{cases}
\beta_0 + \beta_1(1.183606 - x) & x < 1.183606, \\
\beta_0 + \beta_1(x - 1.183606) & x < 1.183606 \\
\end{cases}
$$
![](img/110-mars-one-knot.png){fig-align="center"}
Once the first knot has been found, the search continues for a second knot.
$$
y =
\begin{cases}
\beta_0 + \beta_1(1.183606 - x) & x < 1.183606, \\
\beta_0 + \beta_1(x - 1.183606) & x < 1.183606 \quad \& \quad x < 4.898114 \\
\beta_0 + \beta_1(4.898114 - x) & x > 4.898114
\end{cases}
$$
![](img/111-mars-two-knots.png){fig-align="center"}
This procedure continues until $R^2$ change by less than 0.001.
![](img/112-mars-3-to-4-knots.png){fig-align="center"}
Then the model starts the **pruning** process, which consist in using **cross-validation** to remove knots that do not contribute significantly to predictive accuracy. To be more specific the package used in `R` performs a **Generalized cross-validation** which is a **shortcut** for linear models that produces an **approximate leave-one-out cross-validation** error metric.
### Loading prerequisites
We can fit a direct engine MARS model with the `earth` (Enhanced Adaptive Regression Through Hinges) package, as "MARS" is trademarked and licensed exclusively to *Salford Systems* and cannot be used for *competing software solutions*.
#### Libraries to use
```{r, warning=FALSE, message=FALSE}
# Helper packages for data wrangling and awesome plotting
library(dplyr)
library(recipes)
library(ggplot2)
# Modeling packages for fitting MARS models
# and automating the tuning process
library(earth)
library(caret)
library(rsample)
library(parsnip)
# Model interpretability packages
# for variable importance and relationships
library(vip)
library(pdp)
```
#### Data to use
```{r}
set.seed(123)
ames_split <-
initial_split(
AmesHousing::make_ames(),
prop = 0.7,
strata = "Sale_Price"
)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
```
### Explaning model's summary
Let's explain the summary of a trained model.
```{r}
mars1 <-
earth(Sale_Price ~ ., data = ames_train)
mars1
```
1. The number of terms correspond to the number of coefficients used by the model.
```{r}
coef(mars1) |> length()
```
```{r}
summary(mars1)$coefficients |> head(7)
```
2. The number of predictors is counted after transforming all factors into dummy variables.
```{r}
recipe(Sale_Price ~ ., data = ames_train) |>
step_dummy(all_nominal_predictors()) |>
prep(training = ames_train) |>
bake(new_data = NULL) |>
select(-Sale_Price) |>
ncol()
```
3. Shows the path used by `earth` to select the model with different metrics
```{r}
plot(mars1, which = 1)
```
It's also important to point out that this package has the capacity to assess potential interactions between different hinge functions.
```{r}
earth(Sale_Price ~ .,
data = ames_train,
degree = 2) |>
summary() |>
(\(x) x$coefficients)() |>
head(10)
```
### Tuning Process
There are two important tuning parameters associated with our MARS model:
- `degree`: the maximum degree of interactions, where rarely is there any benefit in assessing greater than *3-rd degree interactions*.
- `nprune`: Maximum number of terms (including intercept) in the pruned model, where you can start out with *10 evenly spaced values*.
```{r}
# create a tuning grid
hyper_grid <- expand.grid(
degree = 1:3,
nprune = seq(2, 100, length.out = 10) |> floor()
)
head(hyper_grid)
```
#### Caret
```{r, eval=FALSE}
# Cross-validated model
set.seed(123) # for reproducibility
cv_mars <- train(
x = subset(ames_train, select = -Sale_Price),
y = ames_train$Sale_Price,
method = "earth",
metric = "RMSE",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = hyper_grid
)
# View results
cv_mars$bestTune
## nprune degree
## 16 45 2
cv_mars$results |>
filter(nprune == cv_mars$bestTune$nprune,
degree == cv_mars$bestTune$degree)
# degree nprune RMSE Rsquared MAE RMSESD RsquaredSD MAESD
# 1 2 45 23427.47 0.9156561 15767.9 1883.2 0.01365285 794.2688
```
```{r, eval=FALSE}
ggplot(cv_mars)+
theme_light()
```
![](img/113-cv-mars-plot.png){fig-align="center"}
```{r, eval=FALSE}
cv_mars$resample$RMSE |> summary()
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 20735 22053 23189 23427 24994 26067
cv_mars$resample
# RMSE Rsquared MAE Resample
# 1 23243.22 0.9387415 15777.98 Fold04
# 2 23044.17 0.9189506 15277.56 Fold03
# 3 23499.99 0.9205506 16190.29 Fold07
# 4 23135.62 0.9226565 16106.93 Fold01
# 5 25491.41 0.8988816 16255.55 Fold05
# 6 21414.96 0.9202359 15987.68 Fold08
# 7 21722.58 0.9050642 14694.66 Fold02
# 8 26066.88 0.8938272 16635.14 Fold06
# 9 20735.28 0.9274782 14226.06 Fold10
# 10 25920.58 0.9101751 16527.20 Fold09
```
#### Tidymodels
The main benefit of using `tidymodels` to perform 10-CV is that you can stratify the folds, which can be very useful taking in consideration that the target variable is *right-skewed*.
```{r, message=FALSE}
ggplot(ames_train, aes(Sale_Price))+
geom_histogram(fill = "blue", alpha = 0.6)+
theme_light()
```
1. Define the model to train
```{r, eval=FALSE}
mars_model <-
mars() |>
set_mode("regression") |>
set_engine("earth") |>
set_args(nprune = tune(),
degree = tune())
```
2. Define the recipe to use
```{r, eval=FALSE}
mars_recipe <-
recipe(Sale_Price ~ ., data = ames_train) |>
step_dummy(all_nominal_predictors()) |>
prep(training = ames_train)
```
3. Create a `workflow` object to join the model and recipe.
```{r, eval=FALSE}
mars_wf <-
workflows::workflow() |>
workflows::add_recipe(mars_recipe) |>
workflows::add_model(mars_model)
```
4. Create the folds with `rsample`
```{r, eval=FALSE}
set.seed(123)
ames_folds <-
vfold_cv(ames_train,
v = 10,
strata = Sale_Price)
```
5. Getting metrics for each resample
```{r, eval=FALSE}
mars_rs_fit <-
mars_wf |>
tune::tune_grid(resamples = ames_folds,
grid = hyper_grid,
metrics = yardstick::metric_set(yardstick::rmse))
```
6. Check the winner's parameters
```{r, eval=FALSE}
mars_rs_fit |>
tune::show_best(metric = 'rmse', n = 3)
# nprune degree .metric .estimator mean n std_err .config
# <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
# 1 2 1 rmse standard 25808. 10 1378. Preprocessor1_Model01
# 2 2 2 rmse standard 25808. 10 1378. Preprocessor1_Model02
# 3 2 3 rmse standard 25808. 10 1378. Preprocessor1_Model03
final_mars_wf <-
mars_rs_fit |>
tune::select_best(metric = 'rmse') |>
tune::finalize_workflow(x = mars_wf)
# ══ Workflow ════════════════════════════════
# Preprocessor: Recipe
# Model: mars()
#
# ── Preprocessor ────────────────────────────
# 1 Recipe Step
#
# • step_dummy()
#
# ── Model ───────────────────────────────────
# MARS Model Specification (regression)
#
# Engine-Specific Arguments:
# nprune = 2
# degree = 1
#
# Computational engine: earth
```
### Feature interpretation
The `vip` package importance will measure the impact of the prediction error as a proportion of the total error features are included (*Generalized cross-validation*).
```{r, eval=FALSE}
# variable importance plots
p1 <-
vip(cv_mars, num_features = 40, geom = "point", value = "gcv") +
ggtitle("GCV")
p2 <-
vip(cv_mars, num_features = 40, geom = "point", value = "rss") +
ggtitle("RSS")
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
![](img/114-vip-plot.png){fig-align="center"}
However, it does not measure the impact for particular hinge functions created for a given feature. To see the effect we need to create a plot for each predictor.
```{r, eval=FALSE}
# Construct partial dependence plots
p1 <-
partial(cv_mars, pred.var = "Gr_Liv_Area", grid.resolution = 10) |>
autoplot()
p2 <-
partial(cv_mars, pred.var = "Year_Built", grid.resolution = 10) |>
autoplot()
p3 <-
partial(cv_mars, pred.var = c("Gr_Liv_Area", "Year_Built"),
grid.resolution = 10) %>%
plotPartial(levelplot = FALSE, zlab = "yhat", drape = TRUE, colorkey = TRUE,
screen = list(z = -20, x = -60))
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 3)
```
![](img/115-pdp-plot.png){fig-align="center"}
### Final thoughts
Pros:
- Naturally handles *mixed types of predictors* (quantitative and qualitative)
- Requires *minimal feature engineering*
- Performs automated *feature selection*
- Highly **correlated predictors* do not impede predictive accuracy
- Finds the important nonlinear interactions present in the data
Cons:
- Slower to train
- Correlated predictors can make model interpretation difficult
## K-nearest neighbors (KNN)
It uses the principle of nearest neighbors to classify unlabeled examples by using the **Euclidean Distance** to calculate distance between the point we want to predict and $k$ closest neighbors on the training data.
$$
d\left( a,b\right) = \sqrt {\sum _{i=1}^{p} \left( a_{i}-b_{i}\right)^2 }
$$
KNN unlike parametric models does not tell us which predictors are important, making it hard to make inferences using this model.
This method performs worst than a parametric as we starting adding *noise* predictors. In fact, we will get in the situation where for a given observation has no *nearby neighbors*, known as **curse of dimensionality** and leading to a very poor prediction of $f(x_{0})$.
### Classifier
The next function estimates the conditional probability for class $j$ as the fraction of points in $N_{0}$ whose response values equal $j$.
$$
\text{Pr}(Y = j|X = x_{0}) = \frac{1}{K}
\displaystyle\sum_{i \in N_{0}} I(y_{i} = j)
$$
- Where
- $j$ response value to test
- $x_{0}$ is the test observation
- $K$ the number of points in the training data that are closest to $x_{0}$ and reduce the model flexibility
- $N_{0}$ points in the training data that are closest to $x_{0}$
Then KNN classifies the test observation $x_{0}$ to the class with the largest probability.
![](img/08-knn-classifier.png){fig-align="center"}
### Regression
KNN regression estimates $f(x_{0})$ using the average of all the training responses in $N_{0}$.
$$
\hat{f}(x_{0}) = \frac{1}{K}
\displaystyle\sum_{i \in N_{0}} y_{i}
$$
- Where
- $x_{0}$ is the test observation
- $K$ the number of points in the training data that are closest to $x_{0}$ and reduce the model flexibility
- $N_{0}$ points in the training data that are closest to $x_{0}$
### Pre-processing
To use this method we need to make sure that all our variables are numeric. If one our variables is a factor we need to perform a dummy transformation of that variable with the `recipes::step_dummy` function.
On the other hand, as this model uses distances to make predicts it's important to check that each feature of the input data is measured with **the same range of values** with the `recipes::step_range` function which normalize from 0 to 1 as happens with the dummy function.
$$
x' = \frac{x - \min(x)}{\max(x) - \min(x)}
$$
Another normalization alternative is centering the predictors in $\overline{x} = 0$ with $S = 0$ with the function `recipes::step_normalize` or the function `scale()` which apply the [z-score normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization).
$$
x' = \frac{x - \mu}{\sigma}
$$
### Coding example
To perform **K-Nearest Neighbors** we just need to create the model specification by using **kknn** engine.
```{r}
library(tidymodels)
library(ISLR2)
Smarket_train <-
Smarket %>%
filter(Year != 2005)
Smarket_test <-
Smarket %>%
filter(Year == 2005)
knn_spec <- nearest_neighbor(neighbors = 3) %>%
set_mode("classification") %>%
set_engine("kknn")
SmarketKnnPredictions <-
knn_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train) |>
augment(new_data = Smarket_test)
conf_mat(SmarketKnnPredictions, truth = Direction, estimate = .pred_class)
accuracy(SmarketKnnPredictions, truth = Direction, estimate = .pred_class)
```
## Tree-Based Methods
These methods involve **stratifying the predictor** space into a number of *simple regions* and then use mean or the mode response value for the training observations in the region to which it belongs.
![](img/48-Hitters-salary-regression-tree-regions.png){fig-align="center"}
As these results can be summarized in a tree, these types of approaches are known as **decision tree** methods, we have some important parts:
- *Terminal nodes* (*leaves*) are represented by $R_1$, $R_2$ and $R_3$.
- *Internal nodes* refers to the points along the tree where the predictor space is split.
- *Branches* refer to the segments of the trees that connect the nodes.
![](img/47-Hitters-salary-regression-tree.png){fig-align="center"}
It's important to take in consideration that the order in which is presented each predictors also explain the level of importance of each variable. For example, the number of `Years` has a higher effect over the player's salary than the number of `Hits`.
### Simple trees
#### Advantages and disadvantages
|Advantages|Disadvantages|
|:---------|:------------|
|Simpler to explain than linear regression thanks to its graphical representation|Small change in the data can cause a large change in the final estimated tree|
|It doesn't need much preprocessing as: <br> - It handles qualitative predictors. <br> - It doesn't require feature scaling or normalization. <br> - It can handle missing values and outliers|They aren't so very good predicting results as: <br> - It prones to overfitting. <br> - It presents low accuracy.|
|It can be used for **feature selection** by defining the importance of a feature based on **how early it appears** in the tree and **how often** it is used for splitting|They can be biased towards the majority class in imbalanced datasets|
#### Regression
To create a decision tree we need to find the regions $R_1, \dots, R_j$ that minimize the RSS where $\hat{y}_{R_j}$ represent the mean response for the training observations within the *j*th box:
$$
RSS = \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2
$$
To define the regions we use the **recursive binary splitting**, which consist the predictor $X_j$ and the cutpoint $s$ leads to the greatest possible reduction in RSS. Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions. The process continues until a *stopping criterion is reached* (no region contains more than five observations).
This method is:
- *Top-down*: It begins at the *top of the tree* (where all observations belong to a single region) and then successively splits the predictor space.
- *Greedy*: At each step of the tree-building process, **the best split is made at that particular step**, rather than looking ahead and picking a split that will lead to a better tree in some future step.
As result, we could end with a very complex tree that **overfits** the data. To solve this, we need **prune** the original tree until getting a **subtree** that leads to the lowest test error rate by using the **cost complexity pruning** approach which creates differente trees based on $\alpha$.
$$
\sum_{m=1}^{|T|} \sum_{i: x_i \in R_m} (y_i - \hat{y}_{R_m}) ^2 + \alpha|T|
$$
Where:
- $\alpha$: Tunning parameter $[0,\infty]$ selected using *k-cross validation*
- $|T|$: Number of terminal nodes of the tree $T$.
- $R_m$: The subset of predictor space corresponding to the $m$th terminal node
- $\hat{y}_{R_m}$: Predicted response associated with $R_m$
![](img/49-tree-best-number-leaves.png){fig-align="center"}
#### Classification
For a classification tree, we predict that each observation belongs to the most *commonly occurring class* of training observations in the region to which it belongs.
As we can not use RSS as a criterion for making the binary splits, the **classification error rate** could the fraction of the training observations in that region that do not belong to the most common class ($1 - \max_k(\hat{p}_{mk})$), but it turns out that classification error is not sufficiently sensitive for tree-growing and we use the next metrics as they are more sensitive to **node purity** (*proportion of the main class on each terminal node*):
|Name|Formula|
|:---|:-----:|
|**Gini index**| $G = \sum_{k = 1}^K 1 - \hat{p}_{mk} (1 -\hat{p}_{mk})$|
|**Entropy**| $D = -\sum_{k = 1}^K 1 - \hat{p}_{mk} \log \hat{p}_{mk}$|
![](img/50-tree-classification-example.png){fig-align="center"}
#### Coding example
https://app.datacamp.com/learn/tutorials/decision-trees-R
```{r}
# For data maninulation
library(data.table)
# For modeling and visualization
library(tidymodels)
#
Boston <- as.data.table(MASS::Boston)
pillar::glimpse(Boston)
```
### Bagging (bootstrap aggregation)
As we said before, simple trees has a *high variance* problem $Var(\hat{f}(x_{0}))$ and **bagging** can help to mitigate this problem.
We know from the *Central Limit Theorem* a natural way to *reduce the variance* and *increase the test set accuracy* is taking many **training sets from the population**, build a separate prediction model using each training set, and average the resulting prediction, as for a given a set of $n$ **independent observations** $Z_1, \dots, Z_n$, each with variance $\sigma^2$, the variance of the mean $\overline{Z}$ of the observations is given by $\sigma^2/n$.
![](img/66-bagging-concept.png){fig-align="center"}
As we generally do not have access to multiple training sets we use **bootstrap** to take repeated samples from the one training data set, train $B$ **not pruned regression trees** and **average** the resulting predictions or select the most commonly occurring class among the $B$ predictions in classification settings.
$$
\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x)
$$
The **number of trees is not a critical** as $B$ will not lead to overfitting. Using $B = 100$ is sufficient to achieve good performance in this example.
#### Out-of-bag error estimation
To estimate the test error as an approximation of the *Leave-one-out cross validation* when $B$ sufficiently large, we can take advantage of the $1/3$ of observation that were **out-of-bag** (OOB) on each re-sample and predict the response for the $i$th observation using each of the trees in which that observation was OOB.
This will yield around $B/3$ predictions for each of the $n$ observation that we can *average* or take a *majority* vote to calculate the *test error*.
#### Variable importance measures
After using this method, we can’t represent the statistical learning procedure using a single tree, instead we can use the *RSS* (or the *Gini index*) to record the total amount that the RSS is decreased due to splits over a given predictor, averaged(or added) over all $B$ trees where a large value indicates an important predictor.
![](img/51-bagging-variable-importance.png){fig-align="center"}
### Random Forests
Predictions from the bagged trees, has a big problem, there are **highly correlated** and averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many **uncorrelated quantities** (*independent*).
To solve this problem **Random Forests** provide an improvement over bagged trees by **decorrelates the trees**. As in bagging, we build many trees based on bootstrapped training samples. But when building these decision trees *random forest* sample $m \approx \sqrt{p}$ predictors to create $B$ *independent trees, making the average of the resulting trees less variable and hence more reliable.
![](img/52-random-forest-effect-over-bagging.png){fig-align="center"}
#### Parameters to tune
Random forests have the least variability in their prediction accuracy when tuning, as the **default values tend to produce good results**.
1. **Number of trees** ($B$): A good rule of thumb is to **start with 10 times the number of features** as the error estimate converges after some trees and computation time increases linearly with the number of trees.
![](img/71-ramdom-forest-tuning-number-of-trees.png){fig-align="center"}
2. **Number of predictors** ($m_{try}$): In `ranger`, $m_{try} = \text{floor} \left( \frac{p}{3} \right)$ in regression problems and $m_{try} = \text{floor} \left( \sqrt{p} \right)$ in classifications problems, but we can explore in the range $[2,p]$.
- With **few** relevant predictors (e.g., noisy data) a **higher number tends to perform better** because it makes it more likely to *select those features with the strongest signal*.
- With **many** relevant predictors a **lower number might perform better**.
![](img/72-ramdom-forest-tuning-number-of-predictors.png){fig-align="center"}
3. **Tree complexity** (*node size*): The default values of $1$ for classification and $5$ for regression as these values tend to produce good results, but:
- If your data has **many noisy predictors** and a **high number of trees**, then performance may improve by **increasing node size** (i.e., decreasing tree depth and complexity).
- If **computation time is a concern** then you can often decrease run time substantially by **increasing the node size**.
Start with three values between 1–10 and adjust depending on impact to accuracy and run time.
![](img/73-ramdom-forest-tuning-node-size.png){fig-align="center"}
### Boosting
As well as **bagging**, **boosting** is a general approach that can be be applied to many statistical learning methods for regression or classification. Both methods create many models in order to create a single prediction $\hat{f}^1, \dots, \hat{f}^B$.
To create boosting trees, in general, we need to create then ***sequentially*** by using information from prior models and fit a new model with a modified version of the original data set. To be more specific we need to flow the following steps:
1. Set $\hat{f}(x) = 0$ and $r_i=y_i$ for all $i$ in the training set.
2. For $b = 1, 2, \dots, B$, repeat the next process:
- Fit a tree $\hat{f}^b$ with d splits (d + 1 terminal nodes) to the training data ($X,r$). In this step is very import to see that we are fitting the new model based on the residuals.
- Update $\hat{f}$ by adding in a shrunken version of the new tree:
$$
\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)
$$
- Update the residuals
$$
r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)
$$
3. Calculate the output of the boosting model by:
$$
\hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}^b(x)
$$
As result, our model *learn slowly* by adding new decision trees into the fitted function in order to update the residuals on each step. As we are going to use many models, each individual model can be small by using a low $d$ parameter. In general, statistical learning approaches that learn slowly tend to perform well.
#### Parameters to tune
- **Number of trees** ($B$): Unlike bagging and random forests, boosting can overfit if B is too large.
- **shrinkage** ($\lambda$): Controls the rate at which boosting learns, it should be a positive value its values are $0.01$ or $0.001$. Very small $\lambda$ can require using a very large value of B in order to achieve good performance.
- **Number of splits or interaction depth** ($d$): It controls the complexity of the boosted ensemble. When $d=1$ is known as a **stump tree** as each term involves only *a single variable*. Some times, stump tree works well and are eraser to interpret, but as $d$ increases the *number of variables* used by each model increases, with $d$ as limit.
![](img/53-boosting-vs-random-forest.png){fig-align="center"}
### Bayesian additive regression trees (BART)
This method constructs trees in a random manner as in bagging and random forests, and each tree tries to capture signal not yet accounted for by the current model, as in boosting.
To understand the method works we need to define some important notation:
- $K$: Number of regression trees. For example $K$ could be $100$
- $B$: Number of iterations. For example $B$ could be $1000$
- $\hat{f}_k^b(x)$: The prediction at $x$ for the $k$th regression tree used in the $b$th iteration
- $\hat{f}^b(x) = \sum_{k=1}^K \hat{f}_k^b(x)$: Summed of the $K$ at the end of each iteration.
To apply this method we need to follow the below steps:
1. In the first iteration all trees are initialized to have a single root node,*the mean of the response values divided by the total number of trees*, in other to predict the mean of $y$ in the first iteration $\hat{f}^1(x)$.
$$
\hat{f}_k^1(x) = \frac{1}{nK} \sum_{i=1}^n y_i
$$
2. Compute the predictions of the first iteration.
$$
\hat{f}^1(x) = \sum_{k=1}^K \hat{f}_k^1(x) = \frac{1}{n} \sum_{i=1}^n y_i
$$
3. For each of the following iterations $b = 2, \dots, B$.
- Update each tree $k = 1, 2, \dots, K$ by:
- Computing a **partial residual** for each tree with all trees but the $k$th tree.
$$
r_i = y_i - \sum_{k'<k} \hat{f}_{k'}^b(x_i) - \sum_{k'>k} \hat{f}_{k'}^{b-1}(x_i)
$$
-
- Based on the *partial residual* BART **randomly choosing a perturbation to the tree** from the previous iteration $\hat{f}_k^{b-1}$ from a set of possible perturbations (adding branches, prunning branches or changing the prediction of terminal nodes) favoring ones that improve the fit to the partial residual. This guards against overfitting since it limits how “hard” we fit the data in each iteration.
- Compute $\hat{f}^b(x) = \sum_{k=1}^K \hat{f}_k^b(x)$
4. Compute the *mean* or a *percentile* after $L$ burn-in iterations that don't provide good results. For example $L$ could be $200$
$$
\hat{f}(x) = \frac{1}{B-L} \sum_{b=L+1}^B \hat{f}(x)
$$
This models works really well even without a tuning process and the random process modifications protects the metho to overfit as we increase the number of iterations, as we can see in the next chart.
![](img/54-bart-vs-boosting-trees.png){fig-align="center"}
## Support Vector Machine
### Maximal Margin Classifier *(Hard Margin Classifier)*
In a p-dimensional space, a **hyperplane** is a flat affine subspace of hyperplane dimension $p − 1$.
$$
f(X)= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p = 0
$$
But if a point $X = (X_1, X_3, \dots, X_p)^T$ doesn't satisfy that equation that equation the point would lies to one or other side the equation:
- Over the *hyperplane* if $\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p > 0$
- Under the *hyperplane* if $\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p < 0$
But as we can see below there are several possible *hyperplane* that could do the job.
![](img/97-hyperplane.png){fig-align="center"}
To solve this problem, we need to find the **maximal margin hyperplane** by computing the perpendicular distance from each training observation to a given separating hyperplane to select the **farthest _hyperplane_ from the training observations**.
![](img/98-maximal-margin-hyperplane.png){fig-align="center"}
In the last example the two blue points and the purple point that lie on the dashed lines are the **support vectors** as they "support" the maximal margin hyperplane.
The maximal margin hyperplane is the solution to the next optimization problem where the **associated class labels** $y_1, \dots, y_n \in \{-1, 1\}$:
$$
\begin{split}
\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} & \; M \\
\text{subject to } &
\begin{cases}
\sum_{j=1}^p \beta_{j}^2 = 1 , \\
y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}) \geq M , \; i = 1, \dots, n
\end{cases}
\end{split}
$$
As consecuence, the ubicated in the expected side of the hyperplane will satisfafy the next condition.
$$
Y_i \times f(X) > 0
$$
This method has two disadvantages:
1. Sometimes there isn't any possible *separating hyperplane*.
![](img/99-no-posible-hyperplane.png){fig-align="center"}
2. Classifying correctly all of the training can lead to sensitivity to individual observations, returning as a result a overfitted model.
![](img/100-hyperplane-sensitivity.png){fig-align="center"}
### Support Vector Classifiers *(soft margin classifier)*
To solve this problem we need to allow that:
- Some observations will be on the incorrect side of the *margin* like observations 1 and 8.
- Some observations will be on the incorrect side of the *hyperplane* like observations 11 and 12.
![](img/101-soft-margin-classifier.png){fig-align="center"}
To get that result, we need to solve the next problem:
$$
\begin{split}
\underset{\beta_0, \beta_1, \dots, \beta_p, M}{\text{maximize}} & \; M \\
\text{subject to } &
\begin{cases}
\sum_{j=1}^p \beta_{j}^2 = 1 , \\
y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}) \geq M (1 - \epsilon_i), \; i = 1, \dots, n \\
\epsilon_i \geq 0, \\
\sum_{i=1}^n \epsilon_i \leq C
\end{cases}
\end{split}
$$
- Where:
- $M$ is the width of the margin.
- $\epsilon_1, \dots, \epsilon_n$ are slack variables that allow individual observations to be on the wrong side of the margin or the hyperplane.
- If $\epsilon_i = 0$ then the *i*th observation is on the **correct side of the margin**.
- If $\epsilon_i > 0$ then the *i*th observation is on the **wrong side of the margin**.
- If $\epsilon_i > 1$ then the *i*th observation is on the **wrong side of the hyperplane**.
- $C$ is a nonnegative tuning parameter that represents the **budget for the total amount that the margin can be violated** by the n observations. For $C > 0$ no more than $C$ observations can be on the wrong side of the hyperplane.
It's important to point that only observations that either **lie on the margin** or **violate the margin** will **affect the hyperplane**.
![](img/102-effects-of-changing-C.png){fig-align="center"}
### Non-linear boundaries
To extend the *Support Vector Classifier* to non-lineal settings we need to use functions that quantifies the similarity of two observations, known as **kernels** $K(x_i, x_{i'})$ and implement it to the **hyperplane** function for the *support vectors* $\mathcal{S}$.
$$
f(x) = \beta_0 + \sum_{i \in \mathcal{S}} K(x_i, x_{i'})
$$
And depending on the shape that we want to use there are some types of kernels to use and we need to **tune the related hyperparameters**. Each of the next kernels is linked to a function in `tidymodels`and in the documentation of each function we will see the parameters to tune for each case
#### Polynomial (`svm_poly`)
As the degree $d$ increases the fit becomes more non-linear.
$$
K(x_i, x_{i'}) = \alpha_i (1 + \sum_{j=1}^p x_{ij} x_{i'j})^d
$$
#### Radial (`svm_rbf`)
As $\gamma$ increases the fit becomes more non-linear.
$$
K(x_i, x_{i'}) = \exp(-\gamma \sum_{j=1}^p(x_{ij}-x_{i'j})^2)
$$
#### Hyperbolic tangent
$$
K(x_i, x_{i'}) = \tanh(k_1||x_i-x_{i'}|| + k_2)
$$
There isn't any function to use this kernel but we can change the default kernel in the `set_engine` function
```{r}
#| eval: false
svm_tanh <-
svm_linear() |>
set_mode("classification") |>
set_engine("kernlab",
kernel = "tanhdot")
```
### Extending SVMs to the K-class case
To extend this method we have 2 alternatives:
- **One-Versus-One Classification (OVO)**: It constructs $\left( \begin{array}{c} K \\ 2 \end{array} \right)$, tallys the number of times that the test observation is assigned to each of the $K$. The final classification is performed by assigning the test observation to the class to which it was most frequently assigned.
- **One-Versus-All Classification (OVA)**: We fit $K$ SVMs, each time comparing one of the $K$ classes to the remaining $K − 1$ classes. Let $\beta = \beta_{0k}, \beta_{1k}, \dots, \beta_{pk}$ denote the parameters that result from fitting an SVM comparing the $k$th class (coded as $+1$) to the others (coded as $−1$). We assign the observation to the class for which the lineal combination of coefficients and the test observation $\beta x^*$ is largest.
::: {.callout-tip}
#### Kernlab support many classes
The `kernlab` package support many classes by default so we don't need to worry about this problem problem.
:::
### Coding Example
1. Load libraries
```{r}
library(tidymodels)
library(ISLR)
library(kernlab)
theme_set(theme_light())
set.seed(1)
sim_data <- tibble(
x1 = rnorm(40),
x2 = rnorm(40),
y = factor(rep(c(-1, 1), 20))
) %>%
mutate(x1 = ifelse(y == 1, x1 + 1.5, x1),
x2 = ifelse(y == 1, x2 + 1.5, x2))
ggplot(sim_data, aes(x1, x2, color = y)) +
geom_point()
```
2. Define model specification
```{r}
svm_linear_spec <- svm_poly(degree = 1) %>%
set_mode("classification") %>%
# We don't need scaling for now
set_engine("kernlab", scaled = FALSE)
```
3. Fitting the model and checking results
```{r}
svm_linear_fit <- svm_linear_spec %>%
set_args(cost = 10) %>%
fit(y ~ ., data = sim_data)
svm_linear_fit %>%
extract_fit_engine() %>%
plot()
```