-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path11-random-forests.Rmd
822 lines (614 loc) · 51.8 KB
/
11-random-forests.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
# Random Forests {#RF}
```{r rf-ch11-setup, include=FALSE}
# Set the graphical theme
ggplot2::theme_set(ggplot2::theme_light())
# Set global knitr chunk options
knitr::opts_chunk$set(
cache = TRUE,
warning = FALSE,
message = FALSE,
collapse = TRUE,
fig.align = "center",
fig.height = 3.5
)
```
The previous chapters covered models where the algorithm is based on a linear expansions in simple basis functions of the form
$$
\sum_{i=1}^p\beta_ih_i\left(\boldsymbol{x}\right),
(\#eq:linear-combo)
$$
where the $\beta_i$ are unknown coefficients to be estimated and the $h_i\left(\cdot\right)$ are transformations applied to the features $\boldsymbol{x}$. For ordinalry linear regression (with or without regularization), these transformations are supplied by the user; hence, these are parametric models. For example, the prediction equation $f\left(\boldsymbol{x}\right) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2 + \beta_4x_1^2$ has
$$h_1\left(\boldsymbol{x}\right) = x_1 \quad \text{(main effect})$$
$$h_2\left(\boldsymbol{x}\right) = x_2 \quad \text{(main effect})$$
$$h_3\left(\boldsymbol{x}\right) = x_2x_4 \quad \text{(two-way interaction effect)}$$
$$h_2\left(\boldsymbol{x}\right) = x_2^2 \quad \text{(quadratic effect})$$
MARS, on the other hand, uses a specific algorithm to find the transformations to use automatically; hence, MARS is a nonparametric model.
*Tree-based models*, are also nonparametric, but they work very differently. Tree-based models use algorithms to partition the feature space into a number of smaller (non-overlapping) regions based on a set of splitting rules and then fits a simpler model (e.g., a constant) in each region. Such *divide-and-conquor* methods (e.g., a single decision tree) can produce simple rules that are easy to interpret and visualize with *tree diagrams*. Although fitted tree-based models can still be written as a linear expansions in simple basis functions (here the basis functions define the feature space partitioning), there is no benfit to doing so as other techniques, like *tree diagrams*, are better and conveying the information. Simple decision trees typically lack in predictive performance compared to more complex algorithms like neural networks and MARS. More sophisticated tree-based models, such as random forests and gradient boosting machines, are less interpretable, but tend to have very good predictive accuracy. This chapter will get you familiar with the basics behind decision trees, and two ways inwhich to combine them into a more accurate ensemble; namely, bagging and random forests. In the next chapter, we'll cover (stochastic) gradient boosting machines, which is another powerful way of combining decision trees into a more accurate ensemble.
## Prerequisites
For this chapter we'll use the following packages:
```{r 11-load-pkgs}
library(caret) # for classification and regression training
library(randomForest) # for Breiman and Cutler's random forest
library(ranger) # for fast implementation of random forests
library(rpart) # for fitting CART-like decision trees
library(rpart.plot) # for flexible decision tree diagrams
library(rsample) # for data splitting
library(pdp) # for partial dependence plots
library(vip) # for variable importance plots
```
To illustrate the various concepts we'll use the Ames Housing data (regression); however, at the end of the chapter we'll also apply fit a random forest model to the employee attrition data (classification).
```{r rf-ch11-data}
# Create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.
# Use set.seed for reproducibility
set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
```
## Decision trees
There are many methodologies for constructing decision trees but the most well-known is the **c**lassification **a**nd **r**egression **t**ree (CART©) algorithm proposed in @breiman2017classification.[^other] Basic decision trees partition the training data into homogeneous subgroups and then fit a simple *constant* for each subgroup (e.g., the mean of the within group response values for regression). The partitioning is achieved by recursive binary partitions formed by asking yes-or-no questions about each feature (e.g., is `age < 18`?). This is done a number of times until a suitable stopping critera is satisfied (e.g., a maximum depth of the tree is reached). After all the partitioning has been done, the model predicts the output based on (1) the average response values for all observations that fall in that subgroup (regression problem), or (2) the class that has majority representation (classification problem). For classification, predicted probabilites can be obtained using the proportion of each class within the subgroups.
What results is an inverted tree-like structure such as that in Figure \@ref(fig:exemplar-decision-tree). In essence, our tree is a set of rules that allows us to traverse the tree based on simple questions about each feature. For example, if the customer is loyal, has household income greater than \$150,000, and is shopping in a store, the exemplar decision tree would predict that the customer will redeem a coupon.
```{r exemplar-decision-tree, echo=FALSE, fig.cap="Exemplar decision tree predicting if a customer will redeem a coupon (yes or no) based on the customer's loyalty, household income, last month's spend, coupon placement, and shopping mode.", out.height="100%", out.width="100%"}
knitr::include_graphics("illustrations/exemplar-decision-tree.png")
```
We refer to the first feature used to split as the *root node*, the next $n$ features used are the *internal nodes*, the outcome region (prediction) as the *leaf* or *terminal node*, and the connections as *branches*.
```{r decision-tree-terminology, echo=FALSE, fig.cap="Terminology of a decision tree.", out.height="80%", out.width="80%"}
knitr::include_graphics("illustrations/decision-tree-terminology.png")
```
### Partitioning
As illustrated above, CART uses *recursive binary partitioning*. The objective is to assess an individual feature ($x_i$) and split that feature to segment the response variable into one of two regions ($R_1$ and $R_2$) such that the overall error is minimized (typically, SSE for regression problems, and cross-entropy or the Gini index for classification problems--see Section \@ref(reg-perf-eval)). Having found the best feature/split combination, the data are partitioned into two regions and the splitting process is repeated on each of the two regions (hence the name binary recursive partioning). This process is continued until some stopping criterion is reached (e.g., a maximum depth is reached or the tree becomes "too complex")
It's important to note that a single feature can be used multiple times in a tree. For example, say we have data generated from an underlying sin function with $\stackrel{iid}{\sim} \left(0, \sigma^2\right)$ errors. A regression tree built with a single root node (often referred to as a decision stump) leads to a split occurring at $x = 3.1$.[^stump]
```{r decision-stump, echo=FALSE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="Decision tree illustrating the single split on feature x (left). The resulting decision boundary illustrates the predicted value when x < 3.1 (0.64), and when x > 3.1 (-0.67) (right)."}
# create data
set.seed(1112) # for reproducibility
df <- tibble::tibble(
x = seq(from = 0, to = 2 * pi, length = 500),
y = sin(x) + rnorm(length(x), sd = 0.5),
truth = sin(x)
)
# run decision stump model
library(rpart)
ctrl <- list(cp = 0, minbucket = 5, maxdepth = 1)
fit <- rpart(y ~ x, data = df, control = ctrl)
# plot tree
library(rpart.plot)
par(mar = c(1, 1, 1, 1))
rpart.plot(fit)
# plot decision boundary
library(ggplot2)
library(dplyr)
df %>%
mutate(pred = predict(fit, df)) %>%
ggplot(aes(x, y)) +
geom_point(alpha = .2, size = 1) +
geom_line(aes(x, y = truth), color = "blue", size = .75) +
geom_line(aes(y = pred), color = "red", size = .75) +
geom_segment(x = 3.1, xend = 3.1, y = -Inf, yend = -.95,
arrow = arrow(length = unit(0.25,"cm")), size = .25) +
annotate("text", x = 3.1, y = -Inf, label = "split", hjust = 1.2, vjust = -1, size = 3) +
geom_segment(x = 5.5, xend = 6, y = 2, yend = 2, size = .75, color = "blue") +
geom_segment(x = 5.5, xend = 6, y = 1.7, yend = 1.7, size = .75, color = "red") +
annotate("text", x = 5.3, y = 2, label = "truth", hjust = 1, size = 3, color = "blue") +
annotate("text", x = 5.3, y = 1.7, label = "decision boundary", hjust = 1, size = 3, color = "red")
```
If we build a deeper tree, we will continue to split on the same feature ($x$) as illustrated in Figure \@ref(fig:depth-3-decision-tree). In this example, this is because $x$ is the only feature available to split on so it will continue finding the optimal splits along this feature's values until a pre-determined stopping criteria is reached.
```{r depth-3-decision-tree, echo=FALSE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="Decision tree illustrating with depth = 3, resulting in 7 decision splits along values of feature x and 8 prediction regions (left). The resulting decision boundary (right)."}
# fit depth 3 decision tree
ctrl <- list(cp = 0, minbucket = 5, maxdepth = 3)
fit <- rpart(y ~ x, data = df, control = ctrl)
rpart.plot(fit)
# plot decision boundary
df %>%
mutate(pred = predict(fit, df)) %>%
ggplot(aes(x, y)) +
geom_point(alpha = .2, size = 1) +
geom_line(aes(x, y = truth), color = "blue", size = .75) +
geom_line(aes(y = pred), color = "red", size = .75)
```
However, in an example where many features are included, the same can be experienced if a single feature provides the best splitting criterion multiple times. For example, a decision tree applied to the iris data set [@fisher1936use] where the species of the flower (setosa, versicolor, and virginica) is predicted based on two features (sepal width and sepal length) results in an optimal decision tree with two splits on each feature. Also, note how the decision boundary in a classification problem results in rectangular regions enclosing the observations.
```{r iris-decision-tree, echo=FALSE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="Decision tree for the iris classification problem (left). The decision boundary results in rectangular regions that enclose the observations. The class with the highest proportion in each region is the predicted value (right)."}
# decision tree
iris_fit <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
rpart.plot(iris_fit)
# decision boundary
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species, shape = Species)) +
geom_point(show.legend = FALSE) +
annotate("rect", xmin = -Inf, xmax = 5.44, ymin = 2.8, ymax = Inf, alpha = .75, fill = "orange") +
annotate("text", x = 4.0, y = 4.4, label = "setosa", hjust = 0, size = 3) +
annotate("rect", xmin = -Inf, xmax = 5.44, ymin = 2.79, ymax = -Inf, alpha = .75, fill = "grey") +
annotate("text", x = 4.0, y = 2, label = "versicolor", hjust = 0, size = 3) +
annotate("rect", xmin = 5.45, xmax = 6.15, ymin = 3.1, ymax = Inf, alpha = .75, fill = "orange") +
annotate("text", x = 6, y = 4.4, label = "setosa", hjust = 1, vjust = 0, size = 3) +
annotate("rect", xmin = 5.45, xmax = 6.15, ymin = 3.09, ymax = -Inf, alpha = .75, fill = "grey") +
annotate("text", x = 6.15, y = 2, label = "versicolor", hjust = 1, vjust = 0, fill = "grey", size = 3) +
annotate("rect", xmin = 6.16, xmax = Inf, ymin = -Inf, ymax = Inf, alpha = .75, fill = "green") +
annotate("text", x = 8, y = 2, label = "virginica", hjust = 1, vjust = 0, fill = "green", size = 3)
```
### Minimizing overfitting
This leads to an important question, how deep should we fit a tree?
```{r deep-overfit-tree, echo=FALSE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="TBD"}
ctrl <- list(cp = 0, minbucket = 1, maxdepth = 50)
fit <- rpart(y ~ x, data = df, control = ctrl)
rpart.plot(fit)
df %>%
mutate(pred = predict(fit, df)) %>%
ggplot(aes(x, y)) +
geom_point(alpha = .2, size = 1) +
geom_line(aes(x, y = truth), color = "blue", size = 0.75) +
geom_line(aes(y = pred), color = "red", size = 0.75)
```
### A simple regression tree example
For example, suppose we want to use a decision tree to predict the miles per gallon a car will average (`mpg`) based on two features: the number of cylinders (`cyl`) and the horsepower (`hp`); such data are available in the `mtcars` data frame which is part of the standard **datasets** package. A (regression) tree is built using the **rpart** package [@R-rpart] (see the code chunk below and Figure \@ref(rf-decision-tree-example-image) which was produced using the **rpart.plot** package [@R-rpart.plot]), and all the training data are passed down this tree. Whenever an obervation reaches a particular node in the tree (i.e., a yes-or-no question about one of the features; in this case, `cyl` or `hp`), it proceeds either to the left (if the answer is "yes") or to the right (if the answer is "no"). This tree has only three terminal nodes. To start traversing the tree, all observations that have 6 or 8 cylinders go to the left branch, all other observations proceed to the right branch. Next, the left branch is further partitioned by `hp`. Of all the observations with 6 or 8 cylinders with `hp` equal to or greater than 192 proceed to the left branch; those with less than 192 `hp` proceed to the right. These branches lead to *terminal nodes* or *leafs* which contain the predicted response value; in this case, the average mpg of cars that fall within that terminal node. In short, cars with less than 5 cylinders (region $R_1$) average 27 mpg, cars with `cyl` $\ge 5$ and `hp` $< 193$ (region $R_2$) average 18 mpg, and all other cars (region $R_3$) in the training data average 13 mph. This is sumamrized in the tree diagram in Figure \@ref(fig:decision-tree-example)
```{r rf-decision-tree-example, echo=TRUE, fig.cap="Using a CART-like decision tree to predict `mpg` based on `cyl` and `hp`.", fig.width=6, fig.asp=0.618, out.width="70%", eval=FALSE}
tree <- rpart(mpg ~ cyl + hp, data = mtcars) # CART-like regression tree
rpart.plot(tree) # tree diagram
```
Using a linear combination in simple basis functions, we can write the predicted `mpg` as
$$
\widehat{mpg} = 27 \cdot I\left(cyl < 5\right) + 18 \cdot I\left(cyl \ge 5 \text{ and } hp < 193 \right) + 13 \cdot I\left(cyl \ge 5 \text{ and } hp \ge 193\right),
$$
where $I\left(\cdot\right)$ is an *indicator function* that evaluates to one if its argument is true and zero otherwise. The coefficients (i.e., 27, 18, 13) correspond to the average response value within the respective region. While this simple example illustrates that decision trees are estimating a model of the same form as equation \@ref(eq:linear-combo), the results are more easily interpreted in the form of a tree diagram like in Figure \@ref(fig:decision-tree-example).
### Deciding on splits
**FIXME:** Continue with the previous example?
How do decision trees partition the data into nonoverlapping regions? In particular, how did the tree in Figure \@ref(fig:decision-tree-example) decide which variables to split on and which split points to use? The answer depends on the tree algorithm used and whether or not context is classification or regression. For CART-like decision trees (like those discussed in this book and implemented in **rpart**), the partitioning of the feature space is done in a top-down, *greedy* fashion. This means that any partion in the tree depends on the previous partitions. But how are these partions made? The algorithm begins with the entire training data set and searches every distinct value of every input variable to find the "best" feature/split combination that partitions the data into two regions ($R_1$ and $R_2$) such that the overall error is minimized (typically, SSE for regression problems, and cross-entropy or the Gini index for classification problems--see Section \@ref(reg-perf-eval)). Having found the best feature/split combination, the data are partitioned into two regions and the splitting process is repeated on each of the two regions (hence the name *binary **r**ecursive **part**ioning*. This process is continued until some stopping criterion is reached (e.g., a maximum depth is reached or the tree becomes "too complex"). What results is, typically, a very deep, complex tree that may produce good predictions on the training set, but is likely to generalize well to new unseen data (i.e., overfitting), leading to poor generalization performance.
To illustrate, we print the struture of the previous regression tree below.
```{r rf-mtcars-partitioning, eval=FALSE}
print(tree)
```
The root node refers to all of the training data. In this node, the mean response is `r mean(mtcars$mpg)` and the SSE is
```{r rf-mtcars-sse-01, eval=FALSE}
sse <- function(x) sum((x - mean(x))^2)
sse(mtcars$mpg) # see the row labeled 1)
```
The feature/splint combination that gives the largest reduction to this SSE is defined by the regions `cyl>=5` and `cyl< 5`. The resulting SSEs within these regions are
```{r rf-mtcars-sse-02, eval=FALSE}
sse(mtcars[mtcars$cyl >= 5, ]$mpg) # see the row labeled 2)
sse(mtcars[mtcars$cyl < 5, ]$mpg) # see the row labeled 3)
```
The algorithm further split the region defined by `cyl>=5` into to further regions. The feature/split combination that gave the biggest reduction to that node's SSE (i.e., 198.47240) is defined by the regions `hp>=192.5` and `hp< 192.5`. The resulting SSEs within these regions are
```{r rf-mtcars-sse-03, eval=FALSE}
sse(mtcars[mtcars$cyl >= 5 & mtcars$hp >= 192.5, ]$mpg) # see the row labeled 4)
sse(mtcars[mtcars$cyl >= 5 & mtcars$hp < 192.5, ]$mpg) # see the row labeled 5)
```
At this point, a stopping criteria was reached (in this case, the minimum number of observations that must exist in a node in order for a split to be attempted) and the tree algorithm stopped partitioning the data. For a list of all the stopping criteria, see `?rpart::rpart.control`.
<!-- For example, using the well-known Boston housing data [@harrison1978hedonic], three decision trees are created based on three different samples of the data. You can see that the first few partitions are fairly similar at the top of each tree; however, they tend to differ substantially closer to the terminal nodes. These deeper nodes tend to overfit to specific attributes of the training data; consequently, slightly different samples will result in highly variable predicted values in the terminal nodes. For this reason, CART-like decision tree algorithms are often considered high variance (i.e., noisy) models fitting procedures. -->
<!-- **FIXME:** Brad, can we replace this with the code that produced the results; either hidden or shown? -->
<!-- ```{r rf-tree-variance-image, echo=FALSE, fig.cap="Three decision trees based on slightly different samples.", out.height="70%", out.width="70%"} -->
<!-- knitr::include_graphics("illustrations/tree-variance-1.svg") -->
<!-- Below is code to reproduce the images. May need to do some trimming of the image as I think I cropped it for aesthetic purposes. -->
```{r tree-correlation, message=FALSE, warning=FALSE, fig.align='center', fig.cap="Six decision trees based on different bootstrap samples.", echo=FALSE, dev='png', eval=FALSE}
iter = 6
par(mfrow = c(3, 3))
for(i in 1:iter){
set.seed(i+30)
# create train/test sets
train_index <- caret::createDataPartition(pdp::boston$cmedv, p = .6333,
list = FALSE,
times = 1)
train_DF <- pdp::boston[train_index,]
validate_DF <- pdp::boston[-train_index,]
train_y <- train_DF$cmedv
train_x <- train_DF[, setdiff(names(train_DF), "cmedv")]
validate_y <- validate_DF$cmedv
validate_x <- validate_DF[, setdiff(names(validate_DF), "cmedv")]
d_tree <- rpart::rpart(cmedv ~ ., train_DF)
# graphs
rpart.plot::rpart.plot(d_tree, main = paste0("Decision Tree ", i), type = 0, extra = 0)
}
```
<!-- ``` -->
## Bagging
The major drawback of decision trees is that they are not typically as accurate as current state-of-the-art ML algorithms like neural networks. However, decision trees do have a number of desirable properties which are listed in Table X below. The idea behind the ensembles of decision trees discussed in this chapter and the next is to improve the predictive performance of trees while retaining most of the other properties.
<table>
<caption>Table X. A comparison of binary recursive partitioning and neural networks.</caption>
<thead>
<tr>
<th style="text-align:right;"> Property </th>
<th style="text-align:right;"> Recursive \n partitioning </th>
<th style="text-align:right;"> Neural networks </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right;"> Naturally handles numeric and categorical variables </td>
<td style="text-align:right;"> `r emo::ji("check")` </td>
<td style="text-align:right;"> `r emo::ji("x")` </td>
</tr>
<tr>
<td style="text-align:right;"> Naturally handles missing values </td>
<td style="text-align:right;"> `r emo::ji("check")` </td>
<td style="text-align:right;"> `r emo::ji("x")` </td>
</tr>
<tr>
<td style="text-align:right;"> Robust to features with outliers </td>
<td style="text-align:right;"> `r emo::ji("check")` </td>
<td style="text-align:right;"> `r emo::ji("x")` </td>
</tr>
<tr>
<td style="text-align:right;"> Insensitive to monotone
transformations of features </td>
<td style="text-align:right;"> `r emo::ji("check")` </td>
<td style="text-align:right;"> `r emo::ji("x")` </td>
</tr>
<tr>
<td style="text-align:right;"> Computational scalability </td>
<td style="text-align:right;"> `r emo::ji("check")` </td>
<td style="text-align:right;"> `r emo::ji("x")` </td>
</tr>
<tr>
<td style="text-align:right;"> Ability to deal with irrelevant
inputs </td>
<td style="text-align:right;"> `r emo::ji("check")` </td>
<td style="text-align:right;"> `r emo::ji("x")` </td>
</tr>
<tr>
<td style="text-align:right;"> Interpretibility </td>
<td style="text-align:right;"> `r emo::ji("check")` </td>
<td style="text-align:right;"> `r emo::ji("x")` </td>
</tr>
<tr>
<td style="text-align:right;"> Predictive accuracy </td>
<td style="text-align:right;"> `r emo::ji("warning")` </td>
<td style="text-align:right;"> `r emo::ji("check")` </td>
</tr>
</tbody>
</table>
As discussed in @friedman2001elements, the key to accuracy is low bias and low variance. Trees are naturally high-variance models; a small change in the training data can lead to substantial different trees. Although *pruning* the tree (i.e., determining a nested sequence of subtrees by recursively snipping off the least important splits) helps reduce the variance[^prune] of a single tree (i.e., by helping to avoid overfitting), there are methods that exploit this variance in a way that can significantly improve performance over and above that of single trees. ***B**ootstrap **agg**regat**ing*** (bagging) [@breiman1996bagging] is one such approach.
Bagging creates an ensemble[^Combining multiple models is referred to as *ensembling*.] of decision trees with low bias and high variance. The bias of each tree is kept at a minimum by constructing overgrown decision trees (i.e., no pruning). In other words, the tree models constructed in bagging are intentionally overfitting the training data. The variance of the final predictions is reduced by averaging (regression) or popular vote (classification). In regression, for example, the prediction of a new observation is obtained by averaging the predictions from each individual tree. To construct bagger model of size $B$, we follow three simple steps:
1. Create $B$ bootstrap replciates of the original training data (Section \@ref(bootstrap)) by selecting rows with replacement
2. For each bootstrap sample, train a single, unpruned decision tree
3. Average individual predictions from each tree to create an overall average predicted value (regression) or use a popular vote (classification)
```{r rf-bagging-image, echo=FALSE, fig.cap="The bagging process.", out.height="70%", out.width="70%", eval=FALSE}
knitr::include_graphics("illustrations/bagging.png")
```
Technically, bagging is a general algorithm that can be applied to any regression or classification algorithm; however, it provides the greatest improvement for models that are adaptive and have high variance. For example, more stable parametric models, such as linear regression and MARS, tend to experience less improvement in predictive performance with bagging.
Although bagging trees can help reduce the variance of a single tree's prediction and improve predictive performance, the trees in bagging are not completely independent of each other since all the original predictors are considered at every split of every tree. Rather, trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to the fact that stronger relationships appear at the top.
For example, if we create six decision trees with different bootstrapped samples of the Boston housing data, we see that the top of the trees all have a very similar structure. Although there are 15 predictor variables to split on, all six trees have both `lstat` and `rm` driving the first few splits. This *between-tree correlation* limits the effect of averaging to reduce the variance of the overall ensemble. In order to reduce variance further, we need to take an additiaonl step to help de-correlate the trees in the ensemble.
```{r, echo=FALSE, fig.width=6, fig.asp=0.618, out.width="100%", fig.cap="Six bagged decision trees fit to the Boston housing data set.", eval=FALSE}
# knitr::include_graphics("illustrations/tree-correlation-1.png")
set.seed(1859)
par(mfrow = c(2, 3))
for (i in 1:6) {
train <- pdp::boston[sample(nrow(pdp::boston), replace = TRUE), ]
rpart.plot(rpart(cmedv ~ ., data = train))
}
```
## Random forests
Random forests are an extension of bagging and injects more randomness into the tree-growing process. Random forests achieve this in two ways:
1. __Bootstrap__: similar to bagging, each tree is grown to a bootstrap resampled data set, which makes them different and _somewhat_ decorrelates them.
2. __Split-variable randomization__: each time a split is to be performed, the search for the split variable is limited to a random subset of *m* of the *p* variables. Typical default values for $m$ are $m = \frac{p}{3}$ (regression problems) and $m = \sqrt{p}$ for classification models. However, this should be considered a tuning parameter. When $m = p$, the randomization amounts to using only step 1 and is the same as *bagging*.
The basic algorithm for a random forest model can be generalized to the following:
```r
1. Given training data set
2. Select number of trees to build (ntrees)
3. for i = 1 to ntrees do
4. | Generate a bootstrap sample of the original data
5. | Grow a regression or classification tree to the bootstrapped data
6. | for each split do
7. | | Select m variables at random from all p variables
8. | | Pick the best variable/split-point among the m
9. | | Split the node into two child nodes
10. | end
11. | Use typical tree model stopping criteria to determine when a tree is complete (but do not prune)
12. end
```
Since the algorithm randomly selects a bootstrap sample to train on ___and___ predictors to use at each split, tree correlation will be lessened beyond bagged trees.
### OOB error vs. test set error
One benefit of bagging (and thus also random forests) is that, on average, a bootstrap sample will contain 63% of the training data. This leaves about 37% of the data out of the bootstrapped sample. We call this the out-of-bag (OOB) sample. We can use the OOB observations to estimate the model’s accuracy, creating a natural cross-validation process, which allows you to not need to sacrifice any of your training data to use for validation. This makes identifying the number of trees required to stablize the error rate during tuning more efficient; however, as illustrated below some difference between the OOB error and test error are expected.
```{r rf-oob-example-image, echo=FALSE, fig.cap="Random forest out-of-bag error versus validation error.", out.height="70%", out.width="70%", eval=FALSE}
knitr::include_graphics("illustrations/oob-error-compare-1.svg")
```
Furthermore, many packages do not keep track of which observations were part of the OOB sample for a given tree and which were not. If you are comparing multiple models to one-another, you’d want to score each on the same validation set to compare performance. Also, although technically it is possible to compute certain metrics such as root mean squared logarithmic error (RMSLE) on the OOB sample, it is not built in to all packages. So if you are looking to compare multiple models or use a slightly less traditional loss function you will likely want to still perform cross validation.
## Fitting a basic random forest model
There are over 20 random forest packages in R.[^task] To demonstrate the basic implementation we illustrate the use of the __randomForest__ package [@R-randomForest], the oldest and most well known implementation of the Random Forest algorithm in R.
```{block, type="tip"}
However, as your data set grows in size __randomForest__ does not scale well (although you can parallelize with __foreach__).
```
`randomForest()` can use the formula or separate x, y matrix notation for specifying our model. Below we apply the default __randomForest__ model using the formulaic specification. The default random forest performs 500 trees and $\frac{features}{3} = 26$ randomly selected predictor variables at each split. Averaging across all 500 trees provides an OOB $MSE = 661089658$ ($RMSE = \$25,711$).
```{r rf-basic-model, eval=FALSE}
# for reproduciblity
set.seed(123)
# default RF model
rf1 <- randomForest(
formula = Sale_Price ~ .,
data = ames_train
)
rf1
```
Plotting the model will illustrate the OOB error rate as we average across more trees and shows that our error rate stabalizes with around 100 trees but continues to decrease slowly until around 300 or so trees.
```{r rf-basic-model-plot, fig.width=5, fig.height=3.5, fig.cap="OOB error (MSE) as a function of the number of trees. We see the MSE reduces quickly for the first 100 trees and then slowly thereafter. We want to make sure that we are providing enough trees so that our OOB error has stabalized or flatlined.", eval=FALSE}
plot(rf1)
```
The plotted error rate above is based on the OOB sample error and can be accessed directly at `rf1$mse`. Thus, we can find which number of trees provides the lowest error rate, which is 447 trees providing an average home sales price error of \$25,649.
```{r rf-basic-model-best, eval=FALSE}
# number of trees with lowest MSE
which.min(rf1$mse)
# RMSE of this optimal random forest
sqrt(rf1$mse[which.min(rf1$mse)])
```
Random forests are one of the best "out-of-the-box" machine learning algorithms. They typically perform remarkably well with very little tuning required. As illustrated above, we were able to get an RMSE of \$25,649 without any tuning which is nearly as good as the best, fully tuned model we've explored thus far. However, we can still seek improvement by tuning hyperparameters in our random forest model.
## Tuning
Compared to the algorithms explored in the previous chapters, random forests have more hyperparameters to tune. However, compared to gradient boosting machines and neural networks, which we explore in future chapters, random forests are much easier to tune. Typically, the primary concern when starting out is tuning the number of candidate variables to select from at each split.
```{block, type="tip"}
The two primary tuning parameters you should always tune in a random forest model are:
1. Number of trees as you want to ensure you apply enough trees to minimize and stabalize the error rate.
2. Number of candidate variables to select from at each split.
```
However, there are a few additional hyperparameters that we should be aware of. Although the argument names may differ across packages, these hyperparameters should be present:
- __Number of trees__: We want enough trees to stabalize the error but using too many trees is unncessarily inefficient, especially when using large data sets.
- __Number of variables to randomly sample as candidates at each split__: Commonly referred to as "mtry". When `mtry` $=p$ the model equates to bagging. When `mtry` $=1$ the split variable is completely random, so all variables get a chance but can lead to overly biased results. A common suggestion is to start with 5 values evenly spaced across the range from 2 to *p*.
- __Sample size to train on__: The default value is 63.25% of the training set since this is the expected value of unique observations in the bootstrap sample. Lower sample sizes can reduce the training time but may introduce more bias than necessary. Increasing the sample size can increase performance but at the risk of overfitting because it introduces more variance. Typically, when tuning this parameter we stay near the 60-80% range.
- __Minimum number of samples within the terminal nodes__: Controls the complexity of the trees. Smaller node size allows for deeper, more complex trees and larger node size results in shallower trees. This is another bias-variance tradeoff where deeper trees introduce more variance (risk of overfitting) and shallower trees introduce more bias (risk of not fully capturing unique patters and relatonships in the data).
- __Maximum number of terminal nodes__: Another way to control the complexity of the trees. More nodes equates to deeper, more complex trees and less nodes result in shallower trees.
- __Split rule__:As stated in the introduction, the most traditional splitting rules are based on minimizing the variance or MSE across the terminal nodes for regression problems and Cross-entropy or Gini index for classification problems. However, additional splitrules have been developed that can offer improved predictive accuracy. For example, the extra trees split rule chooses cut-points fully at random and uses the whole learning sample (rather than a bootstrap replica) to grow the trees [@geurts2006extremely].
Tuning a larger set of hyperparameters requires a larger grid search than we've performed thus far. Unfortunately, this is where __randomForest__ becomes quite inefficient since it does not scale well. Instead, we can use __ranger__ [@R-ranger] which is a C++ implementation of Brieman's random forest algorithm and, as the following illustrates, is over 27 times faster than __randomForest__.
```{r rf-randomForest-vs-ranger, eval=FALSE}
# names of features
features <- setdiff(names(ames_train), "Sale_Price")
# randomForest speed
system.time(
ames_randomForest <- randomForest(
formula = Sale_Price ~ .,
data = ames_train,
ntree = 500,
mtry = floor(length(features) / 3)
)
)
# ranger speed
system.time(
ames_ranger <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 500,
mtry = floor(length(features) / 3)
)
)
```
### Tuning via ranger
There are two approaches to tuning a __ranger__ model. The first is to tune __ranger__ manually using a `for` loop. To perform a manual grid search, first we want to construct our grid of hyperparameters. We’re going to search across 48 different models with varying mtry, minimum node size, sample size, and trying different split rules.
```{r rf-tuning-grid1, eval=FALSE}
# create a tuning grid
hyper_grid <- expand.grid(
mtry = seq(20, 35, by = 5),
min.node.size = seq(3, 9, by = 3),
sample.fraction = c(.632, .80),
splitrule = c("variance", "extratrees"),
OOB_RMSE = 0
)
dim(hyper_grid)
```
We loop through each hyperparameter combination and apply 500 trees since our previous examples illustrated that 500 was plenty to achieve a stable error rate. Also note that we set the random number generator seed. This allows us to consistently sample the same observations for each sample size and make it more clear the impact that each change makes. Our OOB RMSE ranges between ~25,900-28,500. Our top 10 performing models all have RMSE values right around 26,000 and the results show that models with larger sample sizes (80%) and a variance splitrule perform best. However, no definitive evidence suggests that certain values of `mtry` or `min.node.size` are better than other values.
```{block, type="warning"}
This grid search took 69 seconds to complete.
```
```{r rf-grid-search1, eval=FALSE}
for(i in seq_len(nrow(hyper_grid))) {
# train model
model <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 500,
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$min.node.size[i],
sample.fraction = hyper_grid$sample.fraction[i],
splitrule = hyper_grid$splitrule[i],
seed = 123
)
# add OOB error to grid
hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
}
hyper_grid %>%
dplyr::arrange(OOB_RMSE) %>%
head(10)
```
However, using this approach does not provide us with a cross validated measure of error. To get a k-fold CV error, we would have to expand our `for` loop approach or use an alternative approach. One such approach follows.
### Tuning via caret
The second tuning approach is to use the __caret__ package. __caret__ only allows you to tune some, not all, of the available __ranger__ hyperparameters (`mtry`, `splitrule`, `min.node.size`). However, __caret__ will allow us to get a CV measure of error to compare to our previous models (i.e. regularized regression, MARS). The following creates a similar tuning grid as before but with only those hyperparameters that __caret__ will accept.
```{block, type="tip"}
If you do not know what hyperparameters __caret__ allows you to tune for a specific model you can find that info at [https://topepo.github.io/caret/train-models-by-tag.html](https://topepo.github.io/caret/train-models-by-tag.html) or with `caret::getModelInfo`. For example, we can find the parameters available for a `ranger` with `caret::getModelInfo("ranger")$ranger$parameter`.
```
```{r rf-tuning-grid2, eval=FALSE}
# create a tuning grid
hyper_grid <- expand.grid(
mtry = seq(20, 35, by = 5),
min.node.size = seq(3, 9, by = 3),
splitrule = c("variance", "extratrees")
)
```
Tuning with __caret__ provides similar results as our __ranger__ grid search. Both results suggest `mtry = 20` and `min.node.size = 3`. With __ranger__, our OOB RMSE was 25963.96 and with __caret__ our 10-fold CV RMSE was 25531.47.
```{block, type="warning"}
This grid search took a little over 6 minutes to complete.
```
```{r rf-grid-search2, fig.cap="Cross validated RMSE for the __caret__ grid search.", eval=FALSE}
# cross validated model
tuned_rf <- train(
x = subset(ames_train, select = -Sale_Price),
y = ames_train$Sale_Price,
method = "ranger",
metric = "RMSE",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = hyper_grid,
num.trees = 500,
seed = 123
)
# best model
tuned_rf$bestTune
# plot results
ggplot(tuned_rf)
```
## Feature interpretation
Whereas many of the linear models discussed previously use the standardized coefficients to signal importance, random forests have, historically, applied two different approaches to measure variable importance.
1. __Impurity__: At each split in each tree, compute the improvement in the split-criterion (MSE for regression Gini for classification). Then average the improvement made by each variable across all the trees that the variable is used. The variables with the largest average decrease in the error metric are considered most important.
2. __Permutation__: For each tree, the OOB sample is passed down the tree and the prediction accuracy is recorded. Then the values for each variable (one at a time) are randomly permuted and the accuracy is again computed. The decrease in accuracy as a result of this randomly “shaking up” of variable values is averaged over all the trees for each variable. The variables with the largest average decrease in accuracy are considered most important.
To compute these variable importance measures with __ranger__, you must include the importance argument.
```{block, type="note"}
Once you’ve identified the optimal parameter values from the grid search, you will want to re-run your model with these hyperparameter values.
```
```{r rf-model-with-vi, eval=FALSE}
# re-run model with impurity-based variable importance
rf_impurity <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 500,
mtry = 20,
min.node.size = 3,
sample.fraction = .80,
splitrule = "variance",
importance = 'impurity',
verbose = FALSE,
seed = 123
)
# re-run model with permutation-based variable importance
rf_permutation <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 500,
mtry = 20,
min.node.size = 3,
sample.fraction = .80,
splitrule = "variance",
importance = 'permutation',
verbose = FALSE,
seed = 123
)
```
For both options, you can directly access the variable importance values with `model_name$variable.importance`. However, here we'll plot the variable importance using the `vip` package. Typically, you will not see the same variable importance order between the two options; however, you will often see similar variables at the top of the plots. Consquently, in this example, we can comfortably state that there appears to be enough evidence to suggest that two variables stand out as most influential:
* `Overall_Qual`
* `Gr_Liv_Area`
Looking at the next ~10 variables in both plots, you will also see some commonality in influential variables (i.e. `Garage_Cars`, `Bsmt_Qual`, `Year_Built`, `Exter_Qual`).
```{r rf-vip-plots, fig.width=9, fig.cap="Top 25 most important variables based on impurity (left) and permutation (right).", eval=FALSE}
p1 <- vip(rf_impurity, num_features = 25, bar = FALSE) + ggtitle("Impurity-based variable importance")
p2 <- vip(rf_permutation, num_features = 25, bar = FALSE) + ggtitle("Permutation-based variable importance")
gridExtra::grid.arrange(p1, p2, nrow = 1)
```
To better understand the relationship between these important features and `Sale_Price`, we can create partial dependence plots (PDPs). If you recall in the linear and regularized regression sections, we saw that the linear model assumed a continously increasing relationship between `Gr_Liv_Area` and `Sale_Price`. In the MARS chapter, we saw as homes exceed 2,945 square feet, each additional square foot demands a higher marginal increase in sale price than homes with less than 2,945 square feet. However, in between the knots of a MARS model, the relationship will remain linear. However, the PDP plot below displays how random forest models can capture unique non-linear and non-monotonic relationships between predictors and the target. In this case, `Sale_Price` appears to not be influenced by `Gr_Liv_Area` values below 750 sqft or above 3500 sqft. This change in realtionship was not well captured by the prior parametric models.
```{r rf-pdp-GrLiv-Area, fig.height=3, fig.width=4, fig.cap="The mean predicted sale price as the above ground living area increases.", eval=FALSE}
# partial dependence of Sale_Price on Gr_Liv_Area
rf_impurity %>%
partial(pred.var = "Gr_Liv_Area", grid.resolution = 50) %>%
autoplot(rug = TRUE, train = ames_train)
```
Additionally, if we assess the relationship between the `Overall_Qual` predictor and `Sale_Price`, we see a continual increase as the overall quality increases. This provides a more comprehensive understanding of the relationship than the results we saw in the regularized regression section (\@ref(lm-features)). We see that the largest impact on `Sale_Price` occurs when houses go from "Good" overall quality to "Very Good".
```{r rf-pdp-Overall-Qual, fig.height=3, fig.width=8, fig.cap="The mean predicted sale price for each level of the overall quality variable.", eval=FALSE}
# partial dependence of Sale_Price on Overall_Qual
rf_impurity %>%
partial(pred.var = "Overall_Qual", train = as.data.frame(ames_train)) %>%
autoplot()
```
Individual conditional expectation (ICE) curves [@goldstein2015peeking] are an extension of PDP plots but, rather than plot the _average_ marginal effect on the response variable, we plot the change in the predicted response variable ___for each observation___ as we vary each predictor variable. Below shows the regular ICE curve plot (left) and the centered ICE curves (right). When the curves have a wide range of intercepts and are consequently “stacked” on each other, heterogeneity in the response variable values due to marginal changes in the predictor variable of interest can be difficult to discern. The centered ICE can help draw these inferences out and can highlight any strong heterogeneity in our results.
The plots below show that marginal changes in `Gr_Liv_Area` have a fairly homogenous effect on our response variable. As `Gr_Liv_Area` increases, the vast majority of observations show a similar increasing effect on the predicted `Sale_Price` value. The primary differences is in the magnitude of the increasing effect. However, in the centered ICE plot you see evidence of a few observations that display a different pattern. Some have a higher $\hat y$ value when `Gr_Liv_Area` is between 2500-4000 and some have a decreasing $\hat y$ as `Gr_Liv_AreA` increases. These results may be a sign of interaction effects and would be worth exploring more closely.
```{r rf-ice-Gr-Liv-Area, fig.height=3, fig.width=9, fig.cap="Non-centered (left) and centered (right) individual conditional expectation curve plots illustrate how changes in above ground square footage influences predicted sale price for all observations.", eval=FALSE}
# ice curves of Sale_Price on Gr_Liv_Area
ice1 <- rf_impurity %>%
partial(pred.var = "Gr_Liv_Area", grid.resolution = 50, ice = TRUE) %>%
autoplot(rug = TRUE, train = ames_train, alpha = 0.1) +
ggtitle("Non-centered ICE plot")
ice2 <- rf_impurity %>%
partial(pred.var = "Gr_Liv_Area", grid.resolution = 50, ice = TRUE) %>%
autoplot(rug = TRUE, train = ames_train, alpha = 0.1, center = TRUE) +
ggtitle("Centered ICE plot")
gridExtra::grid.arrange(ice1, ice2, nrow = 1)
```
Both PDPs and ICE curves should be assessed for the most influential variables as they help to explain the underlying patterns in the data that the random forest model is picking up.
## Attrition data
With the Ames data, the random forest models obtained predictive accuracy that was close to our best MARS model, but how about the attrition data? The following performs a grid search across 48 hyperparameter combinations.
```{r rf-tuned-rf-attrition, fig.cap="Cross-validated accuracy rate for the 48 different hyperparameter combinations in our grid search. The optimal model uses `mtry` = 9, `splitrule` = gini, and `min.node.size` = 4, which obtained a 10-fold CV accuracy rate of 85.8%.", eval=FALSE}
# get attrition data
df <- rsample::attrition %>% dplyr::mutate_if(is.ordered, factor, ordered = FALSE)
# Create training (70%) and test (30%) sets for the rsample::attrition data.
# Use set.seed for reproducibility
set.seed(123)
churn_split <- initial_split(df, prop = .7, strata = "Attrition")
churn_train <- training(churn_split)
churn_test <- testing(churn_split)
# create a tuning grid
hyper_grid <- expand.grid(
mtry = seq(3, 18, by = 3),
min.node.size = seq(1, 10, by = 3),
splitrule = c("gini", "extratrees")
)
# cross validated model
tuned_rf <- train(
x = subset(churn_train, select = -Attrition),
y = churn_train$Attrition,
method = "ranger",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = hyper_grid,
num.trees = 500,
seed = 123
)
# best model
tuned_rf$bestTune
# plot results
ggplot(tuned_rf)
```
Similar to the MARS model, the random forest model does not improve predictive accuracy over an above the regularized regression model.
```{r rf-attrition-modeling-rf, echo=FALSE, eval=FALSE}
# train logistic regression model
glm_mod <- train(
Attrition ~ .,
data = churn_train,
method = "glm",
family = "binomial",
preProc = c("zv", "center", "scale"),
trControl = trainControl(method = "cv", number = 10)
)
# train regularized logistic regression model
penalized_mod <- train(
Attrition ~ .,
data = churn_train,
method = "glmnet",
family = "binomial",
preProc = c("zv", "center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10
)
# train mars model
hyper_grid <- expand.grid(
degree = 1:3,
nprune = seq(2, 100, length.out = 10) %>% floor()
)
tuned_mars <- train(
x = subset(churn_train, select = -Attrition),
y = churn_train$Attrition,
method = "earth",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = hyper_grid
)
# extract out of sample performance measures
summary(resamples(list(
Logistic_model = glm_mod,
Elastic_net = penalized_mod,
MARS_model = tuned_mars,
RF_model = tuned_rf
)))$statistics$Accuracy %>%
kableExtra::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
```
## Final thoughts
Random forests provide a very powerful out-of-the-box algorithm that often has great predictive accuracy. Because of their more simplistic tuning nature and the fact that they require very little, if any, feature pre-processing they are often one of the first go-to algorithms when facing a predictive modeling problem. However, as we illustrated in this chapter, random forests are not guaranteed to improve predictive accuracy over and above linear models and their cousins. The following summarizes some of the advantages and disadvantages discussed regarding random forests modeling:
__TODO__: may need to better tie in some of these advantages and disadvantages throughout the chapter.
__Advantages:__
- Typically have very good performance.
- Remarkably good "out-of-the box" - very little tuning required.
- Built-in validation set - don't need to sacrifice data for extra validation.
- Does not overfit.
- No data pre-processing required - often works great with categorical and numerical values as is.
- Robust to outliers.
- Handles missing data - imputation not required.
- Provide automatic feature selection.
__Disadvantages:__
- Can become slow on large data sets.
- Although accurate, often cannot compete with the accuracy of advanced boosting algorithms.
- Less interpretable although this is easily addressed with various tools (variable importance, partial dependence plots, LIME, etc.).
## Learning more
The literature behind random forests are rich and we have only touched on the fundamentals. To learn more I would start with the following resources listed in order of complexity:
* [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)
* [Applied Predictive Modeling](http://appliedpredictivemodeling.com/)
* [Computer Age Statistical Inference](https://www.amazon.com/Computer-Age-Statistical-Inference-Mathematical/dp/1107149894)
* [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/)
[^other]: Other decision tree algorithms include the Iterative Dichotomiser 3
[@quinlan1986induction], C4.5 [@quinlan1996bagging], Chi-square automatic interaction detection [@kass1980exploratory], Conditional inference trees [@hothorn2006unbiased], and more.
[^stump]: This decision tree was built using the **rpart** package [@R-rpart], the tree graphic which was produced using the **rpart.plot** package [@R-rpart.plot]), and the decision boundary was illustrated with **ggplot2**.
[^task]: See the Random Forest section in the [Machine Learning Task View](https://CRAN.R-project.org/view=MachineLearning) on CRAN and Erin LeDell's [useR! Machine Learning Tutorial](https://koalaverse.github.io/machine-learning-in-R/random-forest.html#random-forest-software-in-r) for a non-comprehensive list.
[^prune]: See @esposito1997comparative for various methods of pruning. In **rpart**, the amount of pruning is controlled by the *complexity parameters* `cp`.