11-random-forests.Rmd

# Random Forests {#RF}

```{r rf-ch11-setup, include=FALSE}

# Set the graphical theme
ggplot2::theme_set(ggplot2::theme_light())

# Set global knitr chunk options
knitr::opts_chunk$set(
  cache = TRUE,
  warning = FALSE, 
  message = FALSE, 
  collapse = TRUE, 
  fig.align = "center",
  fig.height = 3.5
)
```

The previous chapters covered models where the algorithm is based on a linear expansions in simple basis functions of the form 

$$
  \sum_{i=1}^p\beta_ih_i\left(\boldsymbol{x}\right),
  (\#eq:linear-combo)
$$

where the $\beta_i$ are unknown coefficients to be estimated and the $h_i\left(\cdot\right)$ are transformations applied to the features $\boldsymbol{x}$. For ordinalry linear regression (with or without regularization), these transformations are supplied by the user; hence, these are parametric models. For example, the prediction equation $f\left(\boldsymbol{x}\right) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2 + \beta_4x_1^2$ has

$$h_1\left(\boldsymbol{x}\right) = x_1 \quad \text{(main effect})$$

$$h_2\left(\boldsymbol{x}\right) = x_2 \quad \text{(main effect})$$

$$h_3\left(\boldsymbol{x}\right) = x_2x_4 \quad \text{(two-way interaction effect)}$$

$$h_2\left(\boldsymbol{x}\right) = x_2^2 \quad \text{(quadratic effect})$$

MARS, on the other hand, uses a specific algorithm to find the transformations to use automatically; hence, MARS is a nonparametric model. 

*Tree-based models*, are also nonparametric, but they work very differently. Tree-based models use algorithms to partition the feature space into a number of smaller (non-overlapping) regions based on a set of splitting rules and then fits a simpler model (e.g., a constant) in each region. Such *divide-and-conquor* methods (e.g., a single decision tree) can produce simple rules that are easy to interpret and visualize with *tree diagrams*. Although fitted tree-based models can still be written as a linear expansions in simple basis functions (here the basis functions define the feature space partitioning), there is no benfit to doing so as other techniques, like *tree diagrams*, are better and conveying the information. Simple decision trees typically lack in predictive performance compared to more complex algorithms like neural networks and MARS. More sophisticated tree-based models, such as random forests and gradient boosting machines, are less interpretable, but tend to have very good predictive accuracy. This chapter will get you familiar with the basics behind decision trees, and two ways inwhich to combine them into a more accurate ensemble; namely, bagging and random forests. In the next chapter, we'll cover (stochastic) gradient boosting machines, which is another powerful way of combining decision trees into a more accurate ensemble. 


## Prerequisites

For this chapter we'll use the following packages:

```{r 11-load-pkgs}
library(caret)         # for classification and regression training
library(randomForest)  # for Breiman and Cutler's random forest
library(ranger)        # for fast implementation of random forests
library(rpart)         # for fitting CART-like decision trees
library(rpart.plot)    # for flexible decision tree diagrams
library(rsample)       # for data splitting 
library(pdp)           # for partial dependence plots
library(vip)           # for variable importance plots
```

To illustrate the various concepts we'll use the Ames Housing data (regression); however, at the end of the chapter we'll also apply fit a random forest model to the employee attrition data (classification).

```{r rf-ch11-data}
# Create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.
# Use set.seed for reproducibility

set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)
```


## Decision trees

There are many methodologies for constructing decision trees but the most well-known is the **c**lassification **a**nd **r**egression **t**ree (CART&copy;) algorithm proposed in @breiman2017classification.[^other] Basic decision trees partition the training data into homogeneous subgroups and then fit a simple *constant* for each subgroup (e.g., the mean of the within group response values for regression). The partitioning is achieved by recursive binary partitions formed by asking yes-or-no questions about each feature (e.g., is `age < 18`?). This is done a number of times until a suitable stopping critera is satisfied (e.g., a maximum depth of the tree is reached). After all the partitioning has been done, the model predicts the output based on (1) the average response values for all observations that fall in that subgroup (regression problem), or (2) the class that has majority representation (classification problem). For classification, predicted probabilites can be obtained using the proportion of each class within the subgroups.

What results is an inverted tree-like structure such as that in Figure \@ref(fig:exemplar-decision-tree).  In essence, our tree is a set of rules that allows us to traverse the tree based on simple questions about each feature.  For example, if the customer is loyal, has household income greater than \$150,000, and is shopping in a store, the exemplar decision tree would predict that the customer will redeem a coupon. 

```{r exemplar-decision-tree, echo=FALSE, fig.cap="Exemplar decision tree predicting if a customer will redeem a coupon (yes or no) based on the customer's loyalty, household income, last month's spend, coupon placement, and shopping mode.", out.height="100%", out.width="100%"}
knitr::include_graphics("illustrations/exemplar-decision-tree.png")
```

We refer to the first feature used to split as the *root node*, the next $n$ features used are the *internal nodes*, the outcome region (prediction) as the *leaf* or *terminal node*, and the connections as *branches*. 

```{r decision-tree-terminology, echo=FALSE, fig.cap="Terminology of a decision tree.", out.height="80%", out.width="80%"}
knitr::include_graphics("illustrations/decision-tree-terminology.png")
```

### Partitioning

As illustrated above, CART uses *recursive binary partitioning*. The objective is to assess an individual feature ($x_i$) and split that feature to segment the response variable into one of two regions ($R_1$ and $R_2$) such that the overall error is minimized (typically, SSE for regression problems, and cross-entropy or the Gini index for classification problems--see Section \@ref(reg-perf-eval)). Having found the best feature/split combination, the data are partitioned into two regions and the splitting process is repeated on each of the two regions (hence the name binary recursive partioning). This process is continued until some stopping criterion is reached (e.g., a maximum depth is reached or the tree becomes "too complex")

It's important to note that a single feature can be used multiple times in a tree. For example, say we have data generated from an underlying sin function with $\stackrel{iid}{\sim} \left(0, \sigma^2\right)$ errors. A regression tree built with a single root node (often referred to as a decision stump) leads to a split occurring at $x = 3.1$.[^stump]

```{r decision-stump, echo=FALSE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="Decision tree illustrating the single split on feature x (left). The resulting decision boundary illustrates the predicted value when x < 3.1 (0.64), and when x > 3.1 (-0.67) (right)."}
# create data
set.seed(1112)  # for reproducibility
df <- tibble::tibble(
  x = seq(from = 0, to = 2 * pi, length = 500),
  y = sin(x) + rnorm(length(x), sd = 0.5),
  truth = sin(x)
)

# run decision stump model
library(rpart)
ctrl <- list(cp = 0, minbucket = 5, maxdepth = 1)
fit <- rpart(y ~ x, data = df, control = ctrl)

# plot tree 
library(rpart.plot)
par(mar = c(1, 1, 1, 1))
rpart.plot(fit)

# plot decision boundary
library(ggplot2)
library(dplyr)
df %>%
  mutate(pred = predict(fit, df)) %>%
  ggplot(aes(x, y)) +
  geom_point(alpha = .2, size = 1) +
  geom_line(aes(x, y = truth), color = "blue", size = .75) +
  geom_line(aes(y = pred), color = "red", size = .75) +
  geom_segment(x = 3.1, xend = 3.1, y = -Inf, yend = -.95,
               arrow = arrow(length = unit(0.25,"cm")), size = .25) +
  annotate("text", x = 3.1, y = -Inf, label = "split", hjust = 1.2, vjust = -1, size = 3) +
  geom_segment(x = 5.5, xend = 6, y = 2, yend = 2, size = .75, color = "blue") +
  geom_segment(x = 5.5, xend = 6, y = 1.7, yend = 1.7, size = .75, color = "red") +
  annotate("text", x = 5.3, y = 2, label = "truth", hjust = 1, size = 3, color = "blue") +
  annotate("text", x = 5.3, y = 1.7, label = "decision boundary", hjust = 1, size = 3, color = "red")
```

If we build a deeper tree, we will continue to split on the same feature ($x$) as illustrated in Figure \@ref(fig:depth-3-decision-tree). In this example, this is because $x$ is the only feature available to split on so it will continue finding the optimal splits along this feature's values until a pre-determined stopping criteria is reached. 

```{r depth-3-decision-tree, echo=FALSE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="Decision tree illustrating with depth = 3, resulting in 7 decision splits along values of feature x and 8 prediction regions (left). The resulting decision boundary (right)."}
# fit depth 3 decision tree
ctrl <- list(cp = 0, minbucket = 5, maxdepth = 3)
fit <- rpart(y ~ x, data = df, control = ctrl)
rpart.plot(fit)

# plot decision boundary
df %>%
  mutate(pred = predict(fit, df)) %>%
  ggplot(aes(x, y)) +
  geom_point(alpha = .2, size = 1) +
  geom_line(aes(x, y = truth), color = "blue", size = .75) +
  geom_line(aes(y = pred), color = "red", size = .75)
```

However, in an example where many features are included, the same can be experienced if a single feature provides the best splitting criterion multiple times.  For example, a decision tree applied to the iris data set [@fisher1936use] where the species of the flower (setosa, versicolor, and virginica) is predicted based on two features (sepal width and sepal length) results in an optimal decision tree with two splits on each feature.  Also, note how the decision boundary in a classification problem results in rectangular regions enclosing the observations.

```{r iris-decision-tree, echo=FALSE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="Decision tree for the iris classification problem (left). The decision boundary results in rectangular regions that enclose the observations.  The class with the highest proportion in each region is the predicted value (right)."}
# decision tree
iris_fit <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
rpart.plot(iris_fit)

# decision boundary
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species, shape = Species)) +
  geom_point(show.legend = FALSE) +
  annotate("rect", xmin = -Inf, xmax = 5.44, ymin = 2.8, ymax = Inf, alpha = .75, fill = "orange") +
  annotate("text", x = 4.0, y = 4.4, label = "setosa", hjust = 0, size = 3) +
  annotate("rect", xmin = -Inf, xmax = 5.44, ymin = 2.79, ymax = -Inf, alpha = .75, fill = "grey") +
  annotate("text", x = 4.0, y = 2, label = "versicolor", hjust = 0, size = 3) +
  annotate("rect", xmin = 5.45, xmax = 6.15, ymin = 3.1, ymax = Inf, alpha = .75, fill = "orange") +
  annotate("text", x = 6, y = 4.4, label = "setosa", hjust = 1, vjust = 0, size = 3) +
  annotate("rect", xmin = 5.45, xmax = 6.15, ymin = 3.09, ymax = -Inf, alpha = .75, fill = "grey") +
  annotate("text", x = 6.15, y = 2, label = "versicolor", hjust = 1, vjust = 0, fill = "grey", size = 3) +
  annotate("rect", xmin = 6.16, xmax = Inf, ymin = -Inf, ymax = Inf, alpha = .75, fill = "green") +
  annotate("text", x = 8, y = 2, label = "virginica", hjust = 1, vjust = 0, fill = "green", size = 3)

```

### Minimizing overfitting

This leads to an important question, how deep should we fit a tree?

```{r deep-overfit-tree, echo=FALSE, fig.width=4, fig.height=3, fig.show='hold', fig.cap="TBD"}
ctrl <- list(cp = 0, minbucket = 1, maxdepth = 50)
fit <- rpart(y ~ x, data = df, control = ctrl)
rpart.plot(fit)

df %>%
  mutate(pred = predict(fit, df)) %>%
  ggplot(aes(x, y)) +
  geom_point(alpha = .2, size = 1) +
  geom_line(aes(x, y = truth), color = "blue", size = 0.75) +
  geom_line(aes(y = pred), color = "red", size = 0.75)
```


### A simple regression tree example

For example, suppose we want to use a decision tree to predict the miles per gallon a car will average (`mpg`) based on two features: the number of cylinders (`cyl`) and the horsepower (`hp`); such data are available in the `mtcars` data frame which is part of the standard **datasets** package. A (regression) tree is built using the **rpart** package [@R-rpart] (see the code chunk below and Figure \@ref(rf-decision-tree-example-image) which was produced using the **rpart.plot** package [@R-rpart.plot]), and all the training data are passed down this tree. Whenever an obervation reaches a particular node in the tree (i.e., a yes-or-no question about one of the features; in this case, `cyl` or `hp`), it proceeds either to the left (if the answer is "yes") or to the right (if the answer is "no"). This tree has only three terminal nodes. To start traversing the tree, all observations that have 6 or 8 cylinders go to the left branch, all other observations proceed to the right branch. Next, the left branch is further partitioned by `hp`. Of all the observations with 6 or 8 cylinders with `hp` equal to or greater than 192 proceed to the left branch; those with less than 192 `hp` proceed to the right.  These branches lead to *terminal nodes* or *leafs* which contain the predicted response value; in this case, the average mpg of cars that fall within that terminal node. In short, cars with less than 5 cylinders (region $R_1$) average 27 mpg, cars with `cyl` $\ge 5$ and `hp` $< 193$ (region $R_2$) average 18 mpg, and all other cars (region $R_3$) in the training data average 13 mph. This is sumamrized in the tree diagram in Figure \@ref(fig:decision-tree-example)

```{r rf-decision-tree-example, echo=TRUE, fig.cap="Using a CART-like decision tree to predict `mpg` based on `cyl` and `hp`.", fig.width=6, fig.asp=0.618, out.width="70%", eval=FALSE}
tree <- rpart(mpg ~ cyl + hp, data = mtcars)  # CART-like regression tree
rpart.plot(tree)  # tree diagram
```

Using a linear combination in simple basis functions, we can write the predicted `mpg` as

$$
\widehat{mpg} = 27 \cdot I\left(cyl < 5\right) + 18 \cdot I\left(cyl \ge 5 \text{ and } hp < 193 \right) + 13 \cdot I\left(cyl \ge 5 \text{ and } hp \ge 193\right),
$$

where $I\left(\cdot\right)$ is an *indicator function* that evaluates to one if its argument is true and zero otherwise. The coefficients (i.e., 27, 18, 13) correspond to the average response value within the respective region. While this simple example illustrates that decision trees are estimating a model of the same form as equation \@ref(eq:linear-combo), the results are more easily interpreted in the form of a tree diagram like in Figure \@ref(fig:decision-tree-example).


### Deciding on splits

**FIXME:** Continue with the previous example?

How do decision trees partition the data into nonoverlapping regions? In particular, how did the tree in Figure \@ref(fig:decision-tree-example) decide which variables to split on and which split points to use? The answer depends on the tree algorithm used and whether or not context is classification or regression. For CART-like decision trees (like those discussed in this book and implemented in **rpart**), the partitioning of the feature space is done in a top-down, *greedy* fashion. This means that any partion in the tree depends on the previous partitions. But how are these partions made? The algorithm begins with the entire training data set and searches every distinct value of every input variable to find the "best" feature/split combination that partitions the data into two regions ($R_1$ and $R_2$) such that the overall error is minimized (typically, SSE for regression problems, and cross-entropy or the Gini index for classification problems--see Section \@ref(reg-perf-eval)). Having found the best feature/split combination, the data are partitioned into two regions and the splitting process is repeated on each of the two regions (hence the name *binary **r**ecursive **part**ioning*. This process is continued until some stopping criterion is reached (e.g., a maximum depth is reached or the tree becomes "too complex"). What results is, typically, a very deep, complex tree that may produce good predictions on the training set, but is likely to generalize well to new unseen data (i.e., overfitting), leading to poor generalization performance.  

To illustrate, we print the struture of the previous regression tree below. 

```{r rf-mtcars-partitioning, eval=FALSE}
print(tree)
```

The root node refers to all of the training data. In this node, the mean response is `r mean(mtcars$mpg)` and the SSE is 

```{r rf-mtcars-sse-01, eval=FALSE}
sse <- function(x) sum((x - mean(x))^2)
sse(mtcars$mpg)  # see the row labeled 1)
```

The feature/splint combination that gives the largest reduction to this SSE is defined by the regions `cyl>=5` and `cyl< 5`. The resulting SSEs within these regions are 

```{r rf-mtcars-sse-02, eval=FALSE}
sse(mtcars[mtcars$cyl >= 5, ]$mpg)  # see the row labeled 2)
sse(mtcars[mtcars$cyl <  5, ]$mpg)  # see the row labeled 3)
```

The algorithm further split the region defined by `cyl>=5` into to further regions. The feature/split combination that gave the biggest reduction to that node's SSE (i.e., 198.47240) is defined by the regions `hp>=192.5` and `hp< 192.5`. The resulting SSEs within these regions are

```{r rf-mtcars-sse-03, eval=FALSE}
sse(mtcars[mtcars$cyl >= 5 & mtcars$hp >= 192.5, ]$mpg)  # see the row labeled 4)
sse(mtcars[mtcars$cyl >= 5 & mtcars$hp <  192.5, ]$mpg)  # see the row labeled 5)
```

At this point, a stopping criteria was reached (in this case, the minimum number of observations that must exist in a node in order for a split to be attempted) and the tree algorithm stopped partitioning the data. For a list of all the stopping criteria, see `?rpart::rpart.control`.

<!-- For example, using the well-known Boston housing data [@harrison1978hedonic], three decision trees are created based on three different samples of the data.  You can see that the first few partitions are fairly similar at the top of each tree; however, they tend to differ substantially closer to the terminal nodes. These deeper nodes tend to overfit to specific attributes of the training data; consequently, slightly different samples will result in highly variable predicted values in the terminal nodes. For this reason, CART-like decision tree algorithms are often considered high variance (i.e., noisy) models fitting procedures.  -->

<!-- **FIXME:** Brad, can we replace this with the code that produced the results; either hidden or shown? -->

<!-- ```{r rf-tree-variance-image, echo=FALSE, fig.cap="Three decision trees based on slightly different samples.", out.height="70%", out.width="70%"} -->
<!-- knitr::include_graphics("illustrations/tree-variance-1.svg") -->

<!-- Below is code to reproduce the images.  May need to do some trimming of the image as I think I cropped it for aesthetic purposes. -->
```{r tree-correlation, message=FALSE, warning=FALSE, fig.align='center', fig.cap="Six decision trees based on different bootstrap samples.", echo=FALSE, dev='png', eval=FALSE}
iter = 6
par(mfrow = c(3, 3))
for(i in 1:iter){
  set.seed(i+30)
  # create train/test sets
  train_index <- caret::createDataPartition(pdp::boston$cmedv, p = .6333,
                                     list = FALSE,
                                     times = 1)
  
  train_DF <- pdp::boston[train_index,]
  validate_DF <- pdp::boston[-train_index,]
  
  train_y <- train_DF$cmedv
  train_x <- train_DF[, setdiff(names(train_DF), "cmedv")]
  
  validate_y <- validate_DF$cmedv
  validate_x <- validate_DF[, setdiff(names(validate_DF), "cmedv")]
  
  d_tree <- rpart::rpart(cmedv ~ ., train_DF)
  
  # graphs
  
  rpart.plot::rpart.plot(d_tree, main = paste0("Decision Tree ", i), type = 0, extra = 0) 
  
}
```

<!-- ``` -->


## Bagging

The major drawback of decision trees is that they are not typically as accurate as current state-of-the-art ML algorithms like neural networks. However, decision trees do have a number of desirable properties which are listed in Table X below. The idea behind the ensembles of decision trees discussed in this chapter and the next is to improve the predictive performance of trees while retaining most of the other properties.

<table>
  <caption>Table X. A comparison of binary recursive partitioning and neural networks.</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> Property </th>
   <th style="text-align:right;"> Recursive \n partitioning </th>
   <th style="text-align:right;"> Neural networks </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> Naturally handles numeric and categorical variables </td>
   <td style="text-align:right;"> `r emo::ji("check")` </td>
   <td style="text-align:right;"> `r emo::ji("x")` </td>
  </tr>
  <tr>
   <td style="text-align:right;"> Naturally handles missing values </td>
   <td style="text-align:right;"> `r emo::ji("check")` </td>
   <td style="text-align:right;"> `r emo::ji("x")` </td>
  </tr>
  <tr>
   <td style="text-align:right;"> Robust to features with outliers </td>
   <td style="text-align:right;"> `r emo::ji("check")` </td>
   <td style="text-align:right;"> `r emo::ji("x")` </td>
  </tr>
  <tr>
   <td style="text-align:right;"> Insensitive to monotone
transformations of features </td>
   <td style="text-align:right;"> `r emo::ji("check")` </td>
   <td style="text-align:right;"> `r emo::ji("x")` </td>
  </tr>
  <tr>
   <td style="text-align:right;"> Computational scalability </td>
   <td style="text-align:right;"> `r emo::ji("check")` </td>
   <td style="text-align:right;"> `r emo::ji("x")` </td>
  </tr>
  <tr>
   <td style="text-align:right;"> Ability to deal with irrelevant
inputs </td>
   <td style="text-align:right;"> `r emo::ji("check")` </td>
   <td style="text-align:right;"> `r emo::ji("x")` </td>
  </tr>
  <tr>
   <td style="text-align:right;"> Interpretibility </td>
   <td style="text-align:right;"> `r emo::ji("check")` </td>
   <td style="text-align:right;"> `r emo::ji("x")` </td>
  </tr>
  <tr>
   <td style="text-align:right;"> Predictive accuracy </td>
   <td style="text-align:right;"> `r emo::ji("warning")` </td>
   <td style="text-align:right;"> `r emo::ji("check")` </td>
  </tr>
</tbody>
</table>

As discussed in @friedman2001elements, the key to accuracy is low bias and low variance. Trees are naturally high-variance models; a small change in the training data can lead to substantial different trees. Although *pruning* the tree (i.e., determining a nested sequence of subtrees by recursively snipping off the least important splits) helps reduce the variance[^prune] of a single tree (i.e., by helping to avoid overfitting), there are methods that exploit this variance in a way that can significantly improve performance over and above that of single trees. ***B**ootstrap **agg**regat**ing*** (bagging) [@breiman1996bagging] is one such approach. 

Bagging creates an ensemble[^Combining multiple models is referred to as *ensembling*.] of decision trees with low bias and high variance. The bias of each tree is kept at a minimum by constructing overgrown decision trees (i.e., no pruning). In other words, the tree models constructed in bagging are intentionally overfitting the training data. The variance of the final predictions is reduced by averaging (regression) or popular vote (classification). In regression, for example, the prediction of a new observation is obtained by averaging the predictions from each individual tree. To construct bagger model of size $B$, we follow three simple steps:

1. Create $B$ bootstrap replciates of the original training data (Section \@ref(bootstrap)) by selecting rows with replacement

2. For each bootstrap sample, train a single, unpruned decision tree

3. Average individual predictions from each tree to create an overall average predicted value (regression) or use a popular vote (classification)

```{r rf-bagging-image, echo=FALSE, fig.cap="The bagging process.", out.height="70%", out.width="70%", eval=FALSE}
knitr::include_graphics("illustrations/bagging.png")
```

Technically, bagging is a general algorithm that can be applied to any regression or classification algorithm; however, it provides the greatest improvement for models that are adaptive and have high variance. For example, more stable parametric models, such as linear regression and MARS, tend to experience less improvement in predictive performance with bagging.

Although bagging trees can help reduce the variance of a single tree's prediction and improve predictive performance, the trees in bagging are not completely independent of each other since all the original predictors are considered at every split of every tree. Rather, trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to the fact that stronger relationships appear at the top.

For example, if we create six decision trees with different bootstrapped samples of the Boston housing data, we see that the top of the trees all have a very similar structure. Although there are 15 predictor variables to split on, all six trees have both `lstat` and `rm` driving the first few splits. This *between-tree correlation* limits the effect of averaging to reduce the variance of the overall ensemble. In order to reduce variance further, we need to take an additiaonl step to help de-correlate the trees in the ensemble.

```{r, echo=FALSE, fig.width=6, fig.asp=0.618, out.width="100%", fig.cap="Six bagged decision trees fit to the Boston housing data set.", eval=FALSE}
# knitr::include_graphics("illustrations/tree-correlation-1.png")
set.seed(1859)
par(mfrow = c(2, 3))
for (i in 1:6) {
  train <- pdp::boston[sample(nrow(pdp::boston), replace = TRUE), ]
  rpart.plot(rpart(cmedv ~ ., data = train))
}
```


## Random forests

Random forests are an extension of bagging and injects more randomness into the tree-growing process. Random forests achieve this in two ways:

1. __Bootstrap__: similar to bagging, each tree is grown to a bootstrap resampled data set, which makes them different and _somewhat_ decorrelates them.
2. __Split-variable randomization__: each time a split is to be performed, the search for the split variable is limited to a random subset of *m* of the *p* variables.  Typical default values for $m$ are $m = \frac{p}{3}$ (regression problems) and $m = \sqrt{p}$ for classification models. However, this should be considered a tuning parameter.  When $m = p$, the randomization amounts to using only step 1 and is the same as *bagging*.

The basic algorithm for a random forest model can be generalized to the following:

```r
1.  Given training data set
2.  Select number of trees to build (ntrees)
3.  for i = 1 to ntrees do
4.  |  Generate a bootstrap sample of the original data
5.  |  Grow a regression or classification tree to the bootstrapped data
6.  |  for each split do
7.  |  | Select m variables at random from all p variables
8.  |  | Pick the best variable/split-point among the m
9.  |  | Split the node into two child nodes
10. |  end
11. | Use typical tree model stopping criteria to determine when a tree is complete (but do not prune)
12. end
```

Since the algorithm randomly selects a bootstrap sample to train on ___and___ predictors to use at each split, tree correlation will be lessened beyond bagged trees.

###  OOB error vs. test set error

One benefit of bagging (and thus also random forests) is that, on average, a bootstrap sample will contain 63% of the training data. This leaves about 37% of the data out of the bootstrapped sample. We call this the out-of-bag (OOB) sample. We can use the OOB observations to estimate the model’s accuracy, creating a natural cross-validation process, which allows you to not need to sacrifice any of your training data to use for validation.  This makes identifying the number of trees required to stablize the error rate during tuning more efficient; however, as illustrated below some difference between the OOB error and test error are expected.

```{r rf-oob-example-image, echo=FALSE, fig.cap="Random forest out-of-bag error versus validation error.", out.height="70%", out.width="70%", eval=FALSE}
knitr::include_graphics("illustrations/oob-error-compare-1.svg")
```

Furthermore, many packages do not keep track of which observations were part of the OOB sample for a given tree and which were not. If you are comparing multiple models to one-another, you’d want to score each on the same validation set to compare performance. Also, although technically it is possible to compute certain metrics such as root mean squared logarithmic error (RMSLE) on the OOB sample, it is not built in to all packages. So if you are looking to compare multiple models or use a slightly less traditional loss function you will likely want to still perform cross validation.

## Fitting a basic random forest model

There are over 20 random forest packages in R.[^task]  To demonstrate the basic implementation we illustrate the use of the __randomForest__ package [@R-randomForest], the oldest and most well known implementation of the Random Forest algorithm in R.  

```{block, type="tip"}
However, as your data set grows in size __randomForest__ does not scale well (although you can parallelize with __foreach__).
```

`randomForest()` can use the formula or separate x, y matrix notation for specifying our model.  Below we apply the default __randomForest__ model using the formulaic specification.  The default random forest performs 500 trees and $\frac{features}{3} = 26$ randomly selected predictor variables at each split.  Averaging across all 500 trees provides an OOB $MSE = 661089658$ ($RMSE = \$25,711$).


```{r rf-basic-model, eval=FALSE}
# for reproduciblity
set.seed(123)

# default RF model
rf1 <- randomForest(
  formula = Sale_Price ~ .,
  data = ames_train
)

rf1
```

Plotting the model will illustrate the OOB error rate as we average across more trees and shows that our error rate stabalizes with around 100 trees but continues to decrease slowly until around 300 or so trees.  


```{r rf-basic-model-plot, fig.width=5, fig.height=3.5, fig.cap="OOB error (MSE) as a function of the number of trees.  We see the MSE reduces quickly for the first 100 trees and then slowly thereafter.  We want to make sure that we are providing enough trees so that our OOB error has stabalized or flatlined.", eval=FALSE}
plot(rf1)
```

The plotted error rate above is based on the OOB sample error and can be accessed directly at `rf1$mse`.  Thus, we can find which number of trees provides the lowest error rate, which is 447 trees providing an average home sales price error of \$25,649.


```{r rf-basic-model-best, eval=FALSE}
# number of trees with lowest MSE
which.min(rf1$mse)

# RMSE of this optimal random forest
sqrt(rf1$mse[which.min(rf1$mse)])
```

Random forests are one of the best "out-of-the-box" machine learning algorithms.  They typically perform remarkably well with very little tuning required.  As illustrated above, we were able to get an RMSE of \$25,649 without any tuning which is nearly as good as the best, fully tuned model we've explored thus far.  However, we can still seek improvement by tuning hyperparameters in our random forest model.

## Tuning

Compared to the algorithms explored in the previous chapters, random forests have more hyperparameters to tune.  However, compared to gradient boosting machines and neural networks, which we explore in future chapters, random forests are much easier to tune.  Typically, the primary concern when starting out is tuning the number of candidate variables to select from at each split.  

```{block, type="tip"}
The two primary tuning parameters you should always tune in a random forest model are:

1. Number of trees as you want to ensure you apply enough trees to minimize and stabalize the error rate.
2. Number of candidate variables to select from at each split.

```


However, there are a few additional hyperparameters that we should be aware of. Although the argument names may differ across packages, these hyperparameters should be present: 

- __Number of trees__:  We want enough trees to stabalize the error but using too many trees is unncessarily inefficient, especially when using large data sets.
- __Number of variables to randomly sample as candidates at each split__: Commonly referred to as "mtry". When `mtry` $=p$ the model equates to bagging.  When `mtry` $=1$ the split variable is completely random, so all variables get a chance but can lead to overly biased results. A common suggestion is to start with 5 values evenly spaced across the range from 2 to *p*.
- __Sample size to train on__: The default value is 63.25% of the training set since this is the expected value of unique observations in the bootstrap sample.  Lower sample sizes can reduce the training time but may introduce more bias than necessary.  Increasing the sample size can increase performance but at the risk of overfitting because it introduces more variance. Typically, when tuning this parameter we stay near the 60-80% range.
- __Minimum number of samples within the terminal nodes__: Controls the complexity of the trees.  Smaller node size allows for deeper, more complex trees and larger node size results in shallower trees.  This is another bias-variance tradeoff where deeper trees introduce more variance (risk of overfitting) and shallower trees introduce more bias (risk of not fully capturing unique patters and relatonships in the data).
- __Maximum number of terminal nodes__: Another way to control the complexity of the trees. More nodes equates to deeper, more complex trees and less nodes result in shallower trees.
- __Split rule__:As stated in the introduction, the most traditional splitting rules are based on minimizing the variance or MSE across the terminal nodes for regression problems and Cross-entropy or Gini index for classification problems.  However, additional splitrules have been developed that can offer improved predictive accuracy.  For example, the extra trees split rule chooses cut-points fully at random and uses the whole learning sample (rather than a bootstrap replica) to grow the trees [@geurts2006extremely].

Tuning a larger set of hyperparameters requires a larger grid search than we've performed thus far.  Unfortunately, this is where __randomForest__ becomes quite inefficient since it does not scale well.  Instead, we can use __ranger__ [@R-ranger] which is a C++ implementation of Brieman's random forest algorithm and, as the following illustrates, is over 27 times faster than __randomForest__.

```{r rf-randomForest-vs-ranger, eval=FALSE}
# names of features
features <- setdiff(names(ames_train), "Sale_Price")

# randomForest speed
system.time(
  ames_randomForest <- randomForest(
    formula = Sale_Price ~ ., 
    data    = ames_train, 
    ntree   = 500,
    mtry    = floor(length(features) / 3)
  )
)

# ranger speed
system.time(
  ames_ranger <- ranger(
    formula   = Sale_Price ~ ., 
    data      = ames_train, 
    num.trees = 500,
    mtry      = floor(length(features) / 3)
  )
)
```

### Tuning via ranger

There are two approaches to tuning a __ranger__ model.  The first is to tune __ranger__ manually using a `for` loop.  To perform a manual grid search, first we want to construct our grid of hyperparameters. We’re going to search across 48 different models with varying mtry, minimum node size, sample size, and trying different split rules.

```{r rf-tuning-grid1, eval=FALSE}
# create a tuning grid
hyper_grid <- expand.grid(
  mtry            = seq(20, 35, by = 5),
  min.node.size   = seq(3, 9, by = 3),
  sample.fraction = c(.632, .80),
  splitrule       = c("variance", "extratrees"),
  OOB_RMSE        = 0
)

dim(hyper_grid)
```

We loop through each hyperparameter combination and apply 500 trees since our previous examples illustrated that 500 was plenty to achieve a stable error rate.  Also note that we set the random number generator seed.  This allows us to consistently sample the same observations for each sample size and make it more clear the impact that each change makes.  Our OOB RMSE ranges between ~25,900-28,500.  Our top 10 performing models all have RMSE values right around 26,000 and the results show that models with larger sample sizes (80%)  and a variance splitrule perform best.  However, no definitive evidence suggests that certain values of `mtry` or `min.node.size` are better than other values.

```{block, type="warning"}
This grid search took 69 seconds to complete.
```


```{r rf-grid-search1, eval=FALSE}
for(i in seq_len(nrow(hyper_grid))) {
  
  # train model
  model <- ranger(
    formula         = Sale_Price ~ ., 
    data            = ames_train, 
    num.trees       = 500,
    mtry            = hyper_grid$mtry[i],
    min.node.size   = hyper_grid$min.node.size[i],
    sample.fraction = hyper_grid$sample.fraction[i],
    splitrule       = hyper_grid$splitrule[i],
    seed            = 123
  )
  
  # add OOB error to grid
  hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
}

hyper_grid %>% 
  dplyr::arrange(OOB_RMSE) %>%
  head(10)
```

However, using this approach does not provide us with a cross validated measure of error.  To get a k-fold CV error, we would have to expand our `for` loop approach or use an alternative approach.  One such approach follows.

### Tuning via caret

The second tuning approach is to use the __caret__ package.  __caret__ only allows you to tune some, not all, of the available __ranger__ hyperparameters (`mtry`, `splitrule`, `min.node.size`).  However, __caret__ will allow us to get a CV measure of error to compare to our previous models (i.e. regularized regression, MARS).  The following creates a similar tuning grid as before but with only those hyperparameters that __caret__ will accept.

```{block, type="tip"}
If you do not know what hyperparameters __caret__ allows you to tune for a specific model you can find that info at [https://topepo.github.io/caret/train-models-by-tag.html](https://topepo.github.io/caret/train-models-by-tag.html) or with `caret::getModelInfo`. For example, we can find the parameters available for a `ranger` with `caret::getModelInfo("ranger")$ranger$parameter`.
```


```{r rf-tuning-grid2, eval=FALSE}
# create a tuning grid
hyper_grid <- expand.grid(
  mtry            = seq(20, 35, by = 5),
  min.node.size   = seq(3, 9, by = 3),
  splitrule       = c("variance", "extratrees")
  )
```


Tuning with __caret__ provides similar results as our __ranger__ grid search.  Both results suggest `mtry = 20` and `min.node.size = 3`.  With __ranger__, our OOB RMSE was 25963.96 and with __caret__ our 10-fold CV RMSE was 25531.47.  

```{block, type="warning"}
This grid search took a little over 6 minutes to complete.
```

```{r rf-grid-search2, fig.cap="Cross validated RMSE for the __caret__ grid search.", eval=FALSE}
# cross validated model
tuned_rf <- train(
  x = subset(ames_train, select = -Sale_Price),
  y = ames_train$Sale_Price,
  method = "ranger",
  metric = "RMSE",
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = hyper_grid,
  num.trees = 500,
  seed = 123
)

# best model
tuned_rf$bestTune

# plot results
ggplot(tuned_rf)
```


## Feature interpretation

Whereas many of the linear models discussed previously use the standardized coefficients to signal importance, random forests have, historically, applied two different approaches to measure variable importance.

1. __Impurity__: At each split in each tree, compute the improvement in the split-criterion (MSE for regression Gini for classification). Then average the improvement made by each variable across all the trees that the variable is used. The variables with the largest average decrease in the error metric are considered most important.
2. __Permutation__: For each tree, the OOB sample is passed down the tree and the prediction accuracy is recorded. Then the values for each variable (one at a time) are randomly permuted and the accuracy is again computed. The decrease in accuracy as a result of this randomly “shaking up” of variable values is averaged over all the trees for each variable. The variables with the largest average decrease in accuracy are considered most important.


To compute these variable importance measures with __ranger__, you must include the importance argument.

```{block, type="note"}
Once you’ve identified the optimal parameter values from the grid search, you will want to re-run your model with these hyperparameter values.
```

```{r rf-model-with-vi, eval=FALSE}
# re-run model with impurity-based variable importance
rf_impurity <- ranger(
  formula = Sale_Price ~ ., 
  data            = ames_train, 
  num.trees       = 500,
  mtry            = 20,
  min.node.size   = 3,
  sample.fraction = .80,
  splitrule       = "variance",
  importance      = 'impurity',
  verbose         = FALSE,
  seed            = 123
  )

# re-run model with permutation-based variable importance
rf_permutation <- ranger(
  formula = Sale_Price ~ ., 
  data            = ames_train, 
  num.trees       = 500,
  mtry            = 20,
  min.node.size   = 3,
  sample.fraction = .80,
  splitrule       = "variance",
  importance      = 'permutation',
  verbose         = FALSE,
  seed            = 123
  )
```

For both options, you can directly access the variable importance values with `model_name$variable.importance`.  However, here we'll plot the variable importance using the `vip` package.  Typically, you will not see the same variable importance order between the two options; however, you will often see similar variables at the top of the plots.  Consquently, in this example, we can comfortably state that there appears to be enough evidence to suggest that two variables stand out as most influential:

* `Overall_Qual`
* `Gr_Liv_Area`

Looking at the next ~10 variables in both plots, you will also see some commonality in influential variables (i.e. `Garage_Cars`, `Bsmt_Qual`, `Year_Built`, `Exter_Qual`).

```{r rf-vip-plots, fig.width=9, fig.cap="Top 25 most important variables based on impurity (left) and permutation (right).", eval=FALSE}
p1 <- vip(rf_impurity, num_features = 25, bar = FALSE) + ggtitle("Impurity-based variable importance")
p2 <- vip(rf_permutation, num_features = 25, bar = FALSE) + ggtitle("Permutation-based variable importance")

gridExtra::grid.arrange(p1, p2, nrow = 1)
```

To better understand the relationship between these important features and `Sale_Price`, we can create partial dependence plots (PDPs). If you recall in the linear and regularized regression sections, we saw that the linear model assumed a continously increasing relationship between `Gr_Liv_Area` and `Sale_Price`.  In the MARS chapter, we saw as homes exceed 2,945 square feet, each additional square foot demands a higher marginal increase in sale price than homes with less than 2,945 square feet. However, in between the knots of a MARS model, the relationship will remain linear.  However, the PDP plot below displays how random forest models can capture unique non-linear and non-monotonic relationships between predictors and the target. In this case, `Sale_Price` appears to not be influenced by `Gr_Liv_Area` values below 750 sqft or above 3500 sqft.  This change in realtionship was not well captured by the prior parametric models.

```{r rf-pdp-GrLiv-Area, fig.height=3, fig.width=4, fig.cap="The mean predicted sale price as the above ground living area increases.", eval=FALSE}
# partial dependence of Sale_Price on Gr_Liv_Area
rf_impurity %>%
  partial(pred.var = "Gr_Liv_Area", grid.resolution = 50) %>%
  autoplot(rug = TRUE, train = ames_train)
```

Additionally, if we assess the relationship between the `Overall_Qual` predictor and `Sale_Price`, we see a continual increase as the overall quality increases.  This provides a more comprehensive understanding of the relationship than the results we saw in the regularized regression section (\@ref(lm-features)).  We see that the largest impact on `Sale_Price` occurs when houses go from "Good" overall quality to "Very Good".

```{r rf-pdp-Overall-Qual, fig.height=3, fig.width=8, fig.cap="The mean predicted sale price for each level of the overall quality variable.", eval=FALSE}
# partial dependence of Sale_Price on Overall_Qual
rf_impurity %>%
  partial(pred.var = "Overall_Qual", train = as.data.frame(ames_train)) %>%
  autoplot()
```


Individual conditional expectation (ICE) curves [@goldstein2015peeking] are an extension of PDP plots but, rather than plot the _average_ marginal effect on the response variable, we plot the change in the predicted response variable ___for each observation___ as we vary each predictor variable.  Below shows the regular ICE curve plot (left) and the centered ICE curves (right).  When the curves have a wide range of intercepts and are consequently “stacked” on each other, heterogeneity in the response variable values due to marginal changes in the predictor variable of interest can be difficult to discern. The centered ICE can help draw these inferences out and can highlight any strong heterogeneity in our results.

The plots below show that marginal changes in `Gr_Liv_Area` have a fairly homogenous effect on our response variable. As `Gr_Liv_Area` increases, the vast majority of observations show a similar increasing effect on the predicted `Sale_Price` value.  The primary differences is in the magnitude of the increasing effect. However, in the centered ICE plot you see evidence of a few observations that display a different pattern. Some have a higher $\hat y$ value when `Gr_Liv_Area` is between 2500-4000 and some have a decreasing $\hat y$ as `Gr_Liv_AreA` increases.  These results may be a sign of interaction effects and would be worth exploring more closely.

```{r rf-ice-Gr-Liv-Area, fig.height=3, fig.width=9, fig.cap="Non-centered (left) and centered (right) individual conditional expectation curve plots illustrate how changes in above ground square footage influences predicted sale price for all observations.", eval=FALSE}
# ice curves of Sale_Price on Gr_Liv_Area
ice1 <- rf_impurity %>%
  partial(pred.var = "Gr_Liv_Area", grid.resolution = 50, ice = TRUE) %>%
  autoplot(rug = TRUE, train = ames_train, alpha = 0.1) +
  ggtitle("Non-centered ICE plot")

ice2 <- rf_impurity %>%
  partial(pred.var = "Gr_Liv_Area", grid.resolution = 50, ice = TRUE) %>%
  autoplot(rug = TRUE, train = ames_train, alpha = 0.1, center = TRUE) +
  ggtitle("Centered ICE plot")

gridExtra::grid.arrange(ice1, ice2, nrow = 1)
```

Both PDPs and ICE curves should be assessed for the most influential variables as they help to explain the underlying patterns in the data that the random forest model is picking up.


## Attrition data

With the Ames data, the random forest models obtained predictive accuracy that was close to our best MARS model, but how about the attrition data?  The following performs a grid search across 48 hyperparameter combinations.  

```{r rf-tuned-rf-attrition, fig.cap="Cross-validated accuracy rate for the 48 different hyperparameter combinations in our grid search. The optimal model uses `mtry` = 9, `splitrule` = gini, and `min.node.size` = 4, which obtained a 10-fold CV accuracy rate of 85.8%.", eval=FALSE}
# get attrition data
df <- rsample::attrition %>% dplyr::mutate_if(is.ordered, factor, ordered = FALSE)

# Create training (70%) and test (30%) sets for the rsample::attrition data.
# Use set.seed for reproducibility
set.seed(123)
churn_split <- initial_split(df, prop = .7, strata = "Attrition")
churn_train <- training(churn_split)
churn_test  <- testing(churn_split)

# create a tuning grid
hyper_grid <- expand.grid(
  mtry            = seq(3, 18, by = 3),
  min.node.size   = seq(1, 10, by = 3),
  splitrule       = c("gini", "extratrees")
  )

# cross validated model
tuned_rf <- train(
  x = subset(churn_train, select = -Attrition),
  y = churn_train$Attrition,
  method = "ranger",
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = hyper_grid,
  num.trees = 500,
  seed = 123
)

# best model
tuned_rf$bestTune

# plot results
ggplot(tuned_rf)
```

Similar to the MARS model, the random forest model does not improve predictive accuracy over an above the regularized regression model.

```{r rf-attrition-modeling-rf, echo=FALSE, eval=FALSE}
# train logistic regression model
glm_mod <- train(
  Attrition ~ ., 
  data = churn_train, 
  method = "glm",
  family = "binomial",
  preProc = c("zv", "center", "scale"),
  trControl = trainControl(method = "cv", number = 10)
  )

# train regularized logistic regression model
penalized_mod <- train(
  Attrition ~ ., 
  data = churn_train, 
  method = "glmnet",
  family = "binomial",
  preProc = c("zv", "center", "scale"),
  trControl = trainControl(method = "cv", number = 10),
  tuneLength = 10
  )

# train mars model
hyper_grid <- expand.grid(
  degree = 1:3, 
  nprune = seq(2, 100, length.out = 10) %>% floor()
  )
tuned_mars <- train(
  x = subset(churn_train, select = -Attrition),
  y = churn_train$Attrition,
  method = "earth",
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = hyper_grid
)

# extract out of sample performance measures
summary(resamples(list(
  Logistic_model = glm_mod, 
  Elastic_net = penalized_mod,
  MARS_model = tuned_mars,
  RF_model = tuned_rf
  )))$statistics$Accuracy %>%
  kableExtra::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
```

## Final thoughts

Random forests provide a very powerful out-of-the-box algorithm that often has great predictive accuracy. Because of their more simplistic tuning nature and the fact that they require very little, if any, feature pre-processing they are often one of the first go-to algorithms when facing a predictive modeling problem.  However, as we illustrated in this chapter, random forests are not guaranteed to improve predictive accuracy over and above linear models and their cousins.  The following summarizes some of the advantages and disadvantages discussed regarding random forests modeling:

__TODO__: may need to better tie in some of these advantages and disadvantages throughout the chapter.

__Advantages:__

- Typically have very good performance.
- Remarkably good "out-of-the box" - very little tuning required.
- Built-in validation set - don't need to sacrifice data for extra validation.
- Does not overfit.
- No data pre-processing required - often works great with categorical and numerical values as is.
- Robust to outliers.
- Handles missing data - imputation not required.
- Provide automatic feature selection.

__Disadvantages:__

- Can become slow on large data sets.
- Although accurate, often cannot compete with the accuracy of advanced boosting algorithms.
- Less interpretable although this is easily addressed with various tools (variable importance, partial dependence plots, LIME, etc.).


## Learning more

The literature behind random forests are rich and we have only touched on the fundamentals.  To learn more I would start with the following resources listed in order of complexity:

* [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)
* [Applied Predictive Modeling](http://appliedpredictivemodeling.com/)
* [Computer Age Statistical Inference](https://www.amazon.com/Computer-Age-Statistical-Inference-Mathematical/dp/1107149894)
* [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/)

[^other]: Other decision tree algorithms include the Iterative Dichotomiser 3
[@quinlan1986induction], C4.5 [@quinlan1996bagging], Chi-square automatic interaction detection [@kass1980exploratory], Conditional inference trees [@hothorn2006unbiased], and more. 

[^stump]: This decision tree was built using the **rpart** package [@R-rpart], the tree graphic which was produced using the **rpart.plot** package [@R-rpart.plot]), and the decision boundary was illustrated with **ggplot2**.

[^task]: See the Random Forest section in the [Machine Learning Task View](https://CRAN.R-project.org/view=MachineLearning) on CRAN and Erin LeDell's [useR! Machine Learning Tutorial](https://koalaverse.github.io/machine-learning-in-R/random-forest.html#random-forest-software-in-r) for a non-comprehensive list.
[^prune]: See @esposito1997comparative for various methods of pruning. In **rpart**, the amount of pruning is controlled by the *complexity parameters* `cp`.