13-dnn.Rmd

# Deep Learning {#DL}

```{r dnn-ch13-setup, include=FALSE}

# Set the graphical theme
ggplot2::theme_set(ggplot2::theme_light())

# Set global knitr chunk options
knitr::opts_chunk$set(
  cache = TRUE,
  warning = FALSE, 
  message = FALSE, 
  collapse = TRUE, 
  fig.align = "center",
  fig.height = 3.5
)
```

Machine learning algorithms typically search for the optimal representation of data using some feedback signal (aka objective/loss function).  However, most machine learning algorithms only have the ability to use one or two layers of data transformation to learn the output representation. As data sets continue to grow in the dimensions of the feature space, finding the optimal output representation with a *shallow* model is not always possible.  Deep learning provides a multi-layer approach to learn data representations, typically performed with a *multi-layer neural network*.  Like other machine learning algorithms, deep neural networks (DNN) perform learning by mapping features to targets through a process of simple data transformations and feedback signals; however, DNNs place an emphasis on learning successive layers of meaningful representations.  Although an intimidating subject, the overarching concept is rather simple and has proven highly successful in predicting a wide range of problems (i.e. image classification, speech recognition, autonomous driving).  This chapter will teach you the fundamentals of building a *feedfoward* deep learning model.

This chapter will use a few supporting packages but the main emphasis will be on the `h2o` package.  

```{r dnn-prereq-pkgs, eval=FALSE}
library(rsample)  # for data splitting
library(dplyr)    # for data wrangling
library(h2o)      # for data modeling
library(vip)      # for variable importance plots
library(pdp)      # for partial dependence plots
library(ggplot2)  # used in conjunction with pdp

# launch h2o
h2o.init()
```

```{r dnn-prereq-pkgs}
library(rsample)  # for data splitting
library(dplyr)    # for data wrangling
library(h2o)      # for data modeling
library(vip)      # for variable importance plots
library(pdp)      # for partial dependence plots
library(ggplot2)

# launch h2o
h2o.no_progress()
h2o.init()
```

To illustrate the various concepts we’ll continue focusing on the Ames Housing data (regression); however, at the end of the chapter we’ll also fit a GBM model to the employee attrition data (classification).

However, three important items need to be pointed out.  

1. Feedfoward DNNs require all feature inputs to be numeric.  Consequently, all categorical variables need to be one-hot encoded.  Fortunately,  `h2o` will automatically one-hot encode categorical variables for you; however, other neural network packages do not always provide this service. 
2. Due to the data transformation process that DNNs perform, they are highly sensitive to the individual scale of the feature values. Consequently, all features should be standardized prior to modeling.  Once again, `h2o` will do this for us.
3. Neural networks are sensitive to skewed response variables.  Consequently, we perform a log transformation on our response variable (`Sale_Price`) to normalize its distribution.

```{r dnn-prereq-data}
# Create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.
# Use set.seed for reproducibility

set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

# convert data to h2o objects
train_h2o <- ames_train %>% mutate(Sale_Price = log(Sale_Price)) %>% as.h2o()
test_h2o <- ames_test %>% mutate(Sale_Price = log(Sale_Price)) %>% as.h2o()

# get response and predictor names 
response <- "Sale_Price"
features <- setdiff(names(ames_train), response)
```

## Why deep learning {#dnn_why}

Neural networks originated in the computer science field to answer questions that normal statistical approaches were not designed to answer.  A common example you will find is, assume we wanted to analyze hand-written digits and predict the numbers written.  This was a problem presented to AT&T Bell Lab's to help build automatic mail-sorting machines for the USPS.[@lecun1990handwritten]

```{r digits-fig, echo=FALSE, fig.align='center', fig.cap="Sample images from MNIST test dataset."}
knitr::include_graphics("illustrations/digits.png")
```

This problem is quite unique because many different features of the data can be represented.  As humans, we look at these numbers and consider features such as angles, edges, thickness, completeness of circles, etc.  We interpret these different representations of the features and combine them to recognize the digit.  In essence, neural networks perform the same task albeit in a far simpler manner than our brains. At their most basic levels, neural networks have an *input layer*, *hidden layer*, and *output layer*. The input layer reads in data values from a user provided input. Within the hidden layer is where a majority of the *learning* takes place, and the output layer projects the results.  

```{r dnn-ffwd-fig, echo=FALSE, fig.align='center', fig.cap="Simple feedforward neural network."}
knitr::include_graphics("illustrations/fig18_1.png")
```

Although simple on the surface, historically the magic being performed inside the neural net required lots of data for the neural net to learn and was computationally intense; ultimately making neural nets impractical.  However, in the last decade advancements in computer hardware (off the shelf CPUs became faster and GPUs were created) made computation more practical, the growth in data collection made them more relevant, and advancements in the underlying algorithms made the *depth* (number of hidden layers) of neural nets less of a constraint.  These advancements have resulted in the ability to run very deep and highly parameterized neural networks, which have become known as deep neural networks (DNNs).

```{r dnn-deep-fig, echo=FALSE, fig.align='center', fig.cap="Deep feedforward neural network."}
knitr::include_graphics("illustrations/deep_nn.png")
```

These DNNs allow for very complex representations of data to be modeled, which has opened the door to analyzing high-dimensional data (i.e. images, videos).  In traditional machine learning approaches, features of the data need to be defined prior to modeling. One can only imagine trying to create the features for the digit recognition problem above. However, with DNNs, the hidden layers provide the means to auto-identify features.  A simple way to think of this is to go back to our digit recognition problem.  The first hidden layer may learn about the angles of the line, the next hidden layer may learn about the thickness of the lines, the next learns the location and completeness of the circles, etc.  Aggregating these different attributes together by linking the layers allows the model to predict what digit each image is based on its features. 

This is the reason that DNNs are so popular for very complex problems where feature engineering is impossible (i.e. image classification, facial recognition). However, at their core, DNNs perform successive non-linear transformations across each layer, allowing DNNs to model very complex and non-linear relationships.  This can make DNNs suitable machine learning approaches for traditional regression and classification problems as well. But it is important to keep in mind that deep learning thrives when dimensions of your data are sufficiently large.  As the number of observations (*n*) and feature inputs (*p*) decrease, traditional shallow machine learning approaches tend to perform just as well, if not better, and are more efficient. 

## Feedforward DNNs {#dnn_ff}

Multiple DNN models exist and, as interest and investment in this area have increased, expansions of DNN models have flurished. For example, convolutional neural networks (CNN or ConvNet) have wide applications in image and video recognition, recurrent neural networks (RNN) are used with speech recognition, and long short-term memory neural networks (LTSM) are advancing automated robotics and machine translation. @goodfellow2016deep and @chollet2018deep provide nice comprehensive details of the many DNN algorithms available.  However, fundamental to all these methods is the ___feedforward neural net___ (aka multilayer perceptron).  Feedforward DNNs are densely connected layers where inputs influence each successive layer which then influences the final output layer.

```{r dnn-mlp-fig, echo=FALSE, fig.align='center', fig.cap="Feedforward neural network."}
knitr::include_graphics("illustrations/mlp_network.png")
```

To build a feedforward DNN we need 4 key components:

1. input data &#x2714;,
2. a defined network architecture,
3. our feedback mechanism to help our model learn,
4. a model training approach.

The next few sections will walk you through each of these components to build a feedforward DNN for our Ames housing data.

## Network architecture {#dnn_arch}

When developing the network architecture for a feedforward DNN, you really only need to worry about two features: (1) layers and nodes, (2) activation.

### Layers and nodes

The layers and nodes are the building blocks of our model and they decide how complex your network will be.  Layers are considered *dense* (fully connected) where all the nodes in each successive layer are connected.  Consequently, the more layers and nodes you add the more opportunities for new features to be learned (commonly referred to as the model's *capacity*).  Beyond the *input layer*, which is just our predictor variables, there are two main type of layers to consider: *hidden layers* and an *output layer*.  

#### Hidden layers

There is no well defined approach for selecting the number of hidden layers and nodes, rather, these are the first of many tuning parameters.  Typically, with regular rectangular data (think normal data frames in R), 2-5 hidden layers is sufficient. And the number of nodes you incorporate in these hidden layers is largely determined by the number of features in your data.  Often, the number of nodes in each layer is equal to or less than the number of features but this is not a hard requirement.  At the end of the day, the number of hidden layers and nodes in your network will drive the computational burden of your model.  Consequently, the goal is to find the simplest model with optimal performance.

#### Output layers

The output layer is driven by the type of modeling you are performing.  For regression problems, your output layer will contain one node because that one node will predict a continuous numeric output.  Classification problems are different.  If you are predicting a binary output (True/False, Win/Loss), your output layer will still contain only one node and that node will predict the probability of success (however you define success).  However, if you are predicting a multinomial output, the output layer will contain the same number of nodes as the number of classes being predicted. For example, in our digit recognition problem we would be predicting 10 classes (0-9); therefore, the output layer would have 10 nodes and the output would provide the probability of each class.

#### Implementation

To implement a feedforward DNN, we use `h2o.deeplearning()`. First, we specify the type of response variable (guassian is most common for continuous regression problems but poisson, gamma, and a few others are also available; see `?h2o.deeplearning` for more). This example creates two hidden layers, the first with 200 nodes and the second with 200 (these are actually the default values). By specifying a continuous distribution, h2o.deeplearning will automatically create a single node output layer for your predictions.

```{r initial-model, eval=FALSE}
fit1 <- h2o.deeplearning(
  x = features, 
  y = response, 
  training_frame = train_h2o,
  distribution = "gaussian",        # output is continuous
  hidden = c(200, 200)              # two hidden layers
  )
```

### Activation

A key component with neural networks is what's called *activation*.  In the human body, the biologic neuron receives inputs from many adjacent neurons.  When these inputs accumulate beyond a certain threshold the neuron is *activated* suggesting there is a signal. DNNs work in a similar fashion. 

#### Activation functions

As stated previously, each node is connected to all the nodes in the previous layer.  Each connection gets a weight and then that node adds all the incoming inputs multiplied by its corresponding connection weight (plus an extra *bias* ($w_0$) but don't worry about that right now). The summed total of these inputs become an input to an *activation function*.    

```{r activation-fig, echo=FALSE, fig.cap="Flow of information in an artificial neuron"}
knitr::include_graphics("illustrations/perceptron_node.png")
```


The activation function is simply a mathematical function that determines if there is enough informative input at a node to fire a signal to the next layer.  There are multiple [activation functions](https://en.wikipedia.org/wiki/Activation_function) to choose from but the most common ones include:

$$  
\texttt{Linear (identity):} \;\; f(x)=x
$$

<br>

$$  
\texttt{Rectified linear unit (ReLU):} \;\; f(x)= \begin{cases}
    0, & \text{for $x<0$}.\\
    x, & \text{for $x\geq0$}.
  \end{cases}
$$

<br>

$$  
\texttt{Sigmoid:} \;\; f(x)= \frac{1}{1 + e^{-x}}
$$

When using rectangular data such as our SOW data, the most common approach is to use ReLU activation functions in the hidden layers.  The ReLU activation function is simply taking the summed weighted inputs and transforming them to a 0 (not fire) or 1 (fire) if there is enough signal. For the output layers we use the linear activation function for regression problems and the sigmoid activation function for classification problems as this will provide the probability of the class (multinomial classification problems commonly us the softmax activation function).  


#### Implementation

To implement activation functions into our model we simply incorporate the `activation` argument. For the two hidden layers we add the ReLU activation function (aka "rectifier") and for the output layer we do not specify an activation function because the the default for regression models is a linear activation. 

```{r activation, eval=FALSE}
fit1 <- h2o.deeplearning(
  x = features, 
  y = response, 
  training_frame = train_h2o,
  distribution = "gaussian",        # output is continuous
  hidden = c(200, 200),             # two hidden layers
  activation = "Rectifier"          # hidden layer activation functions
  )
```

We have created our basic network architecture: two hidden layers with 200 nodes each and both hidden layers using ReLU activation functions.  Next, we need to incorporate a feedback mechanism to help our model learn.

## Backpropagation {#dnn_back}


On the first model run (or *forward pass*), the DNN will select a batch of observations, randomly assign weights across all the node connections, and predict the output. The engine of neural networks is how it assesses its own accuracy and automatically adjusts the weights across all the node connections to try improve that accuracy. This process is called *backpropagation*.  To perform backpropagation we need two things:

1. objective function
2. optimizer

First, you need to establish an objective (loss) function to measure performance.  Then, on each forward pass the DNN will measure its performance based on the loss function chosen. The DNN will then work backwards through the layers, compute the gradient ($\S$ \@ref(gbm-gradient)) of the loss with regards to the network weights, adjust the weights a little in the opposite direction of the gradient, grab another batch of observations to run through the model, ...rinse and repeat until the loss function is minimized. This process is known as *mini-batch stochastic gradient descent*[^stochastic] (mini-batch SGD).  There are several variants of mini-batch SGD algorithms; they primarily differ in how fast they go down the gradient descent (see @ruder2016overview for an overview of gradient descent algorithms).  `h2o.deeplearning()` uses Adadelta [@zeiler2012adadelta], which is sufficient for most regression and classification problems you'll encounter.  However, in the tuning section we will show you how to adjust a few different learning parameters, which can help you from getting stuck in a local optima with your loss function.

To incorporate the backpropagation piece of our DNN we identify the loss metric. For regression problems, the loss function argument is "Automatic", which defaults to MSE. 

```{r backpropagation, eval=FALSE}
fit1 <- h2o.deeplearning(
  x = features, 
  y = response, 
  training_frame = train_h2o,
  distribution = "gaussian",        # output is continuous
  hidden = c(200, 200),             # two hidden layers
  activation = "Rectifier",         # hidden layer activation functions
  loss = "Automatic",               # loss function is MSE
  )
```

## Model training {#dnn_train}

We've created a base model, now we just need to train it with our data.  To do so we provide a few other arguments that are worth mentioning:

- `mini_batch_size`: As mentioned in the last section, the DNN will take a batch of data to run through the mini-batch SGD process.  Batch sizes can be between 1 and several hundred (h2o's default is 1).  Small values will be more computationally burdomesome while large values provide less feedback signal.  Typically, 32 is a good size to start with and the values are generally provided as a power of two that fit nicely into the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and so on.
- `epochs`: An *epoch* describes the number of times the algorithm sees the ___entire___ data set. So, each time the algorithm has seen all samples in the dataset, an epoch has completed. In our training set, we have `r nrow(ames_train)` observations so running batches of 32 will require `r round(nrow(ames_train) / 32, 0)` passes for one epoch. The more complex the features and relationships in your data, the more epochs you will require for your model to learn, adjust the weights, and minimize the loss function.
- `n_folds`: Allows us to perform cross-validation (CV). The model will hold out XX% of the data so that we can compute a more accurate estimate of an out-of-sample error rate. When performing CV we need to retain our predictions for scoring purposes.
- `seed`: Provides reproducible results.

Our initial model provides an average cross validated RMSE of 0.1506781.  Later, we'll re-transform our errors so we can interpret them as whole dollar values but for now, our objective is to find the model that minimizes RMSE. 

```{r train-mod1, eval=FALSE}
fit1 <- h2o.deeplearning(
  x = features, 
  y = response, 
  training_frame = train_h2o,
  distribution = "gaussian",                   # output is continuous
  hidden = c(200, 200),                        # two hidden layers
  activation = "Rectifier",                    # hidden layer activation f(x)
  loss = "Automatic",                          # loss function is MSE
  mini_batch_size = 32,                        # batch sizes
  epochs = 20,                                 # of epochs
  nfolds = 5,                                  # 5-fold CV
  keep_cross_validation_predictions = TRUE,    # retain CV prediction values
  seed = 123                                   # for reproducibility
  )

# check out cross validation results
h2o.performance(fit1, xval = TRUE)
## H2ORegressionMetrics: deeplearning
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.02270388
## RMSE:  0.1506781
## MAE:  0.09666981
## RMSLE:  0.01185965
## Mean Residual Deviance :  0.02270388
```


## Model tuning {#dnn_tuning}

Now that we have an understanding of producing and running a basic DNN model, the next task is to find an optimal model by tuning different parameters.  There are many ways to tune a DNN.  Typically, the tuning process follows these general steps; however, there is often a lot of iteration among these:

1. Adjust model capacity (layers & nodes)
2. Add dropout
3. Add weight regularization
4. Adjust learning rate

### Adjust model capacity

Typically, we start with a high capacity model (several layers and nodes) and slowly reduce the layers and nodes.  The goal is to find the most simplistic model that still performs well.  Here, I fit a 3 layer model with 500, 250, and 125 nodes respectively.  I also add a few `stopping_` arguments, which stops the modeling automatically once the RMSE metric on the validation data stops improving by 0.01 (1%) for 2 consecutive epochs.  This allows us to increase the epochs to ensure convergence but the modeling will automatically stop when necessary to help minimize overfitting and computational burden.

```{r train2, eval=FALSE}
fit2 <- h2o.deeplearning(
  x = features, 
  y = response, 
  training_frame = train_h2o,
  distribution = "gaussian",                   
  hidden = c(500, 250, 125),                   # deeper network
  activation = "Rectifier",                    
  loss = "Automatic",                          
  mini_batch_size = 32,                        
  epochs = 100,                                # increased epochs
  nfolds = 5,                                  
  keep_cross_validation_predictions = TRUE,    
  seed = 123,                                  
  stopping_metric = "RMSE",                    # stopping mechanism
  stopping_rounds = 2,                         # number of rounds
  stopping_tolerance = 0.01                    # looking for 1% improvement
  )

# assess RMSE
h2o.rmse(fit2, train = TRUE, xval = TRUE)
##      train       xval 
## 0.03018675 0.14277913 
```


### Add dropout

*Dropout* is one of the most effective and commonly used approaches to prevent overfitting in neural networks.  Dropout randomly drops out (setting to zero) a number of output features in a layer during training.  By randomly removing different inputs and nodes, we help prevent the model from fitting patterns to happenstance patterns (noise) that are not significant.  We can apply dropout with `_dropout_ratio`. 

A typical dropout rate for the input layer is 10-20% and for the hidden layers is 20-50%. In this example I drop out 10% of the inputs features and 20% from each hidden layer.  In this example we see slight improvement in our model's performance and in its overfitting.  Note that when using dropout, you must specify one of the dropout activation functions (i.e. `RectifierWithDropout`, `TanhWithDropout`).


```{r train3, eval=FALSE}
fit3 <- h2o.deeplearning(
  x = features, 
  y = response, 
  training_frame = train_h2o,
  distribution = "gaussian",                   
  hidden = c(500, 250, 125),                   
  activation = "RectifierWithDropout",  # need to specify new activation f(x)             
  loss = "Automatic",                          
  mini_batch_size = 32,                        
  epochs = 100,                                
  nfolds = 5,                                  
  keep_cross_validation_predictions = TRUE,    
  seed = 123,                                  
  stopping_metric = "RMSE",                    
  stopping_rounds = 2,                         
  stopping_tolerance = 0.01,
  input_dropout_ratio = 0.1,             # 10% dropout of input variables
  hidden_dropout_ratios = c(.2, .2, .2)  # 20% dropout of hidden nodes
  )

# compare results
h2o.rmse(fit3, train = TRUE, xval = TRUE)
##      train       xval 
## 0.06509407 0.14237935 
```


### Add weight regularization

In the regularized regression chapter ($\S$ \@ref(regularized-regression)), we discussed the idea of $L_2$ (ridge) and $L_1$ (lasso) regularization.  The same idea can be applied in DNNs where we put constraints on the size that the weights can take.  In DNNs, the most common regularization is the $L_2$ *norm*, which is called *weight decay* in the context of neural networks.  Regularization of weights will force small signals (noise) to have weights nearly equal to zero and only allow signals with consistently strong signals to have relatively larger weights.

We apply regularization with a `l1` or `l2` argument or both for a combination (elastic net). In this example I add the $L_2$ *norm* regularizer with a multiplier of 0.001, which means every weight coefficient in the hidden layers will be multiplied by 0.001.  We do not see improvement with weight decay.

```{r train4, eval=FALSE}
fit4 <- h2o.deeplearning(
  x = features, 
  y = response, 
  training_frame = train_h2o,
  distribution = "gaussian",                   
  hidden = c(500, 250, 125),                   
  activation = "RectifierWithDropout",            
  loss = "Automatic",                          
  mini_batch_size = 32,                        
  epochs = 100,                                
  nfolds = 5,                                  
  keep_cross_validation_predictions = TRUE,    
  seed = 123,                                  
  stopping_metric = "RMSE",                    
  stopping_rounds = 2,                         
  stopping_tolerance = 0.01,
  input_dropout_ratio = 0.1,             
  hidden_dropout_ratios = c(.2, .2, .2),
  l2 = 0.001                               # add weight decay
  )

# compare results
h2o.rmse(fit4, train = TRUE, xval = TRUE)
##      train       xval 
## 0.04941897 0.14334937
```


### Adjust learning rate

Another issue to be concerned with is whether or not we are finding a global minimum versus a local minimum with our loss value.  The mini-batch SGD optimizer we use will take incremental steps down our loss gradient until it no longer experiences improvement.  The size of the incremental steps (aka *learning rate*) will determine if we get stuck in a local minimum instead of making our way to the global minimum.

```{r local-vs-global-fig, echo=FALSE, fig.align='center', fig.cap="A local minimum and a global minimum.."}
knitr::include_graphics("illustrations/minimums.jpg")
```

Adadelt (h2o's stochastic gradient descent optimization) uses an adaptive learning rate to try prevent the optimization from getting stuck in a local optima.  This means the size of the steps it takes down the loss gradient automatically adjusts based on descent speed (momentum) and gradient flatness (plateaus).  There are only two tuning parameters for Adadelt: `rho` and `epsilon`, which balance the global and local search efficiencies. `rho` is the similarity to prior weight updates (similar to momentum), and epsilon is a parameter that prevents the optimization to get stuck in local optima. Defaults are `rho = 0.99` and `epsilon = 1e-8`.

When discussing DNNs you will hear people discuss learning rates so its important to understand; however, vast majority of the time you will not need to adjust these parameters.  You do have the option to adjust these parameters but we suggest you stick with the defaults or consult with internal expertise prior to tuning these parameters.

### Automate the tuning process

Since there are a lot of parameters that can impact model accuracy, hyper-parameter tuning is especially important for DNNs. The simplest hyperparameter search method is a brute-force scan of the full Cartesian product of all combinations specified by a grid search.  This means every combination of hyper-parameters in the `hyper_params` list will be modeled and compared.  When tuning only a handful of parameters this approach can be appropriate; however, it will be computationally burdensome. We use `h2o.grid` to perform the grid search.

```{r full_cartesian_search, eval=FALSE}
# create hyper-parameter tuning grid
hyper_params <- list(
  hidden = list(c(500, 250, 150), c(250, 125)),
  activation = c("Rectifier", "RectifierWithDropout"), 
  input_dropout_ratio = c(0.05, 0.1),
  hidden_dropout_ratios = list(c(.2, .2, .2), c(.3, .3, .3))
  )

# Rather than comparing models by using cross-validation (which is “better” but 
# takes longer), we will simply partition our training set into two pieces – one 
# for training and one for validiation.
splits <- h2o.splitFrame(train_h2o, ratios = 0.8, seed = 1)

# train the grid
dl_grid <- h2o.grid(
  algorithm = "deeplearning",
  x = features, 
  y = response,
  distribution = "gaussian", 
  grid_id = "dl_grid_full",
  training_frame = splits[[1]],
  validation_frame = splits[[2]],
  keep_cross_validation_predictions = TRUE,
  loss = "Automatic",                          
  mini_batch_size = 32,                        
  epochs = 100,                                
  seed = 123,                                  
  stopping_metric = "RMSE",                    
  stopping_rounds = 2,                         
  stopping_tolerance = 0.01,
  hyper_params = hyper_params   # search space
  )

# collect the results and sort by our model performance metric of choice
dl_gridperf <- h2o.getGrid(
  grid_id = "dl_grid_full", 
  sort_by = "mse", 
  decreasing = TRUE
  )

# you can look at the rank-order of the models
# print(dl_gridperf)

# Grab the model_id for the top model, chosen by validation error
best_dl_model_id <- dl_gridperf@model_ids[[1]]
best_dl <- h2o.getModel(best_dl_model_id)

# assess the performance
h2o.rmse(best_dl, train = TRUE, valid = TRUE)
##      train       xval 
## 0.08778202 0.15877771 
```


Often, hyper-parameter search for more than 4 parameters can be done more efficiently with *random* parameter search than with a full grid search. Basically, chances are good you'll find one of many good models in less time than performing an exhaustive grid search. We simply build up to `max_models` models with parameters drawn randomly from user-specified distributions (here, uniform). For this example, we focus on tuning the network architecture along with dropout and regularization parameters. We also let the grid search stop automatically once the performance at the top of the leaderboard doesn't change much anymore, i.e., once the search has converged.

```{r random_search, cache=TRUE, eval=FALSE}
# hyper-parameter tuning grid
hyper_params <- list(
  hidden = list(c(500, 250, 150), c(250, 150, 50), c(250, 125)),
  activation = c("Rectifier", "RectifierWithDropout", "Maxout", "MaxoutWithDropout"), 
  input_dropout_ratio = c(0, 0.05),
  l1 = c(0, 0.00001, 0.0001, 0.001, 0.01, 0.1),
  l2 = c(0, 0.00001, 0.0001, 0.001, 0.01, 0.1)
  )

# Rather than comparing models by using cross-validation (which is “better” but 
# takes longer), we will simply partition our training set into two pieces – one 
# for training and one for validiation.
splits <- h2o.splitFrame(train_h2o, ratios = 0.8, seed = 1)

# For a larger search space we can use random grid search
search_criteria <- list(
  strategy = "RandomDiscrete", 
  max_runtime_secs = 360, 
  max_models = 100, 
  seed = 123,
  stopping_metric = "RMSE",
  stopping_rounds = 5,        
  stopping_tolerance = 0.01
  )

# train the grid
dl_grid <- h2o.grid(
  algorithm = "deeplearning",
  x = features, 
  y = response,
  distribution = "gaussian", 
  grid_id = "dl_grid_random",
  training_frame = splits[[1]],
  validation_frame = splits[[2]],
  keep_cross_validation_predictions = TRUE,
  loss = "Automatic",                          
  mini_batch_size = 32,                        
  epochs = 100, 
  hyper_params = hyper_params,
  search_criteria = search_criteria
  )

# collect the results and sort by our model performance metric of choice
dl_gridperf <- h2o.getGrid(
  grid_id = "dl_grid_random", 
  sort_by = "mse", 
  decreasing = TRUE
  )

# you can look at the rank-order of the models
# print(dl_gridperf)

# Grab the model_id for the top model, chosen by validation error
best_dl_model_id <- dl_gridperf@model_ids[[1]]
best_dl <- h2o.getModel(best_dl_model_id)

# assess the performance
## h2o.performance(best_dl, valid = TRUE)
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## MSE:  0.04227962
## RMSE:  0.2056201
## MAE:  0.1242647
## RMSLE:  0.01643259
## Mean Residual Deviance :  0.04227962
```

Thus far, the best model (`fit3`) came prior to tuning.  We'll use this model as our best model here forward.  Considering this model used a transformed response variable, we can get a better understanding of the error in whole dollar value by re-transforming our cross validation predictions and comparing to the actual sales price.  We see that our best performing neural net model RMSE of \$27,161 outperforms our earlier linear regression models but does not perform nearly as well as our regularized regression, MARS, random forest, and GBM models.

```{r, echo=FALSE, eval=FALSE}
# save best h2o model to date
saveRDS(fit3, file = "data/h2o_dl_ames_model.rds")
```

```{r, echo=FALSE}
fit3 <- readRDS("data/h2o_dl_ames_model.rds")
```

```{r}
# use model 3 as our best model
best_model <- fit3

# compute cross validation error on re-transformed response variable
h2o.cross_validation_holdout_predictions(fit3) %>%
  as.data.frame() %>%
  mutate(
    predict_tran = exp(predict),
    truth = ames_train$Sale_Price
  ) %>%
  yardstick::rmse(truth, predict_tran)
```


## Feature Interpretation


```{r}
vip(best_model, num_features = 20)
```


```{r}
# prediction function
pfun <- function(object, newdata) {
  as.data.frame(predict(object, newdata = as.h2o(newdata)))[[1L]]
}

# compute ICE curves
p1 <- best_model %>%
  partial(
    pred.var = "Gr_Liv_Area", 
    train = ames_train,
    pred.fun = pfun,
    grid.resolution = 50
  ) %>%
  autoplot(rug = TRUE, train = ames_train, alpha = .1, center = TRUE) +
  ggtitle("Centered ICE plot")

# prediction function
pfun <- function(object, newdata) {
  mean(as.data.frame(predict(object, newdata = as.h2o(newdata)))[[1L]])
}

p2 <- best_model %>%
  partial(
    pred.var = "Overall_Qual", 
    train = as.data.frame(ames_train),
    pred.fun = pfun
  ) %>%
  autoplot() +
  ggtitle("Partial dependence plot")

gridExtra::grid.arrange(p1, p2)
```


## Final thoughts

GBMs are one of the most powerful ensemble algorithms that are often first-in-class with predictive accuracy. Although they are less intuitive and more computationally demanding than many other machine learning algorithms, they are essential to have in your toolbox. 

__TODO__: may need to better tie in some of these advantages and disadvantages throughout the chapter.

__Advantages:__

- Often provides predictive accuracy that cannot be beat.
- Lots of flexibility - can optimize on different loss functions and provides several hyperparameter tuning options that make the function fit very flexible.
- No data pre-processing required - often works great with categorical and numerical values as is.
- Handles missing data - imputation not required.

__Disadvantages:__

- GBMs will continue improving to minimize all errors.  This can overemphasize outliers and cause overfitting. Must use cross-validation to neutralize.
- Computationally expensive - GBMs often require many trees (>1000) which can be time and memory exhaustive.
- The high flexibility results in many parameters that interact and influence heavily the behavior of the approach (number of iterations, tree depth, regularization parameters, etc.). This requires a large grid search during tuning.
- Less interpretable although this is easily addressed with various tools (variable importance, partial dependence plots, local variable importance, etc.).


[^stochastic]: Its considered stochastic because a random subset (*batch*) of observations are drawn for each forward pass.