add chapter on validation and internal tuning (#829)

mlr-org · Nov 7, 2024 · 36df925 · 36df925
1 parent 74e8fce
commit 36df925
Show file tree

Hide file tree

Showing 14 changed files with 764 additions and 142 deletions.
diff --git a/book/_quarto.yml b/book/_quarto.yml
@@ -44,6 +44,7 @@ book:
       - chapters/chapter12/model_interpretation.qmd
       - chapters/chapter13/beyond_regression_and_classification.qmd
       - chapters/chapter14/algorithmic_fairness.qmd
+      - chapters/chapter15/predsets_valid_inttune.qmd
     - chapters/references.qmd
   appendices:
       - chapters/appendices/solutions.qmd # online only

diff --git a/book/chapters/appendices/errata.qmd b/book/chapters/appendices/errata.qmd
@@ -11,19 +11,27 @@ aliases:
 
 This appendix lists changes to the online version of this book to chapters included in the first edition.
 
+## 1. Introduction and Overview
 
-## Data and Basic Modeling
+* Add
+
+
+## 2. Data and Basic Modeling
 
 * Replaced reference to `Param` with `Domain`.
 
-## Hyperparameter Optimization
+## 3. Evaluation and Benchmarking
+
+* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.
+
+## 4. Hyperparameter Optimization
 
 * Renamed `TuningInstanceSingleCrit` to `TuningInstanceBatchSingleCrit`.
 * Renamed `TuningInstanceMultiCrit` to `TuningInstanceBatchMultiCrit`.
 * Renamed `Tuner` to `TunerBatch`.
 * Replaced reference to `Param` with `Domain`.
 
-## Advanced Tuning Methods and Black Box Optimization
+## 5. Advanced Tuning Methods and Black Box Optimization
 
 * Renamed `TuningInstanceSingleCrit` to `TuningInstanceBatchSingleCrit`.
 * Renamed `TuningInstanceMultiCrit` to `TuningInstanceBatchMultiCrit`.
@@ -33,10 +41,29 @@ This appendix lists changes to the online version of this book to chapters inclu
 * Renamed `Optimizer` to `OptimizerBatch`.
 * Replaced `OptimInstanceSingleCrit$new()` with `oi()`.
 * Add `oi()` to the table about important functions.
+* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.
 
-## Feature Selection
+## 6. Feature Selection
 
 * Renamed `FSelectInstanceSingleCrit` to `FSelectInstanceBatchSingleCrit`.
 * Renamed `FSelectInstanceMultiCrit` to `FSelectInstanceBatchMultiCrit`.
 * Renamed `FeatureSelector` to `FeatureSelectorBatch`.
 * Add `fsi()` to the table about important functions.
+
+## 8. Non-sequential Pipelines and Tuning
+
+* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.
+
+## 10. Advanced Technical Aspects of mlr3
+
+* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.
+
+## 11. Large-Scale Benchmarking
+
+* Use `$encapsulate()` method instead of the `$encapsulate` and `$fallback` fields.
+
+## 12. Model Interpretation
+
+* Subset task to row 127 instead of 35 for the local surrogate model.
+* Add `as.data.frame()` to "Correctly Interpreting Shapley Values" section.
+
diff --git a/book/chapters/appendices/solutions.qmd b/book/chapters/appendices/solutions.qmd
@@ -1711,9 +1711,9 @@ First, we create the learner that we want to tune, mark the relevant parameter f
 
 ```{r}
 lrn_debug = lrn("classif.debug",
-  error_train = to_tune(0, 1),
-  fallback = lrn("classif.rpart")
+  error_train = to_tune(0, 1)
 )
+lrn_debug$encapsulate("evaluate", fallback = lrn("classif.rpart"))
 lrn_debug
 ```
 
@@ -2171,4 +2171,199 @@ prediction$score(msr_3, adult_subset)
 We can see, that between women there is an even bigger discrepancy compared to men.
 
 * The bias mitigation strategies we employed do not optimize for the *false omission rate* metric, but other metrics instead. It might therefore be better to try to achieve fairness via other strategies, using different or more powerful models or tuning hyperparameters.
+
+## Solutions to @sec-predsets-valid-inttune
+
+1. Manually `$train()` a LightGBM classifier from `r ref_pkg("mlr3extralearners")` on the pima task using $1/3$ of the training data for validation.
+   As the pima task has missing values, select a method from `r ref_pkg("mlr3pipelines")` to impute them.
+   Explicitly set the evaluation metric to logloss (`"binary_logloss"`), the maximum number of boosting iterations to 1000, the patience parameter to 10, and the step size to 0.01.
+   After training the learner, inspect the final validation scores as well as the early stopped number of iterations.
+
+We start by loading the packages and creating the task.
+
+```{r}
+library(mlr3)
+library(mlr3extralearners)
+library(mlr3pipelines)
+
+tsk_pima = tsk("pima")
+tsk_pima
+```
+
+Below, we see that the task has five features with missing values.
+
+```{r}
+tsk_pima$missings()
+```
+
+Next, we create the LightGBM classifier, but don't specify the validation data yet.
+We handle the missing values using a simple median imputation.
+
+```{r}
+lrn_lgbm = lrn("classif.lightgbm",
+  num_iterations = 1000,
+  early_stopping_rounds = 10,
+  learning_rate = 0.01,
+  eval = "binary_logloss"
+)
+
+glrn = as_learner(po("imputemedian") %>>% lrn_lgbm)
+glrn$id = "lgbm"
+```
+
+After constructing the graphlearner, we now configure the validation data using `r ref("set_validate()")`.
+The call below sets the `$validate` field of the LightGBM pipeop to `"predefined"` and of the graphlearner to `0.3`.
+Recall that only the graphlearner itself can specify *how* the validation data is generated.
+The individual pipeops can either use it (`"predefined"`) or not (`NULL`).
+
+```{r}
+set_validate(glrn, validate = 0.3, ids = "classif.lightgbm")
+glrn$validate
+glrn$graph$pipeops$classif.lightgbm$validate
+```
+
+Finally, we train the learner and inspect the validation scores and internally tuned parameters.
+
+```{r}
+glrn$train(tsk_pima)
+
+glrn$internal_tuned_values
+glrn$internal_valid_scores
+```
+
+2. Wrap the learner from exercise 1) in an `AutoTuner` using a three-fold CV for the tuning.
+   Also change the rule for aggregating the different boosting iterations from averaging to taking the maximum across the folds.
+   Don't tune any parameters other than `nrounds`, which can be done using `tnr("internal")`.
+   Use the internal validation metric as the tuning measure.
+   Compare this learner with a `lrn("classif.rpart")` using a 10-fold outer cross-validation with respect to classification accuracy.
+
+We start by setting the number of boosting iterations to an internal tune token where the maximum number of boosting iterations is 1000 and the aggregation function the maximum.
+Note that the input to the aggregation function is a list of integer values (the early stopped values for the different resampling iterations), so we need to `unlist()` it first before taking the maximum.
+
+```{r}
+library(mlr3tuning)
+
+glrn$param_set$set_values(
+  classif.lightgbm.num_iterations = to_tune(
+    upper = 1000, internal = TRUE, aggr = function(x) max(unlist(x))
+  )
+)
+```
+
+Now, we change the validation data from `0.3` to `"test"`, where we can omit the `ids` specification as LightGBM is the base learner.
+
+```{r}
+set_validate(glrn, validate = "test")
+```
+
+Next, we create the autotuner using the configuration given in the instructions.
+As the internal validation measures are calculated by `lightgbm` and not `mlr3`, we need to specify whether the metric should be minimized.
+
+```{r}
+at_lgbm = auto_tuner(
+  learner = glrn,
+  tuner = tnr("internal"),
+  resampling = rsmp("cv", folds = 3),
+  measure = msr("internal_valid_score",
+    select = "classif.lightgbm.binary_logloss", minimize = TRUE)
+)
+at_lgbm$id = "at_lgbm"
+```
+
+Finally, we set up the benchmark design, run it, and evaluate the learners in terms of their classification accuracy.
+
+```{r}
+design = benchmark_grid(
+  task = tsk_pima,
+  learners = list(at_lgbm, lrn("classif.rpart")),
+  resamplings = rsmp("cv", folds = 10)
+)
+
+bmr = benchmark(design)
+
+bmr$aggregate(msr("classif.acc"))
+```
+
+3. Consider the code below:
+
+   ```{r}
+   branch_lrn = as_learner(
+     ppl("branch", list(
+       lrn("classif.ranger"),
+       lrn("classif.xgboost",
+         early_stopping_rounds = 10,
+         eval_metric = "error",
+         eta = to_tune(0.001, 0.1, logscale = TRUE),
+         nrounds = to_tune(upper = 1000, internal = TRUE)))))
+
+   set_validate(branch_lrn, validate = "test", ids = "classif.xgboost")
+   branch_lrn$param_set$set_values(branch.selection = to_tune())
+
+   at = auto_tuner(
+     tuner = tnr("grid_search"),
+     learner = branch_lrn,
+     resampling = rsmp("holdout", ratio = 0.8),
+     # cannot use internal validation score because ranger does not have one
+     measure = msr("classif.ce"),
+     term_evals = 10L,
+     store_models = TRUE
+   )
+
+   tsk_sonar = tsk("sonar")$filter(1:100)
+
+   rr = resample(
+     tsk_sonar, at, rsmp("holdout", ratio = 0.8), store_models = TRUE
+   )
+   ```
+
+   Answer the following questions (ideally without running the code):
+
+  3.1 During the hyperparameter optimization, how many observations are used to train the XGBoost algorithm (excluding validation data) and how many for the random forest?
+      Hint: learners that cannot make use of validation data ignore it.
+
+The outer resampling already removes 20 observations from the data (the outer test set), leaving only 80 data points (the outer train set) for the inner resampling.
+Then 16 (0.2 * 80; the test set of the inner holdout resampling) observations are used to evaluate the hyperparameter configurations.
+This leaves 64 (80 - 16) observations for training.
+For XGBoost, the 16 observations that make up the inner test set are also used for validation, so no more observations from the 64 training points are removed.
+Because the random forest does not support validation, the 16 observations from the inner test set will only be used for evaluation the hyperparameter configuration, but not simultanteously for internal validation.
+Therefore, both the random forest and XGBoost models use 64 observations for training.
+
+  3.2 How many observations would be used to train the final model if XGBoost was selected? What if the random forest was chosen?
+
+In both cases, all 80 observations (the train set from the outer resampling) would be used.
+This is because during the final model fit no validation data is generated.
+
+  3.3 How would the answers to the last two questions change if we had set the `$validate` field of the graphlearner to `0.25` instead of `"test"`?
+
+In this case, the validation data is no longer identical to the inner resampling test set.
+Instead, it is split from the 64 observations that make up the inner training set.
+Because this happens before the task enters the graphlearner, both the XGBoost model *and* the random forest only have access to 48 ((1 - 0.25) * 64) observations, and the remaining 16 are used to create the validation data.
+Note that the random forest will again ignore the validation data as it does not have the 'validation' property and therefore cannot use it.
+Also, the autotuner would now use a different set for tuning the step size and boosting iterations (which coincidentally both have size 16).
+Therefore, the answer to question 3.1 would be 48 instead of 64.
+
+However, this does not change the answer to 3.2, as, again, no validation is performed during the final model fit.
+
+Note that we would normally recommend setting the validation data to `"test"` when tuning, so this should be thought of as a illustrative example.
+
+
+4. Look at the (failing) code below:
+
+   ```{r, error = TRUE}
+   tsk_sonar = tsk("sonar")
+   glrn = as_learner(
+     po("pca") %>>% lrn("classif.xgboost", validate = 0.3)
+   )
+   ```
+
+   Can you explain *why* the code fails?
+   Hint: Should the data that xgboost uses for validation be preprocessed according to the *train* or *predict* logic?
+
+If we set the `$validate` field of the XGBoost classifier to `0.3`, the validation data would be generated from the output task of `PipeOpOpPCA`.
+However, this task has been exclusively preprocessed using the train logic, because the `PipeOpPCA` does not 'know' that the LightGBM classifier wants to do validation.
+Because validation performance is intended to measure how well a model would perform during prediction, the validation should be preprocessed according to the predict logic.
+For this reason, splitting of the 30% of the output from `PipeOpPCA` to use as validation data in the XGBoost classifier would be invalid.
+Therefore, it is not possible to set the `$validate` field of `PipeOps` to values other than `predefined' or `NULL'.
+Only the `GraphLearner` itself can dictate *how* the validation data is created *before* it enters the `Graph`, so the validation data is then preprocessed according to the predict logic.
+
 :::
diff --git a/book/chapters/appendices/solutions_large-scale_benchmarking.qmd b/book/chapters/appendices/solutions_large-scale_benchmarking.qmd
@@ -104,14 +104,14 @@ lrn_ranger = as_learner(
     po("learner", lrn("regr.ranger"))
 )
 lrn_ranger$id = "ranger"
-lrn_ranger$fallback = lrn("regr.featureless")
+lrn_ranger$encapsulate("evaluate", fallback = lrn("regr.featureless"))
 
 lrn_rpart = as_learner(
   ppl("robustify", learner = lrn("regr.rpart")) %>>%
     po("learner", lrn("regr.rpart"))
 )
 lrn_rpart$id = "rpart"
-lrn_rpart$fallback = lrn("regr.featureless")
+lrn_rpart$encapsulate("evaluate", fallback = lrn("regr.featureless"))
 
 learners = list(lrn_ranger, lrn_rpart)
 ```

diff --git a/book/chapters/chapter1/introduction_and_overview.qmd b/book/chapters/chapter1/introduction_and_overview.qmd
@@ -27,6 +27,8 @@ Before we can show you the full power of `mlr3`, we recommend installing the `r
 install.packages("mlr3verse")
 ```
 
+Chapters that were added after the release of the printed version of this book are marked with a '+'.
+
 ## Installation Guidelines {#installguide}
 
 There are many packages in the `mlr3` ecosystem that you may want to use as you work through this book.

diff --git a/book/chapters/chapter10/advanced_technical_aspects_of_mlr3.qmd b/book/chapters/chapter10/advanced_technical_aspects_of_mlr3.qmd
@@ -530,21 +530,18 @@ This means that models can be used for fitting and predicting and any conditions
 However, the result of the experiment will be a missing model and/or predictions, depending on where the error occurs.
 In @sec-fallback, we will discuss fallback learners to replace missing models and/or predictions.
 
-Each `r ref("Learner")` contains the field `r index("$encapsulate", parent = "Learner", aside = TRUE, code = TRUE)` to control how the train or predict steps are wrapped.
+Each `r ref("Learner")` has the method `r index("$encapsulate()", parent = "Learner", aside = TRUE, code = TRUE)` to control how the train or predict steps are wrapped.
 The first way to encapsulate the execution is provided by the package `r ref_pkg("evaluate")`, which evaluates R expressions and captures and tracks conditions (outputs, messages, warnings or errors) without letting them stop the process (see documentation of `r ref("mlr3misc::encapsulate()")` for full details):
 
 ```{r technical-017}
 # trigger warning and error in training
 lrn_debug = lrn("classif.debug", warning_train = 1, error_train = 1)
 
 # enable encapsulation for train() and predict()
-lrn_debug$encapsulate = c(train = "evaluate", predict = "evaluate")
+lrn_debug$encapsulate("evaluate", fallback = lrn("classif.featureless"))
 lrn_debug$train(tsk_penguins)
 ```
 
-Note how we passed `"evaluate"` to `train` and `predict` to enable encapsulation in both training and predicting.
-However, we could have only set encapsulation for one of these stages by instead passing `c(train = "evaluate", predict = "none")` or `c(train = "none", predict = "evaluate")`.
-
 Note that encapsulation captures all output written to the standard output (stdout) and standard error (stderr) streams and stores them in the learner's log.
 However, in some computational setups, the calling process needs to operate on the log output, such as the `r ref_pkg("batchtools")` package in @sec-large-benchmarking.
 In this case, use the encapsulation method `"try"` instead, which catches signaled conditions but does not suppress the output.
@@ -563,7 +560,7 @@ This guards the calling session against segmentation faults which otherwise woul
 On the downside, starting new processes comes with comparably more computational overhead.
 
 ```{r technical-019}
-lrn_debug$encapsulate = c(train = "callr", predict = "callr")
+lrn_debug$encapsulate("callr", fallback = lrn("classif.featureless"))
 # set segfault_train and remove warning_train and error_train
 lrn_debug$param_set$values = list(segfault_train = 1)
 lrn_debug$train(task = tsk_penguins)$errors
@@ -613,13 +610,12 @@ Say an error has occurred when training a model in one or more iterations during
 We strongly recommend the final option, which is statistically sound and can be easily used in any practical experiment.
 `mlr3` includes two baseline learners: `lrn("classif.featureless")`, which, in its default configuration, always predicts the majority class, and `lrn("regr.featureless")`, which predicts the average response by default.
 
-To make this procedure convenient during resampling and benchmarking, we support fitting a baseline (though in theory you could use any `Learner`) as a `r index('fallback learner')` by passing a `r ref("Learner")` to `r index('$fallback', parent = "Learner", aside = TRUE, code = TRUE)`.
+To make this procedure convenient during resampling and benchmarking, we support fitting a baseline (though in theory you could use any `Learner`) as a `r index('fallback learner')` by passing a `r ref("Learner")` to `r index('$encapsulate()', parent = "Learner", aside = TRUE, code = TRUE)`.
 In the next example, we add a classification baseline to our debug learner, so that when the debug learner errors, `mlr3` falls back to the predictions of the featureless learner internally.
-Note that while encapsulation is not enabled explicitly, it is automatically enabled and set to `"evaluate"` if a fallback learner is added.
 
 ```{r technical-022}
 lrn_debug = lrn("classif.debug", error_train = 1)
-lrn_debug$fallback = lrn("classif.featureless")
+lrn_debug$encapsulate("evaluate", fallback = lrn("classif.featureless"))
 
 lrn_debug$train(tsk_penguins)
 lrn_debug
@@ -639,7 +635,7 @@ We re-parametrize the debug learner to fail in roughly 50% of the resampling ite
 
 ```{r technical-024}
 lrn_debug = lrn("classif.debug", error_train = 0.5)
-lrn_debug$fallback = lrn("classif.featureless")
+lrn_debug$encapsulate("evaluate", fallback = lrn("classif.featureless"))
 
 aggr = benchmark(benchmark_grid(
   tsk_penguins,
@@ -970,7 +966,7 @@ For an overview of available DBMS in R, see the CRAN task view on databases at `
 | - | `r ref("future::plan()")` | - |
 | - | `r ref("set_threads()")` | - |
 | - | `r ref("future::tweak()")` | - |
-| `Learner` | `lrn()` | `$encapsulate`; `$fallback`; `$timeout`; `$parallel_predict`; `$log` |
+| `Learner` | `lrn()` | `$encapsulate()`; `$timeout`; `$parallel_predict`; `$log` |
 | `r ref("lgr::Logger")` | `r ref("lgr::get_logger")` | `$set_threshold()` |
 | `r ref("mlr3db::DataBackendDplyr")` | `r ref("mlr3::as_data_backend")` | - |
 | `r ref("mlr3db::DataBackendDuckDB")` | `r ref("as_duckdb_backend")` | - |

diff --git a/book/chapters/chapter11/large-scale_benchmarking.qmd b/book/chapters/chapter11/large-scale_benchmarking.qmd
@@ -49,15 +49,13 @@ lrn_baseline = lrn("classif.featureless", id = "featureless")
 lrn_lr = lrn("classif.log_reg")
 lrn_lr = as_learner(ppl("robustify", learner = lrn_lr) %>>% lrn_lr)
 lrn_lr$id = "logreg"
-lrn_lr$fallback = lrn_baseline
-lrn_lr$encapsulate = c(train = "try", predict = "try")
+lrn_lr$encapsulate("try", fallback = lrn_baseline)
 
 # random forest pipeline
 lrn_rf = lrn("classif.ranger")
 lrn_rf = as_learner(ppl("robustify", learner = lrn_rf) %>>% lrn_rf)
 lrn_rf$id = "ranger"
-lrn_rf$fallback = lrn_baseline
-lrn_rf$encapsulate = c(train = "try", predict = "try")
+lrn_rf$encapsulate("try", fallback = lrn_baseline)
 
 learners = list(lrn_lr, lrn_rf, lrn_baseline)
 ```