-
Notifications
You must be signed in to change notification settings - Fork 288
/
15-workflow-sets.Rmd
511 lines (396 loc) · 21.5 KB
/
15-workflow-sets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
```{r workflow-sets-startup, include = FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(workflowsets)
library(baguette)
library(rules)
library(finetune)
tidymodels_prefer()
caching <- FALSE
cores <- parallel::detectCores()
if (!grepl("mingw32", R.Version()$platform)) {
library(doMC)
registerDoMC(cores = cores)
} else {
library(doParallel)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
}
```
# Screening Many Models {#workflow-sets}
We introduced workflow sets in Chapter \@ref(workflows) and demonstrated how to use them with resampled data sets in Chapter \@ref(compare). In this chapter, we discuss these sets of multiple modeling workflows in more detail and describe a use case where they can be helpful.
For projects with new data sets that have not yet been well understood, a data practitioner may need to screen many combinations of models and preprocessors. It is common to have little or no _a priori_ knowledge about which method will work best with a novel data set.
:::rmdnote
A good strategy is to spend some initial effort trying a variety of modeling approaches, determine what works best, then invest additional time tweaking/optimizing a small set of models.
:::
Workflow sets provide a user interface to create and manage this process. We'll also demonstrate how to evaluate these models efficiently using the racing methods discussed in Section \@ref(racing-example).
## Modeling Concrete Mixture Strength
To demonstrate how to screen multiple model workflows, we will use the concrete mixture data from _Applied Predictive Modeling_ [@apm] as an example. Chapter 10 of that book demonstrated models to predict the compressive strength of concrete mixtures using the ingredients as predictors. A wide variety of models were evaluated with different predictor sets and preprocessing needs. How can workflow sets make such a process of large scale testing for models easier?
First, let's define the data splitting and resampling schemes.
```{r workflow-sets-data}
library(tidymodels)
tidymodels_prefer()
data(concrete, package = "modeldata")
glimpse(concrete)
```
The `compressive_strength` column is the outcome. The `age` predictor tells us the age of the concrete sample at testing in days (concrete strengthens over time) and the rest of the predictors like `cement` and `water` are concrete components in units of kilograms per cubic meter.
:::rmdwarning
For some cases in this data set, the same concrete formula was tested multiple times. We'd rather not include these replicate mixtures as individual data points since they might be distributed across both the training and test set. Doing so might artificially inflate our performance estimates.
:::
To address this, we will use the mean compressive strength per concrete mixture for modeling:
```{r workflow-sets-means}
concrete <-
concrete %>%
group_by(across(-compressive_strength)) %>%
summarize(compressive_strength = mean(compressive_strength),
.groups = "drop")
nrow(concrete)
```
Let's split the data using the default 3:1 ratio of training-to-test and resample the training set using five repeats of 10-fold cross-validation:
```{r workflow-sets-splitting}
set.seed(1501)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test <- testing(concrete_split)
set.seed(1502)
concrete_folds <-
vfold_cv(concrete_train, strata = compressive_strength, repeats = 5)
```
Some models (notably neural networks, KNN, and support vector machines) require predictors that have been centered and scaled, so some model workflows will require recipes with these preprocessing steps. For other models, a traditional response surface design model expansion (i.e., quadratic and two-way interactions) is a good idea. For these purposes, we create two recipes:
```{r workflow-sets-recipes}
normalized_rec <-
recipe(compressive_strength ~ ., data = concrete_train) %>%
step_normalize(all_predictors())
poly_recipe <-
normalized_rec %>%
step_poly(all_predictors()) %>%
step_interact(~ all_predictors():all_predictors())
```
For the models, we use the the `r pkg(parsnip)` addin to create a set of model specifications:
```{r workflow-sets-models}
library(rules)
library(baguette)
linear_reg_spec <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
nnet_spec <-
mlp(hidden_units = tune(), penalty = tune(), epochs = tune()) %>%
set_engine("nnet", MaxNWts = 2600) %>%
set_mode("regression")
mars_spec <-
mars(prod_degree = tune()) %>% #<- use GCV to choose terms
set_engine("earth") %>%
set_mode("regression")
svm_r_spec <-
svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
set_engine("kernlab") %>%
set_mode("regression")
svm_p_spec <-
svm_poly(cost = tune(), degree = tune()) %>%
set_engine("kernlab") %>%
set_mode("regression")
knn_spec <-
nearest_neighbor(neighbors = tune(), dist_power = tune(), weight_func = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
cart_spec <-
decision_tree(cost_complexity = tune(), min_n = tune()) %>%
set_engine("rpart") %>%
set_mode("regression")
bag_cart_spec <-
bag_tree() %>%
set_engine("rpart", times = 50L) %>%
set_mode("regression")
rf_spec <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_engine("ranger") %>%
set_mode("regression")
xgb_spec <-
boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(),
min_n = tune(), sample_size = tune(), trees = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
cubist_spec <-
cubist_rules(committees = tune(), neighbors = tune()) %>%
set_engine("Cubist")
```
The analysis in @apm specifies that the neural network should have up to 27 hidden units in the layer. The `extract_parameter_set_dials()` function extracts the parameter set, which we modify to have the correct parameter range:
```{r workflow-sets-param}
nnet_param <-
nnet_spec %>%
extract_parameter_set_dials() %>%
update(hidden_units = hidden_units(c(1, 27)))
```
How can we match these models to their recipes, tune them, then evaluate their performance efficiently? A workflow set offers a solution.
## Creating the Workflow Set
Workflow sets take named lists of preprocessors and model specifications and combine them into an object containing multiple workflows. There are three possible kinds of preprocessors:
* A standard R formula
* A recipe object (prior to estimation/prepping)
* A `r pkg(dplyr)`-style selector to choose the outcome and predictors
As a first workflow set example, let's combine the recipe that only standardizes the predictors to the nonlinear models that require the predictors to be in the same units:
```{r workflow-sets-normalized}
normalized <-
workflow_set(
preproc = list(normalized = normalized_rec),
models = list(SVM_radial = svm_r_spec, SVM_poly = svm_p_spec,
KNN = knn_spec, neural_network = nnet_spec)
)
normalized
```
Since there is only a single preprocessor, this function creates a set of workflows with this value. If the preprocessor contained more than one entry, the function would create all combinations of preprocessors and models.
The `wflow_id` column is automatically created but can be modified using a call to `mutate()`. The `info` column contains a tibble with some identifiers and the workflow object. The workflow can be extracted:
```{r workflow-sets-get-workflow}
normalized %>% extract_workflow(id = "normalized_KNN")
```
The `option` column is a placeholder for any arguments to use when we evaluate the workflow. For example, to add the neural network parameter object:
```{r workflow-sets-info-update}
normalized <-
normalized %>%
option_add(param_info = nnet_param, id = "normalized_neural_network")
normalized
```
When a function from the `r pkg(tune)` or `r pkg(finetune)` package is used to tune (or resample) the workflow, this argument will be used.
The `result` column is a placeholder for the output of the tuning or resampling functions.
For the other nonlinear models, let's create another workflow set that uses `r pkg(dplyr)` selectors for the outcome and predictors:
```{r workflow-sets-selectors}
model_vars <-
workflow_variables(outcomes = compressive_strength,
predictors = everything())
no_pre_proc <-
workflow_set(
preproc = list(simple = model_vars),
models = list(MARS = mars_spec, CART = cart_spec, CART_bagged = bag_cart_spec,
RF = rf_spec, boosting = xgb_spec, Cubist = cubist_spec)
)
no_pre_proc
```
Finally, we assemble the set that uses nonlinear terms and interactions with the appropriate models:
```{r workflow-sets-quad}
with_features <-
workflow_set(
preproc = list(full_quad = poly_recipe),
models = list(linear_reg = linear_reg_spec, KNN = knn_spec)
)
```
These objects are tibbles with the extra class of `workflow_set`. Row binding does not affect the state of the sets and the result is itself a workflow set:
```{r workflow-sets-complete}
all_workflows <-
bind_rows(no_pre_proc, normalized, with_features) %>%
# Make the workflow ID's a little more simple:
mutate(wflow_id = gsub("(simple_)|(normalized_)", "", wflow_id))
all_workflows
```
## Tuning and Evaluating the Models
Almost all of the members of `all_workflows` contain tuning parameters. To evaluate their performance, we can use the standard tuning or resampling functions (e.g., `tune_grid()` and so on). The `workflow_map()` function will apply the same function to all of the workflows in the set; the default is `tune_grid()`.
For this example, grid search is applied to each workflow using up to 25 different parameter candidates. There are a set of common options to use with each execution of `tune_grid()`. For example, in the following code we will use the same resampling and control objects for each workflow, along with a grid size of 25. The `workflow_map()` function has an additional argument called `seed`, which is used to ensure that each execution of `tune_grid()` consumes the same random numbers.
```{r workflow-sets-grid, eval = FALSE}
grid_ctrl <-
control_grid(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
grid_results <-
all_workflows %>%
workflow_map(
seed = 1503,
resamples = concrete_folds,
grid = 25,
control = grid_ctrl
)
```
```{r workflow-sets-grid-comparison, include = FALSE, cache = caching}
grid_ctrl <-
control_grid(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
full_results_time <-
system.time(
grid_results <-
all_workflows %>%
workflow_map(seed = 1503, resamples = concrete_folds, grid = 25,
control = grid_ctrl, verbose = TRUE)
)
num_grid_models <- nrow(collect_metrics(grid_results, summarize = FALSE)) / 2
```
The results show that the `option` and `result` columns have been updated:
```{r workflow-sets-grid-print}
grid_results
```
The `option` column now contains all of the options that we used in the `workflow_map()` call. This makes our results reproducible. In the `result` columns, the "`tune[+]`" and "`rsmp[+]`" notations mean that the object had no issues. A value such as "`tune[x]`" occurs if all of the models failed for some reason.
There are a few convenience functions for examining results such as `grid_results`. The `rank_results()` function will order the models by some performance metric. By default, it uses the first metric in the metric set (RMSE in this instance). Let's `filter()` to look only at RMSE:
```{r workflow-sets-rank}
grid_results %>%
rank_results() %>%
filter(.metric == "rmse") %>%
select(model, .config, rmse = mean, rank)
```
Also by default, the function ranks all of the candidate sets; that's why the same model can show up multiple times in the output. An option, called `select_best`, can be used to rank the models using their best tuning parameter combination.
The `autoplot()` method plots the rankings; it also has a `select_best` argument. The plot in Figure \@ref(fig:workflow-set-ranks) visualizes the best results for each model and is generated with:
```{r workflow-sets-plot-rank, eval = FALSE}
autoplot(
grid_results,
rank_metric = "rmse", # <- how to order models
metric = "rmse", # <- which metric to visualize
select_best = TRUE # <- one point per workflow
) +
geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, hjust = 1) +
lims(y = c(3.5, 9.5)) +
theme(legend.position = "none")
```
```{r workflow-set-ranks, ref.label = "workflow-sets-plot-rank"}
#| echo = FALSE,
#| out.width = '100%',
#| fig.width=8,
#| fig.height=5.75,
#| fig.cap = "Estimated RMSE (and approximate confidence intervals) for the best model configuration in each workflow.",
#| fig.alt = "Estimated RMSE (and approximate confidence intervals) for the best model configuration in each workflow. The y axis is the estimated RMSE and the x axis is the model rank based on RMSE. Cubist rules and boosted trees show the smallest RMSE values. "
```
In case you want to see the tuning parameter results for a specific model, like Figure \@ref(fig:workflow-sets-autoplot), the `id` argument can take a single value from the `wflow_id` column for which model to plot:
```{r workflow-sets-plot-model, eval = FALSE}
autoplot(grid_results, id = "Cubist", metric = "rmse")
```
```{r workflow-sets-autoplot, ref.label = "workflow-sets-plot-model"}
#| echo = FALSE,
#| out.width = '100%',
#| fig.width = 8,
#| fig.height = 4.5,
#| fig.cap = "The `autoplot()` results for the Cubist model contained in the workflow set.",
#| fig.alt = "The `autoplot()` results for the Cubist model contained in the workflow set. The visalization has a panel for each tuning pameter and shows performance versus the parameter values."
```
There are also methods for `collect_predictions()` and `collect_metrics()`.
The example model screening with our concrete mixture data fits a total of `r format(num_grid_models, big.mark = ",")` models. Using `r cores` workers in parallel, the estimation process took `r round(full_results_time[3]/60/60, 1)` hours to complete.
## Efficiently Screening Models {#racing-example}
One effective method for screening a large set of models efficiently is to use the racing approach described in Section \@ref(racing). With a workflow set, we can use the `workflow_map()` function for this racing approach. Recall that after we pipe in our workflow set, the argument we use is the function to apply to the workflows; in this case, we can use a value of `"tune_race_anova"`. We also pass an appropriate control object; otherwise the options would be the same as the code in the previous section.
```{r workflow-sets-race, eval = FALSE}
library(finetune)
race_ctrl <-
control_race(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
race_results <-
all_workflows %>%
workflow_map(
"tune_race_anova",
seed = 1503,
resamples = concrete_folds,
grid = 25,
control = race_ctrl
)
```
```{r workflow-sets-race-comp, include = FALSE, cache = caching}
race_ctrl <-
control_race(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
race_results_time <-
system.time(
race_results <-
all_workflows %>%
workflow_map("tune_race_anova",
seed = 1503, resamples = concrete_folds, grid = 25,
control = race_ctrl)
)
num_race_models <- sum(collect_metrics(race_results)$n) / 2
```
The new object looks very similar, although the elements of the `result` column show a value of `"race[+]"`, indicating a different type of object:
```{r workflow-sets-race-print}
race_results
```
The same helpful functions are available for this object to interrogate the results and, in fact, the basic `autoplot()` method shown in Figure \@ref(fig:workflow-set-racing-ranks)[^nnetnote] produces trends similar to Figure \@ref(fig:workflow-set-ranks). This is produced by:
[^nnetnote]: As of February 2022, we see slightly different performance metrics for the neural network when trained using macOS on ARM architecture (Apple M1 chip) compared to Intel architecture.
```{r workflow-sets-plot-race-rank, eval = FALSE}
autoplot(
race_results,
rank_metric = "rmse",
metric = "rmse",
select_best = TRUE
) +
geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, hjust = 1) +
lims(y = c(3.0, 9.5)) +
theme(legend.position = "none")
```
```{r workflow-set-racing-ranks, ref.label = "workflow-sets-plot-race-rank"}
#| echo = FALSE,
#| out.width = '100%',
#| fig.width = 8,
#| fig.height = 5.75,
#| fig.cap = "Estimated RMSE (and approximate confidence intervals) for the best model configuration in each workflow in the racing results.",
#| fig.alt = "Estimated RMSE (and approximate confidence intervals) for the best model configuration in each workflow in the racing results. The y axis is the estimated RMSE and the x axis is the model rank based on RMSE. Cubist rules and boosted trees show the smallest RMSE values. "
```
Overall, the racing approach estimated a total of `r format(num_race_models, big.mark = ",")` models, `r round(num_race_models/num_grid_models*100, 2)`% of the full set of `r format(num_grid_models, big.mark = ",")` models in the full grid. As a result, the racing approach was `r round(full_results_time[3]/race_results_time[3], 1)`-fold faster.
Did we get similar results? For both objects, we rank the results, merge them, and plot them against one another in Figure \@ref(fig:racing-concordance).
```{r workflow-sets-racing-concordance, eval = FALSE}
matched_results <-
rank_results(race_results, select_best = TRUE) %>%
select(wflow_id, .metric, race = mean, config_race = .config) %>%
inner_join(
rank_results(grid_results, select_best = TRUE) %>%
select(wflow_id, .metric, complete = mean,
config_complete = .config, model),
by = c("wflow_id", ".metric"),
) %>%
filter(.metric == "rmse")
library(ggrepel)
matched_results %>%
ggplot(aes(x = complete, y = race)) +
geom_abline(lty = 3) +
geom_point() +
geom_text_repel(aes(label = model)) +
coord_obs_pred() +
labs(x = "Complete Grid RMSE", y = "Racing RMSE")
```
```{r racing-concordance, ref.label = "workflow-sets-racing-concordance"}
#| echo = FALSE,
#| out.width = '100%',
#| fig.cap = "Estimated RMSE for the full grid and racing results.",
#| fig.alt = "Estimated RMSE for the full grid and racing results. The results show that many models have the same RMSE result and the others are very similar."
```
While the racing approach selected the same candidate parameters as the complete grid for only `r round(mean(matched_results$config_race == matched_results$config_complete) * 100, 2)`% of the models, the performance metrics of the models selected by racing were nearly equal. The correlation of RMSE values was `r signif(cor(matched_results$race, matched_results$complete), 3)` and the rank correlation was `r signif(cor(matched_results$race, matched_results$complete, method = "spearman"), 3)`. This indicates that, within a model, there were multiple tuning parameter combinations that had nearly identical results.
## Finalizing a Model
Similar to what we have shown in previous chapters, the process of choosing the final model and fitting it on the training set is straightforward. The first step is to pick a workflow to finalize. Since the boosted tree model worked well, we'll extract that from the set, update the parameters with the numerically best settings, and fit to the training set:
```{r workflow-sets-finalize}
best_results <-
race_results %>%
extract_workflow_set_result("boosting") %>%
select_best(metric = "rmse")
best_results
boosting_test_results <-
race_results %>%
extract_workflow("boosting") %>%
finalize_workflow(best_results) %>%
last_fit(split = concrete_split)
```
We can see the test set metrics results, and visualize the predictions in Figure \@ref(fig:concrete-test-results).
```{r workflow-sets-collect-test-metrics}
collect_metrics(boosting_test_results)
```
```{r workflow-sets-test-results, eval=FALSE}
boosting_test_results %>%
collect_predictions() %>%
ggplot(aes(x = compressive_strength, y = .pred)) +
geom_abline(color = "gray50", lty = 2) +
geom_point(alpha = 0.5) +
coord_obs_pred() +
labs(x = "observed", y = "predicted")
```
```{r concrete-test-results, ref.label = "workflow-sets-test-results"}
#| echo = FALSE,
#| out.width = '100%',
#| fig.cap = "Observed versus predicted values for the test set.",
#| fig.alt = "Observed versus predicted values for the test set. The values fall closely along the 45 degree line of identity."
```
We see here how well the observed and predicted compressive strength for these concrete mixtures align.
## Chapter Summary {#workflow-sets-summary}
Often a data practitioner needs to consider a large number of possible modeling approaches for a task at hand, especially for new data sets and/or when there is little knowledge about what modeling strategy will work best. This chapter illustrated how to use workflow sets to investigate multiple models or feature engineering strategies in such a situation. Racing methods can more efficiently rank models than fitting every candidate model being considered.
```{r workflow-sets-save, include = FALSE}
save(concrete_test, concrete_split, grid_results, race_results,
boosting_test_results,
file = "RData/concrete_results.RData", version = 2, compress = "xz")
```