When building a model, its often a good idea to compare and contrast different options for preprocessing and model creation. Workflowsets contains tools to build sets of workflows for easy comparison.


Building a Workflow Set

In order to build a workflow set, begin by creating a list of preprocessors and a list of models. The preprocessors can be either formulas or recipes.

This example uses the cfin dataset introduced in the Tune Tutorial.

# creating three different recipes -- one with no transformations, one where bathymetry is log scaled, and one where bathymetry is log scaled and all variables are normalized. 
basic_rec <- recipe(abundance ~ ., data = cfin_train) |>
  update_role(lat, lon, year, new_role = "ID") |>
  step_corr(threshold = .9) |>
  # boosted regression trees require dummy variables
  step_dummy(all_nominal_predictors())

log_bathy <- basic_rec |>
  step_log(Bathy_depth, base = 10)

normalize <- log_bathy |>
  step_normalize(all_numeric_predictors())

# the names of each item in the list will be later used to id workflows 
preproc_list <- list(basic = basic_rec, 
                     log = log_bathy, 
                     norm = normalize)

# creating two different models -- random forest and boosted regression tree
rf <- rand_forest(mode = "regression", 
                  engine = "ranger", 
                  trees = 100)

# note required package install: xgboost
brt <- boost_tree(mode = "regression", 
                  engine = "xgboost",
                  trees = 15)

model_list <- list(rf = rf, brt = brt)

To build the actual set of workflows, use the workflow_set() method. Each workflow will combine a preprocessor and recipe from the input lists.

cfin_wkfs <- workflow_set(preproc_list, model_list, 
                          cross = TRUE) # create all combinations of models?

cfin_wkfs
## # A workflow set/tibble: 6 × 4
##   wflow_id  info             option    result    
##   <chr>     <list>           <list>    <list>    
## 1 basic_rf  <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 basic_brt <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 log_rf    <tibble [1 × 4]> <opts[0]> <list [0]>
## 4 log_brt   <tibble [1 × 4]> <opts[0]> <list [0]>
## 5 norm_rf   <tibble [1 × 4]> <opts[0]> <list [0]>
## 6 norm_brt  <tibble [1 × 4]> <opts[0]> <list [0]>


Operating Across Workflows

After creating the workflow set, apply operations across the set using the workflow_map() method. Results are collected into the results column of the workflow set.

# applying fit_resamples across the workflow set to assess performance
# using cfin_control and cfin_folds defined in Tune Tutorial
fitted_wkfs <- cfin_wkfs |>
  workflow_map(fn = "fit_resamples", # method to apply in quotes
               verbose = FALSE, 
               seed = 400, 
               resamples = cfin_folds, # arguments for fn 
               metrics =  metric_set(rmse, rsq, mae),
               control = cfin_control)

fitted_wkfs
## # A workflow set/tibble: 6 × 4
##   wflow_id  info             option    result   
##   <chr>     <list>           <list>    <list>   
## 1 basic_rf  <tibble [1 × 4]> <opts[3]> <rsmp[+]>
## 2 basic_brt <tibble [1 × 4]> <opts[3]> <rsmp[+]>
## 3 log_rf    <tibble [1 × 4]> <opts[3]> <rsmp[+]>
## 4 log_brt   <tibble [1 × 4]> <opts[3]> <rsmp[+]>
## 5 norm_rf   <tibble [1 × 4]> <opts[3]> <rsmp[+]>
## 6 norm_brt  <tibble [1 × 4]> <opts[3]> <rsmp[+]>


Analysing Results

In order to cleanly assemble results into a tibble, use collect_metrics(). To determine the best workflow, use rank_results() to sort workflow results by a desired metric.

# collecting metrics
fitted_wkfs |>
  collect_metrics() |>
  head()
## # A tibble: 6 × 9
##   wflow_id  .config         preproc model .metric .estimator  mean     n std_err
##   <chr>     <chr>           <chr>   <chr> <chr>   <chr>      <dbl> <int>   <dbl>
## 1 basic_rf  Preprocessor1_… recipe  rand… mae     standard   0.867     5 0.00520
## 2 basic_rf  Preprocessor1_… recipe  rand… rmse    standard   1.20      5 0.0104 
## 3 basic_rf  Preprocessor1_… recipe  rand… rsq     standard   0.584     5 0.00435
## 4 basic_brt Preprocessor1_… recipe  boos… mae     standard   0.889     5 0.00633
## 5 basic_brt Preprocessor1_… recipe  boos… rmse    standard   1.22      5 0.0105 
## 6 basic_brt Preprocessor1_… recipe  boos… rsq     standard   0.562     5 0.00490
# determining the workflow with the best rmse
fitted_wkfs |>
  rank_results(rank_metric = "rmse", select_best = FALSE)
## # A tibble: 18 × 9
##    wflow_id  .config        .metric  mean std_err     n preprocessor model  rank
##    <chr>     <chr>          <chr>   <dbl>   <dbl> <int> <chr>        <chr> <int>
##  1 basic_rf  Preprocessor1… mae     0.867 0.00520     5 recipe       rand…     1
##  2 basic_rf  Preprocessor1… rmse    1.20  0.0104      5 recipe       rand…     1
##  3 basic_rf  Preprocessor1… rsq     0.584 0.00435     5 recipe       rand…     1
##  4 log_rf    Preprocessor1… mae     0.867 0.00520     5 recipe       rand…     2
##  5 log_rf    Preprocessor1… rmse    1.20  0.0104      5 recipe       rand…     2
##  6 log_rf    Preprocessor1… rsq     0.584 0.00437     5 recipe       rand…     2
##  7 norm_rf   Preprocessor1… mae     0.867 0.00519     5 recipe       rand…     3
##  8 norm_rf   Preprocessor1… rmse    1.20  0.0104      5 recipe       rand…     3
##  9 norm_rf   Preprocessor1… rsq     0.584 0.00435     5 recipe       rand…     3
## 10 basic_brt Preprocessor1… mae     0.889 0.00633     5 recipe       boos…     4
## 11 basic_brt Preprocessor1… rmse    1.22  0.0105      5 recipe       boos…     4
## 12 basic_brt Preprocessor1… rsq     0.562 0.00490     5 recipe       boos…     4
## 13 log_brt   Preprocessor1… mae     0.889 0.00633     5 recipe       boos…     5
## 14 log_brt   Preprocessor1… rmse    1.22  0.0105      5 recipe       boos…     5
## 15 log_brt   Preprocessor1… rsq     0.562 0.00490     5 recipe       boos…     5
## 16 norm_brt  Preprocessor1… mae     0.889 0.00633     5 recipe       boos…     6
## 17 norm_brt  Preprocessor1… rmse    1.22  0.0105      5 recipe       boos…     6
## 18 norm_brt  Preprocessor1… rsq     0.562 0.00489     5 recipe       boos…     6

Workflowsets also contains an autoplot() method to easily visualize results. autoplot() is capable of plotting both rank_results() as well as the results for individual workflows.

fitted_wkfs |>
  autoplot(rank_metric = "rmse", # metric to determine workflow rank, 
           id = "workflow_set") # workflow to plot -- use workflow_set to plot all

Note: Workflowsets has additional capabilities for editing, extracting, and examining individual workflows that are not covered here. See the “Further Resources” section for more information.


Further Resources