Workflows

Creating a Workflow
- Adding a Preprocessor
- Adding a Model
Fitting and Predicting
Examining Workflows
Further Resources

When building a model, preprocessing steps are often specific to a certain model. workflows bundle preprocessing and parsnip objects together, such that we can prep the data and fit the model with a single call to fit().

Creating a Workflow

Initiate a workflow object with the workflow() function. You can add a preprocessor and model in the initial function call or by using add_ methods (below).

wkf <- workflow(preprocessor = NULL, # pass in preprocessor here 
                spec = NULL) # pass in model here

Adding a Preprocessor

There are two options for a workflow preprocessor:

A Formula or Role Specifications: for simple models which need no data transformations, the only preprocessing needed is to specify outcome and predictor variables. add_formula() allows you to pass in a formula, or you can specify outcomes and predictors directly with [add_variables()](https://workflows.tidymodels.org/reference/add_variables.html.

wkf1 <- wkf |>
  add_formula(Species ~ .)

# identical to wkf1
# note the use of everything(), which will ignore variables already referenced
wkf |>
  add_variables(outcomes = Species, predictors = everything())

## ══ Workflow ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Variables
## Model: None
## 
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Outcomes: Species
## Predictors: everything()

A Recipe: for more complex pre-processing, pass in a recipe object using add_recipe(). This is more common. For this example, we will use the iris_recipe defined in the Recipes Tutorial.
- You can pass in a prepped or unprepped recipe to the workflow. If the recipe is not prepped, the workflow will prep it for you.

iris_wkf <- wkf |>
  add_recipe(iris_recipe)

iris_wkf

## ══ Workflow ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: None
## 
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_corr()
## • step_normalize()
## • step_rename()

Note that each add_x() function has accompanying remove_x() and update_x() functions to allow for workflow modification.

Adding a Model

All workflows must include a parsnip model object. Add an unfitted model object using add_model(). We will use the rf random forest specification defined in the Parsnip Tutorial.

iris_wkf <- iris_wkf |>
  add_model(rf)

Note: If you are using a model with a specialized formula, add it using the formula argument of add_model(). You must use add_model() for this – special formulas cannot be specified with the add_formula() function.

# GAMs have special syntax for formulas because of their smoothing functions
gam <- gen_additive_mod() |>
  set_mode("regression") |>
  set_engine("mgcv")
  
gam_wkf <- workflow() |>
  # notice the lack of special syntax for the preprocessor formula
  add_formula(Sepal.length ~ Species + Sepal.width) |> 
  add_model(gam, 
            # now using GAM specific syntax 
            formula = Sepal.length ~ Species + s(Sepal.width))

Fitting and Predicting

After building a workflow, use fit(), predict(), and augment() just as you would with a parsnip model in order to train the workflow and generate predictions. Input datasets should be raw, not baked by a recipe.

augment(<workflow>) will bind predictions to the unbaked input data. If a recipe contains steps that alter row number, augment() will error because the input and output datasets won’t have the same length.
fit(<workflow>) does not take a formula, as the workflow object contains a formula already.

We will fit and predict the workflow using the data_split object defined in the RSample Tutorial.

# fitting the model to the training data
iris_wkf <- iris_wkf |>
  # Since the preprocessor is within the workflow, fit() only needs raw data
  fit(training(data_split)) 

# using augment to generate predictions
# notice that augment now binds predictions to the unbaked input data!
iris_preds <- augment(iris_wkf, testing(data_split))
head(iris_preds)

## # A tibble: 6 × 10
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   row .pred_class .pred_setosa .pred_versicolor .pred_virginica
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int> <fct>              <dbl>            <dbl>           <dbl>
## 1          5.1         3.5          1.4         0.2 setosa      1 setosa             1                0                   0
## 2          4.6         3.1          1.5         0.2 setosa      4 setosa             1                0                   0
## 3          4.6         3.4          1.4         0.3 setosa      7 setosa             1                0                   0
## 4          4.8         3            1.4         0.1 setosa     13 setosa             1                0                   0
## 5          5.7         4.4          1.5         0.4 setosa     16 setosa             0.975            0.025               0
## 6          5.4         3.9          1.3         0.4 setosa     17 setosa             1                0                   0

Examining Workflows

View information about a workflow object by calling it.

iris_wkf

## ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_corr()
## • step_normalize()
## • step_rename()
## 
## ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, ntree = ~200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.89%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         39          0         0  0.00000000
## versicolor      0         38         0  0.00000000
## virginica       0          1        34  0.02857143

To retrieve specific information from a workflow, use an extraction method. Extraction methods are not just for Workflow objects – there are also extractor methods for Parsnip models, Tune objects, and workflow sets. =

A complete list of extractors for workflows can be found here.

# extracting the (unprepped) preprocessor
extract_preprocessor(iris_wkf)

## Recipe
## 
## Inputs:
## 
##       role #variables
##         ID          1
##    outcome          1
##  predictor          4
## 
## Operations:
## 
## Correlation filter on all_numeric_predictors()
## Centering and scaling for all_numeric_predictors()
## Variable renaming for Row

#extracting the fitted parsnip model
extract_fit_parsnip(iris_wkf)

## parsnip model object
## 
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, ntree = ~200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.89%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         39          0         0  0.00000000
## versicolor      0         38         0  0.00000000
## virginica       0          1        34  0.02857143

Further Resources

https://www.tmwr.org/workflows.html
- Anoter example of creating and using a workflow that also touches on their theoretical reasoning.