Parsnip

Initializing a Model
Setting Parameters
Fitting
Predicting
Further Resources

When building a model for your dataset, there are a wide variety of models and model packages that each have their own unique parameters and specifications. Parsnip provides a unified, simple interface for model creation so that you can easily swap between models without worrying about minor syntax changes.

Initializing a Model

A parsnip model is built by defining an initial model object, setting its parameters/engine, and then fitting the model to data. Each type of supported model has its own initializing function.

For example, suppose we wanted to create a linear regression model. To initiate this model, call the linear_reg() function.

lr <- linear_reg()

Alternatively, to initialize a random forest model use the rand_forest() function.

rf <- rand_forest()

Other models supported include logistic regression, boosted trees, k nearest neighbors, and more. For a full list of supported models see the official Function Reference.

Setting Parameters

After initializing a model, set its specifications/parameters.

set_mode() determines whether the model is a classification or regression model. Not every model has both options – for example, linear_reg() only supports regression.
set_args() sets parameters specific to the model type. The names of these arguments are standardized so that similar parameters across models have the same name – for example, penalty is the name of the regularization argument, if a model has one.

# setting mode and parameters for both of our example models 
# note that you don't need to set every model parameter if you're happy with the defaults
lr <- lr |>
  set_mode("regression") |> 
  set_args(penalty = NULL) 

rf <- rf |>
  set_mode("classification") |>
  set_args(trees = 200)

set_engine() determines the model engine, or package that will handle model creation. For many models, there are multiple packages that can run that model, each with their own unique syntax and interface. parsnip unifies the interface, but doesn’t replace the modeling package – meaning that the underlying modelling engine must be specified. To switch engines later on, simply change the argument for set_engine().
- Engines might require package installs.
- use show_engines() retrieve a list of all supported engines for a model. Some engines require a parsnip extension package and may not be shown.
- To see how parsnip arguments transfer to a specific engine, use the translate() method.
- Some models have engine-specific arguments that do not translate to other engines. Pass these arguments to set_engine().

# what engines are available for random forest?
show_engines("rand_forest")

## # A tibble: 6 × 2
##   engine       mode          
##   <chr>        <chr>         
## 1 ranger       classification
## 2 ranger       regression    
## 3 randomForest classification
## 4 randomForest regression    
## 5 spark        classification
## 6 spark        regression

# setting the engine to randomForest 
# note the engine-specific argument used in this call 
rf <- rf |>
  set_engine("ranger", verbose = TRUE)

# to change the engine, call set_engine() again
rf <- rf |>
  set_engine("randomForest")

# using translate() to view the model and how parsnip arguments transfer to the randomForest package
translate(rf)

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   trees = 200
## 
## Computational engine: randomForest 
## 
## Model fit template:
## randomForest::randomForest(x = missing_arg(), y = missing_arg(), 
##     ntree = 200)

Note: Mode, engine, and model arguments can be specified within the initial model call. However, setting them with separate functions enhances readability and creates a more flexible interface.

Engine-specific arguments must be declared within set_engine().

# defining parameters within initial call
rand_forest(mode = "classification", 
            engine = "randomForest", 
            trees = 500)

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   trees = 500
## 
## Computational engine: randomForest

Fitting

Once a model has been initialized and defined, execute the model on our prepped training data using the fit() function. fit() takes in a formula and a model object and returns a fitted model.

We will use the iris_recipe_prepped, prepped_training and prepped_testing datasets defined in the Recipes Tutorial.

# extracting model formula from prepped recipe object
rf_formula <- formula(iris_recipe_prepped)

# fitting our random forest model to our training data
rf_fit <- fit(rf, rf_formula, data = prepped_training)

# taking a look at our fitted model
rf_fit

## parsnip model object
## 
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, ntree = ~200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         39          0         0           0
## versicolor      0         36         0           0
## virginica       0          0        37           0

Note: If a model uses a specialized formula that specifies model structure as well as terms (eg. GAMs), pass it to fit().

Note: If the recipe used to preprocess data modified any roles from your original formula, be sure to extract your model formula from the recipe object to account for those changes.

Predicting

In order to generate predictions from new data, pass the fitted model and a dataset to either the predict() or augment() methods predict() will return a tibble of predictions, and augment() will bind prediction columns to the input data.

Predictions are returned as a tibble with predictable column names.
predict() has a type argument used to specify prediction type. Options for type are “numeric”, “class”, “prob”, “conf_int”, “pred_int”, “quantile”, “time”, “hazard”, “survival”, or “raw”. type defaults to either “numeric” or “class” depending on model mode. Not all type options will work for every model.

# generating predictions
predict(rf_fit, prepped_testing) |>
  head()

## # A tibble: 6 × 1
##   .pred_class
##   <fct>      
## 1 setosa     
## 2 setosa     
## 3 setosa     
## 4 setosa     
## 5 setosa     
## 6 setosa

# note the use of type = prob to retrieve probabilities for each class
predict(rf_fit, prepped_testing, type = "prob") |>
  head()

## # A tibble: 6 × 3
##   .pred_setosa .pred_versicolor .pred_virginica
##          <dbl>            <dbl>           <dbl>
## 1        1                0                   0
## 2        1                0                   0
## 3        1                0                   0
## 4        0.945            0.055               0
## 5        0.99             0.01                0
## 6        1                0                   0

# augment binds predictions to original data
# augment does not have a type argument
iris_preds <- augment(rf_fit, prepped_testing) 

head(iris_preds)

## # A tibble: 6 × 9
##   Sepal.Length Sepal.Width Petal.Width   Row Species .pred_class .pred_setosa .pred_versicolor .pred_virginica
##          <dbl>       <dbl>       <dbl> <int> <fct>   <fct>              <dbl>            <dbl>           <dbl>
## 1       -0.945      0.789        -1.27     8 setosa  setosa             1                0                   0
## 2       -1.06       0.0986       -1.40    10 setosa  setosa             1                0                   0
## 3       -1.17       0.789        -1.27    12 setosa  setosa             1                0                   0
## 4       -0.147      1.71         -1.14    19 setosa  setosa             0.945            0.055               0
## 5       -0.489      0.789        -1.27    21 setosa  setosa             0.99             0.01                0
## 6       -0.831      1.48         -1.02    22 setosa  setosa             1                0                   0

Methods to fit resampling sets are covered in the Tune Tutorial.

Further Resources

https://www.tmwr.org/models.html
- Excellent overview of parsnip, especially regarding how it compares to other modelling packages.