Datasets usually must be preprocessed before use in modeling, either to meet model specifications and/or improve performance. The Recipes package assembles pre-processing steps into objects so that transformations to one dataset can easily be applied to others.


Recipe Initiation

Recipe objects are built by creating a “blank” recipe object and then adding on data transformations via functions.

To initiate a recipe object, pass a formula and template dataset to the recipe() function. For this example, we start with the data_split testing/training split defined in the RSample Tutorial.

# Note that I could also use data_iris, head(data_iris), data_split, etc. as the template; what matters 
#   is that the names and types of the variables stay the same. For larger datasets, using head() 
#   is preferable. The exception to this rule is if you include transformations that require 
#   a comprehensive look at the dataset, such as step_corr().
template_data <- training(data_split)

iris_recipe <- recipe(Species ~ ., 
                      data = template_data)

Note: If you are building a recipe for a model with a specialized formula (eg. GAMs, which include smoothing functions), do not use it here. The formula should only represent outcomes and predictors, not any special operations/transformations (these will be incorporated within Parsnip).

Recipe objects can also be defined by using a lists of contributing variables and their roles (more on this below!). This is preferable for datasets with a high number of variables.

iris_recipe2 <- recipe (template_data, 
                        vars = c("Sepal.Length", "Sepal.Width", 
                                 "Petal.Length", "Petal.Width", "Species", "row"),
                        roles = c("predictor", "predictor", 
                                 "predictor", "predictor", "outcome", "ID"))


Roles

When creating a recipe, each variable is assigned a role, either explicitly by the user or inferred from a formula. A variable can have any role, including outcome, predictor, ID, case weight, stratification variable, etc. The role of a variable determines its treatment by a recipe.

Most models only need outcome and predictor, which can be assigned automatically via a formula. View the roles assigned to variables by a recipe with summary().

iris_recipe |> summary()
## # A tibble: 6 × 4
##   variable     type    role      source  
##   <chr>        <chr>   <chr>     <chr>   
## 1 Sepal.Length numeric predictor original
## 2 Sepal.Width  numeric predictor original
## 3 Petal.Length numeric predictor original
## 4 Petal.Width  numeric predictor original
## 5 row          numeric predictor original
## 6 Species      nominal outcome   original
# building a recipe that only processes one variable - Sepal.Length - and gives it a "predictor" role
recipe(~ Sepal.Length, data = template_data) |> summary()
## # A tibble: 1 × 4
##   variable     type    role      source  
##   <chr>        <chr>   <chr>     <chr>   
## 1 Sepal.Length numeric predictor original
# building a recipe with two "outcome" variables
recipe(Species + Sepal.Length ~ ., data = template_data) |> summary()
## # A tibble: 6 × 4
##   variable     type    role      source  
##   <chr>        <chr>   <chr>     <chr>   
## 1 Sepal.Width  numeric predictor original
## 2 Petal.Length numeric predictor original
## 3 Petal.Width  numeric predictor original
## 4 row          numeric predictor original
## 5 Species      nominal outcome   original
## 6 Sepal.Length numeric outcome   original

After initializing a recipe, update roles for variables using the update_role() function. In the case of data_iris, we want to make row an ID variable so that it won’t be transformed as a predictor but still stay attached to the data for later processing.

# updating 'row' to an ID variable
iris_recipe <- iris_recipe |>
  # update_role can update multiple variables at a time
  update_role(row, new_role = "ID") 

# viewing the updated role
summary(iris_recipe)
## # A tibble: 6 × 4
##   variable     type    role      source  
##   <chr>        <chr>   <chr>     <chr>   
## 1 Sepal.Length numeric predictor original
## 2 Sepal.Width  numeric predictor original
## 3 Petal.Length numeric predictor original
## 4 Petal.Width  numeric predictor original
## 5 row          numeric ID        original
## 6 Species      nominal outcome   original

Once roles are assigned to variables, the recipes package includes several selector functions to easily retrieve variables with specific types and/or roles. These functions are:

Note that factor types are counted as nominal. Also note that some of these functions can only be used within step functions.


Step Functions

Step functions are used to specify the data transformations to include in a recipe. Each step function represents a unique data transformation and returns a recipe object with the transformation added.

Step functions take the following arguments:

Some examples of step functions include:

There are also several step functions that act as wrappers for common dplyr operations, including step_filter(), step_rename(), and step_mutate().

# adding some steps to our iris_recipe object
iris_recipe <- iris_recipe |>
  # removing highly correlated variables, which can often mess up models
  step_corr(all_numeric_predictors()) |>
  # step_normalize() applies both step_center() and step_scale()
  step_normalize(all_numeric_predictors()) |>
  # renaming the row variable - note use of role argument
  step_rename(Row = row, role = "ID")

# calling the recipe prints its information - notice the steps listed
iris_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##         ID          1
##    outcome          1
##  predictor          4
## 
## Operations:
## 
## Correlation filter on all_numeric_predictors()
## Centering and scaling for all_numeric_predictors()
## Variable renaming for Row

Step functions can also cover much more complex operations. For a full list of step functions, see the official Function Reference.

Note: Since step functions are applied successively to a recipe object, order matters. The Recipes Website has some tips and tricks on handling step function order.


Prepping the Recipe and Baking New Data

Once we’ve added all the desired roles and steps to a recipe, use the prep() function to “train” the recipe on the template data – that is, use the template data to estimate the parameters/quantities required by steps.

# preparing the recipe using prep() 
iris_recipe_prepped <- iris_recipe |>
  prep()

# the steps are now trained for the template dataset
iris_recipe_prepped
## Recipe
## 
## Inputs:
## 
##       role #variables
##         ID          1
##    outcome          1
##  predictor          4
## 
## Training data contained 112 data points and no missing data.
## 
## Operations:
## 
## Correlation filter on Petal.Length [trained]
## Centering and scaling for Sepal.Length, Sepal.Width, Petal.Width [trained]
## Variable renaming for Row [trained]
# We changed the role of Row in our recipe. See that in the updated formula
formula(iris_recipe_prepped)
## Species ~ Sepal.Length + Sepal.Width + Petal.Width
## <environment: 0x11b163a50>

Finally, use the prepped recipe to process data. bake() will apply the transformations in a recipe object to a dataset, prepping it for modelling.

Note: prep() must be called before using a recipe to bake data - otherwise, the steps are not trained and the code will error.

# To extract the baked template data, pass in NULL as the new_data argument
prepped_training <- bake(iris_recipe_prepped, 
                         new_data = NULL)
head(prepped_training)
## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Width   Row Species   
##          <dbl>       <dbl>       <dbl> <int> <fct>     
## 1        0.276      -0.996       0.236   135 virginica 
## 2        1.24        0.193       1.46    142 virginica 
## 3        1.60        0.431       0.779   126 virginica 
## 4       -0.207       1.86       -1.26     19 setosa    
## 5        1.24        0.431       1.46    121 virginica 
## 6       -0.207      -0.996      -0.307    80 versicolor
# taking a glimpse at the original unprocessed testing set
head(testing(data_split))
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
## 1           5.1         3.5          1.4         0.2  setosa   1
## 5           5.0         3.6          1.4         0.2  setosa   5
## 8           5.0         3.4          1.5         0.2  setosa   8
## 11          5.4         3.7          1.5         0.2  setosa  11
## 20          5.1         3.8          1.5         0.3  setosa  20
## 23          4.6         3.6          1.0         0.2  setosa  23
# processing a new dataset by passing it as the new_data argument
prepped_testing <- bake(iris_recipe_prepped, 
                        new_data = testing(data_split))

# the transformed final dataset
head(prepped_testing)
## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Width   Row Species
##          <dbl>       <dbl>       <dbl> <int> <fct>  
## 1       -0.930       1.14        -1.39     1 setosa 
## 2       -1.05        1.38        -1.39     5 setosa 
## 3       -1.05        0.907       -1.39     8 setosa 
## 4       -0.568       1.62        -1.39    11 setosa 
## 5       -0.930       1.86        -1.26    20 setosa 
## 6       -1.53        1.38        -1.39    23 setosa


Further Resources