Datasets usually must be preprocessed before use in modeling, either to meet model specifications and/or improve performance. The Recipes
package assembles pre-processing steps into objects so that transformations to one dataset can easily be applied to others.
Recipe objects are built by creating a “blank” recipe object and then adding on data transformations via functions.
To initiate a recipe object, pass a formula and template dataset to the recipe()
function. For this example, we start with the data_split
testing/training split defined in the RSample Tutorial.
# Note that I could also use data_iris, head(data_iris), data_split, etc. as the template; what matters
# is that the names and types of the variables stay the same. For larger datasets, using head()
# is preferable. The exception to this rule is if you include transformations that require
# a comprehensive look at the dataset, such as step_corr().
template_data <- training(data_split)
iris_recipe <- recipe(Species ~ .,
data = template_data)
Note: If you are building a recipe for a model with a specialized formula (eg. GAMs, which include smoothing functions), do not use it here. The formula should only represent outcomes and predictors, not any special operations/transformations (these will be incorporated within Parsnip
).
Recipe objects can also be defined by using a lists of contributing variables and their roles (more on this below!). This is preferable for datasets with a high number of variables.
iris_recipe2 <- recipe (template_data,
vars = c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width", "Species", "row"),
roles = c("predictor", "predictor",
"predictor", "predictor", "outcome", "ID"))
When creating a recipe, each variable is assigned a role, either explicitly by the user or inferred from a formula. A variable can have any role, including outcome, predictor, ID, case weight, stratification variable, etc. The role of a variable determines its treatment by a recipe.
Most models only need outcome and predictor, which can be assigned automatically via a formula. View the roles assigned to variables by a recipe with summary()
.
iris_recipe |> summary()
## # A tibble: 6 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 Sepal.Length numeric predictor original
## 2 Sepal.Width numeric predictor original
## 3 Petal.Length numeric predictor original
## 4 Petal.Width numeric predictor original
## 5 row numeric predictor original
## 6 Species nominal outcome original
# building a recipe that only processes one variable - Sepal.Length - and gives it a "predictor" role
recipe(~ Sepal.Length, data = template_data) |> summary()
## # A tibble: 1 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 Sepal.Length numeric predictor original
# building a recipe with two "outcome" variables
recipe(Species + Sepal.Length ~ ., data = template_data) |> summary()
## # A tibble: 6 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 Sepal.Width numeric predictor original
## 2 Petal.Length numeric predictor original
## 3 Petal.Width numeric predictor original
## 4 row numeric predictor original
## 5 Species nominal outcome original
## 6 Sepal.Length numeric outcome original
After initializing a recipe, update roles for variables using the update_role()
function. In the case of data_iris
, we want to make row
an ID variable so that it won’t be transformed as a predictor but still stay attached to the data for later processing.
update_role()
is overriding the original formula / variable specifications passed to the recipe.# updating 'row' to an ID variable
iris_recipe <- iris_recipe |>
# update_role can update multiple variables at a time
update_role(row, new_role = "ID")
# viewing the updated role
summary(iris_recipe)
## # A tibble: 6 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 Sepal.Length numeric predictor original
## 2 Sepal.Width numeric predictor original
## 3 Petal.Length numeric predictor original
## 4 Petal.Width numeric predictor original
## 5 row numeric ID original
## 6 Species nominal outcome original
Once roles are assigned to variables, the recipes
package includes several selector functions to easily retrieve variables with specific types and/or roles. These functions are:
has_role()
has_type()
all_predictors()
all_outcomes()
all_numeric_predictors()
all_nominal_predictors()
all_numeric()
all_nominal()
Note that factor types are counted as nominal. Also note that some of these functions can only be used within step functions.
Step functions are used to specify the data transformations to include in a recipe. Each step function represents a unique data transformation and returns a recipe object with the transformation added.
Step functions take the following arguments:
skip = TRUE
, then a step will only be applied to the template data, and not any further datasets. This is most useful for transformations to an outcome variable that is not guaranteed to exist in future datasets. Defaults to skip = FALSE
.role
: If a step creates new variables, this argument specifies their role. Optional.Some examples of step functions include:
step_corr()
- remove highly correlated variablesstep_log()
- log transform datastep_interact()
- create interaction variablesstep_center()
- center variables to have a mean of 0step_scale()
- scale variables to have a standard deviation of 1step_zv()
/ step_nzv()
- remove zero / near-zero variance predictorsstep_naomit()
- remove missing valuesstep_dummy()
- convert nominal variables to dummy/indicator variables, which is useful when a model can only process numeric data. Note that some parsnip
models, such as GAMs, will do this for you automatically.There are also several step functions that act as wrappers for common dplyr
operations, including step_filter()
, step_rename()
, and step_mutate()
.
# adding some steps to our iris_recipe object
iris_recipe <- iris_recipe |>
# removing highly correlated variables, which can often mess up models
step_corr(all_numeric_predictors()) |>
# step_normalize() applies both step_center() and step_scale()
step_normalize(all_numeric_predictors()) |>
# renaming the row variable - note use of role argument
step_rename(Row = row, role = "ID")
# calling the recipe prints its information - notice the steps listed
iris_recipe
## Recipe
##
## Inputs:
##
## role #variables
## ID 1
## outcome 1
## predictor 4
##
## Operations:
##
## Correlation filter on all_numeric_predictors()
## Centering and scaling for all_numeric_predictors()
## Variable renaming for Row
Step functions can also cover much more complex operations. For a full list of step functions, see the official Function Reference.
Note: Since step functions are applied successively to a recipe object, order matters. The Recipes Website has some tips and tricks on handling step function order.
Once we’ve added all the desired roles and steps to a recipe, use the prep()
function to “train” the recipe on the template data – that is, use the template data to estimate the parameters/quantities required by steps.
formula()
to pull an updated formula from a prepped recipe object.# preparing the recipe using prep()
iris_recipe_prepped <- iris_recipe |>
prep()
# the steps are now trained for the template dataset
iris_recipe_prepped
## Recipe
##
## Inputs:
##
## role #variables
## ID 1
## outcome 1
## predictor 4
##
## Training data contained 112 data points and no missing data.
##
## Operations:
##
## Correlation filter on Petal.Length [trained]
## Centering and scaling for Sepal.Length, Sepal.Width, Petal.Width [trained]
## Variable renaming for Row [trained]
# We changed the role of Row in our recipe. See that in the updated formula
formula(iris_recipe_prepped)
## Species ~ Sepal.Length + Sepal.Width + Petal.Width
## <environment: 0x11b163a50>
Finally, use the prepped recipe to process data. bake()
will apply the transformations in a recipe object to a dataset, prepping it for modelling.
Note: prep()
must be called before using a recipe to bake data - otherwise, the steps are not trained and the code will error.
# To extract the baked template data, pass in NULL as the new_data argument
prepped_training <- bake(iris_recipe_prepped,
new_data = NULL)
head(prepped_training)
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Width Row Species
## <dbl> <dbl> <dbl> <int> <fct>
## 1 0.276 -0.996 0.236 135 virginica
## 2 1.24 0.193 1.46 142 virginica
## 3 1.60 0.431 0.779 126 virginica
## 4 -0.207 1.86 -1.26 19 setosa
## 5 1.24 0.431 1.46 121 virginica
## 6 -0.207 -0.996 -0.307 80 versicolor
# taking a glimpse at the original unprocessed testing set
head(testing(data_split))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
## 1 5.1 3.5 1.4 0.2 setosa 1
## 5 5.0 3.6 1.4 0.2 setosa 5
## 8 5.0 3.4 1.5 0.2 setosa 8
## 11 5.4 3.7 1.5 0.2 setosa 11
## 20 5.1 3.8 1.5 0.3 setosa 20
## 23 4.6 3.6 1.0 0.2 setosa 23
# processing a new dataset by passing it as the new_data argument
prepped_testing <- bake(iris_recipe_prepped,
new_data = testing(data_split))
# the transformed final dataset
head(prepped_testing)
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Width Row Species
## <dbl> <dbl> <dbl> <int> <fct>
## 1 -0.930 1.14 -1.39 1 setosa
## 2 -1.05 1.38 -1.39 5 setosa
## 3 -1.05 0.907 -1.39 8 setosa
## 4 -0.568 1.62 -1.39 11 setosa
## 5 -0.930 1.86 -1.26 20 setosa
## 6 -1.53 1.38 -1.39 23 setosa
step_interact()
and step_dummy()
.