When building a model, the source dataset needs to be split in some way so that we can reserve data to both build the model and assess its performance. RSample contains tools for dataset splitting and resampling.


The Iris Dataset

data_iris is a modified version of the inbuilt iris dataset that includes row number as a variable.

data_iris <- bind_cols(all_of(iris), 
                       row = 1:nrow(iris))

head(data_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
## 1          5.1         3.5          1.4         0.2  setosa   1
## 2          4.9         3.0          1.4         0.2  setosa   2
## 3          4.7         3.2          1.3         0.2  setosa   3
## 4          4.6         3.1          1.5         0.2  setosa   4
## 5          5.0         3.6          1.4         0.2  setosa   5
## 6          5.4         3.9          1.7         0.4  setosa   6

Using this example dataset, we will build a model that can predict Species based on provided measurements (row is an identification variable and not used as a predictor).


Training / Testing Splits

To create a randomly divided training/testing split using RSample, we can use the initial_split() function. This function generates an rsplit object that contains partitioning information for the data.

# data_iris is defined in 'Overview'
data_split <- initial_split(data_iris, 
                prop = 3/4) # proportion of data to use for training

data_split
## <Analysis/Assess/Total>
## <112/38/150>

To retrieve the training and testing datasets from an rsplit object, use the testing() and training() functions

# using training() and testing() to retrieve datasets
data_test <- testing(data_split)
data_train <- training(data_split)

head(data_test)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
## 5           5.0         3.6          1.4         0.2  setosa   5
## 8           5.0         3.4          1.5         0.2  setosa   8
## 17          5.4         3.9          1.3         0.4  setosa  17
## 20          5.1         3.8          1.5         0.3  setosa  20
## 21          5.4         3.4          1.7         0.2  setosa  21
## 23          4.6         3.6          1.0         0.2  setosa  23


Resampling

If you are building your first model, skip this section!

When building a model, we usually want to estimate the performance of the model before evaluating it with the testing set. Resampling encompasses various techniques to split the training set so that we can estimate model performance prior to polishing and assessing with the testing set. Learn to work with resamples in the Tune and Workflowsets tutorials.


Validation Set

A validation set is a single analysis/assessment split performed on the training data. Validation sets are best for large datasets because multiple resamples are not needed to understand performance.

Use validation_split() and validation_time_split() to create a validation split or time-based validation split, respectively.

validate <- validation_split(data_train, 
                             prop = 3/4, # proportion for analysis set
                             strata = Species)

validate
## # Validation Set Split (0.75/0.25)  using stratification 
## # A tibble: 1 × 2
##   splits          id        
##   <list>          <chr>     
## 1 <split [83/29]> validation

Retrieve data from a specific split object using analysis() and assessment().

  • Note: although analysis() and assessment() support additional arguments, when simply passed an rsplit object they act identically to training() and testing(). The names are different to distinguish resampling from the original train/test split.
# extracting rsplit object
v_split <- validate[[1]][[1]]

# pulling analysis set 
analysis(v_split) |> head()
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
## 12          4.8         3.4          1.6         0.2  setosa  12
## 14          4.3         3.0          1.1         0.1  setosa  14
## 9           4.4         2.9          1.4         0.2  setosa   9
## 27          5.0         3.4          1.6         0.4  setosa  27
## 33          5.2         4.1          1.5         0.1  setosa  33
## 22          5.1         3.7          1.5         0.4  setosa  22


V-Fold Cross Validation

V-fold cross validation splits a dataset into n parts, creating resamples where each part takes a turn being the validation set. V-fold cross validation sets are useful to create averaged accuracy estimates.

Use vfold_cv() to create folds in your data. This function returns a table with a data split object and identification variable for each fold.

data_folds <- vfold_cv(data_train, 
                       v = 5, # number of partitions in dataset 
                       repeats = 1, # times to repeat partitioning
                       strata = Species) # (optional) variable used for stratified sampling

data_folds
## #  5-fold cross-validation using stratification 
## # A tibble: 5 × 2
##   splits          id   
##   <list>          <chr>
## 1 <split [89/23]> Fold1
## 2 <split [89/23]> Fold2
## 3 <split [89/23]> Fold3
## 4 <split [90/22]> Fold4
## 5 <split [91/21]> Fold5


Monte-Carlo Cross Validation

Monte-Carlo cross validation is similar to v-fold cross validation, except that each resample is created by randomly sampling the original dataset – meaning that there may be overlap between each resamples’s analysis set. Use mc_cv() to create these kinds of resamples.

mc_split <- mc_cv(data_train, 
                  prop = 9/10, # proportion to use for analysis per resample
                  times = 20, # number of resamples
                  strata = Species)

head(mc_split)
## # A tibble: 6 × 2
##   splits           id        
##   <list>           <chr>     
## 1 <split [100/12]> Resample01
## 2 <split [100/12]> Resample02
## 3 <split [100/12]> Resample03
## 4 <split [100/12]> Resample04
## 5 <split [100/12]> Resample05
## 6 <split [100/12]> Resample06


Other Resampling Methods

There are a variety of other resampling methods supported by RSample. A subset of other important resampling methods are below:

  • Bootstrapping: bootstraps() creates random resamples with replacement, allowing duplicates in the training sets.
  • Rolling Forecast Origin: rolling_origin() creates resamples for datasets with a strong time-based component. Training sets are selected sequentially, with origin point shifting by a set amount.
  • Leave-One-Out: loo_cv() creates resamples where one point is held out for assessment and all others are used for analysis. There are as many resamples as data points.

For a full list of resampling techniques, visit the official RSample Reference.


Further Resources