When building a model, the source dataset needs to be split in some way so that we can reserve data to both build the model and assess its performance. RSample contains tools for dataset splitting and resampling.
data_iris
is a modified version of the inbuilt iris
dataset that includes row number as a variable.
data_iris <- bind_cols(all_of(iris),
row = 1:nrow(iris))
head(data_iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
## 1 5.1 3.5 1.4 0.2 setosa 1
## 2 4.9 3.0 1.4 0.2 setosa 2
## 3 4.7 3.2 1.3 0.2 setosa 3
## 4 4.6 3.1 1.5 0.2 setosa 4
## 5 5.0 3.6 1.4 0.2 setosa 5
## 6 5.4 3.9 1.7 0.4 setosa 6
Using this example dataset, we will build a model that can predict Species
based on provided measurements (row
is an identification variable and not used as a predictor).
To create a randomly divided training/testing split using RSample, we can use the initial_split()
function. This function generates an rsplit
object that contains partitioning information for the data.
initial_time_split()
function. This function works identically to initial_split()
except that it will use the first prop
samples for training instead of a random selection.# data_iris is defined in 'Overview'
data_split <- initial_split(data_iris,
prop = 3/4) # proportion of data to use for training
data_split
## <Analysis/Assess/Total>
## <112/38/150>
To retrieve the training and testing datasets from an rsplit
object, use the testing()
and training()
functions
# using training() and testing() to retrieve datasets
data_test <- testing(data_split)
data_train <- training(data_split)
head(data_test)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
## 5 5.0 3.6 1.4 0.2 setosa 5
## 8 5.0 3.4 1.5 0.2 setosa 8
## 17 5.4 3.9 1.3 0.4 setosa 17
## 20 5.1 3.8 1.5 0.3 setosa 20
## 21 5.4 3.4 1.7 0.2 setosa 21
## 23 4.6 3.6 1.0 0.2 setosa 23
If you are building your first model, skip this section!
When building a model, we usually want to estimate the performance of the model before evaluating it with the testing set. Resampling encompasses various techniques to split the training set so that we can estimate model performance prior to polishing and assessing with the testing set. Learn to work with resamples in the Tune and Workflowsets tutorials.
A validation set is a single analysis/assessment split performed on the training data. Validation sets are best for large datasets because multiple resamples are not needed to understand performance.
Use validation_split()
and validation_time_split()
to create a validation split or time-based validation split, respectively.
validate <- validation_split(data_train,
prop = 3/4, # proportion for analysis set
strata = Species)
validate
## # Validation Set Split (0.75/0.25) using stratification
## # A tibble: 1 × 2
## splits id
## <list> <chr>
## 1 <split [83/29]> validation
Retrieve data from a specific split object using analysis()
and assessment()
.
analysis()
and assessment()
support additional arguments, when simply passed an rsplit
object they act identically to training()
and testing()
. The names are different to distinguish resampling from the original train/test split.# extracting rsplit object
v_split <- validate[[1]][[1]]
# pulling analysis set
analysis(v_split) |> head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
## 12 4.8 3.4 1.6 0.2 setosa 12
## 14 4.3 3.0 1.1 0.1 setosa 14
## 9 4.4 2.9 1.4 0.2 setosa 9
## 27 5.0 3.4 1.6 0.4 setosa 27
## 33 5.2 4.1 1.5 0.1 setosa 33
## 22 5.1 3.7 1.5 0.4 setosa 22
V-fold cross validation splits a dataset into n parts, creating resamples where each part takes a turn being the validation set. V-fold cross validation sets are useful to create averaged accuracy estimates.
Use vfold_cv()
to create folds in your data. This function returns a table with a data split object and identification variable for each fold.
data_folds <- vfold_cv(data_train,
v = 5, # number of partitions in dataset
repeats = 1, # times to repeat partitioning
strata = Species) # (optional) variable used for stratified sampling
data_folds
## # 5-fold cross-validation using stratification
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [89/23]> Fold1
## 2 <split [89/23]> Fold2
## 3 <split [89/23]> Fold3
## 4 <split [90/22]> Fold4
## 5 <split [91/21]> Fold5
Monte-Carlo cross validation is similar to v-fold cross validation, except that each resample is created by randomly sampling the original dataset – meaning that there may be overlap between each resamples’s analysis set. Use mc_cv()
to create these kinds of resamples.
mc_split <- mc_cv(data_train,
prop = 9/10, # proportion to use for analysis per resample
times = 20, # number of resamples
strata = Species)
head(mc_split)
## # A tibble: 6 × 2
## splits id
## <list> <chr>
## 1 <split [100/12]> Resample01
## 2 <split [100/12]> Resample02
## 3 <split [100/12]> Resample03
## 4 <split [100/12]> Resample04
## 5 <split [100/12]> Resample05
## 6 <split [100/12]> Resample06
There are a variety of other resampling methods supported by RSample
. A subset of other important resampling methods are below:
bootstraps()
creates random resamples with replacement, allowing duplicates in the training sets.rolling_origin()
creates resamples for datasets with a strong time-based component. Training sets are selected sequentially, with origin point shifting by a set amount.loo_cv()
creates resamples where one point is held out for assessment and all others are used for analysis. There are as many resamples as data points.For a full list of resampling techniques, visit the official RSample Reference.