Resampling Methods

Overview

Resampling estimates model performance using out-of-sample predictions without touching the test set. The choice depends on dataset size and computational budget.

Method Best For Pros Cons
V-fold CV Small to medium data Low variance, uses all data Computationally expensive
Validation set Large data (≥10K rows) Fast, simple Higher variance
Repeated CV Final model assessment Lower variance than single CV Very expensive

V-Fold Cross-Validation

Splits training data into V folds; iteratively holds out each fold for validation while training on the rest.

When to use: Default choice for small to medium datasets.

Considerations: - V=10 is standard; V=5 for very small data or expensive models - Stratified CV maintains outcome distribution in each fold - Each observation appears in exactly one held-out fold

tidymodels

library(tidymodels)

# Basic 10-fold CV

# set seed first
resamples <- vfold_cv(train_data, v = 10)

# Stratified by outcome (classification)
resamples <- vfold_cv(train_data, v = 10, strata = outcome)

# Stratified by outcome (regression - uses quartiles internally)
resamples <- vfold_cv(train_data, v = 10, strata = outcome)

Repeated Cross-Validation

Runs V-fold CV multiple times with different random splits, then averages results.

When to use: Final model assessment when variance reduction matters; comparing close models.

Considerations: - 5-fold repeated 5 times or 10-fold repeated 3 times are common - Multiplies computation time by the number of repeats

tidymodels

# 10-fold CV repeated 5 times

# Set seed first
resamples <- vfold_cv(train_data, v = 10, repeats = 5, strata = outcome)

Validation Set

tidymodels treats validation sets like a single resample of the data.

See the instructions in data-spending.md for making the initial three-way split.

tidymodels

library(tidymodels)

# set seed first
init_split <- initial_validation_split(all_data, strata = outcome)
train_data <- training(init_split)
test_data <- testing(init_split)
validation_data <- validation(init_split)

# Create resampling object for tuning
resamples <- validation_set(init_split)

Time Series: Rolling Origin

For time-dependent data, uses expanding or sliding windows that respect temporal order.

When to use: Time series or temporal data where random splits would cause data leakage.

Considerations: - Never use future data to predict the past - initial sets minimum training size; assess sets forecast horizon - requires the data to have a column that has a date or date-time class.

tidymodels

# Rolling origin with expanding window
resamples <-  
  sliding_period(
    train_data,
    index = date_column,
    period = "week",  # use an appropriate unit of time
    lookback = Inf,
    complete = TRUE,
    assess_stop = 2,  # use an appropriate size for data
    step = 2          # use an appropriate size for data
  )

Grouped Data

When observations are not independent (e.g., multiple measurements per subject), keep groups together.

When to use: Repeated measures, clustered data, longitudinal studies.

tidymodels

# Keep all observations from the same group together

# set seed first
resamples <- group_vfold_cv(train_data, group = id_column, v = 10)