Data Spending Methods
Overview
Data spending refers to the initial splitting of a data set into either:
training, validation, and test sets, or
training and test sets
Training, Validation, and Test Sets
Splits the entire data set into three partitions.
When to use: At the very beginning of the modeling process.
Considerations:
When the number of data points is large (say >= 10,000), ask the user if they want this type of split.
- If not, use a basic training/testing split.
Stratified splitting maintains outcome distribution in each fold
tidymodels
library(tidymodels)
# Set seed immediately before splitting for reproducibility
set.seed(3847)
init_split <- initial_validation_split(all_data, strata = outcome)
train_data <- training(init_split)
test_data <- testing(init_split)
validation_data <- validation(init_split)
resamples <- validation_set(init_split)Training and Test Sets
Splits the entire data set into two partitions.
When to use: At the very beginning of the modeling process.
Considerations:
- Stratified splitting maintains outcome distribution in each fold
tidymodels
library(tidymodels)
# Set seed immediately before splitting for reproducibility
set.seed(7291)
init_split <- initial_split(all_data, strata = outcome)
train_data <- training(init_split)
test_data <- testing(init_split)Special Cases
There are several cases when different splitting should be used:
When the data are a time series
When there are correlated rows in the data.
Time Series
In this case, ensure that the data have been ordered from oldest to most recent data.
tidymodels
library(tidymodels)
# Set seed immediately before splitting for reproducibility
set.seed(5628)
init_split <- initial_time_split(all_data)
train_data <- training(init_split)
test_data <- testing(init_split)