Tabular Data Machine Learning

This skill guides the process of developing predictive models for tabular data with proper validation practices.

Data Spending Strategy

Always partition data into:

Training set: Used for all feature engineering, feature selection, and model development
Test set: Reserved for final model evaluation only—requires explicit user permission before use

A common split is 75% training / 25% testing. Use stratified sampling:

Classification: stratify by outcome class
Regression: create temporary quartile groups and stratify by those

See references/data-spending.md for specific instructions for data splitting.

Test Set Rules

NEVER predict on test data during model development
NEVER calculate test set metrics without explicit user permission
NEVER use test data to compare models or tune hyperparameters
DO ask: “If you have completed model development, may I evaluate the final model on the test set?”
DO wait for explicit confirmation before proceeding

Self-check: If you’re writing predict(..., test_data) without prior user permission, STOP—you’re making an error.

Exception: Basic verification after splitting (e.g., nrow(test_data), glimpse(test_data)) to confirm the split worked.

Reproducibility

Always set a random seed before any operation that involves randomness:

Before data splitting - Set seed immediately before initial_split(), initial_time_split(), etc.
Before resampling - Set seed immediately before vfold_cv(), sliding_period(), etc.
Before tuning - Set seed before tune_grid() or other tuning functions (if not already set recently)

This ensures that others can reproduce your exact results. Use a single seed at the start of your script, or re-set it before each random operation for maximum clarity.

Choosing a Seed Value

Do not use common values like 123, 111, 999, 42, or 1: These are overused and can lead to unintentional correlations between different analyses. Using the same seed as other researchers’ work may produce accidentally similar results.

Good practices:

Use a random integer between 1000 and 10000 (e.g., 3847, 7291, 5628)
Different seeds for different projects/analyses
Document your seed choice in comments for reference

Empirical Validation

Always use out-of-sample predictions to measure performance:

Large datasets (≥10,000 rows): Use a single validation set
Small to medium datasets: Use 10-fold cross-validation or appropriate resampling

See references/resampling.md for resampling methods and implementation.

Parallel Processing

Before starting computationally intensive work (cross-validation, tuning, model fitting):

Detect available cores: Use parallel::detectCores() to check system resources
Ask the user in the conversation - Don’t just put a comment in code - have an actual exchange

Example interaction:

You: I’m about to run 10-fold cross-validation with hyperparameter tuning. I can use parallel processing to speed this up significantly.

I see you have 8 cores available. Would you like me to use parallel processing? If so, how many cores should I use? (I’d recommend using 6-7 to leave 1-2 cores free for other processes)

If user says yes:

library(future)
plan("multisession", workers = 6)  # or whatever they specified

If user says no or doesn’t respond:

# Proceed with sequential processing (no future setup)

If user is unsure:

You: Here’s the trade-off: - With parallel processing (6 cores): ~5-10 minutes - Without (sequential): ~30-45 minutes

Your choice won’t affect the results, just the speed.

Continue using the same parallel configuration throughout unless the user asks to stop

Do not automatically enable parallel processing without asking the user first.

Validation Rules

NEVER directly predict on training data to measure performance
DO develop and compare models using only CV or validation set results
DO select final model(s) based on out-of-sample performance

Performance Metrics

See references/evaluation.md for specific instructions for computing performance metrics.

Classification

Ask the user whether they prioritize:

Class separation: Use ROC-AUC or PR-AUC
Calibrated probabilities: Use Brier score

Default set: ROC-AUC, Brier score, and accuracy.

Regression

RMSE: Primary accuracy metric (sensitive to outliers)
MAE: Accuracy metric less sensitive to outliers
R²: Measures variance explained (supplement to RMSE/MAE, not a replacement)

Default set: RMSE and R².

Model Optimization

The modeling process is iterative. Three main levers for improvement:

Feature engineering: Modify predictors so the model does less work
Model selection: Choose appropriate algorithm for data characteristics
Hyperparameter tuning: Optimize parameters that can’t be estimated from data

All steps must be validated using out-of-sample data.

Optimization Rules

NEVER use data outside the training set to determine feature engineering steps
NEVER engineer features, then evaluate directly on training data
DO treat feature engineering and model training as a single process
DO use CV or validation set to measure combined feature engineering + model performance
DO use CV or validation set to select best tuning parameters

Feature Engineering

See references/feature-engineering.md for:

Common feature engineering techniques
Model-specific requirements (mandatory vs. helpful transformations)

Model Tuning

Use parameter ranges provided by the modeling framework
Use space-filling designs for grid search when available
Use racing methods for efficiency (except with validation sets)
Visualize tuning results to show performance vs. parameter relationships

It is a good idea to propose two models to the user:

a regularized linear (or logistic) model such as glmnet
a boosted tree with an early stopping argument to halt after 5 poor iterations.

See references/tuning.md for details on tuning methods and implementation.

Model Evaluation

See references/tuning.md for details on tuning methods and implementation.

Without tuning: Resample the model or use a validation set. Report out-of-sample metrics.

With tuning: Select metric to optimize, identify optimal tuning parameters.

For the best model, present:

Numeric metric results
Appropriate visualizations (see below)

Evaluation Visualizations

Classification:

ROC or PR curves
Calibration curves

Regression:

Observed vs. predicted plots
Residual plots

See references/evaluation.md for metrics, visualizations, and implementation.

Final Model

Once the user selects a final model, fit it on the entire training set.

Test Set Evaluation

After receiving user permission, evaluate on the test set with:

Numeric metrics
Same visualizations as model evaluation (ROC/PR curves, calibration, observed vs. predicted, residuals)