Numeric Predictor Transformations

Centering and Scaling (Standardization)

Centers predictors to mean zero and scales to unit variance. Essential for models that use distance metrics or dot products.

When to use: KNN, SVM, neural networks, regularized regression (lasso, ridge, elastic net), PCA.

Considerations:

  • Compute mean and SD from training data only; apply same values to test data

  • Not needed for tree-based models

tidymodels

library(recipes)

recipe(outcome ~ ., data = train) |>
  step_normalize(all_numeric_predictors())

For min-max scaling to [0, 1] range:

recipe(outcome ~ ., data = train) |>
  step_range(all_numeric_predictors())

Symmetric Transformations

Transforms skewed predictors to approximate symmetry. Can improve performance for models sensitive to outliers or that assume normality.

When to use: Numeric predictors with heavy skew; particularly helpful for linear models and neural networks.

Options:

  • Yeo-Johnson: Handles zero and negative values

  • Box-Cox: Requires strictly positive values

  • orderNorm: Transforms to approximate normality via rank ordering

tidymodels

Yeo-Johnson (recommended default):

recipe(outcome ~ ., data = train) |>
  step_YeoJohnson(all_numeric_predictors())

Box-Cox (positive values only):

recipe(outcome ~ ., data = train) |>
  step_BoxCox(all_numeric_predictors())

orderNorm (from bestNormalize):

recipe(outcome ~ ., data = train) |>
  step_orderNorm(all_numeric_predictors())

Natural Splines

Creates basis functions to model nonlinear relationships between a predictor and outcome. Allows linear models to capture curvature.

When to use: Suspected nonlinear relationship between a numeric predictor and outcome; using a linear model.

Considerations:

  • Degrees of freedom (df) controls flexibility; higher df = more wiggly

  • Start with df = 3-5 and tune if needed

  • Not needed for tree-based models or GAMs (they handle nonlinearity internally)

tidymodels

recipe(outcome ~ ., data = train) |>
  step_ns(predictor_name, deg_free = 4)

Tunable degrees of freedom:

recipe(outcome ~ ., data = train) |>
  step_ns(predictor_name, deg_free = tune())

Interaction Terms

Creates products of predictors to model joint effects. Use when the effect of one predictor depends on another.

When to use: Domain knowledge suggests interaction; exploratory analysis shows effect modification.

Considerations:

  • Can dramatically increase feature count

  • Tree-based models capture interactions automatically

  • For linear models, specify interactions based on domain knowledge

tidymodels

Specific interactions:

recipe(outcome ~ ., data = train) |>
  step_interact(terms = ~ var1:var2)

All pairwise interactions among selected predictors:

recipe(outcome ~ ., data = train) |>
  step_interact(terms = ~ all_numeric_predictors():all_numeric_predictors())

Zero-Variance and Near-Zero-Variance Removal

Removes predictors with single unique value (zero variance) or very few unique values relative to sample size.

When to use: Always include zero-variance removal; near-zero-variance is optional but often helpful.

tidymodels

Zero-variance only:

recipe(outcome ~ ., data = train) |>
  step_zv(all_predictors())

Near-zero-variance (also catches problematic predictors):

recipe(outcome ~ ., data = train) |>
  step_nzv(all_predictors())