Categorical Predictor Encoding

Dummy Variables (Indicator Encoding)

Converts a categorical variable with K levels into K-1 binary columns (one level becomes the reference). Required for models that cannot handle factors directly (linear models, neural networks, SVMs, KNN).

When to use: Most models except tree-based methods, Cubist, and Naive Bayes.

Considerations:

  • Reference level choice affects coefficient interpretation but not predictions

  • High-cardinality variables create many columns—consider target encoding instead

tidymodels

library(recipes)

recipe(outcome ~ ., data = train) |>
  step_dummy(all_factor_predictors())

To use one-hot encoding (K columns instead of K-1):

recipe(outcome ~ ., data = train) |>
  step_dummy(all_factor_predictors(), one_hot = TRUE)

Target Encoding (Effect Encoding)

Replaces categorical levels with a numeric value based on the outcome. Useful for high-cardinality categoricals where dummy encoding creates too many columns.

When to use: Categorical predictors with many levels (e.g., ZIP codes, product IDs).

Considerations:

  • Risk of data leakage if not done within resampling

  • Smoothing/regularization helps with rare levels

  • For classification, encodes the probability of the positive class; for regression, encodes the mean outcome

tidymodels

Mixed-effects target encoding (recommended—provides regularization):

library(recipes)
library(embed)

recipe(outcome ~ ., data = train) |>
  step_lencode_mixed(high_cardinality_var, outcome = vars(outcome))

Simple target encoding (less regularization):

recipe(outcome ~ ., data = train) |>
  step_lencode_glm(high_cardinality_var, outcome = vars(outcome))

Novel Levels

New categorical levels in test/production data that weren’t in training can cause errors. Handle by assigning novel levels to “other” or the most common level.

tidymodels

recipe(outcome ~ ., data = train) |>
  step_novel(all_factor_predictors()) |>
  step_dummy(all_factor_predictors())

Infrequent Levels

Levels that appear rarely may not provide reliable signal and can cause issues in resampling. Pool them into an “other” category.

tidymodels

Pool levels appearing in fewer than 5% of rows:

recipe(outcome ~ ., data = train) |>
  step_other(all_factor_predictors(), threshold = 0.05) |>
  step_dummy(all_factor_predictors())

Allow tuning of the pooling threshold:

recipe(outcome ~ ., data = train) |>
  step_other(all_factor_predictors(), threshold = tune()) |>
  step_dummy(all_factor_predictors())