Missing Data Imputation
Overview
Imputation estimates missing values from observed data. Required for models that cannot handle NA values (most models except some tree implementations).
Key principle: Imputation must be performed within the resampling process. Compute imputation parameters (means, medians, model coefficients) from training data only.
Simple Imputation
Replaces missing values with a single summary statistic.
Mean/Median imputation: Fast and simple. Median is more robust to outliers.
Mode imputation: For categorical variables, replace with most frequent level.
Considerations:
Distorts distributions and underestimates variance
Appropriate when missingness is low (<5%) and mechanism is MCAR
tidymodels
Numeric predictors with median:
library(recipes)
recipe(outcome ~ ., data = train) |>
step_impute_median(all_numeric_predictors())Numeric predictors with mean:
recipe(outcome ~ ., data = train) |>
step_impute_mean(all_numeric_predictors())Categorical predictors with mode:
recipe(outcome ~ ., data = train) |>
step_impute_mode(all_factor_predictors())K-Nearest Neighbors Imputation
Imputes missing values using the mean (numeric) or mode (categorical) of the K nearest neighbors based on non-missing predictors.
When to use: When relationships between predictors matter; missingness may be MAR.
Considerations:
Computationally more expensive than simple imputation
Requires scaling predictors for distance calculation
Typical K values: 5-10
tidymodels
recipe(outcome ~ ., data = train) |>
step_impute_knn(all_predictors(), neighbors = 5)Bagged Tree Imputation
Uses bagged decision trees to predict missing values from other predictors.
When to use: When relationships between predictors are complex or nonlinear; can handle mixed predictor types.
Considerations:
More computationally expensive than KNN
Often produces better imputations for complex data
tidymodels
recipe(outcome ~ ., data = train) |>
step_impute_bag(all_predictors())Linear Model Imputation
Predicts missing values using a linear model (or logistic for binary).
When to use: When linear relationships exist between predictors.
tidymodels
recipe(outcome ~ ., data = train) |>
step_impute_linear(predictor_with_missing,
impute_with = imp_vars(predictor1, predictor2))Missing Value Indicators
Creates binary indicator columns flagging which values were originally missing. Can help if missingness itself is informative.
When to use: When the pattern of missingness may predict the outcome.
tidymodels
recipe(outcome ~ ., data = train) |>
step_indicate_na(all_predictors()) |>
step_impute_median(all_numeric_predictors()) |>
step_impute_mode(all_factor_predictors())Recommended Order
When combining imputation with other steps:
recipe(outcome ~ ., data = train) |>
# 1. Handle novel levels first (before any encoding)
step_novel(all_factor_predictors()) |>
# 2. Impute missing values
step_impute_knn(all_predictors()) |>
# 3. Then proceed with other transformations
step_dummy(all_factor_predictors()) |>
step_normalize(all_numeric_predictors())