Correlation Reduction
Overview
Highly correlated predictors can cause problems:
Unstable coefficient estimates in linear models
Inflated variance in predictions
Redundant information consuming model capacity
Two main strategies: remove correlated predictors or extract new features that combine them.
Correlation Filters
Removes predictors that are highly correlated with others, keeping the one with highest mean absolute correlation to remaining predictors.
When to use: Want interpretable features; correlation is the main concern.
Considerations:
Threshold typically 0.7-0.9
Only addresses pairwise linear correlation
Which predictor is kept can be somewhat arbitrary
tidymodels
library(recipes)
recipe(outcome ~ ., data = train) |>
step_corr(all_numeric_predictors(), threshold = 0.9)Tunable threshold:
recipe(outcome ~ ., data = train) |>
step_corr(all_numeric_predictors(), threshold = tune())Principal Component Analysis (PCA)
Transforms correlated predictors into uncorrelated principal components. Components are ordered by variance explained.
When to use: Many correlated predictors; willing to sacrifice interpretability for dimensionality reduction.
Considerations:
Requires centering and scaling first
Must decide how many components to retain
Components are linear combinations—not directly interpretable
Works well for neural networks, regularized regression
tidymodels
Keep components explaining 95% of variance:
recipe(outcome ~ ., data = train) |>
step_normalize(all_numeric_predictors()) |>
step_pca(all_numeric_predictors(), threshold = 0.95)Keep fixed number of components:
recipe(outcome ~ ., data = train) |>
step_normalize(all_numeric_predictors()) |>
step_pca(all_numeric_predictors(), num_comp = 5)Tunable number of components:
recipe(outcome ~ ., data = train) |>
step_normalize(all_numeric_predictors()) |>
step_pca(all_numeric_predictors(), num_comp = tune())Linear Combinations
Detects and removes predictors that are exact linear combinations of others. This is a stronger condition than correlation.
When to use: Dummy variables from hierarchical categories; derived features that sum to a constant.
tidymodels
recipe(outcome ~ ., data = train) |>
step_lincomb(all_numeric_predictors())Typical Recipe Order
When combining correlation reduction with other steps:
recipe(outcome ~ ., data = train) |>
# 1. Handle categoricals
step_novel(all_factor_predictors()) |>
step_dummy(all_factor_predictors()) |>
# 2. Remove zero-variance
step_zv(all_predictors()) |>
# 3. Handle missing data
step_impute_knn(all_numeric_predictors()) |>
# 4. Normalize before PCA
step_normalize(all_numeric_predictors()) |>
# 5. Then reduce correlation
step_pca(all_numeric_predictors(), threshold = 0.95)Or with correlation filter instead of PCA:
recipe(outcome ~ ., data = train) |>
step_novel(all_factor_predictors()) |>
step_dummy(all_factor_predictors()) |>
step_zv(all_predictors()) |>
step_impute_knn(all_numeric_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.9) |>
step_normalize(all_numeric_predictors())