Feature Engineering Reference

Common Techniques

Technique	When to Use
Zero-variance removal	Always check for predictors with single unique value
Dummy variables	Convert categorical predictors to binary 0/1 indicators
Target encoding	Categorical predictors with many levels → single numeric column
Centering & scaling	Models using distance metrics or dot products
Symmetric transformations	Skewed numeric predictors (Yeo-Johnson or orderNorm)
Imputation	Missing predictor values—estimate from other columns
Correlation reduction	Feature extraction (e.g., PCA) or unsupervised correlation filter
Spline terms	Nonlinear relationships between single predictors and outcome
Interaction terms	Joint effects of two or more predictors

Detailed Implementation

For methodology and code examples:

Categorical predictors: dummy variables, target encoding
Numeric predictors: scaling, transformations, splines
Missing data: imputation strategies
Correlation reduction: PCA, filters

Model-Specific Requirements

Linear Models

Ordinary linear/logistic/multinomial regression: - Mandatory: indicator variables, zero-variance removal, complete data - Helpful: interaction terms, spline terms, reducing correlation, symmetric distributions

Regularized linear/logistic/multinomial regression: - Mandatory: indicator variables, zero-variance removal, standardized scale, complete data - Helpful: interaction terms, spline terms, reducing correlation, symmetric distributions

Distance-Based Models

K-nearest neighbors: - Mandatory: indicator variables, zero-variance removal, standardized scale, complete data - Helpful: symmetric distributions

Support Vector Machines: - Mandatory: indicator variables, zero-variance removal, standardized scale, complete data - Helpful: symmetric distributions

Additive and Spline Models

Generalized Additive Models: - Mandatory: indicator variables, zero-variance removal, complete data - Helpful: symmetric distributions

Multivariate Adaptive Regression Splines (MARS): - Mandatory: indicator variables, complete data - Helpful: symmetric distributions

Probabilistic Models

Naive Bayes: - Mandatory: zero-variance removal - Helpful: none

Neural Networks

Neural networks: - Mandatory: indicator variables, zero-variance removal, reducing correlation, standardized scale, complete data - Helpful: symmetric distributions

Tree-Based Models

Single tree models: - Mandatory: none - Helpful: none

Tree ensemble models (random forest, boosting): - Mandatory: complete data (for most implementations)

Rule-Based Models

RuleFit: - Mandatory: indicator variables, zero-variance removal, standardized scale, complete data - Helpful: none

Cubist: - Mandatory: none - Helpful: none