Feature Engineering Reference
Common Techniques
| Technique | When to Use |
|---|---|
| Zero-variance removal | Always check for predictors with single unique value |
| Dummy variables | Convert categorical predictors to binary 0/1 indicators |
| Target encoding | Categorical predictors with many levels → single numeric column |
| Centering & scaling | Models using distance metrics or dot products |
| Symmetric transformations | Skewed numeric predictors (Yeo-Johnson or orderNorm) |
| Imputation | Missing predictor values—estimate from other columns |
| Correlation reduction | Feature extraction (e.g., PCA) or unsupervised correlation filter |
| Spline terms | Nonlinear relationships between single predictors and outcome |
| Interaction terms | Joint effects of two or more predictors |
Detailed Implementation
For methodology and code examples:
Categorical predictors: dummy variables, target encoding
Numeric predictors: scaling, transformations, splines
Missing data: imputation strategies
Correlation reduction: PCA, filters
Model-Specific Requirements
Linear Models
Ordinary linear/logistic/multinomial regression:
Mandatory: indicator variables, zero-variance removal, complete data
Helpful: interaction terms, spline terms, reducing correlation, symmetric distributions
Regularized linear/logistic/multinomial regression:
Mandatory: indicator variables, zero-variance removal, standardized scale, complete data
Helpful: interaction terms, spline terms, reducing correlation, symmetric distributions
Distance-Based Models
K-nearest neighbors:
Mandatory: indicator variables, zero-variance removal, standardized scale, complete data
Helpful: symmetric distributions
Support Vector Machines:
Mandatory: indicator variables, zero-variance removal, standardized scale, complete data
Helpful: symmetric distributions
Additive and Spline Models
Generalized Additive Models:
Mandatory: indicator variables, zero-variance removal, complete data
Helpful: symmetric distributions
Multivariate Adaptive Regression Splines (MARS):
Mandatory: indicator variables, complete data
Helpful: symmetric distributions
Probabilistic Models
Naive Bayes:
Mandatory: zero-variance removal
Helpful: none
Neural Networks
Neural networks:
Mandatory: indicator variables, zero-variance removal, reducing correlation, standardized scale, complete data
Helpful: symmetric distributions
Tree-Based Models
Single tree models:
Mandatory: none
Helpful: none
Tree ensemble models (random forest, boosting):
- Mandatory: complete data (for most implementations)
Rule-Based Models
RuleFit:
Mandatory: indicator variables, zero-variance removal, standardized scale, complete data
Helpful: none
Cubist:
Mandatory: none
Helpful: none