Feature Engineering Reference
Common Techniques
| Technique | When to Use |
|---|---|
| Zero-variance removal | Always check for predictors with single unique value |
| Dummy variables | Convert categorical predictors to binary 0/1 indicators |
| Target encoding | Categorical predictors with many levels → single numeric column |
| Centering & scaling | Models using distance metrics or dot products |
| Symmetric transformations | Skewed numeric predictors (Yeo-Johnson or orderNorm) |
| Imputation | Missing predictor values—estimate from other columns |
| Correlation reduction | Feature extraction (e.g., PCA) or unsupervised correlation filter |
| Spline terms | Nonlinear relationships between single predictors and outcome |
| Interaction terms | Joint effects of two or more predictors |
Detailed Implementation
For methodology and code examples:
- Categorical predictors: dummy variables, target encoding
- Numeric predictors: scaling, transformations, splines
- Missing data: imputation strategies
- Correlation reduction: PCA, filters
Model-Specific Requirements
Linear Models
Ordinary linear/logistic/multinomial regression: - Mandatory: indicator variables, zero-variance removal, complete data - Helpful: interaction terms, spline terms, reducing correlation, symmetric distributions
Regularized linear/logistic/multinomial regression: - Mandatory: indicator variables, zero-variance removal, standardized scale, complete data - Helpful: interaction terms, spline terms, reducing correlation, symmetric distributions
Distance-Based Models
K-nearest neighbors: - Mandatory: indicator variables, zero-variance removal, standardized scale, complete data - Helpful: symmetric distributions
Support Vector Machines: - Mandatory: indicator variables, zero-variance removal, standardized scale, complete data - Helpful: symmetric distributions
Additive and Spline Models
Generalized Additive Models: - Mandatory: indicator variables, zero-variance removal, complete data - Helpful: symmetric distributions
Multivariate Adaptive Regression Splines (MARS): - Mandatory: indicator variables, complete data - Helpful: symmetric distributions
Probabilistic Models
Naive Bayes: - Mandatory: zero-variance removal - Helpful: none
Neural Networks
Neural networks: - Mandatory: indicator variables, zero-variance removal, reducing correlation, standardized scale, complete data - Helpful: symmetric distributions
Tree-Based Models
Single tree models: - Mandatory: none - Helpful: none
Tree ensemble models (random forest, boosting): - Mandatory: complete data (for most implementations)
Rule-Based Models
RuleFit: - Mandatory: indicator variables, zero-variance removal, standardized scale, complete data - Helpful: none
Cubist: - Mandatory: none - Helpful: none