Understanding Recipe Step Architecture
Before implementing a recipe step, understand the recipe step architecture and workflow.
Note for Source Development: If you’re contributing directly to the recipes package, you can use internal helper functions like
recipes_eval_select(),check_type(), andget_case_weights()without therecipes::prefix. See the Source Development Guide for details.
Reference implementations showing complete architecture: - Simple steps: R/center.R, R/scale.R (modify-in-place pattern) - Complex steps: R/dummy.R, R/pca.R (create-new-columns pattern) - Row operations: R/filter.R, R/sample.R (skip behavior)
The Three-Function Pattern
Every recipe step consists of three functions:
1. Step constructor (e.g., step_center())
User-facing function that: - Captures user arguments - Uses enquos(...) to capture variable selections - Returns recipe with step added via add_step()
step_center <- function(recipe, ..., role = NA, trained = FALSE,
means = NULL, na_rm = TRUE, skip = FALSE,
id = rand_id("center")) {
add_step(
recipe,
step_center_new(
terms = enquos(...),
role = role,
trained = trained,
means = means,
na_rm = na_rm,
skip = skip,
id = id
)
)
}2. Step initialization (e.g., step_center_new())
Internal constructor that: - Is a minimal function with no defaults - Calls step(subclass = "name", ...) to create S3 object
step_center_new <- function(terms, role, trained, means, na_rm, skip, id) {
step(
subclass = "center",
terms = terms,
role = role,
trained = trained,
means = means,
na_rm = na_rm,
skip = skip,
id = id
)
}3. S3 methods
Required methods for every step:
prep.step_*()- Estimates parameters from training databake.step_*()- Applies transformation to new dataprint.step_*()- Displays step in recipe summarytidy.step_*()- Returns step information as tibble
The prep/bake Workflow
prep() - Training phase
Prep resolves variable selections and learns parameters from training data:
prep.step_center <- function(x, training, info = NULL, ...) {
# 1. Resolve variable selections to actual column names
col_names <- recipes_eval_select(x$terms, training, info)
# 2. Validate column types
check_type(training[, col_names], types = c("double", "integer"))
# 3. Compute statistics/parameters from training data
means <- colMeans(training[, col_names], na.rm = x$na_rm)
# 4. Store learned parameters in step object
# 5. Return updated step with trained = TRUE
step_center_new(
terms = x$terms,
role = x$role,
trained = TRUE,
means = means,
na_rm = x$na_rm,
skip = x$skip,
id = x$id
)
}prep() responsibilities: - Resolve variable selections (e.g., all_numeric() → actual column names) - Validate column types - Compute statistics/parameters from training data - Store learned parameters in step object - Return updated step with trained = TRUE
bake() - Application phase
Bake applies the transformation using stored parameters:
bake.step_center <- function(object, new_data, ...) {
# 1. Get column names from trained step
col_names <- names(object$means)
# 2. Validate required columns exist in new data
check_new_data(col_names, object, new_data)
# 3. Apply transformation using stored parameters
for (col in col_names) {
new_data[[col]] <- new_data[[col]] - object$means[[col]]
}
# 4. Return modified data
new_data
}bake() responsibilities: - Takes trained step and new data - Validates required columns exist - Applies transformation using stored parameters - Returns transformed data
Example workflow
# 1. Define recipe with step
rec <- recipe(mpg ~ ., data = mtcars) |>
step_center(all_numeric_predictors())
# At this point, step knows to center all numeric predictors
# but hasn't calculated what those means are yet
# 2. prep() trains the step (calculates means from training data)
trained_rec <- prep(rec, training = mtcars)
# Now the step knows the mean of each column
# 3. bake() applies the step (subtracts those means from new data)
new_data <- bake(trained_rec, new_data = test_data)
# New data has been centered using the training meansStep Type Decision Tree
Choose the appropriate template based on what your step does:
Type 1: Modify-in-Place Steps
Use when: Your step transforms existing columns without creating new ones
Characteristics: - role = NA (preserves existing roles) - No keep_original_cols parameter - Returns tibble with same columns (but modified values) - Examples: step_center, step_scale, step_normalize, step_log
Template: See modify-in-place-steps.md
Type 2: Create-New-Columns Steps
Use when: Your step creates new columns from existing ones
Characteristics: - role = "predictor" (default, assigns role to new columns) - Includes keep_original_cols parameter (default FALSE) - Uses remove_original_cols() in bake() - May need .recipes_estimate_sparsity() if creating sparse columns - Examples: step_dummy, step_pca, step_interact, step_poly
Template: See create-new-columns-steps.md
Type 3: Row-Operation Steps
Use when: Your step filters or removes rows
Characteristics: - Default skip = TRUE (usually not applied during bake on new data) - Affects number of rows returned - Often used for training data only - Examples: step_filter, step_sample, step_naomit, step_slice
Template: See row-operation-steps.md
Key Concepts
Variable Selection
Steps use tidyselect to let users specify columns:
# By name
step_center(disp, hp)
# By type
step_center(all_numeric())
# By role
step_center(all_predictors())
# Combinations
step_center(all_numeric_predictors())The prep() method resolves these selections to actual column names.
Roles
Columns in recipes have roles: - "predictor" - Used as features - "outcome" - Used as target variable - NA - No specific role
Steps can: - Preserve roles (role = NA) - Assign roles to new columns (role = "predictor") - Filter by role (all_predictors())
Training vs Application
Training (prep): - Learn parameters from training data - Store parameters in step object - Happens once
Application (bake): - Apply stored parameters to new data - Can be called multiple times - Uses parameters from prep, doesn’t relearn
Common Patterns
Storing parameters
Store only what’s needed:
# Good: store only the means
means <- colMeans(training[, col_names])
# Bad: store entire training data
training_data <- training # Don't do this!For-loops over purrr
Use for-loops for better error messages:
# Preferred
for (col in col_names) {
new_data[[col]] <- transform(new_data[[col]])
}
# Avoid
new_data <- map(col_names, \(col) transform(new_data[[col]]))Validation early
Validate in prep(), trust in bake():
# prep() validates
check_type(training[, col_names], types = c("double", "integer"))
# bake() trusts and applies
new_data[[col]] <- new_data[[col]] - means[[col]]Next Steps
- Implement modify-in-place steps: modify-in-place-steps.md
- Implement create-new-columns steps: create-new-columns-steps.md
- Implement row-operation steps: row-operation-steps.md
- Learn helper functions: helper-functions.md
- Add optional methods: optional-methods.md