Understanding Recipe Step Architecture

Before implementing a recipe step, understand the recipe step architecture and workflow.

Note for Source Development: If you’re contributing directly to the recipes package, you can use internal helper functions like recipes_eval_select(), check_type(), and get_case_weights() without the recipes:: prefix. See the Source Development Guide for details.

Reference implementations showing complete architecture: - Simple steps: R/center.R, R/scale.R (modify-in-place pattern) - Complex steps: R/dummy.R, R/pca.R (create-new-columns pattern) - Row operations: R/filter.R, R/sample.R (skip behavior)

The Three-Function Pattern

Every recipe step consists of three functions:

1. Step constructor (e.g., step_center())

User-facing function that: - Captures user arguments - Uses enquos(...) to capture variable selections - Returns recipe with step added via add_step()

step_center <- function(recipe, ..., role = NA, trained = FALSE,
                        means = NULL, na_rm = TRUE, skip = FALSE,
                        id = rand_id("center")) {
  add_step(
    recipe,
    step_center_new(
      terms = enquos(...),
      role = role,
      trained = trained,
      means = means,
      na_rm = na_rm,
      skip = skip,
      id = id
    )
  )
}

2. Step initialization (e.g., step_center_new())

Internal constructor that: - Is a minimal function with no defaults - Calls step(subclass = "name", ...) to create S3 object

step_center_new <- function(terms, role, trained, means, na_rm, skip, id) {
  step(
    subclass = "center",
    terms = terms,
    role = role,
    trained = trained,
    means = means,
    na_rm = na_rm,
    skip = skip,
    id = id
  )
}

3. S3 methods

Required methods for every step:

  • prep.step_*() - Estimates parameters from training data
  • bake.step_*() - Applies transformation to new data
  • print.step_*() - Displays step in recipe summary
  • tidy.step_*() - Returns step information as tibble

The prep/bake Workflow

prep() - Training phase

Prep resolves variable selections and learns parameters from training data:

prep.step_center <- function(x, training, info = NULL, ...) {
  # 1. Resolve variable selections to actual column names
  col_names <- recipes_eval_select(x$terms, training, info)

  # 2. Validate column types
  check_type(training[, col_names], types = c("double", "integer"))

  # 3. Compute statistics/parameters from training data
  means <- colMeans(training[, col_names], na.rm = x$na_rm)

  # 4. Store learned parameters in step object
  # 5. Return updated step with trained = TRUE
  step_center_new(
    terms = x$terms,
    role = x$role,
    trained = TRUE,
    means = means,
    na_rm = x$na_rm,
    skip = x$skip,
    id = x$id
  )
}

prep() responsibilities: - Resolve variable selections (e.g., all_numeric() → actual column names) - Validate column types - Compute statistics/parameters from training data - Store learned parameters in step object - Return updated step with trained = TRUE

bake() - Application phase

Bake applies the transformation using stored parameters:

bake.step_center <- function(object, new_data, ...) {
  # 1. Get column names from trained step
  col_names <- names(object$means)

  # 2. Validate required columns exist in new data
  check_new_data(col_names, object, new_data)

  # 3. Apply transformation using stored parameters
  for (col in col_names) {
    new_data[[col]] <- new_data[[col]] - object$means[[col]]
  }

  # 4. Return modified data
  new_data
}

bake() responsibilities: - Takes trained step and new data - Validates required columns exist - Applies transformation using stored parameters - Returns transformed data

Example workflow

# 1. Define recipe with step
rec <- recipe(mpg ~ ., data = mtcars) |>
  step_center(all_numeric_predictors())

# At this point, step knows to center all numeric predictors
# but hasn't calculated what those means are yet

# 2. prep() trains the step (calculates means from training data)
trained_rec <- prep(rec, training = mtcars)

# Now the step knows the mean of each column

# 3. bake() applies the step (subtracts those means from new data)
new_data <- bake(trained_rec, new_data = test_data)

# New data has been centered using the training means

Step Type Decision Tree

Choose the appropriate template based on what your step does:

Type 1: Modify-in-Place Steps

Use when: Your step transforms existing columns without creating new ones

Characteristics: - role = NA (preserves existing roles) - No keep_original_cols parameter - Returns tibble with same columns (but modified values) - Examples: step_center, step_scale, step_normalize, step_log

Template: See modify-in-place-steps.md

Type 2: Create-New-Columns Steps

Use when: Your step creates new columns from existing ones

Characteristics: - role = "predictor" (default, assigns role to new columns) - Includes keep_original_cols parameter (default FALSE) - Uses remove_original_cols() in bake() - May need .recipes_estimate_sparsity() if creating sparse columns - Examples: step_dummy, step_pca, step_interact, step_poly

Template: See create-new-columns-steps.md

Type 3: Row-Operation Steps

Use when: Your step filters or removes rows

Characteristics: - Default skip = TRUE (usually not applied during bake on new data) - Affects number of rows returned - Often used for training data only - Examples: step_filter, step_sample, step_naomit, step_slice

Template: See row-operation-steps.md

Key Concepts

Variable Selection

Steps use tidyselect to let users specify columns:

# By name
step_center(disp, hp)

# By type
step_center(all_numeric())

# By role
step_center(all_predictors())

# Combinations
step_center(all_numeric_predictors())

The prep() method resolves these selections to actual column names.

Roles

Columns in recipes have roles: - "predictor" - Used as features - "outcome" - Used as target variable - NA - No specific role

Steps can: - Preserve roles (role = NA) - Assign roles to new columns (role = "predictor") - Filter by role (all_predictors())

Training vs Application

Training (prep): - Learn parameters from training data - Store parameters in step object - Happens once

Application (bake): - Apply stored parameters to new data - Can be called multiple times - Uses parameters from prep, doesn’t relearn

Common Patterns

Storing parameters

Store only what’s needed:

# Good: store only the means
means <- colMeans(training[, col_names])

# Bad: store entire training data
training_data <- training  # Don't do this!

For-loops over purrr

Use for-loops for better error messages:

# Preferred
for (col in col_names) {
  new_data[[col]] <- transform(new_data[[col]])
}

# Avoid
new_data <- map(col_names, \(col) transform(new_data[[col]]))

Validation early

Validate in prep(), trust in bake():

# prep() validates
check_type(training[, col_names], types = c("double", "integer"))

# bake() trusts and applies
new_data[[col]] <- new_data[[col]] - means[[col]]

Next Steps