Best Practices for Recipes Source Development

Context: This guide is for source development - contributing to the recipes package directly.

Key principle:You CAN use internal functions - you’re developing the package, so internals are available.

For extension development (creating new packages), see Best Practices (Extension).


Using Internal Functions in Recipes

When to Use Internal Functions

Use internal functions when:

  • Standard operations like variable selection or validation

  • Complex calculations already implemented

  • Consistency with existing steps is needed

  • Avoiding code duplication

Don’t use internal functions when:

  • Simple logic can be written inline

  • The internal function doesn’t quite fit

  • It would make code less readable

Finding Existing Internal Functions

# List all objects (including internals)
ls("package:recipes", all.names = TRUE)

# Search for specific pattern
ls("package:recipes", all.names = TRUE, pattern = "^check_")

# View internal function
recipes:::check_type

# Search in package directory
# grep -r "check_type" R/

Common Internal Helpers

recipes_eval_select() - Variable Selection

The most important internal function for resolving variable selections:

prep.step_center <- function(x, training, info = NULL, ...) {
  # Resolve variable selections to actual column names
  col_names <- recipes_eval_select(x$terms, training, info)

  # Now col_names contains actual column names
  # ...
}

Use for:

  • Converting all_numeric() to actual numeric column names

  • Converting all_predictors() to predictor names

  • Resolving manual selections like c(disp, hp)

get_case_weights() - Extract Case Weights

prep.step_center <- function(x, training, info = NULL, ...) {
  col_names <- recipes_eval_select(x$terms, training, info)

  # Extract case weights if they exist
  wts <- get_case_weights(info, training)
  were_weights_used <- are_weights_used(wts, unsupervised = TRUE)

  if (isFALSE(were_weights_used)) {
    wts <- NULL
  }

  # Use wts in calculations
  # ...
}

Handles:

  • Finding case_weights role in recipe

  • Extracting weight column

  • Converting to appropriate format

check_type() - Validate Column Types

prep.step_center <- function(x, training, info = NULL, ...) {
  col_names <- recipes_eval_select(x$terms, training, info)

  # Validate that columns are numeric
  check_type(training[, col_names], types = c("double", "integer"))

  # Proceed with confidence that types are correct
  # ...
}

Common type checks:

  • c("double", "integer") - Numeric data

  • c("factor", "ordered") - Categorical data

  • "character" - Text data

  • "logical" - Boolean data

check_new_data() - Validate Columns Exist in New Data

bake.step_center <- function(object, new_data, ...) {
  col_names <- names(object$means)

  # Verify all required columns exist in new_data
  check_new_data(col_names, object, new_data)

  # Safe to proceed
  # ...
}

Provides:

  • Clear error message if columns missing

  • Consistent error format across steps

remove_original_cols() - Handle keep_original_cols

For steps that create new columns:

bake.step_dummy <- function(object, new_data, ...) {
  # Create new columns
  # ...

  # Handle removal of original columns
  new_data <- remove_original_cols(
    new_data,
    object,
    names(object$levels)  # Original column names
  )

  new_data
}

sel2char() - Convert Selections to Strings

For tidy() on untrained steps:

tidy.step_center <- function(x, ...) {
  if (is_trained(x)) {
    # After training: use actual column names
    res <- tibble(
      terms = names(x$means),
      value = unname(x$means)
    )
  } else {
    # Before training: convert selectors to character
    term_names <- sel2char(x$terms)
    res <- tibble(
      terms = term_names,
      value = na_dbl
    )
  }
  res$id <- x$id
  res
}

File Naming Conventions

Recipes organizes steps by name (less strict than yardstick):

Source File Names

  • Step files: R/[step_name].R

    • Examples: center.R, normalize.R, pca.R, dummy.R

Test File Names Match Source

  • R/center.Rtests/testthat/test-center.R

  • R/normalize.Rtests/testthat/test-normalize.R

Documentation Patterns

Recipes uses extensive @inheritParams to avoid duplication.

Inheriting Standard Parameters

#' Center numeric variables
#'
#' @inheritParams step_normalize
#' @param na_rm A logical value indicating whether NA values should be removed
#'   when computing means.
#' @param means A named numeric vector of means. This is `NULL` until computed
#'   by [prep()].
#'
#' @export
step_center <- function(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  means = NULL,
  na_rm = TRUE,
  skip = FALSE,
  id = rand_id("center")
) {
  # ...
}

Common inheritParams sources:

  • step_normalize - For normalization-type steps

  • step_center - For centering-type steps

  • step_pca - For dimension reduction steps

  • step_dummy - For encoding steps

Standard Parameter Documentation

Parameters shared across most steps:

#' @param recipe A recipe object
#' @param ... One or more selector functions to choose variables
#' @param role Role for new variables (use NA for in-place modifications)
#' @param trained Logical indicating if step has been trained
#' @param skip Logical indicating if step should be skipped during bake()
#' @param id Character string identifier for the step

Using @template

Recipes has some templates, but uses them less than yardstick:

#' @template step-return

Cross-Referencing Steps

Link to related steps:

#' @seealso [step_normalize()], [step_scale()]
#' @family normalization steps

Code Style Specific to Recipes

The Three-Function Pattern

Every step needs three functions (plus S3 methods):

# 1. Step constructor (user-facing, exported)
#' @export
step_center <- function(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  means = NULL,
  na_rm = TRUE,
  skip = FALSE,
  id = rand_id("center")
) {
  add_step(
    recipe,
    step_center_new(
      terms = enquos(...),
      trained = trained,
      role = role,
      means = means,
      na_rm = na_rm,
      skip = skip,
      id = id,
      case_weights = NULL
    )
  )
}

# 2. Step initialization (internal constructor)
step_center_new <- function(terms, role, trained, means, na_rm, skip, id,
                            case_weights) {
  step(
    subclass = "center",
    terms = terms,
    role = role,
    trained = trained,
    means = means,
    na_rm = na_rm,
    skip = skip,
    id = id,
    case_weights = case_weights
  )
}

# 3. S3 methods (all exported)
#' @export
prep.step_center <- function(x, training, info = NULL, ...) {
  # Training logic
}

#' @export
bake.step_center <- function(object, new_data, ...) {
  # Application logic
}

#' @export
print.step_center <- function(x, width = max(20, options()$width - 30), ...) {
  # Printing logic
}

#' @rdname tidy.recipe
#' @export
tidy.step_center <- function(x, ...) {
  # Tidying logic
}

Use add_step() to Add to Recipe

step_center <- function(recipe, ...) {
  add_step(
    recipe,
    step_center_new(...)
  )
}

Don’t manually modify the recipe object.

Use step() to Create Step Object

step_center_new <- function(terms, role, ...) {
  step(
    subclass = "center",  # Must match function name
    terms = terms,
    role = role,
    ...
  )
}

The subclass must match the step name (minus “step_” prefix).

prep() Method Structure

prep.step_center <- function(x, training, info = NULL, ...) {
  # 1. Resolve variable selections
  col_names <- recipes_eval_select(x$terms, training, info)

  # 2. Validate column types
  check_type(training[, col_names], types = c("double", "integer"))

  # 3. Get case weights (if applicable)
  wts <- get_case_weights(info, training)
  were_weights_used <- are_weights_used(wts, unsupervised = TRUE)
  if (isFALSE(were_weights_used)) {
    wts <- NULL
  }

  # 4. Calculate step parameters (e.g., means, scales, etc.)
  means <- calculate_means(training[, col_names], wts, x$na_rm)

  # 5. Check for issues (e.g., Inf, NaN)
  check_for_problems(means, col_names)

  # 6. Return updated step with trained = TRUE
  step_center_new(
    terms = x$terms,
    role = x$role,
    trained = TRUE,
    means = means,
    na_rm = x$na_rm,
    skip = x$skip,
    id = x$id,
    case_weights = were_weights_used
  )
}

bake() Method Structure

bake.step_center <- function(object, new_data, ...) {
  # 1. Get column names from trained step
  col_names <- names(object$means)

  # 2. Validate columns exist in new data
  check_new_data(col_names, object, new_data)

  # 3. Apply transformation
  for (col_name in col_names) {
    new_data[[col_name]] <- new_data[[col_name]] - object$means[[col_name]]
  }

  # 4. Return modified data
  new_data
}

Key points:

  • Use column names from trained object (not from x$terms)

  • Validate before transforming

  • Return the modified data frame

Step Type Best Practices

Modify-in-Place Steps

# Use role = NA (preserve existing role)
step_center <- function(recipe, ..., role = NA, ...) {
  # ...
}

# Don't add keep_original_cols parameter
# These steps modify in place, not create new columns

Create-New-Columns Steps

# Use role = "predictor" (or other appropriate role)
step_dummy <- function(recipe, ..., role = "predictor",
                      keep_original_cols = FALSE, ...) {
  # ...
}

# In bake(), handle keep_original_cols
bake.step_dummy <- function(object, new_data, ...) {
  # Create new columns
  # ...

  # Remove originals unless keep_original_cols = TRUE
  new_data <- remove_original_cols(new_data, object, original_cols)

  new_data
}

Row-Operation Steps

# Default skip = TRUE (usually don't apply to new data)
step_filter <- function(recipe, ..., skip = TRUE, ...) {
  # ...
}

# In bake(), respect skip parameter
bake.step_filter <- function(object, new_data, ...) {
  if (object$skip) {
    return(new_data)
  }

  # Apply filter
  # ...
}

Creating New Internal Helpers

When to Create Internal Helpers

Create internal helpers when:

  • Logic is shared by 2+ steps

  • Complex operation used repeatedly

  • Abstraction improves clarity

Naming Conventions

Internal helpers typically:

  • Have descriptive names

  • Are NOT exported

  • Use @keywords internal and @noRd

Example Internal Helper

#' Calculate weighted mean with NA handling
#'
#' @param x Numeric vector
#' @param wts Numeric weights (or NULL)
#' @param na_rm Remove NA values?
#'
#' @return Numeric scalar
#' @keywords internal
#' @noRd
weighted_mean_na <- function(x, wts = NULL, na_rm = TRUE) {
  if (is.null(wts)) {
    mean(x, na.rm = na_rm)
  } else {
    # Handle hardhat weights
    if (inherits(wts, c("hardhat_importance_weights",
                        "hardhat_frequency_weights"))) {
      wts <- as.double(wts)
    }
    weighted.mean(x, w = wts, na.rm = na_rm)
  }
}

Error Messages

Use cli Functions

if (bad_input) {
  cli::cli_abort(
    "{.arg na_rm} must be a single logical value, not {.obj_type_friendly {na_rm}}.",
    call = call
  )
}

if (missing_columns) {
  cli::cli_warn(
    "Column{?s} {.var {missing}} not found and will be ignored."
  )
}

Pass call for Context

prep.step_center <- function(x, training, info = NULL, ...,
                            call = rlang::caller_env()) {
  # Use call in error messages
  if (problem) {
    cli::cli_abort("Error message", call = call)
  }
}

Variable Selection Best Practices

Always Use recipes_eval_select()

# Good
col_names <- recipes_eval_select(x$terms, training, info)

# Bad - don't try to resolve manually
col_names <- dplyr::select(training, !!!x$terms) |> names()

Support All Standard Selectors

Your step should work with:

  • all_numeric()

  • all_numeric_predictors()

  • all_nominal()

  • all_predictors()

  • all_outcomes()

  • has_role("predictor")

  • Manual selection: c(disp, hp)

Test all of these!

Case Weight Best Practices

Extract Weights in prep()

wts <- get_case_weights(info, training)
were_weights_used <- are_weights_used(wts, unsupervised = TRUE)

if (isFALSE(were_weights_used)) {
  wts <- NULL
}

Store Whether Weights Were Used

step_center_new(
  # ...
  case_weights = were_weights_used  # Store TRUE/FALSE
)

Use Weights in Calculations

if (is.null(wts)) {
  mean(x)
} else {
  # Convert hardhat weights
  if (inherits(wts, c("hardhat_importance_weights",
                      "hardhat_frequency_weights"))) {
    wts <- as.double(wts)
  }
  weighted.mean(x, w = wts)
}

Performance Considerations

Vectorize Operations

# Good - vectorized
for (col_name in col_names) {
  new_data[[col_name]] <- new_data[[col_name]] - object$means[[col_name]]
}

# Avoid - row-by-row
for (i in seq_len(nrow(new_data))) {
  new_data[i, col_names] <- new_data[i, col_names] - object$means
}

Avoid Unnecessary Copies

# Good - modify in place
for (col in cols) {
  new_data[[col]] <- transform(new_data[[col]])
}

# Avoid - creates copies
new_data <- new_data |>
  mutate(across(all_of(cols), transform))

Consistency with Existing Steps

Study Similar Steps

Before implementing:

  • For normalization: R/center.R, R/scale.R, R/normalize.R

  • For encoding: R/dummy.R, R/novel.R

  • For dimension reduction: R/pca.R, R/ica.R

  • For filtering: R/filter.R, R/sample.R

Match Parameter Patterns

Keep standard parameters in standard order:

step_name <- function(
  recipe,            # Always first
  ...,              # Variable selection
  role = NA,        # Role for new variables
  trained = FALSE,  # Training status
  # Step-specific parameters here
  skip = FALSE,     # Skip in bake()?
  id = rand_id("name")  # Unique ID
) {
  # ...
}

Next Steps