Source Development Guide: Contributing to Recipes

Complete guide for contributing new preprocessing steps to the recipes package itself.


When to Use This Guide

Use this guide if you are:

  • Contributing a PR directly to the recipes package

  • Working inside the recipes repository

  • Adding steps that should be part of recipes core

  • Modifying existing recipes steps

Don’t use this guide if you are:


Prerequisites

Clone the Recipes Repository

# Clone from GitHub
git clone https://github.com/tidymodels/recipes.git
cd recipes

# Create a feature branch
git checkout -b feature/add-step-name

See Repository Access for more details.

Install Development Dependencies

# Install recipes with all dependencies
devtools::install_dev_deps()

# Load the package for development
devtools::load_all()

Understanding Recipes Architecture

Package Organization

recipes/
├── R/
│   ├── center.R         # Normalization steps
│   ├── dummy.R          # Encoding steps
│   ├── pca.R            # Dimension reduction
│   ├── filter.R         # Row operations
│   ├── aaa-*.R          # Core infrastructure
│   └── utils-*.R        # Internal utilities
├── tests/testthat/
│   ├── test-center.R
│   ├── test-dummy.R
│   └── helper-*.R       # Test helpers
└── man/                 # Documentation

File Naming Conventions

Source files: R/[step_name].R

  • Examples: center.R, normalize.R, pca.R, dummy.R

Test files: tests/testthat/test-[step_name].R

  • Examples: test-center.R, test-normalize.R

Working with Internal Functions

✅ You CAN Use Internal Functions

When developing recipes itself, internal functions are available:

# ✅ GOOD - You're developing the package
prep.step_center <- function(x, training, info = NULL, ...) {
  # Use internal function (no prefix needed)
  col_names <- recipes_eval_select(x$terms, training, info)

  # Core calculation
  # ...
}

Common Internal Helpers

recipes_eval_select() - Variable Selection

The most important function for resolving variable selections:

# Converts all_numeric(), manual selections, etc. to column names
col_names <- recipes_eval_select(x$terms, training, info)

get_case_weights() - Extract Case Weights

wts <- get_case_weights(info, training)
were_weights_used <- are_weights_used(wts, unsupervised = TRUE)

if (isFALSE(were_weights_used)) {
  wts <- NULL
}

check_type() - Validate Column Types

# Validates that columns are the correct type
check_type(training[, col_names], types = c("double", "integer"))

check_new_data() - Validate Columns Exist

# In bake(), check required columns exist
check_new_data(col_names, object, new_data)

remove_original_cols() - Handle keep_original_cols

# For create-new-columns steps
new_data <- remove_original_cols(new_data, object, original_cols)

sel2char() - Convert Selections to Strings

# For tidy() on untrained steps
term_names <- sel2char(x$terms)

See Best Practices (Source) for complete guide to internal functions.


Step-by-Step Implementation

Step 1: Choose Your Step Type

Determine which category your step falls into:

  • Modify-in-place: Transforms existing columns

  • Create-new-columns: Generates new columns

  • Row-operation: Filters or removes rows

See Step Architecture for decision tree.

Step 2: Create Source File

Create R/[step_name].R:

# R/center.R

#' Center numeric variables
#'
#' `step_center()` creates a *specification* of a recipe step that will
#' normalize numeric data to have a mean of zero.
#'
#' @inheritParams step_normalize
#' @param ... One or more selector functions to choose variables for this step.
#'   See [selections()] for more details.
#' @param role Not used by this step since no new variables are created.
#' @param na_rm A logical value indicating whether NA values should be removed
#'   when computing means.
#' @param means A named numeric vector of means. This is `NULL` until computed
#'   by [prep()].
#'
#' @template step-return
#'
#' @family normalization steps
#' @export
#'
#' @details
#' Centering data means that the average of the variable is subtracted from the
#' data. `step_center` estimates the variable means from the data used in the
#' `training` argument of [prep()]. [bake()] then applies the centering to new
#' data sets using these means.
#'
#' @template case-weights-unsupervised
#'
#' @examplesIf rlang::is_installed("modeldata")
#' data(biomass, package = "modeldata")
#'
#' biomass_tr <- biomass[biomass$dataset == "Training", ]
#' biomass_te <- biomass[biomass$dataset == "Testing", ]
#'
#' rec <- recipe(
#'   HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
#'   data = biomass_tr
#' )
#'
#' center_trans <- rec |>
#'   step_center(carbon, hydrogen)
#'
#' center_obj <- prep(center_trans, training = biomass_tr)
#'
#' transformed_te <- bake(center_obj, biomass_te)
#'
step_center <- function(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  means = NULL,
  na_rm = TRUE,
  skip = FALSE,
  id = rand_id("center")
) {
  add_step(
    recipe,
    step_center_new(
      terms = enquos(...),
      trained = trained,
      role = role,
      means = means,
      na_rm = na_rm,
      skip = skip,
      id = id,
      case_weights = NULL
    )
  )
}

step_center_new <- function(terms, role, trained, means, na_rm, skip, id,
                            case_weights) {
  step(
    subclass = "center",
    terms = terms,
    role = role,
    trained = trained,
    means = means,
    na_rm = na_rm,
    skip = skip,
    id = id,
    case_weights = case_weights
  )
}

#' @export
prep.step_center <- function(x, training, info = NULL, ...) {
  # 1. Resolve variable selections (internal function)
  col_names <- recipes_eval_select(x$terms, training, info)

  # 2. Validate column types (internal function)
  check_type(training[, col_names], types = c("double", "integer"))

  # 3. Get case weights (internal function)
  wts <- get_case_weights(info, training)
  were_weights_used <- are_weights_used(wts, unsupervised = TRUE)
  if (isFALSE(were_weights_used)) {
    wts <- NULL
  }

  # 4. Calculate means
  means <- vapply(
    training[, col_names],
    weighted_mean,  # Could use internal helper if it exists
    numeric(1),
    wts = wts,
    na_rm = x$na_rm
  )

  # 5. Check for issues
  inf_cols <- col_names[is.infinite(means)]
  if (length(inf_cols) > 0) {
    cli::cli_warn(
      "Column{?s} {.var {inf_cols}} returned Inf or NaN."
    )
  }

  # 6. Return updated step
  step_center_new(
    terms = x$terms,
    role = x$role,
    trained = TRUE,
    means = means,
    na_rm = x$na_rm,
    skip = x$skip,
    id = x$id,
    case_weights = were_weights_used
  )
}

#' @export
bake.step_center <- function(object, new_data, ...) {
  col_names <- names(object$means)

  # Validate columns exist (internal function)
  check_new_data(col_names, object, new_data)

  # Apply transformation
  for (col_name in col_names) {
    new_data[[col_name]] <- new_data[[col_name]] - object$means[[col_name]]
  }

  new_data
}

#' @export
print.step_center <- function(x, width = max(20, options()$width - 30), ...) {
  title <- "Centering for "

  # Use internal print helper
  print_step(x$columns, x$terms, x$trained, title, width,
             case_weights = x$case_weights)

  invisible(x)
}

#' @rdname tidy.recipe
#' @export
tidy.step_center <- function(x, ...) {
  if (is_trained(x)) {
    res <- tibble(
      terms = names(x$means),
      value = unname(x$means)
    )
  } else {
    # Use internal helper
    term_names <- sel2char(x$terms)
    res <- tibble(
      terms = term_names,
      value = na_dbl
    )
  }
  res$id <- x$id
  res
}

Step 3: Create Test File

Create tests/testthat/test-center.R:

test_that("centering works correctly", {
  rec <- recipe(mpg ~ ., data = mtcars) |>
    step_center(disp, hp)

  prepped <- prep(rec, training = mtcars)
  results <- bake(prepped, mtcars)

  expect_equal(mean(results$disp), 0, tolerance = 1e-7)
  expect_equal(mean(results$hp), 0, tolerance = 1e-7)

  # Check tidy output
  trained_tidy <- tidy(prepped, 1)
  expect_equal(trained_tidy$value, c(mean(mtcars$disp), mean(mtcars$hp)))
})

test_that("centering handles selectors", {
  rec <- recipe(mpg ~ ., data = mtcars) |>
    step_center(all_numeric_predictors())

  prepped <- prep(rec, training = mtcars)

  # Check correct columns selected
  selected <- prepped$steps[[1]]$columns
  expect_true("disp" %in% selected)
  expect_false("mpg" %in% selected)  # outcome, not predictor
})

test_that("centering works with case weights", {
  mtcars_weighted <- mtcars
  mtcars_weighted$wt_col <- hardhat::importance_weights(seq_len(nrow(mtcars)))

  rec <- recipe(mpg ~ ., data = mtcars_weighted) |>
    update_role(wt_col, new_role = "case_weights") |>
    step_center(disp)

  prepped <- prep(rec, training = mtcars_weighted)

  # Weighted mean should differ from unweighted
  expect_false(
    prepped$steps[[1]]$means[["disp"]] == mean(mtcars$disp)
  )
})

test_that("centering validates input", {
  df <- data.frame(x = letters[1:5], y = 1:5)

  rec <- recipe(~ ., data = df) |>
    step_center(x)

  expect_error(prep(rec, training = df))
})

test_that("centering handles NA", {
  mtcars_na <- mtcars
  mtcars_na$disp[1:5] <- NA

  rec <- recipe(mpg ~ ., data = mtcars_na) |>
    step_center(disp, na_rm = TRUE)

  prepped <- prep(rec, training = mtcars_na)
  baked <- bake(prepped, mtcars_na)

  # NA values remain NA
  expect_true(all(is.na(baked$disp[1:5])))
  expect_false(any(is.na(baked$disp[6:nrow(mtcars_na)])))
})

Testing

INSTRUCTIONS FOR CLAUDE: Create tests based on features present.

Essential Tests (ALL steps) - 8-10 tests minimum

Core functionality (3-4 tests):

  • Basic correctness (transformation works)

  • Multiple columns (if applicable)

  • Single column (if applicable)

Variable selection (1-2 tests):

  • Works with recipes selectors (all_numeric(), all_predictors())

  • Manual column selection

NA handling (1 test):

  • Verify NA behavior (preserve, remove, or error)

Infrastructure (2-3 tests):

  • print() method works

  • tidy() method works (before and after prep)

  • Integration in recipe pipeline

Feature-Specific Tests (Add ONLY if applicable)

If step computes statistics (+2 tests):

  • Case weights: frequency weights

  • Case weights: importance weights

If skip parameter present (+1 test):

  • skip = TRUE and FALSE behavior

If keep_original_cols parameter (+1 test):

  • keep_original_cols = TRUE and FALSE

If multiple custom parameters (+2 tests):

  • Parameter combinations

  • Parameter validation

If complex statistical operations (+2-3 tests):

  • Edge cases (zero variance, all same values)

  • Boundary conditions

Target Test Counts

  • Per-row operations: 8-12 tests

  • Statistical operations: 12-18 tests

  • Complex calculations: 18-25 tests

See Testing Patterns (Source) for comprehensive guide and internal test helpers.

Step 4: Run Tests and Check

# Document
devtools::document()

# Load
devtools::load_all()

# Test
devtools::test()

# Full check
devtools::check()

Documentation Patterns

Using @inheritParams

Recipes uses extensive parameter inheritance:

#' @inheritParams step_normalize

This inherits all standard parameters from step_normalize.

Using @template

#' @template step-return
#' @template case-weights-unsupervised

Available templates are in templates or inline documentation.

Cross-Referencing Steps

#' @seealso [step_normalize()], [step_scale()]
#' @family normalization steps

Documentation Files to Create

INSTRUCTIONS FOR CLAUDE:

Create ONLY these files by default: 1. R/[step_name].R - Complete implementation 2. tests/testthat/test-[step_name].R - Test suite 3. README.md - Overview with basic usage example (200-300 lines)

Do NOT create unless user explicitly requests:

  • ❌ IMPLEMENTATION_SUMMARY.md

  • ❌ QUICKSTART.md

  • ❌ example_usage.R

  • ❌ Additional documentation files

If user wants more documentation, they will ask (e.g., “add comprehensive documentation”).


The Three-Function Pattern

Every step needs these three functions:

1. Step Constructor (User-Facing)

#' @export
step_center <- function(recipe, ..., role = NA, ...) {
  add_step(recipe, step_center_new(...))
}

2. Step Initialization (Internal)

step_center_new <- function(terms, role, trained, ...) {
  step(subclass = "center", ...)
}

3. S3 Methods (All Exported)

#' @export
prep.step_center <- function(x, training, info = NULL, ...) { }

#' @export
bake.step_center <- function(object, new_data, ...) { }

#' @export
print.step_center <- function(x, ...) { }

#' @export
tidy.step_center <- function(x, ...) { }

Step Type Best Practices

Modify-in-Place Steps

# Use role = NA
step_center <- function(recipe, ..., role = NA, ...) { }

# No keep_original_cols parameter
# Columns modified in place

Create-New-Columns Steps

# Use role = "predictor"
step_dummy <- function(recipe, ..., role = "predictor",
                      keep_original_cols = FALSE, ...) { }

# In bake(), handle keep_original_cols
new_data <- remove_original_cols(new_data, object, original_cols)

Row-Operation Steps

# Default skip = TRUE
step_filter <- function(recipe, ..., skip = TRUE, ...) { }

# In bake(), respect skip
if (object$skip) {
  return(new_data)
}

Creating New Internal Helpers

When to Create

Create internal helpers when:

  • Logic is shared by 2+ steps

  • Complex operation used repeatedly

  • Abstraction improves clarity

Example Internal Helper

#' Calculate weighted mean with case weight handling
#'
#' @param x Numeric vector
#' @param wts Numeric weights (or NULL)
#' @param na_rm Remove NA values?
#'
#' @return Numeric scalar
#' @keywords internal
#' @noRd
weighted_mean_na <- function(x, wts = NULL, na_rm = TRUE) {
  if (is.null(wts)) {
    mean(x, na.rm = na_rm)
  } else {
    if (inherits(wts, c("hardhat_importance_weights",
                        "hardhat_frequency_weights"))) {
      wts <- as.double(wts)
    }
    weighted.mean(x, w = wts, na.rm = na_rm)
  }
}

Use:

  • @keywords internal

  • @noRd

  • Don’t export


Error Messages

Use cli for consistent errors:

if (bad_input) {
  cli::cli_abort(
    "{.arg na_rm} must be a single logical value, not {.obj_type_friendly {na_rm}}.",
    call = call
  )
}

Pass call parameter for better error context:

prep.step_center <- function(x, training, info = NULL, ...,
                            call = rlang::caller_env()) {
  if (problem) {
    cli::cli_abort("Error message", call = call)
  }
}

PR Submission

Before Submitting

  1. Run full check:

    devtools::check()
  2. Update NEWS.md:

    ## recipes (development version)
    
    * Added `step_center()` for centering numeric predictors (#123).
  3. Commit changes:

    git add .
    git commit -m "Add step_center()"
    git push origin feature/add-step-center

Creating the PR

  1. Go to https://github.com/tidymodels/recipes

  2. Click “New pull request”

  3. Select your branch

  4. Fill in description:

    • What does this step do?

    • Why is it useful?

    • Reference any related issues

Review Process

Common feedback:

  • Add tests for all selectors

  • Match existing documentation style

  • Use internal helpers

  • Add more examples

  • Fix code style issues

See Troubleshooting (Source) for common review feedback.


Reference Documentation

Source Development

Step Types

Core Concepts

Shared References


Next Steps

  1. Clone recipes repository
  2. Create feature branch
  3. Implement your step following this guide
  4. Test thoroughly with recipes test patterns
  5. Run devtools::check()
  6. Submit PR to tidymodels/recipes

Getting Help