Extension Development: Best Practices

Context: This guide is for extension development - creating new packages that extend tidymodels packages.

Key principle:Never use internal functions (accessed with :::)

Guide to writing high-quality R code for tidymodels extension packages.


Best Practices

Guide to writing high-quality R code for tidymodels extension packages.

Code Style

Use base pipe

# Good
recipe(mpg ~ ., data = mtcars) |>
  step_center(all_numeric_predictors())

# Avoid
recipe(mpg ~ ., data = mtcars) %>%
  step_center(all_numeric_predictors())

The base pipe |> is faster, built-in, and the tidymodels standard.

Anonymous functions

# Single line: use backslash notation
map(x, \(i) i + 1)

# Multi-line: use function()
map(x, function(i) {
  result <- complex_computation(i)
  result + 1
})

For-loops over map()

# Preferred (better error messages)
for (col in columns) {
  new_data[[col]] <- transform(new_data[[col]])
}

# Avoid (harder to debug)
new_data <- map(columns, \(col) transform(new_data[[col]]))

Why prefer for-loops:

  • Better error messages (shows which iteration failed)

  • More familiar to most R users

  • Easier to debug with browser()

  • Consistent with tidymodels style

Minimal comments

# Good: code is self-documenting
means <- colMeans(data)
centered <- sweep(data, 2, means, "-")

# Avoid: over-commenting obvious code
# Calculate column means
means <- colMeans(data)
# Subtract means from each column
centered <- sweep(data, 2, means, "-")

Write clear code that doesn’t need comments. Add comments only for:

  • Complex algorithms

  • Non-obvious optimization tricks

  • Warnings about edge cases

Error Messages

Use cli functions

# Good: cli provides better formatting
if (invalid) {
  cli::cli_abort("{.arg param} must be positive, not {.val {param}}.")
}

if (risky) {
  cli::cli_warn("Column{?s} {.var {col_names}} returned Inf or NaN.")
}

# Avoid: base R error functions
stop("param must be positive")
warning("columns returned Inf or NaN")

cli formatting syntax

# Argument names
cli::cli_abort("{.arg your_param} must be numeric.")

# Code/function names
cli::cli_abort("Use {.code binary} estimator for two classes.")

# Values
cli::cli_abort("Expected 3 columns, got {.val {ncol(data)}}.")

# Variable names
cli::cli_warn("Column{?s} {.var {col_names}} has/have missing values.")

# Pluralization
cli::cli_abort("Found {length(x)} error{?s}.")  # Handles 1 vs many

Error message guidelines

  • Be specific about what’s wrong

  • Tell users what they can do to fix it

  • Include actual values when helpful

  • Use proper English grammar

# Good
cli::cli_abort(
  "{.arg threshold} must be between 0 and 1, not {.val {threshold}}."
)

# Avoid
stop("Invalid threshold")

Documentation Standards

Be explicit

#' @param threshold Threshold value for classification. Must be a numeric
#'   value between 0 and 1. Default is 0.5.

Include:

  • Type (numeric, logical, character, factor)

  • Valid range or options

  • Default value

  • Effect on function behavior

US English

  • Use American spelling: “normalize” not “normalise”

  • Use sentence case: “Calculate the mean” not “calculate the mean”

  • Be consistent throughout

Wrap roxygen at 80 characters

#' This is a long line that should be wrapped to ensure it doesn't exceed the
#' 80-character limit for better readability in various text editors.

Include practical examples

#' @examples
#' # Basic usage
#' metric_name(data, truth, estimate)
#'
#' # With grouped data
#' data |>
#'   dplyr::group_by(fold) |>
#'   metric_name(truth, estimate)

Show realistic use cases, not just minimal examples.

Don’t use dynamic roxygen code

# Bad: calling non-exported functions
#' @return Range: `r metric_range()`  # metric_range() not exported

# Good: static documentation
#' @return Range: 0 to 1

Performance

Vectorization over loops

Always prefer vectorized operations:

# Good: vectorized
errors <- truth - estimate
squared_errors <- errors^2
mean(squared_errors)

# Bad: loop
total <- 0
for (i in seq_along(truth)) {
  total <- total + (truth[i] - estimate[i])^2
}
total / length(truth)

Vectorized functions:

  • Arithmetic: +, -, *, /, ^

  • Comparisons: ==, !=, >, <, >=, <=

  • Logical: &, |, !

  • Math: abs(), sqrt(), log(), exp(), sin(), cos()

  • Aggregations: sum(), mean(), max(), min(), median()

Use matrix operations

Efficient per-class calculations:

# Good: matrix operations
confusion_matrix <- yardstick_table(truth, estimate)
tp <- diag(confusion_matrix)
fp <- colSums(confusion_matrix) - tp
fn <- rowSums(confusion_matrix) - tp

# Bad: looping over classes
tp <- numeric(n_classes)
for (i in seq_len(n_classes)) {
  tp[i] <- confusion_matrix[i, i]
}

Use colSums() and rowSums():

# Good
class_totals <- colSums(confusion_matrix)

# Avoid
class_totals <- apply(confusion_matrix, 2, sum)  # Slower

Avoid repeated computations

General principle: Calculate once, use many times.

# Good: compute once in prep() for recipe steps
prep.step_yourname <- function(x, training, ...) {
  means <- colMeans(training[col_names])  # Computed once, stored
}

# Good: validate once at entry point
metric_vec <- function(truth, estimate, ...) {
  check_numeric_metric(truth, estimate, case_weights)  # Validate once
  metric_impl(truth, estimate, ...)  # Trust the data
}

# Good: pre-compute before loops
levels_list <- levels(truth)
n_levels <- length(levels_list)
for (i in seq_len(n_levels)) {
  # Use pre-computed values
}

# Bad: recomputing unnecessarily
for (i in seq_len(length(levels(truth)))) {
  levels_list <- levels(truth)  # Redundant!
}

Handle case weights efficiently

Convert hardhat weights once:

# Good: convert once at the start
if (!is.null(case_weights)) {
  if (inherits(case_weights, c("hardhat_importance_weights",
                               "hardhat_frequency_weights"))) {
    case_weights <- as.double(case_weights)
  }
  # Now use case_weights multiple times
}

# Bad: converting repeatedly
if (!is.null(case_weights)) {
  result1 <- weighted.mean(x, as.double(case_weights))
  result2 <- weighted.mean(y, as.double(case_weights))  # Converting again!
}

Profile before optimizing

Focus optimization where it matters:

  1. Start with clear, correct code
  2. Profile with profvis::profvis() if performance is an issue
  3. Optimize the actual bottlenecks
  4. Don’t prematurely optimize
# Profile your code
profvis::profvis({
  for (i in 1:100) {
    your_function(data)
  }
})

When performance doesn’t matter

Don’t optimize unnecessarily:

  • Functions typically called once or few times per evaluation

  • Calculation is usually fast compared to model fitting

  • Readability and correctness are more important

Do optimize when:

  • Function called thousands of times (tuning, cross-validation)

  • Working with very large datasets (millions of observations)

  • Profiling shows the function is the bottleneck

Code Validation

Validate early

step_yourname <- function(recipe, ..., your_param = 1) {
  # Validate parameters early
  if (!is.numeric(your_param) || your_param <= 0) {
    cli::cli_abort("{.arg your_param} must be a positive number.")
  }

  # ... rest of function
}

prep.step_yourname <- function(x, training, ...) {
  # Validate data early
  col_names <- recipes_eval_select(x$terms, training, info)
  check_type(training[, col_names], types = c("double", "integer"))

  # ... rest of function
}

Give actionable error messages

# Good: tells user what to do
cli::cli_abort(
  "Columns {.var {bad_cols}} must be numeric.
  Convert to numeric with {.code as.numeric()}."
)

# Avoid: vague errors
stop("Invalid columns")

Memory Management

Don’t store entire datasets

# Good: store only necessary parameters
prep.step_center <- function(x, training, ...) {
  means <- colMeans(training[col_names])  # Just means, not data
  # Return step with means stored
}

# Bad: storing entire training set
prep.step_center <- function(x, training, ...) {
  # Return step with training data stored (memory leak!)
}

Consider memory usage for large data

  • Store statistics/parameters, not raw data

  • Use sparse matrices when appropriate

  • Consider memory-mapped files for very large data

Code Formatting

After writing code, format it:

# Format current package
air::air_format(".")

Or use RStudio: Code → Reformat Code (Cmd/Ctrl + Shift + A)

Version Control

Commit messages

# Good: descriptive commits
"Add support for multiclass metrics"
"Fix NA handling in case weights"
"Update documentation examples"

# Avoid: vague commits
"Fix bug"
"Update code"
"Changes"

Commit frequency

  • Commit after each logical unit of work

  • Commit working, tested code

  • Don’t commit broken code (except on branches)