Extension Development: Best Practices
Context: This guide is for extension development - creating new packages that extend tidymodels packages.
Key principle: ❌ Never use internal functions (accessed with :::)
Guide to writing high-quality R code for tidymodels extension packages.
Best Practices
Guide to writing high-quality R code for tidymodels extension packages.
Code Style
Use base pipe
# Good
recipe(mpg ~ ., data = mtcars) |>
step_center(all_numeric_predictors())
# Avoid
recipe(mpg ~ ., data = mtcars) %>%
step_center(all_numeric_predictors())The base pipe |> is faster, built-in, and the tidymodels standard.
Anonymous functions
# Single line: use backslash notation
map(x, \(i) i + 1)
# Multi-line: use function()
map(x, function(i) {
result <- complex_computation(i)
result + 1
})For-loops over map()
# Preferred (better error messages)
for (col in columns) {
new_data[[col]] <- transform(new_data[[col]])
}
# Avoid (harder to debug)
new_data <- map(columns, \(col) transform(new_data[[col]]))Why prefer for-loops:
Better error messages (shows which iteration failed)
More familiar to most R users
Easier to debug with
browser()Consistent with tidymodels style
Minimal comments
# Good: code is self-documenting
means <- colMeans(data)
centered <- sweep(data, 2, means, "-")
# Avoid: over-commenting obvious code
# Calculate column means
means <- colMeans(data)
# Subtract means from each column
centered <- sweep(data, 2, means, "-")Write clear code that doesn’t need comments. Add comments only for:
Complex algorithms
Non-obvious optimization tricks
Warnings about edge cases
Error Messages
Use cli functions
# Good: cli provides better formatting
if (invalid) {
cli::cli_abort("{.arg param} must be positive, not {.val {param}}.")
}
if (risky) {
cli::cli_warn("Column{?s} {.var {col_names}} returned Inf or NaN.")
}
# Avoid: base R error functions
stop("param must be positive")
warning("columns returned Inf or NaN")cli formatting syntax
# Argument names
cli::cli_abort("{.arg your_param} must be numeric.")
# Code/function names
cli::cli_abort("Use {.code binary} estimator for two classes.")
# Values
cli::cli_abort("Expected 3 columns, got {.val {ncol(data)}}.")
# Variable names
cli::cli_warn("Column{?s} {.var {col_names}} has/have missing values.")
# Pluralization
cli::cli_abort("Found {length(x)} error{?s}.") # Handles 1 vs manyError message guidelines
Be specific about what’s wrong
Tell users what they can do to fix it
Include actual values when helpful
Use proper English grammar
# Good
cli::cli_abort(
"{.arg threshold} must be between 0 and 1, not {.val {threshold}}."
)
# Avoid
stop("Invalid threshold")Documentation Standards
Be explicit
#' @param threshold Threshold value for classification. Must be a numeric
#' value between 0 and 1. Default is 0.5.Include:
Type (numeric, logical, character, factor)
Valid range or options
Default value
Effect on function behavior
US English
Use American spelling: “normalize” not “normalise”
Use sentence case: “Calculate the mean” not “calculate the mean”
Be consistent throughout
Wrap roxygen at 80 characters
#' This is a long line that should be wrapped to ensure it doesn't exceed the
#' 80-character limit for better readability in various text editors.Include practical examples
#' @examples
#' # Basic usage
#' metric_name(data, truth, estimate)
#'
#' # With grouped data
#' data |>
#' dplyr::group_by(fold) |>
#' metric_name(truth, estimate)Show realistic use cases, not just minimal examples.
Don’t use dynamic roxygen code
# Bad: calling non-exported functions
#' @return Range: `r metric_range()` # metric_range() not exported
# Good: static documentation
#' @return Range: 0 to 1Performance
Vectorization over loops
Always prefer vectorized operations:
# Good: vectorized
errors <- truth - estimate
squared_errors <- errors^2
mean(squared_errors)
# Bad: loop
total <- 0
for (i in seq_along(truth)) {
total <- total + (truth[i] - estimate[i])^2
}
total / length(truth)Vectorized functions:
Arithmetic:
+,-,*,/,^Comparisons:
==,!=,>,<,>=,<=Logical:
&,|,!Math:
abs(),sqrt(),log(),exp(),sin(),cos()Aggregations:
sum(),mean(),max(),min(),median()
Use matrix operations
Efficient per-class calculations:
# Good: matrix operations
confusion_matrix <- yardstick_table(truth, estimate)
tp <- diag(confusion_matrix)
fp <- colSums(confusion_matrix) - tp
fn <- rowSums(confusion_matrix) - tp
# Bad: looping over classes
tp <- numeric(n_classes)
for (i in seq_len(n_classes)) {
tp[i] <- confusion_matrix[i, i]
}Use colSums() and rowSums():
# Good
class_totals <- colSums(confusion_matrix)
# Avoid
class_totals <- apply(confusion_matrix, 2, sum) # SlowerAvoid repeated computations
General principle: Calculate once, use many times.
# Good: compute once in prep() for recipe steps
prep.step_yourname <- function(x, training, ...) {
means <- colMeans(training[col_names]) # Computed once, stored
}
# Good: validate once at entry point
metric_vec <- function(truth, estimate, ...) {
check_numeric_metric(truth, estimate, case_weights) # Validate once
metric_impl(truth, estimate, ...) # Trust the data
}
# Good: pre-compute before loops
levels_list <- levels(truth)
n_levels <- length(levels_list)
for (i in seq_len(n_levels)) {
# Use pre-computed values
}
# Bad: recomputing unnecessarily
for (i in seq_len(length(levels(truth)))) {
levels_list <- levels(truth) # Redundant!
}Handle case weights efficiently
Convert hardhat weights once:
# Good: convert once at the start
if (!is.null(case_weights)) {
if (inherits(case_weights, c("hardhat_importance_weights",
"hardhat_frequency_weights"))) {
case_weights <- as.double(case_weights)
}
# Now use case_weights multiple times
}
# Bad: converting repeatedly
if (!is.null(case_weights)) {
result1 <- weighted.mean(x, as.double(case_weights))
result2 <- weighted.mean(y, as.double(case_weights)) # Converting again!
}Profile before optimizing
Focus optimization where it matters:
- Start with clear, correct code
- Profile with
profvis::profvis()if performance is an issue - Optimize the actual bottlenecks
- Don’t prematurely optimize
# Profile your code
profvis::profvis({
for (i in 1:100) {
your_function(data)
}
})When performance doesn’t matter
Don’t optimize unnecessarily:
Functions typically called once or few times per evaluation
Calculation is usually fast compared to model fitting
Readability and correctness are more important
Do optimize when:
Function called thousands of times (tuning, cross-validation)
Working with very large datasets (millions of observations)
Profiling shows the function is the bottleneck
Code Validation
Validate early
step_yourname <- function(recipe, ..., your_param = 1) {
# Validate parameters early
if (!is.numeric(your_param) || your_param <= 0) {
cli::cli_abort("{.arg your_param} must be a positive number.")
}
# ... rest of function
}
prep.step_yourname <- function(x, training, ...) {
# Validate data early
col_names <- recipes_eval_select(x$terms, training, info)
check_type(training[, col_names], types = c("double", "integer"))
# ... rest of function
}Give actionable error messages
# Good: tells user what to do
cli::cli_abort(
"Columns {.var {bad_cols}} must be numeric.
Convert to numeric with {.code as.numeric()}."
)
# Avoid: vague errors
stop("Invalid columns")Memory Management
Don’t store entire datasets
# Good: store only necessary parameters
prep.step_center <- function(x, training, ...) {
means <- colMeans(training[col_names]) # Just means, not data
# Return step with means stored
}
# Bad: storing entire training set
prep.step_center <- function(x, training, ...) {
# Return step with training data stored (memory leak!)
}Consider memory usage for large data
Store statistics/parameters, not raw data
Use sparse matrices when appropriate
Consider memory-mapped files for very large data
Code Formatting
After writing code, format it:
# Format current package
air::air_format(".")Or use RStudio: Code → Reformat Code (Cmd/Ctrl + Shift + A)
Version Control
Commit messages
# Good: descriptive commits
"Add support for multiclass metrics"
"Fix NA handling in case weights"
"Update documentation examples"
# Avoid: vague commits
"Fix bug"
"Update code"
"Changes"Commit frequency
Commit after each logical unit of work
Commit working, tested code
Don’t commit broken code (except on branches)