Understanding the Yardstick Metric System
Before creating metrics, understanding how yardstick’s metric system works helps you build metrics that integrate properly with the ecosystem.
Note for Source Development: If you’re contributing directly to the yardstick package, you can use internal helper functions like
yardstick_mean(),finalize_estimator_internal(), and validation helpers. See the Source Development Guide for details.
What new_*_metric() does
When you wrap your metric function with new_numeric_metric(), new_class_metric(), or new_prob_metric(), it:
Sets attributes that describe your metric:
direction: “minimize”, “maximize”, or “zero” (what’s optimal?)range:c(min, max)(possible values the metric can take)
Creates a class hierarchy:
# Example for accuracy class(accuracy) # [1] "accuracy" "class_metric" "metric" "function"Enables ecosystem integration:
metric_set()knows how to combine your metric with others- The metric can be identified and validated
- Automatic method dispatch works correctly
Why this matters
For metric_set() composition
metrics <- metric_set(accuracy, precision, recall)The metric class hierarchy allows metric_set() to: - Verify all metrics are compatible - Group results by .estimator appropriately - Apply metrics to data correctly
For direction and range
# These attributes help users understand the metric
attr(accuracy, "direction") # "maximize"
attr(accuracy, "range") # c(0, 1)Tools can use this to: - Know if higher is better or worse - Validate metric values are in expected range - Create appropriate visualizations
The .estimator column
Every metric returns a tibble with a .estimator column:
For numeric metrics
# Always "standard"
mae(df, truth, estimate)
# .metric .estimator .estimate
# mae standard 0.5For class metrics
# Depends on number of classes
accuracy(df_binary, truth, estimate)
# .metric .estimator .estimate
# accuracy binary 0.75
accuracy(df_multiclass, truth, estimate)
# .metric .estimator .estimate
# accuracy multiclass 0.68The estimator value comes from finalize_estimator()
- Binary classification → “binary”
- Multiclass with 3+ levels → “macro”, “micro”, or “macro_weighted”
- Numeric/regression → “standard”
Why it matters
When you use metric_set(), results are grouped by .estimator:
metrics <- metric_set(accuracy, precision, recall)
metrics(df, truth, estimate)
# All three metrics share the same .estimator valueClass naming conventions
Your metric’s primary class should match the function name:
mse <- new_numeric_metric(mse, direction = "minimize", range = c(0, Inf))
class(mse)
# [1] "mse" "numeric_metric" "metric" "function"This enables S3 dispatch for methods like autoplot.mse().
Design Considerations
Before implementing a new metric, consider whether you actually need to create one.
When to create a new metric
Create a new metric when: - It measures a genuinely different aspect of model performance - It’s commonly used in your domain and not available in yardstick - It has a well-defined formula or calculation method - You’ll use it repeatedly across multiple projects
Don’t create a new metric if: - It’s just a transformation of an existing metric (use metric_tweak() instead) - It can be composed from existing metrics - It’s a one-off calculation for a specific analysis - It’s too domain-specific for general use
Using metric_tweak() for variations
For simple variations of existing metrics, use metric_tweak():
# Create a variant of F-measure with beta = 2
f2_meas <- metric_tweak("f2_meas", f_meas, beta = 2)
# Use it like any other metric
f2_meas(df, truth, estimate)
metric_set(accuracy, f2_meas)This is much simpler than creating a full new metric.
Naming conventions
Follow yardstick patterns: - Use lowercase with underscores: mean_squared_error → mse - Avoid camelCase or PascalCase - Be consistent with existing naming
Abbreviations vs full names: - Well-known abbreviations: rmse, mae, auc (widely recognized) - Full names for clarity: accuracy, precision, recall (already short) - When in doubt, use the full name
Avoid conflicts:
# Bad: too generic
error() # Conflicts with base::error
metric() # Too vague
# Good: specific and descriptive
prediction_error()
classification_metric()Examples of good names: - miss_rate (clear, descriptive) - huber_loss (named after the technique) - roc_auc (standard abbreviation)
Parameter design
What should be arguments:
# Hyperparameters that affect calculation
huber_loss(data, truth, estimate, delta = 1.0)
# Configuration that changes behavior
f_meas(data, truth, estimate, beta = 1)
# Thresholds or cutoffs
classification_cost(data, truth, estimate, costs = c(1, 2))What should NOT be arguments: - Constants that are part of the metric definition - Values that would break the metric’s meaning - Options that should be separate metrics
Keep parameters minimal:
# Good: focused parameters
mse(data, truth, estimate, na_rm = TRUE, case_weights = NULL)
# Bad: too many options
mse(data, truth, estimate, na_rm = TRUE, case_weights = NULL,
sqrt = FALSE, relative = FALSE, log_scale = FALSE)
# These should be separate metrics: rmse(), relative_mse(), log_mse()Users can always wrap your metric if they need variations:
my_custom_mse <- function(data, truth, estimate) {
result <- mse(data, truth, estimate)
result$.estimate <- sqrt(result$.estimate)
result
}Single responsibility principle
Each metric should do one thing well:
# Good: accuracy measures one thing
accuracy(data, truth, estimate)
# Bad: don't combine multiple metrics
accuracy_and_precision() # Should be two separate metrics
combined_scores() # Use metric_set() insteadCompose with metric_set() instead:
# Let users compose metrics
metrics <- metric_set(accuracy, precision, recall, f_meas)
metrics(data, truth, estimate)Scope and reusability
Design for general use: - Avoid hard-coded domain-specific values - Make assumptions explicit in documentation - Allow customization through parameters when appropriate
Example:
# Bad: too specific
credit_risk_score(data, truth, estimate) # Hard-codes credit risk logic
# Good: general with parameters
classification_cost(data, truth, estimate, costs = c(fp = 2, fn = 5))
# Users can set costs for their domainExported vs Internal Functions
Many yardstick helper functions are INTERNAL and not exported. Using them will cause runtime errors.
❌ Don’t Use (Internal/Not Exported)
yardstick_mean()- NOT EXPORTEDget_weights()- NOT EXPORTEDmetric_range()- NOT EXPORTEDmetric_optimal()- NOT EXPORTEDmetric_direction()- NOT EXPORTEDdata_altman()- NOT EXPORTED (test helper)data_three_class()- NOT EXPORTED (test helper)
✅ Use Instead
For weighted calculations:
# Instead of yardstick_mean(), use base R weighted.mean()
if (is.null(case_weights)) {
mean(values)
} else {
# Handle hardhat weights (convert to numeric)
wts <- if (inherits(case_weights, "hardhat_importance_weights") ||
inherits(case_weights, "hardhat_frequency_weights")) {
as.double(case_weights)
} else {
case_weights
}
weighted.mean(values, w = wts)
}EXPORTED yardstick functions you CAN safely use
check_numeric_metric()✓check_class_metric()✓check_prob_metric()✓yardstick_remove_missing()✓yardstick_any_missing()✓yardstick_table()✓finalize_estimator()✓validate_estimator()✓abort_if_class_pred()✓as_factor_from_class_pred()✓numeric_metric_summarizer()✓class_metric_summarizer()✓prob_metric_summarizer()✓new_numeric_metric()✓new_class_metric()✓new_prob_metric()✓
Next Steps
- Create numeric metrics: numeric-metrics.md
- Create class metrics: class-metrics.md
- Create probability metrics: probability-metrics.md
- Understand confusion matrices: confusion-matrix.md