Model Evaluation
Overview
Evaluation measures how well the model predicts unseen data. Always use out-of-sample predictions—from cross-validation during development, from the test set for final assessment.
Classification Metrics
Threshold-Independent Metrics
These evaluate the full range of predicted probabilities without choosing a classification threshold.
ROC-AUC: Area under the ROC curve. Measures ability to rank positive cases higher than negative cases. Range: 0.5 (random) to 1.0 (perfect).
PR-AUC: Area under the precision-recall curve. Better than ROC-AUC for imbalanced data where the positive class is rare.
Brier Score: Mean squared error of probability predictions. Measures calibration. Range: 0 (perfect) to 1 (worst). Lower is better.
tidymodels
library(tidymodels)
# From predictions dataframe with columns: .pred_class, .pred_positive, truth
metrics <- metric_set(roc_auc, pr_auc, brier_class)
predictions |> metrics(truth = outcome, .pred_positive, estimate = .pred_class)
# Individual metrics
predictions |> roc_auc(truth = outcome, .pred_positive)
predictions |> pr_auc(truth = outcome, .pred_positive)
predictions |> brier_class(truth = outcome, .pred_positive)Threshold-Dependent Metrics
Require choosing a probability threshold (default: 0.5) to convert probabilities to class predictions.
Accuracy: Proportion of correct predictions. Can be misleading with imbalanced classes.
Sensitivity (Recall): True positive rate. Of actual positives, how many did we catch?
Specificity: True negative rate. Of actual negatives, how many did we correctly identify?
Precision (PPV): Of predicted positives, how many are actually positive?
F1 Score: Harmonic mean of precision and recall. Balances both concerns.
tidymodels
# Confusion matrix
predictions |> conf_mat(truth = outcome, estimate = .pred_class)
# Multiple metrics
class_metrics <- metric_set(accuracy, sensitivity, specificity, precision, f_meas)
predictions |> class_metrics(truth = outcome, estimate = .pred_class)Multiclass Classification
For more than two classes, metrics are computed per-class and aggregated.
# Macro-averaging (unweighted mean across classes)
predictions |> accuracy(truth = outcome, estimate = .pred_class)
predictions |> roc_auc(truth = outcome, .pred_class_A, .pred_class_B, .pred_class_C)
# Specify estimator explicitly
predictions |> sensitivity(truth = outcome, estimate = .pred_class, estimator = "macro")
predictions |> sensitivity(truth = outcome, estimate = .pred_class, estimator = "micro")Regression Metrics
RMSE: Root mean squared error. In outcome units. Penalizes large errors heavily.
MAE: Mean absolute error. In outcome units. Less sensitive to outliers than RMSE.
R²: Coefficient of determination. Proportion of variance explained. Does not measure prediction accuracy—use with RMSE/MAE.
MAPE: Mean absolute percentage error. Scale-independent but undefined when true values are zero.
tidymodels
# From predictions dataframe with columns: .pred, truth
reg_metrics <- metric_set(rmse, mae, rsq)
predictions |> reg_metrics(truth = outcome, estimate = .pred)
# Individual metrics
predictions |> rmse(truth = outcome, estimate = .pred)
predictions |> mae(truth = outcome, estimate = .pred)
predictions |> rsq(truth = outcome, estimate = .pred)Evaluation Visualizations
Classification: ROC Curve
Shows tradeoff between sensitivity and specificity across all thresholds.
# Generate ROC curve data
roc_data <- predictions |> roc_curve(truth = outcome, .pred_positive)
# Plot
autoplot(roc_data)
# Custom plot
roc_data |>
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(linetype = "dashed") +
coord_equal()Classification: Precision-Recall Curve
Shows tradeoff between precision and recall. Preferred for imbalanced data.
pr_data <- predictions |> pr_curve(truth = outcome, .pred_positive)
autoplot(pr_data)Classification: Calibration Curve
Shows whether predicted probabilities match observed frequencies. Well-calibrated models follow the diagonal.
library(probably)
# Calibration plot
predictions |>
cal_plot_breaks(truth = outcome, .pred_positive, num_breaks = 10)
# Calibration with more refinement
predictions |>
cal_plot_windowed(truth = outcome, .pred_positive, step_size = 0.03)Classification: Confusion Matrix
predictions |>
conf_mat(truth = outcome, estimate = .pred_class) |>
autoplot(type = "heatmap")
# Mosaic plot
predictions |>
conf_mat(truth = outcome, estimate = .pred_class) |>
autoplot(type = "mosaic")Regression: Observed vs Predicted
Points should fall along the diagonal for good predictions.
predictions |>
ggplot(aes(x = outcome, y = .pred)) +
geom_point(alpha = 0.5) +
geom_abline(color = "red", linetype = "dashed") +
coord_obs_pred() +
labs(x = "Observed", y = "Predicted")Regression: Residual Plots
Residuals should be randomly scattered around zero with constant variance.
predictions |>
mutate(residual = outcome - .pred) |>
ggplot(aes(x = .pred, y = residual)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(x = "Predicted", y = "Residual")Residual distribution:
predictions |>
mutate(residual = outcome - .pred) |>
ggplot(aes(x = residual)) +
geom_histogram(bins = 30)Comparing Models
Resampling Comparison
Compare models using the same resamples for valid statistical comparison.
# Fit multiple models to same resamples
results_rf <- fit_resamples(wf_rf, resamples = folds)
results_xgb <- fit_resamples(wf_xgb, resamples = folds)
results_glm <- fit_resamples(wf_glm, resamples = folds)
# Combine for comparison
all_results <- bind_rows(
collect_metrics(results_rf) |> mutate(model = "random_forest"),
collect_metrics(results_xgb) |> mutate(model = "xgboost"),
collect_metrics(results_glm) |> mutate(model = "logistic")
)
# Visualize
all_results |>
filter(.metric == "roc_auc") |>
ggplot(aes(x = model, y = mean)) +
geom_point() +
geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err), width = 0.2)Using workflow_set
library(workflowsets)
# Create workflow set
wf_set <- workflow_set(
preproc = list(basic = recipe_basic, normalized = recipe_normalized),
models = list(rf = model_rf, xgb = model_xgb)
)
# Fit all workflows to same resamples
wf_results <- wf_set |>
workflow_map("fit_resamples", resamples = folds)
# Rank by metric
rank_results(wf_results, rank_metric = "roc_auc")
# Plot comparison
autoplot(wf_results, metric = "roc_auc")Collecting Predictions from Resamples
# Save predictions during resampling
results <- fit_resamples(
wf,
resamples = folds,
control = control_resamples(save_pred = TRUE)
)
# Get all out-of-sample predictions
predictions <- collect_predictions(results, summarize = TRUE)
# Now use for evaluation visualizations
predictions |> roc_curve(truth = outcome, .pred_positive) |> autoplot()