Parsnip Model Specification System
This document explains the architecture and design of parsnip’s model specification system. This applies to both creating new models and adding engines to existing models.
Overview
parsnip provides a unified interface to diverse modeling functions across R packages. It separates:
- Model specification - What type of model (linear_reg, boost_tree, etc.)
- Engine - How to compute it (lm, glmnet, xgboost, etc.)
- Mode - What type of prediction (regression, classification, etc.)
This separation allows the same model specification to work with multiple computational engines while maintaining a consistent interface.
Model Specification Objects
Structure
A model specification is an S3 object created by functions like linear_reg(), boost_tree(), etc. It contains:
linear_reg()
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lmKey properties stored in the object:
args- Main arguments (penalty, mixture, trees, etc.)eng_args- Engine-specific arguments (passed viaset_engine())mode- The prediction mode (“regression”, “classification”, etc.)engine- The computational backend (e.g., “lm”, “glmnet”, “xgboost”)method- Internal: fitting methoduser_specified_mode- Whether user explicitly set modeuser_specified_engine- Whether user explicitly set engine
Class Hierarchy
Model specifications have a class hierarchy:
class(linear_reg())
#> [1] "linear_reg" "model_spec"This allows:
Method dispatch (e.g.,
fit.linear_reg())Type checking (is it a
model_spec?)Model-specific behaviors
The class is created using make_classes() which prepends the model type to "model_spec".
Difference from Fitted Models
Model specification (model_spec):
Blueprint for fitting
No data involved
Lightweight (just configuration)
Created by
linear_reg(),boost_tree(), etc.
Fitted model (model_fit):
Result of
fit()Contains trained parameters
Has actual model object (e.g.,
lmobject)Used for prediction
Engine Registration System
How Engines Work
Engines connect model specifications to actual computational implementations:
linear_reg(penalty = 0.1) + set_engine("glmnet")
↓
glmnet::glmnet(lambda = 0.1, ...)
Each model-engine-mode combination must be registered with:
set_model_engine()- Register that this engine existsset_dependency()- Specify required packagesset_model_arg()- Translate main arguments to engine argumentsset_fit()- Specify how to fit the modelset_pred()- Specify how to make predictions (for each type)
The Registration Database
Registered models are stored in an environment accessible via get_model_env():
env <- get_model_env()
ls(env) # Lists all registered modelsFor each model, there’s a table of engine/mode combinations with their fit and prediction specifications.
Looking Up Available Engines
show_engines("linear_reg")
#> Shows all registered engines and modesThis queries the registration database to show what’s available.
Model Modes
Available Modes
From R/aaa_models.R, parsnip supports:
"regression"- Numeric outcomes"classification"- Categorical outcomes"censored regression"- Survival/time-to-event outcomes"quantile regression"- Quantile predictions"unknown"- Placeholder before user sets mode
Setting Modes
In constructor (default mode):
linear_reg(mode = "regression") # DefaultChange with set_mode():
nearest_neighbor(mode = "unknown") |>
set_mode("classification")Mode-Specific Behaviors
Different modes have different:
Prediction types:
Regression:
numeric,conf_int,pred_int,rawClassification:
class,prob,rawCensored regression:
time,survival,hazard,linear_pred,rawQuantile regression:
quantile,raw
Engine requirements:
Some engines only support certain modes
Must register separately for each mode
Validation:
parsnip checks mode compatibility before fitting
Error if engine doesn’t support the mode
Main Arguments vs Engine Arguments
Philosophy
parsnip distinguishes between:
Main arguments - Standardized across engines
Common tuning parameters
Defined in model constructor
Named consistently (e.g.,
penalty,trees,mtry)May not apply to all engines (ignored if not applicable)
Engine arguments - Engine-specific
Passed via
set_engine()Not standardized
Go directly to underlying function
Use engine’s native names
When to Add Main Arguments
Main arguments should be:
Common across multiple engines
Important tuning parameters
Worth standardizing for tune integration
Example: penalty in linear_reg()
glmnet:
lambdakeras:
penaltyspark:
reg_param
All get standardized as penalty in parsnip.
Using Engine Arguments for Flexibility
Engine arguments allow access to engine-specific features:
boost_tree() |>
set_engine("xgboost",
tree_method = "hist", # xgboost-specific
gpu_id = 0) # xgboost-specificThese bypass translation and go straight to the engine.
Integration with Tidymodels Ecosystem
Fitting Workflows
Direct fit:
spec <- linear_reg() |> set_engine("lm")
fit <- fit(spec, mpg ~ ., data = mtcars)With workflows:
library(workflows)
wf <- workflow() |>
add_model(spec) |>
add_formula(mpg ~ .)
fit <- fit(wf, data = mtcars)Prediction
Multiple types:
predict(fit, new_data = mtcars, type = "numeric")
predict(fit, new_data = mtcars, type = "conf_int")The type depends on mode and engine capabilities.
Tuning
With tune package:
spec <- boost_tree(trees = tune(), tree_depth = tune()) |>
set_engine("xgboost")
# Tune with tune::tune_grid()Main arguments can be marked for tuning using tune().
Recipes and Workflows
Parsnip integrates seamlessly:
library(recipes)
recipe <- recipe(mpg ~ ., data = mtcars) |>
step_normalize(all_numeric_predictors())
workflow() |>
add_recipe(recipe) |>
add_model(spec) |>
fit(data = mtcars)The Fit → Model_fit → Predict Pipeline
1. Specification
User creates a model specification:
spec <- boost_tree(trees = 100) |> set_engine("xgboost")2. Translation
When fit() is called, arguments are translated:
Main arguments mapped to engine arguments via
set_model_arg()Engine arguments passed through unchanged
Formula converted to engine’s expected interface
3. Fitting
The engine’s fit function is called:
# Behind the scenes:
xgboost::xgb.train(
nrounds = 100, # translated from trees
...
)4. Wrapping
Result is wrapped in a model_fit object:
class(fit)
#> [1] "model_fit"This contains:
The original spec
The fitted model object
Preprocessing information
5. Prediction
When predict() is called:
Extract the fitted model object
Call engine-specific prediction function
Post-process to standard format (tibble with
.predcolumns)Return consistently named output
This pipeline ensures consistent interface while allowing engine flexibility.
Design Considerations
When Creating New Models
Consider: 1. Is this model type distinct from existing ones? 2. What main arguments are common across implementations? 3. What modes should it support? 4. What prediction types make sense?
Example: survival_reg() is distinct from linear_reg() because:
Different outcome type (censored data)
Different prediction types (time, survival, hazard)
Different evaluation metrics
When Adding Engines
Consider: 1. Does the model type already exist? 2. What’s the natural interface (formula, matrix, xy)? 3. Which prediction types can this engine support? 4. What main arguments does it support? 5. What engine-specific arguments are valuable?
Model Naming
Follow parsnip conventions:
Descriptive of algorithm:
linear_reg(),rand_forest(),boost_tree()Not package-specific: Not
glmnet_model()orxgboost_model()Function form:
nearest_neighbor(), notnearest_neighborsorknn
Internal Architecture
Key Internal Functions
Constructor:
new_model_spec()- Core constructor helpermake_classes()- Creates class hierarchy
Validation:
spec_is_possible()- Checks if model/engine/mode combination could existspec_is_loaded()- Checks if it’s actually registeredcheck_empty_ellipse()- Validates no extra arguments
Environment Management:
get_model_env()- Access model registryget_from_env()- Retrieve registration dataset_in_env()- Store registration data
Error Handling:
stop_incompatible_mode()- Mode not supportedstop_incompatible_engine()- Engine not availablestop_missing_engine()- No engine specified
File Organization in parsnip Source
Model constructors: R/[model_type].R
Example:
R/linear_reg.R,R/boost_tree.RContains the user-facing function
Engine registrations: R/[model]_data.R
Example:
R/linear_reg_data.R,R/boost_tree_data.RContains all
set_*()calls for that model
Infrastructure: R/aaa_models.R, R/misc.R
Model environment setup
Core helper functions
Summary
Parsnip’s architecture separates:
What (model type) from how (engine) from why (mode)
Interface (user-facing) from implementation (engine-specific)
Specification (pre-fit) from fitted model (post-fit)
This design enables:
Consistent interface across diverse engines
Easy engine switching
Integration with tidymodels ecosystem (tune, workflows, recipes)
Extension by third-party packages
Understanding this architecture is essential for both creating new models and adding engines to existing ones.