Add Parsnip Model

Create entirely new model specifications for the parsnip package. This skill covers creating new model types (like linear_reg(), boost_tree()) with their constructors, registration, and engine implementations.

Use this skill when: Creating a fundamentally new model type for parsnip.

For adding engines to existing models: See add-parsnip-engine skill instead.

Prerequisites

Before creating a new parsnip model, ensure you have:

R Package Development:

Extension Prerequisites - Required for extension packages
Development Workflow - Fast iteration practices

Parsnip Architecture:

Model Specification System - Core parsnip concepts
Fit and Predict Methods - Implementation fundamentals

Creating a New Model

Assess Model Complexity First

Before diving into implementation, determine the complexity of your model:

Simple Model:

Single mode (regression OR classification, not both)
1-2 engines
Few main arguments (1-3)
Standard prediction types only

→ Follow streamlined approach: Focus on getting the basics right, avoid over-engineering

Complex Model:

Multiple modes (regression AND classification)
3+ engines
Many main arguments
Custom prediction types or special encoding

→ Reference detailed guides for multi-mode handling, encoding options, and advanced patterns

Target files regardless of complexity:

Extension development: 2-3 files (constructor, tests, optional README); acceptable to reach 4-6 if needed
Source development: 2-4 files (constructor, data file, tests, optional engine docs); acceptable to reach 5-8 if needed

1. Design the Model Specification

Start here: Model Constructor Design

Decide on:

Model name and function (e.g., sparse_reg())
Which modes to support (regression, classification, both?)
Main arguments (standardized across engines)
Default engine

This step defines the user-facing API.

2. Understand the Registration System

Review: Model Specification System

Learn how parsnip’s registration system works:

Model environment and storage
Engine registration database
Mode handling
Argument translation

3. Implement the Registration Sequence

Follow: Registration Sequence

Complete registration in the correct order: 1. set_new_model() - Declare model exists 2. set_model_mode() - Declare supported modes 3. set_model_engine() - Register each engine 4. set_dependency() - Package requirements 5. set_model_arg() - Argument translation 6. set_fit() - Fitting method 7. set_encoding() - Data conversion (if needed) 8. set_pred() - Each prediction type

4. Design Main Arguments

Plan carefully: Argument Design

Create standardized arguments that:

Work across multiple engines
Map to engine-specific parameters
Integrate with tune package
Follow tidymodels conventions

5. Implement Fit and Predict

Core implementation: Fit and Predict Methods

For each engine:

Choose interface type (formula, matrix, xy)
Implement data conversion
Register fit method
Register each prediction type with proper column naming

6. Handle Prediction Types

Standardize output: Prediction Types

Implement appropriate prediction types for each mode:

Regression: numeric, conf_int, pred_int
Classification: class, prob
Survival: time, survival, hazard, linear_pred
Quantile: quantile

7. Configure Mode Handling

If multi-mode: Mode Handling

For models supporting multiple modes:

Register each mode separately
Set mode-specific defaults
Implement mode-specific prediction types
Handle mode validation

8. Handle Encoding Options

For matrix/xy interfaces: Encoding Options

Configure how formulas are converted:

Choose interface type
Set indicator coding (traditional vs one-hot)
Handle intercept
Configure factor encoding

Testing Your Model

Essential tests:

Model constructor creates correct object
Setting engine works
Setting mode works (if multi-mode)
Formula and xy interfaces equivalent
Each prediction type returns correct format
Factor handling works correctly
Error messages are clear

Test each engine separately:

test_that("lm engine works", {
  skip_if_not_installed("lm_package")

  spec <- my_model() |> set_engine("lm")
  fit <- fit(spec, y ~ x, data = data)
  expect_s3_class(fit, "model_fit")

  preds <- predict(fit, data)
  expect_named(preds, ".pred")
})

Contributing to Parsnip Source

For PRs to tidymodels/parsnip:

Additional resources for source development:

Best Practices (Source) - Parsnip-specific patterns
Troubleshooting (Source) - Common issues
Testing Patterns (Source) - Comprehensive tests

Key differences from extensions:

Can use internal functions (:::)
Follow parsnip file organization (R/[model].R, R/[model]_data.R)
Add to parsnip documentation
More comprehensive testing required
Consider existing parsnip conventions

Example: Creating `sparse_reg()`

Hypothetical new model for sparse regression:

Note: This example shows extension development patterns. For source development, omit parsnip:: prefixes and use internal functions as shown in source-guide.md.

Constructor (R/sparse_reg.R):

sparse_reg <- function(mode = "regression",
                       penalty = NULL,
                       sparsity_threshold = NULL,
                       engine = "glmnet") {

  args <- list(
    penalty = rlang::enquo(penalty),
    sparsity_threshold = rlang::enquo(sparsity_threshold)
  )

  parsnip::new_model_spec(
    "sparse_reg",
    args = args,
    eng_args = NULL,
    mode = mode,
    user_specified_mode = FALSE,
    method = NULL,
    engine = engine,
    user_specified_engine = !missing(engine)
  )
}

Registration (in .onLoad() or package setup):

# Declare model
parsnip::set_new_model("sparse_reg")
parsnip::set_model_mode("sparse_reg", "regression")

# Register glmnet engine
parsnip::set_model_engine("sparse_reg", "regression", "glmnet")
parsnip::set_dependency("sparse_reg", "glmnet", "glmnet", "regression")

# Translate arguments
parsnip::set_model_arg(
  model = "sparse_reg",
  eng = "glmnet",
  parsnip = "penalty",
  original = "lambda",
  func = list(pkg = "dials", fun = "penalty"),
  has_submodel = TRUE
)

# Fit method
parsnip::set_fit(
  model = "sparse_reg",
  eng = "glmnet",
  mode = "regression",
  value = list(
    interface = "matrix",
    protect = c("x", "y"),
    func = c(pkg = "glmnet", fun = "glmnet"),
    defaults = list(family = "gaussian")
  )
)

# Predictions
parsnip::set_pred(
  model = "sparse_reg",
  eng = "glmnet",
  mode = "regression",
  type = "numeric",
  value = list(
    pre = NULL,
    post = NULL,
    func = c(fun = "predict"),
    args = list(
      object = rlang::expr(object$fit),
      newx = rlang::expr(as.matrix(new_data)),
      type = "response"
    )
  )
)

Testing:

test_that("sparse_reg constructor works", {
  spec <- sparse_reg()
  expect_s3_class(spec, "sparse_reg")
  expect_equal(spec$mode, "regression")
  expect_equal(spec$engine, "glmnet")
})

test_that("sparse_reg fits and predicts", {
  skip_if_not_installed("glmnet")

  spec <- sparse_reg(penalty = 0.1) |> set_engine("glmnet")
  fit <- fit(spec, mpg ~ ., data = mtcars)

  expect_s3_class(fit, "model_fit")

  preds <- predict(fit, mtcars[1:5, ])
  expect_s3_class(preds, "tbl_df")
  expect_named(preds, ".pred")
  expect_equal(nrow(preds), 5)
})

Common Pitfalls

Prioritize correctness over structure. Focus on getting the implementation right before worrying about file organization.

Code correctness issues (fix first): 1. Incorrect column names - Follow .pred naming conventions strictly (.pred, .pred_class, .pred_lower) 2. Wrong interface type - Match engine’s expected input format (formula vs matrix vs xy) 3. Inconsistent argument naming - Use tidymodels standards (penalty, mixture), not engine names 4. Missing prediction post-processing - Ensure output format matches parsnip conventions

Implementation completeness (fix second): 5. Incomplete registration - Must complete full sequence for each engine (set_new_model → set_model_mode → set_model_engine → set_dependency → set_model_arg → set_fit → set_pred) 6. Missing mode registration - Must register modes explicitly with set_model_mode() 7. No argument translation - Main arguments must map to engine arguments via set_model_arg() 8. Insufficient testing - Test all modes, engines, prediction types, and both fit() and fit_xy()

Structural concerns (address last): 9. Too many files - Keep to 2-3 files for extensions, 2-4 for source (see File Discipline section)

When to Create a New Model

INSTRUCTIONS FOR CLAUDE:

Before implementing, verify this is truly a NEW model type. If the user requests:

A different computational engine for an existing model (e.g., “add xgboost to boost_tree”) → Stop. Politely explain this should use the add-parsnip-engine skill instead
A minor variation that could be an engine-specific argument → Stop. Suggest using engine-specific arguments rather than creating a new model
Something that duplicates an existing parsnip model → Stop. Point them to the existing model

Only proceed with implementation if it’s genuinely a new model type that doesn’t exist in parsnip.

Create a new model when:

The algorithm is fundamentally different from existing models
It serves a distinct use case
It has unique prediction types
It fills a gap in parsnip’s model coverage

Don’t create a new model when:

It’s just a different engine for an existing model type → Use add-parsnip-engine instead
It’s a minor variation of an existing model → Consider engine-specific arguments instead

Examples:

✓ survival_reg() - New outcome type (censored data)
✓ naive_bayes() - Distinct algorithm family
✗ Random forest with different package → Add engine to rand_forest()
✗ Linear regression with different penalty → Add engine to linear_reg()

File Discipline

Keep implementations focused and avoid creating unnecessary files.

Target file counts:

Extension development:

R/[model_name].R - Model constructor
tests/testthat/test-[model_name].R - Tests
README.md - Only if needed for package users
Total: 2-3 files (acceptable to reach 4-6 if implementation requires it)

Source development:

R/[model_name].R - Model constructor
R/[model_name]_data.R - Engine registrations
tests/testthat/test-[model_name].R - Tests
man/rmd/[model_name]_[engine].Rmd - Engine docs (optional)
Total: 2-4 files (acceptable to reach 5-8 if implementation requires it)

Do not create:

Implementation notes or summaries (IMPLEMENTATION_NOTES.md, IMPLEMENTATION_SUMMARY.md)
Usage example files (example_usage.R, examples.R)
Separate documentation files (DOCUMENTATION.md, USAGE.md)
Development guide files (DEVELOPMENT.md, GUIDE.md)
Testing guide files (TESTING.md, TEST_GUIDE.md)
Changelog files (CHANGELOG.md, NEWS.md unless source development)
Configuration files (CONFIG.md, SETUP.md)
Workflow files (WORKFLOW.md, PROCESS.md)
Debug or log files (DEBUG.md, LOG.md)
Status or progress files (STATUS.md, PROGRESS.md)
TODO or task files (TODO.md, TASKS.md)
Multiple README variants (README_DEV.md, README_TECHNICAL.md)

Instead:

Put usage examples in roxygen @examples tags
Put implementation notes in roxygen @details tags
Put development notes in comments within code
Document design decisions in commit messages
Use vignettes for comprehensive usage guides (if creating a package)

Next Steps

After creating your model:

Test thoroughly - Both extension and source tests
Document - Add examples and usage guidance
Share - Consider contributing to parsnip
Maintain - Keep up with engine package updates

For questions or contributions, see: