Data-Dependent Parameters

Creating parameters with unknown bounds resolved by training data

This guide covers how to create parameters whose ranges depend on dataset characteristics using unknown() and the finalization system.

Understanding unknown()

The Placeholder

unknown() is a special placeholder for parameter bounds that cannot be determined until you see the data:

# Extension pattern
mtry <- function(range = c(1L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,  # Upper bound is unknown
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(mtry = "# Randomly Selected Predictors"),
    finalize = dials::get_p
  )
}

mtry()
#> # Randomly Selected Predictors (quantitative)
#> Range: [1, ?]
#> ^^^ Question mark indicates unknown bound

Why unknown() Exists

Some parameters depend on dataset properties:

Number of predictors: Can’t sample more features than exist
Number of observations: Sample size must be ≤ dataset size
Number of columns: PCA components ≤ number of variables

Without seeing the data, we can’t set sensible upper bounds.

Where unknown() Appears

Typically in the upper bound of the range:

range = c(1L, unknown())        # Most common: lower fixed, upper unknown
range = c(unknown(), 100L)      # Rare: upper fixed, lower unknown
range = c(unknown(), unknown()) # Very rare: both bounds unknown

When Parameters Need Data-Dependent Ranges

Use unknown() and finalization when:

Number of Predictors Determines Upper Bound

Examples:

mtry(): Randomly selected predictors in random forests
num_comp(): Number of PCA components
max_features: Maximum features to select
num_terms(): Number of model terms based on predictors

Reason: Can’t select more features than exist in the dataset

# Can't set upper bound without knowing # predictors
mtry <- function(range = c(1L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(mtry = "# Randomly Selected Predictors"),
    finalize = dials::get_p  # Will set upper = ncol(x)
  )
}

Number of Observations Affects Range

Examples:

sample_size(): Rows to sample
min_n(): Minimum observations in node
bootstrap_sample(): Bootstrap sample size

Reason: Sample size must be ≤ number of rows

sample_size <- function(range = c(dials::unknown(), dials::unknown())) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = NULL,
    label = c(sample_size = "# Observations Sampled"),
    finalize = dials::get_n_frac
  )
}

Complex Data-Dependent Logic

Examples:

num_initial_terms(): MARS terms based on earth package formula
Custom bounds based on multiple data properties
Heuristic-based range adjustment

Reason: Upper bound follows package-specific or domain-specific rules

num_initial_terms <- function(range = c(1L, dials::unknown())) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = NULL,
    label = c(num_initial_terms = "# Initial MARS Terms"),
    finalize = get_initial_mars_terms  # Custom logic
  )
}

The Finalization System

Overview

Finalization resolves unknown() bounds using training data:

Parameter with unknown() → finalize(param, data) → Parameter with known bounds

The finalize Argument

When creating a parameter, provide a finalize function:

my_param <- function(range = c(1L, unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(my_param = "My Parameter"),
    finalize = get_my_bound  # Finalize function
  )
}

Finalize Function Signature

finalize_function <- function(object, x) {
  # object: Parameter object with unknown() bounds
  # x: Predictor data (matrix, data frame, or tibble)
  #
  # Returns: Parameter object with resolved bounds
}

Key points:

Takes parameter object and predictor data
Examines data properties (ncol, nrow, etc.)
Updates the parameter’s range
Returns modified parameter object

Built-in Finalize Functions

dials provides several built-in finalize functions for common cases.

get_p()

Sets upper bound to number of predictors (columns)

# Extension pattern
mtry <- function(range = c(1L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(mtry = "# Randomly Selected Predictors"),
    finalize = dials::get_p
  )
}

# Usage
param <- mtry()
param
#> Range: [1, ?]

finalized <- dials::finalize(param, mtcars[, -1])  # 10 predictors
finalized
#> Range: [1, 10]

Use for: Parameters bounded by number of features/predictors

get_n()

Sets upper bound to number of observations (rows)

# Extension pattern
max_samples <- function(range = c(1L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(max_samples = "# Maximum Samples"),
    finalize = dials::get_n
  )
}

# Usage
param <- max_samples()
finalized <- dials::finalize(param, mtcars[, -1])  # 32 rows
finalized
#> Range: [1, 32]

Use for: Parameters bounded by number of observations

get_n_frac()

Sets both bounds as fractions of observations

# Extension pattern
sample_prop <- function(range = c(dials::unknown(), dials::unknown())) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = NULL,
    label = c(sample_prop = "# Sampled Observations"),
    finalize = dials::get_n_frac
  )
}

# Usage
param <- sample_prop()
finalized <- dials::finalize(param, mtcars[, -1])  # 32 rows
finalized
#> Range: [floor(0.1 * 32), 32] = [3, 32]

Use for: Sample sizes as proportion of dataset

get_log_p()

Sets upper bound to log of number of predictors

# Extension pattern
sparse_features <- function(range = c(1L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(sparse_features = "# Sparse Features"),
    finalize = dials::get_log_p
  )
}

# Usage
param <- sparse_features()
large_data <- matrix(rnorm(100 * 100), ncol = 100)
finalized <- dials::finalize(param, large_data)
finalized
#> Range: [1, log(100)] ≈ [1, 5]

Use for: Parameters that scale logarithmically with predictors

Creating Custom Finalize Functions

For complex logic, create custom finalize functions using range_get() and range_set().

Pattern

# Extension pattern
custom_finalize <- function(object, x) {
  # 1. Calculate new bound(s) based on data
  new_upper <- calculate_upper_bound(x)

  # 2. Get current range
  bounds <- dials::range_get(object)

  # 3. Update bound(s)
  bounds$upper <- new_upper

  # 4. Set new range and return
  dials::range_set(object, bounds)
}

range_get()

Extract current range from parameter:

param <- mtry()
bounds <- dials::range_get(param)
bounds
#> $lower
#> [1] 1
#>
#> $upper
#> unknown()

Returns a list with $lower and $upper.

range_set()

Set new range on parameter:

new_bounds <- list(lower = 1, upper = 10)
updated_param <- dials::range_set(param, new_bounds)
updated_param
#> Range: [1, 10]

Takes parameter and list with $lower and $upper, returns updated parameter.

Complete Examples

Example 1: Using Built-in get_p()

Number of PCA components:

# Extension pattern
num_comp <- function(range = c(1L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(num_comp = "# Principal Components"),
    finalize = dials::get_p
  )
}

# Usage
num_comp()
#> # Principal Components (quantitative)
#> Range: [1, ?]

# Finalize with data
param <- num_comp()
finalized <- dials::finalize(param, mtcars[, -1])
finalized
#> # Principal Components (quantitative)
#> Range: [1, 10]

# Now can generate grid
grid <- dials::grid_regular(finalized, levels = 5)
grid
#> # A tibble: 5 × 1
#>   num_comp
#>      <int>
#> 1        1
#> 2        3
#> 3        5
#> 4        8
#> 5       10

Example 2: Using Built-in get_n()

Maximum observations to use:

# Extension pattern
max_obs <- function(range = c(10L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(max_obs = "# Maximum Observations"),
    finalize = dials::get_n
  )
}

# Usage
param <- max_obs()
finalized <- dials::finalize(param, mtcars[, -1])  # 32 rows
finalized
#> Range: [10, 32]

Example 3: Custom Finalize with Simple Logic

Maximum features to select (80% of predictors):

# Extension pattern
max_features <- function(range = c(1L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(max_features = "# Maximum Features"),
    finalize = get_max_features
  )
}

get_max_features <- function(object, x) {
  # Set upper bound to 80% of predictors
  upper_bound <- floor(0.8 * ncol(x))
  upper_bound <- max(1L, upper_bound)  # At least 1
  upper_bound <- as.integer(upper_bound)

  # Update range
  bounds <- dials::range_get(object)
  bounds$upper <- upper_bound
  dials::range_set(object, bounds)
}

# Usage
param <- max_features()
finalized <- dials::finalize(param, mtcars[, -1])  # 10 predictors
finalized
#> Range: [1, 8]  # floor(0.8 * 10) = 8

Example 4: Custom Finalize with Complex Logic

MARS initial terms (earth package formula):

# Extension pattern
num_initial_terms <- function(range = c(1L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(num_initial_terms = "# Initial MARS Terms"),
    finalize = get_initial_mars_terms
  )
}

get_initial_mars_terms <- function(object, x) {
  # Earth package formula: min(200, max(20, 2 * ncol(x))) + 1
  p <- ncol(x)
  upper_bound <- min(200, max(20, 2 * p)) + 1
  upper_bound <- as.integer(upper_bound)

  # Update range
  bounds <- dials::range_get(object)
  bounds$upper <- upper_bound
  dials::range_set(object, bounds)
}

# Usage
param <- num_initial_terms()

# Small dataset (10 predictors)
finalized_small <- dials::finalize(param, mtcars[, -1])
finalized_small
#> Range: [1, 41]  # min(200, max(20, 2*10)) + 1 = 41

# Large dataset (100 predictors)
large_data <- matrix(rnorm(100 * 100), ncol = 100)
finalized_large <- dials::finalize(param, large_data)
finalized_large
#> Range: [1, 201]  # min(200, max(20, 2*100)) + 1 = 201

Example 5: Updating Both Bounds

Neighbor range based on data size:

# Extension pattern
neighbors_adaptive <- function(range = c(dials::unknown(), dials::unknown())) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = NULL,
    label = c(neighbors_adaptive = "# Adaptive Neighbors"),
    finalize = get_adaptive_neighbors
  )
}

get_adaptive_neighbors <- function(object, x) {
  n <- nrow(x)

  # Set lower bound: at least 3, or 1% of data
  lower_bound <- max(3L, floor(0.01 * n))

  # Set upper bound: at most 50, or 10% of data
  upper_bound <- min(50L, floor(0.10 * n))

  # Ensure lower < upper
  if (lower_bound >= upper_bound) {
    lower_bound <- max(1L, upper_bound - 1L)
  }

  # Update range
  bounds <- list(
    lower = as.integer(lower_bound),
    upper = as.integer(upper_bound)
  )
  dials::range_set(object, bounds)
}

# Usage
param <- neighbors_adaptive()

# Small dataset (32 rows)
finalized_small <- dials::finalize(param, mtcars[, -1])
finalized_small
#> Range: [3, 3]  # max(3, floor(0.01*32)) to min(50, floor(0.10*32))

# Large dataset (1000 rows)
large_data <- matrix(rnorm(1000 * 10), ncol = 10)
finalized_large <- dials::finalize(param, large_data)
finalized_large
#> Range: [10, 50]  # floor(0.01*1000)=10 to min(50, floor(0.10*1000))=50

How Finalization Works in tune Workflows

Manual Finalization

Explicitly finalize before tuning:

# Define parameter with unknown bound
mtry_param <- mtry()

# Finalize with training data
mtry_finalized <- dials::finalize(mtry_param, train_x)

# Use in grid
grid <- dials::grid_regular(mtry_finalized, levels = 5)

# Tune
tune::tune_grid(model_spec, resamples, grid = grid)

Automatic Finalization in tune

The tune package automatically finalizes parameters during tuning:

# Model with tunable parameter
rf_spec <- parsnip::rand_forest(mtry = tune::tune()) |>
  parsnip::set_engine("ranger") |>
  parsnip::set_mode("regression")

# Create workflow
wf <- workflows::workflow() |>
  workflows::add_model(rf_spec) |>
  workflows::add_formula(mpg ~ .)

# tune_grid automatically finalizes mtry using training data
results <- tune::tune_grid(
  wf,
  resamples = vfold_cv(mtcars),
  grid = 10  # Grid will use finalized mtry
)

Workflow with Finalization

# Extract parameter set
params <- workflows::extract_parameter_set_dials(wf)
params
#> Collection of 1 parameters for tuning
#>    id    parameter type object class
#>  mtry         mtry nparam[?]
#> Model parameters needing finalization: mtry

# Finalize parameters
params_finalized <- dials::finalize(params, mtcars[, -1])
params_finalized
#> Collection of 1 parameters for tuning
#>    id    parameter type object class
#>  mtry         mtry nparam[+]

# Generate grid with finalized parameters
grid <- dials::grid_regular(params_finalized, levels = 5)

Extension vs Source Patterns

Extension Development

Use dials:: prefix throughout:

# Parameter definition
mtry <- function(range = c(1L, dials::unknown()), trans = NULL) {
  dials::new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(mtry = "# Randomly Selected Predictors"),
    finalize = dials::get_p  # Built-in finalize
  )
}

# Custom finalize function
custom_finalize <- function(object, x) {
  upper_bound <- calculate_bound(x)
  bounds <- dials::range_get(object)
  bounds$upper <- upper_bound
  dials::range_set(object, bounds)
}

# Usage
dials::finalize(param, data)

Source Development

No dials:: prefix needed:

# Parameter definition
mtry <- function(range = c(1L, unknown()), trans = NULL) {
  new_quant_param(
    type = "integer",
    range = range,
    inclusive = c(TRUE, TRUE),
    trans = trans,
    label = c(mtry = "# Randomly Selected Predictors"),
    finalize = get_p  # Built-in finalize
  )
}

# Custom finalize function
custom_finalize <- function(object, x) {
  upper_bound <- calculate_bound(x)
  bounds <- range_get(object)
  bounds$upper <- upper_bound
  range_set(object, bounds)
}

# Usage
finalize(param, data)

Testing Data-Dependent Parameters

Essential Tests

Parameter with unknown bound: Object created correctly
Finalization works: finalize() resolves unknown bounds
Bounds are sensible: Finalized range makes sense for data
Grid generation: Finalized parameter works with grid functions
Edge cases: Small datasets, single column, etc.

Example Test Suite

# tests/testthat/test-my-data-dependent-param.R

test_that("my_param creates parameter with unknown bound", {
  param <- my_param()

  expect_s3_class(param, "quant_param")
  expect_equal(param$range$lower, 1L)
  expect_s3_class(param$range$upper, "unknown")
})

test_that("my_param finalizes with data", {
  param <- my_param()
  finalized <- dials::finalize(param, mtcars[, -1])

  expect_s3_class(finalized, "quant_param")
  expect_type(finalized$range$upper, "integer")
  expect_false(inherits(finalized$range$upper, "unknown"))
})

test_that("my_param finalized range is sensible", {
  param <- my_param()
  finalized <- dials::finalize(param, mtcars[, -1])

  # Upper bound should be number of predictors
  expect_equal(finalized$range$upper, ncol(mtcars) - 1)
  expect_true(finalized$range$upper >= finalized$range$lower)
})

test_that("finalized my_param works with grid functions", {
  param <- my_param()
  finalized <- dials::finalize(param, mtcars[, -1])

  grid <- dials::grid_regular(finalized, levels = 5)
  expect_equal(nrow(grid), 5)
  expect_true(all(grid$my_param >= finalized$range$lower))
  expect_true(all(grid$my_param <= finalized$range$upper))
})

test_that("my_param handles small datasets", {
  small_data <- mtcars[1:5, 1:3]  # 5 rows, 2 predictors

  param <- my_param()
  finalized <- dials::finalize(param, small_data[, -1])

  expect_equal(finalized$range$upper, 2L)
  expect_true(finalized$range$lower <= finalized$range$upper)
})

test_that("my_param handles single column", {
  single_col <- data.frame(x = rnorm(10))

  param <- my_param()
  finalized <- dials::finalize(param, single_col)

  expect_equal(finalized$range$upper, 1L)
})

Best Practices

Use built-in finalize functions when possible: get_p(), get_n() cover most cases
Document finalization logic: Explain in @details how bounds are determined
Handle edge cases: Single column, single row, empty data
Ensure lower < upper: Always check bounds are valid after finalization
Use integer types appropriately: Cast to integer with as.integer() for integer parameters
Test with various data sizes: Small, medium, and large datasets
Consider reasonable bounds: Avoid extreme values that don’t make sense

Next Steps

Learn More

Quantitative parameters: Quantitative Parameters Guide
Grid integration: Grid Integration Guide
Parameter system: Parameter System Overview

Implementation Guides

Extension development: Extension Development Guide
Source development: Source Development Guide

Last Updated: 2026-03-31