Skip to contents

Return a reasonable value for the k argument of RtGam (the total smooth basis dimension of the model's one or more smooth predictors) based on the number of data points. The smooth basis dimension controls the maximum degrees of freedom (and by proxy the "wiggliness") of the smooth predictors. The estimation procedure leans toward providing an excess number of degrees of freedom to the model. The consequence is slower model fits, but a better chance of avoiding avoiding non-convergence due to undersmoothing. If manually supplying a value to k rather than relying on the default estimate, see When to use a different value for RtGam-specific implementation guidance and mgcv::choose.k for more general debugging guidance from the underlying model fitting package. Note that k may be a minimum of 2 or a maximum of the number of data points.

Usage

smooth_dim_heuristic(n)

Arguments

n

An integer, the dimension of the data.

Value

An integer, the proposed total smooth basis dimensionality available to the RtGam model.

How k is used

The model is composed of one or more smooth predictors, depending the specifics of the model specification. In a simple model with only one smooth predictor, all the degrees of freedom from k would be applied to that single smooth. In a more complex model composed of multiple smooth predictors, the total degrees degrees of freedom made available by k would be partitioned between the different smooths.

When to use a different value

Model non-convergence

When an RtGam model does not converge, a reasonable first debugging step is to increase the value of k and refit the model. Commonly, GAMs exhibit diagnostic issues when the model does not have enough flexibility to represent the underlying data generating process. Increasing k above the default estimate provides more flexibility.

However, insufficient flexibility is not the only source of non-convergence. When increasing k does not improve the default model diagnostics, manual model checking via mgcv::gam.check() may be needed. Also see mgcv::choose.k for guidance.

Slow model fits

RtGam models usually fit faster when the model has less flexibility (lower values of k). The guess from smooth_dim_heuristic() leans toward providing excess degrees of freedom, so model fits may take a little longer than needed. If models are taking a long time to converge, it would be reasonable to set k to a small value, checking for convergence, and increasing k if needed until the model convergences. This approach may or may not be faster than simply waiting for a model with a higher k to fit.

Very wiggly data

If running models in a setting where the data seem quite wiggly, exhibiting sharp jumps or drops, a model with more flexibility than normal may be needed. k should be increased to the maximum possible value. When running pre-set models in production, it would also be reasonable to fix the value of k above the default. Because GAMs penalize model wiggliness, the fit to both wiggly and non-wiggly data is likely to be satisfactory, at the cost of increased runtime.

Implementation details

The algorithm to pick k is a piecewise function. When \(n \le 10\), then the returned value is \(n\). When \(n > 10\), then the returned value is \( \lceil \sqrt{10n} \rceil \). This approach is loosely inspired by Ward et al., 2021. As in Ward et al. the degrees of freedom of the spline (1) is set to a reasonably high value to avoid oversmoothing and (2) scales with the dimension of the data to accommodate changing trends over time.

smooth_dim_heuristic() uses a piecewise function because each smooth parameter needs its own degrees of freedom, which adds a fixed initial setup cost. When the dimension of the data is small, the default value of k increases linearly with the data to accommodate this fixed setup cost. When the dimension of the data is larger, the default value of k increases with the square root of the data to balance having sufficient basis dimension to fit to changing trends over time without having so many dimensions that model fits are very slow.

References

Ward, Thomas, et al. "Growth, reproduction numbers and factors affecting the spread of SARS-CoV-2 novel variants of concern in the UK from October 2020 to July 2021: a modelling analysis." BMJ open 11.11 (2021): e056636.

See also

RtGam() for the use-case and additional documentation as well as mgcv::choose.k and mgcv::gam.check for more general guidance from mgcv.

Examples

cases <- 1:10
k <- smooth_dim_heuristic(length(cases))