Model overview: wastewater-informed, site-level infection dynamics
This document details the generative model used by wwinference
to infer global and local infection dynamics from count data (e.g. cases or hospital admissions) and wastewater concentration data, and use that to produce nowcasts and forecasts. The wwinference
model assumes that wastewater concentration data is available for one or more “local” sites that represent a subset of the total population that produce the “global” epidemiological indicators (cases or admissions). In the future, we plan to provide the user with functionality for other types of data structures, e.g. multiple data streams of hospital admissions in addition to multiple wastewater concentration data streams, but for now this structure is the only option provided to the user. The model is structured into four generative components: infections, hierarchical subpopulation-level infection dynamics, hospital admissions, and viral genome concentrations in wastewater. The model imposes a hierarchical structure where the infection dynamics in the subpopulations represented by the wastewater concentration data are assumed to be localized outbreaks similar to one another and centered around the “global” infection dynamics that give rise to the hospital admissions. Note, we will describe the model in terms of the generation of hospital admissions, but the user can choose to replace this with any “count” dataset with a delay distribution from infection to the generation of that count data, e.g. cases would also work well here.
Model components
Our models are constructed from a set of generative components. These are:
- Infection component: A renewal model for the infection dynamics, which generates estimates of incident latent infections per capita.
- Hierarchical subpopulation-level infection dynamics: A model for the relation between the subpopulation infection dynamics and the “global” infection dynamics
- Hospital admissions component: A model for the expected number of hospital admissions given incident latent infections per capita.
- Viral genome concentration in wastewater: A model for the expected genome concentration given incident infections per capita.
See the notation section for an overview of the mathematical notation we use to describe the model components, including how probability distributions are parameterized.
Infection component
Renewal process for incident infections
This component assumes that latent (unobserved) expected incident infections per capita
Where
This process is initialized by estimating an initial exponential growth5 of infections for 50 days prior to the calibration start time
where
Instantaneous reproduction number
We decompose the instantaneous reproduction number
We assume that the unadjusted reproduction number
where
The damping term we use is based on Asher et al. 20187 but extended to be applicable to a renewal process. It assumes that the instantaneous reproduction number is damped by recent infections weighted by the generation interval. This is a simple way to account for the fact that the instantaneous reproduction number is likely to decrease when there are many infections in the population, due to factors such as immunity, behavioral changes, and public health interventions. The damping term is defined as:
where
Hierarchical subpopulation infection dynamics component
The structure of this model assumes that we have hospital admissions data coming from a larger “global” population (e.g. an entire state or county) and localized wastewater concentration measurements coming from subsets of the global population. We therefore divide the “global” population into subpopulations representing sampled wastewater sites’ catchment populations, with an additional subpopulation to represent individuals who do not contribute to the sampled wastewater.
We model infection dynamics in these subpopulations hierarchically: subpopulation infection dynamics are distributed about a central jurisdiction-level infection dynamic, and the total infections that generate the hospital admissions observations are simply the sum of the subpopulation-level infections.
Subpopulation definition
The total population consists of
Whenever the sum of the wastewater catchment population sizes
The total number of subpopulations is then
This amounts to modeling the wastewater catchments populations as approximately non-overlapping; every infected individual either does not contribute to measured wastewater or contributes principally to one wastewater catchment. This approximation is reasonable if we restrict our analyses to primary wastewaster treatment plants, which avoids the possibility that an individual might be sampled once in a sample taken upstream and then sampled again in a more aggregated sample taken further downstream.
When converting from predicted per capita incident hospital admissions
This amounts to making two key additional modeling assumptions: - Any individuals who contribute to wastewaster measurements but are not part of the total population are distributed among the catchment populations approximately proportional to catchment population size. - Whenever
The hierarchical subpopulation structure linking infection dynamics in each subpopulation to a central or “global” dynamic is implemented using a reference subpopulation. The reference subpopulation is by default the subpopulation not covered by wastewater, or in the case where the sum of the wastewater site catchment populations meet or exceed the total population (
Subpopulation-level infections
We couple the subpopulation and total population infection dynamics at the level of the un-damped instantaneous reproduction number in the reference subpopulation,
We model the subpopulations as having infection dynamics that are similar to one another but can differ from the reference subpopulation dynamic.
We represent this with a hierarchical model where we estimate the reference subpopulation’s un-damped effective reproductive number
The refrence value for the undamped instantaneous reproductive number
where
The time-varying subpopulation effect on
where
We chose a prior of
The subpopulation
\mathcal{R}_k(t) = \mathcal{R}^\mathrm{u}_k(t) \exp \left(-\gamma \sum_{\tau = 1}^{T_f} I_k(t-\tau) g(\tau) \right)
From
To obtain the number of infections per capita
I(t) = \frac{1}{\sum\nolimits_{k=1}^{K_\mathrm{total}} n_k} \sum_{k=1}^{K_\mathrm{total}} n_k I_k(t)
We infer the site level initial per capita incidence
Hospital admissions component
Following other semi-mechanistic renewal frameworks, we model the expected hospital admissions per capita
To account for day-of-week effects in hospital reporting, we use an estimated weekday effect
Where
We define the discrete hospital admissions delay distribution
Therefore, we model the proportion of infections that give rise to hospital admissions
The values
where
where
We chose a relatively strong prior of
In the version of the model where we do not fit the wastewater data (which we refer to as the “hospital admissiosn only” model), we model the IHR as constant. We assign this constant IHR the same prior distribution that we assign
Hospital admissions observation model
We model the observed hospital admission counts
where the “global” population size
Currently, we do not explicitly model the delay from hospital admission to reporting of hospital admissions. In reality, corrections (upwards or downwards) in the admissions data after the report date are possible and do happen. This is an active area of further development, but for now, we advise the user to manually exclude hospital admissions data points that appear implausible. Future work will include incorporation of a simple model for right-truncation when data is rolling in in real-time with incomplete reporting in recent days. However, the current workflow assumes mandatory and for the most part complete reporting of hospital admissions.
Viral genome concentration in wastewater component
We model site-specific viral genome concentrations in wastewater
where
This approach assumes that
We model the shedding kinetics
\log_{10}[s^\mathrm{cont}(\tau)] = \begin{cases}
V_\mathrm{peak} \frac{\tau}{\tau_\mathrm{peak}} & \tau \leq \tau_\mathrm{peak} \\
V_\mathrm{peak} \left( 1 - \frac{\tau - \tau_\mathrm{peak}}{\tau_\mathrm{shed} - \tau_\mathrm{peak}} \right) & \tau_\mathrm{peak} < \tau \leq \tau_\mathrm{shed} \\
0 & \tau > \tau_\mathrm{shed}
\end{cases}
where
Future iterations of this model will evaluate the utility of mechanistic modeling of wastewater collection and processing.
Viral genome concentration observation model
Genome concentration measurements can vary between sites, and even within a site through time, because of differences in sample collection and lab processing methods. To account for this variability, we add a scaling term
Both
such that
See Prior Distributions for priors on
In the rare cases when a site submits multiple concentrations for a single date and lab method, we treat each record as an independent observation.
Censoring of wastewater observations below the limit of detection
Lab processing methods have a finite limit of detection (LOD), such that not all wastewater measurements can be modeled using the log-normal approach above. This limit of detection varies across sites, between methods, and potentially also over time.
If an observed value
where
\int_{-\infty}^{\log [\mathrm{LOD}_{ijt}]} f_\mathrm{Normal}(x; \log[M_{ij} C_i(t)], \sigma_{cij}) \mathrm{d}x
(This is mathematically equivalent to integrating the probability density function of the log-normal distribution from zero to the LOD.)
Example model parameters for COVID-19
The default parameters provided by the wwinference
package are used to fit a model of COVID-19 hospital admissions and wastewater concentrations in terms of reported SARS-CoV-2 genome copies per mL. Below we will describe the priors and parameters provided. If fitting the model to a different epidemiological indicator (e.g. cases) or a different pathogen (e.g. flu) a number of these will have to be modified accordingly.
Prior distributions
We use informative priors for parameters that have been well characterized in the literature and weakly informative priors for parameters that have been less well characterized.
Parameter | Prior distribution | Source |
---|---|---|
Initial hospitalization probability | Perez-Guzman et al. 2023 10 | |
Time to peak fecal shedding | Russell et al. 2023 11, Huisman et al. 2022 12, Cavany et al. 2022 13 | |
Peak viral shedding |
Miura et al. 2021 14 | |
Duration of shedding | Cevik et al. 2021 15, Russell et al. 2023 16 | |
Total genomes shed per infected individual | Watson et al 202317 | |
Initial infections per capita |
where |
|
Initial exponential growth rate | Chosen to assume flat dynamics prior to observations | |
Infection feedback term | Weakly informative prior chosen to have a mode of 500 in natural scale, based on posterior estimates of peaks from prior seasons in a few jurisdictions | |
Day of the week effects | Weakly informative prior with a mode at even daily reporting (no effects) | |
Standard deviation of the log of the site-lab level multiplier |
$_m (0, 0.25) $ | Weakly informative prior chosen to allow average magnitude of concentrations to be either similar or different among individual sites, depending on data |
Modal site-level observation standard deviation | Weakly informative prior chosen to allow the mode to be either small or large | |
Standard deviation of the Normal distribution of individual log observation standard deviations |
Weakly informative prior which allows for individual s.d.s to be either clustered around the mode or more dispersed |
Scalar parameters
Parameter | Value | Source |
---|---|---|
Maximum generation interval |
|
|
Maximum infection to hospital admissions delay |
|
|
Wastewater produced per person-day |
|
Ortiz 202418 |
Distributional parameters
The discrete generation interval probability mass function
We derive the distribution
We model the incubation period with a discretized, modified Weibull distribution23 with probability mass function
\delta(\tau) = \begin{cases}
\delta^\mathrm{cont}(\tau) / \left( {\sum}_{\tau'=0}^{23} g^\mathrm{cont}(\delta') \right) & 0 \leq \tau \leq 23 \\
0 & \text{otherwise}
\end{cases}
We model the symptom onset to hospital admission delay distribution with a Negative Binomial distribution with probability mass function
The infection-to-hospitalization delay distribution
This resulting infection to hospital admission delay distribution has a mean of 12.2 days and a standard deviation of 5.67 days.
Implementation
Our framework is an extension of the widely used 25 26, semi-mechanistic renewal framework EpiNow2 2728, using a Bayesian latent variable approach implemented in the probabilistic programming language Stan 29 using 30 to interface with R.
We fit the model using the “No-U-Turn Sampler Markov chain Monte Carlo” method. This is a type of Hamiltonian Monte Carlo (HMC) algorithm and is the core fitting method used by cmdstan
. The dfault parameter settings are set to run 4 chains for 750 warm-up iterations and 500 sampling iterations, with a target average acceptance probability of 0.95 and a maximum tree depth of 12. The user can adjust these settings using the get_mcmc_options()
function.
Appendix: Wastewater data pre-processing
Viral genome concentration in wastewater outlier detection and removal
We identify potential outlier genome concentrations for each unique site and lab pair with an approach based on
Briefly, we compute
- For purposes of outlier detection, exclude wastewater observations below the LOD.
- For purposes of outlier detection, exclude observations more than 90 days before the forecast date.
- For each site
, compute the change per unit time between successive observations and : . - Compute
-scores for across all sites and timepoints . Flag values with -scores over 3 as outliers and remove them from model calibration. - Compute
-scores for the change per unit time values across all sites and pairs of timepoints. For values with -scores over 2, flag the corresponding wastewater concentrations as outliers and remove them from model calibration.
The
Appendix: Notation
The notation
We parameterize Normal distributions in terms of their mean and standard deviation:
We parameterize Beta distributions in terms of their two standard shape parameters
We parameterize Negative Binomial distributions in terms of their mean and their positive-constrained dispersion parameter (often denoted
We write
Observed data are labeled by data source: