## Options
options(tidyverse.quiet = TRUE)
options(dplyr.summarise.inform = FALSE)
options(knitr.kable.NA = '')
## Packages
# Basic
library(knitr)
library(quarto)
library(data.table)
# Nonlinear models
library(gslnls)
# Tidyverse
library(broom)
library(patchwork)
library(tidyverse) # should go last to avoid conflicts (particularly with data.table)
Tutorial: Modeling Microbial Inactivation Kinetics with Disinfectant Demand in R
Version 1.9
Description
This tutorial provides instructions for conducting microbial disinfection kinetics analyses in R, including detailed derivations of the kinetic models used. The methods described in this tutorial were implemented to analyze chlorine disinfection kinetics of Elizabethkingia spp., as described in the manuscript:
Holcomb DA, Riner D, Cowan B, Salah Z, Jennings WC, Mattioli MC, Murphy JL. (2024). Chlorine inactivation of Elizabethkingia spp. in water. Emerging Infectious Diseases, 30 (10). https://doi.org/10.3201/eid3010.240917.
A companion R script, kinetics_functions.R
, abstracts all the functions developed in this tutorial. It can be sourced into an R session to make the functions available in the global environment to conduct kinetic analyses. Two accompanying datasets, ek_survival.csv
and chlorine_dose.csv
, are also provided to replicate the example analysis from this tutorial. All files are available in the manuscript repository.
Background
Our goal is to estimate CT values—the minimum contact time at a given initial disinfectant concentration, in \(mg \cdot min \cdot L^{-1}\), to achieve a desired log-reduction in the quantity of a particular microbe. CT values are not measured directly but rather are inferred from a modeled decay rate estimated from experimental data. Different assumptions about the functional form of the disinfection kinetic relationship result in competing decay rate models that may yield differing CT value estimates. CT values are also temperature- and pH-dependent, but we will not account for those factors in the models and instead assume that everything is conducted at constant temperature and pH (which is typically the case for the sorts of experimental data these models are fit to).
Because CT values assume a constant disinfectant dose, we do want to account disinfectant demand in the experimental systems that reduce the disinfectant concentration over time when estimating the inactivation rate constants. The standard practice in the field is simply to assume a constant log-linear decay of the disinfectant in time, which is estimated from the measured disinfectant concentrations at several time points during the disinfection experiments. The disinfectant decay estimation uncertainty is ignored in the inactivation kinetic models.
Define Models
We will use three kinetic models: first order (log-linear) decay for disinfectant demand, the Chick-Watson pseudo-first order rate law (with disinfectant demand), and the Hom model (with disinfectant demand) for non-linear log-survival [1]. In the following, we derive the model equations and implement the models as R functions.
Setup
Load packages and set options.
1. Disinfectant demand (first order decay)
Assuming first order decay provides for a log-linear reduction in disinfectant concentration determined by a single rate constant \(k'\) [2]. The disinfectant concentration \(C\) at contact time \(t\) is given by
\[ \begin{align} C &= C_0 \ \mathrm{e}^{-k't} \\ \ln\left(\frac{C}{C_0}\right) &= -k't \end{align} \tag{1}\]
for initial disinfectant concentration \(C_0\). \(k'\) is obtained as the regression coefficient estimate from a simple linear regression with the log-survival ratio \(\ln\left(\frac{C_t}{C_0}\right)\) as the response (dependent) variable and contact time \(t\) as the predictor (independent) variable:
\[ \begin{align} \ln\left(\frac{C}{C_0}\right)_i &= \beta t_i + \epsilon_i \\ \beta &= -k' \\ \epsilon_i &\sim \mathcal{N}\left(0, \sigma_C \right) \end{align} \tag{2}\]
where \(\epsilon_i\) is the independent, normally distributed residual for observation \(i\) and \(\sigma_C\) is the residual standard deviation of the model fit.
2. Chick-Watson rate law (pseudo-first order)
Ignoring disinfectant demand
In a disinfectant demand-free system, the Chick-Watson rate law describes linear log-survival governed by two parameters: a decay constant \(k\) (analogous to \(k'\) in the first-order disinfectant decay) and the coefficient of dilution \(n\), an empirical constant representing the average number of disinfectant molecules needed to inactivate a single organism [2,3]. For \(N\) viable organisms at time \(t\) and disinfectant concentration \(C\), the rate of change in organism concentration is given by
\[ \begin{align} \frac{\mathrm{d}N}{\mathrm{d}t} &= -k N C^n \\ \int_{N_0}^{N} \frac{\mathrm{d}N}{N} &= -k C^n \int_{0}^{t} \mathrm{d}t \\ \ln\left(\frac{N}{N_0}\right) &= -k C^n t \end{align} \tag{3}\]
where \(N_0\) is the starting concentration of the organism. Solving the Chick-Watson model may be accomplished using non-linear least squares (NLS) regression with the log-survival ratio \(\ln\left(\frac{N}{N_0}\right)\) as the outcome and a function \(g \left( \cdot \right)\) representing the Chick-Watson rate law as the predictor:
\[ \begin{align} \ln\left(\frac{N}{N_0}\right)_i &= g \left(x_i, \theta \right) + \epsilon_i \\ g \left( x_i, \theta \right) &= -k C^n t \\ \epsilon_i &\sim \mathcal{N}\left( 0, \sigma_N \right) \end{align} \tag{4}\]
where \(x_i\) represents the data for observation \(i\) passed to the function (in this case, disinfectant concentration \(C\) and contact time \(t\)) and \(\theta\) are the unknown parameters we wish to solve for (here, rate constant \(k\) and coefficient of dilution \(n\)). Because we are assuming no disinfectant demand, \(C\) is the initial chlorine dose, \(C_0\).
First we’ll define an R function for the Chick-Watson rate law, which evaluates to the log-survival ratio corresponding to given input values of the parameters \(k\) and \(n\) and data \(t\) and \(C_0\)
## Chick-Watson rate law (no demand)
# parameters of interest (k, n) are passed log-transformed for computational reasons
<- function(c0, time, lnk, lnn){
law_cw
<- exp(lnk)
k <- exp(lnn)
n
<- -k * (c0^n) * time
lnS
return(lnS)
}
With disinfectant demand
Assuming first-order disinfectant decay with rate constant \(k'\), we can substitute Equation 1 into the Chick-Watson rate law (Equation 3) to obtain:
\[ \begin{align} \frac{\mathrm{d}N}{\mathrm{d}t} &= -k N C_0^n \ \mathrm{e}^{-n k' t} \\ \int_{N_0}^{N} \frac{\mathrm{d}N}{N} &= -k C_0^n \int_{0}^{t} \mathrm{e}^{-n k' t} \ \mathrm{d}t \\ \ln\left(\frac{N}{N_0}\right) &= -\frac{k C_0^n}{n k'} \left(1 - \mathrm{e}^{-n k' t} \right) \end{align} \tag{5}\]
which predicts the log-survival ratio at contact time \(t\) accounting for log-linear loss of disinfectant over time [4,5].1
As before, we’ll express this equation as an R function that returns log-survival ratios for given values of the kinetic parameters (\(k\), \(k'\), and \(n\)), the initial chlorine dose \(C_0\), and contact time \(t\).
## Chick-Watson rate law with demand
# parameters of interest (k, n) are passed log-transformed for computational reasons
<- function(c0, time, kprime, lnk, lnn){
law_cw_demand
<- exp(lnk)
k <- exp(lnn)
n
<- (-k * c0^n) / (n * kprime)
A <- 1 - exp(-n * kprime * time)
B
<- A * B
lnS
return(lnS)
}
3. Hom model (non-linear log-survival)
The Hom model generalizes the Chick-Watson model to allow for non-linear log-survival ratios by introducing an additional parameter \(m\), the Hom dilution coefficient:
\[ \begin{align} \frac{\mathrm{d}N}{\mathrm{d}t} &= -k m N C^n t^{m-1} \\ \int_{N_0}^{N} \frac{\mathrm{d}N}{N} &= -k m C^n \int_{0}^{t} t^{m-1} \mathrm{d}t \\ \ln\left(\frac{N}{N_0}\right) &= -k C^n t^{m} \end{align} \tag{6}\]
Values of \(m > 1\) allow for a “shoulder” in the survival curve, where disinfection initially proceeds more slowly before accelerating; conversely, values of \(m < 1\) induce “tailing off” of the survival curve, with initial rapid disinfection that slows over time. The chick-Watson rate law is recovered as a special case for \(m = 1\) [2,3].
Incorporating disinfectant demand into the Hom model yields:
\[ \begin{align} \frac{\mathrm{d}N}{\mathrm{d}t} &= -k m N C_0^n \ \mathrm{e}^{-n k' t} t^{m-1} \\ \int_{N_0}^{N} \frac{\mathrm{d}N}{N} &= -k m C_0^n \int_{0}^{t} \mathrm{e}^{-n k' t} t^{m-1} \mathrm{d}t \\ \ln\left(\frac{N}{N_0}\right) &= -k m C_0^n \int_{0}^{t} \mathrm{e}^{-n k' t} t^{m-1} \mathrm{d}t \end{align} \tag{7}\]
Unfortunately, the integral on the right hand side does not have a simple closed form solution [2]. However, there is a clever trick to bypass the integral using the lower incomplete gamma function [6]
\[ \gamma \left(\alpha, x \right) = \int_{0}^{x} \mathrm{e}^{-z} z^{\alpha - 1} \mathrm{d}z \quad \quad \alpha >0, \ x \geq 0 \tag{8}\]
(The gamma function proper, \(\Gamma(\alpha)\), is defined as the integral from zero to infinity; by integrating from zero to \(x\), we obtain only the lower tail of the gamma function, thus “lower incomplete”.)
Letting \(x = z = n k' t\) and integrating, we obtain:
\[ \begin{align} \frac{\mathrm{d}N}{\mathrm{d} \left(nk't \right)} &= \frac{-k m N C_0^n \ \mathrm{e}^{-n k' t} t^{m-1}}{n k'} \\ \int_{N_0}^{N} \frac{\mathrm{d}N}{N} &= -k m C_0^n \int_{0}^{x} \frac{\mathrm{e}^{-z} t^{m-1}}{n k'} \ \mathrm{d}z \\ \ln\left(\frac{N}{N_0}\right) &= \frac{-k m C_0^n}{n k'} \int_{0}^{x} \mathrm{e}^{-z} t^{m-1} \mathrm{d}z \end{align} \]
Because \(t = \frac{z}{n k'}\), we can substitute for \(t\)
\[ \begin{align} \ln\left(\frac{N}{N_0}\right) &= \frac{-k m C_0^n}{n k'} \cdot \frac{1}{ \left(n k' \right)^{m-1}} \int_{0}^{x} \mathrm{e}^{-z} z^{m-1} \mathrm{d}z \\ &= \frac{-k m C_0^n}{ \left(n k' \right)^{m}} \int_{0}^{x} \mathrm{e}^{-z} z^{m-1} \mathrm{d}z \end{align} \]
such that the integral on the right hand side is now the lower incomplete gamma function with \(\alpha = m\) [2]. Because the incomplete gamma function is readily evaluated in R (function pgamma()
),2 substitution gives us a tractable Hom model with first order disinfectant demand:
\[ \begin{align} \ln\left(\frac{N}{N_0}\right) &= \frac{-k m C_0^n}{ \left(n k' \right)^{m}} \cdot \gamma \left(m, \ n k' t \right) \quad \quad m > 0, \ n k' t \geq 0 \end{align} \tag{9}\]
As before, we’ll write an R function that implements the incomplete gamma Hom model and returns the log-survival ratio:
## Hom model with demand
# parameters of interest (k, n, m) are passed log-transformed for computational reasons
<- function(c0, time, kprime, lnk, lnn, lnm){
law_hom_demand
<- exp(lnk)
k <- exp(lnn)
n <- exp(lnm)
m
<- (-k * m * (c0^n)) / ((n * kprime)^m)
A <- pgamma(n * kprime * time, m)
B
<- A * B
lnS
return(lnS)
}
Fit Models
Non-linear least squares (NLS)
The kinetic models are fit by NLS, which requires specifying starting values for the unknown parameters. The base R nls()
function employs an algorithm that is not very robust to non-optimal starting values, small sample sizes, and poor model geometries, all of which we commonly encounter in inactivation experiment data. It also is limited in the methods that can be applied to the fitted model for further analysis and interpretation. Let’s write an NLS fitting function to use a more robust optimization method that also provides additional post-processing support.
We can also calculate additional model performance indicators, specifically \(R^2\) and a corollary more appropriate for non-linear models, the deviance explained. Deviance measures how well the proposed model fits the data compared with the saturated model that fits the data exactly, in terms of the difference in their log-likelihoods:
\[ D := -2 \left[\ell \left( \hat{\beta} \right) - \ell_s \right] \]
for model log-likelihood \(\ell \left( \hat{\beta} \right)\) and saturated log-likelihood \(\ell_s\). We can compare the deviance of a proposed model to the null deviance, the deviance of the simplest possible model with a single parameter fit to the same data: \(D_0 := -2 \left[\ell \left( \hat{\beta}_0 \right) - \ell_s \right]\). The null model is traditionally the intercept-only model, in which only a constant value equal to the mean of the outcome data \(\bar{y}\) is estimated. In ordinary least squares (OLS) regression, the deviance is equivalent to the residual sum of squares (RSS) \(D = RSS = \sum \left(y_i - \hat{y_i} \right)^2\) and the null deviance corresponds to the total sum of squares (TSS) \(D_0 = TSS = \sum \left(y_i - \bar{y} \right)^2\). The classical coefficient of determination, \(R^2\), is calculated as:
\[ R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{D}{D_0} \]
In the OLS case, \(R^2\) can be interpreted as the proportion of variance explained by the model. This interpretation no longer holds for nonlinear models, but the same ratio of deviance to the null deviance—the deviance explained—can still be used to assess how close the proposed model is to perfectly reproducing the training data relative to the simplest null model. In NLS, however, the traditional intercept-only null model may not be nested within the proposed model and may therefore be an inappropriate point of comparison for calculating deviance explained. Both the Chick-Watson and Hom models fix the intercept at zero, for example, so an intercept-only model that estimates a non-zero intercept would not be nested within either kinetic model. We might instead fit a single-parameter model that is a linear function of time with the intercept fixed at zero—the same log-linear decay model we used to estimate disinfectant demand. This represents the simplest model fully nested within the kinetic models: the Chick-Watson model is obtained for the Hom model with \(m = 1\), and the linear-in-time model is simply the Chick-Watson model fixed with \(n = 0\) and \(k' = 0\)
\[ \begin{align} \frac{\mathrm{d}N}{\mathrm{d}t} &= -k N C_0^0 \mathrm{e}^{-(0) (0) t} = -k N C_0^0 \mathrm{e}^{0} = -k N (1) (1) \\ \frac{\mathrm{d}N}{\mathrm{d}t} &= -k N\\ \int_{N_0}^{N} \frac{\mathrm{d}N}{N} &= -k \int_{0}^{t} \mathrm{d}t \\ \ln\left(\frac{N}{N_0}\right) &= -kt \end{align} \]
This model implies that log-survival begins at zero (by definition in all our models) and decays at a constant rate, regardless of disinfectant concentration. The ratio of the deviance of our proposed models to the deviance of this constant slope model tells us how much better the proposed models explain our data than if we just assumed a constant decrease in log-survival with time. As with the classical \(R^2\) metric, the deviance explained should not be used for model selection, as it will always increase as parameters are added to the model. We can implement both the classical \(R^2\) metric (which we’ll refer to as “naive” as it uses an inappropriate intercept-only null model) and the deviance explained using the linear-with-time null model in our revised NLS fitting function, along with the root mean square error (RMSE), an absolute measure of model fit closely related to the residual standard deviation: \(RMSE = \sqrt{ \frac{\sum \left(y_i - \hat{y_i} \right)^2} {n}}\).
Because we’ve added so many model summary metrics, let’s separate these out as their own function, which can either be called directly for an existing NLS model object, or called by the fitting function we’ll wrap around it. That means any updates will only have to be done in one place to propagate through everywhere else.
### function to extract estimates and metrics from kinetic model objects
## extract parameter estimates from existing gsl_nls model object
<- function(fit,
extract_kinetic form_null = NULL,
time_var = "time"){
## R-Squared
<- fit$m$lhs() # original outcome data
y <- resid(fit) # model residuals
e_hat <- sum((y - mean(y))^2) # total sum of squares
tss <- sum((e_hat)^2) # residual sum of squares
rss <- 1 - (rss / tss) # R-squared (unreliable for NLS)
r2
## root mean squared error
<- sqrt(mean(e_hat^2))
rmse
## Deviance explained
# fit null model
if(is.null(form_null)){
<- as.character(fit$call$formula[[2]]) # outcome variable name
out_var <- str_c(out_var, " ~ -1 + ", time_var) # nested null model formula
form_null message("null model not specified \nfitting linear in time with intercept fixed at 0 \n")
}
<- as.formula(form_null)
form_null
<- glm(form_null,
fit_null data = augment(fit))
<- deviance(fit_null) # null deviance (residual deviance of null model)
d_null <- deviance(fit) # residual deviance of proposed model
d_resid <- 1 - (d_resid / d_null) # deviance explained
d_explain
## standard model stats
<- glance(fit) %>%
stat mutate(rmse = rmse,
tss = tss,
r2 = r2,
dev_explain = d_explain,
dev_null = d_null) %>%
select(sigma, rmse,
df_resid = df.residual,
nobs, loglik = logLik, tss,
deviance, dev_null, dev_explain,aic = AIC, bic = BIC,
r2, converge = isConv)
<- confint(fit)
est_ci
### format estimates with stats
<- tidy(fit) %>%
est rename(est_ln = estimate,
est_se_ln = std.error) %>%
mutate(est_lo_ln = est_ci[,1],
est_hi_ln = est_ci[,2],
term = str_remove(term, "ln"),
est = exp(est_ln),
est_se = exp(est_se_ln),
est_lo = exp(est_lo_ln),
est_hi = exp(est_hi_ln)) %>%
bind_cols(stat) %>%
select(term,
est, est_se, est_lo, est_hi,est_stat = statistic, est_pval = p.value,
ends_with("_ln"),
everything())
return(est)
}
Now we can define a function to actually fit the models by NLS and either return the model object or a flat dataframe with the parameter estimates and fit metrics extracted from a call to the extract_kinetic()
function defined above. We will use the gsl_nls()
function from the gslnls
package to fit the models with the recommended Levenberg-Marquardt (scale = "levenberg"
) method [7].
We pass the non-linear kinetic model functions defined earlier as part of the model formula call, e.g. formula = log_survival ~ law_cw_demand(c0, time, kprime, lnk, lnn)
. Values for time, disinfectant dose, and the disinfectant demand rate constant \(k'\) are passed as data (but still included in the formula call, as they are named arguments to the kinetic model functions). Starting values must be supplied as a named list for each unknown parameter on the log scale.
### function to fit nonlinear kinetic models with nonlinear least squares
# argument `flat = TRUE` returns formatted estimates in dataframe, not model object
<- function(data,
fit_kinetic
formula,
start,control = list(scale = "levenberg"),
flat = TRUE,
form_null = NULL,
time_var = "time"){
## ensure model formula is correctly formatted
<- as.formula(formula)
form
## fit NLS model
<- gsl_nls(form,
fit data = data,
start = start,
trace = FALSE,
control = control)
## return model object?
if(!flat){
return(fit)
else {
}
<- extract_kinetic(fit,
est form_null = form_null,
time_var = time_var)
return(est)
} }
Log-linear disinfectant decay
Let’s also define a function to fit the log-linear (first order) disinfectant demand model and estimate \(k'\). For compatibility with the NLS outputs, we’ll calculate many of the same model fit metrics.
## custom function to estimate log-linear decay rate constant k'
# argument `flat = TRUE` returns formatted estimates in dataframe, not model object
<- function(data,
fit_loglin
formula,flat = TRUE){
<- as.formula(formula)
form
<- lm(form, data = data)
fit
if(!flat){
return(fit)
else {
}
## fit metrics calcs
<- fit$model[,1] # original outcome data
y <- resid(fit) # model residuals
e_hat <- sum((y - mean(y))^2) # total sum of squares
tss <- sum((e_hat)^2) # residual sum of squares
rss <- 1 - (rss / tss) # R-squared (unreliable for NLS)
r2 <- sqrt(mean(e_hat^2))
rmse <- glm(form, data = data)$null.deviance
dev_null
<- glance(fit) %>%
stat mutate(tss = tss,
dev_null = dev_null,
r2 = r2,
rmse = rmse) %>%
select(sigma, rmse,
df_resid = df.residual,
nobs, loglik = logLik, tss,
dev_explain = r.squared,
deviance, dev_null, aic = AIC, bic = BIC)
r2,
<- tidy(fit, conf.int = TRUE) %>%
est select(term,
est = estimate,
est_se = std.error,
est_lo = conf.low,
est_hi = conf.high,
est_stat = statistic,
est_pval = p.value) %>%
bind_cols(stat)
return(est)
} }
This function will return estimates for the residual standard deviation (\(\sigma_C\)) and the regression coefficient on contact time, which corresponds to the decay rate constant \(-k'\) (note the sign on \(\beta\) includes the minus sign on \(k'\) in Equation 2, so \(k' = - \beta\)). An intercept term will also be included in the model by default. However, the y-intercept should be exactly zero according to our model (at time \(t = 0\), \(\ln\left(\frac{N}{N_0} \right) = \ln\left(\frac{N_0}{N_0} \right) = \ln\left(1 \right) = 0\)), so it may be appropriate to exclude the intercept term by specifying -1
on the right hand side of the model formula (e.g., log_survival ~ -1 + time
). Note that the observations at time zero should probably be excluded from the model for related reasons (i.e., we assume the log-survival ratio at zero contact time is known to be exactly 0).
Interpret Models
Predictions
The fitted models allow us to generate model-based predictions of the log-survival ratio for a given contact time and disinfectant dose. We have already used model predictions at the values of our observed data to calculate the classical \(R^2\) and RMSE using the model residuals. We can also make predictions across a range of contact times to plot the general shape of our fitted inactivation curves, along with prediction intervals to characterize the uncertainty of our model-derived inactivation curves. We will generate the prediction intervals with an asymptotic approach, which assumes the models are well behaved and applies standard large-sample theory to obtain the intervals. While the assumptions may not exactly hold for typical inactivation experiment data, the asymptotic approach it is fast, stable, and predictable, unlike other approaches (e.g., profiling) that frequently fail for these weakly-identified kinetic models. Though the asymptotic prediction intervals should be interpreted cautiously, in practice they provide a reasonable and consistent picture of the uncertainty in our kinetic models.
## predict log-survival from existing GSL_NLS model object
<- function(fit,
pred_kinetic newdata = NULL,
interval = "prediction",
level = 0.95){
if(is.null(newdata)) newdata = augment(fit)
# asymptotic predictions
<- predict(fit,
pred_raw newdata = newdata,
interval = interval,
level = level) %>%
as_tibble()
<- bind_cols(newdata,
pred %>%
pred_raw) rename(pred_avg = fit,
pred_lo = lwr,
pred_hi = upr)
return(pred)
}
Contact time
So far we have been estimating rate constants, which are difficult to interpret directly. Instead, we usually evaluate disinfection kinetics in terms of the contact time required to achieve a desired reduction in microbial concentration for a given disinfectant dose. A common target is 3 log10 reduction, which corresponds to a 99.9% decrease in microbial concentration. We can also calculate contact times for other reasonable reduction targets and disinfectant doses to be presented as a CT table. We can calculate the contact time for a given log-survival ratio and disinfectant dose from a fitted model by rearranging the Chick-Watson rate law (Equation 5):
\[ \begin{align} \ln\left(\frac{N}{N_0}\right) &= -k C^n t \\ t &= \frac{\ln\left(\frac{N}{N_0}\right)}{-k \ C^n} \end{align} \tag{10}\]
Note that contact times are calculated assuming a constant disinfectant dose and do not account for disinfectant demand. However, the values of the parameters used to calculate CT values were estimated accounting for disinfectant demand and could perhaps be thought of as demand-adjusted disinfection rates. As such, Hom model-derived CT values are similarly simple to obtain:
\[ \begin{align} \ln\left(\frac{N}{N_0}\right) &= -k C^n t^{m} \\ t &= \left( \frac{\ln\left(\frac{N}{N_0}\right)}{-k \ C^n} \right)^{1/m} \end{align} \tag{11}\]
Because the Hom model simplifies to the Chick-Watson model for \(m = 1\), we can define a single function to calculate the contact time needed for a given log-reduction using parameters fit using either model:
<- function(lnS, c0, k, n, m = 1){
calc_time
<- lnS / (-k * c0^n)
tm
<- tm^(1/m)
t
return(t)
}
CT tables
The contact times can be calculated for different log-reductions as specific disinfectant doses to obtain a table of CT values, the product of disinfectant dose and contact time (at constant temperature and pH) usually used to communicate disinfection requirements for a particular microbe for use by treatment operators.
## prepare table of CT values from model fit for given disinfectant dose
<- function(est, .dose = 0.2){
tab_ct
= c(m = 1)
cols
%>%
est select(model, nobs,
aic, rmse, dev_explain,%>%
term, est) pivot_wider(names_from = term,
values_from = est) %>%
::add_column(., !!!cols[!names(cols) %in% names(.)]) %>%
tibblemutate(m = replace_na(m, 1),
dose = .dose,
ct_log2 = .dose * calc_time(lnS = log(0.01), c0 = .dose, k = k, n = n, m = m),
ct_log3 = .dose * calc_time(lnS = log(0.001), c0 = .dose, k = k, n = n, m = m),
ct_log4 = .dose * calc_time(lnS = log(0.0001), c0 = .dose, k = k, n = n, m = m),
m = ifelse(model == "Chick-Watson", NA, m))
}
Comparing rate constant estimates
If we have estimated separate rate constants (the \(k\) parameter estimates) for different subsets of the data then we can test whether the two estimates are statistically equivalent with a simple z-test. While the rate constants could come from different model types (e.g., one from Chick-Watson and the other from Hom), in practice the rate constant estimates from both models are wildly different (likely because Hom is poorly identified with its extra parameter) despite producing very similar log-survival predictions. Instead, it is most useful to focus on comparing parameter estimates from the same model fit to two non-overlapping subsets of the data (e.g., comparing rate constants between clinical and environmental strains of the the same organism).
As regression coefficients from non-linear least squares fits, we assume the rate constants are normally distributed and can test the null hypothesis that two rate constants estimated from separate data subsets are equal using a z-test with the test statistic given by
\[ z = \frac{k_1 - k_2}{\sqrt{se_{k_1}^2 + se_{k_2}^2}} \tag{12}\]
and the two-tailed probability obtained from the standard normal distribution. This is appropriate only if we can assume no covariance between the rate constants: \(\mathrm{cov}(k_1, k_2) = 0\); this assumption does not hold if some of the same data are used in both model fits. Therefore, while we can compare the rate constants fit separately to each species, we cannot compare the rate constant for one species to the rate constant estimated for all the data across species. (In practice it’s probably fine to do that anyway, but technically the denominator would be missing a \(-2 \ \mathrm{cov}(k_1, k_2)\) term that would be difficult to estimate for our NLS fits.)
## Function to z-test if rate constants estimated for different data subsets are different
# rate constants and SEs should genrally be on the log-scale
<- function(k1, k1_se,
test_rates
k2, k2_se){
<- k1 - k2
mu
<- sqrt(k1_se^2 + k2_se^2)
se_pool
<- mu / se_pool
stat
<- 2 * pnorm(-abs(stat))
prob
<- tibble(k1 = k1,
out k2 = k2,
k_diff = mu,
se_pool = se_pool,
stat = stat,
pval = prob)
return(out)
}
Example
We will demonstrate fitting and interpreting disinfection kinetics models with an example dataset from chlorine inactivation experiments for two species of the opportunistic waterborne bacterial pathogen Elizabethkingia spp.
Setup
All the functions have been saved in the companion script kinetics_functions.R
, which can be sourced to make the functions available in the global environment. There are two datasets: the main Elizabethkinga inactivation experiments and the free-chlorine residual concentrations measured over time in the experimental flasks, from which we will estimate disinfectant demand.
## source analysis functions
source("../script/kinetics_functions.R")
## Data
# disinfectant decay
<- fread("../out/chlorine_dose.csv")
df_dis
# Ek survival
<- fread("../out/ek_survival.csv") df_micro
Compare Chick-Watson and Hom models
First we’ll fit both models to all the data and plot the predicted inactivation curves.
Fit
## Estimate disinfectant demand
<- df_dis %>%
est_dis fit_loglin(formula = lnS ~ -1 + time) %>%
mutate(term = ifelse(term == "time", "kprime", term),
across(c(est, est_lo, est_hi), ~-1 * .x),
model = "disinfectant")
# extract decay constant
<- est_dis %>%
kprime filter(term == "kprime") %>%
select(est) %>%
pull()
### Fit kinetic models
## augment data
<- df_micro %>%
df_run mutate(kprime = kprime)
# prediction grid
<- tibble(time = seq(0, 1, 0.01),
df_pred c0_dose = 0.2,
kprime = kprime)
## Fit models
# Chick-Watson
<- fit_kinetic(df_run,
m_cw ~ law_cw_demand(c0_dose, time, kprime, lnk, lnn),
lnS start = list(lnk = 8,
lnn = log(2)),
flat = FALSE)
# Hom
<- fit_kinetic(df_run,
m_hom ~ law_hom_demand(c0_dose, time, kprime, lnk, lnn, lnm),
lnS start = list(lnk = 20,
lnn = log(2),
lnm = 0.01),
flat = FALSE)
## Estimates
<- extract_kinetic(m_cw) %>%
est_cw mutate(model = "Chick-Watson")
null model not specified
fitting linear in time with intercept fixed at 0
<- extract_kinetic(m_hom) %>%
est_hom mutate(model = "Hom")
null model not specified
fitting linear in time with intercept fixed at 0
<- bind_rows(est_cw,
est_all
est_hom,
est_dis)
## Predictions
<- pred_kinetic(m_cw,
pred_cw newdata = df_pred) %>%
mutate(model = "Chick-Watson")
<- pred_kinetic(m_hom,
pred_hom newdata = df_pred) %>%
mutate(model = "Hom")
<- bind_rows(pred_cw,
pred_all pred_hom)
Tabulate
Construct a table of fit metrics, parameter estimates, and CT values for the two models.
## CT table
<- est_all %>%
ct_all filter(model != "disinfectant") %>%
tab_ct() %>%
mutate(across(c(k, n, m), ~log(.x)))
%>%
ct_all kable(digits = 2,
col.names = c("Model", "N",
"AIC", "Deviance Explained", "RMSE",
"ln(k)", "ln(n)", "ln(m)",
"Cl Dose (mg/L)",
"99%",
"99.9%",
"99.99%")) %>%
::add_header_above(header = c(" " = 2,
kableExtra"Model Fit" = 3,
"Parameter Estimates" = 3,
" " = 1,
"CT Value (min*mg/L)" = 3))
Model | N | AIC | Deviance Explained | RMSE | ln(k) | ln(n) | ln(m) | Cl Dose (mg/L) | 99% | 99.9% | 99.99% |
---|---|---|---|---|---|---|---|---|---|---|---|
Chick-Watson | 83 | 430.92 | 3.13 | 0.52 | 10.36 | 1.41 | 0.2 | 0.02 | 0.03 | 0.04 | |
Hom | 83 | 431.56 | 3.10 | 0.53 | 25.06 | 2.41 | 0.88 | 0.2 | 0.02 | 0.02 | 0.03 |
Plot
## Data
# predictions
<- pred_all %>%
pred_plot mutate(logS = log10(exp(pred_avg)),
lo = log10(exp(pred_lo)),
hi = log10(exp(pred_hi)))
# data
<- df_run %>%
data_plot mutate(logS = log10(exp(lnS)))
## Color Scales
<- c("Chick-Watson" = "#bd0026",
col_model "Hom" = "#fd8d3c")
## Compare models for all data
%>%
pred_plot mutate(model = fct_relevel(model, c("Chick-Watson", "Hom"))) %>%
ggplot(data = .,
mapping = aes(x = time, y = logS)) +
geom_hline(aes(yintercept = -3),
linetype = "dashed") +
geom_point(data = data_plot,
alpha = 0.8,
position = position_jitter(width = 0.01),
show.legend = TRUE) +
geom_line(aes(color = model, linetype = model),
linewidth = 1.25) +
geom_ribbon(aes(ymin = lo, ymax = hi, fill = model,
color = model, linetype = model),
alpha = 0.3) +
annotate("text",
x = 0.75, y = -3.1, size = 3,
vjust = "top",
label = "99.9% inactivation") +
annotate("text",
x = 0.95, y = -0.95, size = 3,
hjust = "right", vjust = "bottom",
label = "all data") +
annotate("text",
x = 0.05, y = -7.95, size = 3,
hjust = "left", vjust = "bottom",
label = "0.2 mg/L FCR") +
scale_color_manual(values = col_model,
name = "Model") +
scale_fill_manual(values = col_model,
name = "Model") +
guides(color = guide_legend(override.aes = list(size = 2, linewidth = 0.5))) +
scale_linetype_manual(values = c("solid", "longdash"),
name = "Model") +
scale_x_continuous(limits = c(0, 1),
expand = expansion(add = c(0, 0.01)),
oob = scales::squish) +
scale_y_continuous(limits = c(-8.5, 0),
breaks = c(0, -1, -2, -3, -4, -5, -6, -7, -8, -9),
labels = c(0, 1, 2, 3, 4, 5, 6, 7, 8 , 9),
expand = expansion(add = c(0.01, 0.01)),
oob = scales::squish) +
labs(x = "contact time (min)",
y = expression(log[10]~inactivation)) +
theme_bw()
Compare species
We have two similar species of Elizabethkingia, E. anophelis and E. meningoseptica. We will fit the Chick-Watson model to each and compare the inactivation rate constant estimates between them.
The companion script already contains a convenience wrapper function analyze_kinetic()
to perform the entire analysis procedure from above that can be applied to each data subset. This wrapper function was written using the variable names in our dataset and contains other analysis-specific assumptions, so it would need to be modified for more general use.
Fit
## EkA
# disinfectant decay
<- df_dis %>%
df_eka_dis filter(species == "EkA")
<- df_micro %>%
df_eka filter(species == "EkA")
## run analysis
<- analyze_kinetic(data = df_eka,
run_eka data_dis = df_eka_dis)
## extract estimates
<- run_eka$est %>%
est_eka mutate(species = "E. anophelis")
## extract predictions
<- run_eka$pred %>%
pred_eka mutate(species = "E. anophelis")
## extract data
<- run_eka$data %>%
data_eka mutate(species = "E. anophelis")
## EkM
# disinfectant decay
<- df_dis %>%
df_ekm_dis filter(species == "EkM")
<- df_micro %>%
df_ekm filter(species == "EkM")
## run analysis
<- analyze_kinetic(data = df_ekm,
run_ekm data_dis = df_ekm_dis)
## extract estimates
<- run_ekm$est %>%
est_ekm mutate(species = "E. meningoseptica")
## extract predictions
<- run_ekm$pred %>%
pred_ekm mutate(species = "E. meningoseptica")
## extract data
<- run_ekm$data %>%
data_ekm mutate(species = "E. meningoseptica")
Plot
## Data
# predictions
<- bind_rows(pred_eka,
pred_combo %>%
pred_ekm) filter(model == "Chick-Watson") %>%
mutate(logS = log10(exp(pred_avg)),
lo = log10(exp(pred_lo)),
hi = log10(exp(pred_hi)),
species = fct_relevel(species, c("E. anophelis", "E. meningoseptica")))
# data
<- bind_rows(data_eka, data_ekm) %>%
data_combo mutate(group = "species",
species = fct_relevel(species, c("E. anophelis", "E. meningoseptica")),
logS = log10(exp(lnS)))
## Color Scales
<- c("E. anophelis" = "#253494",
col_species "E. meningoseptica" = "#f768a1")
## Compare species with Chick-Watson Model
%>%
pred_combo ggplot(data = .,
mapping = aes(x = time, y = logS)) +
geom_hline(aes(yintercept = -3),
linetype = "dashed") +
geom_point(aes(color = species, shape = species),
data = data_combo,
position = position_jitter(width = 0.01),
alpha = 0.8) +
geom_line(aes(color = species, linetype = species),
linewidth = 1.25) +
geom_ribbon(aes(ymin = lo, ymax = hi, fill = species,
color = species, linetype = species),
alpha = 0.3) +
annotate("text",
x = 0.75, y = -3.1, size = 3,
vjust = "top",
label = "99.9% inactivation") +
annotate("text",
x = 0.95, y = -0.95, size = 3,
hjust = "right", vjust = "bottom",
label = "Chick-Watson model") +
annotate("text",
x = 0.05, y = -7.95, size = 3,
hjust = "left", vjust = "bottom",
label = "0.2 mg/L FCR") +
scale_color_manual(values = col_species,
name = "Species") +
scale_fill_manual(values = col_species,
name = "Species") +
scale_shape_discrete(name = "Species") +
guides(color = guide_legend(override.aes = list(size = 2, linewidth = 0.5))) +
scale_linetype_manual(values = c("solid", "longdash"),
name = "Species") +
scale_x_continuous(limits = c(0, 1),
expand = expansion(add = c(0, 0.01)),
oob = scales::squish) +
scale_y_continuous(limits = c(-8.5, 0),
breaks = c(0, -1, -2, -3, -4, -5, -6, -7, -8, -9),
labels = c(0, 1, 2, 3, 4, 5, 6, 7, 8 , 9),
expand = expansion(add = c(0.01, 0.01)),
oob = scales::squish) +
labs(x = "contact time (min)",
y = expression(log[10]~inactivation)) +
theme_bw() +
theme(legend.text = element_text(face="italic"))
It looks like the predicted inactivation may be slightly different between the two species, based on the separation between the modeled inactivation curves in the plot. We should test whether the rate constant estimates are statistically different.
Compare rate constants
## By species
<- bind_rows(est_eka,
k_test_species %>%
est_ekm) mutate(group = "species") %>%
filter(term == "k",
== "species",
group == "Chick-Watson") %>%
model group_by(group, model, term) %>%
summarise(test = test_rates(est_ln[1], est_se_ln[1],
2], est_se_ln[2]),
est_ln[.groups = "drop") %>%
unnest(test)
# disinfectant demand - not logged (model used natural-scale rate parameter)
<- bind_rows(est_eka,
kprime_test_species %>%
est_ekm) mutate(group = "species") %>%
filter(term == "kprime",
== "species") %>%
group group_by(group, model, term) %>%
summarise(test = test_rates(est[1], est_se[1],
2], est_se[2]),
est[.groups = "drop") %>%
unnest(test)
## combine
<- bind_rows(k_test_species,
k_test %>%
kprime_test_species) mutate(signif = ifelse(pval < 0.05, "yes", "no"))
## Table
%>%
k_test kable(digits = 2,
col.names = c("Group", "Model", "Parameter",
"Model 1", "Model 2", "Difference",
"Pooled SE", "z-score",
"p-value", "Significant (p<0.05)")) %>%
::add_header_above(header = c(" " = 3,
kableExtra"ln(rate constant)" = 3,
" " = 4))
Group | Model | Parameter | Model 1 | Model 2 | Difference | Pooled SE | z-score | p-value | Significant (p<0.05) |
---|---|---|---|---|---|---|---|---|---|
species | Chick-Watson | k | 10.25 | 9.99 | 0.26 | 2.52 | 0.10 | 0.92 | no |
species | disinfectant | kprime | 0.90 | 0.88 | 0.02 | 0.17 | 0.09 | 0.93 | no |
Using a z-test with 95% confidence level, the disinfection and disinfectant demand rate constants are not significantly different by species. The apparent differences observed in the inactivation curve plot above are not reflected in the rate constant estimates, which are adjusted for the disinfectant decay rates (which were also not significantly different) and potentially different coefficients of dilution, the empirical constant \(n\) that is introduced in the Chick-Watson model.
References
Disclaimer:
The findings and conclusions of this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Footnotes
The equation provided for the Chick-Watson model with first-order disinfectant decay in Gyürék and Finch (1998) looks somewhat different and introduces new notation:
\[ \ln \left( \frac{N}{N_0} \right) = - \frac{k}{k'n} \left(C_0^n - C_f^n \right) \]
where \(C_f\) is described in an appendix table as the disinfectant residual after contact time \(t\). Perhaps this is algebraically equivalent, but as \(C_f\) is not clearly defined and \(t\) no longer explicitly appears in the model, it is not obvious how to fit the model with this parameterization.↩︎
The
pgamma()
documentation states that it is closely related to the incomplete gamma function but differs by a normalizing factor, namely \(\Gamma \left(\alpha \right)^{-1}\). To recover exactly the incomplete gamma function as defined in Equation 8, one would specifypgamma(x, a) * gamma(a)
in R, but it is not necessary for our purposes.↩︎