advanced_data.Rmd
In the first vignette, we considered only examples where we could observe every chain size exactly, both in theory and in practice. This is not always true. Some times we only consider chains that have at least some number of cases, and some times we only know that a chain was at least a given size. Our investigation of these topics will use data on outbreaks of the pneumonic plague reported by Nishiura et al. (2012).
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 4.5 12.0 277.6 24.0 5009.0
Among the data, we see a number of relatively small chains, but no singletons. This is because when Nishiura et al. investigated outbreaks of the pneumonic plague they compiled data only on chains of at least 2 cases.
In general, any particular dataset may include chains only if they
reach some minimum size threshold. Analyzing such data requires
appropriately conditioning the likelihood using the
condition_geq
argument. For analyzing the pneumonic plague
data, we need to set condition_geq = rep(2, 19)
, indicating
that for each of the 19 chains in the dataset, the minimum observation
size was 2. While it is generally expected that
condition_geq = rep(minimum_observation_size, number_of_chains)
,
by specifying the minimum observation size for each chain separately, it
is possible to combine datasets collected according to different
standards in a unified interface. One more wrinkle remains before
analysis.
We might also notice in that data that there is one very large chain of size 5009. This is not a stuttering chain. Nishiura et al. call it a major epidemic. We have two choices for how to appropriately handle this. Both amount to acknowledging that this chain did not go extinct due to any feature of the branching process model that this package uses, but differ in how they treat it.
We can simply declare that this chain did not go extinct.
Mathematically this requires partitioning outcomes into exctinct and
non-extinct chains and considering them separately. But in practice with
chainsR
this is done by declaring the size of a chain to be
Inf
(which is the package’s shorthand for non-extinct).
Note that this will force
because when
every chain must go extinct.
pneumonic_plague_inf <- pneumonic_plague
pneumonic_plague_inf[4] <- Inf
fit_inf <- fit_nbbp_homogenous_bayes(
all_outbreaks = pneumonic_plague_inf,
condition_geq = rep(2, length(pneumonic_plague)),
seed = 42
)
As always, first we inspect the MCMC.
rstan::check_hmc_diagnostics(fit_inf)
##
## Divergences:
## 2 of 10000 iterations ended with a divergence (0.02%).
## Try increasing 'adapt_delta' to remove the divergences.
##
## Tree depth:
## 0 of 10000 iterations saturated the maximum tree depth of 10.
##
## Energy:
## E-BFMI indicated no pathological behavior.
rstan::summary(fit_inf)$summary[, c("n_eff", "Rhat")]
## n_eff Rhat
## r_eff 1905.162 1.001219
## inv_sqrt_dispersion 2369.287 1.000739
## dispersion 6299.645 1.000225
## exn_prob 5064.136 1.000848
## p_0 2406.052 1.000997
## lp__ 1762.918 1.000372
We see a small number of divergent transitions, but things look good on the whole.
Now we can look at what we estimated.
fit_inf
## Inference for Stan model: nbbp_homogenous.
## 4 chains, each with iter=5000; warmup=2500; thin=1;
## post-warmup draws per chain=2500, total post-warmup draws=10000.
##
## mean se_mean sd 2.5% 25% 50% 75%
## r_eff 1.09 0.00 0.08 1.01 1.04 1.07 1.12
## inv_sqrt_dispersion 1.39 0.02 1.07 0.05 0.53 1.14 2.06
## dispersion 1730.05 1560.91 123889.73 0.07 0.24 0.77 3.54
## exn_prob 0.94 0.00 0.04 0.84 0.92 0.95 0.97
## p_0 0.54 0.00 0.15 0.35 0.40 0.51 0.66
## lp__ -79.09 0.03 1.12 -82.08 -79.54 -78.75 -78.28
## 97.5% n_eff Rhat
## r_eff 1.32 1905 1
## inv_sqrt_dispersion 3.89 2369 1
## dispersion 374.67 6300 1
## exn_prob 0.99 5064 1
## p_0 0.82 2406 1
## lp__ -77.97 1763 1
##
## Samples were drawn using NUTS(diag_e) at Tue May 20 19:58:29 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
We can see that is inferred to be above one with probability 1. It is likely near 1, though. We see considerable uncertainty about , as we did in the borealpox example.
We can alternately declare that all we know is that chain was at
least as big as we saw. That is, it didn’t go extinct under its own
power at size 5009, but it might have at 5010, or 7500, or something
else. This is accomplished using the censor_geq
argument,
which specifies the smallest possible size of each chain
(allowing for as many censored observations as needed), with
NA
meaning no censoring is applied. (This is significantly
slower, so for demonstration purposes only, we use fewer and shorter
MCMC runs.)
censoring_sizes <- c(rep(NA, 3), 5009, rep(NA, 15))
fit_cen <- fit_nbbp_homogenous_bayes(
all_outbreaks = pneumonic_plague,
condition_geq = rep(2, length(pneumonic_plague)),
censor_geq = censoring_sizes,
seed = 42,
iter = 600,
)
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
As always, first we inspect the MCMC. As stan has warned us, performance isn’t amazing, and we should take both the convergence of our chains and the quality of the resulting estimates (especially for tail quantities) with a few grains of salt. However, for demonstration purposes, these results are good enough.
rstan::check_hmc_diagnostics(fit_cen)
##
## Divergences:
## 0 of 1200 iterations ended with a divergence.
##
## Tree depth:
## 0 of 1200 iterations saturated the maximum tree depth of 10.
##
## Energy:
## E-BFMI indicated no pathological behavior.
rstan::summary(fit_cen)$summary[, c("n_eff", "Rhat")]
## n_eff Rhat
## r_eff 162.4110 1.014274
## inv_sqrt_dispersion 211.7760 1.014272
## dispersion 142.9110 1.025720
## exn_prob 519.5789 1.003398
## p_0 218.9932 1.016453
## lp__ 216.2874 1.006255
Now we can look at what we estimated.
fit_cen
## Inference for Stan model: nbbp_homogenous.
## 4 chains, each with iter=600; warmup=300; thin=1;
## post-warmup draws per chain=300, total post-warmup draws=1200.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5%
## r_eff 1.08 0.01 0.08 0.99 1.03 1.06 1.11 1.32
## inv_sqrt_dispersion 1.37 0.07 1.04 0.06 0.53 1.14 2.02 3.80
## dispersion 6050.69 5973.77 71413.67 0.07 0.24 0.77 3.56 266.24
## exn_prob 0.95 0.00 0.04 0.84 0.93 0.96 0.98 1.00
## p_0 0.54 0.01 0.15 0.35 0.40 0.51 0.66 0.82
## lp__ -76.28 0.07 1.07 -79.28 -76.63 -75.94 -75.55 -75.28
## n_eff Rhat
## r_eff 162 1.01
## inv_sqrt_dispersion 212 1.01
## dispersion 143 1.03
## exn_prob 520 1.00
## p_0 219 1.02
## lp__ 216 1.01
##
## Samples were drawn using NUTS(diag_e) at Tue May 20 19:59:22 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
We can see that now there is some probability that , corresponding to the possibility that this chain was simply large, but finite. We again see considerable uncertainty about .
Obviously, the two analyses disagree on whether can be below 1. When we take the chain of 5009 to be non-extinct, this probability is 0, while when we take that chain to be censored, it is 7%. Otherwise, though, estimates of are qualitatively relatively similar.
r_df <- data.frame(
r = c(
unlist(rstan::extract(fit_inf, "r_eff")),
unlist(rstan::extract(fit_cen, "r_eff"))
),
large_chain = c(
rep("non-extinct", 10000),
rep("censored", 1200)
)
)
r_df |>
ggplot(aes(x = r, fill = large_chain, y = after_stat(density))) +
geom_histogram(position = "identity", alpha = 0.7) +
theme_minimal() +
xlab("R") +
geom_vline(xintercept = 1.0, lty = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The overall similarity of inference suggests that, in some sense, the model “prefers” to consider the chain non-extinct even if we did not say it was such. Recall that when , a chain may or may not go extinct, while when extinction is certain. So, for the part of the posterior where the chain must be finite-but-large, while for the rest it could be either finite-but-large or non-extinct. We can compute both the probabilities of the chain being finite-but-large and/or non-extinct given and , and ask which is larger. We can marginalize this over the posterior distribution on and use the percent of the posterior where the probability that the chain is non-extinct is larger than that it is finite-but-large as a measure of this preference
We will need to use the PMF (dnbbp
) and CDF
(pnbbp
) functions from nbbp
. (Here we use 500
samples for illustrative purposes, but in general one should use the
entire posterior.)
probs <- rstan::extract(fit_cen, pars = c("r_eff", "dispersion")) |>
as.data.frame() |>
head(500) |>
mutate(
ccdf = purrr::map2_dbl(
r_eff, dispersion, function(r_eff, dispersion) {
1 - pnbbp(5008, r_eff, dispersion)
}
)
) |>
mutate(
pr_non_extinct = purrr::map2_dbl(
r_eff, dispersion, function(r_eff, dispersion) {
dnbbp(Inf, r_eff, dispersion)
}
)
) |>
mutate(
pr_large = purrr::map2_dbl(
ccdf, pr_non_extinct, function(ccdf, pr_non_extinct) {
ccdf - pr_non_extinct
}
)
)
The probability that the chain is non-extinct is larger in 90% of the posterior, confirming our intuition.
Estimates of are even more similar between the two analysis choices than .
k_df <- data.frame(
k = c(
unlist(rstan::extract(fit_inf, "dispersion")),
unlist(rstan::extract(fit_cen, "dispersion"))
),
large_chain = c(
rep("non-extinct", 10000),
rep("censored", 1200)
)
)
k_df |>
ggplot(aes(
x = log(k),
fill = large_chain,
y = after_stat(density)
)) +
geom_histogram(position = "identity", alpha = 0.7) +
theme_minimal() +
xlab("log(k)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note that while the analyses produced similar results here, that is not guaranteed to be the case in general. Differences are especially likely if the observed size of a non-extinct chain is not as large as it is here. Careful thought should be used to choose an analysis approach.