Advanced data • nbbp

In the first vignette, we considered only examples where we could observe every chain size exactly, both in theory and in practice. This is not always true. Some times we only consider chains that have at least some number of cases, and some times we only know that a chain was at least a given size. Our investigation of these topics will use data on outbreaks of the pneumonic plague reported by Nishiura et al. (2012).

library(nbbp)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
data(pneumonic_plague)
summary(pneumonic_plague)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0     4.5    12.0   277.6    24.0  5009.0

Conditioning on observable chain sizes

Among the data, we see a number of relatively small chains, but no singletons. This is because when Nishiura et al. investigated outbreaks of the pneumonic plague they compiled data only on chains of at least 2 cases.

In general, any particular dataset may include chains only if they reach some minimum size threshold. Analyzing such data requires appropriately conditioning the likelihood using the condition_geq argument. For analyzing the pneumonic plague data, we need to set condition_geq = rep(2, 19), indicating that for each of the 19 chains in the dataset, the minimum observation size was 2. While it is generally expected that condition_geq = rep(minimum_observation_size, number_of_chains), by specifying the minimum observation size for each chain separately, it is possible to combine datasets collected according to different standards in a unified interface. One more wrinkle remains before analysis.

What if the chain didn’t go extinct?

We might also notice in that data that there is one very large chain of size 5009. This is not a stuttering chain. Nishiura et al. call it a major epidemic. We have two choices for how to appropriately handle this. Both amount to acknowledging that this chain did not go extinct due to any feature of the branching process model that this package uses, but differ in how they treat it.

Treat it as non-extinct

We can simply declare that this chain did not go extinct. Mathematically this requires partitioning outcomes into exctinct and non-extinct chains and considering them separately. But in practice with chainsR this is done by declaring the size of a chain to be Inf (which is the package’s shorthand for non-extinct). Note that this will force $R > 1$ because when $R < 1$ every chain must go extinct.

pneumonic_plague_inf <- pneumonic_plague
pneumonic_plague_inf[4] <- Inf

fit_inf <- fit_nbbp_homogenous_bayes(
  all_outbreaks = pneumonic_plague_inf,
  condition_geq = rep(2, length(pneumonic_plague)),
  seed = 42
)

As always, first we inspect the MCMC.

rstan::check_hmc_diagnostics(fit_inf)

## 
## Divergences:

## 2 of 10000 iterations ended with a divergence (0.02%).
## Try increasing 'adapt_delta' to remove the divergences.

## 
## Tree depth:

## 0 of 10000 iterations saturated the maximum tree depth of 10.

## 
## Energy:

## E-BFMI indicated no pathological behavior.

rstan::summary(fit_inf)$summary[, c("n_eff", "Rhat")]

##                        n_eff     Rhat
## r_eff               1905.162 1.001219
## inv_sqrt_dispersion 2369.287 1.000739
## dispersion          6299.645 1.000225
## exn_prob            5064.136 1.000848
## p_0                 2406.052 1.000997
## lp__                1762.918 1.000372

We see a small number of divergent transitions, but things look good on the whole.

Now we can look at what we estimated.

fit_inf

## Inference for Stan model: nbbp_homogenous.
## 4 chains, each with iter=5000; warmup=2500; thin=1; 
## post-warmup draws per chain=2500, total post-warmup draws=10000.
## 
##                        mean se_mean        sd   2.5%    25%    50%    75%
## r_eff                  1.09    0.00      0.08   1.01   1.04   1.07   1.12
## inv_sqrt_dispersion    1.39    0.02      1.07   0.05   0.53   1.14   2.06
## dispersion          1730.05 1560.91 123889.73   0.07   0.24   0.77   3.54
## exn_prob               0.94    0.00      0.04   0.84   0.92   0.95   0.97
## p_0                    0.54    0.00      0.15   0.35   0.40   0.51   0.66
## lp__                 -79.09    0.03      1.12 -82.08 -79.54 -78.75 -78.28
##                      97.5% n_eff Rhat
## r_eff                 1.32  1905    1
## inv_sqrt_dispersion   3.89  2369    1
## dispersion          374.67  6300    1
## exn_prob              0.99  5064    1
## p_0                   0.82  2406    1
## lp__                -77.97  1763    1
## 
## Samples were drawn using NUTS(diag_e) at Tue May 20 19:58:29 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

We can see that $R$ is inferred to be above one with probability 1. It is likely near 1, though. We see considerable uncertainty about $k$ , as we did in the borealpox example.

Treat it as censored

We can alternately declare that all we know is that chain was at least as big as we saw. That is, it didn’t go extinct under its own power at size 5009, but it might have at 5010, or 7500, or something else. This is accomplished using the censor_geq argument, which specifies the smallest possible size of each chain (allowing for as many censored observations as needed), with NA meaning no censoring is applied. (This is significantly slower, so for demonstration purposes only, we use fewer and shorter MCMC runs.)

censoring_sizes <- c(rep(NA, 3), 5009, rep(NA, 15))
fit_cen <- fit_nbbp_homogenous_bayes(
  all_outbreaks = pneumonic_plague,
  condition_geq = rep(2, length(pneumonic_plague)),
  censor_geq = censoring_sizes,
  seed = 42,
  iter = 600,
)

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

As always, first we inspect the MCMC. As stan has warned us, performance isn’t amazing, and we should take both the convergence of our chains and the quality of the resulting estimates (especially for tail quantities) with a few grains of salt. However, for demonstration purposes, these results are good enough.

rstan::check_hmc_diagnostics(fit_cen)

## 
## Divergences:

## 0 of 1200 iterations ended with a divergence.

## 
## Tree depth:

## 0 of 1200 iterations saturated the maximum tree depth of 10.

## 
## Energy:

## E-BFMI indicated no pathological behavior.

rstan::summary(fit_cen)$summary[, c("n_eff", "Rhat")]

##                        n_eff     Rhat
## r_eff               162.4110 1.014274
## inv_sqrt_dispersion 211.7760 1.014272
## dispersion          142.9110 1.025720
## exn_prob            519.5789 1.003398
## p_0                 218.9932 1.016453
## lp__                216.2874 1.006255

Now we can look at what we estimated.

fit_cen

## Inference for Stan model: nbbp_homogenous.
## 4 chains, each with iter=600; warmup=300; thin=1; 
## post-warmup draws per chain=300, total post-warmup draws=1200.
## 
##                        mean se_mean       sd   2.5%    25%    50%    75%  97.5%
## r_eff                  1.08    0.01     0.08   0.99   1.03   1.06   1.11   1.32
## inv_sqrt_dispersion    1.37    0.07     1.04   0.06   0.53   1.14   2.02   3.80
## dispersion          6050.69 5973.77 71413.67   0.07   0.24   0.77   3.56 266.24
## exn_prob               0.95    0.00     0.04   0.84   0.93   0.96   0.98   1.00
## p_0                    0.54    0.01     0.15   0.35   0.40   0.51   0.66   0.82
## lp__                 -76.28    0.07     1.07 -79.28 -76.63 -75.94 -75.55 -75.28
##                     n_eff Rhat
## r_eff                 162 1.01
## inv_sqrt_dispersion   212 1.01
## dispersion            143 1.03
## exn_prob              520 1.00
## p_0                   219 1.02
## lp__                  216 1.01
## 
## Samples were drawn using NUTS(diag_e) at Tue May 20 19:59:22 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

We can see that now there is some probability that $R < 1$ , corresponding to the possibility that this chain was simply large, but finite. We again see considerable uncertainty about $k$ .

Comparison

Obviously, the two analyses disagree on whether $R$ can be below 1. When we take the chain of 5009 to be non-extinct, this probability is 0, while when we take that chain to be censored, it is 7%. Otherwise, though, estimates of $R$ are qualitatively relatively similar.

r_df <- data.frame(
  r = c(
    unlist(rstan::extract(fit_inf, "r_eff")),
    unlist(rstan::extract(fit_cen, "r_eff"))
  ),
  large_chain = c(
    rep("non-extinct", 10000),
    rep("censored", 1200)
  )
)

r_df |>
  ggplot(aes(x = r, fill = large_chain, y = after_stat(density))) +
  geom_histogram(position = "identity", alpha = 0.7) +
  theme_minimal() +
  xlab("R") +
  geom_vline(xintercept = 1.0, lty = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The overall similarity of inference suggests that, in some sense, the model “prefers” to consider the chain non-extinct even if we did not say it was such. Recall that when $R \geq 1$ , a chain may or may not go extinct, while when $R < 1$ extinction is certain. So, for the part of the posterior where $R < 1$ the chain must be finite-but-large, while for the rest it could be either finite-but-large or non-extinct. We can compute both the probabilities of the chain being finite-but-large and/or non-extinct given $R$ and $k$ , and ask which is larger. We can marginalize this over the posterior distribution on $(R,k)$ and use the percent of the posterior where the probability that the chain is non-extinct is larger than that it is finite-but-large as a measure of this preference

We will need to use the PMF (dnbbp) and CDF (pnbbp) functions from nbbp. (Here we use 500 samples for illustrative purposes, but in general one should use the entire posterior.)

probs <- rstan::extract(fit_cen, pars = c("r_eff", "dispersion")) |>
  as.data.frame() |>
  head(500) |>
  mutate(
    ccdf = purrr::map2_dbl(
      r_eff, dispersion, function(r_eff, dispersion) {
        1 - pnbbp(5008, r_eff, dispersion)
      }
    )
  ) |>
  mutate(
    pr_non_extinct = purrr::map2_dbl(
      r_eff, dispersion, function(r_eff, dispersion) {
        dnbbp(Inf, r_eff, dispersion)
      }
    )
  ) |>
  mutate(
    pr_large = purrr::map2_dbl(
      ccdf, pr_non_extinct, function(ccdf, pr_non_extinct) {
        ccdf - pr_non_extinct
      }
    )
  )

The probability that the chain is non-extinct is larger in 90% of the posterior, confirming our intuition.

Estimates of $k$ are even more similar between the two analysis choices than $R$ .

k_df <- data.frame(
  k = c(
    unlist(rstan::extract(fit_inf, "dispersion")),
    unlist(rstan::extract(fit_cen, "dispersion"))
  ),
  large_chain = c(
    rep("non-extinct", 10000),
    rep("censored", 1200)
  )
)

k_df |>
  ggplot(aes(
    x = log(k),
    fill = large_chain,
    y = after_stat(density)
  )) +
  geom_histogram(position = "identity", alpha = 0.7) +
  theme_minimal() +
  xlab("log(k)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note that while the analyses produced similar results here, that is not guaranteed to be the case in general. Differences are especially likely if the observed size of a non-extinct chain is not as large as it is here. Careful thought should be used to choose an analysis approach.