Bottom-up aggregation from base forecasts to US forecasts
Source:vignettes/bottom-up-aggregation.Rmd
bottom-up-aggregation.Rmd
Copula structure for forecasting US hospitalizations from independent state forecasts
Its common to make inference on multiple states independently (up to possibly shared priors). This approach is convenient because it reduces the high dimensional problem of making joint inferences on all states into an embarrasingly parallel problem of making a marginal inference on each state. However, we lose dependency structure between states that can affect the quality of the forecasting for the aggregation level; that is the whole US.
In this note, we’ll use Copula theory to try to repair this problem in a post-processing step for the base forecasts for each state. In this terminology a base forecast is a forecast for a state from a marginal inference on the state under the independence (conditional on choice of priors) assumption.
Lets denote the national numbers of people hospitalised in any future week w = 1, 2,...,T is the sum of people hospitalised in each state/territory s = 1,\dots, S:
H_{US, w} = \sum_{s = 1}^S H_{s,w}.
Skalar’s theorem gives that we can represent the multi-variate weekly forecast distribution for the states across the weeks ahead w = 1, 2,..., T:
H(h_{1,1},h_{1,2},...,h_{S,T}) = P(H_{1, 1} \leq h_{1,1}, H_{1, 2} \leq h_{1,2},...,H_{S, T} \leq h_{s,T})
= C(F_{1,1}(h_{1,1}), F_{1,2}(h_{1,2}), \dots , F_{S,T}(h_{S,T})).
With the marginal cdfs F_{s,w}(h_{s,w}) = P(H_{s, w} \leq h_{s,w}) and a copula (i.e. a cumulative distribution on a multivariate [0,1]^n random vector where n is the number of states) C. We aim to find a reasonable approximation to C so that sampling from the trajectories of independent state forecasts can be done in a way which optimizes the forecast skill of H_{US, w} over the weeks forecast.
The propostions/assumptions in this note: 1. That the independent state forecasts form a set of reasonable approximations to the marginal forecast distribution of each state. 2. Rather than sampling at the weekly marginals for every state, I assume that we can sample according the overall number of another generated quantity over all weeks forecast and return a weekly trajectory. This effectively assumes that the correlation structure between states forecast underlying the national forecast is only between overall numbers of this generated quantity rather than more complicated possibilities (e.g time series oscillations). 2. That a reasonable approximation for the copula C is a Gaussian copula with the covariance matrix,
\Sigma_{i,j} = \delta_{ij} + (1 - \delta_{ij}) \rho.
Aggregation for an example forecast
First, I get the forecasttools
example.
library(forecasttools)
library(copula)
library(ggplot2)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) :
#> object 'type_sum.accel' not found
example_base_forecasts <- forecasttools::example_daily_forecast_flu
head(example_base_forecasts)
#> # A tibble: 6 × 4
#> .draw date hosp location
#> <dbl> <date> <dbl> <chr>
#> 1 18 2023-10-21 0 NM
#> 2 18 2023-10-22 0 NM
#> 3 18 2023-10-23 0 NM
#> 4 18 2023-10-24 2 NM
#> 5 18 2023-10-25 0 NM
#> 6 18 2023-10-26 0 NM
Then, define the Gaussian copula as described above, making sure that the dimension of the copula matches that of the number of base forecast locations. I choose \rho = 0.5.
ndims <- length(unique(example_base_forecasts$location))
rho <- 0.5
cp <- copula::normalCopula(rho, dim = ndims)
The copula sampling steps are done by the
bottom_up_aggregation
function. The generated quantity that
we use for ranked sampling doesn’t have to be the same as the
quantity being aggregated but in this case we’ll specify
hospitalisations as both the quantity to be aggregates and the ranking
quantity. The name of the location that is being aggregated for
is the "US"
.
ndraws <- 1000 # Number of bottom-up samples
agg_forecast_tbl <-
forecasttools::bottom_up_aggregation(
example_base_forecasts,
cp,
ndraws,
draw_col = ".draw",
date_col = "date",
value_to_aggregate_col = "hosp",
rank_quantity_col = "hosp",
location_col = "location",
aggregated_location_name = "US"
)
head(agg_forecast_tbl)
#> # A tibble: 6 × 4
#> date forecast .draw location
#> <date> <dbl> <int> <chr>
#> 1 2023-10-21 211 1 US
#> 2 2023-10-22 181 1 US
#> 3 2023-10-23 198 1 US
#> 4 2023-10-24 156 1 US
#> 5 2023-10-25 213 1 US
#> 6 2023-10-26 183 1 US
Comparison of simple aggregation and copula ranked sampling
A simple aggregation technique could be to sum over forecast states by forecast draw and date, then treat the spread of forecast draws for each date as the predictive distribution for the whole US. The copula sampling on ranked generated quantities also treats the spread of forecast draws as the predictive distribution for the whole US.
We compare the mean and median forecasts below.
summed_base_forecasts <- example_base_forecasts |>
dplyr::group_by(.draw, date) |>
dplyr::summarise(
summmed_forecast = sum(hosp)
) |>
dplyr::ungroup() |>
dplyr::group_by(date) |>
dplyr::summarise(
agg_median_pred = median(summmed_forecast),
agg_mean_pred = mean(summmed_forecast)
) |>
dplyr::mutate(agg_location = "US") |>
dplyr::mutate(group = "simple")
#> `summarise()` has grouped output by '.draw'. You can override using the
#> `.groups` argument.
rank_sampled_agg_forecasts <- agg_forecast_tbl |>
dplyr::group_by(date) |>
dplyr::summarise(
agg_median_pred = median(forecast),
agg_mean_pred = mean(forecast)
) |>
dplyr::mutate(group = "copula")
combined_forecasting <- dplyr::bind_rows(
summed_base_forecasts,
rank_sampled_agg_forecasts
)
comparison_plot <- ggplot(data = combined_forecasting) +
geom_line(aes(x = date, y = agg_mean_pred, color = group)) +
geom_point(aes(x = date, y = agg_median_pred, color = group)) +
ylab("Daily hosp") +
ggtitle("Comparison between simple and copula sample ranked agg.
for US forecasts")
comparison_plot
Note that the mean forecasts will be similar, although differences between the means will tend to have the same sign at each time point because copula samples by rank.
This is because the marginal copula sample for each state i is U_i \sim \text{Uniform}[0,1]. Therefore, the marginal expected forecast at week w and state i is unchanged compared to simple sampling, and the linearity of the expectation operator means that the average forecast is the same using either method:
E[H_{US, w}] = \sum_{s = 1}^S E[H_{s,w}].
However, the median is not a linear operator so the correlation structure between states induced by using the copula rank sampling affects the statistic. The difference between mean and median is higher for copula rank sampling compared to simple sampling. The reason is that the US aggregated forecast distribution will be more highly dispersed for copula rank sampling due to the positive correlation structure induced by sampling from positively correlated ranks. This can be understood mathematically,
\text{var}[H_{US, w}] = \sum_{s = 1}^S \text{var}[H_{s,w}] + 2 \sum_{s< s'} \text{cov}[H_{s,w}H_{s',w}] \geq \sum_{s = 1}^S \text{var}[H_{s,w}].