Formatting output for Forecast Hub submission
Source:vignettes/format-hubverse.Rmd
format-hubverse.Rmd
The FluSight Forecast
Hub and the COVID-19 Forecast
Hub both use the hubverse
framework to
accept forecast submissions from contributing teams. Many other
infectious disease forecasting challenges use or are adopting the
hubverse framework as well.
FluSight and COVIDHub current accept submissions via GitHub
pull requests. The submitted model-output
csv
files need to follow a specified formatting schema; the
FluSight
and COVIDHub
Github repos provide detailed descriptions.
forecasttools
provides functions to help you produce
valid hubverse submissions from your forecast output. This vignette will
show you how to use forecasttools
to format forecast output
for submission to a Hub.
An example forecast
Let us suppose that we have some forecast output as tidy data. We’ll
use the forecasttools::example_daily_forecast_flu
dataset
here. It provides a set of 100 random draws from a 4000-draw posterior
distribution of hospitalization trajectories forecast as of
2023-10-21
.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
dat <- forecasttools::example_daily_forecast_flu
dat
#> # A tibble: 159,000 × 4
#> .draw date hosp location
#> <dbl> <date> <dbl> <chr>
#> 1 18 2023-10-21 0 NM
#> 2 18 2023-10-22 0 NM
#> 3 18 2023-10-23 0 NM
#> 4 18 2023-10-24 2 NM
#> 5 18 2023-10-25 0 NM
#> 6 18 2023-10-26 0 NM
#> 7 18 2023-10-27 0 NM
#> 8 18 2023-10-28 0 NM
#> 9 18 2023-10-29 0 NM
#> 10 18 2023-10-30 0 NM
#> # ℹ 158,990 more rows
53 states and territories are represented, coded by a two-letter location code:
dat |>
distinct(location) |>
pull()
#> [1] "NM" "SD" "MS" "ID" "NH" "VI" "ND" "MN" "KS" "DC" "UT" "NV" "DE" "MA" "VT"
#> [16] "RI" "CT" "WV" "AK" "WY" "TX" "PR" "NC" "IN" "AR" "TN" "HI" "NJ" "OR" "MI"
#> [31] "MD" "NE" "OH" "GA" "CA" "WA" "VA" "CO" "AL" "PA" "AZ" "MT" "WI" "FL" "NY"
#> [46] "ME" "LA" "IL" "SC" "OK" "KY" "MO" "IA"
This dataset is pre-formatted as tidy
data: each row represents exactly one observation (here, a given
.draw
from the posterior distribution of incident
hospitalizations hosp
for a given date
and
location
.
If you are not yet familiar with the concept of tidy data, we highly
recommend reading this
introductory vignette. For an introduction to applying tidy data
principles to managing Bayesian posterior distributions / MCMC output,
we recommend this
vignette from the excellent tidybayes
R package.
Hubverse submission format
Forecast hubs that use hubverse
expect submissions in a
different format. Here’s an example of an inflenza forecast from
2024-04-06 submitted to the 2023-24 FluSight Challenge by the
cfarenewal-cfaepimlight
team:
library(readr)
submit_path <- "https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/refs/heads/main/model-output/cfarenewal-cfaepimlight/2024-04-06-cfarenewal-cfaepimlight.csv"
submission <- read_csv(submit_path)
submission |>
print(width = 100)
#> # A tibble: 5,865 × 8
#> reference_date target horizon target_end_date location output_type
#> <date> <chr> <dbl> <date> <chr> <chr>
#> 1 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 2 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 3 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 4 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 5 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 6 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 7 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 8 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 9 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 10 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> output_type_id value
#> <dbl> <dbl>
#> 1 0.01 34
#> 2 0.025 38
#> 3 0.05 41.0
#> 4 0.1 44
#> 5 0.15 47
#> 6 0.2 49
#> 7 0.25 50
#> 8 0.3 52
#> 9 0.35 53
#> 10 0.4 54
#> # ℹ 5,855 more rows
Overview of hubverse model output schema
Let’s talk through each of these columns. For additional details and guidance, consult FluSight’s model output schema and the hubverse documentation page on model output formats.
reference_date
The official “as of” date for the forecast, in ISO8601
YYYY-MM-DD
format. The reference date is not
necessarily the due date for forecast submission. For FluSight and
COVIDHub, it is instead the Saturday that ends the USA “epidemiological
week” or “epiweek” that the forecast is due. You’ll learn a bit more
about the why and how of “epiweeks” later in this vignette.
target
The quantity being forecast in the given row. Different Hubs accept forecasts of different quantities of interest. For example, in 2023-2024, FluSight had two targets:
wk inc flu hosp
: epiweekly incident hospitalizations (i.e. the total from Sunday through Saturday for each epiweek).wk flu hosp rate change
: a discrete prediction of whether the hospitalization rate is projected to be stable, increase a little, increase a lot, decrease a little, or decrease a lot.
The example forecast above contains forecasts for
wk inc flu hosp
but not for
wk flu hosp rate change
.
horizon
The forecast “horizon” relative to the reference_date
,
here in units of weeks ahead/behind. Flusight allows the following
horizons: -1, 0, 1, 2, 3
. A horizon of -1
represents a “nowcast” of what the ultimately reported totals for the
epiweek before the reference week will be after all backfill is
accounted for.
target_end_date
The Saturday ending the epiweek corresponding to the given
horizon
. So for example a horizon of -1
from
2024-04-06
is 2024-03-30
and a horizon of
3
from 2024-04-06
is
2024-04-27
.
location
The geographic region for which the forecast is being made. For
FluSight, this is either a US state or territory or the United States as
a whole. Locations are coded not as two-letter abbreviations but rather
as two-digit legacy state FIPS codes, except for the national forecast,
which is coded as US
:
submission |>
distinct(location) |>
pull()
#> [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "16" "17" "18" "19"
#> [16] "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34"
#> [31] "35" "36" "37" "38" "39" "40" "41" "42" "44" "45" "46" "47" "48" "50" "51"
#> [46] "53" "54" "55" "56" "72" "US"
To help you code things properly, forecasttools
provides
a lookup table called forecasttools::us_location_table
:
forecasttools::us_location_table
#> # A tibble: 58 × 3
#> location_code short_name long_name
#> <chr> <chr> <chr>
#> 1 US US United States
#> 2 01 AL Alabama
#> 3 02 AK Alaska
#> 4 04 AZ Arizona
#> 5 05 AR Arkansas
#> 6 06 CA California
#> 7 08 CO Colorado
#> 8 09 CT Connecticut
#> 9 10 DE Delaware
#> 10 11 DC District of Columbia
#> # ℹ 48 more rows
output_type
The type of forecast value the row contains.
For quantile forecasts, this will be 'quantile'
: each
row represents the value of of a particular quantile level for a
particular reference_date
, target
,
location
, and horizon
.
For trend classification forecasts, it will typically be
'pmf'
, for “probability mass function”, to indicate that
the row represents the probability of a given trend (e.g. the
probability of a 'large increase'
.)
output_type_id
This column holds any needed information about what subclass of the
output type
the row contains. For quantile
, it
contains the level of the quantile
(e.g. 0.01
for the 1st percentile, 0.5
for the
median / 50th percentile, etc). For pmf
, contains the
name of the category/event whose probability is being
given (e.g. 'stable'
or
'large decrease'
).
value
This column contains the actual forecast value. For a
quantile
forecast of a hospitalizations
target
, this is the number of forecast hospitalizations for
the given location
and horizon
at the given
quantile level (specified in output_type_id
)
An example
Let’s look at an abtirary row from the example submission above:
submission[123, ] |>
print(width = 100)
#> # A tibble: 1 × 8
#> reference_date target horizon target_end_date location output_type
#> <date> <chr> <dbl> <date> <chr> <chr>
#> 1 2024-04-06 wk inc flu hosp -1 2024-03-30 02 quantile
#> output_type_id value
#> <dbl> <dbl>
#> 1 0.3 8
This row says that for location
02
(Alaska,
as you can confirm by consulting
forecasttools::us_location_table
), the 0.3
quantile (30th percentile) of the forecast predictive distribution for
incident hospitalizations during the epiweek ending Saturday
2024-03-30
(a -1
week forecast
horizon
) is 8
.
Getting your forecast into this format
So you now have your nice tidy daily forecast, and you want to aggregate it into weekly quantiles and format it for hubverse submission. Let’s look at our tidy forecast data again:
dat
#> # A tibble: 159,000 × 4
#> .draw date hosp location
#> <dbl> <date> <dbl> <chr>
#> 1 18 2023-10-21 0 NM
#> 2 18 2023-10-22 0 NM
#> 3 18 2023-10-23 0 NM
#> 4 18 2023-10-24 2 NM
#> 5 18 2023-10-25 0 NM
#> 6 18 2023-10-26 0 NM
#> 7 18 2023-10-27 0 NM
#> 8 18 2023-10-28 0 NM
#> 9 18 2023-10-29 0 NM
#> 10 18 2023-10-30 0 NM
#> # ℹ 158,990 more rows
Aggregate to epiweekly
The first thing we need to do is aggregate it from a daily forecast to an (epi)weekly forecast. Because our data is tidy, we could use a split/apply/combine workflow.
epiweekly <- dat |>
mutate(
epiweek = lubridate::epiweek(date),
epiyear = lubridate::epiyear(date)
) |>
group_by(
epiweek,
epiyear,
.draw,
location
) |>
filter(n() == 7) |> ## only use epiweeks for which we have all 7 days
summarise(weekly_hosp = sum(hosp))
#> `summarise()` has grouped output by 'epiweek', 'epiyear', '.draw'. You can
#> override using the `.groups` argument.
epiweekly
#> # A tibble: 21,200 × 5
#> # Groups: epiweek, epiyear, .draw [400]
#> epiweek epiyear .draw location weekly_hosp
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 43 2023 18 AK 17
#> 2 43 2023 18 AL 46
#> 3 43 2023 18 AR 44
#> 4 43 2023 18 AZ 20
#> 5 43 2023 18 CA 152
#> 6 43 2023 18 CO 17
#> 7 43 2023 18 CT 1
#> 8 43 2023 18 DC 1
#> 9 43 2023 18 DE 1
#> 10 43 2023 18 FL 74
#> # ℹ 21,190 more rows
To save you typing and avoid mistakes (like forgetting that two
different years can have the same epiweek) however,
forecasttools
provides a daily_to_epiweekly()
function to automate and autocheck this process:
library(forecasttools)
epiweekly_forecasts <- dat |>
daily_to_epiweekly(
value_col = "hosp", ## column of values to aggregate to weekly
id_cols = c(".draw", "location"), ## column(s) that identify a trajectory
weekly_value_name = "weekly_hosp" ## what to name the output column
)
epiweekly_forecasts
#> # A tibble: 21,200 × 5
#> epiweek epiyear .draw location weekly_hosp
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 43 2023 18 AK 17
#> 2 43 2023 18 AL 46
#> 3 43 2023 18 AR 44
#> 4 43 2023 18 AZ 20
#> 5 43 2023 18 CA 152
#> 6 43 2023 18 CO 17
#> 7 43 2023 18 CT 1
#> 8 43 2023 18 DC 1
#> 9 43 2023 18 DE 1
#> 10 43 2023 18 FL 74
#> # ℹ 21,190 more rows
Notice we didn’t have to specify that dates were found in the
date
column. This is because
daily_to_epiweekly
has default values for all of the
required column names. They are:
value_col = "hosp"
date_col = "date"
-
id_cols = ".draw"
, weekly_value_name = "weekly_value"
daily_to_epiweekly
takes a flag called
strict
to determine whether to only use weeks for which we
have all seven days. It defaults to TRUE
if not set, which
means that a similar step to the filter(n() == 7)
in our
manual code above gets applied. It also checks to make sure that no week
contains more than 7 days. So failing to group by
location
as well as .draw
throws an error:
epiweekly_forecasts <- dat |>
daily_to_epiweekly(
value_col = "hosp", ## column of values to aggregate to weekly
id_cols = ".draw", ## column(s) that identify a trajectory
weekly_value_name = "weekly_hosp" ## what to name the output column
)
#> Error in `daily_to_epiweekly()`:
#> ! At least one trajectory had more
#> than 7 values for a given epiweek
#> of a given year. Check your date
#> column for repeated values and
#> check that the trajectory
#> id columns are the ones you intended to
#> use. This run used columns named
#> '.draw' to identify unique trajectories
Compute the needed posterior quantiles
Now we compute our forecast quantiles from our table of
epiweekly_forecasts
. Again, we could to this manually:
flusight_quantile_levels <- c(0.01, 0.025, seq(0.05, 0.95, 0.05), 0.975, 0.99)
quant_summary <- epiweekly_forecasts |>
dplyr::group_by(epiweek, epiyear, location) |>
dplyr::reframe(
quant_values = quantile(
weekly_hosp,
probs = flusight_quantile_levels
),
quant_levels = flusight_quantile_levels
)
quant_summary
#> # A tibble: 4,876 × 5
#> epiweek epiyear location quant_values quant_levels
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 43 2023 AK 3.97 0.01
#> 2 43 2023 AK 4.47 0.025
#> 3 43 2023 AK 7 0.05
#> 4 43 2023 AK 8.9 0.1
#> 5 43 2023 AK 9 0.15
#> 6 43 2023 AK 10 0.2
#> 7 43 2023 AK 11 0.25
#> 8 43 2023 AK 12 0.3
#> 9 43 2023 AK 12 0.35
#> 10 43 2023 AK 13.6 0.4
#> # ℹ 4,866 more rows
forecasttools
provides a
trajectories_to_quantiles()
function to automate this
process.
quant_summary <- epiweekly_forecasts |>
trajectories_to_quantiles(
timepoint_cols = c("epiweek", "epiyear"),
id_cols = "location",
value_col = "weekly_hosp"
)
quant_summary
#> # A tibble: 4,876 × 5
#> epiweek epiyear location quantile_value quantile_level
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 43 2023 AK 3.97 0.01
#> 2 43 2023 AK 4.47 0.025
#> 3 43 2023 AK 7 0.05
#> 4 43 2023 AK 8.9 0.1
#> 5 43 2023 AK 9 0.15
#> 6 43 2023 AK 10 0.2
#> 7 43 2023 AK 11 0.25
#> 8 43 2023 AK 12 0.3
#> 9 43 2023 AK 12 0.35
#> 10 43 2023 AK 13.6 0.4
#> # ℹ 4,866 more rows
Warning: order matters for proagating uncertainty; first compute, then summarize!
People are sometimes tempted to compute weekly quantiles from daily forecasts by first computing daily quantiles and then summing those daily quantiles to get weekly ones. This is typically not what you want, and it is not equivalent to first aggregating to the week and then computing quantiles.
Why not? It does not capture all the ways in which Monday’s predicted value and Tuesday’s predicted value might co-vary. In general, we wish to propagate uncertainty in a principled manner.
For handling samples from a posterior distribution, this typically means computing any transformations / functions of our sampled parameters–such as a weekly hospitalization total computed from daily values—and only then summarizing. Specifically, we compute the transformations “draw-wise” (that is, one computation for each MCMC draw), so that the Monday value from draw number 1 gets added to the Tuesday value for draw number 1, the Monday value for draw number 2 gets added to the Tuesday value for draw number 2, etc.
Format output for Hub submission
First, we need to convert state abbreviations to the the
two-character location codes that most US-based hubs expect. We could
use a SQL-style workflow with
forecasttools::us_location_table
:
manual_rename <- quant_summary |>
rename(state = location) |>
inner_join(
forecasttools::us_location_table |> select(
state = short_name,
location = location_code
),
by = "state"
)
manual_rename
#> # A tibble: 4,876 × 6
#> epiweek epiyear state quantile_value quantile_level location
#> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
#> 1 43 2023 AK 3.97 0.01 02
#> 2 43 2023 AK 4.47 0.025 02
#> 3 43 2023 AK 7 0.05 02
#> 4 43 2023 AK 8.9 0.1 02
#> 5 43 2023 AK 9 0.15 02
#> 6 43 2023 AK 10 0.2 02
#> 7 43 2023 AK 11 0.25 02
#> 8 43 2023 AK 12 0.3 02
#> 9 43 2023 AK 12 0.35 02
#> 10 43 2023 AK 13.6 0.4 02
#> # ℹ 4,866 more rows
To save you typing, forecasttools
provides a
us_loc_abbr_to_code()
function (as well as its inverse,
us_loc_code_to_abbr()
):
quant_summary <- quant_summary |>
mutate(location = us_loc_abbr_to_code(location))
quant_summary
#> # A tibble: 4,876 × 5
#> epiweek epiyear location quantile_value quantile_level
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 43 2023 02 3.97 0.01
#> 2 43 2023 02 4.47 0.025
#> 3 43 2023 02 7 0.05
#> 4 43 2023 02 8.9 0.1
#> 5 43 2023 02 9 0.15
#> 6 43 2023 02 10 0.2
#> 7 43 2023 02 11 0.25
#> 8 43 2023 02 12 0.3
#> 9 43 2023 02 12 0.35
#> 10 43 2023 02 13.6 0.4
#> # ℹ 4,866 more rows
Now, we need to add the other required hubverse columns, select only
the time horizons of interest, and so on. forecasttools
provides a get_hubverse_table()
function to do this for a
given forecast reference_date
and target
name.
Here, we’ll use 2023-10-21
and
wk inc flu hosp
. get_hubverse_table()
is
configurable, but for convenience it is designed to accept
trajectories_to_quantiles()
output like our
quant_summary
tibble by default:
flusight_table <- quant_summary |>
get_hubverse_table(
"2023-10-21",
"wk inc flu hosp"
)
flusight_table |> print(width = 100)
#> # A tibble: 3,588 × 8
#> reference_date target horizon target_end_date location output_type
#> <chr> <chr> <int> <date> <chr> <chr>
#> 1 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> 2 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> 3 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> 4 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> 5 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> 6 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> 7 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> 8 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> 9 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> 10 2023-10-21 wk inc flu hosp 1 2023-10-28 01 quantile
#> output_type_id value
#> <dbl> <dbl>
#> 1 0.01 4.96
#> 2 0.025 6.9
#> 3 0.05 9.95
#> 4 0.1 13
#> 5 0.15 18
#> 6 0.2 20
#> 7 0.25 22.8
#> 8 0.3 25
#> 9 0.35 26.7
#> 10 0.4 29
#> # ℹ 3,578 more rows
Write the formatted output to .csv
FluSight and COVIDHub currently take submissions as comma-separated
values (.csv
) files. We can now write our table to
.csv
with base write.csv()
or
readr::write_csv()
: