Formatting output for Forecast Hub submission
Source:vignettes/format-hubverse.Rmd
format-hubverse.RmdThe FluSight Forecast
Hub and the COVID-19 Forecast
Hub both use the hubverse framework to
accept forecast submissions from contributing teams. Many other
infectious disease forecasting challenges use or are adopting the
hubverse framework as well.
FluSight and COVIDHub current accept submissions via GitHub
pull requests. The submitted model-output
csv files need to follow a specified formatting schema; the
FluSight
and COVIDHub
Github repos provide detailed descriptions.
forecasttools provides functions to help you produce
valid hubverse submissions from your forecast output. This vignette will
show you how to use forecasttools to format forecast output
for submission to a Hub.
An example forecast
Let us suppose that we have some forecast output as tidy data. We’ll
use the forecasttools::example_daily_forecast_flu dataset
here. It provides a set of 100 random draws from a 4000-draw posterior
distribution of hospitalization trajectories forecast as of
2023-10-21.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
dat <- forecasttools::example_daily_forecast_flu
dat
#> # A tibble: 159,000 × 4
#> .draw date hosp location
#> <dbl> <date> <dbl> <chr>
#> 1 18 2023-10-21 0 NM
#> 2 18 2023-10-22 0 NM
#> 3 18 2023-10-23 0 NM
#> 4 18 2023-10-24 2 NM
#> 5 18 2023-10-25 0 NM
#> 6 18 2023-10-26 0 NM
#> 7 18 2023-10-27 0 NM
#> 8 18 2023-10-28 0 NM
#> 9 18 2023-10-29 0 NM
#> 10 18 2023-10-30 0 NM
#> # ℹ 158,990 more rows53 states and territories are represented, coded by a two-letter location code:
dat |>
distinct(location) |>
pull()
#> [1] "NM" "SD" "MS" "ID" "NH" "VI" "ND" "MN" "KS" "DC" "UT" "NV" "DE" "MA" "VT"
#> [16] "RI" "CT" "WV" "AK" "WY" "TX" "PR" "NC" "IN" "AR" "TN" "HI" "NJ" "OR" "MI"
#> [31] "MD" "NE" "OH" "GA" "CA" "WA" "VA" "CO" "AL" "PA" "AZ" "MT" "WI" "FL" "NY"
#> [46] "ME" "LA" "IL" "SC" "OK" "KY" "MO" "IA"This dataset is pre-formatted as tidy
data: each row represents exactly one observation (here, a given
.draw from the posterior distribution of incident
hospitalizations hosp for a given date and
location.
If you are not yet familiar with the concept of tidy data, we highly
recommend reading this
introductory vignette. For an introduction to applying tidy data
principles to managing Bayesian posterior distributions / MCMC output,
we recommend this
vignette from the excellent tidybayes
R package.
Hubverse submission format
Forecast hubs that use hubverse expect submissions in a
different format. Here’s an example of an inflenza forecast from
2024-04-06 submitted to the 2023-24 FluSight Challenge by the
cfarenewal-cfaepimlight team:
library(readr)
submit_path <- "https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/refs/heads/main/model-output/cfarenewal-cfaepimlight/2024-04-06-cfarenewal-cfaepimlight.csv"
submission <- read_csv(submit_path)
submission |>
print(width = 100)
#> # A tibble: 5,865 × 8
#> reference_date target horizon target_end_date location output_type
#> <date> <chr> <dbl> <date> <chr> <chr>
#> 1 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 2 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 3 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 4 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 5 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 6 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 7 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 8 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 9 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> 10 2024-04-06 wk inc flu hosp -1 2024-03-30 01 quantile
#> output_type_id value
#> <dbl> <dbl>
#> 1 0.01 34
#> 2 0.025 38
#> 3 0.05 41.0
#> 4 0.1 44
#> 5 0.15 47
#> 6 0.2 49
#> 7 0.25 50
#> 8 0.3 52
#> 9 0.35 53
#> 10 0.4 54
#> # ℹ 5,855 more rowsOverview of hubverse model output schema
Let’s talk through each of these columns. For additional details and guidance, consult FluSight’s model output schema and the hubverse documentation page on model output formats.
reference_date
The official “as of” date for the forecast, in ISO8601
YYYY-MM-DD format. The reference date is not
necessarily the due date for forecast submission. For FluSight and
COVIDHub, it is instead the Saturday that ends the USA “epidemiological
week” or “epiweek” that the forecast is due. You’ll learn a bit more
about the why and how of “epiweeks” later in this vignette.
target
The quantity being forecast in the given row. Different Hubs accept forecasts of different quantities of interest. For example, in 2023-2024, FluSight had two targets:
wk inc flu hosp: epiweekly incident hospitalizations (i.e. the total from Sunday through Saturday for each epiweek).wk flu hosp rate change: a discrete prediction of whether the hospitalization rate is projected to be stable, increase a little, increase a lot, decrease a little, or decrease a lot.
The example forecast above contains forecasts for
wk inc flu hosp but not for
wk flu hosp rate change.
horizon
The forecast “horizon” relative to the reference_date,
here in units of weeks ahead/behind. Flusight allows the following
horizons: -1, 0, 1, 2, 3. A horizon of -1
represents a “nowcast” of what the ultimately reported totals for the
epiweek before the reference week will be after all backfill is
accounted for.
target_end_date
The Saturday ending the epiweek corresponding to the given
horizon. So for example a horizon of -1 from
2024-04-06 is 2024-03-30 and a horizon of
3 from 2024-04-06 is
2024-04-27.
location
The geographic region for which the forecast is being made. For
FluSight, this is either a US state or territory or the United States as
a whole. Locations are coded not as two-letter abbreviations but rather
as two-digit legacy state FIPS codes, except for the national forecast,
which is coded as US:
submission |>
distinct(location) |>
pull()
#> [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "16" "17" "18" "19"
#> [16] "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34"
#> [31] "35" "36" "37" "38" "39" "40" "41" "42" "44" "45" "46" "47" "48" "50" "51"
#> [46] "53" "54" "55" "56" "72" "US"To help you code things properly, forecasttools provides
a lookup table called forecasttools::us_location_table:
forecasttools::us_location_table
#> # A tibble: 58 × 3
#> code abbr name
#> <chr> <chr> <chr>
#> 1 US US United States
#> 2 01 AL Alabama
#> 3 02 AK Alaska
#> 4 04 AZ Arizona
#> 5 05 AR Arkansas
#> 6 06 CA California
#> 7 08 CO Colorado
#> 8 09 CT Connecticut
#> 9 10 DE Delaware
#> 10 11 DC District of Columbia
#> # ℹ 48 more rows
output_type
The type of forecast value the row contains.
For quantile forecasts, this will be 'quantile': each
row represents the value of of a particular quantile level for a
particular reference_date, target,
location, and horizon.
For trend classification forecasts, it will typically be
'pmf', for “probability mass function”, to indicate that
the row represents the probability of a given trend (e.g. the
probability of a 'large increase'.)
output_type_id
This column holds any needed information about what subclass of the
output type the row contains. For quantile, it
contains the level of the quantile
(e.g. 0.01 for the 1st percentile, 0.5 for the
median / 50th percentile, etc). For pmf, contains the
name of the category/event whose probability is being
given (e.g. 'stable' or
'large decrease').
value
This column contains the actual forecast value. For a
quantile forecast of a hospitalizations
target, this is the number of forecast hospitalizations for
the given location and horizon at the given
quantile level (specified in output_type_id)
An example
Let’s look at an abtirary row from the example submission above:
submission[123, ] |>
print(width = 100)
#> # A tibble: 1 × 8
#> reference_date target horizon target_end_date location output_type
#> <date> <chr> <dbl> <date> <chr> <chr>
#> 1 2024-04-06 wk inc flu hosp -1 2024-03-30 02 quantile
#> output_type_id value
#> <dbl> <dbl>
#> 1 0.3 8This row says that for location 02 (Alaska,
as you can confirm by consulting
forecasttools::us_location_table), the 0.3
quantile (30th percentile) of the forecast predictive distribution for
incident hospitalizations during the epiweek ending Saturday
2024-03-30 (a -1 week forecast
horizon) is 8.
Getting your forecast into this format
So you now have your nice tidy daily forecast, and you want to aggregate it into weekly quantiles and format it for hubverse submission. Let’s look at our tidy forecast data again:
dat
#> # A tibble: 159,000 × 4
#> .draw date hosp location
#> <dbl> <date> <dbl> <chr>
#> 1 18 2023-10-21 0 NM
#> 2 18 2023-10-22 0 NM
#> 3 18 2023-10-23 0 NM
#> 4 18 2023-10-24 2 NM
#> 5 18 2023-10-25 0 NM
#> 6 18 2023-10-26 0 NM
#> 7 18 2023-10-27 0 NM
#> 8 18 2023-10-28 0 NM
#> 9 18 2023-10-29 0 NM
#> 10 18 2023-10-30 0 NM
#> # ℹ 158,990 more rowsAggregate to epiweekly
The first thing we need to do is aggregate it from a daily forecast to an (epi)weekly forecast. Because our data is tidy, we could use a split/apply/combine workflow.
epiweekly <- dat |>
mutate(
epiweek = lubridate::epiweek(date),
epiyear = lubridate::epiyear(date)
) |>
group_by(
epiweek,
epiyear,
.draw,
location
) |>
filter(n() == 7) |> ## only use epiweeks for which we have all 7 days
summarise(weekly_hosp = sum(hosp))
#> `summarise()` has grouped output by 'epiweek', 'epiyear', '.draw'. You can
#> override using the `.groups` argument.
epiweekly
#> # A tibble: 21,200 × 5
#> # Groups: epiweek, epiyear, .draw [400]
#> epiweek epiyear .draw location weekly_hosp
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 43 2023 18 AK 17
#> 2 43 2023 18 AL 46
#> 3 43 2023 18 AR 44
#> 4 43 2023 18 AZ 20
#> 5 43 2023 18 CA 152
#> 6 43 2023 18 CO 17
#> 7 43 2023 18 CT 1
#> 8 43 2023 18 DC 1
#> 9 43 2023 18 DE 1
#> 10 43 2023 18 FL 74
#> # ℹ 21,190 more rowsTo save you typing and avoid mistakes (like forgetting that two
different years can have the same epiweek) however,
forecasttools provides a daily_to_epiweekly()
function to automate and autocheck this process:
library(forecasttools)
epiweekly_forecasts <- dat |>
daily_to_epiweekly(
value_col = "hosp", ## column of values to aggregate to weekly
id_cols = c(".draw", "location"), ## column(s) that identify a trajectory
weekly_value_name = "weekly_hosp" ## what to name the output column
)
epiweekly_forecasts
#> # A tibble: 21,200 × 5
#> epiweek epiyear .draw location weekly_hosp
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 43 2023 18 AK 17
#> 2 43 2023 18 AL 46
#> 3 43 2023 18 AR 44
#> 4 43 2023 18 AZ 20
#> 5 43 2023 18 CA 152
#> 6 43 2023 18 CO 17
#> 7 43 2023 18 CT 1
#> 8 43 2023 18 DC 1
#> 9 43 2023 18 DE 1
#> 10 43 2023 18 FL 74
#> # ℹ 21,190 more rowsNotice we didn’t have to specify that dates were found in the
date column. This is because
daily_to_epiweekly has default values for all of the
required column names. They are:
value_col = "hosp"date_col = "date"-
id_cols = ".draw", weekly_value_name = "weekly_value"
daily_to_epiweekly takes a flag called
strict to determine whether to only use weeks for which we
have all seven days. It defaults to TRUE if not set, which
means that a similar step to the filter(n() == 7) in our
manual code above gets applied. It also checks to make sure that no week
contains more than 7 days. So failing to group by
location as well as .draw throws an error:
epiweekly_forecasts <- dat |>
daily_to_epiweekly(
value_col = "hosp", ## column of values to aggregate to weekly
id_cols = ".draw", ## column(s) that identify a trajectory
weekly_value_name = "weekly_hosp" ## what to name the output column
)
#> Error in `daily_to_epiweekly()`:
#> ! At least one trajectory had more
#> than 7 values for a given epiweek
#> of a given year. Check your date
#> column for repeated values and
#> check that the trajectory
#> id columns are the ones you intended to
#> use. This run used columns named
#> '.draw' to identify unique trajectoriesCompute the needed posterior quantiles
Now we compute our forecast quantiles from our table of
epiweekly_forecasts. Again, we could to this manually:
flusight_quantile_levels <- c(0.01, 0.025, 1:19 / 20, 0.975, 0.99)
quant_summary <- epiweekly_forecasts |>
dplyr::group_by(epiweek, epiyear, location) |>
dplyr::reframe(
quant_values = quantile(
weekly_hosp,
probs = flusight_quantile_levels
),
quant_levels = flusight_quantile_levels
)
quant_summary
#> # A tibble: 4,876 × 5
#> epiweek epiyear location quant_values quant_levels
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 43 2023 AK 3.97 0.01
#> 2 43 2023 AK 4.47 0.025
#> 3 43 2023 AK 7 0.05
#> 4 43 2023 AK 8.9 0.1
#> 5 43 2023 AK 9 0.15
#> 6 43 2023 AK 10 0.2
#> 7 43 2023 AK 11 0.25
#> 8 43 2023 AK 12 0.3
#> 9 43 2023 AK 12 0.35
#> 10 43 2023 AK 13.6 0.4
#> # ℹ 4,866 more rowsforecasttools provides a
trajectories_to_quantiles() function to automate this
process.
quant_summary <- epiweekly_forecasts |>
trajectories_to_quantiles(
timepoint_cols = c("epiweek", "epiyear"),
id_cols = "location",
value_col = "weekly_hosp"
)
quant_summary
#> # A tibble: 4,876 × 5
#> epiweek epiyear location quantile_value quantile_level
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 43 2023 AK 3.97 0.01
#> 2 43 2023 AK 4.47 0.025
#> 3 43 2023 AK 7 0.05
#> 4 43 2023 AK 8.9 0.1
#> 5 43 2023 AK 9 0.15
#> 6 43 2023 AK 10 0.2
#> 7 43 2023 AK 11 0.25
#> 8 43 2023 AK 12 0.3
#> 9 43 2023 AK 12 0.35
#> 10 43 2023 AK 13.6 0.4
#> # ℹ 4,866 more rowsWarning: order matters for proagating uncertainty; first compute, then summarize!
People are sometimes tempted to compute weekly quantiles from daily forecasts by first computing daily quantiles and then summing those daily quantiles to get weekly ones. This is typically not what you want, and it is not equivalent to first aggregating to the week and then computing quantiles.
Why not? It does not capture all the ways in which Monday’s predicted value and Tuesday’s predicted value might co-vary. In general, we wish to propagate uncertainty in a principled manner.
For handling samples from a posterior distribution, this typically means computing any transformations / functions of our sampled parameters–such as a weekly hospitalization total computed from daily values—and only then summarizing. Specifically, we compute the transformations “draw-wise” (that is, one computation for each MCMC draw), so that the Monday value from draw number 1 gets added to the Tuesday value for draw number 1, the Monday value for draw number 2 gets added to the Tuesday value for draw number 2, etc.
Format output for Hub submission
First, we need to convert state abbreviations to the the
two-character location codes that most US-based hubs expect. We could
use a SQL-style workflow with
forecasttools::us_location_table:
manual_rename <- quant_summary |>
rename(state = location) |>
inner_join(
forecasttools::us_location_table |> select(
state = abbr,
location = code
),
by = "state"
)
manual_rename
#> # A tibble: 4,876 × 6
#> epiweek epiyear state quantile_value quantile_level location
#> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
#> 1 43 2023 AK 3.97 0.01 02
#> 2 43 2023 AK 4.47 0.025 02
#> 3 43 2023 AK 7 0.05 02
#> 4 43 2023 AK 8.9 0.1 02
#> 5 43 2023 AK 9 0.15 02
#> 6 43 2023 AK 10 0.2 02
#> 7 43 2023 AK 11 0.25 02
#> 8 43 2023 AK 12 0.3 02
#> 9 43 2023 AK 12 0.35 02
#> 10 43 2023 AK 13.6 0.4 02
#> # ℹ 4,866 more rowsTo save you typing, forecasttools provides a
us_loc_abbr_to_code() function (as well as its inverse,
us_loc_code_to_abbr()):
quant_summary <- quant_summary |>
mutate(location = us_loc_abbr_to_code(location))
quant_summary
#> # A tibble: 4,876 × 5
#> epiweek epiyear location quantile_value quantile_level
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 43 2023 02 3.97 0.01
#> 2 43 2023 02 4.47 0.025
#> 3 43 2023 02 7 0.05
#> 4 43 2023 02 8.9 0.1
#> 5 43 2023 02 9 0.15
#> 6 43 2023 02 10 0.2
#> 7 43 2023 02 11 0.25
#> 8 43 2023 02 12 0.3
#> 9 43 2023 02 12 0.35
#> 10 43 2023 02 13.6 0.4
#> # ℹ 4,866 more rowsNow, we need to add the other required hubverse columns, select only
the time horizons of interest, and so on. forecasttools
provides a get_hubverse_quantile_table() function to do
this for a given forecast reference_date and
target name. Here, we’ll use 2023-10-21 and
wk inc flu hosp. get_hubverse_table() is
configurable, but for convenience it is designed to accept
trajectories_to_quantiles() output like our
quant_summary tibble by default:
flusight_table <- quant_summary |>
mutate(target_end_date = epiweek_to_date(
epiweek = epiweek, epiyear = epiyear,
day_of_week = 7
)) |>
get_hubverse_quantile_table(
reference_date = "2023-10-21",
target_name = "wk inc flu hosp",
horizons = -1:3,
horizon_timescale = "weeks",
timepoint_col = "target_end_date"
)
flusight_table |> print(width = 100)
#> # A tibble: 3,657 × 9
#> reference_date target horizon horizon_timescale target_end_date
#> <chr> <chr> <int> <chr> <date>
#> 1 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> 2 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> 3 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> 4 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> 5 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> 6 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> 7 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> 8 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> 9 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> 10 2023-10-21 wk inc flu hosp 1 weeks 2023-10-28
#> location output_type output_type_id value
#> <chr> <chr> <dbl> <dbl>
#> 1 01 quantile 0.01 4.96
#> 2 01 quantile 0.025 6.9
#> 3 01 quantile 0.05 9.95
#> 4 01 quantile 0.1 13
#> 5 01 quantile 0.15 18
#> 6 01 quantile 0.2 20
#> 7 01 quantile 0.25 22.8
#> 8 01 quantile 0.3 25
#> 9 01 quantile 0.35 26.6
#> 10 01 quantile 0.4 29
#> # ℹ 3,647 more rowsWrite the formatted output to .csv
FluSight and COVIDHub currently take submissions as comma-separated
values (.csv) files. We can now write our table to
.csv with base write.csv() or
readr::write_csv():