Formatting output for Forecast Hub submission

The FluSight Forecast Hub and the COVID-19 Forecast Hub both use the hubverse framework to accept forecast submissions from contributing teams. Many other infectious disease forecasting challenges use or are adopting the hubverse framework as well.

FluSight and COVIDHub current accept submissions via GitHub pull requests. The submitted model-output csv files need to follow a specified formatting schema; the FluSight and COVIDHub Github repos provide detailed descriptions.

forecasttools provides functions to help you produce valid hubverse submissions from your forecast output. This vignette will show you how to use forecasttools to format forecast output for submission to a Hub.

An example forecast

Let us suppose that we have some forecast output as tidy data. We’ll use the forecasttools::example_daily_forecast_flu dataset here. It provides a set of 100 random draws from a 4000-draw posterior distribution of hospitalization trajectories forecast as of 2023-10-21.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
dat <- forecasttools::example_daily_forecast_flu
dat
#> # A tibble: 159,000 × 4
#>    .draw date        hosp location
#>    <dbl> <date>     <dbl> <chr>   
#>  1    18 2023-10-21     0 NM      
#>  2    18 2023-10-22     0 NM      
#>  3    18 2023-10-23     0 NM      
#>  4    18 2023-10-24     2 NM      
#>  5    18 2023-10-25     0 NM      
#>  6    18 2023-10-26     0 NM      
#>  7    18 2023-10-27     0 NM      
#>  8    18 2023-10-28     0 NM      
#>  9    18 2023-10-29     0 NM      
#> 10    18 2023-10-30     0 NM      
#> # ℹ 158,990 more rows

53 states and territories are represented, coded by a two-letter location code:

dat |>
  distinct(location) |>
  pull()
#>  [1] "NM" "SD" "MS" "ID" "NH" "VI" "ND" "MN" "KS" "DC" "UT" "NV" "DE" "MA" "VT"
#> [16] "RI" "CT" "WV" "AK" "WY" "TX" "PR" "NC" "IN" "AR" "TN" "HI" "NJ" "OR" "MI"
#> [31] "MD" "NE" "OH" "GA" "CA" "WA" "VA" "CO" "AL" "PA" "AZ" "MT" "WI" "FL" "NY"
#> [46] "ME" "LA" "IL" "SC" "OK" "KY" "MO" "IA"

This dataset is pre-formatted as tidy data: each row represents exactly one observation (here, a given .draw from the posterior distribution of incident hospitalizations hosp for a given date and location.

If you are not yet familiar with the concept of tidy data, we highly recommend reading this introductory vignette. For an introduction to applying tidy data principles to managing Bayesian posterior distributions / MCMC output, we recommend this vignette from the excellent tidybayes R package.

Hubverse submission format

Forecast hubs that use hubverse expect submissions in a different format. Here’s an example of an inflenza forecast from 2024-04-06 submitted to the 2023-24 FluSight Challenge by the cfarenewal-cfaepimlight team:

library(readr)
submit_path <- "https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/refs/heads/main/model-output/cfarenewal-cfaepimlight/2024-04-06-cfarenewal-cfaepimlight.csv"
submission <- read_csv(submit_path)
submission |>
  print(width = 100)
#> # A tibble: 5,865 × 8
#>    reference_date target          horizon target_end_date location output_type
#>    <date>         <chr>             <dbl> <date>          <chr>    <chr>      
#>  1 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#>  2 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#>  3 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#>  4 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#>  5 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#>  6 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#>  7 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#>  8 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#>  9 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#> 10 2024-04-06     wk inc flu hosp      -1 2024-03-30      01       quantile   
#>    output_type_id value
#>             <dbl> <dbl>
#>  1          0.01   34  
#>  2          0.025  38  
#>  3          0.05   41.0
#>  4          0.1    44  
#>  5          0.15   47  
#>  6          0.2    49  
#>  7          0.25   50  
#>  8          0.3    52  
#>  9          0.35   53  
#> 10          0.4    54  
#> # ℹ 5,855 more rows

Overview of hubverse model output schema

Let’s talk through each of these columns. For additional details and guidance, consult FluSight’s model output schema and the hubverse documentation page on model output formats.

`reference_date`

The official “as of” date for the forecast, in ISO8601 YYYY-MM-DD format. The reference date is not necessarily the due date for forecast submission. For FluSight and COVIDHub, it is instead the Saturday that ends the USA “epidemiological week” or “epiweek” that the forecast is due. You’ll learn a bit more about the why and how of “epiweeks” later in this vignette.

`target`

The quantity being forecast in the given row. Different Hubs accept forecasts of different quantities of interest. For example, in 2023-2024, FluSight had two targets:

wk inc flu hosp: epiweekly incident hospitalizations (i.e. the total from Sunday through Saturday for each epiweek).
wk flu hosp rate change: a discrete prediction of whether the hospitalization rate is projected to be stable, increase a little, increase a lot, decrease a little, or decrease a lot.

The example forecast above contains forecasts for wk inc flu hosp but not for wk flu hosp rate change.

`horizon`

The forecast “horizon” relative to the reference_date, here in units of weeks ahead/behind. Flusight allows the following horizons: -1, 0, 1, 2, 3. A horizon of -1 represents a “nowcast” of what the ultimately reported totals for the epiweek before the reference week will be after all backfill is accounted for.

`target_end_date`

The Saturday ending the epiweek corresponding to the given horizon. So for example a horizon of -1 from 2024-04-06 is 2024-03-30 and a horizon of 3 from 2024-04-06 is 2024-04-27.

`location`

The geographic region for which the forecast is being made. For FluSight, this is either a US state or territory or the United States as a whole. Locations are coded not as two-letter abbreviations but rather as two-digit legacy state FIPS codes, except for the national forecast, which is coded as US:

submission |>
  distinct(location) |>
  pull()
#>  [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "16" "17" "18" "19"
#> [16] "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34"
#> [31] "35" "36" "37" "38" "39" "40" "41" "42" "44" "45" "46" "47" "48" "50" "51"
#> [46] "53" "54" "55" "56" "72" "US"

To help you code things properly, forecasttools provides a lookup table called forecasttools::us_location_table:

forecasttools::us_location_table
#> # A tibble: 58 × 3
#>    code  abbr  name                
#>    <chr> <chr> <chr>               
#>  1 US    US    United States       
#>  2 01    AL    Alabama             
#>  3 02    AK    Alaska              
#>  4 04    AZ    Arizona             
#>  5 05    AR    Arkansas            
#>  6 06    CA    California          
#>  7 08    CO    Colorado            
#>  8 09    CT    Connecticut         
#>  9 10    DE    Delaware            
#> 10 11    DC    District of Columbia
#> # ℹ 48 more rows

`output_type`

The type of forecast value the row contains.

For quantile forecasts, this will be 'quantile': each row represents the value of of a particular quantile level for a particular reference_date, target, location, and horizon.

For trend classification forecasts, it will typically be 'pmf', for “probability mass function”, to indicate that the row represents the probability of a given trend (e.g. the probability of a 'large increase'.)

`output_type_id`

This column holds any needed information about what subclass of the output type the row contains. For quantile, it contains the level of the quantile (e.g. 0.01 for the 1st percentile, 0.5 for the median / 50th percentile, etc). For pmf, contains the name of the category/event whose probability is being given (e.g. 'stable' or 'large decrease').

`value`

This column contains the actual forecast value. For a quantile forecast of a hospitalizations target, this is the number of forecast hospitalizations for the given location and horizon at the given quantile level (specified in output_type_id)

An example

Let’s look at an abtirary row from the example submission above:

submission[123, ] |>
  print(width = 100)
#> # A tibble: 1 × 8
#>   reference_date target          horizon target_end_date location output_type
#>   <date>         <chr>             <dbl> <date>          <chr>    <chr>      
#> 1 2024-04-06     wk inc flu hosp      -1 2024-03-30      02       quantile   
#>   output_type_id value
#>            <dbl> <dbl>
#> 1            0.3     8

This row says that for location 02 (Alaska, as you can confirm by consulting forecasttools::us_location_table), the 0.3 quantile (30th percentile) of the forecast predictive distribution for incident hospitalizations during the epiweek ending Saturday 2024-03-30 (a -1 week forecast horizon) is 8.

Getting your forecast into this format

So you now have your nice tidy daily forecast, and you want to aggregate it into weekly quantiles and format it for hubverse submission. Let’s look at our tidy forecast data again:

dat
#> # A tibble: 159,000 × 4
#>    .draw date        hosp location
#>    <dbl> <date>     <dbl> <chr>   
#>  1    18 2023-10-21     0 NM      
#>  2    18 2023-10-22     0 NM      
#>  3    18 2023-10-23     0 NM      
#>  4    18 2023-10-24     2 NM      
#>  5    18 2023-10-25     0 NM      
#>  6    18 2023-10-26     0 NM      
#>  7    18 2023-10-27     0 NM      
#>  8    18 2023-10-28     0 NM      
#>  9    18 2023-10-29     0 NM      
#> 10    18 2023-10-30     0 NM      
#> # ℹ 158,990 more rows

Aggregate to epiweekly

The first thing we need to do is aggregate it from a daily forecast to an (epi)weekly forecast. Because our data is tidy, we could use a split/apply/combine workflow.

epiweekly <- dat |>
  mutate(
    epiweek = lubridate::epiweek(date),
    epiyear = lubridate::epiyear(date)
  ) |>
  group_by(
    epiweek,
    epiyear,
    .draw,
    location
  ) |>
  filter(n() == 7) |> ## only use epiweeks for which we have all 7 days
  summarise(weekly_hosp = sum(hosp))
#> `summarise()` has grouped output by 'epiweek', 'epiyear', '.draw'. You can
#> override using the `.groups` argument.

epiweekly
#> # A tibble: 21,200 × 5
#> # Groups:   epiweek, epiyear, .draw [400]
#>    epiweek epiyear .draw location weekly_hosp
#>      <dbl>   <dbl> <dbl> <chr>          <dbl>
#>  1      43    2023    18 AK                17
#>  2      43    2023    18 AL                46
#>  3      43    2023    18 AR                44
#>  4      43    2023    18 AZ                20
#>  5      43    2023    18 CA               152
#>  6      43    2023    18 CO                17
#>  7      43    2023    18 CT                 1
#>  8      43    2023    18 DC                 1
#>  9      43    2023    18 DE                 1
#> 10      43    2023    18 FL                74
#> # ℹ 21,190 more rows

To save you typing and avoid mistakes (like forgetting that two different years can have the same epiweek) however, forecasttools provides a daily_to_epiweekly() function to automate and autocheck this process:

library(forecasttools)
epiweekly_forecasts <- dat |>
  daily_to_epiweekly(
    value_col = "hosp", ## column of values to aggregate to weekly
    id_cols = c(".draw", "location"), ## column(s) that identify a trajectory
    weekly_value_name = "weekly_hosp" ## what to name the output column
  )
epiweekly_forecasts
#> # A tibble: 21,200 × 5
#>    epiweek epiyear .draw location weekly_hosp
#>      <dbl>   <dbl> <dbl> <chr>          <dbl>
#>  1      43    2023    18 AK                17
#>  2      43    2023    18 AL                46
#>  3      43    2023    18 AR                44
#>  4      43    2023    18 AZ                20
#>  5      43    2023    18 CA               152
#>  6      43    2023    18 CO                17
#>  7      43    2023    18 CT                 1
#>  8      43    2023    18 DC                 1
#>  9      43    2023    18 DE                 1
#> 10      43    2023    18 FL                74
#> # ℹ 21,190 more rows

Notice we didn’t have to specify that dates were found in the date column. This is because daily_to_epiweekly has default values for all of the required column names. They are:

value_col = "hosp"
date_col = "date"
id_cols = ".draw",
weekly_value_name = "weekly_value"

daily_to_epiweekly takes a flag called strict to determine whether to only use weeks for which we have all seven days. It defaults to TRUE if not set, which means that a similar step to the filter(n() == 7) in our manual code above gets applied. It also checks to make sure that no week contains more than 7 days. So failing to group by location as well as .draw throws an error:

epiweekly_forecasts <- dat |>
  daily_to_epiweekly(
    value_col = "hosp", ## column of values to aggregate to weekly
    id_cols = ".draw", ## column(s) that identify a trajectory
    weekly_value_name = "weekly_hosp" ## what to name the output column
  )
#> Error in `daily_to_epiweekly()`:
#> ! At least one trajectory had more
#> than 7 values for a given epiweek
#> of a given year. Check your date
#> column for repeated values and
#> check that the trajectory
#> id columns are the ones you intended to
#> use. This run used columns named
#> '.draw' to identify unique trajectories

Compute the needed posterior quantiles

Now we compute our forecast quantiles from our table of epiweekly_forecasts. Again, we could to this manually:

flusight_quantile_levels <- c(0.01, 0.025, 1:19 / 20, 0.975, 0.99)

quant_summary <- epiweekly_forecasts |>
  dplyr::group_by(epiweek, epiyear, location) |>
  dplyr::reframe(
    quant_values = quantile(
      weekly_hosp,
      probs = flusight_quantile_levels
    ),
    quant_levels = flusight_quantile_levels
  )

quant_summary
#> # A tibble: 4,876 × 5
#>    epiweek epiyear location quant_values quant_levels
#>      <dbl>   <dbl> <chr>           <dbl>        <dbl>
#>  1      43    2023 AK               3.97        0.01 
#>  2      43    2023 AK               4.47        0.025
#>  3      43    2023 AK               7           0.05 
#>  4      43    2023 AK               8.9         0.1  
#>  5      43    2023 AK               9           0.15 
#>  6      43    2023 AK              10           0.2  
#>  7      43    2023 AK              11           0.25 
#>  8      43    2023 AK              12           0.3  
#>  9      43    2023 AK              12           0.35 
#> 10      43    2023 AK              13.6         0.4  
#> # ℹ 4,866 more rows

forecasttools provides a trajectories_to_quantiles() function to automate this process.

quant_summary <- epiweekly_forecasts |>
  trajectories_to_quantiles(
    timepoint_cols = c("epiweek", "epiyear"),
    id_cols = "location",
    value_col = "weekly_hosp"
  )

quant_summary
#> # A tibble: 4,876 × 5
#>    epiweek epiyear location quantile_value quantile_level
#>      <dbl>   <dbl> <chr>             <dbl>          <dbl>
#>  1      43    2023 AK                 3.97          0.01 
#>  2      43    2023 AK                 4.47          0.025
#>  3      43    2023 AK                 7             0.05 
#>  4      43    2023 AK                 8.9           0.1  
#>  5      43    2023 AK                 9             0.15 
#>  6      43    2023 AK                10             0.2  
#>  7      43    2023 AK                11             0.25 
#>  8      43    2023 AK                12             0.3  
#>  9      43    2023 AK                12             0.35 
#> 10      43    2023 AK                13.6           0.4  
#> # ℹ 4,866 more rows

Warning: order matters for proagating uncertainty; first compute, then summarize!

People are sometimes tempted to compute weekly quantiles from daily forecasts by first computing daily quantiles and then summing those daily quantiles to get weekly ones. This is typically not what you want, and it is not equivalent to first aggregating to the week and then computing quantiles.

Why not? It does not capture all the ways in which Monday’s predicted value and Tuesday’s predicted value might co-vary. In general, we wish to propagate uncertainty in a principled manner.

For handling samples from a posterior distribution, this typically means computing any transformations / functions of our sampled parameters–such as a weekly hospitalization total computed from daily values—and only then summarizing. Specifically, we compute the transformations “draw-wise” (that is, one computation for each MCMC draw), so that the Monday value from draw number 1 gets added to the Tuesday value for draw number 1, the Monday value for draw number 2 gets added to the Tuesday value for draw number 2, etc.

Format output for Hub submission

First, we need to convert state abbreviations to the the two-character location codes that most US-based hubs expect. We could use a SQL-style workflow with forecasttools::us_location_table:

manual_rename <- quant_summary |>
  rename(state = location) |>
  inner_join(
    forecasttools::us_location_table |> select(
      state = abbr,
      location = code
    ),
    by = "state"
  )
manual_rename
#> # A tibble: 4,876 × 6
#>    epiweek epiyear state quantile_value quantile_level location
#>      <dbl>   <dbl> <chr>          <dbl>          <dbl> <chr>   
#>  1      43    2023 AK              3.97          0.01  02      
#>  2      43    2023 AK              4.47          0.025 02      
#>  3      43    2023 AK              7             0.05  02      
#>  4      43    2023 AK              8.9           0.1   02      
#>  5      43    2023 AK              9             0.15  02      
#>  6      43    2023 AK             10             0.2   02      
#>  7      43    2023 AK             11             0.25  02      
#>  8      43    2023 AK             12             0.3   02      
#>  9      43    2023 AK             12             0.35  02      
#> 10      43    2023 AK             13.6           0.4   02      
#> # ℹ 4,866 more rows

To save you typing, forecasttools provides a us_loc_abbr_to_code() function (as well as its inverse, us_loc_code_to_abbr()):

quant_summary <- quant_summary |>
  mutate(location = us_loc_abbr_to_code(location))

quant_summary
#> # A tibble: 4,876 × 5
#>    epiweek epiyear location quantile_value quantile_level
#>      <dbl>   <dbl> <chr>             <dbl>          <dbl>
#>  1      43    2023 02                 3.97          0.01 
#>  2      43    2023 02                 4.47          0.025
#>  3      43    2023 02                 7             0.05 
#>  4      43    2023 02                 8.9           0.1  
#>  5      43    2023 02                 9             0.15 
#>  6      43    2023 02                10             0.2  
#>  7      43    2023 02                11             0.25 
#>  8      43    2023 02                12             0.3  
#>  9      43    2023 02                12             0.35 
#> 10      43    2023 02                13.6           0.4  
#> # ℹ 4,866 more rows

Now, we need to add the other required hubverse columns, select only the time horizons of interest, and so on. forecasttools provides a get_hubverse_quantile_table() function to do this for a given forecast reference_date and target name. Here, we’ll use 2023-10-21 and wk inc flu hosp. get_hubverse_table() is configurable, but for convenience it is designed to accept trajectories_to_quantiles() output like our quant_summary tibble by default:

flusight_table <- quant_summary |>
  mutate(target_end_date = epiweek_to_date(
    epiweek = epiweek, epiyear = epiyear,
    day_of_week = 7
  )) |>
  get_hubverse_quantile_table(
    reference_date = "2023-10-21",
    target_name = "wk inc flu hosp",
    horizons = -1:3,
    horizon_timescale = "weeks",
    timepoint_col = "target_end_date"
  )

flusight_table |> print(width = 100)
#> # A tibble: 3,657 × 9
#>    reference_date target          horizon horizon_timescale target_end_date
#>    <chr>          <chr>             <int> <chr>             <date>         
#>  1 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#>  2 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#>  3 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#>  4 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#>  5 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#>  6 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#>  7 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#>  8 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#>  9 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#> 10 2023-10-21     wk inc flu hosp       1 weeks             2023-10-28     
#>    location output_type output_type_id value
#>    <chr>    <chr>                <dbl> <dbl>
#>  1 01       quantile             0.01   4.96
#>  2 01       quantile             0.025  6.9 
#>  3 01       quantile             0.05   9.95
#>  4 01       quantile             0.1   13   
#>  5 01       quantile             0.15  18   
#>  6 01       quantile             0.2   20   
#>  7 01       quantile             0.25  22.8 
#>  8 01       quantile             0.3   25   
#>  9 01       quantile             0.35  26.6 
#> 10 01       quantile             0.4   29   
#> # ℹ 3,647 more rows

Write the formatted output to `.csv`

FluSight and COVIDHub currently take submissions as comma-separated values (.csv) files. We can now write our table to .csv with base write.csv() or readr::write_csv():

example_output_path <- file.path(
  tempdir(),
  "2023-10-23-example-team-example-model.csv"
)
flusight_table |>
  readr::write_csv(example_output_path)

An example forecast

Hubverse submission format

Overview of hubverse model output schema

reference_date

target

horizon

target_end_date

location

output_type

output_type_id

value