Skip to contents

In this vignette you will learn how use forecasttools to pull data from the National Healthcare Safety Network (NHSN) Weekly Hospital Respiratory Data (HRD) dataset. This dataset contains counts of epiweekly influenza and COVID-19 hospital admissions by U.S. state, among other quantities.

We will use the Socrata Open Data API (SODA) API endpoint on data.cdc.gov. data.cdc.gov also provides a browser view of the data with links to download data (e.g. as .csv or .tsv files), but when building pipelines, we wish to operate programmatically.

API Key

First, you’ll want to go to data.cdc.gov and request an API token. This is not strictly required, but it is generally considered polite, and it will speed up your data requests (the polite get served first!).

To request a token, navigate to data.cdc.gov’s developer settings page. You will be prompted to log in. If you have CDC credentials, you can use those. Otherwise, you can sign up for an account with Tyler Data and Insights (the contractor that manages data.cdc.gov). Once logged in, navigate to the developer settings page, click the “Create new API key” button and follow the prompts.

Make sure to record your secret key somewhere safe but not tracked by Git (e.g. a .gitignore-ed secrets.toml file.

One place to store secrets is as environment variables. If you don’t provide one explicitly, pull_nhsn() looks for a valid NHSN API key ID in an environment variable named NHSN_API_KEY_ID and a corresponding secret in an environment variable named NHSN_API_KEY_SECRET.

Getting all the data

Our workhorse function for getting NHSN data is called simply pull_nhsn(). If you provide it with no arguments, it will simply fetch you the entire dataset as a tibble::tibble(), up to the specified maximum number of rows (default 10,0000).

Yyou can increase by setting the limit argument of pull_nhsn() a larger value. To protect you against accidentally pulling incomplete datasets, pull_nhsn() errors by default if the number of rows retrieved hits the limit. You can suppress that error by setting error_on_limit = FALSE.

We’ll provide the api_key_id and api_key_secret arguments explicitly. If you omit those (and valid values can’t be found in the default environment variables), forecasttools will warn you that the data request will be lower priority and less polite.

my_api_key_id <- "<YOUR API KEY ID HERE>"
my_api_key_secret <- "<YOUR API KEY SECRET HERE>"
library(forecasttools)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
all_data <- pull_nhsn(
  limit = 1000,
  api_key_id = my_api_key_id,
  api_key_secret = my_api_key_secret,
  error_on_limit = FALSE
)
#> No encoding supplied: defaulting to UTF-8.

all_data
#> # A tibble: 1,000 × 157
#>    weekendingdate       jurisdiction numinptbeds numinptbedsadult numinptbedsped
#>    <chr>                <chr>        <chr>       <chr>            <chr>         
#>  1 2020-08-08T00:00:00… AK           2387.0      432.14           40.0          
#>  2 2020-08-15T00:00:00… AK           2364.86     432.14           40.0          
#>  3 2020-08-22T00:00:00… AK           2327.43     431.86           40.14         
#>  4 2020-08-29T00:00:00… AK           2337.57     431.69           40.71         
#>  5 2020-09-05T00:00:00… AK           2327.43     553.43           62.0          
#>  6 2020-09-12T00:00:00… AK           2357.71     525.83           62.0          
#>  7 2020-09-19T00:00:00… AK           2396.38     478.71           63.43         
#>  8 2020-09-26T00:00:00… AK           1998.11     1016.14          209.0         
#>  9 2020-10-03T00:00:00… AK           1506.57     1061.71          190.86        
#> 10 2020-10-10T00:00:00… AK           1519.71     1339.29          180.43        
#> # ℹ 990 more rows
#> # ℹ 152 more variables: numinptbedsocc <chr>, numinptbedsoccadult <chr>,
#> #   numinptbedsoccped <chr>, numicubeds <chr>, numicubedsadult <chr>,
#> #   numicubedsped <chr>, numicubedsocc <chr>, numicubedsoccadult <chr>,
#> #   numicubedsoccped <chr>, numconfc19hosppatsadult <chr>,
#> #   numconfc19hosppatsped <chr>, totalconfc19hosppats <chr>,
#> #   numconfc19icupatsadult <chr>, numconfc19icupatsped <chr>, …

Getting only relevant rows and columns

This is a huge table, and it can take a fair amount of time to download. We can speed things up by only requesting a subset of the data via a “query”. These queries are a way to ask data.cdc.gov only for the rows and columns we care about, which will speed up our download.

To automate repetitive tasks, there are pre-defined ways to apply some common queries.

Date limits

pull_nhsn() takes start_date and end_date arguments. If these are defined, it will query the dataset for only rows that fall between those dates, inclusive

Column limits

pull_nhsn() takes a columns argument, which defaults to NULL. If columns is NULL, pull_nhsn() pulls all columns. Otherwise, it pulls the columns jurisdiction, weekendingdate, and any columns enumerated in columns.

An example

As an example, let’s pull only epiweekly incident influenza hospitalizations between 1 Jan 2023 and 31 December 2023, inclusive. Here, we’ll let pull_nhsn() look for an API key and secret in our environment variables.

influenza_2023 <- pull_nhsn(
  columns = c("totalconfflunewadm"),
  start_date = "2023-01-01",
  end_date = "2023-12-31"
)
#> No encoding supplied: defaulting to UTF-8.

influenza_2023
#> # A tibble: 2,964 × 3
#>    jurisdiction weekendingdate          totalconfflunewadm
#>    <chr>        <chr>                   <chr>             
#>  1 AK           2023-01-07T00:00:00.000 32.0              
#>  2 AK           2023-01-14T00:00:00.000 7.0               
#>  3 AK           2023-01-21T00:00:00.000 7.0               
#>  4 AK           2023-01-28T00:00:00.000 5.0               
#>  5 AK           2023-02-04T00:00:00.000 3.0               
#>  6 AK           2023-02-11T00:00:00.000 1.0               
#>  7 AK           2023-02-18T00:00:00.000 2.0               
#>  8 AK           2023-02-25T00:00:00.000 5.0               
#>  9 AK           2023-03-04T00:00:00.000 2.0               
#> 10 AK           2023-03-11T00:00:00.000 4.0               
#> # ℹ 2,954 more rows

By default, pull_nhsn() returns its output sorted first by jurisdiction, then by weekendingdate. You can configure this differently via the order_by keyword argument.

influenza_by_date <- pull_nhsn(
  columns = c("totalconfflunewadm"),
  start_date = "2023-01-01",
  end_date = "2023-12-31",
  order_by = c("weekendingdate", "jurisdiction")
)
#> No encoding supplied: defaulting to UTF-8.

influenza_by_date
#> # A tibble: 2,964 × 3
#>    jurisdiction weekendingdate          totalconfflunewadm
#>    <chr>        <chr>                   <chr>             
#>  1 AK           2023-01-07T00:00:00.000 32.0              
#>  2 AL           2023-01-07T00:00:00.000 123.0             
#>  3 AR           2023-01-07T00:00:00.000 158.0             
#>  4 AS           2023-01-07T00:00:00.000 0.0               
#>  5 AZ           2023-01-07T00:00:00.000 386.0             
#>  6 CA           2023-01-07T00:00:00.000 913.0             
#>  7 CO           2023-01-07T00:00:00.000 158.0             
#>  8 CT           2023-01-07T00:00:00.000 170.0             
#>  9 DC           2023-01-07T00:00:00.000 17.0              
#> 10 DE           2023-01-07T00:00:00.000 45.0              
#> # ℹ 2,954 more rows