In this vignette you will learn how use forecasttools
to
pull data from the National Healthcare Safety
Network (NHSN) Weekly
Hospital Respiratory Data (HRD) dataset. This dataset contains
counts of epiweekly influenza and COVID-19 hospital admissions by U.S.
state, among other quantities.
We will use the Socrata Open Data
API (SODA) API endpoint on data.cdc.gov
.
data.cdc.gov
also provides a browser
view of the data with links to download data (e.g. as
.csv
or .tsv
files), but when building
pipelines, we wish to operate programmatically.
API Key
First, you’ll want to go to data.cdc.gov
and request an
API token. This is not strictly required, but it is generally considered
polite, and it will speed up your data requests (the polite get served
first!).
To request a token, navigate to data.cdc.gov
’s
developer settings page. You will be prompted to log in. If you have
CDC credentials, you can use those. Otherwise, you can sign up for an account with Tyler
Data and Insights (the contractor that manages
data.cdc.gov
). Once logged in, navigate to the developer
settings page, click the “Create new API key” button and follow the
prompts.
Make sure to record your secret key somewhere safe but not
tracked by Git (e.g. a .gitignore
-ed
secrets.toml
file.
One place to store secrets is as environment
variables. If you don’t provide one explicitly,
pull_nhsn()
looks for a valid NHSN API key ID in an
environment variable named NHSN_API_KEY_ID
and a
corresponding secret in an environment variable named
NHSN_API_KEY_SECRET
.
Getting all the data
Our workhorse function for getting NHSN data is called simply
pull_nhsn()
. If you provide it with no arguments, it will
simply fetch you the entire dataset as a tibble::tibble()
,
up to the specified maximum number of rows (default 10,0000).
Yyou can increase by setting the limit
argument of
pull_nhsn()
a larger value. To protect you against
accidentally pulling incomplete datasets, pull_nhsn()
errors by default if the number of rows retrieved hits the
limit
. You can suppress that error by setting
error_on_limit = FALSE
.
We’ll provide the api_key_id
and
api_key_secret
arguments explicitly. If you omit those (and
valid values can’t be found in the default environment variables),
forecasttools
will warn you that the data request will be
lower priority and less polite.
my_api_key_id <- "<YOUR API KEY ID HERE>"
my_api_key_secret <- "<YOUR API KEY SECRET HERE>"
library(forecasttools)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
all_data <- pull_nhsn(
limit = 1000,
api_key_id = my_api_key_id,
api_key_secret = my_api_key_secret,
error_on_limit = FALSE
)
#> No encoding supplied: defaulting to UTF-8.
all_data
#> # A tibble: 1,000 × 157
#> weekendingdate jurisdiction numinptbeds numinptbedsadult numinptbedsped
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2020-08-08T00:00:00… AK 2387.0 432.14 40.0
#> 2 2020-08-15T00:00:00… AK 2364.86 432.14 40.0
#> 3 2020-08-22T00:00:00… AK 2327.43 431.86 40.14
#> 4 2020-08-29T00:00:00… AK 2337.57 431.69 40.71
#> 5 2020-09-05T00:00:00… AK 2327.43 553.43 62.0
#> 6 2020-09-12T00:00:00… AK 2357.71 525.83 62.0
#> 7 2020-09-19T00:00:00… AK 2396.38 478.71 63.43
#> 8 2020-09-26T00:00:00… AK 1998.11 1016.14 209.0
#> 9 2020-10-03T00:00:00… AK 1506.57 1061.71 190.86
#> 10 2020-10-10T00:00:00… AK 1519.71 1339.29 180.43
#> # ℹ 990 more rows
#> # ℹ 152 more variables: numinptbedsocc <chr>, numinptbedsoccadult <chr>,
#> # numinptbedsoccped <chr>, numicubeds <chr>, numicubedsadult <chr>,
#> # numicubedsped <chr>, numicubedsocc <chr>, numicubedsoccadult <chr>,
#> # numicubedsoccped <chr>, numconfc19hosppatsadult <chr>,
#> # numconfc19hosppatsped <chr>, totalconfc19hosppats <chr>,
#> # numconfc19icupatsadult <chr>, numconfc19icupatsped <chr>, …
Getting only relevant rows and columns
This is a huge table, and it can take a fair amount of time to
download. We can speed things up by only requesting a subset of the data
via a “query”. These queries are a way to ask data.cdc.gov
only for the rows and columns we care about, which will speed up our
download.
To automate repetitive tasks, there are pre-defined ways to apply some common queries.
Date limits
pull_nhsn()
takes start_date
and
end_date
arguments. If these are defined, it will query the
dataset for only rows that fall between those dates, inclusive
Column limits
pull_nhsn()
takes a columns
argument, which
defaults to NULL
. If columns
is
NULL
, pull_nhsn()
pulls all columns.
Otherwise, it pulls the columns jurisdiction
,
weekendingdate
, and any columns enumerated in
columns
.
An example
As an example, let’s pull only epiweekly incident influenza
hospitalizations between 1 Jan 2023 and 31 December 2023, inclusive.
Here, we’ll let pull_nhsn()
look for an API key and secret
in our environment variables.
influenza_2023 <- pull_nhsn(
columns = c("totalconfflunewadm"),
start_date = "2023-01-01",
end_date = "2023-12-31"
)
#> No encoding supplied: defaulting to UTF-8.
influenza_2023
#> # A tibble: 2,964 × 3
#> jurisdiction weekendingdate totalconfflunewadm
#> <chr> <chr> <chr>
#> 1 AK 2023-01-07T00:00:00.000 32.0
#> 2 AK 2023-01-14T00:00:00.000 7.0
#> 3 AK 2023-01-21T00:00:00.000 7.0
#> 4 AK 2023-01-28T00:00:00.000 5.0
#> 5 AK 2023-02-04T00:00:00.000 3.0
#> 6 AK 2023-02-11T00:00:00.000 1.0
#> 7 AK 2023-02-18T00:00:00.000 2.0
#> 8 AK 2023-02-25T00:00:00.000 5.0
#> 9 AK 2023-03-04T00:00:00.000 2.0
#> 10 AK 2023-03-11T00:00:00.000 4.0
#> # ℹ 2,954 more rows
By default, pull_nhsn()
returns its output sorted first
by jurisdiction
, then by weekendingdate
. You
can configure this differently via the order_by
keyword
argument.
influenza_by_date <- pull_nhsn(
columns = c("totalconfflunewadm"),
start_date = "2023-01-01",
end_date = "2023-12-31",
order_by = c("weekendingdate", "jurisdiction")
)
#> No encoding supplied: defaulting to UTF-8.
influenza_by_date
#> # A tibble: 2,964 × 3
#> jurisdiction weekendingdate totalconfflunewadm
#> <chr> <chr> <chr>
#> 1 AK 2023-01-07T00:00:00.000 32.0
#> 2 AL 2023-01-07T00:00:00.000 123.0
#> 3 AR 2023-01-07T00:00:00.000 158.0
#> 4 AS 2023-01-07T00:00:00.000 0.0
#> 5 AZ 2023-01-07T00:00:00.000 386.0
#> 6 CA 2023-01-07T00:00:00.000 913.0
#> 7 CO 2023-01-07T00:00:00.000 158.0
#> 8 CT 2023-01-07T00:00:00.000 170.0
#> 9 DC 2023-01-07T00:00:00.000 17.0
#> 10 DE 2023-01-07T00:00:00.000 45.0
#> # ℹ 2,954 more rows