How to import survey data
Source:vignettes/How-to-import-survey-data.Rmd
How-to-import-survey-data.Rmd
This article describes how to import survey data from SAS into R. Non-survey data can be imported as well.
Generally speaking, you only need to import a data system from SAS into R once. After it has been imported, use the R version of the data, and share it with others.
SAS data file
The first step is to obtain the SAS data file that you would like to
import. In this example, we will use a file called
namcs2019_sas.sas7bdat
. This is the SAS data file for the
National Ambulatory Medical Care Survey (NAMCS) 2019 Public Use File
(PUF). For more information, and the location from which this SAS data
file can be downloaded, see nchsdata::namcs2019
.
SAS formats data file
Next, you need to create a SAS formats data file. This is a SAS data
file with information about formats; it should not be confused with a
SAS catalog file. This formats data file is generated in SAS using the
PROC FORMAT
command with the CNTLOUT
option,
as follows:
LIBNAME project 'c:\output';
PROC FORMAT CNTLOUT=project.formats_dataset;
VALUE AGERF
1 ='Under 15 years'
2 ='15-24 years'
3 ='25-44 years'
4 ='45-64 years'
5 ='65-74 years'
6 ='75 years and over'
;
/* ... */
RUN;
SAS saves the formats data file in the folder specified by the
LIBNAME
command.
Look for a specialized function
In importsurvey
, there are several functions named
import_sas_*
. See if any of these functions is specifically
designed for importing your survey or another survey that has
the same survey design variables.
To see the available function:
library(importsurvey)
help(package = "importsurvey")
Since we are trying to import NAMCS 2019 PUF, we notice a function
called import_sas_namcs2019puf
. When we check help for this
function, we see that this is exactly the correct function to use.
import_sas_namcs2019puf
can be used to import NAMCS 2019
PUF or any other survey that has the same survey design
variables. For example, the same function can be used to import another
survey, called NHAMCS ED 2020 PUF. (For more info on this survey, see
nchsdata::ed2020
.)
If a specialized function does not already exist, we will cover that scenario below.
Use the specialized function
Specify the following information:
- the SAS survey data file
- the SAS formats data file
- the name of an
.rds
file to be created. This file can be read later withreadRDS()
. - a short name for the survey
import_sas_namcs2019puf(sas_data = "namcs2019_sas.sas7bdat"
, sas_formats_data = "namcs_formats_dataset.sas7bdat"
, r_out = "namcs_2019_puf.rds"
, label = "NAMCS 2019 PUF")
The survey name, specified by the label
argument, is
important. It is used by functions in the surveytable
package to allow the user to verify that they are analyzing the correct
survey.
The above command will likely print some messages that list certain kinds of variables. Check all messages to see whether anything is unexpected, which could indicate a mistake.
Variables that have no format. Did you intend for
these variables to not have a format? In this list, you expect to see
survey design variables (such as PATWT
) and numeric
variables (such as NUMMED
).
Format applies only to some values, created multiple version of variable.
Some numeric variables contain actual values as well as certain
“special values”, which often indicate some kind of missingness. For
example, consider the BMI
(body mass index) variable. Some
observations have the actual value of BMI, such as 25, while other
observations have a “special value”, which is one of the following: -9,
meaning “Missing data”; -7, meaning “Not calculated”; or -5, meaning
“Height or weight outside acceptable ranges”.
When importsurvey
encounters such a variable, in
addition to the original variable, it creates two more variables, as
follows. The analyst should use the version of the variable that is
appropriate to their specific analysis.
-
X.nospecial
: a numeric variable in which all values specified in the formats are converted to missing values (NA
) – only the “non-special” values are retained; and -
X.special
: a factor variable in which all values not specified in the formats are given a value of “Other”. (For BMI, a categorical variable, with categories being “Missing data”, “Not calculated”, “Height or weight outside acceptable ranges”, and “Other”.)
Thus, for BMI, one would generally not use the BMI
variable directly, because it contains both values of BMI as well as
special values. BMI.nospecial
could be used in numeric
calculations, such as to calculate the mean of known values.
BMI.special
could be used to calculate the percentages of
each of the special values.
Yes/no variable - converted to logical. These
variables have been converted to logical variables, with values being
TRUE
, FALSE
, or NA
.
They can be used in logical expressions, which makes using the data
easier. For example, to select visits in which the expected source of
payment is both Medicare and Medicaid, instead of typing
PAYMCARE == "Yes" & PAYMCAID == "Yes"
one only needs to
type PAYMCARE & PAYMCAID
.
Many category values not used - possible error. Many category values were not used. Was this done on purpose?
In addition to checking the above messages, please verify that the survey design variables are correct.
Verify some estimates
Load the survey that you just imported, and check some basic estimates, to see that they make sense.
mysurvey = readRDS("namcs_2019_puf.rds")
class(mysurvey)
#> [1] "survey.design2" "survey.design"
This is a survey design object. It can be analyzed using the
survey
or the surveytable
packages.
set_survey("mysurvey")
#> _
#> Survey name NAMCS 2019 PUF
#> Number of variables 1100
#> Number of observations 8250
#> Stratified 1 - level Cluster Sampling design (with replacement)
#> With (398) clusters.
#> svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT,
#> data = d1)
#> * To adjust how counts are rounded, see ?set_count_int
Verify survey name, survey design variables, and the number of observations.
Does the estimate of the total seem correct?
total()
Number (000) | SE (000) | LL (000) | UL (000) |
---|---|---|---|
1,036,484 | 48,836 | 945,014 | 1,136,809 |
(Checked presentation standards. Nothing to report.) |
Do the estimates of a couple of basic variables seem correct?
tab("AGER", "SEX")
Level | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|
Under 15 years | 117,917 | 14,097 | 93,229 | 149,142 | 11.4 | 1.3 | 8.9 | 14.2 |
15-24 years | 64,856 | 7,018 | 52,387 | 80,292 | 6.3 | 0.6 | 5.1 | 7.5 |
25-44 years | 170,271 | 13,966 | 144,925 | 200,049 | 16.4 | 1.1 | 14.3 | 18.8 |
45-64 years | 309,506 | 23,290 | 266,994 | 358,787 | 29.9 | 1.4 | 27.2 | 32.6 |
65-74 years | 206,866 | 14,366 | 180,481 | 237,109 | 20 | 1.2 | 17.6 | 22.5 |
75 years and over | 167,069 | 15,179 | 139,746 | 199,735 | 16.1 | 1.3 | 13.7 | 18.8 |
(Checked presentation standards. Nothing to report.) |
Level | Number (000) | SE (000) | LL (000) | UL (000) | Percent | SE | LL | UL |
---|---|---|---|---|---|---|---|---|
Female | 605,045 | 34,866 | 540,377 | 677,452 | 58.4 | 1.9 | 54.6 | 62.1 |
Male | 431,439 | 27,664 | 380,436 | 489,280 | 41.6 | 1.9 | 37.9 | 45.4 |
(Checked presentation standards. Nothing to report.) |
What if a specialized import function does not exist?
What if a specialized import function for your survey or data does not exist?
In the importsurvey
package, all of the
import_sas_*
functions are just short functions that call a
function called import_sas()
. See how the
import_sas_*
functions are written, and try to issue
similar commands for the survey that you are trying to import.
The import_sas_*
functions have 3 main steps, as
follows:
Call
import_sas()
to convert the SAS survey data file to a data frame with survey data. Schematically,SAS -> data.frame
.Call
survey::svydesign()
to convert the data frame to a survey design object. Schematically,data.frame -> survey.design
.Label the survey design object, and save it to an RDS file.
For example, let’s examine import_sas_namcs2019puf
and
try to identify these 3 steps.
importsurvey::import_sas_namcs2019puf
#> function (sas_data, sas_formats_data, r_out, label = "")
#> {
#> assert_that(!file.exists(r_out), msg = paste0("Output file ",
#> r_out, " already exists."))
#> assert_that(is.string(label), nzchar(label))
#> d1 = import_sas(sas_data, sas_formats_data, formats = "attr",
#> bool_levels = c("yes", "no"), bool_true = "yes", bool_false = "no",
#> keep_unformatted = c("DIAG1", "DIAG2", "DIAG3", "DIAG4",
#> "DIAG5"))
#> sdo = svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT,
#> data = d1)
#> attr(sdo, "label") = label
#> attr(sdo$call, "srcref") = NULL
#> message("\n*** Please verify that the correct survey design variables are used (ids, strata, weights, fpc, etc.): ")
#> print(sdo)
#> saveRDS(sdo, r_out)
#> }
#> <bytecode: 0x0000000024d5afb0>
#> <environment: namespace:importsurvey>
Please be sure to read help on import_sas()
.
There are a couple of important issues to keep in mind when calling
import_sas()
.
Converting Yes/No variables to logical variables. This is an optional step. While you do not need to do this, it makes working with the survey easier.
A variable that has levels listed in the bool_levels
argument will be converted to a logical variable.
- By default,
bool_levels = c("yes", "no")
- For the RCC SU survey, a more involved value was appropriate,
namely,
c("yes", "selected", "no", "not selected", "missing")
.
bool_true
: of the above values, which should be set to
TRUE
?
- By default,
bool_true = "yes"
. - For the RCC SU survey,
bool_true = c("yes", "selected")
.
Likewise, bool_false
tells us which of the above values
should be set to FALSE
.
- By default,
bool_false = "no"
. - For the RCC SU survey,
bool_false = c("no", "not selected")
.
Any values in bool_levels
that are not
in bool_true
or bool_false
set to
NA
(missing). Thus, for RCC SU, “missing” is set to
NA
.
Pairing variables and formats. You have to tell
import_sas()
which formats go with which variables. You do
this with the formats
argument.
formats = "attr"
: Each variable in the data file has an attribute calledformat.sas
with the name of the SAS format. This was used for importing the NAMCS survey.formats = "name"
: The variable name and the format name are the same. This was used for importing the RCC survey.formats = "funcname"
: You have to specify a function in theformats_func
argument. This function takes the variable name and returns the format name. This was used for importing the RCC SU survey.
For RCC SU,
formats_func = function(name) toupper(paste0(name, "f"))
.
That is, when the variable name is sex
, the format name is
SEXF
.