How to import survey data • importsurvey

This article describes how to import survey data from SAS into R. Non-survey data can be imported as well.

Generally speaking, you only need to import a data system from SAS into R once. After it has been imported, use the R version of the data, and share it with others.

SAS data file

The first step is to obtain the SAS data file that you would like to import. In this example, we will use a file called namcs2019_sas.sas7bdat. This is the SAS data file for the National Ambulatory Medical Care Survey (NAMCS) 2019 Public Use File (PUF). For more information, and the location from which this SAS data file can be downloaded, see nchsdata::namcs2019.

SAS formats data file

Next, you need to create a SAS formats data file. This is a SAS data file with information about formats; it should not be confused with a SAS catalog file. This formats data file is generated in SAS using the PROC FORMAT command with the CNTLOUT option, as follows:

LIBNAME project 'c:\output';
PROC FORMAT CNTLOUT=project.formats_dataset;
   VALUE AGERF   
       1 ='Under 15 years'
       2 ='15-24 years'
       3 ='25-44 years'
       4 ='45-64 years'
       5 ='65-74 years'
       6 ='75 years and over'
;
/* ... */
RUN;

SAS saves the formats data file in the folder specified by the LIBNAME command.

Look for a specialized function

In importsurvey, there are several functions named import_sas_*. See if any of these functions is specifically designed for importing your survey or another survey that has the same survey design variables.

To see the available function:

library(importsurvey)
help(package = "importsurvey")

Since we are trying to import NAMCS 2019 PUF, we notice a function called import_sas_namcs2019puf. When we check help for this function, we see that this is exactly the correct function to use.

import_sas_namcs2019puf can be used to import NAMCS 2019 PUF or any other survey that has the same survey design variables. For example, the same function can be used to import another survey, called NHAMCS ED 2020 PUF. (For more info on this survey, see nchsdata::ed2020.)

If a specialized function does not already exist, we will cover that scenario below.

Use the specialized function

Specify the following information:

the SAS survey data file
the SAS formats data file
the name of an .rds file to be created. This file can be read later with readRDS().
a short name for the survey

import_sas_namcs2019puf(sas_data = "namcs2019_sas.sas7bdat"
, sas_formats_data = "namcs_formats_dataset.sas7bdat"
, r_out = "namcs_2019_puf.rds"
, label = "NAMCS 2019 PUF")

The survey name, specified by the label argument, is important. It is used by functions in the surveytable package to allow the user to verify that they are analyzing the correct survey.

The above command will likely print some messages that list certain kinds of variables. Check all messages to see whether anything is unexpected, which could indicate a mistake.

Variables that have no format. Did you intend for these variables to not have a format? In this list, you expect to see survey design variables (such as PATWT) and numeric variables (such as NUMMED).

Format applies only to some values, created multiple version of variable.

Some numeric variables contain actual values as well as certain “special values”, which often indicate some kind of missingness. For example, consider the BMI (body mass index) variable. Some observations have the actual value of BMI, such as 25, while other observations have a “special value”, which is one of the following: -9, meaning “Missing data”; -7, meaning “Not calculated”; or -5, meaning “Height or weight outside acceptable ranges”.

When importsurvey encounters such a variable, in addition to the original variable, it creates two more variables, as follows. The analyst should use the version of the variable that is appropriate to their specific analysis.

X.nospecial: a numeric variable in which all values specified in the formats are converted to missing values (NA) – only the “non-special” values are retained; and
X.special: a factor variable in which all values not specified in the formats are given a value of “Other”. (For BMI, a categorical variable, with categories being “Missing data”, “Not calculated”, “Height or weight outside acceptable ranges”, and “Other”.)

Thus, for BMI, one would generally not use the BMI variable directly, because it contains both values of BMI as well as special values. BMI.nospecial could be used in numeric calculations, such as to calculate the mean of known values. BMI.special could be used to calculate the percentages of each of the special values.

Yes/no variable - converted to logical. These variables have been converted to logical variables, with values being TRUE, FALSE, or NA.

They can be used in logical expressions, which makes using the data easier. For example, to select visits in which the expected source of payment is both Medicare and Medicaid, instead of typing PAYMCARE == "Yes" & PAYMCAID == "Yes" one only needs to type PAYMCARE & PAYMCAID.

Many category values not used - possible error. Many category values were not used. Was this done on purpose?

In addition to checking the above messages, please verify that the survey design variables are correct.

Verify some estimates

Load the survey that you just imported, and check some basic estimates, to see that they make sense.

mysurvey = readRDS("namcs_2019_puf.rds")

class(mysurvey)
#> [1] "survey.design2" "survey.design"

This is a survey design object. It can be analyzed using the survey or the surveytable packages.

library(surveytable)

set_survey("mysurvey")
#>                        _             
#> Survey name            NAMCS 2019 PUF
#> Number of variables    1100          
#> Number of observations 8250          
#> Stratified 1 - level Cluster Sampling design (with replacement)
#> With (398) clusters.
#> svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT, 
#>     data = d1)
#> * To adjust how counts are rounded, see ?set_count_int

Verify survey name, survey design variables, and the number of observations.

Does the estimate of the total seem correct?

total()

Total {NAMCS 2019 PUF}
Number (000)	SE (000)	LL (000)	UL (000)
1,036,484	48,836	945,014	1,136,809
(Checked presentation standards. Nothing to report.)

Do the estimates of a couple of basic variables seem correct?

tab("AGER", "SEX")

Patient age recode {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years	117,917	14,097	93,229	149,142	11.4	1.3	8.9	14.2
15-24 years	64,856	7,018	52,387	80,292	6.3	0.6	5.1	7.5
25-44 years	170,271	13,966	144,925	200,049	16.4	1.1	14.3	18.8
45-64 years	309,506	23,290	266,994	358,787	29.9	1.4	27.2	32.6
65-74 years	206,866	14,366	180,481	237,109	20	1.2	17.6	22.5
75 years and over	167,069	15,179	139,746	199,735	16.1	1.3	13.7	18.8
(Checked presentation standards. Nothing to report.)

Patient sex {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Female	605,045	34,866	540,377	677,452	58.4	1.9	54.6	62.1
Male	431,439	27,664	380,436	489,280	41.6	1.9	37.9	45.4
(Checked presentation standards. Nothing to report.)

What if a specialized import function does not exist?

What if a specialized import function for your survey or data does not exist?

In the importsurvey package, all of the import_sas_* functions are just short functions that call a function called import_sas(). See how the import_sas_* functions are written, and try to issue similar commands for the survey that you are trying to import.

The import_sas_* functions have 3 main steps, as follows:

Call import_sas() to convert the SAS survey data file to a data frame with survey data. Schematically, SAS -> data.frame.
Call survey::svydesign() to convert the data frame to a survey design object. Schematically, data.frame -> survey.design.
Label the survey design object, and save it to an RDS file.

For example, let’s examine import_sas_namcs2019puf and try to identify these 3 steps.

importsurvey::import_sas_namcs2019puf
#> function (sas_data, sas_formats_data, r_out, label = "") 
#> {
#>     assert_that(!file.exists(r_out), msg = paste0("Output file ", 
#>         r_out, " already exists."))
#>     assert_that(is.string(label), nzchar(label))
#>     d1 = import_sas(sas_data, sas_formats_data, formats = "attr", 
#>         bool_levels = c("yes", "no"), bool_true = "yes", bool_false = "no", 
#>         keep_unformatted = c("DIAG1", "DIAG2", "DIAG3", "DIAG4", 
#>             "DIAG5"))
#>     sdo = svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT, 
#>         data = d1)
#>     attr(sdo, "label") = label
#>     attr(sdo$call, "srcref") = NULL
#>     message("\n*** Please verify that the correct survey design variables are used (ids, strata, weights, fpc, etc.): ")
#>     print(sdo)
#>     saveRDS(sdo, r_out)
#> }
#> <bytecode: 0x0000000024d5afb0>
#> <environment: namespace:importsurvey>

Please be sure to read help on import_sas().

There are a couple of important issues to keep in mind when calling import_sas().

Converting Yes/No variables to logical variables. This is an optional step. While you do not need to do this, it makes working with the survey easier.

A variable that has levels listed in the bool_levels argument will be converted to a logical variable.

By default, bool_levels = c("yes", "no")
For the RCC SU survey, a more involved value was appropriate, namely, c("yes", "selected", "no", "not selected", "missing").

bool_true: of the above values, which should be set to TRUE?

By default, bool_true = "yes".
For the RCC SU survey, bool_true = c("yes", "selected").

Likewise, bool_false tells us which of the above values should be set to FALSE.

By default, bool_false = "no".
For the RCC SU survey, bool_false = c("no", "not selected").

Any values in bool_levels that are not in bool_true or bool_false set to NA (missing). Thus, for RCC SU, “missing” is set to NA.

Pairing variables and formats. You have to tell import_sas() which formats go with which variables. You do this with the formats argument.

formats = "attr": Each variable in the data file has an attribute called format.sas with the name of the SAS format. This was used for importing the NAMCS survey.
formats = "name": The variable name and the format name are the same. This was used for importing the RCC survey.
formats = "funcname": You have to specify a function in the formats_func argument. This function takes the variable name and returns the format name. This was used for importing the RCC SU survey.

For RCC SU, formats_func = function(name) toupper(paste0(name, "f")). That is, when the variable name is sex, the format name is SEXF.