Tips and tricks • surveytable

Conditional independence test

Suppose you want to test the equality of proportions of a specific variable across all levels of another variable. For example, suppose you are interested in visits by patients aged 65-74 years (AGER value of "65-74 years"). You’d like to test the proportion of these visits for different values of physician specialty (SPECCAT).

To perform this test, use the tab_subset() function, with argument test set to the level of interest, in this case, "65-74 years". Thus, in this case, the analysis is as follows:

library(surveytable)
set_survey(namcs2019sv)

Variables	Observations	Design
Survey info {NAMCS 2019 PUF}
33	8,250	Stratified 1 - level Cluster Sampling design (with replacement) With (398) clusters. namcs2019sv = survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT , data = namcs2019sv_df)

tab_subset("AGER", "SPECCAT", test = "65-74 years")

Level	n	Number	SE	LL	UL	Percent	SE	LL	UL
Patient age recode = ‘65-74 years’ (for different levels of Type of specialty (Primary, Medical, Surgical))
Primary care specialty	411	85,504,462	10,209,964	67,580,662	108,182,027	16.4	1.6	13.4	19.8
Surgical care specialty	787	53,481,659	6,404,521	42,226,824	67,736,277	24.9	2.8	19.6	30.9
Medical care specialty	463	67,879,861	10,944,662	49,319,409	93,425,199	22.6	2.9	17.1	29.0

Level 1	Level 2	Test statistic	Degrees of freedom	p-value	Flag
Conditional independence test of Patient age recode = ‘65-74 years’ across all pairs of Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Primary care specialty	Surgical care specialty	2.70	238	0.008	*
Primary care specialty	Medical care specialty	1.88	206	0.061
Surgical care specialty	Medical care specialty	-0.57	221	0.572
Design-based t-test. *: p <= 0.05

Reset the options

Depending on your coding style, you might be changing options in multiple locations in your code. In these situations, set_opts(reset = TRUE) might be useful. This command resets all surveytable options to their default values.

Verify the results

Some analysts might wish to compare the output from surveytable to the output from other statistical software, such as SAS / SUDAAN. In this situation, set_opts(output = "raw") might be useful. This command tells surveytable to print unformatted and unrounded tables. In addition, when performing hypothesis testing, this option prints the test statistic and the degrees of freedom, not just the p-value.

library(surveytable)
set_survey(namcs2019sv)

Variables	Observations	Design
Survey info {NAMCS 2019 PUF}
33	8,250	Stratified 1 - level Cluster Sampling design (with replacement) With (398) clusters. namcs2019sv = survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT , data = namcs2019sv_df)

set_opts(output = "raw")
tab("SPECCAT", test = TRUE)
set_opts(reset = TRUE)

## * Generating unformatted / raw output.
## Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
##                     Level    n    Number       SE        LL        UL  Percent
## 1  Primary care specialty 2993 521466378 31136212 463840192 586251877 50.31107
## 2 Surgical care specialty 3050 214831829 31110335 161661415 285489984 20.72697
## 3  Medical care specialty 2207 300186150 43496739 225806019 399066973 28.96196
##         SE       LL       UL
## 1 2.576021 45.12608 55.49110
## 2 2.989343 15.09426 27.33542
## 3 3.557853 22.10191 36.61234
## N = 8250.
## 
## Comparison of all pairs of Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
##                   Level 1                 Level 2 Test statistic
## 1  Primary care specialty Surgical care specialty      -6.223460
## 2  Primary care specialty  Medical care specialty      -3.686136
## 3 Surgical care specialty  Medical care specialty       1.382098
##   Degrees of freedom      p-value Flag
## 1                239 2.162375e-09    *
## 2                207 2.908114e-04    *
## 3                222 1.683308e-01     
## Design-based t-test. *: p <= 0.05
## * Resetting all options to their default values.

Subset a survey

Consider this example, in which we estimate the number of medications by age group:

library(surveytable)
set_survey(namcs2019sv)

Variables	Observations	Design
Survey info {NAMCS 2019 PUF}
33	8,250	Stratified 1 - level Cluster Sampling design (with replacement) With (398) clusters. namcs2019sv = survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT , data = namcs2019sv_df)

tab_subset("NUMMED", "AGER")

Level	% known	Mean	SEM	LL	UL	SD
Number of medications coded (for different levels of Patient age recode) {NAMCS 2019 PUF}
Under 15 years	100	1.58	0.168	1.25	1.91	1.75
15-24 years	100	1.64	0.112	1.42	1.86	1.70
25-44 years	100	2.15	0.225	1.71	2.59	2.74
45-64 years	100	3.49	0.303	2.90	4.09	4.49
65-74 years	100	4.44	0.431	3.60	5.29	5.03
75 years and over	100	5.53	0.494	4.56	6.50	5.59

What if we’d like to estimate the same thing, but only for the visits for which NUMMED > 0?

One way to do this is to create another survey object for which NUMMED > 0, and then analyze this new survey object.

newsurvey = survey_subset(namcs2019sv, NUMMED > 0
  , label = "NAMCS 2019 PUF: NUMMED 1+")
set_survey(newsurvey)

Variables	Observations	Design
Survey info {NAMCS 2019 PUF: NUMMED 1+}
33	5,738	Stratified 1 - level Cluster Sampling design (with replacement) With (374) clusters. survey_subset(namcs2019sv, NUMMED > 0, label = “NAMCS 2019 PUF: NUMMED 1+”)

Note that we called set_survey(), to let R know that we now want to analyze the new object newsurvey, not namcs2019sv.

Now, let’s create the table:

tab_subset("NUMMED", "AGER")

Level	% known	Mean	SEM	LL	UL	SD
Number of medications coded (for different levels of Patient age recode) {NAMCS 2019 PUF: NUMMED 1+}
Under 15 years	100	2.34	0.157	2.03	2.64	1.66
15-24 years	100	2.34	0.116	2.11	2.57	1.58
25-44 years	100	3.04	0.257	2.53	3.54	2.81
45-64 years	100	4.92	0.358	4.22	5.62	4.62
65-74 years	100	6.02	0.445	5.15	6.89	4.98
75 years and over	100	7.29	0.457	6.39	8.18	5.32

Be sure to check the table title to verify that you are tabulating the new survey object.

Advanced variable editing and data flow

Advanced variable editing

First, let’s review what I call “advanced variable editing”.

surveytable provides a number of functions to create or modify survey variables.
Some examples include var_collapse() and var_cut().
Occasionally, you might need to do advanced variable editing. Here’s how:

Keep in mind that every survey object has an element called variables. This is a data frame where the survey’s variables are located.

Create a new variable in the variables data frame (which is part of the survey object).
Call set_survey() again. Any time you modify the variables data frame, call set_survey().
Tabulate the new variable.

For an example of this, see vignette("Example-Residential-Care-Community-Services-User-NSLTCP-RCC-SU-report").

Data flow

The above explanation raises the question of why set_survey() must be called again, after variables is modified. Here is an explanation:

The survey that you’re analyzing actually exists in three separate places:

A file on your computer data storage that contains the survey object. For example, it could be an RDS file on your hard disk drive that contains the survey object named something like mysurvey.rds.
The survey object in R’s global environment, named something like mysurvey.
A hidden copy of the survey object that’s used by surveytable. This is what surveytable analyzes.

Why is there (3) that’s different from (2), you might ask. That’s due to an arcane issue with how R packages work – both (2) and (3) are necessary.

Normally, information only flows forwards, from (1) to (2) and from (2) to (3).

Forwards flow:

Going from (1) to (2): call readRDS().
Going from (2) to (3): call set_survey().

Backwards flow:

Going from (3) to (2): you probably don’t need this, but see below. If you really need this, use surveytable:::.load_survey().
Going from (2) to (1): call saveRDS(). Normally, you probably don’t want to do this. Normally, the survey file (mysurvey.rds) should probably not be changed.

The functions for modifying or creating variables that are part of the surveytable package (like var_cut() or var_collapse()) modify (3). Since (3) is what surveytable works with and tabulates, you can use var_collapse(), and then immediately use tab(). You don’t need to do anything extra in between.

If you are modifying the variables data frame directly, you are modifying (2). After you modify (2), you need to copy it over to (3), so that surveytable can use it. You do that by calling set_survey().

Thus, any time you modify variables yourself, call set_survey(). You modify (2), then copy (2) -> (3) by calling set_survey().

On the flip side, the changes that you make in (3) (using surveytable functions like var_cut() or var_collapse()) are not reflected in (2). If you make changes in (3), then call set_survey(), those changes are lost, because set_survey() copies (2) -> (3). If those changes were important, you can just rerun the code that created them. If you really need to go from (3) to (2), use mysurvey = surveytable:::.load_survey().