About me

University of Nebraska, Omaha
- B.S. Computer Science and Mathematics (2009)
- M.S. Mathematics, Data Science (May, 2018)
Software Engineer (2004-present)
Flight Operations, U.S. Army National Guard (2000-2009)

A fun experiment

Step 1: Pick a random percentage. e.g. 54%, 28%, 77%, etc.
Step 2: Type that number into Google followed by “of Americans”
Step 3: Follow rabbit hole for hours

Simple random sample

Pólya urn model
With (SRSWR) or Without Replacement (SRSWOR)
- With replacement - makes use of i.i.d. assumption
- Without replacement - not i.i.d. but still exchangeable
Requires access to the entire population

Sampling design

Sampling Plan	Design-based inference	Model-based inference
Probability sample	A	C
Model-dependent sample	B	D
Quota sampling	E	F
Convenience sampling	G	H
Snowball sampling	I	J
Peer nomination	K	L

When we talk about survey sample design or sampling strategy, we’re talking about two components: A sampling plan, and a method for drawing inferences from that sample.

There are essentially two main sampling plan approaches: Probability sampling plans and non-probability sampling plans.

Cells A and D are natural pairings, but it is not uncommon to find the hybrid approaches of B and C in survey research. We’re seeing more and more model-based inference - which you can think of as primarily Bayesian models - with Nate Silver and FiveThirtyEight, or Nate Cohn and the NYT. It is becoming more and more common to run model-based inference on probability samples.

However, most research is still done with design-based inference.

A probability sampling plan assigns non-zero probabilities of selection to each member of the sampling frame. Sample choices are then randomized.

Model-dependent sampling plans, on the other hand, assume the statistics of interest follow a known probability distribution. They then seek to draw samples that maximize the precision (or minimize the variance) of estimation for the statistics of interest.

These are not common in survey practice. Surveys are generally intended to be multi-purpose, and if the wrong model is used, it can lead to biased estimates.

Design-based inference is non-parametric, in that it only relies on the probability of each observation’s inclusion.

Model-based inference is based on a probability distribution for the random variable of interest.

Design effects

“deft”
Similar to variance inflation factor (VIF)
Effective sample size

\[ D^2(\hat{\theta}) = \frac{SE(\hat{\theta})^2_{complex}}{SE(\hat{\theta})^2_{srs}} = \frac{var(\hat{\theta})_{complex}}{var(\hat{\theta})_{srs}} \]

\[ n_{eff} = \frac{n_{complex}}{d^2(\hat{\theta})} \]

Clustering

Grouping people by geographic regions
SRS to choose a geographic region

Clustering

Stratification

Stratification

Weighting

\(N = 51\)

Weighting

\(N_{men} = 30\)
\(p_{men} = \frac{30}{51} = 0.588\)

Weighting

\(N_{women} = 21\)
\(p_{women} = \frac{21}{51} = 0.412\)

Weighting

\(N_{women} = 21\)
Women Odds Ratio: \(\frac{p_{women}}{p_{men}} = \frac{0.588}{0.412} = 1.427\)
Men Odds Ratio: \(\frac{p_{men}}{p_{women}} = \frac{0.412}{0.588} = 0.701\)

H-CUP Nationwide Inpatient Sample

Healthcare Cost and Utilization Project
Must be purchased

NIS Sampling Design

1988-2011: 100% sample of 20% of HCUP hospitals
2012-present: 20% sample of 100% of HCUP hospitals

NIS Complex Survey Design

Clustered on hospital ID
Weights included in discwt field for national estimates
1988-2011: Stratified by census region and bed size
2012-present: Stratified by census division and bedside
- Region 1 (Northeast)
  - Division 1 (New England) -Division 2 (Mid Atlantic)
- Region 2 (Midwest)
  - Division 3 (East North Central)
  - Division 4 (West North Central) (incl. Nebraska)
- Region 3 (South)
  - Division 5 (South Atlantic)
  - Division 6 (East South Central)
  - Division 7 (West South Central)
- Region 4 (West)
  - Division 8 (Mountain)
  - Division 9 (Pacific)

NIS dimensions

Big data?
Definitely large data
~3GB per year (raw CSV)

Importance of survey design

Treating as SRS

summary(lm(los~age, data=cdiff))

## 
## Call:
## lm(formula = los ~ age, data = cdiff)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14.19  -7.12  -3.98   2.27 349.01 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.190744   0.187955   75.50   <2e-16 ***
## age         -0.043275   0.002687  -16.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.94 on 73264 degrees of freedom
## Multiple R-squared:  0.003528,   Adjusted R-squared:  0.003514 
## F-statistic: 259.4 on 1 and 73264 DF,  p-value: < 2.2e-16

Importance of survey design

Accounting for survey design with R survey package

library('survey')

cdiff.design <- svydesign(ids = ~hospid, data = cdiff, weights = ~discwt,  strata = ~nis_stratum, nest=TRUE)
summary(svyglm(los~age, design=cdiff.design))

## 
## Call:
## svyglm(formula = los ~ age, design = cdiff.design)
## 
## Survey design:
## svydesign(ids = ~hospid, data = cdiff, weights = ~discwt, strata = ~nis_stratum, 
##     nest = TRUE)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.95231    0.55033  25.353  < 2e-16 ***
## age         -0.04657    0.00637  -7.311 6.27e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 180.579)
## 
## Number of Fisher Scoring iterations: 2

SRS vs. complex design

SRS line in red, complex design in blue

Research design checklist

Khera and Krumholz, 2017

Data interpretation checklist

Khera and Krumholz, 2017

Data analysis checklist

Khera and Krumholz, 2017

Sources and Further Reading

Heeringa, S., West, B. T., Berglund, P. A., Applied Survey Data Analysis, 2nd Ed., CRC Press (2017)
Kalton, G., Introduction to Survey Sampling, SAGE Publications (1983)
Khera R. and Krumholz H., With Great Power Comes Great Responsibility: Big Data Research From the National Inpatient Sample, Circulation: Cardiovascular Quality and Outcomes (2017)
Lumley, T. R Package ‘survey’, (2018)

Complex Survey Design and the NIS

Brian Detweiler

April 20, 2018

About me

A fun experiment

Simple random sample

Sampling design

Design effects

Design effects

Clustering

Clustering

Stratification

Stratification

Weighting

Weighting

Weighting

Weighting

H-CUP Nationwide Inpatient Sample

NIS Sampling Design

NIS Complex Survey Design

NIS dimensions

Importance of survey design

Importance of survey design

SRS vs. complex design

Research design checklist

Data interpretation checklist

Data analysis checklist

Sources and Further Reading