Download Survey Analysis: Options for Missing Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
NESUG 2010
Statistics and Analysis
Survey Analysis: Options for Missing Data
Paul Gorrell, IMPAQ International, LLC, Columbia, MD
Abstract
A common situation researchers working with survey data face is the analysis of missing data, often due to nonresponse.
In addition to missing values for analysis variables, SAS® excludes observations if the weight of any of the design variables (strata, cluster, domain) have missing values. This paper discusses two options available with the SAS survey
procedures (e.g. SURVEYFREQ, SURVEYMEANS): the MISSING option and the NOMCAR option. The MISSING
option is used with categorical variables to instruct SAS to treat missing values as a valid category. The NOMCAR option (new with version 9.2) is used when the default assumption that missing values for analysis variables are missing
completely at random (i.e. the group of non-respondents do not differ in any relevant respect from the group of respondents) is not appropriate. Use of the NOMCAR option instructs SAS to perform a domain analysis of missing and nonmissing values. Specific examples will be used to illustrate the effect of the use of these two options for variance estimation and the computation of confidence limits
Introduction
A useful starting point for the discussion of missing data in this paper is the following text from the SAS documentation
section on Missing Values for PROC SURVEYMEANS:
(1)
By default, when computing statistics for an analysis variable, PROC SURVEYMEANS omits
observations with missing values for that variable. The procedure computes statistics for each
variable based only on observations that have nonmissing values for that variable. This treatment
is based on the assumption that the missing values are missing completely at random (MCAR).
However, this assumption is sometimes not true. For example, evidence from other surveys might
suggest that observations with missing values are systematically different from observations without
missing values. If you believe that missing values are not missing completely at random, then you
can specify the NOMCAR option to let variance estimation include these observations with missing
values in the analysis variables.
For the analysis of complex surveys another factor comes into play, i.e. the omission of observations with missing values
potentially removes important information with respect to the design properties of the survey, e.g. strata and cluster information. We will see, in the discussion of an example from the Medical Expenditure Panel Survey (MEPS) below, that
the use of the NOMCAR option can be an alternative to using a DOMAIN analysis which is often recommended instead
of prior restricting of analyses to target subpopulations.
The effect of using the NOMCAR option is given in (2).
(2)
When the NOMCAR option is used, the procedure treats observations with and without missing
values for analysis variables as two different domains, and it performs a domain analysis in the
domain of nonmissing observations.
Although SAS 9.2 includes options for replication methods of variance estimation (BRR, Jackknife), the NOMCAR option only applies to the default Taylor series method. Note also the reference in (2) to analysis variables. In contrast,
the MISSING option affects categorical variables. The text in (3) is from the SAS 9.2 documentation for PROC
SURVEYMEANS.
1
NESUG 2010
Statistics and Analysis
(3)
[The MISSING option] treats missing values as a valid (nonmissing) category for all categorical
variables, which includes CLASS, STRATA, CLUSTER, and DOMAIN variables.
By default, if you do not specify the MISSING option, an observation is excluded from the
analysis if it has a missing value.
Note that SAS' characterization of a variable as categorical is based on its use on one of the listed statements (e.g.
DOMAIN), and not on the variable's values or range of values.
The rest of this paper consists of three examples: Example 1 shows the effect of the NOMCAR option with a simple stratified sample with missing data for the analysis variable; Example 2 shows the effect of the MISSING option for a similar
stratified sample with missing values for a categorical variable used in the DOMAIN statement; Example 3 uses a morecomplex example to compare the effect of using the NOMCAR option with a DOMAIN analysis based on missing values
for an analysis variable.
The goal of this paper is to illustrate some of the effects you will observe when using the NOMCAR and MISSING options. This paper is by no means an exhaustive discussion of the topic. Nor does it advise you when it is appropriate to
use or not use these options. Often this is determined solely by the design properties of the survey data you are analyzing
and/or your research goals. A discussion of all the different design and analytic factors to consider is beyond the scope of
this paper. But the examples discussed below should give you a concrete sense of the use of these options, as well as
specific questions to consider when weighing their use.
Example 1 (Spending on Ice Cream by Grade Level)
This example is straight from the SAS 9.2. documentation for PROC SURVEYMEANS (Example 85.4, Analyzing Survey Data with Missing Values). In this example students from three grades (7, 8, and 9) are sampled with respect to
spending for ice cream (you can see a more user-friendly formatting of the ICECREAM data set, sorted by GRADE,
SPENDING in Appendix A). The value of WEIGHT is assigned as the inverse of the probability of selection (1/PROB).
For each grade, PROB is defined as the ratio of the number sampled to the total number of students. Not shown here is a
separate data set (STUDENTTOTALS) which has the total number of students for each grade, i.e. the population totals
for each stratum.
(4)
DATA ICECREAM;
INPUT GRADE SPENDING @@;
IF GRADE = 7
THEN PROB = 20/1824;
IF GRADE = 8
THEN PROB = 9/1025;
IF GRADE = 9
THEN PROB = 11/1151;
WEIGHT = 1/PROB;
DATALINES;
7 7 7 7 8 . 9 10 7 . 7 10 7 3
7 . 9 15 8 16 7 6 7 6 7 6 9 15
9 8 9 7 7 3 7 12 7 4 9 14 8 18
7 4 7 11 9 8 8 . 8 13 7 . 9 .
;
RUN;
8 20 8 19 7 2
8 17 8 14 9 .
9 9 7 2 7 1
9 11 7 2 7 9
For comparison purposes we will first show output for the SURVEYMEANS code shown in (5). Here the mean and sum
are requested. Although not germane to the missing data issues discussed here, the STUDENTTOTALS data set is used
to compute a finite population correction for variance estimation (it is included here to maintain consistency with the SAS
documentation example). In the code below GRADE is the stratification variable, SPENDING the analysis variable, and
WEIGHT the weight variable. The LIST option on the STRATA statement requests a Stratum Information table as part
of the procedure output.
2
NESUG 2010
Statistics and Analysis
(5)
PROC SURVEYMEANS DATA= ICECREAM TOTAL=STUDENTTOTALS MEAN SUM;
STRATA GRADE / LIST;
VAR SPENDING;
WEIGHT WEIGHT;
RUN;
The Data Summary table in (6) below lists the number of strata (i.e. grades), the number of observations (cf. the PROC
PRINT output in Appendix A), and the weighted sum (i.e. the sum of the population total for all grades).
The Stratum Information table lists descriptive information for each strata (grade). The “N Obs” column shows the number sampled and the “N” column shows the number of observations with non-missing values for the analysis variable
SPENDING. Subtracting N from N Obs shows that Grade 7 has 3 missing values and Grades 8 and 9 have 2 missing
values each (see Appendix A).
The Statistics tables shows the requested MEAN and SUM, along with the variance estimate for each. For these estimates the observations with missing values were excluded. As stated, this is the default SAS behavior.
(6) Output for the SURVEYMEANS code in (5).
Data Summary
3
Number of Strata
40
Number of Observations
4000
Sum of Weights
Stratum Information
Stratum
Index
GRADE
Population
Total
Sampling
Rate N Obs Variable
1
7
1824
1.10%
20 SPENDING
17
2
8
1025
0.88%
9 SPENDING
7
3
9
1151
0.96%
11 SPENDING
9
N
Statistics
Variable
SPENDING
Mean
Std Error
of Mean
9.770542
Sum
Std Dev
0.541381 32139 1780.792065
Keeping especially the estimates of variance in mind, we now modify the example by including the NOMCAR option to
see its effect.
(7)
PROC SURVEYMEANS DATA= ICECREAM TOTAL=STUDENTTOTALS NOMCAR MEAN SUM;
STRATA GRADE / LIST;
VAR SPENDING;
WEIGHT WEIGHT;
RUN;
3
NESUG 2010
Statistics and Analysis
The Data Summary and Strata Information tables are unchanged from the prior example so they will not be reproduced
below. The output in (8) does show a new Variance Estimation table to reflect the inclusion of the NOMCAR option. As
stated this option is specific to the Taylor Series method for variance estimation, and this method is listed in the table—as
is the fact that observations for missing values for the analysis variable will be included.
(8) Output for the SURVEYMEANS code in (7)
Variance Estimation
Taylor Series
Method
Included (NOMCAR)
Missing Values
Statistics
Variable
SPENDING
Mean
Std Error
of Mean
9.770542
Sum
Std Dev
0.652347 32139 3515.126876
Of particular interest here is the difference in the standard error for the mean and the standard deviation for the sum. But
first note that the point estimates (MEAN, SUM) are unaffected. It is only the variance estimation which is affected.
This is particularly important when variance estimates are used to determine if two point estimates (e.g. the MEAN or
SUM in different years) are significantly different. Standard errors and standard deviations tend to be larger when the
NOMCAR option is used than when the assumption is made that missing values are missing completely at random. This
is certainly the case with the example shown. Therefore the assumption that missing values are not missing completely at
random is the more-conservative assumption.
Example 2 (Spending on Ice Cream: Domain Analysis, Parent's Education)
This example modifies the input data used in Example 1 by adding a new, binary, variable (PARENT_ED) which indicates if the student's parent completed high school or college. In addition to values of COLLEGE or HIGHSCHOOL, in
this data set, the variable also has missing values.
The code below is similar to that in (5), except for the inclusion of the DOMAIN statement. In addition to the Statistics
table we saw in Example 1, the use of the DOMAIN statement will generate an output Domain Analysis table.
(9)
PROC SURVEYMEANS DATA= ICECREAM TOTAL=STUDENTTOTALS MEAN SUM;
STRATA GRADE / LIST;
VAR SPENDING;
DOMAIN PARENT_ED;
WEIGHT WEIGHT;
RUN;
In the Statistics table below the overall estimates (Mean, Sum and their variance estimates) are identical to those we saw
for Example 1 when the NOMCAR option was not used. The Domain Analysis table shows these estimates for the subpopulations of students with parents with either a college or high school education. Note that observations are not included if the value of PARENT_ED is missing.
4
NESUG 2010
Statistics and Analysis
(10)
Output for the SURVEYMEANS code in (9)
Statistics
Std Error
Mean of Mean Sum
Variable
SPENDING 9.770542
Std Dev
0.541381 32139 1780.792065
Domain Analysis: PARENT_ED
Mean
Std Error
of Mean
SPENDING 9.412570
1.038571
PARENT_ED Variable
COLLEGE
HIGHSCHOOL SPENDING 8.662245
Sum
Std Dev
15182 3000.080373
1.650628 7655.674747 2787.963856
Below we add the MISSING option in order to include all observations in the data set, including those for students where
we don’t have information about their parents' education.
(11)
PROC SURVEYMEANS DATA= ICECREAM TOTAL=STUDENTTOTALS MISSING MEAN SUM;
STRATA GRADE / LIST;
VAR SPENDING;
DOMAIN PARENT_ED;
WEIGHT WEIGHT;
RUN;
(12) Output for the SURVEYMEANS code in (11)
Statistics
Variable
Mean
Std Error
of Mean Sum
SPENDING 9.770542
Std Dev
0.541381 32139 1780.792065
Domain Analysis: PARENT_ED
PARENT_ED Variable
Std Error
Mean of Mean
SPENDING 11.734845
COLLEGE
Sum
Std Dev
1.519611 9301.014141 3361.373465
SPENDING
9.412570
1.139641
15182 3363.156310
HIGHSCHOOL SPENDING
8.662245
1.661525 7655.674747 2885.062890
In the Domain Analysis table above we now see three rows for the PARENT_ED domain variable. In addition to seeing
the estimates for this subpopulation, we also see that the inclusion of these observation, by changing the total number of
observations within each stratum, has also changed the variance estimates. For example, the Std Dev for students whose
5
NESUG 2010
Statistics and Analysis
parents attended college is 3,000 when the observations with missing PARENT_ED values are excluded. But the Std
Dev for this same group is 3,363 when those observations are included, i.e. including these observations with missing
values yields more-conservative estimates of reliability. This difference points to the importance of determining, for the
survey analysis you are conducting, whether or not it is appropriate to exclude missing values for categorical variables in
generating variances for your estimates.
Next we turn to a real-world example using data from the Medical Expenditure Panel Survey.
Example 3 (Hospital Stay Expenses)
The Medical Expenditure Panel Survey (MEPS) is a complex national probability survey of the civilian noninstitutionalized population. Each year MEPS collects healthcare utilization, expenditure and other information for approximately 32,000 individuals. Public use files (PUFs) are released each year. The data in the example discussed below is from the 2006 MEPS Full-Year Consolidated Data file (HC-105), available for download from the Agency For
Healthcare Research and Quality’s Web site (http://www.meps.ahrq.gov/mepsweb).
In order to use MEPS data for national estimates, person- and family-level weights are developed and released on the
annual public-use files. In the example used here the 2006 person-level weight variable PERWT06F is used. In addition,
the MEPS sample design includes stratification, clustering, multiple stages of selection, and disproportionate sampling.
Because of these complex design properties, it is not appropriate to assume simple random sampling for variance estimation. To obtain accurate variance estimates an appropriate technique to derive standard errors associated with the
weighted estimates must be used. Several methods for estimating standard errors for estimates from complex surveys
have been developed, including the Taylor-series linearization method, balanced repeated replication, and the jack-knife
method. The MEPS public use files include variables to obtain weighted estimates and to implement a Taylor-series approach to estimate standard errors for weighted survey estimates. These variables, which jointly reflect the MEPS survey
design, include the estimation weight, sampling strata, and the cluster or primary sampling unit (PSU).
Standard errors for MEPS estimates normally require the analytic file to contain all of the MEPS sample persons (e.g.,
those with positive values for the person weight variable) in order for the analysis to correctly account for the MEPS strata and PSUs. Subsetting to a population of interest (e.g. persons with a particular condition, procedure, or utilization),
although normally an efficient programming move, potentially removes important stratification and clustering information from the analysis procedure. Indeed this is often the reason to use a survey procedure such as SURVEYMEANS or
SURVEY FREQ rather than their counterparts MEANS and FREQ.
In the examples discussed below the following design variables will be used: PERWT06F (person-level weight variable);
VARSTR (stratum variable); VARPSU (PSU, i.e. cluster, variable). The analysis variable is IPFEXP06 (2006 inpatient
hospital stay facility expenses).
Consider a situation where you are asked to generate the mean and total person-level expenditures for hospitals stays in
2006, but only for persons with hospital stay expenses. This is a typical way to look at average expenditures because the
majority of persons will have zero hospital-stay expenses in a given year. You could remove persons with zero expenses
from the analysis by deleting them from the input data set. But this conflicts with the recommendation not to subset in
this way because it removes important strata and cluster (PSU) information from the variance estimation calculations. As
Machlin et al (2005) point out,
(12)
“Analyses are often limited to a subgroup of the population. However, creating a special analysis file
that contains only observations for the subgroup of interest may yield incorrect standard errors…
because all of the observations corresponding to a stage of the MEPS sample design may be deleted.
Therefore, it is advisable to preserve the entire survey design structure for the program by reading in
the entire person-level file.”
One apparent alternative is to recode the zero values to missing, as in (13) below, in order to exclude these observations
from the analysis. This would indeed exclude those observations since the analysis procedure will omit observations with
missing values for the analysis variable. But this would be equivalent to the prior subsetting already discussed.
6
NESUG 2010
Statistics and Analysis
(13)
DATA IP2006M;
SET CDATA.H105 (KEEP= IPFEXP06 VARSTR VARPSU PERWT06F);
IF IPFEXP06 = 0
THEN IPFEXP06 = . ;
RUN;
(14)
PROC SURVEYMEANS DATA= IP2006M MEAN SUM;
STRATA VARSTR ;
CLUSTER VARPSU;
VAR IPFEXP06;
WEIGHT PERWT06F;
RUN;
(15) Output for the SURVEYMEANS code in (14)
Data Summary
Number of Strata
203
Number of Clusters
451
Number of Observations
34145
Number of Observations Used
32577
1568
Number of Obs with Nonpositive Weights
299267035
Sum of Weights
Statistics
Variable
Label
Mean
IPFEXP06
HOSP FACILITY EXPENSES
12584
Std Error of
Mean
Sum
Std Dev
467.184002 264914200265 13268732359
As we saw with the previous examples, the Data Summary table contains the basic information for the number of strata,
clusters (PSUs), etc. Note that the number of observations used in this table is the number of observations with a positive
weight. The sum of observations used and the number of observations with nonpositive weights is the number of observations (32,577 + 1,568 = 34,145).
The Statistics table above shows that, in 2006, the mean, per-person, hospital stay expense, for those with a stay, is
$12,584. The standard error for this estimate is 467.18. The total expense is $264.9 billion, with a standard deviation of
13.3 billion.
Having seen in (15) the variance estimates when persons with zero estimates (recoded to missing) are excluded from the
analysis, we modify the example to include the NOMCAR option. Note that the input dataset here still has zero values
recoded to missing.
(16)
PROC SURVEYMEANS DATA= IP2006M NOMCAR MEAN SUM;
STRATA VARSTR / LIST;
CLUSTER VARPSU;
VAR IPFEXP06;
WEIGHT PERWT06F;
RUN;
7
NESUG 2010
Statistics and Analysis
(17) Output for the SURVEYMEANS code in (16)
Variance Estimation
Taylor Series
Method
Included (NOMCAR)
Missing Values
Statistics
Variable
Label
Mean
IPFEXP06
HOSP FACILITY EXPENSES
12584
Std Error of
Mean
Sum
Std Dev
477.532869 264914200265 14171118082
Again, as we saw in the ice cream example, neither the mean nor total estimates are affected by the use of the NOMCAR
option. But the standard error for the mean, as well as the standard deviation for the sum, are larger, i.e. when observations with zero hospital expenses are excluded from the analysis, the standard error is 467.18-- but 477.53 when these
observations are included. Similarly, when the zero-expense records are excluded, the standard deviation is 13.3 billion,
but 14.2 billion when those observations are included.
As the SAS documentation says, when the NOMCAR option is used, the analysis procedure “treats observations with and
without missing values for analysis variables as two different domains, and it performs a domain analysis in the domain
of nonmissing observations.” We can see this explicitly if we consider that, prior to the introduction of the NOMCAR
option with version 9.2., the only alternative was to create a domain variable and use the DOMAIN statement to instruct
SAS to perform a domain analysis. Consider the domain variable SUBPOP created in (18) and used in (19). Here the
zero values for IPFEXP06 have not been recoded to missing, but rather keep their original value.
(18)
DATA IP2006;
SET CDATA.H105 (KEEP= IPFEXP06 VARSTR VARPSU PERWT06F);
IF IPFEXP06 > 0
THEN SUBPOP = 'WITH EXP';
ELSE SUBPOP = 'WITHOUT EXP';
RUN;
(19)
PROC SURVEYMEANS DATA= IP2006 MEAN SUM;
STRATA VARSTR ;
CLUSTER VARPSU;
VAR IPFEXP06;
WEIGHT PERWT06F;
DOMAIN SUBPOP;
RUN;
(20) Output for the SURVEYMEANS code in (19)
Statistics
Variable
Label
IPFEXP06
HOSP FACILITY EXPENSES
Mean
885.210094
8
Std Error of
Mean
Sum
Std Dev
40.482798 264914200265 14171118082
NESUG 2010
Statistics and Analysis
Domain Analysis: SUBPOP
SUBPOP
Variable
Label
Mean
WITH EXP IPFEXP06
HOSP FACILITY EXPENSES
12584
WITHOUT IPFEXP06
HOSP FACILITY EXPENSES
0
Std Error of
Mean
Sum
Std Dev
477.532869 264914200265 14171118082
0
0
Here the Statistics table gives the estimates for the full population, i.e. persons with and without a hospital stay expense.
The Domain Analysis table shows the estimates of interest, i.e. those for persons with an expense (WITH EXP). What is
important to note here is that both the standard error and the standard deviation are identical to those produced by use of
the NOMCAR option. This follows from the fact that the NOMCAR option is, behind the scenes, performing the domain
analysis explicitly coded in (18) and (19).
One potential advantage of using the explicit DOMAIN analysis here is that the output more accurately reflects the input
data and the analysis preformed. The NOMCAR option, although potentially a useful shortcut, masks both the properties
of the input data and the fact that a domain analysis is being performed.
Summary
This paper has illustrated the use of two options of potential use when working with survey data with missing values.
The MISSING option overrides SAS' default behavior of excluding observations where the values of a categorical value
are missing. Instead it treats missing values as a valid analysis category. The NOMCAR option is intended for use when
the default assumption that observations where the analysis variable has missing values are missing completely at random
is not justified. This option instructs SAS to perform a domain analysis for observations with and without missing values
for the analysis variable.
References
Machlin, S., Yu, W., and Zodet, M. Computing Standard Errors for MEPS Estimates. January 2005. Agency for Healthcare Research and Quality, Rockville, MD. Available at:
HTTP://WWW.MEPS.AHRQ.GOV/SURVEY_COMP/STANDARD_ERRORS.JSP.
Disclaimer
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute
Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
Contact Information
Paul Gorrell
IMPAQ International, LCC
10420 Little Patuxent Parkway
Columbia, MD 21044
[email protected]
9
0
NESUG 2010
Statistics and Analysis
APPENDIX A
ICECREAM DATA SET
Obs
GRADE
SPENDING
Obs
GRADE
SPENDING
1
7
.
37
9
11
2
7
.
38
9
14
3
7
.
39
9
15
4
7
1
40
9
15
5
7
2
6
7
2
7
7
2
8
7
3
9
7
3
10
7
4
11
7
4
12
7
6
13
7
6
14
7
6
15
7
7
16
7
7
17
7
9
18
7
10
19
7
11
20
7
12
21
8
.
22
8
.
23
8
13
24
8
14
25
8
16
26
8
17
27
8
18
28
8
19
29
8
20
30
9
.
31
9
.
32
9
7
33
9
8
34
9
8
35
9
9
36
9
10
10