Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois at Chicago 1 Focus of the seminar Data cleaning/missing data Sampling bias reduction Survey Research Laboratory 2 When analyzing survey data... 1. Understand & evaluate survey design 2. Screen the data 3. Adjust for sampling design Survey Research Laboratory 3 1. Understand & evaluate survey Conductor of survey Sponsor of survey Measured variables Unit of analysis Mode of data collection Dates of data collection Survey Research Laboratory 4 1. Understand & evaluate survey Geographic coverage Respondent eligibility criteria Sample design Sample size & response rate Survey Research Laboratory 5 Levels of measurement Nominal Ordinal Interval Ratio Survey Research Laboratory 6 2. Data screening ALWAYS examine raw frequency distributions for… (a) out-of-range values (outliers) (b) missing values Survey Research Laboratory 7 2. Data screening Out-of-range values: Delete data Recode values Survey Research Laboratory 8 Missing data: can reduce effective sample size may introduce bias Survey Research Laboratory 9 Reasons for missing data Refusals (question sensitivity) Don’t know responses (cognitive problems, memory problems) Not applicable Data processing errors Questionnaire programming errors Design factors Attrition in panel studies Survey Research Laboratory 10 Effects of ignoring missing data Reduced sample size – loss of statistical power Data may no longer be representative– introduces bias Difficult to identify effects Survey Research Laboratory 11 Assumptions on missing data Missing completely at random (MCAR) Missing at random (MAR) Ignorable Nonignorable Survey Research Laboratory 12 Missing completely at random (MCAR) Being missing is independent from any variables. Cases with complete data are indistinguishable from cases with missing data. Missing cases are a random subsample of original sample. Survey Research Laboratory 13 Missing at random (MAR) The probability of a variable being observed is independent of the true value of that variable controlling for one or more variables. Example: Probability of missing income is unrelated to income within levels of education. Survey Research Laboratory 14 Ignorable missing data The data are MAR. The missing data mechanism is unrelated to the parameters we want to estimate. Survey Research Laboratory 15 Nonignorable missing data The pattern of data missingness is non-MAR. Survey Research Laboratory 16 Methods of handling missing data Listwise (casewise) deletion: uses only complete cases Pairwise deletion: uses all available cases Dummy variable adjustment: Missing value indicator method Mean substitution: substitute mean value computed from available cases (cf. unconditional or conditional) Survey Research Laboratory 17 Methods of handling missing data Regression methods: predict value based on regression equation with other variables as predictors Hot deck: identify the most similar case to the case with a missing and impute the value Survey Research Laboratory 18 Methods of handling missing data Maximum likelihood methods: use all available data to generate maximum likelihood-based statistics. Survey Research Laboratory 19 Methods of handling missing data Multiple imputation: combines the methods of ML to produce multiple data sets with imputed values for missing cases Survey Research Laboratory 20 Multiple Imputation Software SAS—user written IVEware, experimental MI & MIANALYZE STATA—user written ICE R—user written libraries and functions SOLAS S-PLUS Survey Research Laboratory 21 Methods of handling missing data: summary Listwise deletion assumes MCAR or MAR Dummy variable adjustment is biased, don’t use Conditional mean substitution assumes MCAR within cells Hot Deck & Regression improvement on mean substitution but... Multiple Imputation is least biased but computationally intensive Survey Research Laboratory 22 Missing Data Final Point All methods of imputation underestimate true variance artificial increase in sample size values treated as though obtained by data collection Rao (1996) On variance estimation with imputed survey data. Jrn. Am. Stat. Assoc. 91:499-506 Fay (1996) in bibliography Survey Research Laboratory 23 Types of survey sample designs Simple Random Sampling Systematic Sampling Complex sample designs stratified designs cluster designs mixed mode designs Survey Research Laboratory 24 Why complex sample designs? Increased efficiency Decreased costs Survey Research Laboratory 25 Why complex sample designs? Statistical software packages with an assumption of SRS underestimate the sampling variance. Not accounting for the impact of complex sample design can lead to a biased estimate of the sampling variance (Type I error). Survey Research Laboratory 26 Sample weights Used to adjust for differing probabilities of selection. In theory, simple random samples are self-weighted. In practice, simple random samples are likely to also require adjustments for nonresponse. Survey Research Laboratory 27 Types of sample weights Poststratification weights: designed to bring the sample proportions in demographic subgroups into agreement with the population proportion in the subgroups. Nonresponse weights: designed to inflate the weights of survey respondents to compensate for nonrespondents with similar characteristics. “Blow-up” (expansion) weights: provide estimates for the total population of interest. Survey Research Laboratory 28 Syntax examples of design-based analysis in STATA, SUDAAN, & SAS STATA svyset svyset svyset svyreg strata strata psu psu pweight finalwt fatitk age male black hispanic SUDAAN proc regress data=”c:\nhanes.sav” filetype=spss desgn=wr; nest strata psu; weight finalwt subpgroup sex race; levels 2 3; model fatintk = age sex race; Survey Research Laboratory 29 Syntax examples of design-based analysis in STATA, SUDAAN, & SAS SAS proc surveyreg data=nhanes; strata strata; cluster psu; class sex race; model fatintk = age sex race; weight finalwt Survey Research Laboratory 30 In summary, when analyzing survey data... Understand & evaluate survey design Screen the data – deal with missing data & outliers. If necessary, adjust for study design using weights and appropriate computer software. Survey Research Laboratory 31 Bibliography On website with slides Big names in area of missing data are: • Roderick Little • Donald B. Rubin • Paul Allison • See also McKnight et.al. Missing Data: A Gentle Introduction Large government datasets (NSFG, NHANES, NHIS) often include detailed methodological documentation on imputation and weight construction. Survey Research Laboratory 32 Thank You! www.srl.uic.edu Survey Research Laboratory 33