Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UNDERSTANDING AND ADDRESSING MISSING DATA DANIELLE ROBBINS, PHD CINDY SANGALANG, PHD AUGUST 28, 2013 OVERVIEW Goals for workshop • Understand what missing data is and why it matters • Describe ways to address missing data • Gain exposure to advanced methods for addressing missing data Topics covered • Patterns of missing data • Conventional approaches for handling missing data • Advanced topics: • Maximum likelihood (ML) • Multiple imputation (MI) WHAT IS MISSING DATA? Observations/cases with any missing values Variables Cases ? ? ? ? ? ? ? ? = missing WHY CAN MISSING DATA BE A PROBLEM? Reduced sample size Potential bias HOW MIGHT DATA BE MISSING? • Study participants • Item nonresponse • Study design • Attrition (panel studies) • Error in data recording WHEN IS MISSING DATA A PROBLEM? • Amount of missing data • Adequate statistical power to detect effects • Patterns of missingness MISSING DATA PATTERNS: UNDERLYING ASSUMPTIONS Classification system developed by Rubin and colleagues (Rubin, 1976; Little & Rubin, 1987, 2002) In other words, is the reason for missing data random or not random? 3 categories: • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Not Missing at Random (NMAR or MNAR) MISSING COMPLETELY AT RANDOM (MCAR) • Data are considered MCAR when there are no patterns in missingness • Missing values are not related (or correlated) to any variables in the study • If data are MCAR, the complete data is considered a random subsample from the original target sample • MCAR is the ideal situation MISSING AT RANDOM (MAR) • Data are MAR if missingness on a variable is related to other variables in your analysis • Missingness on a variable can be predicted by other variables • Considered a weaker assumption than MCAR NOT MISSING AT RANDOM (NMAR) • Missingness is related to the variable itself • Whether data is NMAR is a theoretical and conceptual consideration • Must create a separate model that accounts for missing data CONVENTION APPROACHES FOR HANDLING MISSING DATA Listwise deletion Pairwise deletion Mean imputation/substitution LISTWISE DELETION Also known as complete case analysis Strengths • Unbiased with data are MCAR • Works for any kind of statistical analysis Weaknesses • If large proportion of data missing, could result loss of statistical bias • May introduce bias if data are MAR PAIRWISE DELETION Only cases relating to a pair of variables are used in analysis (e.g. correlations) AKA “unwise deletion” Strengths • Uses all available information • Approximately unbiased in MCAR Weaknesses • Estimates based on different sets of data (sample sizes and standard errors) MEAN IMPUTATION (SUBSTITUTION) Missing values replaced with mean values Strengths • Easy to implement Weaknesses • Introduces bias - sample size increased but standard error underestimated ADVANCED APPROACHES: Maximum Likelihood Multiple Imputation MAXIMUM LIKELIHOOD A method that uses a likelihood function to determine estimates Likelihood functions: • Functions that relate the parameters of a statistical model like regression : • 𝑌 = 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ • Function that characterizes the probability of the data behavior based on the data and unknown parameter functions • 𝐿 𝜃 𝑌 = 𝑃 𝑌 𝜃 , so which parameter estimates set, θ, maximizes the likelihood function EXAMPLES OF JOINT PROBABILITY DENSITY FUNCTIONS COMPONENTS OF MAXIMUM LIKELIHOOD* Estimates are obtained when, θ maximizes L(θ|Y). Estimates are not biased in large samples due to the central limit theorem Estimates have low standard error due to appropriate uncertainty quantification Estimates appear normal due to the central limit theorem * Given N > 200, hard to use for N < 100 MAXIMUM LIKELIHOOD PROCEDURE To get estimates one method used is the EM algorithm is used iteratively in a two step process 1. 2. Find the Expected Value of the log likelihood function given an estimate, θ0 Maximize the function and get new estimates, θ1 Repeat steps 1 and 2 until convergence in estimate occurs IDEAL CONDITIONS FOR MAXIMUM LIKELIHOOD Data should be continuous Data should be normally distributed Constant variance , i.e. variation in variables should not be dependent upon the data Overall Multivariate Normal Data is ideal but Maximum Likelihood is still implemented on categorical data Missing data assumed to be MAR IMPLEMENTATION OF MAXIMUM LIKELIHOOD Can use EM algorithm * Direct Maximization Factoring * Used in both ML and MI DATA EXAMPLES Can you pick which data sets might not be MAR? ID 1 2 3 4 5 6 7 Alcohol Amount 1 2 3 4 4 2 2 Alcohol Frequency 1 1 1 1 x x x ID 1 2 3 4 5 6 7 Alcohol Amount 1 2 3 4 4 2 2 Alcohol Frequency 1 x 1 2 3 2 x ID 1 2 3 4 5 6 7 Alcohol Amount 1 2 3 4 4 2 2 Alcohol Frequency x x x x x x x DATA EXAMPLE FOR ANALYSIS National Longitudinal Survey of Youth (1990) Variables: • ANTI: antisocial behavior, (0-6) • SELF: self-esteem, (6-24) • POV: poverty status , dichotomous, 1= in poverty • BLACK: 1 if child is Black, dichotomous • HISPANIC: 1 if child is Hispanic, dichotomous • CHILDAGE: age in 1990 • DIVORCE: 1 if mother divorced in 1990, dichotomous • GENDER: 1 if female, dichotomous • MOMAGE: mother’s age at birth of child • MOMWORK: 1 if mother is employed, dichotomous Missing data on predictor variables N=581 ANALYSIS ON MISSING DATA SET proc reg data = nlsyem; model anti=self pov black hispanic childage divorce gender momage momwork; run; MAXIMUM LIKELIHOOD: SAS proc calis data=nlsyem method=fiml; path anit <- self pov black hispanic childage divorce gender momage momwork; run; MAXIMUM LIKELIHOOD: STATA sem anti <- self pov black hispanic childage divorce gender momage momwork, method(mlmv) MAXIMUM LIKELIHOOD: REPEATED MEASURES Proc mixed and Proc glimmix (SAS) xtreg, xtmixed, etc. (STATA) automatically handle missing data by ML only if there are no missing on the predicted LIMITATIONS OF ML Must use a joint distribution of all variables to make estimate (i.e. L(θ|Y)) May not be robust Auxiliary variables may be difficult to use, an auxiliary variable is a variable that is freely correlated with all other variables, MULTIPLE IMPUTATION Similar properties to ML, both estimates are started with the EM algorithm Used with any type of data Multiple software options Stochastic results Complex implementation Imputation model must at least be the same model as the analysis model MULTIPLE IMPUTATION EQUATIONS Given dependent variable Y, and independent variable X, imputations are generated for non-missing cases using: 𝑦𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 Cases with missing data, imputations are generated using: 𝑦𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 + 𝑠𝑥𝑦 𝑟𝑖 MULTIPLE IMPUTATION ASSUMPTIONS Data is assumed to be missing at random (MAR) Data is multivariate normal IMPLEMENTATION OF MI: SAS (N=10) proc mi data=nlsyem out=impnlsyem noprint nimpute=100; var anti self pov black hispanic childage divorce gender momage momwork; run; proc reg data=impnlsyem outest=a covout; model anti=self pov black hispanic childage divorce gender momage momwork; by _imputation_; run; proc mianalyze data=a; modeleffects intercept self pov black hispanic childage divorce gender momage momwork; run; IMPLEMENTATION OF MI: SAS (N=10) IMPLEMENTATION OF MI: SAS (N=100) IMPLEMENTATION OF MI: STATA mi set flong mi register imputed anti self pov black hispanic childage divorce gender momage momwork mi impute mvn anti self pov black hispanic childage divorce gender momage momwork, add(10) mi estimate: regress anti self pov black hispanic childage divorce gender momage momwork IMPLEMENTATION OF MI: STATA MI VS ML Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types of data but the imputation model must be synonymous with the analysis model ML offers one result given a set of parameters MI gives stochastic results ML offers no conflict between imputation and analysis model CHOOSING ML VS MI Choose ML when data is mostly continuous Choose MI when data is mostly categorical PRACTICAL STEPS Assess amount of missing and assess on which variables Select appropriate technique for handling missing data Conduct sensitivity analysis (test out different approaches and see how results differ, if you see consistent) – that way you can empirically compare different approaches and compare estimates REPORTING MISSING DATA FOR PUBLICATION • Report amount of missing data as a percentage of the complete data • Consider assumptions and patterns of missing data • Report appropriate method of handling missing data REFERENCES Allison, P.D. (2012). Handling Missing Data by Maximum Likelihood. Statistics and Data Analysis: SAS Global Forum. Allison, P.D. (2002). Missing Data. Iowa City: Sage. Baraldi, A.N. & Enders, C.K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48, 5-37. Schlomer, G.L., Bauman, S. & Card. N.A. (2010). Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57(1), 1-10.