Download understanding and addressing missing data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
UNDERSTANDING AND ADDRESSING
MISSING DATA
DANIELLE ROBBINS, PHD
CINDY SANGALANG, PHD
AUGUST 28, 2013
OVERVIEW
Goals for workshop
• Understand what missing data is and why it matters
• Describe ways to address missing data
• Gain exposure to advanced methods for addressing missing data
Topics covered
• Patterns of missing data
• Conventional approaches for handling missing data
• Advanced topics:
• Maximum likelihood (ML)
• Multiple imputation (MI)
WHAT IS MISSING DATA?
Observations/cases with any missing values
Variables
Cases
?
?
?
?
?
?
?
? = missing
WHY CAN MISSING DATA
BE A PROBLEM?
Reduced sample size
Potential bias
HOW MIGHT DATA BE MISSING?
• Study participants
• Item nonresponse
• Study design
• Attrition (panel studies)
• Error in data recording
WHEN IS MISSING DATA
A PROBLEM?
• Amount of missing data
• Adequate statistical power to detect effects
• Patterns of missingness
MISSING DATA PATTERNS:
UNDERLYING ASSUMPTIONS
Classification system developed by Rubin and colleagues
(Rubin, 1976; Little & Rubin, 1987, 2002)
In other words, is the reason for missing data random or not
random? 3 categories:
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Not Missing at Random (NMAR or MNAR)
MISSING COMPLETELY AT
RANDOM (MCAR)
• Data are considered MCAR when there are no patterns in
missingness
• Missing values are not related (or correlated) to any variables
in the study
• If data are MCAR, the complete data is considered a random
subsample from the original target sample
• MCAR is the ideal situation
MISSING AT RANDOM (MAR)
• Data are MAR if missingness on a variable is related to other
variables in your analysis
• Missingness on a variable can be predicted by other variables
• Considered a weaker assumption than MCAR
NOT MISSING AT RANDOM
(NMAR)
• Missingness is related to the variable itself
• Whether data is NMAR is a theoretical and conceptual
consideration
• Must create a separate model that accounts for missing data
CONVENTION APPROACHES FOR
HANDLING MISSING DATA
Listwise deletion
Pairwise deletion
Mean imputation/substitution
LISTWISE DELETION
Also known as complete case analysis
Strengths
• Unbiased with data are MCAR
• Works for any kind of statistical analysis
Weaknesses
• If large proportion of data missing, could result loss of
statistical bias
• May introduce bias if data are MAR
PAIRWISE DELETION
Only cases relating to a pair of variables are used in analysis
(e.g. correlations)
AKA “unwise deletion”
Strengths
• Uses all available information
• Approximately unbiased in MCAR
Weaknesses
• Estimates based on different sets of data (sample sizes and
standard errors)
MEAN IMPUTATION
(SUBSTITUTION)
Missing values replaced with mean values
Strengths
• Easy to implement
Weaknesses
• Introduces bias - sample size increased but standard error
underestimated
ADVANCED APPROACHES:
Maximum Likelihood
Multiple Imputation
MAXIMUM LIKELIHOOD
A method that uses a likelihood function to determine
estimates
Likelihood functions:
• Functions that relate the parameters of a statistical model like
regression :
• 𝑌 = 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯
• Function that characterizes the probability of the data
behavior based on the data and unknown parameter functions
• 𝐿 𝜃 𝑌 = 𝑃 𝑌 𝜃 , so which parameter estimates set, θ,
maximizes the likelihood function
EXAMPLES OF JOINT
PROBABILITY DENSITY
FUNCTIONS
COMPONENTS OF
MAXIMUM LIKELIHOOD*
Estimates are obtained when, θ maximizes L(θ|Y).
Estimates are not biased in large samples due to the central limit
theorem
Estimates have low standard error due to appropriate uncertainty
quantification
Estimates appear normal due to the central limit theorem
* Given N > 200, hard to use for N < 100
MAXIMUM LIKELIHOOD
PROCEDURE
To get estimates one method used is the EM algorithm is
used iteratively in a two step process
1.
2.
Find the Expected Value of the log likelihood function
given an estimate, θ0
Maximize the function and get new estimates, θ1
Repeat steps 1 and 2 until convergence in estimate occurs
IDEAL CONDITIONS FOR
MAXIMUM LIKELIHOOD
Data should be continuous
Data should be normally distributed
Constant variance , i.e. variation in variables should not be
dependent upon the data
Overall Multivariate Normal Data is ideal but Maximum
Likelihood is still implemented on categorical data
Missing data assumed to be MAR
IMPLEMENTATION OF
MAXIMUM LIKELIHOOD
Can use EM algorithm *
Direct Maximization
Factoring
* Used in both ML and MI
DATA EXAMPLES
Can you pick which data sets might not be MAR?
ID
1
2
3
4
5
6
7
Alcohol
Amount
1
2
3
4
4
2
2
Alcohol
Frequency
1
1
1
1
x
x
x
ID
1
2
3
4
5
6
7
Alcohol
Amount
1
2
3
4
4
2
2
Alcohol
Frequency
1
x
1
2
3
2
x
ID
1
2
3
4
5
6
7
Alcohol
Amount
1
2
3
4
4
2
2
Alcohol
Frequency
x
x
x
x
x
x
x
DATA EXAMPLE FOR
ANALYSIS
National Longitudinal Survey of Youth (1990)
Variables:
• ANTI: antisocial behavior, (0-6)
• SELF: self-esteem, (6-24)
• POV: poverty status , dichotomous, 1= in poverty
• BLACK: 1 if child is Black, dichotomous
• HISPANIC: 1 if child is Hispanic, dichotomous
• CHILDAGE: age in 1990
• DIVORCE: 1 if mother divorced in 1990, dichotomous
• GENDER: 1 if female, dichotomous
• MOMAGE: mother’s age at birth of child
• MOMWORK: 1 if mother is employed, dichotomous
Missing data on predictor variables
N=581
ANALYSIS ON
MISSING DATA SET
proc reg data = nlsyem;
model anti=self pov black hispanic childage divorce gender
momage momwork;
run;
MAXIMUM
LIKELIHOOD: SAS
proc calis data=nlsyem method=fiml;
path anit <- self pov black hispanic childage divorce gender
momage momwork;
run;
MAXIMUM LIKELIHOOD: STATA
sem anti <- self pov black hispanic childage divorce gender
momage momwork, method(mlmv)
MAXIMUM LIKELIHOOD:
REPEATED MEASURES
Proc mixed and Proc glimmix (SAS)
xtreg, xtmixed, etc. (STATA) automatically handle missing data
by ML only if there are no missing on the predicted
LIMITATIONS OF ML
Must use a joint distribution of all variables to make estimate
(i.e. L(θ|Y))
May not be robust
Auxiliary variables may be difficult to use, an auxiliary
variable is a variable that is freely correlated with all other
variables,
MULTIPLE IMPUTATION
Similar properties to ML, both estimates are started with the
EM algorithm
Used with any type of data
Multiple software options
Stochastic results
Complex implementation
Imputation model must at least be the same model as the
analysis model
MULTIPLE IMPUTATION
EQUATIONS
Given dependent variable Y, and independent variable X,
imputations are generated for non-missing cases using:
𝑦𝑖 = 𝑏0 + 𝑏1 𝑥𝑖
Cases with missing data, imputations are generated using:
𝑦𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 + 𝑠𝑥𝑦 𝑟𝑖
MULTIPLE IMPUTATION
ASSUMPTIONS
Data is assumed to be missing at random (MAR)
Data is multivariate normal
IMPLEMENTATION OF MI: SAS
(N=10)
proc mi data=nlsyem out=impnlsyem noprint nimpute=100;
var anti self pov black hispanic childage divorce gender momage momwork;
run;
proc reg data=impnlsyem outest=a covout;
model anti=self pov black hispanic childage
divorce gender momage momwork;
by _imputation_;
run;
proc mianalyze data=a;
modeleffects intercept self pov black hispanic childage divorce gender momage
momwork;
run;
IMPLEMENTATION OF
MI: SAS (N=10)
IMPLEMENTATION OF MI: SAS
(N=100)
IMPLEMENTATION OF MI: STATA
mi set flong
mi register imputed anti self pov black hispanic childage divorce
gender momage momwork
mi impute mvn anti self pov black hispanic childage divorce
gender momage momwork, add(10)
mi estimate: regress anti self pov black hispanic childage divorce
gender momage momwork
IMPLEMENTATION OF MI: STATA
MI VS ML
Maximum likelihood (ML) is simpler to implement
Multiple imputation (MI) can handle various types of data but
the imputation model must be synonymous with the analysis
model
ML offers one result given a set of parameters
MI gives stochastic results
ML offers no conflict between imputation and analysis model
CHOOSING ML VS MI
Choose ML when data is mostly continuous
Choose MI when data is mostly categorical
PRACTICAL STEPS
Assess amount of missing and assess on which variables
Select appropriate technique for handling missing data
Conduct sensitivity analysis (test out different approaches and
see how results differ, if you see consistent) – that way you can
empirically compare different approaches and compare estimates
REPORTING MISSING DATA
FOR PUBLICATION
• Report amount of missing data as a percentage of the
complete data
• Consider assumptions and patterns of missing data
• Report appropriate method of handling missing data
REFERENCES
Allison, P.D. (2012). Handling Missing Data by Maximum
Likelihood. Statistics and Data Analysis: SAS Global Forum.
Allison, P.D. (2002). Missing Data. Iowa City: Sage.
Baraldi, A.N. & Enders, C.K. (2010). An introduction to
modern missing data analyses. Journal of School
Psychology, 48, 5-37.
Schlomer, G.L., Bauman, S. & Card. N.A. (2010). Best
practices for missing data management in counseling
psychology. Journal of Counseling Psychology, 57(1), 1-10.