Download Missing Data and Multiple Imputation

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Missing Data and
Multiple Imputation
By Jon Atwood
Collaborator
LISA
In this course, we will…
• Examine missing data in a general sense; what it is, where it comes from,
what types exist, etc.
• Explain the problems of certain common methods for dealing with missing
data, such as complete case analysis and single imputation methods
• Study multiple imputation (MI), learning generally how it works
• Apply MI to real data sets using SAS and R
So what is missing data?
• Missing data is information that we want to know, but don’t
• It can come in many forms, from people not answering questions on surveys,
to inaccurate recordings of the height of plants that need to be discarded, to
canceled runs in a driving experiment due to rain
• We could also consider something we never even thought of to be missing
data
The key question is, why is the data missing?
• What mechanism is it that contributes to, or is associated with, the
probability of a data point being absent?
• Can it be explained by our observed data or not?
• The answers drastically affect what we can ultimately do to compensate for
the missingness
Perhaps the most common method of handling
missing data is “Complete Case Analysis”
• Simply delete all cases that have any missing values at all, so you are left only
with observations with all variables observed
• Computer software often does this by default when performing analysis
(regression, for example)
• This is the simplest way to handle missing data. In some cases, will work fine
• However, loss of sample will lead to variance larger than reflected by
the size of your data.
• May bias your sample
And now a closer look…
• We use as an example a data set of body fat percentage in men, and the
circumference of various body parts (Penrose et al., 1985)
• Does the circumference of certain body parts predict body fat percentage?
• Here are some significant figures from a regression model with body fat
percentage as the response
Predictor
Age
Neck
Forearm
Wrist
Estimate
0.0626
-0.4728
0.45315
-1.6181
S.E.
0.0313
0.2294
0.1979
0.5323
P-Value
0.0463
0.0403
0.0229
0.0026
In this case, the data is complete, with sample
size 252
• But suppose about 5 percent of the participants had missing values? 10
percent? 20 percent?
• What if we performed complete case analysis and removed those who had
missing values?
• First let’s examine the effect if we do this if when the data is MCAR
• I randomly removed cases from the data set, reran the analysis and stored the
p-values. I did this 1,000 times, and plotted the 1,000 p-values in boxplots
For about 5 percent (n=13) deleted
P
V
a
l
u
e
Age
Neck
Forearm
Wrist
For about 20 percent (n=50) deleted
P
V
a
l
u
e
Age
Neck
Forearm
Wrist
We seem to change our conclusions somewhat
• With age and neck, it seems we fail to reject more often than not
• The other two, we still reject most of the time
• This is assuming the missing subjects do mot differ from the non-missing.
This would cause bias .
Types Of Missingness
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR) or Not
Missing at Random (NMAR)
What Distinguishes Each Type?
• Suppose you’re loitering
outside an elementary school
one day…
• You then find out that students just received their report cards for the first quarter
• For some reason, you start asking passing students their English grades. Of course, you
don’t force them to tell you or anything. You also write down their gender and hair color
A data set from this activity might look like
this…
Hair Color
Red
Brown
Black
Black
Brown
Brown
Brown
Black
Black
Brown
Black
Brown
Red
Red
Brown
Black
Gender Grade
M
F
F
M
M
M
F
M
M
F
F
F
M
F
M
M
A
A
B
A
B
B
A
C
A
A
A
• 7 students received As, 3 received Bs, and 1 a C
• No failing!!
• But 5 students did not reveal their grade
To determine the type of missingness, look at
what influences the probability of a missing point
•
•
•
Here is the same data set, but the values are
replaced with a “0” if the data point is observed
and “1” if it is not
We’ll call this the “Missing Matrix.” Obviously
there are many more possible missing matricies
The relevant question is, for any one of these data
points, what is the probability that the point is
equal to “1” ?
Hair Color
Gender
Grade
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
1
0
1
0
0
0
Upcoming Quiz!
• What type of missingness do the grades
exhibit?
Missing Completely at Random (MCAR)
• If this probability is not dependent on any of the data, observed or
unobserved, then the data is Missing Completely at Random (MCAR)
• To be more precise, suppose that X is the observed data and Y is the
unobserved data. Suppose we label our “Missing Matrix” as R.
• Then, if the data are MCAR, P(R|X,Y)=P(R)
Example…
• Suppose you are running an experiment on plants
grown in pots, when suddenly you have a nervous
breakdown and smash some of the pots
• In your insanity, you will probably not likely
choose the plants to smash in a well-defined
pattern, such as height age, etc.
• Hence, the missing values generated from your
act of madness will likely fall into the MCAR
category
Another way to think of MCAR
• Supposed we had to quickly go to the
bathroom and do number 2
• In our desperation, we use the data as our
toilet paper
• Presumably, some of our data would be
smeared with…you know what
• The data smeared can be said to be a
random subset of our data
In practice, MCAR is usually not realistic
• A completely random mechanism for generating missingness in your data set
just isn’t very realistic
• Usually, missing data is missing for a reason. Maybe older people are less
likely to answer web-delivered questions on surveys, or in longitudinal studies
people may die before they have completed the entire study, etc., companies
may be reluctant to reveal financial information, etc.
Missing at Random (MAR)
• If the probability of your missing data is dependent on the observed data
but not the unobserved data, your missing observations are said to be Missing
at Random (MAR)
• Symbolically, P(R|X,Y)=P(R|X), so that the unobserved data does not
contribute to the probability of observing our “Missing Matrix.”
• Random is somewhat of a misnomer. MAR means that there is a mechanism
that is associated with whether the data is missing, and it has to do with our
observed data
Example…
• Usually, missing data is missing for a reason. Maybe older people are less
likely to answer web-delivered questions on surveys, or in longitudinal studies
people may die before they have completed the entire study, etc., companies
may be reluctant to reveal financial information, etc.
•
The key point to MAR is…
•
•
•
•
We can still model the missing mechanism and compensate for it
The multiple imputation methods we will be talking about today assume MAR
For example, if age is known, you can model missingness as a function of age
Whether or not missing data is MAR or the next type, Missing Not at Random
(MNAR) is not testable. Requires you to understand your data
• (after)
Missing Not at Random (MNAR)
• The missingness has something to do with the missing value itself
• It has been said that smokers are not as likely to answer the question, “Do
you smoke?”
• Said to be nonignorable
• Although there are some proposed ways to handle MNAR data, these are
more complicated and are beyond the scope of this class
So, returning to our school example…
Hair Color
• Do you think this missing data is
likely MCAR, MAR or MNAR?
Red
Brown
Black
Black
Brown
Brown
Brown
Black
Black
Brown
Black
Brown
Red
Red
Brown
Black
Gender Grade
M
F
F
M
M
M
F
M
M
F
F
F
M
F
M
M
A
A
B
A
B
B
A
C
A
A
A
Add overall GPA
• Now the data looks like this
• Does this change anything?
Hair Color
Red
Brown
Black
Black
Brown
Brown
Brown
Black
Black
Brown
Black
Brown
Red
Red
Brown
Black
GPA
3.4
3.6
3.7
3.9
2.5
3.2
3.0
2.9
3.3
4.0
3.65
3.4
2.2
3.8
3.8
3.67
Gender Grade
M
A
F
A
F
B
M
A
M
M
F
M
B
M
B
F
A
F
F
C
M
F
A
M
A
M
A
So what do we do
about missing data?
Single Imputation Methods
Impute Once
• Mean Imputation: imputing the average from observed cases for all missing
values of a variable
• Hot Deck Imputation: imputing a value from another subject, or “donor,”
that is most like the subject in terms of observed variables
• Some others
• All fundamentally impose too much precision. We have uncertainty in
what the unobserved values actually are
Multiple Imputation
• Using a single imputation approach does not account for an obvious source
of uncertainty
• By imputing only once, we are treating the imputed value as if we observed it
when we did not
• Therefore, we have uncertainty in what the observed value would have been
• Multiple Imputation (MI) takes this into account by generating several
random values for each missing data point
The General Process
1. A value is randomly drawn for the unobserved data points based on a
predetermined model from the observed data
2. Repeat step 1 some number of times, say N, resulting in N imputed data
sets
3. Each imputed data set is analyzed separately
4. The separate analyses are pooled together for a unifying analysis that takes
into account all the imputed data sets
To illustrate…
Here’s some data
X
Y
32
2
43
?
56
6
25
?
84
5
Oh no, we have two
missing values! Whatever
shall we do?!
Let’s Impute Some Data!
First, we’ll use a predictive distribution of the missing values, given the
observed values, to make random draws of the observed values and fill
them in.
X
Y
32
2
43
5.5
56
6
25
8
84
5
Now we have one imputed data set!
Let’s Set That
Aside…
And Do it
Again!!!!
X
Y
32
2
43
5.5
56
6
X
Y
25
8
32
2
84
5
43
7.2
56
6
25
1.1
5
84
Set that aside…
Now we have 2 imputed data sets!!!
X
Y
X
Y
32
2
32
2
43
5.5
43
7.2
56
6
56
6
25
8
25
1.1
84
5
84
5
• Do this m number of times for m imputed data sets
Inference with Multiple Imputation
• Now that we have our imputed data sets, how do we make use of them?
(suppose in this case m = 2)
X
Y
X
Y
32
2
32
2
43
5.5
43
7.2
56
6
56
6
25
8
25
1.1
84
5
84
5
We analyze each separately
X
Y
X
Y
32
2
32
2
43
5.5
43
7.2
56
6
56
6
25
8
25
1.1
84
5
84
5
Slope
S.E.
-0.8245
6.1845
Slope
S.E.
4.932
4.287
Finally we pool the analyses together
• The pooled slope estimate is the average of the m imputed estimates
• In our example, β1p =
β11+β12
2
= (4.932-.8245)*.5 = 2.0538
• The pooled slope variance is given by
𝑠=
𝑍𝑖
𝑚
+ (1 +
1
1
)*
𝑚 𝑚−1
∗ (𝛽1𝑖 − β1p )2
Where Zi is the standard error of the imputed slopes
The pooled standard error in this case is (4.287 + 6.1845)/2 + (3/2)*(16.569) =
30.08925
To find the standard error, take the square root, and we get 5.485
Predicting the missing data given the observed data
• Bayes’ Theorem...
𝑃(𝐴|𝐵) ∗ 𝑃(𝐵)
𝑃 𝐵𝐴 =
𝑃(𝐴)
Imagine, then, that we establish some distribution of parameters of interest before
considering the data, P(θ), where θ is the set of parameters we are trying to
estimate. This is called the prior distribution of θ.
Then, we establish a distribution P(Xobs|θ)
We can finally use Bayes Theorem to establish P(θ|Xobs), make random draws for θ,
and use these draws to make predictions of Ymiss
How many imputations do we need?
• Depends on the size of the data set and the amount of missingness
• Some previous research indicated that about 5 is sufficient for efficiency of the
estimates, based on
λ
(1 + 𝑚)-1
Where m is the number of imputations and λ is the fraction of missing
information for the term being estimated (Schaffer, 1999)
• However, more recent research claims that a good imputation number is actually
higher (maybe 40 or more) in order to achieve higher power (Graham et al, 2007)
General Methods for Multiple Imputation
• Regression based
• Chained Equations (MICE) or Fully Conditional Specification (FCS)
• Markov Chain Monte Carlo (MCMC)
• We will look at part of a data set of CEO bonuses, with other predictor
variables (sales, advanced degrees, age, etc.)
Regression Approach in SAS
• Uses predictive mean matching, which means that the actual imputed value is one
chosen randomly from a set of observed values whose predicted value is close
to the predicted value of the missing observation
• Is meant to try and keep imputed values plausible
• Based on the imputation model we build, posterior random draws are made
for the regression parameters
• These draws are used to construct the predicted values for the missing
observation
What parameters?
• Suppose our imputation model is y=β0 + β1𝑥1 + ⋯ + β𝑘𝑥𝑘
• A random draw is made from the posterior predictive distribution of the
parameters, and we get the randomly drawn parameters 𝜷* = (𝜷*1,…, 𝜷*k)
• The missing value yi is predicted as 𝜷*1x1 + … + 𝜷*kxk
• Predictive mean matching is made based on this prediction
SAS example
• We will look at part of a data set of CEO bonuses, with other predictor
variables (sales, advanced degrees, age, etc.)
• Since we plan to do regression on bonuses, and bonuses may have large
variability as they get higher, we will take the log of bonuses before we do the
imputation
Here’s the code for the entire process:
proc mi data=bob1 out=mob
seed=123 nimpute=10;
monotone regpmm(logb=stock sales
years mba mastphd age);
var stock sales years mba mastphd
age logb;
run;
Imputation Code
proc reg data=mob outest=mp
covout noprint;
model logb=stock sales years mba
mastphd age;
by _imputation_;
run;
Regression Code
proc mianalyze data=mp;
modeleffects stock sales years mba
mastphd age;
run;
Pooled Analysis Code
Here is the output
Parameter
Variance
Between Within
Parameter Estimate
Total
stock
sales
1.2212E-05 3.5E-05 4.9E-05
1.15E-12 1.02E- 1.15E-11
11
years
6.388E-06 2.5E-05 3.2E-05
mba
0.001423 0.00794
0.0095
mastphd
0.001299 0.00633 0.00776
age
1.0735E-05 2.5E-05
3.7E-05
stock
-0.00556
sales
2.42E-05
years
0.017694
mba
0.014343
mastphd
-0.00182
age
0.014896
Std 95% Confiden Theta0 t for H0: Pr > |t|
Error
ce Limits
Paramet
er=Thet
a0
0.0069 -0.0194 0.00824
0
-0.8
0.4267
7
3.4E- 0.0000 3.1E-05
0
7.16 <.0001
06
2
0.0056 0.0066 0.02879
0
3.15
0.0019
2
0.0974 -0.1774 0.20612
0
0.15
0.8831
9
0.0880 -0.1753 0.17162
0
-0.02
0.9835
9
0.0060 0.0028 0.02698
0
2.45
0.0163
8
1
Classification Variables
• Suppose that we want to impute a variable that takes one of two values,
“male” or “female”, “smoker” or “non smoker”, “dead” or “alive”
• Or what if there are even more categories, such as dislike, like, and love?
• What if they are nominal, like chocolate, vanilla, and strawberry?
• We can hardly use continuous methods in these cases
We can use the “Logistic Regression Method”
• Remember that if p = probability that y=1, the logistic regression model can
be expressed as
𝑙𝑜𝑔
𝑝
1−𝑝
= β0 + β1𝑥1 + ⋯ + β𝑘𝑥𝑘
We can make random draws for 𝜷*, the estimators of 𝜷, from their posterior
distribution of
Use those to calculate the estimate for p =
the missing case
𝑒 𝑥β
1+𝑒 𝑥β
, and use this to predict y for
• This method also works for ordinal data
• Can be performed sequentially in SAS on multiple variables one at a time if
data is monotone missing, which means an observation missing implies
observations missing in all the rest of the variables for that subject
• Discriminant Function Method can be used for nominal variables
SAS Example
• I took the CEO data set and removed 57 values (no particular reason I chose
57)
• The following code runs the imputation
proc mi data=bob nimpute=5 seed=231 out=lid;
class mastphd;
var age stock sales years mba mastphd;
monotone logistic (mastphd=years sales stock age mba);
run;
And we get this…
WARNING: The maximum likelihood estimates for the logistic regression with observed observations may
not exist for variable MastPHD. The posterior predictive distribution of the parameters used
in the imputation process is based on the maximum likelihood estimates in the last maximum
likelihood iteration.
The answer lies in the follies of logistic regression,
as well as the redundancy of our model
Table of MastPHD by MBA
MastPH
MBA(MBA)
D(MastP
HD)
0
1
Total
0
1
Total
372
0
372
50.07
100
67.64
178
23.96
47.98
32.36
550
74.02
0
0
0
193
25.98
52.02
100
193
25.98
50.07
371
49.93
743
100
• We have “perfect classification” in that no one without a
masters/phd has an mba
• If we have perfect classification like this, then the algorithm that
does logistic regression will not converge
• This is something you need to be careful about in general
Now that we’ve removed MBA, here’s the code
proc mi data=bob nimpute=5
seed=231 out=lid;
class mastphd;
var age stock sales years
mastphd;
monotone logistic
(mastphd=years sales stock age);
run;
Imputation Code
proc logistic data=lid outest=rain covout
noprint descending;
class mastphd;
model mastphd=age stock sales years;
by _imputation_;
run;
Logistic Regression Code
proc mianalyze data=rain;
modeleffects age stock sales years;
run;
Pooled Analysis Code
So here are the results
Variance Information
Parameter
Variance
Between Within
DF
Total
Relative
Fraction
Relative
Increase
Missing
Efficiency
in
Information
Variance
age
7.2861E-05 0.00015
0.00023 28.185
0.604415
0.416692
0.923073
stock
1.3164E-05 0.00042
0.00044 3084.8
0.037355
0.036634
0.992727
sales
8.39E-13 5.41E11
1.8885E-05 0.00014
5.51E-11 11990
0.018605
0.018429
0.996328
0.00016 204.21
0.162733
0.148259
0.971202
years
Parameter Estimates
Std 95% Confidence Li
DF
Minimum Maximum Theta0 t for Pr > |t|
Error
mits
H0:
Param
eter=T
heta0
-0.030592 0.01524
-0.0618 0.00061
28.185 -0.040456
-0.02281
0 -2.01
0.0543
-0.087136 0.02095
-0.1282 -0.0461
3084.8 -0.090364 -0.082114
0 -4.16 <.0001
3.554E-06 7.4E-06
-1E-05 0.00002
11990 0.000002689 0.000004581
0
0.48
0.6321
0.005274 0.01273
-0.0198 0.03036
204.21
0.001705
0.010075
0
0.41
0.679
Parameter Estimate
age
stock
sales
years
What about when more than one
variable has missing values?
Multiple Imputation by Chained Equations
(MICE)
1. Provides initial imputations of missing values
2. For one particular variable, removes them again
3. Builds model based on other variables, and uses posterior
predictive distribution to impute random values
4. Does the same thing for another variable, only imputed values
for first variable remain
5. Completes for all variables, repeats the process many times
6. This makes one imputed data set. Does so m times
• Works well in simulations, handles many types of variables at once
• Can take a lot of time, and theoretical justification is not particularly strong
R Example
• This data set, “Nhanes”, has age group, body mas index, hypertensive status
and serum cholesterol
• Body mass index and serum cholesterol are continuous, while hypertensive
status (yes or no) is binary and age group is ordinal
• We will use the package “mice” and the function “mice” to complete the
imputation and analysis
Code
• You need to install the ‘mice’ package
nhanes$hyp<-as.factor(nhanes$hyp)
bord<-mice(nhanes,m=40,seed=132, me=c("polr","pmm","logreg","norm"))
complete(bord,12)
bit<-with(bord,lm(chl~age+bmi+hyp))
summary(pool(bit))
Output
est
se
t
df Pr(>|t|)
(Intercept) -39.104424 88.462185 -0.4420468 9.341691 0.66851235
age
40.287101 18.378020 2.1921350 6.268912 0.06894168
bmi
6.091045 2.610044 2.3336941 11.449700 0.03876241
hyp
5.410891 29.405394 0.1840102 8.038752 0.85856252
Body mass is a significant predictor of cholesterol, and age nearly is, but
hypertensive status is not
Markov Chain Monte Carlo Approach
• Here, the process gives us data estimations via Markov
chains
• A Markov chain holds the property that the probability
of the next link in the chain depends only on the
current link
• Basically, we perform a bunch of steps, and the
probability of each step depends only on the previous
step
• Eventually, theory holds that under certain conditions,
the steps will converge to the state that we are trying to
estimate, called the stationary distribution
But, there’s a catch…
• This approach assumes multivariate normality
Summary
• Though handling missing data is ultimately just a nuisance necessity and not
the point of the analysis, it pays to give it the consideration it is due
• Whether or not you use multiple imputation, single imputation, or complete
case analysis depends on how much missing data you have, and how big the
sample is
• Having the actual data is still always better
Thank you!