Download ACCP Research Institute F.I.T. Program: Statistical Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Instrumental variables estimation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Regression toward the mean wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Introduction to Multiple Imputation
Greg Stoddard
September 29, 2011
University of Utah School of Medicine
Outline
The missing value problem
Types of missing
Simple approaches
Multiple Imputation – State of Art
When to use each approach
Consider the following dataset.
ID
Y
X1
X2
1
11
1
2
2
10
miss
5
3
miss
3
2
4
9
miss
miss
5
12
5
7
6
7
6
3
N=6
When a linear regression, Y = a + b1(X1)+b2(X2)
is fitted, the sample size drops to N = 3, due to
listwise deletion of missing values.
To get our sample size back to 6, we plug up
the missing holes with “imputed” values.
ID
Y
X1
X2
1
11
1
2
2
10
3
5
3
11
3
2
4
9
6
3
5
12
5
7
6
7
6
3
To offer a precise definition…
missing data imputation is
substituting values for missing
values—but it is not fabricating data
because we do it statistically.
Every study is going to have some
missing values. You rarely see any
mention of it in a published article,
however. Should you mention it?
In the guidance article,
Vandenbrouchke JP, von Elm E, Altman DG, et al.
Strengthening and reporting of observational studies
in epidemiology (STROBE): explanation and
elaboration. Ann Intern Med 2007;147(8):W-163 to
W-194.
on page W-176, it advises,
“12(c) Explain how missing data were
addressed”
This guidance article then gives the following example,
used in a paper by Chandola et al. (2006) of how you
might say this,
“Our missing data analysis procedures used missing at random
(MAR) assumptions. We used the MICE (multivariate imputation
by chained equations) method of multiple multivariate imputation in
STATA. We independently analyzed 10 copies of the data, each
with missing values suitably imputed , in the multivariate logistic
regression analyses. We average estimates of the variables to
give a single mean estimate and adjusted standard errors
according to Rubin’s rules”.
------Chandola T, Brunner E, Marmot M. Chronic stress at work and the
metabolic syndrome: prospective study. BMJ 2006;332:521-5.
PMID: 16428252
Although missing value imputation is too complex of a
subject to include in introductory statistics textbooks,
entire chapters are devoted in specialized applied texts.
Some examples are:
Harrell Jr FE. Regression Modeling Strategies With Applications to
Linear Models, Logistic Regression, and Survival Analysis. New
York, Springer-Verlag, 2001, pp.41-52.
Twisk JWR. Applied Longitudinal Data Analysis for Epidemiology:
A Practical Guide. Cambridge, Cambridge University Press,
2003, pp.202-224.
Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and
Proportions, 3rd ed. Hoboken NJ, John Wiley & Sons, 2003,
pp.491-560.
A popular classification scheme for missing data is
(Harrell, 2001, pp.41-52):
1. Missing completely at random (MCAR)
2. Missing at random (MAR)
3. Informative missing (IM)
-----Harrell Jr FE. Regression Modeling Strategies With Applications to
Linear Models, Logistic Regression, and Survival Analysis. New
York, Springer-Verlag, 2001, pp.41-52.
Missing completely at random (MCAR)
Data are missing for reasons unrelated to any
characteristics or responses of the subject, including the
value of the missing value, were it to be known. An
example is the accidental dropping of test tube resulting
in missing laboratory measurements. (Here, the best
guess of the missing variable is simply the sample
median).
-----Harrell Jr FE, 2001, pp.41-52.
Missing at random (MAR)
Data elements are not missing at random, but the
probability that a value is missing depends on values of
variables that were actually measured. For example,
suppose males are less likely to respond to their income
question in general, but the likelihood of responding is
independent of their actual income. In this case,
unbiased sex-specific income estimates can be made if
we have data on the sex variable (by replacing the
missing value with the sex-specific median income, for
example).
-----Harrell Jr FE, 2001, pp.41-52.
Informative missing (IM)
Data elements are more likely to be missing if their true
values of the variable in question are systematically
higher or lower. For example, this occurs if lower
income subjects, or high income subjects, or both, are
less likely to answer the income question in a survey.
This is the most difficult type of missing data to handle,
and in many cases there is no good value to substitute
for the missing value. Furthermore, if you analyze your
data by just dropping these subjects, your results will be
biased, so that does not work either.
-----Harrell Jr FE, 2001, pp.41-52.
Missing Comorbidities In Patient Medical Record
A special case of missing is a comorbidity not listed in
the patient’s medical record. For example, if no
mention of diabetes was ever made and a diagnostic
code for diabetes was never entered for any clinic visit,
the fact that it is missing suggests that the patient does
not have diabetes.
Defining a coding rule to replace this missing value with
0, or absent, will most likely produce the least amount of
misclassification, or mis-imputation, error.
Steyerberg (2009) mentions this coding rule approach,
“An alternative in such a situation might be to change
the definition of the predictor, i.e., by assuming that if no
value is available from a patient chart, the characteristic
is absent rather than missing.”
------Steyerberg EW. (2009). Clinical Prediction Models: A Practical
Approach to Development, Validation, and Updating. New York,
Springer, 2009, pp.130-131.
Replacing Missing Values with Mean, Median, or
Mode
Before the more sophisticated imputation schemes
were developed, it was common practice to replace the
missing value with a likely value, being the mean,
median, or mode.
One criticism of this approach is that you artificially
shrink the variance, since so many observations will
then have the average value.
Royston (2004) makes this criticism,
“Old-fashioned imputation typically replaced missing
values with the mean or mode of the nonmissing values
for that variable. That approach is now regarded as
inadequate. For subsequent statistical inference to be
valid, it is essential to inject the correct degree of
randomness into the imputations and to incorporate that
uncertainty when computing standard errors and
confidence intervals for parameters of interest.”
-----Royston P. Multiple imputation of missing values. The Stata
Journal 2004;4(3):227-241.
One possible approach is imputing the missing value
with a likely value, such as the median. Then, add a
random residual back to the median imputed value to
maintain the correct standard error (Harrell, 2001).
--Harrell Jr FE. Regression Modeling Strategies With
Applications to Linear Models, Logistic Regression,
and Survival Analysis. New York, Springer-Verlag,
2001, pp.45-46.
Such a direct approach is not usually done, however,
since the more widely accepted approaches,
accomplish the same thing.
Furthermore, if there are a lot of missing data, imputing
with a likely value might adversely affect the regression
coefficient.
The imputation methods of multiple imputation and
maximum likelihood not only provide the best standard
errors, but also the best regression coefficients.
What About Imputing the Outcome Variable?
ID
Y
X1
X2
1
11
1
2
2
10
miss
5
3
miss
3
2
4
9
miss
miss
5
12
5
7
6
7
6
3
At first, it seems like you must at least have the
outcome variable measured or the subject
should be excluded.
It is common to discard subjects with a missing
outcome variable, but imputing missing values of the
outcome variable frequently leads to more efficient
estimates of the regression coefficients when the
imputation is based on the nonmissing predictor
variables (Harrell, 2001).
----Harrell Jr FE. Regression Modeling Strategies With
Applications to Linear Models, Logistic Regression,
and Survival Analysis. New York, Springer-Verlag,
2001, pp.43.
Missing Value Indicator Approach
A historically popular approach in epidemiologic
research was to use a missing value indicator, which
has a value of 1 if the variable is missing and 0
otherwise.
For example, given the following variable for gender,
1. male n = 50
2. female n = 40
Missing n = 10
we would recode this to two indicators, male and
malemissing:
Original gender
variable
1. male
(n=50)
2. female (n=40)
. = missing (n=10)
Male
indicator
1 (n=50)
0 (n=40)
0 (n=10)
Malemissing
indicator
0 (n=50)
0 (n=40)
1 (n=10)
and then include both indicator variables into the
regression model. With this approach, the missing
value indicator is not interpreted, or reported in an
article, but simply acts as a place holder so the subjects
with missing values are not dropped out of the analysis.
Greenland and Finkle (1995) suggest not using the
missing value indictor approach,
“…The method based on missing-data indicators can
exhibit severe bias even when the data are missing
completely at random, … In general, the authors
recommend that epidemiologists avoid using the
missing-indicator method and use more sophisticated
methods whenever a large proportion of data are
missing.”
-----Greenland S, Finkle WD. (1995). A critical look at
methods for handling missing covariates in
epidemiologic regression analysis. Am J Epidemiol
142(12):1255-64.
Steyerberg (2009, pp.130-131) likewise advises not
using the missing value indicator approach,
“…such a procedure ignores correlation of the values of
predictors among each other. Simulations have shown
that the procedure may lead to severe bias in estimated
regression coefficients.155,295. The missing indicator
should hence generally not be used.”
------Steyerberg EW. (2009). Clinical Prediction Models: A
Practical Approach to Development, Validation, and
Updating. New York, Springer, 2009, pp.130-131.
Hotdeck Imputation
In this method, the missing values are
replaced by randomly selected values from
the nonmissing values of the same variable.
ID
Y
X1
X2
1
11
1
2
2
10
3
5
3
11
3
2
4
9
6
3
5
12
5
7
6
7
6
3
Hotdeck imputation has the advantage of being simple
to use, it preserves the distributional characteristics of
the variable, and performs nearly as well as the more
sophisticated imputation approaches (Roth, 1994).
------Roth, P. Missing data: A conceptual review for applied
psychologists. Personnel Psychology 1994;47:537-560.
In hotdeck imputation, you set a random number seed
before the imputation, so you can replicate the
imputation and subsequent analysis. You then create
imputed variables, using different variable names, such
as
male → male_imp.
Then you use ordinary statistical methods, such as
linear regression on the imputed variables.
The original variables, with missing values, are
preserved for final analyses using multiple imputation, if
you choose to do so.
What you are going to discover, however, is that you do
not get the answer you want in your regression model.
So, you become curious and choose a different random
number seed, and this time you get the desired answer
using your second round of imputed variables. If you
pre-specified your random number seed, this will be
very unsatisifying because you are actually stuck with
your first model…but which model is the “right” one?
It seems the only correct thing to do, then, is to do
hotdeck several times and then average the results to
arrive at a stable “right” answer. This is the basic idea
of multiple imputation, which we will get to later.
ID
Y
X1
X2
1
11
1
2
2
10
3
5
3
11
3
2
4
9
6
3
5
12
5
7
6
7
6
3
Recall, with hotdeck imputation, we simply use a random
value from the nonmissing values. This has the advantage
that an actual possible value is used, but it does not take
into account that a “more likely” value could be obtained by
using information from the other variables the imputed
variable is correlated with. For example, body weight will
be different between teenagers and adults.
Regression Approach
In this method, you impute the missing value with the
predicted value from an appropriate regression model.
For example, you could use a linear regression, with
age and gender as the predictors, to impute the missing
body weight variables.
This is similar to imputing with means, but more like
using subgroup specific means.
A disadvantage is that it does not preserve the
variability of the data.
Multiple Imputation
“The idea of multiple imputation is that instead of filling
in missing values to create a single imputed dataset,
several (or more) imputed data sets are created each of
which contains different imputed values. The analysis of
a statistical model is then done on each of the imputed
data sets. The multiple analyses are then combined to
yield a single set of results. The major advantage of
multiple imputation over single imputation is that it
produces standard errors that reflect the degree of
uncertainty due to the imputation missing values. In
general, multiple imputation techniques require that
missing observations are missing at random (MAR).”
-----http://www.ats.ucla.edu/stat/stata/library/ice.htm
There are two major approaches to creating multiply
imputed datasets:
1) multivariate normal
2) imputation by chained equations
-----http://www.ats.ucla.edu/stat/stata/library/ice.htm
1) multivariate normal
This approach is “based on the joint distribution of all
the variables in the imputation model, including
variables to be imputed and variables to be used only
for the purpose of imputing other variables. In this
approach, the joint distribution of all variables in the
imputation model is assumed to be multivariate normal.”
-----http://www.ats.ucla.edu/stat/stata/library/ice.htm
2) imputation by chained equations
This method is “based on each conditional density of a
variable given other variables.”
-----http://www.ats.ucla.edu/stat/stata/library/ice.htm
imputation by chained equations
The user selects how many imputed datasets to
combine. Usually 3 is enough, with little if any
advantage to using more than 10 sets.
To create one imputed dataset,
1) Drop any subject that is missing every variable that
will be considered in the regression model, as these
subjects are impossible to impute.
2) All missing values for each specific variable is filled in
with randomly selected nonmissing values from the
same variable (hotdeck approach).
3) For each variable, in turn, ignore the filled-in value
and instead impute the missing value by predicting it
from the remaining variables, where the remaining
variables were filled in if needed in step 2 (regression
approach).
-----http://www.ats.ucla.edu/stat/stata/library/ice.htm
Next, each imputed dataset is analyzed independently
with the desired regression model.
Then, “Estimates of parameters of interest are averaged
across the copies to give a single estimate. Standard
errors are computed according to the ‘Rubin rules’,
devised to allow for the between- and within-imputation
components of variation in the parameter estimates.”
-----Royston P. Multiple imputation of missing values. The Stata
Journal 2004;4(3):227-241.
Do we really need to go to all this trouble? It depends
on how much missing data you have.
Harrell (2001) provides some crude guidelines for:
Proportion of missings ≤ 0.05
Proportion of missings 0.05 to 0.15
Proportion of missings > 0.15
-----Harrell Jr FE. Regression Modeling Strategies With Applications to
Linear Models, Logistic Regression, and Survival Analysis. New
York, Springer-Verlag, 2001, p.49.
“Proportion of missings  0.05:
It doesn’t matter very much how you impute missings or
whether you adjust variance of regression coefficient
estimates for having imputed data in this case. For
continuous variables imputing missings with the median
nonmissing value is adequate; for categorical predictors
the most frequent category can be used. Complete
case analysis is an option here.”
“Proportion of missings 0.05 to 0.15:
If a predictor is unrelated to all of the other predictors, imputations
can be done the same as the above (i.e., impute a reasonable
constant value). If the predictor is correlated with other predictors,
develop a customized model (or have the transcan fuction
[available for S-Plus from Harrell’s website] do it for you) to predict
the predictor from all of the other predictors. Then impute missings
with predicted values. For categorical variables, classification trees
are good methods for developing customized imputation models.
For continuous variables, ordinary regression can be used if the
variable in question does not require a nonmonotonic
transformation to be predicted from the other variables. For either
the related or unrelated predictor case, variances may need to
adjusted for imputation. Single imputation is probably OK here, but
multiple imputation doesn’t hurt.”
“Proportion of missings > 0.15:
This situation requires the same considerations as in
the previous case, and adjusting variances for
imputation is even more important. To estimate the
strength of the effect of a predictor that is frequently
missing, it may be necessary to refit the model on the
subject of observations for which that predictor is not
missing, if Y is not used for imputation. Multiple
imputation is preferred for most models.”
That ends the lecture.
As a discussion topic, however, let’s return to
the informative missing situation.
Informative missing (IM)
Data elements are more likely to be missing if their true
values of the variable in question are systematically
higher or lower. For example, this occurs if lower
income subjects, or high income subjects, or both, are
less likely to answer the income question in a survey.
This is the most difficult type of missing data to handle,
and in many cases there is no good value to substitute
for the missing value. Furthermore, if you analyze your
data by just dropping these subjects, your results will be
biased, so that does not work either.
-----Harrell Jr FE, 2001, pp.41-52.
Discussion Question:
Suppose you want to want to predict opioid abuse in patients
receiving prescription opioid pain medications.
Your primary predictor variable is an opioid abuse potential scale
that is scored as: 1) at risk for abuse, 0) not at risk.
Your primary outcome variable is a question on your survey, “Do
you currently or have you in the past year used opioid prescription
drugs simply to get high? (Yes or No)”
You told your subjects that they had the option to skip any question
they wanted to. About 20% of your subjects chose to not answer
that question (many of them, perhaps, because they would be
admitting to doing something illegal).
How are you going to analyze these data?