Download What do we mean by missing data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
What do we mean by missing data?
Missing data are simply observations that we intended to be made but did not. For example, an individual may
only respond to certain questions in a survey, or may not respond at all to a particular wave of a longitudinal
survey.
In the presence of missing data, our goal remains making inferences that apply to the population targeted by
the complete sample - i.e. the goal remains what it was if we had seen the complete data.
However, both making inferences and performing the analysis are now more complex. We will see we need to
make assumptions in order to draw inferences, and then use an appropriate computational approach for the
analysis.
We will avoid adopting computationally simple solutions (such as just analysing complete data or carrying
forward the last observation in a longitudinal study) which generally lead to misleading inferences.
In practice the data consist of (a) the observations actually made (where '?' denotes a missing observation):
Figure 1: Typical partially observed data set
and (b) the pattern of missing values:
Figure 2: Pattern of missing values for the data in Figure 1. A '1' indicates that an observation is seen, a '0'
that it is missing
Inferential framework
When it comes to analysis, whether we adopt a frequentist approach (Figure 3) or a Bayesian approach (Figure
4), the likelihood is central. In these notes, for convenience, we discuss issues from a frequentist perspective,
although often we use appropriate Bayesian computational strategies to approximate frequentist analyses.
Figure 3: Schematic for frequentist (sometimes termed traditional) paradigm of inference
The actual sampling process involves the 'selection' of the missing values, as well as the units. So to complete
the process of inference in a justifiable way we need to take this into account.
Figure 4: Schematic for Bayesian paradigm of inference
The likelihood is a measure of comparative support for different models given the data. It requires a model for
the observed data, and as with classical inference this must involve aspects of the way in which the missing
data have been selected (i.e. the missingness mechanism).
Assumptions
We distinguish between item and unit nonresponse (missingness). For item missingness, values can be missing
on response (i.e. outcome) variables and/or on explanatory (i.e. design/covariate/exposure/confounder)
variables.
Missing data can effect properties of estimators (for example, means, percentages, percentiles, variances,
ratios, regression parameters and so on). Missing data can also affect inferences, i.e. the properties of tests and
confidence intervals, and Bayesian posterior distributions.
A critical determinant of these effects is the way in which the probability of an observation being missing (the
missingness mechanism) depends on other variables (measured or not) and on its own value.
In contrast with the sampling process, which is usually known, the missingness mechanism is usually
unknown.
The data alone cannot usually definitively tell us the sampling process.
Likewise, the missingness pattern, and its relationship to the observations, cannot definitively identify the
missingness mechanism.
The additional assumptions needed to allow the observed data to be the basis of inferences that would have
been available from the complete data can usually be expressed in terms of either
1. the relationship between selection of missing observations and the values they would have taken, or
2. the statistical behaviour of the unseen data.
These additional assumptions are not subject to assessment from the data under analysis; their plausibility
cannot be definitively determined from the data at hand.
The issues surrounding the analysis of data sets with missing values therefore centre on assumptions. We
have to
1. decide which assumptions are reasonable and sensible in any given setting;
- contextual/subject matter information will be central to this
2. ensure that the assumptions are transparent;
3. explore the sensitivity of inferences/conclusions to the assumptions, and
4. understand which assumptions are associated with particular analyses.
Getting computation out of the way
The above implies it is sensible to use approaches that make weak assumptions, and to seek computational
strategies to implement them.
However, often computationally simple strategies are adopted, which make strong assumptions, which are
subsequently hard to justify.
Classic examples are completers analysis (i.e. only including units with fully observed data in the analysis)
and last observation carried forward. The latter is sometimes advocated in longitudinal studies, and replaces a
unit's unseen observations at a particular wave with their last observed values, irrespective of the time that has
elapsed between the two waves.
Simple, ad-hoc methods and their shortcomings
In contrast to principled methods, these usually create a single 'complete' dataset, which is analysed as if it
were the fully observed data.
Unless certain, fairly strong, assumptions are true, the answers are invalid.
We briefly review the following methods:




Analysis of completers only
Imputation of simple mean
Imputation of regression mean
Last observation carried forward
Completers analysis
The data on the left below has one missing observation on variable 2, unit 10.




Completers analysis deletes all units with incomplete
data from the analysis (here unit 10).
It is inefficient.
It is problematic in regression when covariate values are
missing and models with several sets of explanatory
variables need to be compared. Either we keep changing
the size of the data set, as we add/remove explanatory
variables with missing observations, or we use the
(potentially very small, and unrepresentative) subset of
the data with no missing values.
When the missing observations are not a completely
random selection of the data, a completers analysis will
give biased estimates and invalid inferences.
Simple mean imputation
The data on the left below has one missing observation on variable 2, unit 10.
We replace this with the arithmetic average of the observed data for that variable. This value is shown in red
in the table below.



This approach is clearly inappropriate for categorical
variables.
It does not lead to proper estimates of measures of
association or regression coefficients. Rather,
associations tend to be diluted.
In addition, variances will be wrongly estimated
(typically under estimated) if the imputed values are
treated as real. Thus inferences will be wrong too.
Regression mean imputation
Here, we use the completers to calculate the regression of the incomplete variable on the other complete
variables. Then, we substitute the predicted mean for each unit with a missing value. In this way we use
information from the joint distribution of the variables to make the imputation.
Example
Consider again our dataset with two variables, which is missing variable 2 on unit 10:
To perform regression imputation, we first regress variable 2 on variable 1 (note, it doesn't matter which of
these is the 'response' in the model of interest). In our example, we use simple linear regression:
V2 =
Using units 1-9, we find that
+
V1 + e.
= 6.56 and = - 0.366, so the regression relationship is
Expected value of V2 = 6.56 - 0.366V1.
For unit 10, this gives
6.56 - 0.366 x 3.6 = 5.24.
This value is shown in red below:
Results of regression mean imputation. Note


Regression mean imputation can generate unbiased
estimates of means, associations ad regression
coefficients in a much wider range of settings than simple
mean imputation.
However, one important problem remains. The variability
of the imputations is too small, so the estimated precision
of regression coefficients will be wrong and inferences
will be misleading.
Creating and extra category
When a categorical variable has missing values it is common practice to add an extra 'missing value' category.
In the example below, the missing values, denoted '?' have been given the category 3.
This is bad practice because:




the impact of this strategy depends on how missing
values are divided among the real categories, and how the
probability of a value being missing depends on other
variables;
very dissimilar classes can be lumped into one group;
sever bias can arise, in any direction, and
when used to stratify for adjustment (or correct for
confounding) the completed categorical variable will not
do its job properly.
Last observation carried forward (LOCF)
This method is specific to longitudinal data problems.
For each individual, missing values are replaced by the last observed value of that variable. For example:
Here the three missing values for unit 1, at times 4, 5 and 6 are replaced by the value at time 3, namely 2.0.
Likewise the two missing values for unit 3, at times 5 and 6, are replaced by the value at time 4, which is 3.5.
Using LOCF, once the data set has been completed in this way it is analysed as if it were fully observed.
For full longitudinal data analyses this is clearly disastrous: means and covariance structure are seriously
distorted. For single time point analyses the means are still likely to be distorted, measures of precision are
wrong and hence inferences are wrong. Note this is true even if the mechanism that causes the data to be
missing is completely random. For a full discussion download the talk 'LOCF - time to stop carrying it
forward' from the preprints page of this site.
Conclusions
Unless the proportion missing is so small as to be unlikely to affect inferences, these simple ad-hoc methods
should be avoided. However, note that 'small' is hard to define: estimates of the chances of rare events can be
very sensitive to just a few missing observations; likewise, a sample mean can be sensitive to missing
observations which are in the tails of the distribution.
They usually conflict with the statistical model that underpins the analysis (however simple and implicit this
might be) So they introduce bias.
As the assumptions about the reason for the data being missing that they implicitly make are often difficult to
describe (e.g. with LOCF), they can make it very hard to know what assumptions are being made in the
analysis.
They do not properly reflect statistical uncertainty: data are effectively 'made up' and no subsequent account is
taken of this.
Some notation
The data
We denote the data we intended to collect, by Y, and we partition this into
Y = {Yo,Ym}.
where Yo is observed and Ym is missing.
Note that some variables in Y may be outcomes/responses, some may be explanatory variables/covariates.
Depending on the context these may all refer to one unit, or to an entire dataset.
Missing value indicator
Corresponding to every observation Y, there is a missing value indicator R, defined as:
R=
with R corresponding to Y.
Missing value mechanism
The key question for analyses with missing data is, under what circumstances, if any, do the analyses we
would perform if the data set were fully observed lead to valid answers?
As before, 'valid' means that effects and their SE's are consistently estimated, tests have the correct size, and
so on, so inferences are correct.
The answer depends on the missing value mechanism.
This is the probability that a set of values are missing given the values taken by the observed and missing
observations, which we denote by
Pr(R | yo, ym)
Examples of missing value mechanisms
1.
2.
3.
4.
The chance of nonresponse to questions about income usually depend on the person's income.
Someone may not be at home for an interview because they are at work.
The chance of a subject leaving a clinical trial may depend on their response to treatment.
A subject may be removed from a trial if their condition is insufficiently controlled.
Missing Completely at Random (MCAR)
Suppose the probability of an observation being missing does not depend on observed or unobserved
measurements. In mathematical terms, we write this as
Pr(r | yo, ym) = Pr(r)
Then we say that the observation is Missing Completely At Random, which is often abbreviated to MCAR.
Note that in a sample survey setting MCAR is sometimes called uniform non-response.
If data are MCAR, then consistent results with missing data can be obtained by performing the analyses we
would have used had their been no missing data, although there will generally be some loss of information. In
practice this means that, under MCAR, the analysis of only those units with complete data gives valid
inferences.
An example of a MCAR mechanism would be that a laboratory sample is dropped, so the resulting
observation is missing.
However, many mechanisms that initially seem to be MCAR may turn out not to be. For example, a patient in
a clinical trial may be lost to follow up after 'falling' under a bus; however if it is a psychiatric trial, this may
be an indication of poor response to treatment. Likewise, if a response to a postal questionnaire is missing
because the questionnaire was lost or stolen in the post, this may not be random but rather reflect the area in
which the sorting office is located.
As we have already said, under MCAR analyses of completers only (a short hand for including in the analysis
only units with fully observed data) give valid inferences.
So do analyses based on moment based estimators (for example, generalised estimating equations), and other
estimators derived from consistent estimating equations.
By consistent estimating equations we mean functions of the data and unknown parameters whose
expectation, taken over the complete data at the population parameter values, is zero. Under MCAR, they still
have expectation zero, and so still lead to valid inferences.
Saying the same thing mathematically, an estimating equation can be written as U(y, ), and at the estimate
, U(y, ) = 0. The estimating equation is consistent because EU(Y, ) = 0 (where is the population
parameter value). It remains consistent if the data are missing completely at random (MCAR) because, even
then, still EU(Yo, ) = 0.
A simple example of a consistent is estimating equation is the sample mean, U(y, ) =
-
.
Missing At Random (MAR)
After considering MCAR, a second question naturally arises. That is, what are the most general conditions
under which a valid analysis can be done using only the observed data, and no information about the missing
value mechanism, Pr(r | yo, ym)?
The answer to this is when, given the observed data, the missingness mechanism does not depend on the
unobserved data. Mathematically,
Pr(r | yo, ym) = Pr(r | yo).
This is termed Missing At Random, abbreviated MAR.
This is equivalent to saying that the behaviour of two units who share observed values have the same
statistical behaviour on the other observations, whether observed or not.
For example:
As units 1 and 2 have the same values where both are observed, given these observed values, under MAR,
variables 3, 5 and 6 from unit 2 have the same distribution (NB not the same value!) as variables 3, 5 and 6
from unit 1.
Note that under MAR the probability of a value being missing will generally depend on observed values, so it
does not correspond to the intuitive notion of 'random'. The important idea is that the missing value
mechanism can expressed solely in terms of observations that are observed.
Unfortunately, this can rarely be definitively determined from the data at hand!
Examples of MAR mechanisms


A subject may be removed from a trial if his/her condition is not controlled sufficiently well (according
to pre-defined criteria on the response).
Two measurements of the same variable are made at the same time. If they differ by more than a given
amount a third is taken. This third measurement is missing for those that do not differ by the given
amount.
A special case of MAR is uniform non-response within classes. For example, suppose we seek to collect data
on income and property tax band. Typically, those with higher incomes may be less willing to reveal them.
Thus, a simple average of incomes from respondents will be downwardly biased.
However, now suppose we have everyone's property tax band, and given property tax band non-response to
the income question is random. Then, the income data is missing at random; the reason, or mechanism, for it
being missing depends on property band. Given property band, missingness does not depend on income itself.
Therefore, to get an unbiased estimate of income, we first average the observed income within each property
band. As data are missing at random given property band, these estimates will be valid. To get an estimate of
the overall income, we simply combine these estimates, weighting by the proportion in each property band.
In this example, a simple summary statistic (average of observed incomes) was biased. Conversely, a simple
model (estimate of income conditional on property band), where we condition on the variable that makes the
data MAR, led to a valid result.
This is an example of a more general result. Methods based on the likelihood are valid under MAR. However,
in general non-likelihood methods (e.g. based on completers, moments, estimating equations & including
generalised estimating equations) are not valid under MAR, although some can be 'fixed up'. In particular,
ordinary means, and other simple summary statistics from observed data, will be biased.
Finally, note that in a likelihood setting the term ignorable is often used to refer to and MAR mechanism. It is
the mechanism (i.e. the model for Pr(R | yo)) which is ignorable - not the missing data!
Missing Not At Random (MNAR)
When neither MCAR nor MAR hold, we say the data are Missing Not At Random, abbreviated MNAR.
In the likelihood setting (see end of previous section) the missingness mechanism is termed non-ignorable.
What this means is
1. Even accounting for all the available observed information, the reason for observations being missing
still depends on the unseen observations themselves.
2. To obtain valid inference, a joint model of both Y and R is required (that is a joint model of the data
and the missingness mechanism).
Unfortunately
1. We cannot tell from the data at hand whether the missing observations are MCAR, NMAR or MAR
(although we can distinguish between MCAR and MAR).
2. In the MNAR setting it is very rare to know the appropriate model for the missingness mechanism.
Hence the central role of sensitivity analysis; we must explore how our inferences vary under assumptions of
MAR, MNAR, and under various models. Unfortunately, this is often easier said than done, especially under
the time and budgetary constraints of many applied projects.
Summary
We have defined, in non-technical language, the commonly used terms MCAR, MAR and NMAR, together
with ignorable and non-ignorable.
We have seen that
1. The implications of missingness for the analysis depend on the missing value mechanism , which is
rarely known.
2. The intuitive notion of randomness for the missing value mechanism is called Missing Completely at
Random (MCAR).
A wide range of analyses are valid under the assumption of MCAR.
3. A special intermediate case between 'missing completely at random' and 'not missing at random' is
Missing at Random (MAR).
Assuming MAR, particular analyses that ignore the missing value mechanism are valid under MAR
(e.g. likelihood) and others can be fixed up (e.g. estimating equations can be fixed up by weighting).
4. In most situations, the true mechanism is probably MNAR.
Important
1. We cannot tell from the data at hand whether the missing observations are MCAR, NMAR or MAR
(although we can distinguish between MCAR and MAR).
2. In the MNAR setting it is very rare to know the appropriate model for the missingness mechanism.
Hence the central role of sensitivity analysis; we must explore how our inferences vary under assumptions of
MAR, MNAR, and under various models. Unfortunately, this is often easier said than done, especially under
the time and budgetary constraints of many applied projects.
Principled methods
These all have the following in common:


No attempt is made to replace a missing value directly.
i.e. we do not pretend to 'know' the missing values.
Rather: available information (from the observed data and other contextual considerations) is
combined with assumptions not dependent on the observed data.
This is used to
1. either generate statistical information about each missing value,
e.g. distributional information: given what we have observed, the missing observation has a normal
distribution with mean and variance
, where the parameters can be estimated from the data.
2. and/or generate information about the missing value mechanism
The great range of ways in which these can be done leads to the plethora of approaches to missing values.
Here are some broad classes of approach:


Wholly model based methods.
Simple stochastic imputation.


Multiple stochastic imputation.
Weighting methods
Wholly model based methods
A full statistical model is written down for the complete data.
Analysis (whether frequentist or Bayesian) is based on the likelihood.
Assumptions must be made about the missing data mechanism:


If it is assumed MCAR or MAR, no explicit model is needed for it.
Otherwise this model must be included in the overall formulation.
Such likelihood analyses requires some form of integration (averaging) over the missing data. Depending on
the setting this can be done implicitly or explicitly, directly or indirectly, analytically or numerically. The
statistical information on the missing data is contained in the model.
Examples of this would be the use of linear mixed models under MAR in SAS PROC MIXED or MLwiN.
Simple stochastic imputation





Instead of replacing a value with a mean, a random draw is made from some suitable distribution.
Provided the distribution is chosen appropriately, consistent estimators can be obtained from methods
that would work with the whole data set.
Very important in the large survey setting where draws are made from units with complete data that
are 'similar' to the one with missing values (donors).
There are many variations on this hot-deck approach.
Implicitly they use non-parametric estimates of the distribution of the missing data: typically need very
large samples.
Although the resulting estimators can behave well, for precision (and inference) account must be taken of the
source of the imputations (i.e. there is no 'extra' data). This implies that the usual complete data estimators of
precision can't be used. Thus, for each particular class of estimator (e.g. mean, ratio, percentile) each type of
imputation has an associated variance estimator that may be design based (i.e. using the sampling structure of
the survey) or model based, or model assisted (i.e. using some additional modelling assumptions). These
variance estimators can be very complicated and are not convenient for generalization.
Multiple (stochastic) imputation
This is very similar to the single stochastic imputation method, except there are many ways in which draws
can be made (e.g. hot-deck non-parametric, model based).
The crucial difference is that, instead of completing the data once, the imputation process is repeated a small
number of times (typically 5-10). Provided the draws are done properly, variance estimation (and hence
constructing valid inferences) is much more straightforward.
As is discussed more in the 'introduction to multiple imputation' document, the observed variability among the
estimates from each imputed data set is used in modifying the complete data estimates of precision. In this
way, valid inferences are obtained under missing at random.
Weighting methods
We give a simple illustration of weighting methods and contrast them with likelihood-based methods.
Example: simple continuous problem
Consider a simple linear regression setting:
E(Yi) =
+
xi = xiT ,
where Yi are independent and identically distributed as N(0,
i = 1,..., n,
). A typical data set might look like this:
The ordinary least squares regression line (in this case maximum likelihood) is obtained by solving the normal
equations for
:
xi(yi - xiT ) = 0.
More generally we can get parameter estimates by solving estimating equations:
U(Y; ) =
Ui(yi; ) = 0.
In this example, the estimates of the slope and intercept give the following line:
Suppose now that some response (i.e.Y) observations are missing. The implications are (i) possible bias in the
estimate of the intercept and slope and (ii) loss of precision in the estimate of the intercept and slope. Suppose
in particular that the responses are MNAR; specifically that all observations greater than y=13 are unobserved.
In other words we lose all observations above the horizontal line in the left-hand picture, leaving the observed
data in the right hand picture:
The 'completers' regression line is now biased (and inconsistent). However, because in this case we know the
missing value mechanism and the distribution involved (which is unlikely in real applications) we can do a
valid analysis using likelihood methods. In this special case the likelihood method is known as Tobit
regression. Both the 'completers' and Tobit regression line are shown in the figure below, where the
completers line is bottom line at the right hand end:
To make it a little more realistic, suppose now that an observation greater than 13 ha a probability of 0.25 of
being observed; in other words instead of seeing the left hand plot below, we see the right hand plot.
The completers line is still inconsistent (lower
line at right hand end):
We could use Tobit regression to correct for
this (top line at right hand end: original
regression line; middle line at right hand end:
Tobit regression line; bottom line at right hand
end: 'completers' regression line)
But there now exists an alternative correction, which requires only that we know the probability of Yi being
missing given its value. In other words, we don't need to know the distribution of the observations as we do for
the Tobit regression.
Let Ri be a random variable indicating whether Yi is missing or not, so Ri = 0 implies Yi missing, and Ri = 1
implies Yi is observed.
The following weighted estimating equation is unbiased for the regression parameters:
= 0.
In this (artificial) example
Pr(Ri = 1 | Yi > 13) =
and 1 otherwise, so we can use simple weighted least squares to make the correction.
Comparison of weighting with other methods.
At right hand end, top line is from weighted
regression; second line is original regression
line; third line is tobit regression and fourth
line is completers analysis.
We now look at the performance of these two methods in this simple regression setting where the probability
of observations greater than 13 being seen is 0.25. For sample sizes of 20, 100 and 1000, the table below
shows the mean and standard deviation of the slope estimators (true value 2) over 10,000 simulations.
Estimator
Expected value
SE
Completers only
1.73
0.39
Tobit
1.99
0.33
Weighted
1.95
0.45
1.75
0.20
n = 30
n = 100
Completers only
Tobit
1.98
0.18
Weighted
1.99
0.23
n = 1000
Completers only
1.74 0.063
Tobit
1.98 0.055
Weighted
2.00 0.070
We see that both tobit and weighted regression are unbiased, but that estimates from a weighted analysis are
far more variable.
Conclusion
Our simple examples have illustrated that there are broadly two forms of principled analysis:
1. likelihood methods, which make distributional assumptions about the unseen data, and assumptions
about the form dropout mechanism.
2. weighting methods, which use the inverse of
Pr(Ri = 1 | Yi)
as weights.
In its simple form, weighting is much less precise. However, in the session on weighting, we will see
that this can be addressed, albeit with difficulty.
In summary, in contrast to ad-hoc methods, principled methods are:



based on a well-defined statistical model for the complete data (assumptions), and explicit assumptions
about the missing value mechanism.
the subsequent analysis, inferences and conclusions are valid under these assumptions.
this doesn't mean the assumptions are necessarily true but it does allow the dependence of the
conclusions on these assumptions to be investigated.
Modelling R
If we have one partially observed variable, define the 'missingness indicator', Ri as before, and construct a
logistic model:
logitPr(Ri = 1) =
+
xi1 +
xi2 + ...
We can compare models using standard methods, and so select a final model for dropout. We should consider
interactions if we suspect different mechanisms are causing missing observations in different data subgroups.
Such models are not only useful guides to interpreting analyses, they also indicate which variables we should
include for our models to be valid under missing at random (MAR) and provide estimates of the weights for
methods that use inverse probability weights.
We can generalise this approach to cope with the situation were we have two partially observed variables, and
the second is always unobserved when the first is (i.e. loss to follow-up):
1. Construct a logistic model for the probability of the first variable being observed.
2. For those units for which the first variable is observed, construct a logistic model for the probability of
the second variable being observed.
Then
Pr(second variable observed)
= Pr(second variable observed given first variable observed)
X Pr(first variable observed)
Introduction
The aim of this document is to:
1. introduce the ideas of multiple imputation;
2. outline how to carry out multiple imputation, and
3. provide an intuitive justification for multiple imputation.
Why do multiple imputation?
One of the main problems with the single stochastic imputation methods is the need for developing
appropriate variance formulae for each different setting.
Multiple imputation attempts to provide a procedure that can get the appropriate measures of precision
relatively simply in (almost) any setting.
It was developed by Rubin is a survey setting (where it feels very natural) but has more recently been used
more widely.
(1)
(2)
(3)
Below, we assume we have an established method for fitting our model, had the data been completely
observed.
- e.g. regression, glm, ...
Some notation
For simplicity, suppose we have only two variables in our data set. Suppose one of them is observed on every
unit. Call this Y1. Suppose one is only observed on some units. Call this Y2.
The key idea
The key idea is to use the data from units where both (Y1, Y2) are observed to learn about the relationship
between Y1 and Y2. Then, we use this relationship to complete the data set by drawing the missing observations
from the distribution of Y2| Y1. We do this K (typically 5) times, giving rise to K complete data sets.
We analyse each of these data sets in the usual way.
We combine the results using particular rules.
Intuition behind multiple imputation
First, we model observed (Y1, Y2) pairs. These are shown below, with a regression line through them. It's
crucial that the variable with the missing values is the response, whether or not it is going to be the response in
the final model of interest. The '?' indicates we have the value of Y1, but that for Y2 is missing.
Next, we draw missing Y2 by (i) drawing from distribution of regression line (ii) drawing from variablity about
that line. In the picture below, the dotted line is the regression line from the observed data (as on the previous
picture) and the red line is drawn from the estimated distribution of the regression line (i.e. the red line's
intercept and slope are drawn from the estimated bivariate normal distribution of the intercept and slope).
Then, a draw is made from the estimated normal distribution of the residuals, and added to the line, to give the
imputed points, shown by red triangles.
From this graph we can see straight away why replacing the missing observations with the mean of Y2 is a bad
idea. For instance, the leftmost '?' in the fist picture above would be given a value far above the regression line
(which represents its expected value given Y1).
We can also see why a single imputation on the regression line - i.e. where the imputed data (triangles in the
graph above) lies on the regression line - is inadequate. This would be an over-confident prediction of the
missing value. Systematically doing this would lead to estimates of standard errors that were too small, and
inferences that were therefore over-confident.
However, a single imputation of each missing value is not adequate, because we only know the distribution of
the missing values. Thus, we need to repeat the imputation process a number of times, each time drawing a
new regression line, and new residuals about that regression line. We thus end up with a number of completed
data sets as follows:
Notation for analyses of imputed data sets
As described above, we have imputed K complete data sets. Analysing each of them in the usual way (i.e.
using the model intended for the complete data) gives us K estimates of the original quantity of interest, Q.
Denote these estimates Q1,..., QK. So, each Q could represent a regression coefficient from a regression model
of interest which we fit to each imputed data set in turn.
The analysis of each imputed data set will also give an estimate of the variance of Qk, say
the usual variance estimate from the model.
We combine these quantities to get our overall estimate and its variance using certain rules.
Intuition for combining the estimates
Consider the imputation of just 1 missing observation.
. Again, this is
Imagine a 3-d representation, with the
Ymiss axis going back into the screen.
Given a particular value of Yobs the
imputations (numbered 1,2,3,4)
combine with the observed data to
give the estimates of Q shown by the
black dots. Each of these estimates
also has a variance, which is
represented by the line through the
black dot.
Now we project this into twodimensions, over Ymiss.
The multiple imputation estimate is going to be the average of the black dots. In other words, the average over
the distribution of YM given YO of Q, which is itself calculated from the observed and 'missing' data:
QMI = EYM| YOE[Q(YO, YM)].
The variance has to reflect two components; the variance of the Q's from the imputed datasets about their
average and also the variance of each Q estimate. In fact, it is the sum of these two; i.e. in this case (with
Q1,..., Q4) the sample variability of Q1,..., Q4 about their mean, plus the average of the variance of Q1,..., Q4.
These are known respectively as the between imputation variance and the within imputation variance.
Mathematically,
V[QMI] = EYM| YOV[Q(YO, YM)] + VYM| YOE[Q(YO, YM)].
This motivates the formulae for combining the estimates and calculating the variance, which are given in the
next section.
Testing hypotheses
We assume that, if the data were all observed, then our estimator Q would have a normal distribution.
If this is so, we can compare
t ,
a t-distribution with
degrees of freedom, where
= (K - 1)
1+
.
The rate of missing information
If there were no missing data, and we used multiple imputation, we should find that (1 + 1/K)
the relative increase in variance due to the missing data is
r=
.
Alternatively, the 'rate of missing information' is
=
.
It turns out a better estimate of this quantity is
=
.
Combining the estimates
Let the multiple imputation estimate of Q be QMI. Then, following from the above,
QMI =
Qk.
Further define the within imputation and between imputation components of variance by
= 0. Thus
=
where we recall our definition of
,
and
(Qk - QMI)2,
=
= V[Qk]. Then the variance of QMI is
=
1+
+
.
How do we draw YM| YO?
In the pictures above, we described a regression method for drawing YM| YO. This should work reasonably if
the data set is large, as it is then an approximation to a Bayesian rule.
This rule says that, if
is the parameter describing the joint distribution of (YO, YM) :
Posterior distn of
Joint distn of
(YM, ) given YO
(YM, YO) given
We put an uninformative distribution on
sample from YM| YO.
X distn
of
.
, and discard the values of
drawn from the posterior, leaving a
Frequently asked questions




How many imputations?
o With 50% missing information, an estimate based on 5 impuitations has SD 5% wider than one
with an infinite number of impuations.
What if not MAR?
o Most software implementations assume MAR, but this is not necessary.
Why not compute just one imputation?
o Underestimates variance, as can't estimate
.
What if I am interested in more than one parameter?
o Imputation proceeds in the same way, as does finding the overall estimate of Q. However, the
estimating the covariance matrix can be tricky. Typically more imputations will be needed. See
Schafer (2000) for a discussion.
Bibliography
Allison, P. D. (2000) Multiple imputation for missing data: a cautionary tale.
Sociological methods and Research, 28, 301-309.
Burren, S. V., Boshuizen, H. C. and Knook, D. L. (1999) Multiple imputation of missing blood presure
covariates in survival analysis.
Statistics in Medicine, pp. 681-694.
Gelman, A. and Raghunathan, T. E. (2001) Using conditional distributions for missing-data imputation, in
discussion of `using conditional distributions for missing-data imputation' by Arnold et al.
Statistical Science, 3, 268-269.
Horton, N. J. and Libsitz, S. R. (2001) Multiple imputation in practice: comparison of software packages for
regression models with missing variables.
The American Statistician, pp. 244-254.
Little, R. J. A. and Rubin, D. B. (2002) Statistical analysis with missing data (second edition).
Chichester: Wiley.
Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J. and Solenberger, P. (2001) A multivariate technique
for multiply imputing missing values using a sequence of regression models.
Survey Methodology, 27, 85-95.
Royston, P. (2004) Multiple imputation of missing values.
The Stata Journal, 3, 227-241.
Rubin, D. (1996) Multiple imputation after 18 years.
Journal of the American Statistical Association, 91, 473-490.
Rubin, D. B. (1976) Inference and missing data.
Biometrika, 63, 581-592.
Schafer, J. L. (1997) Analysis of incomplete multivariate data.
London: Chapman and Hall.
Schafer, J. L. (1999) Multiple imputation: a primer.
Statistical Methods in Medical Research, 8, 3-15.
Taylor, J. M. G., Cooper, K. L., Wei, J. T., Sarma, R. V., Raghunathan, T. E. and Heeringa, S. G. (2002) Use
of Multiple Imputation to Correct for Nonresponse Bias in a Survey of Urologic Symptoms among AfricanAmerican Men .
American Journal of Epidemiology, 156, 774-782.
Software
Software for drawing YM| YO.
We can use Markov Chain Monte Carlo (MCMC) methods to draw from this posterior distribution, and then
we discard the 's and use the YM's as described above.
This approach is implemented in MLwiN - see the software page on this website.
Other options include WinBUGS (see the example analyses page on this website) or PROC MI in SAS.
Note that drawing from YM| YO and then doing the analysis in WinBUGS can be unfeasibly slow even for
moderate data sets.
One alternative is to use 'chained equations' also known as 'regression switching' or 'sequential regression
imputation' (all variants of the same approach) (see the links page of this website).
Chained equations: some comments
Roughly, multiple imputation using chained equations proceeds as follows. (We say 'roughly', as
implementations vary):
1. To get started, for each variable in turn fill in missing values with randomly chosen observed values
2. 'filled-in' values in the first variable are discarded leaving the original missing values. These missing
values are then imputed using regression imputation on all other variables.
3. The 'filled-in' values in the second variable are discarded. These missing values are then imputed using
'proper' regression imputation on all other variables.
4. This process is repeated for each variable in turn. Once each variable has been imputed using the
regression method we have completed one 'cycle'.
5. The process is continued for several cycles, typically 10.
Comments on chained equation method
This was first published by Raghunathan et al. (2001); see also the SAS implementation at
www.isr.urmich.edu/src/smp/ive
For a medical example see Taylor et al. (2002).
A Dutch group has developed related software; see Burren et al. (1999), and associated S+ software at
www.multiple-imputation.com.
This has been implemented in stata; see Royston (2004), and www.stata.com/support.
All the implementations are slightly different!
Although MICE is an attractive approach, overcoming some of the issues with binary and ordinal data that are
difficult for proper multiple imputation, the lack of a well established theoretical basis means even those who
propose it suggest it is used cautiously.
To quote van Buuren and Oudshoorn (MICE):
'It is hard to establish convergence in the general case, but simulation studies suggest that the coverage
properties in some important practical cases are quite good.'
The problem is that you are in effect defining many conditional distributions, and this does not guarantee the
existence of a joint distribution. Further discussion is given by Raghunathan et al. (2001) (the original paper),
Gelman and Raghunathan (2001) and, briefly, in Little and Rubin (2002).
Note further that, as implemented in stata it is inappropriate for hierarchical data; generally if data are
hierarchical, so should the imputation be. See the article by Carpenter and Goldstein for the multilevel
modelling newsletter, downloadable from the preprints page on this site. More generally, we think the general
application of this approach to hierarchical data is problematic.
Summary and conclusions







Untestable assumptions unavoidable with missing data.
Shun unprincipled methods.
MI is most convenient under MAR.
o To increase the chance that this is approximately true, we may wish to include several
predictors of missingness that we do not want to adjust for in the final analysis.
Multiple imputation is particularly useful for missing covarites, especially in:
o survey settings where there is a separate imputer and analyst;
o large and messy problems, where a full likelihood or Bayesian analysis is impractical.
For models with missing responses, provided the covariates predictive of dropout are included, similar
results are obtained to regression models (or mixed models, for longitudinal data).
o in most missing outcome situations, preferable not to use multiple imputation, as it wastes
information.
Ideally, should consider a form of sensitivity analysis, though this is often not straightforward.
o proper MI analyses are awkward under MNAR; it is necessary to make proper imputations
from the posterior conditional on the missing value indicator.
o Instead we can modify the imputation model to assess sensitivity, for example by using a
postulated accept-reject mechanism on imputations.
Often, serious thought unavoidable!
Introduction
The aim of this document is to




give an intuitive justification for Inverse Probability Weighting (IPW);
look at a simple example;
discuss methods to improve efficiency, and
contrast with multiple imputation.
Idea behind inverse probability weighting
Suppose the full data are
Group:
A
B
C
Response: 1 1 1 2 2 2 3 3 3
The average response is 2. However, we observe:
Group:
A
B
C
Response: 1 ? ? 2 2 2 ? 3 3
From the observed data, the average response is 13/6, biased.
Notice the probability of response is 1/3 in group A, 1 in group B and 2/3 in group C.
Calculate weighted average, where each observation is weighted by 1/{Probability of response}:
= 2.
IPW has eliminated the bias in this case; more generally it will give estimators the property they 'home in' on
the truth as the sample size increases (i.e. they are consistent).
A more mathematical view
Most estimators are the solution of an equation like
For example, if Ui(xi, ) = (xi -
gives
=
U(xi,
) = 0.
(xi -
)=0
), solving
xi/n.
Theory says that if the average of Ui(Xi,
) over samples from the population is zero, our estimate will
'home in' on the truth as the sample gets large (this is called consistency).
If some of the observations xi are unobserved, then it follows the corresponding U's are missing from the
above sum. Thus the average of Ui(
However, now let
) is not zero, so estimates won't `home in' on the truth.
Ri =
Then, the average (over repeated samples from the population) of
U(xi,
) = 0,
so parameter estimates will 'home in' on the truth as the sample size gets large.
In general, inverse probability weighting recovers consistent estimates when data are missing at random.