Download Definition and Diagnosis of Problematic Attrition in Randomized Controlled Experiments Fernando Martel García

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Definition and Diagnosis of Problematic
Attrition in Randomized Controlled
Experiments
Fernando Martel García∗
fmg229[at]nyu.edu
First draft October 1, 2012
Current draft April 25, 2013
Abstract
Attrition is the Achilles’ Heel of the randomized experiment: It is
fairly common, and it can completely unravel the benefits of randomization. Using the structural language of causal diagrams I demonstrate
that attrition is problematic for identification of the average treatment
effect (ATE) if – and only if – it is a common effect of the treatment and
the outcome (or a cause of the outcome other than the treatment). I
also demonstrate that whether the ATE is identified and estimable for
the full population of units in the experiment, or only for those units
with observed outcomes, depends on two d-separation conditions. One
of these is testable ex-post under standard experimental assumptions.
The other is testable ex-ante so long as adequate measurement protocols are adopted. Missing at Random (MAR) assumptions are neither
necessary nor sufficient for identification of the ATE.
∗
I would like to thank Matthew Blackwell, Felix Elwert, and Judea Pearl for careful comments and suggestions; participants in the Applied Statistics Workshop hosted by Harvard
University’s Department of Government 2013; and in panel 38-13 of the Midwest Political
Science Association Meeting 2013. All errors are my own. This manuscript relies on casual
diagrams, truth tables, and sentential logic for proofs. These techniques are being implemented in an open source R package that will soon be made available in the Comprehensive
R Archive Network (CRAN).
1
1
Introduction
Attrition, or missing data on the outcome of interest, is the “Achilles’ Heel
of the randomized experiment” (Shadish et al. 1998, 3). It can completely
unravel the benefits of randomization, and it is very common in social experiments, impact evaluations, clinical trials, and studies where outcomes are
measured using survey instruments, or with a long lag. But when is attrition a problem for causal inference in randomized controlled experiments
(RCEs)? And how can problematic forms of attrition be reliably diagnosed?
Here I show how existing answers to these questions are often ambiguous,
and at times misleading. I also define and demonstrate the necessary and
sufficient conditions for problematic attrition. And I conclude by proposing
diagnostic tests and measurement protocols to reliably diagnose problematic
attrition.
Attrition is said to be problematic whenever the average treatment effect
(ATE) is not identified non-parametrically under standard experimental assumptions. Problematic attrition arises if – and only if – attrition is a common effect of the treatment and the outcome (or a cause of the outcome
other than the treatment). Moreover, whether attrition is problematic; or
the ATE is identified and estimable for all experimental units; or only for
those units with observed outcomes, depends on just two d-separation (i.e.
orthogonality) conditions. The first condition can be checked ex-post by testing whether the treatment causes attrition. By itself this test only allows a
partial diagnosis of problematic attrition. The second d-separation condition
can be be checked if ex-ante researchers put in place a measurement protocol
for non-response follow up at endline (and baseline) data collection. These
results are summarized in a handy checklist researchers can use to improve
2
the process of causal inference in experiments subject to attrition.
Attrition can severely compromise the comparability of units across experimental arms; limit the representativeness of the observed sample with regards to the experimental population; and increase the variance of any estimated effects (Cook and Campbell 1979). Attrition is also very common. For
example, attrition rates of 20 to 40 percent appear to be the norm in social experiments (Ashenfelter and Plant 1990; Hausman and Wise 1979; Heckman
and Smith 1995; Krueger 1999; Newhouse et al. 2008). A field experiment in
political science studying the effect of media on political behaviour reported
69 percent attrition rates for survey data, and 23 percent for administrative
records (Gerber, Karlan and Bergan 2009, 41–43). And a survey of clinical
trials found that 63 of 71 trials (89 percent) reported missing outcome data,
with 13 of these trials reporting missing outcomes for more than 20 percent
of patients (Wood, White and Thompson 2004). Attrition tends to be higher
in longitudinal studies, and in field studies that rely on survey instruments.
This study uses the graphical language of causal diagrams (Pearl 1995; Greenland, Pearl and Robins 1999) to define and diagnose problematic attrition.
The choice of language is justified by a simple logic: (1) Whether attrition
is problematic depends on the underlying causal structure; (2) probability
statements cannot pin down this causal structure because causation implies
correlation but not vice versa; therefore (3) discussion of attrition using probability statements is necessarily ambiguous. By contrast, causal diagrams
encode and communicate causal knowledge directly, highlighting implicit assumptions, and providing justifications for seemingly tautological statements
like “Missing at Random” (Rubin 1976; Little and Rubin 1987). This power-
3
ful new language provides greater leverage for causal identification.1
2
Literature review
Because attrition is so common, and because it can undo the benefits of randomization, it has become a major source of concern. One strand of the
literature focuses on ex ante remedies to minimize attrition , whereas a second much larger strand focuses on ex-post analytical techniques for dealing
with attrition (see Shadish (2002, 5-6) for an overview). Surprisingly, not
much has been written on the definition of problematic attrition, on the necessary and sufficient conditions that make attrition attrition a problem, and
on ways to reliably diagnose problematic attrition.
Two problems bedevil the attrition literature. First, problematic attrition is
often discussed in terms of probability and correlations – the language of
symptoms –, as opposed to causal relations and diagrams – the language
of causes. This leads to ambiguity, as it is the casual model generating the
observed correlations, rather than the correlations themselves, that determine whether attrition is problematic. Because several causal models can be
compatible with any given covariance structure, the language of probability
cannot pin down the problem. Second, I could not find proof of the necessary and sufficient conditions for problematic attrition. Instead discussion of
1
Notation is is more than semantics. For example, Arabic and Roman numerals both
have direct translation yet the positional system of Arabic numerals makes them much better
suited to arithmetic. Today high school kids are able to carry out arithmetic operations that
ancient Romans could only perform with the aid of pebbles, finger counting, and abacists,
or professionals skilled in using the abacus (Reynolds 2004). Not surprisingly the abacists
resisted the introduction of arabic numerals for centuries (Stone 1972).
4
attrition tends to fall back on assumptions about the nature of the missingness, and whether it is completely at random, at random, or not at random.
Yet some of these assumptions are neither necessary nor sufficient for problematic attrition, and often they are proffered without adequate reasons, justifications, and diagnostic tests. This is not surprising. Judging attrition to
be problematic presupposes knowledge of the necessary and sufficient conditions for problematic attrition, and of the causal structure generating both
the outcome and the attrition, which only causal diagrams make explicit.
Using the language of probability Jurs and Glass (1971) purported to show
how problematic attrition depends on whether it is random within, and between, comparison groups. Though sensible, their language leaves room for
ambiguity. Consider their universal statement: “if the [attrition] is systematically related to the treatments, the inferences possible from the reduced
sample are not the same as those that could be drawn from the original sample.” (Jurs and Glass 1971, 63). Now lets show an exception. Suppose the
only causes of attrition are the randomized treatment, and its interaction with
a covariate. Suppose also no causal relation connects that covariate to the
outcome of interest. And suppose treatment has an effect on the outcome.
This implies: (i) differential attrition; (ii) missingness that is systematically
related to a covariate; and (iii) unobserved outcomes that are correlated with
the attrition. No matter, as is shown below, the observed outcomes remain
comparable across groups, and representative of the full experimental population, so the ATE is identified and estimable.2
2
Despite these problems Jurs and Glass’s (1971) method remains relatively popular.
Kalichman et al. (2001) used it to analyse attrition in a randomized field trial studying the
effect of a behavioural intervention on HIV transmission.
5
A review of a convenience sample of papers from leading experimental social scientist highlights the degree of confusion taht surrounds problematic
attrition (see Appendix A). Often the diagnosis of problematic attrition, and
the procedures used to deal with it, are asserted ex nihilo, without reference to a specific body of literature. Problematic attrition appears to have
acquired the status of a folk theorem. For example, Cook and Campbell
(1979) advanced the now common belief that problematic attrition can be
diagnosed only when pre-test measures related to the outcome are available
(West, Biesanz and Pitts 2000, 52). Yet, as will be demonstrated below, a
key diagnostic for problematic attrition involves checking whether treatment
has an effect on attrition, which is testable in the absence of any covariates.
Though the the way field researchers define, diagnose, and deal with attrition is often sensible, their approaches are stated with such ambiguity that it
is hard to judge the merits of what they are doing.
Researchers armed with causal diagrams have begun to clear the thicket of
ambiguity.3 In a seminal contribution Hernán, Hernández-Díaz and Robins
(2004) used causal diagrams to distinguish between selection bias and confounding. The former arises from conditioning on the common effects of the
exposure (or the causes of the exposure) and the outcome (or the causes
of the outcome), while confounding arises from common causes of both
treatment and the outcome. However, Hernán, Hernández-Díaz and Robins
(2004) do not provide a formal proof and, maybe because of this, theirs is
not a sufficient condition for selection bias.4 Geneletti, Richardson and Best
3
An earlier structural approach had flourished in economics (Heckman 1979). How-
ever, this approach is highly parametric, exhibiting a mathematical complexity that is mostly
unnecessary and prone to errors (Greene 1981).
4
The proof that their formulation is insufficient for selection bias is simple. Consider
the directed acyclic graph Y ← Z → R, where Y is the outcome, Z the treatment, and R is
6
(2009) and Daniel et al. (2012) also discuss attrition using causal diagrams
but neither proves the necessary and sufficient conditions for identification
of the ATE. Bareinboim and Pearl (2012) show how the odds ratio remains
unbiased even when selection is caused by the outcome. More generally, Elwert and Winship (2012) provide a comprehensive discussion of selection bias
problems (see also Pearl (2012, 2013)). I build on these insights by proving
the necessary and sufficient conditions for identification of the ATE in RCEs
subject to attrition. And by showing how simple ex post tests, combined with
ex-ante measurement protocols, can be used to reliably diagnose problematic
attrition.
3
3.1
Introduction to causal diagrams
Definitions
Definition 1 (Graph). A graph is a collection G = 〈V, E〉 of nodes V =
{V1 , . . . , VN } and edges E = {E1 , . . . , E M } where the nodes correspond to
variables, and the edges denote the relation between pairs of variables.
Definition 2 (Directed Acyclic Graph, adapted from Pearl (2009, §1.2)). A
directed acyclic graph (DAG) is a graph that only admits: (i) directed edges
with one arrowhead (e.g. →); (ii) bi-directed edges with two arrowheads
(e.g. L9999K); and (iii) no directed cycles (e.g. X → Y → X ), thereby ruling
out mutual or self causation.
the attrition dummy, and suppose we condition on observed outcomes (r = 0). By selecting
units with observed outcomes we are conditioning on a common effect of treatment (Z),
and a cause of the outcome (Z) yet the ATE is identified, as shown below.
7
Intuitively X → Y reads “X causes Y ”, and X L9999K Y reads “X and Y have
unknown causes in common or are confounded”. A path in a DAG is any
unbroken route traced along the edges of a graph – irrespective of how the
arrows are pointing (e.g. X L9999K M → Y is a path between the ordered
pair of variables (X , Y )). Any two nodes are connected if there exists a path
between them, else they are disconnected. A directed path is a path composed
of directed edges where all edges point in the direction of the path (e.g.
X → M → Y ). We also say that X is a parent of Y , and Y is a child of X , if
X directly causes Y , as in X → Y . For example, the parents of Y in causal
diagram X → Y ← UY are denoted PA(Y ) = {X }, where, by convention, we
confine the set of parents of Y to the set of endogenous variables in V (i.e.
we do not include UY in the set PA(Y ) even though UY is a direct cause of Y ).
Similarly, we say that X is an ancestor of Y , and Y a descendant of X , if X is
a direct or indirect cause of Y . For example, X is both a direct and indirect
cause of Y if a DAG includes direct and mediated directed paths (e.g. X → Y ,
and X → Z → Y , respectively). We refer to non-terminal nodes in directed
paths as mediators (e.g. Z is a mediator in the path X → Z → Y ).
Definition 3 (Causal Diagram, adapted from Pearl (2009, 44, 203)). A causal
diagram or structure of a set of variables W is a DAG G = 〈{U, V}, E〉 with
the following properties:
1. Each node in {U, V} corresponds to one and only one distinct element
in W , and vice versa;
2. Each edge E ∈ E represents a direct functional relationship among the
corresponding pair of variables;
3. The set of nodes {U, V} is partitioned into two sets:
8
(a) U = {U1 , U2 , . . . , UN } is a set of background variables determined
only by factors outside the causal structure;
(b) V = {V1 , V2 , . . . , VN } is a set of endogenous variables determined by
variables in the causal structure – that is, variables U ∪ V; and
4. None of the variables in U have causes in common.
A DAG of a set of variables W only qualifies as a causal diagram if it includes
all common causes of the variables in W .5 Consequently, a causal diagram
replaces all bi-directed edges with latent variables. As a result it only admits
three types of associations between any pair of variables (X , Y ) in W : (1) X
causes Y or vice versa, as depicted by a directed edge (e.g. X → Y ); (2) X and
Y share (latent) causes in common (e.g. X ← Z → Y ); or (3) they have an
effect in common or collider that for some reason is being held constant. For
example, in causal diagram X → C ← Y , selecting units into a study on the
basis of C = c induces a non-causal association between X and Y . Hernán,
Hernández-Díaz and Robins (2004) show how selection on a collider can
result in selection bias.
Finally, every causal diagram has a corresponding causal model, defined as
follows:
Definition 4 (Causal Model, adapted from Pearl (2009, 203)). A causal model
M replaces the set of edges E in a causal diagram G by a set of functions
F = { f1 , f2 , . . . ., f N }, one for each element of V, such that M = 〈U, V, F 〉. In
turn, each function f i is a mapping from (the respective domains of) Ui ∪ PAi
5
If two variables in W have a cause U Z in common (e.g. UY ← U Z → UX ) but U Z 6∈ W ,
then DAG G is not a causal diagram. To make it one U Z should be included in U and UY and
UA included in the set of endogenous variables V. Formally property 4 ensures a DAG meets
the Causal Markov Condition (see Pearl (2009b, 19, 30) for details).
9
to Vi , where Ui ⊆ U and PAi ⊆ V\Vi and the entire set F forms a mapping
from U to V. In other words, each f i in
vi = f i (pai , ui ),
i = 1, 2, . . . , N ,
assigns a value to Vi that depends on (the values of) a select set of variables
in V ∪ U, and the entire set F has a unique solution V(u).
Causal models are completely non-parametric. For example, function y =
f y (x, z, u y ) is compatible with any well defined mathematical expression in
its arguments like y = α + β1 x + β2 z + u y , or y = α + β1 x + β2 z + β3 xz + u y .
Causal models are also completely deterministic. Probability comes into the
picture through our ignorance of background conditions U, which we summarize using a probability distribution P(u). Because endogenous variables
are a function of these background conditions, P(u) induces a probability
distributions P(v ) over all endogenous variables in V.6
Definition 5 (Probabilistic Causal Model, Pearl (2009, 205)). A probabilistic
causal model Γ is a pair 〈M, P(u)〉, where M is a causal model and P(u) is a
probability function defined over the domain of U.
By the definition of a causal diagram (Definition 3), probabilistic causal models have the property that ui ⊥ u j ∀i 6= j, ui , u j ∈ u. Figure 1 provides an
illustration of a probabilistic causal model and its corresponding causal diagram.
Definition 6 (Causal Effect, adapted from Pearl (2009, 70)). Given a probabilistic causal model Γ and two disjoint sets of variables, X and Y , the causal
6
This is exactly the same as characterizing disturbance terms in a regression context
using some distribution, like " ∼ N (0, σ2 ).
10
X
Z
W
Y
UX
UZ
UY
UW
= f X (W, UX );
= f Z (X , U Z );
= f w (UW );
= f Y (Z, W, UY );
= PX (UX );
= P Z (U Z );
= PY (UY );
= PW (UW ).
Ux
Uz
Uy
X
Z
Y
Uw
W
Figure 1: On the left panel probabilistic causal model Γ (Definition 5), and on
the right the causal diagram G (Γ ) representing the causal structure of Γ (Definition 3). X is the cause under investigation, Z is a mediator, Y is the outcome, and W a covariate. Variables Ui , i ∈ {X , Z, Y, W } are unobserved background causes, which we summarize using probability distributions Pi (Ui ).
These are independently distributed by construction, as all causes of two or
more variables are included in the model, maybe as latent variables. White
nodes denote observed variables, and grey nodes unobserved ones.
effect of X on Y , denoted P( y|do(x)), is a function from X to the space of
probability distributions on Y . For each realization x of X , P( y|do(x)) gives
the probability of Y = y induced by deleting from Γ equations f x i , ∀x i ∈ X,
and replacing them with x i = x i0 .
Causal diagrams have a number of advantages for causal identification. First,
they provide an accessible and intuitive language for encoding, in a public
manner, a researchers’ private knowledge of the causal structure governing a
set of variables. Second, causal diagrams capture all the causal associations
and conditional independencies necessary to identify causal effects (Definition 6).7 This information is enough to infer, justify, and test identification
strategies, including which covariates to control for, what assumptions are
7
The focus on identification of P( y|d o(x)) from which other quantities of interest like
the ATE can be calculated. This is somewhat restrictive: At times P( y|d o(x)) is not identified
yet some quantity, like the odds ratio, is (Bareinboim and Pearl 2012).
11
plausible, which are testable, and what variables are good instruments (Pearl
2009a). Third, causal diagrams capture all of the information needed for
identification purposes, making precise mathematical specification of a probabilistic causal model redundant.8 Fourth, the simplicity of causal diagrams
means identification conditions can be derived algorithmically. A researcher
can feed a causal diagram into a computer, ask whether the effect of X on
Y is identified, and get an automated reply, including the set of conditioning
variables that need to be controlled for.9
The goal of causal identification is to assess the necessary and sufficient conditions that would allow us to express P( y|do(x)) in terms of the observable
distribution P(O), where O ⊆ V is a subset of observed endogenous variables V in probabilistic model Γ (Definition 5).10 Two sufficient conditions
for causal identification are the front-door and back-door criteria (see Annex
B). Both make use of the graphical concept of d-separation.
Definition 7 (d-Separation, Pearl (2009, pp 16-17)). A path p is said to be
d-separated (or blocked) by a set of nodes S iff:
1. p contains a chain i → m → j or a fork i ← m → j such that the middle
node m is in S, or
2. p contains an inverted fork (or collider) i → m ← j such that the middle
node m is not in S and such that no descendant of m is in S.
A set S is said to d-separate X from Y iff S blocks every path from a node in
X to a node in Y .
8
When non-parametric identification is impossible, parametric assumptions can be
made. However, parametric identification is not accorded as much credibility.
9
See the package graph (Gentleman et al. 2012) for R (R Core Team 2013).
10
The topic of identification is a nuanced one (see Pearl (2009b, §3) for details).
12
Following Pearl (2009b, p 18) let (X ⊥ Y |S)P capture the probabilistic notion of conditional independence and (X ⊥ Y |S)G the graphical notion of
d-separation in Definition 7. Their connection is established by the following
theorem:
Theorem 1 (Probabilistic implications of d-Separation, Pearl (2009, p 18)).
For any three disjoint subset of nodes (X , Y, S) in DAG G and for all probability
functions P, we have:
1. (X ⊥ Y |S)G ⇒ (X ⊥ Y |S)P whenever G and P are compatible;11 and
2. if (X ⊥ Y |S)P holds in all distributions compatible with G, it follows that
(X ⊥ Y |S)G .
Theorem 1 allows us to determine whether two variables, or sets of variables,
are independent on the basis of d-separation alone. These independencies
are useful for identification purposes.
4
Examples and intuition
In randomized controlled experiments attrition is a problem of selection bias
and estimation. This is not obvious from the econometrics literature, where
identification of the ATE hinges on unconfoundedness and overlap assumptions (Imbens and Wooldridge 2009, 26). Formally:
11
By compatibility Pearl means that the probability distribution admits the factorizaQ
tion P(x 1 , . . . , x n ) = i P(x i |pai ), where pai are the parents of x i , that is the nodes with
directed edges pointing into x i . For example, X ← U → Y factorizes as P(X , U, Y ) =
P(U) P(X |U) P(Y |U).
13
Unconfoundedness (Yz 0 (u), Yz 00 (u) ⊥ Z(u)|X (u)) or, in graphical terms, (Y ⊥
Z|X )G under the null of no effect; and
Overlap 0 < pr(Z(u) = 1|X (u) = x) < 1, ∀x,
where Yz 0 (u) is the hypothetical outcome for unit u under intervention do(Z =
z 0 ) (Pearl 2009b, 204), Z is a treatment, and X a set of covariates including the empty set. Randomized controlled experiments ensure that unconfoundedness and overlap hold in large samples, and that (Yz 0 (u), Yz 00 (u) ⊥
Z(u)). Yet the relevant question in experiments with attrition is whether
(Yz 0 (u), Yz 00 (u) ⊥ Z(u)|R(u) = 0).
To understand when and why conditioning on units with observed outcomes
may be problematic we need to unpack the unconfoundedness assumption.
Consider Hernán, Hernández-Díaz and Robins’s (2004) structural decomposition of the unconfoundedness assumption:
No confounding The treatment and the outcome have no (uncontrolled)
causes in common; and
No selection bias Common effects of treatment (or the causes of treatment)
and the outcome (or the causes of the outcome other than the treatment) are not controlled for.12
From this decomposition it is evident that the assumption of no selection bias
will fail whenever attrition is a descendant of both treatment and the outcome
(or a cause of the outcome other than the treatment), and we condition on
attrition R. More generally, this is true even if the outcome is fully observed
12
This is not their exact definition, which is insufficient without the additional statement
“other than the treatment”.
14
and R is any effect of the outcome. However, even if the ATE for the full
experimental group is identified conditional on attrition, we may not be able
to estimate it once we condition on observed outcomes only. This is a subtle
distinction. In experiments with attrition outcomes are only observed when
r = 0, which means that for estimation purposes we are conditioning on
r = 0, not R. As a result we may only be able to estimate the ATE for the
subset of units with observed outcomes.
For example, Figure 2a illustrates a situation where the ATE is identified but
all we can estimate is the ATE for those units with observed outcomes, which
may differ from the ATE for the full experimental population. To simplify
the discussion, and focus on attrition, I assume a specific parametric form for
Y → O ← R:
O=


Y,
if R = 0

Missing,
if R = 1.
(1)
Equation 1 ensures O ≡ Y after selection on observables (e.g. after dropping
all units with outcomes labelled missing). It also rules out measurement
error, or the possibility of Z having an effect on measurement O but not the
latent outcome Y (e.g. it rules out the edge Z → O).
Second, by virtue of randomization there is no backdoor path in Figure 2a
connecting in the ordered pair (Z, Y ) so the set S = {;} d-separates them.
By Theorem 5, Annex B, the causal effect of Z on Y is therefore identifiable,
and given by the formula:
P( y|do(z)) = P( y|z).
15
(2)
Third, in practice the ATE cannot be estimated using Equation 2 because y
is not observed whenever r = 1 (i.e. o is labelled missing). For estimation
purposes we need to condition on R such that:
P( y|do(z)) = P( y|z, r = 0) P(r = 0) + P( y|z, r = 1) P(r = 1)
(3)
Conditioning on R does not result in selection bias because R is not a descendant of Z. By Equation 1 we can compute the first product term in Equation
3, as P( y|z, r = 0) ≡ P(o|z, r = 0). However, the second product term cannot
be calculated because P(o|z, r = 1) is labelled as missing. Consequently all
we can estimate is E[ y|z 0 , r = 0] − E[ y|z 00 , r = 0] for any two distinct realizations z 0 and z 00 of Z, or the effect of treatment on the subset of units with
observed outcomes.13
Now consider Figure 2b. Here R is a descendant of Z. No matter, to estimate
the ATE we also need to condition on Z, and conditioning on Z d-separates
attrition R from the latent outcome Y . This implies (R ⊥ Y |Z) P and also
P(Y |Z, R) = P(Y |Z). The latter allows unbiased estimation of the ATE for
the full population of units in the experiment, even after conditioning on
r = 0. By substitution:
P( y|do(z)) = P( y|z, r = 0) P(r = 0) + P( y|z, r = 1) P(r = 1),
= P( y|z) P(r = 0) + P( y|z) P(r = 1),
= P( y|z).
13
If X were observed in Figure 2a, P(x, r = 0) > 0 ∀x, and we were willing to assume X
is the only ancestor in common between Y and R, then we can recover the ATE using inverse
probability weights. The reason is X d-separates Y from R such that (Y ⊥ R|X ) P (Theorem
1). But not this requires making assumptions not warranted by the experimental design.
16
Uy
Z
Ux
Uy
Y
X
Ur
Z
O
O
Ur
R
(a) ATE identified, not estimable
Uy
Y
Ux
X
Ur
R
(b) ATE identified and estimable
Uy
Z
Y
Z
Y
O
O
Ur
R
R
.
(c) ATE not identified, R a collider
(d) ATE not identified, R a collider
Uy
Z
Ux
Ur
Uy
Y
X
Z
Y
O
O
Ur
R
(e) ATE identified and estimable
R
(f) ATE not identified, R a virtual collider
Figure 2: Attrition mechanisms. Z is the treatment, X a covariate, Y is the latent outcome observed by experimental units but not the researcher, and O is the publicly observed
outcome. Attrition R is a binary variable indicating whether Y is visible to the researcher
(R = 0) or not (R = 1), such that O = Y if R = 0 and O = missing otherwise. The Ui are
independently distributed background causes. White nodes denote observed variables, and
light gray nodes unobserved ones. Square nodes are variables held constant (e.g. r = 0).
Directed edges (→) denote causation, and undirected edges (—) correlations induced by
colliders.
17
Because P(o|z, r = 0) ≡ P( y|z, r = 0) ≡ P( y|z) the ATE for the full experimental population can be estimated as E[Y |do(z 0 ), r = 0] − E[Y |do(z 00 ), r =
0] for any two distinct realizations z 0 and z 00 of Z (so long as P(z, r = 0) >
0 ∀z). This example also shows how unbiased estimation of the ATE is possible despite differential attrition directly affected by treatment.
Now consider Figures 2c and 2d. In both examples the ATE is unconditionally
identified yet conditioning on R for estimation purposes activates a collider
in a path between the treatment and the outcome. This generates a spurious
dependence between the treatment and the latent outcome under the null of
no effect, so the ATE is not identified. In Figure 2d conditioning on R induces
the spurious association Z — Y , while in Figure 2c it activates spurious associations along paths Z — X → Y and Z — UR — C → Y . This is no surprise:
in both causal mechanisms conditioning on R invalidates the no selection bias
assumption.
Finally consider Figures 2e and 2f. In Figure 2e the ATE is identified and
estimable conditional on observed outcomes (r = 0). This is true even if
there is differential attrition (treatment causes attrition), and imbalance in
the observed covariate (X interacts with Z). Figure 2f involves a virtual collider (Pearl 2009b, 339–400), or controlling for a descendant of the outcome
of interest. In this case conditioning on observed outcomes r = 0 induces a
non-causal association between UR and Y . In addition, because R is a child
of Y and hence a proxy for Y , we are also limiting the variation of Y itself.
In turn induces a non-causal association between UY and Z. In fact, conditioning on a virtual collider creates an association between all parents of
each of the ancestors of the virtual collider. Moreover, these non-causal associations induce other associations, like the association between Z and UR in
18
Z → Y — UR . Finally, Figure 2f has the property that the effect is identified
whenever Z is without effect but not otherwise.14
5
Necessary and sufficient conditions for problematic attrition
To formalize the necessary and sufficient conditions for problematic attrition
I limit the discussion to the class of Simple Attrition DAGs defined below. This
allows me to focus the discussion on attrition under the standard assumptions
of randomization, excludability, and non-interference.
Definition 8 (Simple Attrition DAGs (SADAGs)). A simple attrition DAG is
one where: (i) observed outcomes O are determined by Equation 1 (an exclusion restriction); (ii) latent outcomes Y for any unit i are independent of
treatment assigned to any other unit j, j 6= i (non-interference); (iii) a single treatment Z taking two or more values is applied in a randomized fashion
(no arrows can point into Z); and (iv) attrition R does not cause Y .
The definition of SADAGs restricts attention to attrition in experimental settings separate from errors in variables, interference, or effects mediated by
R. Clearly if the only effect of Z on Y is via R, as in Z → R → Y , the ATE
cannot be estimated: the effect R → Y is not identified because when r = 1
all outcomes o are labelled missing. These restrictions notwithstanding, the
intellectual apparatus presented here can be easily generalized to situations
14
The front door criterion (Annex B) is invalid in the face of virtual colliders. Intuitively,
the last step between the presumed mediator and the outcome is structurally the same as
the direct step Z → Y in Figure 2f, which is confounded by a spurious correlation with UY .
19
where only some or none of these assumptions hold.
Next I introduce virtual colliders, which play an important role in what follows:
Theorem 2 (On the implications of conditioning on the effect of a cause). For
any directed path U1 → V1 → . . . → Vi → Vi+1 → . . . → Vn−1 → VN in a causal
diagram. For any variable Vi , N > i ≥ 1 in that path, there exists at least one
∗
value of Vi+n , N − i ≥ n ≥ 1, such that conditioning on Vi+n = vi+n
generates
a non-casual association between Ui+1 and Vi , which we denote by connecting
these two nodes with an undirected edge (e.g. Ui+1 — Vi ). (Where Ui+1 (not
shown) captures other causes of Vi+1 beyond Vi (e.g. as in Vi → Vi+1 ← Ui+1 ).)
Proof. First consider any i, N > i ≥ 1 and n = 1. This is the case of a simple
collider, where we control for Vi+1 in causal diagram Vi → Vi+1 ← Ui+1 . Now
consider the equivalent causal model vi+1 = f Vi+1 (Vi , Ui+1 ). By contradiction,
suppose (Vi ⊥ Ui+1 |vi+1 )G ∀vi ∈ Vi . Because f Vi+1 is a single-valued function,
this implies that Vi+1 does not change with Vi conditional on any value ui+1
in the domain of Ui , which contradicts the assumption that Vi is a cause of
∗
∗
Vi+1 →←. There exists vi+1
such that ¬(Vi ⊥ Ui+1 |vi+1
)G
Second, consider whether the claim is true for any n, N − i ≥ n ≥ 1. By
repeated substitution we can write the reduced-form causal model vi+n =
f Vi+n (Vi+1 , Ui+n ). By contradiction suppose that for each possible value ui+n in
the domain of Ui+n , vi+n = f Vi+n (Vi+1 , ui+n ) for all values of Vi+1 . This implies
that Vi+n does not change with Vi+1 conditional on any value ui+n in the domain of Ui+n ; which contradicts the assumption that Vi+1 is a cause of Vi+n
∗
→←. Therefore there exists at least one pair (u∗i+n , vi+n
) such that a change
∗
in Vi+1 breaks the equality vi+n
= f Vi+n (Vi+1 , u∗i+n ); that is, generates a value
20
∗∗
∗
∗
vi+n
6= vi+n
. Put differently, the pair (u∗i+n , vi+n
) partitions the domain of Vi+1
into at least two sets, one whose elements satisfy the equality, and one whose
elements do not.
Third, because these induced subsets of Vi+1 are a partition of its domain, it
∗
must be the case that vi+1
, from the first step above, is in one of the sets, and
each set maps into a unique value of Vi+n for given ui+n . Therefore ∃vi+n ∈
∗
∗
Vi+n such that conditioning on Vi+n = vi+n
selects on vi+1
with some positive
∗
probability. By the first step above this implies ¬(Vi ⊥ Ui+1 |vi+n
).
Corollary 1. In a causal diagram G , if Y is a mediator between Z and R then
there does not exist a set of variables S 0 that meets the front-door criterion for
the effect of Z on Y conditional on R.
Proof. We know from Theorem 2 that controlling for a descendant of Y will
induce a non-causal association between all the parents of Y . As a result the
last step in the front-door criterion is bound to fail, since the mechanism is
no longer isolated.
Definition 9 (Problematic attrition). For any SADAG defined by the collection G = 〈V, E〉, attrition is said to be problematic whenever the ATE is not
identifiable non-parametrically conditional on r = 0 and Z.
The emphasis on non-parametric identification, and on conditioning limited
to r = 0 and Z, stems from the nature of the experimental context, where
researchers typically want to avoid making any assumptions beyond those
warranted by the experimental design.
Theorem 3 (Necessary and sufficient causal conditions for problematic attrition). In any SADAG defined by the collection G = 〈V, E〉, attrition will be
21
problematic (e.g. the ATE will not be identifiable non-parametrically conditional
on r = 0 and Z) if and only if:
1. Z is an ancestor of R; and
2. Y is an ancestor of R; or
3. Exists X in V\Z, such that X is an ancestor of both Y and R.
Proof. (⇒): For any SADAG G , suppose conditions 1–3 are all true, and, by
contradiction, the ATE is identified non-parametrically conditional on r = 0
and Z only. Then either Y is an ancestor of R, or exists X in V\Z, such that
X is an ancestor of both Y and R.
Case 1: Y is an ancestor of R. Then if Z is an ancestor of R there are two
further possibilities: (a) Y is a mediator of the effect of Z on R (e.g. Z →
. . . → Y → . . . → R); or (b) Y is not a mediator of the effect of Z on R (e.g. Y
is not in the previous chain). Case 1a: Y is a mediator of the effect of Z on R.
Then by Theorem 2 and Corollary 1 the ATE is not identified →←. Case 1b:
Y is not a mediator of the effect of Z on R. Then the ordered pair (Y, R) is
connected by a directed path, and similarly for the ordered pair (Z, R). These
two directed paths meet at R, so R is the only collider in a path between
Z and Y along these two directed paths. By the definition of d-separation
(Definition 7) conditioning on r = 0 induces an association between Z and
Y which, by Theorem 1, implies that Z and Y are dependent under the null
of no effect. This implies that the ATE is not identified →←.
Case 2: Exists X in V\Z, such that X is an ancestor of both Y and R. Then by
the definition of a SADAG (and assumption R is not an ancestor of Y ) either
one of two causal structures must be true: (a) Y is a mediator of the effect
22
of X on R (e.g. X → . . . → Y → . . . → R); or (b) Y is not a mediator of the
effect of X on R (e.g. Y ← . . . ← X → . . . → R). Case 2a: Y is a mediator
of the effect of X on R. Then Y is an ancestor of R, which by the previous
results (Case 1) implies that the ATE is not identified →←. Case 2b: Y is
not a mediator of the effect of X on R. Then by the definition of an ancestor,
X is connected to Y and R through directed paths which, together with Z
being an ancestor of R, implies R is the only collider in the combination of
the two paths connecting Z to Y via R and X . By the definition of d-separation
(Definition 7) conditioning on r = 0 induces an association between Z and
Y , which by Theorem 1 implies that Z and Y are dependent under the null
of no effect. This implies that the ATE is not identified →←.
Under the null of no effect, and by the definition of a SADAG, Z and Y can
only be associated by controlling for a (virtual) collider. Because R can only
be a (virtual) collider whenever Z is an ancestor of R, and Y is an ancestor of
R; or Z is an ancestor of R and exists X in V\Z, such that X is an ancestor of
both Y and R, these two cases cover all the possibilities. Consequently, when
condition 1 is true, and either of conditions 2 or 3 is also true, then the ATE
is not identified. And because G is any arbitrary SADAG, this is true of all
SADAGs.
(⇐): For any SADAG G , suppose condition 1 is false or both conditions 2
and 3 are false, and, by contradiction, the ATE is not identified. Then either
Z is not an ancestor R; or Y is not an ancestor of R and there does not exists
X in V\Z such that X is an ancestor of both Y and R.
Case 1: Z is not an ancestor R. Under the null of no effect, and by the definition of a SADAG, Z and Y can only be associated by controlling for a (virtual)
collider, which implies a directed path from Z to R →←.
Case 2: Y is not an ancestor of R and there does not exists X in V\Z such
23
that X is an ancestor of both Y and R. Then there can be no path from Y
to R other than through Z (e.g. Y ← . . . ← Z → . . . → R), in which case
(R ⊥ Y |Z)G . By the same logic as in Case 1, the ATE is identified →←.
Under the null of no effect, and by the definition of a SADAG, Z and Y can
only be associated by controlling for a (virtual) collider. Because R can only
be a (virtual) collider whenever Z is an ancestor of R, and Y is an ancestor of
R; or Z is an ancestor of R and exists X in V\Z, such that X is an ancestor of
both Y and R, these two cases cover all the possibilities. Consequently, when
condition 1 is not true, or both conditions 2 and 3 are false, the ATE is identified. And because G is any arbitrary SADAG, this is true of all SADAGs.
To say that the ATE is identified does not imply it is estimable, as shown by
the next theorem:
Theorem 4 (Estimation of ATE when ATE is identified conditional on r=0
and Z). In any SADAG defined by the collection G = 〈V, E〉, and assuming the
ATE is identified conditional on r = 0 and Z, the ATE is estimable iff there does
not exists exists X in V\Z, such that Y ← . . . ← X → . . . → R. Otherwise, given
that ATE is identified, all we can estimate is the ATE for units with observed
outcomes not labelled as missing (ATER=0 ).
Proof. (⇒): For any SADAG G , suppose there does not exists exists X in V\Z,
such that Y ← . . . ← X → . . . → R, the ATE is identified, and, by contradiction
the ATE is not estimable conditional on only r = 0 and z. Then either Z is
not an ancestor of R, or Y is not an ancestor of R.
Case 1: Z is not an ancestor R. Then the only way Y and R can be related
is if Y is an ancestor R. Now, if Z is not an ancestor of Y (e.g. the ATE=0)
then the ATE is identified, and it is 0. If Z is an ancestor of Y , then by Y an
24
ancestor of R, it implies Z is an ancestor of R, but then the ATE would not be
identified →←.
Case 2: Y is not an ancestor of R. Then the only way Y and R can be related,
given that not exists exists X in V\Z, such that Y ← . . . ← X → . . . → R, is if
Z is an ancestor in common.This can happen if Z is an ancestor of Y , and Y
is an ancestor of R →←; or if Z is an ancestor of both Y and R, and Y is not
an ancestor of R, in which case Z is the only ancestor they have in common
and so (Y ⊥ R|Z)G . By Theorem 1 this implies (Y ⊥ R|Z) P which, given that
the ATE is identified, implies P( y|do(z)) = P( y|z) = P( y|z) P(r = 0).
Assuming the ATE is identified, and not exists exists X in V\Z such that Y ←
. . . ← X → . . . → R implies that Z is not an ancestor, or Y is not an ancestor
of R. These two cases cover all the possibilities. Consequently, when the ATE
is identified, and there does not exists X in V\Z such that Y ← . . . ← X →
. . . → R, the ATE is estimable. And because G is any arbitrary SADAG, this is
true of all SADAGs.
(⇐): For any SADAG G , suppose the ATE is not identified, or exists X in V\Z
such that X is an ancestor of both Y and R, and, by contradiction, the ATE is
estimable.
Case 1: ATE is not identified. If the ATE is not identified then it is not estimable, since the former is a necessary condition for estimation →←.
Case 2: Exists X in V\Z such that X is an ancestor of both Y and R. Then this
implies ¬(Y ⊥ R) s.t. P(Y |Z, R) 6= P(Y |Z). Because outcome data are only
observed when r = 0 all we can estimate is P( y|z, r = 0) or AT E r=0 →←.
Assuming the ATE is not identified, or exists X in V\Z such that Y ← . . . ←
X → . . . → R implies that the ATE is not identified, or, if it is, Y and R have
a common ancestor in X . These two cases cover all the possibilities. Consequently, when the ATE is not identified, or there exists X in V\Z such that
25
(Y ⊥ R|Z)G
(Z ⊥ R)G
True
False
True
ATE identified
Estimable
ATE identified
ATE r=0 estimable
False
ATE identified
Estimable
ATE not identified
Not estimable
Table 1: This table summarizes the main implications of Definition 9, and
Theorems 3 and 4. For any SADAG G the table shows how identification and
estimation of the ATE depends on just two d-separation conditions (Z ⊥ R)G
and (Y ⊥ R|Z)G . These can be checked algorithmically for any G .
Y ← . . . ← X → . . . → R, the ATE is not estimable. And because G is any
arbitrary SADAG, this is true of all SADAGs.
6
Diagnosis
The implications of Definition 9, and Theorems 3, and 4 are summarized in
Table 1. In particular, problematic attrition (Definition 9) refers to situations
where both d-separation conditions, (Y ⊥ R|Z)G and (Z ⊥ R|Z)G , are false,
as in the bottom right cell of Table 1.
Faced with attrition, a key task for field researchers is inferring where in Table 1 their application belongs. In principle this can be done by performing
two tests. First, we can test the null that treatment has no effect on attrition,
against the alternative that it causes attrition. If the null is rejected we conclude that our application likely falls in the second row of Table 1, otherwise
we conclude that the evidence is not strong enough to reject the null of no
effect. This test is unproblematic. Attrition can be regarded as any other
26
outcome of interest, as is the case whenever attrition is caused by death.15
Second, we can test the null that the outcome and attrition are uncorrelated
conditional on the treatment, against the alternative that they are correlated.
If the null is rejected we conclude that that our application likely falls in the
second column of Table 1, otherwise we conclude that there is not enough
evidence to reject the null. This test is also unproblematic in the sense that
all we are testing for is the presence of correlations, not causation. Finally, if
both nulls are rejected we conclude that our application is a case of problematic attrition (bottom right cell). In this case the ATE is not identified without
additional assumptions.
In practice, testing whether attrition and the outcome are correlated conditional on the treatment is not possible, as the outcome is labelled missing
whenever r = 1. Ex post, after the data collection is complete, researchers
can only test whether treatment causes attrition. Even so, the resulting diagnosis is still useful even if incomplete. Specifically, if attrition is not an effect
of treatment then there is less reason to believe that the observed attrition
is problematic. Yet uncertainty remains as to whether the estimated ATE is
relevant for all units in the experiment, or only for the subset of units with observed outcomes. However ex ante, before data collection begins, researchers
can prepare for attrition by having a sampling plan for non-response follow
up at endline (and baseline) surveys (Hansen and Hurwitz 1946; Deming
1953). These data can be used to test whether outcomes and attrition are
correlated, and make inferences about the attrition process. However, at this
stage a more promising approach is simply to use inverse probability weights
or multiple imputation to avoid having to condition on observed outcomes
15
There is one implicit assumption, that of stability, see Annex C. This is a general as-
sumption about inferred causation, and is not specific to causal diagrams.
27
in the first place. This is justified whenever the follow up survey involves a
random sample of the units with missing outcomes in the first survey wave.
7
Discussion
7.1
Relation to Missing at Random assumptions
Experiments subject to attrition are often analyzed and discussed under the
missingness framework of Rubin (1976) and Little and Rubin (1987). However this approach does not provide neither necessary nor sufficient conditions for identification of the ATE in experiments subject to attrition.16
First, consider the assumption that data are Missing at Random (MAR), or
P(R|Vmis , Vo bs ; β) = P(R|Vo bs ; β), where R is the attrition indicator, V is the
set of observed and missing endogenous variables in a causal model, and β
a vector of parameters of the missingness model (Schafer 1997, 11). MAR
requires that attrition be independent of all missing variables in the model
conditional on the observed ones. This is unnecessary. As can be seen from
Table 1, all we need to identify the ATE is for attrition (R) to be independent of the latent outcome (Y ) conditional on the treatment (Z) (Gerber and
Green (2012, 219, fn 11)). For example, suppose X were not observed in
Figure 2e, then P(R|X , Z) 6= P(R|Z), yet the ATE is identified.
Second, consider a slight modification of Figure 2f, where we insert a mediator M between Y and R, such that Y → M → R. In this case we have
16
To be clear, what follows is not a general criticism of Little and Rubin’s (1987) approach
to missing data but of its “routine" application to experiments subject to attrition. Their
approach remains incredibly useful and powerful in many other settings.
28
P(R|Z, M , Y ; β) = P(R|M ; β), yet the ATE is not non-parametrically identified
because R is a virtual collider. At this point one could use inverse probability
weights and multiple imputation, yet investigators ought to be aware that
by invoking MAR they are also invoking a set of parametric assumptions not
justified by the experimental design.
Third, MAR is a description of a symptom (e.g. R ⊥ Vmis |Vo bs ), but not of
the underlying causal structure generating those symptoms. This can lead to
ambiguity since several causal structures can be compatible with the exact
same symptoms. For example, (Y ⊥ R|X ) may be true if (i) X is a mediator
between Y and R; or (ii) X is a common cause of both Y and R; or (iii) X
is a mediator between Y and S, S an affect of both Y and R that is being
held constant. The point is that problematic attrition is a function of the
underlying causal structure, yet this casual structure remains hidden by the
probability language (see Pearl 2009b, §1.5, 7.4.4).
7.2
On the role of additional identification assumptions
Thus far I have defined and diagnosed problematic attrition strictly within
the confines of experimental assumptions (see Definition 8). Even so, it is
easy to re-state the results in Table 1 to include additional conditions for
identification. For example, we could write one of the requirements as a
conjunction – (Y ⊥ R|Z, S)G and (Y ⊥ Z|S)G – where S is some observed
covariate, and the second requirement ensures that the conditioning strategy
is harmless (e.g. we are not conditioning on a collider between the outcome
and the treatment). For example, in Figure 2d attrition is a descendant of
both the treatment and the outcome, and in Figure 2c it is a descendant of the
29
treatment and an ancestor of the outcome other than the treatment. Hence,
the ATE will remain unidentified unless we can find an identification strategy
that blocks, or circumvents, the spurious association between Z and Y .17
Such identification strategies may include measuring and conditioning on X ;
unpacking the link Z → Y in search of a front-door identification strategy (see
Annex B); imputing the missing outcomes to avoid conditioning on R (e.g.
Rubin 1976, 2004); doing so implicitly using inverse probability weights (e.g.
Robins, Rotnitzky and Zhao 1995); or imputing bounds (e.g. Horowitz and
Manski 2000).
All the aforementioned identification strategies have one thing in common:
They invoke additional assumptions beyond the standard experimental assumptions of random assignment, excludability, and non-interference (Gerber and Green 2012, 45). Once again, causal diagrams remain ideally suited
for communicating these additional assumptions. For example, Figure 3a
might be a possible representation of a researcher’s prior causal knowledge
about the process generating the outcome and the attrition in a specific application. In particular, that knowledge assumes that M is a lonesome mediator
between the treatment and the outcome which would justify identification
on the basis of the front-door criterion (see Annex B).18 However, because X
remains unobserved, we can only estimate the ATE for units with observed
17
A complete proof of problematic attrition which allows for non-parametric identifica-
tion strategies (including conditioning on covariates and front-door conditioning) is easy
to derive, and is available from the author. It simply complicates the presentation without
adding new insights.
18
This knowledge also has testable implications. For example, it implies an effect of Z on
M , and (M ⊥ R|Z) P . Indeed, causal diagrams can be used to infer, justify, and test identification strategies including which covariates to control for, what assumptions are plausible,
which are testable, and what variables are good instruments (Pearl 2009b).
30
Z
Ux
Ur
Um
Uy
M
Y
X
Z
Um
Uy
M
Y
Uw W
O
Ur
R
(a) ATE identified (front-door criterion), not estimable
O
R
(b) ATE not identified, R and W virtual colliders
Figure 3: Attrition mechanisms. Z is the treatment, X a covariate, Y is the latent outcome observed by experimental units but not the researcher, and O is the publicly observed
outcome. Attrition R is a binary variable indicating whether Y is visible to the researcher
(R = 0) or not (R = 1), such that O = Y if R = 0 and O = missing otherwise. The Ui are
independently distributed background causes. White nodes denote observed variables, and
light gray nodes unobserved ones. Square nodes are variables held constant (e.g. r = 0).
Directed edges (→) denote causation, and undirected edges (—) correlations induced by
colliders.
outcomes, or r = 0. In Figure 3b the ATE is not identified on the basis of
prior knowledge, as both W and R are virtual colliders.
In general researchers ought to consider potential attrition mechanism before
randomization, and write these down using causal diagrams. This knowledge
can then be used to inform what measures should be taken, the design of the
measurement protocol, diagnostic tests, and conditioning strategies for dealing with attrition. Such conditioning strategies will be all the more credible
if the hypothesized causal diagram is included in a study protocol posted to
a public registry prior to randomization.
31
7.3
On the distinction between identification and estimation
The notion of causal identification remains a subject of contention in the
social sciences (see Pearl 2009b, §2, 3.2.4). Here I have distinguished between the conditions under which the ATE is identified, and the conditions
under which it is estimable. This distinction arises naturally form notions of
d-separation and identification criteria like the back-door criterion. For example in Figure 2a the ATE is identified conditional on R because (Z ⊥ Y |R)G ,
and so P( y|z, r = 0) only picks up the effect of Z on Y unpolluted by other
spurious associations. However, we cannot pick up all of the effect because
P( y|z, r = 1) is not observed. And because ¬(Y ⊥ R|Z) we cannot be sure
that the non-parametric estimate for the units with observed outcomes is exactly the same as the estimate from all units had all outcomes been observed.
All we know is that the estimator is consistent for the effect on units with observed outcomes.
Identification and estimation are fundamentally distinct. Identification only
requires qualitative knowledge of the process generating the data, as captured in a causal diagram. By contrast, estimation may condition on additional variables to reduce variance, or make additional parametric assumptions to model the data, and so on.
7.4
On causal diagrams, proofs, and truth tables
A causal model captures qualitative knowledge about causal relations between variables. In particular, a variable is an ancestor or descendant of an-
32
other variable, or it isn’t. This binary distinction in causal relations amongst
variables simplifies non-parametric identification strategies.19 In particular,
researchers can rely on truth tables to make causal queries (see Annex C
for an example). Indeed, Theorems 3 and 4 were derived with the aid of
a truth table. To help researchers make logical causal queries of their own,
this study is accompanied by an R (R Core Team 2013) package for eliciting
causal diagrams and performing sentential logic on causal queries.20
8
Conclusion
Attrition has been labelled the Achilles’ Heel of the randomized experiment:
it is very common, and it can completely unravel the benefits of randomization. This study demonstrates how problematic attrition depends on the
underlying causal structure, and for this reason it recommends that attrition
be always discussed in terms of causes and effect. The language of probability is not up to task, for the simple reason that causation implies correlation
but not vice versa. As a result researchers’ discussion of attrition is often inherently ambiguous (see Appendix A). By contrast the structural language
of causal diagrams is ideally suited to discussing causal relations. To wit,
it simplifies the discussion considerably by doing away with parametric and
functional form assumptions.
We say attrition is problematic whenever the ATE cannot be identified non19
Obviously, if non-parametric identification is not possible, other parametric assumptions
may be invoked (e.g. Heckman 1979). However, parametric identification strategies are
typically not accorded as much credibility.
20
In progress.
33
parametrically conditional on observed outcomes and the assigned treatment. Attrition is problematic if – and only if – it is a descendant of both
the treatment and the outcome (or some ancestor of the outcome other than
the treatment). Ex-post, after the intervention is closed to further data collection, problematic attrition can be partially diagnosed by formally testing
whether treatment has a causal effect on attrition. Ex-ante, two stage sampling can be used at endline (and baseline) to further diagnose the nature
of the attrition. More generally it is recommended that researchers encode
all their causal knowledge about attrition is a causal diagram prior to randomization, that they use this knowledge to inform the study design, and
that they register it in a study protocol posted in a public registry. These results are summarized in a handy checklist researchers can use to improve the
process of causal inference in experiments subject to attrition (see Annex D).
This study has focused defining and diagnosing problematic attrition. Much
less attention has been paid to dealing with problematic attrition once it is
diagnosed. Even so, the framework presented here provides the basic intellectual infrastructure upon which more complex identification strategies can
be built.
References
Ashenfelter, Orley and Mark W. Plant. 1990. “Nonparametric Estimates of the
Labor-Supply Effects of Negative Income Tax Programs.” Journal of Labor
Economics 8(1):pp. S396–S415.
Bareinboim, E. and Judea Pearl. 2012. Controlling selection bias in causal
34
inference. In JMLR Proceedings of the Fifteenth International Conference on
Artificial Intelligence and Statistics (AISTATS).
Cook, T.D. and D.T. Campbell. 1979. Quasi-experimentation: design & analysis
issues for field settings. Rand McNally College.
Daniel, Rhian M, Michael G Kenward, Simon N Cousens and Bianca L
De Stavola. 2012. “Using causal diagrams to guide analysis in missing
data problems.” Statistical Methods in Medical Research 21(3):243–256.
Deming, W. Edwards. 1953. “On a Probability Mechanism to Attain an Economic Balance Between the Resultant Error of Response and the Bias of
Nonresponse.” Journal of the American Statistical Association 48(264):743–
772.
Duflo, Esther, Pascaline Dupas and Michael Kremera. 2011. “Peer Effects,
Teacher Incentives, and the Impact of Tracking: Evidence from a Randomized Evaluation in Kenya.” The American Economic Review 101(5):1739–
1774.
Duflo, Esther, Rachel Glennerster and Michael Kremer. 2007. Chapter 61 Using Randomization in Development Economics Research: A Toolkit. Vol. 4
of Handbook of Development Economics Elsevier pp. 3895 – 3962.
Elwert, Felix and Christopher Winship. 2012. Endogenous Selection Bias.
Technical report University of Wisconsin–Madison.
Geneletti, Sara, Sylvia Richardson and Nicky Best. 2009. “Adjusting for selection bias in retrospective, case–control studies.” Biostatistics 10(1):17–31.
Gentleman, R., Elizabeth Whalen, W. Huber and S. Falcon. 2012. graph: A
package to handle graph data structures. R package version 1.34.0.
35
Gerber, A. and D.P. Green. 2012. Field Experiments: Design, Analysis, and
Interpretation. W. W. Norton.
Gerber, Alan S., Dean Karlan and Daniel Bergan. 2009. “Does the Media
Matter? A Field Experiment Measuring the Effect of Newspapers on Voting Behavior and Political Opinions.” American Economic Journal: Applied
Economics 1(2):35–52.
Gerber, alan S., gregory A. Huber and ebonya Washington. 2010. “Party Affiliation, Partisanship, and Political Beliefs: A Field Experiment.” American
Political Science Review 104:720–744.
Greene, William H. 1981. “Sample Selection Bias as a Specification Error: A
Comment.” Econometrica 49(3):795–798.
Greenland, Sander, Judea Pearl and James M. Robins. 1999. “Causal Diagrams for Epidemiologic Research.” Epidemiology 10(1):37–48.
Hansen, Morris H. and William N. Hurwitz. 1946. “The Problem of NonResponse in Sample Surveys.” Journal of the American Statistical Association 41(236):517–529.
Hausman, Jerry A. and David A. Wise. 1979. “Attrition Bias in Experimental
and Panel Data: The Gary Income Maintenance Experiment.” Econometrica
47(2):455–473.
Heckman, James J. 1979. “Sample Selection Bias as a Specification Error.”
Econometrica 47(1):153–161.
Heckman, James J. and Jeffrey A. Smith. 1995. “Assessing the Case for Social
Experiments.” The Journal of Economic Perspectives 9(2):85–110.
36
Heckman, James J., Robert J. Lalonde and Jeffrey A. Smith. 1999. The economics and econometrics of active labor market programs. In Handbook of
Labor Economics, ed. Orley C. Ashenfelter and David Card. Vol. Volume 3,
Part 1 Elsevier pp. 1865–2097.
Hernán, Miguel A., Sonia Hernández-Díaz and James M. Robins. 2004. “A
Structural Approach to Selection Bias.” Epidemiology 15(5):615–625.
Horowitz, Joel L. and Charles F. Manski. 2000. “Nonparametric Analysis
of Randomized Experiments with Missing Covariate and Outcome Data.”
Journal of the American Statistical Association 95(449):77–84.
Imbens, G.W. and J.M. Wooldridge. 2009. “Recent Developments in the
Econometrics of Program Evaluation.” Journal of Economic Literature
47(1):5–86.
Jurs, Stephen G. and Gene V. Glass. 1971. “The Effect of Experimental Mortality on the Internal and External Validity of the Randomized Comparative
Experiment.” The Journal of Experimental Education 40(1):62–66.
Kalichman, Seth C, David Rompa, Marjorie Cage, Kari DiFonzo, Dolores
Simpson, James Austin, Webster Luke, Jeff Buckles, Florence Kyomugisha,
Eric Benotsch, Steven Pinkerton and Jeff Graham. 2001. “Effectiveness of
an intervention to reduce HIV transmission risks in HIV-positive people.”
American Journal of Preventive Medicine 21(2):84 – 92.
Krueger, Alan B. 1999. “Experimental Estimates of Education Production
Functions.” The Quarterly Journal of Economics 114(2):497–532.
Little, Roderick JA and Donald B Rubin. 1987. Statistical analysis with missing
data. Vol. 539 Wiley New York.
37
Martel García, Fernando. 2012. Scientific Progress in the Absence of New
Data: A Procedural Replication of Ross (2006). Technical report. Available
at SSRN: http://ssrn.com/abstract=2256286.
Morgan, Stephen L. and Christopher Winship. 2007.
Couterfactuals and
Causal Inference: Methods and principles of Social Research. Cambridge
Univ. Press.
Newhouse, Joseph P., Robert H. Brook, Naihua Duan, Emmett B. Keeler, Arleen Leibowitz, Willard G. Manning, M. Susan Marquis, Carl N. Morris,
Charles E. Phelps and John E. Rolph. 2008. “Attrition in the RAND Health
Insurance Experiment: A Response to Nyman.” Journal of Health Politics,
Policy and Law 33(2):295–308.
Pearl, J. 1995.
“Causal diagrams for empirical research.” Biometrika
82(4):669–688.
Pearl, J. 2009a. “Causal inference in statistics: An overview.” Statistics Surveys 3:96–146.
Pearl, Judea. 2009b. Causality: Models, Reasoning, and Inference. Cambridge
University Press.
Pearl, Judea. 2012. A solution to a class of selection-bias problems. Technical
report Tech. Rep.
Pearl, Judea. 2013. Linear Models: A Useful “Microscope” for Causal Analysis. Technical report Tech. Rep.
R Core Team. 2013. R: A Language and Environment for Statistical Computing.
Vienna, Austria: R Foundation for Statistical Computing.
URL: http://www.R-project.org/
38
Reynolds, Barbara E. 2004. The Algorists vs. the Albacists: An Ancient Controversy on the Use of Calculators. In Sherlock Holmes in Babylon: And
Other Tales of Mathematical History, ed. Marlow Anderson, Victor Katz and
Robin Wilson. MAA pp. 148–152.
Robins, James M., Andrea Rotnitzky and Lue Ping Zhao. 1995.
“Analy-
sis of Semiparametric Regression Models for Repeated Outcomes in the
Presence of Missing Data.” Journal of the American Statistical Association
90(429):106–121.
Rubin, D.B. 2004. Multiple Imputation for Nonresponse in Surveys. Wiley
Classics Library John Wiley & Sons.
Rubin, Donald B. 1976. “Inference and missing data.” Biometrika 63(3):581–
592.
Schafer, JL. 1997. Analysis of incomplete multivariate data. Chapman &
Hall/CRC.
Shadish, William R. 2002. “Revisiting Field Experimentation: Field Notes for
the Future.” Psychological Methods 7(1):3–18.
Shadish, W.R., X. Hu, R.R. Glaser, R. Kownacki and S. Wong. 1998. “A method
for exploring the effects of attrition in randomized experiments with dichotomous outcomes.” Psychological Methods 3(1):3.
Stone, Williard E. 1972. “Abacists versus Algorists.” Journal of Accounting
Research 10(2):pp. 345–350.
West, S.G., J.C. Biesanz and S.C. Pitts. 2000. Causal inference and generalization in field settings: Experimental and quasi-experimental designs. In
39
Handbook of research methods in social and personality psychology. Cambridge University Press, Cambridge, UK chapter 3, pp. 40–84.
Wood, Angela M, Ian R White and Simon G Thompson. 2004. “Are missing outcome data adequately handled? A review of published randomized
controlled trials in major medical journals.” Clinical Trials 1(4):368–376.
40
A
Field researchers’ explanations of when attrition is a problem and diagnostic tests
Without pretending to be systematic the following are common rationales
for why attrition is a problem. These are quoted not because they may be
inappropriate in the context in which they are stated, so much as to illustrate
the variety of ways in which attrition is deemed to be a problem. This study
tries to unify and simplify these criteria, disambiguate the definition of problematic attrition, and provide a diagnostic test consistent with the definition.
A.1
When or why attrition is a problem:
• “If the probability of attrition is correlated with experimental response,
then traditional statistical techniques will lead to biased and inconsistent estimates of the experimental effect.” (Hausman and Wise 1979,
456)
• “While the attrition in these surveys has typically not been as severe
as in social experiments, the same problems of potential bias arises, if
attrition is not random”(Hausman and Wise 1979, 456)
• “If there is attrition based on unobserved variables that are correlated
with the outcome measures but not predicted by the observables, our
results may be biased.” (Gerber, Karlan and Bergan 2009, 43)
• “Sample attrition poses a problem for experimental evaluations when it
is correlated with individual characteristics or with the impact of treatment conditional on characteristics. In practice, persons with poorer la-
41
bor market characteristics tend to have higher attrition rates (see, e.g.,
Brown, 1979). Even if attrition affects both experimental and control
groups in the same way, the experiment estimates the mean impact of
the program only for those who remain in the sample. Usually, attrition
rates are both non-random and larger for controls than for treatments.
In this case, the experimental estimate of training is biased because individuals’ experimental status, R, is correlated with their likelihood of
being in the sample. In this setting, experimental evaluations become
non-experimental evaluations because evaluators must make some assumption to deal with selection bias.” (Heckman, Lalonde and Smith
1999, x).
• “attrition may be a serious problem if it is not random” (Ashenfelter and
Plant 1990, S402); “If it is not random, then the reliance on the assumption of random assignment is unjustified, and, in fact, nonparametric estimation is impossible. When there is attrition, some assumption must
be made regarding the behavior of the families who left the experiment
and how it differs from those remaining.”(Ashenfelter and Plant 1990,
S408); “The fact of attrition necessitates some sort of “parametric” assumption about the sample.”(Ashenfelter and Plant 1990, S410)
• “In the JTPA evaluation, the parameter of interest was the mean impact of JTPA training on those receiving it. Given this parameter of
interest, random assignment should be placed so as to minimize attrition from the program within the treatment group, because the experimental mean-difference estimate corresponds exactly to the impact
of training on the trained only in the case where there is no attrition.”(Heckman and Smith 1995, 102); “In the presence of this level of
42
attrition, the only way to obtain a credible estimate of the mean impact
of the program is to model the attrition process using the very nonexperimental methods eschewed by proponents of randomized social
experiments. ”(Heckman and Smith 1995, 103)
• “At heart, adjusting for possible nonrandom attrition is a matter of imputing test scores for students who exited the sample. With longitudinal
data, this can be done crudely by assigning the student’s most recent
test percentile to that student in years when the student was absent
from the sample.” (Krueger 1999, 515).
• “Random attrition will only reduce a study’s statistical power; however,
attrition that is correlated with the treatment being evaluated may bias
estimates. [. . . ] Even if attrition rates are similar in treatment and
comparison groups, it remains possible that the attritors were selected
differently in the treatment and comparison groups.” (Duflo, Glennerster and Kremer 2007, 3943).
• “Whenever substantial attrition occurs this is cause for major concern
because there is a possibility that the attrition is nonrandom and may
lead to bias.” (Gerber, Huber and Washington 2010, 727).
A.2
Diagnostics and tests for analyzing attrition
In terms of diagnostics for determining whether attrition is a problem:
• “The key question is whether the experimental treatments are changing
the rate of attrition of [units in the experimental group]”(Ashenfelter
and Plant 1990, S408)
43
• “There was no difference between tracking and nontracking schools in
overall attrition rates. The characteristics of those who attrited are similar across groups, except that girls in tracking schools were less likely
to attrit in the endline test [. . . ] The proportion of attritors and their
characteristics do not differ between the two treatment arms”(Duflo,
Dupas and Kremera 2011, 1752)
• “A first step in the analysis of an evaluation must always be to report
attrition levels in the treatment and comparison groups and to compare attritors with non-attritors using baseline data (when available) to
see if they differ systematically, at least along observable dimensions.
If attrition remains a problem, statistical techniques are available to
identify and adjust for the bias. These techniques can be parametric
(see Hausman and Wise, 1979; Wooldridge, 2002 or Grasdal, 2001)
or nonparametric. We will focus on non-parametric techniques here
because parametric methods are more well known. Moreover, nonparametric sample correction methods are interesting for randomized
evaluation, because they do not require the functional form and distribution assumptions characteristic of parametric approaches. Important
studies discussing non-parametric bounds include Manski (1989) and
Lee (2002)” (Duflo, Glennerster and Kremer 2007, 3944).
• “To test for differential attrition across conditions, a 2 attrition (lost
vs retained) × 2 condition (risk-reduction skills intervention vs comparison) contingency table, chi-square test was performed; 44 of 187
(24%) participants in the experimental condition and 33 of the 145
(23%) comparison-intervention participants were lost at follow-up, a
nonsignificant difference. We also conducted attrition analyses for dif-
44
ferences on baseline measures using 2 (attrition) × 2 (condition) analyses of variance,23 where: (1) an attrition effect signals differences
between participants lost and retained, (2) an intervention effect indicates a breakdown in randomization, and (3) an attrition x condition
interaction indicates differential loss between conditions. Results indicated a main effect for attrition on participant age: participants lost to
follow-up were younger (M = 37.9) than those retained (M = 40.7).
No other differences between lost and retained participants were significant, there were no significant differences between intervention conditions on baseline variables, and the attrition by intervention-group
interactions were not significant.” (Kalichman et al. 2001, 87).
B
Front- and Back-door criteria for identification
B.1
Back-door criterion
The back-door criterion tells us what variables need to be conditioned on to
ensure a set of causes and its effects of interest are independently distributed
under the null of no effect. This corresponds to the econometric notion of
unconfoundedness (Imbens and Wooldridge 2009, 26).
Definition 10 (Back-door criterion, Pearl (2009, p 79)). A set of variables
S satisfies the back-door criterion relative to an ordered pair of variables
(X i , X j ) in a DAG if:
1. no node in S is a descendant of X i ; and
2. S d-separates (or blocks) every path between X i and X j that contains
45
and arrow into X i .
Similarly, if X and Y are two disjoint subsets of nodes in G, then S is said
to satisfy the back-door criterion relative to (X , Y ) if it satisfies the criterion
relative to any pair (X i , X j ) s.t. X i ∈ X and X j ∈ Y .
The back-door criterion essentially restricts d-separation to common causes
of X and Y . For example, suppose X → M → Y and X ← U → Y . The set
S = {M , U} d-separates X and Y but only set S 0 = {U} meets the back-door
criterion. Intuitively conditioning on M is not useful if our goal is to identify
the total (mediated) effect of X on Y , whereas conditioning on U blocks a
confounder. Formally:
Theorem 5 (Back-door adjustment, Pearl (2009, p 79)). If a set of variables
S satisfies the back-door criterion relative to (X , Y ), then the causal effect of X
on Y is identifiable and is given by the formula:
P( y|do(x)) =
X
P( y|x, s) P(s).
s
As noted in the previous section, once we are given P( y|do(x)), we can compute the ATE for any two distinct realizations x 0 and x 00 of X as E[Y |do(x 0 )]−
E[Y |do(x 00 )] (Pearl 2009b, p 70).
B.2
Front-door criterion
The front-door criterion is another sufficient strategy for causal identification.
Unlike the back-door criterion, that focuses on common causes of the cause
and the effect, the front-door criterion focuses on the mechanism linking the
46
cause to the effect, applying the back-door criterion at each step along the
causal path (see Pearl (2009b); Morgan and Winship (2007)).
Definition 11 (Front-door criterion, Pearl (2009, 82)). A set of variables Z is
said to satisfy the front-door criterion relative to an ordered pair of variables
(X , Y ) if:
1. Z intercepts all directed paths from X to Y ; and
2. there is no unblocked back-door path from X to Z; and
3. all back-door paths from Z to Y are blocked by X .
Mechanisms that meet the front-door criterion are isolated (by properties 2
and 3) and exhaustive (by property 1, see Morgan and Winship (2007, §8)
for details).
Theorem 6 (Front-door adjustment, Pearl (2009, 83)). If Z satisifies the frontdoor criterion relative to (X, Y ) and if P(x, z) > 0, then the causal effect of X
on Y is identifiable and is given by the formula:
P( y|do(x)) =
X
P(z|x)
X
P( y|x 0 , z) P(x 0 ).
x0
z
Both front- and back-door identification criteria are illustrated in Figure 1. In
that figure variable W confounds the effect of X on Y : it induces an association between these two variables even under the null of no effect (e.g. X and
Y remain associated (or d-connected) via backdoor path X ← W → Y even
if we delete edge X → Z or Z → Y ). Conditioning on W removes this association as set S = {W } meets the back-door criterion (Definition 10). A second
identification approach is to compute the effect of X on Z, and of Z on Y for
47
each level of X , as set S = {Z} meets the front-door criterion (Definition 11).
This is specially useful whenever W is not observed.
C
Stability
Definition 12 (Stability, Pearl (2009, p 48)). Let I(P) denote the set of
all conditional independence relationships embodied in P. A causal model
M = 〈D, Θ D 〉 generates a stable distribution if and only if P(〈D, Θ D 〉) con
tains no extraneous independencies – that is, if and only if I P(〈D, Θ D 〉) ⊆
I P(〈D, Θ0D 〉) for any set of parameters Θ0D .
When we parametrize a DAG structure D (denoted G in the notation we’ve
been using) with parameters Θ we have a causal model M (e.g. trivially
X → Y becomes a model as in Y = α+β X +" y , " y ∼ N (0, σ) with parameters
Θ = (α, β, σ) T ). That model generates distributions P(〈D, Θ D 〉), and these
are likely to vary as we change the parameters. A model is stable by Definition
12 if perturbing the parameters does not destroy any independencies. For
example, consider any two pathways by which the treatment Z can affect
attrition R: (i) Z → R; and (ii) Z → Y ∗ → R. When two or more paths
connect Z and R a precise tuning of parameters could negate the effect via
one path with a countervailing effect via another path. Observationally Z
and R appear independent but this apparent independence is not robust to
small perturbations of the parameters. These implications are best illustrated
by the truth table in Table 2.
Stability is important because by ruling out such precise tuning of parameters
then, and by Theorem 1, it follows that (Z ⊥ R) P ⇔ (Z ⊥ R)G . If this is true
48
P
Q
W
F
F
T
T
F
T
F
T
F
F
F
T
¬P ∧
T
T
F
F
T
F
F
F
¬Q ∨
T
F
T
F
W
(Z ⊥ R|S)G
(Z ⊥ R|S) P
F
F
F
T
T
F
F
F
T
F
F
T
T
F
F
T
Table 2: Truth table depicting implications of failure to reject the null hypothesis that treatment Z does not cause attrition R using information from
the observed distribution, where P =“Z → R”; Q =”Z → Y ∗ → R”; and
W =”P ∧ Q∧ they exactly offset”. Assuming stability (Definition 12) rules
out the last row of the table where independence of Z and R in the statistical
sense does not imply their independence in the underlying causal structure.
then failing to reject the null of (Z ⊥ R) P implies failure to reject (Z ⊥ R)G ,
which should focus the researcher’s attention on the first row of Table 1. If so
AT ER=0 is identified and estimable.21 Moreover, this works for any SADAG,
so we do not have to make any more assumptions than those in Definitions
8 and 12. In the context of experiments these are relatively uncontroversial.
21
This is one instance where testing the null of exactly zero effect really makes sense: A
small effect is not always evidence of a small problem if big effects are almost offsetting one
another.
49
D
Checklist
ATTCHECK: A procedural checklist for diagnosing problematic attrition
in randomized controlled experiments
Item
Check
Procedures
Perform a review of the relevant literature, and consult with experts, to identify possible causes of both the outcome and attrition
in your particular application
Encode this expert knowledge using a causal diagram
Use the causal diagram to determine what variables need to be
measured
Prepare for attrition by having a sampling plan for non-response
follow up at endline (and baseline) surveys, as needed
Include a discussion of attrition and the related causal diagram in
the study protocol
Register the study protocol before recruiting units into the study,
or before randomization at the latest
Check for problematic attrition by testing the d-separation conditions (R ⊥ Z)G and (Y ⊥ R|Z)G
Interpret your findings and assess whether attrition is problematic. If not, assess whether the ATE is estimable for all units, or
only for those units with observed outcomes
Software technologies
Report any software used and its release version
(For details on procedural checklists see Martel García (2012))
50