Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Definition and Diagnosis of Problematic Attrition in Randomized Controlled Experiments Fernando Martel García∗ fmg229[at]nyu.edu First draft October 1, 2012 Current draft April 25, 2013 Abstract Attrition is the Achilles’ Heel of the randomized experiment: It is fairly common, and it can completely unravel the benefits of randomization. Using the structural language of causal diagrams I demonstrate that attrition is problematic for identification of the average treatment effect (ATE) if – and only if – it is a common effect of the treatment and the outcome (or a cause of the outcome other than the treatment). I also demonstrate that whether the ATE is identified and estimable for the full population of units in the experiment, or only for those units with observed outcomes, depends on two d-separation conditions. One of these is testable ex-post under standard experimental assumptions. The other is testable ex-ante so long as adequate measurement protocols are adopted. Missing at Random (MAR) assumptions are neither necessary nor sufficient for identification of the ATE. ∗ I would like to thank Matthew Blackwell, Felix Elwert, and Judea Pearl for careful comments and suggestions; participants in the Applied Statistics Workshop hosted by Harvard University’s Department of Government 2013; and in panel 38-13 of the Midwest Political Science Association Meeting 2013. All errors are my own. This manuscript relies on casual diagrams, truth tables, and sentential logic for proofs. These techniques are being implemented in an open source R package that will soon be made available in the Comprehensive R Archive Network (CRAN). 1 1 Introduction Attrition, or missing data on the outcome of interest, is the “Achilles’ Heel of the randomized experiment” (Shadish et al. 1998, 3). It can completely unravel the benefits of randomization, and it is very common in social experiments, impact evaluations, clinical trials, and studies where outcomes are measured using survey instruments, or with a long lag. But when is attrition a problem for causal inference in randomized controlled experiments (RCEs)? And how can problematic forms of attrition be reliably diagnosed? Here I show how existing answers to these questions are often ambiguous, and at times misleading. I also define and demonstrate the necessary and sufficient conditions for problematic attrition. And I conclude by proposing diagnostic tests and measurement protocols to reliably diagnose problematic attrition. Attrition is said to be problematic whenever the average treatment effect (ATE) is not identified non-parametrically under standard experimental assumptions. Problematic attrition arises if – and only if – attrition is a common effect of the treatment and the outcome (or a cause of the outcome other than the treatment). Moreover, whether attrition is problematic; or the ATE is identified and estimable for all experimental units; or only for those units with observed outcomes, depends on just two d-separation (i.e. orthogonality) conditions. The first condition can be checked ex-post by testing whether the treatment causes attrition. By itself this test only allows a partial diagnosis of problematic attrition. The second d-separation condition can be be checked if ex-ante researchers put in place a measurement protocol for non-response follow up at endline (and baseline) data collection. These results are summarized in a handy checklist researchers can use to improve 2 the process of causal inference in experiments subject to attrition. Attrition can severely compromise the comparability of units across experimental arms; limit the representativeness of the observed sample with regards to the experimental population; and increase the variance of any estimated effects (Cook and Campbell 1979). Attrition is also very common. For example, attrition rates of 20 to 40 percent appear to be the norm in social experiments (Ashenfelter and Plant 1990; Hausman and Wise 1979; Heckman and Smith 1995; Krueger 1999; Newhouse et al. 2008). A field experiment in political science studying the effect of media on political behaviour reported 69 percent attrition rates for survey data, and 23 percent for administrative records (Gerber, Karlan and Bergan 2009, 41–43). And a survey of clinical trials found that 63 of 71 trials (89 percent) reported missing outcome data, with 13 of these trials reporting missing outcomes for more than 20 percent of patients (Wood, White and Thompson 2004). Attrition tends to be higher in longitudinal studies, and in field studies that rely on survey instruments. This study uses the graphical language of causal diagrams (Pearl 1995; Greenland, Pearl and Robins 1999) to define and diagnose problematic attrition. The choice of language is justified by a simple logic: (1) Whether attrition is problematic depends on the underlying causal structure; (2) probability statements cannot pin down this causal structure because causation implies correlation but not vice versa; therefore (3) discussion of attrition using probability statements is necessarily ambiguous. By contrast, causal diagrams encode and communicate causal knowledge directly, highlighting implicit assumptions, and providing justifications for seemingly tautological statements like “Missing at Random” (Rubin 1976; Little and Rubin 1987). This power- 3 ful new language provides greater leverage for causal identification.1 2 Literature review Because attrition is so common, and because it can undo the benefits of randomization, it has become a major source of concern. One strand of the literature focuses on ex ante remedies to minimize attrition , whereas a second much larger strand focuses on ex-post analytical techniques for dealing with attrition (see Shadish (2002, 5-6) for an overview). Surprisingly, not much has been written on the definition of problematic attrition, on the necessary and sufficient conditions that make attrition attrition a problem, and on ways to reliably diagnose problematic attrition. Two problems bedevil the attrition literature. First, problematic attrition is often discussed in terms of probability and correlations – the language of symptoms –, as opposed to causal relations and diagrams – the language of causes. This leads to ambiguity, as it is the casual model generating the observed correlations, rather than the correlations themselves, that determine whether attrition is problematic. Because several causal models can be compatible with any given covariance structure, the language of probability cannot pin down the problem. Second, I could not find proof of the necessary and sufficient conditions for problematic attrition. Instead discussion of 1 Notation is is more than semantics. For example, Arabic and Roman numerals both have direct translation yet the positional system of Arabic numerals makes them much better suited to arithmetic. Today high school kids are able to carry out arithmetic operations that ancient Romans could only perform with the aid of pebbles, finger counting, and abacists, or professionals skilled in using the abacus (Reynolds 2004). Not surprisingly the abacists resisted the introduction of arabic numerals for centuries (Stone 1972). 4 attrition tends to fall back on assumptions about the nature of the missingness, and whether it is completely at random, at random, or not at random. Yet some of these assumptions are neither necessary nor sufficient for problematic attrition, and often they are proffered without adequate reasons, justifications, and diagnostic tests. This is not surprising. Judging attrition to be problematic presupposes knowledge of the necessary and sufficient conditions for problematic attrition, and of the causal structure generating both the outcome and the attrition, which only causal diagrams make explicit. Using the language of probability Jurs and Glass (1971) purported to show how problematic attrition depends on whether it is random within, and between, comparison groups. Though sensible, their language leaves room for ambiguity. Consider their universal statement: “if the [attrition] is systematically related to the treatments, the inferences possible from the reduced sample are not the same as those that could be drawn from the original sample.” (Jurs and Glass 1971, 63). Now lets show an exception. Suppose the only causes of attrition are the randomized treatment, and its interaction with a covariate. Suppose also no causal relation connects that covariate to the outcome of interest. And suppose treatment has an effect on the outcome. This implies: (i) differential attrition; (ii) missingness that is systematically related to a covariate; and (iii) unobserved outcomes that are correlated with the attrition. No matter, as is shown below, the observed outcomes remain comparable across groups, and representative of the full experimental population, so the ATE is identified and estimable.2 2 Despite these problems Jurs and Glass’s (1971) method remains relatively popular. Kalichman et al. (2001) used it to analyse attrition in a randomized field trial studying the effect of a behavioural intervention on HIV transmission. 5 A review of a convenience sample of papers from leading experimental social scientist highlights the degree of confusion taht surrounds problematic attrition (see Appendix A). Often the diagnosis of problematic attrition, and the procedures used to deal with it, are asserted ex nihilo, without reference to a specific body of literature. Problematic attrition appears to have acquired the status of a folk theorem. For example, Cook and Campbell (1979) advanced the now common belief that problematic attrition can be diagnosed only when pre-test measures related to the outcome are available (West, Biesanz and Pitts 2000, 52). Yet, as will be demonstrated below, a key diagnostic for problematic attrition involves checking whether treatment has an effect on attrition, which is testable in the absence of any covariates. Though the the way field researchers define, diagnose, and deal with attrition is often sensible, their approaches are stated with such ambiguity that it is hard to judge the merits of what they are doing. Researchers armed with causal diagrams have begun to clear the thicket of ambiguity.3 In a seminal contribution Hernán, Hernández-Díaz and Robins (2004) used causal diagrams to distinguish between selection bias and confounding. The former arises from conditioning on the common effects of the exposure (or the causes of the exposure) and the outcome (or the causes of the outcome), while confounding arises from common causes of both treatment and the outcome. However, Hernán, Hernández-Díaz and Robins (2004) do not provide a formal proof and, maybe because of this, theirs is not a sufficient condition for selection bias.4 Geneletti, Richardson and Best 3 An earlier structural approach had flourished in economics (Heckman 1979). How- ever, this approach is highly parametric, exhibiting a mathematical complexity that is mostly unnecessary and prone to errors (Greene 1981). 4 The proof that their formulation is insufficient for selection bias is simple. Consider the directed acyclic graph Y ← Z → R, where Y is the outcome, Z the treatment, and R is 6 (2009) and Daniel et al. (2012) also discuss attrition using causal diagrams but neither proves the necessary and sufficient conditions for identification of the ATE. Bareinboim and Pearl (2012) show how the odds ratio remains unbiased even when selection is caused by the outcome. More generally, Elwert and Winship (2012) provide a comprehensive discussion of selection bias problems (see also Pearl (2012, 2013)). I build on these insights by proving the necessary and sufficient conditions for identification of the ATE in RCEs subject to attrition. And by showing how simple ex post tests, combined with ex-ante measurement protocols, can be used to reliably diagnose problematic attrition. 3 3.1 Introduction to causal diagrams Definitions Definition 1 (Graph). A graph is a collection G = 〈V, E〉 of nodes V = {V1 , . . . , VN } and edges E = {E1 , . . . , E M } where the nodes correspond to variables, and the edges denote the relation between pairs of variables. Definition 2 (Directed Acyclic Graph, adapted from Pearl (2009, §1.2)). A directed acyclic graph (DAG) is a graph that only admits: (i) directed edges with one arrowhead (e.g. →); (ii) bi-directed edges with two arrowheads (e.g. L9999K); and (iii) no directed cycles (e.g. X → Y → X ), thereby ruling out mutual or self causation. the attrition dummy, and suppose we condition on observed outcomes (r = 0). By selecting units with observed outcomes we are conditioning on a common effect of treatment (Z), and a cause of the outcome (Z) yet the ATE is identified, as shown below. 7 Intuitively X → Y reads “X causes Y ”, and X L9999K Y reads “X and Y have unknown causes in common or are confounded”. A path in a DAG is any unbroken route traced along the edges of a graph – irrespective of how the arrows are pointing (e.g. X L9999K M → Y is a path between the ordered pair of variables (X , Y )). Any two nodes are connected if there exists a path between them, else they are disconnected. A directed path is a path composed of directed edges where all edges point in the direction of the path (e.g. X → M → Y ). We also say that X is a parent of Y , and Y is a child of X , if X directly causes Y , as in X → Y . For example, the parents of Y in causal diagram X → Y ← UY are denoted PA(Y ) = {X }, where, by convention, we confine the set of parents of Y to the set of endogenous variables in V (i.e. we do not include UY in the set PA(Y ) even though UY is a direct cause of Y ). Similarly, we say that X is an ancestor of Y , and Y a descendant of X , if X is a direct or indirect cause of Y . For example, X is both a direct and indirect cause of Y if a DAG includes direct and mediated directed paths (e.g. X → Y , and X → Z → Y , respectively). We refer to non-terminal nodes in directed paths as mediators (e.g. Z is a mediator in the path X → Z → Y ). Definition 3 (Causal Diagram, adapted from Pearl (2009, 44, 203)). A causal diagram or structure of a set of variables W is a DAG G = 〈{U, V}, E〉 with the following properties: 1. Each node in {U, V} corresponds to one and only one distinct element in W , and vice versa; 2. Each edge E ∈ E represents a direct functional relationship among the corresponding pair of variables; 3. The set of nodes {U, V} is partitioned into two sets: 8 (a) U = {U1 , U2 , . . . , UN } is a set of background variables determined only by factors outside the causal structure; (b) V = {V1 , V2 , . . . , VN } is a set of endogenous variables determined by variables in the causal structure – that is, variables U ∪ V; and 4. None of the variables in U have causes in common. A DAG of a set of variables W only qualifies as a causal diagram if it includes all common causes of the variables in W .5 Consequently, a causal diagram replaces all bi-directed edges with latent variables. As a result it only admits three types of associations between any pair of variables (X , Y ) in W : (1) X causes Y or vice versa, as depicted by a directed edge (e.g. X → Y ); (2) X and Y share (latent) causes in common (e.g. X ← Z → Y ); or (3) they have an effect in common or collider that for some reason is being held constant. For example, in causal diagram X → C ← Y , selecting units into a study on the basis of C = c induces a non-causal association between X and Y . Hernán, Hernández-Díaz and Robins (2004) show how selection on a collider can result in selection bias. Finally, every causal diagram has a corresponding causal model, defined as follows: Definition 4 (Causal Model, adapted from Pearl (2009, 203)). A causal model M replaces the set of edges E in a causal diagram G by a set of functions F = { f1 , f2 , . . . ., f N }, one for each element of V, such that M = 〈U, V, F 〉. In turn, each function f i is a mapping from (the respective domains of) Ui ∪ PAi 5 If two variables in W have a cause U Z in common (e.g. UY ← U Z → UX ) but U Z 6∈ W , then DAG G is not a causal diagram. To make it one U Z should be included in U and UY and UA included in the set of endogenous variables V. Formally property 4 ensures a DAG meets the Causal Markov Condition (see Pearl (2009b, 19, 30) for details). 9 to Vi , where Ui ⊆ U and PAi ⊆ V\Vi and the entire set F forms a mapping from U to V. In other words, each f i in vi = f i (pai , ui ), i = 1, 2, . . . , N , assigns a value to Vi that depends on (the values of) a select set of variables in V ∪ U, and the entire set F has a unique solution V(u). Causal models are completely non-parametric. For example, function y = f y (x, z, u y ) is compatible with any well defined mathematical expression in its arguments like y = α + β1 x + β2 z + u y , or y = α + β1 x + β2 z + β3 xz + u y . Causal models are also completely deterministic. Probability comes into the picture through our ignorance of background conditions U, which we summarize using a probability distribution P(u). Because endogenous variables are a function of these background conditions, P(u) induces a probability distributions P(v ) over all endogenous variables in V.6 Definition 5 (Probabilistic Causal Model, Pearl (2009, 205)). A probabilistic causal model Γ is a pair 〈M, P(u)〉, where M is a causal model and P(u) is a probability function defined over the domain of U. By the definition of a causal diagram (Definition 3), probabilistic causal models have the property that ui ⊥ u j ∀i 6= j, ui , u j ∈ u. Figure 1 provides an illustration of a probabilistic causal model and its corresponding causal diagram. Definition 6 (Causal Effect, adapted from Pearl (2009, 70)). Given a probabilistic causal model Γ and two disjoint sets of variables, X and Y , the causal 6 This is exactly the same as characterizing disturbance terms in a regression context using some distribution, like " ∼ N (0, σ2 ). 10 X Z W Y UX UZ UY UW = f X (W, UX ); = f Z (X , U Z ); = f w (UW ); = f Y (Z, W, UY ); = PX (UX ); = P Z (U Z ); = PY (UY ); = PW (UW ). Ux Uz Uy X Z Y Uw W Figure 1: On the left panel probabilistic causal model Γ (Definition 5), and on the right the causal diagram G (Γ ) representing the causal structure of Γ (Definition 3). X is the cause under investigation, Z is a mediator, Y is the outcome, and W a covariate. Variables Ui , i ∈ {X , Z, Y, W } are unobserved background causes, which we summarize using probability distributions Pi (Ui ). These are independently distributed by construction, as all causes of two or more variables are included in the model, maybe as latent variables. White nodes denote observed variables, and grey nodes unobserved ones. effect of X on Y , denoted P( y|do(x)), is a function from X to the space of probability distributions on Y . For each realization x of X , P( y|do(x)) gives the probability of Y = y induced by deleting from Γ equations f x i , ∀x i ∈ X, and replacing them with x i = x i0 . Causal diagrams have a number of advantages for causal identification. First, they provide an accessible and intuitive language for encoding, in a public manner, a researchers’ private knowledge of the causal structure governing a set of variables. Second, causal diagrams capture all the causal associations and conditional independencies necessary to identify causal effects (Definition 6).7 This information is enough to infer, justify, and test identification strategies, including which covariates to control for, what assumptions are 7 The focus on identification of P( y|d o(x)) from which other quantities of interest like the ATE can be calculated. This is somewhat restrictive: At times P( y|d o(x)) is not identified yet some quantity, like the odds ratio, is (Bareinboim and Pearl 2012). 11 plausible, which are testable, and what variables are good instruments (Pearl 2009a). Third, causal diagrams capture all of the information needed for identification purposes, making precise mathematical specification of a probabilistic causal model redundant.8 Fourth, the simplicity of causal diagrams means identification conditions can be derived algorithmically. A researcher can feed a causal diagram into a computer, ask whether the effect of X on Y is identified, and get an automated reply, including the set of conditioning variables that need to be controlled for.9 The goal of causal identification is to assess the necessary and sufficient conditions that would allow us to express P( y|do(x)) in terms of the observable distribution P(O), where O ⊆ V is a subset of observed endogenous variables V in probabilistic model Γ (Definition 5).10 Two sufficient conditions for causal identification are the front-door and back-door criteria (see Annex B). Both make use of the graphical concept of d-separation. Definition 7 (d-Separation, Pearl (2009, pp 16-17)). A path p is said to be d-separated (or blocked) by a set of nodes S iff: 1. p contains a chain i → m → j or a fork i ← m → j such that the middle node m is in S, or 2. p contains an inverted fork (or collider) i → m ← j such that the middle node m is not in S and such that no descendant of m is in S. A set S is said to d-separate X from Y iff S blocks every path from a node in X to a node in Y . 8 When non-parametric identification is impossible, parametric assumptions can be made. However, parametric identification is not accorded as much credibility. 9 See the package graph (Gentleman et al. 2012) for R (R Core Team 2013). 10 The topic of identification is a nuanced one (see Pearl (2009b, §3) for details). 12 Following Pearl (2009b, p 18) let (X ⊥ Y |S)P capture the probabilistic notion of conditional independence and (X ⊥ Y |S)G the graphical notion of d-separation in Definition 7. Their connection is established by the following theorem: Theorem 1 (Probabilistic implications of d-Separation, Pearl (2009, p 18)). For any three disjoint subset of nodes (X , Y, S) in DAG G and for all probability functions P, we have: 1. (X ⊥ Y |S)G ⇒ (X ⊥ Y |S)P whenever G and P are compatible;11 and 2. if (X ⊥ Y |S)P holds in all distributions compatible with G, it follows that (X ⊥ Y |S)G . Theorem 1 allows us to determine whether two variables, or sets of variables, are independent on the basis of d-separation alone. These independencies are useful for identification purposes. 4 Examples and intuition In randomized controlled experiments attrition is a problem of selection bias and estimation. This is not obvious from the econometrics literature, where identification of the ATE hinges on unconfoundedness and overlap assumptions (Imbens and Wooldridge 2009, 26). Formally: 11 By compatibility Pearl means that the probability distribution admits the factorizaQ tion P(x 1 , . . . , x n ) = i P(x i |pai ), where pai are the parents of x i , that is the nodes with directed edges pointing into x i . For example, X ← U → Y factorizes as P(X , U, Y ) = P(U) P(X |U) P(Y |U). 13 Unconfoundedness (Yz 0 (u), Yz 00 (u) ⊥ Z(u)|X (u)) or, in graphical terms, (Y ⊥ Z|X )G under the null of no effect; and Overlap 0 < pr(Z(u) = 1|X (u) = x) < 1, ∀x, where Yz 0 (u) is the hypothetical outcome for unit u under intervention do(Z = z 0 ) (Pearl 2009b, 204), Z is a treatment, and X a set of covariates including the empty set. Randomized controlled experiments ensure that unconfoundedness and overlap hold in large samples, and that (Yz 0 (u), Yz 00 (u) ⊥ Z(u)). Yet the relevant question in experiments with attrition is whether (Yz 0 (u), Yz 00 (u) ⊥ Z(u)|R(u) = 0). To understand when and why conditioning on units with observed outcomes may be problematic we need to unpack the unconfoundedness assumption. Consider Hernán, Hernández-Díaz and Robins’s (2004) structural decomposition of the unconfoundedness assumption: No confounding The treatment and the outcome have no (uncontrolled) causes in common; and No selection bias Common effects of treatment (or the causes of treatment) and the outcome (or the causes of the outcome other than the treatment) are not controlled for.12 From this decomposition it is evident that the assumption of no selection bias will fail whenever attrition is a descendant of both treatment and the outcome (or a cause of the outcome other than the treatment), and we condition on attrition R. More generally, this is true even if the outcome is fully observed 12 This is not their exact definition, which is insufficient without the additional statement “other than the treatment”. 14 and R is any effect of the outcome. However, even if the ATE for the full experimental group is identified conditional on attrition, we may not be able to estimate it once we condition on observed outcomes only. This is a subtle distinction. In experiments with attrition outcomes are only observed when r = 0, which means that for estimation purposes we are conditioning on r = 0, not R. As a result we may only be able to estimate the ATE for the subset of units with observed outcomes. For example, Figure 2a illustrates a situation where the ATE is identified but all we can estimate is the ATE for those units with observed outcomes, which may differ from the ATE for the full experimental population. To simplify the discussion, and focus on attrition, I assume a specific parametric form for Y → O ← R: O= Y, if R = 0 Missing, if R = 1. (1) Equation 1 ensures O ≡ Y after selection on observables (e.g. after dropping all units with outcomes labelled missing). It also rules out measurement error, or the possibility of Z having an effect on measurement O but not the latent outcome Y (e.g. it rules out the edge Z → O). Second, by virtue of randomization there is no backdoor path in Figure 2a connecting in the ordered pair (Z, Y ) so the set S = {;} d-separates them. By Theorem 5, Annex B, the causal effect of Z on Y is therefore identifiable, and given by the formula: P( y|do(z)) = P( y|z). 15 (2) Third, in practice the ATE cannot be estimated using Equation 2 because y is not observed whenever r = 1 (i.e. o is labelled missing). For estimation purposes we need to condition on R such that: P( y|do(z)) = P( y|z, r = 0) P(r = 0) + P( y|z, r = 1) P(r = 1) (3) Conditioning on R does not result in selection bias because R is not a descendant of Z. By Equation 1 we can compute the first product term in Equation 3, as P( y|z, r = 0) ≡ P(o|z, r = 0). However, the second product term cannot be calculated because P(o|z, r = 1) is labelled as missing. Consequently all we can estimate is E[ y|z 0 , r = 0] − E[ y|z 00 , r = 0] for any two distinct realizations z 0 and z 00 of Z, or the effect of treatment on the subset of units with observed outcomes.13 Now consider Figure 2b. Here R is a descendant of Z. No matter, to estimate the ATE we also need to condition on Z, and conditioning on Z d-separates attrition R from the latent outcome Y . This implies (R ⊥ Y |Z) P and also P(Y |Z, R) = P(Y |Z). The latter allows unbiased estimation of the ATE for the full population of units in the experiment, even after conditioning on r = 0. By substitution: P( y|do(z)) = P( y|z, r = 0) P(r = 0) + P( y|z, r = 1) P(r = 1), = P( y|z) P(r = 0) + P( y|z) P(r = 1), = P( y|z). 13 If X were observed in Figure 2a, P(x, r = 0) > 0 ∀x, and we were willing to assume X is the only ancestor in common between Y and R, then we can recover the ATE using inverse probability weights. The reason is X d-separates Y from R such that (Y ⊥ R|X ) P (Theorem 1). But not this requires making assumptions not warranted by the experimental design. 16 Uy Z Ux Uy Y X Ur Z O O Ur R (a) ATE identified, not estimable Uy Y Ux X Ur R (b) ATE identified and estimable Uy Z Y Z Y O O Ur R R . (c) ATE not identified, R a collider (d) ATE not identified, R a collider Uy Z Ux Ur Uy Y X Z Y O O Ur R (e) ATE identified and estimable R (f) ATE not identified, R a virtual collider Figure 2: Attrition mechanisms. Z is the treatment, X a covariate, Y is the latent outcome observed by experimental units but not the researcher, and O is the publicly observed outcome. Attrition R is a binary variable indicating whether Y is visible to the researcher (R = 0) or not (R = 1), such that O = Y if R = 0 and O = missing otherwise. The Ui are independently distributed background causes. White nodes denote observed variables, and light gray nodes unobserved ones. Square nodes are variables held constant (e.g. r = 0). Directed edges (→) denote causation, and undirected edges (—) correlations induced by colliders. 17 Because P(o|z, r = 0) ≡ P( y|z, r = 0) ≡ P( y|z) the ATE for the full experimental population can be estimated as E[Y |do(z 0 ), r = 0] − E[Y |do(z 00 ), r = 0] for any two distinct realizations z 0 and z 00 of Z (so long as P(z, r = 0) > 0 ∀z). This example also shows how unbiased estimation of the ATE is possible despite differential attrition directly affected by treatment. Now consider Figures 2c and 2d. In both examples the ATE is unconditionally identified yet conditioning on R for estimation purposes activates a collider in a path between the treatment and the outcome. This generates a spurious dependence between the treatment and the latent outcome under the null of no effect, so the ATE is not identified. In Figure 2d conditioning on R induces the spurious association Z — Y , while in Figure 2c it activates spurious associations along paths Z — X → Y and Z — UR — C → Y . This is no surprise: in both causal mechanisms conditioning on R invalidates the no selection bias assumption. Finally consider Figures 2e and 2f. In Figure 2e the ATE is identified and estimable conditional on observed outcomes (r = 0). This is true even if there is differential attrition (treatment causes attrition), and imbalance in the observed covariate (X interacts with Z). Figure 2f involves a virtual collider (Pearl 2009b, 339–400), or controlling for a descendant of the outcome of interest. In this case conditioning on observed outcomes r = 0 induces a non-causal association between UR and Y . In addition, because R is a child of Y and hence a proxy for Y , we are also limiting the variation of Y itself. In turn induces a non-causal association between UY and Z. In fact, conditioning on a virtual collider creates an association between all parents of each of the ancestors of the virtual collider. Moreover, these non-causal associations induce other associations, like the association between Z and UR in 18 Z → Y — UR . Finally, Figure 2f has the property that the effect is identified whenever Z is without effect but not otherwise.14 5 Necessary and sufficient conditions for problematic attrition To formalize the necessary and sufficient conditions for problematic attrition I limit the discussion to the class of Simple Attrition DAGs defined below. This allows me to focus the discussion on attrition under the standard assumptions of randomization, excludability, and non-interference. Definition 8 (Simple Attrition DAGs (SADAGs)). A simple attrition DAG is one where: (i) observed outcomes O are determined by Equation 1 (an exclusion restriction); (ii) latent outcomes Y for any unit i are independent of treatment assigned to any other unit j, j 6= i (non-interference); (iii) a single treatment Z taking two or more values is applied in a randomized fashion (no arrows can point into Z); and (iv) attrition R does not cause Y . The definition of SADAGs restricts attention to attrition in experimental settings separate from errors in variables, interference, or effects mediated by R. Clearly if the only effect of Z on Y is via R, as in Z → R → Y , the ATE cannot be estimated: the effect R → Y is not identified because when r = 1 all outcomes o are labelled missing. These restrictions notwithstanding, the intellectual apparatus presented here can be easily generalized to situations 14 The front door criterion (Annex B) is invalid in the face of virtual colliders. Intuitively, the last step between the presumed mediator and the outcome is structurally the same as the direct step Z → Y in Figure 2f, which is confounded by a spurious correlation with UY . 19 where only some or none of these assumptions hold. Next I introduce virtual colliders, which play an important role in what follows: Theorem 2 (On the implications of conditioning on the effect of a cause). For any directed path U1 → V1 → . . . → Vi → Vi+1 → . . . → Vn−1 → VN in a causal diagram. For any variable Vi , N > i ≥ 1 in that path, there exists at least one ∗ value of Vi+n , N − i ≥ n ≥ 1, such that conditioning on Vi+n = vi+n generates a non-casual association between Ui+1 and Vi , which we denote by connecting these two nodes with an undirected edge (e.g. Ui+1 — Vi ). (Where Ui+1 (not shown) captures other causes of Vi+1 beyond Vi (e.g. as in Vi → Vi+1 ← Ui+1 ).) Proof. First consider any i, N > i ≥ 1 and n = 1. This is the case of a simple collider, where we control for Vi+1 in causal diagram Vi → Vi+1 ← Ui+1 . Now consider the equivalent causal model vi+1 = f Vi+1 (Vi , Ui+1 ). By contradiction, suppose (Vi ⊥ Ui+1 |vi+1 )G ∀vi ∈ Vi . Because f Vi+1 is a single-valued function, this implies that Vi+1 does not change with Vi conditional on any value ui+1 in the domain of Ui , which contradicts the assumption that Vi is a cause of ∗ ∗ Vi+1 →←. There exists vi+1 such that ¬(Vi ⊥ Ui+1 |vi+1 )G Second, consider whether the claim is true for any n, N − i ≥ n ≥ 1. By repeated substitution we can write the reduced-form causal model vi+n = f Vi+n (Vi+1 , Ui+n ). By contradiction suppose that for each possible value ui+n in the domain of Ui+n , vi+n = f Vi+n (Vi+1 , ui+n ) for all values of Vi+1 . This implies that Vi+n does not change with Vi+1 conditional on any value ui+n in the domain of Ui+n ; which contradicts the assumption that Vi+1 is a cause of Vi+n ∗ →←. Therefore there exists at least one pair (u∗i+n , vi+n ) such that a change ∗ in Vi+1 breaks the equality vi+n = f Vi+n (Vi+1 , u∗i+n ); that is, generates a value 20 ∗∗ ∗ ∗ vi+n 6= vi+n . Put differently, the pair (u∗i+n , vi+n ) partitions the domain of Vi+1 into at least two sets, one whose elements satisfy the equality, and one whose elements do not. Third, because these induced subsets of Vi+1 are a partition of its domain, it ∗ must be the case that vi+1 , from the first step above, is in one of the sets, and each set maps into a unique value of Vi+n for given ui+n . Therefore ∃vi+n ∈ ∗ ∗ Vi+n such that conditioning on Vi+n = vi+n selects on vi+1 with some positive ∗ probability. By the first step above this implies ¬(Vi ⊥ Ui+1 |vi+n ). Corollary 1. In a causal diagram G , if Y is a mediator between Z and R then there does not exist a set of variables S 0 that meets the front-door criterion for the effect of Z on Y conditional on R. Proof. We know from Theorem 2 that controlling for a descendant of Y will induce a non-causal association between all the parents of Y . As a result the last step in the front-door criterion is bound to fail, since the mechanism is no longer isolated. Definition 9 (Problematic attrition). For any SADAG defined by the collection G = 〈V, E〉, attrition is said to be problematic whenever the ATE is not identifiable non-parametrically conditional on r = 0 and Z. The emphasis on non-parametric identification, and on conditioning limited to r = 0 and Z, stems from the nature of the experimental context, where researchers typically want to avoid making any assumptions beyond those warranted by the experimental design. Theorem 3 (Necessary and sufficient causal conditions for problematic attrition). In any SADAG defined by the collection G = 〈V, E〉, attrition will be 21 problematic (e.g. the ATE will not be identifiable non-parametrically conditional on r = 0 and Z) if and only if: 1. Z is an ancestor of R; and 2. Y is an ancestor of R; or 3. Exists X in V\Z, such that X is an ancestor of both Y and R. Proof. (⇒): For any SADAG G , suppose conditions 1–3 are all true, and, by contradiction, the ATE is identified non-parametrically conditional on r = 0 and Z only. Then either Y is an ancestor of R, or exists X in V\Z, such that X is an ancestor of both Y and R. Case 1: Y is an ancestor of R. Then if Z is an ancestor of R there are two further possibilities: (a) Y is a mediator of the effect of Z on R (e.g. Z → . . . → Y → . . . → R); or (b) Y is not a mediator of the effect of Z on R (e.g. Y is not in the previous chain). Case 1a: Y is a mediator of the effect of Z on R. Then by Theorem 2 and Corollary 1 the ATE is not identified →←. Case 1b: Y is not a mediator of the effect of Z on R. Then the ordered pair (Y, R) is connected by a directed path, and similarly for the ordered pair (Z, R). These two directed paths meet at R, so R is the only collider in a path between Z and Y along these two directed paths. By the definition of d-separation (Definition 7) conditioning on r = 0 induces an association between Z and Y which, by Theorem 1, implies that Z and Y are dependent under the null of no effect. This implies that the ATE is not identified →←. Case 2: Exists X in V\Z, such that X is an ancestor of both Y and R. Then by the definition of a SADAG (and assumption R is not an ancestor of Y ) either one of two causal structures must be true: (a) Y is a mediator of the effect 22 of X on R (e.g. X → . . . → Y → . . . → R); or (b) Y is not a mediator of the effect of X on R (e.g. Y ← . . . ← X → . . . → R). Case 2a: Y is a mediator of the effect of X on R. Then Y is an ancestor of R, which by the previous results (Case 1) implies that the ATE is not identified →←. Case 2b: Y is not a mediator of the effect of X on R. Then by the definition of an ancestor, X is connected to Y and R through directed paths which, together with Z being an ancestor of R, implies R is the only collider in the combination of the two paths connecting Z to Y via R and X . By the definition of d-separation (Definition 7) conditioning on r = 0 induces an association between Z and Y , which by Theorem 1 implies that Z and Y are dependent under the null of no effect. This implies that the ATE is not identified →←. Under the null of no effect, and by the definition of a SADAG, Z and Y can only be associated by controlling for a (virtual) collider. Because R can only be a (virtual) collider whenever Z is an ancestor of R, and Y is an ancestor of R; or Z is an ancestor of R and exists X in V\Z, such that X is an ancestor of both Y and R, these two cases cover all the possibilities. Consequently, when condition 1 is true, and either of conditions 2 or 3 is also true, then the ATE is not identified. And because G is any arbitrary SADAG, this is true of all SADAGs. (⇐): For any SADAG G , suppose condition 1 is false or both conditions 2 and 3 are false, and, by contradiction, the ATE is not identified. Then either Z is not an ancestor R; or Y is not an ancestor of R and there does not exists X in V\Z such that X is an ancestor of both Y and R. Case 1: Z is not an ancestor R. Under the null of no effect, and by the definition of a SADAG, Z and Y can only be associated by controlling for a (virtual) collider, which implies a directed path from Z to R →←. Case 2: Y is not an ancestor of R and there does not exists X in V\Z such 23 that X is an ancestor of both Y and R. Then there can be no path from Y to R other than through Z (e.g. Y ← . . . ← Z → . . . → R), in which case (R ⊥ Y |Z)G . By the same logic as in Case 1, the ATE is identified →←. Under the null of no effect, and by the definition of a SADAG, Z and Y can only be associated by controlling for a (virtual) collider. Because R can only be a (virtual) collider whenever Z is an ancestor of R, and Y is an ancestor of R; or Z is an ancestor of R and exists X in V\Z, such that X is an ancestor of both Y and R, these two cases cover all the possibilities. Consequently, when condition 1 is not true, or both conditions 2 and 3 are false, the ATE is identified. And because G is any arbitrary SADAG, this is true of all SADAGs. To say that the ATE is identified does not imply it is estimable, as shown by the next theorem: Theorem 4 (Estimation of ATE when ATE is identified conditional on r=0 and Z). In any SADAG defined by the collection G = 〈V, E〉, and assuming the ATE is identified conditional on r = 0 and Z, the ATE is estimable iff there does not exists exists X in V\Z, such that Y ← . . . ← X → . . . → R. Otherwise, given that ATE is identified, all we can estimate is the ATE for units with observed outcomes not labelled as missing (ATER=0 ). Proof. (⇒): For any SADAG G , suppose there does not exists exists X in V\Z, such that Y ← . . . ← X → . . . → R, the ATE is identified, and, by contradiction the ATE is not estimable conditional on only r = 0 and z. Then either Z is not an ancestor of R, or Y is not an ancestor of R. Case 1: Z is not an ancestor R. Then the only way Y and R can be related is if Y is an ancestor R. Now, if Z is not an ancestor of Y (e.g. the ATE=0) then the ATE is identified, and it is 0. If Z is an ancestor of Y , then by Y an 24 ancestor of R, it implies Z is an ancestor of R, but then the ATE would not be identified →←. Case 2: Y is not an ancestor of R. Then the only way Y and R can be related, given that not exists exists X in V\Z, such that Y ← . . . ← X → . . . → R, is if Z is an ancestor in common.This can happen if Z is an ancestor of Y , and Y is an ancestor of R →←; or if Z is an ancestor of both Y and R, and Y is not an ancestor of R, in which case Z is the only ancestor they have in common and so (Y ⊥ R|Z)G . By Theorem 1 this implies (Y ⊥ R|Z) P which, given that the ATE is identified, implies P( y|do(z)) = P( y|z) = P( y|z) P(r = 0). Assuming the ATE is identified, and not exists exists X in V\Z such that Y ← . . . ← X → . . . → R implies that Z is not an ancestor, or Y is not an ancestor of R. These two cases cover all the possibilities. Consequently, when the ATE is identified, and there does not exists X in V\Z such that Y ← . . . ← X → . . . → R, the ATE is estimable. And because G is any arbitrary SADAG, this is true of all SADAGs. (⇐): For any SADAG G , suppose the ATE is not identified, or exists X in V\Z such that X is an ancestor of both Y and R, and, by contradiction, the ATE is estimable. Case 1: ATE is not identified. If the ATE is not identified then it is not estimable, since the former is a necessary condition for estimation →←. Case 2: Exists X in V\Z such that X is an ancestor of both Y and R. Then this implies ¬(Y ⊥ R) s.t. P(Y |Z, R) 6= P(Y |Z). Because outcome data are only observed when r = 0 all we can estimate is P( y|z, r = 0) or AT E r=0 →←. Assuming the ATE is not identified, or exists X in V\Z such that Y ← . . . ← X → . . . → R implies that the ATE is not identified, or, if it is, Y and R have a common ancestor in X . These two cases cover all the possibilities. Consequently, when the ATE is not identified, or there exists X in V\Z such that 25 (Y ⊥ R|Z)G (Z ⊥ R)G True False True ATE identified Estimable ATE identified ATE r=0 estimable False ATE identified Estimable ATE not identified Not estimable Table 1: This table summarizes the main implications of Definition 9, and Theorems 3 and 4. For any SADAG G the table shows how identification and estimation of the ATE depends on just two d-separation conditions (Z ⊥ R)G and (Y ⊥ R|Z)G . These can be checked algorithmically for any G . Y ← . . . ← X → . . . → R, the ATE is not estimable. And because G is any arbitrary SADAG, this is true of all SADAGs. 6 Diagnosis The implications of Definition 9, and Theorems 3, and 4 are summarized in Table 1. In particular, problematic attrition (Definition 9) refers to situations where both d-separation conditions, (Y ⊥ R|Z)G and (Z ⊥ R|Z)G , are false, as in the bottom right cell of Table 1. Faced with attrition, a key task for field researchers is inferring where in Table 1 their application belongs. In principle this can be done by performing two tests. First, we can test the null that treatment has no effect on attrition, against the alternative that it causes attrition. If the null is rejected we conclude that our application likely falls in the second row of Table 1, otherwise we conclude that the evidence is not strong enough to reject the null of no effect. This test is unproblematic. Attrition can be regarded as any other 26 outcome of interest, as is the case whenever attrition is caused by death.15 Second, we can test the null that the outcome and attrition are uncorrelated conditional on the treatment, against the alternative that they are correlated. If the null is rejected we conclude that that our application likely falls in the second column of Table 1, otherwise we conclude that there is not enough evidence to reject the null. This test is also unproblematic in the sense that all we are testing for is the presence of correlations, not causation. Finally, if both nulls are rejected we conclude that our application is a case of problematic attrition (bottom right cell). In this case the ATE is not identified without additional assumptions. In practice, testing whether attrition and the outcome are correlated conditional on the treatment is not possible, as the outcome is labelled missing whenever r = 1. Ex post, after the data collection is complete, researchers can only test whether treatment causes attrition. Even so, the resulting diagnosis is still useful even if incomplete. Specifically, if attrition is not an effect of treatment then there is less reason to believe that the observed attrition is problematic. Yet uncertainty remains as to whether the estimated ATE is relevant for all units in the experiment, or only for the subset of units with observed outcomes. However ex ante, before data collection begins, researchers can prepare for attrition by having a sampling plan for non-response follow up at endline (and baseline) surveys (Hansen and Hurwitz 1946; Deming 1953). These data can be used to test whether outcomes and attrition are correlated, and make inferences about the attrition process. However, at this stage a more promising approach is simply to use inverse probability weights or multiple imputation to avoid having to condition on observed outcomes 15 There is one implicit assumption, that of stability, see Annex C. This is a general as- sumption about inferred causation, and is not specific to causal diagrams. 27 in the first place. This is justified whenever the follow up survey involves a random sample of the units with missing outcomes in the first survey wave. 7 Discussion 7.1 Relation to Missing at Random assumptions Experiments subject to attrition are often analyzed and discussed under the missingness framework of Rubin (1976) and Little and Rubin (1987). However this approach does not provide neither necessary nor sufficient conditions for identification of the ATE in experiments subject to attrition.16 First, consider the assumption that data are Missing at Random (MAR), or P(R|Vmis , Vo bs ; β) = P(R|Vo bs ; β), where R is the attrition indicator, V is the set of observed and missing endogenous variables in a causal model, and β a vector of parameters of the missingness model (Schafer 1997, 11). MAR requires that attrition be independent of all missing variables in the model conditional on the observed ones. This is unnecessary. As can be seen from Table 1, all we need to identify the ATE is for attrition (R) to be independent of the latent outcome (Y ) conditional on the treatment (Z) (Gerber and Green (2012, 219, fn 11)). For example, suppose X were not observed in Figure 2e, then P(R|X , Z) 6= P(R|Z), yet the ATE is identified. Second, consider a slight modification of Figure 2f, where we insert a mediator M between Y and R, such that Y → M → R. In this case we have 16 To be clear, what follows is not a general criticism of Little and Rubin’s (1987) approach to missing data but of its “routine" application to experiments subject to attrition. Their approach remains incredibly useful and powerful in many other settings. 28 P(R|Z, M , Y ; β) = P(R|M ; β), yet the ATE is not non-parametrically identified because R is a virtual collider. At this point one could use inverse probability weights and multiple imputation, yet investigators ought to be aware that by invoking MAR they are also invoking a set of parametric assumptions not justified by the experimental design. Third, MAR is a description of a symptom (e.g. R ⊥ Vmis |Vo bs ), but not of the underlying causal structure generating those symptoms. This can lead to ambiguity since several causal structures can be compatible with the exact same symptoms. For example, (Y ⊥ R|X ) may be true if (i) X is a mediator between Y and R; or (ii) X is a common cause of both Y and R; or (iii) X is a mediator between Y and S, S an affect of both Y and R that is being held constant. The point is that problematic attrition is a function of the underlying causal structure, yet this casual structure remains hidden by the probability language (see Pearl 2009b, §1.5, 7.4.4). 7.2 On the role of additional identification assumptions Thus far I have defined and diagnosed problematic attrition strictly within the confines of experimental assumptions (see Definition 8). Even so, it is easy to re-state the results in Table 1 to include additional conditions for identification. For example, we could write one of the requirements as a conjunction – (Y ⊥ R|Z, S)G and (Y ⊥ Z|S)G – where S is some observed covariate, and the second requirement ensures that the conditioning strategy is harmless (e.g. we are not conditioning on a collider between the outcome and the treatment). For example, in Figure 2d attrition is a descendant of both the treatment and the outcome, and in Figure 2c it is a descendant of the 29 treatment and an ancestor of the outcome other than the treatment. Hence, the ATE will remain unidentified unless we can find an identification strategy that blocks, or circumvents, the spurious association between Z and Y .17 Such identification strategies may include measuring and conditioning on X ; unpacking the link Z → Y in search of a front-door identification strategy (see Annex B); imputing the missing outcomes to avoid conditioning on R (e.g. Rubin 1976, 2004); doing so implicitly using inverse probability weights (e.g. Robins, Rotnitzky and Zhao 1995); or imputing bounds (e.g. Horowitz and Manski 2000). All the aforementioned identification strategies have one thing in common: They invoke additional assumptions beyond the standard experimental assumptions of random assignment, excludability, and non-interference (Gerber and Green 2012, 45). Once again, causal diagrams remain ideally suited for communicating these additional assumptions. For example, Figure 3a might be a possible representation of a researcher’s prior causal knowledge about the process generating the outcome and the attrition in a specific application. In particular, that knowledge assumes that M is a lonesome mediator between the treatment and the outcome which would justify identification on the basis of the front-door criterion (see Annex B).18 However, because X remains unobserved, we can only estimate the ATE for units with observed 17 A complete proof of problematic attrition which allows for non-parametric identifica- tion strategies (including conditioning on covariates and front-door conditioning) is easy to derive, and is available from the author. It simply complicates the presentation without adding new insights. 18 This knowledge also has testable implications. For example, it implies an effect of Z on M , and (M ⊥ R|Z) P . Indeed, causal diagrams can be used to infer, justify, and test identification strategies including which covariates to control for, what assumptions are plausible, which are testable, and what variables are good instruments (Pearl 2009b). 30 Z Ux Ur Um Uy M Y X Z Um Uy M Y Uw W O Ur R (a) ATE identified (front-door criterion), not estimable O R (b) ATE not identified, R and W virtual colliders Figure 3: Attrition mechanisms. Z is the treatment, X a covariate, Y is the latent outcome observed by experimental units but not the researcher, and O is the publicly observed outcome. Attrition R is a binary variable indicating whether Y is visible to the researcher (R = 0) or not (R = 1), such that O = Y if R = 0 and O = missing otherwise. The Ui are independently distributed background causes. White nodes denote observed variables, and light gray nodes unobserved ones. Square nodes are variables held constant (e.g. r = 0). Directed edges (→) denote causation, and undirected edges (—) correlations induced by colliders. outcomes, or r = 0. In Figure 3b the ATE is not identified on the basis of prior knowledge, as both W and R are virtual colliders. In general researchers ought to consider potential attrition mechanism before randomization, and write these down using causal diagrams. This knowledge can then be used to inform what measures should be taken, the design of the measurement protocol, diagnostic tests, and conditioning strategies for dealing with attrition. Such conditioning strategies will be all the more credible if the hypothesized causal diagram is included in a study protocol posted to a public registry prior to randomization. 31 7.3 On the distinction between identification and estimation The notion of causal identification remains a subject of contention in the social sciences (see Pearl 2009b, §2, 3.2.4). Here I have distinguished between the conditions under which the ATE is identified, and the conditions under which it is estimable. This distinction arises naturally form notions of d-separation and identification criteria like the back-door criterion. For example in Figure 2a the ATE is identified conditional on R because (Z ⊥ Y |R)G , and so P( y|z, r = 0) only picks up the effect of Z on Y unpolluted by other spurious associations. However, we cannot pick up all of the effect because P( y|z, r = 1) is not observed. And because ¬(Y ⊥ R|Z) we cannot be sure that the non-parametric estimate for the units with observed outcomes is exactly the same as the estimate from all units had all outcomes been observed. All we know is that the estimator is consistent for the effect on units with observed outcomes. Identification and estimation are fundamentally distinct. Identification only requires qualitative knowledge of the process generating the data, as captured in a causal diagram. By contrast, estimation may condition on additional variables to reduce variance, or make additional parametric assumptions to model the data, and so on. 7.4 On causal diagrams, proofs, and truth tables A causal model captures qualitative knowledge about causal relations between variables. In particular, a variable is an ancestor or descendant of an- 32 other variable, or it isn’t. This binary distinction in causal relations amongst variables simplifies non-parametric identification strategies.19 In particular, researchers can rely on truth tables to make causal queries (see Annex C for an example). Indeed, Theorems 3 and 4 were derived with the aid of a truth table. To help researchers make logical causal queries of their own, this study is accompanied by an R (R Core Team 2013) package for eliciting causal diagrams and performing sentential logic on causal queries.20 8 Conclusion Attrition has been labelled the Achilles’ Heel of the randomized experiment: it is very common, and it can completely unravel the benefits of randomization. This study demonstrates how problematic attrition depends on the underlying causal structure, and for this reason it recommends that attrition be always discussed in terms of causes and effect. The language of probability is not up to task, for the simple reason that causation implies correlation but not vice versa. As a result researchers’ discussion of attrition is often inherently ambiguous (see Appendix A). By contrast the structural language of causal diagrams is ideally suited to discussing causal relations. To wit, it simplifies the discussion considerably by doing away with parametric and functional form assumptions. We say attrition is problematic whenever the ATE cannot be identified non19 Obviously, if non-parametric identification is not possible, other parametric assumptions may be invoked (e.g. Heckman 1979). However, parametric identification strategies are typically not accorded as much credibility. 20 In progress. 33 parametrically conditional on observed outcomes and the assigned treatment. Attrition is problematic if – and only if – it is a descendant of both the treatment and the outcome (or some ancestor of the outcome other than the treatment). Ex-post, after the intervention is closed to further data collection, problematic attrition can be partially diagnosed by formally testing whether treatment has a causal effect on attrition. Ex-ante, two stage sampling can be used at endline (and baseline) to further diagnose the nature of the attrition. More generally it is recommended that researchers encode all their causal knowledge about attrition is a causal diagram prior to randomization, that they use this knowledge to inform the study design, and that they register it in a study protocol posted in a public registry. These results are summarized in a handy checklist researchers can use to improve the process of causal inference in experiments subject to attrition (see Annex D). This study has focused defining and diagnosing problematic attrition. Much less attention has been paid to dealing with problematic attrition once it is diagnosed. Even so, the framework presented here provides the basic intellectual infrastructure upon which more complex identification strategies can be built. References Ashenfelter, Orley and Mark W. Plant. 1990. “Nonparametric Estimates of the Labor-Supply Effects of Negative Income Tax Programs.” Journal of Labor Economics 8(1):pp. S396–S415. Bareinboim, E. and Judea Pearl. 2012. Controlling selection bias in causal 34 inference. In JMLR Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS). Cook, T.D. and D.T. Campbell. 1979. Quasi-experimentation: design & analysis issues for field settings. Rand McNally College. Daniel, Rhian M, Michael G Kenward, Simon N Cousens and Bianca L De Stavola. 2012. “Using causal diagrams to guide analysis in missing data problems.” Statistical Methods in Medical Research 21(3):243–256. Deming, W. Edwards. 1953. “On a Probability Mechanism to Attain an Economic Balance Between the Resultant Error of Response and the Bias of Nonresponse.” Journal of the American Statistical Association 48(264):743– 772. Duflo, Esther, Pascaline Dupas and Michael Kremera. 2011. “Peer Effects, Teacher Incentives, and the Impact of Tracking: Evidence from a Randomized Evaluation in Kenya.” The American Economic Review 101(5):1739– 1774. Duflo, Esther, Rachel Glennerster and Michael Kremer. 2007. Chapter 61 Using Randomization in Development Economics Research: A Toolkit. Vol. 4 of Handbook of Development Economics Elsevier pp. 3895 – 3962. Elwert, Felix and Christopher Winship. 2012. Endogenous Selection Bias. Technical report University of Wisconsin–Madison. Geneletti, Sara, Sylvia Richardson and Nicky Best. 2009. “Adjusting for selection bias in retrospective, case–control studies.” Biostatistics 10(1):17–31. Gentleman, R., Elizabeth Whalen, W. Huber and S. Falcon. 2012. graph: A package to handle graph data structures. R package version 1.34.0. 35 Gerber, A. and D.P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. W. W. Norton. Gerber, Alan S., Dean Karlan and Daniel Bergan. 2009. “Does the Media Matter? A Field Experiment Measuring the Effect of Newspapers on Voting Behavior and Political Opinions.” American Economic Journal: Applied Economics 1(2):35–52. Gerber, alan S., gregory A. Huber and ebonya Washington. 2010. “Party Affiliation, Partisanship, and Political Beliefs: A Field Experiment.” American Political Science Review 104:720–744. Greene, William H. 1981. “Sample Selection Bias as a Specification Error: A Comment.” Econometrica 49(3):795–798. Greenland, Sander, Judea Pearl and James M. Robins. 1999. “Causal Diagrams for Epidemiologic Research.” Epidemiology 10(1):37–48. Hansen, Morris H. and William N. Hurwitz. 1946. “The Problem of NonResponse in Sample Surveys.” Journal of the American Statistical Association 41(236):517–529. Hausman, Jerry A. and David A. Wise. 1979. “Attrition Bias in Experimental and Panel Data: The Gary Income Maintenance Experiment.” Econometrica 47(2):455–473. Heckman, James J. 1979. “Sample Selection Bias as a Specification Error.” Econometrica 47(1):153–161. Heckman, James J. and Jeffrey A. Smith. 1995. “Assessing the Case for Social Experiments.” The Journal of Economic Perspectives 9(2):85–110. 36 Heckman, James J., Robert J. Lalonde and Jeffrey A. Smith. 1999. The economics and econometrics of active labor market programs. In Handbook of Labor Economics, ed. Orley C. Ashenfelter and David Card. Vol. Volume 3, Part 1 Elsevier pp. 1865–2097. Hernán, Miguel A., Sonia Hernández-Díaz and James M. Robins. 2004. “A Structural Approach to Selection Bias.” Epidemiology 15(5):615–625. Horowitz, Joel L. and Charles F. Manski. 2000. “Nonparametric Analysis of Randomized Experiments with Missing Covariate and Outcome Data.” Journal of the American Statistical Association 95(449):77–84. Imbens, G.W. and J.M. Wooldridge. 2009. “Recent Developments in the Econometrics of Program Evaluation.” Journal of Economic Literature 47(1):5–86. Jurs, Stephen G. and Gene V. Glass. 1971. “The Effect of Experimental Mortality on the Internal and External Validity of the Randomized Comparative Experiment.” The Journal of Experimental Education 40(1):62–66. Kalichman, Seth C, David Rompa, Marjorie Cage, Kari DiFonzo, Dolores Simpson, James Austin, Webster Luke, Jeff Buckles, Florence Kyomugisha, Eric Benotsch, Steven Pinkerton and Jeff Graham. 2001. “Effectiveness of an intervention to reduce HIV transmission risks in HIV-positive people.” American Journal of Preventive Medicine 21(2):84 – 92. Krueger, Alan B. 1999. “Experimental Estimates of Education Production Functions.” The Quarterly Journal of Economics 114(2):497–532. Little, Roderick JA and Donald B Rubin. 1987. Statistical analysis with missing data. Vol. 539 Wiley New York. 37 Martel García, Fernando. 2012. Scientific Progress in the Absence of New Data: A Procedural Replication of Ross (2006). Technical report. Available at SSRN: http://ssrn.com/abstract=2256286. Morgan, Stephen L. and Christopher Winship. 2007. Couterfactuals and Causal Inference: Methods and principles of Social Research. Cambridge Univ. Press. Newhouse, Joseph P., Robert H. Brook, Naihua Duan, Emmett B. Keeler, Arleen Leibowitz, Willard G. Manning, M. Susan Marquis, Carl N. Morris, Charles E. Phelps and John E. Rolph. 2008. “Attrition in the RAND Health Insurance Experiment: A Response to Nyman.” Journal of Health Politics, Policy and Law 33(2):295–308. Pearl, J. 1995. “Causal diagrams for empirical research.” Biometrika 82(4):669–688. Pearl, J. 2009a. “Causal inference in statistics: An overview.” Statistics Surveys 3:96–146. Pearl, Judea. 2009b. Causality: Models, Reasoning, and Inference. Cambridge University Press. Pearl, Judea. 2012. A solution to a class of selection-bias problems. Technical report Tech. Rep. Pearl, Judea. 2013. Linear Models: A Useful “Microscope” for Causal Analysis. Technical report Tech. Rep. R Core Team. 2013. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. URL: http://www.R-project.org/ 38 Reynolds, Barbara E. 2004. The Algorists vs. the Albacists: An Ancient Controversy on the Use of Calculators. In Sherlock Holmes in Babylon: And Other Tales of Mathematical History, ed. Marlow Anderson, Victor Katz and Robin Wilson. MAA pp. 148–152. Robins, James M., Andrea Rotnitzky and Lue Ping Zhao. 1995. “Analy- sis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data.” Journal of the American Statistical Association 90(429):106–121. Rubin, D.B. 2004. Multiple Imputation for Nonresponse in Surveys. Wiley Classics Library John Wiley & Sons. Rubin, Donald B. 1976. “Inference and missing data.” Biometrika 63(3):581– 592. Schafer, JL. 1997. Analysis of incomplete multivariate data. Chapman & Hall/CRC. Shadish, William R. 2002. “Revisiting Field Experimentation: Field Notes for the Future.” Psychological Methods 7(1):3–18. Shadish, W.R., X. Hu, R.R. Glaser, R. Kownacki and S. Wong. 1998. “A method for exploring the effects of attrition in randomized experiments with dichotomous outcomes.” Psychological Methods 3(1):3. Stone, Williard E. 1972. “Abacists versus Algorists.” Journal of Accounting Research 10(2):pp. 345–350. West, S.G., J.C. Biesanz and S.C. Pitts. 2000. Causal inference and generalization in field settings: Experimental and quasi-experimental designs. In 39 Handbook of research methods in social and personality psychology. Cambridge University Press, Cambridge, UK chapter 3, pp. 40–84. Wood, Angela M, Ian R White and Simon G Thompson. 2004. “Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals.” Clinical Trials 1(4):368–376. 40 A Field researchers’ explanations of when attrition is a problem and diagnostic tests Without pretending to be systematic the following are common rationales for why attrition is a problem. These are quoted not because they may be inappropriate in the context in which they are stated, so much as to illustrate the variety of ways in which attrition is deemed to be a problem. This study tries to unify and simplify these criteria, disambiguate the definition of problematic attrition, and provide a diagnostic test consistent with the definition. A.1 When or why attrition is a problem: • “If the probability of attrition is correlated with experimental response, then traditional statistical techniques will lead to biased and inconsistent estimates of the experimental effect.” (Hausman and Wise 1979, 456) • “While the attrition in these surveys has typically not been as severe as in social experiments, the same problems of potential bias arises, if attrition is not random”(Hausman and Wise 1979, 456) • “If there is attrition based on unobserved variables that are correlated with the outcome measures but not predicted by the observables, our results may be biased.” (Gerber, Karlan and Bergan 2009, 43) • “Sample attrition poses a problem for experimental evaluations when it is correlated with individual characteristics or with the impact of treatment conditional on characteristics. In practice, persons with poorer la- 41 bor market characteristics tend to have higher attrition rates (see, e.g., Brown, 1979). Even if attrition affects both experimental and control groups in the same way, the experiment estimates the mean impact of the program only for those who remain in the sample. Usually, attrition rates are both non-random and larger for controls than for treatments. In this case, the experimental estimate of training is biased because individuals’ experimental status, R, is correlated with their likelihood of being in the sample. In this setting, experimental evaluations become non-experimental evaluations because evaluators must make some assumption to deal with selection bias.” (Heckman, Lalonde and Smith 1999, x). • “attrition may be a serious problem if it is not random” (Ashenfelter and Plant 1990, S402); “If it is not random, then the reliance on the assumption of random assignment is unjustified, and, in fact, nonparametric estimation is impossible. When there is attrition, some assumption must be made regarding the behavior of the families who left the experiment and how it differs from those remaining.”(Ashenfelter and Plant 1990, S408); “The fact of attrition necessitates some sort of “parametric” assumption about the sample.”(Ashenfelter and Plant 1990, S410) • “In the JTPA evaluation, the parameter of interest was the mean impact of JTPA training on those receiving it. Given this parameter of interest, random assignment should be placed so as to minimize attrition from the program within the treatment group, because the experimental mean-difference estimate corresponds exactly to the impact of training on the trained only in the case where there is no attrition.”(Heckman and Smith 1995, 102); “In the presence of this level of 42 attrition, the only way to obtain a credible estimate of the mean impact of the program is to model the attrition process using the very nonexperimental methods eschewed by proponents of randomized social experiments. ”(Heckman and Smith 1995, 103) • “At heart, adjusting for possible nonrandom attrition is a matter of imputing test scores for students who exited the sample. With longitudinal data, this can be done crudely by assigning the student’s most recent test percentile to that student in years when the student was absent from the sample.” (Krueger 1999, 515). • “Random attrition will only reduce a study’s statistical power; however, attrition that is correlated with the treatment being evaluated may bias estimates. [. . . ] Even if attrition rates are similar in treatment and comparison groups, it remains possible that the attritors were selected differently in the treatment and comparison groups.” (Duflo, Glennerster and Kremer 2007, 3943). • “Whenever substantial attrition occurs this is cause for major concern because there is a possibility that the attrition is nonrandom and may lead to bias.” (Gerber, Huber and Washington 2010, 727). A.2 Diagnostics and tests for analyzing attrition In terms of diagnostics for determining whether attrition is a problem: • “The key question is whether the experimental treatments are changing the rate of attrition of [units in the experimental group]”(Ashenfelter and Plant 1990, S408) 43 • “There was no difference between tracking and nontracking schools in overall attrition rates. The characteristics of those who attrited are similar across groups, except that girls in tracking schools were less likely to attrit in the endline test [. . . ] The proportion of attritors and their characteristics do not differ between the two treatment arms”(Duflo, Dupas and Kremera 2011, 1752) • “A first step in the analysis of an evaluation must always be to report attrition levels in the treatment and comparison groups and to compare attritors with non-attritors using baseline data (when available) to see if they differ systematically, at least along observable dimensions. If attrition remains a problem, statistical techniques are available to identify and adjust for the bias. These techniques can be parametric (see Hausman and Wise, 1979; Wooldridge, 2002 or Grasdal, 2001) or nonparametric. We will focus on non-parametric techniques here because parametric methods are more well known. Moreover, nonparametric sample correction methods are interesting for randomized evaluation, because they do not require the functional form and distribution assumptions characteristic of parametric approaches. Important studies discussing non-parametric bounds include Manski (1989) and Lee (2002)” (Duflo, Glennerster and Kremer 2007, 3944). • “To test for differential attrition across conditions, a 2 attrition (lost vs retained) × 2 condition (risk-reduction skills intervention vs comparison) contingency table, chi-square test was performed; 44 of 187 (24%) participants in the experimental condition and 33 of the 145 (23%) comparison-intervention participants were lost at follow-up, a nonsignificant difference. We also conducted attrition analyses for dif- 44 ferences on baseline measures using 2 (attrition) × 2 (condition) analyses of variance,23 where: (1) an attrition effect signals differences between participants lost and retained, (2) an intervention effect indicates a breakdown in randomization, and (3) an attrition x condition interaction indicates differential loss between conditions. Results indicated a main effect for attrition on participant age: participants lost to follow-up were younger (M = 37.9) than those retained (M = 40.7). No other differences between lost and retained participants were significant, there were no significant differences between intervention conditions on baseline variables, and the attrition by intervention-group interactions were not significant.” (Kalichman et al. 2001, 87). B Front- and Back-door criteria for identification B.1 Back-door criterion The back-door criterion tells us what variables need to be conditioned on to ensure a set of causes and its effects of interest are independently distributed under the null of no effect. This corresponds to the econometric notion of unconfoundedness (Imbens and Wooldridge 2009, 26). Definition 10 (Back-door criterion, Pearl (2009, p 79)). A set of variables S satisfies the back-door criterion relative to an ordered pair of variables (X i , X j ) in a DAG if: 1. no node in S is a descendant of X i ; and 2. S d-separates (or blocks) every path between X i and X j that contains 45 and arrow into X i . Similarly, if X and Y are two disjoint subsets of nodes in G, then S is said to satisfy the back-door criterion relative to (X , Y ) if it satisfies the criterion relative to any pair (X i , X j ) s.t. X i ∈ X and X j ∈ Y . The back-door criterion essentially restricts d-separation to common causes of X and Y . For example, suppose X → M → Y and X ← U → Y . The set S = {M , U} d-separates X and Y but only set S 0 = {U} meets the back-door criterion. Intuitively conditioning on M is not useful if our goal is to identify the total (mediated) effect of X on Y , whereas conditioning on U blocks a confounder. Formally: Theorem 5 (Back-door adjustment, Pearl (2009, p 79)). If a set of variables S satisfies the back-door criterion relative to (X , Y ), then the causal effect of X on Y is identifiable and is given by the formula: P( y|do(x)) = X P( y|x, s) P(s). s As noted in the previous section, once we are given P( y|do(x)), we can compute the ATE for any two distinct realizations x 0 and x 00 of X as E[Y |do(x 0 )]− E[Y |do(x 00 )] (Pearl 2009b, p 70). B.2 Front-door criterion The front-door criterion is another sufficient strategy for causal identification. Unlike the back-door criterion, that focuses on common causes of the cause and the effect, the front-door criterion focuses on the mechanism linking the 46 cause to the effect, applying the back-door criterion at each step along the causal path (see Pearl (2009b); Morgan and Winship (2007)). Definition 11 (Front-door criterion, Pearl (2009, 82)). A set of variables Z is said to satisfy the front-door criterion relative to an ordered pair of variables (X , Y ) if: 1. Z intercepts all directed paths from X to Y ; and 2. there is no unblocked back-door path from X to Z; and 3. all back-door paths from Z to Y are blocked by X . Mechanisms that meet the front-door criterion are isolated (by properties 2 and 3) and exhaustive (by property 1, see Morgan and Winship (2007, §8) for details). Theorem 6 (Front-door adjustment, Pearl (2009, 83)). If Z satisifies the frontdoor criterion relative to (X, Y ) and if P(x, z) > 0, then the causal effect of X on Y is identifiable and is given by the formula: P( y|do(x)) = X P(z|x) X P( y|x 0 , z) P(x 0 ). x0 z Both front- and back-door identification criteria are illustrated in Figure 1. In that figure variable W confounds the effect of X on Y : it induces an association between these two variables even under the null of no effect (e.g. X and Y remain associated (or d-connected) via backdoor path X ← W → Y even if we delete edge X → Z or Z → Y ). Conditioning on W removes this association as set S = {W } meets the back-door criterion (Definition 10). A second identification approach is to compute the effect of X on Z, and of Z on Y for 47 each level of X , as set S = {Z} meets the front-door criterion (Definition 11). This is specially useful whenever W is not observed. C Stability Definition 12 (Stability, Pearl (2009, p 48)). Let I(P) denote the set of all conditional independence relationships embodied in P. A causal model M = 〈D, Θ D 〉 generates a stable distribution if and only if P(〈D, Θ D 〉) con tains no extraneous independencies – that is, if and only if I P(〈D, Θ D 〉) ⊆ I P(〈D, Θ0D 〉) for any set of parameters Θ0D . When we parametrize a DAG structure D (denoted G in the notation we’ve been using) with parameters Θ we have a causal model M (e.g. trivially X → Y becomes a model as in Y = α+β X +" y , " y ∼ N (0, σ) with parameters Θ = (α, β, σ) T ). That model generates distributions P(〈D, Θ D 〉), and these are likely to vary as we change the parameters. A model is stable by Definition 12 if perturbing the parameters does not destroy any independencies. For example, consider any two pathways by which the treatment Z can affect attrition R: (i) Z → R; and (ii) Z → Y ∗ → R. When two or more paths connect Z and R a precise tuning of parameters could negate the effect via one path with a countervailing effect via another path. Observationally Z and R appear independent but this apparent independence is not robust to small perturbations of the parameters. These implications are best illustrated by the truth table in Table 2. Stability is important because by ruling out such precise tuning of parameters then, and by Theorem 1, it follows that (Z ⊥ R) P ⇔ (Z ⊥ R)G . If this is true 48 P Q W F F T T F T F T F F F T ¬P ∧ T T F F T F F F ¬Q ∨ T F T F W (Z ⊥ R|S)G (Z ⊥ R|S) P F F F T T F F F T F F T T F F T Table 2: Truth table depicting implications of failure to reject the null hypothesis that treatment Z does not cause attrition R using information from the observed distribution, where P =“Z → R”; Q =”Z → Y ∗ → R”; and W =”P ∧ Q∧ they exactly offset”. Assuming stability (Definition 12) rules out the last row of the table where independence of Z and R in the statistical sense does not imply their independence in the underlying causal structure. then failing to reject the null of (Z ⊥ R) P implies failure to reject (Z ⊥ R)G , which should focus the researcher’s attention on the first row of Table 1. If so AT ER=0 is identified and estimable.21 Moreover, this works for any SADAG, so we do not have to make any more assumptions than those in Definitions 8 and 12. In the context of experiments these are relatively uncontroversial. 21 This is one instance where testing the null of exactly zero effect really makes sense: A small effect is not always evidence of a small problem if big effects are almost offsetting one another. 49 D Checklist ATTCHECK: A procedural checklist for diagnosing problematic attrition in randomized controlled experiments Item Check Procedures Perform a review of the relevant literature, and consult with experts, to identify possible causes of both the outcome and attrition in your particular application Encode this expert knowledge using a causal diagram Use the causal diagram to determine what variables need to be measured Prepare for attrition by having a sampling plan for non-response follow up at endline (and baseline) surveys, as needed Include a discussion of attrition and the related causal diagram in the study protocol Register the study protocol before recruiting units into the study, or before randomization at the latest Check for problematic attrition by testing the d-separation conditions (R ⊥ Z)G and (Y ⊥ R|Z)G Interpret your findings and assess whether attrition is problematic. If not, assess whether the ATE is estimable for all units, or only for those units with observed outcomes Software technologies Report any software used and its release version (For details on procedural checklists see Martel García (2012)) 50