Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What Can Be Learned from a Simple Table? Bayesian Inference and Sensitivity Analysis for Causal Effects from 2 × 2 and 2 × 2 × K Tables in the Presence of Unmeasured Confounding ∗ Kevin M. Quinn† This Version: September 6, 2008 Original Version: January 11, 2008 Abstract What, if anything, should one infer about the causal effect of a binary treatment on a binary outcome from a 2 × 2 cross-tabulation of non-experimental data? Many researchers would answer “nothing” because of the likelihood of severe bias due to the lack of adjustment for key confounding variables. This paper shows that such a conclusion is unduly pessimistic. Because the complete data likelihood under arbitrary patterns of confounding factorizes in a particularly convenient way, it is possible to parameterize this general situation with four easily interpretable parameters. Subjective beliefs regarding these parameters are easily elicited and subjective statements of uncertainty become possible. This paper also develops a novel graphical display called the confounding plot that quickly and efficiently communicates all patterns of confounding that would leave a particular causal inference relatively unchanged. ∗ An earlier version of this paper was awarded the 2008 Gosnell Prize for Excellence in Political Methodology. I thank Adam Glynn, Dan Ho, Gary King, and seminar participants at Penn State University for helpful comments and conversations. I gratefully acknowledge support from the U.S. National Science Foundation (grant BCS 05-27513) and the Institute for Quantitative Social Science at Harvard University. Software to perform the analyses described in this paper is available at http://cran.r-project.org/web/packages/SimpleTable/index.html. † Associate Professor, Department of Government and The Institute for Quantitative Social Science, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138. kevin [email protected] 1 1 Introduction What, if anything, should one infer about the causal effect of a binary treatment X on a binary outcome Y from a 2 × 2 cross-tabulation of non-experimental data? As an example, consider Table 1 which presents data that might be relevant to test the so-called “jury aversion” hypothesis (Oliver and Wolfinger, 1999). This table presents data on citizens’ perceived source of jury lists (X = 0: sources of jury lists do not include voter lists, X = 1: sources of jury lists do include voter lists) and voter registration (Y = 0: citizen did not register to vote, Y = 1: citizen did register to vote) from a study by Oliver and Wolfinger (1999). One might be tempted to infer that since the estimated Y =0 Did Not Register to Vote Y =1 Did Register to Vote 19 143 114 473 X=0 Perceived Source of Jury Lists Does Not Include Voter Lists X=1 Perceived Source of Jury Lists Does Include Voter Lists Table 1: Perceived Source of Jury Lists and Voter Registration Among Citizens With Some Self Reported Knowledge of How Jury Lists are Constructed. Each entry is the number of citizens in that category. Data from Table 2 of Oliver and Wolfinger (1999). An additional 639 citizens (46% of total) responded that they did not know the source of jury lists. probability of registering to vote given X = 0 equals 143/(19 + 143) ≈ 0.88 and the estimated probability of registering to vote given X = 1 equals 473/(114 + 473) ≈ 0.81, perceiving voter lists to be a source of jury lists decreases the probability of registering to vote (among those with some knowledge of how jury lists are formed) by about 7 percentage points. Is this a reasonable inference? Most scholars would argue that this would not be a reasonable inference. Since it seems likely that many variables that are common causes of both X and Y —such as occupation, education, civic involvement, etc.—are omitted,1 the naive estimate of a 7 percentage point effect is likely biased. 1 Such omitted variables are often referred to as confounders. When it is not possible to measure all such confounders some say that unmeasured confounding is present. Rigorous definitions of these concepts are somewhat 2 Many researchers would go on to say that not much of anything could be said about causality from this table because of this bias. While there are certainly aspects of truth to the position that unmeasured confounding invalidates many attempts at causal inference, it turns out that it is possible to learn some things about the size of the causal effects of interest regardless of the nature and degree of the unmeasured confounding. For instance, Manski (1990) derived bounds for the average treatment effect under very general assumptions and showed that, in the case of binary treatment and outcome, the width of these bounds is 1.2 Since the average treatment effect can take values between -1 and 1, Manski’s bounding interval always includes the null value of no effect. Manski (2003) has also shown how auxiliary assumptions can narrow this bounding interval. While this is clearly important work in that it provides a very precise statement of what the observed data alone can tell one about the causal effects of interest, it does not go quite as far as many social scientists would like on two counts. First, it provides absolutely no guidance as to where within the bounding interval the causal effect of interest is likely to be. Some may believe that all values within the interval are equally likely but, as we will see later, such a belief will often be inconsistent with reasonable beliefs about the nature of the unmeasured confounding. Second, the standard approach pursued in Manski (1990) is a large sample approach that does not account for sampling variability.3 This makes the interpretation of bounds derived for sparse tables somewhat unclear. This paper address these issues by doing the following. It reduces the case of arbitrary unmeasured confounding with binary treatment and binary outcome to its most basic, yet still general, form. It does this in a way that provides a readily interpretable parameterization of the key quantities. It then shows how a Bayesian prior distribution can be placed over the four free parameters that govern the type and extent of the unmeasured confounding. If one is willing and able to use background knowledge to make some (possibly weak) assumptions about the nature of the unmeainvolved. Interested readers should consult Pearl (2000). 2 For related work see Robins (1989); Manski (1993, 2003); Imai and Yamamoto (2008) and Balke and Pearl (1997). 3 But see Imai and Soneji (2007) for an approach based on the bootstrap that does account for sampling variability. 3 sured confounding, sharp posterior estimates of causal effects are easy to calculate. Since these assumptions are formalized within the Bayesian framework, subjective uncertainty about causal effects is calculated in a logically coherent manner. The end result is a procedure that allows researchers to make probability statements about the likely size of causal effects based on the evidence in a 2 × 2 (or 2 × 2 × K) table regardless of the sample size and the amount of unmeasured confounding. It thus directly addresses the shortcomings of the traditional bounding analysis. The reasons for developing this approach to analyzing tabular data are three-fold. First, 2 × 2 tables are widely used (particularly in older work) and it would be useful to know how a rational person should interpret the information in these tables. Second, once the simple 2 × 2 case is understood it is easy to extend the main ideas in this paper to more complicated settings with binary treatment and outcome and measured categorical confounders. Interestingly, while conditioning on a measured confounder may make prior elicitation easier it does not change the large-sample nonparametric bounds on the average treatment effect. Thus the benefits of adjusting for multiple measured confounders when substantial unmeasured confounding is believed to be present may be fewer than many researchers seem to believe. Finally, to the extent that many political scientists are primarily interested in the sign of hypothesized causal effects, a well-designed (Bayesian) sensitivity analysis may provide them with stronger evidence for or against their hypothesis than the results from a small number of regression or matching analyses. Indeed, this was the motivation for what is commonly believed to be the first sensitivity analysis (Cornfield et al., 1959).4 The approach presented in this paper is related to, but distinct from, many sensitivity analysis methods that can be used to see how particular departures from the experimental ideal impact point estimates of causal effects. Typically, such approaches attempt to directly model the confounderoutcome association and the confounder-treatment association (Schlesselman, 1978; Rosenbaum and Rubin, 1983a; Lin et al., 1998). Such approaches require the practitioner to make assumptions about (a) the nature of the confounding variable (whether it is univariate/multivariate, dis4 See Lin et al. (1998), p. 948. 4 crete/continuous, etc.) as well as (b) the association between this assumed confounder and outcomes and treatment status. Needless to say, it is very difficult for most practitioners to think coherently about the nature of the unmeasured confounder(s) and the resulting (conditional) associations. Other approaches reduce the sensitivity analysis to an examination of changes in a scalar parameter that can be varied by the researcher Rosenbaum (1987, 2002a,b). While this makes it much easier to move forward with a sensitivity analysis, the interpretation of this scalar parameter is not always as transparent as it might seem (Robins, 2002). The approach developed in this paper is at least as (sometimes more) general than the approaches described above and it also only requires researchers to specify their beliefs over easily interpreted quantities. The work that is most similar to this paper is Chickering and Pearl (1997). That work relies on similar ideas to develop a Bayesian approach to the related problem of analyzing experimental data with non-compliance. A major difference between the two approaches is that the data of interest for this paper admit a much simpler factorization that makes prior elicitation and inference much more straightforward. A much smaller set of free parameters need be considered here and Markov chain Monte Carlo methods are not necessary for inference. This makes it possible to write easy-to-use open source software for prior elicitation and inference.5 Thus, the methods in the current paper are much more likely to be successfully used by empirical social scientists. Relatedly, the structure of the current problem allows for a novel graphical display called the confounding plot that (a) is quick and easy to compute, and (b) shows the entire range of possible types of confounding that would not appreciably change a given causal inference. This simple graph is so easy to generate and so informative that there is no reason it should not be a part of every analysis that attempts to make causal inferences from a 2 × 2 table. The paper proceeds as follows. In Section 2 we introduce the necessary terminology and notation and then show how posterior inferences can be constructed from a 2 × 2 table with general unmeasured confounding. Section 3 shows how causal quantities of interest such as the average 5 This software takes the form of an R (R Development Core Team, 2007) package named SimpleTable. This package is available at http://cran.r-project.org/web/packages/SimpleTable/index.html. 5 treatment effect and the relative risk can be written in terms of the model parameters from Section 2. Large sample nonparametric bounds on these causal quantities are also derived in this section. These bounds coincide with those of Manski (1990) although the derivation is slightly different. Section 4 discusses the choice of prior distribution for the model parameters and then describes a simple posterior sampling algorithm that does not require Markov chain Monte Carlo. Section 5 describes the construction and interpretation of the novel confounding plot discussed above. In Section 6 we revisit the example data in Table 1. Here we see how defensible prior beliefs can be operationalized in a prior distribution over the model parameters and what this implies for inferences about a possible jury aversion effect. The final section concludes. 2 Terminology, Notation, and the Probability Model Consider the situation in which one is interested in assessing the causal effect of a binary treatment variable on a binary outcome. We let Xi ∈ {0 (control), 1 (treatment)} denote the treatment status of unit i, Yi ∈ {0 (failure), 1(success)} denote the observed outcome from unit i, and i = 1, . . . , n index individual units. Throughout the rest of the paper we will assume that (Xi , Yi ) for i = 1, . . . , n are independent replicates drawn from some joint distribution PXY . We use the notation xi and yi to denote individual realizations of Xi and Yi and x and y to denote the n-vectors of these realized values. We adopt a counterfactual causal model that is consistent with the work of Rubin (1974, 1978); Robins (1986) and Pearl (1995, 2000) (see also Holland (1986) for a general review and Morgan and Winship (2007) for an introductory treatment from a social science perspective). We adopt the notation Yi (Xi = 0) and Yi (Xi = 1) to denote the value of unit i’s outcome variable if Xi is set equal to 0 and 1 respectively by an outside intervention that leaves all pre-intervention variables unchanged. Unit-level causal effects of changing X from 0 to 1 on Y in unit i are defined in terms of comparisons of Yi (Xi = 0) and Yi (Xi = 1). Since unit i cannot simultaneously receive both treatment and control, one of Yi (Xi = 0) and Yi (Xi = 1) is a counterfactual quantity. 6 Throughout the rest of this paper we will refer to Yi (Xi = 0) and Yi (Xi = 1) as potential outcomes or counterfactual outcomes. Rather than focusing on unit-level causal effects, we will concern ourselves with aggregate effects within some collection of units. Here, interest centers on the post-intervention distribution Pr(Y (X = x) = y) for x = 0, 1 and y = 0, 1. The post-intervention probability Pr(Y (X = x) = y) gives the probability that the outcome variable from a randomly chosen unit will take value y when the unit in question is assigned X = x. Causal quantities of interest such as the average treatment effect AT E = Pr(Y (X = 1) = 1) − Pr(Y (X = 0) = 1) and the relative risk RR = Pr(Y (X = 1) = 1) Pr(Y (X = 0) = 1) can be defined in terms of the post-intervention distribution. Note that the post-intervention distribution depends on counterfactual outcomes and is generally not equivalent to P (Y = y|X = x). This conditional distribution can be calculated directly from PXY . This latter distribution is sometimes called the pre-intervention distribution. While the pre-intervention distribution PXY can be consistently estimated without maintaining untestable assumptions, the post-intervention distribution can only be estimated consistently if untestable causal assumptions are maintained (Rubin, 1978; Holland, 1986; Robins, 1986; Pearl, 1995, 2000). Specifically, if one is willing to assume that treatment assignment is strongly ignorable6 and what Rubin (1980, 1986) has called the stable unit treatment value assumption (SUTVA)7 holds, then the post-intervention probabilities are given by the pre-intervention conditional probabilities 6 Formally, strong ignorability of treatment assignment means that [Y (X = 0), Y (X = 1)] ⊥ ⊥ X. In words, the (counterfactual) joint distribution of potential outcomes is independent of treatment. Somewhat more intuitively, strong ignorability of treatment assignment means that a unit’s potential outcomes do not depend on whether the unit is assigned to treatment or control. Note that this does not imply that the observed outcome Y is independent of treatment X. 7 SUTVA has two parts. To quote Rubin (1986): “SUTVA is simply the a priori assumption that the value of Y for unit u when exposed to treatment t will be the same no matter what mechanism is used to assign treatment t to unit u and no matter what treatments the other units receive, and this holds for [all units and all treatments]” (p. 961). 7 of Y given X: Pr(Y (X = x) = y) = Pr(Y = y|X = x). (1) Estimates of Pr(Y = y|X = x) and hence Pr(Y (X = x) = y) can be constructed in the usual ways from the observed values of X and Y . In a well-run randomized controlled experiment, strong ignorability of treatment assignment and SUTVA are likely to hold because of the design of the experiment. However, in situations other than the ideal experiment the assumptions necessary for the equality in Equation 1 to hold are unlikely to be satisfied. Thus, causal inferences made directly from the observed cell frequencies in 2 × 2 tables of non-experimental data are unlikely to be accurate. However, in most applications it is generally believed that some (potentially unobservable) collection of variables U exists such that treatment assignment is conditionally ignorable given U.8 If one is also willing to assume that SUTVA holds, one can write the post-intervention distribution as: Z Pr(Y (X = x) = y) = Pr(Y = y|X = x, U = u)dPU (2) where PU is the distribution of U (Rosenbaum and Rubin, 1983b; Pearl, 1995). Note that U may well be multivariate, continuous, and may well be unobservable. It is common to say that U contains a set of confounding variables. Similarly, one often says that confounding is present when one needs to adjust for U as in Equation 2. Heuristically, one can think of U as containing variables that are common causes of X and Y . If one is willing to assume that U is rich enough so that the potential outcomes are a deterministic function of xi and ui a tremendous simplification becomes possible.9 While U may be extremely complicated, the binary nature of treatment and outcome implies that the domain of U can be partitioned into four equivalence classes depending on the pattern of potential outcomes associated 8 Formally, [Y (X = 0), Y (X = 1)] ⊥ ⊥ X|U. Here strong ignorability of treatment assignment holds within each level of U. 9 Such an assumption is actually not a strong assumption since, as we will see below, u need not be observed for the main results that follow. Further, most counterfactual causal models (see Rubin (1974, 1978); Pearl (2000)) explicitly assume deterministic potential outcomes at a conceptual level. 8 Yi (Xi = 0) 0 0 1 1 Yi (Xi = 1) 0 1 0 1 Zi 0 1 2 3 Never Succeed Helped Hurt Always Succeed Table 2: Possible Patterns of Potential Outcomes and Coarsest General Confounding Variable. A unit i for which Zi = 0 has a value of Ui that causes it to always have Yi = 0 regardless of the (counterfactual) value of Xi . We say these units are “never succeeders”. If Zi = 1 we say that unit i is “helped” by treatment because its potential outcome under X = 1 is equal to 1 (success) while its potential outcome under X = 0 is 0 (failure). If unit i has Zi = 2 we say that i is “hurt” by treatment because its potential outcome under X = 1 is equal to 0 (failure) while its potential outcome under X = 0 is 1 (success). Finally, if Zi = 3 we say that i is an “always succeeder” because its value of Ui is such that Yi will always equal 1 regardless of the (counterfactual) value of Xi . with each point in the domain of U (Angrist et al., 1996; Balke and Pearl, 1997; Chickering and Pearl, 1997). We introduce a new categorical variable Zi that labels these equivalence classes. The values of Zi along with the associated patterns of potential outcomes are presented in Table 2. Since Z is defined in terms of the value of the potential outcome pairs it is clear that conditional ignorability holds given Z. We assume that (Xi , Yi , Zi ) for all i are independent replicates from some joint distribution PXY Z . If PXY Z is known, one could write the post-intervention distribution as Pr(Y (X = x) = y) = 3 X Pr(Y = y|X = x, Z = z) Pr(Z = z). z=0 where the probabilities on the right-hand-side of the equation above can be calculated directly from PXY Z . Similarly, if a sample was available from PXY Z , one could use the sample values of (x, y, z) to consistently estimate PXY Z and to then use this estimate to construct an estimate of the post-intervention distribution. Needless to say, it is extremely unlikely that applied researchers in the social sciences would find themselves in the situation where Zi is observable for any i. Without data on Z it is impossible to consistently estimate PXY Z . Nevertheless, there is some information about Z in observed (X, Y ) data sampled from PXY Z . The goal of this paper is to show how this information can be combined with subjective background knowledge to yield causal inferences from 2 × 2 and 2 × 2 × K tables even when the confounding variables in U are not measured. 9 In what follows, we begin by specifying a probability model for the very unlikely situation where (xi , yi , zi ) are observed for all i. While this situation will not occur in practice, it does serve as a starting point to deal with the realistic situation of z being completely unobserved. We deal with this more realistic situation in Section 2.2. Once again, the goal of this paper is to demonstrate how the situation in which z is completely unobserved can still be analyzed so that accurate statements of subjective uncertainty regarding causal effects can still be made. 2.1 z Completely Observed The goal here is to make inferences about the form of PXY Z and to then treat causal quantities as functionals of this distribution. We adopt a Bayesian approach. While Bayesian procedures have a number of advantages (Bernardo and Smith, 1994; Gelman et al., 2003; Gill, 2007), the main reason for taking a Bayesian approach in this paper is that it allows us to incorporate background knowledge about the (potentially unobserved) confounder Z in a principled fashion (Kadane and Wolfson, 1998; Western and Jackman, 1994; Gill and Walker, 2005). We begin by discussing the likelihood function and then discuss our choice of prior distribution along with the resulting posterior distribution. 10 2.1.1 Likelihood Assuming that (Xi , Yi , Zi ) for all i are independent replicates from some joint distribution PXY Z we can write the likelihood as: p(x, y, z|θ, ψ) = = = n Y i=1 n Y i=1 n Y p(xi , yi , zi |θ, ψ) p(xi , yi |θ)p(zi |xi , yi , ψ) I(x =0,yi =0) I(xi =0,yi =1) I(xi =1,yi =0) I(xi =1,yi =1) θ01 θ10 θ11 × θ00 i i=1 I(x =0,yi =0,zi =1) (1 − ψ00 )I(xi =0,yi =0,zi =0) × I(x =0,yi =1,zi =3) (1 − ψ01 )I(xi =0,yi =1,zi =2) × I(x =1,yi =0,zi =2) (1 − ψ10 )I(xi =1,yi =0,zi =0) × I(x =1,yi =1,zi =3) (1 − ψ11 )I(xi =1,yi =1,zi =1) ψ00 i ψ01 i ψ10 i ψ11 i C C C C C001 C013 = θ0000+ θ0101+ θ1010+ θ1111+ ψ00 (1 − ψ00 )C000 ψ01 (1 − ψ01 )C012 × C102 C113 ψ10 (1 − ψ10 )C100 ψ11 (1 − ψ11 )C111 where I(·) is the indicator function, Cxyz = Pn i=1 I(xi = x, yi = x, zi = z), Cxy+ = (3) Pn i=1 I(xi = x, yi = x), θ00 , θ01 , θ10 , θ11 ≥ 0, θ00 + θ01 + θ10 + θ11 = 1, and ψ00 , ψ01 , ψ10 , ψ11 ∈ [0, 1]. Note that Cxyz is the number of cases in cell (x, y, z) of the 2 × 2 × 4 contingency table for (X, Y, Z) and that Cxy+ is the number of cases in cell (x, y) of the 2 × 2 marginal table for (X, Y ). We emphasize that, after assuming that X and Y are binary and (Xi , Yi , Zi ) are independent replicates from PXY Z , we have made no assumptions about the form of PXY Z . Any logically consistent distribution PXY Z and any form of confounding among independent replicates can be captured by particular values of θ and ψ. This probability model for (X, Y, Z) is thus completely general and nonparametric. While this model for (X, Y, Z) might seem to contain a large number of parameters that are difficult to interpret it is actually the case that there are a relatively small number of free parameters 11 Parameter θxy ψ00 Probability Pr(Xi = x, Yi = y) Pr(Zi = 1|Xi = 0, Yi = 0) 1 − ψ00 Pr(Zi = 0|Xi = 0, Yi = 0) ψ01 Pr(Zi = 3|Xi = 0, Yi = 1) 1 − ψ01 Pr(Zi = 2|Xi = 0, Yi = 1) ψ10 Pr(Zi = 2|Xi = 1, Yi = 0) 1 − ψ10 Pr(Zi = 0|Xi = 1, Yi = 0) ψ11 Pr(Zi = 3|Xi = 1, Yi = 1) 1 − ψ11 Pr(Zi = 1|Xi = 1, Yi = 1) Interpretation Probability Xi is equal to x and Yi is equal to y Probability i would be helped by treatment given i not treated and i failed Probability i would never succeed given i not treated and i failed Probability i would always succeed given i not treated and i succeeded Probability i would be hurt by treatment given i not treated and i succeeded Probability i was hurt by treatment given i treated and i failed Probability i would never succeed given i treated and i failed Probability i would always succeed given i treated and i succeeded Probability i was helped by treatment given i treated and i succeeded Table 3: Interpretation of Parameters in the Model for (X,Y,Z). The i indices denote a randomly selected unit. that are easily interpretable in terms of conditional and marginal probabilities. Table 3 provides a summary of the key parameters and their intuitive meanings. Here we see that there are two key sets of parameters θxy and ψxy for x = 0, 1 and y = 0, 1. The θ parameters govern a multinomial distribution for the distribution of (X, Y ) after Z has been marginalized out of PXY Z . For instance, θ01 is thus the probability that X = 0 and Y = 1 averaged over all 4 levels of Z. The ψ parameters govern the conditional distribution of Z given X and Y . Note that because of the definition of Z (see Table 2) only 2 values of Z are logically possible given any admissible (X, Y ) pair. For instance, if X = 0 and Y = 1 then Z must be equal to either 2 or 3. The distribution of Zi given Xi = x and Yi = y is thus Bernoulli with parameter ψxy . Keeping with our example, ψ01 gives the probability that Zi = 3 given Xi = 0 and Yi = 1 while (1 − ψ01 ) gives the probability that Zi = 2 given Xi = 0 and Yi = 1. The other conditional distributions for Z given X = x and Y = y are similarly parameterized. 12 2.1.2 Prior and Posterior Distributions Bayesian inference centers on the posterior distribution of θ and ψ given the observed data. The posterior distribution is given (up to proportionality) by: p(θ, ψ|y, x, z) ∝ p(y, x, z|θ, ψ)p(θ, ψ) The likelihood p(y, x, z|θ, ψ) was given in the previous section. What is still needed is a specification of the prior distribution p(θ, ψ). A natural choice for the joint prior distribution of θ and ψ is to assume that θ, ψ00 , ψ01 , ψ10 , and ψ11 are mutually independent a priori and that θ ∼ Dirichlet(a00 , a01 , a10 , a11 ), ψxy ∼ Beta(bxy , cxy ), for x = 0, 1 and y = 0, 1. This is the conjugate prior distribution for this model. As we will see below, this prior specification will allow us to think of the hyper-parameters axy , bxy , and cxy for x = 0, 1 and y = 0, 1 as additional “pseudo-observations”. This makes the prior distributions more easily interpretable, which is quite important for the current application where inferences will typically be quite dependent on the prior. Using the prior specification discussed above, we can write the posterior density (up to proportionality) as: C p(θ, ψ|x, y, z) ∝ θ0000+ +a00 −1 C01+ +a01 −1 C10+ +a10 −1 C11+ +a11 −1 θ01 θ10 θ11 × C013 +b01 −1 C001 +b00 −1 (1 − ψ00 )C000 +c00 −1 ψ01 (1 − ψ01 )C012 +c01 −1 × ψ00 C102 +b10 −1 C113 +b11 −1 ψ10 (1 − ψ10 )C100 +c10 −1 ψ11 (1 − ψ11 )C111 +c11 −1 (4) This is just the product of a Dirichlet distribution and four beta distributions. Note the similarity of the posterior in Equation 4 to the likelihood in Equation 3. Specifically, note that our prior for θ is serving to add an additional (a00 − 1) (X = 0, Y = 0) pairs, (a01 − 1) (X = 0, Y = 1) pairs, etc. to the dataset. Similarly, our prior for ψ is adding an additional (b00 − 1) (X = 0, Y = 0, Z = 1) triples, (c00 − 1) (X = 0, Y = 0, Z = 0) triples, etc. to the dataset. 13 2.2 z Completely Unobserved We now analyze the case in which the confounder Z is never observed in the sample data. Given the definition of Z, we expect that all applications will fall into this category. While inference in this situation relies critically on untestable assumptions about the relationship between Z and (X, Y ) it is the case that there is some information in the observed Cxy+ counts as to the distribution of Z in the population. Further, in many cases, there will be enough scientific background knowledge such that ψ can be given a defensible subjective prior distribution. 2.2.1 Likelihood Let Zi denote the set of possible values Zi could take given the observed data on unit i. More formally, {0, 1} {2, 3} Zi = {0, 2} {1, 3} if if if if xi xi xi xi = 0, = 0, = 1, = 1, yi yi yi yi =0 =1 =0 =1 We can then write the likelihood as: p(x, y|θ, ψ) = n X Y p(xi , yi , zi |θ, ψ) i=1 zi ∈Zi = n Y p(xi , yi |θ) n Y p(zi |xi , yi , ψ) zi ∈Zi i=1 = X I(x =0,yi =0) I(xi =0,yi =1) I(xi =1,yi =0) I(xi =1,yi =1) θ01 θ10 θ11 × θ00 i i=1 ( X I(x =0,yi =0,zi =1) ψ00 i (1 − ψ00 )I(xi =0,yi =0,zi =0) × zi ∈Zi I(x =0,yi =1,zi =3) (1 − ψ01 )I(xi =0,yi =1,zi =2) × I(x =1,yi =0,zi =2) (1 − ψ10 )I(xi =1,yi =0,zi =0) × ) I(x =1,yi =1,zi =3) (1 − ψ11 )I(xi =1,yi =1,zi =1) ψ01 i ψ10 i ψ11 i C C C C = θ0000+ θ0100+ θ1010+ θ1111+ 14 (5) where θxy and Cxy+ are as before for x = 0, 1 and y = 0, 1. Note that the complete missingness of Z has caused the term consisting of conditional probabilities of Z given X and Y to drop completely out of the likelihood. As we might expect, the observed data provide no information about ψ. Note, however, that this does not mean that there is no information in the observed data about Z. Because the domain of Zi is cut in half from 4 possible values to 2 possible values conditional on xi and yi , there is some information in the data about the distribution of Z. This is important because, as we will see later, many causal quantities of interest can be written as functions of certain marginal probabilities of Z. 2.2.2 Prior and Posterior Distributions Again, the prior discussed in Section 2.1.2 remains a natural choice for this situation. Combining this prior with the likelihood in Equation 5 gives us the following posterior density (up to proportionality): C p(θ, ψ|x, y) ∝ θ0000+ +a00 −1 C01+ +a01 −1 C10+ +a10 −1 C11+ +a11 −1 θ01 θ10 θ11 × b01 −1 b00 −1 (1 − ψ01 )c01 −1 × (1 − ψ00 )c00 −1 ψ01 ψ00 b11 −1 b10 −1 (1 − ψ11 )c11 −1 (1 − ψ10 )c10 −1 ψ11 ψ10 (6) Note that the only information about ψ is coming from the prior distribution. This implies that inferences that depend on ψ will typically be sensitive to one’s choice of prior for ψ. 3 Causal Quantities of Interest The purpose of this section is to show how the causal quantities discussed at the beginning of Section 2 can be calculated from θ and, in some cases, ψ. Because the form of these causal quantities will depend on what (conditional) ignorability assumptions one is willing to make, we need a terminology to distinguish these different forms of the quantities of interest. Following Holland (1986) we term quantities that would be true causal quantities under strong ignorability 15 prima facie quantities. These are the quantities that would be calculated directly from the the observed data under an assumption of unconfoundedness. In most non-experimental situations the prima facie quantities will not be true causal quantities. The other types of quantities we are interested in here are those that are generally valid causal quantities but depend on the typically unidentified parameter ψ. We term these quantities sensitivity analysis quantities. While these quantities will be genuine causal quantities as long as SUTVA holds, the fact that the confounding variable Z is not typically observed means that inference will be sensitive to prior beliefs about the conditional distribution of Z in the population. 3.1 Post Intervention Distribution As noted in Section 2, the starting point for the derivation of all causal quantities of interest is the post-intervention distribution. This is the distribution over the potential outcomes given a hypothetical manipulation of X applied to the entire population of units. We can write the prima facie post intervention distribution as: Prp (Y (X = 0) = 0) = Pr(Y = 0|X = 0) = θ00 θ00 + θ01 Prp (Y (X = 0) = 1) = Pr(Y = 1|X = 0) = θ01 θ00 + θ01 Prp (Y (X = 1) = 0) = Pr(Y = 0|X = 1) = θ10 θ10 + θ11 Prp (Y (X = 1) = 1) = Pr(Y = 1|X = 1) = θ11 . θ10 + θ11 As we might expect, the prima facie post-intervention distribution does not depend on ψ and thus can be consistently estimated regardless of whether or not Z is observed. The prima facie post intervention distribution will be the true post intervention distribution if there is no confounding– 16 i.e., treatment assignment is strongly ignorable. All other prima facie causal quantities can be derived from the prima facie post-intervention. The sensitivity analysis post intervention distribution is given by: Prs (Y (X = 0) = 0) = 3 X Pr(Y = 0|X = 0, Z = z) Pr(Z = z) z=0 = Pr(Z = 0) + Pr(Z = 1) = θ10 (1 − ψ10 ) + θ11 (1 − ψ11 ) + θ00 Prs (Y (X = 0) = 1) = 3 X Pr(Y = 1|X = 0, Z = z) Pr(Z = z) z=0 = Pr(Z = 2) + Pr(Z = 3) = θ10 ψ10 + θ11 ψ11 + θ01 Prs (Y (X = 1) = 0) = 3 X Pr(Y = 0|X = 1, Z = z) Pr(Z = z) z=0 = Pr(Z = 0) + Pr(Z = 2) = θ00 (1 − ψ00 ) + θ01 (1 − ψ01 ) + θ10 Prs (Y (X = 1) = 1) = 3 X Pr(Y = 1|X = 1, Z = z) Pr(Z = z) z=0 = Pr(Z = 1) + Pr(Z = 3) = θ00 ψ00 + θ01 ψ01 + θ11 Because conditional ignorability holds according to the definition of Z, if ψ were known the sensitivity analysis post-intervention distribution would yield the true post intervention distribution. Of course, ψ is never known (and typically not identified) so the sensitivity analysis post-intervention distribution will depend on one’s prior beliefs about ψ. Note, however, that this sensitivity to ψ does not imply that there is no information in the observed data about the sensitivity analysis post-intervention distribution and functionals thereof. We explore the the extent to which the observed data alone are informative about causal quantities below. 17 3.2 Average Treatment Effects The average treatment effect is what most social scientists think of as “the” causal effect of X on Y . As noted above, it describes the expected difference of Y (X = 1) and Y (X = 0) for a unit randomly chosen from the study population. While the focus in this subsection is simply on the average treatment effect or AT E, it is also possible to define average treatment effects within just the treated group (AT T ) or just the control group (AT C) and calculate bounds and sensitivity analysis distributions for these estimands as well. We do not do that here for reasons of space. The software that accompanies this article does allow the user to calculate AT T and AT C along with AT E. 3.2.1 Prima Facie Average Treatment Effect The prima facie average treatment effect is defined as: AT Ep = Prp (Y (X = 1) = 1) − Prp (Y (X = 0) = 1) = θ01 θ11 − θ10 + θ11 θ00 + θ01 If treatment assignment is strongly ignorable and SUTVA holds it is easy to show that AT Ep equals the true average treatment effect in the population. 3.2.2 Sensitivity Analysis Average Treatment Effect The sensitivity analysis average treatment effect is defined as: AT Es = Prs (Y (X = 1) = 1) − Prs (Y (X = 0) = 1) = (θ00 ψ00 + θ01 ψ01 + θ11 ) − (θ10 ψ10 + θ11 ψ11 + θ01 ) (7) Because of the definition of Z, AT Es will always equal the true average treatment effect as long as SUTVA holds. 18 3.2.3 Large Sample Nonparametric Bounds for the Average Treatment Effect Following Manski (1990) we can derive nonparametric bounds for the average treatment effect that will contain the true average treatment effect with probability 1 as sample size goes to infinity. Inspection of Equation 7 reveals that the minimum value of AT Es will occur when ψ00 = 0, ψ01 = 0, ψ10 = 1, and ψ11 = 1. Similarly, the maximum value of AT Es will occur when ψ00 = 1, ψ01 = 1, ψ10 = 0, and ψ11 = 0. Substituting these values into the expression for AT Es and recognizing that AT Es = AT E we see that: AT E ∈ [−(θ10 + θ01 ), (θ00 + θ11 )] Substituting the MLEs for θ00 , θ01 , θ10 , and θ11 we see that (in a slight abuse of notation) lim Pr n→∞ C00+ + C11+ −C01+ − C10+ ≤ AT E ≤ n n =1 where it is understood that the Cxy+ counts also depend on n. Note that this interval will always include 0. Further, as Manski (1990) has shown, and as is easy to see here since P P x y θxy = 1, the width of this interval will always be 1. 3.3 (Log) Relative Risk The relative risk is the ratio of two post-intervention probabilities. It tells how many times more likely a positive outcome is under treatment than under control. Relative risks are often of interest in situations where the outcome variable describes a rare event. Here the AT E may be quite small in absolute value while the relative risk is large. See King and Zeng (2002) for a discussion of the concept of relative risk and related concepts. 3.3.1 Prima Facie (Log) Relative Risk The prima facie relative risk is defined as: Prp (Y (X = 1) = 1) Prp (Y (X = 0) = 1) θ11 . θ01 = θ10 + θ11 θ00 + θ01 RRp = 19 It will often be convenient to report the log of RRp which is referred to as the prima facie log relative risk. 3.3.2 Sensitivity Analysis (Log) Relative Risk The sensitivity analysis relative risk is defined as: Prs (Y (X = 1) = 1) Prs (Y (X = 0) = 1) θ00 ψ00 + θ01 ψ01 + θ11 = θ10 ψ10 + θ11 ψ11 + θ01 RRs = 3.3.3 (8) Large Sample Nonparametric Bounds for (Log) Relative Risks Looking at Equation 8 we see that the minimum value of RRs will occur when ψ00 = 0, ψ01 = 0, ψ10 = 1, and ψ11 = 1. Similarly, the maximum value of RRs will occur when ψ00 = 1, ψ01 = 1, ψ10 = 0, and ψ11 = 0. Substituting these values into the expression for RRs and recognizing that RRs = RR we see that: θ11 RR ∈ , θ01 + θ10 + θ11 θ00 θ11 + +1 θ01 θ01 Note that this interval will always include 1. Substituting the MLEs for θ00 , θ01 , θ10 , and θ11 we see that (in an abuse of notation) lim Pr n→∞ C00+ C11+ C11+ ≤ RR ≤ + +1 C01+ + C10+ + C11+ C01+ C01+ =1 where it is understood that the Cxy+ counts depend on n. The large sample bounds for the log relative risk can be obtained by logging the endpoints of the interval above. Unlike the bounds for the AT E which must always have width 1, the width of the bounding interval for the RR is only constrained to be greater than or equal to 1. 4 Bayesian Inference For Causal Effects As we have seen in Sections 3.2.3 and 3.3.3, there is some information in just the marginal counts Cxy+ about causal quantities such as the average treatment effect and the relative risk. 20 However, we have also seen that these bounds, taken on their own, cannot rule out the null values of 0 and 1 for the AT E and RR respectively. Further, the large sample bounds do not allow us to make statements about the probability that the quantity of interest is within some subset of the bounding interval. A third problem is that while the bounds are generally valid as sample size gets large it is less clear what one should believe about the likely value of the quantity of interest in small samples given just the information in the bounds. A principled method to address all of these issues is to adopt a Bayesian approach and base inference on the posterior density in Equation 6. Taking such an approach requires one to specify a prior distribution for (θ, ψ). In Section 2.1.2 we argued that independent Dirichlet and Beta distributions made sense in terms of interpretability. In the remainder of this section we discuss how the parameters governing these prior distributions can be chosen and how one can easily summarize the resulting posterior distribution to make inferences about causal quantities of interest such as AT E and RR. 4.1 Choosing a Prior Distribution It is worth emphasizing that, unlike Bayesian inference for models which are point identified, the impact of the choice of prior for ψ on the posterior distribution for ψ and functionals of that posterior distribution will not diminish as n gets large if Z is completely unobserved. In fact, since no new information about ψ is arriving as n gets large, the marginal posterior for ψ will always be equal to the prior for ψ. Clearly one needs to be quite careful in determining the prior for ψ. Specifically, one should either be able to justify a particular choice of prior by an appeal to substantive background knowledge and/or perform a prior sensitivity analysis in which multiple reasonable priors—including some distributions that are based on the most extreme but still defensible assumptions that critical subject matter experts would entertain—are tried and the associated results are reported. As is summarized in Tables 2 and 3 each ψxy represents the conditional probability of one of the two possible configurations of potential outcomes among units in which we observe X = x 21 and Y = y. Thus the Beta(bxy , cxy ) prior for ψxy can be thought of as a statement of belief that bxy − 1 of the Cxy+ units have one potential outcome profile while cxy − 1 of the Cxy+ units have the other possible potential outcome profile. If bxy + cxy = Cxy+ + 2 then the information in the prior is equivalent to the information that would be in the sample data in the ideal case in which the potential outcome patterns are observed for units with X = x and Y = y. If bxy + cxy < Cxy+ + 2 then there is less information in the prior than this ideal situation and if bxy + cxy > Cxy+ + 2 then the prior is adding more information than one could ever get directly from the sample data. To clarify the relationship between the bxy and cxy parameters and potential outcomes we will walk through the interpretation of ψxy , bxy , and cxy for all x and y. An additional example of how this works will appear in the empirical example in Section 6. Begin with the situation in which Xi = 0 and Yi = 0. Here two potential outcome profiles are possible: Zi = 0 (never succeed) and Zi = 1 (helped). ψ00 is the conditional probability that Zi = 1 (i would be helped by treatment) given Xi = 0 and Yi = 0, while 1 − ψ00 is obviously the conditional probability that Zi = 0 (i would never succeed) given Xi = 0 and Yi = 0. b00 is the number of pseudo Z = 1 (helped) observations + 1 and c00 is the number of pseudo Z = 0 (never succeed) observations + 1. If our background knowledge suggests that units for which we observe X = 0 and Y = 0 are unlikely to respond to treatment we would set c00 > b00 . This implies that Pr(Zi = never succeed |Xi = 0, Yi = 0) > Pr(Zi = helped |Xi = 0, Yi = 0). On the other hand, if we believe that these units are more likely than not to respond to treatment we would set b00 > c00 . This would imply Pr(Zi = helped |Xi = 0, Yi = 0) > Pr(Zi = never succeed |Xi = 0, Yi = 0). The absolute magnitude of b00 and c00 determines how sure we are of the potential outcome distribution within the X = 0, Y = 0 group. Next consider the situation in which Xi = 0 and Yi = 1. Again, two potential outcome profiles are possible: Zi = 2 (hurt) and Zi = 3 (always succeed). ψ01 is the conditional probability that Zi = 3 (i would always succeed) given Xi = 0 and Yi = 1, while 1−ψ01 is the conditional probability that Zi = 2 (i would be hurt by treatment) given Xi = 0 and Yi = 1. b01 is the number of pseudo 22 Z = 3 (always succeed) observations + 1 and c01 is the number of pseudo Z = 2 (hurt) observations + 1. If our background knowledge suggests that units for which we observe X = 0 and Y = 1 are unlikely to respond to treatment we would set b01 > c01 . On the other hand, if we believe that these units are more likely than not to respond (negatively) to treatment we would set c01 > b01 . Again, the absolute magnitude of b01 and c01 determines how sure we are of the potential outcome distribution within the X = 0, Y = 1 group. Within the Xi = 1 and Yi = 0 group the two potential outcome profiles are Zi = 0 (never succeed) and Zi = 2 (hurt) . ψ10 is the conditional probability of randomly selecting a unit for which Zi = 2 (i is hurt by treatment) from the Xi = 1 and Yi = 0 group. b10 is the number of pseudo Z = 2 (hurt) observations + 1 and c10 is the number of pseudo Z = 0 (never succeed) observations + 1. Setting b10 > c10 would be consistent with a prior belief that more subjects tend to respond negatively to treatment within this group, while setting c10 > b10 would be consistent with a belief that treatment is more likely than not to have no effect on units within this group. Finally, within the Xi = 1 and Yi = 1 group we see that the two potential outcome profiles are Zi = 1 (helped) and Zi = 3 (always succeed). ψ11 is the conditional probability of seeing a Zi = 3 (always succeed) observation within this group. b11 is the number of pseudo Z = 3 (always succeed) observations + 1 and c11 is the number of pseudo Z = 1 (helped) observations + 1. If one thinks that units within this group are, on average, likely to respond positively to treatment one would set c11 > b11 . If non-responsiveness is hypothesized one would set b11 > c11 . Now that we have discussed the basic interpretive issues of setting a subjective prior for ψ we can move on to discuss some more general strategies that be used to operationalize such a prior. In some settings it is reasonable to assume that the treatment has a monotonic effect on outcomes for most units under study. Put more simply, under an assumption of a generally positive (negative) monotonic effect, the treatment may have either a positive (negative) effect or no effect on most units but it is unlikely to have a negative (positive) effect on any but a small fraction of units. If one believes that a generally positive monotonic treatment effect is 23 reasonable then one could set b01 c01 and b10 c10 . Conversely, one could set b00 c00 and b11 c11 to operationalize a generally negative monotonic treatment effect. In some sense, Manski’s (2003) monotone treatment response assumption is a limiting case of this approach in which under a positive (negative) monotone treatment response there are absolutely no units that are hurt (helped) by treatment. Just as in Manski’s bounds analysis, the assumption of a generally a monotone treatment effect can greatly narrow down the likely range of causal quantities of interest. One can also operationalize selection effects within the framework of this paper. If one believes that units that are most likely to benefit from treatment (X = 1) select into treatment one could set c11 /(b11 + c11 ) > b00 /(b00 + c00 ). In words, this implies that the probability of seeing a helped unit among the treated units is greater than the probability of seeing a helped unit among the control units. If one believes that units avoid selecting a treatment that harms them one could set c01 /(b01 + c01 ) > b10 /(b10 + c10 ). In words, the probability of seeing a hurt unit in the control group is greater than the probability of seeing a hurt unit in the treatment group. One of the major advantages of thinking of unmeasured confounding in terms of the conditional distribution of the potential outcomes given xi and yi is that it formulates the problem in a way that is relatively easy for practitioners to reason about while remaining completely general. In order to apply the methods presented here, a researcher only needs to be able to reason about eight easily understood quantities– only four of which are completely free parameters. This is true regardless of the form of unmeasured confounding in his or her study. Contrast this with approaches to sensitivity analysis that attempt to directly model the relationships between the confounding variable(s) and outcomes and the confounding variable(s) and treatment status (Schlesselman, 1978; Rosenbaum and Rubin, 1983a; Lin et al., 1998). Such approaches require the practitioner to make assumptions about a) the nature of the confounding variable (whether it is univariate/multivariate, discrete/continuous, etc.) as well as b) the association between this assumed confounder and outcomes and treatment status. As (Robins, 1999, p. 172) and (Brumback et al., 2004, p. 763) have argued, it is “essentially impossible” for researchers to reason about a) and that even if the 24 nature of the confounding variables can be assumed it is difficult for most practitioners to reason about the necessary associations. 4.2 Posterior Inference It is relatively easy to show that the posterior distributions for θ and ψ discussed in Sections 2.1 and 2.2 can all be sampled from using simple independent Monte Carlo sampling. Markov chain Monte Carlo is not necessary. Once a sample of (θ, ψ) is available from the posterior distribution of interest it is a simple matter to plug the draws of these parameters into the formulas for the post-intervention distribution, average treatment effects, relative risk, etc. in order to obtain the posterior distribution over these causal quantities of interest. To produce a Monte Carlo sample of size m from the distribution with density given (up to proportionality) by Equation 6 we can use Algorithm 4.1. Algorithm 4.1: PosteriorSamplingUnobservedZ(C, a, b, c, m) for j ← 1 to m θ (j) ← rdirichlet(C00+ + a00 , C01+ + a01 , C10+ + a10 , C11+ + a11 ) (j) ψ00 ← rbeta(b00 , c00 ) (j) do ψ01 ← rbeta(b01 , c01 ) (j) ψ10 ← rbeta(b10 , c10 ) ψ (j) ← rbeta(b , c ) 11 11 11 (j) m (j) (j) m (j) m (j) m return ({θ }j=1 , {ψ00 }j=1 , {ψ01 }m j=1 , {ψ10 }j=1 , {ψ11 }j=1 ) Here rdirichlet(d, e, f, g) is a function that returns a pseudo-random draw from a Dirichlet(d, e, f, g) distribution and rbeta(d, e) is a function that returns a pseudo-random draw from a Beta(d, e) distribution. 4.2.1 Inference for Causal Quantities of Interest Once a sample {θ (j) , ψ (j) }m j=1 from the posterior distribution of (θ, ψ) has been drawn it is quite easy to plug these draws into the formulas for causal quantities of interest given in Section 3 to obtain a sample from the posterior distribution of the causal quantity of interest. 25 For instance a sample from the posterior distribution of the prima facie average treatment effect (j) ({AT Ep }m j=1 ) can be constructed by taking the jth sample to be (j) AT Ep(j) = (j) θ11 (j) (j) θ10 + θ11 − θ01 (j) (j) θ00 + θ01 for j = 1, . . . , m. Similarly, a sample from the posterior distribution of the sensitivity analysis average treatment (j) effect ({AT Es }m j=1 ) can be constructed by taking the jth sample to be (j) (j) (j) (j) (j) (j) (j) (j) (j) (j) AT Es(j) = θ00 ψ00 + θ01 ψ01 + θ11 − θ10 ψ10 + θ11 ψ11 + θ01 for j = 1, . . . , m. Samples from the posterior distributions of other causal quantities of interest follow analogously. Once one has a sample from the posterior distribution of interest it is a simple matter to summarize the distribution by calculating density estimates, highest posterior density regions (the smallest region that contains a pre-specified amount of the posterior mass), the probability that a quantity of interest is greater than 0, etc. using the sampled parameter values. See Jackman (2000); King et al. (2000); Gelman et al. (2003) and Gill (2007) for discussions of how posterior samples can be summarized. 5 Ranges of ψ Consistent With a Given Prima Facie Post-Intervention Distribution: The Confounding Plot While the subjective Bayesian approach outlined above takes beliefs about ψ as input and returns a subjective probability distribution over causal effects, we can also reverse this process and start with a particular prima facie post-intervention distribution and ask what values of ψ will result in a sensitivity analysis post-intervention distribution that is within some tolerance of the given prima facie post-intervention distribution. Formally, for a given values of θ00 , θ01 , θ10 and θ11 we seek to find all values of ψ00 , ψ01 , ψ10 and 26 ψ11 for which θ00 |Prp (Y (X = 0) = 0) − Prs (Y (X = 0) = 0)| = − (θ10 (1 − ψ10 ) + θ11 (1 − ψ11 ) + θ00 ) ≤ θ00 + θ01 (9) and θ01 |Prp (Y (X = 0) = 1) − Prs (Y (X = 0) = 1)| = − (θ10 ψ10 + θ11 ψ11 + θ01 ) ≤ θ00 + θ01 (10) and θ10 |Prp (Y (X = 1) = 0) − Prs (Y (X = 1) = 0)| = − (θ00 (1 − ψ00 ) + θ01 (1 − ψ01 ) + θ10 ) ≤ θ10 + θ11 (11) and θ11 |Prp (Y (X = 1) = 1) − Prs (Y (X = 1) = 1)| = − (θ00 ψ00 + θ01 ψ01 + θ11 ) ≤ θ10 + θ11 (12) for some small positive . Note that Inequalities 9 and 10 only depend on ψ10 and ψ11 while Inequalities 11 and 12 only depend on ψ00 and ψ01 . It is thus possible to depict all values of ψ00 , ψ01 , ψ10 and ψ11 that satisfy Inequalities 9 - 12 with a pair of 2-dimensional plots— one of ψ10 and ψ11 and another of ψ00 and ψ01 . Figure 4 depicts how these plots look for particular value of θ00 , θ01 , θ10 , θ11 , and . Because this method does not account for sampling variability it is most appropriate for situations in which all of the cells in the 2 × 2 table have a reasonably large number of observations. We note in passing the similarity of the plots above to the tomography plots of King (1997) that are useful for ecological inference. Indeed, the situation under consideration in this paper in which Cxy+ are all fully observed but the joint Cxyz counts are not observed can be thought of as a particular type of ecological inference problem (Richardson, 2004). 6 Example: Jury Aversion and Voter Registration Revisited In this section, the methods discussed previously will be applied to data from a study by Oliver and Wolfinger (1999) which looks at the extent to which perceiving jury lists to be constructed from 27 voter registration lists causes a decrease in voter registration. As Oliver and Wolfinger note, there is a widespread belief among election officials and some scholars that the practice of drawing jury lists from voter registration records depresses voter registration and hence turnout. To estimate the extent to which aversion to jury duty depresses voter registration, Oliver and Wolfinger use data from the 1991 ANES Pilot Study (Miller et al., 1991). The simplest tabulation of the data used by Oliver and Wolfinger appears in Table 1. If one were to assume that treatment assignment is strongly ignorable one could analyze this 2 × 2 table as if it were a randomized experiment. Doing so we see that the posterior mean of the prima facie average treatment effect is equal to -0.076 with associated 95% highest posterior density (HPD) region [−0.134, −0.016] and the posterior mean of the prima facie log relative risk is -0.090 with 95% HPD region [−0.159, −0.019]. The posterior probability that the prima facie average treatment effect is less than 0 is 0.991. The same is true for the prima facie log relative risk. These results would seem to suggest that within the portion of the population that has some knowledge of how jury lists are constructed (a little over half the ANES sample) a belief that voter registration records are used to build jury lists decreases voter registration by about 7 and a half percentage points. In terms of relative risk, those who do not think that jury lists are constructed from voter lists are about 9.4% more likely to register to vote than those who do think jury lists are built from voter lists. These results suggest a moderate negative effect that would be consistent with the jury aversion hypothesis. Nonetheless, since these results rely on the implausible assumption of ignorable treatment assignment one would be correct to regard these prima facie results with more than a little suspicion. 6.1 Bayesian Analysis of the 2 × 2 Table Under Non-Informative Priors As an initial step toward determine what one should infer about the effect of jury aversion of voter registration we calculate the AT Es and RRs under three different non-informative priors for ψ. The first corresponds to a uniform prior over each element of ψ. Formally, ψxy ∼ Beta(1, 1) for 28 Figure 1: Posterior Distributions of Prima Facie and Sensitivity Analysis Average Treatment Effects and Relative Risks for Data in Table 1 Under Non-Informative Priors. The green line in each panel is the posterior density of the prima facie ATE or RR (solid line depicts 95% highest posterior density region). The purple polygons depict the posterior density of the sensitivity analysis ATE or RR under alternative priors for ψ (dark purple denotes the 95% highest posterior density region). The horizontal light blue line in each each figure denotes the region inside the large sample nonparametric bounds for the estimand. The leftmost panels are from the model that assumes uniform prior distributions for all ψ parameters. The center panels are from the model that assumes a Beta(0.5, 0.5) prior distributions for all ψ parameters. Finally, the rightmost panels are from the model that assumes Beta(0.001, 0.001) prior distributions for all ψ parameters. all x and y. The second prior corresponds to the Jeffreys prior ψxy ∼ Beta(0.5, 0.5) for all x and y. We also examine results under the assumption that ψxy ∼ Beta(0.001, 0.001). Graphical depictions of these results along with the prima facie results and the large sample nonparametric bounds are presented in Figure 1. Here we see that the large sample nonparametric bounds for the average treatment effect are AT E ∈ [−0.343, 0.657] and the bounds for the log relative risk are log(RR) ∈ [−0.434, 1.491]. Assuming that there may be unmeasured confounding, 29 but nothing at all about the form of that confounding, we should believe that the AT E and log(RR) are within the above intervals. However, we cannot say how likely these quantities are to be within any subregion of these intervals. To do that we need to specify a prior distribution for ψ. Under the Beta(1, 1) prior we see that the posterior mean of the AT Es is 0.157 with associated 95% HPD region [−0.189, 0.506] and that the posterior mean of log(RRs ) is 0.294 with 95% HPD region [−0.284, 0.990]. Under the Beta(0.5, 0.5) prior the posterior mean of AT Es is 0.157 with 95% HPD region [−0.246, 0.562] while the posterior mean of log(RRs ) is 0.324 with 95% HPD region [−0.356, 1.149]. In both cases, there is little posterior mass near the endpoints of the large sample bounds. Something different happens when we adopt the Beta(0.001, 0.001) prior. This prior puts most of the prior mass very near ψxy = 0 and ψxy = 1. As a re- sult, the posterior distributions for AT Es and log(RRs ) become multimodal. Further, because of the finite sample size, some non-negligible posterior mass is placed outside the large-sample bounds of both AT E and log(RR). Specifically, we see that the 95% HPD region for AT Es is: [−0.377, −0.258] ∪ [−0.23, −0.097] ∪ [−0.053, 0.076] ∪ [0.235, 0.692] and the 95% HPD region for log(RRs ) is: [−0.481, −0.096] ∪ [−0.094, 0.076] ∪ [0.538, 1.563]. Indeed, one of the advantages of the approach taken in this paper is that it remains valid in finite samples of any size. On the whole, looking at these results under various non-informative prior distributions would suggest that there is not enough evidence to reject the null of AT E = 0 and log(RR) = 0. 6.2 Bayesian Analysis of the 2 × 2 Table Under Informative Priors While such an analysis under a series of non-informative prior distributions may seem compelling, one needs to remember that since no sample information about ψ is available, even these “non-informative” prior distributions are building in some assumptions that may be at odds with scientific knowledge or common sense. For instance, all three of the “non-informative” priors used above are symmetric around 0.5 and thus have the property that the prior density for ψxy is the same as the prior density for 1 − ψxy . Further, since bxy and cxy were assumed to equal the same 30 constant for all x and y it follows that the marginal prior distributions for all of the ψxy parameters are equivalent. This means, for instance, that these priors embody the assumption that the fraction of units hurt by treatment in the (X = 1, Y = 0) group is equal to the fraction of units helped by treatment in the (X = 0, Y = 0) group. This is an implausible assumption in most applications where there is even a hint of background knowledge– including the current application. 31 0.0 0.0 Density 15 10 5 0 15 10 0.2 0.4 0.6 0.8 0.2 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.6 0.8 1 − ψ01: Fraction of Hurt in (X=0, Y=1) 0.4 b01=17, c01=5 b01=25, c01=3 b01=45, c01=3.5 1.0 ψ01: Fraction of Always Succeed in (X=0, Y=1) 0.0 0.0 0.2 0.4 0.6 0.8 0.2 0.6 0.8 1.0 1.0 1 − ψ10: Fraction of Never Succeed in (X=1, Y=0) 0.4 b10=5, c10=17 b10=3, c10=25 b10=3.5, c10=45 ψ10: Fraction of Hurt in (X=1, Y=0) 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 1 − ψ11: Fraction of Helped in (X=1, Y=1) 0.0 b11=10, c11=0.02 b11=10, c11=0.02 b11=10, c11=0.02 ψ11: Fraction of Always Succeed in (X=1, Y=1) 0.0 b11=10, c11=0.02 b11=10, c11=0.02 b11=10, c11=0.02 Figure 2: Subjective Prior Densities Used in Analysis of Table 1. The prior referred to as prior (a) is given by the solid black lines, the prior referred to as prior (b) is given by the dashed blue lines, and the prior referred to as prior (c) is given by the dotted red lines. 1 − ψ00: Fraction of Never Succeed in (X=0, Y=0) 0.4 b00=0.02, c00=10 b00=0.02, c00=10 b00=0.02, c00=10 1.0 b10=5, c10=17 b10=3, c10=25 b10=3.5, c10=45 5 0 Density 15 10 5 0 15 5 0 10 ψ00: Fraction of Helped in (X=0, Y=0) Density Density 15 10 5 0 15 5 0 10 b01=17, c01=5 b01=25, c01=3 b01=45, c01=3.5 Density Density 15 10 5 0 15 10 b00=0.02, c00=10 b00=0.02, c00=10 b00=0.02, c00=10 Density Density 5 0 32 Figure 3: Posterior Distributions of Prima Facie and Sensitivity Analysis Average Treatment Effects and Relative Risks for Data in Table 1 Under Informative Subjective Priors. The green line in each panel is the posterior density of the prima facie ATE or RR (solid line depicts 95% highest posterior density region). The purple polygons depict the posterior density of the sensitivity analysis ATE or RR under alternative priors for ψ (dark purple denotes the 95% highest posterior density region). The horizontal light blue line in each each figure denotes the region inside the large sample nonparametric bounds on the estimand. The leftmost panels (column (a)) are from the model in which b00 = 0.02, c00 = 10, b01 = 17, c01 = 5, b10 = 5, c10 = 17, b11 = 10, and c11 = 0.02. The center panels (column (b)) are from the model in which b00 = 0.02, c00 = 10, b01 = 25, c01 = 3, b10 = 3, c10 = 25, b11 = 10, and c11 = 0.02 . Finally, the rightmost panels (column (c)) are from the model that assumes b00 = 0.02, c00 = 10, b01 = 45, c01 = 3.5, b10 = 3.5, c10 = 45, b11 = 10, and c11 = 0.02 In this application it seems relatively uncontroversial to assume that perceiving jury lists to be drawn from voter lists should not cause anyone to vote who would not have voted absent such a belief. In other words, it seems safe to assume a negative monotonic treatment effect. Put still another way, we do not anticipate a jury attraction effect. At most one would expect that only a minuscule fraction of the population is helped by the treatment variable. We operationalize this belief by assuming ψ00 ∼ Beta(0.02, 10) and ψ11 ∼ Beta(10, 0.02). Under such a prior, the 33 probability that the percentage of helped individuals in the (X = 0, Y = 0) group is less than 2% is 0.975. The same is true for the percentage of helped individuals in the (X = 1, Y = 1) group. As we will see below, this relatively uncontroversial assumption allows us to greatly narrow down the range of plausible ATE and RR values. It seems that a wider range of defensible subjective beliefs are possible about ψ01 and ψ10 . Recall that ψ01 is the fraction of always succeed individuals and (1−ψ01 ) is the fraction of hurt individuals in the (X = 0, Y = 1) group, while ψ10 is the fraction of hurt individuals and (1−ψ10 ) is the fraction of never succeed individuals in the (X = 1, Y = 0) group. Researchers who hypothesize that there is a large jury aversion effect will believe that ψ01 is relatively small and ψ10 is relatively large. On the other hand, researchers who think that such an effect is small or essentially nonexistent will believe that ψ01 is close to 1 and ψ10 is close to 0. We operationalize these beliefs, along with beliefs in between these extremes with three prior distributions. We label these priors (a), (b), and (c). Prior (a) represents a belief that there is a fairly substantial jury aversion effect, (b) is a belief in a moderate effect, and (c) is a belief in a fairly small effect. Figure 2 plots the prior densities for each element of ψ under each set of prior beliefs. Note that in all three priors we assume an essentially negative monotonic treatment effect by assuming ψ00 ∼ Beta(0.02, 10) and ψ11 ∼ Beta(10, 0.02). Under prior (a) we assume ψ01 ∼ Beta(17, 5) and ψ10 ∼ Beta(5, 17). This is consistent with a belief that there is a 95% chance that the fraction of hurt individuals in the (X = 0, Y = 1) group is between 0.082 and 0.419. The same is true for the fraction of hurt individuals in the (X = 1, Y = 0) group. In each case, the median fraction of hurt individuals is believed to be 0.219. We think very few researchers believe that more than about 40% of the individuals in these groups would be (or were) swayed to abstain from registering to vote solely by a belief that registering may make them slightly more likely to serve jury duty. Similarly, the lower bound of about 8% is consistent with a small but noticeable jury aversion effect. We think prior (a) is a reasonable representation of the beliefs of someone who believes in a fairly sizable, but still plausible jury aversion effect. Prior (c) is designed to represent the beliefs of someone much more skeptical of any such effect. 34 Here we assume that ψ01 ∼ Beta(45, 3.5) and ψ10 ∼ Beta(3.5, 45). This is consistent with a belief that there is a 95% chance that the fraction of hurt individuals in the (X = 0, Y = 1) group is between 0.018 and 0.159. The same is true for the fraction of hurt individuals in the (X = 1, Y = 0) group. The prior median of each of these fractions is 0.066. While this prior assumes that the fractions of individual who are hurt by treatment is probably less than about 15% of each of the two relevant treatment-outcome groups it is still consistent with a negative jury aversion effect. Indeed, the expected fractions of hurt individuals in each of the two relevant outcome-treatment groups is about 7% which would not seem to be completely negligible. Prior (b) falls somewhere between priors (a) and (c). Here we assume that ψ01 ∼ Beta(25, 3) and ψ10 ∼ Beta(3, 25). This is consistent with a belief that there is a 95% chance that the fraction of hurt individuals in the (X = 0, Y = 1) group is between 0.024 and 0.243. The same is true for the fraction of hurt individuals in the (X = 1, Y = 0) group. The prior median of each of these fractions is 0.098. Results from analyses under priors (a), (b), and (c) are plotted in Figure 3. The first thing that is apparent from this figure is that the subjective priors have greatly decreased the plausible range of the causal effects of interest. By assuming an essentially negative monotonic treatment effect we forced most of the posterior mass to be to the left of 0. Given that the large sample nonparametric bounds were asymmetric around 0 with more room to the right of 0, this prior assumption adds a lot of information in a way that we think is relatively uncontroversial. Where exactly to the left of 0 most of the posterior mass lies depends on one’s prior beliefs about ψ01 and ψ10 . Under prior (a) we see that the posterior mean of AT Es is -0.077 with 95% HPD region [−0.124, −0.035]. This is quite similar to the prima facie ATE. Under prior (b) the posterior mean of AT Es is -0.035 with 95% HPD region [−0.067, −0.010]. Finally, under prior (c) we have that the posterior mean of AT Es is -0.023 with 95% HPD region [−0.046, −0.006]. From this analysis we would conclude that the prima facie results derived from the simple 2 × 2 table can be supported with not implausible beliefs about unmeasured confounding. However, such 35 0.6 0.8 1.0 0.2 (1 − ψ11): Fraction Helped in (X=1, Y=1) 0.4 0.0 0.2 0.4 (1 − ψ01): Fraction Hurt in (X=0, Y=1) 0.2 1.0 0.8 0.6 0.4 0.8 0.6 0.4 0.2 ψ00: Fraction Helped in (X=0, Y=0) 0.0 0.6 1.0 0.0 0.8 0.8 0.2 1.0 0.6 0.4 1.0 0.4 0.6 0.8 0.2 0.8 0.6 0.0 1.0 0.4 0.0 0.2 0.2 0.0 0.4 ψ11: Fraction Always Succeed in (X=1, Y=1) 0.6 (1 − ψ10): Fraction Never Succeed in (X=1, Y=0) 0.0 0.8 1.0 1.0 0.0 ψ01: Fraction Always Succeed in (X=0, Y=1) (1 − ψ00): Fraction Never Succeed in (X=0, Y=0) ψ10: Fraction Hurt in (X=1, Y=0) Figure 4: Values of ψ00 , ψ01 , ψ10 and ψ11 for which |Prp (Y (X = x) = y) − Prs (Y (X = x) = y)| ≤ 0.025 for x = 0, 1 and y = 0, 1 when θ00 = 0.026, θ01 = 0.191, θ10 = 0.152, θ11 = 0.631. These θ values correspond to the posterior mean values of these parameters given the data in Table 1 and a non-informative prior for θ. The shaded regions depict values that satisfy the inequality above. beliefs are at the extreme of what we think are reasonable. Other reasonable beliefs about potential unmeasured confounding yield much smaller (and substantively unimportant) jury aversion effects. 6.3 Ranges of ψ Consistent with the Estimated Prima Facie Post-Intervention Distribution A more comprehensive picture of what beliefs about ψ can support inferences close to the prima facie inferences can be obtained by looking at the confounding plot discussed in Section 5. Figure 4 plots the values of ψ00 , ψ01 , ψ10 and ψ11 for which |Prp (Y (X = x) = y) − Prs (Y (X = x) = y)| ≤ 0.025 for x = 0, 1 and y = 0, 1 when θ is set to its posterior mean value. Here we see that inferences are relatively insensitive to beliefs about ψ00 – any value of ψ00 between 0 and 1 can produce a sensitivity analysis post-intervention distribution that is close to the prima facie post-intervention distribution. This should not be unexpected given the relatively small fraction of observations that fall in the (X = 0, Y = 0) cell. However, the prima facie 36 inferences are quite sensitive to beliefs about ψ11 — the fraction of always succeed individuals in the (X = 1, Y = 1) cell. This parameter must be between about 0.8 and 1 in order for the sensitivity analysis post intervention distribution to be close to its prima facie counterpart. As we argued above, it seems quite reasonable to assume that ψ11 is very close to 1 in this application (there is essentially no jury attraction effect) so this sensitivity to ψ11 may not be so important. More important is the sensitivity of the prima facie results to the values of ψ01 and ψ10 . Here we see that the fraction of hurt individuals in the (X = 0, Y = 1) cell cannot be too great while the fraction of hurt individuals in the (X = 1, Y = 0) cell has to be be moderately large. 6.4 Dealing with a Measured Confounder: Bayesian Analysis of a 2 × 2 × 5 Table We can also use Bayesian methods to model general unmeasured confounding while simultaneously adjusting for measured confounders. Here we deal with the situation in which there is a single measured confounder. However, the same techniques could, in principle, be used to deal with multiple measured confounders that are all discrete. Table 4 presents the data in Table 1 broken down by occupational category (W = 0: selfemployed, W = 1: hourly workers, W = 2: others working, W = 3: not in workforce, W = 4: retired). Since the costs associated with serving jury duty are, to a large extent, tied to one’s occupation and since occupational status is believed to have some effect on political participation we expect occupation to act as a confounder and it would seem sensible to adjust for this measured confounder. Such an adjustment would be especially useful if we thought it removed most of the confounding bias (unlikely in this application) and/or if it made it easier to reason about the unobserved potential outcomes (somewhat true in this case). In situations where adjustment for the measured confounding variables is thought to remove only a small portion of the confounding bias it is not always clear whether the additional information provided by conditioning on the measured confounding variables outweighs the additional complication of formulating a prior conditional on X, Y , and the measured confounders. 37 If we are willing to assume the measured confounder W is discrete with K categories it is appropriate to perform the simple 2 × 2 analyses described above separately within each of the k = 0, . . . , K − 1 levels of W and then weight by φk ≡ Pr(W = k). Somewhat more formally, we can write the likelihood for the observed data marginalized over the unobserved Z variables as: p(w, x, y|φ, θ, ψ) = n X Y p(wi , xi , yi , zi |φ, θ, ψ) i=1 zi ∈Zi = n Y p(wi , xi , yi |φ, θ) = p(zi |wi , xi , yi , ψ) zi ∈Zi i=1 n Y X p(wi |φ)p(xi , yi |wi , θ) X zi ∈Zi i=1 p(zi |wi , xi , yi , ψ) where the p(wi |φ) term is a multinomial mass function with probability vector φ and multinomial sample size 1, p(xi , yi |wi , θ) is a multinomial mass function that depends on the value of wi with probabilities θ00|wi , θ01|wi , θ10|wi , θ11|wi and multinomial sample size 1, and the term involving the sum over Z consists of 4K Bernoulli mass functions with parameters ψ00|wi , ψ01|wi , ψ10|wi and ψ11|wi . Note that everything is analogous to the 2×2 analysis except for the conditioning on wi throughout. Now that we are explicitly adjusting for W the prima facie post-intervention distribution must 38 be redefined to account for this adjustment. It becomes: Prpw (Y (X = 0) = 0) = = Prpw (Y (X = 0) = 1) = = Prpw (Y (X = 1) = 0) = = Prpw (Y (X = 1) = 1) = = K−1 X w=0 K−1 X w=0 K−1 X w=0 K−1 X w=0 K−1 X w=0 K−1 X w=0 K−1 X w=0 K−1 X w=0 Pr(Y = 0|X = 0, W = w) Pr(W = w) θ00|w φw θ00|w + θ01|w Pr(Y = 1|X = 0, W = w) Pr(W = w) θ01|w φw θ00|w + θ01|w Pr(Y = 0|X = 1, W = w) Pr(W = w) θ10|w φw θ10|w + θ11|w Pr(Y = 1|X = 1, W = w) Pr(W = w) θ11|w φw θ10|w + θ11|w 39 Similarly, the sensitivity analysis post-intervention distribution becomes: Prsw (Y (X = 0) = 0) = = Prsw (Y (X = 0) = 1) = = Prsw (Y (X = 1) = 0) = = Prsw (Y (X = 1) = 1) = = K−1 3 XX Pr(Y = 0|X = 0, W = w, Z = z) Pr(Z = z|W = w) Pr(W = w) w=0 z=0 K−1 X [θ10|w (1 w=0 K−1 3 XX − ψ10|w ) + θ11|w (1 − ψ11|w ) + θ00|w ]φw Pr(Y = 1|X = 0, W = w, Z = z) Pr(Z = z|W = w) Pr(W = w) w=0 z=0 K−1 X [θ10|w ψ10|w w=0 K−1 3 XX + θ11|w ψ11|w + θ01|w ]φw Pr(Y = 0|X = 1, W = w, Z = z) Pr(Z = z|W = w) Pr(W = w) w=0 z=0 K−1 X [θ00|w (1 w=0 K−1 3 XX − ψ00|w ) + θ01|w (1 − ψ01|w ) + θ10|w ]φw Pr(Y = 1|X = 1, W = w, Z = z) Pr(Z = z|W = w) Pr(W = w) w=0 z=0 K−1 X [θ00|w ψ00|w + θ01|w ψ01|w + θ11|w ]φw w=0 Prima facie and sensitivity analysis average treatment effects and relative risks are defined in the usual way from these post-intervention distributions. After a bit of algebra it can be shown that the large sample nonparametric bounds for the AT E and RR do not change when moving from the simple 2 × 2 analysis to the analysis in which a nonparametric adjustment is made for W and no additional conditional independence assumptions are made. We perform two analyses that explicitly adjust for reported occupation. In the first analysis we assume a uniform prior for each of the 20 ψxy|w parameters. The second analysis assumes b00|w = 0.02, c00|w = 10, b01|w = 25, c01|w = 3, b10|w = 3, c10|w = 25, b11|w = 10, c11|w = 0.02 for w = 0, 1, 2 and b00|w = 0.02, c00|w = 10, b01|w = 15, c01|w = 1, b10|w = 1, c10|w = 15, b11|w = 10, c11|w = 0.02 for w = 3, 4. This informative subjective prior assumes that there is an essentially negative monotonic treatment effect and that the largest effects are found among those who are working (w = 0, 1, 2). In both analyses we assume non-informative priors for θ and φ. Figure 5 displays the results of 40 these analyses graphically. Looking at these results we see that the posterior mean of AT Epw is -0.082 with 95% HPD interval [−.137 − 0.024] and the posterior mean of log(RR)pw is -0.100 with 95% HPD interval [−0.162, −0.028] note that these results are quite close to those for AT Ep and RRp suggesting that, by itself, reported occupational category is not a serious confounder. Although it is worth noting that the effects adjusted for occupation are slightly larger in magnitude and the intervals have narrowed somewhat. Under the uniform prior for each ψxy|w we see that the posterior mean of AT Esw is 0.156 with 95% HPD region [−0.039, 0.351] and the posterior mean of log(RR)sw is 0.250 with 95% HPD region [−0.064, 0.586]. Note the dramatic narrowing of the HPD intervals here relative to the intervals for from the uniform prior for ψxy unconditional on occupation. As we might expect, conditioning on a measured confounder tends to decrease the variability of one’s estimates. Looking at the results under the subjective prior on ψ we see that the posterior mean of AT Esw is -0.031 with 95% HPD region [−0.048, −0.016] and the posterior mean of log(RR)sw is -0.038 with 95% HPD interval [−0.058, −0.021]. Again, these results are roughly consistent with those from the appropriate 2 × 2 analysis except that the sampling variability has decreased after adjusting for occupation. It is reassuring to note that the results presented here based on justifiable subjective prior distributions correspond closely at a qualitative level to the results of Oliver and Wolfinger (1999) who adjusted for a number measured covariates that were available in the original dataset. While Oliver and Wolfinger do not explicitly calculate the same causal effects as we have in this paper they reach a similar qualitative conclusion. To cite from their conclusion: There is a grain—but no more than a grain—of truth to the belief that jury aversion reduces voter registration... People who believe that registering makes them susceptible to jury service are slightly less likely to be registered than those who identify other sources. This aversion to jury service is responsible for a drop in voter turnout of less than one percentage point. (p. 151) 41 Figure 5: Posterior Distributions of Prima Facie and Sensitivity Analysis Average Treatment Effects and Relative Risks for Data in Table 4. The green line in each panel is the posterior density of the prima facie ATE or RR (solid line depicts 95% highest posterior density region). The purple polygons depict the posterior density of the sensitivity analysis ATE or RR under alternative priors for ψ (dark purple denotes the 95% highest posterior density region). The horizontal light blue line in each each figure denotes the region inside the large sample nonparametric bounds on the estimand. The leftmost panels (column (a)) are from a model in which a uniform prior is assumed for each element of ψ within each occupational category. The rightmost panels (column (b)) are from the model that assumes b00|w = 0.02, c00|w = 10, b01|w = 25, c01|w = 3, b10|w = 3, c10|w = 25, b11|w = 10, c11|w = 0.02 for w = 0, 1, 2 and b00|w = 0.02, c00|w = 10, b01|w = 15, c01|w = 1, b10|w = 1, c10|w = 15, b11|w = 10, c11|w = 0.02 for w = 3, 4. This informative subjective prior assumes that there is an essentially negative monotonic treatment effect and that the largest effects are found among those who are working (w = 0, 1, 2). 42 7 Discussion All counterfactual causal inferences require fundamentally untestable causal assumptions. In a well-run randomized controlled experiment the necessary assumptions for point identification of causal effects are so plausible that they hardly appear to be assumptions. However, as one moves away from the experimental ideal the necessary ignorability assumptions typically become less plausible. Better research designs and improved estimation strategies can do much to help decrease the impact of confounding bias. Nonetheless, researchers still need to make causal assumptions in order to make causal inferences. The development of methods of sensitivity analyses for situations in which unmeasured confounding is present, as is done in this paper, serves to shift empirical social science research away from the all too typical enterprise of defending indefensible causal assumptions to the practice of honestly stating the range of assumptions that are consistent with a particular type of causal effect. This has the potential to accelerate the accumulation of knowledge in the social sciences. While the running example in this paper dealt with a large sample survey of U.S. citizens, the methods discussed here are just as, if not more, relevant for qualitative researchers in comparative politics and international relations who typically deal with much smaller numbers of cases. Indeed, as Sekhon (2004) has pointed out, causal claims in such settings typically involve the same sorts of counterfactuals considered here (see also Fearon (1991) and Hawthorn (1993)). The methods presented here merely require researchers to specify their beliefs about four relevant counterfactual quantities– the fractions of “helped”, “hurt”, “always succeed”, and “never succeed” units within various observed combinations of X and Y . To the extent that the consideration of various counterfactuals is what researchers in these fields already do, the methods presented in this paper can be seen as a way to ensure that the conclusions reached by these scholars are logically consistent with the counterfactual claims they are willing to maintain. Further, because researchers in these fields are often unable to collect additional data on many important confounding variables they 43 often find themselves dealing with simple 2 × 2 and 2 × 2 × K contingency tables where substantial unmeasured confounding is thought to be present. 44 References Angrist, Joshua D., Guido W. Imbens, and Donald B. Rubin. 1996. “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association 91:444–455. Balke, Alexander, and Judea Pearl. 1997. “Bounds on Treatment Effects From Studies With Imperfect Compliance.” Journal of the American Statistical Association 92(439):1171–1176. Bernardo, José, and Adrian F. M. Smith. 1994. Bayesian Theory. New York: Wiley. Brumback, Babette A., Miguel A. Hernán, Sebastien J. P. A. Haneuse, and James M. Robins. 2004. “Sensitivity Analyses for Unmeasured Confounding Assuming a Marginal Structural Model for Repeated Measures.” Statistics in Medicine 23:749–767. Chickering, David Maxwell, and Judea Pearl. 1997. “A Clinician’s Tool for Analyzing NonCompliance.” Computing Science and Statistics 29(2):424–431. Cornfield, J., W. Haenszel, E. Hammond, A. Lilienfeld, M. Shimkin, and E. Wynder. 1959. “Smoking and Lung Cancer: Recent Evidence and a Discussion of Some Questions.” Journal of the National Cancer Institute 22:173–203. Fearon, James. 1991. “Counterfactuals and Hypothesis Testing in Political Science.” World Politics 43:474–484. Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2003. Bayesian Data Analysis. London: Chapman & Hall, second edition. Gill, Jeff. 2007. Bayesian Methods: A Social and Behavioral Science Approach. Boca Raton, FL: Chapman & Hall/CRC, second edition. Gill, Jeff, and Lee D. Walker. 2005. “Elicited Priors for Bayesian Model Specifications in Political Science Research.” The Journal of Politics 67(3):841–872. Hawthorn, Geoffrey. 1993. Plausible Worlds: Possibility and Understanding in History and the Social Sciences. New York: Cambridge University Press. Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81:945–960. Imai, Kosuke, and Samir Soneji. 2007. “On the Estimation of Disability-Free Life Expectancy: Sullivan’s Method and Its Extension.” Journal of the American Statistical Association 102:1199– 1211. Imai, Kosuke, and Teppei Yamamoto. 2008. “Causal Inference with Measurement Error: Nonparametric Identification and Sensitivity Analyses of a Field Experiment on Democratic Deliberations.” Paper presented at the 2008 Summer Political Methodology Meeting. Jackman, Simon. 2000. “Estimation and Inference via Bayesian Simulation: an Introduction to Markov Chain Monte Carlo.” American Journal of Political Science 44(2):375–404. Kadane, Joseph B., and Lara J. Wolfson. 1998. “Experiences in Elicitation.” The Statistician 47:3–19. King, Gary. 1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton: Princeton University Press. 45 King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44(2):347– 361. King, Gary, and Langche Zeng. 2002. “Estimating Risk and Rate Levels, Ratios, and Differences in Case-Contrl Studies.” Statistics in Medicine 21:1409–1427. Lin, D. Y., B. M. Psaty, and R. A. Kronmal. 1998. “Assessing the Sensitivity of Regression Results to Unmeasured Confounders in Observational Studies.” Biometrics 54:948–963. Manski, Charles F. 1990. “Nonparametric Bounds on Treatment Effects.” The American Economic Review Papers and Proceedings 80(2):319–323. Manski, Charles F. 1993. “Identification Problems in the Social Sciences.” Sociological Methodology 23:1–56. Manski, Charles F. 2003. Partial Identification of Probability Distributions. New York: Springer. Miller, Warren E., Donald R. Kinder, Steven J. Rosenstone, and the American National Election Studies. 1991. “American National Election Study: 1990-1991 Panel Study of the Political Consequences of War / 1991 Pilot Study.” [computer file] (Study #9673) Conducted by the Center for Political Studies of the Institute for Social Research, the University of Michigan. Morgan, Stephen L., and Christopher Winship. 2007. Counterfactuals and Causal Inference: Methods and Principles for Social Research. New York: Cambridge University Press. Oliver, J. Eric, and Raymond E. Wolfinger. 1999. “Jury Aversion and Voter Registration.” American Political Science Review 93(1):147–152. Pearl, Judea. 1995. “Causal Diagrams for Empirical Research.” Biometrika 82:669–710. Pearl, Judea. 2000. Causality: Models, Reasoning, and Inference. New York: Cambridge University Press. R Development Core Team. 2007. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Richardson, Thomas. 2004. “Discussion on the Paper by Wakefield.” Journal of the Royal Statistical Society: Series A 167:438. Robins, James M. 1986. “A new aproach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect.” Mathematical Modeling 7:1393–1512. Robins, James M. 1989. “The Analysis of Randomized and Nonrandomized AIDS Treatment Trials Using a New Approach to Causal Inference in Longitudinal Studies.” In Health Service Research Methodology: A Focus on AIDS ( L Sechreset, H. Freeman, and A. Mulley, editors), Washington DC: U.S. Public Health Service. Robins, James M. 1999. “Association, Causation, and Marginal Structural Models.” Synthese 121:151–179. Robins, James M. 2002. “Comment.” Statistical Science 17:309–321. Rosenbaum, Paul R. 1987. “Sensitivity Analysis for Certain Permutation Inferences in Matched Observational Studies.” Biometrika 74:13–26. 46 Rosenbaum, Paul R. 2002a. “Covariance Adjustment in Randomized Experiments and Observational Studies.” Statistical Science 17:286–304. Rosenbaum, Paul R. 2002b. Observational Studies. New York: Springer, second edition. Rosenbaum, Paul R., and Donald B. Rubin. 1983a. “Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome.” Journal of the Royal Statistical Society: Series B 45:212–218. Rosenbaum, Paul R., and Donald B. Rubin. 1983b. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70:41–55. Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66(5):688–701. Rubin, Donald B. 1978. “Bayesian Inference for Causal Effects: The Role of Randomization.” The Annals of Statistics 6(1):34–58. Rubin, Donald B. 1980. “Randomization Analysis of Experimental Data: The Fisher Randomization Test: Comment.” Journal of the American Statistical Association 75(371):591–593. Rubin, Donald B. 1986. “Statistics and Causal Inference: Comment: Which Ifs Have Causal Answers.” Journal of the American Statistical Association 81(396):961–962. Schlesselman, James J. 1978. “Assessing Effects of Confounding Variables.” American Journal of Epidemiology 108:3–8. Sekhon, Jasjeet S. 2004. “Quality Meets Quantity: Case Studies, Conditional Probability, and Counterfactuals.” Perspectives on Politics 2(2):281–293. Western, Bruce, and Simon Jackman. 1994. “Bayesian Inference for Comparative Research.” American Political Science Review 88(June):412–423. 47 W = 0 Self-Employed Y =0 Did Not Register to Vote Y =1 Did Register to Vote 1 21 10 93 X=0 Perceived Source of Jury Lists Does Not Include Voter Lists X=1 Perceived Source of Jury Lists Does Include Voter Lists W = 1 Hourly Workers Y =0 Did Not Register to Vote Y =1 Did Register to Vote 5 32 27 92 X=0 Perceived Source of Jury Lists Does Not Include Voter Lists X=1 Perceived Source of Jury Lists Does Include Voter Lists W = 2 Others Working Y =0 Did Not Register to Vote Y =1 Did Register to Vote 4 44 52 186 X=0 Perceived Source of Jury Lists Does Not Include Voter Lists X=1 Perceived Source of Jury Lists Does Include Voter Lists W = 3 Not in Workforce Y =0 Did Not Register to Vote Y =1 Did Register to Vote 7 20 19 47 Y =0 Did Not Register to Vote Y =1 Did Register to Vote 2 26 6 55 X=0 Perceived Source of Jury Lists Does Not Include Voter Lists X=1 Perceived Source of Jury Lists Does Include Voter Lists W = 4 Retired X=0 Perceived Source of Jury Lists Does Not Include Voter Lists X=1 Perceived Source of Jury Lists Does Include Voter Lists Table 4: Perceived Source of Jury Lists and Voter Registration Among Citizens With Some Self Reported Knowledge of How Jury Lists are Constructed by Occupational Category. Each entry is the number of citizens in that category. Data from Table 2 of Oliver and Wolfinger (1999). 48