Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Weighted Estimation for Analyses with Missing Data Cyrus Samii Columbia University Political Science Motivation Missing data plague data analyses in political science. Recent applied statistics literature reflects renewed interest in weighting methods for missing data problems. Three properties are stressed in this literature: (i) robustness, (ii) the ability to use post-treatment information in causal analysis, and (iii) methods to gain efficiency. I present these results, hoping to show the potential in using refashioned weighting methods for political science research. Preliminaries Consider a generalized linear regression of Y on X for an iid sample P of size n indexed by i. Use the estimating equation ni=1 S(Yi, Xi; β) to characterize the regression estimator (Liang and Zeger, 1986; Stefanski and Boos, 2002). Full sample estimates, β̂f , come from solvP ing, ni=1 S(Yi, Xi; β̂f ) = 0. Let β0 ≡ E (β̂f ) define our target estimate. Define an indicator, Ri, for whether data for i are fully observed, where the probability that Ri = 1 is πi, a function of the observed data. The estimate on the observed data is obtained by solving, Pn i=1 RiS(Yi, Xi; β̂c) = 0. In general, this will not be zero in expectation when β̂c = β0, indicating bias. Robustness The estimating equation, S, does not necessarily define the true data generating process (DGP). S likely defines a useful (e.g. linear) approximation. Inverse probability weighting (IPW) allows for unbiased “approximate” inference. To see this, suppose missing data on Yi, and that conditional independence holds, Yi ⊥⊥ Ri|Xi. Then, E µ ¶ n n X X Ri Ri E Si(Yi, Xi; β0) = E Si(Yi, Xi; β0)|Yi, Xi π(Xi) π(Xi) i=1 i=1 n X Si(Yi, Xi; β0). =E i=1 Figure 1 illustrates this property, showing how IPW recovers the linear approximation of a complex relationship between X and Y . No weighting IPW F IGURE 1: Gray points are the complete sample, and hollow points are observed data. Hollow points on the right are scaled proportionally to their weight. The gray line is the full sample target, and the dashed lines are the attempts to estimate it. Estimating equations allow one to study bias-reduction in terms of influence (Tsiatis, 2006). This is intuitive in Figure 1: we tilt the regression line by increasing the influence of points at the left. This opens the door to a variety of semi- and non-parametric methods for constructing weights. The example above used logistic regression. Alternatives are boosted regression (McCaffrey et al, 2004) and robit regression, which is more robust than logistic regression (Kang and Schafer, 2007). Using post-treatment information The robustness result depends only on our ability to estimate πi using whatever observed data is available. We have flexibility in modeling πi—e.g. using post-treatment information. Consider an example due to Hernan et al (2004). Link X, Y , and R causally with a directed acyclic graph (Pearl, 2000), where X is a “treatment.” Add a post-treatment variable, Z, on the path to R: X J J J ^ À Y Z ? R We want the average effect of X on Y , which is modeled as 0 here. We only observe Y when R = 1. This missingness mechanism induces bias in an unadjusted regression of Y on X. Including Z only adds another biasing component (King and Zeng, 2006). But the information in X and Z can be used compute πi. Then, IPW is unbiased. For example, suppose a DGP, Y ∼ Bernoulli(.5), Z ∼ Bernoulli(logit−1(−2 + 2D + 2E)), X ∼ Bernoulli(.5), and R ∼ Bernoulli(logit−1(−4 + 3L)) The following table shows the results from attempts to estimate a coefficient on X with a logistic regression on 1,000 samples. Specification Mean coef. SD of coef. (a) (a) (a) β0 + β1 X -0.58 0.11 (b) (b) (b) (b) β0 + β1 X + β2 Z -0.87 0.13 (a) (a) (c) β0 + β1 X, weighted by 0.00 0.21 logit−1(α0 + α1X + α2Z) Gaining efficiency In the previous example, the standard deviation of the IPW estimate is large. This shows the volatility of IPW estimates. IPW also discard incomplete data. Thus, it is inefficient relative to methods that use the incomplete data. Robins and colleagues (e.g. Bang and Robins, 2005) have proposed “augmented”-IPW estimators to incorporate incomplete data to recover efficiency. The estimating equations can be augmented to include any function of the data that has mean zero at the target estimate without introducing bias. Thus, µ ¶ n X Ri Ri Si(Yi, Xi; β̂f ) + 1 − φi = 0, πi πi i=1 estimates the target parameter if φi is a function of the fully observed data with E (φi|β0) = 0. Note that 1 − Rπii term equals zero if the πi are accurate. The estimator is “doubly robust” in the sense that it estimates the target parameter if either πi is estimated well or the assumptions on φi are accurate. Optimal φi functions are available to maximize efficiency (Robins and Rotnitzky, 1995; Tsiatis, 2006). Carpenter et al (2006) present an example of linear regression with three covariates, (X1,X2,X3), and missingness on X1. They derive an augmented estimating equation, with Si = Xi(Yi − β 0Xi) and φi = E [Xi(Yi −β 0Xi)|Yi, Xi2, Xi3], where conditional expected values for X1 are based on an assumption of multivariate normality. Semi-parametric estimation is carried out by maximizing a quasi-log-likelihood (McCullagh and Nelder, 1989:323-328). The πi are estimated with logistic regression, and so asymptotic parameter variances are derived via the M-estimator framework. The table below shows results from a simulation with this example, focusing on a coefficient estimate that suffered the most bias in complete-case OLS: Method OLS IPW IPW Augmented IPW Augmented IPW Augmented IPW Multiple imputation Multiple imputation πi model X1 model Avg se Avg bias/Avg se 0.03 -2.06 Correct 0.03 -0.09 Wrong 0.03 -2.06 Correct Correct 0.02 -0.11 Wrong Correct 0.02 -0.10 Correct Wrong 0.03 -0.09 Correct 0.02 -0.08 Wrong 0.03 2.60 Conclusion Weighting methods are robust, flexible, and efficient for dealing with missing data. Space limits prevent discussion of other benefits, such as the ready adaptability of augmented IPW for analyzing sensitivity to conditional independence violations (Scharfstein et al, 1999). For nonmonotone missingness over many variables in a dataset, augmented IPW is intractable, and so multiple imputation must be preferred. But we must accept substantial model dependence. Weighting is best for primary analyses when missingness on one or two variables poses substantial threats to validity. References Bang H, Robins JM. 2005. “Doubly robust estimation in missing data and causal inference models.” Biomtr. 61:962-972. Hernan MA, Hernandez-Diaz S, Robins JM. 2004. “A structural approach to selection bias.” Epid.. 15:615-625. Kang JDY, Schafer JL. 2007. “Demystifying double robustness.” Stat. Sc. 22:523-539. King G. Zeng L. 2006. “The dangers of extreme counterfactuals.” Pol. Anal.. 14:131-159. Liang KY, Zeger SL. 1986. “Longitudinal data analysis using generalized linear models.” Biomka.. 73:13-22. Little RA, Rubin DB. 2002. Statistical Analysis with Missing Data. New York: Wiley. McCaffrey DF, Ridgeway G, Morral AR. 2004. “Propensity score estimation with boosted regression for evaluating causal effects in observational studies.” Psych. Meth..9:405-425. McCullagh P, Nelder JA. 1999. Generalized Linear Models, 2nd Ed. New York: Chapman and Hall. Pearl J. 2000. Causality: Models Reasoning and Inference. New York: Cambridge. Robins JM, Rotnitzky A. “Semiparametric efficiency in in multivariate regression models with missing data.” JASA. 89:846-866. Scharfstein DO, Rotnitzky A, Robins JM. 1999. “Adjusting for non-ignorable drop-out using semiparametric nonresponse models (with discussion).” JASA.94:1096-1120. Stefanski LA, Boos DD. 2002. “The calculus of M-estimation.” The Am. Stat.. 56:29-38. Tsiatis AA. 2006. Semi-parametric Theory and Missing Data. New York: Springer.