Download A Specification Test for Linear Regressions that use

A Specification Test for Linear Regressions that use King–Based Ecological Inference Point Estimates as Dependent Variables Michael C. Herron1 Kenneth W. Shotts2 EXTREMELY ROUGH DRAFT – COMMENTS WELCOME July 13, 2000 1 Department of Political Science, Northwestern University. Address: Scott Hall, 601 University Place, Evanston, IL 60208–1006. Email: [email protected]. Fax 847–491–8985. 2 Department of Political Science, Northwestern University. Address: Scott Hall, 601 University Place, Evanston, IL 60208–1006. Email: k–[email protected]. Fax: 847–491–8985 Abstract Many researchers use point estimates produced by the King (1997) ecological inference technique as dependent variables in second stage linear regressions. We show, however, that this two stage procedure is at risk of logical inconsistency. Namely, the assumptions necessary to support the procedure’s first stage (ecological inference via King’s method) can be incompatible with the assumptions supporting the second (linear regression). We derive a specification test for logical consistency of the two stage procedure and describe options available to a researcher whose ecological dataset fails the test. 1 Introduction Many researchers use point estimates produced by the King (1997) ecological inference technique as dependent variables in second stage linear regressions. King advocates this two stage statistical procedure, which we call EI–R, saying that “[it] has enormous potential for uncovering new information” (p. 279). Burden and Kimball (1998), for example, use King–based estimates of district–level ticket splitting rates as dependent variables in second stage regressions which examine why individuals cast split ticket ballots. Similarly, Gay (1997) employs estimates of Black turnout rates, produced by King’s ecological inference technique, in regressions that seek to determine whether black citizens vote at higher rates in elections that include black Congressional candidates. Other examples of EI–R can be found in Kimball and Burden (1998), Voss and Lublin (1998), Cohen, Kousser and Sides (1999), Wolbrecht and Corder (1999), Lublin (2000), and Voss and Miller (2000). As evidenced by the multiple citations above, EI–R has achieved widespread usage despite the fact that it was only recently proposed. We show, however, that applications of this two stage procedure can be logically inconsistent in the sense that the assumptions necessary to support the procedure’s first stage (ecological inference via King’s method) can be incompatible with the assumptions supporting the second (linear regression). In contrast, we say that an application of EI–R is logically consistent when its first stage and second stage assumptions are not mutually contradictory. The possibility that the assumptions which support an application of EI–R contradict one another merits consideration because productive use of a statistical technique requires adopting the technique’s underlying assumptions. That is, to use any statistical technique a researcher must ultimately rely on its assumptions if he wishes to argue that the estimators produced by the technique have desirable properties such as consistency, asymptotic normality, and so forth. Moreover, and this is the key point which drives the results presented in this paper, when two statistical techniques are combined as in the case of EI–R a researcher must simultaneously adopt the assumptions that support the technique’s first and second stages. Previous work on EI–R has not sought to analyze whether its two sets of underlying assumptions – those associated with first stage ecological inference and those with second stage linear regression 1 – can be simultaneously adopted. In fact, to the best of our knowledge all published and working– paper implementations of EI–R simply assume without consideration that the procedure’s first and second stages are compatible, i.e. that EI–R is logically consistent. Nonetheless, we provide an analysis of the compatibility of EI–R’s two sets of assumptions and in so doing address the issue of aggregation bias in ecological data. In particular, our analysis is based on the fact that standard implementations of King’s technique (the first stage of EI–R) assume that there is no aggregation bias. Nonetheless, we show that it is often the case that linear regressions which use King–based point estimates as dependent variables (the second stage of EI–R) imply the existence of aggregation bias. As an illustration of this clash of assumptions, consider a classic case of ecological inference: a researcher wants to estimate and explain disaggregated turnout rates for African–American and white voters. Yet, suppose that she has only precinct–level racial composition and turnout data. To apply King’s technique as a component of EI–R to this turnout problem, the researcher must assume that there is no aggregation bias in the turnout dataset, i.e., that black turnout is uncorrelated with the fraction of a precinct’s population that is African–American. Now, suppose that such a researcher estimates a second stage regression in which King–based estimates of black turnout rates are assumed to be a linear function of precinct–level income and education levels and other politically–relevant independent variables. If any of these right hand side variables are correlated with the fraction in a precinct’s population that is African-American, then the researcher’s second stage regression model virtually guarantees (in a sense that we make precise later) that black turnout is correlated with the fraction of a precinct’s population that is AfricanAmerican, i.e., that there is aggregation bias. Thus, in this example the assumptions behind the second stage of EI–R contradict those in its first stage. This is problematic at two levels. At a fundamental level, any standard philosophy of science requires researchers to adopt internally consistent sets of assumptions. To do otherwise, as is often implicitly done in EI–R, is to simultaneously state “A is true” and “A is false.” Once two mutually contradictory premises are adopted, any implication can be logically derived, i.e., all subsequent analysis is meaningless. 2 At a more pragmatic level, logical inconsistency of EI–R is problematic because, in the presence of aggregation bias, King’s ecological inference technique produces statistically inconsistent estimates of all relevant population parameters.1 And, when all first stage point estimates are inconsistent, second stage regression results – that is, EI–R’s final results – will also be statistically inconsistent in ways that cannot be readily rectified. An estimator that is statistically inconsistent is generally useless because it does not converge in probability to its true value; rather, it tends to an incorrect value as the sample on which it is based gets large. Given the problems which result if EI–R is logically inconsistent, it is important for researchers to determine whether a particular application of the technique is problematic. Fortunately, it turns out that not all implementations of EI–R are logically inconsistent. Rather, we show that logical consistency of EI–R is application–dependent, i.e., EI–R is logically consistent in some empirical applications yet logically inconsistent in others. Thus, we propose a simple specification test for logical consistency of EI–R. The test is implemented via a single linear regression using data from both the first and second stages of a proposed application of EI–R. Interpreting the results of the specification test then requires consideration of only a single F statistic, the well–known and popular statistic which tests the null hypothesis that all slope coefficients in a linear regression are zero. If the F statistic produced by an application of the test to a given dataset is statistically significant, then EI–R is logically inconsistent when applied to this dataset and should not be used. If, on the other hand, the F value is not significant, then logical consistency of the application of EI–R cannot be rejected. It is worth noting that, even in an application where EI–R is logically consistent, second stage regression results will almost certainly be statistically inconsistent. This is because errors in King– based estimates of disaggregated quantities are inversely correlated with the true parameter values that the estimates measure. Fortunately, it is possible to correct for this statistical inconsistency so long as EI–R is logically consistent (Herron and Shotts 2000). However, when the test for logical consistency of EI–R rejects in a given ecological dataset, thus implying the presence of aggregation bias, a researcher who wants to study the dataset has only two 1 See Tam (1998) for a thorough discussion of the effects of aggregation bias on King’s technique. 3 options. One, she can abandon EI–R entirely and choose a different method of analysis. Or, two, she can explicitly model the aggregation bias in the first stage of the EI–R procedure. We discuss this latter option in some detail. We emphasize that our critique of EI–R and King’s ecological inference technique is fundamentally different from the argument in Tam (1998) and Cho and Gaines (2000), both of which claim that aggregation bias is a common phenomenon and that King’s diagnostics provide little help in identifying and rectifying it. Our analysis, in contrast, is not based on a claim that the assumptions supporting King’s ecological inference technique are unwarranted, either in general or in a particular empirical application. Rather, our critique applies only to the use of King’s technique within the context of the combined EI–R procedure, and what we argue is that the assumptions which support EI–R must be compatible across EI–R’s two stages. In Section 2 we provide notation and details on EI–R. Then, in Section 3 we present our specification test for logical consistency of EI–R and explain how to interpret it. Section 4 considers the options faced by a researcher whose ecological dataset fails the test, and in Section 5 we apply the test to the ecological dataset used in Burden and Kimball (1998). Section 6 offers general comments on EI–R and concludes. 2 Overview of the EI–R Procedure The first stage of EI–R consists of an application of King’s ecological inference technique, and the second stage requires estimating a linear regression based on first stage output. Thus, we now provide some details on ecological inference and King’s technique in particular and then explain how the output of the technique is used in second stage regressions. Suppose we have precinct–level data for an election of interest: let denote precinct ’s turnout rate in the election and The goal of ecological inference is to use and , , the fraction of Blacks in precinct . to estimate , the fraction of Blacks in precinct who voted in the election, and , the corresponding fraction of Whites who voted. Achen and Shively (1995) discuss the history of the so–called ecological inference problem and review various solution attempts. 4 King distinguishes between two different implementations, standard and extended, of his ecological inference technique. The precise details of these implementations can be found in Chapter 6–8 of King’s book. Critiques and discussions of King’s approach to ecological inference can be found in Freedman, Klein, Ostland and Roberts (1998), Tam (1998), and Cho and Gaines (2000). Until further notice our comments on King’s technique apply to its standard implementation or what we also call the standard model. have a truncated bivariate normal distribu , standard deviations and correlation . In tion on the unit square with means King’s notation, these parameters are collected in a 5–vector . Notably, The standard implementation assumes that and are fixed across , and this is the key feature of the standard model. where denotes covariance. Furthermore, the standard model assumes the two means This is equivalent to assuming there exists no aggregation bias, i.e. there is no systematic relationship between , the fraction of Blacks in precinct who voted, and , the fraction of Blacks in precinct .2 Let , , , , and denote the point estimates of , , , , and , respectively, based on applying King’s standard model to an ecological dataset.3 Corresponding to the vector , the five point estimates are collected in a 5–vector one of which is that there is no aggregation bias – . Under standard model assumptions – recall, as tends to infinity where ! is a consistent estimate of , and a similar statement applies to the other four point estimates in . In general, then, " . Let be the point estimate of produced by applying King’s technique to an ecological dataset and , ( is a function of , , and ). In EI–R, the point estidenotes convergence in probability. In other words, mates are used as dependent variables in a second stage regression. Namely, King suggests that a &()+ ' * In. addition, King’s technique assumes that there is no spatial autocorrelation, that #$ 2 3 and #% are independent for King’s ecological inference technique can be estimated using standard maximum likelihood theory. Or, optional priors can be used in which case King’s model yields a Bayesian framework. Both of these implementations of King’s technique are compatible with the results described here. 5 researcher who in theory might want to estimate (1) (2) instead estimate where the difference between equation (1) and (2) is that served) in the latter. Here, is a constant and (observed) substitutes for (unob a –vector of parameters to be estimated, is a –vector of exogenous variables, and is a disturbance term assumed to be uncorrelated with and . Let and denote the least squares estimates of and , respectively.4 Thus, the EI–R procedure consists of, first, estimating technique and, second, estimating and using King’s ecological inference with least squares applied to equation (2). 5 The final product of EI–R, therefore, consists of the estimates and , and, ultimately, whether EI–R is statistically consistent in an overall sense depends on whether ! and " . On the other hand, what can be thought of as first stage statistical consistency depends on whether where the –vector ! , as noted previously, is the first stage estimate produced by applying King’s standard model to an ecological dataset. When EI–R is logically consistent then its first and second stage assumptions do not contradict one another. Hence, when EI–R is logically consistent then there is no aggregation bias and EI–R’s first stage is statistically consistent ( stances " and " where ). In this case, except in very unusual circum- denotes a lack of convergence in probability (Herron and Shotts 2000). In other words, EI–R is in general statistically inconsistent even when it is logically even when the first stage of EI–R is statistically consistent. consistent, and this is because 4 $ $ $ $ Equations (1) and (2) could just as easily have had and on their left hand sides. All of our results apply in a symmetric fashion to such a case. Similarly, the results apply to regressions with and as left hand side variables as well. 5 King suggests a weighted least squares approach when estimating equation (2). Weighting, however, is not germane to our discussion of logical inconsistency. Furthermore, Herron and Shotts point out that there is no theory which says that weighted least squares estimation of equation (2) is superior to ordinary least squares estimation. 6 However, Herron and Shotts describe manipulations of of the population parameters and and that produce consistent estimates in equation (1) based on inconsistent estimates produced by equation (2). Overall, then, when EI–R is logically consistent its first stage is statistically consistent and, despite the fact that ! and consistent second stage estimates of " , it is possible to manipulate and . ! These manipulations are possible only when and so as to generate , i.e. only when first stage ecological inference estimates are statistically consistent. This is a key point for the following reason: when EI–R is logically inconsistent, then its first stage is statistically inconsistent ( " ! ) and EI– R completely falls apart.6 In particular, as far as are aware, no useful manipulations of and are possible when EI–R is logically inconsistent. In this latter situation, these point estimates are hopelessly statistically inconsistent and this cannot be rectified. 3 The Logical Consistency Test for EI–R Recall that EI–R’s first stage assumption of no aggregation bias is mathematically equivalent to . However, second stage regression implies from equation (1). Hence, for EI–R to be logically consistent across its first and second stages, it must be true that where , is the th element of the parameter vector and where is the th element of the vector . This derivation forms the basis of the following result: 6 King (1997) argues in Ch. 11 that his technique sometimes produces accurate estimates of even in the presence of aggregation bias. However, Tam (1998) argues that, in such situations, the technique produces inaccurate estimates. In our opinion, the fact that an inconsistent estimator happens to be on target in some applications does not constitute adequate justification for adopting it. Namely, inconsistent estimators are guilty until proven innocent, and, at the time of this writing, the value of King’s technique in the presence of aggregation bias remains uncertain. 7 , , and , , is Proposition 1 Suppose that an ecological dataset, consisting of given. Let be the true second stage parameter vector from equation (1). When it is not the case that (3) then EI–R as applied to the given dataset is logically inconsistent. Proposition 1 places a restriction on the parameter vector that must be satisfied for EI–R to be logically consistent when applied to a given ecological dataset. When the restriction is violated, then EI–R is logically inconsistent. Moreover, as described earlier, logical inconsistency of EI–R implies first stage statistical inconsistency ( in and ) and correcting resulting second stage inconsistencies is not possible. Suppose that, for a given ecological dataset, the covariances which appear in equation (3) all happen to be exactly zero. Then, any –vector , regardless of its elements, is consistent with the restriction in Proposition 1. This is because the elements , when substituted into equation (3), will be multiplied by zero and hence have no impact on whether the left hand side of the equation is truly zero. In other words, when all covariances in equation (3) are zero it is not possible for the elements of to violate the restriction in Proposition 1 and in so doing cause rejection of logical consistency of EI–R. On the other hand, if some of the terms are not zero, then the set of which satisfy the restriction in Proposition 1 is not equivalent to the set of all possible vectors vectors. , For example, suppose that so that is a 2–vector, , and . Then, for EI–R to be logically consistent it must be the case that or, equivalently, the . This example illustrates in a simple manner the fact that, when some of terms are non–zero, elements of the vector will have to satisfy a non–trivial restriction for EI–R to be logically consistent. Thus, we can think of Proposition 1 as having two cases. In one case – when all covariances in equation (3) are zero – there is no restriction on . In the other case – when at least one of 8 the terms is non–zero – the restriction in equation (3) is meaningful and logical consistency of EI–R depends on whether satisfies it. When Proposition 1 imposes a meaningful restriction on , we can enquire as to the nature of the restriction. Is it meaningful yet weak – for example, requiring that tight – for example, requiring, say, that for all – or is it for all ? If the former, then we would say that the restriction is relatively easy to satisfy and hence there are no compelling reasons to reject logical consistency of EI–R based solely on the fact that On the other hand, if the restriction on must satisfy the restriction. is very tight, then it is certainly possible that most vectors will fail to satisfy it. From this it would follow that EI–R is unlikely to be logically consistent. Similarly, if given an ecological dataset the associated restriction on is so tight that is seems unimaginable that it can be satisfied, then we would commensurately argue that, in all likelihood, EI–R is logically inconsistent when applied to the said dataset. We now develop this idea of a very tight restriction on , a restriction that seems virtually impossible to satisfy. In particular, our conceptualization of such a restriction is one that will be satisfied with probability zero if is drawn randomly from a continuous distribution. To motivate this conceptualization, consider the following example. Let and suppose as before that logical . If is drawn from a continuous distribution on , then the probability that is zero since a line has zero probability measure on a two–dimensional plane. Hence, we say that the restriction is very tight and, when consistency of EI–R requires logical consistency of EI–R depends on it, we conclude that EI–R will in all likelihood be logically inconsistent. In general, when the –vector on , then the probability that of dimension is zero. Thus, Definition 1 A –vector elements of must lie in an is randomly drawn from a continuous probability distribution elements of the vector, with , lie a given hyperplane from equation (1) is said to be non–generic if, for , dimensional hyperplane. Definition 1 codifies the idea of a very tight restriction in the sense that when it must satisfy such a restriction. 9 is said to be non–generic For example, suppose that implies that two ( is a –vector ( ) and that a restriction from equation (3) ) of its elements must lie in a one ( ) dimensional hyperplane. (This is a complicated way of saying that these two elements must lie on a line.) In general, the probability that two continuously distributed random variables fall exactly on a given line is zero. Hence, if all elements of are drawn from a continuous distribution on , the probability that the two key elements satisfy the restriction of being on a given line is zero. Of course, in actual applications of EI–R it is not the case that is drawn from a continuous distribution prior to EI–R’s being employed. Rather, according to standard frequentist logic, estimating a model like equation (2) is equivalent to supposing that there exists a true and that this value is fixed albeit unknown. Nonetheless, the idea behind a restriction which leads to a non–generic is that the restriction is so tight that we should expect it never to hold. Hence, if we conclude that a given application of EI–R is logically inconsistent unless is non–generic, then we have effectively concluded that the application is in all likelihood logically inconsistent. In particular, we would say in this instance that EI–R is generically logically inconsistent. Definition 2 An application of EI–R to an ecological dataset is said to be generically logically inconsistent if the restriction in Proposition 1 requires It is straightforward to see that, if any of the then the to be non–generic. terms in equation (3) are non–zero, vector which satisfies the resulting restriction is non–generic. If, say, only one covariance is non–zero, then the associated element of , that which multiplies the non–zero covariance, will itself have to be exactly zero. If not, the restriction in equation (3) will not be satisfied unless exactly offsets the product of the non–zero covariance and the non–zero element of . This sort of offsetting will occur only by complete coincidence. Furthermore, the probability that a continuously distributed random variable on Hence, when only a single is exactly zero is itself a probability zero event. term is zero, is non–generic. We have now argued the following. Proposition 2 Let where be the true parameter vector from equation (1). If there exists a such that , then is non–generic. 10 Hence, whenever an ecological dataset features at least a single, non–zero EI–R as applied to the dataset is generically logically inconsistent. The result in Proposition 2 is, importantly, independent of whether term, is zero. In fact, this latter covariance only makes matters worse from the perspective of satisfying the restriction in equation (3). Namely, all terms can be zero but the restriction can still be violated if . Hence, by ignoring this latter covariance our test for logical consistency of EI–R is conservative.7 The matter of whether a proposed application of EI–R is generically logically inconsistent, then, turns on whether there exists a estimated by regressing on the term that is non–zero. Such covariance terms can be vector. Then, if the standard F test for all slope coefficients being zero rejects at a conventional confidence level, it follows that EI–R is generically logically inconsistent. Thus, the hypothesis test for logical consistency of EI–R requires nothing more than estimating a single regression with data that EI–R itself requires in the first place. This regression can, and should, be run before any application of EI–R. The intuition behind our regression–based test for logical consistency in EI–R is as follows. Recall that King’s ecological inference technique requires that there is no aggregation bias, i.e. that is a function of covariates collected in a vector and, if some of the elements in are correlated with , then in general to be zero. Thus, when the test for logical consistency of EI–R it is not possible for . But, if a second stage regression assumes that rejects, it follows that there is aggregation bias in the first stage and EI–R in this situation completely breaks down. The matter of non–rejection of logical consistency of EI–R is an important one. The test we propose for logical consistency is one of necessity as opposed to one of sufficiency. That is, an application of EI–R must pass our test; if not, it follows that the application is logically inconsistent. However, if an application of EI–R does pass the test, it does not therefore follow that the application 7 Our discussion of restrictions and logical consistency of EI–R began with the premise that an ecological dataset is given, and then we considered restrictions on such that equation (3) holds. We could, alternatively, have started from the premise that the elements of were given and asked about the resulting ecological dataset which would satisfy the equation. From this latter perspective, when any elements of are non–zero, the set of covariances which satisfy equation (3) will be non–generic. Hence, it follows that EI–R as applied to the dataset will be generically logically inconsistent. 11 is logically consistent. Rather, it is still possible that the application is logically inconsistent in some manner not covered by our test.8 Or, the failure to reject could simply reflect that fact that our test does not have a power of one. For the remainder of this paper, we use the term “logically inconsistent” in place of “generically logically consistent” as the latter is tedious. Similarly, we refer to our test as a test for logical consistency even though, technically, it is a test for generic logical consistency. 4 What to do when an Ecological Dataset Fails the Specification Test When a researcher’s ecological dataset fails the test for logical consistency of EI–R, she has two options. First, she can abandon EI–R entirely and choose another method of data analysis. Second, she can model the aggregation bias that caused test failure in the first place and in so doing avoid a second stage regression. Modeling aggregation bias in the first stage requires what King calls his extended model. Recall that in the standard model the mean of is , and this mean vector is assumed to be fixed across observations. However, in the extended model the two elements of this vector are not restricted in this way. Rather, in the extended model the means are denoted where the means are dependent on . Readers interested in the details of the extended model should consult King (1997, pp. 168–74). In brief, this model posits that is a linear function of exogenous covariates. That is, the extended model assumes something akin to estimated and where is a vector of parameters that must be a vector of covariates that includes a constant. A product of the extended model, then, is an estimate of the parameter vector . For numerical reasons (see p. 170) the extended model does not assume the precise form noted above for . However, the above expression for captures the essence of the extended model. Thus, suppose that a researcher has an ecological dataset and on it carries out the test for logical consistency of EI–R. Suppose as well that the test rejects. Since it consequently follows that there $ 8 For example, logical consistency of EI–R will impose certain restrictions on the distribution of . Although we do not explore the nature of these restrictions, it seems quite plausible that these restrictions are extremely tight. 12 is aggregation bias in the ecological dataset, the hypothetical researcher would then include as extended model covariates all elements of and subsequently would not estimate a second stage regression at all. In other words, when the specification test rejects logical consistency of EI–R, then the two stage statistical procedure should not be used. Rather, an abbreviated version – consisting solely of an application of King’s extended model – should replace it. To summarize, the procedure we propose for EI–R is as follows. 1. Collect an ecological dataset consisting of 2. Regress on the elements of the vector , , and , . and consider the F test that all slope coefficients are zero. Then, (a) If the F statistic does not reject, implement EI–R using the method described in Herron and Shotts (2000). (b) If the F statistic rejects, apply King’s extended model using all elements of ates for parameterizing the means and of and as covari- , respectively. The ultimate objective of EI–R is determining the extent to which elements of the exogenous covariate vector are related to . When standard EI–R is employed because the test for logically consistency did not reject, inference on the impact of sion results, specifically, on confidence intervals for can be based on second stage regres- –related slope coefficients produced by the method described by Herron and Shotts. That is, the Herron and Shotts method produces consistent estimates and valid confidence intervals for and , and these intervals can be used to conduct inference in the standard way. On the other hand, when King’s extended model is employed as the first (and only) component of an abbreviated EI–R, assessing the impact of on requires considering first stage standard errors produced by King’s technique. Namely, for each covariate used to parameterize the mean of in the extended model, King’s method produces a coefficient estimate and an associated t–statistic. Standard significance tests using these t–statistics (one for each element of the basis of determining whether the elements of 13 ) can form are related to through its mean . Suppose, for example, that contains only a single non–constant covariate and that, based on failure of the logical consistency test, this covariate is included in an application of King’s extended model. Associated with the covariate will be two coefficient estimates, one which relates the covariate to the mean and one to the mean . Depending on what a researcher is interested in and in her problem – studying – in other words, depending on the substantive meanings of she would consult one or perhaps both of these estimates and, using estimated standard errors, form t–statistics and carry out hypothesis tests in the usual fashion. A researcher may be tempted to include all or some elements of in both an application of King’s extended model as well as in a second stage regression. This practice – an example follows shortly – should be avoided as it can lead to internal incoherence. In other words, either King’s standard model should be used followed by a second stage regression or King’s extended model should be employed and there should be no second stage regression. Mixing the extended model with a second stage regression can lead to serious problems. Suppose, to illustrate this point, that a researcher gathers an ecological dataset and on it carries out the test for logical consistency of EI–R. Suppose as well that the test rejects and that only one coefficient estimate from the test’s regression is individually significant at, say, the level. The researcher might want to include the variable associated with the coefficient (that is, one element of the vector) in an extended King model and then estimate a second stage regression using the full vector. In such a case, the one variable which is significantly related to appear twice in the EI–R procedure insofar as it influences the mean EI–R and then influences in the specification test would of in the first stage of again in the second stage. Yet, both estimates of the impact of the variable are based on a single ecological dataset. Thus, the variable’s appearing twice means that there will be two separate and dependent estimates of the impact of the variable on . Simply put, it is impossible to say how the two associated t–statistics should be combined and interpreted. It is conceivable, for instance, that the first stage estimate of the impact of the covariate on the mean of can be statistically insignificant while the second stage estimate of the impact of the covariate is significant. When all the covariates are used only once, either in a second stage regression or in 14 an application of King’s extended model, this type of incoherence can be avoid. Finally, a practice to which special attention is warranted is the inclusion of the vector of exogenous covariates as an element of (e.g. Gay 1997, Lublin 2000). This practice – which assumes is a function of – guarantees that EI–R is logically inconsistent. Thus, if a researcher believes that is dependent on , then the latter should be included as a parameter in explicitly that the first stage ecological inference step along with all other exogenous variables of interest. In this situation, no second stage regression will be estimated. 5 An Application of the Test for Logical Consistency in EI–R We now apply the test for logically consistency in EI–R to the ecological dataset used by Burden and Kimball (1998) in their published study of ticket–splitting in the United States Congressional and Presidential elections of 1992.9 We obtained the Burden and Kimball data and divided it into House and Senate sections.10 For Burden and Kimball’s House data, the variable represents the proportion of individuals in Congressional district who voted for Michael Dukakis in the 1992 presidential race. And, is the fraction who voted for a Democratic Congressional candidate as well as for Dukakis, and is fraction who voted for is the fraction who voted for the Democratic House candidate, a Democratic Congressional candidate yet also voted for Bush. Thus, in district ticket–splitters is . the fraction of Furthermore, for the House analysis there are a total of four elements of the vector . These elements consist of an indicator as to whether district had a Democratic incumbent (“Democratic Incumbent”), the Democratic candidate’s proportion of House campaign spending (“Spending Ratio”), whether district allowed straight–party votes (“Ballot Format”), and whether district was located in the South (“South”). In accordance with the test for logical consistency of EI–R, we regressed on the elements of and a constant. Resulting coefficient estimates and the overall F 9 Cho and Gaines (2000) identify several errors in Burden and Kimball’s dataset, but we use the original dataset to maintain compatibility with the published study. Our critique of Burden and Kimball is quite different from that offered by Cho and Gaines, who argue, among other things, that King’s diagnostics indicate the presence of aggregation bias in the data set. 10 The data is available from the ICPSR and in particular from ftp://ftp.icpsr.umich.edu/pub/PRA/outgoing/s1140. 15 Table 1: The Test for Logical Consistency in EI–R Applied to Burden and Kimball (1998) Variable Constant House Estimate Ideological Distance – Senate Estimate Democratic Incumbent Spending Ratio Ballot Format South 361 F 54.9 Note: t–statistics in parentheses 32 2.42 statistic are in Table 1.11 In Burden and Kimball’s Senate data the variables noted above refer to states rather than Congressional districts, but the Senate vector contains one additional covariate (“Ideological Dis- tance”) compared to the House version of . This latter covariate measures the ideological distance between George Bush and the Democratic Senate candidate in state . Senate results are in Table 1, and all t–statistics in the table were calculated using heteroskedastic–consistent standard errors. Before examining the results in Table 1 we consider at an intuitive level whether we expect be correlated with the elements of the voters in district (or state) and vector. Recall that to represents the fraction of Dukakis– includes, among other things, a South indicator variable. Given all that is known about Democratic politics in the South, it would be hard to imagine that is not correlated with whether district is in the south. With respect to the House regression model, the F statistic for the hypothesis test that all slope coefficients are zero is . The associated p–value is approximately zero. This means that EI–R, 11 $ $ $ Since Burden and Kimball try to explain ticket–splitting rates, their second stage dependent variables are as opposed to and . See fn. 4. 16 $ and consistency of EI–R in the Senate case is certainly not a foregone conclusion. Our results on Burden and Kimball’s dataset raise an important question: are our results on Burden and Kimball’s work anomalous? On one hand, without an exhaustive study of numerous research programs that use second stage regressions we will never know precisely how many published applications of EI–R are problematic. On the other hand, there are many instances in which it is hard to imagine that elements of the exogenous covariate vector yet unrelated to . For example, in Cohen, are related to Kousser and Sides’s (1999) study of crossover voting in state legislative elections in Washington and California, is the fraction of individuals in district who are Democrats.13 Moreover, one of the elements in the associated vector is an indicator that measures whether there is a Democratic incumbent in district . It is possible, of course, that Democratic incumbency in district is unrelated to whether district has a large number of Democrats in it. However, this seems unlikely to be true. This reasoning, then, suggests that logical consistency in Cohen, Kousser and Sides’s application of EI–R should be questioned. 6 Discussion As we noted in the introduction, the number of research articles which use EI–R is not matched by a parallel number of analyses of EI–R as a statistical technique. In fact, as far as we know, this paper and Herron and Shotts (2000) are the only formal treatments of EI–R and, more broadly, the only treatments of regression models that use ecological inference point estimates as dependent or independent variables. It seems fair to say that the conclusions of Herron and Shotts and the present paper are not optimistic. Namely, Herron and Shotts show that EI–R’s final point estimates are statistically inconsistent even when EI–R is itself logically consistent. And, in the present paper we have proposed a test for logical consistency of EI–R and applied it to the House and Senate datasets used in Burden and Kimball (1998). The test rejects logical consistency in the House case and comes extremely close to rejecting for the Senate. We suspect that these conclusions are not anomalous and that 13 Cohen, Kousser and Sides’s notation uses $ instead of 18 $ and $ instead of # $. many applications of EI–R are logically inconsistent. Moreover, we pointed out that, when the test for logical consistency of EI–R rejects, a second stage regression should not be estimated. Rather, covariates that would have been included in such a regression should instead find themselves in an application of King’s extended model, that is, in an abbreviated version of EI–R. Thus, if our critique of EI–R is taken seriously, we should expect to see numerous researchers who wish to estimate second stage regressions instead using King’s extended model and eschewing second stage regressions entirely. This, however, raises an important issue about EI–R and the enterprise of using exogenous covariates in ecological inference. Rivers (1998) notes that King’s extended model is fragile in the sense of being barely identified. The importance of this point is as follows: if it is difficult to identify using King’s extended model the impact of exogenous covariates on unknown disaggregated quantities, then it should be commensurately difficult to use a second stage regression to do this. However, second stage regressions are easily identified and they make it appear as if identifying the impact of exogenous covariates on disaggregated quantities is a simple matter. However, it should not be forgotten that second stage regressions are identified precisely because they treat ecological inference point estimates ( , in our notation) as fixed. But, such estimates are not fixed. Instead, they are functions of estimated bivariate normal parameters ( ). What this means is that researchers must tread extremely carefully when working with exogenous covariates and ecological inference. When a researcher estimates an extended King model with numerous covariates, she must ensure that her resulting point estimates as to the impact of her covariates are stable. If they are not, then this indicates potential identification problems. Models that are poorly identified must be flagged as such, as their results are potentially meaningless. References Achen, Christopher H. and W. Phillips Shively. 1995. Cross–Level Inference. Chicago: University of Chicago Press. Burden, Barry C. and David C. Kimball. 1998. “A New Approach to the Study of Ticket Splitting.” 19 American Political Science Review 92(3):533–44. Cho, Wendy K. Tam and Brain J. Gaines. 2000. “Reassessing the Study of Split–Ticket Voting.” Working Paper, University of Illinois–Champaign. Cohen, Jonathan, Thad Kousser and John Sides. 1999. “Sincere Voting, Hedging, and Raiding: Testing a Formal Model of Crossover Voting in Blanket Primaries.” Paper presented at the annual meeting of the American Political Ccience Association, Atlanta, GA. Freedman, D.A., S.P. Klein, M. Ostland and M.R. Roberts. 1998. “On ‘Solutions’ to the Ecological Inference Problem.” Journal of the American Statistical Association 93(444):1518–22. Gay, Claudine. 1997. Taking Charge: Black Electoral Success and The Redefinition of American Politics PhD thesis Harvard University. Herron, Michael C. and Kenneth W. Shotts. 2000. “Using Ecological Inference Point Estimates in Second Stage Linear Regressions.” Working Ppaer, Northwestern University. Kimball, David C. and Barry C. Burden. 1998. “Seeing the Forest and the Trees: Explaining Split–Ticket Voting within Districts and States.” Paper presented at the annual meeting of the American Political Ccience Association, Boston, MA. King, Gary. 1997. A Solution to the Ecological Inference Problem. Princeton, NJ: Princeton University Press. Lublin, David I. 2000. “Context and Francophone Support for Quebec Sovereignty.” Working Paper, American University. Rivers, Douglas. 1998. “Review of ’A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data’.” American Political Science Review 92(2):442. Tam, Wendy K. 1998. “Iff the Assumption Fits...: A Comment on the King Ecological Inference Solution.” Political Analysis 7:143–63. 20 Voss, D. Stephen and David Lublin. 1998. “Ecological Inference and the Comparative Method.” APSA-CP: Newsletter of the APSA Organized Section in Comparative Politics 9(1):25–31. Voss, D. Stephen and Penny Miller. 2000. “The Phantom Segregationist: Kentucky’s 1996 Desegregation Amendment and the Limits of Direct Democracy.” Working Paper, University of Kentucky. Wolbrecht, Christina and J. Kevin Corder. 1999. “Gender and the Vote, 1920–1932.” Paper presented at the annual meeting of the American Political Ccience Association, Atlanta, GA. 21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Specification Test for Linear Regressions that use