Download A Specification Test for Linear Regressions that use

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
A Specification Test for Linear Regressions that use
King–Based Ecological Inference Point Estimates as
Dependent Variables
Michael C. Herron1
Kenneth W. Shotts2
EXTREMELY ROUGH DRAFT – COMMENTS WELCOME
July 13, 2000
1
Department of Political Science, Northwestern University. Address: Scott Hall, 601 University Place,
Evanston, IL 60208–1006. Email: [email protected]. Fax 847–491–8985.
2
Department of Political Science, Northwestern University. Address: Scott Hall, 601 University Place,
Evanston, IL 60208–1006. Email: k–[email protected]. Fax: 847–491–8985
Abstract
Many researchers use point estimates produced by the King (1997) ecological inference technique
as dependent variables in second stage linear regressions. We show, however, that this two stage
procedure is at risk of logical inconsistency. Namely, the assumptions necessary to support the procedure’s first stage (ecological inference via King’s method) can be incompatible with the assumptions supporting the second (linear regression). We derive a specification test for logical consistency
of the two stage procedure and describe options available to a researcher whose ecological dataset
fails the test.
1 Introduction
Many researchers use point estimates produced by the King (1997) ecological inference technique as dependent variables in second stage linear regressions. King advocates this two stage
statistical procedure, which we call EI–R, saying that “[it] has enormous potential for uncovering
new information” (p. 279). Burden and Kimball (1998), for example, use King–based estimates
of district–level ticket splitting rates as dependent variables in second stage regressions which examine why individuals cast split ticket ballots. Similarly, Gay (1997) employs estimates of Black
turnout rates, produced by King’s ecological inference technique, in regressions that seek to determine whether black citizens vote at higher rates in elections that include black Congressional
candidates. Other examples of EI–R can be found in Kimball and Burden (1998), Voss and Lublin
(1998), Cohen, Kousser and Sides (1999), Wolbrecht and Corder (1999), Lublin (2000), and Voss
and Miller (2000).
As evidenced by the multiple citations above, EI–R has achieved widespread usage despite the
fact that it was only recently proposed. We show, however, that applications of this two stage procedure can be logically inconsistent in the sense that the assumptions necessary to support the procedure’s first stage (ecological inference via King’s method) can be incompatible with the assumptions
supporting the second (linear regression). In contrast, we say that an application of EI–R is logically
consistent when its first stage and second stage assumptions are not mutually contradictory.
The possibility that the assumptions which support an application of EI–R contradict one another merits consideration because productive use of a statistical technique requires adopting the
technique’s underlying assumptions. That is, to use any statistical technique a researcher must ultimately rely on its assumptions if he wishes to argue that the estimators produced by the technique
have desirable properties such as consistency, asymptotic normality, and so forth. Moreover, and
this is the key point which drives the results presented in this paper, when two statistical techniques
are combined as in the case of EI–R a researcher must simultaneously adopt the assumptions that
support the technique’s first and second stages.
Previous work on EI–R has not sought to analyze whether its two sets of underlying assumptions
– those associated with first stage ecological inference and those with second stage linear regression
1
– can be simultaneously adopted. In fact, to the best of our knowledge all published and working–
paper implementations of EI–R simply assume without consideration that the procedure’s first and
second stages are compatible, i.e. that EI–R is logically consistent. Nonetheless, we provide an
analysis of the compatibility of EI–R’s two sets of assumptions and in so doing address the issue
of aggregation bias in ecological data. In particular, our analysis is based on the fact that standard
implementations of King’s technique (the first stage of EI–R) assume that there is no aggregation
bias. Nonetheless, we show that it is often the case that linear regressions which use King–based
point estimates as dependent variables (the second stage of EI–R) imply the existence of aggregation
bias.
As an illustration of this clash of assumptions, consider a classic case of ecological inference:
a researcher wants to estimate and explain disaggregated turnout rates for African–American and
white voters. Yet, suppose that she has only precinct–level racial composition and turnout data. To
apply King’s technique as a component of EI–R to this turnout problem, the researcher must assume
that there is no aggregation bias in the turnout dataset, i.e., that black turnout is uncorrelated with
the fraction of a precinct’s population that is African–American.
Now, suppose that such a researcher estimates a second stage regression in which King–based
estimates of black turnout rates are assumed to be a linear function of precinct–level income and
education levels and other politically–relevant independent variables. If any of these right hand side
variables are correlated with the fraction in a precinct’s population that is African-American, then
the researcher’s second stage regression model virtually guarantees (in a sense that we make precise
later) that black turnout is correlated with the fraction of a precinct’s population that is AfricanAmerican, i.e., that there is aggregation bias. Thus, in this example the assumptions behind the
second stage of EI–R contradict those in its first stage.
This is problematic at two levels. At a fundamental level, any standard philosophy of science
requires researchers to adopt internally consistent sets of assumptions. To do otherwise, as is often
implicitly done in EI–R, is to simultaneously state “A is true” and “A is false.” Once two mutually
contradictory premises are adopted, any implication can be logically derived, i.e., all subsequent
analysis is meaningless.
2
At a more pragmatic level, logical inconsistency of EI–R is problematic because, in the presence of aggregation bias, King’s ecological inference technique produces statistically inconsistent
estimates of all relevant population parameters.1 And, when all first stage point estimates are inconsistent, second stage regression results – that is, EI–R’s final results – will also be statistically
inconsistent in ways that cannot be readily rectified. An estimator that is statistically inconsistent is
generally useless because it does not converge in probability to its true value; rather, it tends to an
incorrect value as the sample on which it is based gets large.
Given the problems which result if EI–R is logically inconsistent, it is important for researchers
to determine whether a particular application of the technique is problematic. Fortunately, it turns
out that not all implementations of EI–R are logically inconsistent. Rather, we show that logical
consistency of EI–R is application–dependent, i.e., EI–R is logically consistent in some empirical
applications yet logically inconsistent in others.
Thus, we propose a simple specification test for logical consistency of EI–R. The test is implemented via a single linear regression using data from both the first and second stages of a proposed
application of EI–R. Interpreting the results of the specification test then requires consideration of
only a single F statistic, the well–known and popular statistic which tests the null hypothesis that all
slope coefficients in a linear regression are zero. If the F statistic produced by an application of the
test to a given dataset is statistically significant, then EI–R is logically inconsistent when applied to
this dataset and should not be used. If, on the other hand, the F value is not significant, then logical
consistency of the application of EI–R cannot be rejected.
It is worth noting that, even in an application where EI–R is logically consistent, second stage
regression results will almost certainly be statistically inconsistent. This is because errors in King–
based estimates of disaggregated quantities are inversely correlated with the true parameter values
that the estimates measure. Fortunately, it is possible to correct for this statistical inconsistency so
long as EI–R is logically consistent (Herron and Shotts 2000).
However, when the test for logical consistency of EI–R rejects in a given ecological dataset, thus
implying the presence of aggregation bias, a researcher who wants to study the dataset has only two
1
See Tam (1998) for a thorough discussion of the effects of aggregation bias on King’s technique.
3
options. One, she can abandon EI–R entirely and choose a different method of analysis. Or, two,
she can explicitly model the aggregation bias in the first stage of the EI–R procedure. We discuss
this latter option in some detail.
We emphasize that our critique of EI–R and King’s ecological inference technique is fundamentally different from the argument in Tam (1998) and Cho and Gaines (2000), both of which claim
that aggregation bias is a common phenomenon and that King’s diagnostics provide little help in
identifying and rectifying it. Our analysis, in contrast, is not based on a claim that the assumptions
supporting King’s ecological inference technique are unwarranted, either in general or in a particular empirical application. Rather, our critique applies only to the use of King’s technique within the
context of the combined EI–R procedure, and what we argue is that the assumptions which support
EI–R must be compatible across EI–R’s two stages.
In Section 2 we provide notation and details on EI–R. Then, in Section 3 we present our specification test for logical consistency of EI–R and explain how to interpret it. Section 4 considers the
options faced by a researcher whose ecological dataset fails the test, and in Section 5 we apply the
test to the ecological dataset used in Burden and Kimball (1998). Section 6 offers general comments
on EI–R and concludes.
2 Overview of the EI–R Procedure
The first stage of EI–R consists of an application of King’s ecological inference technique, and
the second stage requires estimating a linear regression based on first stage output. Thus, we now
provide some details on ecological inference and King’s technique in particular and then explain
how the output of the technique is used in second stage regressions.
Suppose we have precinct–level data for an election of interest: let
denote precinct ’s turnout rate in the election and
The goal of ecological inference is to use
and ,
,
the fraction of Blacks in precinct .
to estimate , the fraction of Blacks in precinct
who voted in the election, and , the corresponding fraction of Whites who voted. Achen and
Shively (1995) discuss the history of the so–called ecological inference problem and review various
solution attempts.
4
King distinguishes between two different implementations, standard and extended, of his ecological inference technique. The precise details of these implementations can be found in Chapter
6–8 of King’s book. Critiques and discussions of King’s approach to ecological inference can be
found in Freedman, Klein, Ostland and Roberts (1998), Tam (1998), and Cho and Gaines (2000).
Until further notice our comments on King’s technique apply to its standard implementation or what
we also call the standard model.
have a truncated bivariate normal distribu , standard deviations and correlation . In
tion on the unit square with means
King’s notation, these parameters are collected in a 5–vector . Notably,
The standard implementation assumes that
and
are fixed across , and this is the key feature of the standard model.
where denotes covariance.
Furthermore, the standard model assumes the two means
This is equivalent to assuming there exists no aggregation bias, i.e. there is no systematic relationship between
, the fraction of Blacks in precinct who voted, and , the fraction of Blacks in
precinct .2
Let
, , , , and
denote the point estimates of
, , , , and , respectively,
based on applying King’s standard model to an ecological dataset.3 Corresponding to the vector ,
the five point estimates are collected in a 5–vector
one of which is that there is no aggregation bias –
. Under standard model assumptions – recall,
as tends to infinity where ! is a consistent estimate of , and a similar
statement applies to the other four point estimates in . In general, then, " .
Let be the point estimate of produced by applying King’s technique to an ecological
dataset and , ( is a function of , , and ). In EI–R, the point estidenotes convergence in probability. In other words,
mates are used as dependent variables in a second stage regression. Namely, King suggests that a
&()+
' * In. addition, King’s technique assumes that there is no spatial autocorrelation, that #$
2
3
and
#%
are independent for
King’s ecological inference technique can be estimated using standard maximum likelihood theory. Or, optional
priors can be used in which case King’s model yields a Bayesian framework. Both of these implementations of King’s
technique are compatible with the results described here.
5
researcher who in theory might want to estimate
(1)
(2)
instead estimate
where the difference between equation (1) and (2) is that
served) in the latter. Here,
is a constant and
(observed) substitutes for (unob
a –vector of parameters to be estimated,
is a
–vector of exogenous variables, and is a disturbance term assumed to be uncorrelated with and . Let and denote the least squares estimates of and , respectively.4
Thus, the EI–R procedure consists of, first, estimating
technique and, second, estimating
and
using King’s ecological inference
with least squares applied to equation (2). 5 The final
product of EI–R, therefore, consists of the estimates
and , and, ultimately, whether EI–R is
statistically consistent in an overall sense depends on whether
! and
" . On the other
hand, what can be thought of as first stage statistical consistency depends on whether
where the –vector
! , as noted previously, is the first stage estimate produced by applying King’s
standard model to an ecological dataset.
When EI–R is logically consistent then its first and second stage assumptions do not contradict one another. Hence, when EI–R is logically consistent then there is no aggregation bias and
EI–R’s first stage is statistically consistent (
stances
" and
" where
). In this case, except in very unusual circum-
denotes a lack of convergence in probability (Herron and
Shotts 2000). In other words, EI–R is in general statistically inconsistent even when it is logically
even when the first stage of EI–R is statistically consistent.
consistent, and this is because 4
$
$
$
$
Equations (1) and (2) could just as easily have had
and
on their left hand sides. All of our results apply in
a symmetric fashion to such a case. Similarly, the results apply to regressions with
and
as left hand side
variables as well.
5
King suggests a weighted least squares approach when estimating equation (2). Weighting, however, is not germane
to our discussion of logical inconsistency. Furthermore, Herron and Shotts point out that there is no theory which says
that weighted least squares estimation of equation (2) is superior to ordinary least squares estimation.
6
However, Herron and Shotts describe manipulations of
of the population parameters
and
and
that produce consistent estimates
in equation (1) based on inconsistent estimates produced by
equation (2). Overall, then, when EI–R is logically consistent its first stage is statistically consistent
and, despite the fact that
! and
consistent second stage estimates of
" , it is possible to manipulate and .
! These manipulations are possible only when
and
so as to generate
, i.e. only when first stage ecological
inference estimates are statistically consistent. This is a key point for the following reason: when
EI–R is logically inconsistent, then its first stage is statistically inconsistent (
" ! ) and EI–
R completely falls apart.6 In particular, as far as are aware, no useful manipulations of
and
are possible when EI–R is logically inconsistent. In this latter situation, these point estimates are
hopelessly statistically inconsistent and this cannot be rectified.
3 The Logical Consistency Test for EI–R
Recall that EI–R’s first stage assumption of no aggregation bias is mathematically equivalent to
. However, second stage regression implies
from equation
(1). Hence, for EI–R to be logically consistent across its first and second stages, it must be true that
where
,
is the th element of the parameter vector
and where
is the th element of the vector
. This derivation forms the basis of the following result:
6
King (1997) argues in Ch. 11 that his technique sometimes produces accurate estimates of even in the presence
of aggregation bias. However, Tam (1998) argues that, in such situations, the technique produces inaccurate estimates.
In our opinion, the fact that an inconsistent estimator happens to be on target in some applications does not constitute
adequate justification for adopting it. Namely, inconsistent estimators are guilty until proven innocent, and, at the time of
this writing, the value of King’s technique in the presence of aggregation bias remains uncertain.
7
, , and , , is
Proposition 1 Suppose that an ecological dataset, consisting of
given. Let
be the true second stage parameter vector from equation (1). When it is not the case
that
(3)
then EI–R as applied to the given dataset is logically inconsistent.
Proposition 1 places a restriction on the parameter vector
that must be satisfied for EI–R to be
logically consistent when applied to a given ecological dataset. When the restriction is violated, then
EI–R is logically inconsistent. Moreover, as described earlier, logical inconsistency of EI–R implies
first stage statistical inconsistency (
in
and
) and correcting resulting second stage inconsistencies
is not possible.
Suppose that, for a given ecological dataset, the
covariances which appear in equation (3)
all happen to be exactly zero. Then, any –vector , regardless of its elements, is consistent with
the restriction in Proposition 1. This is because the elements
, when substituted into equation (3),
will be multiplied by zero and hence have no impact on whether the left hand side of the equation
is truly zero. In other words, when all covariances in equation (3) are zero it is not possible for the
elements of
to violate the restriction in Proposition 1 and in so doing cause rejection of logical
consistency of EI–R.
On the other hand, if some of the
terms are not zero, then the set of
which satisfy the restriction in Proposition 1 is not equivalent to the set of all possible
vectors
vectors.
,
For example, suppose that so that is a 2–vector, , and . Then, for EI–R to be logically consistent it must be the case that or,
equivalently,
the
. This example illustrates in a simple manner the fact that, when some of
terms are non–zero, elements of the vector will have to satisfy a non–trivial
restriction for EI–R to be logically consistent.
Thus, we can think of Proposition 1 as having two cases. In one case – when all covariances
in equation (3) are zero – there is no restriction on . In the other case – when at least one of
8
the
terms is non–zero – the restriction in equation (3) is meaningful and logical
consistency of EI–R depends on whether
satisfies it.
When Proposition 1 imposes a meaningful restriction on , we can enquire as to the nature of
the restriction. Is it meaningful yet weak – for example, requiring that
tight – for example, requiring, say, that
for all – or is it
for all ? If the former, then we would say that the
restriction is relatively easy to satisfy and hence there are no compelling reasons to reject logical
consistency of EI–R based solely on the fact that
On the other hand, if the restriction on
must satisfy the restriction.
is very tight, then it is certainly possible that most
vectors will fail to satisfy it. From this it would follow that EI–R is unlikely to be logically
consistent. Similarly, if given an ecological dataset the associated restriction on
is so tight that
is seems unimaginable that it can be satisfied, then we would commensurately argue that, in all
likelihood, EI–R is logically inconsistent when applied to the said dataset.
We now develop this idea of a very tight restriction on , a restriction that seems virtually
impossible to satisfy. In particular, our conceptualization of such a restriction is one that will be
satisfied with probability zero if
is drawn randomly from a continuous distribution. To motivate
this conceptualization, consider the following example. Let
and suppose as before that logical
. If is drawn from a continuous distribution
on , then the probability that
is zero since a line has zero probability measure on
a two–dimensional plane. Hence, we say that the restriction is very tight and, when
consistency of EI–R requires
logical consistency of EI–R depends on it, we conclude that EI–R will in all likelihood be logically
inconsistent.
In general, when the –vector
on
, then the probability that
of dimension
is zero. Thus,
Definition 1 A –vector
elements of
must lie in an is randomly drawn from a continuous probability distribution
elements of the vector, with
, lie a given hyperplane
from equation (1) is said to be non–generic if, for
,
dimensional hyperplane.
Definition 1 codifies the idea of a very tight restriction in the sense that
when it must satisfy such a restriction.
9
is said to be non–generic
For example, suppose that
implies that two ( is a –vector (
) and that a restriction from equation (3)
) of its elements must lie in a one (
) dimensional
hyperplane. (This is a complicated way of saying that these two elements must lie on a line.) In
general, the probability that two continuously distributed random variables fall exactly on a given
line is zero. Hence, if all
elements of
are drawn from a continuous distribution on
, the
probability that the two key elements satisfy the restriction of being on a given line is zero.
Of course, in actual applications of EI–R it is not the case that
is drawn from a continuous dis-
tribution prior to EI–R’s being employed. Rather, according to standard frequentist logic, estimating
a model like equation (2) is equivalent to supposing that there exists a true
and that this value is
fixed albeit unknown. Nonetheless, the idea behind a restriction which leads to a non–generic
is
that the restriction is so tight that we should expect it never to hold. Hence, if we conclude that a
given application of EI–R is logically inconsistent unless
is non–generic, then we have effectively
concluded that the application is in all likelihood logically inconsistent. In particular, we would say
in this instance that EI–R is generically logically inconsistent.
Definition 2 An application of EI–R to an ecological dataset is said to be generically logically
inconsistent if the restriction in Proposition 1 requires
It is straightforward to see that, if any of the
then the
to be non–generic.
terms in equation (3) are non–zero,
vector which satisfies the resulting restriction is non–generic. If, say, only one covariance
is non–zero, then the associated element of , that which multiplies the non–zero covariance, will
itself have to be exactly zero. If not, the restriction in equation (3) will not be satisfied unless
exactly offsets the product of the non–zero covariance and the non–zero element of .
This sort of offsetting will occur only by complete coincidence. Furthermore, the probability that
a continuously distributed random variable on
Hence, when only a single
is exactly zero is itself a probability zero event.
term is zero,
is non–generic.
We have now argued the following.
Proposition 2 Let
where
be the true parameter vector from equation (1). If there exists a such that
, then
is non–generic.
10
Hence, whenever an ecological dataset features at least a single, non–zero
EI–R as applied to the dataset is generically logically inconsistent.
The result in Proposition 2 is, importantly, independent of whether
term,
is zero. In fact,
this latter covariance only makes matters worse from the perspective of satisfying the restriction in
equation (3). Namely, all
terms can be zero but the restriction can still be violated if
. Hence, by ignoring this latter covariance our test for logical consistency of EI–R
is conservative.7
The matter of whether a proposed application of EI–R is generically logically inconsistent, then,
turns on whether there exists a
estimated by regressing
on the
term that is non–zero. Such covariance terms can be
vector. Then, if the standard F test for all slope coefficients
being zero rejects at a conventional confidence level, it follows that EI–R is generically logically
inconsistent. Thus, the hypothesis test for logical consistency of EI–R requires nothing more than
estimating a single regression with data that EI–R itself requires in the first place. This regression
can, and should, be run before any application of EI–R.
The intuition behind our regression–based test for logical consistency in EI–R is as follows.
Recall that King’s ecological inference technique requires that there is no aggregation bias, i.e. that
is a function of covariates
collected in a vector and, if some of the elements in are correlated with , then in general
to be zero. Thus, when the test for logical consistency of EI–R
it is not possible for . But, if a second stage regression assumes that
rejects, it follows that there is aggregation bias in the first stage and EI–R in this situation completely
breaks down.
The matter of non–rejection of logical consistency of EI–R is an important one. The test we
propose for logical consistency is one of necessity as opposed to one of sufficiency. That is, an
application of EI–R must pass our test; if not, it follows that the application is logically inconsistent.
However, if an application of EI–R does pass the test, it does not therefore follow that the application
7
Our discussion of restrictions and logical consistency of EI–R began with the premise that an ecological dataset is
given, and then we considered restrictions on such that equation (3) holds. We could, alternatively, have started from
the premise that the elements of were given and asked about the resulting ecological dataset which would satisfy the
equation. From this latter perspective, when any elements of are non–zero, the set of covariances which satisfy equation
(3) will be non–generic. Hence, it follows that EI–R as applied to the dataset will be generically logically inconsistent.
11
is logically consistent. Rather, it is still possible that the application is logically inconsistent in some
manner not covered by our test.8 Or, the failure to reject could simply reflect that fact that our test
does not have a power of one.
For the remainder of this paper, we use the term “logically inconsistent” in place of “generically
logically consistent” as the latter is tedious. Similarly, we refer to our test as a test for logical
consistency even though, technically, it is a test for generic logical consistency.
4 What to do when an Ecological Dataset Fails the Specification Test
When a researcher’s ecological dataset fails the test for logical consistency of EI–R, she has two
options. First, she can abandon EI–R entirely and choose another method of data analysis. Second,
she can model the aggregation bias that caused test failure in the first place and in so doing avoid a
second stage regression.
Modeling aggregation bias in the first stage requires what King calls his extended model. Recall
that in the standard model the mean of
is , and this mean vector is assumed to
be fixed across observations. However, in the extended model the two elements of this vector are
not restricted in this way. Rather, in the extended model the means are denoted
where
the means are dependent on .
Readers interested in the details of the extended model should consult King (1997, pp. 168–74).
In brief, this model posits that
is a linear function of exogenous covariates. That is, the extended
model assumes something akin to
estimated and
where
is a vector of parameters that must be
a vector of covariates that includes a constant. A product of the extended model,
then, is an estimate
of the parameter vector . For numerical reasons (see p. 170) the extended
model does not assume the precise form noted above for
. However, the above expression for captures the essence of the extended model.
Thus, suppose that a researcher has an ecological dataset and on it carries out the test for logical
consistency of EI–R. Suppose as well that the test rejects. Since it consequently follows that there
$
8
For example, logical consistency of EI–R will impose certain restrictions on the distribution of . Although we do
not explore the nature of these restrictions, it seems quite plausible that these restrictions are extremely tight.
12
is aggregation bias in the ecological dataset, the hypothetical researcher would then include as
extended model covariates all elements of
and subsequently would not estimate a second stage
regression at all. In other words, when the specification test rejects logical consistency of EI–R, then
the two stage statistical procedure should not be used. Rather, an abbreviated version – consisting
solely of an application of King’s extended model – should replace it.
To summarize, the procedure we propose for EI–R is as follows.
1. Collect an ecological dataset consisting of 2. Regress on the elements of the vector
,
, and
,
.
and consider the F test that all slope coefficients
are zero. Then,
(a) If the F statistic does not reject, implement EI–R using the method described in Herron
and Shotts (2000).
(b) If the F statistic rejects, apply King’s extended model using all elements of
ates for parameterizing the means and of and as covari-
, respectively.
The ultimate objective of EI–R is determining the extent to which elements of the exogenous
covariate vector
are related to . When standard EI–R is employed because the test for logically
consistency did not reject, inference on the impact of
sion results, specifically, on confidence intervals for
can be based on second stage regres-
–related slope coefficients produced by the
method described by Herron and Shotts. That is, the Herron and Shotts method produces consistent
estimates and valid confidence intervals for
and , and these intervals can be used to conduct
inference in the standard way.
On the other hand, when King’s extended model is employed as the first (and only) component
of an abbreviated EI–R, assessing the impact of
on
requires considering first stage standard
errors produced by King’s technique. Namely, for each covariate used to parameterize the mean
of
in the extended model, King’s method produces a coefficient estimate and an associated
t–statistic. Standard significance tests using these t–statistics (one for each element of
the basis of determining whether the elements of
13
) can form
are related to through its mean .
Suppose, for example, that
contains only a single non–constant covariate and that, based on
failure of the logical consistency test, this covariate is included in an application of King’s extended
model. Associated with the covariate will be two coefficient estimates, one which relates the covariate to the mean
and one to the mean
. Depending on what a researcher is interested in
and in her problem –
studying – in other words, depending on the substantive meanings of
she would consult one or perhaps both of these estimates and, using estimated standard errors, form
t–statistics and carry out hypothesis tests in the usual fashion.
A researcher may be tempted to include all or some elements of
in both an application of
King’s extended model as well as in a second stage regression. This practice – an example follows
shortly – should be avoided as it can lead to internal incoherence. In other words, either King’s
standard model should be used followed by a second stage regression or King’s extended model
should be employed and there should be no second stage regression. Mixing the extended model
with a second stage regression can lead to serious problems.
Suppose, to illustrate this point, that a researcher gathers an ecological dataset and on it carries
out the test for logical consistency of EI–R. Suppose as well that the test rejects and that only one
coefficient estimate from the test’s regression is individually significant at, say, the
level. The
researcher might want to include the variable associated with the coefficient (that is, one element of
the
vector) in an extended King model and then estimate a second stage regression using the full
vector.
In such a case, the one variable which is significantly related to appear twice in the EI–R procedure insofar as it influences the mean
EI–R and then influences
in the specification test would
of
in the first stage of
again in the second stage. Yet, both estimates of the impact of the
variable are based on a single ecological dataset. Thus, the variable’s appearing twice means that
there will be two separate and dependent estimates of the impact of the variable on . Simply put,
it is impossible to say how the two associated t–statistics should be combined and interpreted. It is
conceivable, for instance, that the first stage estimate of the impact of the covariate on the mean of
can be statistically insignificant while the second stage estimate of the impact of the covariate is
significant. When all the
covariates are used only once, either in a second stage regression or in
14
an application of King’s extended model, this type of incoherence can be avoid.
Finally, a practice to which special attention is warranted is the inclusion of the vector of exogenous covariates
as an element of
(e.g. Gay 1997, Lublin 2000). This practice – which assumes
is a function of – guarantees that EI–R is logically inconsistent. Thus, if a
researcher believes that is dependent on , then the latter should be included as a parameter in
explicitly that
the first stage ecological inference step along with all other exogenous variables of interest. In this
situation, no second stage regression will be estimated.
5 An Application of the Test for Logical Consistency in EI–R
We now apply the test for logically consistency in EI–R to the ecological dataset used by Burden
and Kimball (1998) in their published study of ticket–splitting in the United States Congressional
and Presidential elections of 1992.9 We obtained the Burden and Kimball data and divided it into
House and Senate sections.10
For Burden and Kimball’s House data, the variable
represents the proportion of individuals
in Congressional district who voted for Michael Dukakis in the 1992 presidential race. And,
is the fraction who voted for
a Democratic Congressional candidate as well as for Dukakis, and is fraction who voted for
is the fraction who voted for the Democratic House candidate,
a Democratic Congressional candidate yet also voted for Bush. Thus, in district
ticket–splitters is .
the fraction of
Furthermore, for the House analysis there are a total of four elements of the vector
. These
elements consist of an indicator as to whether district had a Democratic incumbent (“Democratic
Incumbent”), the Democratic candidate’s proportion of House campaign spending (“Spending Ratio”), whether district allowed straight–party votes (“Ballot Format”), and whether district was
located in the South (“South”). In accordance with the test for logical consistency of EI–R, we
regressed on the elements of
and a constant. Resulting coefficient estimates and the overall F
9
Cho and Gaines (2000) identify several errors in Burden and Kimball’s dataset, but we use the original dataset to
maintain compatibility with the published study. Our critique of Burden and Kimball is quite different from that offered
by Cho and Gaines, who argue, among other things, that King’s diagnostics indicate the presence of aggregation bias in
the data set.
10
The data is available from the ICPSR and in particular from ftp://ftp.icpsr.umich.edu/pub/PRA/outgoing/s1140.
15
Table 1: The Test for Logical Consistency in EI–R Applied to Burden and Kimball (1998)
Variable
Constant
House Estimate
Ideological Distance
–
Senate Estimate
Democratic Incumbent
Spending Ratio
Ballot Format
South
361
F
54.9
Note: t–statistics in parentheses
32
2.42
statistic are in Table 1.11
In Burden and Kimball’s Senate data the variables noted above refer to states rather than Congressional districts, but the Senate
vector contains one additional covariate (“Ideological Dis-
tance”) compared to the House version of
. This latter covariate measures the ideological distance
between George Bush and the Democratic Senate candidate in state . Senate results are in Table 1,
and all t–statistics in the table were calculated using heteroskedastic–consistent standard errors.
Before examining the results in Table 1 we consider at an intuitive level whether we expect be correlated with the elements of the
voters in district (or state) and
vector. Recall that to
represents the fraction of Dukakis–
includes, among other things, a South indicator variable. Given
all that is known about Democratic politics in the South, it would be hard to imagine that is not
correlated with whether district is in the south.
With respect to the House regression model, the F statistic for the hypothesis test that all slope
coefficients are zero is . The associated p–value is approximately zero. This means that EI–R,
11
$
$
$
Since Burden and Kimball try to explain ticket–splitting rates, their second stage dependent variables are
as opposed to
and . See fn. 4.
16
$
and
consistency of EI–R in the Senate case is certainly not a foregone conclusion.
Our results on Burden and Kimball’s dataset raise an important question: are our results on
Burden and Kimball’s work anomalous? On one hand, without an exhaustive study of numerous
research programs that use second stage regressions we will never know precisely how many published applications of EI–R are problematic.
On the other hand, there are many instances in which it is hard to imagine that elements of
the exogenous covariate vector
yet unrelated to . For example, in Cohen,
are related to
Kousser and Sides’s (1999) study of crossover voting in state legislative elections in Washington
and California, is the fraction of individuals in district who are Democrats.13 Moreover, one of
the elements in the associated
vector is an indicator that measures whether there is a Democratic
incumbent in district . It is possible, of course, that Democratic incumbency in district is unrelated
to whether district has a large number of Democrats in it. However, this seems unlikely to be true.
This reasoning, then, suggests that logical consistency in Cohen, Kousser and Sides’s application of
EI–R should be questioned.
6 Discussion
As we noted in the introduction, the number of research articles which use EI–R is not matched
by a parallel number of analyses of EI–R as a statistical technique. In fact, as far as we know, this
paper and Herron and Shotts (2000) are the only formal treatments of EI–R and, more broadly, the
only treatments of regression models that use ecological inference point estimates as dependent or
independent variables.
It seems fair to say that the conclusions of Herron and Shotts and the present paper are not
optimistic. Namely, Herron and Shotts show that EI–R’s final point estimates are statistically inconsistent even when EI–R is itself logically consistent. And, in the present paper we have proposed a
test for logical consistency of EI–R and applied it to the House and Senate datasets used in Burden
and Kimball (1998). The test rejects logical consistency in the House case and comes extremely
close to rejecting for the Senate. We suspect that these conclusions are not anomalous and that
13
Cohen, Kousser and Sides’s notation uses
$
instead of
18
$
and
$
instead of
# $.
many applications of EI–R are logically inconsistent. Moreover, we pointed out that, when the test
for logical consistency of EI–R rejects, a second stage regression should not be estimated. Rather,
covariates that would have been included in such a regression should instead find themselves in an
application of King’s extended model, that is, in an abbreviated version of EI–R.
Thus, if our critique of EI–R is taken seriously, we should expect to see numerous researchers
who wish to estimate second stage regressions instead using King’s extended model and eschewing
second stage regressions entirely. This, however, raises an important issue about EI–R and the
enterprise of using exogenous covariates in ecological inference.
Rivers (1998) notes that King’s extended model is fragile in the sense of being barely identified.
The importance of this point is as follows: if it is difficult to identify using King’s extended model
the impact of exogenous covariates on unknown disaggregated quantities, then it should be commensurately difficult to use a second stage regression to do this. However, second stage regressions
are easily identified and they make it appear as if identifying the impact of exogenous covariates on
disaggregated quantities is a simple matter. However, it should not be forgotten that second stage
regressions are identified precisely because they treat ecological inference point estimates ( , in
our notation) as fixed. But, such estimates are not fixed. Instead, they are functions of estimated
bivariate normal parameters ( ).
What this means is that researchers must tread extremely carefully when working with exogenous covariates and ecological inference. When a researcher estimates an extended King model
with numerous covariates, she must ensure that her resulting point estimates as to the impact of her
covariates are stable. If they are not, then this indicates potential identification problems. Models
that are poorly identified must be flagged as such, as their results are potentially meaningless.
References
Achen, Christopher H. and W. Phillips Shively. 1995. Cross–Level Inference. Chicago: University
of Chicago Press.
Burden, Barry C. and David C. Kimball. 1998. “A New Approach to the Study of Ticket Splitting.”
19
American Political Science Review 92(3):533–44.
Cho, Wendy K. Tam and Brain J. Gaines. 2000. “Reassessing the Study of Split–Ticket Voting.”
Working Paper, University of Illinois–Champaign.
Cohen, Jonathan, Thad Kousser and John Sides. 1999. “Sincere Voting, Hedging, and Raiding:
Testing a Formal Model of Crossover Voting in Blanket Primaries.” Paper presented at the
annual meeting of the American Political Ccience Association, Atlanta, GA.
Freedman, D.A., S.P. Klein, M. Ostland and M.R. Roberts. 1998. “On ‘Solutions’ to the Ecological
Inference Problem.” Journal of the American Statistical Association 93(444):1518–22.
Gay, Claudine. 1997. Taking Charge: Black Electoral Success and The Redefinition of American
Politics PhD thesis Harvard University.
Herron, Michael C. and Kenneth W. Shotts. 2000. “Using Ecological Inference Point Estimates in
Second Stage Linear Regressions.” Working Ppaer, Northwestern University.
Kimball, David C. and Barry C. Burden. 1998. “Seeing the Forest and the Trees: Explaining
Split–Ticket Voting within Districts and States.” Paper presented at the annual meeting of the
American Political Ccience Association, Boston, MA.
King, Gary. 1997. A Solution to the Ecological Inference Problem. Princeton, NJ: Princeton University Press.
Lublin, David I. 2000. “Context and Francophone Support for Quebec Sovereignty.” Working Paper,
American University.
Rivers, Douglas. 1998. “Review of ’A Solution to the Ecological Inference Problem: Reconstructing
Individual Behavior from Aggregate Data’.” American Political Science Review 92(2):442.
Tam, Wendy K. 1998. “Iff the Assumption Fits...: A Comment on the King Ecological Inference
Solution.” Political Analysis 7:143–63.
20
Voss, D. Stephen and David Lublin. 1998. “Ecological Inference and the Comparative Method.”
APSA-CP: Newsletter of the APSA Organized Section in Comparative Politics 9(1):25–31.
Voss, D. Stephen and Penny Miller. 2000. “The Phantom Segregationist: Kentucky’s 1996 Desegregation Amendment and the Limits of Direct Democracy.” Working Paper, University of
Kentucky.
Wolbrecht, Christina and J. Kevin Corder. 1999. “Gender and the Vote, 1920–1932.” Paper presented at the annual meeting of the American Political Ccience Association, Atlanta, GA.
21