Download Document 8919879

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Problem with Quantitative Studies of
International Conict
Nathaniel Beck
Gary King
UCSD
Harvard University
Langche Zeng
1
Harvard University
July 15, 1998
Nathaniel Beck: Department of Political Science, University of California, San Diego, La Jolla,
California 92093, [email protected], http://weber.ucsd.edu/~nbeck/. Gary King: Department of
Government, Littauer Center North Yard, Harvard University, Cambridge Massachusetts 02138.
[email protected], http://GKing.Harvard.Edu, (617) 495-2027. Langche Zeng (on leave from
George Washington University): Department of Government, Littauer Center North Yard, Harvard University, Cambridge Massachusetts 02138, [email protected]. We thank Kristian
Gleditsch, Simon Jackman, and Richard Tucker for helpful discussions, and the National Science
Foundation for grants SBR-9753126 (to Langche Zeng) and SBR-9729884 (to Gary King) and the
Centers for Disease Control and Prevention (Division of Diabetes Translation) and the Global
Forum for Health Research for research support for Gary King.
1
Abstract
Despite immense data collections, prestigious journals, and sophisticated analyses, empirical ndings in the literature on international conict are frequently unsatisfying. Statistical
results appear to change from article to article and specication to specication. Very few
relationships hold up to replication with even minor respecication. Accurate forecasts
are nonexistent. We provide a simple conjecture about what accounts for this problem,
and oer a statistical framework that better matches the substantive issues and types of
data in this eld. Our model, a version of a \neural network" model, forecasts substantially better than any previous eort, and appears to uncover some structural features of
international conict.
1 Introduction
Despite immense data collections, prestigious journals, and sophisticated analyses, empirical ndings in the literature on international conict are frequently unsatisfying. Statistical results appear to change from article to article and specication to specication.
Very few relationships hold up to replication with even minor respecication. Instead of
new, durable, systematic patterns being steadily uncovered, as is the case in most other
quantitative subelds of the discipline, students of international conict are left wrestling
with their data to eke out something they can label a nding. As a consequence, those
with deep qualitative knowledge of the eld are rarely persuaded by conclusions from
quantitative works. (See Rosenau, 1976; Levy, 1989; and Geller and Singer, 1998; Bueno
de Mesquita, 1981; Vasquez, 1993).
Perhaps the most important evidence of the existence of this problem is the miserable
forecasting performance of statistical models in this eld. Quantitative conict models
do not appear to forecast at all. We believe it to be true that no legitimate statistical
model has ever forecast an international conict with greater than 0.5 probability, and
certainly none has done so while also being correct. Political scientists have long eschewed
forecasting to emphasize causal explanation, and we do not disagree with this emphasis.
But a claim to have found a causal explanation that is a structural feature of the world,
but which changes unpredictably over time, is of dubious validity and marginal value.
Forecasting is thus of indirect interest as the ultimate test of whether we have really found
structure, but we should not let the eld's forecasting failures cause us to lose sight of
the tremendous practical value of accurate international conict predictions would be. A
large portion of the foreign policy bureaucracy in many countries is devoted to this task,
and a quantitative \expert system" to help guide the local country experts could be of
considerably value. Forecasts of political conict would also be of interest to politicalbusiness risk analysts, and many others.
Scholars blame everything from the data to the theory to the international world for
the lack of results, and no doubt all are correct at least sometimes. Similarly, the attacks
on this problem have come from every angle. The most venerable tradition is to try
to improve the data and measures of international conict and its correlates (Jones et
al., 1996). Others have modied existing statistical models to accommodate some of the
special features of conict data (King, 1989; Beck and Tucker, 1998). And still others have
attempted to derive statistical models from formally stated rational choice theories based
on the motivations of states, political leaders, or other domestic actors (Signorino, 1998;
Smith, 1998). An eventual resolution of this problem will no doubt require advances on
all three fronts, and a convergence in theoretical and statistical models and data.
Our current approach is based on the belief that much of the problem lies with a
somewhat overlooked but key substantive issue that is reected in the mismatch between
available data and the entire set of statistical methods applied to this problem. For one
example, international conict is a rare event, and the processes that drive conict where
it is more common are likely to be very dierent from those elsewhere. Even the meanings
of variables change from place to place and time to time (e.g., limiting freedom of assembly
would be a monumental change in the U.S., not an issue in the U.K., and a common but
meaningful act in Iran). As a result, many qualitative researchers expect the relationships
in this eld to be highly nonlinear, massively interactive, and heavily context-dependent
or contingent. Because these characteristics cannot be modeled within standard statistical
approaches, we adopt a form of the highly exible \neural network model" as a rst cut at
addressing this problem. These models are well suited to data with these types of complex,
1
nonlinear, and contingent relationships.
We nd rst and most importantly that our models are able to predict international
conict reasonably well. Whereas all previous models predict that international conict
will essentially never occur, our out of sample forecasts pick up about 17% of these disputes.
Thus, although there is still a long way to go to produce highly accurate forecasts of all
these rare and unusual events, our analyses conrm that there does exist structure in
these data. In fact, our predictive model may be of some practical use in addition to its
academic value.
With this forecasting evidence that the relationships we estimate are real, we turn to
causal interpretation of the underlying structure. With our more appropriate techniques,
we nd hints of more robust and replicable patterns coming into focus.
In Section 2, we propose a simple conjecture that seeks to explain some of the problem
with results in this eld. We believe the idea may explain some of the anomalies and
nonndings in the literature and why our model is able to forecast reasonably well. It
also highlights the features that an appropriate method would need to have to uncover
stable patterns in this eld. Section 2 also discusses alternative possibilities that we do
not believe are sucient, but have many adherents. We then discuss appropriate methods
in Section 3 and apply them in real data in Section 4. Section 5 concludes. The appendix
contains some technical details on Bayesian neural networks.
2 Explaining the Problem with Conict Studies
We now oer our tentative answer to the problem with international conict studies (in
Section 2.1) and discuss several alternative possible explanations (Section 2.2).
2.1 A Conjecture
Our conjecture about the problem with conict studies is quite simple, and aspects of it are
implied in much prior literature. The idea is that the eects of most explanatory variables
are undetectably small for the vast majority of country dyads, but they are large, stable,
and replicable when the ex ante probability of conict is large. For example, Canada and
St. Lucia have essentially no chance of going to war today. If St. Lucia becomes slightly
more democratic or Canada slightly less, perhaps the probability of war will increase some,
but the increase will be so small that it would be undetectable. In contrast, if Iran and
Iraq became slightly more democratic, it might greatly reduce the probability that the two
would go to war. If the conjecture is right, then the eects of the causes of conict dier
by dyad, with trivially small eects for the vast majority and larger eects for a few.
We believe this simple idea may help explain diverse features of the quantitative literature on the causes of international conict. At the least, it does appear to be consistent
with several observable implications:
1. Most scholars use statistical procedures that assume the eects of the causes of war
are nearly the same for all dyads. The estimates these analyses produce are roughly
the average of essentially zero eects for most observations and larger eects for a
tiny fraction of the cases. Thus, unless the eect is enormous in the small set of dyads
with a high ex ante probability of war, estimates from most analyses will appear very
small or resemble random noise. Indeed, small to nonexistent, and highly variable,
eects are dominant in the literature.
2
2. When eects in the high ex ante probability of war dyads are huge, the average over
all the dyads would be large enough to be reliably detected with most methods.
(Nonetheless, the estimated eect would be somewhat too large for some dyads
and far too small for a few.) Of the few variables with a very large eect in this
subset may include variables like contiguity and time since the last war, and these
are indeed among the few variables that often give reasonably robust results across
specications.
3. If only a few observations have large eects, small changes in the set of dyads included
in a statistical analysis will sometimes have disproportionate eects on the results.
This also appears true and may account for some of the apparent instability of results
in the literature from article to article.
4. A similar observable implication results from the strong priors most scholars derive
from their considerable qualitative knowledge about the eld. What would we expect
to nd when strong priors such as these are combined with constant-eect statistical
methods that, because of the nature of the data, produce very noisy results? We
would expect to see researchers pushing their data analyses extremely hard looking
for the eects they believe are there but are dicult to nd. Unfortunately, this
would make the results dier from investigator to investigator just as they seem to
since answers will depend very sensitively on otherwise minor coding decisions.
5. Some scholars make coding decisions that seem consistent with our conjecture when
they discard all dyads but those deemed \politically relevant" or \at risk," or in
other words have a high ex ante probability of war. Unfortunately, if our idea is
correct substantively, these coding decisions are problematic methodologically. Such
problems are often recognized by the authors, who have little choice but to put some
restrictions on an otherwise endless data collection. The problem is that coding rules
amount to dropping many cases without war and few with war, which in some cases
may generate two types of selection bias. One eect is intentional selection on the
eect size, which biases the eect upward if the relevant population to which one is
inferring is all dyads, and otherwise correctly increases the eect. In addition, except
in cases when the denition of \politically relevant" is clearly based on one of the
explanatory variables (as in Maoz, 1996), these rules also select on the dependent
variable, which biases the estimated average eect towards zero. Whatever the
goal of the inference, addressing the problem by selecting cases in this way may give
answers that are too small or too noisy, which does appear to be the case throughout
the literature. The results using these selection rules would be somewhat stronger
than using the entire data set, but not as large as one might expect | exactly as
the literature indicates.
6. Finally, if our conjecture is right, applying an appropriate statistical technique will
conrm the existence of sizeable and robust eects in the high ex ante probability
of war dyads and tiny eects elsewhere. For many variables at least, the direction,
magnitude, and nature of the large eects should not be wildly inconsistent with
our qualitative knowledge of international relations, unless there is a clear reason.
If this method indeed nds real features of the international system, rather than
some idiosyncratic sample characteristics that result from our specication choices
or coding rules, out of sample forecasts ought to predict similar patterns in the next
data set.
3
The rst ve of these observable implications of our conjecture are consistent with observations from the literature. Testing the sixth implication will occupy most of the rest of
this paper.
According to our idea, what makes international conict data dierent from some other
rare events data sets is the markedly varying eect sizes in combination with suciently
informative explanatory variables that, with the appropriate methods, can at least partly
predict where and explain why conict occurs. Some other rare events data, such as
in epidemiological studies of disease may also t this description, but only if all these
characteristics apply. For example, rare events data with large dierences in probabilities
would not generate all these observable implications without predictor variables that had
sucient explanatory power.
2.2 Alternative Explanations
The two leading alternative explanations of the problem with quantitative international
relations are based on denial and blame, respectively. Both are at least partly correct, but
we do not think they account for most of the problem.
The rst alternative explanation denies the problem altogether by arguing that the
nonresults of quantitative international relations are simply due to the lack of systematic
patterns in the international world. This could be the right explanation. After all, sophisticated statistical methods have been applied to the best data in this eld many times
and most analyses turn up the same nonresults. When we follow the same procedures in
other elds, we normally believe what we nd. Otherwise, there would be little reason
to have run the analyses in the rst place. The idea is also supported by a few (nihilist)
qualitative researchers. We do not believe this idea explains most of the patterns, but
we need not argue against it: if we nd veriable systematic patterns, the denial theory
would be rejected.
The second alternative theory, and easily the one most widely held, blames the data.
Conict data have various types of systematic and random measurement error, to be sure,
but in our view this error is not so much more exceptional relative to the supposedly
higher quality data in other political science elds that it could account for the dierences
in the strength and stability of empirical results between these elds. We are aware of
all the nightmare stories, such as graduate students coding international events data from
newspapers in languages they do not know. But we are also aware of similar stories in
other elds such as published election statistics in the U.S. that report more voters than
people in some districts, and survey responses being fabricated by interviewers without
even visiting interviewees. We know of the systematic and surprisingly large discrepancies
in international events data between dierent data sources on seemingly obvious measures,
like U.S-Soviet conict (Vasquez, 1981). But we also know that a similar systematic
comparison of the two most popular data sets on postwar U.S. congressional elections, one
of the strongest subelds in American politics, revealed errors that totaled the equivalent
of nearly two houses of congress (Gelman and King, 1990). Presumably because the
ndings are more consistent with qualitative knowledge and much more robust, these
examples do not lead anyone to dismiss quantitative American politics. The problems
caused by measurement error in international relations should not be minimized and may
well be worse than in other elds. Certainly the (impressive) continuing eorts to reduce
it should be encouraged, but we do not believe the data is solely at fault.
Another reason we question the \blame the data" theory is that we have found the subeld to be highly collegial and its norms of data sharing deeply held and widely supported.
4
The replication standards of journals that publish quantitative work in international relations are among the highest in any subeld in the discipline. Considerable resources are
devoted to original data collection. Data from published articles and even working papers
are widely available on the web, and scholars respond quickly, eciently, and generously
to requests for further information.
3 Statistical Models of International Conict
Relative to other types of data in the discipline, international conict data have some
unusual characteristics. They are based on thousands to hundreds of thousands of dyads
(combinations of countries taken pairwise) or directed dyads (with country A's actions
toward country B, and country B's action toward country A appearing as separate observations in the same data set). Whether the universe of dyads includes only originators
of conict, all nations, or something in between is by no means clear, even aside from
selection bias issues. Most outcome variables are dichotomous. The data often concern
rare events, with thousands of times more 0's (peace) than 1's (conict). The explanatory
variables are often neither dichotomous nor quite continuous, with distributions that are
asymmetric or with multiple mass points (such as at the end or mid-points). The indices
used are often of necessity complicated combinations of underlying measures. In addition,
if our conjecture is correct, only very small parts of very large data sets contain most of
the interesting information.
The statistical method we introduce here is a version of a neural network model (rst
introduced to political scientists by Schrodt, 1995, and Zeng, 1996a,b, in press).1 This
is a technique with an immense literature supporting it in engineering, computer science,
statistics, psychology, linguistics, neuroscience, medicine, nance, and other elds. Neural
network analysts have adopted an extensive and essentially unique language to describe
their work. The language is useful because it helps emphasize the rough analogies between
these statistical models and some theoretical models for the way human brains may work,
and because specialized jargon is frequently useful in new areas. Yet, using this language to
describe concepts known to political scientists by other names can be counter productive.
We therefore introduce these models as straightforward generalizations of logistic models,
easily the most commonly used statistical models of international conict.
Section 3.1 makes the transition from logit to our neural network models, and Section
3.2 discusses various issues relating to estimation, interpretation, and inference with our
model, and introduces the main ideas of Bayesian methods for neural networks adopted
in this paper.
3.1 From Logit to Neural Networks
Our dependent variable Yi takes on a value of 1 if dyad i (i = 1; : : : ; n) is engaged in an
international conict and 0 if it is at peace. If conict and peace are mutually exclusive and
exhaustive (which we make true by denition), then a Bernoulli distribution fully describes
this variable. The only parameter of a Bernoulli distribution is i , the probability of an
international conict. Then let a vector of a constant term and k explanatory variables
be denoted Xi = f1; X1i ; X2i ; : : : ; Xki g, where 1 is the constant term. The next step is
General references on neural networks include Rumelhart et al. (1986), Muller and Reinhardt (1990),
Hertz, Krogh and Palmer (1991). Detailed discussion of neural networks as statistical models can be found
in, for example, White (1992), Ripley (1996), Cheng and Titterington (1994), Kuan and White (1994),
and Bishop (1995).
1
5
to specify the relationship between i and Xi . Perhaps the simplest possibility is a linear
function, resulting in what is known as the linear probability model:
Yi Bernoulli(i )
i = Xi = 0 + 1 X1i + + k Xki
(1)
where Xi is merely a matrix expression for a linear relationship between i and Xi ,
i = linear(Xi )
and the (k + 1) 1 vector includes a constant term and k weights (or coecients) on
each of the k explanatory variables. The problem with the linear probability model is that
it can generate impossible values of i (greater than 1 or less than 0), and so even values
within the right range near the boundaries are questionable. This problem was known
long ago, and since logit models became computationally feasible in the last two decades
it has become almost entirely supplanted.
The logit model requires a small change from Equations 1, and only in the functional
form:
Yi Bernoulli(i )
1
(2)
1 + e,X (where the (k + 1) 1 eect parameter vector is interpreted dierently than in Equation
1). Thus, the logit model maps the linear functional form Xi , which can take on any
value, into the [0,1] interval required for i by applying the logit() function. The vast
majority of analyses in conict studies use some form of this method. For our purposes,
it is important to recognize that the second line of Equation 2 as hierarchical with i
specied as a logit function of a linear function of Xi ,
i = logit(Xi ) =
i
i = logit(linear(Xi ))
Just as the logit is a strict generalization of the linear probability model, created by
adding an extra level of hierarchy, our neural network model will generalize the logit with
additional levels of hierarchy.
Equations 2 avoid the problems with impossible probability values, but from our perspective they retain a critical weakness: the eect of any of the explanatory variables is
only marginally dierent across observations in the data set and only slightly dependent
on the values at which we might hold constant the other explanatory variables, and only
somewhat related to the ex ante probability of conict (see Nagler, 1991, 1994). One way
to look at changes in the eects of explanatory variables is with the derivative of the probability i with respect to one of the explanatory variables, say X1i . For linear models this
derivative (or \instantaneous eect") is 1 , which is obviously constant. For logit models,
the derivative is i (1 , i )1 , which is better. But since i is within a small range above
zero for all but a few observations (and, given the logit model's inexibility, virtually all
observations in practice), this is a highly restrictive and nearly constant specication. To
avoid this weakness, two dierent types of generalizations might be considered, either of
which would be an improvement, but neither of which is sucient.
First, we might specify a random eects model. Instead of leaving xed at one set
of values, as it is in Equation 2, we could let it vary randomly over the observations in
some form such as i = + i . This is generally workable, but it requires a normally
6
unveriable assumption, that the random variation i is independent of the explanatory
variables. Although some assumption about randomness is often better than logit's more
restrictive assumption of constant eects, even some dependence will bias all estimates
from this model.
A second possibility is standard interaction eects. For example, we might let the
eect d of democracy Xdi be a function of whether the states in the dyad are contiguous
Xci , by specifying say di = 0 + c Xci , so that presumably c is positive. Substituting
this expression back into the second line of Equation 2 produces an interaction eect that
is easy to estimate: merely include Xdi Xci as an additional variable in Xi and use any
standard logit package. This strategy is appropriate, and we believe it will work in many
instances. However, it requires a good deal of prior knowledge of the types of interactions
to specify. If one is in the situation of not having sucient detailed prior knowledge, then
the number of interactions one would include would not generally be precisely estimable
with available data. In fact, including many interaction terms can quickly result in severe
problems of numerical instability due to the presence of collinearity. In our case, where we
suspect massive interaction eects, with most of the eects in tiny areas of the parameter
space, standard interaction-based logit models will be too restrictive or require too many
interactive and higher order terms.
Our choice for an approach to this problem is the massively interactive, highly nonlinear
neural network model, in particular, the single hidden layer feed-forward perceptron. All
this biological language sounds complicated, but we can express it all as a statistical model
that is a straightforward generalization of the logit model. Neural network models are a
very complicated version of a statistical model, but they are in the end just a statistical
model and can be understood in that framework. The biological analogies are unnecessary
and may or may not be meaningful even in biological terms.
Logit models use an \S-shaped" (really escalator-shaped) curve to approximate the
true relationship between the probability i and the explanatory variables Xi . Now imagine how much better the approximation could be if more than one such curve could be
used simultaneously, each with a dierent curvature and orientation | which is roughly
speaking what neural network models can do. In order to approximate the relationship
between i and Xi with a set of M logit curves, we use this neural network model, which
has the same distribution but a dierent functional form as the logit and linear probability
models:
Yi Bernoulli(i )
i = logit 0 + 1 logit(Xi (1) ) + 2 logit(Xi (2) ) + + M logit(Xi (M ) )
(3)
This neural network model is a type of discrete choice model that diers from the logit
only in the shape of the curve. It is easiest to compare this relationship with the standard
logit, which has only two additional levels of hierarchy, and this neural network model in
the special case when M = 1, since the second line of Equation 3 can be expressed as a
logit function of a linear function of a logit function of a linear function:
i = logit(linear(logit(linear(Xi ))))
The larger value the researcher chooses for M , the more logit curves are used at the third
level of generalization and as a result the larger variety of shapes the entire expression can
approximate.
To be more specic, in the logit model in Equation 2 is a (k + 1) 1 vector of
eect parameters (corresponding to a constant term and the weights on the k explanatory
7
variables). After the 's are multiplied by their respective Xi 's and summed up, logit(Xi )
is then one number which we also label i . In contrast, the (1) ; : : : ; (M ) parameters in
the second line of Equation 3 are each (k + 1) 1 vectors, so that each expresses a
dierent weighting of the k explanatory variables. Then the logit function is applied to
each of the dierent weighting schemes Xi (1) ; : : : ; Xi (M ) to yield a set of M numbers:
logit(Xi (1) ); : : : ; logit(Xi (M ) ). A weighted average of these M numbers are then taken
(with the 's as adjustable weights in this linear expression), and the logit is taken one
nal time to make sure that the entire expression yields a number for i that is between
0 and 1. The result is a remarkably exible functional form.
Neural networks meet the needs of conict research because they allow the eect of
each explanatory variable to dier markedly over the dyads, as required by our conjecture
about international conict. As in the case of linear probability and logit models, this can
be made clear by examining the partial derivatives of i with respect to the independent
variables, which we have derived for our neural network model:
M
X
@i
=
i (1 , i ) j (j )h logit(Xi (j ) ) 1 , logit(Xi (j ) )
@Xih
j =1
where h indexes explanatory variable h in Xi and element h in j . Comparing this expression with its counterpart in the linear probability or logit model, it is clear that our
neural network model allows much more variability of the marginal eects of independent
variables across dyads: now not only varying with i across dyads, but with Xi and all
the logit curves too. Even though each element in the summation is limited in size, the
combination of all the terms can produce wide dierences.
More generally, what makes neural network models attractive as statistical models is
that they provide a class of functional forms that are able to approximate any hypothetical relationship between i and Xi , given a large enough choice for the number of logit
functions M (Hornik 1991, White 1992).2 Although the functional form in Equation 3
has a fair number of adjustable parameters (M (k + 2) + 1), compared with other exible
functions that also have general approximation capabilities, like those based on polynomial
spline or trigonometric functions (e.g., Gallant, 1981), neural networks seem to require far
fewer parameters to model the same level of complexity (Barron, 1993). They are often
described as occupying a middle range between standard parametric models, with a small
number of parameters, and nonparametric models, with almost innite exibility (Ripley,
1996). Neural networks avoid the independence assumptions of random eects models
while still allowing wide variability of marginal eects, and their exibility and general
approximation capabilities far outperform standard logit-based interaction models.
3.2 Issues in Neural Network Modeling
As nonlinear statistical models, neural networks have the same issues of model selection,
estimation, interpretation and inference as any other statistical model. Some of these
raise concerns that are more common in models with exible functional forms, like neural
networks, and so demand special attention. In the context of our model in Equation 3, how
There are a large variety of neural network models, many are even more general than Equation 3, as
they allow additional levels of hierarchy, stacking functions within functions, and feed backward eects
(as in the so called \recurrent neural networks" which are appropriate for modeling time series data), etc.
In addition one can choose functions other than logit or even mix several dierent functions in the same
analysis. They can also be used with stochastic components other than Bernoulli to model dierent types
of dependent variables.
2
8
do we choose the size of M , i.e., how many logit units do we use? How do we compare the
dierent choices? For a given model, how are the parameter values determined? And what
do they mean? How does statistical inference proceed, and how do we handle uncertainty?
None of these questions is trivial. However, recent developments in Bayesian methods
for neural networks have allowed disciplined treatments of many of these issues. In this
paper, we adopt the Bayesian framework with Gaussian approximations (MacKay 1992abc,
1994), which is discussed in some detail in the appendix. Here we highlight some of its
basic ideas in connection with the questions raised above.
The central feature of the Bayesian approach is to treat everything as probabilistic.
Hence, instead of estimating the \true" values of some xed parameters, we look for
the posterior distributions of these parameter given the information of data; instead of
deciding on \the" correct model we consider the probability of a model being true in light
of empirical evidence. Model building and inference is therefore a process of updating our
prior beliefs about the world using the information we receive from empirical data. In the
Gaussian approximation framework for neural networks, we assume Gaussian (i.e., normal)
prior distributions for the parameters and that are centered on zero, that is, we believe
it more likely that the parameters take smaller values than larger ones. This prior merely
reects the common belief that a certain degree of \smoothness" in the underlying data
generating function is likely (as smaller neural network parameter values result in smoother
functions).3 From this prior distribution, we can then apply Bayes theorem to evaluate the
posterior distribution of the parameters which we use to evaluate the predictive distribution
of dependent variable (by integrating out the parameters). For complex functions like
neural networks, there are no analytical solutions for the integration, and Monte Carlo
methods of sampling the distribution is computationally very burdensome. Hence here we
approximate various intermediate distributions with normals since they are analytically
easy to treat and simple to sample from. This enables us to arrive at the posterior
probability i given our data.
Selecting M and Avoiding Overtting In conventional (nonbayesian) neural net-
works, selection of M , the number of logit curves to use, is a rather critical issue, as too
large M can cause \overtting". Overtting is a danger with any statistical model, but
especially for very exible forms like ours. A model that is too exible picks up idiosyncrasies unique to a particular data set rather than the structural features of the world that
persist to data out of sample. The danger of overtting is real, and many attempts have
been devised to develop procedures that protect against it.
In the bayesian framework, the presence of smooth priors for the parameters signicantly alleviates this problem. Instead of searching for the set of parameter values that
maximize the model performnace on the training data, we instead look for solutions that
at the same time punishes model complexity. And more importantly (in principal at least),
in the Bayesian paradigm, there is no single model that is the correct one. Rather, each
alternative is correct with certain probability. If a model overts, it would just have a
smaller probability of being true compared with one that does not. So dierent models
can be compared by examining the measures of evidence for them.4 In practice, however,
3
The explanatory variables are normalized before passing into the neural network model, and hence the
size of the parameters have uniform scale if not meaning.
4
There do exist theoretical results that could help with overtting. For example, Radford Neal (1996)
proves that a Bayesian neural network can use an innite number of logit curves without causing improper
behavior in the output function, provided the prior variances for the parameters (in the context of our
model) are properly scaled. He therefore makes the important suggestion that one should choose the most
complicated model that is computationally feasible (the largest possible M ), and scale the variance of the
9
there exists a fairly straightforward test of whether one is nding structure or overtting
ephemera. The procedure is to set aside a portion of the data as a \test set," t the model
to the remaining bulk of the data (the \training set") and see whether the forecasts hold
up in the test set. Since there is always the tendency to iterate back and forth between
tting models to the training set and verifying the model in the test set, in our explorations with split samples, we tried to set aside two test sets, the second of which we do
not consult until our early iterations are complete.5
In the Bayesian framework, overtting can be eectively controlled by giving appropriate priors, which is computationally equivalent to augmenting the error function to be
minimized with terms that punish model complexity. For example, Radford Neal (1996)
proves that a Bayesian neural network can use an innite number of logit curves without
causing improper behavior in the output function, provided the prior variances for the
parameters (in the context of our model) are properly scaled. He therefore makes the
important suggestion that one should choose the most complicated model that is computationally feasible (the largest possible M ), and scale the variance of the priors properly.
This way, one has a model capable of extracting as much information as possible, but the
data are being taxed to the same degree regardless of the amount of information in the
data. We follow Neal's suggestion.
Interpretation Since the neural network literature is concerned almost exclusively with
pattern recognition and forecasting, the issue of interpreting the eect of the explanatory
variables on i is not an issue that has not received much attention. For the purposes of
political science research, however, interpretation of causal eects is always critical. The
problem is that the functional form is so exible, and the estimated relationships between
i and Xi can be so complicated, that the parameters ((1) ; : : : ; (M ) and 0 ; : : : ; M )
are almost impossible to understand directly, without further computation.6 As such, we
have created a graphical device that enables us to produce highly interpretable results. For
example, we can plot the expected value of (and condence intervals for) the probability
of conict by one or two explanatory variables, while holding constant the remaining
explanatory variables at chosen values. Of course, since neural networks allow estimation
of dierent eects for dierent observations, the values at which we choose to hold constant
the other explanatory variables is a critical feature of interpretation. The identical methods
can be used with regression models, but are not needed since the eect parameters are
constant and so the whole functional relationship can be easily summarized with a single
number. We elaborate on these methods of interpretation in Section 4.
In practice, interpretation is a problem for any statistical model beyond the very simple linear additive setup. For any model that is even slightly more complicated than the
linear additive model we must provide graphic analyses of the eects of some variables on
priors with the level of complexity. This way, one has a model capable of extracting as much information
as possible, but the data are being taxed to the same degree regardless of the amount of information in the
data. Unfortunately, this suggestion requires use of extremely time consuming computational algorithms,
some features of which discuss in the appendix.
5
Validation with training and test sets can be improved in theory with techniques like cross-validation,
where one breaks the data set into all possible splits. But in practice, most users keep a single test set
aside just to be sure.
6
This diculty of interpretation is compounded by the fact that the error function has multiple global
(as well as local) maxima. For example, since switching 1 with 2 , and (1) with (2) , yield identical values
for i , we cannot uniquely attribute political meaning to the individual logit(Xi (j) ) terms or any of the
specic or parameters. The multiple maxima also cause problems in applying standard optimization
algorithms, but computationally ecient \back-propagation" techniques to evaluate the gradient and thus
maximize the likelihood seem to work well (Bishop, 1995: 141).
10
the dependent variables, holding others constant. While our procedures aer compuationally more intense, they are no dierent in principal from the simple graphical techniques
discussed in King (1989b: section 5.2) or the simulation-based methods in King, Tomz,
and Wittenberg (1998).
4 Empirical Tests
In this section we discuss the data and model (Section 4.1), our model's forecasting performance (Section 4.2), and our search for causal structure in the model results (Section
4.3).
4.1 Data and Model
We use the militarized international dispute (MID) data set (version 2.10 updated from
the Correlates of War project; Jones et al., 1996) as corrected and augmented by Richard
Tucker (Beck and Tucker, 1998). The data contain 23,529 dyads between 1947 and 1989.
Of these, only 976 (4.1%) contain MIDs, coded 1, with the rest coded 0.7 The explanatory
variables include whether the dyad contains geographically contiguous countries (contig),
whether they are allies (ally), the degree to which the two countries have similar foreign
policy portfolios (sq), the degree of asymmetry of power within the dyad (asym), the
degree of democratization (dem), and the number of years since the last conict (py).
These are standard variables used to explain international conict, but they are not the
ones we would choose to forecast war best. Thus, we believe that the predictive success
demonstrated below could be substantially improved upon with more appropriate data.
We divided these data into into an \in sample" period training set, 1947{1985, which
we use to t the model, and a test set which we use to evaluate the forecasts, 1986{1989.
We t the neural network model in Equation 3 to the training set. Without looking at
the test set, we experimented with setting M to various values. We found that an M
set around 25 provided about the right level of exibility for our data. A Gaussian prior
distribution is given to the parameters, and the most probable values of the parameters
and their covariance matrix are found, in the manner described in the Appendix. We also
t a standard logit model to the training set for comparison.
4.2 Forecasts
Table 1 gives one view of the comparative forecasting performance of the logit and neural
network (NN) models. In the table, we divide the forecasts into 1s (for conict) and 0s (for
peace). So the left portion of the table reports success at forecasting conict when conict
exists, and the right portion gives success at forecasting no conict when no conict exists.
As is clear, since the logit model never forecasts that a conict will occur in any one dyad
(i.e., the probability never reaching 0.5), it fails for all the cases of conict and succeeds for
all the cases of peace. This success is not much of a success, of course, since the optimistic
claim that war will never occur will be successful 96% of the time!
The table indicates that the neural network model's performance is substantially better
than for the logit. It is nearly as good as the logit when there is no conict, with all the
probabilities exceeding 99% correct. But more important is that when military conict
occurs, the neural network model makes a successful forecast 16.7% of the time on average.
7
As is commonplace in this literature, conict is coded as any militarized conict, that is all positive
MID conict codes were coded 1.
11
% Correct 1’s
% Correct 0’s
Year
Logit
NN
# of 1’s
Logit
NN
# of 0’s
1947-1985
0%
25.3%
892
100%
99.58%
20,155
1986
0%
18.5%
27
100%
99.83%
584
1987
0%
14.3%
28
100%
98.98%
587
1988
0%
23.1%
13
100%
99.34%
609
1989
0%
12.5%
16
100%
99.51%
618
1986-1989
0%
16.7%
84
100%
99.42%
2,398
Total
0%
24.6%
976
100%
99.57%
22,553
Table 1: Logit and Neural Network Forecasting Performance
This is not high in an absolute sense, but it is a lot better than the logit success rate of
zero. Given the high costs of military conict and the tremendous benets of knowing
ahead of time when a war will occur, this improved forecasting performance could be of
signicant policy value. This relatively high percentage of successful predictions stays
within a reasonably narrow range from 12.5% to 23.1% for each of the four out-of-sample
years, hence further conrming the overall result with separate observable implications.
Furthermore, most of these gures are lower than their t to the training set, which is as
it should be if we expect structure to exist but some real change in the world to occur.
Table 1 demonstrates that the neural network model discriminates far better than the
logit model by assigning very dierent probabilities of international conict to the available
dyads. It does not indicate whether either model's probability values are correct except for
above and below the 0.5 mark. For example, if we observe 1000 dyads with 0.1 probability
of war, none of these individual dyads would be predicted to go to war, but with 1000
dyads each at 10% risk, we would expect to see 100 wars from somewhere in the set. We
now evaluate whether either model's predictions has these desirable characteristics.
We begin by computing predicted probabilities for each dyad from the logit and neural
network models. We then sort these into bins of 0.1 width: [0; 0:1); [0:1; 0:2); : : : ; [0:9; 1].
Within each bin, we compute the mean predicted probability (which will presumably be
near its respective midpoint) as well as the observed fraction of 1s in each bin. We compare
the two to check the t of the model in the training set and to evaluate the forecasts in
the test set. Figure 1 plots these numbers for both statistical models.
The in-sample plot at the left of Figure 1 shows that the predicted probabilities and observed fraction of 1s fairly closely match each other for the neural network model. The logit
model is reasonably close as well when the mean probability is 0.25 or less, but becomes
much worse for higher (i.e., more interesting) predicted probabilities. This is especially
important from our perspective: not only does the logit model predict peace breaking out
in every dyad but the model becomes more inaccurate for its largest probability of conict
12
Forecast (1986−1989)
1
0.75
0.75
NN
Fraction y=1
Fraction y=1
In−Sample (1947−1985)
1
Logit
0.5
Logit
0.25
0.25
0
NN
0.5
0
0.25
0.5
0.75
0
1
0
0.25
0.5
0.75
1
Mean Prob(y=1)
Mean Prob(y=1)
Figure 1: Logit and NN Probabilities vs. Actual Outcomes
predictions, even though these are still very low.8 In contrast, when the neural network
model gives a probability, that probability is a reasonably accurate assessment of the odds
of a conict occurring within the sample. Of course, as important a dierence between
the two models is the much more acute discriminatory power of the neural network model,
which we can see because the logit gives no probability predictions above the 0.4 bin.
Whereas the left graph of Table 1 evaluates the t of the two models to the same
data, the right graph uses the same technique to evaluate the success of the out-of-sample
forecasts. This graph shows a reasonably close correspondence again between the estimated
probabilities and observed fractions for both models. The two models track each other
very well too for the lower probability bins, indicating the same high level of success in
that region of probability; this is good, although we are of course not as interested in
this region of probability. (The fact that the logit model did not t well to the last two
points within sample but did t out of sample would seem to be a lucky coincidence.)
More interesting is that whereas the logit model has no high probability forecasts, the
neural network model tracks the observed fractions reasonably well even for the sparsely
populated high probability bins. One possible problem is the noticeable overestimation of
conict for the neural network model in this high probability region (which indicates that
our forecasts could be improved), but the observed numbers are both fairly small and not
that far o.
We believe that these forecasts are quite solid, and although many uncertainties remain,
they seem to be far better than any previously produced. They are also more accurate
forecasts than many scholars thought would be possible from conict data. We now turn
to an examination of the systematic patterns that underlie this structure.
8
The one possible exception is the last mean probability bin for the logit, which has very few observations
and so the fraction of 1s has much higher sampling variability. (We chose not to add error bars for each
point so the graphs would be easier to read.)
13
4.3 Causal Structure
Our model appears to predict international conict far better than any previous attempt.
We turn now to what we believe is some clear evidence of the kinds of structure in international conict data that our model reveals. This evidence appears to be consistent with
that predicted by our conjecture about the problem with quantitative conict studies.
To interpret our results, we create what we call marginal eect graphs that plot the
probability of conict by one explanatory variable, holding constant all the others at a
designated value. Since our conjecture about conict studies holds that the eect of most
variables will be larger (i.e., more discriminating) when the ex ante probability of war is
higher, we hold constant the other variables at two values, one for \high" and one for
\low" probabilities of conict. (For simplicity, we compute these by the median of each
explanatory variable among observations where Y = 1 and where Y = 0, respectively.)
Figure 2 presents these graphs for both logit and our neural network model for each of the
explanatory variables except democracy (which we discuss separately afterwards). In the
gure, the low and high controls for the logit model appears in the left two columns and
for the neural network in the right two columns. Each explanatory variable appears in a
separate row; the vertical lines on the graph are one standard error bars surrounding the
predicted probability.
Substantively, the neural network analysis in Figure 2 shows a few similarities to conventional logit analysis but also demonstrates some features consistent with our conjecture. As can be seen by comparing the rst and third columns, the logit analysis allows for
greater eects of the explanatory variables when the ex ante probability of conict is high,
but the dierences are small | and of course virtually all the logistic eects are at lines
indicating no eect detected. In contrast, the neural network model produces much larger
changes in the eect of any variable as we move from low to high ex ante probabilities of
conict (compare the second to the fourth columns). This is particularly obvious for some
of what might normally be considered \control" variables. For example, while contiguity
has a strong eect on the probability of a dispute in the logit analysis, this eect is more
than doubled for the high probably cases in the neural network analysis. While the eect
of contiguity is strong enough so that it is not hidden by the logistic analysis, allowing for
complex interactions shows that the variable has the extremely strong eect on disputes
that we would have expected but previous researchers were unable to demonstrate.
The neural network analysis also shows how important duration dependence (py, the
number of years since the last conict) is when the ex ante probability of a dispute is high.
While the logit analysis nds some evidence duration dependence, it is quite modest.
For high ex ante probability of war situations, with a conict occurring last year, the
probability under a logit analysis is under 0.25. In contrast, the neural network analysis
yields a probability of almost 80%. It takes about a decade for this probability to recede
to nearly zero. Clearly both analyses give a similar pattern of decay, but the more exible
neural network gives a much higher maximum probability of a dispute. The eect of the
duration of peace on the probability of a dispute is clearly underestimated in the logit
analysis. While duration dependence is strong enough to show up in the logit analysis to
some degree, the lack of interactions in that model do not allow us to discern how critical
time is in forecasting future disputes.
The neural network analysis also shows that two important determinants of disputes,
Sq (the similarity of foreign policy positions as measured by the similarity of alliance
portfolios) and Asym (asymmetry, measured by whether one partner in the dyad has a
preponderance of military capability) have a very strong, but non-monotonic eect on the
probability of a dispute. The logit analysis assumes that all eects are monotonic and
14
Probability
Logit−−−Low
Neural Net−−−Low
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0
0.5
1
0
0
0.5
Probability
Contig
0
0
0.5
1
0
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0.5
1
0
0
0.5
1
0
0
0.5
Ally
1
0
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
−0.5
0
0.5
1
0
−1
−0.5
Sq
0
0.5
1
0
−1
−0.5
Sq
0
0.5
1
−1
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0.5
1
0
0
0.5
Asym
1
0.5
1
0
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
20
Py
30
40
0
0
10
20
30
40
0
0
Py
10
20
30
40
0
10
Py
Figure 2: Marginal Eect of Explanatory Variables: 1947-1985
15
1
Asym
1
10
1
0.5
Asym
1
0
0.5
0
0
Asym
0
0
Sq
1
0
−0.5
Sq
1
0
1
Ally
1
−1
0.5
Ally
1
0
1
Contig
1
Ally
0.5
Contig
1
0
Probability
1
Contig
0
Probability
Neural Net−−−High
1
0
Probability
Logit−−−High
1
20
Py
30
40
so is unable to detect these kinds of relationships. The neural network analysis reveals a
stronger, and more complex, relationship between these variables and conict.9
The logit analysis indicates that similarity of foreign policy positions (Sq) have some
eect on conict within the dyad. But the eect, averaged over all dyads and assuming no
interactions, is very small. The eect is much larger for the neural network analysis when
there is a high ex ante probability of dispute | leading to a huge change in the probability
of a dispute of about 60%, as compared to the logit's predicted maximal impact of under
10%. Interestingly, dyads with extremely dissimilar alliance portfolios have a very low
probability of engaging in disputes. Only as similarity moves from towards the highest
range does it actually make disputes less likely.10 A logit analyst interested in peace would
also prescribe an increase in dyadic similarity of interests. The neural network analysis
demonstrates that this prescription is only correct for dyads that are already somewhat
similar, in which case the gains from increasing similarity are much greater than the logit
analysis indicates.
We see a similar eect of asymmetry of power. While an increasing preponderance of
power in the dyad does lessen the probability of a dispute (in the high ex ante cases), this
eect only occurs when one partner already has much of the power. For dyads that do
not show a high preponderance of force, increases in the preponderance of force actually
raise the probability of a dispute. The neural network analysis shows that our policy
prescriptions must be more subtle, and change with both the levels of each variable and
the levels of all other variables.
Finally, we turn to the eect of democracy on the probability of a dispute. For each
dyad we measure the democracy scores (using Polity3 data; Gurr and Jaggers, 1996) of
each partner, and create two democracy variables, which we call \dema" and \demb".
The countries in a dyad are randomly assigned to the two variables. Theory does not
tell us how these two democracy scores combine to aect disputes. In response, previous
researchers have used a variety of ad hoc measures in logistic models to capture a single
dyadic measure of democracy. Neural networks allow us to enter both partners' democracy
scores in the specication, with the network estimating, rather than the analyst assuming,
how democracy aects the probability of a dispute. The contour plot of this eect, for
both the logit and the neural network analysis and for low and high ex ante probability
of conict, appears in Figure 3. A contour plot, like a hiker's topography map, is a three
dimensional plot of a surface looking downward. The (contour) lines of equal height are
plotted, with numbers indicating the height and representing the probability of a conict
given the values on the horizontal and vertical axes.11
The logit model reveals slight evidence in favor of democratic peace, with smaller
probabilities of conict as we move towards the top right \democratic" corner (of either
logit graph on the left of Figure 3). The very slight eects found by the logit model
As a diagnostic, we also applied generalized additive models (GAMs) to these data. These models
are better than logit because they allow non-monotonicities but they are not as good as neural networks
because they do not allow interactions. GAMs do reveal non-monotonic eects for these two variables but
much smaller ones than under our neural network model.
10
There are very few dyads at the low range of the Sq variable. The logit analysis simply averages these
dyads with all others. The neural network analysis can better separate out eects in this range of the data,
revealing the more subtle structure.
11
Each plot is \symmetrically averaged" to ensure symmetry and reduce estimation error. That is,
before the averaging, each point on one of these contour plots, and its symmetrically reected value on
the other side of the 45 degree line, are dierent estimates of the identical probability. If the countries
were completely \mixed" rather than merely randomly assigned to the two variables, dema and demb, this
averaging would be unnecessary. The eect is not dramatic (if it were, there would be something wrong
with the model), but it does allow us to average over the random noise somewhat.
9
16
Logit−−−Low
Neural Net−−−Low
10
10
0.01
0.009
5
0.004
5
0.0095
0.004
Dem b
Dem b
0.008
0.0105
0
0.01
0.012
−50125
0
0.01
0.016 0.012
−5
0.0115
02
0.006
0.014
0.018
0.011
0.0
018
013
−10
−10
−5
0
5
10
0.01
0.02
−10
−10
−5
0
5
10
Dem a
Dem a
Logit−−−High
Neural Net−−−High
10
10
0.4 0.3 0.2
0.17
0.4
5
0.5
5
0.18
Dem b
Dem b
0.5
0
0
0.5
0.21
0.6
−5
−5
0.19
0.4
0.2
0.5
0.22
−10
−10
−5
0.4
0
5
−10
−10
10
−5
0
Dem a
Dem a
Figure 3: Marginal Eects of Democracy: 1947-1985
17
5
10
are probably due to its incorrect assumption that the eect of democracy is linear (on
the logit scale) and additive, an assumption that is inconsistent with most beliefs and
evidence about international relations. In contrast, the exible neural network contours
clearly reveal the democratic peace hypothesis. These contours are qualitatively similar
to those in a simpler generalized additive model analysis (Beck and Jackman, 1998; Beck
and Tucker 1998). The neural network analysis, however, shows that democracy has a
much stronger eect in the high ex ante probability of conict cases. In the earlier work,
movements in democracy only changed the probability of a dispute by a few percent; here
the change is fully 30% in the high probability case. In the top right \democratic" corner,
the probability of a dispute is clearly the lowest in the neural network graphs. (Consistent
with our previous ndings, the bottom right \autocratic" corner does not show the highest
probability of a dispute. While democracies do not ght other democracies, they are no
less likely to ght autocracies than are other autocracies. Moreover, as soon as we leave
the extreme top right corner (that is, move away from the most democratic dyads), the
probability of dispute increases. (In the post war data there are relatively few dyads in
the interior of our plot, and so the condence intervals, which would be dicult to show
in this type of graph, are much wider in that middle region.)
What is most stunning about these results is how dierently democracy eects the
probability of a dispute in the low and high ex ante probability of conict cases. In the
low probability situation, changes in democracy can shift the probability of a dispute from
a maximum of two percent to a minimum of just about zero. But, consistent with our
conjecture, in the high probability subset we see democracy changing the probability of
a dispute from over 50% down to merely 20%. This is a much bigger eect than anyone
has previously found, and the societal import could be substantial: If we can democratize
dyads that, based on other explanatory variables, have a high probability of engaging in a
dispute, we could dramatically increase peace; democratization for dyads that are already
likely to peaceful, will, not surprisingly, have almost no eect on peace.
Note, however, that the biggest eect in the reduction of international conict occurs
all the way in the top right corner (of the bottom right plot), meaning that small increases
in in the degree of democratization only matters in countries that are fairly democratic to
begin with. So as a matter of public policy, if we can make a country more democratic, it
will reduce the likelihood of war, but only after the country has been mostly democratized
does the eect become really substantial. (Your grandmother was right all along: all
the vitamins are packed into the last few string beans.) Previous studies have found a
statistically signicant, but much weaker and less nuanced, eect of democracy on peace
because they have averaged high and low probability of conict situations and disallowed
other types of nonlinearities and interactions.
The simple plots we provide here cannot, of course, demonstrate the full structure of
the neural network, nor can it fully demonstrate its power. That is more clearly shown
in our ability to forecast disputes much better than any linear or additive analysis. But
these plots do show that our results are both consistent with, but much stronger and more
complicated, than what has been found in more traditional analyses.
5 Concluding Remarks
By adopting, adapting, and importing a statistical model into this eld that better matches
the substantive concerns of quantitative and qualitative researchers than existing methods, we have provided much more accurate forecasts of international conict than have
researchers using similar data but with simpler models. Indeed, this would appear to be
18
the rst paper that estimates international conict at some level higher than 0.5 (ours go
above 0.9 for some cases). Of the international conicts, we are able to forecast about
17% from data in the years prior to the conict.
This forecasting result can only be driven by an underlying structure of international
politics that stays relatively stable over time. Understanding, and even conrming the
existence of, this structure has been something of the holy grail of quantitative conict
studies, and we believe our neural network approach has the potential to achieve this goal.
We believe that with the appropriate graphical tools we introduce, the statistical models
we use have the potential to uncover structure in many areas. In particular, we had little
diculty intepreting our results in the context of the standard literature on the democratic
peace.
Neural networks are computationally and intellectually complex. But they are no more
than extensions of standard interactive models. While early neural network research often
seemed to overt the data, new Bayesian analysis can apparantly surmount much of that
problem. It seems unlikely that the eect of any variable does not interact with at least
some of the other variables. Neural networks are a tool that can unlock such complicated
structures. There is no question that they do a wonderful job in recognizing patters. But
they also can nd complicated structural regularities in standard international relations
conict data.
A Bayesian Methods for Neural Network Models
A.1 The Model
The basic neural network model we estimate is given in Equation 3 and repeated here
Yi Bernoulli(i )
i = logit 0 + 1 logit(Xi (1) ) + 2 logit(Xi (2) ) + + M logit(Xi (M ) )
(4)
To this model, we add a standard Bayesian hierarchical structure to shrink the space in
which the numerous parameters probabilistically lie. We do this by adding independent
normal distributions N (0; 1=h ) for each group of the parameters, with one h (and hence
the index h) for the constant terms 0 , one for each element of other than 0 , and one
for the set of elements of . In addition, each of the h elements is assumed to have an
(uninformative) improper uniform prior distribution.
A.2 The Posterior Distribution of the Parameters
The posterior distribution is formed, as usual, as the product of the prior distributions
and the likelihood function:
P (; ; jY ) /
kY
+2
h=1
!
N (0; 1=h )
n
Y
i=1
iYi (1
, i) ,Y
1
!
i
(5)
where i is dened in the second line of Equation 4.
Ideally, we would be able to draw random samples of , , and directly from this
posterior distribution to compute quantities of interest. Indeed, this has been accomplished
with a hybrid version of Markov Chain Monte Carlo (MCMC) methods (Neal, 1996).
However, runs that provide exact draws from the posterior with MCMC methods take an
inordinately long time to complete. The method also has all the usual problems caused
by a lack of agreement on how to assess stochastic convergence in MCMC algorithms.
19
After trying Neal's approach, we felt that having more time to experiment with dierent
specications to understand the data was necessary, and so we adopted the Gaussian
approximation approach due to MacKay (1992a,b,c, 1994). MacKay's approach is to
maximize Equation 5 with respect to the parameters and then to compute the variance
covariance matrix for and based on the full posterior. He also assumes that has
a posterior density that is suciently spiked that we can replace it with its maximum
posterior estimate.
A.3 Posterior Probabilities of Conict
One of our goals is to generate forecasts of international conict. The other is to see how
these forecasts would change given various congurations of the explanatory variables, Xi .
For both goals, we need to specify the posterior probability of the forecasts.
For this section, we denote the observed data as Y and the future value we wish to
forecast as Z . Conceptually, this task is simple, and could in principal be accomplished by
the usual simulation methods that apply as with virtually any other statistical model (see
King, Tomz, and Wittenberg, 1998). That is, draw random samples of , , and from
their posterior distribution in Equation 5 (or their asymptotic normal approximation),
insert them into the functional form in the second line of Equation 4 to compute i ,
and take a random draw from a Bernoulli distribution with this parameter (the rst
line of Equation 4). In practice, we used MacKay's (1994) analytical approximations to
accomplish the same task.
References
Beck, Nathaniel; Jonathan Katz; and Richard Tucker. 1998. \Taking Time Seriously:
Time-Series{Cross-Section Analysis with a Binary Dependent Variable," American
Journal of Political Science, forthcoming.
Beck, Nathaniel and Richard Tucker. 1998. \Democracy and Peace: General Law or Limited Phenomenon?" Paper presented at the Annual Meeting of the Midwest Political
Science Association, Chicago, IL.
Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford University Press.
Bueno de Mesquita, Bruce. 1981. The War Trap. New Haven, CT: Yale.
Cheng, B. and Titterington, D.M. 1994. \Neural Networks: A Review from a Statistical
Perspective."Statistical Science, 9, 1, 2-54.
Geller, Daniel S., and J. David Singer. 1998. Nations at War: A Scientic Study of
International Conict. New York: Cambridge University Press.
Gelman, Andrew and Gary King. 1990. \Estimating Incumbency Advantage Without
Bias," American Journal of Political Science, 34, 4 (November): Pp. 1142-1164.
Gurr, Ted Robert and Jaggers, Keith. 1996. \Polity III: Regime Change and Political
Authority, 1800-1994," Computer le, Inter-university Consortium for Political and
Social Research, Ann Arbor, MI.
Hertz, J. Krogh, A. and Palmer, R. 1991. Introduction to the Theory of Neural Computation Addison-Wesley, Reading, MA.
Jones, Daniel M., Stuart A. Bremer, and J. David Singer. 1996. `Correlates of War
Project: Militarized Interstate Dispute Data, 1816-1992." Conict Management and
Peace Science 15: 163-213. http://pss.la.psu.edu/cow%20data/mids_210.zip
20
King, Gary. 1989. \Event Count Models for International Relations: Generalizations and
Applications," International Studies Quarterly, 33, 2 (June): 123{147.
King, Gary. 1989b. Unifying Political Methodology: The Likelihood Theory of Statistical
Inference, Ann Arbor: University of Michigan Press.
King, Gary; Michael Tomz; and Jason Wittenberg. 1998. \How to Intepret and Present
Statistical Results or Enough with the Logit Coecients Already!" paper prepared
for the annual meetings of the American Political Science Association.
Kuan, C.M and White, H. 1994. \Articial Neural Networks: An Econometric Perspective." Econometric Reviews. 13(1), 1-91.
Levy, Jack S. 1989. \The Causes of War: A Review of Theories and Evidence." In Behavior, Society, and Nuclear War, Vol. 1, edited by Phillip E. Tetlock, Jo L. Husbands,
Robert Jervis, Paul C. Stern, and Charles Tilly, 2120-333. New York, Oxford: Oxford
University Press.
MacKay, D.J.C. 1992a. \Bayesian interpolation," Neural Computation, 4:3:415-447.
MacKay, D.J.C. 1992b. \A practical Bayesian framework for backprop networks," Neural
Computation, 4:3:448{472.
MacKay, D.J.C. 1992c. \The evidence framework applied to classication networks,"
Neural Computation, 4:5:698{714.
MacKay, D.J.C. 1994. \Bayesian Methods for Backpropagation Networks." In E. Domany,J.L. van Hemmen and K.Schulten (Eds.), Models of Neural Networks, III, Chapter 6. New York: Springer-Verlag.
MacKay, D.J.C. 1995. \Bayesian Non-linear Modeling for the 1993 Energy Prediction
Competition." In G. Heidbreder (Ed.), Maximum Entropy and Bayesian Methods.
Dordrecht: Kluwer.
Muller, B. and Reinhardt, J. 1990. Neural Networks: An Introduction. Springer, Berlin.
Nagler, Jonathan. 1994. \Scobit: An Alternative Estimator to Logit and Probit," American Journal of Political Science, 38, 1, (February).
Nagler, Jonathan. 1991. \The Eect of Registration Laws and Education on U.S. Voter
Turnout," American Political Science Review, 85, 4, (December).
Neal, R.M. 1996. Bayesian Learning for Neural Networks. Springer-Verlag.
Ripley, Brian D. 1996. Pattern Recognition and Neural Networks, New York: Cambridge
University Press.
Rosenau, James N., Ed. 1976. In Search of Global Patterns. New York: Free Press.
Rumelhart, D.E. et al, 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press.
Maoz, Zeev. 1996. Domestic Sources of Global Change. Ann Arbor, MI: University of
Michigan Press.
Schrodt, P. 1995. Patterns, Rules and Learning: Computational Models of International
Behavior. Manuscript.
Signorino, Curtis. \Estimation and Strategic Interaction in Discrete Choice Models of
International Conict," Harvard University Center for International Aairs working
paper.
Smith, Alastair. 1998. \A Summary of Political Selection: The Eect of Strategic Choice
on the Escalation of International Crises," American Journal of Political Science, 42,
2: 698{701.
21
Vasquez, John A. 1993. The War Puzzle. Cambridge, New York: Cambridge University
Press.
White, H., (1992). Articial Neural Networks, Approximation and Learning Theory.
Blackwell, Oxford UK and Cambridge USA.
Wright, Quincy. 1942/1965. A Study of War. Chicago: University of Chicago Press.
Zeng, Langche. 1996a \Neural Network Models and Political Data Analysis," Paper presented at the APSA Annual Meeting, San Francisco, September. forthcoming In Diana
Richards (ed.) Political Complexity: Nonlinear Models of Politics. Ann Arbor: University of Michigan Press.
Zeng, L. 1996b. \Prediction and Classication with Neural Network Models," Paper
presented at the MPSA Annual Meeting, Chicago, April, forthcoming in Sociological
Methods and Research.
22