Download Data Mining for Theorists Brenton Kenkel Curtis S. Signorino July 22, 2011

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Discrete choice wikipedia , lookup

Regression analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Data Mining for Theorists
Brenton Kenkel∗
Curtis S. Signorino†
July 22, 2011
Work in Progress: Comments Welcome.
Abstract
Among those interested in statistically testing formal models, two approaches dominate. The structural estimation approach derives a structural probability model based
on the formal model and then estimates parameters associated with that model. The
reduced-form approach generally applies off-the-shelf techniques—such as OLS, logit,
or probit—to test whether the independent variables are related to a decision variable
according to the comparative statics predictions. We provide a new statistical method
for the comparative statics approach. The decision variable of interest is modeled as
a polynomial function of the available covariates, which allows for the nonmonotonic
and interactive relationships commonly found in strategic choice data. We use the
adaptive lasso to reduce the number of parameters and prevent overfitting, and we
obtain measures of uncertainty via the nonparametric bootstrap. The method is “data
mining” because the aim is to discover complex relationships in data without imposing
a particular structure, but “for theorists” in that it was developed specifically to deal
with the peculiar features of data on strategic choice. Using a Monte Carlo simulation, we show that the method handily outperforms other non-structural techniques in
estimating a nonmonotonic relationship from strategic choice data.
∗
Ph.D. candidate, Department of Political Science, University of Rochester (email:
[email protected])
†
Associate Professor, Department of Political Science, University of Rochester (email: [email protected])
1
Contents
1 Introduction
2
2 Approaches to Testing Formal Models
3
3 Strategic Estimation without Game Forms
5
4 Simulation
11
5 Empirical Applications
5.1 Schultz (2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
17
6 Conclusion
23
A Appendix
A.1 K-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
23
1
Introduction
The integration of formal modeling and empirical research has become a major area of inquiry
in political science. The statistical literature in this area has focused on the development of
estimators derived directly from formal models (e.g., Signorino 1999; Smith 1999; Lewis and
Schultz 2003). These structural estimators make the tightest possible links between theory
and data, but they can be difficult to derive and implement. When structural estimation of a
model is infeasible, applied researchers have two bad options: use canned statistical routines
known to be inappropriate for strategic choice data, or refrain from any systematic test
of the model’s implications. This paper provides a better solution—an easy-to-implement,
non-structural method for estimation of comparative statics relationships in strategic data.
The method is based on statistical models familiar to political scientists, and it does not
require any application-specific derivations or programming.
Strategic interaction can produce data with nonlinear and nonmonotonic relationships,
even if the underlying model is simple. Naı̈ve application of familiar statistical techniques
like OLS and logistic regression cannot capture such relationships—this has been the crux
of the argument for structural strategic estimation. But if one’s goal is simply to verify
2
that a particular comparative statics prediction holds up in the data, rather than the more
ambitious task of estimating the actual parameters of the game, structural models are not
necessary. The only requirement is a statistical technique that is flexible enough to allow for
interactive and nonmonotonic relationships. We achieve this by using standard regression
techniques, but with a polynomial basis expansion of the independent variables. To guard
against overfitting, we bootstrap the adaptive lasso (Zou 2006), a machine learning technique
for variable selection in high-dimensional problems. The method is “data mining” because
the aim is to discover complex relationships in data without imposing a particular structure,
but “for theorists” in that it was developed specifically to deal with the peculiar features of
data on strategic choice.
The paper proceeds as follows. Section 2 discusses the differences and respective advantages of structural and reduced-form approaches to testing predictions from formal models.
Section 3 introduces our method, the bootstrapped adaptive lasso for polynomial regression. Section 4 presents the results of a Monte Carlo simulation showing that the method
captures a nonmonotonic relationship in a strategic choice setting reasonably well, and that
it outperforms some alternative techniques. Section 6 concludes, and Appendix A provides
additional details not given in the text.
2
Approaches to Testing Formal Models
Over the last decade, political scientists have paid increasing attention to the question of how
to properly test hypotheses derived from formal models. Most of the methodological literature on this topic has taken a structural approach, as in developing statistical models whose
parameters directly correspond to elements of game-theoretic models (e.g., Signorino 1999;
Smith 1999; Lewis and Schultz 2003). From a statistical perspective, structural modeling is
the most direct way to connect formal theories to observed data. However, the structural
approach has some well-known drawbacks, such as the difficulty of deriving and implement-
3
ing full-information estimators in the absence of “canned” techniques. An alternative way
to proceed, which we call the reduced form approach, is to derive comparative statics from
theoretical models and use standard estimators to test whether these predictions are borne
out in data. Reduced form approaches, by definition, cannot yield estimates for particular
elements of a formal model (e.g., a player’s utility for some outcome), but they may still
be useful for assessing whether observed data conform to the main predictions of a formal
model.
Reduced form estimation places fewer demands on researchers and data than structural
approaches do. Structural modeling requires that one derive a likelihood function that
corresponds to the solution of the model being examined. This may be a difficult analytical
task itself, especially if the model involves continuous choice sets or complex information
transmission. Moreover, computation of such a model via maximum likelihood or Bayesian
MCMC may not be straightforward, since there is no guarantee that the likelihood function
will be well-behaved or easy to evaluate. For example, the signaling model of Lewis and
Schultz (2003) requires that a nonlinear system be solved numerically in each evaluation of
the likelihood. A reduced form approach allows researchers to bypass these hurdles: the
only analytical work needed is to derive the comparative statics, and there is typically no
application-specific programming necessary.
Identification issues, which can be troublesome for structural estimation, are obviated
in reduced form approaches. It usually is not possible to estimate all of the parameters of
a given model; some exclusion restrictions are necessary to ensure that the estimates are
well-defined. It can be challenging to establish the identification conditions even for simple
models, as illustrated by Lewis and Schultz’s (2003) derivation of the requirements for a class
of discrete choice models. Even worse problems arise if the model has multiple equilibria in
some regions of the parameter space, in which case a selection rule is necessary to ensure that
the probability model is valid. If our goal is simply to assess a comparative statics prediction
by estimating the conditional relationship between a covariate and a decision variable, we
4
need not deal with exclusion restrictions and selection rules. The method developed in this
paper is built on generalized linear models, so the only requirement is that the covariates be
linearly independent.
Even if all of the aforementioned obstacles are surmountable, there are some more subtle problems that arise in structural modeling. In general, the dataset must specify the
“player” in the model that each real-world actor corresponds to, and it must have recorded
the entire sequence of decisions—not just the outcome of interest. For example, to estimate the workhorse model of international conflict—a sequential two-player model of crisis
bargaining—the researcher must specify which state got to make the first offer. Moreover,
the size of the offer itself must be known, even if the researcher is only interested in what
determines whether war occurs. Few datasets in political science are collected with this kind
of format in mind, so the imposition of the necessary structure will almost surely entail some
ad hoc guesswork. Sometimes it is possible to modify the estimator to account for missingness of this type of information, but at the cost of yet more analytical and computational
issues.
We do not intend to disparage structural modeling — indeed, we are proponents of
structural modeling under the right circumstances. When the structure of the interaction
is well known and it is feasible to derive a statistical model based on the relevant theory,
structural modeling is efficient and yields more informative parameter estimates than a
reduced-form estimator could. But these requirements are stringent—it is easy to think of
research situations in which they are not met. Our goal is to make it possible to test the
implications of formal theories even in these more difficult settings.
3
Strategic Estimation without Game Forms
In this section, we develop a new method to estimate comparative-statics relationships from
strategically generated data, for which no specification of the game form is necessary. The
5
estimator is built on the familiar generalized linear model, so the procedure is relatively quick
and the results easily interpretable. The method is built on three components. First, we use
polynomial regression to approximate interactive nonlinear relationships. Second, we use the
adaptive lasso to eliminate irrelevant polynomial terms and guard against overfitting. Third,
we use the nonparametric bootstrap to obtain standard errors and confidence intervals. Our
working shorthand name for this process is polywog.
We consider a general model of a strategic data-generating process. There are n independent observations of a game whose outcome is Yi ∈ <, where Y = (Y1 , . . . , Yn ).. Players’
utilities are a function of k variables, denoted Z1 through Zk . In matrix notation, the full set
of covariates is the n × k matrix Z, with ith row zi and jth column Zj . There is some probabilistic component of the game, such as “trembling hand” agent error or private information
(see Signorino 2003), so that the model can be written as
Yi = f (zi ) + i ,
(1)
where f is an unknown real-valued function and i is a random variable with full support
on the real line.1 The structural approach to strategic estimation is to posit f (zi ) = g(zi ; θ),
where g is a known function derived from the equilibrium conditions of a particular game
form, and estimate the finite-dimensional parameter vector θ. We instead take a reducedform approach, in which the goal is to estimate comparative statics without specifying the
exact structure of the strategic interaction.
In non-strategic contexts, the typical approach to estimation of models like equation (1)
is to assume that f (·) is an additive function of the covariates,
Yi = Ȳ +
k
X
fj (zij ) + i .
(2)
j=1
1
For ease of presentation, we focus on the case of unbounded, continuous Yi . However, with minor
modifications, it is straightforward to apply all of the techniques developed in this section to outcome
variables that are binary, categorical, counts, or durations.
6
With this setup, known as a generalized additive model, or GAM, it is straightforward to
estimate the functions f1 , . . . , fk nonparametrically via splines or other standard smoothing
techniques (Hastie and Tibshirani 1990). Unfortunately, the assumption of additivity is not
appropriate for data on strategic interactions. Under any standard equilibrium concept,
each player’s choice depends on her expected utility, which is a multiplicative function of her
expectations about others’ actions and her own payoff for each possible outcome. Therefore,
even in simple sequential games where all players’ payoffs are linear in the covariates, the
outcome may depend significantly on nonlinear functions of the variables and interactions
between them (Signorino and Yilmaz 2003). While GAMs allow for modeling nonlinear
relationships, most implementations do not model interactions, nor provide guidance on
variable selection.
In lieu of the additivity assumption, we need another way to make estimation of f (·)
tractable. Our approach is to use a d-degree polynomial approximation,2 which can be
obtained via a Taylor series expansion:
Yi = f (zi ) + i
X
j1 j2
jk
βj1 j2 ···jk zi1
≈
zi2 · · · zik
+ i
(3)
j1P
,...,jk ∈N:
jm ≤d
We can then estimate β by least squares, as long as d is small enough that the number
of terms does not exceed n. To save on notation, we denote the total number of terms
(other than the constant) in the polynomial expansion as p, and we write the coefficients as
β = (β0 , β1 , . . . , βp ).
Our work does not stop at the polynomial approximation itself. Unless the number of
covariates is unusually small, even a low-degree polynomial expansion will contain enough
terms to pose efficiency problems for standard regression techniques. The number of terms
P
in a d-degree expansion of k covariates is dm=0 m+k−1
, which grows quickly with d and
k−1
2
A more general approach, which we will pursue in future work, would be to consider any d-dimensional
basis function approximation, such as B-splines.
7
k. For example, with 10 covariates, a 2nd-degree expansion requires 66 terms, while a 3rddegree expansion requires 286. In this situation, some form of model selection is needed to
avoid overfitting and inefficiency in the regression estimates. The model selection procedure
best known to political scientists, if only as an object of derision (King 1986, 669), is stepwise regression. In some contexts, stepwise procedures are useful for eliminating irrelevant
variables from a regression model (Hastie, Tibshirani and Friedman 2009, 58–60). However,
stepwise regression often fails to eliminate many “noise” variables, especially when there is
correlation among the covariates—a problem that does not necessarily go away as the sample size increases (Derksen and Keselman 1992). Estimates of uncertainty from best-subset
procedures like stepwise regression are too small, and confidence intervals contain the true
parameters less often than the nominal rate would indicate (Hurvich and Tsai 1990). The
stepwise procedure is computationally burdensome when the number of terms is large, as it
requires refitting the model hundreds or even thousands of times. Finally, stepwise regression estimates are not continuous, meaning small changes to the data can cause the set of
variables selected and their estimated coefficients to vary wildly.
To avoid these problems, we perform model selection via the adaptive lasso (Zou 2006),
which has recently become popular in the statistical machine learning literature. The procedure is based on the “least absolute shrinkage and selection operator,” or lasso, proposed
by Tibshirani (1996). The lasso is a form of penalized regression in which the typical leastsquares objective function is minimized subject to a constraint on the magnitude of the
Pp
coefficients,
j=1 |βj | ≤ t. When the tuning parameter t ≥ 0 is small enough, the lasso
performs model selection, yielding estimates of exactly 0 for some of the coefficients. The
adaptive lasso is an extension of the lasso with better statistical properties, including the
oracle property (defined by Fan and Li 2001):
1. As n increases, the probability of selecting the correct model (β̂j 6= 0 if and only if
βj 6= 0) approaches 1,
2. As n increases, the distribution of β̂ is the same as that of the coefficients from regres8
sion on the true model (Xj such that βj 6= 0).
The oracle property means that, as n → ∞, the adaptive lasso identifies the true model,
and its estimates are no less efficient than if the true model—the set of covariates with a
non-zero effect in the population—were known in advance. In addition, the adaptive lasso is
continuous, so there is no danger of instability in the model selection or coefficient estimates.
The adaptive lasso proceeds in two steps. First, one obtains a consistent initial estimate
of the model parameters β̂ o , typically via OLS. Second, the adaptive lasso estimate is the
solution of
p
n
X
X
|βj |
0
2
,
argmin
(Yi − xi β) + λ
o
β∈<p i=1
j=1 |β̂j |
(4)
where λ ≥ 0 is a tuning parameter (discussed below). There are a few important things to
note about equation (4). The intercept, β0 , is not penalized. The penalty term preserves the
convexity of the objective function, so a solution exists and is unique. The presence of β̂ o is
critical for consistent model selection. For any j such that βj = 0, consistency of β̂ o implies
that β̂jo → 0 as n → ∞, meaning the adaptive lasso penalty on |βj | becomes arbitrarily large
in the limit. Therefore, if λ > 0, the adaptive lasso will, with high probability, estimate
β̂j = 0 when n is sufficiently large.
The penalization factor λ controls the sparsity of the estimated model, also known as
“shrinkage,” and must be selected carefully. If λ = 0, the adaptive lasso is equivalent to
ordinary least squares. Higher values of λ entail more aggressive model selection, with more
coefficients estimated as exactly 0; in the limit, the estimates are β̂0 = Ȳ and β̂j = 0 for all
j = 1, . . . , p. The criteria for selecting the penalization factor are similar to those for choosing
the bandwidth in kernel density estimation and other nonparametric problems, because there
is a bias-variance tradeoff. If there is too much shrinkage, some relevant covariates will be
left out of the estimated model, producing bias. With too little shrinkage, the model will be
overfitted, meaning the variance of the estimates is unacceptably high. The ideal value of λ
9
would minimize the mean squared error,
MSE(λ) = Bias(β̂; λ)2 + Var(β̂; λ).
We can find an estimate of this ideal λ using K-fold cross-validation (Hastie, Tibshirani and
Friedman 2009, 241–244), which entails splitting the data into K subsamples and using them
to approximate the out-of-sample prediction error of the adaptive lasso for many candidate
values of λ. We choose the value that produces the lowest prediction error across subsamples.
See Appendix A.1 for a full description of the K-fold cross-validation procedure.
The last step is to provide measures of uncertainty in the estimates, such as standard
errors and confidence intervals. Zou (2006) discusses a method to calculate standard errors
for adaptive lasso estimates, but it applies only to the nonzero coefficients. We instead
use the bootstrap, so that we can estimate uncertainty in all elements of β̂.3 Chatterjee
and Lahiri (2011) prove that the residual bootstrap consistently estimates standard errors
and confidence intervals for adaptive lasso coefficients. We use the nonparametric bootstrap
rather than the residual bootstrap, since it is more robust to basic linear model violations
such as heteroscedasticity and can be applied to generalized linear models as well. This
method consists in constructing B new datasets by resampling (Yi , xi ) with replacement
from the original data, and running the adaptive lasso on each.4 Denote the adaptive lasso
estimate in the bth iteration as β̂ (b) . The bootstrap standard error estimate for β̂j is its
standard deviation across iterations,
v
u
B 2
u1 X
(b)
(·)
β̂j − β̂j
s. e.(β̂j ) = t
.
B b=1
The bootstrap 95% confidence interval for β̂j is the interval bounded by the 2.5th and 97.5th
3
Use of the bootstrap does not guarantee nondegenerate standard error estimates, since some terms may
be excluded from the estimated model in every bootstrap iteration.
4
For computational tractability, we use the same value of λ in the bootstrap iterations as in the original
fit.
10
1
L
R
2
U11 + ǫ11
L
U13 + ǫ13
U23 + ǫ23
R
U14 + ǫ14
U24 + ǫ24
Figure 1. Game tree for the data-generating process used in the Monte Carlo simulation.
(b)
percentiles of β̂j .
4
Simulation
We verify the performance of the techniques described above by applying them to simulated
strategic choice data. We show that the method is able to detect a nonmonotonic relationship
between a covariate and an outcome, with a tolerable rate of prediction error compared to
full structural estimation. The data-generating process is illustrated in Figure 1, where each
jk term is private information drawn from the standard normal distribution. We provide
only a brief exposition of the model; interested readers may consult Signorino (2003) or
Signorino and Tarar (2006) for complete derivations. There are N = 2,500 observations of
the outcome Yi ∈ {L, RL, RR} and of the covariates xi = (X1i , X2i , . . . , X6i )0 , where
x∗i ∼ N (µ, Σ)
µ = (0, 0, 0, 0, −1, 1)0
Σjk = 1,
Σjk = ρ,
Xji = Xji∗ ,
j=k
j 6= k
j = 1, 2, 5, 6
Xji = I(Xji∗ > 0),
j = 3, 4
The variables X3 and X4 are binary, while all others are continuous and unbounded. Note
11
that X4 through X6 are noise variables with no true effect on the outcome, though they are
correlated with the relevant variables if ρ 6= 0. The true data-generating process is given by
the equations
U11i = −1 + 2X1i + X2i
U13i = 1.25
(5)
U14i = 0.5X3i − X2i
U23i = 0
U24i = 0.75 + X1i + X2i + 0.5X3i
Our aim is to characterize the comparative-static relationship between each covariate and
the probability of observing the outcome Yi = RR. Let Ỹi be a binary indicator for the
event Yi = RR. Using (5) and the perfect Bayesian equilibrium solution concept, the true
relationship is given by the nonlinear equation
Pr(Ỹi = 1 | Xi ) = Φ
(1 − p4i )U13i + p4i U14i − U11i
p
p24i + (1 − p4i )2 + 1
!
Φ
U24i
√
2
,
(6)
√
where Φ(·) denotes the normal CDF and p4i = Φ(U24i / 2). Under the given parameterization of the utility equations, equation (6) is nonmonotonic in both X1i and X2i .
The dataset was simulated B = 1000 times each for ρ ∈ {0, 0.5, 0.8}. In each iteration,
we estimated the relationship between the covariates and Ỹ via four methods:
• Naı̈ve logit: Logistic regression of the form
Pr(Ỹi = 1 | Xi ) = Λ(β0 + β1 X1i + β2 X2i + β3 X3i + β4 X4i + β5 X5i + β6 X6i ),
where Λ(·) denotes the logistic CDF.
12
• GAM: A generalized additive model of the form
Pr(Ỹi = 1 | Xi ) = Λ β0 +
6
X
!
fj (Xji ) ,
j=1
where all smoothing parameters are chosen automatically via cross-validation.5
• Structural: Full-information maximum likelihood estimation of the given LQRE
model. We use two variants: the “correct” model, in which all irrelevant terms are
dropped, and the “saturated” model, in which each utility equation includes all covariates.
• Polywog: Adaptive lasso for logistic polynomial regression, as described in Section 3,
with expansions of degree d = 2 and d = 3. To avoid infinite-valued parameter
estimates in the first stage, we use the bias-corrected logistic regression recommended
by Zorn (2005).6
We compare these estimators in terms of their ability to recover the true predicted probability for out-of-sample data. For any x = (x1 , x2 , x3 , x4 ), write the true probability of
Ỹi = 1 as π(x), and let π̂ψ (x) denote the predicted probability as estimated from the results
of method ψ. Note that π̂ψ (x) is a random variable, since the results of method ψ depend
on the sample used for estimation. The mean squared error of the estimator ψ at x is
MSE(ψ; x) = E (π(x) − π̂ψ (x))2 .
The bias and variability of an estimator contribute to its MSE, so techniques with low MSE
are preferable. The average MSE of the estimator ψ across all possible values of x is the
5
We use the R package mgcv (Wood 2011) to estimate GAMs.
We use the R package brglm (Kosmidis 2010) for bias-corrected logistic regression and glmnet (Friedman,
Hastie and Tibshirani 2010) for the adaptive lasso.
6
13
mean integrated squared error,
Z
MISE(ψ) = E
(π(x) − π̂ψ (x))2 dF (x),
X
where F is the CDF of x. We can approximate the MISE of each estimator using the
Monte Carlo simulations. In each iteration b, we construct a set of M = 1000 out-of-sample
observations, xb(1) , . . . , xb(M ) , drawn from the same probability distribution used to generate
the sample observations. We then use the results of each method ψ in this iteration to
obtain π̂ψb (xb(m) ), the predicted probability for the mth element of the holdout sample. The
approximate MISE is
M
B
2
1 X 1 X
\
MISE(ψ)
=
π(xb(m) ) − π̂ψb (xb(m) ) .
M m=1 B b=1
The MISE itself is a squared percentage, so we present the results in terms of the square
root of the approximate MISE.
To get an idea of the comparative performance of each estimator, see the plots in Figure 2.
We examine two “profiles” in which X1 varies while the five other variables are held fixed.
In the first, the continuous variables are held at their means and the binary variables are
held at 0; in the second, X2 is held at 1.5 while the others are the same as in the first.
The solid lines represent the true predicted probability of Ỹ = 1 as a function of X1 , as
given by (6). The dotted lines show the average predicted probability obtained from each
estimator, and the bounds of the dark gray regions are the 2.5th and 97.5th percentiles. As
expected, the structural models recover the true probabilities with high precision. At the
other extreme, the naı̈ve logistic regression, unable to capture the strong nonmonotonicity,
clearly yields the least accurate fitted values. The results from the other non-structural
estimators, the GAM and the two variants of the adaptive lasso, are more interesting. For
the first profile, when all of the variables other than X1 are held at their means, the average
predicted probability is similar, and reasonably accurate, across all three methods. However,
14
structural (correct)
structural (saturated)
naive logit
0.6
0.5
0.4
0.3
0.2
pp.mean
0.1
−3
−2
−1
0
GAM
1
2
3
−3
−2
−1
0
1
polywog (d=2)
2
3
−3
−2
−1
0
1
polywog (d=3)
2
3
−3
−2
−1
0
1
2
3
−3
−2
−1
2
3
−3
−2
−1
2
3
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0
1
x1
(a) ρ = 0.5, continuous covariates held at mean, binary covariates held at 0
structural (correct)
structural (saturated)
naive logit
0.8
0.6
0.4
pp.mean
0.2
0.0
−3
−2
−1
0
GAM
1
2
3
−3
−2
−1
0
1
polywog (d=2)
2
3
−3
−2
−1
0
1
polywog (d=3)
2
3
−3
−2
−1
0
1
2
3
−3
−2
−1
2
3
−3
−2
−1
2
3
0.8
0.6
0.4
0.2
0.0
0
1
0
1
x1
(b) ρ = 0.5, X2 held at 1.5, other continuous covariates held at mean, binary covariates held at 0
Figure 2. Average predicted probability and 95% interval for each estimator.
15
structural (correct)
structural (saturated)
naive logit
GAM
polywog (d=2)
polywog (d=3)
ρ = 0 ρ = 0.5 ρ = 0.8
0.022
0.022
0.022
0.032
0.032
0.032
0.217
0.234
0.238
0.172
0.125
0.075
0.055
0.055
0.056
0.057
0.058
0.056
Table 1. Root mean integrated squared error for each estimator across levels of correlation
among the covariates.
in the second profile, when X2 is held 1.5 standard deviations above its mean, the GAM
performs noticeably worse than the polywog estimators. The GAM at least gets the inverse
U-shape of the relationship correct, but its estimates of the location and magnitude of the
peak are far off. For X1 ≤ −0.5, the 95% regions of the GAM predicted probabilities do not
even contain the true values. The polywog estimators appear less efficient than the GAM,
particularly at the extremes of the parameter space, but they are also much less biased.
The root MISE calculations in Table 1 tell a similar story. The structural models always
have the lowest MISE, and naı̈ve logit always has the highest. In between, the degree-2
polywog performs best, followed by the degree-3 polywog and then the GAM. The magnitude of the difference in RMISE between the degree-2 polywog and the saturated structural
model is about twice the difference between the saturated and correct structural models.
The typical level of prediction error for the degree-2 and degree-3 polywog models is about
5.5%, compared to 2.2% for the correct structural model and 3.2% for the saturated structural model. Neither of the polywog methods is adversely affected by the introduction of
correlation among the covariates, even at a level as high as 0.8.
Overall, the simulation results support our contention that adaptive lasso polynomial
regression is a good way to estimate comparative statics relationships from strategic data.
Although the method is not as efficient or accurate as a full structural estimator, it recovers
the main relationships in the data with tolerably low bias. More generally, the results
demonstrate the importance of accounting for interactions between variables when estimating
16
relationships from strategic data. The additional flexibility of the generalized additive model
is overshadowed by its severe bias for predictions when multiple variables are not held at
their means.
5
5.1
Empirical Applications
Schultz (2001)
A prominent example of testing comparative statics predictions with empirical data in the
international conflict literature is Schultz’s (2001) work on democracy and coercive bargaining. Schultz uses a game-theoretic model to derive hypotheses about the relationship between
states’ domestic political institutions and their behavior in international crises. The main
theoretical result is that democracies’ coercive threats are more credible than those made by
autocracies. If the leader of a democratic state makes demands that she is not truly willing
to back with military force, the domestic opposition can make political gains by refusing to
back the government’s stance. When the leader ultimately is forced to back down from a
bluff, voters reward the opposition party for its better foreign policy judgment. Anticipating
this potential political humiliation, democratic leaders make threats more selectively than
dictators, who face no similar domestic political constraints. This relative unwillingness to
bluff ends up working in favor of democracies, because their international opponents are more
likely to believe the threats they do choose to make—and thus to concede the issue rather
than provoke costly fighting. We will examine two of this theory’s observable implications
about factors associated with the outbreak of international disputes (Hypotheses 1 and 2,
p. 121). The first is that, all else equal, democracies are less likely to start disputes, due
to their leaders’ fear of the political consequences of being caught in a bluff. The second is
that, all else equal, democracies are weakly more likely to be targeted in a dispute, because
they are less likely to bluff in response to a threat and thus more likely to concede the issue
immediately.
17
Structural estimation of Schultz’s game-theoretic model would be challenging. First, each
player possesses private information that their moves signal to the others. Although there
are some statistical models of strategic information transmission (e.g., Lewis and Schultz
2003; Whang 2010), these are built on less complex games and have only rarely been applied
to observational data. Second, it would be costly to collect all of the data necessary for
structural estimation. For example, one of the possible outcomes in the model is for the
domestic opposition to denounce the government for failing to issue a coercive threat. To
include this in the statistical model, one would need an indicator for each case where no
dispute occurred—the vast majority of those in any international conflict dataset—about
whether each country’s domestic opposition urged a more aggressive stance. Clearly, this
would require either a massive data collection effort or a restrictive sample of cases. Such
measures should not be necessary to test hypotheses on the relationship between domestic
regime type and foreign policy outcomes, on which ample data already exist.
In light of these challenges to structural modeling, the empirical approach in Democracy
and Coercive Diplomacy is decidedly reduced-form. Schultz uses logistic regression to model
the probability of militarized international disputes (MIDs) as a function of regime type and
various controls. Unfortunately, the models do not contain multiplicative or higher-order
terms for most of the variables. If the true data-generating process is strategic, as claimed
by the theory, these models may be subject to misspecification bias (Signorino and Yilmaz
2003). To examine the sensitivity of the original results to this specification choice, we
replicate the analysis using the polywog method.7 We use a sample of N = 17,296 directed
dyad-years from 1946 to 1984, corresponding to Schultz’s model 3 (p. 135).8 The dependent
variable is MID initiation, and the following regressors are included (interested readers may
7
Because a complete replication archive for Democracy and Coercive Diplomacy is not available, we
attempted to replicate the dataset as closely as possible by following the coding and sampling rules given
in Chapter 5 and Appendix C. We were able to obtain statistical results that were substantively the same,
albeit not exactly, to those reported in the book.
8
Schultz does not apply any political relevance criterion, but his statistical technique uses only those
directed dyads that experience at least one initiation in the period under consideration. For consistency, we
use the same restriction.
18
polywog
original
predicted probability of MID
0.07
0.06
0.05
●
●
0.04
0.03
●
●
0.02
●
●
0.01
●
neither initiator
target
both
neither initiator
●
target
both
democratic
Figure 3. Predicted probability of dispute for an “average” profile by dyadic regime type
in the models without fixed effects.
refer to Schultz 2001 for coding details):
• Democracy of both states (dichotomous indicator based on Polity III components)
• Major power status of both states
• Balance of capabilities and state 1’s share
• Contiguity
• S-scores of states with each other and with system leader
• Years since last dispute (cubic spline)
We include each of these in a d = 2 polywog specification. The original models use a linear
specification, except an interaction of the democracy indicators, an interaction of the major
power indicators, and a cubic spline for peace years.9
9
In our replication, we replace the spline terms with a third-degree polynomial (Carter and Signorino
2010). This has no substantive effect on the results.
19
capbalance
capshare1
0.08
0.06
predicted probability of MID
0.04
0.02
s_wt_glo
0.07
0.07
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
model
0.2 0.4 0.6 0.8 1.0
s_ld_1
0.10
0.08
0.2 0.4 0.6 0.8
s_ld_2
−0.5
0.07
0.12
0.06
0.10
0.05
0.08
0.04
0.06
0.03
0.04
0.02
0.02
0.06
0.04
0.02
−0.5
0.0
0.5
1.0
−0.5
0.0
0.5
1.0
0.0
0.5
peaceyrs
1.0
polywog
original
0.0 0.1 0.2 0.3 0.4 0.5
Figure 4. Fitted values for an “average” profile in the models without fixed effects.
To control for unobserved heterogeneity between directed dyads, Schultz uses Chamberlain’s (1980) fixed-effects logistic regression estimator, sometimes referred to as conditional
logit. Unfortunately, to our knowledge, there is currently no lasso implementation of conditional logit, so a straightforward comparison of methods is difficult. Instead, we compare
two sets of models: one without fixed effects in either and one with fixed effects dummy
variables in both.
In the models without fixed effects, we find stronger support for Schultz’s main hypothesis—
that democracies are less likely to initiate disputes—in the polywog model than in the ordinary logistic regression. Figure 3 shows the predicted probability of a dispute for each potential dyadic combination of regime type, with 95% bootstrap confidence intervals (B = 1000).
All of the other variables are held at their means (continuous) or modes (binary). Although
20
capbalance
capshare1
s_wt_glo
0.14
0.030
0.10
0.04
0.020
0.08
0.03
0.015
predicted probability of MID
0.12
0.05
0.025
0.06
0.02
0.010
0.04
0.01
0.005
0.02
model
0.2 0.4 0.6 0.8 1.0
s_ld_1
0.2 0.4 0.6 0.8
s_ld_2
−0.5
0.0
0.5
peaceyrs
1.0
polywog
original
0.8
0.07
0.07
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
−0.5
0.0
0.5
1.0
0.6
0.4
0.2
−0.5
0.0
0.5
1.0
0.0 0.1 0.2 0.3 0.4 0.5
Figure 5. Fitted values for an “average” profile in the models with fixed effects.
the difference in probability between an all-autocratic dyad and one with a democratic initiator is not statistically significant at conventional levels in either case, the substantive effect
is greater under the polywog model. Going from an autocratic initiator to a democratic one
reduces the probability of conflict by about 45% according to the polywog model, compared
to 22% in the ordinary logit. On the other hand, the hypothesis about democratic targets
appears to have no support under either model. We also find substantial differences in the
estimated relationships between the control variables and the probability of dispute occurrence. Figure 4 displays fitted values for profiles in which each of the continuous regressors
is varied across its range in the data, with other variables held at their means or modes. The
gray bars again represent a 95% bootstrap confidence interval. It is clear that a linear specification fails to pick up on some significant non-monotonicities, such as the inverse U-shaped
21
in sample
out of sample
proportion of 0s correct
1.0
0.8
0.6
model
polywog
original model
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0 0.0
0.2
0.4
0.6
0.8
1.0
proportion of 1s correct
Figure 6. ROC curves for the models with fixed effects.
relationship between the initiator’s share of capabilities and the probability of a dispute.
In the models with fixed effects, the qualitative differences between polywog and ordinary
logit are relatively slight. Moreover, with about 500 directed-dyad dummy variables, the
relationships between the control variables and the chance of a dispute cannot be estimated
with high precision; see the plots in Figure 5. However, the presence of fixed effects raises
a different issue: the potential of overfitting. Controlling for unobserved heterogeneity with
fixed effects improves in-sample fit, but at the risk of worse out-of-sample predictions.10
When fixed effects are included as dummy variables, a penalized estimator such as the lasso
helps guard against overfitting and thus should be better for classifying observations that
were not used to fit the model. Indeed, variable selection in high-dimensional classification
problems is one of the main uses of the lasso in the machine learning literature (e.g., Koh,
Kim and Boyd 2007). We examined the predictive performance of each of the fixed-effects
models on the 1946–1984 data used to fit the model and on a holdout sample of N = 1,332
observations from 1985–1988. The ROC curves for these are plotted in Figure 6. The original
10
With the conditional logit estimator, it is not even possible to generate out-of-sample predictions, as the
fixed effects themselves are not estimated.
22
model is strictly better for in-sample fit, but tends to be worse at predictions outside the
sample. These results suggest that a penalized regression approach like polywog may be
useful in settings where researchers want to control for unobserved heterogeneity without
overfitting.
6
Conclusion
We have introduced a new method for testing whether comparative statics predictions are
borne out in strategic choice data. Our simulations show that the method is effective and
efficient compared to alternative non-structural techniques. There are many directions to
pursue in future work. We hope to explore alternative basis functions for the d-degree
approximation, including orthogonal polynomials and B-splines. We plan to further demonstrate the real-world usefulness of the technique with additional empirical applications. In
addition, we intend to develop an R package polywog to facilitate easy implementation of
the methods described.
A
A.1
Appendix
K-fold cross-validation
We borrow much of Hastie, Tibshirani and Friedman’s (2009, 241–242) notation to describe
the K-fold cross-validation procedure. The first step is to randomly partition the data into
K subsamples of approximately equal size (exactly equal if n is evenly divisible by K), via
the mapping κ : {1, . . . , n} → {1, . . . , K}. We also form a grid of T candidate values for
the penalization parameter, (λ1 , . . . , λt , . . . , λT ), where λ1 = 0 and λT is large enough that
all coefficients other than the intercept are shrunk to 0. Typical choices are K = 10 and
T = 100. It is possible to set K = n, known as leave-one-out cross-validation, but this is too
computationally expensive to be feasible for the adaptive lasso with moderate or large n.
23
Let β̂ −k (λt ) denote the adaptive lasso estimate obtained when λ = λt and the observations
in the kth subsample are excluded. We can estimate the out-of-sample prediction error of
the adaptive lasso under each candidate value of λ as
CV(λt ) =
n
2
1 X
Yi − x0i β̂ −κ(i) (λt ) .
n i=1
That is, the prediction error for each observation is the squared difference between its actual
value and the fitted value from the adaptive lasso when the subsample containing the given
observation is excluded. This entails running KT adaptive lasso fits, one for each subset
to exclude and each grid point λt . We choose the λt that minimizes CV(·) to serve as the
penalization parameter in the adaptive lasso fit on the full sample. Hastie, Tibshirani and
Friedman (2009, 244) advocate an alternative criterion for even greater shrinkage, which is
to choose the largest λt for which CV(·) is within one standard error of the minimum.
References
Carter, David B. and Curtis S. Signorino. 2010. “Back to the Future: Modeling Time
Dependence in Binary Data.” Political Analysis 18(3):271–292.
URL: http://pan.oxfordjournals.org/cgi/content/abstract/18/3/271
Chamberlain, Gary. 1980. “Analysis of Covariance with Qualitative Data.” The Review of
Economic Studies 47(1):225–238.
URL: http://www.jstor.org/stable/2297110?origin=crossref
Chatterjee, A. and S.N. Lahiri. 2011. “Bootstrapping Lasso Estimators.” Journal of the
American Statistical Association .
Derksen, Shelley and H.J. Keselman. 1992. “Backward, Forward, and Stepwise Automated
Subset Selection Algorithms: Frequency of Obtaining Authentic and Noise Variables.”
British Journal of Mathematical and Statistical Psychology 45:265–282.
Fan, Jianqing and Runze Li. 2001. “Variable Selection via Nonconcave Penalized Likelihood
and its Oracle Properties.” Journal of the American Statistical Association 96(456):1348–
1360.
Friedman, Jerome, Trevor Hastie and Robert Tibshirani. 2010. “Regularization Paths
for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software
33(1):1–22.
24
Hastie, Trevor and Robert Tibshirani. 1990. Generalized Additive Models. Chapman &
Hall/CRC.
Hastie, Trevor, Robert Tibshirani and Jerome Friedman. 2009. The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. 2 ed. Springer.
Hurvich, Clifford M. and Chih-Ling Tsai. 1990. “The Impact of Model Selection on Inference
in Linear Regression.” The American Statistician 44(3):214–217.
King, Gary. 1986. “How Not to Lie with Statistics: Avoiding Common Mistakes in Quantitative Political Science.” American Journal of Political Science 30(3):666–687.
Koh, K, S Kim and S Boyd. 2007. “An interior-point method for large-scale l1-regularized
logistic regression.” Journal of Machine learning research 8(8):1519–1555.
URL:
file://localhost/Users/Brenton/Documents/Papers/K/Koh 2007.pdf
papers://711b4f30-efed-461f-95f9-54d3a5b0d80f/Paper/p12709
Kosmidis, Ioannis. 2010. brglm: Bias reduction in binary-response GLMs. R package version
0.5-5.
URL: http://go.warwick.ac.uk/kosmidis/software
Lewis, Jeffrey B. and Kenneth A. Schultz. 2003. “Revealing Preferences: Empirical Estimation of a Crisis Bargaining Game with Incomplete Information.” Political Analysis
11(4):345–367.
Schultz, Kenneth A. 2001. Democracy and Coercive Diplomacy. Cambridge University Press.
Signorino, Curtis S. 1999. “Strategic Interaction and the Statistical Analysis of International
Conflict.” The American Political Science Review 93(2):279–297.
Signorino, Curtis S. 2003. “Structure and Uncertainty in Discrete Choice Models.” Political
Analysis 11(4):316–344.
Signorino, Curtis S. and Ahmer Tarar. 2006. “A Unified Theory and Test of Extended
Immediate Deterrence.” American Journal of Political Science 50(3):586–605.
Signorino, Curtis S. and Kuzey Yilmaz. 2003. “Strategic Misspecification in Regression
Models.” American Journal of Political Science 47(3):551–566.
Smith, Alastair. 1999. “Testing Theories of Strategic Choice: The Example of Crisis Escalation.” American Journal of Political Science 43(4):1254–1283.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of
the Royal Statistical Society, Series B 58(1):267–288.
Whang, Taehee. 2010. “Structural estimation of economic sanctions: From initiation to
outcomes.” Journal of Peace Research 47(5):561–573.
25
Wood, S. N. 2011. “Fast stable restricted maximum likelihood and marginal likelihood
estimation of semiparametric generalized linear models.” Journal of the Royal Statistical
Society (B) 73(1):3–36.
Zorn, Christopher. 2005. “A Solution to Separation in Binary Response Models.” Political
Analysis 13:157–170.
Zou, Hui. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American
Statistical Association 101(476):1418–1429.
26