Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining for Theorists Brenton Kenkel∗ Curtis S. Signorino† July 22, 2011 Work in Progress: Comments Welcome. Abstract Among those interested in statistically testing formal models, two approaches dominate. The structural estimation approach derives a structural probability model based on the formal model and then estimates parameters associated with that model. The reduced-form approach generally applies off-the-shelf techniques—such as OLS, logit, or probit—to test whether the independent variables are related to a decision variable according to the comparative statics predictions. We provide a new statistical method for the comparative statics approach. The decision variable of interest is modeled as a polynomial function of the available covariates, which allows for the nonmonotonic and interactive relationships commonly found in strategic choice data. We use the adaptive lasso to reduce the number of parameters and prevent overfitting, and we obtain measures of uncertainty via the nonparametric bootstrap. The method is “data mining” because the aim is to discover complex relationships in data without imposing a particular structure, but “for theorists” in that it was developed specifically to deal with the peculiar features of data on strategic choice. Using a Monte Carlo simulation, we show that the method handily outperforms other non-structural techniques in estimating a nonmonotonic relationship from strategic choice data. ∗ Ph.D. candidate, Department of Political Science, University of Rochester (email: [email protected]) † Associate Professor, Department of Political Science, University of Rochester (email: [email protected]) 1 Contents 1 Introduction 2 2 Approaches to Testing Formal Models 3 3 Strategic Estimation without Game Forms 5 4 Simulation 11 5 Empirical Applications 5.1 Schultz (2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 6 Conclusion 23 A Appendix A.1 K-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 1 Introduction The integration of formal modeling and empirical research has become a major area of inquiry in political science. The statistical literature in this area has focused on the development of estimators derived directly from formal models (e.g., Signorino 1999; Smith 1999; Lewis and Schultz 2003). These structural estimators make the tightest possible links between theory and data, but they can be difficult to derive and implement. When structural estimation of a model is infeasible, applied researchers have two bad options: use canned statistical routines known to be inappropriate for strategic choice data, or refrain from any systematic test of the model’s implications. This paper provides a better solution—an easy-to-implement, non-structural method for estimation of comparative statics relationships in strategic data. The method is based on statistical models familiar to political scientists, and it does not require any application-specific derivations or programming. Strategic interaction can produce data with nonlinear and nonmonotonic relationships, even if the underlying model is simple. Naı̈ve application of familiar statistical techniques like OLS and logistic regression cannot capture such relationships—this has been the crux of the argument for structural strategic estimation. But if one’s goal is simply to verify 2 that a particular comparative statics prediction holds up in the data, rather than the more ambitious task of estimating the actual parameters of the game, structural models are not necessary. The only requirement is a statistical technique that is flexible enough to allow for interactive and nonmonotonic relationships. We achieve this by using standard regression techniques, but with a polynomial basis expansion of the independent variables. To guard against overfitting, we bootstrap the adaptive lasso (Zou 2006), a machine learning technique for variable selection in high-dimensional problems. The method is “data mining” because the aim is to discover complex relationships in data without imposing a particular structure, but “for theorists” in that it was developed specifically to deal with the peculiar features of data on strategic choice. The paper proceeds as follows. Section 2 discusses the differences and respective advantages of structural and reduced-form approaches to testing predictions from formal models. Section 3 introduces our method, the bootstrapped adaptive lasso for polynomial regression. Section 4 presents the results of a Monte Carlo simulation showing that the method captures a nonmonotonic relationship in a strategic choice setting reasonably well, and that it outperforms some alternative techniques. Section 6 concludes, and Appendix A provides additional details not given in the text. 2 Approaches to Testing Formal Models Over the last decade, political scientists have paid increasing attention to the question of how to properly test hypotheses derived from formal models. Most of the methodological literature on this topic has taken a structural approach, as in developing statistical models whose parameters directly correspond to elements of game-theoretic models (e.g., Signorino 1999; Smith 1999; Lewis and Schultz 2003). From a statistical perspective, structural modeling is the most direct way to connect formal theories to observed data. However, the structural approach has some well-known drawbacks, such as the difficulty of deriving and implement- 3 ing full-information estimators in the absence of “canned” techniques. An alternative way to proceed, which we call the reduced form approach, is to derive comparative statics from theoretical models and use standard estimators to test whether these predictions are borne out in data. Reduced form approaches, by definition, cannot yield estimates for particular elements of a formal model (e.g., a player’s utility for some outcome), but they may still be useful for assessing whether observed data conform to the main predictions of a formal model. Reduced form estimation places fewer demands on researchers and data than structural approaches do. Structural modeling requires that one derive a likelihood function that corresponds to the solution of the model being examined. This may be a difficult analytical task itself, especially if the model involves continuous choice sets or complex information transmission. Moreover, computation of such a model via maximum likelihood or Bayesian MCMC may not be straightforward, since there is no guarantee that the likelihood function will be well-behaved or easy to evaluate. For example, the signaling model of Lewis and Schultz (2003) requires that a nonlinear system be solved numerically in each evaluation of the likelihood. A reduced form approach allows researchers to bypass these hurdles: the only analytical work needed is to derive the comparative statics, and there is typically no application-specific programming necessary. Identification issues, which can be troublesome for structural estimation, are obviated in reduced form approaches. It usually is not possible to estimate all of the parameters of a given model; some exclusion restrictions are necessary to ensure that the estimates are well-defined. It can be challenging to establish the identification conditions even for simple models, as illustrated by Lewis and Schultz’s (2003) derivation of the requirements for a class of discrete choice models. Even worse problems arise if the model has multiple equilibria in some regions of the parameter space, in which case a selection rule is necessary to ensure that the probability model is valid. If our goal is simply to assess a comparative statics prediction by estimating the conditional relationship between a covariate and a decision variable, we 4 need not deal with exclusion restrictions and selection rules. The method developed in this paper is built on generalized linear models, so the only requirement is that the covariates be linearly independent. Even if all of the aforementioned obstacles are surmountable, there are some more subtle problems that arise in structural modeling. In general, the dataset must specify the “player” in the model that each real-world actor corresponds to, and it must have recorded the entire sequence of decisions—not just the outcome of interest. For example, to estimate the workhorse model of international conflict—a sequential two-player model of crisis bargaining—the researcher must specify which state got to make the first offer. Moreover, the size of the offer itself must be known, even if the researcher is only interested in what determines whether war occurs. Few datasets in political science are collected with this kind of format in mind, so the imposition of the necessary structure will almost surely entail some ad hoc guesswork. Sometimes it is possible to modify the estimator to account for missingness of this type of information, but at the cost of yet more analytical and computational issues. We do not intend to disparage structural modeling — indeed, we are proponents of structural modeling under the right circumstances. When the structure of the interaction is well known and it is feasible to derive a statistical model based on the relevant theory, structural modeling is efficient and yields more informative parameter estimates than a reduced-form estimator could. But these requirements are stringent—it is easy to think of research situations in which they are not met. Our goal is to make it possible to test the implications of formal theories even in these more difficult settings. 3 Strategic Estimation without Game Forms In this section, we develop a new method to estimate comparative-statics relationships from strategically generated data, for which no specification of the game form is necessary. The 5 estimator is built on the familiar generalized linear model, so the procedure is relatively quick and the results easily interpretable. The method is built on three components. First, we use polynomial regression to approximate interactive nonlinear relationships. Second, we use the adaptive lasso to eliminate irrelevant polynomial terms and guard against overfitting. Third, we use the nonparametric bootstrap to obtain standard errors and confidence intervals. Our working shorthand name for this process is polywog. We consider a general model of a strategic data-generating process. There are n independent observations of a game whose outcome is Yi ∈ <, where Y = (Y1 , . . . , Yn ).. Players’ utilities are a function of k variables, denoted Z1 through Zk . In matrix notation, the full set of covariates is the n × k matrix Z, with ith row zi and jth column Zj . There is some probabilistic component of the game, such as “trembling hand” agent error or private information (see Signorino 2003), so that the model can be written as Yi = f (zi ) + i , (1) where f is an unknown real-valued function and i is a random variable with full support on the real line.1 The structural approach to strategic estimation is to posit f (zi ) = g(zi ; θ), where g is a known function derived from the equilibrium conditions of a particular game form, and estimate the finite-dimensional parameter vector θ. We instead take a reducedform approach, in which the goal is to estimate comparative statics without specifying the exact structure of the strategic interaction. In non-strategic contexts, the typical approach to estimation of models like equation (1) is to assume that f (·) is an additive function of the covariates, Yi = Ȳ + k X fj (zij ) + i . (2) j=1 1 For ease of presentation, we focus on the case of unbounded, continuous Yi . However, with minor modifications, it is straightforward to apply all of the techniques developed in this section to outcome variables that are binary, categorical, counts, or durations. 6 With this setup, known as a generalized additive model, or GAM, it is straightforward to estimate the functions f1 , . . . , fk nonparametrically via splines or other standard smoothing techniques (Hastie and Tibshirani 1990). Unfortunately, the assumption of additivity is not appropriate for data on strategic interactions. Under any standard equilibrium concept, each player’s choice depends on her expected utility, which is a multiplicative function of her expectations about others’ actions and her own payoff for each possible outcome. Therefore, even in simple sequential games where all players’ payoffs are linear in the covariates, the outcome may depend significantly on nonlinear functions of the variables and interactions between them (Signorino and Yilmaz 2003). While GAMs allow for modeling nonlinear relationships, most implementations do not model interactions, nor provide guidance on variable selection. In lieu of the additivity assumption, we need another way to make estimation of f (·) tractable. Our approach is to use a d-degree polynomial approximation,2 which can be obtained via a Taylor series expansion: Yi = f (zi ) + i X j1 j2 jk βj1 j2 ···jk zi1 ≈ zi2 · · · zik + i (3) j1P ,...,jk ∈N: jm ≤d We can then estimate β by least squares, as long as d is small enough that the number of terms does not exceed n. To save on notation, we denote the total number of terms (other than the constant) in the polynomial expansion as p, and we write the coefficients as β = (β0 , β1 , . . . , βp ). Our work does not stop at the polynomial approximation itself. Unless the number of covariates is unusually small, even a low-degree polynomial expansion will contain enough terms to pose efficiency problems for standard regression techniques. The number of terms P in a d-degree expansion of k covariates is dm=0 m+k−1 , which grows quickly with d and k−1 2 A more general approach, which we will pursue in future work, would be to consider any d-dimensional basis function approximation, such as B-splines. 7 k. For example, with 10 covariates, a 2nd-degree expansion requires 66 terms, while a 3rddegree expansion requires 286. In this situation, some form of model selection is needed to avoid overfitting and inefficiency in the regression estimates. The model selection procedure best known to political scientists, if only as an object of derision (King 1986, 669), is stepwise regression. In some contexts, stepwise procedures are useful for eliminating irrelevant variables from a regression model (Hastie, Tibshirani and Friedman 2009, 58–60). However, stepwise regression often fails to eliminate many “noise” variables, especially when there is correlation among the covariates—a problem that does not necessarily go away as the sample size increases (Derksen and Keselman 1992). Estimates of uncertainty from best-subset procedures like stepwise regression are too small, and confidence intervals contain the true parameters less often than the nominal rate would indicate (Hurvich and Tsai 1990). The stepwise procedure is computationally burdensome when the number of terms is large, as it requires refitting the model hundreds or even thousands of times. Finally, stepwise regression estimates are not continuous, meaning small changes to the data can cause the set of variables selected and their estimated coefficients to vary wildly. To avoid these problems, we perform model selection via the adaptive lasso (Zou 2006), which has recently become popular in the statistical machine learning literature. The procedure is based on the “least absolute shrinkage and selection operator,” or lasso, proposed by Tibshirani (1996). The lasso is a form of penalized regression in which the typical leastsquares objective function is minimized subject to a constraint on the magnitude of the Pp coefficients, j=1 |βj | ≤ t. When the tuning parameter t ≥ 0 is small enough, the lasso performs model selection, yielding estimates of exactly 0 for some of the coefficients. The adaptive lasso is an extension of the lasso with better statistical properties, including the oracle property (defined by Fan and Li 2001): 1. As n increases, the probability of selecting the correct model (β̂j 6= 0 if and only if βj 6= 0) approaches 1, 2. As n increases, the distribution of β̂ is the same as that of the coefficients from regres8 sion on the true model (Xj such that βj 6= 0). The oracle property means that, as n → ∞, the adaptive lasso identifies the true model, and its estimates are no less efficient than if the true model—the set of covariates with a non-zero effect in the population—were known in advance. In addition, the adaptive lasso is continuous, so there is no danger of instability in the model selection or coefficient estimates. The adaptive lasso proceeds in two steps. First, one obtains a consistent initial estimate of the model parameters β̂ o , typically via OLS. Second, the adaptive lasso estimate is the solution of p n X X |βj | 0 2 , argmin (Yi − xi β) + λ o β∈<p i=1 j=1 |β̂j | (4) where λ ≥ 0 is a tuning parameter (discussed below). There are a few important things to note about equation (4). The intercept, β0 , is not penalized. The penalty term preserves the convexity of the objective function, so a solution exists and is unique. The presence of β̂ o is critical for consistent model selection. For any j such that βj = 0, consistency of β̂ o implies that β̂jo → 0 as n → ∞, meaning the adaptive lasso penalty on |βj | becomes arbitrarily large in the limit. Therefore, if λ > 0, the adaptive lasso will, with high probability, estimate β̂j = 0 when n is sufficiently large. The penalization factor λ controls the sparsity of the estimated model, also known as “shrinkage,” and must be selected carefully. If λ = 0, the adaptive lasso is equivalent to ordinary least squares. Higher values of λ entail more aggressive model selection, with more coefficients estimated as exactly 0; in the limit, the estimates are β̂0 = Ȳ and β̂j = 0 for all j = 1, . . . , p. The criteria for selecting the penalization factor are similar to those for choosing the bandwidth in kernel density estimation and other nonparametric problems, because there is a bias-variance tradeoff. If there is too much shrinkage, some relevant covariates will be left out of the estimated model, producing bias. With too little shrinkage, the model will be overfitted, meaning the variance of the estimates is unacceptably high. The ideal value of λ 9 would minimize the mean squared error, MSE(λ) = Bias(β̂; λ)2 + Var(β̂; λ). We can find an estimate of this ideal λ using K-fold cross-validation (Hastie, Tibshirani and Friedman 2009, 241–244), which entails splitting the data into K subsamples and using them to approximate the out-of-sample prediction error of the adaptive lasso for many candidate values of λ. We choose the value that produces the lowest prediction error across subsamples. See Appendix A.1 for a full description of the K-fold cross-validation procedure. The last step is to provide measures of uncertainty in the estimates, such as standard errors and confidence intervals. Zou (2006) discusses a method to calculate standard errors for adaptive lasso estimates, but it applies only to the nonzero coefficients. We instead use the bootstrap, so that we can estimate uncertainty in all elements of β̂.3 Chatterjee and Lahiri (2011) prove that the residual bootstrap consistently estimates standard errors and confidence intervals for adaptive lasso coefficients. We use the nonparametric bootstrap rather than the residual bootstrap, since it is more robust to basic linear model violations such as heteroscedasticity and can be applied to generalized linear models as well. This method consists in constructing B new datasets by resampling (Yi , xi ) with replacement from the original data, and running the adaptive lasso on each.4 Denote the adaptive lasso estimate in the bth iteration as β̂ (b) . The bootstrap standard error estimate for β̂j is its standard deviation across iterations, v u B 2 u1 X (b) (·) β̂j − β̂j s. e.(β̂j ) = t . B b=1 The bootstrap 95% confidence interval for β̂j is the interval bounded by the 2.5th and 97.5th 3 Use of the bootstrap does not guarantee nondegenerate standard error estimates, since some terms may be excluded from the estimated model in every bootstrap iteration. 4 For computational tractability, we use the same value of λ in the bootstrap iterations as in the original fit. 10 1 L R 2 U11 + ǫ11 L U13 + ǫ13 U23 + ǫ23 R U14 + ǫ14 U24 + ǫ24 Figure 1. Game tree for the data-generating process used in the Monte Carlo simulation. (b) percentiles of β̂j . 4 Simulation We verify the performance of the techniques described above by applying them to simulated strategic choice data. We show that the method is able to detect a nonmonotonic relationship between a covariate and an outcome, with a tolerable rate of prediction error compared to full structural estimation. The data-generating process is illustrated in Figure 1, where each jk term is private information drawn from the standard normal distribution. We provide only a brief exposition of the model; interested readers may consult Signorino (2003) or Signorino and Tarar (2006) for complete derivations. There are N = 2,500 observations of the outcome Yi ∈ {L, RL, RR} and of the covariates xi = (X1i , X2i , . . . , X6i )0 , where x∗i ∼ N (µ, Σ) µ = (0, 0, 0, 0, −1, 1)0 Σjk = 1, Σjk = ρ, Xji = Xji∗ , j=k j 6= k j = 1, 2, 5, 6 Xji = I(Xji∗ > 0), j = 3, 4 The variables X3 and X4 are binary, while all others are continuous and unbounded. Note 11 that X4 through X6 are noise variables with no true effect on the outcome, though they are correlated with the relevant variables if ρ 6= 0. The true data-generating process is given by the equations U11i = −1 + 2X1i + X2i U13i = 1.25 (5) U14i = 0.5X3i − X2i U23i = 0 U24i = 0.75 + X1i + X2i + 0.5X3i Our aim is to characterize the comparative-static relationship between each covariate and the probability of observing the outcome Yi = RR. Let Ỹi be a binary indicator for the event Yi = RR. Using (5) and the perfect Bayesian equilibrium solution concept, the true relationship is given by the nonlinear equation Pr(Ỹi = 1 | Xi ) = Φ (1 − p4i )U13i + p4i U14i − U11i p p24i + (1 − p4i )2 + 1 ! Φ U24i √ 2 , (6) √ where Φ(·) denotes the normal CDF and p4i = Φ(U24i / 2). Under the given parameterization of the utility equations, equation (6) is nonmonotonic in both X1i and X2i . The dataset was simulated B = 1000 times each for ρ ∈ {0, 0.5, 0.8}. In each iteration, we estimated the relationship between the covariates and Ỹ via four methods: • Naı̈ve logit: Logistic regression of the form Pr(Ỹi = 1 | Xi ) = Λ(β0 + β1 X1i + β2 X2i + β3 X3i + β4 X4i + β5 X5i + β6 X6i ), where Λ(·) denotes the logistic CDF. 12 • GAM: A generalized additive model of the form Pr(Ỹi = 1 | Xi ) = Λ β0 + 6 X ! fj (Xji ) , j=1 where all smoothing parameters are chosen automatically via cross-validation.5 • Structural: Full-information maximum likelihood estimation of the given LQRE model. We use two variants: the “correct” model, in which all irrelevant terms are dropped, and the “saturated” model, in which each utility equation includes all covariates. • Polywog: Adaptive lasso for logistic polynomial regression, as described in Section 3, with expansions of degree d = 2 and d = 3. To avoid infinite-valued parameter estimates in the first stage, we use the bias-corrected logistic regression recommended by Zorn (2005).6 We compare these estimators in terms of their ability to recover the true predicted probability for out-of-sample data. For any x = (x1 , x2 , x3 , x4 ), write the true probability of Ỹi = 1 as π(x), and let π̂ψ (x) denote the predicted probability as estimated from the results of method ψ. Note that π̂ψ (x) is a random variable, since the results of method ψ depend on the sample used for estimation. The mean squared error of the estimator ψ at x is MSE(ψ; x) = E (π(x) − π̂ψ (x))2 . The bias and variability of an estimator contribute to its MSE, so techniques with low MSE are preferable. The average MSE of the estimator ψ across all possible values of x is the 5 We use the R package mgcv (Wood 2011) to estimate GAMs. We use the R package brglm (Kosmidis 2010) for bias-corrected logistic regression and glmnet (Friedman, Hastie and Tibshirani 2010) for the adaptive lasso. 6 13 mean integrated squared error, Z MISE(ψ) = E (π(x) − π̂ψ (x))2 dF (x), X where F is the CDF of x. We can approximate the MISE of each estimator using the Monte Carlo simulations. In each iteration b, we construct a set of M = 1000 out-of-sample observations, xb(1) , . . . , xb(M ) , drawn from the same probability distribution used to generate the sample observations. We then use the results of each method ψ in this iteration to obtain π̂ψb (xb(m) ), the predicted probability for the mth element of the holdout sample. The approximate MISE is M B 2 1 X 1 X \ MISE(ψ) = π(xb(m) ) − π̂ψb (xb(m) ) . M m=1 B b=1 The MISE itself is a squared percentage, so we present the results in terms of the square root of the approximate MISE. To get an idea of the comparative performance of each estimator, see the plots in Figure 2. We examine two “profiles” in which X1 varies while the five other variables are held fixed. In the first, the continuous variables are held at their means and the binary variables are held at 0; in the second, X2 is held at 1.5 while the others are the same as in the first. The solid lines represent the true predicted probability of Ỹ = 1 as a function of X1 , as given by (6). The dotted lines show the average predicted probability obtained from each estimator, and the bounds of the dark gray regions are the 2.5th and 97.5th percentiles. As expected, the structural models recover the true probabilities with high precision. At the other extreme, the naı̈ve logistic regression, unable to capture the strong nonmonotonicity, clearly yields the least accurate fitted values. The results from the other non-structural estimators, the GAM and the two variants of the adaptive lasso, are more interesting. For the first profile, when all of the variables other than X1 are held at their means, the average predicted probability is similar, and reasonably accurate, across all three methods. However, 14 structural (correct) structural (saturated) naive logit 0.6 0.5 0.4 0.3 0.2 pp.mean 0.1 −3 −2 −1 0 GAM 1 2 3 −3 −2 −1 0 1 polywog (d=2) 2 3 −3 −2 −1 0 1 polywog (d=3) 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 2 3 −3 −2 −1 2 3 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0 1 x1 (a) ρ = 0.5, continuous covariates held at mean, binary covariates held at 0 structural (correct) structural (saturated) naive logit 0.8 0.6 0.4 pp.mean 0.2 0.0 −3 −2 −1 0 GAM 1 2 3 −3 −2 −1 0 1 polywog (d=2) 2 3 −3 −2 −1 0 1 polywog (d=3) 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 2 3 −3 −2 −1 2 3 0.8 0.6 0.4 0.2 0.0 0 1 0 1 x1 (b) ρ = 0.5, X2 held at 1.5, other continuous covariates held at mean, binary covariates held at 0 Figure 2. Average predicted probability and 95% interval for each estimator. 15 structural (correct) structural (saturated) naive logit GAM polywog (d=2) polywog (d=3) ρ = 0 ρ = 0.5 ρ = 0.8 0.022 0.022 0.022 0.032 0.032 0.032 0.217 0.234 0.238 0.172 0.125 0.075 0.055 0.055 0.056 0.057 0.058 0.056 Table 1. Root mean integrated squared error for each estimator across levels of correlation among the covariates. in the second profile, when X2 is held 1.5 standard deviations above its mean, the GAM performs noticeably worse than the polywog estimators. The GAM at least gets the inverse U-shape of the relationship correct, but its estimates of the location and magnitude of the peak are far off. For X1 ≤ −0.5, the 95% regions of the GAM predicted probabilities do not even contain the true values. The polywog estimators appear less efficient than the GAM, particularly at the extremes of the parameter space, but they are also much less biased. The root MISE calculations in Table 1 tell a similar story. The structural models always have the lowest MISE, and naı̈ve logit always has the highest. In between, the degree-2 polywog performs best, followed by the degree-3 polywog and then the GAM. The magnitude of the difference in RMISE between the degree-2 polywog and the saturated structural model is about twice the difference between the saturated and correct structural models. The typical level of prediction error for the degree-2 and degree-3 polywog models is about 5.5%, compared to 2.2% for the correct structural model and 3.2% for the saturated structural model. Neither of the polywog methods is adversely affected by the introduction of correlation among the covariates, even at a level as high as 0.8. Overall, the simulation results support our contention that adaptive lasso polynomial regression is a good way to estimate comparative statics relationships from strategic data. Although the method is not as efficient or accurate as a full structural estimator, it recovers the main relationships in the data with tolerably low bias. More generally, the results demonstrate the importance of accounting for interactions between variables when estimating 16 relationships from strategic data. The additional flexibility of the generalized additive model is overshadowed by its severe bias for predictions when multiple variables are not held at their means. 5 5.1 Empirical Applications Schultz (2001) A prominent example of testing comparative statics predictions with empirical data in the international conflict literature is Schultz’s (2001) work on democracy and coercive bargaining. Schultz uses a game-theoretic model to derive hypotheses about the relationship between states’ domestic political institutions and their behavior in international crises. The main theoretical result is that democracies’ coercive threats are more credible than those made by autocracies. If the leader of a democratic state makes demands that she is not truly willing to back with military force, the domestic opposition can make political gains by refusing to back the government’s stance. When the leader ultimately is forced to back down from a bluff, voters reward the opposition party for its better foreign policy judgment. Anticipating this potential political humiliation, democratic leaders make threats more selectively than dictators, who face no similar domestic political constraints. This relative unwillingness to bluff ends up working in favor of democracies, because their international opponents are more likely to believe the threats they do choose to make—and thus to concede the issue rather than provoke costly fighting. We will examine two of this theory’s observable implications about factors associated with the outbreak of international disputes (Hypotheses 1 and 2, p. 121). The first is that, all else equal, democracies are less likely to start disputes, due to their leaders’ fear of the political consequences of being caught in a bluff. The second is that, all else equal, democracies are weakly more likely to be targeted in a dispute, because they are less likely to bluff in response to a threat and thus more likely to concede the issue immediately. 17 Structural estimation of Schultz’s game-theoretic model would be challenging. First, each player possesses private information that their moves signal to the others. Although there are some statistical models of strategic information transmission (e.g., Lewis and Schultz 2003; Whang 2010), these are built on less complex games and have only rarely been applied to observational data. Second, it would be costly to collect all of the data necessary for structural estimation. For example, one of the possible outcomes in the model is for the domestic opposition to denounce the government for failing to issue a coercive threat. To include this in the statistical model, one would need an indicator for each case where no dispute occurred—the vast majority of those in any international conflict dataset—about whether each country’s domestic opposition urged a more aggressive stance. Clearly, this would require either a massive data collection effort or a restrictive sample of cases. Such measures should not be necessary to test hypotheses on the relationship between domestic regime type and foreign policy outcomes, on which ample data already exist. In light of these challenges to structural modeling, the empirical approach in Democracy and Coercive Diplomacy is decidedly reduced-form. Schultz uses logistic regression to model the probability of militarized international disputes (MIDs) as a function of regime type and various controls. Unfortunately, the models do not contain multiplicative or higher-order terms for most of the variables. If the true data-generating process is strategic, as claimed by the theory, these models may be subject to misspecification bias (Signorino and Yilmaz 2003). To examine the sensitivity of the original results to this specification choice, we replicate the analysis using the polywog method.7 We use a sample of N = 17,296 directed dyad-years from 1946 to 1984, corresponding to Schultz’s model 3 (p. 135).8 The dependent variable is MID initiation, and the following regressors are included (interested readers may 7 Because a complete replication archive for Democracy and Coercive Diplomacy is not available, we attempted to replicate the dataset as closely as possible by following the coding and sampling rules given in Chapter 5 and Appendix C. We were able to obtain statistical results that were substantively the same, albeit not exactly, to those reported in the book. 8 Schultz does not apply any political relevance criterion, but his statistical technique uses only those directed dyads that experience at least one initiation in the period under consideration. For consistency, we use the same restriction. 18 polywog original predicted probability of MID 0.07 0.06 0.05 ● ● 0.04 0.03 ● ● 0.02 ● ● 0.01 ● neither initiator target both neither initiator ● target both democratic Figure 3. Predicted probability of dispute for an “average” profile by dyadic regime type in the models without fixed effects. refer to Schultz 2001 for coding details): • Democracy of both states (dichotomous indicator based on Polity III components) • Major power status of both states • Balance of capabilities and state 1’s share • Contiguity • S-scores of states with each other and with system leader • Years since last dispute (cubic spline) We include each of these in a d = 2 polywog specification. The original models use a linear specification, except an interaction of the democracy indicators, an interaction of the major power indicators, and a cubic spline for peace years.9 9 In our replication, we replace the spline terms with a third-degree polynomial (Carter and Signorino 2010). This has no substantive effect on the results. 19 capbalance capshare1 0.08 0.06 predicted probability of MID 0.04 0.02 s_wt_glo 0.07 0.07 0.06 0.06 0.05 0.05 0.04 0.04 0.03 0.03 0.02 0.02 0.01 model 0.2 0.4 0.6 0.8 1.0 s_ld_1 0.10 0.08 0.2 0.4 0.6 0.8 s_ld_2 −0.5 0.07 0.12 0.06 0.10 0.05 0.08 0.04 0.06 0.03 0.04 0.02 0.02 0.06 0.04 0.02 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 0.0 0.5 peaceyrs 1.0 polywog original 0.0 0.1 0.2 0.3 0.4 0.5 Figure 4. Fitted values for an “average” profile in the models without fixed effects. To control for unobserved heterogeneity between directed dyads, Schultz uses Chamberlain’s (1980) fixed-effects logistic regression estimator, sometimes referred to as conditional logit. Unfortunately, to our knowledge, there is currently no lasso implementation of conditional logit, so a straightforward comparison of methods is difficult. Instead, we compare two sets of models: one without fixed effects in either and one with fixed effects dummy variables in both. In the models without fixed effects, we find stronger support for Schultz’s main hypothesis— that democracies are less likely to initiate disputes—in the polywog model than in the ordinary logistic regression. Figure 3 shows the predicted probability of a dispute for each potential dyadic combination of regime type, with 95% bootstrap confidence intervals (B = 1000). All of the other variables are held at their means (continuous) or modes (binary). Although 20 capbalance capshare1 s_wt_glo 0.14 0.030 0.10 0.04 0.020 0.08 0.03 0.015 predicted probability of MID 0.12 0.05 0.025 0.06 0.02 0.010 0.04 0.01 0.005 0.02 model 0.2 0.4 0.6 0.8 1.0 s_ld_1 0.2 0.4 0.6 0.8 s_ld_2 −0.5 0.0 0.5 peaceyrs 1.0 polywog original 0.8 0.07 0.07 0.06 0.06 0.05 0.05 0.04 0.04 0.03 0.03 0.02 0.02 0.01 0.01 −0.5 0.0 0.5 1.0 0.6 0.4 0.2 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 Figure 5. Fitted values for an “average” profile in the models with fixed effects. the difference in probability between an all-autocratic dyad and one with a democratic initiator is not statistically significant at conventional levels in either case, the substantive effect is greater under the polywog model. Going from an autocratic initiator to a democratic one reduces the probability of conflict by about 45% according to the polywog model, compared to 22% in the ordinary logit. On the other hand, the hypothesis about democratic targets appears to have no support under either model. We also find substantial differences in the estimated relationships between the control variables and the probability of dispute occurrence. Figure 4 displays fitted values for profiles in which each of the continuous regressors is varied across its range in the data, with other variables held at their means or modes. The gray bars again represent a 95% bootstrap confidence interval. It is clear that a linear specification fails to pick up on some significant non-monotonicities, such as the inverse U-shaped 21 in sample out of sample proportion of 0s correct 1.0 0.8 0.6 model polywog original model 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 proportion of 1s correct Figure 6. ROC curves for the models with fixed effects. relationship between the initiator’s share of capabilities and the probability of a dispute. In the models with fixed effects, the qualitative differences between polywog and ordinary logit are relatively slight. Moreover, with about 500 directed-dyad dummy variables, the relationships between the control variables and the chance of a dispute cannot be estimated with high precision; see the plots in Figure 5. However, the presence of fixed effects raises a different issue: the potential of overfitting. Controlling for unobserved heterogeneity with fixed effects improves in-sample fit, but at the risk of worse out-of-sample predictions.10 When fixed effects are included as dummy variables, a penalized estimator such as the lasso helps guard against overfitting and thus should be better for classifying observations that were not used to fit the model. Indeed, variable selection in high-dimensional classification problems is one of the main uses of the lasso in the machine learning literature (e.g., Koh, Kim and Boyd 2007). We examined the predictive performance of each of the fixed-effects models on the 1946–1984 data used to fit the model and on a holdout sample of N = 1,332 observations from 1985–1988. The ROC curves for these are plotted in Figure 6. The original 10 With the conditional logit estimator, it is not even possible to generate out-of-sample predictions, as the fixed effects themselves are not estimated. 22 model is strictly better for in-sample fit, but tends to be worse at predictions outside the sample. These results suggest that a penalized regression approach like polywog may be useful in settings where researchers want to control for unobserved heterogeneity without overfitting. 6 Conclusion We have introduced a new method for testing whether comparative statics predictions are borne out in strategic choice data. Our simulations show that the method is effective and efficient compared to alternative non-structural techniques. There are many directions to pursue in future work. We hope to explore alternative basis functions for the d-degree approximation, including orthogonal polynomials and B-splines. We plan to further demonstrate the real-world usefulness of the technique with additional empirical applications. In addition, we intend to develop an R package polywog to facilitate easy implementation of the methods described. A A.1 Appendix K-fold cross-validation We borrow much of Hastie, Tibshirani and Friedman’s (2009, 241–242) notation to describe the K-fold cross-validation procedure. The first step is to randomly partition the data into K subsamples of approximately equal size (exactly equal if n is evenly divisible by K), via the mapping κ : {1, . . . , n} → {1, . . . , K}. We also form a grid of T candidate values for the penalization parameter, (λ1 , . . . , λt , . . . , λT ), where λ1 = 0 and λT is large enough that all coefficients other than the intercept are shrunk to 0. Typical choices are K = 10 and T = 100. It is possible to set K = n, known as leave-one-out cross-validation, but this is too computationally expensive to be feasible for the adaptive lasso with moderate or large n. 23 Let β̂ −k (λt ) denote the adaptive lasso estimate obtained when λ = λt and the observations in the kth subsample are excluded. We can estimate the out-of-sample prediction error of the adaptive lasso under each candidate value of λ as CV(λt ) = n 2 1 X Yi − x0i β̂ −κ(i) (λt ) . n i=1 That is, the prediction error for each observation is the squared difference between its actual value and the fitted value from the adaptive lasso when the subsample containing the given observation is excluded. This entails running KT adaptive lasso fits, one for each subset to exclude and each grid point λt . We choose the λt that minimizes CV(·) to serve as the penalization parameter in the adaptive lasso fit on the full sample. Hastie, Tibshirani and Friedman (2009, 244) advocate an alternative criterion for even greater shrinkage, which is to choose the largest λt for which CV(·) is within one standard error of the minimum. References Carter, David B. and Curtis S. Signorino. 2010. “Back to the Future: Modeling Time Dependence in Binary Data.” Political Analysis 18(3):271–292. URL: http://pan.oxfordjournals.org/cgi/content/abstract/18/3/271 Chamberlain, Gary. 1980. “Analysis of Covariance with Qualitative Data.” The Review of Economic Studies 47(1):225–238. URL: http://www.jstor.org/stable/2297110?origin=crossref Chatterjee, A. and S.N. Lahiri. 2011. “Bootstrapping Lasso Estimators.” Journal of the American Statistical Association . Derksen, Shelley and H.J. Keselman. 1992. “Backward, Forward, and Stepwise Automated Subset Selection Algorithms: Frequency of Obtaining Authentic and Noise Variables.” British Journal of Mathematical and Statistical Psychology 45:265–282. Fan, Jianqing and Runze Li. 2001. “Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties.” Journal of the American Statistical Association 96(456):1348– 1360. Friedman, Jerome, Trevor Hastie and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33(1):1–22. 24 Hastie, Trevor and Robert Tibshirani. 1990. Generalized Additive Models. Chapman & Hall/CRC. Hastie, Trevor, Robert Tibshirani and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed. Springer. Hurvich, Clifford M. and Chih-Ling Tsai. 1990. “The Impact of Model Selection on Inference in Linear Regression.” The American Statistician 44(3):214–217. King, Gary. 1986. “How Not to Lie with Statistics: Avoiding Common Mistakes in Quantitative Political Science.” American Journal of Political Science 30(3):666–687. Koh, K, S Kim and S Boyd. 2007. “An interior-point method for large-scale l1-regularized logistic regression.” Journal of Machine learning research 8(8):1519–1555. URL: file://localhost/Users/Brenton/Documents/Papers/K/Koh 2007.pdf papers://711b4f30-efed-461f-95f9-54d3a5b0d80f/Paper/p12709 Kosmidis, Ioannis. 2010. brglm: Bias reduction in binary-response GLMs. R package version 0.5-5. URL: http://go.warwick.ac.uk/kosmidis/software Lewis, Jeffrey B. and Kenneth A. Schultz. 2003. “Revealing Preferences: Empirical Estimation of a Crisis Bargaining Game with Incomplete Information.” Political Analysis 11(4):345–367. Schultz, Kenneth A. 2001. Democracy and Coercive Diplomacy. Cambridge University Press. Signorino, Curtis S. 1999. “Strategic Interaction and the Statistical Analysis of International Conflict.” The American Political Science Review 93(2):279–297. Signorino, Curtis S. 2003. “Structure and Uncertainty in Discrete Choice Models.” Political Analysis 11(4):316–344. Signorino, Curtis S. and Ahmer Tarar. 2006. “A Unified Theory and Test of Extended Immediate Deterrence.” American Journal of Political Science 50(3):586–605. Signorino, Curtis S. and Kuzey Yilmaz. 2003. “Strategic Misspecification in Regression Models.” American Journal of Political Science 47(3):551–566. Smith, Alastair. 1999. “Testing Theories of Strategic Choice: The Example of Crisis Escalation.” American Journal of Political Science 43(4):1254–1283. Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society, Series B 58(1):267–288. Whang, Taehee. 2010. “Structural estimation of economic sanctions: From initiation to outcomes.” Journal of Peace Research 47(5):561–573. 25 Wood, S. N. 2011. “Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.” Journal of the Royal Statistical Society (B) 73(1):3–36. Zorn, Christopher. 2005. “A Solution to Separation in Binary Response Models.” Political Analysis 13:157–170. Zou, Hui. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American Statistical Association 101(476):1418–1429. 26