Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Spike and Slab Prior Distributions for Simultaneous Bayesian Hypothesis Testing, Model Selection, and Prediction, of Nonlinear Outcomes Xun Pang Department of Political Science One Brookings Dr., Seigle Hall, St. Louis, MO [email protected] Jeff Gill Center for Applied Statistics, Washington University Department of Political Science One Brookings Dr., Seigle Hall, St. Louis, MO [email protected] Abstract A small body of literature has used the spike and slab prior specification for model selection with strictly linear outcomes. In this setup a two-component mixture distribution is stipulated for coefficients of interest with one part centered at zero with very high precision (the spike) and the other as a distribution diffusely centered at the research hypothesis (the slab). With the selective shrinkage, this setup incorporates the zero coefficient contingency directly into the modeling process to produce posterior probabilities for hypothesized outcomes. We extend the model to qualitative responses by designing a hierarchy of forms over both the parameter and model spaces to achieve variable selection, model averaging, and individual coefficient hypothesis testing. To overcome the technical challenges in estimating the marginal posterior distributions possibly with a dramatic ratio of density heights of the spike to the slab, we develop a hybrid Gibbs sampling algorithm using an adaptive rejection approach for various discrete outcome models, including dichotomous, polychotomous, and count responses. The performance of the models and methods are assessed with both Monte Carlo experiments and empirical applications in political science. Keywords: Spike and Slab Prior, Hypothesis Testing, Bayesian Model Selection, Bayesian Model Averaging, Adaptive Rejection Sampling, Generalized Linear Model 1 Introduction Variable selection, hypothesis testing, and forecasting are intertwined but distinct activities such that researchers typically consider them sequentially. Relatedly, model averaging can also be considered as a means of addressing the variable selection and hypothesis testing task, but only in a Bayesian context since model probabilities are required. In fact, coherent treatment of all of these requires the Bayesian perspective on model uncertainty (Hoeting et al 1999, Alston 2004), yet this approach is under-utilized in the social sciences. More specifically in political science, the absence of this perspective can be attributed to a fixation with individual coefficient (Wald) testing, often using strictly p-value evidence. This leads to the misconception that isolated theories are sufficient for variable selection. However, there are often multiple measures on the conceptually same or similar variables, and co-linearity is common among social science variables, and all right-hand-side variables interact with non-linear link functions in the generalized linear model. Therefore, two models with alternative variable selections can produce dramatically different inferences and yet both have good fit (Regal & Hook 1991; Draper,1995; Madigan & York 1995; Kass & Raftery; 1995; Raftery, 1996). This problem cannot easily be solved with the current state of model averaging since it involves complicated or highly subjective issues of identifying the model space, dealing with formidable integral calculations, justified inclusion of prior information, and overcoming the estimation challenges of jumping between different model spaces (Hoeting et al 1999; Clyde 1999; O’Hara & Sillanpää, 2009). Unfortunately all of these challenges remain even with Markov chain Monte Carlo methods. This paper extends the Bayesian linear spike and slab model to nonlinear (qualitative) outcomes, which has not been accomplished to date. The spike and slab model specifies mixture prior distributions for each of the coefficient parameters where the spike part helps detect coefficients of zero magnitude by shrinking those coefficients to zero in posterior terms, and the slab part provides the non-zero coefficients with a diffuse normal prior thus not requiring extensively informed prior knowledge. In this way, both the model space and parameter space are determined at the same time, even though the full dimension of the model space is not changed mathematically. Therefore we simultaneously perform variable inclusion decisions, hypothesis testing, and model specification, as well as set up a principled method for posterior prediction using Bayesian model averaging. The applied practice of model determination has an inglorious history in the social sciences (Leamer 1978, Gill 1999), and remains an ongoing source of controversy. We provide a means whereby empirical researchers can avoid: arbitrary decisions, difficult covariate tradeoffs, and the fallacy that only one specification was tried. Since an R package is developed by the authors, the primary cost to this method is a willingness to fully embrace Bayesian probabilistic inference. There are several technical advantages over rival procedures outlined in this paper. We avoid the complication of transforming between model spaces of different dimensions as done with the 1 reversible jump MCMC method (Green 1995). The difficulty in specifying a dimension matching condition through a “bijection” function is the primary reason that RJMCMC is rarely used, and often used with pared-down or simple datasets. To avoid this problem the generalized linear spike and slab (GLSS) model uses modified continuous bimodal priors and applies the local and dynamic adaption method by implementing hierarchical specification of the variance parameter for each coefficient. This hierarchical design also avoids choosing a point value for the prior setup of the variance which typically results in complex tuning of the MCMC algorithm (Clyde & George 2004, West 2003, Geweke 1996, George & McCulloch 1993, Mitchell & Beauchamp 1988). For practitioners, we develop nearly-automatic MCMC samplers with little tuning to implement the GLSS model for commonly observed qualitiative responses in political science, including dichotomous, polychotomous and count responses. In addition, we show that the spike and slab model works as a general two-sided Bayesian hypothesis test for individual coefficients, which has been an elusive goal in Bayesian statistics (Berger & Sellke 1987, Berger, Brown, & Wolpert 1994; Berger, Boukai, & Wang 1997, Lehmann 1993, Meng 1994, Rubin 1984). This theoretically important aspect has not been recognized in the literature, although the linear spike and slab prior is routinely applied for variable selection in statistical genetics (Meuwissen & Goddard 2004; Sha et al 2004; Kilpikari & Sillanpää 2003; Ishwaran & Raowe 2008). Furthermore, nearly all hypothesis tests from regression analysis in political science are either explicitly or implicitly two-sided, thus necessitating a two-sided approach for any general method as proposed here. This paper is organized as follows. In the next section, we present the general setup for the GLSS model and discuss how this model can serves for the three purposes of coefficient effect hypothesis testing, variable/model selection, and model averaging. In Section 3, we address the choice of prior distributions for hyperparameters important for the performance of the MCMC algorithms. In Section 4, we discuss the MCMC algorithms for estimating GLSS models with different link functions, and provide brief examples to demonstrate these hybrid MCMC algorithms. Then, we apply the methods to two empirical studies using data from political science and illustrate how this method can help hypothesis testing where conventional methods fail, and improve efficiency of variable selection, and take account of model uncertainty in forecasting. The MCMC algorithm for sampling the hyperparameters, R code for adaptive rejection sampling in the GLSS model with the logit link, and JAGS code for estimate the poisson spike and slab model are given in appendices. 2 2 Generalized Linear Spike and Slab Model 2.1 Model Specification The spike and slab linear model was first introduced by Mitchell & Beauchamp (1988), giving a point mass at zero and a bounded uniform distribution elsewhere in the parameter space. Formally, the prior setup is as follows: p(βk = 0) = h0k and p(βk < b, βk 6= 0) = (b + fk )h1k , where − fk < b < fk , (1) where the probability at the spike (h0k ) and that at the slab (2h1k fk ) are greater than 0 and h0k + 2h1k fk = 1. This is therefore a mixture distribution with parameters fk and γk = h0k /hik (the ratio of the heights of the two parts). Given a mixture prior distribution with a highly concentrated component at zero, each βk associated with a “vulnerable” predictor is very likely to be absorbed by this point mass and be deleted from the stochastically selected submodels. One important modification proposed by George & McCulloch (1993) changes the discrete prior specification into a continuous mixture distribution. This modification facilitates applying the Gibbs sampler instead of using Bayesian cross-validation to estimate the model. They introduce a new parameter, γk , which is a dichotomous variable (γk ∈ {0, 1}) in the following prior specification: βk |γk ∼ (1 − γk )N (0, τk2 ) + γk N (0, c2k τk2 ), and p(γk = 1) = 1 − p(γk = 0) = pk , (2) (3) where τk2 is such a small value that once βk falls into the first normal distribution, its support is sufficiently concentrated around zero such that βk can be “safely” replaced by 0. The parameter γk is the indicator of a covariate associated with βk to be included into (γk = 1) or excluded from (γk = 0) the model in the MCMC process. A further modification of the spike and slab setup is conducted by Ishwaran & Rao (2005), which specifies a continuous distribution for the variance τk2 in the above specification to reduce the posterior sensitivity to the prior choices of the two-point mixture distribution of the variance hyperparameter in George & McCulloch (1993). Those works all concentrate on linear models. The spike and slab model and its modified versions have been applied for variable selection among a large number of potential specifications especially when the number of covariates is greater than that of the observations (Ishwaran & Rao 2008, Barbieri & Berger 2004, Chipman, George, & McCulloch 2001, George & McCulloch 1997). The generalized linear spike and slab model proposed in this paper follows the linear version 3 of hierarchical structure introduced by Ishwaran & Rao, and is specified as follows: E(yi ) = g −1 (x0i β), βk |γ 2 ∼ N (0, γ 2 ), (4) for k = 1, 2, ..., K, γ 2 = Fk ηk2 , (5) (6) Fk |ν0 , w ∼ (1 − w)δν0 (·) + wδ1 (·), ηk−2 |a1 , a2 ∼ Gamma(a1 , a2 ), w|c1 , c2 ∼ Beta(c1 , c2 ). (7) (8) (9) In this specification, the response variable yi is discrete and could be binary, ordinal, multinomial, or count data. The design matrix x contains covariates, and the covariates can be standardized for the purpose of possibly improving mixing in Markov Chain Monte Carlo simulations. In equation (5), each βk has its own variance parameter γk2 , which is updated in the MCMC simulation. In fact, the prior distribution of γk2 is determined by two random variables, Fk and ηk2 . The hyperparameter Fk has a discrete distribution with support {ν0 , 1} and a weight parameter w (it is actually a rescaled Bernoulli distribution). In equation (7), δA (·) is a mass point at A, √ and ν0 is such a small value that 3 ν0 ηk is very close to zero. This means that > 99.99% of the mass in the spike part is concentrated in a region not distinguishable from 0 and whenever βk falls into the spike, it can be safely replaced by zero. If Fk takes the value of ν0 (determined by both the data and the prior probability 1 − w), the observed data suggest that βk ’s effect is negligible and support the null hypothesis βk = 0. Similarly, if Fk takes the value of 1 with its posterior probability, βk ∼ N (0, ηk2 ) and with ηk2 large enough, this normal prior is diffuse and has limited influence on the posterior, admitting the non-zero coefficients. Since zero or close-tozero coefficients tend to be shrunk into the spike part, only those coefficients whose effects are differentiable from zero will have Fk = 1 with a high frequency. 2.2 Spike and Slab Model as a Hypothesis Testing Approach Both of the original and the modified versions of spike and slab setup can be naturally applied for coefficient hypothesis testing. Hoeting et al (1999) used the posterior probability of p(βk 6= 0|D) (D denotes data) and compared them with p-values, but the theoretical justiffication has not been laid out. The primary application of spike and slab priors is complete model specification which is essentially Bayesian model testing. Less attention has been paid to using them for hypothesis testing on individual coefficients, i.e., H0 : βk = 0 versus H1 : βk 6= 0. In the spike and slab model of Mitchell & Beauchamp (1988), the mixture prior in equation (1) is a straightforward setup of 4 the prior beliefs of the statements about covariates’ effects, and the observed data will update those beliefs into the posterior distributions of π(h0k |D) versus π(2h1k fk |D). These quantities can be computed by using the MCMC output of γk and fk , and provide the posterior probability that the null or alternative hypothesis is true. In the modified model specification in equation (3), with the dichotomous random variable γk indicating inclusion or exclusion of the covariate associated with βk , interpreting it as a hypothesis testing on βk is even more straightforward. After the Markov chain converges, each iteration is one test based on a sample of parameters drawn from the relevant posterior distribution: what we actually obtain is p(H0 |D) = p(γk = 0|D) = p versus p(H1 |D) = p(γk = 1|D) = 1 − p. Those conditional probabilities are estimated by using the observed data instead of relying on unreplicatable resampling or hypothesized data. Therefore, this Bayesian hypothesis testing is based on sufficiently large numbers of repeated experiments (the number of MCMC iterations) and data from the same distribution (the posterior distribution). The Bayesian decision making is to use those probabilities which are also the risk of rejecting or accepting a hypothesis. For instance, if p(H0 |D) = p, then rejecting H0 assumes the risk of p and accepting it of 1 − p. Note that the hypothesis testing and decision making make probabilistic claims, and do not involve p-values or an arbitrary significance threshold level as in the null hypothesis significance testing (NHST) paradigm. The fact that the spike and slab model can serve as coefficient hypothesis testing has important meaning for social scientists, since NHST that currently dominates tests of statistical reliability in the social sciences is logically flawed and inconsistent as noted by a huge number of authors(Barnett 1973, Berger, Boukai, & Wang 1997, Berger Thomas Sellke 1987, Berkhardt & Schoenfeld 2003, Bernardo 1984, Brandstätter 1999, Carver 1978, 1993, Dar, Serlin & Omar 1994, Cohen 1988, 1994, 1992, 1977, 1962, Denis 2005, Falk & Greenbaum 1995, Gelman, Carlin, Stern, & Rubin 1995, Gigerenzer 1987, 1993, 1998, Gigerenzer & Murray 1987, Gill 1999, 2005, Gliner, Leech & Morgan 2002, Grayson 1998, Greenwald 1975, Greenwald, Gonzalez, Harris & Guthrie 1996, Hager 2000, Howson & Urbach 1993, Hunter 1997, Hunter & Schmidt 1990, Jeffreys 1961, Kirk 1996, Krueger 1999, 2001, Lindsay 1995, Loftus 1991, 1993a, 1993b, 1994, 1996, Loftus & Bamber 1990, Macdonald 1997, Meehl 1967, 1978, 1990, 1978, Nickerson 2000, Oakes 1986, Pollard 1993, Pollard & Richardson 1987, Robinson & Levin 1997, Rosnow & Rosenthal 1989, Rozeboom 1960, 1997, Schmidt 1996, Schmidt & Hunter 1977, Sedlmeier and Gigerenzer 1989, Thompson 2002, Wilkinson 1999). With a small number of observations, especially, for the maximum likelihood estimator, the asymptotic assumptions on which test statistics rely are simply inapplicable, rendering these tests unhelpful, if not outright misleading. Without relying on an artificial interpretation of conditional probability, Bayesian inference and decision making based on posterior description and the Bayes Factor have many advantages over the NHST. However, for the purpose of traditional hypothesis testing of coefficient reliability, posterior de- 5 scription alone does not present evidence in the traditional decision format that social science readers are used to seeing. The empirical posterior description is essentially a one-sided test based on p(βk > 0|D) versus p(βk ≤ 0|D), while two-sided hypothesis testing is quite difficult since the priors and posteriors are often continuous p.d.f.’s such that a point-null hypothesis has zero probability by design. Despite the evidence that point null hypothesis testing is inconsistent with Bayesian practice, there has been considerable effort expended trying to find a reconcilable Bayesian approach (Berger & Sellke 1987, Berger, Brown, & Wolpert 1994; Berger, Boukai, & Wang 1997, Lehmann 1993, Meng 1994, Rubin 1984). For instance, Lindley (1961) posits the prior information as sufficiently vague so that one has no particular belief that θ = θ0 versus θ = θ0 ±, where is some small value. Then a reference (ignorance-expressing) prior can be used to obtain a posterior, and H0 is rejected for values that fall out of the (1−α)100 highest posterior density region. The Bayes Factor is a more formal comparison of two alternative hypotheses of any type, but the pairwise comparison of submodels with imputation of covariates is inefficient and computational demanding. With sophisticated models, the computation of the Bayes Factor itself is not always stable and reliable (Kass & Raftery 1995, Han & Carlin 2001). In the spike and slab setting, we can conduct the hypothesis testing on coefficient effect by simply computing (g) the posterior probability from the MCMC output p̂(Fk = 1|D) = (# : Fk = 1)/G, where G is the total number of iterations. Accordingly, p(H0 is True|D) = p(Fk = ν0 |D) and is estimated as 1 − p̂(Fk = 1|D). For the purpose of testing the hypothesis of H0 : βk = 0 versus H1 : βk 6= 0, the posterior distribution of major interest is p(Fk |D). 2.3 Bayesian Model Comparison and Variable Selection We always have uncertainty about the true data generation process (DGP) or/and the DGP itself is dynamic instead of fixed; therefore, model comparison is necessary and important. In the Bayesian point of view, model comparison is not about identifying the “true” model, but rather about assessing the relative relevance of models based on the observed data as well as prior information (Aitkin 1991; Kass & Raftery 1995; Han & Carlin 2001). In the literature, three major Bayesian approaches have been developed for model comparison: the Bayes Factor and its variants (Carlin & Chib 1995, Kass & Raftery 1995; Gelman et al 1995; Aitkin 1997, Berger & Pericchi 1998; Congdon 2001; Robert & Casella 2004; Bayarri & Berger, 2000; Perez & Berger 2002; Spiegelhalter et al 1999), distance measures such as entropy distance (Sahu & Cheng; 2003) or Kullback-Leibler divergence (Mengersen & Robert 1996), and simultaneous estimation of models in a enlarged parameter and model space, including the spike and slab approach, reversible jump MCMC (Carlin & Chib 1995; Green 1995), and the birth and death MCMC (Stephens 2000). The Bayes Factor approach often involves complicated integration problems. To overcome the very challenging integration problem, Chib (1995) and Chib & Jeliazkov (2001) developed a very general approach from both the Gibbs and MH outputs. This 6 marginal likelihood method has been implemented broadly, but is still computational intense and requires very careful partitionation of the paramters in high dimensional models. In general, all of the approaches for computing the Bayes Facotr can invite numerical instability issues (Kass & Raftery 1995; Berger & Pericchi 1998; Han & Carlin 2001). Reversible jump MCMC involves changes of model dimensions, which makes the algorithm complicated, and this is also true for the birth and death approach (O’Hara & Sillanpää 2009). On the contrary, the spike and slab model does not involve changes in the model dimension, because when a subset of coefficients are shrunk to zero they still stay in the model mathematically. The fact that the spike and slab model deals with the parameter and sample space at the same time simplifies the MCMC algorithm and smooth the path of the Markov chain in the whole model space. Theoretically, model comparison is a judgement about the quality of fit derived from making such decisions as: variables inclusion/exclusion, incorporated prior beliefs, the functional form, hierarchies, and distributional assumptions. More often the key tradeoffs involve just variable selection. It is easy to say that this should always be based on substantive theory in political science, but the reality is much more complicated since there are often alternative measures for phenonema of interest (eg. “ideology” or “partisanship” in survey research), high correlations among the potential explanatory variables, surprisingly strong relationships from unsuspected sources, and complications from interactions inherent in all generalized linear models with link functions that are not an identity. On the opposite end of the spectrum many political scientists have a distaste for “data mining” through sample data since it simply chases unexplained covariance in the sample and therefore rarely produces reliable inferences. We are offering, instead, a principled criteria based the posterior probability that a regression coefficent is distinct from zero, conditional on: other covariates, the form of the model, and of course the data. Furthermore, by controlling the level of shrinkage, users can stipulate high prior probability for coefficients with strong theoretical support and let these factors dominate the posterior probability for others. In the Bayesian framework, variable selection is essentially about how to estimate the marginal posterior probability of a covariate to be included into the model based on the observed data and prior information or preferences. In this sense, variables should always have a prior with a spike and slab (Miller 2002), reflecting the uncertainty of including or excluding them. Following this line, there are various methods to implement variable selection, all explicitly or implicitly using spike and slab priors. One approach uses two auxiliary variables: γk which indicates whether a variable should or should not be included into the model, and ψk is used for computing the posterior effect of a coefficient βk = ψk γk (Kuo & Mallick 199). This approach is different from stochastic search variable selection (SSVS) (George & McCulloch 1993; Brown et al 1998) in that the auxiliary variable ψk has its prior distribution and affects the posterior probability if a “pseudo-prior” is not used as in the Gibbs variable selection(GVS) (Dellaportas et al 1997). The SSVS approach focuses on p(βk |Ik = 0), and Ik is often assumed a Bernoulli prior distribution 7 with parameter 0.5 (George & McCulloch 1993). Criticisms on this approach are not only on the complexity of tuning, but also about the undesirable influence of priors: the prior is too informative and biases toward a model with half of the candidate variables, which is not desirable in that the resulting model can be too complex given that the set of candidate variables is usually large (Hara & Sillanpää,2009). The spike and slab setup in this paper is essentially a SSVS approach, but our indicator Fk does not have prespecified Bernoulli parameter—the hierarchical structure allows the parameter w to be updated based on data information. Our approach is also close to the adaptive shrinkage method in the sense that this method generally specifies βk |τk2 ∼ N (0, τk2 ), and focuses on coming up with a suitable prior for the variance τk2 to shrink values of βk towards zero, but not shrink the non-zero coefficients. The prior distribution of τ 2 also controls the sparseness of the model and the mixing of MCMC algorithm, which is exactly the case for our model in equation (5). In the literature, the often-used priors in this dynamic shrinkage approach include Jeffreys’s prior (Hoti & Sillanpää 2006, Xu 2003) and Laplacian prior (Figueiredo 2003), which is better known as Bayesian Lasso (Park & Casella 2008; Kyung, Gill, & Casella 2008; Yi & Xu 2008). In the present paper, we use a mixture prior for the variance parameter γk2 which can achieve the same “selective shrinkage” effect but is easier to estimate. Another approach for variable selection is called model space approach. Instead of specifying priors for the individual coefficients, this approach places priors on the number of covariates in a selected model and after the dimension of the model is decided, the secondary task is to choose variables. This approach includes Reversible Jump MCMC (Green 1995) and its variants (Godsill 2001; Kilpikari and Sillanpää 2003). Not much different from the RJ-MCMC, those variants also involve The change in dimension and still requires a dimension matching condition, which is often quite complicated in realistic models. 2.4 Bayesian Model Averaging for Prediction Bayesian Model Averaging (BMA) is a way to solve the inference and prediction problems caused by model uncertainty (Leamer 1978; Draper 1995; Chatfield 1995; Kass & Raftery 1995; Dijkstra 1988; George 1999). Ideally, the inferential and predictive distributions or any quantity under investigation should obtained by using all the relevant models and averaged with their weights (model posterior probabilities), namely, those distributions should all be mixtures along the whole model space (Clyde 1999). However, in empirical analyses, model uncertainty is more often ignored and inferences, and predictions are obtained based on a single model as if it were the true data generation process with absolute certainty. Ignoring the uncertainty leads to overconfident inferences and involves higher decision risk (Alston et al 2005). Madigan & Raftery (1994) showed that BMA performs better, in terms of the logarithmic scoring rule, than predictions based on any single model including the “best” one. The idea of BMA is simple: for 8 any quantity of interest, ∆, we obtain its posterior distribution in the following way: p(∆|y) = N X p(∆|Mn , y)p(Mn |y), (10) n=1 where, p(y|Mn )p(Mn ) p(Mn |y) = PN , j=1 p(y|Mj )p(Mj ) (11) Z p(y|Mn ) = p(y|θn , Mn )p(θn |Mn )dθn . (12) However, implementing this procedure has several formidable challenges: how to determine the number of models in equation (10); how to specify the model prior probability in equation (11); and how to handle the high-dimension integral in equation (12). For solving the first problem, one approach is the Occam’s windom method (Medigan and Raftery 1994; Volinsky et al 1997) by setting rules based on information and decision theories. The second line is what we adopt in the present paper, which is to use Markov chains for determining the model space by simply counting the proportions of the observations of submodels. The second approach has the advantage of avoiding computing the marginal likelihood and the model weights by analytically or numerically doing integration; instead, the weights are given along with the model set in the MCMC simulation—the frequency of appearance of a model is its empirical posterior weight. BMA is important for probabilistic predictions, because there are two types of discrepancies between observed and predicted values (Draper et al. 1993). The first type of discrepancy is predictive bias—a systematic tendency to predict on the low side or the high side. The second one is lack of calibration—a systematic tendency to over- or understate predictive accuracy (Hoeting et al 1999). Madigan & Raftery (1994) use the predictive log score to measure the decision risk, which is a combined measure of bias and calibration. The score can be simple computed by randomly split the data into two parts: DB , the build data, and DT , the test data, and compute the following quantities: X Scoresingle = − log p(d|M, DB ), (13) d∈DT ! ScoreBM A = − X d∈DT log X p(d|M, DB )p(M |DB ) . (14) m∈m The larger the score, the more discrepant the prediction. Madigan & Raftery found ScoreBM A is smaller than Scoresingle . We will use this criterion to compare the prediction performance of the GLSS model with single models in one of the empirical studies in the empirical study section. 9 3 Selective Shrinkage and Prior Choices 3.1 Selective Shrinkage Essentially, the three tasks (hypothesis testing, variable selection, and model averaging) are achieved by the GLSS model mainly due to the selective shrinkage effect of the spike and slab prior. As mentioned above, this approach can be regarded as a modified version of the adaptive shrinkage approach. In this section, we discuss this mechanism focusing on its meaning to prior choices. To simply illustrate the shrinkage mechanism, note that in each MCMC iteration the Bayesian −1 0 linear estimation of β is β̂ = (X0 X+B−1 0 ) X Z, where B0 is the variance of a multivariate normal prior NK (0, B0 ). The prior variance can be regarded as the ridge matrix in the conventional sense, and the elements in B0 are referred as the ridge parameters. The selective shrinkage is to use the ridge parameters to achieve certain goal (function) by adjusting the bias-variance tradeoff.1 Also, in the Bayesian setup priors can be considered as a penalty function on the likelihood function. Different penalties are often achieved by specifying a different measure of G(·) in the R general expression of a mixture prior π(βk ) = N (βk |0, ψk )G(dψk ). For example, a quadratic penalty is obtained with normal priors, and the LASSO penalty is achieved by using a double exponential (Laplace) prior (Kyung et al. 2009). In our case, the measure of G(·) is defined by equation (7) and (8), which influences the likelihood by shrinking zero or close-zero coefficients to zero. This penalty or shrinkage is an important mechanism of the hypothesis testing process. As discussed above, non-Bayesian hypothesis testing approaches rely on estimating the standard errors. With discrete outcome models, it is well-known that the maximum likelihood estimator is inefficient for estimating covariance matrices with a small sample size: the so-called “Stein paradox” (Stein 1956, Efron 1982). Shrinkage estimation was introduced for the purpose of achieving more efficient estimators. The original Laplace method (the first shrinkage estimator), adds some prior data (information) into the maximum likelihood estimate to shrink the estimate toward zero. The degree of this shrinkage depends upon the size and structure of observed data (Greenland 2008). Suppose the unrestricted model has a K-dimensional coefficient parameter vector Ψ = (ψ1 , ..., ψK ), and a restricted P -dimensional submodel with a parameter vector Θ = (θ1 , ..., θP ). The linear shrinkage approach constructs a new mixed estimator by weighted averaging: Γ̂ = λΨ̂ + (1 − λ)Θ̂, where λ ∈ [0, 1], denotes the shrinkage intensity. This shrinkage estimation can 1 The well known decomposition of Mean Square Error (MSE) of an estimator q̂ of a parameter q is MSE= Var(q̂) + Bias(q̂)2 , and it measures the performance of this estimator. To improve the performance, we choose an optimal combination of the two parts, which can be achieved by non-parameteric approaches such as boostrapping the variance (Schaefer & Strimmer 2005a) or by parameteric approaches such as shrinkage estimation. 10 obtain a regularized estimate outperforming each individual estimators in terms of statistical efficiency if the shrinkage parameter λ is appropriately chosen. In practice, this parameter is often chosen by fixing it at a certain value, or specifying a function depending on the sample size, or by minimizing a pre-specified risk function (Schaefer & Strimmer 2005), or by using cross-validation (Friedman 1989), or an empirical Bayesian context (Greenland 2000, Morris 1983). The shrinkage estimation has been widely used for model (variable) selection (Zhang, Ahn, & Lin 2006, Wang, Zhang, & Li 2005, Ishwaran & Rao 2008), multiple counfounder detection (Greenland 2008), and large-scale covariance estimation (Tong & Wang 2007, Willan, Pinto, & O’Brien 2005). The spike and slab model is a shrinkage estimation because when Fk = ν0 , then β (g) ∼ N (0, (ν0 η 2 )(g) ). In this situation, βk can be safely replaced by 0 and has almost zero effect in the next iteration, which is equivalent to shrinking the model to be K 0 dimensional, where K 0 = #(Fk = 1), and the shrinkage parameter λ is a vector instead of a scalar, and can be computed with the frequency of occurrence of each submodel in the MCMC simulation. 3.2 Prior Choices The prior choices are important for achieving both desirable shrinkage and confounding. The shrinkage determines the sparseness of the model—how parsimonious the selected model is desired; and the confounding affects how easily the chain can move from one state space to another, but too much confounding means that the spike and slab prior will lose its ability to discriminate between the zero and non-zero coefficients. The working of the prior hinges on the following three parameters: • ηk2 : the hyperparameter of the variance for the slab part of the mixture prior for βk . We avoid the complexity of choosing an appropriate point value of ηk2 by setting another hierarchical level of ηk2 ∼ IG(a1 , a2 ) (George & McCulloch 1993). This gamma distribution allows the Markov chain to find the appropriate sample space for each coefficients, and adjusts the space dynamically. By choosing the hyperparameters of a1 and a2 , we set the possible support of ηk2 and its uncertainty; therefore, this gamma distribution should have support large enough to dilute the shrinkage introduced by the prior and let the data to identify those non-zero coefficients. However, the desirable shape of this inverse gamma distribution is also determined by the data. If the data is not informative, the posterior distribution will be diffuse and the confounding of this inverse gamma distribution with the ν0 ηk2 around zero will be not enough. In other words, the mass that connects the two parts will be too thin, and the chain will have difficulty transforming between two parts, and we will observe most of the coefficient chains stay in the spike part for a long time. Here, the rule of thumb is that when too much shrinkage is observed, one should change the prior of this inverse gamma distribution to be more right skewed or less diffuse. 11 30 40 2000 Frequency 500 0 10 20 30 40 0 !k 1 2 10 20 3 4 5 0 −10 0 20 10 20 !k 30 0.20 0.05 0.00 −6 −4 −2 0 !k 2 4 6 −20 −10 0 10 !k The upper panels modify the prior distributions of the the variance hyperparamter γk2 = Fk ηk2 . The lower panels are their resulting spike and slab prior distributions of βk . • ν0 : ν0 ηk2 is the variance of the spike part (a normal distribution with very high precision). The choices of ν0 and ηk2 depend on each other, and the principle is to ensure that ν0 ηk2 is small enough to shrink the coefficients of zero magnitude, but not too small to have enough confounding with the slab part. • w: this parameter controls the prior belief of the probability of a coefficient to be non-zero, since it is the proportion of the mass allocated to the slab of the prior. Unlike George & McCulloch (1993), w is the same across all K coefficients, which means that the user can only assign systematic beliefs of the presence of all the covariates’ effects. Small values of w reflect high prior skepticism about the coefficients and the theoretical importance of their associated covariates, while large values of w mean that the researcher stresses more the theoretical importance of the variables she chooses and is more skeptical about the sampling or measurement of those data. If c1 = 1, c2 = 1, the prior for w becomes uniform and reflects indifferent prior beliefs of the two hypothesis. These general rules imply that in practice, a number of trials with different combinations of the three parameter prior choices are often required until desirable mixing and shrinkage are obtained. Graphical tools can help get rough ideas of how the prior choices contain prior beliefs and meet 12 40 0.15 Density 0.3 Density 0.1 0.0 −20 10 !2 k 0.4 0.20 0.10 Density 0.05 0.00 −10 0 !2 k 0.15 0.20 0.15 0.10 0.05 0.00 −20 0 −1 !2 k 0.10 20 !2 k 0.2 10 0 0 200 1000 0 0 Density 1000 1500 800 600 Frequency 500 400 1000 Frequency 3000 2000 Frequency 4000 1500 1000 5000 Figure 1: Different Prior Setups 20 the rules above. Figure 1 illustrates how different choices of those tuning parameters incorporate prior uncertainty and effect the confounding, and some of the combinations of those prior choices should be better than the others. In the figure, the first choice is (w, ν0 , a1 , a2 ) : (0.5, 5E-09, 5, 50), which has the following prior information: we specify indifferent prior probabilities for βk = 0 and βk 6= 0 since w = 0.5 puts half of the mass in the spike and the other half the slab. The gamma distribution for the variance for the slab part produces a flat normal prior which allows non-zero coefficients to be identified. Although the prior for γk2 = ν0 ηk2 is continuous, the density between the spike and slab is thin, which facilitates separating the non-zero coefficients from zero ones. But the two parts are still connected with amount of mass. It is just the mass between that serves as a bridge for the chain to transform between the two parts. The second choice is (w, ν0 , a1 , a2 ) : (0.9, 5E-09, 5, 50), and it is the same as the first one except w = 0.9 puting 90% of the mass in the slab part, reflecting very a strong prior belief that H1 is true. This will result in a bias towards more complicated models (more variables are selected). Normally, we should not have such a strong prior, especially with the fact that w controls the systematic belief of all the β coefficients, this prior setup is not recommended. The third specification, (w, ν0 , a1 , a2 ) : (5E-09, 0.9, 30, 50), is much worse than the first two. Besides the very strong prior bias towards H1 , the variance of the slab part is too small. With modest sample sizes, this slab prior will strongly push the posterior distribution of non-zero coefficients towards zero. Also, the mass between the two parts is too thin to have enough confounding. The last example has a prior choice as (w, ν0 , a1 , a2 ) : (0.5, 0.1, 5, 10). With ν0 = 0.1, the prior distribution of γk2 the spike and slab parts are connected with much mass between and the confounding is bad in the sense that the two states cannot be distinguished and the spike and slab prior functions similarly to bimodal normal mixture priors. 4 MCMC Algorithms for Generalized Linear Spike and Slab Models The spike and slab priors for multiple coefficients complicate the parameter space—the two different state spaces have to be separated for the purpose of identifying variables included or excluded, but at the same time they should be connected to make it possible or easy to transform from one state space to the other. This trade-off requires a tuning of the degree of confounding, i.e., we need a “gray zone” for βk to have nontrivial probabilities to be in both the spike and slab states; otherwise, the Markov chain will stay in one state for very long time only because there is no bridge for it to get into the other space (Smith & Kohn 2002). This is the reason that George & McCulloch (1993) replaced the discrete spike and slab setup (a point mass as the spike) in Mitchell & Beauchamp (1988) with two normal distributions. However, in their case 13 Figure 2: Spike and Slab Model Sample Spaces y y z2 z1 x x Bivariate Spike and Slab Prior Hypothetical Resulting Posterior Note that in both graphs, the spikes are much higher than they are shown in the limited space and go far out of the graphs. the confounding issue still requires complicated tuning. Another issue closely related is how to sample the coefficient posteriors. In Figure 2, we illustrate a simple bivariate case. The ratio of the heights of the spike and slab parts is usually huge for both the prior and posterior distributions since the spike part is highly concentrated around zero. For the Metropolis-Hasting algorithm, it is difficult to specify a good proposal density to approximate the target, and the acceptance rate will be unacceptably low when the chain is stuck in the spike part. The Gibbs sampler is more mechanical and requires less tuning, but for non-linear models, the coefficient parameters often do not have conditional conjucate priors. With the auxilary latent response variable approach (Albert & Chib 1993), generalized linear models with probit link can be easily handled with the Gibbs sampler (Sha et al 2004), but for other links such as the logit and poisson GLMs, the Gibbs sampler cannot be applied for sampling the coefficient parameters. The adaptive MCMC approach has been suggested to obtain a good proposal by learning from the history (Ji & Schmidler 2009), but the success of this approach heavily depends on the starting location because it dramatically effects the “learning” of the sample space (Craiu et al, forthcoming). In this paper, we propose sampling strategies with the goal of achieving minimum tuning and making the algorithm as automated as possible. We discuss the sampling schemes for the generalized linear models with probit, logit, and poisson links, and use a simple example for each to illustrate how the algorithm works. The algorithm for updating the hyperparameters, F, w, and η 2 , is the same across different link functions, and is given in Appendix A. 14 4.1 Generalized Linear Models with Probit Link The simplest GLSS models are those that use the probit link function, because the data augmentation method introduced by Albert and Chib (1993) easily transforms such models into a linear framework, and the Gibbs sampler can be used to update the coefficient parameters along with the auxiliary latent response variables, just as is in the linear cases. This convenience is achieved because the GLSS model treats the spike and slab prior just as an ordinary prior. 4.1.1 Binary and Ordinal Probit Spike-Slab Model Generally, the probit model for binary and ordinal outcomes can be specified as follows: 1 if (zi > 0) yi = j if τj−1 < zi < τj for j = 1, 2, . . . J, zi = x0i β + i . (15) (16) By assuming i ∼ N (0, 1) for identification, this model will have the exact likelihood of the GLM probit model by integrating the auxiliary latent variable zi out. Given the updated hyperparameters of F, w, and η 2 , the coefficient parameter is sampled from a multivariate normal distribution β|z∗ , γ 2 ∼ N (β̄, B̄), with: B̄ = (X0 X + nΓ−1 )−1 and β̄ = B̄X0 Z∗ , where γk2 = Fk ηk2 , and Γ is the diagonal matrix with γk2 as the k-th diagonal elements. The latent response variable is sampled from a truncated normal distribution conditioned only on β. In the ordinal probit model, the cutpoints are assigned with a uniform distribution τ ∼ U (ττ : −∞ < τ1 < . . . < τJ−1 < ∞, and updated as τj ∼ U (max[max(zi : yi = j), τj−1 ], min[min(zi : yi = j + 1), τj+1 ]). 4.1.2 Multinomial Probit Spike-Slab Model With the same data augmentation method, estimating the probit multinomial model only requires a very minor modification on the above one. The setup is as follows: yi = j if zij > zik for all k 6= j, zij = x0ij β + ij . (17) (18) It is more convenient to show the sampling strategy by expressing this model in a matrix form: Z1 x1 1 .. .. . . = . β × .. ZN xN 15 N N where the error ∼ NN ×J (θ, Ω = IN Σ) and Σ is a J × J matrix parameterized by a vector θ. Conditional on F, w and η 2 , the parameters β, Z and θ can be updated as follows: 1. β|Z1 , . . . , ZN , Y, θ ∼ Nk (β̄ Z , Bz ), where B = (X0 Σ−1 X + Γ−1 )−1 and β̄ Z = Bz X0 Σ−1 Z. 2. Zi |Y, β, θ ∼ N (Xi β, Σ) and the sample should meet the requirement that the yi th element in Zi is the maximum. 3. θ|Z, Y, β = π(θ)|Ω(θ)|−1/2 exp −1/2(Z − Xβ)0 Σ−1 (θ)(Z − Xβ) , and this can be simply sampled by drawing from an approximating normal distribution with matched mode and curvature (see Albert & Chib 1993). 4.1.3 Example Because of the conditional conjugacy, estimating GLSS models with probit link is relatively easy and the Gibbs sampler is used for all the parameters. This allows us to focus on the tuning issue of prior specifications stated above in order to achieve a desirable level of shrinkage and good mixing. Here we use a simple but observational dataset—Rule Change in the House of Representatives from 1867 to 1998 in the United States, which were analyzed by Binder (1996, 2006) and Schickler (2000) by coding the response variable, Rule Change, both as binary and ordinal data. There are only 66 observations in the dataset, and 5 covariates are included. Among the 66 observations, there are 8 events of pro-minority change and 21 episodes of pro-majority change. The Gibbs sampler works very well in estimating both the binary and ordinal probit S-S models. We use four different sets of prior specifications to check the prior sensitivity and test the reasonable range of confounding. Although the sample size is small, the data seem informative and the probit S-S models are not very sensitive to mild changes of priors in term of hypothesis testing results. Since ν0 and ηk2 jointly determine the confounding degree, we fix the prior value of ν0 in those varying prior setups and only change the value of ηk2 by tuning its inverse gamma prior distribution. The choice has the inverse gamma distribution with mean 58.33 and standard deviation 17.59; and the second prior setup changes the mean to be 20 and the sd to be 7.54, which is a dramatic change. But because this is a change in hyperparameters at a very low level, the results are not very sensitive to this change (compare the first and second columns in Table 1). The second important tuning is the hyperparameter of w which controls the overall shrinkage level. For the first two sets of priors, we use uniform priors on w (Beta(1, 1)), while in the third and forth setups we make the beta prior distribution to be skewed (with the skewness as 1.42 and -1.42 respectively). This change makes two of the variables in the ordinal model have very different testing results: republican.majority with p(β = 0|D, P rior3 ) = 0.993 and p(β = 0|D, P rior4 ) = 0.397, and homogeneity.change with p(β = 0|D, P rior3 ) = 1.000 and 16 p(β = 0|D, P rior4 ) = 0.237. This suggests that although w is at the low level of the hierarchy, it is still influential in the model and the information injected by its prior should be justified and extreme prior beliefs will yield very different posterior hypotheses testing results. However, in terms of testing the importance of the covariates included by setting some arbitrary threshold, the different testing methods generally agree with each other in this particular case. 4.2 Generalized Linear Model with Logistic Link Modeling dichotomous and polychotomous outcomes with GLSS models with the logistic link is more complicated than the probit link because there is no conjucacy for the coefficients. However, the logistic model is often preferred for several reasons. The coefficients have nice interpretations as measuring the change of log-odds (logits); the logistic distribution has fatter tails; and in some analyses, such as case-control studies, the relative rates measured by the logistic model are consistant, but this is not the case for the probit model (Breslow 1996; Lacy 1997; King & Zeng 2001). Normally, in Bayesian analysis, the coefficients in the logistic model are sampled by using random walk or independent chain with the Metropolis- Hasting algorithm and the proposal density is often constructed by using the information provided by the maximum likelihood estimates (for example, in the MCMCpack, Martin & Quinn). The MH algorithm is a very general MCMC sampler (Hastings, 1970, Chib & Greenberg, 1995)This sampler performs poorly for the logistic S-S model for two reasons. The coefficient posteriors in the GLSS are mixture distributions with the spike part highly concentrated (the value of the height is very large compared to that of the slab), the MH chain has a hard time to go from the spike to the slab; also with small sample sizes, the proposals poorly approximate the posteriors, resulting in an unacceptably low acceptance rate. To overcome this sampling difficulty and to make the algorithm more automatic, we use the auxiliary variable approach proposed by Holmes & Held (2006) and apply the Gibbs sampler to simulate from the full conditionals. 17 18 0.479 0.314 0.981 0.977 0.000† 0.993 1.000 0.993 0.941 0.000† 0.017† 0.967 0.000† 0.934 0.957 P3 0.397 0.237 0.860 0.857 0.000† 0.000† 0.999 0.000† 0.988 0.999 P4 −0.311† 0.176 0.315 0.127 0.551† −0.422† 0.194 0.694† 0.243 0.209 Estimate 0.167 0.215 0.284 0.181 0.260 0.220 0.295 0.395 0.222 0.319 SE 0.067 0.417 0.273 0.486 0.039 0.054 0.511 0.079 0.274 0.513 p-value MLE-Probit −0.321 0.179 0.337 0.134 0.566† −0.473† 0.218 0.802† 0.281 0.207 Mean 0.170 0.217 0.287 0.183 0.263 0.229 0.319 0.423 0.232 0.337 SE Lower −0.657 −0.245 −0.223 −0.219 0.058 0.007 0.607 0.908 0.499 1.089 −0.936 −0.041 −0.398 0.854 0.019 1.677 −0.153 0.752 −0.450 0.876 Upper Bayesian-Deffusive Normal “†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have a certain direction of effect (in the one-sided test) on the response variable with the effect size of 0.10. P3 : ν0 = 5E-07, a1 = 13, a2 = 700, c1 = 1, c2 = 8, and P4 : ν0 = 5E-07, a1 = 13, a2 = 700, c1 = 8, c2 = 1 Prior setups in the GLSS models: P1 : ν0 = 5E-07, a1 = 13, a2 = 700, c1 = 1, c2 = 1, P2 : ν0 = 5E-07, a1 = 13, a2 = 300, c1 = 1, c2 = 1, 0.641 0.649 0.931 0.909 0.000† ordinal response republican.majority homogeneity.change capacity.change polarization.change floor.median P2 0.028† 0.994 0.000† 0.983 0.998 P1 Spike-Slab:P r(β = 0) binary response: pro-majority republican.majority 0.049† homogeneity.change 0.980 capacity.change 0.000† polarization.change 0.932 floor.median 0.996 Coef. Table 1: U.S. House Rule Change: Probit Binary and Ordinal Models (N=66, K=5) 4.2.1 Binary and Ordinal Logit Spike-Slab Model By using the auxliary variable approach, the Logistic S-S model with the same hyperparameter specification above can be written as follows: 1 if (zi > 0) (19) yi = j if τj−1 < zi ≤ τj for j = 1, 2, . . . J, i ∼ N (0, λi ), (20) λi = (2ψi )2 , (21) ψi ∼ KS, (22) where ψ is a random variable distributed as a Kolmogorov-Smirnov distribution (Devroye, 1981 and 1986), and accordingly, i has a marginal distribution as a logistic distribution (Andrews & Mallows 1974). Therefore, by integrating zi out of the likelihood Z L(β|y) = f (y|z, β)π(z|β)dz, (23) we get the exact likelihood function of a GLM with the logistic link, if assuming zi ’s are independent. By this means, all of the parameters in the model, except the auxiliary parameter λi , can be all sampled by using the Gibbs sampler. As for λ, the sampling scheme and parameterization are mainly based on Devroye (1986) and follows the application of Holmes & Held (2006). We sample λi with a rejection sampling scheme by dynamically approximating the target density with squeezing functions. The details are as follows: 1. Choose the proposal density g(λ) ∼ GIG(0.5, 1, ς 2 ), where GIG denotes a generalized inverse Gaussian distribution, and ς 2 = (z − x0 β)2 2. Construct the acceptation probability: α(λ) = l(λ)π(λ) , M g(λ) (24) l(λ) ∝ λ−1/2 exp(−0.5ς 2 /λ), (the likelihood), (25) 1 π(λ) = λ−1/2 KS(0.5λ1/2 ), (the prior). 4 (26) Setting M = 1, the acceptance rate α(λ) can be simplified as α(λ) = exp(0.5λ)π(λ), and the simulation problem now is straightforwardly centered about evaluating a K-S distribution. 19 3. Partition the parameter space of λ into two regions and hopefully we can construct a monotone alternative series representation. Devroye (1986, pp.161-168) proved that the turning point of this infinite mixture distribution is in the interval of [4/3, π) and where is it exactly is unknown. Following Holmes & Held (2006), we take the value of 4/3 as the breakpoint and use squeezing functions adopted in the two regions to sample the acceptance rate. The R code is given in Appendix B. 4. For the ordinal logistic model, the updating of cutpoints is the same as in the binary probit model. 4.2.2 Multinomial Logit Spike-Slab Model The multinomial logistic spike-slab model can be estimated with very similar algorithm as the one for logistic binary or ordinal models above, which is also discussed in Holmes & Held (2006), because it can be reformatted as several logit binary models and update each in turn. This is based on the fact that the parameter vector β j , conditional on all other parameters including β −j , has a logistic setup, and we can sample β j by directly using the same idea as before. In more details, the basic setup of multinomial logit model is: yi ∼ M(1; θi1 , . . . , θiJ ), θij = PJ xi β j r=1 exp(xi β r ) , (27) (28) where M denotes a multinomial distribution, and one of β j is fixed to identify the model. By working on the conditional likelihood of β j , it is easy to find out that it has a logistic setup: N Y L(β j |y, β −j ) ∝ (ϑij )I(yi =j) (1 − ϑij )I(yi 6=j) , (29) i=1 ϑij = exp(xi β j − Cij ) , 1 + exp(xi β j − Cij ) Cij = log X exp(xi β r ). (30) (31) r6=j With a spike and slab prior for β j as NK (00, Γj ), this multinormial model can be updated exactly as above. Given the other coefficients and the latent variable zi sampled from ∼ T L(xi β j −Cij , 1), we update β j by using the exact algorithm stated above. Here, T L denotes a truncated logistic distribution (if yi = j, zi is defined in the interval [0, ∞), otherwise it is in (−∞, 0)). 20 4.2.3 Example Since for the logit models we proposed a new algorithm, we use simulated data to illustrate how the algorithm works and assess the performance of models with knowing the true data generation process. As stated above, the spike and slab models generally have the advantages for hypothesis testing with small sample sizes; therefore, we choose the sample size to be 50 and the dimension of the design matrix to be 50 × 10, leaving the ratio of observations to coefficients as small as 5. The response variable is dichotomous. We also choose a correlated design to assess the effect of collinearity on different hypothesis testing methods. Specifically, the covariates X1 , ..., X5 are independent of each other, and X6 and X7 are highly correlated with ρ = 0.8, µ, Σ), where µ = (0, −1.6, 0.8) and Σ is a Toeplitz covariance matrix and (X8 , X9 , X10 ) ∼ N3 (µ with a first-order autoregressive structure and ρ = 0.6. In Table 2, we give the estimates and hypothesis testing results with different methods. For the logit S-S model, we use four different prior setups after several trials of tuning to find the reasonable range of prior values. The priors are reported in Table 2. In Table 2, the point estimates based on different methods do not recover the true values given that the sample size is so small and dichotomous responses are likely to be not informative. However, in terms of hypothesis testing on individual coefficients, the logit S-S model generally performs very well, and better than the other two competing methods. For β1 to β5 whose associated covariates are independent, the S-S model with varying prior setups correctly identifies the zero and non-zero coefficients of the first four. As for X5 , the S-S model is uncertain about whether it has effect or not given its true effect is small (−0.1), and the S-S Model I and III, the two models with smaller shrinkage effect, give higher probability that the covariate has some effect. The one-sided Bayesian testing based on the logit Bayesian model with diffuse priors of β also works not badly and gives more than 2/3 odds for β5 ≤ 0 and the probability of 0.24 for β5 > 0. The NSHT based on the ML estimates identifies H0 : β5 = 0 to be rejected with effect size 0.05, but incorrectly fails to reject H0 : β1 = 0. For those correlated covariates, we find that collinearity does have effect on hypothesis testing on all those methods, with relatively smaller impact on the logit S-S model. First, the NSHT completely fails to tell which variables actually have effect and which have effect only because of the collinearity. It incorrectly rejects H0 : β6 = 0 but fails to reject H0 : β7 = 0. It also makes a mistake to suggest that X10 is not an statistically reliable covariate but X8 is. The Bayesian one-sided tests do not work well, either, by giving ambiguous testing results to the last three coefficients and incorrectly identifying β6 as a negative parameter with high certainty. The performance of the logit S-S model varies with different prior setups, but is generally better than the former two. The S-S model I correctly identifies the non-zero coefficients. Although it assigns a big probability for β6 6= 0, it still admits the nontrivial probability that it is a zero coefficient. Model II has bigger shrinkage effect since the inverse gamma distribution for the variance parameter is more concentrated and with location closer to 0. It shrinks both β6 and 21 Table 2: Logit Binary Model (N=50, K=10) Coef. β1 β2 β3 β4 β5 β6 β7 β8 β9 β10 Coef. β1 β2 β3 β4 β5 β6 β7 β8 β9 β10 DGP 1.400 −1.700 0.000 0.000 −0.100 0.000 −1.000 0.000 −1.600 0.800 DGP 1.400 −1.700 0.000 0.000 −0.100 0.000 −1.000 0.000 −1.600 0.800 Spike-Slab I Spike-Slab II Mean P r(β = 0) Mean P r(β = 0) 3.071 −3.269 0.000 0.000 −0.163 1.286 −1.809 0.000 −1.339 1.270 0.000† 0.000† 1.000 1.000 0.793 0.235 0.000† 1.000 0.000† 0.000† 2.918 −2.262 0.000 0.000 −0.002 0.002 0.000 0.000 −1.017 1.611 0.000† 0.000† 1.000 1.000 0.985 0.975 1.000 1.000 0.088† 0.000† Spike-Slab IV Diffuse Bayesian Mean P r(β = 0) Mean P r(β > 0) 3.302 −3.376 0.000 0.111 −0.062 2.049 −2.631 0.000 −2.412 0.262 0.000† 0.000† 1.000 0.736 0.824 0.000† 0.000† 1.000 0.000† 0.816 2.916 −3.339 0.279 0.223 −1.300 2.418 −2.658 −1.184 −2.303 1.340 0.008† 0.043† 0.682 0.792 0.235 0.055† 0.048† 0.289 0.190 0.227 Spike-Slab III Mean P r(β = 0) 0.000† 0.000† 1.000 1.000 0.565 0.306 0.000† 0.825 0.000† 0.000† 2.906 −3.049 0.000 0.000 −0.093 1.107 −1.617 −0.102 −1.854 0.916 MLE Estimate 4.860 −5.874 0.607 0.257 −2.262 3.991 −4.407 −1.831 −3.895 2.141 p value 1.000 0.000† 0.759 0.593 0.038† 0.000† 0.999 0.078† 0.026† 0.942 Prior setups in the GLSS models: P1 : ν0 = 5E-09, a1 = 10, a2 = 500, c1 = 1, c2 = 1, P2 : ν0 = 5E-09, a1 = 10, a2 = 300, c1 = 1, c2 = 1, P3 : ν0 = 5E-09, a1 = 10, a2 = 500, c1 = 8, c2 = 1, and P4 : ν0 = 5E-09, a1 = 13, a2 = 500, c1 = 1, c2 = 8 “†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have a certain direction of effect (in the one-sided test) on the response variable with the effect size of 0.10. 22 β7 into zero almost all the time. However, the two non-zero coefficients, β9 and β10 do survive. Model III has the lest shrinkage effect with the same prior specification as in Model I except w has a prior distribution biased towards the slab part. All non-zero coefficients among the five survive in this model, and β8 is identifies as more likely to be zero and β5 more likely to be non-zero. Model IV has bigger shrinkage than Model I with w biased towards the spike part, but this additional shrinkage seems to concentrate on β10 , but β6 which is the true zero coefficient survives. This example highlights the complication in hypothesis testing with collinearity, and the spike and slab model tends to behavior better. In term of the simulation algorithm, the adaptive rejection algorithm works smoothly and numerically stable. No tuning required in this sampler, and the whole process is mechanical. The Markov chain mixes well as long as the priors of the three hyperparameters are within the reasonable region which needs to be found by preliminary trials and varies from application to application. 4.3 Log Linear Model: Poission Link For the poisson spike and slab regression, we can use the adaptive rejection approach to sample the coefficients β which requires very little tuning. This is because with a normal prior, the conditional posterior of each βk has a log-concave shape and this means that the adaptive rejection algorithm is easy to implement and works efficiently (Gilks & Wild 1992). To demonstrate it, we show that with the independent spike and slab prior βk ∼ N (0, γk2 ), the conditional posterior of β in the poisson model can be written as follows !! N N K K K X X X 1 2 XX xik yi βk − exp xik βk , β + π(β|y, γ) ∝ exp − 2 k γ k i=1 k=1 i=1 k=1 k=1 and the conditional distribution of βk then is π(βk |y, β −k , γ) ∝ exp − 1 2 β + xik yi β k − γk2 k N X !! exp i=1 X xij βj + xik βk j6=k This conditional posterior distribution is log-concave because: N X d2 1 X 2 ln π(β |y, β , γ) = − − x exp xij βj + xik βk k −k ik dβk2 γ i=1 j6=k ! < 0. We can use the adaptive rejection algorithm to sample from this density by constructing the upper and lower hulls to squeeze this density and updating the upper and lower hulls adaptively. 23 Table 3: Poisson Model (N=500, K=10) Coef. β1 β2 β3 β4 β5 β6 β7 β8 β9 β10 DGP 0.600 −0.200 0.450 −0.350 0.250 0.000 0.000 0.000 0.000 0.000 Spike-Slab Diffuse.Bayesian Mean P r(β = 0) Mean P r(β > 0) 0.580 −0.216 0.485 −0.358 0.240 0.001 0.003 0.000 0.001 0.000 0.000† 0.000† 0.000† 0.000† 0.000† 0.983 0.960 0.995 0.982 0.991 0.577 −0.217 0.473 −0.352 0.247 0.061 0.066 −0.034 0.066 0.062 1.000† 0.000† 1.000† 0.000† 1.000† 0.928† 0.939† 0.208 0.951† 0.922† MLE Estimate 0.640 −0.240 0.505 −0.371 0.261 0.053 0.065 −0.072 0.077 0.071 p value 0.000† 0.000† 0.000† 0.000† 0.000† 0.151 0.088† 0.049† 0.027† 0.074† “†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have a certain direction of effect (in the one-sided test) on the response variable with the effect size of 0.10. 4.3.1 Example The adaptive rejection algorithm stated above is very straightforward to implement and the process is automate. Unlike in the Metropolis-Hasting algorithm, no tuning is needed for sampling the coefficients β, which nicely leaves the tuning effort focusing on the prior setup of the hyperparmeters ν0 , ηk and w. Even more conveniently, the BUGS software uses the exact algorithm, the adaptive rejection sampler, on poisson models. This means that the poisson S-S model can be estimated by using BUGS since the other parameters in the models have standard conditional distributions. The JAGS code for implementing the poisson model is given in the Appendix C. Here we use a Monte Carlo study to illustrate the performance of the algorithm. In this study, we also want to assess the performance of the S-S model with large sample sizes. As stated above, the NSHT has been criticized for being biased against the null when the sample size is large; in other words, with a large sample size, every coefficient tends to be “significant”. We simulate 500 counts with a data generation process containing 10 covariates, five of which are non-zero and the other half are zero. The covariates are not correlated with each other. The sample size of 500 is not very large, but count data are usually more informative than other types of discrete data. We estimate the poisson S-S model, Bayesian Poisson model with diffuse priors, and poisson GLM with the maximum likelihood estimator. With the sample size of 500, mild 24 variation of prior setups for the poisson S-S model does not make much difference in hypothesis testing results. Therefore, we only report one S-S model in Table 3. All those three models in Table 3 recovers the true parameters values very well, which means that the information in the data are sufficient for identifying the parameters. What is interesting in this example is the hypothesis tests on the zero coefficients. In terms of point estimates, all those three models find that the scale of effects of those covariates associated with zero coefficients are very small, with the S-S model having the smallest scales for all those zero coefficients. But for hypothesis testing, the NHST finds that four out of the five zero coefficients are statistically different from zero. Surprisingly, the Bayesian one-sided tests also suggest that most of the zero coefficients to be positive with high certainty, and one of them is likely to be negative with probability 0.7. The poisson S-S model correctly identifies those zero coefficients, and with a effect size of 0.05, the null hypotheses that they are zero coefficients are all accepted. 5 Empirical Examples In this section, we use two empirical political studies to illustrate how the GLSS model can be applied as a useful tool. The first example shows that with a very small sample size and small ratio of observations to covariates, the GLSS model can help to draw statistical inferences (hypothesis testing) when both the maximum likelihood and ordinal Bayesian approaches fail. It also serves as a model selection tool by choosing the very small number of most promising variables. We show that for those promising parameters, the single models and the BMA estimates yield similar results when the number of submodels is small and the submodels are similar in terms of variable selection. The second example shows how the GLSS model can be applied to deal with variable selection and forecasting problems when there are a large number of potential covariates many of which are supported by theories or measure similar factors. The GLSS model is much more efficient and theory-driven than the traditional methods of variable selection, such as forward and backward stepwise selection. The forecasting based on the GLSS model is BMA over many submodels based on their weights (posterior probabilities), which takes into account of the model uncertainty (uncertainty about the true DGP(s)) and have lower risk in out-of-sample prediction. 5.1 Global Energy Security and IEA Cooperation In this part, we use the data gathered from the historical database of the International Energy Agency (IEA) to assess the factors which are often suggested to affect global energy security. The outcome variable is dichotomous, with an oil-supply disruption occurring in a given year coded as 1 and 0 otherwise. The sample years are from 1984 to 2004, and 21 observations are included in the dataset. There are 16 covariates, including variables about IEA energy 25 cooperation, the international politics, demand-supply relationship in the global oil market and natural disasters. In terms of the sample size and number of observations per coefficient, this example is very extreme, and the maximum likelihood probit model should not be applied. Even with the Bayesian approach, this dataset should be expected to yield very diffuse posteriors (if diffuse priors are used). The GLSS model can work better in this case because it will dynamically exclude variables frome the final model (if a single model is to be chosen) based on information in the data. 5.1.1 Step 1: Determine Variable Inclusion Probabilities In Table 4, we report the estimation results from three models. The ML logit model collapsed in the process of mode-finding, and the numbers reported in the table do not make any sense. We do not give the standard errors since they are huge numbers and meaningless for inferential purposes. For the Bayesian logistic model with diffuse normal priors, we estimated this model by using the auxiliary variable approach stated above, because the MH algorithms with independent or random walk proposal all failed. As expected, we observe little Bayesian shrinkage in the posterior distributions: the priors for those coefficients are set as N (0, 202 ) and the posterior standard deviations are in the range of 7 to 17. The one-sided tests suggest 5 out of the 16 covariates have positive or negative effect with a probability greater than 0.90. The Logistic S-S model shrunk 9 of the 16 covariates into zero and they were excluded from the model almost all the time. The coefficient posteriors reported in Table 4 are the weighted averaging of 6 submodels with different sets of covariates included, as is presented in Table 5. The scales of the posterior mean and standard deviation is very different from those of the logit model with diffuse normal priors, because the logit S-S model is based on the groups of submodels very different from the “whole” model. 5.1.2 Step 2: Hypothesis Testing For standard hypothesis testing, the GLSS model works even with small sample sizes and radically large or small ratios of observations to covariates. This is because the test is based on the information given in the data (data are treated as fixed in all Bayesian analysis) and no hypothetical or augmented data is involved and no asymptotic principles are required. Bayesian one-sided tests work well here but they obviously require a priori directional knowledge (mistakenly specified direction in one-sided tests are called Type III Errors). But the one-sided variant does not help with variable selection since evidence of the sign, positive or negative, is different from evidence that coefficient is statistically distinct from zero. Note that in general the coefficient estimates differ considerably between the one-sided and two-sided results. This demonstrates the effect of spike and slab priors versus highly diffuse forms and the feature that the GLSS model 26 27 R & D # IEA.members nuclear.energy natural.gas oil.stock oil.stock(t-1) natural.disaster intl’ crisis gulf.region.crisis OPEC.market.share OPEC.market.share(t-1) world.nuclear.energy world.natural.gas oil.price oil.price(t-1) demand.china.india 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SD 1.808 1.708 0.000 2.086 1.116 0.000 0.000 1.422 0.000 0.000 0.000 2.993 2.203 0.000 0.354 0.000 Mean −1.403 0.991 0.000 −3.416 0.520 0.000 0.000 −0.719 0.000 0.000 0.000 −7.629 3.547 0.000 0.004 0.000 0.525 0.516 1.000 0.000† 0.507 1.000 1.000 0.610 1.000 1.000 1.000 0.000† 0.000† 1.000 0.973 1.000 p(β = 0|D) Logit S-S Model −23.169 1.620 −12.246 −25.080 10.614 6.605 32.992 −12.002 −4.307 −1.020 −2.111 −27.872 28.566 2.726 −1.732 4.335 Mean 7.432 12.354 10.924 9.256 12.385 9.245 11.906 14.563 7.396 16.267 16.411 8.306 11.061 15.447 15.934 14.958 SD p(β ≤ 0|D) 1.000† 0.444 0.861 1.000† 0.205 0.260 0.001† 0.752 0.694 0.533 0.546 0.998† 0.000† 0.454 0.545 0.394 p(β > 0|D) 0.000† 0.556 0.139 0.000† 0.795 0.741 0.999† 0.248 0.306 0.467 0.454 0.002† 1.000† 0.546 0.455 0.606 Logit Bayesian Model −48.830 −34.300 −32.086 −22.066 42.174 38.439 −77.699 −34.781 5.121 −426.434 415.408 −98.294 28.976 37.519 −53.869 85.595 Estimate Logit ML Model 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 p(> |z|) “†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have a certain direction of effect (in the one-sided test) on the response variable with the effect size of 0.10. Covariates No. Table 4: Energy Security: Coefficient Effect Testing Table 5: Model and Variable Selection (Energy Security) Submodel M1 M2 M3 M4 M5 M6 Prob. 0.5065 0.3629 0.0857 0.0269 0.0100 0.0082 X1 X2 X4 X5 X8 X12 X13 X15 #Covariates 3 7 6 8 4 5 does not exclude the probability that a coefficient can be zero, which can happen even when the data and model give the coefficient a large magnitude (e.g., world.nuclear.energy). Finally, The maximum likihood based NHST approach which relies on asymptotic normality completely fails here as shown by the p-values and nonsensical coefficient estimates. 5.1.3 Step 3: Model Selection To illustrate the GLSS method as a model/variable selection tool, we estimate the “best” model again by using the Bayesian logit model with diffuse priors and the standard maximum likelihood approach and compare the estimates with the posteriors of those covariates in the GLSS model in Table 4. The “best” model is defined as the model with the largest posterior probability. As shown in Table 5, the best model has a posterior weight as 0.51 and includes three variables. The GLSS model consists of this “best model” with the weight of 0.51 and other five submodels, and the three variables in the “best” model are also in all of the other four submodels. Therefore, the posteriors of those three variables are the weighted averaging over all the five models: p(βk |Y) =0.5 × p(β|M1 , |Y ) + 0.36 × p(β|M2 , |Y ) + 0.09 × p(β|M3 , |Y ) + 0.01 ∗ p(β|M4 , |Y ) + 0.01 ∗ p(β|M5 , |Y ) + 0.01 ∗ p(β|M6 , |Y ), (32) and will have bigger uncertainty than the estimates in the single model. However, since the posterior probabilities concentrate on two submodels, the BMA estimates should not be very different from those in the single model. The estimates and standard errors based on the single model are reported in Table 6 and compared with the BMA posteriors. By reducing the number of covariates and increasing the ratio of observations to covariates, the ML model works this time, producing sensible results. This is also true for the Bayesian logit model–the scales are very different from the previous one. The estimates and standard errors are similar among the 28 Table 6: Energy Security: Selected Model Covariates natural.gas world.nuclear.energy world.natural.gas Logit ML Model Logit Bayesian Model Logit S-S model Estimate SD Estimate SD Estimate SD −2.896 −8.092 3.378 3.267 4.418 2.234 −3.427 −8.236 4.264 1.867 2.146 2.197 −3.423 −7.638 3.551 2.091 2.994 2.207 three models. Note that the results of the GLSS model are the BMA posteriors directly taken from Table 4, and generally have larger standard errors than the other two. This is because the BMA posteriors (GLSS posteriors) take account of not only the uncertainty of parameters but also that of the models—six instead of only one data generation processes are considered in the GLSS model. Based on the hypothesis tests on the individual coefficients and submodels, this example tends to suggest that except the policy of energy substitution, the international energy cooperation of the International Energy Agency has not contributed to the global energy security. The cooperation could have provided joint goods to its member states, but there is not much spillover to the whole world. Outside the IEA, the popularity of nuclear energy is associated with lower risk of oil supply distribution in the global market, but the consumption of natural gas is negatively associated with global energy security. As a simple example, the models here do not attempt to identify causality. The negative association of natural gas demand increase and oil supply disruption is easily explained with the fact that natural gas is the second most important source of energy in today’s world, and an important substitute to oil. When the oil supply decreases, consumers have to resort to more natural gas to provide energy. This endogeneity problem should be handled in further causal studies. The comparison between the GLSS posteriors and the single model estimates underlines an important fact: although the GLSS model looks like a “kitchen-sink” model with all the available covariates being thrown into, the posteriors it produces can be directly used for statistical inferences and actually in terms of reducing the risk of statistical inferences, it is superior to the single models only with selected variables. This is because the GLSS model automatically selects the most relevant models without throwing the available information in other variables out and provides BMA posteriors. Importantly, it shrinks those statistically unreliable coefficients based on the information in the data, and averaging across all relevant models to draw inferences. Therefore, the GLSS model uses the information in the data more efficiently than the Bayesian logit model, and produces better results for hypothesis testing and statistical inferences. 29 5.2 State Failures: HT, SRVS, and BMA Since 1994, the Political Instability Task Force (PITF, formerly, the State Failure Task Force), a group of researchers from multiple fields sponsored by the U.S. government, have been working on a big project with the goal of understanding and forecasting failed states. Up to now, they have produced four comprehensive reports, Phase I Findings to Phase IV Findings. They made a great effort to build a warning system of state failure based on statistical analysis on a huge collection of data. This database includes all the independent states with populations of at least 500,000 from 1955 to 1998. Based on the released complete dataset (2001), there are 8580 observations (country-year’s), and about 1200 candidate explanatory variables. By looking at those covariates carefully, many variables have multiple measures, forming multiple covariates. With such a huge set of covariates and similar meanings of multiple variables, the task of model/variable selection requires a lot of work. The PITF team first used their prior information based on theories about state failure and knowledge of the measurements of variables to narrow down the set of more than 1200 variables to about 50 covariates, and then use single-variable tests to find promising variables, and then apply stepwise logistic regression with both forward and backward selection approaches. The final global model they select is very simple, retaining only three variables: democracy, trade openness and infant mortality. This very parsimonious specification is used for forecasting and they claim a rate of correct prediction of about 70% to 80% using the naı̈ve criteria. Their treatment of missing data, strategy for statistical forecasting, and assessment of forecasting success has been extensively criticized by King & Zeng (2001), who also suggested alternatives. The PITF team’s Phase III Findings include both global and models on specific geographic regions and types of failure (the Sub-Saharan Africa model, the Muslim Countries model and the Ethnic War model). For this example we select the Sub-Saharan region to demonstrate efficient conduct model/variable selection and statistical forecasting based on Bayesian model averaging (BMA) with the GLSS. Following both the PITF team and King & Zeng’s approach, we adopt the case-control method for this rare-event study: 44 cases of state failure observed in the sub-dataset where 120 control cases are randomly drawn from the pool of Sub-Saharan countries between 1955 and 1998. We do not actually perform this random sample, instead, we reuse the control cases in the SFTS team’s original study as clean comparison. To narrow down the set of 1200 variables, the SFTS used a considerable amount of prior substantive information about nations and politics and selected 43 variables at the first stage of variable selection. They also reported the explanatory variables tested by their univariate t-tests and stepwise regression. They categorized these variables considered as candidates for the final model into three groups: (1) political and leadership (17 variables), (2) democraphic and societal (13 variables),and (3) economic and environmental (13 variables). We basically use those 43 variables in this example. But two of the remain variables were not fully indentified by the documentation and dynamic 30 changes of several variables are also of interest, so the resulting number available for use here is 48. Unfortunately the amount of missing data is quite large here with these data, and the PITF analysis uses casewise deletion to handle this problem. This literally prevents them from doing model selection, since by including or excluding certain variables the observations are not the same and the models are not comparable regardless what the criterion or method is used. Accordingly we used multiple imputation to fill-in the missing data, and apply the logistic S-S model to the each of the complete data sets that result. Note that in case-country studies, the logit model is consistent in term of estimating the relative rates since its coefficients has this nice interpretation as changes of the log odds-ratio. The probit model should not be used in this case, although in other cases the choice between the probit model and the logit model can be arbitrary. 5.2.1 Step 1: Determine Variable Inclusion Probabilities The specific variable selection method used by the Task Force is the stepwise forward and backward selection approaches to select variables from the three clusters of covariates mentioned above, which ended up with a model containing 7 explanatory variables. The creteria they adopted are that with the forward selection approach, the process ended when “further addtions no longer imporved the model’s overall fit with the data beyond a given threshold” and with the backward selection approach, the selected variables remain in the model when “further deletions significant impaired the model’s overall fit.” Besides the skepticism on their model comparison due to the casewise deletion of missing values, the number of models they have to estimate and P P13 P13 17 13 13 compare is as huge as 17 ≈ 8.79 × 1012 . It is impossible for the Task i=1 j=1 k=1 i j k Force to have checked all of the combinations of variables, but they reported that they had spent years in variable and model selection. Since the process of variable selection is companied with the process of hypothesis testing on individual coefficients, this is better than univariate t-tests since collinearity is taken into account. We use the logit S-S model to simultaneously conduct hypothesis testing on the 48 individual coefficients for the variable selection purpose. We also compare the logit S-S model with the other two alternative hypothesis testing methods. Note that, for variable selection, the standard maximum likelihood based NHST approach is not appropriate for this “kitchen-sink” model, since the p values are not the probability that a coefficient is non-zero, meaning that the p values do not provide information about the inclusion of “non-significant” variables In addition, the sample size relative to the number of explanatory variables (3.375) precludes using the NHST. The Bayesian one-sided tests are relevant in terms of testing the directions of effect, but do not help variable selection. In Table 7 and 8, we report the results from all three types of testing. To make it easier to communicate, we also report both the probabilties p̂(β > 0|D) and p̂(β ≤ 0|D) based on the Bayesian logit model using 31 Table 7: Hypothesis Testing of Individual Covariates No. Variable Spike-Slab p̂(β = 0|D) Diffuse Bayesian p̂(β > 0|D) p̂(β ≤ 0|D) MLE p value Variables in PITF’s final model 1 Trade Openness 2 Total Population 3 Regime Type (score) 4 Regime Type (dichotomous) 5 Colonial Heritage (British) 6 Colonial Heritage (French) 7 Discrimination 8 Leader’s Tenure 0.000† 1.000 0.000† 0.000† 1.000 0.865 0.560 1.000 0.257 0.226 1.000† 1.000† 0.843 0.222 0.997† 0.275 0.743 0.774 0.000† 0.000† 0.157 0.778 0.003† 0.725 0.580 0.697 0.009† 0.000† 0.374 0.510 0.032† 0.518 Cluster I: Political and Leadership 9 Change in democracy 10 Economic discrimination 11 Political discrimination 12 Separatist activity 13 Party fractionalization 14 Parliamentary responsibility 15 Party legitimacy 16 Ruling elite’s class character 17 Ruling elite’s ideology 18 Regime duration 19 Freedom House political rights index 20 Freedom House civil liberties index 21 Amnesty International political terror scale 22 US Department of State political terror index 23 Neighbors in major armed conflict 24 Neighbors in major civil/ethnic conflict 25 Membership in regional organizations 0.000† 1.000 0.570 0.853 0.192 1.000 0.000† 0.433 1.000 0.000† 0.632 0.399 0.420 0.361 0.921 0.942 0.777 0.996† 0.649 0.192 0.566 0.374 0.272 0.941† 0.845 0.312 0.999† 0.850 0.825 0.986† 0.067† 0.311 0.723 0.110 0.004† 0.351 0.808 0.434 0.626 0.728 0.059† 0.155 0.688 0.001† 0.150 0.175 0.014† 0.933† 0.689 0.277 0.890 0.032† 0.684 0.374 0.731 0.998 0.568 0.166 0.443 0.850 0.014† 0.404 0.455 0.054† 0.131 0.853 0.771 0.280 “†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have a certain direction of effect (in the one-sided test) on the response variable with the effect size of 0.10. We report the variables not in the PITF final model by putting them in their clusters defined by the PITF. 32 Table 8: Hypothesis Testing of Individual Covariates (Cont.) No. Variable Spike-Slab p̂(β = 0|D) Diffuse Bayesian p̂(β > 0|D) p̂(β ≤ 0|D) MLE p value Cluster II: Demographic and Societal 26 Youth bulge 27 Labor force as a percent of population 28 Infant mortality 29 Annual change in infant mortality 30 Life expectancy 31 Secondary school enrollment ratio 32 Change in secondary school enrollment ratio 33 Calories per capita 34 Urban population 35 Urban population growth rate 36 Ethno-linguistic fractionalization 37 Trading partner concentration 1.000 1.000 1.000 0.715 1.000 0.899 1.000 0.000† 1.000 0.854 0.747 1.000 0.467 0.030† 0.266 0.096† 0.256 0.968† 0.130 0.007† 0.871 0.305 0.052† 0.239 0.533 0.970† 0.734 0.904† 0.744 0.032† 0.870 0.993† 0.130 0.695 0.948† 0.762 0.747 0.125 0.738 0.286 0.678 0.176 0.451 0.041† 0.570 0.742 0.307 0.725 Cluster III: Economic and Environmental 38 GDP per capita 39 Change in GDP per capita 40 Change in reserves 41 Government debt 42 Trade with OECD countries 43 Annual change in inflation rate 44 Cropland area 45 Irrigated land 46 Access to safe water 47 Damage due to drought 48 Famine 0.000† 0.000† 1.000 0.425 0.000† 0.438 0.047† 0.168 0.850 0.659 0.667 0.101 0.038† 0.300 0.782 0.996† 0.974† 0.190 0.988† 0.196 0.088† 0.179 0.899 0.962† 0.700 0.218 0.004† 0.026† 0.810 0.012† 0.804 0.912† 0.821 0.496 0.257 0.590 0.520 0.091† 0.085† 0.467 0.146 0.436 0.329 0.435 Number of statistically reliable variables 11 18 18 9 “†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have a certain direction of effect (in the one-sided test) on the response variable with the effect size of 0.10. We report the variables not in the PITF final model by putting them in their clusters defined by the PITF. 33 diffuse priors for β. Note that for the most part low p-values are associated with low values of p̂(β = 0|D), even though this probability is not at all related to the convention in political science of doing sequential Wald tests down regression tables with an implied or stated 0.10 threshold. In addition, one-sized tests find more variables that have positive or negative effects than the number of non-zero coefficients found by the two one-sided tests. The hypothesis tests on individual coefficients based on all the three models are dramatically different from the results based on the univariate t tests used by the Task Force. This is likely due to the different ways of handling missing data (multiple imputation versus casewise deletion) and the consideration of collinearity. There are 11 variables found to be statistically reliable by the logit S-S model according to their marginal probabilities (p(βk = 0|D). Note that most of the variables in the final model of the Task Force are not insignificant based on their t tests. Two of the variables found significant in their final, Trade Openness and Regime Type, are identified by the logit S-S model as statistically reliable variables. 5.2.2 Step 2: Determine Model Probabilities The spike and slab model conducts model and variable selection simultaneously. Since in each MCMC iteration, those covariates whose associated effect parameters are allocated to the spike part are erased from the whole model, we estimate a submodel in each iteration based on the model posterior probabilities. For this analysis on Sub-Saharan state failure data, the logit S-S model identifies 25 submodels which are presented in Table 9. In the table, we only exclude the 14 variables never appeared in any submodel (refer to Table 7 and Table 8 for the variable numbers). There is no submodel which has a posterior probability greater than 0.20. However, among the 25 models, 22 of them have very small posterior posterior probabilities. The first three submodels have posterior weights much larger than the others. The best model, M1 , includes 13 covariates, the second best, 19 covariates, and the third best, 11 covariates, and are not as parsimonious as the final model used by the Task Force. However, this sparseness is the most we can achieve by adjusting the shrinkage effect of the priors. Further increase of the shrinkage causes all the covariates to be shrunk to zero probability in this particular case. It could be attributed to the trade-off between mixing and shrinkage, but also may reflect that the underlying mechanism of state failure is not sparse and needs to be explained with many factors. 5.2.3 Step 3: Forecasting The Task Force reported a very successfully forecasting performance of their final model: “[o]ur Sub-Saharan Africa model correctly classified 80 percent of the historical cases from which it was estimated”. However, the criterion they used to evaluate forecasting performance is not clear. In addtion, no ex ante cutoff values for classification of failures and non-failures were reported, 34 35 Prob. 0.1699 0.1212 0.1208 0.0731 0.0622 0.0582 0.0475 0.0472 0.0375 0.0341 0.0320 0.0297 0.0268 0.0235 0.0224 0.0213 0.0208 0.0109 0.0103 0.0085 0.0079 0.0052 0.0051 0.0031 0.0008 Model M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18 M19 M20 M21 M22 M23 M24 M25 3 1 4 7 6 9 11 12 13 15 16 18 19 20 21 22 23 24 Covariates 25 29 31 33 35 36 38 Table 9: Models with Posterior Model Probability based on S-S Logit Model 39 41 42 43 44 45 46 47 48 and the threshold was very likely to be set post hoc, i.e., based on the prediction results. In fact, evaluating forecasting performance involves sophisticated decision theories and should depend on the assessment of the relative costliness of misprediction of failure and non-failure. Misevaluation of forecasting performance by the Task Force has been already criticized by King & Zeng (2001). Here we do not want to readdress the methodological problems in the Task Force forecast evaluation; instead, our interest is centered at the comparison between BMA based on the GLSS model and single best models in terms of forecasting. We fist investigate the in-sample forecasts of BMA and single models, and then evaluate out-of-sample forecasting performance of those models by adopting the predictive log score and risk classification proposed and applied by Hoeting et al (1999). Those evaluations suggest that BMA based on the GLSS model smooth forecasting and reduce the risk of misprediction by taking into account model uncertainty (DGP uncertainty). In Figure 3, the left and middle graphs present the in-sample predictive probabilities of BMA based on the logit S-S model and all the 25 submodels identified in Table 9. These two graphs show clearly that BMA smooth the predictions by averaging the predicted probabilities of the submodels with their weights. To evaluate the performance of BMA and the three single models with the highest posterior probabilities, we use a Receiver-Operating Characteristic (ROC) curve which was also used by King & Zeng (2001) to compare their model with the global model of the Task Force in predicting. As was pointed out by them, the ROC curve is free of assigning any specific threshold based on the costs of misclassification of the two types (failures and nonfailures). The model has ROC curve which is above those of the other models is superior in forecasting, simply because the highest curve means that given any level of correct classification of one type (say, “failures”), the model of this curve has a higher rate of correct classification of the other type (say, “nonfailures”) than the models of the curves below. The right graph in Figure 3 demonstrates that the ROC curve of BMA dominates those of the three best single models, except in the short range of 0.45 to 0.55 of correct classification of failures where it is slightly worse than the single model I. BMA performs much better than the single model when the cost of misclassifying failures as nofailures is assessed to be high (accordingly, the threshold is set to be low, and the rate of correct classification of failures is high) which is reflected in the right part of the curves. This superiority of BMA is especially valuable to policy makers since less resources, such as foreign aids sent to those high risk countries for preventing state failure, will be wasted by allocating them mistakenly to those “sate” countries. Then we compare the out-of-sample forecasting performance of BMA and the “best” single models. First, we compute the predictive log scores of the BMA and single models by using the formula in equation (13) and (14). We first split the dataset into the build and treat groups. Although it is not necessary to split the dataset equally, we randomly sample 81 observations out of the 162 with 22 cases of state failure in the build group and the rest in the treat group 36 0 10 20 Cases 30 Prodictive Prob. of 1’s 40 BMA SubModel1 SubModel2 SubModel3 predected probs 1.0 0 BMA SubModel1 SubModel2 SubModel3 20 40 Cases 60 80 Prodictive Prob. of 0’s 100 120 0.0 BMA SubModel1 SubModel2 SubModel3 Figure 3: Within-Sample Predictive Probabilities: BMA versus Single Models Proportion of 0's Correctly Classified 0.2 0.6 Proportion of 1's Correctly Classified 0.4 ROC Curve In the left and middle figures, the predictive probabilities are plotted from “worst prediction” to “best prediction” for failures, and “best prediction” to “worst prediction” for non-failures, both sorted according to BMA forecastes. predected probs 0.8 0.6 0.4 0.2 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 37 0.8 1.0 2 . We estimate the logit S-S model with the build group data. The logit S-S model identifies 27 submodels with two top submodels have much higher posterior probabilities than the others (the posterior model probabilities are 0.2134 and 0.1983 respectively). We estimate these two single models with diffuse priors. We then approximate the predictive log scores by numerically integrating the parameters out and obtain the predictive probabilities of observations in the treat group. The scores are reported in Table 10. A smaller log score indicates a better model, and the score of BMA is about 8 points better than the single models, with about 9.88% per observation. As stated by Hoeting et al (1999), in practice, it’s more meaningful to conduct forecasts by classifying the subjects into discrete risk categories than continuous numerical prediction. This is especially true for this state failure study since preemptive policies can be calibrated to different categories, but are difficult to be adjusted to the continuous numerical predictive values. We follow the procedure of Hoeting et al to conduct risk classification: 1. Estimate models with the build group data and obtain the estimates of β̂ (interval estimates in this case); 2. Compute the risk scores (xi β̂) for each subject i; 3. Define low, medium, and high risk groups for the model by using empirical quantiles of the calculated risk scores; 4. Calculate the risk scores of the subjects in the treat group and classify them into the three groups; 5. Check the observed survival status (no failure) of those subjects assigned to different groups. A model is better if it assigns higher (lower) risks to the country-years which actually experienced a (non) failure. We use three different choices of the cutoff values for classification. The first choice uses the quantiles of 1/3 and 2/3, which is a naive choice; the second set of threshold risk values are the percentiles of 40% and 60% with the consideration that the media risk group may be less interesting to policy makers than the two extreme groups; and the third choice of 20% and 40% percentiles is based on assumption that mispredicting state failure is costly and it may be better to adopt some policies to help more countries by setting the threshold of high risk class to be low. We compare the BMA model with the best two single models and resport the results in Table 10. Despite of different cutfoff values for risk classfication, BMA performs 2 We do not split the sample according to years, such as constructing the build group with countries in 1955 to 1978 and the rest as the treat group. This is because the case-control study destroys the time dimension of the data—data are treated as cross-sectional. Also, no dynamics is models, and long-run forecasting does not make any sense. Neither do We use the county-years excluded by the resampling of the case-control study, because there is no sate failure in those country-years 38 Table 10: Classification for Predictive Discrimination Cutoff Values—Percentiles: 33.33%, 66.67% BMA Risk Group Low Med High M1 S F %F 23 15 21 5 9 8 17.85% 37.50% 27.58% S F 21 15 23 11 3 8 M2 %F 34.38% 16.67% 25.81% S F %F 19 19 21 8 8 6 29.63% 29.23% 22.22% Cutoff Values—Percentiles: 40.00%, 60.00% BMA Risk Group Low Med High M1 S F %F 25 11 23 7 6 9 21.87% 35.29% 28.13% S F 24 7 28 11 1 10 M2 %F 31.43% 12.50% 26.32% S F 21 13 25 11 4 7 %F 34.38% 42.86% 21.88% Cutoff Values—Percentiles: 20.00%, 40.00% BMA Risk Group Low Med High Predictive Log Score S F 12 14 33 4 3 15 M1 %F 25.00% 17.65% 31.25% 97.112 S F 16 8 35 9 2 11 104.270 M2 %F 36.00% 20.00% 23.91% S F 16 4 39 6 3 13 %F 27.27% 48.0% 25.00% 105.370 In the table, S denotes “Survival” and F “Failure”. The two single models are the models with highest posterior probabilities: p(M1 |D) = 0.2134 and p(M2 |D) = 0.1983 better than the single models: the country-years assigned to the high risk group by BMA have a higher state failure rate than those assigned high risk by the other two single models; and the country-years classified as low risk ones by BMA have the state failure rates lower that those assigned to the low risk group by the single models. All those models assign the subjects to the medium risk group which actually experienced a very high rate of state failure, and sometime a higher rate than those in the high risk group. This can be attributed to the small sample size of 81 in the build group, and models cannot be very precise in term of both inferences and forecasts. However, we only evaluate the relative performance here. It is not surprising that in this particular case BMA performs much better than the single models in out-of-sample forecasts but is very similar as, and only slightly better than, the single best models in within-sample prediction. The logit S-S model with all the 48 covariates identifies 39 25 submodels, and three of them have much higher probabilities than the rest 22 models. The top three models perform similarly in forecasts. Since BMA is majorly based on the three models, it provides similar within-sample predictive probabilities. But in the out-of-sample prediction, with the DGP of the treat group is only inferred based on the build group, but the most likely DGP for the latter is not necessarily exactly the same for the former, even though the two groups are selected randomly. With a sample size of 81 in the build group, the uncertainty about the true DGP suggested by the data is important and should not be ignored when drawing inferences and conducting forecasts. BMA takes into account model uncertainty and uses the information in the build group on all possible DGPs, and therefore performs better in forecasting than the models which treat one of the DGPs for the build group as the only “true” DGP for the treat group. The single models are riskier by ignoring the uncertainty of the DGP(s), and this is especially true in out-of-sample forecasting because of high uncertainty about the DGP which is only inferred from different observations. 6 Conclusion By extending the Bayesian linear spike and slab model to nonlinear (qualitative) outcomes, which has not been accomplished to date, we specify mixture prior distributions for each of the coefficient parameters. The spike part of this prior helps detect coefficients that are actually of zero magnitude by shrinking those coefficients to zero in posterior terms, whereas the slab part of the prior provides that are truly not zero coefficients with a diffuse normal prior thus not requiring extensively informed prior knowledge. This means that the model space and parameter space are determined simultaneously through the estimation process. The primary contribution of this work is to exploit Bayesian prior specification and the engine of Bayesian stochastic simulation to incorporate variable inclusion decisions, model comparison, and hypothesis testing, directly and simultaneously into the estimation process. This process is built on linear spike and slab strategies, where the authors were unaware of the additional inferential information, but extended to the sort of nonlinear outcomes regularly used in empirical Political Science. This setup leads directly to more reliable posterior predictions since it is straightforward to Bayesianly average across models with high probability. This avoids the primary criticism of standard Bayesian model averaging whereby the researcher subjectively picks the variables in the first set of models (Raftery 1995, Hoeting et al. 1999). In the Bayesian paradigm all unknown quantities are treated probabilistically, including alternative model specifications. Therefore the spike and slab priors approach to model choice requires a full commitment to the tenents of Bayesian inference, including informed priors, conditioning on the data, and full posterior description. We make the mechanics of this process straighforward with an R package (spike.slab). As the examples here show, additional information in the data 40 can lead to more informed model choice and better posterior predictions. 7 References Albert, J. A., and S. Chib, (1993). “Bayesian Analysis of Binary and Polychotomous Response Data.” Journal of the American Statistical Association 88: 669-679. Alston, C.L., K. Mengersen, J. M. Thompson, P. J. Littlefeld, D. Perry, A. J. Ball (2004) “Statistical Analysis of Sheep CAT Ccan Images Using a Bayesian Mixture Model. “ Aust. J. Agricultural Research 55: 57-68. Aitkin, M. (1991). “Posterior Bayes Factor.” Journal of the Royal Statistical Society: Series B 53:111-142. Aitkin, M. (1997). “The Calibration of P-values, Posterior Bayes Factors and the AIC from the Posterior Distribution of the Likelihood (with discussion).” Statist. And Computing 7: 253-272. Anderson, David R., Kenneth P. Burnham, and William L. Thompson. (2000). “Null Hypothesis Testing: Problems, Prevalence, and an Alternatives.” Journal of Wildlife Mangement 64:912-923. Barbieri, M., and J. Berger. (2004). “Optimal Predictive Model Selection.” Annals of Statistics 32:870-897. Barnett, Vic. (1973). Comparative Statistical Inference. New York: Wiley Bayarri, M. J. and Berger, J. (2000). “P-Values for Composite Null Models (with discusiion).” Journal of American Statistical Association 95: 1127-1142. Berger, J. O. and Pericchi, L. (1998). “Accurate and Stable Bayesian Model Selection: the Median Intrinsic Bayes Factor.” Sankhya B 60: 1-18. Berger, J. O., L. D. Brown, and R. L. Wolpert. (1994). “A Unified Conditional Frequentist and Bayesian Test for Fixed and Sequential Simple Hypothesis Testing.” Annals of Statistics 22:1787-1807. Berger, J. O., and R. L. Wolpert. (1984). The Likelihood Principle. Hayward, CA: Insitute of Mathematical Statistics Monograph Series. Berger, James O., B. Boukai, and Y. Wang. (1997). “Unified Frequentist and Bayesian Testing of a Precise Hypothesis.” Statistical Science 12:133-160. Berger, James O., and Thomas Sellke. (1987). “Testing a Point Null Hypothesis: The Irreconciliability of P Values and Evidence.” Journal of the American Statistical Association 82:112-122. Bernardo, José M. (1984). “Monitoring the 1982 Spanish Scoialist Vicotry: A Bayesian Analysis.” Journal of the American Statistical Association 79:510-515. Binder, S. A. (1996). “The Partisan Basis of Procedural Choice: Allocating Parliamentary Rights in the House, 1789-1990.” American Political Science Review 90:8-22. Binder, S. A. (2006). “Parties and Institutional Choice Revisited.” Legislative Studies Quarterly 31:513-532. Birnbaum, A. (1962). “On the Foundations of Statistical Inference.” Journal of the American Statistical Association 57:269-306. Brandstätter, Eduard. (1999). “Confidence Intervals as an Alternative to Signi cance Testing.” Methods of Psychological Research Online 4:32-46. Breslow, N. E. (1996). “Statistics in Epidemiology:The Case-Control Study.” Journal of the American Statistical Association 91: 14-28. Brown, P. J., M. Vannucci and T. Fearn. (1998). “Multivariate Bayesian Variable Selection and Prediction.” J. Royal Statist. Society Series B 60: 627-641. 41 Burkhardt, Hans., and Alan. H. Schoenfeld. (2003). “Improving Educational Research: Toward a More Useful, More In uential, and Better-funded Enterprise.” Educational Researcher 32:3-14. Carlin, B. and Chib, S. (1995). “ Bayesian Model Choice via Markov Chain Monte Carlo Methods.” Statist. Society Series B 57: 473-484. J. Royal Carver, Ronald P. (1978). “The Case Against Statistical Significance Testing.” Harvard Education Review 48:378-99. Carver, Ronald P. (1993). “Merging the Simple View of Reading with Rauding Theory.” Journal of Reading Behavior 25:439-455. Casella, G., and R. L. Berger. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem.” Journal of the American Statistical Association 82:106-111. Casella, George, and R. L. Berger. (2001). Statistical Inference. Belmont, CA: Duxbury Advanced Series. Chatfield, C. (1995). “Model Uncertainty, Data Mining, and Statistical Inference (with discussion).” Statist. Soc. Ser. A 158: 419-466. J. Roy. Chib, S., (1995). “Marginal Likelihood from the Gibbs Output.” Journal of the American Statistical Association 90: 1313-21. Chib, S., and E. Greenberg, (1995). “Understanding the Metropolis-Hastings Algorithm.” The American Statistician 49: 327-35. Chib, S., and I. Jeliazkov, (2001). “Marginal Likelihood from the Metropolis-Hastings Output.” Journal of the American Statistical Association 96: 270-81. Chipman, H., E. I. George, and R. E. McCulloch. (2001). “The Practical Implementation of Bayesian Model Selection (with discussion)”. In Model Selection, ed. P. Lahiri. OH: IMS, Beachwood, pp. 65-134. Clyde, M., and E. I. George. (2004). “Model Uncertainty.” Statistical Science 19:81-94. Cohen, Jacob. (1962). “The Statistical Power of Abnormal-Social Psychological Research: A Review.” Journal of Abnormal and Social Psychology 65:145-153. Cohen, Jacob. (1977). Statistical Power Analysis for the Behavioral Sciences. Second Edition. New York: Academic Press. Cohen, Jacob. (1988). Statistical Power Analysis for the Behavioral Sciences. Second Edition. Hillsdale, NJ: Lawrence Erlbaum Associates. Cohen, Jacob. (1992). “A Power Primer.” Psychological Bulletin 112:115-159. Cohen, Jacob. (1994). “The Earth is Round (p < 0.5).” American Psychologist 12:997-1003. Congdon, P. (2001). Bayesian Statistical Modeling. Wiley, England. Craiu, R.V., J.S. Rosenthal and C. Yang. (forthcoming). “Learn From Thy Neighbor: Parallel-Chain Adaptive MCMC.” Journal of the American Statistical Association. Denis, Daniel J. (2005). “The Modern Hypothesis Testing Hybrid: R. A. Fisher’s Fading Inuence.” Journal de la Société Francaise de Statistique 145:5-26. Devroye, Luc, (1981). “The Series Method in Random Variate Generation and Its Application to the KolmogorovSmirnov Distribution.” American Journal of Mathmatical and Management Science 1: 359-379. Devroye, Luc, (1986). Non-Uniform Random Variate Generation. Springer-Verlag, New York. Dijkstra, T. K. (1988). On Model Uncertainty and Its Statistical Implications. Springer, Berlin. Draper, D. , (1995). “Assessment and Propagation of Model Uncertainty.” J. Royal Statist. Society Series B 57: 45-97. 42 Falk, Ruma, and Charles W. Greenbaum. (1995). “Significance Tests Die Hard: The Amazing Persistence of a Probabilistic Misconception.” Theory & Psychology 5:75-98. King. G, and L. Zeng, (2001). “Explaining Rare Events in International Relations.” International Organization 55: 693-715. Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. (1995). Bayesian Data Analysis. New York: Chapman & Hall. George, E. I., and R. E. McCulloch. (1997). “Approaches for Bayesian Variable Selection.” Statistica Sinica 7:339-373. George, Edward I., and Robert E. McCulloch. (1993). “Variable Selection Via Gibbs Sampling.” Journal of the American Statistical Association 88: 881-889. George, E. I. (1999). “Bayesian model selection,” in Encyclopedia of Statistical Sciences Update 3. Wiley, New York. Geweke, J. (1996). “Bayesian Inference for Linear Models Subject to Linear Inequality Constraints” In Modeling and Prediction: Honouring Seymour Geisser, ed. W. O. Johnson, J. C. Lee, and A. Zellner. New York: Springer. Gigerenzer, Gerd. (1987). “Probabilistic Thinking and the Fight Against Subjectivity.” In The Probabilistic Revolution, ed. Lorenz Krüger, Gerd Gigerenzer, and Mary Morgan. Vol. 2 Cambridge, MA: MIT. Gigerenzer, Gerd. (1993). “The Superego, the Ego, and the ID in Statistical Reasoning.” In A Handbook for Data Anlysis in the Behavioral Science: Methodological Issues, ed. G. Keren, and C. Lewis. Hillsdale, NJ: Lawrence Erlbaum Associates. Gigerenzer, Gerd. (1998). “We Need Statistical Thinking, not Statistical Rituals.” Behavioral and Brain Sciences 21:199-200. Gigerenzer, Gerd, and D. J. Murray. (1987). Cognition as Intuitive Statistics. Hillsdale, NJ: Lawrence Erlbaum Associates. Gill, Jeff. (1999). “The Insignificance of Null Hypothesis Significance Testing.” Political Research Quarterly 52:647-674. Gill, Jeff. (2005). “An Entropy Measure of Uncertainty in Vote Choice.” Electoral Studies 24:371-392. Gill, Jeff. (2007). Bayesian Methods for the Social and Behavioral Sciences. Second Edition. New York: Chapman & Hall. Gilner, Jeffrey A., Nancy L. Leech, and George A. Morgan. (2002). “Problems with Null Hypothesis Significance Testing (NHST): What do the Textbooks Say.” Journal of Experimental Education 71:83-92. Godsill, S. J. (2001) “On the Relationship between MCMC Model Uncertainty Methods.” Journal of Computational & Graphical Statistics 10: 230-248. Grayson, D. A. (1998). “The Frequentist Facade and the Flight from Evidential Inference.” British Journal of Psychology 89:325-345. Green, P. J. (1995). “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination.” Biometrika 82:711-732. Greenland, S. (2000). “Principles of Multilevel Modelling.” International Journal of Epidemiology 29:158-167. Greenland, Sander. (2008). “Invited Commentary: Variable Selection versus Shrinkage in the Control of Multiple Confouders.” American Journal of Epidemiology Advance Access:1-7. Greenwald, Anthony G. (1975). “Consequences of Prejudice Against the Null Hypothesis.” Psychological Bulletin 82:1-20. 43 Greenwald, Anthony G., R. Gonzalez, R. J. Harris, and D. Guthrie. (1996). “Effect Sizes and Values: What Should Be Reported and What Should Be Replicated.” Psychophysiology 33:175-183. Han, Cong, and Bradley Carlin. (2001). “Markov Chain Monte Carlo Methods for Computing Bayes Factors: A Comprehensive Review.” Journal of the American Statistical Association 96:1122-1132. Harlow, Lisa L., Stanley A. Mulaik, and James H. Steiger. (1997). What If There Were No Signifficance Tests? (Multivariate Applications). Lawrence Erlbaum. Hastings, W. K., (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications.” Biometrika 57:97-109. Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky. (1999). “Bayesian Model Averaging: A Tutorial (with discussion).” Statistical Science 14:382-401. Howson, Colin, and Peter Urbach. (1993). Scientific Reasoning: The Bayesian Approach. Second Edition. Chicago: Open Court. Hunter, John E. (1997). “Needed: A Ban on the Significance Test.” Psychological Science Special Section 8:3-7. Hunter, John E., and Frank L. Schmidt. (1990). Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. Beverly Hills: Sage. Hwang, J. T., G. Casella, C. P. Robert, M. T. Wells, and R. H. Farrell. (1992). “Estimating Accuracy in Testing.” Annals of Statistics 20:490-509. Ishwaran, Hemant, and J. Sunil Rao. (2003). “Detecting Differentially Expressed Genes in Microarrays Using Bayesian Model Selection.” Journal of the American Statistical Association 98:438-455 Ishwaran, Hemant, and J. Sunil Rao. (2005). “Spike and Slab Gene Selection for Multigroup Microarry Data.” Journal of American Statistical Association 100:764-780. Ishwaran, Hemant, and J. Sunil Rao. (2005a). “Spike and Slab Gene Selection for Multigroup Microarry Data.” Journal of the American Statistical Association 100:764-780. Ishwaran, Hemant, and J. Sunil Rao. (2008). “Clustering Gene Expression Profile Data by Selective Shrinkage.” Statistical and Probability Letter 78:1490-1497. Jeffreys, Harold. (1961). The Theory of Probability. Oxford: Clarendon Press. Ji, Chunlin, and Scott C. Schmidler. (2007). “Adaptive Markov Chain Monte Carlo for Bayesian Variable Selection.” Unpublished Manuscript. Hoti, F., and M. J. Sillanpää. (2006). “Bayesian Mapping of Genotype X Expression Interactions in Quantitative and Qualitative Traits.” Heredity 97: 4-18. Kass, Robert E., and Adrian E. Raftery. (1995). “Bayes Factors.” Journal of the American Statistical Association 90:773-795. Kirk, Roger E. (1996). “Practical Significance: A Concept Whose Time Has Come.” logical Measurement 56:746-759. Educational and Psycho- Kilpikari, R., and M. J. Sillanpää. (2003). “Bayesian Analysis of Multilocus Association in Quantitative and Qualitative Traits.” Genet. Epidemiol. 25: 122-135. King, G., and L. Zeng, (2001). “Improving Forecasts of State Failure.” World Politics 53: 623-658. King, G., and L. Zeng, (2001). “Logistic Regression in Rare Events Data.” Political Analysis 9: 137-163. Krueger, Joachim I. (1999). “Do We Need Inferential Statistics? Reply to Hallahan on Social-Bias.” PSYCOLOQUY 10(004). Kuo, L and B. Mallick. (1998). “Variable Selection for Regression Models.” Sankhya Ser. B 60: 65-81. 44 Kyung, MinJung, Jeff Gill, and George Casella. (2008). “Estimation in Dirichlet Random Effects Models.” Technical Report, Center for Applied Statistics, Washington University. http://polmeth.wustl.edu/working papers.php?text=probit+mixed+ Dirichlet+random+effects+model&searchkeywords=T&order=dateposted Kyung, Minjung, Jeff Gill, and George Casella. (2009). “Penalized Regression, Standard Errors, and Bayesian Lassos.” Technical Report, Center for Applied Statistics, Washington University. http://polmeth.wustl.edu/ workingpapers.php?text=Bayesian +hierarchical+model&searchkeywords=T&order=dateposted Lacy, M. G. (1997). “Efficiently Studying Rare Events: Case-Control Methods for Sociologists.” Sociological Perspectives 40: 129-154. Leamer, E. E. (1978). Leamer, E. E. Specification Searches: Ad Hoc Inference with Nonexperimental Data. New York: John Wiley & Sons. Lehmann, E. L. (1993). “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two.” Journal of the American Statistical Association 88:1242-1249. Lindley, D. V. (1961). “The Use of Prior Probability Distributions in Statistical Inference and Decision.” Proc.Fourth Berkeley Symp. Math. Statist. Prob. Univ. of Calif. Press 453-468. Lindsay, R. M. (1995). “Reconsidering the Status of Tests of Significance: An Alternative Criterion of Adequacy.” Accounting, Organizations and Society 20:35-53. Loftus, Geoffrey R. (1991). “On the Tyranny of Hypothesis Testing in the Social Sciences.” Contemporary Psychology 36:102-105. Loftus, Geoffrey R. (1993a). “Editorial Comment.” Memory & Cognition 21:1-3. Loftus, Geoffrey R. (1993b). “Visual Data Representation and Hypothesis Testing in the Microcomputer Age.” Behavior Research Methods, Instrumentation, and Computers 25:250-256. Loftus, Geoffrey R. (1996). “Psychology will be a Much Better Science when we Change the Way we Analyze Data.” Current Directions in Psychological Science 161-171. Loftus, Geoffrey R., and D. Bamber. (1990). “Weak Models, Strong Models, Unidimensional Models, and Psychological Time.” Journal of Experimental Psychology: Learning, Memory, and Cognition 16:916-926. Long, J. Scott. (1997). Regression Models for Categorical and Limited Dependent Variables. Lodon: Sage Publications. Macdonald, Ranald R. (1997). “On Statistical Testing in Psychology.” British Journal of Psychology 88:333-349. Madigan, D. and York, J. (1995). “Bayesian Graphical Models for Discrete Data.” 215-232. Internat. Statist. Rev. 63: Meehl, Paul E. (1967). “Theory— Testing in Psychology and Physics: A Methodological Paradox.” Philosophy of Science 34:103-115. Meehl, Paul E. (1978). “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology.” Journal of Consulting and Clinical Psychology 46:806-834. Meehl, Paul E. (1990). “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles that Warrant it.” Psychological Inquiry 1:108-141. Meng, X. L. (1994). “Posterior Predictive P-Values.” Annals of Statistics 22:1142-1160. Mengersen, K. and C. P. Robert. (1996). “Testing for Mixtures: A Bayesian Entropic approach.” in Bayesian Statistics, 5 (Alicante, 1994). New York: Oxford University Press. Meuwissen, T. H. E. and M. E. Goddard. (2004). “Mapping Multiple QTL Using Linkage Disequilibrium and Linkage Analysis Information and Multitrait Data.” Genet. Sel. Evol. 36: 261-279. Miller, A. (2002). Subset Selection in Regression. Boca Raton, Fla: Chapman & HallCRC. 45 Mitchell, T. J., and J. J. Beauchamp. (1988). “Bayesian Variable Selection in Linear Regression.” Journal of the American Statistical Association 83:1023-1032. Morris, C. N. (1983). “Parameteric Empirical Bayes Inference: Theory and Applications.” Journal of the American Statistical Association 78:47-55. Nickerson, Raymond S. (2000). “Null hypothesis statistical testing: A review of an old and Null Hypothesis Statistical Testing: A Review of an Old and Continuing Controversy.” Psychological Methods 5:241-301. Oakes, M. (1986). Statistical Inference: A Commentary for the Social and Behavioral Science. New York: John Wiley & Sons. O’Hara, R.B. and Sillanpää M. J., (2009). “A Review of Bayesian Variable Selection Methods: What, How, and Which?” Bayesian Analysis 4: 85-118. Park, T., and G. Casella. (2008) “The Bayesian Lasso.” Journal of the American Statistical Association 103: 681-686. Perez, J. M. and J. Berger. (2002). “Expected Posterior Prior Distributions for Model Selection.” Biometrika 89: 491-512. Pollard, P. (1993). “How Significant is ’Significance?” In A Handbook for Data Anlysis in the Behavioral Science: Methodological Issues, ed. Gerd Gigerenzer, and C. Lewis. Hillsdale, NJ: Lawrence Erlbaum Associates. Pollard, P., and J. T. E. Richardson. (1987). “On the Probability of Making Type One Errors.” Psychological Bulletin 102:159-163. Raftery, A. E. (1995). “Bayesian Model Selection in Social Research.” Sociological Methodology 25:111-163. Raftery, A. E. (1996). “Approximate Bayes Factors and Accounting for Model uncertainty in Generalised Linear Models. “ Biometrika 83: 251-266. Regal, R. and Hook, E. B. (1991). “The Effects of Model Selection on Confidence Intervals for the Size of a Closed Population.“ Statistics in Medicine 10: 717-721. Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition. Robinson, D. H., and J. R. Levin. (1997). “Reflections on Statistical and Substantive Signifcance, with a Slice of Replication.” Educational Researcher 26:21-26. Rosnow, Ralph L., and R. Rosenthal. (1989). “Statistical Procedures and the Justification of Knowledge in Psychological Science.” American Psychologist 44:1276-1284. Rozeboom, William W. (1960). “The Fallacy of the Null Hypothesis Significance Test.” Psychological Bulletin 57:416-428. Rozeboom, William W. (1997). “Good Science is Abductive, not Hypothetico-Deductive.” In What If There Were No Significance Tests?, ed. L. L. Harlow, S. A. Mulaik, and J. H. Steiger. Mahwah, NJ: Erlbaum 335-392. Rubin, Donald B. (1984). “Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician.” Annals of Statistics 12:1151-1172. Sahu, S. and R. Cheng. (2003). “A Fast Distance Based Approach for Determining the Number of Components in Mixtures.” Canadian Journal of Statistics 31: 2-33. Schaefer, Juliane, and Korbinian Strimmer. (2005a). “An Empirical Bayes Approach to inferring Large-Scale Gene Association Networks.” Bioinformatics 21:754-764. Schaefer, Juliane, and Korbinian Strimmer. (2005b). “A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics.” Statistical Applications in Genetics and Molecular Biology 4:1-30. 46 Schervish, Mark J. (1996). “P Values: What They Are and What They Are Not.” The American Statistician 50:203-206. Schickler, E. (2000). “Institutional Changes in the House of Representative, 1867-1998: A Test of Partisan and Ideological Power Balance Models.” American Political Science Review 94:269-288. Schmidt, Frank L. (1997). “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for the Training of Researchers.” Psychological Methods 1:115-129. Schmidt, Frank L., and John E. Hunter. (1977). “Development of a General Solution to the Problem of Validity Generalization.” Journal of Applied Psychology 62:529-540. Sedlmeier, Peter, and Gerd Gigerenzer. (1989). “Do Studies of Statistical Power Have an Effect on the Power of Studies.” Psychological Bulletin 105:309-316. Sha, N, M. Vannucci, M. G. Tadesse, P. J. Brown,I. Dragoni, N. Davies, T. C. Roberts, A. Contestabile, M. Salmon, C. Buckley, F. Falciani, (2004). “Bayesian Variable Selection in Multinomial Probit Models to Identify Molecular Signatures of Disease Stage.” Biometrics 60: 812-819. Smith, M. and Kohn, R. (2002). “Parsimonious covariance matrix estimation for longitudinal data.” Journal of the American Statistical Association 97: 1141-1153. Spiegelhalter, D. J., N. G. Best, B. P. Carlin and A. van der Linde. (2002) “Bayesian Measure of Model Complexity and Fit.” J. Royal Statist. Society Series B 64: 583-639. Stephens, M. (2000). “Bayesian Analysis of Mixture Models with an Unkown Number of Components–an Alternative to Reversible Jump Methods.” Annals of Statistics 28: 40-74. Thompson, Bruce. (2002). “What future quantitative social science research could What Future Quantitative Social Science Research Could Look Like: Confidence Intervals for Effect Sizes.” Educational Researcher 31:24-31. Tong, TieJun, and Yuedong Wang. (2007). “Optimal Shrinkage Estimation of Variance With Applications to Microarry Data Analysis.” Journal of the American Statistical Association 102:113-122. Volinsky, C. T., D. Madigan, A. E. Raftery, and R. A. Kronmal, (1997). “Bayesian Model Averaging in Proportional Hazard Models: Assessing the Risk of a Stroke. “ J. Roy. Statist. Soc. Ser. C 46: 433-448. Waagepetersen, Rasmus, and Daniel Sorensen. (2001). “A Tuorial on Reversible MCMC with a View Towards Applications in QTL-Mapping.” International Statistical Review 69:49-62. Wang, Hui, Yuan-Ming Zhang, Xinmin Li, Godfred L. Masinde, Subburaman Mohan, David J.Baylink, and Shizhong Xu. (2005). “Bayesian Shrinkage Estimation of Quantitative Trait Loci Parameters.” Genetics 170:465-480. West, M. (2003). “Bayesian Factor Regression Models in the ’Large p, Small n’ Paradigm.” Bayesian Statistics 7:723-732. Wilknson, L. (1999). “Statistical Methods in Psychology Journals: Guidelines and Explanations.” The American Psychologist 54:594-604. Willan, Andrew R., Eleanor M. Pinto, Bernie J. O’Brien, Padma Kaul, and Ron Goeree. (2005). “Country Specific Cost Comparison from Multinational Clinical Trial Using Empirical Bayesian Shrinkage Estimation: The Canadian ASSENT-3 Econmic Analysis.” Health Economicsa 14:327-338. Wooldridge, Jeffrey M. (2001). Econometric Analysis of Cross Section and Panel Data. Cambridge, Mass:, MIT Press. Xu, S. (2003). “Estimating Polygenic Effects Using Markers of the Entire Genome.” Genetics. 163: 789-801. Yi, N., and S. Xu. (2008). “Bayesian LASSO for Quantitative Trait Loci Mapping.” Genetics 179: 1045-1055. 47 Zhang, Hao Helen, Jeongyoun Ahn, Xiaodong Lin, and Cheolwoo Park. (2006). “Gene Selection Using Support Vector Machines with Non-Convex Penalty.” Bioinformatics 22:88-95. Appendix A MCMC Algorithm for Updating Hyperparameters The joint posterior distribution of a generalized linear Spike and Slab model can be written as follows: 2 π(β, η , F , w|Y) ∼ N Y L(Y|β, η , F , w)π(w) i=1 K Y π(βk |Fk , ηk2 )π(Fk |w)π(ηk2 ) (33) k=1 With the conditional independent in the hierarchical setup, the coefficient parameter vector β can be sampled as in ordinal Bayesian generalized linear models given the draws the hyperparmeters. Here we only give the sampling scheme for those hyperparmeters. Given the current draw of β, all the hyperparameters can be updated by using the Gibbs sampler as follows: 1. Draw Fk |β, η ∝ l(Fk )π(Fk ), which is a two-point discrete distribution. The posterior unnormalized heights at the two points are by compute by multiplying the height of the likelihood and prior at each points and as the following: βk2 βk2 −1/2 , and w2,k = w exp − 2 . w1,k = (1 − w)ν0 exp − 2ν0 ηk2 2ηk Normalize the distribution by defining κ ≡ w2,k /(w1,k + w2,k ), and mathcalFk can be drawn from its posterior: Fk |ν0 , w ∼ (1 − κ)δν0 (·) + κδ1 (·) −2 2. Draw h ≡ ηK : h|β, F ∝ h1/2 exp(− h βk2 a1 −1 )h exp(−a2 h), 2 Fk (34) and this is a gamma kernel and we can sample eta−2 from its posterior distribution k 2 2 2 Gamma(a1 + 1/2, a2 + βk /(2Fk )), and compute γk = ν0 ηk . 3. Draw w: the parameter of w just follows a simple Beta-Bernoulli updating and can be drawn from its posterior distribution: w|γ ∼ Beta(c1 + #{k : Fk = 1}, c2 + #{k : Fk = ν0 } Appendix B R Code for Sampling λ rsquare <- (ystar-X.rescaled%*%as.matrix(beta[g,]))^2 lambda.new <- c() 48 for (i in 1:n){ OK <- FALSE while (!OK){ # draw a proposal prop.lambda <- c() jj <- sqrt(rsquare[i]) R <- (rnorm(1))^2 RR <- 1+(R-sqrt(R*(4*jj+R)))/(2*jj) RRR <- 1/(1+RR) u <- runif(1) if (u < RRR){ prop.lambda <- jj/RR }else{ prop.lambda <- jj*RR } # accept/reject U <- runif(1) if (prop.lambda > 4/3){ OK <- rightmost.interval(U, prop.lambda, mm) }else{ OK <- leftmost.interval(U, prop.lambda, mm) } } lambda.new[i] <- prop.lambda } % the two functions called in the code above are defined as: rightmost.interval <- function(U, lambda, mm){ ZD <- 1 XD <- exp(-0.5*lambda) j <- 0 for (d in 1:mm){ j <- j+1 ZD <- ZD-(j+1)^2*XD^((j+1)^2-1) if (ZD>U) return(TRUE) j <- j+1 ZD <- ZD+(j+1)^2*XD^((j+1)^2-1) if (ZD<U) return(FALSE) } return(FALSE) } 49 leftmost.interval <- function(U, lambda, mm){ HD <- 0.5*log(2)+2.5*log(pi)-2.5*log(lambda)-pi^2/(2*lambda)+0.5*lambda LU <- log(U) ZD <- 1 XD <- exp(-pi^2/(2*lambda)) KD <- lambda/pi^2 j <- 0 for (d in 1:mm){ j <- j+1 ZD <- ZD-KD*XD^(j^2-1) WD <- HD+log(ZD) if (WD>LU) return(TRUE) j <- j+1 ZD <- ZD+(j+1)^2*XD^((j+1)^2-1) WD <- HD+log(ZD) if (WD<LU) return(FALSE) } return(FALSE) } Appendix C JAGS Code for Poisson Spike and Slab Model model{ for (i in 1:n){ Y[i] ~ dpois(lambda[i]); log(lambda[i]) <- inprod(B, X[i,])+epsilon[i]; epsilon[i] ~ dnorm(0, tau.epsilon); } tau.epsilon <- pow(sigma.epsilon, -2); sigma.epsilon ~ dunif(0, 100); w ~ dbeta(c.1, c.2); for (k in 1:K){ Temp[k] ~ dbern(w); zz[k] <- equals(Temp[k], 0) F[k] <- zz[k] *deltanull+1-zz[k]; eta[k] ~ dgamma(a.1, a.2); gamma[k] <- F[k]*(1/eta[k]); B[k] ~ dnorm (0, 1/gamma[k]); } } 50