Download Spike and Slab Prior Distributions for Simultaneous Bayesian

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Spike and Slab Prior Distributions for Simultaneous Bayesian
Hypothesis Testing, Model Selection, and Prediction, of
Nonlinear Outcomes
Xun Pang
Department of Political Science
One Brookings Dr., Seigle Hall, St. Louis, MO
[email protected]
Jeff Gill
Center for Applied Statistics, Washington University
Department of Political Science
One Brookings Dr., Seigle Hall, St. Louis, MO
[email protected]
Abstract
A small body of literature has used the spike and slab prior specification for model
selection with strictly linear outcomes. In this setup a two-component mixture distribution is stipulated for coefficients of interest with one part centered at zero with
very high precision (the spike) and the other as a distribution diffusely centered at the
research hypothesis (the slab). With the selective shrinkage, this setup incorporates
the zero coefficient contingency directly into the modeling process to produce posterior probabilities for hypothesized outcomes. We extend the model to qualitative
responses by designing a hierarchy of forms over both the parameter and model spaces
to achieve variable selection, model averaging, and individual coefficient hypothesis
testing. To overcome the technical challenges in estimating the marginal posterior
distributions possibly with a dramatic ratio of density heights of the spike to the
slab, we develop a hybrid Gibbs sampling algorithm using an adaptive rejection approach for various discrete outcome models, including dichotomous, polychotomous,
and count responses. The performance of the models and methods are assessed with
both Monte Carlo experiments and empirical applications in political science.
Keywords: Spike and Slab Prior, Hypothesis Testing, Bayesian Model Selection,
Bayesian Model Averaging, Adaptive Rejection Sampling, Generalized Linear Model
1
Introduction
Variable selection, hypothesis testing, and forecasting are intertwined but distinct activities such
that researchers typically consider them sequentially. Relatedly, model averaging can also be
considered as a means of addressing the variable selection and hypothesis testing task, but only
in a Bayesian context since model probabilities are required. In fact, coherent treatment of all of
these requires the Bayesian perspective on model uncertainty (Hoeting et al 1999, Alston 2004),
yet this approach is under-utilized in the social sciences. More specifically in political science,
the absence of this perspective can be attributed to a fixation with individual coefficient (Wald)
testing, often using strictly p-value evidence. This leads to the misconception that isolated
theories are sufficient for variable selection. However, there are often multiple measures on the
conceptually same or similar variables, and co-linearity is common among social science variables,
and all right-hand-side variables interact with non-linear link functions in the generalized linear
model. Therefore, two models with alternative variable selections can produce dramatically
different inferences and yet both have good fit (Regal & Hook 1991; Draper,1995; Madigan &
York 1995; Kass & Raftery; 1995; Raftery, 1996). This problem cannot easily be solved with
the current state of model averaging since it involves complicated or highly subjective issues of
identifying the model space, dealing with formidable integral calculations, justified inclusion of
prior information, and overcoming the estimation challenges of jumping between different model
spaces (Hoeting et al 1999; Clyde 1999; O’Hara & Sillanpää, 2009). Unfortunately all of these
challenges remain even with Markov chain Monte Carlo methods.
This paper extends the Bayesian linear spike and slab model to nonlinear (qualitative) outcomes, which has not been accomplished to date. The spike and slab model specifies mixture
prior distributions for each of the coefficient parameters where the spike part helps detect coefficients of zero magnitude by shrinking those coefficients to zero in posterior terms, and the
slab part provides the non-zero coefficients with a diffuse normal prior thus not requiring extensively informed prior knowledge. In this way, both the model space and parameter space are
determined at the same time, even though the full dimension of the model space is not changed
mathematically. Therefore we simultaneously perform variable inclusion decisions, hypothesis
testing, and model specification, as well as set up a principled method for posterior prediction
using Bayesian model averaging. The applied practice of model determination has an inglorious
history in the social sciences (Leamer 1978, Gill 1999), and remains an ongoing source of controversy. We provide a means whereby empirical researchers can avoid: arbitrary decisions, difficult
covariate tradeoffs, and the fallacy that only one specification was tried. Since an R package
is developed by the authors, the primary cost to this method is a willingness to fully embrace
Bayesian probabilistic inference.
There are several technical advantages over rival procedures outlined in this paper. We avoid
the complication of transforming between model spaces of different dimensions as done with the
1
reversible jump MCMC method (Green 1995). The difficulty in specifying a dimension matching
condition through a “bijection” function is the primary reason that RJMCMC is rarely used,
and often used with pared-down or simple datasets. To avoid this problem the generalized linear
spike and slab (GLSS) model uses modified continuous bimodal priors and applies the local and
dynamic adaption method by implementing hierarchical specification of the variance parameter
for each coefficient. This hierarchical design also avoids choosing a point value for the prior
setup of the variance which typically results in complex tuning of the MCMC algorithm (Clyde &
George 2004, West 2003, Geweke 1996, George & McCulloch 1993, Mitchell & Beauchamp 1988).
For practitioners, we develop nearly-automatic MCMC samplers with little tuning to implement
the GLSS model for commonly observed qualitiative responses in political science, including
dichotomous, polychotomous and count responses. In addition, we show that the spike and slab
model works as a general two-sided Bayesian hypothesis test for individual coefficients, which
has been an elusive goal in Bayesian statistics (Berger & Sellke 1987, Berger, Brown, & Wolpert
1994; Berger, Boukai, & Wang 1997, Lehmann 1993, Meng 1994, Rubin 1984). This theoretically
important aspect has not been recognized in the literature, although the linear spike and slab
prior is routinely applied for variable selection in statistical genetics (Meuwissen & Goddard
2004; Sha et al 2004; Kilpikari & Sillanpää 2003; Ishwaran & Raowe 2008). Furthermore, nearly
all hypothesis tests from regression analysis in political science are either explicitly or implicitly
two-sided, thus necessitating a two-sided approach for any general method as proposed here.
This paper is organized as follows. In the next section, we present the general setup for the
GLSS model and discuss how this model can serves for the three purposes of coefficient effect
hypothesis testing, variable/model selection, and model averaging. In Section 3, we address the
choice of prior distributions for hyperparameters important for the performance of the MCMC
algorithms. In Section 4, we discuss the MCMC algorithms for estimating GLSS models with
different link functions, and provide brief examples to demonstrate these hybrid MCMC algorithms. Then, we apply the methods to two empirical studies using data from political science
and illustrate how this method can help hypothesis testing where conventional methods fail, and
improve efficiency of variable selection, and take account of model uncertainty in forecasting.
The MCMC algorithm for sampling the hyperparameters, R code for adaptive rejection sampling
in the GLSS model with the logit link, and JAGS code for estimate the poisson spike and slab
model are given in appendices.
2
2
Generalized Linear Spike and Slab Model
2.1
Model Specification
The spike and slab linear model was first introduced by Mitchell & Beauchamp (1988), giving
a point mass at zero and a bounded uniform distribution elsewhere in the parameter space.
Formally, the prior setup is as follows:
p(βk = 0) = h0k
and
p(βk < b, βk 6= 0) = (b + fk )h1k ,
where − fk < b < fk ,
(1)
where the probability at the spike (h0k ) and that at the slab (2h1k fk ) are greater than 0 and
h0k + 2h1k fk = 1. This is therefore a mixture distribution with parameters fk and γk = h0k /hik
(the ratio of the heights of the two parts). Given a mixture prior distribution with a highly
concentrated component at zero, each βk associated with a “vulnerable” predictor is very likely
to be absorbed by this point mass and be deleted from the stochastically selected submodels.
One important modification proposed by George & McCulloch (1993) changes the discrete prior
specification into a continuous mixture distribution. This modification facilitates applying the
Gibbs sampler instead of using Bayesian cross-validation to estimate the model. They introduce a new parameter, γk , which is a dichotomous variable (γk ∈ {0, 1}) in the following prior
specification:
βk |γk ∼ (1 − γk )N (0, τk2 ) + γk N (0, c2k τk2 ),
and
p(γk = 1) = 1 − p(γk = 0) = pk ,
(2)
(3)
where τk2 is such a small value that once βk falls into the first normal distribution, its support is
sufficiently concentrated around zero such that βk can be “safely” replaced by 0. The parameter
γk is the indicator of a covariate associated with βk to be included into (γk = 1) or excluded
from (γk = 0) the model in the MCMC process. A further modification of the spike and slab
setup is conducted by Ishwaran & Rao (2005), which specifies a continuous distribution for the
variance τk2 in the above specification to reduce the posterior sensitivity to the prior choices
of the two-point mixture distribution of the variance hyperparameter in George & McCulloch
(1993). Those works all concentrate on linear models. The spike and slab model and its modified
versions have been applied for variable selection among a large number of potential specifications
especially when the number of covariates is greater than that of the observations (Ishwaran &
Rao 2008, Barbieri & Berger 2004, Chipman, George, & McCulloch 2001, George & McCulloch
1997).
The generalized linear spike and slab model proposed in this paper follows the linear version
3
of hierarchical structure introduced by Ishwaran & Rao, and is specified as follows:
E(yi ) = g −1 (x0i β),
βk |γ 2 ∼ N (0, γ 2 ),
(4)
for k = 1, 2, ..., K,
γ 2 = Fk ηk2 ,
(5)
(6)
Fk |ν0 , w ∼ (1 − w)δν0 (·) + wδ1 (·),
ηk−2 |a1 , a2 ∼ Gamma(a1 , a2 ),
w|c1 , c2 ∼ Beta(c1 , c2 ).
(7)
(8)
(9)
In this specification, the response variable yi is discrete and could be binary, ordinal, multinomial,
or count data. The design matrix x contains covariates, and the covariates can be standardized
for the purpose of possibly improving mixing in Markov Chain Monte Carlo simulations. In
equation (5), each βk has its own variance parameter γk2 , which is updated in the MCMC simulation. In fact, the prior distribution of γk2 is determined by two random variables, Fk and ηk2 .
The hyperparameter Fk has a discrete distribution with support {ν0 , 1} and a weight parameter
w (it is actually a rescaled Bernoulli distribution). In equation (7), δA (·) is a mass point at A,
√
and ν0 is such a small value that 3 ν0 ηk is very close to zero. This means that > 99.99% of
the mass in the spike part is concentrated in a region not distinguishable from 0 and whenever
βk falls into the spike, it can be safely replaced by zero. If Fk takes the value of ν0 (determined
by both the data and the prior probability 1 − w), the observed data suggest that βk ’s effect is
negligible and support the null hypothesis βk = 0. Similarly, if Fk takes the value of 1 with its
posterior probability, βk ∼ N (0, ηk2 ) and with ηk2 large enough, this normal prior is diffuse and
has limited influence on the posterior, admitting the non-zero coefficients. Since zero or close-tozero coefficients tend to be shrunk into the spike part, only those coefficients whose effects are
differentiable from zero will have Fk = 1 with a high frequency.
2.2
Spike and Slab Model as a Hypothesis Testing Approach
Both of the original and the modified versions of spike and slab setup can be naturally applied for
coefficient hypothesis testing. Hoeting et al (1999) used the posterior probability of p(βk 6= 0|D)
(D denotes data) and compared them with p-values, but the theoretical justiffication has not been
laid out. The primary application of spike and slab priors is complete model specification which
is essentially Bayesian model testing. Less attention has been paid to using them for hypothesis
testing on individual coefficients, i.e., H0 : βk = 0 versus H1 : βk 6= 0. In the spike and slab model
of Mitchell & Beauchamp (1988), the mixture prior in equation (1) is a straightforward setup of
4
the prior beliefs of the statements about covariates’ effects, and the observed data will update
those beliefs into the posterior distributions of π(h0k |D) versus π(2h1k fk |D). These quantities
can be computed by using the MCMC output of γk and fk , and provide the posterior probability
that the null or alternative hypothesis is true.
In the modified model specification in equation (3), with the dichotomous random variable γk
indicating inclusion or exclusion of the covariate associated with βk , interpreting it as a hypothesis
testing on βk is even more straightforward. After the Markov chain converges, each iteration is
one test based on a sample of parameters drawn from the relevant posterior distribution: what
we actually obtain is p(H0 |D) = p(γk = 0|D) = p versus p(H1 |D) = p(γk = 1|D) = 1 − p.
Those conditional probabilities are estimated by using the observed data instead of relying on
unreplicatable resampling or hypothesized data. Therefore, this Bayesian hypothesis testing is
based on sufficiently large numbers of repeated experiments (the number of MCMC iterations)
and data from the same distribution (the posterior distribution). The Bayesian decision making
is to use those probabilities which are also the risk of rejecting or accepting a hypothesis. For
instance, if p(H0 |D) = p, then rejecting H0 assumes the risk of p and accepting it of 1 − p. Note
that the hypothesis testing and decision making make probabilistic claims, and do not involve
p-values or an arbitrary significance threshold level as in the null hypothesis significance testing
(NHST) paradigm.
The fact that the spike and slab model can serve as coefficient hypothesis testing has important meaning for social scientists, since NHST that currently dominates tests of statistical
reliability in the social sciences is logically flawed and inconsistent as noted by a huge number
of authors(Barnett 1973, Berger, Boukai, & Wang 1997, Berger Thomas Sellke 1987, Berkhardt
& Schoenfeld 2003, Bernardo 1984, Brandstätter 1999, Carver 1978, 1993, Dar, Serlin & Omar
1994, Cohen 1988, 1994, 1992, 1977, 1962, Denis 2005, Falk & Greenbaum 1995, Gelman, Carlin,
Stern, & Rubin 1995, Gigerenzer 1987, 1993, 1998, Gigerenzer & Murray 1987, Gill 1999, 2005,
Gliner, Leech & Morgan 2002, Grayson 1998, Greenwald 1975, Greenwald, Gonzalez, Harris &
Guthrie 1996, Hager 2000, Howson & Urbach 1993, Hunter 1997, Hunter & Schmidt 1990, Jeffreys 1961, Kirk 1996, Krueger 1999, 2001, Lindsay 1995, Loftus 1991, 1993a, 1993b, 1994, 1996,
Loftus & Bamber 1990, Macdonald 1997, Meehl 1967, 1978, 1990, 1978, Nickerson 2000, Oakes
1986, Pollard 1993, Pollard & Richardson 1987, Robinson & Levin 1997, Rosnow & Rosenthal
1989, Rozeboom 1960, 1997, Schmidt 1996, Schmidt & Hunter 1977, Sedlmeier and Gigerenzer
1989, Thompson 2002, Wilkinson 1999). With a small number of observations, especially, for
the maximum likelihood estimator, the asymptotic assumptions on which test statistics rely are
simply inapplicable, rendering these tests unhelpful, if not outright misleading. Without relying
on an artificial interpretation of conditional probability, Bayesian inference and decision making
based on posterior description and the Bayes Factor have many advantages over the NHST.
However, for the purpose of traditional hypothesis testing of coefficient reliability, posterior de-
5
scription alone does not present evidence in the traditional decision format that social science
readers are used to seeing. The empirical posterior description is essentially a one-sided test
based on p(βk > 0|D) versus p(βk ≤ 0|D), while two-sided hypothesis testing is quite difficult
since the priors and posteriors are often continuous p.d.f.’s such that a point-null hypothesis has
zero probability by design. Despite the evidence that point null hypothesis testing is inconsistent
with Bayesian practice, there has been considerable effort expended trying to find a reconcilable
Bayesian approach (Berger & Sellke 1987, Berger, Brown, & Wolpert 1994; Berger, Boukai, &
Wang 1997, Lehmann 1993, Meng 1994, Rubin 1984). For instance, Lindley (1961) posits the
prior information as sufficiently vague so that one has no particular belief that θ = θ0 versus
θ = θ0 ±, where is some small value. Then a reference (ignorance-expressing) prior can be used
to obtain a posterior, and H0 is rejected for values that fall out of the (1−α)100 highest posterior
density region. The Bayes Factor is a more formal comparison of two alternative hypotheses of
any type, but the pairwise comparison of submodels with imputation of covariates is inefficient
and computational demanding. With sophisticated models, the computation of the Bayes Factor
itself is not always stable and reliable (Kass & Raftery 1995, Han & Carlin 2001). In the spike
and slab setting, we can conduct the hypothesis testing on coefficient effect by simply computing
(g)
the posterior probability from the MCMC output p̂(Fk = 1|D) = (# : Fk = 1)/G, where G is
the total number of iterations. Accordingly, p(H0 is True|D) = p(Fk = ν0 |D) and is estimated
as 1 − p̂(Fk = 1|D). For the purpose of testing the hypothesis of H0 : βk = 0 versus H1 : βk 6= 0,
the posterior distribution of major interest is p(Fk |D).
2.3
Bayesian Model Comparison and Variable Selection
We always have uncertainty about the true data generation process (DGP) or/and the DGP
itself is dynamic instead of fixed; therefore, model comparison is necessary and important. In
the Bayesian point of view, model comparison is not about identifying the “true” model, but
rather about assessing the relative relevance of models based on the observed data as well as
prior information (Aitkin 1991; Kass & Raftery 1995; Han & Carlin 2001). In the literature,
three major Bayesian approaches have been developed for model comparison: the Bayes Factor
and its variants (Carlin & Chib 1995, Kass & Raftery 1995; Gelman et al 1995; Aitkin 1997,
Berger & Pericchi 1998; Congdon 2001; Robert & Casella 2004; Bayarri & Berger, 2000; Perez
& Berger 2002; Spiegelhalter et al 1999), distance measures such as entropy distance (Sahu
& Cheng; 2003) or Kullback-Leibler divergence (Mengersen & Robert 1996), and simultaneous
estimation of models in a enlarged parameter and model space, including the spike and slab
approach, reversible jump MCMC (Carlin & Chib 1995; Green 1995), and the birth and death
MCMC (Stephens 2000). The Bayes Factor approach often involves complicated integration
problems. To overcome the very challenging integration problem, Chib (1995) and Chib &
Jeliazkov (2001) developed a very general approach from both the Gibbs and MH outputs. This
6
marginal likelihood method has been implemented broadly, but is still computational intense and
requires very careful partitionation of the paramters in high dimensional models. In general, all
of the approaches for computing the Bayes Facotr can invite numerical instability issues (Kass
& Raftery 1995; Berger & Pericchi 1998; Han & Carlin 2001). Reversible jump MCMC involves
changes of model dimensions, which makes the algorithm complicated, and this is also true for
the birth and death approach (O’Hara & Sillanpää 2009). On the contrary, the spike and slab
model does not involve changes in the model dimension, because when a subset of coefficients
are shrunk to zero they still stay in the model mathematically. The fact that the spike and
slab model deals with the parameter and sample space at the same time simplifies the MCMC
algorithm and smooth the path of the Markov chain in the whole model space.
Theoretically, model comparison is a judgement about the quality of fit derived from making
such decisions as: variables inclusion/exclusion, incorporated prior beliefs, the functional form,
hierarchies, and distributional assumptions. More often the key tradeoffs involve just variable
selection. It is easy to say that this should always be based on substantive theory in political
science, but the reality is much more complicated since there are often alternative measures for
phenonema of interest (eg. “ideology” or “partisanship” in survey research), high correlations
among the potential explanatory variables, surprisingly strong relationships from unsuspected
sources, and complications from interactions inherent in all generalized linear models with link
functions that are not an identity. On the opposite end of the spectrum many political scientists
have a distaste for “data mining” through sample data since it simply chases unexplained covariance in the sample and therefore rarely produces reliable inferences. We are offering, instead, a
principled criteria based the posterior probability that a regression coefficent is distinct from zero,
conditional on: other covariates, the form of the model, and of course the data. Furthermore, by
controlling the level of shrinkage, users can stipulate high prior probability for coefficients with
strong theoretical support and let these factors dominate the posterior probability for others.
In the Bayesian framework, variable selection is essentially about how to estimate the marginal
posterior probability of a covariate to be included into the model based on the observed data and
prior information or preferences. In this sense, variables should always have a prior with a spike
and slab (Miller 2002), reflecting the uncertainty of including or excluding them. Following this
line, there are various methods to implement variable selection, all explicitly or implicitly using
spike and slab priors. One approach uses two auxiliary variables: γk which indicates whether
a variable should or should not be included into the model, and ψk is used for computing the
posterior effect of a coefficient βk = ψk γk (Kuo & Mallick 199). This approach is different from
stochastic search variable selection (SSVS) (George & McCulloch 1993; Brown et al 1998) in
that the auxiliary variable ψk has its prior distribution and affects the posterior probability if a
“pseudo-prior” is not used as in the Gibbs variable selection(GVS) (Dellaportas et al 1997). The
SSVS approach focuses on p(βk |Ik = 0), and Ik is often assumed a Bernoulli prior distribution
7
with parameter 0.5 (George & McCulloch 1993). Criticisms on this approach are not only
on the complexity of tuning, but also about the undesirable influence of priors: the prior is too
informative and biases toward a model with half of the candidate variables, which is not desirable
in that the resulting model can be too complex given that the set of candidate variables is usually
large (Hara & Sillanpää,2009). The spike and slab setup in this paper is essentially a SSVS
approach, but our indicator Fk does not have prespecified Bernoulli parameter—the hierarchical
structure allows the parameter w to be updated based on data information. Our approach is
also close to the adaptive shrinkage method in the sense that this method generally specifies
βk |τk2 ∼ N (0, τk2 ), and focuses on coming up with a suitable prior for the variance τk2 to shrink
values of βk towards zero, but not shrink the non-zero coefficients. The prior distribution of τ 2
also controls the sparseness of the model and the mixing of MCMC algorithm, which is exactly
the case for our model in equation (5). In the literature, the often-used priors in this dynamic
shrinkage approach include Jeffreys’s prior (Hoti & Sillanpää 2006, Xu 2003) and Laplacian prior
(Figueiredo 2003), which is better known as Bayesian Lasso (Park & Casella 2008; Kyung, Gill,
& Casella 2008; Yi & Xu 2008). In the present paper, we use a mixture prior for the variance
parameter γk2 which can achieve the same “selective shrinkage” effect but is easier to estimate.
Another approach for variable selection is called model space approach. Instead of specifying
priors for the individual coefficients, this approach places priors on the number of covariates
in a selected model and after the dimension of the model is decided, the secondary task is to
choose variables. This approach includes Reversible Jump MCMC (Green 1995) and its variants
(Godsill 2001; Kilpikari and Sillanpää 2003). Not much different from the RJ-MCMC, those
variants also involve The change in dimension and still requires a dimension matching condition,
which is often quite complicated in realistic models.
2.4
Bayesian Model Averaging for Prediction
Bayesian Model Averaging (BMA) is a way to solve the inference and prediction problems caused
by model uncertainty (Leamer 1978; Draper 1995; Chatfield 1995; Kass & Raftery 1995; Dijkstra
1988; George 1999). Ideally, the inferential and predictive distributions or any quantity under
investigation should obtained by using all the relevant models and averaged with their weights
(model posterior probabilities), namely, those distributions should all be mixtures along the
whole model space (Clyde 1999). However, in empirical analyses, model uncertainty is more
often ignored and inferences, and predictions are obtained based on a single model as if it
were the true data generation process with absolute certainty. Ignoring the uncertainty leads
to overconfident inferences and involves higher decision risk (Alston et al 2005). Madigan &
Raftery (1994) showed that BMA performs better, in terms of the logarithmic scoring rule, than
predictions based on any single model including the “best” one. The idea of BMA is simple: for
8
any quantity of interest, ∆, we obtain its posterior distribution in the following way:
p(∆|y) =
N
X
p(∆|Mn , y)p(Mn |y),
(10)
n=1
where,
p(y|Mn )p(Mn )
p(Mn |y) = PN
,
j=1 p(y|Mj )p(Mj )
(11)
Z
p(y|Mn ) =
p(y|θn , Mn )p(θn |Mn )dθn .
(12)
However, implementing this procedure has several formidable challenges: how to determine the
number of models in equation (10); how to specify the model prior probability in equation (11);
and how to handle the high-dimension integral in equation (12). For solving the first problem,
one approach is the Occam’s windom method (Medigan and Raftery 1994; Volinsky et al 1997)
by setting rules based on information and decision theories. The second line is what we adopt
in the present paper, which is to use Markov chains for determining the model space by simply counting the proportions of the observations of submodels. The second approach has the
advantage of avoiding computing the marginal likelihood and the model weights by analytically
or numerically doing integration; instead, the weights are given along with the model set in the
MCMC simulation—the frequency of appearance of a model is its empirical posterior weight.
BMA is important for probabilistic predictions, because there are two types of discrepancies
between observed and predicted values (Draper et al. 1993). The first type of discrepancy is
predictive bias—a systematic tendency to predict on the low side or the high side. The second one
is lack of calibration—a systematic tendency to over- or understate predictive accuracy (Hoeting
et al 1999). Madigan & Raftery (1994) use the predictive log score to measure the decision risk,
which is a combined measure of bias and calibration. The score can be simple computed by
randomly split the data into two parts: DB , the build data, and DT , the test data, and compute
the following quantities:
X
Scoresingle = −
log p(d|M, DB ),
(13)
d∈DT
!
ScoreBM A = −
X
d∈DT
log
X
p(d|M, DB )p(M |DB ) .
(14)
m∈m
The larger the score, the more discrepant the prediction. Madigan & Raftery found ScoreBM A is
smaller than Scoresingle . We will use this criterion to compare the prediction performance of the
GLSS model with single models in one of the empirical studies in the empirical study section.
9
3
Selective Shrinkage and Prior Choices
3.1
Selective Shrinkage
Essentially, the three tasks (hypothesis testing, variable selection, and model averaging) are
achieved by the GLSS model mainly due to the selective shrinkage effect of the spike and slab
prior. As mentioned above, this approach can be regarded as a modified version of the adaptive
shrinkage approach. In this section, we discuss this mechanism focusing on its meaning to prior
choices.
To simply illustrate the shrinkage mechanism, note that in each MCMC iteration the Bayesian
−1 0
linear estimation of β is β̂ = (X0 X+B−1
0 ) X Z, where B0 is the variance of a multivariate normal
prior NK (0, B0 ). The prior variance can be regarded as the ridge matrix in the conventional
sense, and the elements in B0 are referred as the ridge parameters. The selective shrinkage is to
use the ridge parameters to achieve certain goal (function) by adjusting the bias-variance tradeoff.1 Also, in the Bayesian setup priors can be considered as a penalty function on the likelihood
function. Different penalties are often achieved by specifying a different measure of G(·) in the
R
general expression of a mixture prior π(βk ) = N (βk |0, ψk )G(dψk ). For example, a quadratic
penalty is obtained with normal priors, and the LASSO penalty is achieved by using a double
exponential (Laplace) prior (Kyung et al. 2009). In our case, the measure of G(·) is defined by
equation (7) and (8), which influences the likelihood by shrinking zero or close-zero coefficients
to zero.
This penalty or shrinkage is an important mechanism of the hypothesis testing process. As
discussed above, non-Bayesian hypothesis testing approaches rely on estimating the standard
errors. With discrete outcome models, it is well-known that the maximum likelihood estimator
is inefficient for estimating covariance matrices with a small sample size: the so-called “Stein
paradox” (Stein 1956, Efron 1982). Shrinkage estimation was introduced for the purpose of
achieving more efficient estimators. The original Laplace method (the first shrinkage estimator),
adds some prior data (information) into the maximum likelihood estimate to shrink the estimate
toward zero. The degree of this shrinkage depends upon the size and structure of observed data
(Greenland 2008).
Suppose the unrestricted model has a K-dimensional coefficient parameter vector Ψ =
(ψ1 , ..., ψK ), and a restricted P -dimensional submodel with a parameter vector Θ = (θ1 , ..., θP ).
The linear shrinkage approach constructs a new mixed estimator by weighted averaging: Γ̂ =
λΨ̂ + (1 − λ)Θ̂, where λ ∈ [0, 1], denotes the shrinkage intensity. This shrinkage estimation can
1
The well known decomposition of Mean Square Error (MSE) of an estimator q̂ of a parameter q is MSE=
Var(q̂) + Bias(q̂)2 , and it measures the performance of this estimator. To improve the performance, we choose an
optimal combination of the two parts, which can be achieved by non-parameteric approaches such as boostrapping
the variance (Schaefer & Strimmer 2005a) or by parameteric approaches such as shrinkage estimation.
10
obtain a regularized estimate outperforming each individual estimators in terms of statistical efficiency if the shrinkage parameter λ is appropriately chosen. In practice, this parameter is often
chosen by fixing it at a certain value, or specifying a function depending on the sample size, or by
minimizing a pre-specified risk function (Schaefer & Strimmer 2005), or by using cross-validation
(Friedman 1989), or an empirical Bayesian context (Greenland 2000, Morris 1983). The shrinkage
estimation has been widely used for model (variable) selection (Zhang, Ahn, & Lin 2006, Wang,
Zhang, & Li 2005, Ishwaran & Rao 2008), multiple counfounder detection (Greenland 2008), and
large-scale covariance estimation (Tong & Wang 2007, Willan, Pinto, & O’Brien 2005).
The spike and slab model is a shrinkage estimation because when Fk = ν0 , then β (g) ∼
N (0, (ν0 η 2 )(g) ). In this situation, βk can be safely replaced by 0 and has almost zero effect
in the next iteration, which is equivalent to shrinking the model to be K 0 dimensional, where
K 0 = #(Fk = 1), and the shrinkage parameter λ is a vector instead of a scalar, and can be
computed with the frequency of occurrence of each submodel in the MCMC simulation.
3.2
Prior Choices
The prior choices are important for achieving both desirable shrinkage and confounding. The
shrinkage determines the sparseness of the model—how parsimonious the selected model is desired; and the confounding affects how easily the chain can move from one state space to another,
but too much confounding means that the spike and slab prior will lose its ability to discriminate
between the zero and non-zero coefficients. The working of the prior hinges on the following
three parameters:
• ηk2 : the hyperparameter of the variance for the slab part of the mixture prior for βk .
We avoid the complexity of choosing an appropriate point value of ηk2 by setting another
hierarchical level of ηk2 ∼ IG(a1 , a2 ) (George & McCulloch 1993). This gamma distribution
allows the Markov chain to find the appropriate sample space for each coefficients, and
adjusts the space dynamically. By choosing the hyperparameters of a1 and a2 , we set the
possible support of ηk2 and its uncertainty; therefore, this gamma distribution should have
support large enough to dilute the shrinkage introduced by the prior and let the data to
identify those non-zero coefficients. However, the desirable shape of this inverse gamma
distribution is also determined by the data. If the data is not informative, the posterior
distribution will be diffuse and the confounding of this inverse gamma distribution with
the ν0 ηk2 around zero will be not enough. In other words, the mass that connects the two
parts will be too thin, and the chain will have difficulty transforming between two parts,
and we will observe most of the coefficient chains stay in the spike part for a long time.
Here, the rule of thumb is that when too much shrinkage is observed, one should change
the prior of this inverse gamma distribution to be more right skewed or less diffuse.
11
30
40
2000
Frequency
500
0
10
20
30
40
0
!k
1
2
10
20
3
4
5
0
−10
0
20
10
20
!k
30
0.20
0.05
0.00
−6
−4
−2
0
!k
2
4
6
−20
−10
0
10
!k
The upper panels modify the prior distributions of the the variance hyperparamter γk2 = Fk ηk2 . The
lower panels are their resulting spike and slab prior distributions of βk .
• ν0 : ν0 ηk2 is the variance of the spike part (a normal distribution with very high precision).
The choices of ν0 and ηk2 depend on each other, and the principle is to ensure that ν0 ηk2 is
small enough to shrink the coefficients of zero magnitude, but not too small to have enough
confounding with the slab part.
• w: this parameter controls the prior belief of the probability of a coefficient to be non-zero,
since it is the proportion of the mass allocated to the slab of the prior. Unlike George &
McCulloch (1993), w is the same across all K coefficients, which means that the user can
only assign systematic beliefs of the presence of all the covariates’ effects. Small values
of w reflect high prior skepticism about the coefficients and the theoretical importance of
their associated covariates, while large values of w mean that the researcher stresses more
the theoretical importance of the variables she chooses and is more skeptical about the
sampling or measurement of those data. If c1 = 1, c2 = 1, the prior for w becomes uniform
and reflects indifferent prior beliefs of the two hypothesis.
These general rules imply that in practice, a number of trials with different combinations of the
three parameter prior choices are often required until desirable mixing and shrinkage are obtained.
Graphical tools can help get rough ideas of how the prior choices contain prior beliefs and meet
12
40
0.15
Density
0.3
Density
0.1
0.0
−20
10
!2
k
0.4
0.20
0.10
Density
0.05
0.00
−10
0
!2
k
0.15
0.20
0.15
0.10
0.05
0.00
−20
0
−1
!2
k
0.10
20
!2
k
0.2
10
0
0
200
1000
0
0
Density
1000
1500
800
600
Frequency
500
400
1000
Frequency
3000
2000
Frequency
4000
1500
1000
5000
Figure 1: Different Prior Setups
20
the rules above. Figure 1 illustrates how different choices of those tuning parameters incorporate
prior uncertainty and effect the confounding, and some of the combinations of those prior choices
should be better than the others. In the figure, the first choice is (w, ν0 , a1 , a2 ) : (0.5, 5E-09, 5, 50),
which has the following prior information: we specify indifferent prior probabilities for βk = 0
and βk 6= 0 since w = 0.5 puts half of the mass in the spike and the other half the slab. The
gamma distribution for the variance for the slab part produces a flat normal prior which allows
non-zero coefficients to be identified. Although the prior for γk2 = ν0 ηk2 is continuous, the density
between the spike and slab is thin, which facilitates separating the non-zero coefficients from zero
ones. But the two parts are still connected with amount of mass. It is just the mass between
that serves as a bridge for the chain to transform between the two parts. The second choice
is (w, ν0 , a1 , a2 ) : (0.9, 5E-09, 5, 50), and it is the same as the first one except w = 0.9 puting
90% of the mass in the slab part, reflecting very a strong prior belief that H1 is true. This
will result in a bias towards more complicated models (more variables are selected). Normally,
we should not have such a strong prior, especially with the fact that w controls the systematic
belief of all the β coefficients, this prior setup is not recommended. The third specification,
(w, ν0 , a1 , a2 ) : (5E-09, 0.9, 30, 50), is much worse than the first two. Besides the very strong
prior bias towards H1 , the variance of the slab part is too small. With modest sample sizes,
this slab prior will strongly push the posterior distribution of non-zero coefficients towards zero.
Also, the mass between the two parts is too thin to have enough confounding. The last example
has a prior choice as (w, ν0 , a1 , a2 ) : (0.5, 0.1, 5, 10). With ν0 = 0.1, the prior distribution of γk2
the spike and slab parts are connected with much mass between and the confounding is bad
in the sense that the two states cannot be distinguished and the spike and slab prior functions
similarly to bimodal normal mixture priors.
4
MCMC Algorithms for Generalized Linear Spike and
Slab Models
The spike and slab priors for multiple coefficients complicate the parameter space—the two
different state spaces have to be separated for the purpose of identifying variables included or
excluded, but at the same time they should be connected to make it possible or easy to transform
from one state space to the other. This trade-off requires a tuning of the degree of confounding,
i.e., we need a “gray zone” for βk to have nontrivial probabilities to be in both the spike and
slab states; otherwise, the Markov chain will stay in one state for very long time only because
there is no bridge for it to get into the other space (Smith & Kohn 2002). This is the reason
that George & McCulloch (1993) replaced the discrete spike and slab setup (a point mass as the
spike) in Mitchell & Beauchamp (1988) with two normal distributions. However, in their case
13
Figure 2: Spike and Slab Model Sample Spaces
y
y
z2
z1
x
x
Bivariate Spike and Slab Prior
Hypothetical Resulting Posterior
Note that in both graphs, the spikes are much higher than they are shown in the limited space
and go far out of the graphs.
the confounding issue still requires complicated tuning.
Another issue closely related is how to sample the coefficient posteriors. In Figure 2, we
illustrate a simple bivariate case. The ratio of the heights of the spike and slab parts is usually
huge for both the prior and posterior distributions since the spike part is highly concentrated
around zero. For the Metropolis-Hasting algorithm, it is difficult to specify a good proposal
density to approximate the target, and the acceptance rate will be unacceptably low when the
chain is stuck in the spike part. The Gibbs sampler is more mechanical and requires less tuning,
but for non-linear models, the coefficient parameters often do not have conditional conjucate
priors. With the auxilary latent response variable approach (Albert & Chib 1993), generalized
linear models with probit link can be easily handled with the Gibbs sampler (Sha et al 2004),
but for other links such as the logit and poisson GLMs, the Gibbs sampler cannot be applied
for sampling the coefficient parameters. The adaptive MCMC approach has been suggested to
obtain a good proposal by learning from the history (Ji & Schmidler 2009), but the success of this
approach heavily depends on the starting location because it dramatically effects the “learning”
of the sample space (Craiu et al, forthcoming). In this paper, we propose sampling strategies
with the goal of achieving minimum tuning and making the algorithm as automated as possible.
We discuss the sampling schemes for the generalized linear models with probit, logit, and poisson
links, and use a simple example for each to illustrate how the algorithm works. The algorithm
for updating the hyperparameters, F, w, and η 2 , is the same across different link functions, and
is given in Appendix A.
14
4.1
Generalized Linear Models with Probit Link
The simplest GLSS models are those that use the probit link function, because the data augmentation method introduced by Albert and Chib (1993) easily transforms such models into a
linear framework, and the Gibbs sampler can be used to update the coefficient parameters along
with the auxiliary latent response variables, just as is in the linear cases. This convenience is
achieved because the GLSS model treats the spike and slab prior just as an ordinary prior.
4.1.1
Binary and Ordinal Probit Spike-Slab Model
Generally, the probit model for binary and ordinal outcomes can be specified as follows:


1
if (zi > 0)
yi =

j
if τj−1 < zi < τj for j = 1, 2, . . . J,
zi = x0i β + i .
(15)
(16)
By assuming i ∼ N (0, 1) for identification, this model will have the exact likelihood of the GLM
probit model by integrating the auxiliary latent variable zi out. Given the updated hyperparameters of F, w, and η 2 , the coefficient parameter is sampled from a multivariate normal distribution
β|z∗ , γ 2 ∼ N (β̄, B̄), with:
B̄ = (X0 X + nΓ−1 )−1
and
β̄ = B̄X0 Z∗ ,
where γk2 = Fk ηk2 , and Γ is the diagonal matrix with γk2 as the k-th diagonal elements. The latent response variable is sampled from a truncated normal distribution conditioned only on β. In
the ordinal probit model, the cutpoints are assigned with a uniform distribution τ ∼ U (ττ : −∞ <
τ1 < . . . < τJ−1 < ∞, and updated as τj ∼ U (max[max(zi : yi = j), τj−1 ], min[min(zi : yi = j + 1), τj+1 ]).
4.1.2
Multinomial Probit Spike-Slab Model
With the same data augmentation method, estimating the probit multinomial model only requires
a very minor modification on the above one. The setup is as follows:
yi = j if zij > zik for all k 6= j,
zij = x0ij β + ij .
(17)
(18)
It is more convenient to show the sampling strategy by expressing this model in a matrix form:

 



Z1
x1
1
 ..   .. 
 . 
 .  =  .  β ×  .. 
ZN
xN
15
N
N
where the error ∼ NN ×J (θ, Ω = IN Σ) and Σ is a J × J matrix parameterized by a vector
θ. Conditional on F, w and η 2 , the parameters β, Z and θ can be updated as follows:
1. β|Z1 , . . . , ZN , Y, θ ∼ Nk (β̄ Z , Bz ), where B = (X0 Σ−1 X + Γ−1 )−1 and β̄ Z = Bz X0 Σ−1 Z.
2. Zi |Y, β, θ ∼ N (Xi β, Σ) and the sample should meet the requirement that the yi th element
in Zi is the maximum.
3. θ|Z, Y, β = π(θ)|Ω(θ)|−1/2 exp −1/2(Z − Xβ)0 Σ−1 (θ)(Z − Xβ) , and this can be simply
sampled by drawing from an approximating normal distribution with matched mode and
curvature (see Albert & Chib 1993).
4.1.3
Example
Because of the conditional conjugacy, estimating GLSS models with probit link is relatively easy
and the Gibbs sampler is used for all the parameters. This allows us to focus on the tuning
issue of prior specifications stated above in order to achieve a desirable level of shrinkage and
good mixing. Here we use a simple but observational dataset—Rule Change in the House of
Representatives from 1867 to 1998 in the United States, which were analyzed by Binder (1996,
2006) and Schickler (2000) by coding the response variable, Rule Change, both as binary and
ordinal data. There are only 66 observations in the dataset, and 5 covariates are included. Among
the 66 observations, there are 8 events of pro-minority change and 21 episodes of pro-majority
change.
The Gibbs sampler works very well in estimating both the binary and ordinal probit S-S
models. We use four different sets of prior specifications to check the prior sensitivity and test
the reasonable range of confounding. Although the sample size is small, the data seem informative
and the probit S-S models are not very sensitive to mild changes of priors in term of hypothesis
testing results. Since ν0 and ηk2 jointly determine the confounding degree, we fix the prior value
of ν0 in those varying prior setups and only change the value of ηk2 by tuning its inverse gamma
prior distribution. The choice has the inverse gamma distribution with mean 58.33 and standard
deviation 17.59; and the second prior setup changes the mean to be 20 and the sd to be 7.54,
which is a dramatic change. But because this is a change in hyperparameters at a very low
level, the results are not very sensitive to this change (compare the first and second columns in
Table 1). The second important tuning is the hyperparameter of w which controls the overall
shrinkage level. For the first two sets of priors, we use uniform priors on w (Beta(1, 1)), while in
the third and forth setups we make the beta prior distribution to be skewed (with the skewness
as 1.42 and -1.42 respectively). This change makes two of the variables in the ordinal model
have very different testing results: republican.majority with p(β = 0|D, P rior3 ) = 0.993 and
p(β = 0|D, P rior4 ) = 0.397, and homogeneity.change with p(β = 0|D, P rior3 ) = 1.000 and
16
p(β = 0|D, P rior4 ) = 0.237. This suggests that although w is at the low level of the hierarchy, it
is still influential in the model and the information injected by its prior should be justified and
extreme prior beliefs will yield very different posterior hypotheses testing results. However, in
terms of testing the importance of the covariates included by setting some arbitrary threshold,
the different testing methods generally agree with each other in this particular case.
4.2
Generalized Linear Model with Logistic Link
Modeling dichotomous and polychotomous outcomes with GLSS models with the logistic link
is more complicated than the probit link because there is no conjucacy for the coefficients.
However, the logistic model is often preferred for several reasons. The coefficients have nice
interpretations as measuring the change of log-odds (logits); the logistic distribution has fatter
tails; and in some analyses, such as case-control studies, the relative rates measured by the
logistic model are consistant, but this is not the case for the probit model (Breslow 1996; Lacy
1997; King & Zeng 2001). Normally, in Bayesian analysis, the coefficients in the logistic model
are sampled by using random walk or independent chain with the Metropolis- Hasting algorithm
and the proposal density is often constructed by using the information provided by the maximum
likelihood estimates (for example, in the MCMCpack, Martin & Quinn). The MH algorithm is a
very general MCMC sampler (Hastings, 1970, Chib & Greenberg, 1995)This sampler performs
poorly for the logistic S-S model for two reasons. The coefficient posteriors in the GLSS are
mixture distributions with the spike part highly concentrated (the value of the height is very
large compared to that of the slab), the MH chain has a hard time to go from the spike to the
slab; also with small sample sizes, the proposals poorly approximate the posteriors, resulting
in an unacceptably low acceptance rate. To overcome this sampling difficulty and to make the
algorithm more automatic, we use the auxiliary variable approach proposed by Holmes & Held
(2006) and apply the Gibbs sampler to simulate from the full conditionals.
17
18
0.479
0.314
0.981
0.977
0.000†
0.993
1.000
0.993
0.941
0.000†
0.017†
0.967
0.000†
0.934
0.957
P3
0.397
0.237
0.860
0.857
0.000†
0.000†
0.999
0.000†
0.988
0.999
P4
−0.311†
0.176
0.315
0.127
0.551†
−0.422†
0.194
0.694†
0.243
0.209
Estimate
0.167
0.215
0.284
0.181
0.260
0.220
0.295
0.395
0.222
0.319
SE
0.067
0.417
0.273
0.486
0.039
0.054
0.511
0.079
0.274
0.513
p-value
MLE-Probit
−0.321
0.179
0.337
0.134
0.566†
−0.473†
0.218
0.802†
0.281
0.207
Mean
0.170
0.217
0.287
0.183
0.263
0.229
0.319
0.423
0.232
0.337
SE
Lower
−0.657
−0.245
−0.223
−0.219
0.058
0.007
0.607
0.908
0.499
1.089
−0.936 −0.041
−0.398 0.854
0.019 1.677
−0.153 0.752
−0.450 0.876
Upper
Bayesian-Deffusive Normal
“†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have a certain direction of
effect (in the one-sided test) on the response variable with the effect size of 0.10.
P3 : ν0 = 5E-07, a1 = 13, a2 = 700, c1 = 1, c2 = 8, and P4 : ν0 = 5E-07, a1 = 13, a2 = 700, c1 = 8, c2 = 1
Prior setups in the GLSS models: P1 : ν0 = 5E-07, a1 = 13, a2 = 700, c1 = 1, c2 = 1, P2 : ν0 = 5E-07, a1 = 13, a2 = 300, c1 = 1, c2 = 1,
0.641
0.649
0.931
0.909
0.000†
ordinal response
republican.majority
homogeneity.change
capacity.change
polarization.change
floor.median
P2
0.028†
0.994
0.000†
0.983
0.998
P1
Spike-Slab:P r(β = 0)
binary response: pro-majority
republican.majority 0.049†
homogeneity.change
0.980
capacity.change
0.000†
polarization.change 0.932
floor.median
0.996
Coef.
Table 1: U.S. House Rule Change: Probit Binary and Ordinal Models (N=66, K=5)
4.2.1
Binary and Ordinal Logit Spike-Slab Model
By using the auxliary variable approach, the Logistic S-S model with the same hyperparameter
specification above can be written as follows:


1
if (zi > 0)
(19)
yi =

j
if τj−1 < zi ≤ τj for j = 1, 2, . . . J,
i ∼ N (0, λi ),
(20)
λi = (2ψi )2 ,
(21)
ψi ∼ KS,
(22)
where ψ is a random variable distributed as a Kolmogorov-Smirnov distribution (Devroye, 1981
and 1986), and accordingly, i has a marginal distribution as a logistic distribution (Andrews &
Mallows 1974). Therefore, by integrating zi out of the likelihood
Z
L(β|y) = f (y|z, β)π(z|β)dz,
(23)
we get the exact likelihood function of a GLM with the logistic link, if assuming zi ’s are independent. By this means, all of the parameters in the model, except the auxiliary parameter λi , can
be all sampled by using the Gibbs sampler. As for λ, the sampling scheme and parameterization
are mainly based on Devroye (1986) and follows the application of Holmes & Held (2006). We
sample λi with a rejection sampling scheme by dynamically approximating the target density
with squeezing functions. The details are as follows:
1. Choose the proposal density g(λ) ∼ GIG(0.5, 1, ς 2 ), where GIG denotes a generalized
inverse Gaussian distribution, and ς 2 = (z − x0 β)2
2. Construct the acceptation probability:
α(λ) =
l(λ)π(λ)
,
M g(λ)
(24)
l(λ) ∝ λ−1/2 exp(−0.5ς 2 /λ), (the likelihood),
(25)
1
π(λ) = λ−1/2 KS(0.5λ1/2 ), (the prior).
4
(26)
Setting M = 1, the acceptance rate α(λ) can be simplified as α(λ) = exp(0.5λ)π(λ), and the
simulation problem now is straightforwardly centered about evaluating a K-S distribution.
19
3. Partition the parameter space of λ into two regions and hopefully we can construct a
monotone alternative series representation. Devroye (1986, pp.161-168) proved that the
turning point of this infinite mixture distribution is in the interval of [4/3, π) and where is
it exactly is unknown. Following Holmes & Held (2006), we take the value of 4/3 as the
breakpoint and use squeezing functions adopted in the two regions to sample the acceptance
rate. The R code is given in Appendix B.
4. For the ordinal logistic model, the updating of cutpoints is the same as in the binary probit
model.
4.2.2
Multinomial Logit Spike-Slab Model
The multinomial logistic spike-slab model can be estimated with very similar algorithm as the
one for logistic binary or ordinal models above, which is also discussed in Holmes & Held (2006),
because it can be reformatted as several logit binary models and update each in turn. This is
based on the fact that the parameter vector β j , conditional on all other parameters including
β −j , has a logistic setup, and we can sample β j by directly using the same idea as before. In
more details, the basic setup of multinomial logit model is:
yi ∼ M(1; θi1 , . . . , θiJ ),
θij = PJ
xi β j
r=1
exp(xi β r )
,
(27)
(28)
where M denotes a multinomial distribution, and one of β j is fixed to identify the model. By
working on the conditional likelihood of β j , it is easy to find out that it has a logistic setup:
N
Y
L(β j |y, β −j ) ∝
(ϑij )I(yi =j) (1 − ϑij )I(yi 6=j) ,
(29)
i=1
ϑij =
exp(xi β j − Cij )
,
1 + exp(xi β j − Cij )
Cij = log
X
exp(xi β r ).
(30)
(31)
r6=j
With a spike and slab prior for β j as NK (00, Γj ), this multinormial model can be updated exactly
as above. Given the other coefficients and the latent variable zi sampled from ∼ T L(xi β j −Cij , 1),
we update β j by using the exact algorithm stated above. Here, T L denotes a truncated logistic
distribution (if yi = j, zi is defined in the interval [0, ∞), otherwise it is in (−∞, 0)).
20
4.2.3
Example
Since for the logit models we proposed a new algorithm, we use simulated data to illustrate
how the algorithm works and assess the performance of models with knowing the true data
generation process. As stated above, the spike and slab models generally have the advantages
for hypothesis testing with small sample sizes; therefore, we choose the sample size to be 50 and
the dimension of the design matrix to be 50 × 10, leaving the ratio of observations to coefficients
as small as 5. The response variable is dichotomous. We also choose a correlated design to assess
the effect of collinearity on different hypothesis testing methods. Specifically, the covariates
X1 , ..., X5 are independent of each other, and X6 and X7 are highly correlated with ρ = 0.8,
µ, Σ), where µ = (0, −1.6, 0.8) and Σ is a Toeplitz covariance matrix
and (X8 , X9 , X10 ) ∼ N3 (µ
with a first-order autoregressive structure and ρ = 0.6. In Table 2, we give the estimates and
hypothesis testing results with different methods. For the logit S-S model, we use four different
prior setups after several trials of tuning to find the reasonable range of prior values. The priors
are reported in Table 2. In Table 2, the point estimates based on different methods do not recover
the true values given that the sample size is so small and dichotomous responses are likely to
be not informative. However, in terms of hypothesis testing on individual coefficients, the logit
S-S model generally performs very well, and better than the other two competing methods. For
β1 to β5 whose associated covariates are independent, the S-S model with varying prior setups
correctly identifies the zero and non-zero coefficients of the first four. As for X5 , the S-S model
is uncertain about whether it has effect or not given its true effect is small (−0.1), and the S-S
Model I and III, the two models with smaller shrinkage effect, give higher probability that the
covariate has some effect. The one-sided Bayesian testing based on the logit Bayesian model
with diffuse priors of β also works not badly and gives more than 2/3 odds for β5 ≤ 0 and the
probability of 0.24 for β5 > 0. The NSHT based on the ML estimates identifies H0 : β5 = 0 to
be rejected with effect size 0.05, but incorrectly fails to reject H0 : β1 = 0.
For those correlated covariates, we find that collinearity does have effect on hypothesis testing
on all those methods, with relatively smaller impact on the logit S-S model. First, the NSHT
completely fails to tell which variables actually have effect and which have effect only because
of the collinearity. It incorrectly rejects H0 : β6 = 0 but fails to reject H0 : β7 = 0. It also
makes a mistake to suggest that X10 is not an statistically reliable covariate but X8 is. The
Bayesian one-sided tests do not work well, either, by giving ambiguous testing results to the
last three coefficients and incorrectly identifying β6 as a negative parameter with high certainty.
The performance of the logit S-S model varies with different prior setups, but is generally better
than the former two. The S-S model I correctly identifies the non-zero coefficients. Although
it assigns a big probability for β6 6= 0, it still admits the nontrivial probability that it is a zero
coefficient. Model II has bigger shrinkage effect since the inverse gamma distribution for the
variance parameter is more concentrated and with location closer to 0. It shrinks both β6 and
21
Table 2: Logit Binary Model (N=50, K=10)
Coef.
β1
β2
β3
β4
β5
β6
β7
β8
β9
β10
Coef.
β1
β2
β3
β4
β5
β6
β7
β8
β9
β10
DGP
1.400
−1.700
0.000
0.000
−0.100
0.000
−1.000
0.000
−1.600
0.800
DGP
1.400
−1.700
0.000
0.000
−0.100
0.000
−1.000
0.000
−1.600
0.800
Spike-Slab I
Spike-Slab II
Mean P r(β = 0)
Mean P r(β = 0)
3.071
−3.269
0.000
0.000
−0.163
1.286
−1.809
0.000
−1.339
1.270
0.000†
0.000†
1.000
1.000
0.793
0.235
0.000†
1.000
0.000†
0.000†
2.918
−2.262
0.000
0.000
−0.002
0.002
0.000
0.000
−1.017
1.611
0.000†
0.000†
1.000
1.000
0.985
0.975
1.000
1.000
0.088†
0.000†
Spike-Slab IV
Diffuse Bayesian
Mean P r(β = 0)
Mean P r(β > 0)
3.302
−3.376
0.000
0.111
−0.062
2.049
−2.631
0.000
−2.412
0.262
0.000†
0.000†
1.000
0.736
0.824
0.000†
0.000†
1.000
0.000†
0.816
2.916
−3.339
0.279
0.223
−1.300
2.418
−2.658
−1.184
−2.303
1.340
0.008†
0.043†
0.682
0.792
0.235
0.055†
0.048†
0.289
0.190
0.227
Spike-Slab III
Mean
P r(β = 0)
0.000†
0.000†
1.000
1.000
0.565
0.306
0.000†
0.825
0.000†
0.000†
2.906
−3.049
0.000
0.000
−0.093
1.107
−1.617
−0.102
−1.854
0.916
MLE
Estimate
4.860
−5.874
0.607
0.257
−2.262
3.991
−4.407
−1.831
−3.895
2.141
p value
1.000
0.000†
0.759
0.593
0.038†
0.000†
0.999
0.078†
0.026†
0.942
Prior setups in the GLSS models: P1 : ν0 = 5E-09, a1 = 10, a2 = 500, c1 = 1, c2 = 1, P2 : ν0 = 5E-09, a1 =
10, a2 = 300, c1 = 1, c2 = 1, P3 : ν0 = 5E-09, a1 = 10, a2 = 500, c1 = 8, c2 = 1, and P4 : ν0 = 5E-09, a1 = 13, a2 =
500, c1 = 1, c2 = 8
“†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or
have a certain direction of effect (in the one-sided test) on the response variable with the effect
size of 0.10.
22
β7 into zero almost all the time. However, the two non-zero coefficients, β9 and β10 do survive.
Model III has the lest shrinkage effect with the same prior specification as in Model I except w
has a prior distribution biased towards the slab part. All non-zero coefficients among the five
survive in this model, and β8 is identifies as more likely to be zero and β5 more likely to be
non-zero. Model IV has bigger shrinkage than Model I with w biased towards the spike part, but
this additional shrinkage seems to concentrate on β10 , but β6 which is the true zero coefficient
survives.
This example highlights the complication in hypothesis testing with collinearity, and the
spike and slab model tends to behavior better. In term of the simulation algorithm, the adaptive
rejection algorithm works smoothly and numerically stable. No tuning required in this sampler,
and the whole process is mechanical. The Markov chain mixes well as long as the priors of the
three hyperparameters are within the reasonable region which needs to be found by preliminary
trials and varies from application to application.
4.3
Log Linear Model: Poission Link
For the poisson spike and slab regression, we can use the adaptive rejection approach to sample
the coefficients β which requires very little tuning. This is because with a normal prior, the
conditional posterior of each βk has a log-concave shape and this means that the adaptive rejection
algorithm is easy to implement and works efficiently (Gilks & Wild 1992). To demonstrate it,
we show that with the independent spike and slab prior βk ∼ N (0, γk2 ), the conditional posterior
of β in the poisson model can be written as follows
!!
N
N
K
K
K
X
X
X
1 2 XX
xik yi βk −
exp
xik βk
,
β +
π(β|y, γ) ∝ exp −
2 k
γ
k
i=1
k=1 i=1
k=1
k=1
and the conditional distribution of βk then is
π(βk |y, β −k , γ) ∝ exp −
1 2
β + xik yi β k −
γk2 k
N
X
!!
exp
i=1
X
xij βj + xik βk
j6=k
This conditional posterior distribution is log-concave because:
N
X
d2
1 X 2
ln
π(β
|y,
β
,
γ)
=
−
−
x
exp
xij βj + xik βk
k
−k
ik
dβk2
γ
i=1
j6=k
!
< 0.
We can use the adaptive rejection algorithm to sample from this density by constructing the
upper and lower hulls to squeeze this density and updating the upper and lower hulls adaptively.
23
Table 3: Poisson Model (N=500, K=10)
Coef.
β1
β2
β3
β4
β5
β6
β7
β8
β9
β10
DGP
0.600
−0.200
0.450
−0.350
0.250
0.000
0.000
0.000
0.000
0.000
Spike-Slab
Diffuse.Bayesian
Mean P r(β = 0)
Mean P r(β > 0)
0.580
−0.216
0.485
−0.358
0.240
0.001
0.003
0.000
0.001
0.000
0.000†
0.000†
0.000†
0.000†
0.000†
0.983
0.960
0.995
0.982
0.991
0.577
−0.217
0.473
−0.352
0.247
0.061
0.066
−0.034
0.066
0.062
1.000†
0.000†
1.000†
0.000†
1.000†
0.928†
0.939†
0.208
0.951†
0.922†
MLE
Estimate
0.640
−0.240
0.505
−0.371
0.261
0.053
0.065
−0.072
0.077
0.071
p value
0.000†
0.000†
0.000†
0.000†
0.000†
0.151
0.088†
0.049†
0.027†
0.074†
“†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or
have a certain direction of effect (in the one-sided test) on the response variable with the effect
size of 0.10.
4.3.1
Example
The adaptive rejection algorithm stated above is very straightforward to implement and the process is automate. Unlike in the Metropolis-Hasting algorithm, no tuning is needed for sampling
the coefficients β, which nicely leaves the tuning effort focusing on the prior setup of the hyperparmeters ν0 , ηk and w. Even more conveniently, the BUGS software uses the exact algorithm,
the adaptive rejection sampler, on poisson models. This means that the poisson S-S model can
be estimated by using BUGS since the other parameters in the models have standard conditional
distributions. The JAGS code for implementing the poisson model is given in the Appendix C.
Here we use a Monte Carlo study to illustrate the performance of the algorithm. In this
study, we also want to assess the performance of the S-S model with large sample sizes. As
stated above, the NSHT has been criticized for being biased against the null when the sample
size is large; in other words, with a large sample size, every coefficient tends to be “significant”.
We simulate 500 counts with a data generation process containing 10 covariates, five of which
are non-zero and the other half are zero. The covariates are not correlated with each other. The
sample size of 500 is not very large, but count data are usually more informative than other types
of discrete data. We estimate the poisson S-S model, Bayesian Poisson model with diffuse priors,
and poisson GLM with the maximum likelihood estimator. With the sample size of 500, mild
24
variation of prior setups for the poisson S-S model does not make much difference in hypothesis
testing results. Therefore, we only report one S-S model in Table 3.
All those three models in Table 3 recovers the true parameters values very well, which means
that the information in the data are sufficient for identifying the parameters. What is interesting
in this example is the hypothesis tests on the zero coefficients. In terms of point estimates, all
those three models find that the scale of effects of those covariates associated with zero coefficients
are very small, with the S-S model having the smallest scales for all those zero coefficients. But
for hypothesis testing, the NHST finds that four out of the five zero coefficients are statistically
different from zero. Surprisingly, the Bayesian one-sided tests also suggest that most of the zero
coefficients to be positive with high certainty, and one of them is likely to be negative with
probability 0.7. The poisson S-S model correctly identifies those zero coefficients, and with a
effect size of 0.05, the null hypotheses that they are zero coefficients are all accepted.
5
Empirical Examples
In this section, we use two empirical political studies to illustrate how the GLSS model can be
applied as a useful tool. The first example shows that with a very small sample size and small ratio
of observations to covariates, the GLSS model can help to draw statistical inferences (hypothesis
testing) when both the maximum likelihood and ordinal Bayesian approaches fail. It also serves
as a model selection tool by choosing the very small number of most promising variables. We
show that for those promising parameters, the single models and the BMA estimates yield similar
results when the number of submodels is small and the submodels are similar in terms of variable
selection. The second example shows how the GLSS model can be applied to deal with variable
selection and forecasting problems when there are a large number of potential covariates many
of which are supported by theories or measure similar factors. The GLSS model is much more
efficient and theory-driven than the traditional methods of variable selection, such as forward
and backward stepwise selection. The forecasting based on the GLSS model is BMA over many
submodels based on their weights (posterior probabilities), which takes into account of the model
uncertainty (uncertainty about the true DGP(s)) and have lower risk in out-of-sample prediction.
5.1
Global Energy Security and IEA Cooperation
In this part, we use the data gathered from the historical database of the International Energy
Agency (IEA) to assess the factors which are often suggested to affect global energy security.
The outcome variable is dichotomous, with an oil-supply disruption occurring in a given year
coded as 1 and 0 otherwise. The sample years are from 1984 to 2004, and 21 observations
are included in the dataset. There are 16 covariates, including variables about IEA energy
25
cooperation, the international politics, demand-supply relationship in the global oil market and
natural disasters. In terms of the sample size and number of observations per coefficient, this
example is very extreme, and the maximum likelihood probit model should not be applied. Even
with the Bayesian approach, this dataset should be expected to yield very diffuse posteriors (if
diffuse priors are used). The GLSS model can work better in this case because it will dynamically
exclude variables frome the final model (if a single model is to be chosen) based on information
in the data.
5.1.1
Step 1: Determine Variable Inclusion Probabilities
In Table 4, we report the estimation results from three models. The ML logit model collapsed
in the process of mode-finding, and the numbers reported in the table do not make any sense.
We do not give the standard errors since they are huge numbers and meaningless for inferential
purposes. For the Bayesian logistic model with diffuse normal priors, we estimated this model by
using the auxiliary variable approach stated above, because the MH algorithms with independent
or random walk proposal all failed. As expected, we observe little Bayesian shrinkage in the
posterior distributions: the priors for those coefficients are set as N (0, 202 ) and the posterior
standard deviations are in the range of 7 to 17. The one-sided tests suggest 5 out of the 16
covariates have positive or negative effect with a probability greater than 0.90. The Logistic S-S
model shrunk 9 of the 16 covariates into zero and they were excluded from the model almost all
the time. The coefficient posteriors reported in Table 4 are the weighted averaging of 6 submodels
with different sets of covariates included, as is presented in Table 5. The scales of the posterior
mean and standard deviation is very different from those of the logit model with diffuse normal
priors, because the logit S-S model is based on the groups of submodels very different from the
“whole” model.
5.1.2
Step 2: Hypothesis Testing
For standard hypothesis testing, the GLSS model works even with small sample sizes and radically large or small ratios of observations to covariates. This is because the test is based on the
information given in the data (data are treated as fixed in all Bayesian analysis) and no hypothetical or augmented data is involved and no asymptotic principles are required. Bayesian one-sided
tests work well here but they obviously require a priori directional knowledge (mistakenly specified direction in one-sided tests are called Type III Errors). But the one-sided variant does
not help with variable selection since evidence of the sign, positive or negative, is different from
evidence that coefficient is statistically distinct from zero. Note that in general the coefficient
estimates differ considerably between the one-sided and two-sided results. This demonstrates the
effect of spike and slab priors versus highly diffuse forms and the feature that the GLSS model
26
27
R & D
# IEA.members
nuclear.energy
natural.gas
oil.stock
oil.stock(t-1)
natural.disaster
intl’ crisis
gulf.region.crisis
OPEC.market.share
OPEC.market.share(t-1)
world.nuclear.energy
world.natural.gas
oil.price
oil.price(t-1)
demand.china.india
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SD
1.808
1.708
0.000
2.086
1.116
0.000
0.000
1.422
0.000
0.000
0.000
2.993
2.203
0.000
0.354
0.000
Mean
−1.403
0.991
0.000
−3.416
0.520
0.000
0.000
−0.719
0.000
0.000
0.000
−7.629
3.547
0.000
0.004
0.000
0.525
0.516
1.000
0.000†
0.507
1.000
1.000
0.610
1.000
1.000
1.000
0.000†
0.000†
1.000
0.973
1.000
p(β = 0|D)
Logit S-S Model
−23.169
1.620
−12.246
−25.080
10.614
6.605
32.992
−12.002
−4.307
−1.020
−2.111
−27.872
28.566
2.726
−1.732
4.335
Mean
7.432
12.354
10.924
9.256
12.385
9.245
11.906
14.563
7.396
16.267
16.411
8.306
11.061
15.447
15.934
14.958
SD
p(β ≤ 0|D)
1.000†
0.444
0.861
1.000†
0.205
0.260
0.001†
0.752
0.694
0.533
0.546
0.998†
0.000†
0.454
0.545
0.394
p(β > 0|D)
0.000†
0.556
0.139
0.000†
0.795
0.741
0.999†
0.248
0.306
0.467
0.454
0.002†
1.000†
0.546
0.455
0.606
Logit Bayesian Model
−48.830
−34.300
−32.086
−22.066
42.174
38.439
−77.699
−34.781
5.121
−426.434
415.408
−98.294
28.976
37.519
−53.869
85.595
Estimate
Logit ML Model
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
p(> |z|)
“†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have a certain direction of effect (in the
one-sided test) on the response variable with the effect size of 0.10.
Covariates
No.
Table 4: Energy Security: Coefficient Effect Testing
Table 5: Model and Variable Selection (Energy Security)
Submodel
M1
M2
M3
M4
M5
M6
Prob.
0.5065
0.3629
0.0857
0.0269
0.0100
0.0082
X1
X2
X4
X5
X8
X12
X13
X15
#Covariates
3
7
6
8
4
5
does not exclude the probability that a coefficient can be zero, which can happen even when the
data and model give the coefficient a large magnitude (e.g., world.nuclear.energy). Finally,
The maximum likihood based NHST approach which relies on asymptotic normality completely
fails here as shown by the p-values and nonsensical coefficient estimates.
5.1.3
Step 3: Model Selection
To illustrate the GLSS method as a model/variable selection tool, we estimate the “best” model
again by using the Bayesian logit model with diffuse priors and the standard maximum likelihood
approach and compare the estimates with the posteriors of those covariates in the GLSS model
in Table 4. The “best” model is defined as the model with the largest posterior probability. As
shown in Table 5, the best model has a posterior weight as 0.51 and includes three variables. The
GLSS model consists of this “best model” with the weight of 0.51 and other five submodels, and
the three variables in the “best” model are also in all of the other four submodels. Therefore,
the posteriors of those three variables are the weighted averaging over all the five models:
p(βk |Y) =0.5 × p(β|M1 , |Y ) + 0.36 × p(β|M2 , |Y ) + 0.09 × p(β|M3 , |Y )
+ 0.01 ∗ p(β|M4 , |Y ) + 0.01 ∗ p(β|M5 , |Y ) + 0.01 ∗ p(β|M6 , |Y ),
(32)
and will have bigger uncertainty than the estimates in the single model. However, since the
posterior probabilities concentrate on two submodels, the BMA estimates should not be very
different from those in the single model. The estimates and standard errors based on the single
model are reported in Table 6 and compared with the BMA posteriors. By reducing the number
of covariates and increasing the ratio of observations to covariates, the ML model works this
time, producing sensible results. This is also true for the Bayesian logit model–the scales are
very different from the previous one. The estimates and standard errors are similar among the
28
Table 6: Energy Security: Selected Model
Covariates
natural.gas
world.nuclear.energy
world.natural.gas
Logit ML Model
Logit Bayesian Model
Logit S-S model
Estimate
SD
Estimate
SD
Estimate
SD
−2.896
−8.092
3.378
3.267
4.418
2.234
−3.427
−8.236
4.264
1.867
2.146
2.197
−3.423
−7.638
3.551
2.091
2.994
2.207
three models. Note that the results of the GLSS model are the BMA posteriors directly taken
from Table 4, and generally have larger standard errors than the other two. This is because the
BMA posteriors (GLSS posteriors) take account of not only the uncertainty of parameters but
also that of the models—six instead of only one data generation processes are considered in the
GLSS model.
Based on the hypothesis tests on the individual coefficients and submodels, this example tends
to suggest that except the policy of energy substitution, the international energy cooperation
of the International Energy Agency has not contributed to the global energy security. The
cooperation could have provided joint goods to its member states, but there is not much spillover
to the whole world. Outside the IEA, the popularity of nuclear energy is associated with lower risk
of oil supply distribution in the global market, but the consumption of natural gas is negatively
associated with global energy security. As a simple example, the models here do not attempt
to identify causality. The negative association of natural gas demand increase and oil supply
disruption is easily explained with the fact that natural gas is the second most important source
of energy in today’s world, and an important substitute to oil. When the oil supply decreases,
consumers have to resort to more natural gas to provide energy. This endogeneity problem should
be handled in further causal studies.
The comparison between the GLSS posteriors and the single model estimates underlines an
important fact: although the GLSS model looks like a “kitchen-sink” model with all the available covariates being thrown into, the posteriors it produces can be directly used for statistical
inferences and actually in terms of reducing the risk of statistical inferences, it is superior to the
single models only with selected variables. This is because the GLSS model automatically selects
the most relevant models without throwing the available information in other variables out and
provides BMA posteriors. Importantly, it shrinks those statistically unreliable coefficients based
on the information in the data, and averaging across all relevant models to draw inferences.
Therefore, the GLSS model uses the information in the data more efficiently than the Bayesian
logit model, and produces better results for hypothesis testing and statistical inferences.
29
5.2
State Failures: HT, SRVS, and BMA
Since 1994, the Political Instability Task Force (PITF, formerly, the State Failure Task Force), a
group of researchers from multiple fields sponsored by the U.S. government, have been working on
a big project with the goal of understanding and forecasting failed states. Up to now, they have
produced four comprehensive reports, Phase I Findings to Phase IV Findings. They made a great
effort to build a warning system of state failure based on statistical analysis on a huge collection
of data. This database includes all the independent states with populations of at least 500,000
from 1955 to 1998. Based on the released complete dataset (2001), there are 8580 observations
(country-year’s), and about 1200 candidate explanatory variables. By looking at those covariates
carefully, many variables have multiple measures, forming multiple covariates. With such a huge
set of covariates and similar meanings of multiple variables, the task of model/variable selection
requires a lot of work. The PITF team first used their prior information based on theories about
state failure and knowledge of the measurements of variables to narrow down the set of more
than 1200 variables to about 50 covariates, and then use single-variable tests to find promising
variables, and then apply stepwise logistic regression with both forward and backward selection
approaches. The final global model they select is very simple, retaining only three variables:
democracy, trade openness and infant mortality. This very parsimonious specification is used
for forecasting and they claim a rate of correct prediction of about 70% to 80% using the naı̈ve
criteria. Their treatment of missing data, strategy for statistical forecasting, and assessment of
forecasting success has been extensively criticized by King & Zeng (2001), who also suggested
alternatives.
The PITF team’s Phase III Findings include both global and models on specific geographic
regions and types of failure (the Sub-Saharan Africa model, the Muslim Countries model and
the Ethnic War model). For this example we select the Sub-Saharan region to demonstrate
efficient conduct model/variable selection and statistical forecasting based on Bayesian model
averaging (BMA) with the GLSS. Following both the PITF team and King & Zeng’s approach,
we adopt the case-control method for this rare-event study: 44 cases of state failure observed
in the sub-dataset where 120 control cases are randomly drawn from the pool of Sub-Saharan
countries between 1955 and 1998. We do not actually perform this random sample, instead, we
reuse the control cases in the SFTS team’s original study as clean comparison. To narrow down
the set of 1200 variables, the SFTS used a considerable amount of prior substantive information
about nations and politics and selected 43 variables at the first stage of variable selection. They
also reported the explanatory variables tested by their univariate t-tests and stepwise regression.
They categorized these variables considered as candidates for the final model into three groups:
(1) political and leadership (17 variables), (2) democraphic and societal (13 variables),and (3)
economic and environmental (13 variables). We basically use those 43 variables in this example.
But two of the remain variables were not fully indentified by the documentation and dynamic
30
changes of several variables are also of interest, so the resulting number available for use here is
48.
Unfortunately the amount of missing data is quite large here with these data, and the PITF
analysis uses casewise deletion to handle this problem. This literally prevents them from doing model selection, since by including or excluding certain variables the observations are not
the same and the models are not comparable regardless what the criterion or method is used.
Accordingly we used multiple imputation to fill-in the missing data, and apply the logistic S-S
model to the each of the complete data sets that result. Note that in case-country studies, the
logit model is consistent in term of estimating the relative rates since its coefficients has this
nice interpretation as changes of the log odds-ratio. The probit model should not be used in this
case, although in other cases the choice between the probit model and the logit model can be
arbitrary.
5.2.1
Step 1: Determine Variable Inclusion Probabilities
The specific variable selection method used by the Task Force is the stepwise forward and backward selection approaches to select variables from the three clusters of covariates mentioned
above, which ended up with a model containing 7 explanatory variables. The creteria they
adopted are that with the forward selection approach, the process ended when “further addtions
no longer imporved the model’s overall fit with the data beyond a given threshold” and with the
backward selection approach, the selected variables remain in the model when “further deletions
significant impaired the model’s overall fit.” Besides the skepticism on their model comparison
due to the casewise deletion of missing values, the number of models they have to estimate and
P P13 P13 17 13 13
compare is as huge as 17
≈ 8.79 × 1012 . It is impossible for the Task
i=1
j=1
k=1 i
j
k
Force to have checked all of the combinations of variables, but they reported that they had
spent years in variable and model selection. Since the process of variable selection is companied
with the process of hypothesis testing on individual coefficients, this is better than univariate
t-tests since collinearity is taken into account. We use the logit S-S model to simultaneously
conduct hypothesis testing on the 48 individual coefficients for the variable selection purpose.
We also compare the logit S-S model with the other two alternative hypothesis testing methods.
Note that, for variable selection, the standard maximum likelihood based NHST approach is
not appropriate for this “kitchen-sink” model, since the p values are not the probability that a
coefficient is non-zero, meaning that the p values do not provide information about the inclusion
of “non-significant” variables In addition, the sample size relative to the number of explanatory
variables (3.375) precludes using the NHST. The Bayesian one-sided tests are relevant in terms
of testing the directions of effect, but do not help variable selection. In Table 7 and 8, we report
the results from all three types of testing. To make it easier to communicate, we also report
both the probabilties p̂(β > 0|D) and p̂(β ≤ 0|D) based on the Bayesian logit model using
31
Table 7: Hypothesis Testing of Individual Covariates
No.
Variable
Spike-Slab
p̂(β = 0|D)
Diffuse Bayesian
p̂(β > 0|D)
p̂(β ≤ 0|D)
MLE
p value
Variables in PITF’s final model
1
Trade Openness
2
Total Population
3
Regime Type (score)
4
Regime Type (dichotomous)
5
Colonial Heritage (British)
6
Colonial Heritage (French)
7
Discrimination
8
Leader’s Tenure
0.000†
1.000
0.000†
0.000†
1.000
0.865
0.560
1.000
0.257
0.226
1.000†
1.000†
0.843
0.222
0.997†
0.275
0.743
0.774
0.000†
0.000†
0.157
0.778
0.003†
0.725
0.580
0.697
0.009†
0.000†
0.374
0.510
0.032†
0.518
Cluster I: Political and Leadership
9
Change in democracy
10
Economic discrimination
11
Political discrimination
12
Separatist activity
13
Party fractionalization
14
Parliamentary responsibility
15
Party legitimacy
16
Ruling elite’s class character
17
Ruling elite’s ideology
18
Regime duration
19
Freedom House political rights index
20
Freedom House civil liberties index
21
Amnesty International political terror scale
22
US Department of State political terror index
23
Neighbors in major armed conflict
24
Neighbors in major civil/ethnic conflict
25
Membership in regional organizations
0.000†
1.000
0.570
0.853
0.192
1.000
0.000†
0.433
1.000
0.000†
0.632
0.399
0.420
0.361
0.921
0.942
0.777
0.996†
0.649
0.192
0.566
0.374
0.272
0.941†
0.845
0.312
0.999†
0.850
0.825
0.986†
0.067†
0.311
0.723
0.110
0.004†
0.351
0.808
0.434
0.626
0.728
0.059†
0.155
0.688
0.001†
0.150
0.175
0.014†
0.933†
0.689
0.277
0.890
0.032†
0.684
0.374
0.731
0.998
0.568
0.166
0.443
0.850
0.014†
0.404
0.455
0.054†
0.131
0.853
0.771
0.280
“†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have a certain
direction of effect (in the one-sided test) on the response variable with the effect size of 0.10. We report the
variables not in the PITF final model by putting them in their clusters defined by the PITF.
32
Table 8: Hypothesis Testing of Individual Covariates (Cont.)
No.
Variable
Spike-Slab
p̂(β = 0|D)
Diffuse Bayesian
p̂(β > 0|D)
p̂(β ≤ 0|D)
MLE
p value
Cluster II: Demographic and Societal
26
Youth bulge
27
Labor force as a percent of population
28
Infant mortality
29
Annual change in infant mortality
30
Life expectancy
31
Secondary school enrollment ratio
32
Change in secondary school enrollment ratio
33
Calories per capita
34
Urban population
35
Urban population growth rate
36
Ethno-linguistic fractionalization
37
Trading partner concentration
1.000
1.000
1.000
0.715
1.000
0.899
1.000
0.000†
1.000
0.854
0.747
1.000
0.467
0.030†
0.266
0.096†
0.256
0.968†
0.130
0.007†
0.871
0.305
0.052†
0.239
0.533
0.970†
0.734
0.904†
0.744
0.032†
0.870
0.993†
0.130
0.695
0.948†
0.762
0.747
0.125
0.738
0.286
0.678
0.176
0.451
0.041†
0.570
0.742
0.307
0.725
Cluster III: Economic and Environmental
38
GDP per capita
39
Change in GDP per capita
40
Change in reserves
41
Government debt
42
Trade with OECD countries
43
Annual change in inflation rate
44
Cropland area
45
Irrigated land
46
Access to safe water
47
Damage due to drought
48
Famine
0.000†
0.000†
1.000
0.425
0.000†
0.438
0.047†
0.168
0.850
0.659
0.667
0.101
0.038†
0.300
0.782
0.996†
0.974†
0.190
0.988†
0.196
0.088†
0.179
0.899
0.962†
0.700
0.218
0.004†
0.026†
0.810
0.012†
0.804
0.912†
0.821
0.496
0.257
0.590
0.520
0.091†
0.085†
0.467
0.146
0.436
0.329
0.435
Number of statistically reliable variables
11
18
18
9
“†” highlights the covariates that are identified as statistically reliable (in the two-sided tests) or have
a certain direction of effect (in the one-sided test) on the response variable with the effect size of 0.10.
We report the variables not in the PITF final model by putting them in their clusters defined by the
PITF.
33
diffuse priors for β. Note that for the most part low p-values are associated with low values
of p̂(β = 0|D), even though this probability is not at all related to the convention in political
science of doing sequential Wald tests down regression tables with an implied or stated 0.10
threshold. In addition, one-sized tests find more variables that have positive or negative effects
than the number of non-zero coefficients found by the two one-sided tests. The hypothesis tests
on individual coefficients based on all the three models are dramatically different from the results
based on the univariate t tests used by the Task Force. This is likely due to the different ways
of handling missing data (multiple imputation versus casewise deletion) and the consideration
of collinearity. There are 11 variables found to be statistically reliable by the logit S-S model
according to their marginal probabilities (p(βk = 0|D). Note that most of the variables in the
final model of the Task Force are not insignificant based on their t tests. Two of the variables
found significant in their final, Trade Openness and Regime Type, are identified by the logit S-S
model as statistically reliable variables.
5.2.2
Step 2: Determine Model Probabilities
The spike and slab model conducts model and variable selection simultaneously. Since in each
MCMC iteration, those covariates whose associated effect parameters are allocated to the spike
part are erased from the whole model, we estimate a submodel in each iteration based on the
model posterior probabilities. For this analysis on Sub-Saharan state failure data, the logit S-S
model identifies 25 submodels which are presented in Table 9. In the table, we only exclude the 14
variables never appeared in any submodel (refer to Table 7 and Table 8 for the variable numbers).
There is no submodel which has a posterior probability greater than 0.20. However, among the
25 models, 22 of them have very small posterior posterior probabilities. The first three submodels
have posterior weights much larger than the others. The best model, M1 , includes 13 covariates,
the second best, 19 covariates, and the third best, 11 covariates, and are not as parsimonious
as the final model used by the Task Force. However, this sparseness is the most we can achieve
by adjusting the shrinkage effect of the priors. Further increase of the shrinkage causes all the
covariates to be shrunk to zero probability in this particular case. It could be attributed to the
trade-off between mixing and shrinkage, but also may reflect that the underlying mechanism of
state failure is not sparse and needs to be explained with many factors.
5.2.3
Step 3: Forecasting
The Task Force reported a very successfully forecasting performance of their final model: “[o]ur
Sub-Saharan Africa model correctly classified 80 percent of the historical cases from which it was
estimated”. However, the criterion they used to evaluate forecasting performance is not clear.
In addtion, no ex ante cutoff values for classification of failures and non-failures were reported,
34
35
Prob.
0.1699
0.1212
0.1208
0.0731
0.0622
0.0582
0.0475
0.0472
0.0375
0.0341
0.0320
0.0297
0.0268
0.0235
0.0224
0.0213
0.0208
0.0109
0.0103
0.0085
0.0079
0.0052
0.0051
0.0031
0.0008
Model
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
M13
M14
M15
M16
M17
M18
M19
M20
M21
M22
M23
M24
M25
3
1
4
7
6
9
11
12
13
15
16
18
19
20
21
22
23
24
Covariates
25
29
31
33
35
36
38
Table 9: Models with Posterior Model Probability based on S-S Logit Model
39
41
42
43
44
45
46
47
48
and the threshold was very likely to be set post hoc, i.e., based on the prediction results. In fact,
evaluating forecasting performance involves sophisticated decision theories and should depend
on the assessment of the relative costliness of misprediction of failure and non-failure. Misevaluation of forecasting performance by the Task Force has been already criticized by King &
Zeng (2001). Here we do not want to readdress the methodological problems in the Task Force
forecast evaluation; instead, our interest is centered at the comparison between BMA based on
the GLSS model and single best models in terms of forecasting. We fist investigate the in-sample
forecasts of BMA and single models, and then evaluate out-of-sample forecasting performance
of those models by adopting the predictive log score and risk classification proposed and applied
by Hoeting et al (1999). Those evaluations suggest that BMA based on the GLSS model smooth
forecasting and reduce the risk of misprediction by taking into account model uncertainty (DGP
uncertainty).
In Figure 3, the left and middle graphs present the in-sample predictive probabilities of BMA
based on the logit S-S model and all the 25 submodels identified in Table 9. These two graphs
show clearly that BMA smooth the predictions by averaging the predicted probabilities of the
submodels with their weights. To evaluate the performance of BMA and the three single models
with the highest posterior probabilities, we use a Receiver-Operating Characteristic (ROC) curve
which was also used by King & Zeng (2001) to compare their model with the global model of the
Task Force in predicting. As was pointed out by them, the ROC curve is free of assigning any
specific threshold based on the costs of misclassification of the two types (failures and nonfailures).
The model has ROC curve which is above those of the other models is superior in forecasting,
simply because the highest curve means that given any level of correct classification of one type
(say, “failures”), the model of this curve has a higher rate of correct classification of the other
type (say, “nonfailures”) than the models of the curves below. The right graph in Figure 3
demonstrates that the ROC curve of BMA dominates those of the three best single models,
except in the short range of 0.45 to 0.55 of correct classification of failures where it is slightly
worse than the single model I. BMA performs much better than the single model when the cost
of misclassifying failures as nofailures is assessed to be high (accordingly, the threshold is set to
be low, and the rate of correct classification of failures is high) which is reflected in the right
part of the curves. This superiority of BMA is especially valuable to policy makers since less
resources, such as foreign aids sent to those high risk countries for preventing state failure, will
be wasted by allocating them mistakenly to those “sate” countries.
Then we compare the out-of-sample forecasting performance of BMA and the “best” single
models. First, we compute the predictive log scores of the BMA and single models by using the
formula in equation (13) and (14). We first split the dataset into the build and treat groups.
Although it is not necessary to split the dataset equally, we randomly sample 81 observations
out of the 162 with 22 cases of state failure in the build group and the rest in the treat group
36
0
10
20
Cases
30
Prodictive Prob. of 1’s
40
BMA
SubModel1
SubModel2
SubModel3
predected probs
1.0
0
BMA
SubModel1
SubModel2
SubModel3
20
40
Cases
60
80
Prodictive Prob. of 0’s
100
120
0.0
BMA
SubModel1
SubModel2
SubModel3
Figure 3: Within-Sample Predictive Probabilities: BMA versus Single Models
Proportion of 0's Correctly Classified
0.2
0.6
Proportion of 1's Correctly Classified
0.4
ROC Curve
In the left and middle figures, the predictive probabilities are plotted from “worst prediction” to “best prediction” for failures,
and “best prediction” to “worst prediction” for non-failures, both sorted according to BMA forecastes.
predected probs
0.8
0.6
0.4
0.2
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
37
0.8
1.0
2
. We estimate the logit S-S model with the build group data. The logit S-S model identifies
27 submodels with two top submodels have much higher posterior probabilities than the others
(the posterior model probabilities are 0.2134 and 0.1983 respectively). We estimate these two
single models with diffuse priors. We then approximate the predictive log scores by numerically
integrating the parameters out and obtain the predictive probabilities of observations in the treat
group. The scores are reported in Table 10. A smaller log score indicates a better model, and the
score of BMA is about 8 points better than the single models, with about 9.88% per observation.
As stated by Hoeting et al (1999), in practice, it’s more meaningful to conduct forecasts by
classifying the subjects into discrete risk categories than continuous numerical prediction. This is
especially true for this state failure study since preemptive policies can be calibrated to different
categories, but are difficult to be adjusted to the continuous numerical predictive values. We
follow the procedure of Hoeting et al to conduct risk classification:
1. Estimate models with the build group data and obtain the estimates of β̂ (interval estimates
in this case);
2. Compute the risk scores (xi β̂) for each subject i;
3. Define low, medium, and high risk groups for the model by using empirical quantiles of the
calculated risk scores;
4. Calculate the risk scores of the subjects in the treat group and classify them into the three
groups;
5. Check the observed survival status (no failure) of those subjects assigned to different groups.
A model is better if it assigns higher (lower) risks to the country-years which actually experienced a (non) failure. We use three different choices of the cutoff values for classification. The
first choice uses the quantiles of 1/3 and 2/3, which is a naive choice; the second set of threshold
risk values are the percentiles of 40% and 60% with the consideration that the media risk group
may be less interesting to policy makers than the two extreme groups; and the third choice of
20% and 40% percentiles is based on assumption that mispredicting state failure is costly and
it may be better to adopt some policies to help more countries by setting the threshold of high
risk class to be low. We compare the BMA model with the best two single models and resport
the results in Table 10. Despite of different cutfoff values for risk classfication, BMA performs
2
We do not split the sample according to years, such as constructing the build group with countries in 1955 to
1978 and the rest as the treat group. This is because the case-control study destroys the time dimension of the
data—data are treated as cross-sectional. Also, no dynamics is models, and long-run forecasting does not make
any sense. Neither do We use the county-years excluded by the resampling of the case-control study, because
there is no sate failure in those country-years
38
Table 10: Classification for Predictive Discrimination
Cutoff Values—Percentiles: 33.33%, 66.67%
BMA
Risk Group
Low
Med
High
M1
S
F
%F
23
15
21
5
9
8
17.85%
37.50%
27.58%
S
F
21
15
23
11
3
8
M2
%F
34.38%
16.67%
25.81%
S
F
%F
19
19
21
8
8
6
29.63%
29.23%
22.22%
Cutoff Values—Percentiles: 40.00%, 60.00%
BMA
Risk Group
Low
Med
High
M1
S
F
%F
25
11
23
7
6
9
21.87%
35.29%
28.13%
S
F
24
7
28
11
1
10
M2
%F
31.43%
12.50%
26.32%
S
F
21
13
25
11
4
7
%F
34.38%
42.86%
21.88%
Cutoff Values—Percentiles: 20.00%, 40.00%
BMA
Risk Group
Low
Med
High
Predictive Log Score
S
F
12
14
33
4
3
15
M1
%F
25.00%
17.65%
31.25%
97.112
S
F
16
8
35
9
2
11
104.270
M2
%F
36.00%
20.00%
23.91%
S
F
16
4
39
6
3
13
%F
27.27%
48.0%
25.00%
105.370
In the table, S denotes “Survival” and F “Failure”. The two single models are the models with highest posterior
probabilities: p(M1 |D) = 0.2134 and p(M2 |D) = 0.1983
better than the single models: the country-years assigned to the high risk group by BMA have
a higher state failure rate than those assigned high risk by the other two single models; and the
country-years classified as low risk ones by BMA have the state failure rates lower that those
assigned to the low risk group by the single models. All those models assign the subjects to the
medium risk group which actually experienced a very high rate of state failure, and sometime
a higher rate than those in the high risk group. This can be attributed to the small sample
size of 81 in the build group, and models cannot be very precise in term of both inferences and
forecasts. However, we only evaluate the relative performance here.
It is not surprising that in this particular case BMA performs much better than the single
models in out-of-sample forecasts but is very similar as, and only slightly better than, the single
best models in within-sample prediction. The logit S-S model with all the 48 covariates identifies
39
25 submodels, and three of them have much higher probabilities than the rest 22 models. The top
three models perform similarly in forecasts. Since BMA is majorly based on the three models,
it provides similar within-sample predictive probabilities. But in the out-of-sample prediction,
with the DGP of the treat group is only inferred based on the build group, but the most likely
DGP for the latter is not necessarily exactly the same for the former, even though the two groups
are selected randomly. With a sample size of 81 in the build group, the uncertainty about the
true DGP suggested by the data is important and should not be ignored when drawing inferences
and conducting forecasts. BMA takes into account model uncertainty and uses the information
in the build group on all possible DGPs, and therefore performs better in forecasting than the
models which treat one of the DGPs for the build group as the only “true” DGP for the treat
group. The single models are riskier by ignoring the uncertainty of the DGP(s), and this is
especially true in out-of-sample forecasting because of high uncertainty about the DGP which is
only inferred from different observations.
6
Conclusion
By extending the Bayesian linear spike and slab model to nonlinear (qualitative) outcomes,
which has not been accomplished to date, we specify mixture prior distributions for each of the
coefficient parameters. The spike part of this prior helps detect coefficients that are actually
of zero magnitude by shrinking those coefficients to zero in posterior terms, whereas the slab
part of the prior provides that are truly not zero coefficients with a diffuse normal prior thus not
requiring extensively informed prior knowledge. This means that the model space and parameter
space are determined simultaneously through the estimation process.
The primary contribution of this work is to exploit Bayesian prior specification and the engine
of Bayesian stochastic simulation to incorporate variable inclusion decisions, model comparison,
and hypothesis testing, directly and simultaneously into the estimation process. This process
is built on linear spike and slab strategies, where the authors were unaware of the additional
inferential information, but extended to the sort of nonlinear outcomes regularly used in empirical
Political Science. This setup leads directly to more reliable posterior predictions since it is
straightforward to Bayesianly average across models with high probability. This avoids the
primary criticism of standard Bayesian model averaging whereby the researcher subjectively
picks the variables in the first set of models (Raftery 1995, Hoeting et al. 1999).
In the Bayesian paradigm all unknown quantities are treated probabilistically, including alternative model specifications. Therefore the spike and slab priors approach to model choice requires
a full commitment to the tenents of Bayesian inference, including informed priors, conditioning
on the data, and full posterior description. We make the mechanics of this process straighforward
with an R package (spike.slab). As the examples here show, additional information in the data
40
can lead to more informed model choice and better posterior predictions.
7
References
Albert, J. A., and S. Chib, (1993). “Bayesian Analysis of Binary and Polychotomous Response Data.” Journal
of the American Statistical Association 88: 669-679.
Alston, C.L., K. Mengersen, J. M. Thompson, P. J. Littlefeld, D. Perry, A. J. Ball (2004) “Statistical Analysis of
Sheep CAT Ccan Images Using a Bayesian Mixture Model. “ Aust. J. Agricultural Research 55: 57-68.
Aitkin, M. (1991). “Posterior Bayes Factor.” Journal of the Royal Statistical Society: Series B 53:111-142.
Aitkin, M. (1997). “The Calibration of P-values, Posterior Bayes Factors and the AIC from the Posterior Distribution of the Likelihood (with discussion).” Statist. And Computing 7: 253-272.
Anderson, David R., Kenneth P. Burnham, and William L. Thompson. (2000). “Null Hypothesis Testing:
Problems, Prevalence, and an Alternatives.” Journal of Wildlife Mangement 64:912-923.
Barbieri, M., and J. Berger. (2004). “Optimal Predictive Model Selection.” Annals of Statistics 32:870-897.
Barnett, Vic. (1973). Comparative Statistical Inference. New York: Wiley
Bayarri, M. J. and Berger, J. (2000). “P-Values for Composite Null Models (with discusiion).” Journal of
American Statistical Association 95: 1127-1142.
Berger, J. O. and Pericchi, L. (1998). “Accurate and Stable Bayesian Model Selection: the Median Intrinsic
Bayes Factor.” Sankhya B 60: 1-18.
Berger, J. O., L. D. Brown, and R. L. Wolpert. (1994). “A Unified Conditional Frequentist and Bayesian Test
for Fixed and Sequential Simple Hypothesis Testing.” Annals of Statistics 22:1787-1807.
Berger, J. O., and R. L. Wolpert. (1984). The Likelihood Principle. Hayward, CA: Insitute of Mathematical
Statistics Monograph Series.
Berger, James O., B. Boukai, and Y. Wang. (1997). “Unified Frequentist and Bayesian Testing of a Precise
Hypothesis.” Statistical Science 12:133-160.
Berger, James O., and Thomas Sellke. (1987). “Testing a Point Null Hypothesis: The Irreconciliability of P
Values and Evidence.” Journal of the American Statistical Association 82:112-122.
Bernardo, José M. (1984). “Monitoring the 1982 Spanish Scoialist Vicotry: A Bayesian Analysis.” Journal of the
American Statistical Association 79:510-515.
Binder, S. A. (1996). “The Partisan Basis of Procedural Choice: Allocating Parliamentary Rights in the House,
1789-1990.” American Political Science Review 90:8-22.
Binder, S. A. (2006). “Parties and Institutional Choice Revisited.” Legislative Studies Quarterly 31:513-532.
Birnbaum, A. (1962). “On the Foundations of Statistical Inference.” Journal of the American Statistical Association 57:269-306.
Brandstätter, Eduard. (1999). “Confidence Intervals as an Alternative to Signi
cance Testing.” Methods of Psychological Research Online 4:32-46.
Breslow, N. E. (1996). “Statistics in Epidemiology:The Case-Control Study.” Journal of the American Statistical
Association 91: 14-28.
Brown, P. J., M. Vannucci and T. Fearn. (1998). “Multivariate Bayesian Variable Selection and Prediction.” J.
Royal Statist. Society Series B 60: 627-641.
41
Burkhardt, Hans., and Alan. H. Schoenfeld. (2003). “Improving Educational Research: Toward a More Useful,
More In uential, and Better-funded Enterprise.” Educational Researcher 32:3-14.
Carlin, B. and Chib, S. (1995). “ Bayesian Model Choice via Markov Chain Monte Carlo Methods.”
Statist. Society Series B 57: 473-484.
J. Royal
Carver, Ronald P. (1978). “The Case Against Statistical Significance Testing.” Harvard Education Review
48:378-99.
Carver, Ronald P. (1993). “Merging the Simple View of Reading with Rauding Theory.” Journal of Reading
Behavior 25:439-455.
Casella, G., and R. L. Berger. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing
Problem.” Journal of the American Statistical Association 82:106-111.
Casella, George, and R. L. Berger. (2001). Statistical Inference. Belmont, CA: Duxbury Advanced Series.
Chatfield, C. (1995). “Model Uncertainty, Data Mining, and Statistical Inference (with discussion).”
Statist. Soc. Ser. A 158: 419-466.
J. Roy.
Chib, S., (1995). “Marginal Likelihood from the Gibbs Output.” Journal of the American Statistical Association
90: 1313-21.
Chib, S., and E. Greenberg, (1995). “Understanding the Metropolis-Hastings Algorithm.” The American Statistician 49: 327-35.
Chib, S., and I. Jeliazkov, (2001). “Marginal Likelihood from the Metropolis-Hastings Output.” Journal of the
American Statistical Association 96: 270-81.
Chipman, H., E. I. George, and R. E. McCulloch. (2001). “The Practical Implementation of Bayesian Model
Selection (with discussion)”. In Model Selection, ed. P. Lahiri. OH: IMS, Beachwood, pp. 65-134.
Clyde, M., and E. I. George. (2004). “Model Uncertainty.” Statistical Science 19:81-94.
Cohen, Jacob. (1962). “The Statistical Power of Abnormal-Social Psychological Research: A Review.” Journal
of Abnormal and Social Psychology 65:145-153.
Cohen, Jacob. (1977). Statistical Power Analysis for the Behavioral Sciences. Second Edition. New York:
Academic Press.
Cohen, Jacob. (1988). Statistical Power Analysis for the Behavioral Sciences. Second Edition. Hillsdale, NJ:
Lawrence Erlbaum Associates.
Cohen, Jacob. (1992). “A Power Primer.” Psychological Bulletin 112:115-159.
Cohen, Jacob. (1994). “The Earth is Round (p < 0.5).” American Psychologist 12:997-1003.
Congdon, P. (2001). Bayesian Statistical Modeling. Wiley, England.
Craiu, R.V., J.S. Rosenthal and C. Yang. (forthcoming). “Learn From Thy Neighbor: Parallel-Chain Adaptive
MCMC.” Journal of the American Statistical Association.
Denis, Daniel J. (2005). “The Modern Hypothesis Testing Hybrid: R. A. Fisher’s Fading Inuence.” Journal de
la Société Francaise de Statistique 145:5-26.
Devroye, Luc, (1981). “The Series Method in Random Variate Generation and Its Application to the KolmogorovSmirnov Distribution.” American Journal of Mathmatical and Management Science 1: 359-379.
Devroye, Luc, (1986). Non-Uniform Random Variate Generation. Springer-Verlag, New York.
Dijkstra, T. K. (1988). On Model Uncertainty and Its Statistical Implications. Springer, Berlin.
Draper, D. , (1995). “Assessment and Propagation of Model Uncertainty.” J. Royal Statist. Society Series B 57:
45-97.
42
Falk, Ruma, and Charles W. Greenbaum. (1995). “Significance Tests Die Hard: The Amazing Persistence of a
Probabilistic Misconception.” Theory & Psychology 5:75-98.
King. G, and L. Zeng, (2001). “Explaining Rare Events in International Relations.” International Organization
55: 693-715.
Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. (1995). Bayesian Data Analysis. New
York: Chapman & Hall.
George, E. I., and R. E. McCulloch. (1997). “Approaches for Bayesian Variable Selection.” Statistica Sinica
7:339-373.
George, Edward I., and Robert E. McCulloch. (1993). “Variable Selection Via Gibbs Sampling.” Journal of the
American Statistical Association 88: 881-889.
George, E. I. (1999). “Bayesian model selection,” in Encyclopedia of Statistical Sciences Update 3. Wiley, New
York.
Geweke, J. (1996). “Bayesian Inference for Linear Models Subject to Linear Inequality Constraints” In Modeling
and Prediction: Honouring Seymour Geisser, ed. W. O. Johnson, J. C. Lee, and A. Zellner. New York:
Springer.
Gigerenzer, Gerd. (1987). “Probabilistic Thinking and the Fight Against Subjectivity.” In The Probabilistic
Revolution, ed. Lorenz Krüger, Gerd Gigerenzer, and Mary Morgan. Vol. 2 Cambridge, MA: MIT.
Gigerenzer, Gerd. (1993). “The Superego, the Ego, and the ID in Statistical Reasoning.” In A Handbook for
Data Anlysis in the Behavioral Science: Methodological Issues, ed. G. Keren, and C. Lewis. Hillsdale, NJ:
Lawrence Erlbaum Associates.
Gigerenzer, Gerd. (1998). “We Need Statistical Thinking, not Statistical Rituals.” Behavioral and Brain Sciences
21:199-200.
Gigerenzer, Gerd, and D. J. Murray. (1987). Cognition as Intuitive Statistics. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Gill, Jeff. (1999). “The Insignificance of Null Hypothesis Significance Testing.” Political Research Quarterly
52:647-674.
Gill, Jeff. (2005). “An Entropy Measure of Uncertainty in Vote Choice.” Electoral Studies 24:371-392.
Gill, Jeff. (2007). Bayesian Methods for the Social and Behavioral Sciences. Second Edition. New York: Chapman
& Hall.
Gilner, Jeffrey A., Nancy L. Leech, and George A. Morgan. (2002). “Problems with Null Hypothesis Significance
Testing (NHST): What do the Textbooks Say.” Journal of Experimental Education 71:83-92.
Godsill, S. J. (2001) “On the Relationship between MCMC Model Uncertainty Methods.” Journal of Computational & Graphical Statistics 10: 230-248.
Grayson, D. A. (1998). “The Frequentist Facade and the Flight from Evidential Inference.” British Journal of
Psychology 89:325-345.
Green, P. J. (1995). “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination.” Biometrika 82:711-732.
Greenland, S. (2000). “Principles of Multilevel Modelling.” International Journal of Epidemiology 29:158-167.
Greenland, Sander. (2008). “Invited Commentary: Variable Selection versus Shrinkage in the Control of Multiple
Confouders.” American Journal of Epidemiology Advance Access:1-7.
Greenwald, Anthony G. (1975). “Consequences of Prejudice Against the Null Hypothesis.” Psychological Bulletin
82:1-20.
43
Greenwald, Anthony G., R. Gonzalez, R. J. Harris, and D. Guthrie. (1996). “Effect Sizes and Values: What
Should Be Reported and What Should Be Replicated.” Psychophysiology 33:175-183.
Han, Cong, and Bradley Carlin. (2001). “Markov Chain Monte Carlo Methods for Computing Bayes Factors: A
Comprehensive Review.” Journal of the American Statistical Association 96:1122-1132.
Harlow, Lisa L., Stanley A. Mulaik, and James H. Steiger. (1997). What If There Were No Signifficance Tests?
(Multivariate Applications). Lawrence Erlbaum.
Hastings, W. K., (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications.”
Biometrika 57:97-109.
Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky. (1999). “Bayesian Model Averaging: A Tutorial
(with discussion).” Statistical Science 14:382-401.
Howson, Colin, and Peter Urbach. (1993). Scientific Reasoning: The Bayesian Approach. Second Edition.
Chicago: Open Court.
Hunter, John E. (1997). “Needed: A Ban on the Significance Test.” Psychological Science Special Section 8:3-7.
Hunter, John E., and Frank L. Schmidt. (1990). Methods of Meta-Analysis: Correcting Error and Bias in
Research Findings. Beverly Hills: Sage.
Hwang, J. T., G. Casella, C. P. Robert, M. T. Wells, and R. H. Farrell. (1992). “Estimating Accuracy in Testing.”
Annals of Statistics 20:490-509.
Ishwaran, Hemant, and J. Sunil Rao. (2003). “Detecting Differentially Expressed Genes in Microarrays Using
Bayesian Model Selection.” Journal of the American Statistical Association 98:438-455
Ishwaran, Hemant, and J. Sunil Rao. (2005). “Spike and Slab Gene Selection for Multigroup Microarry Data.”
Journal of American Statistical Association 100:764-780.
Ishwaran, Hemant, and J. Sunil Rao. (2005a). “Spike and Slab Gene Selection for Multigroup Microarry Data.”
Journal of the American Statistical Association 100:764-780.
Ishwaran, Hemant, and J. Sunil Rao. (2008). “Clustering Gene Expression Profile Data by Selective Shrinkage.”
Statistical and Probability Letter 78:1490-1497.
Jeffreys, Harold. (1961). The Theory of Probability. Oxford: Clarendon Press.
Ji, Chunlin, and Scott C. Schmidler. (2007). “Adaptive Markov Chain Monte Carlo for Bayesian Variable
Selection.” Unpublished Manuscript.
Hoti, F., and M. J. Sillanpää. (2006). “Bayesian Mapping of Genotype X Expression Interactions in Quantitative
and Qualitative Traits.” Heredity 97: 4-18.
Kass, Robert E., and Adrian E. Raftery. (1995). “Bayes Factors.” Journal of the American Statistical Association
90:773-795.
Kirk, Roger E. (1996). “Practical Significance: A Concept Whose Time Has Come.”
logical Measurement 56:746-759.
Educational and Psycho-
Kilpikari, R., and M. J. Sillanpää. (2003). “Bayesian Analysis of Multilocus Association in Quantitative and
Qualitative Traits.” Genet. Epidemiol. 25: 122-135.
King, G., and L. Zeng, (2001). “Improving Forecasts of State Failure.” World Politics 53: 623-658.
King, G., and L. Zeng, (2001). “Logistic Regression in Rare Events Data.” Political Analysis 9: 137-163.
Krueger, Joachim I. (1999). “Do We Need Inferential Statistics? Reply to Hallahan on Social-Bias.” PSYCOLOQUY 10(004).
Kuo, L and B. Mallick. (1998). “Variable Selection for Regression Models.” Sankhya Ser. B 60: 65-81.
44
Kyung, MinJung, Jeff Gill, and George Casella. (2008). “Estimation in Dirichlet Random Effects Models.”
Technical Report, Center for Applied Statistics, Washington University. http://polmeth.wustl.edu/working
papers.php?text=probit+mixed+ Dirichlet+random+effects+model&searchkeywords=T&order=dateposted
Kyung, Minjung, Jeff Gill, and George Casella. (2009). “Penalized Regression, Standard Errors, and Bayesian
Lassos.” Technical Report, Center for Applied Statistics, Washington University. http://polmeth.wustl.edu/
workingpapers.php?text=Bayesian +hierarchical+model&searchkeywords=T&order=dateposted
Lacy, M. G. (1997). “Efficiently Studying Rare Events: Case-Control Methods for Sociologists.” Sociological
Perspectives 40: 129-154.
Leamer, E. E. (1978). Leamer, E. E. Specification Searches: Ad Hoc Inference with Nonexperimental Data. New
York: John Wiley & Sons.
Lehmann, E. L. (1993). “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two.”
Journal of the American Statistical Association 88:1242-1249.
Lindley, D. V. (1961). “The Use of Prior Probability Distributions in Statistical Inference and Decision.”
Proc.Fourth Berkeley Symp. Math. Statist. Prob. Univ. of Calif. Press 453-468.
Lindsay, R. M. (1995). “Reconsidering the Status of Tests of Significance: An Alternative Criterion of Adequacy.”
Accounting, Organizations and Society 20:35-53.
Loftus, Geoffrey R. (1991). “On the Tyranny of Hypothesis Testing in the Social Sciences.” Contemporary
Psychology 36:102-105.
Loftus, Geoffrey R. (1993a). “Editorial Comment.” Memory & Cognition 21:1-3.
Loftus, Geoffrey R. (1993b). “Visual Data Representation and Hypothesis Testing in the Microcomputer Age.”
Behavior Research Methods, Instrumentation, and Computers 25:250-256.
Loftus, Geoffrey R. (1996). “Psychology will be a Much Better Science when we Change the Way we Analyze
Data.” Current Directions in Psychological Science 161-171.
Loftus, Geoffrey R., and D. Bamber. (1990). “Weak Models, Strong Models, Unidimensional Models, and
Psychological Time.” Journal of Experimental Psychology: Learning, Memory, and Cognition 16:916-926.
Long, J. Scott. (1997). Regression Models for Categorical and Limited Dependent Variables. Lodon: Sage
Publications.
Macdonald, Ranald R. (1997). “On Statistical Testing in Psychology.” British Journal of Psychology 88:333-349.
Madigan, D. and York, J. (1995). “Bayesian Graphical Models for Discrete Data.”
215-232.
Internat. Statist. Rev. 63:
Meehl, Paul E. (1967). “Theory— Testing in Psychology and Physics: A Methodological Paradox.” Philosophy
of Science 34:103-115.
Meehl, Paul E. (1978). “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of
Soft Psychology.” Journal of Consulting and Clinical Psychology 46:806-834.
Meehl, Paul E. (1990). “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two
Principles that Warrant it.” Psychological Inquiry 1:108-141.
Meng, X. L. (1994). “Posterior Predictive P-Values.” Annals of Statistics 22:1142-1160.
Mengersen, K. and C. P. Robert. (1996). “Testing for Mixtures: A Bayesian Entropic approach.” in Bayesian
Statistics, 5 (Alicante, 1994). New York: Oxford University Press.
Meuwissen, T. H. E. and M. E. Goddard. (2004). “Mapping Multiple QTL Using Linkage Disequilibrium and
Linkage Analysis Information and Multitrait Data.” Genet. Sel. Evol. 36: 261-279.
Miller, A. (2002). Subset Selection in Regression. Boca Raton, Fla: Chapman & HallCRC.
45
Mitchell, T. J., and J. J. Beauchamp. (1988). “Bayesian Variable Selection in Linear Regression.” Journal of the
American Statistical Association 83:1023-1032.
Morris, C. N. (1983). “Parameteric Empirical Bayes Inference: Theory and Applications.” Journal of the
American Statistical Association 78:47-55.
Nickerson, Raymond S. (2000). “Null hypothesis statistical testing: A review of an old and Null Hypothesis
Statistical Testing: A Review of an Old and Continuing Controversy.” Psychological Methods 5:241-301.
Oakes, M. (1986). Statistical Inference: A Commentary for the Social and Behavioral Science. New York: John
Wiley & Sons.
O’Hara, R.B. and Sillanpää M. J., (2009). “A Review of Bayesian Variable Selection Methods: What, How, and
Which?” Bayesian Analysis 4: 85-118.
Park, T., and G. Casella. (2008) “The Bayesian Lasso.” Journal of the American Statistical Association 103:
681-686.
Perez, J. M. and J. Berger. (2002). “Expected Posterior Prior Distributions for Model Selection.” Biometrika
89: 491-512.
Pollard, P. (1993). “How Significant is ’Significance?” In A Handbook for Data Anlysis in the Behavioral Science:
Methodological Issues, ed. Gerd Gigerenzer, and C. Lewis. Hillsdale, NJ: Lawrence Erlbaum Associates.
Pollard, P., and J. T. E. Richardson. (1987). “On the Probability of Making Type One Errors.” Psychological
Bulletin 102:159-163.
Raftery, A. E. (1995). “Bayesian Model Selection in Social Research.” Sociological Methodology 25:111-163.
Raftery, A. E. (1996). “Approximate Bayes Factors and Accounting for Model uncertainty in Generalised Linear
Models. “ Biometrika 83: 251-266.
Regal, R. and Hook, E. B. (1991). “The Effects of Model Selection on Confidence Intervals for the Size of a
Closed Population.“ Statistics in Medicine 10: 717-721.
Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition.
Robinson, D. H., and J. R. Levin. (1997). “Reflections on Statistical and Substantive Signifcance, with a Slice of
Replication.” Educational Researcher 26:21-26.
Rosnow, Ralph L., and R. Rosenthal. (1989). “Statistical Procedures and the Justification of Knowledge in
Psychological Science.” American Psychologist 44:1276-1284.
Rozeboom, William W. (1960). “The Fallacy of the Null Hypothesis Significance Test.” Psychological Bulletin
57:416-428.
Rozeboom, William W. (1997). “Good Science is Abductive, not Hypothetico-Deductive.” In What If There
Were No Significance Tests?, ed. L. L. Harlow, S. A. Mulaik, and J. H. Steiger. Mahwah, NJ: Erlbaum
335-392.
Rubin, Donald B. (1984). “Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician.” Annals of Statistics 12:1151-1172.
Sahu, S. and R. Cheng. (2003). “A Fast Distance Based Approach for Determining the Number of Components
in Mixtures.” Canadian Journal of Statistics 31: 2-33.
Schaefer, Juliane, and Korbinian Strimmer. (2005a). “An Empirical Bayes Approach to inferring Large-Scale
Gene Association Networks.” Bioinformatics 21:754-764.
Schaefer, Juliane, and Korbinian Strimmer. (2005b). “A Shrinkage Approach to Large-Scale Covariance Matrix
Estimation and Implications for Functional Genomics.” Statistical Applications in Genetics and Molecular
Biology 4:1-30.
46
Schervish, Mark J. (1996). “P Values: What They Are and What They Are Not.” The American Statistician
50:203-206.
Schickler, E. (2000). “Institutional Changes in the House of Representative, 1867-1998: A Test of Partisan and
Ideological Power Balance Models.” American Political Science Review 94:269-288.
Schmidt, Frank L. (1997). “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for the Training of Researchers.” Psychological Methods 1:115-129.
Schmidt, Frank L., and John E. Hunter. (1977). “Development of a General Solution to the Problem of Validity
Generalization.” Journal of Applied Psychology 62:529-540.
Sedlmeier, Peter, and Gerd Gigerenzer. (1989). “Do Studies of Statistical Power Have an Effect on the Power of
Studies.” Psychological Bulletin 105:309-316.
Sha, N, M. Vannucci, M. G. Tadesse, P. J. Brown,I. Dragoni, N. Davies, T. C. Roberts, A. Contestabile, M.
Salmon, C. Buckley, F. Falciani, (2004). “Bayesian Variable Selection in Multinomial Probit Models to
Identify Molecular Signatures of Disease Stage.” Biometrics 60: 812-819.
Smith, M. and Kohn, R. (2002). “Parsimonious covariance matrix estimation for longitudinal data.” Journal of
the American Statistical Association 97: 1141-1153.
Spiegelhalter, D. J., N. G. Best, B. P. Carlin and A. van der Linde. (2002) “Bayesian Measure of Model Complexity
and Fit.” J. Royal Statist. Society Series B 64: 583-639.
Stephens, M. (2000). “Bayesian Analysis of Mixture Models with an Unkown Number of Components–an Alternative to Reversible Jump Methods.” Annals of Statistics 28: 40-74.
Thompson, Bruce. (2002). “What future quantitative social science research could What Future Quantitative
Social Science Research Could Look Like: Confidence Intervals for Effect Sizes.” Educational Researcher
31:24-31.
Tong, TieJun, and Yuedong Wang. (2007). “Optimal Shrinkage Estimation of Variance With Applications to
Microarry Data Analysis.” Journal of the American Statistical Association 102:113-122.
Volinsky, C. T., D. Madigan, A. E. Raftery, and R. A. Kronmal, (1997). “Bayesian Model Averaging in Proportional Hazard Models: Assessing the Risk of a Stroke. “ J. Roy. Statist. Soc. Ser. C 46: 433-448.
Waagepetersen, Rasmus, and Daniel Sorensen. (2001). “A Tuorial on Reversible MCMC with a View Towards
Applications in QTL-Mapping.” International Statistical Review 69:49-62.
Wang, Hui, Yuan-Ming Zhang, Xinmin Li, Godfred L. Masinde, Subburaman Mohan, David J.Baylink, and
Shizhong Xu. (2005). “Bayesian Shrinkage Estimation of Quantitative Trait Loci Parameters.” Genetics
170:465-480.
West, M. (2003). “Bayesian Factor Regression Models in the ’Large p, Small n’ Paradigm.” Bayesian Statistics
7:723-732.
Wilknson, L. (1999). “Statistical Methods in Psychology Journals: Guidelines and Explanations.” The American
Psychologist 54:594-604.
Willan, Andrew R., Eleanor M. Pinto, Bernie J. O’Brien, Padma Kaul, and Ron Goeree. (2005). “Country
Specific Cost Comparison from Multinational Clinical Trial Using Empirical Bayesian Shrinkage Estimation:
The Canadian ASSENT-3 Econmic Analysis.” Health Economicsa 14:327-338.
Wooldridge, Jeffrey M. (2001). Econometric Analysis of Cross Section and Panel Data. Cambridge, Mass:, MIT
Press.
Xu, S. (2003). “Estimating Polygenic Effects Using Markers of the Entire Genome.” Genetics. 163: 789-801.
Yi, N., and S. Xu. (2008). “Bayesian LASSO for Quantitative Trait Loci Mapping.” Genetics 179: 1045-1055.
47
Zhang, Hao Helen, Jeongyoun Ahn, Xiaodong Lin, and Cheolwoo Park. (2006). “Gene Selection Using Support
Vector Machines with Non-Convex Penalty.” Bioinformatics 22:88-95.
Appendix A
MCMC Algorithm for Updating Hyperparameters
The joint posterior distribution of a generalized linear Spike and Slab model can be written as
follows:
2
π(β, η , F , w|Y) ∼
N
Y
L(Y|β, η , F , w)π(w)
i=1
K
Y
π(βk |Fk , ηk2 )π(Fk |w)π(ηk2 )
(33)
k=1
With the conditional independent in the hierarchical setup, the coefficient parameter vector β can
be sampled as in ordinal Bayesian generalized linear models given the draws the hyperparmeters.
Here we only give the sampling scheme for those hyperparmeters. Given the current draw of β,
all the hyperparameters can be updated by using the Gibbs sampler as follows:
1. Draw Fk |β, η ∝ l(Fk )π(Fk ), which is a two-point discrete distribution. The posterior
unnormalized heights at the two points are by compute by multiplying the height of the
likelihood and prior at each points and as the following:
βk2
βk2
−1/2
, and w2,k = w exp − 2 .
w1,k = (1 − w)ν0
exp −
2ν0 ηk2
2ηk
Normalize the distribution by defining κ ≡ w2,k /(w1,k + w2,k ), and mathcalFk can be drawn
from its posterior:
Fk |ν0 , w ∼ (1 − κ)δν0 (·) + κδ1 (·)
−2
2. Draw h ≡ ηK
:
h|β, F ∝ h1/2 exp(−
h βk2 a1 −1
)h
exp(−a2 h),
2 Fk
(34)
and this is a gamma kernel and we can sample eta−2
from its posterior distribution
k
2
2
2
Gamma(a1 + 1/2, a2 + βk /(2Fk )), and compute γk = ν0 ηk .
3. Draw w: the parameter of w just follows a simple Beta-Bernoulli updating and can be
drawn from its posterior distribution: w|γ ∼ Beta(c1 + #{k : Fk = 1}, c2 + #{k : Fk = ν0 }
Appendix B
R Code for Sampling λ
rsquare <- (ystar-X.rescaled%*%as.matrix(beta[g,]))^2
lambda.new <- c()
48
for (i in 1:n){
OK <- FALSE
while (!OK){
# draw a proposal
prop.lambda <- c()
jj <- sqrt(rsquare[i])
R <- (rnorm(1))^2
RR <- 1+(R-sqrt(R*(4*jj+R)))/(2*jj)
RRR <- 1/(1+RR)
u <- runif(1)
if (u < RRR){
prop.lambda <- jj/RR
}else{
prop.lambda <- jj*RR
}
# accept/reject
U <- runif(1)
if (prop.lambda > 4/3){
OK <- rightmost.interval(U, prop.lambda, mm)
}else{
OK <- leftmost.interval(U, prop.lambda, mm)
}
}
lambda.new[i] <- prop.lambda
}
% the two functions called in the code above are defined as:
rightmost.interval <- function(U, lambda, mm){
ZD <- 1
XD <- exp(-0.5*lambda)
j <- 0
for (d in 1:mm){
j <- j+1
ZD <- ZD-(j+1)^2*XD^((j+1)^2-1)
if (ZD>U) return(TRUE)
j <- j+1
ZD <- ZD+(j+1)^2*XD^((j+1)^2-1)
if (ZD<U) return(FALSE)
}
return(FALSE)
}
49
leftmost.interval <- function(U, lambda, mm){
HD <- 0.5*log(2)+2.5*log(pi)-2.5*log(lambda)-pi^2/(2*lambda)+0.5*lambda
LU <- log(U)
ZD <- 1
XD <- exp(-pi^2/(2*lambda))
KD <- lambda/pi^2
j <- 0
for (d in 1:mm){
j <- j+1
ZD <- ZD-KD*XD^(j^2-1)
WD <- HD+log(ZD)
if (WD>LU) return(TRUE)
j <- j+1
ZD <- ZD+(j+1)^2*XD^((j+1)^2-1)
WD <- HD+log(ZD)
if (WD<LU) return(FALSE)
}
return(FALSE)
}
Appendix C
JAGS Code for Poisson Spike and Slab Model
model{
for (i in 1:n){
Y[i] ~ dpois(lambda[i]);
log(lambda[i]) <- inprod(B, X[i,])+epsilon[i];
epsilon[i] ~ dnorm(0, tau.epsilon);
}
tau.epsilon <- pow(sigma.epsilon, -2);
sigma.epsilon ~ dunif(0, 100);
w ~ dbeta(c.1, c.2);
for (k in 1:K){
Temp[k] ~ dbern(w);
zz[k] <- equals(Temp[k], 0)
F[k] <- zz[k] *deltanull+1-zz[k];
eta[k] ~ dgamma(a.1, a.2);
gamma[k] <- F[k]*(1/eta[k]);
B[k] ~ dnorm (0, 1/gamma[k]);
}
}
50