Download 7. Repeated-sampling inference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Regression toward the mean wikipedia , lookup

German tank problem wikipedia , lookup

Least squares wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
7
Repeated-sampling inference
Much of the discussion regarding econometric analysis of endogenous relations
centers around identification issues. In this chapter we review the complementary
matter of inference. Exchangeability or symmetric dependence and de Finetti’s
theorem lie at the heart of most (perhaps all) statistical inference. A simple binomial example illustrates. Exchangeability says that a sequence of coin flips has the
property
P r (X1 = 1, X2 = 0, X3 = 1, X4 = 1)
= P r (X3 = 1, X4 = 0, X2 = 1, X1 = 1)
and so on for all permutations of the random variable index. de Finetti’s theorem [1937, reprinted in 1964] provides justification for typical statistical sampling
from a population with unknown distribution based on a large number of iid draws
from the unknown distribution. That is, if ex ante the analyst assesses that samples
are exchangeable (and from a large population), then the samples can be viewed
as independent and identically distributed from an unknown distribution function.
Perhaps it is instructive to consider whether (most) specification issues can be
thought of as questions of the validity of some exchangeability conditions. While
we ponder this, we review repeated-sampling based inference with particular attention to bootstrapping and Bayesian simulation.1
1 MacKinnon
[2002] suggests three fruitful avenues for exploiting abundant computing capacity:
(1) structural models at the individual level that frequently draw on simulation, (2) Markov chain
Monte Carlo (McMC) analysis, and (3) bootstrap inference.
107
108
7. Repeated-sampling inference
7.1
Monte Carlo simulation
Monte Carlo simulation can be applied when the statistic of interest is pivotal.
Definition 7.1 A pivotal statistic is one that depends only on the data and no
unknown parameters.
Monte Carlo simulation of pivotal statistics produces exact tests.
Definition 7.2 Exact tests are tests for which a true null hypothesis is rejected
with probability precisely equal to , the nominal size of the test.
However, if the test statistic is not pivotal (for instance, the distribution is unknown), a Monte Carlo test doesn’t apply.
7.2
Bootstrap
Inference based on bootstrapping is simply an application of the fundamental theorem of statistics. That is, when randomly sampled with replacement the empirical
distribution function is consistent for the population distribution function (see appendix).
To bootstrap a single parameter such as the correlation between two random
variables. say x and y, we simply sample randomly with replacement from the
pair (x, y). Then, utilize the empirical distribution of the statistic (say, sample
correlation) to draw inferences, for instance, about the mean, etc. (see Efron [1979,
2000]).
7.2.1
Bootstrap regression
For a regression that satisfies standard OLS (spherical) conditions, bootstrapping
involves first estimating the regression via OLS Xi ̂ and calculating the residuals.2 The second step involves randomly sampling with replacement a residual for
each estimated regression observation Xi ̂. Pseudo responses Y are constructed
by adding the sampled residual to the estimated regression Xi ̂ for each draw
desired (often this is simply n, the original sample size). Next, bk is estimated
via OLS regression of Y on the matrix of regressors. Steps two and three are repeated B times to produce an empirical sample of bk , k = 1, . . . .B. Davidson
and MacKinnon [2003] recommend choosing B such that  (B + 1) is an integer
where  is the proposed size of the test. Inferences (such as interval estimates) are
then based on this empirical sample.
2 The current and next section draw heavily from Freedman [1981] and Freedman and Peters [1984].
7.2 Bootstrap
7.2.2
109
Bootstrap panel data regression
If the errors are heteroskedastic and/or correlated, then the bootstrapping procedure above is modified to accommodate these features. The key is we bootstrap
exchangeable partitions of the data. Suppose we have panel data stacked by time
series of length T by J cross-sectional individuals in the sample (the sample size
is n = T  J).
Heteroskedasticity
If we suppose the errors are independent but the variance depends on the crosssectional unit,

 2
0
···
0
 1 IT
 0
 22 IT · · ·
0 


= .
..
.. 
..
 ..
.
.
. 
2
0
0
· · ·  J IT
then random draws with replacement of the first step residuals (whether estimated
by OLS or WLS, weighted least squares) are taken from the size T sample of residuals for each cross-sectional unit or group of cross-sectional individuals with the
same variance. As these partitions are exchangeable, this preserves the differences
in variances across cross-sectional units. The remainder of the process remains as
described above for bootstrapping regression.
When the nature of the heteroskedasticity is unknown, Freedman [1981] suggests a paired bootstrap where [Yi , Xi ] are sampled simultaneously. MacKinnon
[2002, p. 629-631] also discusses a wild bootstrap to deal with unknown heteroskedasticity.
Correlated errors
If the errors are serially correlated but the variance is constant across cross-sectional
units,


V 0 ··· 0
 0 V ··· 0 


= .
.. . .
.. 
.
 .
. . 
.
0 0 ··· V
where



V = 2 

1
1
..
.
1
1
..
.
···
···
..
.
T
T 1
..
.
T
t1
···
1





then random vector (of length T ) draws with replacement of the first step residuals
(whether estimated by OLS or GLS, generalized least squares) are taken from
110
7. Repeated-sampling inference
the cross-sectional units.3 As these partitions are exchangeable, this preserves the
serial correlation inherent in the data. The remainder of the process is as described
above for bootstrapping regression.4
Heteroskedasticity and serial correlation
If the errors are serially correlated and the
sectional units,

V1 0
 0 V2

= .
..
 ..
.
where



Vj =  2j 

variance is nonconstant across cross
···
···
..
.
0
0
..
.
···
VJ




0
0
1
1
..
.
1
1
..
.
···
···
..
.
T
T 1
..
.
T
t1
···
1





then a combination of the above two sampling procedures is employed.5 That
is, groups of cross-section units with the same variance-covariance structure are
identified and random vector (of length T ) draws with replacement of the first
step residuals (whether estimated by OLS or GLS) are taken from the groups of
cross-sectional units. As these partitions are exchangeable, this preserves the heteroskedasticity and serial correlation inherent in the data. The remainder of the
process is as described above for bootstrapping regression.
3 For
cross-sectional correlation (but independent errors through time)

IT
 12 IT

 = 2 
..

.
1J IT
12 IT
IT
..
.
2J IT
···
···
..
.
···
1J IT
2J IT
..
.
IT





simply apply the same ideas to the length J vector of residuals over cross-sectional units in place of
the length T vector of residuals through time.
4 When the nature of the serial correlation is unknown, as expected the challenge is greater. MacKinnon [2002] discusses two approaches: sieve bootstrap and block bootstrap. Not surprisingly, when
the nature of the correlation or heteroskedasticity is unknown the bootstrap performs more poorly than
otherwise.
5 Cross-sectional correlation and heteroskedasticity



=

 21 IT
12  1  2 IT
..
.
1J  1  J IT
12  1  2 IT
 22 IT
..
.
2J  2  J IT
···
···
..
.
···
again calls for sampling from like variance-covariance groups.
1J  1  J IT
2J  2  J IT
..
.
 2J IT





7.3 Bayesian simulation
7.2.3
111
Bootstrap summary
Horowitz [2001] relates the bootstrap to asymptotically pivotal statistics in discussing effective usage of the bootstrap.
Definition 7.3 An asymptotically pivotal statistic is a statistic whose asymptotic
distribution does not depend on unknown population parameters.
Horowitz concludes
• If an asymptotically pivotal statistic is available, use the bootstrap to estimate the probability distribution of the asymptotically pivotal statistic or a
critical test value based on the asymptotically pivotal statistic.
• Use an asymptotically pivotal statistic if available rather than bootstrapping
a non-asymptotically pivotal statistic such as a regression slope coefficient
or standard error to estimate the probability distribution of the statistic.
• Recenter the residuals of an overidentified model before applying the bootstrap.
• Extra care is called for when bootstrapping models for dependent data,
semi- or non-parametric estimators, or non-smooth estimators.
7.3
Bayesian simulation
Like bootstrapping, Bayesian simulation employs repeated sampling with replacement to draw inferences. Bayesian sampling in its simplest form utilizes Bayes’
theorem to identify the posterior distribution of interest p ( | Y ) from the likelihood function p (Y | ) and prior distribution for the parameters of interest p ().
p ( | Y ) =
p (Y | ) p ()
p (Y )
The marginal distribution of the data p (Y ) is a normalizing adjustment. Since
it does not affect the kernel of the distribution it is typically suppressed and the
posterior is written
p ( | Y )  p (Y | ) p ()
7.3.1
Conjugate families
It is straightforward to sample from the posterior distribution when its kernel (the
portion of the density function or probability mass function that depends on the parameters of interest) is readily recognized. For a number of prior distributions (and
likelihood functions), the posterior distribution is readily recognized as a standard
distribution. This is referred to as conjugacy and the matching prior distribution is
called the conjugate prior. A formal definition follows.
112
7. Repeated-sampling inference
Definition 7.4 If F is a class of sampling distributions p (Y | ) and  is a class
of prior distributions for , then class  is conjugate to F class if p ( | Y )  
for all p (· | )  F and p (·)  .
For example, a binomial likelihood


n
ns
 ( | s; n) =
s (1  )
s
n
s = i=1 yi , yi = {0, 1}
combines with a beta(; , ) prior
p () =
 ( + ) 1
1

(1  )
 ()  ()
to yield
ns
p ( | y)  s (1  )
1
1 (1  )
ns+1
= s+1 (1  )
which is the kernel of a beta( | y;  + s,  + n  s) distribution.
Also, a single draw from a Gaussian likelihood with known standard deviation,



2
1 (y  )
 ( | y, )  exp 
2 2
combines with a Gaussian or normal prior

2
1 (  µ0 )
p ( | µ0 ,  0 )  exp 
2
 20
to yield6


2
1 (  µ1 )
p ( | y, , µ0 ,  0 )  exp 
2
 21
where µ1 =
1
2
0
µ0 + 12 y
1
2
0
+ 12
and  21 =
1
2
0
1
+ 12

. The posterior distribution of the mean
given the data and priors is Gaussian. And, for a sample of n exchangeable draws,
the likelihood is


n
2

1 (yi  )
 ( | y, ) 
exp 
2
2
i=1
6 The
product gives

1
exp 
2

(y  )2
(  µ0 )2
+
2
 20

Then, expand the exponent and complete the square. Any constants are ignored in the identification of
the kernel as they’re absorbed through normalization of the posterior kernel.
7.3 Bayesian simulation
113
combined with the above prior yields

2
1 (  µn )
p ( | y, , µ0 ,  0 )  exp 
2
 2n
where µ1 =
1
2
0
µ0 + n2 y
1
2
0
+ n2
, y is the sample mean, and  21 =

1
2
0
1
+ n2
. The posterior
distribution of the mean given the data and priors is again Gaussian.
These and some other well-known and widely used conjugate family distributions are summarized in tables 7.1, 7.2, 7.3, and 7.4 (see Bernardo and Smith
[1994] and Gelman et al [2003]).
Table 7.1: Conjugate families for univariate discrete distributions
likelihood p (Y | )
conjugate prior p ()
posterior p ( | Y )
Binomial (s | n, )
where
n

s=
yi , yi  {0, 1}
Beta (; , )
 1 (1  )1
Beta
( |  + s,  + n  s)
Poisson (s | n)
where
n

s=
yi , yi = 0, 1, 2, . . .
Gamma (; , )
 1 e
Gamma
( |  + s,  + n)
Exponential (t | n, )
where
n

t=
yi , yi = 0, 1, 2, . . .
Gamma (; , )
 1 e
Gamma
( |  + n,  + t)
Negative-binomial (s | , nr)
where
n

s=
yi , yi = 0, 1, 2, . . .
Beta (; , )
 1 (1  )1
Beta
( |  + nr,  + s)
i=1
i=1
i=1
i=1
Beta and gamma are continuous distributions
A few words regarding the multi-parameter Gaussian case with unknown mean
and variance seem appropriate. The joint prior combines a Gaussian prior for the
mean conditional on the variance and an inverse-gamma or inverse-chi square
prior for the variance.7 The joint posterior distribution is the same form as the prior
7 The
inverse-gamma(, ) distribution



  (+1)

p  2 ; ,    2
exp  2

114
7. Repeated-sampling inference
Table 7.2: Conjugate families for univariate continuous distributions
likelihood
p (Y | )
Uniform (Yi | 0, )
where 0 < Yi < ,
t = max {Y1 , . . . , Yn }
conjugate prior
p ()
marginal posterior
p ( | Y )
Pareto (; , )
 (+1)
Pareto
(;  + n, max {, t})
N ormal


0
nY


Normal Y | ,  2
variance known


Normal  |  2 ; 0 ,  20
Normal (Y | µ, )
mean known,  2 = 
Inversegamma (; , )
 (+1) e/


 1
0 e
(y 0 )2
2 2
0

2
µ |  ;
2
0
1
2
0
1
2
0
+
2
+ n2

+
, 

n
2
 Inverse-gamma

; n+2
,  + 12 t
2
where
n

t=
(Yi  µ)2
i=1

Normal Y | , 
2



Normal  |  2 ; 0 , n0
Student t
(; n , , 2 + n) ;
 Inverse-gamma

 2 ;  + 12 n,  n
Inverse


gamma  2 ; , 
For the normal-inverse gamma posterior the parameters
are

n = (n0 + n)1
n

+
nY
0 0

 = (n + n0 )  + 12 n  1
n

2
 n =  + 12 (n  1) s2 + 12 (n0 + n)1 n0 n 0  Y

2

s2 = (n  1)1 n
i=1 Yi  Y
both unknown




— Gaussian  |  2 ; n ,  2n inverse-gamma  2 |  + 12 n,  n . Hence, the con-
ditional distribution for the mean given the variance is Gaussian  |  2 ; n ,  2n
2
where  2n = n0+n . On integrating out the variance from the joint posterior the
marginal posterior for the mean is noncentral, scaled Student t( | n , , ) distributed.



A scaled Student t X | µ,  = 12 ,  is symmetric with mean µ, variance 1 2
2 
=  2 ,  degrees of freedom, and the density function kernel is

1+
1
2
 (X  µ)
(+1)/2

= 1+
1

X µ

2 (+1)/2


can be reparameterized as an inverse-2 distribution ,  20



  (/2+1)
 2
p  2 ; ,  20   2
exp  20
2
(see Gelman et al [2003], p. 50). Hence,  =

2
or  = 2 and  =
 2
0
2
or  20 = 2.
7.3 Bayesian simulation
115
Hence, the standard t distribution is Student t(Z | 0, 1, ) where Z = Xµ
 . Marginalization
of
the
mean
follows
Gelman
et
al
[2003]
p.
76.
For
uninformative


priors, p ,  2   2
p ( | y)




p ,  2 | y d 2


0 
A
n2
=

exp  2 d 2
2
0
=
2
A
where A = (n  1) s2 +n (  y) . Let z = 2
2 , then transformation of variables
yields
 
z (n2)/2 exp [z] dz
p ( | y)  An/2
0
Since the integral involves the kernel for a gamma, it integrates to a constant and
can be ignored for identifying the marginal posterior kernel. Hence, we recognize
 n2

2
p ( | y)  An/2 = (n  1) s2 + n (  y)

 n
2 2
n (  y)

1+
(n  1) s2


2
is the kernel for a noncentral, scaled Student t ; y, sn , n  1 . Marginalization
with informed conjugate priors works in analogous fashion.
Table 7.3: Conjugate families for multivariate discrete distributions
likelihood p (Y | )
conjugate prior p ()
Multinomialk
(r; , n)
where
ri = 0, 1, 2, . . .
Dirichletk
(; )
where
 = {1 , . . . , k+1 }
posterior p ( | Y )

Dirichletk 
 + r1 , . . . ,
; 1
k+1 + rk+1
The Dirichlet distribution is a multivariate analog to the beta distribution and
k

has continuous support where rk+1 = n 
r . Ferguson [1973] proposed the
=1
Dirichlet process as a Bayesian nonparametric approach. Some properties of the
Dirichlet distribution include
E [i | ] =
V ar [i | ] =
i
0
i (0  i )
20 (0 + 1)
116
7. Repeated-sampling inference
Cov [i , j | ] =
where 0 =
k+1

i j
(0 + 1)
20
i
i=1
Table 7.4: Conjugate families for multivariate continuous distributions
likelihood p (Y | )
Normal (Y | , )
parameters
unknown
Linear regression 
Normal Y | X,  2
parameters
unknown
conjugate prior p ()
Normal( | ; 0 , n0 )
InverseWishart (; , )


2
Normal  |  2 ; 0 , n1
0 
Inverse

gamma  2 ; , 
marginal posterior p ( | Y )
Student tk (; n , , 2n );
Inverse
Wishart ;  + 12 n,  n
Student tk (; n , , 2 + n);
 Inverse
gamma  2 ;  + 12 n,  n
The multivariate Student tk (X | µ, , ) is analogous to the univariate Student
t(X | µ, , ) as it is symmetric with mean vector (length k) µ, k  k symmetric,

positive definite variance matrix 1 2
, and  degrees of freedom. For the Student t and inverse-Wishart marginal posteriors associated with multivariate normal
likelihood function, the parameters are

1 
n = (n0 + n)
n0 0 + nY
 = (n + n0 ) n  1
n


T
1
1
1
 n =  + S + (n0 + n) n0 n 0  Y 0  Y
2
2
S=
n


i=1
Yi  Y

Yi  Y
T
1
1
n =  + n  (k  1)
2
2
For the Student t and inverse-gamma marginal posteriors associated with linear
regression, the parameters are8

1 

 n = n0 + X T X
n0  0 + X T Y
n0 = X0T X0




1
T
 = n0 + X X  + n  1
n
2
8 Notice, linear regression subsumes the univariate, multi-parameter Gaussian case. If we let X = 
(a vector of ones), then linear regression becomes the univariate Gaussian case.
7.3 Bayesian simulation
117
1
1
T
T
(Y  Xn ) Y + (0  n ) n0 0
2
2
Bayesian regression with conjugate priors works as if we have data from a prior
period {Y0 , X0 } and the current period
to estimate
 {Y,X}from which

 n . ApY0
X0
0
plying OLS to the stack of equations
=
n +
yields9
Y
X

n =  +
1  T

X0T X0 + X T X
X0 Y0 + X T Y

1 

= n0 + X T X
n0  0 + X T Y
n
=

The inverse-Wishart and multivariate Student t distributions are multivariate analogs
to the inverse-gamma and (noncentral, scaled) univariate Student t distributions,
respectively.
7.3.2
McMC simulations
Markov chain Monte Carlo (McMC) simulations are employed when the marginal
posterior distributions cannot be derived or are extremely cumbersome to derive.
McMC approaches draw from the set of conditional posterior distributions instead
of the marginal posterior distributions. The Hammersley-Clifford theorem (Hammersley and Clifford [1971] and Besag [1974]) provides regulatory conditions
9 This perspective of Bayesian regression is consistent with recursive least squares where the previous estimate t1 based on data {Yt1 , Xt1 } is updated for data {Yt , Xt } as t = t1 +
 T
1 T
T
1
Xt1 Yt1 and the information matrix
t Xt (Yt  Xt  t1 ), where  t1 = Xt1 Xt1
T
is updated as t = t1 + Xt Xt . To see this, note


1 T
T
t = 1
t Xt Yt + I  t Xt Xt  t1
but


T
I  1
t Xt Xt  t1
=
=
=


T
1
1
t
t  Xt Xt  t1


T
Xt1 + XtT Xt  XtT Xt t1
1
Xt1
t
T
1
t Xt1 Xt1  t1
T
1
t Xt1 Yt1
 T
1 T
since t1 = Xt1
Xt1
Xt1 Yt1 . Hence,
t
=
=
=
or, in the notation above
as indicated above.
1 T
T
1
t Xt Yt + t Xt1 Yt1


T
1
XtT Yt + Xt1
Yt1
t

1 

T
T
Xt1
Xt1 + XtT Xt
Xt1
Yt1 + XtT Yt
1 


 = X0T X0 + X T X
X0T Y0 + X T Y
118
7. Repeated-sampling inference
for when a set of conditional distributions characterizes a unique joint distribution. The regulatory conditions are essentially that every point in the marginal and
conditional distributions have positive mass. Common McMC approaches (Gibbs
sampler and Metropolis-Hastings algorithm) are supported by the HammersleyClifford theorem. The utility of McMC simulation has evolved along with the R
Foundation for Statistical Computing.
Gibbs sampler
Suppose we cannot derive p ( | Y ) in closed form (it does not have a standard
probability distribution) but we can identify the conditional posterior distributions.
We can utilize the full conditional posterior distributions to draw dependent samples for parameters of interest via McMC simulation.
For full conditional posterior distributions
p (1 | 1 , Y )
..
.
p (k | k , Y )
draws are made for 1 conditional on starting values for parameters other than 1 ,
that is 1 . Then, 2 is drawn conditional on the 1 draw and the starting value
for the remaining . Next, 3 is drawn conditional on the draws for 1 and 2 and
the remaining . This continues until all  have been sampled. Then the sampling
is repeated for a large number of draws with parameters updated each iteration by
the most recent draw.
The samples are dependent. Not all samples will be from the posterior; only after a finite (but unknown) number of iterations are draws from the marginal posterior distribution (see Gelfand and Smith [1990]). (Note, in general, p (1 , 2 | Y ) =
p (1 | 2 , Y ) p (1 | 2 , Y ).) Convergence is usually checked using trace plots,
burn-in iterations, and other convergence diagnostics. Model specification includes
convergence checks, sensitivity to starting values and possibly prior distribution
and likelihood assignments, comparison of draws from the posterior predictive
distribution with the observed sample, and various goodness of fit statistics.
Albert and Chib’s Gibbs sampler Bayes’ probit
The challenge with discrete choice models (like probit) is that latent utility is
unobservable, rather the analyst observes only discrete (usually binary) choices
(see chapter 5). Albert & Chib [1993] employ Bayesian data augmentation to
“supply” the latent variable. Hence, parameters of a probit model are estimated
via normal Bayesian regression (see earlier discussion in this chapter). Consider
the latent utility model
UD = W   V
The conditional posterior distribution for  is
 
1 
p (|D, W, UD )  N b1 , Q1 + W T W
7.3 Bayesian simulation
119
where

1  1

b1 = Q1 + W T W
Q b0 + W T W b

1 T
b = WTW
W UD
1
 T
b0 = prior means for  and Q = W0 W0
is the prior for the covariance. The
conditional posterior distribution for the latent variables are
p (UD |D = 1, W, )  N (W , I|UD > 0) or T N(0,) (W , I)
p (UD |D = 0, W, )  N (W , I|UD  0) or T N(,0) (W , I)
where T N (·) refers to random draws from a truncated normal (truncated below
for the first and truncated above for the second). Iterative draws for (UD |D, W, )
and (|D, W, UD ) form the Gibbs sampler. Interval estimates of  are supplied
by post-convergence draws of (|D, W, UD ). For simulated normal draws of the
unobservable portion of utility, V , this Bayes’ augmented data probit produces
remarkably similar inferences to MLE.10
Metropolis-Hastings algorithm
If neither some conditional posterior, p (j | Y, j ), or the marginal posterior,
p ( | Y ), is recognizable, then we can employ the Metropolis-Hastings (MH) algorithm. The Gibbs sampler is a special case of the MH algorithm. The random
walk Metropolis algorithm is most common and outlined next.
The random walk Metropolis algorithm is as follows. We wish to draw from
p ( | ·) but we only know p ( | ·) up to constant of proportionality, p ( | ·) =
cf ( | ·) where c is unknown.
• Let (k1) be a draw from p ( | ·).11


• Draw  from N (k1) , s2 where s2 is fixed.
10 An efficient algorithm for this Gibbs sampler probit, rbprobitGibbs, is available in the bayesm
package of R (http://www.r-project.org/), the open source statistical computing project. Bayesm is a
package written to complement Rossi, Allenby, and McCulloch [2005].
11 The procedure describes the algorithm for a single parameter. A general K parameter algorithm
works similarly (see Train [2002], p. 305):
(a) Start with a value  0n .
(b) Draw K independent values from a standard normal density, and stack the draws into a vector
labeled  1 .
(c) Create a trial value of  1n =  0n +  1 where  is the researcher-chosen jump size parameter,
 is the Cholesky factor of W such that T = W . Note the proposal distribution is specified to be
normal with zero mean and variance  2 W .
(d) Draw a standard uniform variable µ1 .
1


L(yn | 1
n )( n |b,W )
(e) Calculate the ratio F =
where L yn |  1n is a product of logits, and
0
L(yn | 0
n )( n |b,W )
 1

  n | b, W is the normal density.
(f ) If µ1  F , accept  1n ; if µ1 > F , reject  1n and let  1n =  0n .
(g) Repeat the process many times. For sufficiently large t,  tn is a draw from the marginal posterior.
120
7. Repeated-sampling inference

• Let  = min 1,
p(  |·)
p( (k1) |·)
=
cf (  |·)
cf ( (k1) |·)

.
• Draw z  from U (0, 1).
• If z  <  then (k) =  , otherwise (k) = (k1) . In other words, with
probability  set (k) =  , and otherwise set (k) = (k1) .12
These draws converge to random draws from the marginal posterior distribution
after a burn-in interval if properly tuned.
Tuning the Metropolis algorithm involves selecting s2 (jump size) so that the
parameter space is explored appropriately (see Halton sequences discussion below). Usually, smaller jump size results in more accepts and larger jump size
results in fewer accepts. If s2 is too small, the Markov chain will not converge
quickly, has more serial correlation in the draws, and may get stuck at a local
mode (multi-modality can be a problem). If s2 is too large, the Markov chain will
move around too much and not be able to thoroughly explore areas of high posterior probability. Of course, we desire concentrated samples from the posterior
distribution. A commonly-employed rule of thumb is to target an acceptance rate
for  around 30% (20  80% is usually considered “reasonable”).13
Some other McMC methods
Other acceptance sampling procedures such as WinBUGs (see Spiegelhalter, et
al. [2003]) are self-tuned. That is, the algorithm adaptively tunes the jump size
in generating random post convergence joint posterior draws. A difficulty with
WinBUGs is that it can mysteriously crash with little diagnostic aid.
Halton sequences
Random sampling can be slow to provide good coverage and hence prove to be
a costly way to simulate data. An alternative that provides better coverage with
fewer draws involves Halton sequences (see Train [2002], ch. 9, p. 224-238). Unlike other methods discussed above, Halton draws tend to be negatively correlated.
Importantly, Bhat [2001] finds that 100 Halton draws provided lower simulation
error for his mixed logit than 1, 000 random draws, for discrete choice models.
Further, the error rate with 125 Halton draws was half as large as with 1, 000
random draws and somewhat smaller than with 2, 000 random draws.
A Halton sequence builds around a pre-determined number k (usually a prime
number). The Halton sequence is


1
2
k1
st+1 = st , st + t , st + t , . . . , st +
k
k
kt
12 A modification of the RW Metropolis algorithm sets  (k) =   with log() probability where
 = min{0, log[f ( |·)]  log[f ((k1) |·)]}.
13 Gelman, et al [2004] report the optimal acceptance rate is 0.44 when the number of parameters
K = 1 and drops toward 0.23 as K increases.
7.3 Bayesian simulation
121
starting with s0 = 0 (even though zero is ignored). An example helps to fix ideas.
Example 7.1 Consider the prime k = 3. The sequence through two iterations is


0 + 1/3 = 1/3, 0 + 2/3 = 2/3,


0 + 1/9 = 1/9, 1/3 + 1/9 = 4/9, 2/3 + 1/9 = 7/9,


0 + 2/9 = 2/9, 1/3 + 2/9 = 5/9, 2/3 + 2/9 = 8/9, . . .
This procedure describes uniform Halton draws. Other distributions are accommodated in the usual way — by inverse distribution functions.
Example 7.2 For example, normal draws are found by 1 (st ). Continuing with
the above Halton sequence, standard normal draws are


1 (1/3)  0.43, 1 (2/3)  0.43,


1 (1/9)  1.22, 1 (4/9)  0.14, 1 (7/9)  0.76,
 1

 (2/9)  0.76, 1 (5/9)  0.14, 1 (8/9)  1.22, . . .
Example 7.3 For two independent standard normal unobservables we create Halton sequences for each from different primes and transform. Suppose we use k = 2
and k = 3. The first few draws are


 


 
1 = 1 21 = 0, 1 13  = 0.43 ,






1 1
1 2



=

=
.67,

=
0.43
,


2
4
3










3
1
1
1


=
1.22
,
 3 =   4 = 0.67, 

9 


1 1
1 4
4 = 
=
1.15,

=
0.14
,
8 
9


 5 = 1 5 = 0.32, 1 7 = 0.76 , 


8
9







 

 6 = 1 3 = 0.32, 1 2 = 0.76 , 





8
9








7 = 1 78 = 1.15, 1 59 = 0.14 , . . .
As the initial cycle of elements (from near zero to near one) for multiple dimension
sequences are highly correlated, the initial elements are usually discarded (treated
as burn-in). The number of elements discarded is at least as large as the largest
prime used in creating the sequences. Since primes cycle at different rates after
the first cycle, primes are more effective bases (they have smaller correlation) for
Halton sequences.
Randomized Halton draws
Halton sequences are systematic, not random, while asymptotic properties of estimators assume random (or at least pseudo-random) draws of unobservables. Halton sequences can be transformed in a way that makes draws pseudo-random (as
is the case for all computer-based randomizations). Bhat [2003] suggests the following procedure:
1. Take a draw µ from a standard uniform distribution.
2. Add µ to each element of the Halton sequence. If the resulting element exceeds
one, subtract 1 from it. That is, sn = mod (s0 + µ) where s0 (sn ) is the original
(transformed) element of the Halton sequence and mod(·) returns the fractional
122
7. Repeated-sampling inference
part of the argument.
Suppose µ = 0.4 for the above Halton sequence (again through two iterations),
the pseudo-random sequence is
{0.4, 0.733, 0.067, 0.511, 0.844, 0.178, 0.622, 0.956, 0.289, . . .}
The spacing remains the same so we achieve the same coverage but draws are random. In a sense, this "blocking" approach is similar to bootstrapping regressions
with heteroskedastic and/or correlated errors. A different draw for µ is taken for
each unobservable.
Bhat [2003] also proposes scrambled Halton draws to deal with high dimension issues. Halton sequences for high dimension problems utilize larger prime
numbers. For large prime numbers, correlation in the sequences may persist for
much longer than the first cycle as discussed above. Bhat proposes scrambling the
sequence so that if we think of the above sequence as BC then the sequence is reversed to be CB where B = 13 and C = 23 . Different permutations are employed
for different primes. Continuing with the above Halton sequence for k = 3, the
original and scrambled sequences are tabulated below.
Original
1/3
2/3
1/9
4/9
7/9
2/9
5/9
8/9
7.4
Scrambled
2/3
1/3
2/9
8/9
5/9
1/9
7/9
4/9
Additional reading
Kreps [1988, ch. 11] and McCall [1991] discuss exchangeability and de Finetti’s
theorem as well as implications for economics. Davidson and MacKinnon [2003],
MacKinnon [2002], and Cameron and Trivedi [2005] discuss bootstrapping, pivotal statistics, etc., and Horowitz [2001] provides an extensive discussion of bootstrapping. Casella and George [1992] and Chib and Hamilton [1995] offer basic introductions to the Gibbs sampler and Metropolis-Hastings algorithm, respectively. Tanner and Wong [1987] discuss calculating posterior distributions
by data augmentation. Train [2002, ch. 9] discusses various Halton sequence approaches and other remaining open questions associated with this relatively new,
but promising technique.