Download Bayesian analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

History of statistics wikipedia, lookup

Foundations of statistics wikipedia, lookup

Taylor's law wikipedia, lookup

Student's t-test wikipedia, lookup

Approximate Bayesian computation wikipedia, lookup

Bayesian inference wikipedia, lookup

Gibbs sampling wikipedia, lookup

Transcript
Bayesian analysis
Bayesian analysis illustrates scientific reasoning where consistent reasoning (in the
sense that two individuals with the same background knowledge, evidence, and perceived veracity of the evidence reach the same conclusion) is fundamental. In the next
few pages we survey foundational ideas: Bayes’ theorem (and its product and sum
rules), maximum entropy probability assignment, the importance of loss functions,
and Bayesian posterior simulation (direct and Markov chain Monte Carlo or McMC,
for short). Then, we return to some of the examples utilized to illustrate classical inference strategies in the projections and conditional expectation functions notes.
1
Bayes’ theorem and consistent reasoning
Consistency is the hallmark of scientific reasoning. When we consider probability
assignment to events, whether they are marginal, conditional, or joint events their assignments should be mutually consistent (match up with common sense). This is what
Bayes’ product and sum rules express formally.
The product rule says the product of a conditional likelihood (or distribution) and
the marginal likelihood (or distribution) of the conditioning variable equals the joint
likelihood (or distribution).
p (x, y)
= p (x|y) p (y)
= p (y|x) p (x)
The sum rule says if we sum over all events related to one (set of) variable(s)
(integrate out one variable or set of variables), we are left with the likelihood (or distribution) of the remaining (set of) variables(s).
p (x)
p (y)
=
=
n

i=1
n

p (x, yi )
p(xj , y)
j=1
Bayes’ theorem combines these ideas to describe consistent evaluation of evidence.
That is, the posterior likelihood associated with a proposition, , given the evidence,
y, is equal to the product of the conditional likelihood of the evidence given the proposition and marginal or prior likelihood of the conditioning variable (the proposition)
scaled by the likelihood of the evidence. Notice, we’ve simply rewritten the product
rule where both sides are divided by p (y) and p (y) is simply the sum rule where  is
integrated out of the joint distribution, p (, y).
p ( | y) =
p (y | ) p ()
p (y)
For Bayesian analyses, we often find it convenient to suppress the normalizing factor,
p (y), and write the posterior distribution is proportional to the product of the sampling
1
distribution or likelihood function and prior distribution.
p ( | y)  p (y | ) p ()
or
p ( | y)   ( | y) p ()
where p (y | ) is the sampling distribution,  ( | y) is the likelihood function, and
p () is the prior distribution for . Bayes’ theorem is the glue that holds consistent
probability assignment together.
Example 1 Consider the following joint distribution:
p (y = y1 ,  = 1 )
0.1
p (y = y2 ,  = 1 )
0.4
p (y = y1 ,  = 2 )
0.2
p (y = y2 ,  = 2 )
0.3
The sum rule yields the following marginal distributions:
p (y = y1 )
0.3
p (y = y2 )
0.7
p ( = 1 )
0.5
p ( = 2 )
0.5
and
The product rule gives the conditional distributions:
y1
y2
p (y |  = 1 )
0.2
0.8
p (y |  = 2 )
0.4
0.6
p ( | y = y1 )
p ( | y = y2 )
and
1
2
1
3
2
3
4
7
3
7
as common sense dictates.
2
Maximum entropy distributions
From the above, we see that evaluation of propositions given evidence is entirely determined by the sampling distribution, p (y | ), or likelihood function,  ( | y), and the
prior distribution for the proposition, p (). Consequently, assignment of these probabilities is a matter of some considerable import. How do we proceed? Jaynes suggests
we take account of our background knowledge, , and evaluate the evidence in a manner consistent with both background knowledge and evidence. That is, the posterior
likelihood (or distribution) is more aptly represented by
p ( | y, )  p (y | , ) p ( | )
2
Now, we’re looking for a mathematical statement of what we know and only what
we know. For this idea to properly grounded requires a sense of complete ignorance
(even though this may never represent out state of background knowledge). For instance, if we think that µ1 is more likely the mean or expected value than µ2 then we
must not be completely ignorant about the location of the random variable and consistency demands that our probability assignment reflect this knowledge. Further, if the
order of events or outcomes is not exchangeable (if one permutation is more plausible
than another), then the events are not seen as stochastically independent or identically
distributed.1 The mathematical statement of our background knowledge is defined in
terms of Shannon’s entropy (or sense of diffusion or uncertainty).
2.1
Entropy
Shannon defines entropy as2
h=
for discrete events where
n

i=1
n

pi  log (pi )
pi = 1
i=1
or differential entropy as
h=

p (x) log p (x) dx
for events with continuous support where

p (x) dx = 1
2.2
Discrete examples
2.2.1
Discrete uniform
Example 2 Suppose we know only that there are three possible (exchangeable) events,
x1 = 1, x2 = 2,and x3 = 3. The maximum entropy probability assignment is found by
solving the Lagrangian
 3
 3



L  max 
pi log pi  (0  1)
pi  1
pi
i=1
i=1
1 Exchangeability
is foundational for independent and identically distributed events (iid), a cornerstone
of inference. However, exchangeability is often invoked in a conditional sense. That is, conditional on a set
of variables exchangeability applies — a foundational idea of conditional expectations or regression.
2 The axiomatic development for this measure of entropy can be found in Jaynes [2003] or Accounting
and Causal Effects: Econometric Challenges, ch. 13.
3
First order conditions yield
pi = e0
for i = 1, 2, 3
and
0 = log 3
Hence, as expected, the maximum entropy probability assignment is a discrete uniform
distribution with pi = 13 for i = 1, 2, 3.
2.2.2
Discrete nonuniform
Example 3 Now suppose we know a little more. We know the mean is 2.5.3 The
Lagrangian is now
 3
 3
 3





L  max 
pi log pi  (0  1)
pi  1   1
pi xi  2.5
pi
i=1
i=1
i=1
First order conditions yield
pi = e0 xi 1
for i = 1, 2, 3
and
0
=
1
= 0.834
2.987
The maximum entropy probability assignment is
2.3
p1
=
0.116
p2
=
0.268
p3
=
0.616
Normalization and partition functions
The above analysis suggests a general approach for assigning probabilities.


m

1
exp 
j fj (xi )
p (xi ) =
Z (1 , . . . , m )
j=1
where fj (xi ) is a function of the random variable, xi , reflecting what we know4 and


n
m


Z (1 , . . . , m ) =
exp 
j fj (xk )
j=1
k=1
3 Clearly,
if we knew the mean is 2 then we would assign the uniform discrete distribution above.
0 simply ensures the probabilities sum to unity and the partition function assures this, we can
define the partition function without 0 .
4 Since
4
is a normalizing factor, called a partition function. Probability assignment is completed
by determining the Lagrange multipliers, j , j = 1, . . . , m, from the m constraints.
Return to the example above. Since we know support and the mean, n = 3 and the
function f (xi ) = xi . This implies
Z (1 ) =
3

exp [1 xi ]
i=1
and
exp [1 xi ]
pi = 3
k=1 exp [1 xk ]
where x1 = 1, x2 = 2, and x3 = 3. Now, solving the constraint
3

i=1
3

i=1
pi xi  2.5
exp [1 xi ]
xi  2.5
3
k=1 exp [1 xk ]
=
0
=
0
produces the multiplier, 1 = 0.834, and identifies the probability assignments
p1
=
0.116
p2
=
0.268
p3
=
0.616
We utilize the analog to the above partition function approach next to address continuous density assignment.
2.4
Continuous examples
The partition function approach for continuous support involves density assignment
 

m
exp  j=1 j fj (x)
 

p (x) =  b
m
exp  j=1 j fj (x) dx
a
where support is between a and b.
2.4.1
Continuous uniform
Example 4 Suppose we only know support is between zero and three. The above partition function density assignment is simply (there are no constraints so there are no
multipliers to identify)
exp [0]
1
p (x) =  3
=
3
exp [0] dx
0
Of course, this is the density function for a uniform with support from 0 to 3.
5
2.4.2
Known mean
Example 5 Continue the example above but with known mean equal to 2. The partition
function density assignment is
p (x) =  3
0
and the mean constraint is

0

0
3
x 3
0
exp [1 x]
exp [1 x] dx
3
xp (x) dx  2
exp [1 x]
exp [1 x] dx
dx  2
=
0
=
0
so that 1 = 0.716 and the density function is a truncated exponential distribution with
support from 0 to 3.
p (x) = 0.0945 exp [0.716x]
2.4.3
Gaussian (normal) distribution and known variance
Example 6 Suppose we know the average dispersion or variance is  2 = 100. Then,
a finite mean must exist, but even if we don’t know it, we can find the maximum entropy
density function for arbitrary mean, µ. Using the partition function approach above
we have


2
exp 2 (x  µ)


p (x) =  
2
exp

(x

µ)
dx
2

and the average dispersion constraint is
 
2
(x  µ) p (x) dx  100

 
exp [1 x]
2
dx  100
(x  µ)  
exp [1 x] dx


so that 2 =
1
2 2
=
0
=
0
and the density function is
p (x)
=
=


2
1
(x  µ)

exp 
2 2
2


2
1
(x  µ)

exp 
200
210
Of course, this is a Gaussian or normal density function. Strikingly, the Gaussian
distribution has greatest entropy of any probability assignment with the same variance.
6
3
Loss functions
Bayesian analysis is combined with decision analysis via explicit recognition of a loss
function. Relative to classical statistics this is a strength as a loss function always
exists but is sometimes not acknowledged. For simplicity and brevity we’ll explore
5
symmetric versions of a few loss functions.
 
Let 
 denote an estimator for , c 
,  denote a loss function, and p ( | y) denote
the posterior distribution for  given evidence y. A sensible strategy for consistent decision making involves selecting the estimator, 
, to minimize the average or expected
loss.
  
minE [loss] = minE c 
, 




Briefly, for symmetric loss functions, we find the expected loss minimizing estimator
is the posterior mean for a quadratic loss function, is the median of the posterior distribution for a linear loss function, and is the posterior mode for an all or nothing loss
function.
3.1
Quadratic loss
 

2
Let c 
,  =  
   for  > 0, then (with support from a to b)

  

minE c ,  = min




a
b

2
 
   p ( | y) d
The first order conditions are
 b 
2
d
 
   p ( | y) d = 0
d
 a
Expansion of the integrand gives
 b 

2
d
  2
 + 2 p ( | y) d = 0
 
d
 a
or


 b
 b
2
d
2


   2
p ( | y) d +
 p ( | y) d = 0
d

a
a
Differentiation gives
and the solution is
where
b
a

 2
2

=

b
a

a
p ( | y) d

=0
b
p ( | y) d
p ( | y) d is the posterior mean.
5 The more general case, including asymmetric loss, is addressed in chapter 4 of Accounting and Causal
Effects: Econometric Challenges.
7
3.2
Linear loss


 


Let c 
,  =  
   for  > 0, then

  
minE c 
,  = min




The first order conditions are
d
d


a
b
a
b




 
   p ( | y) d




 
   p ( | y) d = 0
Rearrangement of the integrand gives

 
 b 
 


d
 
   p ( | y) d 
 
   p ( | y) d = 0

d
 a

or

 


 
 


d
F 
 | y  a p ( | y) d  
 1F 
|y

=0
b
d

+ p ( | y) d




where F 
 | y is cumulative posterior probability evaluated at 
. Differentiation
gives








F 
 | y +
p 
 | y 
p 
|y







  = 0
 
 1F 
 | y 
 1p 
 | y +
 1p 
|y
or
and the solution is
 
 


 |y  1F 
|y
 F 
 


 2F 
 |y 1
=
0
=
0

 1
F 
|y =
2
or the median of the posterior distribution.
3.3
Let
All or nothing loss
   >0
c 
,  =
0
if 
 = 

if  = 
Then, we want to assign 
 the maximum value of p ( | y) or the posterior mode.
8
4
Analysis
The science of experimentation and evaluation of evidence is a deep and subtle art.
Essential ingredients include careful framing of the problem (theory development),
matching of data to be collected with the frame, and iterative model specification
that complements the data and problem frame so that causal effects can be inferred.
Consistent evaluation of evidence draws from the posterior distribution which is proportional to the product of the likelihood and prior. When they combine to generate a
recognizable posterior distribution, our work is made simpler. Next, we briefly discuss
conjugate families which produce recognizable posterior distributions.
4.1
Conjugate families
Conjugate families arise when the likelihood times the prior produces a recognizable
posterior kernel
p ( | y)   ( | y) p ()
where the kernel is the characteristic part of the distribution function that depends on
the random variable(s) (the part excluding any normalizing constants). For example,
the density function for a univariate Gaussian or normal is


1
1
2

exp  2 (x  µ)
2
2
and its kernel (for  known) is


1
2
exp  2 (x  µ)
2
1
as 2
is a normalizing constant. Now, we discuss a few common conjugate family
results6 and uninformative prior results to connect with classical results.
4.1.1
Binomial - beta prior
A binomial likelihood with unknown success probability, ,


n
ns
 ( | s; n) =
s (1  )
s
s=
n
i=1
yi , yi = {0, 1}
combines with a beta(; a, b) prior (i.e., with parameters a and b)
p () =
 (a + b) a1
b1

(1  )
 (a)  (b)
6 A more complete set of conjugate families are summarized in chapter 7 of Accounting and Causal
Effects: Econometric Challenges.
9
to yield
ns
p ( | y)  s (1  )
b1
a1 (1  )
ns+b1
 s+a1 (1  )
which is the kernel of a beta distribution with parameters (a + s) and (b + n  s),
beta( | y; a + s, b + n  s).
Uninformative priors Suppose priors for  are uniform over the interval zero to one
or, equivalently, beta(1, 1).7 Then, the likelihood determines the posterior distribution
for .
ns
p ( | y)  s (1  )
which is beta( | y; 1 + s, 1 + n  s).
4.1.2
Gaussian (unknown mean, known variance)
A single draw from a Gaussian likelihood with unknown mean, , known standard
deviation, ,


2
1 (y  )
 ( | y, )  exp 
2 2
combines with a Gaussian or normal prior for  given  2 with prior mean 0 and prior
variance  20


2


1 (  0 )
2
2
p  |  ; 0 ,  0  exp 
2
 20
or writing  20   2 /0 , we have

2
2
p  |  ; 0 ,  /0
to yield

2
p  | y, , 0 ,  /0



2
1 0 (  0 )
 exp 
2
2

1
 exp 
2

2

2
(y  )
0 (  0 )
+
2

2

Expansion and rearrangement gives




 2

1  2
2
2
2
p  | y, , 0 ,  /0  exp  2 y + 0 0  2y +  + 0   20
2
Any terms not involving  are constants and can be discarded as they are absorbed on
normalization of the posterior





1 
p  | y, , 0 ,  2 /0  exp  2 2 (0 + 1)  2 (0 0 + y)
2


would utilize Jeffreys’ prior, p () beta ; 12 , 12 , which is invariant to transformation, as the
uninformative prior.
7 Some
10
2
+y)
Completing the square (add and subtract (000+1
), dropping the term subtracted (as
it’s all constants), and factoring out (0 + 1) gives


2 


0 + 1
0  0 + y
2
p  | y, , 0 ,  /0  exp 

2 2
0 + 1
Finally, we have

2
p  | y, , 0 ,  /0
where 1 =
0  0 +y
0 +1
=
1
0
 0 + 12
1
1
 0 + 2
y


2
1 (  1 )
 exp 
2
 21
and  21 =
2
0 +1
=
1
0
1
+ 12

, or the posterior distribu-
tion of the mean given the data and priors is Gaussian or normal. Notice, the posterior
mean, 1 , weights the data and prior beliefs by their relative precisions.
For a sample of n exchangeable draws, the likelihood is


n
2

1 (yi  )
 ( | y, ) 
exp 
2
2
i=1
combined with the above prior yields

2
p  | y, , 0 ,  /0
where n =
0 µ0 +ny
0 +n
=
1
0
 0 + n2
1
n
 0 + 2
y


2
1 (  n )
 exp 
2
 2n

, y is the sample mean, and  2n =
2
0 +n
=
1
0
1
+ n2
,
or the posterior distribution of the mean, , given the data and priors is again Gaussian
or normal and the posterior mean, n , weights the data and priors by their relative
precisions.
Uninformative

 prior An uninformative prior for the mean, , is the (improper) uniform, p  |  2 = 1. Hence, the likelihood


n
2

1 (yi  )
 ( | y, ) 
exp 
2
2
i=1

 n


1
2
2
 exp  2
yi  2ny + n
2
i=1

 n


1
2
2
2
 exp  2
yi  ny + n (  y)
2
i=1


1
2
 exp  2 n (  y)
2
11
determines the posterior


2
n (  y)
p  |  , y  exp 
2 2


2
which is the kernel for a Gaussian or N  |  2 , y; y, n , the classical result.

4.1.3

2
Gaussian (known mean, unknown variance)
For a sample of n exchangeable draws with known mean, µ, and unknown variance, ,
a Gaussian or normal likelihood is


n
2

1 (yi  µ)
 12
 ( | y, µ) 
 exp 
2

i=1
combines with an inverted-gamma(a, b)


b
p (; a, b)  (a+1) exp 

 n+2a

to yield an inverted-gamma 2 , b + 12 t posterior distribution where
t=
n

i=1
2
(yi  µ)
Alternatively and conveniently

 (but equivalently), we could parameterize the prior
as an inverted-chi square  0 ,  20 8





 0  20
( 20 +1)
2
p ;  0 ,  0  ()
exp 
2
and combine with the above likelihood to yield
p ( | y)  
(
n+ 0
2
+1)


  20 +t
an inverted chi-square  0 + n, 00 +n
.



1 
2
exp 
 0 0 + t
2
Uninformative prior An uninformative prior for scale is
p ()  1
Hence, the posterior distribution for scale is


t
p ( | y)  
exp 
2


which is the kernel of an inverted-chi square ; n, nt .
( n
2 +1)
2
8 0  0
X
variable.


is a scaled, inverted-chi square  0 ,  20 with scale  20 where X is a chi square( 0 ) random
12
4.1.4
Gaussian (unknown mean, unknown variance)
For a sample of n exchangeable draws, a normal likelihood with unknown mean, ,
and unknown (but constant) variance,  2 , is


n
2



1
(y

)
i
 ,  2 | y 
 1 exp 
2
2
i=1
Expanding and rewriting the likelihood gives
 n

 1 y 2  2yi  + 2


i
2
n
 ,  | y   exp

2
2
i=1
n
Adding and subtracting i=1 2yi y = 2ny 2 , we write


n

  2  n2
  2

1   2
2
2
2
 ,  | y  
exp  2
yi  2yi y + y + y  2y + 
2 i=1
or

2


 ,  | y  
 n
2 2


n

1 
2
2
exp  2
(yi  y) + (y  )
2 i=1
which can be rewritten as



  2  n2
1 
2
2
2
 ,  | y  
exp  2 (n  1) s + n (y  )
2

n
2
1
where s2 = n1
i  y) . The above likelihood combines with a Gaussian or
i=1 (y




normal  |  2 ; 0 ,  2 /0  inverted-chi square  2 ;  0 ,  20 prior9


2


 2

0 (  0 )
1
2
2
2
p  |  ; 0 ,  /0  p  ;  0 ,  0

exp 

2 2


 2 ( 0 /2+1)
 0  20
 
exp  2
2
 2 (  0+3
)
2
 


2
 0  20 + 0 (  0 )
 exp 
2 2




to yield a normal  |  2 ; n ,  2n /n *inverted-chi square  2 ;  n ,  2n joint posterior
distribution10 where
n
= 0 + n
n
= 0 + n
 n  2n
9 The
10 The
=  0  20 + (n  1) s2 +
0 n
2
(0  y)
0 + n
prior for the mean, , is conditional on the scale of the data,  2 .
product of normal or Gaussian kernels produces a Gaussian kernel.
13
That is, the joint posterior is


p ,  2 | y; 0 ,  2 /0 ,  0 ,  20


2
 n+20 +3


 0  20 + (n  1) s2 

1
2

+0 (  0 )
 exp  2

2 
2
+n (  y)

Completing the square The expression for the joint posterior is written by completing the square. Completing the weighted square for  centered around
n =
where y =
1
n
n
i=1
1
(0 0 + ny)
0 + n
yi gives
2
(0 + n) (  n )
=
=
(0 + n) 2  2 (0 + n) n + (0 + n) 2n
(0 + n) 2  2 (0 0 + ny) + (0 + n) 2n
While expanding the exponent includes the square plus additional terms as follows




2
2
0 (  0 ) + n (  y) = 0 2  20 + 20 + n 2  2y + y 2
(0 + n) 2  2 (0 0 + ny) + 0 20 + ny 2
=
Add and subtract (0 + n) 2n and simplify.
2
2
0 (  0 ) + n (  y)
(0 + n) 2  2 (0 + n) n + (0 + n) 2n
=
 (0 + n) 2n + 0 20 + ny 2
2
=
(0 + n) (  n )

 

1
(0 + n) 0 20 + ny 2
2
 (0 0 + ny)
(0 + n)
Expand and simplify the last term.
2
2
2
0 (  0 ) + n (  y) = (0 + n) (  n ) +
0 n
2
(0  y)
0 + n
Now, the joint posterior can be rewritten as


p ,  2 | y; 0 ,  2 /0 ,  0 ,  20

or

2
p ,  | y; 0 , 
2
/0 ,  0 ,  20


2
 n+20 +3


1
 exp  2
2 


 0  20 + (n  1) s2 
2
0 n

+ 0 +n (0  y)
2 
+ (0 + n) (  n )


1
2
 
exp  2  n  n
2


1
2
1
 exp  2 (0 + n) (  n )
2

14

2 
n+ 0
2
1
Hence,the conditional posterior
distribution for the mean, , given  2 is Gaussian or

2
normal  |  2 ; n , 0+n .
Marginal posterior distributions We’re often interested in the marginal posterior
distributions which are derived by integrating out the other parameter from the joint
2
posterior. The marginal
posterior

 for the mean, , on integrating out  is a noncentral,
 2n
scaled-Student t ; n , n ,  n 11 for the mean



p ; n ,  2n , n ,  n  
or

 n  2n
p ; n ,
, n
n


n
n +
n ( n )2
 2n

2
n (  n )
1+
 n  2n
  n2+1

  n2+1


and the marginal posterior for the variance,  2 , is an inverted-chi square  2 ;  n ,  2n
on integrating out .



  ( n /2+1)
 n  2n
p  2 ;  n ,  2n   2
exp 
2 2
Derivation of the marginal posterior for the mean, , is as follows. Let z =
where
0 n
2
2
A =  0  20 + (n  1) s2 +
(0  y) + (0 + n) (  n )
0 + n
A
2 2
2
=  n  2n + (0 + n) (  n )
The marginal posterior for the mean, , integrates out  2 from the joint posterior
 


p ( | y) =
p ,  2 | y d 2


0 
 2  n+20 +3
A
=

exp  2 d 2
2
0
Utilizing  2 =
2
A
2z
and dz =  2zA d 2 or d 2 =  2zA2 dz,
    n+20 +3
A
A
p ( | y) 
exp [z] dz
2
2z
2z
0
    n+20 +1
A

z 1 exp [z] dz
2z
0
 
n+ 0 +1
n+ 0 +1
 A 2
z 2 1 exp [z] dz
0
11 The


noncentral, scaled-Student t ; n ,  2n /n ,  n implies


2   n +1

distribution p ( | y)  1 +
n /
n
n
n
2
.
15

n
 n / n
has a standard Student-t( n )
  n+ 0 +1
The integral 0 z 2 1 exp [z] dz is a constant since it is the kernel of a gamma
density and therefore can be ignored when deriving the kernel of the marginal posterior
for the mean
n+ 0 +1
p ( | y)  A 2

 n+20 +1
2
  n  2n + (0 + n) (  n )

 n+20 +1
2 
(0 + n) (  n )

1+
 n  2n


 2n
which is the kernel for a noncentral, scaled Student t ; n , 0 +n
, n + 0 .
Derivation of the marginal posterior for  2 is somewhat simpler. Write the joint
posterior in terms of the conditional posterior for the mean multiplied by the marginal
posterior for  2 .



 

p ,  2 | y = p  |  2 , y p  2 | y
Marginalization of  2 is achieved by integrating out .
 
 2 

 

p  |y =
p  2 | y p  |  2 , y d

Since only the conditional posterior involves  the marginal posterior for  2 is immediate.




  n+20 +3
A
p ,  2 | y   2
exp  2
2




2
n+
+2
 2  20
 n  2n 1
(0 + n) (  n )
 
exp 
 exp 
2 2
2 2
Integrating out  yields


 n  2n
exp 
2 2


 
2
(0 + n) (  n )
1

 exp 
d
2 2



 ( 2n +1)
 n  2n
 2
exp 
2 2


which we recognize as the kernel of an inverted-chi square  2 ;  n ,  2n .


p 2 | y


2
 n+20 +2
Uninformative priors The case of uninformative priors is relatively straightforward.
Since priors convey no information, the prior for the mean is uniform (proportional to
a constant, 0  0) and an uninformative prior for  2 has  0  0 degrees of freedom
so that the joint prior is

  1
p ,  2   2
16
The joint posterior is


p ,  2 | y
where


1 
2
exp  2 (n  1) s2 + n (  y)
2


 2 [(n1)/2+1]
2
 
exp  n2
2
 n

2
1
 exp  2 (  y)
2


2
(n/2+1)
 2n = (n  1) s2


2
The conditional posterior for  given  2 is Gaussian y, n . And, the marginal poste 2

rior for  is noncentral, scaled Student t y, sn , n  1 , the classical estimator.
Derivation of the marginal posterior proceeds as above. The joint posterior is



  2 (n/2+1)
1 
2
2
2
p ,  | y  
exp  2 (n  1) s + n (  y)
2
2
A
2
2
Let z = 2
out of the joint
2 where A = (n  1) s + n (  y) . Now integrate 
posterior following the transformation of variables.


 
 2 (n/2+1)
A
p ( | y) 

exp  2 d 2
2
0
 
 An/2
z n/21 ez dz
0
As before, the integral involves the kernel of a gamma density and therefore is a constant which can be ignored. Hence,
p ( | y)  An/2

 n2
2
 (n  1) s2 + n (  y)

 n1+1
2  2
n (  y)

1+
(n  1) s2


2
which we recognize as the kernel of a noncentral, scaled Student t ; y, sn , n  1 .
4.1.5
Multivariate Gaussian (unknown mean, known variance)
More than one random variable (the multivariate case) with joint Gaussian or normal
likelihood is analogous to the univariate case with Gaussian conjugate prior. Consider
a vector of k random variables (the sample is comprised of n draws for each random
variable) with unknown mean, , and known variance, . For n exchangeable draws
17
of the random vector (containing each of the m random variable), the multivariate
Gaussian likelihood is


n

1
T
 ( | y, ) 
exp  (yi  ) 1 (yi  )
2
i=1
where superscript T refers to transpose, yi and  are k length vectors and  is a k  k
variance-covariance matrix. A Gaussian prior for the mean vector, , with prior mean,
0 , and prior variance, 0 ,is


1
T
1
p ( | ; 0 , 0 )  exp  (  0 ) 0 (  0 )
2
The product of the likelihood and prior yields the kernel of a multivariate posterior
Gaussian distribution for the mean


1
T
1
p ( | , y; 0 , 0 )  exp  (  0 ) 0 (  0 )
2
 n

 1
T 1
 exp
 (yi  )  (yi  )
2
i=1
Completing the square Expanding terms in the exponent leads to
T
(  0 ) 1
0 (   0 ) +
n

i=1
T
(yi  ) 1 (yi  )




1
1
= T 1
  2T 1
y
0 + n
0  0 + n
n

+T0 1

+
yiT 1 yi
0
0
i=1
where y is the sample average. While completing the (weighted) square centered
around

  1

1 1
 = 1
0 0 + n1 y
0 + n
leads to


T 
1
1
0 + n



18


1
= T 1

0 + n
 1

T
1
2 0 + n


T  1
1
+ 0 + n

Thus, adding and subtracting 
(with three extra terms).
T


1
1
 in the exponent completes the square
0 + n
T
(  0 ) 1
0 (   0 ) +

T
1
0


T

+ n
T 
1
1
0

i=1
T
(yi  ) 1 (yi  )


T 
1
1
1
 +  1

0 + n
0 + n
n


T 
1
 1
 + T0 1
yiT 1 yi
0 + n
0 0 +
= 
=

n

  2
+ n
1
T



i=1

n


T 1
1
yiT 1 yi
1
+
n

+



+
0
0 0
0
i=1
Dropping constants (the last three extra terms unrelated to ) gives


T  1


1
1
p ( | , y; 0 , 0 )  exp    
0 + n

2
Hence, the posterior for the mean  has expected value  and variance


1 1
V ar [ | y, , 0 , 0 ] = 1
0 + n
As in the univariate case, the data and prior beliefs are weighted by their relative precisions.
Uninformative priors Uninformative priors for  are proportional to a constant.
Hence, the likelihood determines the posterior


n
1
T 1
 ( | , y)  exp 
(yi  )  (yi  )
2 i=1
Expanding the exponent and adding and subtracting ny T 1 y (to complete the square)
gives
n

i=1
T
(yi  ) 1 (yi  )
=
n

i=1
yiT 1 yi  2nT 1 y + nT 1 
+ny T 1 y  ny T 1 y
T
= n (y  ) 1 (y  )
n

+
yiT 1 yi  ny T 1 y
i=1
The latter two terms are constants, hence, the posterior kernel is
 n

T
p ( | , y)  exp  (y  ) 1 (y  )
2


1
which is Gaussian or N ; y, n  , the classical result.
19
4.1.6
Multivariate Gaussian (unknown mean, unknown variance)
When both the mean, , and variance, , are unknown, the multivariate Gaussian cases
remains analogous to the univariate case. Specifically, a Gaussian likelihood
 (,  | y) 
n

i=1
n
2

||

||


1
T
exp  (yi  ) 1 (yi  )
2

 

n
T
1
(yi  y) 1 (yi  y)
i=1
exp 
T
2
+n (y  ) 1 (y  )


1
T 1
2
exp 
(n  1) s + n (y  )  (y  )
2
 12
||
n
2
n
T 1
1
where s2 = n1
(yi  y) combines with a Gaussian-inverted
i=1 (yi  y) 
Wishart prior




 1


1
 12
T
1
p  | ; 0 ,
 p  ; ,   || exp  (  0 ) 0  (  0 )
0
2



1
+k+1

tr


2
 || 2 ||
exp 
2
where tr (·) is the trace of the matrix and  is degrees of freedom, to produce




tr 1
 +n+k+1
2
2
p (,  | y)  || ||
exp 
2



T
1
(n  1) s2 + n (y  ) 1 (y  )
 12
 || exp 
T
2
+0 (  0 ) 1 (  0 )
Completing the square Completing the square involves the matrix analog to the
univariate unknown mean and variance case. Consider the exponent (in braces)
T
=
2
+0  
=
T
(n  1) s + ny 
T
=
T
(n  1) s2 + n (y  ) 1 (y  ) + 0 (  0 ) 1 (  0 )
1
1
T
T
y  2n 
  20  
1
T
0 +
1
y + n 
1

0 T0 1 0
T 1
(n  1) s + (0 + n)  1   2 
2
T
(0 0 + ny) + (0 + n) Tn 1 n
 (0 + n) Tn 1 n + 0 T0 1 0 + ny T 1 y
T
(n  1) s2 + (0 + n) (  n ) 1 (  n )
0 n
T
+
(0  y) 1 (0  y)
0 + n
Hence, the joint posterior can be rewritten as
20


tr 1
p (,  | y)  || ||
exp 
2



T
 (0 + n) (  n ) 1 (  n ) 
1
1


+ (n  1) s2
 || 2 exp 

2
T 1
0 n
+ 0 +n (0  y)  (0  y)





1
+n+k+1

tr

+ (n  1) s2
1

2
 || 2 ||
exp 
T
n
+ 00+n
(0  y) 1 (0  y)
2


1
1
T 1
 || 2 exp 
(0 + n) (  n )  (  n )
2

2
 +n+k+1
2

Inverted-Wishart kernel We wish to identify the exponent with Gaussian by invertedWishart kernels where the inverted-Wishart involves the trace of a square, symmetric
matrix, call it n , multiplied by 1 .
To make this connection we utilize the following general results. Since a quadratic
form, say xT 1 x, is a scalar, it’s equal to its trace,


xT 1 x = tr xT 1 x
Further, for conformable matrices A, B and C, D,
tr (A) + tr (B) = tr (A + B)
and
tr (CD) = tr (DC)
We immediately have the results
and




tr xT x = tr xxT






tr xT 1 x = tr 1 xxT = tr xxT 1


1
Therefore, the above joint posterior can be rewritten as a N ; n , (0 + n)  


inverted-Wishart 1 ;  + n, n


+n

1 
 +n+k+1
2
p (,  | y)  |n | 2 ||
exp  tr n 1
2


1
0 + n

T
 || 2 exp 
(  n ) 1 (  n )
2
where
n =
1
(0 0 + ny)
0 + n
21
and
n

0 n
T
(y  0 ) (y  0 )

+
n
0
i=1


1
Now, it’s apparent the conditional posterior for  given  is N n , (0 + n) 


0 + n
T
p ( | , y)  exp 
(  n ) 1 (  n )
2
n =  +
T
(yi  y) (yi  y) +
Marginal posterior distributions Integrating out the other parameter gives the
marginal posteriors, a multivariate Student t for the mean,
Student tk (; n , ,  + n  k + 1)
and an inverted-Wishart for the variance,


I-W 1 ;  + n, n
where
1
 =( 0 + n)
1
( + n  k + 1)
n
Marginalization of the mean derives from the following identities (see Box and Tiao
[1973], p. 427, 441). Let Z be a m  m positive definite symmetric matrix consisting
of 12 m (m + 1) distinct random variables zij (i, j = 1, . . . , m; i  j). And let q > 0
and B be an m  m positive definite symmetric matrix. Then, the distribution of zij ,
1
p (Z)  |Z| 2
q1


exp  21 tr (ZB) ,
Z>0
is a multivariate generalization of the 2 distribution obtained by Wishart [1928]. Integrating out the distinct zij produces the first identity.



1
1
q1
 1 (q+m1)
|Z| 2
exp  tr (ZB) dZ = |B| 2
(I.1)
2
Z>0


1
q+m1
2 2 (q+m1) m
2
where p (b) is the generalized gamma function (Siegel [1935])
p
   1 p(p1) 

p (b) =  12 2
 b+
=1
and
 (z) =


tz1 et dt
0
or for integer n,
 (n) = (n  1)!
22
p
2

,
b>
p1
2
The second identity involves the relationship between determinants that allows us to
express the above as a quadratic form. The identity is
|Ik  P Q| = |Il  QP |
(I.2)
for P a k  l matrix and Q a l  k matrix. 

If we transform the joint posterior to p , 1 | y , the above identities can be
applied to marginalize the joint posterior. The key to transformation is


  


1

p ,  | y = p (,  | y)  1 

  
where  1 is the (absolute value of the) determinant of the Jacobian or






  
 =   ( 11 ,  12 , . . . ,  kk ) 

  ( 11 ,  12 , . . . ,  kk ) 
 1 
k+1
= ||
with  ij the elements of  and  ij the elements of 1 . Hence,



1 
 +n+k+1
1
2
p (,  | y)  ||
exp  tr n 
2


1

+
n
0
2
T 1
 || exp 
(  n )  (  n )
2



1 
 +n+k
1
1
2
 ||
exp  tr S () 
2
T
where S () = n + (0 + n) (  n ) (  n ) , can be rewritten


2k+2



1 
 +n+k+2
2
p , 1 | y  ||
exp  tr S () 1 || 2
2


 1  +nk


1
   2 exp  tr S () 1
2
Now, applying the first identity yields



 1 (+n+1)
p , 1 | y d1  |S ()| 2
1 >0

 1 (+n+1)

T 2
 n + (0 + n) (  n ) (  n ) 

 1 (+n+1)

T 2
 I + (0 + n) 1
n (   n ) (   n ) 
And the second identity gives

 12 (+n+1)
T
p ( | y)  1 + (0 + n) (  n ) 1
(


)
n
n
We recognize this is the kernel of a multivariate Student tk (; n , ,  + n  k + 1)
distribution.
23
Uninformative priors The joint uninformative prior (with a locally uniform prior for
) is
 k+1
p (, )  || 2
and the joint posterior is
 k+1
2
p (,  | y) 
||

||

||


1
T
exp 
(n  1) s2 + n (y  ) 1 (y  )
2



1
T
exp 
(n  1) s2 + n (y  ) 1 (y  )
2



1 
exp  tr S () 1
2
n
2
||
 n+k+1
2
 n+k+1
2
n
T
T
where now S () = i=1 (y  yi) (y  yi ) + n (y  ) (y  ) . Then, the conditional posterior for  given  is N y, n1 
 n

T
p ( | , y)  exp  (  y) 1 (  y)
2
The marginal posterior for  is derived analogous to the above informed conjugate
prior case. Rewriting the posterior in terms of 1 yields


2k+2



1 
 n+k+1
1
1
2
p ,  | y  ||
exp  tr S () 
|| 2
2


 1  nk1

1 
1
2


 
exp  tr S () 
2



p ( | y) 
p , 1 | y d1
1 >0



 1  nk1


  2 exp  1 tr S () 1 d1

2
1 >0
The first identity (I.1) produces
p ( | y) 
n
|S ()| 2
 n
 n2



T
T
 
(y  yi ) (y  yi ) + n (y  ) (y  ) 


i=1

 n2
 n
1




T
T 

 I + n
(y  yi ) (y  yi )
(y  ) (y  ) 


i=1
The
 second identity(I.2) identifies the marginal posterior for  as (multivariate) Student
tk ; y, n1 s2 , n  k

 n2
n
T
T
p ( | y)  1 +
(y  ) (y  )
(n  k) s2
n
T
where (n  k) s2 = i=1 (y  yi ) (y  yi ). The marginal posterior for the variance
 1

n
T
is I-W  ; n, n where now n = i=1 (y  yi ) (y  yi ) .
24
4.1.7
Bayesian linear regression
Linear regression is the starting point for more general data modeling strategies, including nonlinear models. Hence, Bayesian linear regression is foundational. Suppose
the data are generated by
y = X + 
where
and  
 X isa n  p full column rank matrix of (weakly exogenous) regressors

N 0, 2 In andE [ | X] = 0. Then, the sample conditional density is y | X, ,  2
 N X,  2 In .
Known variance If the error variance,  2 In , is known and we have informed Gaussian
priors for  conditional on  2 ,




p  |  2  N  0 ,  2 V0

1
where we can think of V0 = X0T X0
as if we had a prior sample (y0 , X0 ) such that
1 T

X0 y0
 0 = X0T X0
then the conditional posterior for  is




p  |  2 , y, X;  0 , V0  N , V
where

1  T


X0 X0  0 + X T X 
 = X0T X0 + X T X


 = X T X 1 X T y

and

1
V =  2 X0T X0 + X T X
The variance expression follows from rewriting the estimator

1  T


X0 X0  0 + X T X 
 = X0T X0 + X T X

1 T 
1  T

1 T

X y
X0 X0 X0T X0
X0 y0 + X T X X T X
= X0T X0 + X T X
 T



1
= X0 X0 + X T X
X0T y0 + X T y
Since the DGP is
then


y0 = X0  + 0 , 0  N 0,  2 In0
y = X + ,
  N 0,  2 In

1  T

 = X0T X0 + X T X
X0 X0  + X0T 0 + X T X + X T 
The conditional (and by iterated expectations, unconditional) expected value of the
estimator is

 
1  T

E  | X, X0 = X0T X0 + X T X
X0 X0 + X T X  = 
25
Hence,
so that
V


  E  | X, X0 =   

1  T

= X0T X0 + X T X
X0 0 + X T 


 V ar  | X, X0



T
= E       | X, X0
 
1  T

T 
X0T X0 + X T X
X0 0 + X T  X0T 0 + X T 
= E

1
 X0T X0 + X T X
| X, X0

 
 
1
X0T 0 T0 X0 + X T T0 X0
T
T
X0 X0 + X X

+X0T 0 T X + X T T T X
= E
 T
1
T
 X0 X0 + X X
| X, X0



1
 T
1
X0T  2 IX0 + X T  T IX X0T X0 + X T X
= X0 X0 + X T X

1

1  T
=  2 X0T X0 + X T X
X0 X0 + X T X X0T X0 + X T X

1
=  2 X0T X0 + X T X
Now, let’s backtrack and derive the conditional posterior as the product of conditional priors and the likelihood function. The likelihood function for known variance
is




1
T
  |  2 , y, X  exp  2 (y  X) (y  X)
2
Conditional Gaussian priors are




1
T
p  |  2  exp  2 (   0 ) V01 (   0 )
2
The conditional posterior is the product of the prior and likelihood



T


1
(y

X)
(y

X)
p  |  2 , y, X
 exp  2
T
2
+ (   0 ) V01 (   0 )
 T


T
 y y  2y T X +  X T X 
1
= exp  2
+ T X0T X0   2 T0 X0T X0  

2 
+ T0 X0T X0  0
The first and last terms in the exponent do not involve  (are constants) and can ignored
as they are absorbed through normalization. This leaves





1
2y T X +  T X T X +  T X0T X0 
2
p  |  , y, X
 exp  2
2 T0 X0T X0 
2





 T X0T X0 + X T X 
1
= exp  2
2 y T X +  T0 X0T X0 
2
26
which can be recognized as the expansion of the conditional posterior claimed above.




p  |  2 , y, X
 N , V


T


1
 exp     V1   
2


T  T


1 
= exp  2   
X0 X0 + X T X   
2





 T X0T X0 + X T X  





T
 1

= exp  2
2 X0T X0 + X T X  

2 


T


+ X0T X0 + X T X 






 T X0T X0 + X T X  


T 
 1
= exp  2
2 X0T X0  0 + X T y  

2 

T 


+ X0T X0 + X T X 
The last term in the exponent is all constants (does not involve ) so its absorbed
through normalization and disregarded for comparison of kernels. Hence,




T 1 

1
2
p  |  , y, X
 exp     V   
2





 T X0T X0 + X T X 
1
 exp  2
2 y T X +  T0 X0T X0 
2
as claimed.
Uninformative priors
If the prior for  is uniformly distributed conditional on

known variance,  2 , p  |  2  1, then it’s as if X0T X0  0 and the posterior for 
is




 
  2 X T X 1
p  |  2 , y, X  N ,
equivalent to the classical parameter estimators.
Unknown variance In the usual case where the variance as well as the regression
coefficients, , are unknown, the likelihood function can be expressed as




1
T
 ,  2 | y, X   n exp  2 (y  X) (y  X)
2
Rewriting gives


1 T
 ,  | y, X   exp  2  
2
since  = y  X. The estimated model is y = Xb + e, therefore X +  = Xb + e

1 T
where b = X T X
X y and e = y  Xb are estimates of  and , respectively.
This implies  = e  X (  b) and



T


1
eT e  2 (  b) X T e
2
n
 ,  | y, X   exp  2
T
2
+ (  b) X T X (  b)

2

n
27
Since, X T e = 0 by construction, this simplifies as




1  T
T
2
n
T
 ,  | y, X   exp  2 e e + (  b) X X (  b)
2
or

2

 ,  | y, X  
n


1 
T
2
T
exp  2 (n  p) s + (  b) X X (  b)
2
1
where s2 = np
eT e.12


The conjugate
prior
for linear regression is the Gaussian  |  2 ;  0 ,  2 1
-inverse
0
 2

chi square  ;  0 ,  20


T


 2

(   0 ) 0 (   0 )
2
2 1
2
p
p  |  ;  0 ,  0  p  ;  0 ,  0
  exp 
2 2


 0  20
( 0 /2+1)

exp  2
2


Combining
the prior
with the likelihood gives a joint Gaussian ,  2 1
n -inverse chi


square  0 + n,  2n posterior




(n  p) s2
2
2 1
2
n
p ,  | y, X;  0 ,  0 ,  0 ,  0
  exp 
2 2


T
(  b) X T X (  b)
 exp 
2 2


T
(   0 ) 0 (   0 )
p
 exp 
2 2


 2 ( 0 /2+1)
 0  20
 
exp  2
2
Collecting terms and rewriting, we have

2
p ,  | y, X;  0 , 
2
2
1
0 ,  0 , 0



 2n
 
exp  2
2






1
T
p
 exp  2    n   
2


2 [( 0 +n)/2+1]
12 Notice, the univariate Gaussian case is subsumed by linear regression where X =  (a vector of ones).
Then, the likelihood as described earlier,




1 
 ,  2 | y, X   n exp  2 (n  p) s2 + (  b)T X T X (  b)
2
becomes




1 
  = ,  2 | y, X =    n exp  2 (n  1) s2 + n (  y)2
2
 T 1 T
T
where  = , b = X X
X y = y, p = 1, and X X = n.
28
where
and

1 

 = 0 + X T X
0  0 + X T Xb


n = 0 + X T X
T



T

 
   XT X 

 n  2n =  0  20 + (n  p) s2 +  0   0  0   + 


where  n =  0 + n. The conditional posterior of  given  2 is Gaussian ,  2 1
n .
Completing the square The derivation of the above joint posterior follows from
the matrix version of completing the square where 0 and X T X are square, symmetric,
full rank p  p matrices. The exponents from the prior for the mean and likelihood are

T


T
 XT X   

(   ) 0 (   ) +   
0
0
Expanding and rearranging gives

T


  +  T 0  + 
T X T X 

 T  0 + X t X   2 0  0 + X T X 
0
0
(1)
The latter two terms are constants not involving  (and can be ignored when writing the
kernel for the conditional posterior) which we’ll add to when we complete the square.
Now, write out the square centered around 

T 





0 + X T X   
=  T 0 + X T X 


T 
T 
2 0 + X T X  +  0 + X T X 
Substitute for  in the second term on the right hand side and the first two terms are
identical to the two terms in equation (1). Hence, the exponents from the prior for the
mean and likelihood in (1) are equal to

T 



0 + X T X   

T 
T X T X 

 0 + X T X  +  T 0  + 
0
0
which can be rewritten as

T 



0 + X T X   
T



T

 
   XT X 

+  0   0  0   + 
or (in the form analogous to the univariate Gaussian case)

T 



0 + X T X   

T 


1
1
1


+ 0  
1 1
0  
n  0 n  1 + 0 n  1 n 0
where 1 = X T X.
29
Stacked regression Bayesian linear regression with conjugate priors works as if
we have a prior sample {X0 , y0 }, 0 = X0T X0 , and initial estimates

1 T
 0 = X0T X0
X 0 y0
Then, we combine this initial "evidence" with new evidence to update our beliefs in
the form of the posterior. Not surprisingly, the posterior mean is a weighted average
of the two "samples" where the weights are based on the relative precision of the two
"samples".
Marginal posterior distributions The marginal
 posterior for  on integrating
out  2 is noncentral, scaled multivariate Student tp ,  2n 1
n , 0 + n
p ( | y, X) 


T

  0 +n+p
2
 n  2n +    n   


1+
T


1 
   n   
2
 n n
  0 +n+p
2
where n = 0 +X T X. This result corresponds with the univariate Gaussian case and
A
2
is derived analogously by transformation of variables where z = 2
2 where A =  n +

T




   n    . The marginal posterior for  2 is inverted-chi square  2 ;  n,  2n .
Derivation of the marginal posterior for  is as follows.
 


p ( | y) =
p ,  2 | y d 2


0 
 2  n+ 02+p+2
A
=

exp  2 d 2
2
0
2
A
Utilizing  2 = 2z
and dz =  2zA d 2 or d 2 =  2zA2 dz, (1 and 2 are constants and
can be ignored when deriving the kernel)
    n+ 02+p+2
A
A
p ( | y) 
exp [z] dz
2z
2z 2
0
 
n+ 0 +p
n+ 0 +p
 A 2
z 2 1 exp [z] dz
0

n+ 0 +k
1
2
The integral 0 z
exp [z] dz is a constant since it is the kernel of a gamma
density and therefore can be ignored when deriving the kernel of the marginal posterior
for the mean
n+ 0 +p
p ( | y)  A 2


T

 n+20 +p
  n  2n +    n   


  n+20 +p
n   

1+
 n  2n


the kernel for a noncentral, scaled (multivariate) Student tp ; ,  2n 1
n , n + 0 .


T
30
Uninformative priors Again, the case of uninformative priors is relatively straightforward. Since priors convey no information, the prior for the mean is uniform (proportional to a constant, 0  0) and the prior for  2 has  0  0 degrees of freedom

  1
so that the joint prior is p ,  2   2
.
The joint posterior is



  [n/2+1]
1
T
p ,  2 | y   2
exp  2 (y  X) (y  X)
2

1 T
Since y = Xb + e where b = X T X
X y, the joint posterior can be written



  [n/2+1]
1 
T
exp  2 (n  p) s2 + (  b) X T X (  b)
p ,  2 | y   2
2
Or, factoring into the conditional posterior for  and marginal for  2 , we have



 

p ,  2 | y  p  2 | y p  |  2 , y


 2 [(np)/2+1]
 2n
exp  2
 
2


1
T
p
T
 exp  2 (  b) X X (  b)
2
where
 2n = (n  p) s2


1 
Hence, the conditional posterior for  given  2 is Gaussian b,  2 X T X
.


 T 1
2
,n  p ,
The marginal posterior for  is multivariate Student tp ; b, s X X
the classical estimator. Derivation of the marginal posterior for  is analogous to that
T
A
2
above. Let z = 2
X T X (  b). Integrating  2
2 where A = (n  p) s + (  b)
out of the joint posterior produces the marginal posterior for .



p ( | y) 
p ,  2 | y d 2



 2  n+2
A
2


exp  2 d 2
2
Substitution yields
p ( | y) 
 
n
 A 2
A
2z

 n+2
2
A
exp [z] dz
2z 2
n
z 2 1 exp [z] dz
As before, the integral involves the kernel of a gamma distribution, a constant which
31
can be ignored. Therefore, we have
n
p ( | y)  A 2

 n2
T
 (n  p) s2 + (  b) X T X (  b)
 n2

T
(  b) X T X (  b)

1+
(n  p) s2



1
which is multivariate Student tp ; b, s2 X T X
,n  p .
4.1.8
Bayesian linear regression with general error structure
Now, we consider Bayesian regression with a more general error structure. That is, the
DGP is
y = X + , ( | X)  N (0, )
First, we consider the known variance case, then take up the unknown variance case.
Known variance If the error variance, , is known, we simply repeat the Bayesian
linear regression approach discussed above for the known variance case after transforming all variables via the Cholesky decomposition of . Let
 = T
and
Then, the DGP is
 1 1
1 = T

1 y = 1 X + 1 
where
1   N (0, In )

1
With informed priors for , p ( | )  N ( 0 ,  ) where it is as if  = X0T 1
,
0 X0
the posterior distribution for  conditional on  is


p ( | , y, X;  0 ,  )  N , V
where


1  T 1
T 1
T 1 
X0T 1
X
+
X

X
X

X

+
X

X

0
0
0
0 0
0

1 

1
T 1
T 1 
=
1
+
X

X


+
X

X

0


=



 = X T 1 X 1 X T 1 y

32
and
V

T 1
X0T 1
X
0 X0 + X 
1

T 1
=
1
X
 +X 
=
1
It is instructive to once again backtrack to develop the conditional posterior distribution. The likelihood function for known variance is


1
T 1
 ( | , y, X)  exp  (y  X)  (y  X)
2
Conditional Gaussian priors are


1
T
p ( | )  exp  2 (   0 ) V1 (   0 )
2
The conditional posterior is the product of the prior and likelihood



T


1
(y  X) 1 (y  X)
2
p  |  , y, X
 exp  2
T
2
+ (   0 ) V1 (   0 )

 T 1

1
y  y  2y T 1 X +  T X T 1 X
= exp  2
+ T V1   2 T0 V1  +  T0 V1  0
2
The first and last terms in the exponent do not involve  (are constants) and can ignored
as they are absorbed through normalization. This leaves





1
2y T 1 X +  T X T 1 X
2
p  |  , y, X
 exp  2
+ T V1   2 T0 V1 
2





T
1
T 1



V
+
X

X

1
 


= exp  2
2  2 y T 1 X +  T0 V 1  

which can be recognized as the expansion of the conditional posterior claimed above.


p ( | , y, X)  N , V


T


1
 exp     V1   
2



T  1

1
= exp    
V + X T 1 X   
2






 T V1 + X T 1 X  



 1


 
T


1
T 1
= exp 
2 V + X  X  

 2






 + T V 1 + X T 1 X  









 T X0T X0 + X T X 



T 
 1
T
1
T 1

= exp 

2
y

X
+

V

0

 2






T


+ X0T X0 + X T X 
33
The last term in the exponent is all constants (does not involve ) so its absorbed
through normalization and disregarded for comparison of kernels. Hence,


T




1
p  |  2 , y, X
 exp     V1   
2





 T X0T X0 + X T X 
1


T
 exp  2
2
2 y T 1 X +  T0 V1 
as claimed.
Unknown variance Bayesian linear regression with unknown general error structure, , is something of a composite of ideas developed for exchangeable ( 2 In error
structure) Bayesian regression and the multivariate Gaussian case with mean and variance unknown where each draw is an element of the y vector and X is an n  p matrix
of regressors. A Gaussian likelihood is


1
n
T 1
2
exp  (y  )  (y  )
 (,  | y, X)  ||
2



T 1
1
(y

Xb)

(y

Xb)
n
 || 2 exp 
T
2
+ (b  ) X T 1 X (b  )


1
n
T
 || 2 exp 
(n  p) s2 + (b  ) X T 1 X (b  )
2

1 T 1
T
1
where b = X T 1 X
X  y and s2 = np
(y  Xb) 1 (y  Xb). Combine the likelihood with a Gaussian-inverted Wishart prior


 1

1
T 1
p ( | ;  0 ,  )  p  ; ,   exp  (   0 )  (   0 )
2



1
+p+1

tr


2
 || 2 ||
exp 
2

1
where tr (·) is the trace of the matrix, it is as if  = X0T 1
, and  is degrees
0 X0
of freedom to produce the joint posterior



1
+n+p+1

tr


2
p (,  | y, X)  || 2 ||
exp 
2



(n  p) s2


1
T
+ (b  ) X T 1 X (b  ) 
 exp 

2
T
+ (   0 ) 1
 (   0 )
34
Completing the square Completing the square involves the matrix analog to the
univariate unknown mean and variance case. Consider the exponent (in braces)
T
T
(n  p) s2 + (b  ) X T 1 X (b  ) + (   0 ) 1
 (   0 )
(n  p) s2 + bT X T 1 Xb  2 T X T 1 Xb +  T X T 1 X
T 1
T 1
+ T 1
   2   0 +  0   0


T 1
= (n  p) s2 +  T 1
+
X

X


=
2 T V1  + bT X T 1 Xb +  T0 1
 0
=
(n  p) s2 +  T V1   2 T V1  + bT X T 1 Xb +  T0 1
 0
where
1 

1
T 1
T 1
1
+
X

X


+
X

Xb
0




1
T 1
= V   0 + X  Xb


=

1
T 1
.and V = 1
X
.
 +X 
Variation in  around  is

T


T
   V1    =  T V1   2 T V1  +  V1 
The first two terms are identical to two terms in the posterior involving  and there is
apparently no recognizable kernel from these expressions. The joint posterior is
p (,  | y, X)

 +n+p+1
2

2
|| ||


 exp 


1



tr 1
exp 
2

T


   V1   
T







+ (n  p) s2   V1 
+bT X T 1 Xb +  T0 1
 0




tr 1 + (n  p) s2



T
 1
 +n+p+1
2
exp 
 || 2 ||
 V1 
2
 +bT X T 1 Xb +  T 1 
0
0 


T 1 
 
1 
 exp 
   V   
2
2








Therefore, we write the conditional posteriors for the parameters of interest. First, we
focus on  then we take up .
The conditional posterior for  conditional on  involves collecting

 all terms involving . Hence, the conditional posterior for  is ( | )  N , V or


T 1 
 
1 
p ( | , y, X)  exp 
   V   
2
35
Inverted-Wishart kernel Now, we gather all terms involving  and write the
conditional posterior for .
p ( | , y, X)

2
 +n+p+1
2

 +n+p+1
2

 +n+p+1
2

|| ||

|| 2 ||

|| 2 ||





1
tr 1 + (n  p) s2
exp 
T
+ (b  ) X T 1 X (b  )
2





tr 1 +


1
T

(y  Xb) 1 (y  Xb)
exp 

2
T
+ (b  ) X T 1 X (b  )

 


T
1
 +( y  Xb) (y  Xb)
exp 
tr
1
T
2
+ (b  ) X T X (b  )
We can identify the kernel as an inverted-Wishart involving the trace of a square,
symmetric matrix, call it n , multiplied by 1 .


The above joint posterior can be rewritten as an inverted-Wishart 1 ;  + n, n


+n

1 
 +n+p+1
2
p (,  | y)  |n | 2 ||
exp  tr n 1
2
where
T
T
n =  + (y  Xb) (y  Xb) + (b  ) X T X (b  )
With conditional posteriors in hand, we can employ McMC strategies (namely, a
Gibbs sampler) to draw inferences around the parameters of interest,  and . That
is, we sequentially draw  conditional on  and , in turn, conditional on . We discuss McMC strategies (both the Gibbs sampler and its generalization, the MetropolisHastings algorithm) later.
(Nearly) uninformative priors As discussed by Gelman, et al [2004] uninformative priors for this case is awkward, at best. What does it mean to posit uninformative priors for a regression with general error structure? Consistent probability assignment suggests that either we have some priors about the correlation structure or heteroskedastic nature of the errors (informative priors) or we know nothing about the error structure (uninformative priors). If priors are uninformative, then maximum entropy
probability assignment suggests we assign independent and unknown homoskedastic
errors. Hence, we discuss nearly uninformative priors for this general error structure
regression.
The joint uninformative prior (with a locally uniform prior for ) is
 21
p (, )  ||
36
and the joint posterior is
 12
p (,  | y, X) 
||

||

||


1
T
exp 
(n  p) s2 + (b  ) X T 1 X (b  )
2


1
T
exp 
(n  p) s2 + (b  ) X T 1 X (b  )
2



1 
exp  tr S () 1
2
n
2
||
 n+1
2
 n+1
2
T
T
where now S () = (y  Xb) (y 
+ (b  ) X T X
 Xb)
 (b  ). Then, the condi T 1 1
tional posterior for  given  is N b, X  X
|
 n

T
p ( | , y, X)  exp  (  b) X T 1 X (  b)
2


The conditional posterior for  given  is inverted-Wishart 1 ; n, n



n
1 
 n+1
p (,  | y)  |n | 2 || 2 exp  tr n 1
2
where
T
T
n = (y  Xb) (y  Xb) + (b  ) X T X (b  )
As with informed priors, a Gibbs sampler (sequential draws from the conditional posteriors) can be employed to draw inferences for the uninformative prior case.
Next, we discuss posterior simulation, a convenient and flexible strategy for drawing inference from the evidence and (conjugate) priors.
4.2
Direct posterior simulation
Posterior simulation allows us to learn about features of the posterior (including linear
combinations, products, or ratios of parameters) by drawing samples when unable (or
difficult) to write exact form of density.
Example For example, suppose
x1 and
distrib
 x2 are (jointly) Gaussian or normally
2
µ
uted with unknown means, µ
,
and
known
variance-covariance,

I
= 9I,
2
1
2

but we’re interested in x3 = 1 x1 + 2 x2 . Based on a sample of data (n = 30),
y = {x1 , x2 }, we can infer the posterior means and variance for x1 and x2 and simulate posterior draws for x3 from which properties of the posterior distribution for x3 can
be inferred. Suppose µ1 = 50 and µ2 = 75 and we have no prior knowledge regarding the location of x1 and x2 so we employ uniform (uninformative) priors. Sample
37
statistics for x1 and x2 are reported below.
statistic
x1
mean
51.0
median
50.8
standard deviation 3.00
maximum
55.3
minimum
43.6
quantiles:
0.01
43.8
0.025
44.1
0.05
45.6
0.10
47.8
0.25
49.5
0.75
53.0
0.9
54.4
0.95
55.1
0.975
55.3
0.99
55.3
Sample statistics
x2
75.5
76.1
2.59
80.6
69.5
69.8
70.3
71.1
72.9
73.4
77.4
78.1
79.4
80.2
80.5
Since we know x1 and x2 are independent each with variance 9, the marginal posteriors
for µ1 and µ2 are


9
p (µ1 | y)  N 49.6,
30
and
p (µ2 | y)  N

75.6,
9
30

and the predictive posteriors for x1 and x2 are based on posteriors draws for µ1 and µ2
p (x1 | µ1 , y)  N (µ1 , 9)
and
p (x2 | µ2 , y)  N (µ2 , 9)
For 1 = 2 and 2 = 3, we generate 1, 000 posterior predictive draws of x1 and
x2 , and utilize them to create posterior predictive draws for x3 . Sample statistics for
38
these posterior draws are reported below.
statistic
µ1
µ1
x1
x2
mean
51.0 75.5 50.9 75.4
median
51.0 75.5 50.8 75.3
standard deviation 0.55 0.56 3.15 3.04
maximum
52.5 77.5 59.7 85.4
minimum
48.5 73.5 39.4 65.4
quantiles:
0.01
49.6 74.4 44.1 68.5
0.025
49.8 74.5 44.7 69.7
0.05
50.1 74.6 45.7 70.6
0.10
50.3 74.8 46.8 71.6
0.25
50.6 75.2 48.8 73.3
0.75
51.3 75.9 52.9 77.6
0.9
51.6 76.3 55.0 79.4
0.95
51.8 76.5 56.2 80.5
0.975
52.0 76.7 57.5 81.6
0.99
52.3 76.9 58.5 82.4
Sample statistics for posterior draws
x3
149.2
149.2
5.07
163.1
131.4
137.8
139.8
141.2
142.8
145.7
152.8
155.6
157.6
159.4
160.9
A normal probability plot13 and histogram based on 1, 000 draws of x3 along with
the descriptive statistics above based on posterior draws suggest that x3 is well approx13 We employ Filliben’s [1975] approach by plotting normal quantiles of u , N (u ), (horizontal axis)
j
i
against z scores (vertical axis) for the data, y, of sample size n where
1  0.5n
ui =
(a general expression is
ja
,
n+12a
j0.3175
n+0.365
0.5n
j=1
j = 2, . . . , n  1
j=n
in the above a = 0.3175), and
zi =
yi  y
s
with sample average, y, and sample standard deviation, s.
39
imated by a Gaussian distribution.
Normal probability plot for x3
40
4.2.1
Independent simulation
The above example illustrates independent simulation. Since x1 and x2 are independent, their joint distribution, p (x1 , x2 ), is the product of their marginals, p (x1 ) and
p (x2 ). As these marginals depend on their unknown means, we can independently
draw from the marginal posteriors for the means, p (1 | y) and p (2 | y), to generate
predictive posterior draws for x1 and x2 .14
The general independent posterior simulation procedure is
1. draw 2 from the marginal posterior p (2 | y),
2. draw 1 from the marginal posterior p (1 | y).
4.2.2
Nuisance parameters & conditional simulation
When there are nuisance parameters or, in other words, the model is hierarchical in
nature, it is simpler to employ conditional posterior simulation. That is, draw the nuisance parameter from its marginal posterior then draw the other parameters of interest
conditional on the draw of the nuisance or hierarchical parameter.
The general conditional simulation procedure is
1. draw 2 (say, scale) from the marginal posterior p (2 | y),
2. draw 1 (say, mean) from the conditional posterior p (1 | 2 , y).
Example We compare independent simulation based on marginal posteriors for the
mean and variance with conditional simulation based on the marginal posterior of the
variance and the conditional posterior of the mean for the Gaussian (normal) unknown
mean and variance case. First ,we explore informed priors, then we compare with uninformative priors. An exchangeable sample of n = 50 observations from a Gaussian
(normal) distribution with mean equal to 46, a draw from the prior distribution for the
mean (described below), and variance equal to 9, a draw from the prior distribution for
the variance (also, described below).
Informed priors The prior distribution for the mean is Gaussian with mean equal
2
to 0 = 50 and variance equal to 0 = 18 (0 = 12 ). The prior distribution for the
variance is inverted-chi square with  0 = 5 degrees of freedom and scale equal to
 20 = 9. Then, the marginal posterior for the variance is inverted-chi square with
 n =  0 + n = 55 degrees of freedom and scale equal to  n  2n = 45 + 49s2 +
n
n
2
2
25
1
1
2
i=1 (yi  y) and y = n
i=1 yi depend on the
50.5 (50  y) where s = n1
sample. The conditional posterior for the mean is Gaussian with mean equal to n =
1
50.5 (25 + 50y) and variance equal to the draw from marginal posterior for the variance
2
scaled by 0 + n, 50.5
. The marginal posterior for the mean is noncentral, scaled
1
Student t with noncentrality parameter equal to n = 50.5
(25 + 50y) and scale equal

 2n
 2n
to 50.5 . In other words, posterior draws for the mean are  = t 50.5
+ n where t is
a draw from a standard Student t(55) distribution.
14 Predictive
posterior simulation is discussed below.
41
Statistics for 1, 000 marginal and conditional posterior draws of the mean and marginal posterior draws of the variance are tabulated below.


 2 
 | 2 , y
 |y
statistic
( | y)
mean
45.4
45.5
9.6
median
45.4
45.5
9.4
standard deviation
0.45
0.44
1.85
maximum
46.8
46.9
21.1
minimum
44.1
43.9
5.5
quantiles:
0.01
44.4
44.4
6.1
0.025
44.5
44.6
6.6
0.05
44.7
44.8
7.0
0.10
44.9
44.8
7.0
0.25
45.1
45.2
8.3
0.75
45.7
45.8
10.7
0.9
46.0
46.0
12.0
0.95
46.2
46.2
12.8
0.975
46.3
46.3
13.4
0.99
46.5
46.5
14.9
Sample statistics for posterior draws based on informed priors
Clearly, marginal and conditional posterior draws for the mean are very similar, as
expected. Marginal posterior draws for the variance have more spread than those for
the mean, as expected, and all posterior draws comport well with the underlying distribution. Sorted posterior draws based on informed priors are plotted below with the
42
underlying parameter depicted by a horizontal line.
Posterior draws for the mean and variance based on informed priors
As the evidence and priors are largely in accord, we might expect the informed priors to reduce the spread in the posterior distributions somewhat. Below we explore
uninformed priors.
Uninformed priors The marginal posterior for the variance is inverted-chi square
with n  1 = 49 degrees of freedom and scale equal to (n  1) s2 = 49s2 where
n
n
2
1
1
s2 = n1
i=1 (yi  y) and y = n
i=1 yi depend on the sample. The conditional
posterior for the mean is Gaussian with mean equal to y and variance equal to the draw
2
from marginal posterior for the variance scaled by n, 50 . The marginal posterior for
the mean is noncentral, scaled Student t with noncentrality parameter equalto y and
2
s
scale equal to 50
. In other words, posterior draws for the mean are  = t
where t is a draw from a standard Student t(49) distribution.
43
s2
50
+y
Statistics for 1, 000 marginal and conditional posterior draws of the mean and marginal posterior draws of the variance are tabulated below.


 2 
 | 2 , y
 |y
statistic
( | y)
mean
45.4
45.4
9.7
median
45.4
45.5
9.4
standard deviation
0.43
0.45
2.05
maximum
46.7
47.0
18.9
minimum
44.0
43.9
4.8
quantiles:
0.01
44.4
44.3
6.2
0.025
44.6
44.5
6.6
0.05
44.7
44.7
6.9
0.10
44.9
44.8
7.3
0.25
45.1
45.1
8.3
0.75
45.7
45.7
10.9
0.9
46.0
46.0
12.4
0.95
46.1
46.2
13.5
0.975
46.3
46.3
14.3
0.99
46.4
46.4
15.6
Sample statistics for posterior draws based on informed priors
There is remarkably little difference between the informed and uninformed posterior
draws. With a smaller sample we would expect the priors to have a more substantial
impact. Sorted posterior draws based on uninformed priors are plotted below with the
44
underlying parameter depicted by a horizontal line.
Posterior draws for the mean and variance based on uninformed priors
Discrepant evidence Before we leave this subsection, perhaps it is instructive to
explore the implications of discrepant evidence. That is, we investigate the case where
the evidence differs substantially from the priors. We again draw a value for  from
9
a Gaussian distribution with mean 50 and variance 1/2
, now the draw is  = 53.1.


Then, we set the prior for , 0 , equal to 50 + 6 0 = 75.5. Everything else remains
as above. As expected, posterior draws based on uninformed priors are very similar to
those reported above except with the shift in the mean for .15
Based on informed priors, the marginal posterior for the variance is inverted-chi
square with  n =  0 + n = 55 degrees of freedom and scale equal to  n  2n = 45 +
n
n
2
2
25
1
1
49s2 + 50.5
(75.5  y) where s2 = n1
i=1 (yi  y) and y = n
i=1 yi depend
on the sample. The conditional posterior for the mean is Gaussian with mean equal to
1
n = 50.5
(37.75 + 50y) and variance equal to the draw from marginal posterior for
15 To
conserve space, posterior draws based on the uninformed prior results are not reported.
45
2

the variance scaled by 0 + n, 50.5
. The marginal posterior for the mean is noncentral,
1
scaled Student t with noncentrality parameter equal to n = 50.5
(37.75 + 50y) and

 2n
 2n
scale equal to 50.5 . In other words, posterior draws for the mean are  = t 50.5
+ n
where t is a draw from a standard Student t(55) distribution.
Statistics for 1, 000 marginal and conditional posterior draws of the mean and marginal posterior draws of the variance are tabulated below.


 2 
statistic
( | y)
 | 2 , y
 |y
mean
53.4
53.4
13.9
median
53.4
53.4
13.5
standard deviation
0.52
0.54
2.9
maximum
55.3
55.1
26.0
minimum
51.4
51.2
7.5
quantiles:
0.01
52.2
52.2
8.8
0.025
52.5
52.4
9.4
0.05
52.6
52.6
10.0
0.10
52.8
52.8
10.6
0.25
53.1
53.1
11.9
0.75
53.8
53.8
15.5
0.9
54.1
54.1
17.6
0.95
54.3
54.4
19.2
0.975
54.5
54.5
21.1
0.99
54.7
54.7
23.3
Sample statistics for posterior draws based on informed priors: discrepant case
Posterior draws for  are largely unaffected by the discrepancy between the evidence
and the prior, presumably, because the evidence dominates with a sample size of 50.
However, consistent with intuition posterior draws for the variance are skewed upward
more than previously. Sorted posterior draws based on informed priors are plotted
46
below with the underlying parameter depicted by a horizontal line.
Posterior draws for the mean and variance based on informed priors:
discrepant case
4.2.3
Posterior predictive distribution
As we’ve seen for independent simulation (the first example in this section), posterior predictive draws allow us to generate distributions for complex combinations of
parameters or random variables.
For independent simulation, the general procedure for generating posterior predictive draws is
1. draw 1 from p (1 | y),
2. draw 2 from p (2 | y),
3. draw y from p(
y | 1 , 2 , y) where y is the predictive random variable.
Also, posterior predictive distributions provide a diagnostic check on model specification adequacy. If sample data and posterior predictive draws are substantially different
we have evidence of model misspecification.
47
For conditional simulation, the general procedure for generating posterior predictive draws is
1. draw 2 from p (2 | y),
2. draw 1 from p (1 | 2 , y),
3. draw y from p(
y | 1 , 2 , y).
4.2.4
Bayesian linear regression
As our (ANOVA and ANCOVA) examples draw on small samples, in this subsection we
focus on a linear DGP
Y = X + 
uninformative priors for , and known variance,  2 = 1. The posterior for  is

1 T

1
Gaussian or normal with mean b = X T X
X y and variance  2 X T X





1
T  T
2
p  |  , Y, X  exp  2 (  b) X X (  b)
2
Next, we return to the (ANOVA and ANCOVA) examples from the projections chapter
but apply Bayesian simulation.
Mean estimation As discussed in the projections chapter, we can estimate an unknown mean from a Gaussian distribution via regression where X =  (a vector of
2
ones). The posterior distribution is Gaussian with mean Y and variance n . For an
2
exchangeable sample, Y = {4, 6, 5}, we have Y = 5 and variance n = 0.577. The
table below reports statistics from 1, 000 posterior draws.


statistic
 |  2 , Y, X
mean
4.99
median
4.99
standard deviation
0.579
maximum
6.83
minimum
3.25
quantiles:
0.01
3.62
0.025
3.91
0.05
4.04
0.10
4.26
0.25
4.57
0.75
5.39
0.9
5.78
0.95
5.92
0.975
6.06
0.99
6.23
Sample statistics for posterior draws of the mean
DGP : Y = X + , X = 
48
These results correspond well with, say, a 95% classical confidence interval {3.87, 6.13}
while the 95% Bayesian posterior interval is {3.91, 6.06}.
Different means (ANOVA)
ANOVA example 1 Suppose we have exchangeable outcome or response data
(conditional on D) with two factor levels identified by D.
Y
4
6
5
11
9
10
D
0
0
0
1
1
1
We’re interested in estimating the unknown means, or equivalently, the mean difference
conditional on D (and one of the means). We view the former DGP as
Y
= X1  (1) + 
=  0 (1  D) +  1 D + 
a no intercept regression where X1 =
latter DGP as

1D
D

and  (1) =

0
1

, and the
= X2  (2) + 
Y
= 0 + 2D + 


 (2)
0
where X2 =  D ,  =
,and  2 =  1   0 .
2
The posterior for  (1) is Gaussian with means

b
(1)
=
 

Y |D=0

1
and variance  2 X1T X1
=

1
3
0
0
1
3


.
 
=
Y |D=1

5
10




T


1 
p  (1) |  2 , Y, X1  exp  2  (1)  b(1) X1T X1  (1)  b(1)
2
While the posterior for  (2) is Gaussian with means
b
(2)
=
 

Y |D=0

 
 
=
Y |D=1  Y |D=0
49

5
5


1
and variance  2 X2T X2
=

1
3
 13
 13
2
3

.



T


1 
p  (2) |  2 , Y, X2  exp  2  (2)  b(2) X2T X2  (2)  b(2)
2
The tables below reports statistics from 1, 000 posterior draws for each model.
statistic
0
1
1  0
mean
4.99 10.00
5.00
median
4.99 10.00
4.99
standard deviation
0.597 0.575
0.851
maximum
6.87 11.63
7.75
minimum
2.92
8.35
2.45
quantiles:
0.01
3.64
8.64
3.14
0.025
3.82
8.85
3.37
0.05
4.01
9.06
3.58
0.10
4.23
9.26
3.90
0.25
4.60
9.58
4.44
0.75
5.40 10.39
5.57
0.9
5.76 10.73
6.09
0.95
5.97 10.91
6.36
0.975
6.15 11.08
6.69
0.99
6.40 11.30
7.17
classical 95% interval:
lower
3.87
8.87
3.40
upper
6.13 11.13
6.60
Sample statistics for posterior draws from ANOVA example 1


DGP : Y = X1  (1) + , X1 = (1  D) D
50
statistic
0
2
0 + 2
mean
5.02
4.99
10.01
median
5.02
4.97
10.02
standard deviation
0.575 0.800
0.580
maximum
6.71
7.62
12.09
minimum
3.05
2.88
7.90
quantiles:
0.01
3.65
3.10
8.76
0.025
3.89
3.43
8.88
0.05
4.05
3.66
9.02
0.10
4.29
3.91
9.22
0.25
4.62
4.49
9.62
0.75
5.41
5.51
10.38
0.9
5.73
6.02
10.75
0.95
5.93
6.30
10.96
0.975
6.14
6.54
11.18
0.99
6.36
6.79
11.33
classical 95% interval:
lower
3.87
3.40
8.87
upper
6.13
6.60
11.13
Sample statistics for posterior draws from ANOVA example 1


DGP : Y = X2  (2) + , X2 =  D
As expected, for both models there is strong correspondence between classical confidence intervals and the posterior intervals (reported at the 95% level).
ANOVA example 2 Suppose we now have a second binary factor, W .
Y
4
6
5
11
9
10
D
0
0
0
1
1
1
W
0
1
0
1
0
0
(D  W )
0
0
0
1
0
0
We’re still interested in estimating the mean differences conditional on D and W . The
DGP is
Y
= X + 
=  0 +  1 D +  2 W +  3 (D  W ) + 


0
 1 



where X =  D W (D  W ) and  = 
  2 . The posterior is Gaussian
3




1
T
2
T
p  |  , Y, X  exp  2 (  b) X X (  b)
2
51
with mean


4.5
 5 

1 T

b = XT X
X Y =
 1.5 
0
 1
1
1
1 
2 2
2
2
1
 1



1
1
1


2
2
and variance  2 X T X
= 1
.
1
3
3
 2


2
2
2
1
2
1  32
3
The tables below reports statistics from 1, 000 posterior draws.
statistic
0
1
2
3
mean
4.45
5.05
1.54
0.03
median
4.47
5.01
1.50
0.06
standard deviation
0.706 0.989 1.224
1.740
maximum
6.89
7.98
5.80
6.28
minimum
2.16
2.21 2.27
5.94
quantiles:
0.01
2.74
2.76 1.36
4.05
0.025
3.06
3.07 0.87
3.36
0.05
3.31
3.47 0.41
2.83
0.10
3.53
3.82 0.00
2.31
0.25
3.94
4.39
0.69
1.15
0.75
4.93
5.72
2.35
1.13
0.9
5.33
6.34
3.20
2.14
0.95
5.57
6.69
3.58
2.75
0.975
5.78
6.99
3.91
3.41
0.99
6.00
7.49
4.23
4.47
classical 95% interval:
lower
3.11
3.04 0.90
3.39
upper
5.89
6.96
3.90
3.39
Sample statistics for posterior draws
from ANOVA example 2

DGP : Y = X + , X =  D W (D  W )
As expected, there is strong correspondence between the classical confidence intervals
and the posterior intervals (reported at the 95% level).
ANOVA example 3 Now, suppose the second factor, W , is perturbed slightly.
Y
4
6
5
11
9
10
D
0
0
0
1
1
1
W
0
0
1
1
0
0
52
(D  W )
0
0
0
1
0
0
We’re still interested in estimating the mean differences conditional on D and W . The
DGP is
Y
= X + 
=  0 +  1 D +  2 W +  3 (D  W ) + 


0
 1 



where X =  D W (D  W ) and  = 
  2 . The posterior is Gaussian
3




1
T
p  |  2 , Y, X  exp  2 (  b) X T X (  b)
2
with mean


5
 4.5 

1 T

b = XT X
X Y =
 0 
1.5
 1
1 
 12  12
2
2
1
1
 T 1 
1
1 
 2

2
2
and variance  X X
= 1
.
1
3
3 
 2


2
2
2
1
2
1  32
3
The tables below reports statistics from 1, 000 posterior draws.
statistic
0
1
2
3
mean
5.01
4.49
0.01
1.50
median
5.00
4.49
0.03
1.52
standard deviation
0.700 0.969 1.224
1.699
maximum
7.51
7.45
3.91
8.35
minimum
2.97
1.61 5.23
3.34
quantiles:
0.01
3.46
2.28 2.84
2.36
0.025
3.73
2.67 2.30
1.82
0.05
3.87
2.93 1.98
1.18
0.10
4.08
3.25 1.56
0.62
0.25
4.50
3.79 0.80
0.37
0.75
5.50
5.14
0.82
2.60
0.9
5.93
5.76
1.51
3.61
0.95
6.11
6.09
2.00
4.23
0.975
6.29
6.27
2.36
4.74
0.99
6.51
6.58
2.93
5.72
classical 95% interval:
lower
3.61
2.54 2.40
1.89
upper
6.39
6.46
2.40
4.89
Sample statistics for posterior draws
from
ANOVA
example

3
DGP : Y = X + , X =  D W (D  W )
53
As expected, there is strong correspondence between the classical confidence intervals
and the posterior intervals (reported at the 95% level).
ANOVA example 4 Now, suppose the second factor, W , is again perturbed.
Y
4
6
5
11
9
10
D
0
0
0
1
1
1
W
0
1
1
0
0
1
(D  W )
0
0
0
0
0
1
We’re again interested in estimating the mean differences conditional on D and W .
The DGP is
Y
where X =
with mean


D
= X + 
=  0 +  1 D +  2 W +  3 (D  W ) + 


0
 1 


W (D  W ) and  = 
  2 . The posterior is Gaussian
3




1
T
p  |  2 , Y, X  exp  2 (  b) X T X (  b)
2

4
 6 

1 T

b = XT X
X Y =
 1.5 
1.5


1
1
1
1

1 
1
1.5 
 1 1.5

and variance  2 X T X
=
.
 1
1
1.5 1.5 

1
1.5
54
1.5
3
The tables below reports statistics from 1, 000 posterior draws.
statistic
0
1
2
3
mean
3.98
6.01
1.53
1.46
median
3.95
6.01
1.56
1.51
standard deviation
1.014 1.225 1.208
1.678
maximum
6.72
9.67
6.40
4.16
minimum
0.71
2.61 1.78
6.50
quantiles:
0.01
1.69
3.28 1.06
5.24
0.025
2.02
3.69 0.84
4.73
0.05
2.37
3.94 0.53
4.26
0.10
2.70
4.34 0.05
3.60
0.25
3.27
5.19
0.69
2.59
0.75
4.72
6.81
2.36
0.27
0.9
5.28
7.58
3.09
0.73
0.95
5.64
8.08
3.45
1.31
0.975
5.94
8.35
3.81
1.75
0.99
6.36
8.79
4.21
2.20
classical 95% interval:
lower
2.04
3.60 0.90
4.89
upper
5.96
8.40
3.90
1.89
Sample statistics for posterior draws
from ANOVA example 4

DGP : Y = X + , X =  D W (D  W )
As expected, there is strong correspondence between the classical confidence intervals
and the posterior intervals (reported at the 95% level).
Simple regression Suppose we don’t observe treatment, D, but rather observe only
regressor, X1 , in combination with outcome, Y .
Y
4
6
5
11
9
10
X1
1
1
0
1
1
0
For the perceived DGP


Y =  0 +  1 X1 + 
with   N 0,  2 I ( 2 known) the posterior distribution for  is


1
T
T
p ( | Y, X)  exp  2 (  b) X X (  b)
2
55




1 T

1

7.5
X Y =
, and  2 X T X
=
where X =  X1 , b = X T X
1

 1
0
6
.
0 14
Sample statistics for 1, 000 posterior draws are tabulated below.
statistic
0
1
mean
7.48
0.98
median
7.49
1.01
standard deviation
0.386
0.482
maximum
8.67
2.39
minimum
6.18
0.78
quantiles:
0.01
6.55
0.14
0.025
6.71
0.08
0.05
6.85
0.16
0.10
6.99
0.34
0.25
7.22
0.65
0.75
7.74
1.29
0.9
7.98
1.60
0.95
8.09
1.76
0.975
8.22
1.88
0.99
8.33
2.10
classical 95% interval:
lower
6.20
0.02
upper
7.80
1.98
Sample statistics for posterior draws from simple
regression
example


DGP : Y = X + , X =  X1
Classical confidence intervals and Bayesian posterior intervals are similar, as expected.
ANCOVA example Suppose in addition to the regressor, X1 , we observe treatment,
D, in combination with outcome, Y .
Y
4
6
5
11
9
10
X1
1
1
0
1
1
0
D
0
0
0
1
1
1
For the perceived DGP
Y =  0 +  1 D +  2 X1 +  3 (D  X1 ) + 
56


with   N 0,  2 I ( 2 known) the posterior distribution for  is


1
T
p ( | Y, X)  exp  2 (  b) X T X (  b)
2

where X =



5
 5 


1 T

D X1 (D  X1 ) , b = X T X
X Y =
 1 , and
0

 1
 13
0
0
3


2
0
0 
 T 1   13
3
2


 X X
=
1
0
 12 
2
 0

0
0  12
1
Sample statistics for 1, 000 posterior draws are tabulated below.
statistic
0
1
2
3
mean
5.00
4.99
1.02
0.01
median
5.02
5.02
1.02
0.01
standard deviation
0.588 0.802 0.697
1.00
maximum
7.23
7.96
3.27
3.21
minimum
3.06
2.69 1.33
3.31
quantiles:
0.01
3.62
3.15 0.57
2.34
0.025
3.74
3.38 0.41
2.03
0.05
4.02
3.65 0.10
1.66
0.10
4.28
3.98
0.14
1.25
0.25
4.60
4.47
0.54
0.66
0.75
5.34
5.53
1.48
0.65
0.9
5.73
5.96
1.91
1.25
0.95
5.97
6.30
2.16
1.68
0.975
6.20
6.50
2.37
2.07
0.99
6.49
6.93
2.72
2.36
classical 95% interval:
lower
3.87
3.40 0.39
1.96
upper
6.13
6.60
2.39
1.96
Sample statistics for posterior draws
from
the
ANCOVA
example


DGP : Y = X + , X =  D X1 (D  X1 )
Classical confidence intervals and Bayesian posterior intervals are similar, as expected.
4.3
McMC (Markov chain Monte Carlo) simulation
Markov chain Monte Carlo (McMC) simulations can be employed when the marginal posterior distributions cannot be derived or are extremely cumbersome to derive.
57
McMC approaches draw from the set of conditional posterior distributions instead of
the marginal posterior distributions. The utility of McMC simulation has evolved along
with the R Foundation for Statistical Computing.
Before discussing standard algorithms (the Gibbs sampler and Metropolis-Hastings)
we briefly review some important concepts associated with Markov chains and attempt
to develop some intuition regarding their effective usage. The objective is to be able
to generate draws from a stationary posterior distribution which we denote  but we’re
unable to directly access. To explore how Markov chains help us access , we begin
with discrete Markov chains then connect to continuous chains.
Discrete state spaces Let S =


1 , 2 , . . . , d be a discrete state space. A
Markov chain is the sequence of random variables, {1 , 2 , . . . , r , . . .} given 0 generated by the following transition


pij  Pr r+1 = j | r = i
The Markov property says that transition to r+1 only depends on the immediate past
history, r , and not all history. Define a Markov transition matrix, P = [pij ], where
the rows denote initial states and the columns denote transition states such that, for
example, pii is the likelihood of beginning in state i and remaining in state i.
Now, relate this Markov chain idea to distributions from which random variables
are drawn. Say, the initial value, 0 , is drawn from  0 . Then, the distribution for 1
given 0   0 is

 d


d
 1j  Pr 1 = j = i=1 Pr 0 = i pij = i=1  0i pij , j = 1, 2, . . . , d
In matrix notation, the above is
1 = 0 P
and after r iterations we have
r = 0 P r
As the number of iterations increases, we expect the effect of the initial distribution,
 0 , dies out so long as the chain does not get trapped.
Irreducibility and stationarity The idea of no absorbing states or states in which
the chain gets trapped is called irreducibility. This is key to our construction of Markov
chains. If pij > 0 (strictly positive) for all i, j, then the chain is irreducible and there
exists a stationary distribution, , such that
lim  0 P r = 
r
and
P = 
Since the elements are all positive and each row sums to one, the maximum eigenvalue
of P T is one and its corresponding eigenvector and vector from the inverse of the matrix of eigenvectors determine . By singular value decomposition, P = SS 1 where
58
S
eigenvectors and  is a diagonal matrix of corresponding eigenvalues,
 isT armatrix of
P
= Sr S 1 since

PT
r
= SS 1 SS 1 · · · SS 1
= Sr S 1
This implies the long-run steady-state is determined by the largest eigenvalue (if max  =
1) and in the direction of its eigenvector and inverse eigenvector (if the remaining  s
< 1 then ri goes to zero and their corresponding eigenvectors’ influence on direction
dies out). That is,
 T r
P
= S1 r S11
where S1 denotes the eigenvector (column vector) corresponding to the unit eigenvalue
and S11 denotes the corresponding inverse eigenvector (row vector). Since one is
the largest eigenvalue of P T , after a large number of iterations  0 P r converges to
1   = . Hence, after many iterations the Markov chain produces draws from a
stationary distribution if the chain is irreducible.
Time reversibility and stationarity An equivalent property, time reversibility, is
more useful when working with more general state space chains. Time reversibility
says that if we reverse the order of a Markov chain, the resulting chain has the same
transition behavior. First, we show the reverse chain is Markovian if the forward chain
is Markovian, then we relate the forward and reverse chain transition probabilities, and
finally, we show that time reversibility implies  i pij =  j pji and this implies P = 
where  is the stationary distribution for the chain. The reverse transition probability
(by Bayesian "updating") is


pij  Pr r = j | r+1 = i1 , r+2 = i2 , . . . , r+T = iT


Pr r = j , r+1 = i1 , r+2 = i2 , . . . , r+T = iT


=
Pr r+1 = i1 , r+2 = i2 , . . . , r+T = iT

 

Pr r = j Pr r+1 = i1 | r = j


=
Pr r+1 = i1


Pr r+2 = i2 , . . . , r+T = iT | r = j , r+1 = i1



Pr r+2 = i2 , . . . , r+T = iT | r+1 = i1
Since the forward chain is Markovian, this simplifies as

 

Pr r = j Pr r+1 = i1 | r = j



pij =
Pr r+1 = i1


Pr r+2 = i2 , . . . , r+T = iT | r+1 = i1

 
Pr r+2 = i2 , . . . , r+T = iT | r+1 = i1

 

Pr r = j Pr r+1 = i1 | r = j


=
Pr r+1 = i1
59
The reverse chain is Markovian.
Let P  represent the transition matrix for the reverse chain then the above says
 j pji
pij =
i
By definition, time reversibility implies pij = pij . Hence, time reversibility implies
 i pij =  j pji
Time reversibility says the likelihood of transitioning from state i to j is equal to the
likelihood of transitioning from j to i.
The above implies if a chain is reversible with respect to a distribution  then  is
the stationary distribution of the chain. To see this sum both sides of the above relation
over i



i  i pij =
i  j pji =  j
i pji =  j  1, j = 1, 2, . . . , d
In matrix notation, we have
P = 
 is the stationary distribution of the chain.
Continuous state spaces Continuous state spaces are analogous to discrete state
spaces but with a few additional technical details.
  Transition probabilities are defined
in reference to sets rather than the singletons i . For example, for a set A   the
chain is define in terms of the probabilities of the set given the value of the chain on
the previous iteration, . That is, the kernel of the chain, K (, A), is the probability of
set A given the chain is at  where

K (, A) =
p (, ) d
A
p (, ) is a density function with given  and p (·, ·) is the transition or generator
function of the kernel.
An invariant or stationary distribution with density  (·) implies



 
 () d = K (, A)  () d =
p (, ) d  () d
A


A
Time reversibility in the continuous space case implies
 () p (, ) =  () p (, )
And, irreducibility in the continuous state case is satisfied for a chain with kernel K
with respect to  (·) if every set A with positive probability  can be reached with
positive probability after a finite number of iterations. In other words, if A  () d >
0 then there exists n  1 such that K n (, A) > 0. With continuous state spaces,
irreducibility and time reversibility produce a stationary distribution of the chain as
with discrete state spaces.
Next, we briefly discussion application of these Markov chain concepts to two popular McMC strategies: the Gibbs sampler, and Metropolis-Hastings (MH) algorithm.
The Gibbs sampler is a special case of MH and somewhat simpler so we review it first.
60
4.3.1
Gibbs sampler
Suppose we cannot derive p ( | Y ) in closed form (it does not have a standard probability distribution) but we are able to identify the set of conditional posterior distributions. We can utilize the set of full conditional posterior distributions to draw dependent
samples for parameters of interest via McMC simulation.
For the full set of conditional posterior distributions
p (1 | 1 , Y )
..
.
p (k | k , Y )
draws are made for 1 conditional on starting values for parameters other than 1 ,
that is 1 . Then, 2 is drawn conditional on the 1 draw and the starting values for
the remaining . Next, 3 is drawn conditional on the draws for 1 and 2 and the
starting values for the remaining . This continues until all  have been sampled. Then
the sampling is repeated for a large number of draws with parameters updated each
iteration by the most recent draw.
For example, the procedure for a Gibbs sampler involving two parameters is
1. select a starting value for 2 ,
2. draw 1 from p (1 |2 , y) utilizing the starting value for 2 ,
3. draw 2 from p(2 |1 , y) utilizing the previous draw for 1 ,
4. repeat until a converged sample based on the marginal posteriors is obtained.
The samples are dependent. Not all samples will be from the posterior; only after a finite (but unknown) number of iterations are draws from the marginal posterior distribution (see Gelfand and Smith [1990]). (Note, in general, p (1 , 2 | Y ) =
p (1 | 2 , Y ) p (1 | 2 , Y ).) Convergence is usually checked using trace plots, burnin iterations, and other convergence diagnostics. Model specification includes convergence checks, sensitivity to starting values and possibly prior distribution and likelihood assignments, comparison of draws from the posterior predictive distribution with
the observed sample, and various goodness of fit statistics.
4.3.2
Metropolis-Hastings algorithm
If neither some conditional posterior, p (j | Y, j ), or its marginal posterior, p ( | Y ),
is recognizable, then we may be able to employ the Metropolis-Hastings (MH) algorithm. The Gibbs sampler is a special case of the MH algorithm. The random walk
Metropolis algorithm is most common and outlined next.
We wish to draw from p ( | ·) but we only know p ( | ·) up to constant of proportionality, p ( | ·) = cf ( | ·) where c is unknown. The random walk Metropolis
algorithm for one parameter is as follows.
1. Let (k1) be a draw
 from p (
 | ·).
2. Draw  from N (k1) , s2 where s2 is fixed.



|·)
cf (  |·)
3. Let  = min 1, p p(
=
.
cf ( (k1) |·)
((k1) |·)
61
4. Draw z  from U (0, 1).
5. If z  <  then (k) =  , otherwise (k) = (k1) . In other words, with
probability  set (k) =  , and otherwise set (k) = (k1) .16
These draws converge to random draws from the marginal posterior distribution after a
burn-in interval if properly tuned.
Tuning the Metropolis algorithm involves selecting s2 (jump size) so that the parameter space is explored appropriately. Usually, smaller jump size results in more
accepts and larger jump size results in fewer accepts. If s2 is too small, the Markov
chain will not converge quickly, has more serial correlation in the draws, and may get
stuck at a local mode (multi-modality can be a problem). If s2 is too large, the Markov
chain will move around too much and not be able to thoroughly explore areas of high
posterior probability. Of course, we desire concentrated samples from the posterior
distribution. A commonly-employed rule of thumb is to target an acceptance rate for
 around 30% (20  80% is usually considered “reasonable”).17
The above procedure describes the algorithm for a single parameter or vector of
parameters. A general K parameter Metropolis-Hastings algorithm works similarly
(see Train [2002], p. 305).
1. Start with a value  0n .
2. Draw K independent values from a standard normal density, and stack the draws
into a vector labeled  1 .
3. Create a trial value of  1n =  0n +  1 where  is the researcher-chosen jump
size parameter,  is the Cholesky factor of W such that T = W . Note the proposal
distribution is specified to be normal with zero mean and variance  2 W .
4. Draw a standard uniform variable µ1 .


L(yn | 1 )( 1 |b,W )
5. Calculate the ratio F = L(yn |n0 )( 0n|b,W ) where L yn |  1n is a product of
n
n


logits, and   1n | b, W is the normal density.
6. If µ1  F , accept  1n ; if µ1 > F , reject  1n and let  1n =  0n .
7. Repeat the process many times, adjusting the tuning parameters if necessary. For
sufficiently large t,  tn is a draw from the marginal posterior.
4.3.3
Missing data augmentation
One of the many strengths of M cM C approaches is their flexibility for dealing with
missing data. Missing data is a common characteristic plaguing limited dependent
variable models like discrete choice and selection. As a prime example, we next discuss
Albert and Chib’s McMC data augmentation approach to discrete choice modeling.
Later we’ll explore McMC data augmentation of selection models.
Albert and Chib’s Gibbs sampler Bayes’ probit The challenge with discrete choice
models (like probit) is that latent utility is unobservable, rather the analyst observes
modification of the RW Metropolis algorithm sets (k) =  with log() probability where  =
min{0, log[f ( |·)]  log[f ((k1) |·)]}.
17 Gelman, et al [2004] report the optimal acceptance rate is 0.44 when the number of parameters K = 1
and drops toward 0.23 as K increases.
16 A
62
only discrete (usually binary) choices.18 Albert & Chib [1993] employ Bayesian data
augmentation to “supply” the latent variable. Hence, parameters of a probit model
are estimated via normal Bayesian regression (see earlier discussion in this chapter).
Consider the latent utility model
UD = W   V
where binary choice, D, is observed.
D=

1
0
UD > 0
UD < 0
The conditional posterior distribution for  is
 
1 
p (|D, W, UD )  N b1 , Q1 + W T W
where

1  1

b1 = Q1 + W T W
Q b0 + W T W b

1 T
b = WTW
W UD

1
b0 = prior means for  and Q = W0T W0
is the prior for the covariance. The
conditional posterior distribution for the latent variables are
p (UD |D = 1, W, )  N (W , I|UD > 0) or T N(0,) (W , I)
p (UD |D = 0, W, )  N (W , I|UD  0) or T N(,0) (W , I)
where T N (·) refers to random draws from a truncated normal (truncated below for
the first and truncated above for the second). Iterative draws for (UD |D, W, ) and
(|D, W, UD ) form the Gibbs sampler. Interval estimates of  are supplied by postconvergence draws of (|D, W, UD ). For simulated normal draws of the unobservable
portion of utility, V , this Bayes’ augmented data probit produces remarkably similar
inferences to MLE.19
Probit example We compare ML (maximum likelihood) estimates20 with Gibbs
sampler McMC data augmentation probit estimates for a simple discrete choice problem. In particular, we return to the choice (or selection) equation referred to in the
18 See Accounting and causal effects: econometric challenges, chapter 5 for a discussion of discrete choice
models.
19 An efficient algorithm for this Gibbs sampler probit, rbprobitGibbs, is available in the bayesm package
of R (http://www.r-project.org/), the open source statistical computing project. Bayesm is a package written
to complement Rossi, Allenby, and McCulloch [2005].
20 See the second appendix to these notes for a brief discussion of ML estimation of discrete choice models.
63
illustration of the control function strategy for identifying treatment effects of the projection chapter. The variables (choice and instruments) are
D
1
1
1
0
0
0
Z1
1
0
1
1
0
1
Z2
0
3
0
0
0
2
Z3
0
1
0
0
1
0
Z4
0
1
0
2
0
0
Z5
1
2
1
0
0
0
The above data are a representative sample. To mitigate any small sample bias, we
repeat this sample 20 times (n = 120).21
ML estimates (with standard errors in parentheses below the estimates) are
E [UD | Z] = 0.6091 Z1 + 0.4950 Z2  0.1525 Z3  0.7233 Z4 + 0.2283 Z5
(0.2095)
(0.1454)
(0.2618)
(0.1922)
(0.1817)
(Z 
)
The model has only modest explanatory power (pseudo-R2 = 1   = 11.1%,
( 0 )
 
 
where  Z 
 is the log-likelihood for the model and  
0 is the log-likelihood with
a constant only). However, recall this selection equation works perfectly as a control function in the treatment effect example where high explanatory power does not
indicate an adequate model specification (see projections chapter).
Now, we compare the ML results with McMC data augmentation and the Gibbs
sampler probit discussed previously. Statistics from 10, 000 posterior draws following
1, 000 burn-in draws are tabulated below based on the n = 120 sample.
statistic
1
2
3
4
5
mean
0.6225 0.5030 0.1516 0.7375
0.2310
median
0.6154 0.5003 0.1493 0.7336
0.2243
standard deviation
0.2189
0.1488
0.2669
0.2057
0.1879
quantiles:
0.025
1.0638 0.2236 0.6865 1.1557 0.1252
0.25
0.7661 0.4009 0.3286 0.8720
0.1056
0.75
0.4713 0.6007 0.02757 0.5975 0.3470
0.975
0.1252 0.1056
0.2243
0.3549
0.6110
Sample statistics for data augmented Gibbs
McMC
probit
posterior

 draws
DGP : UD = Z + , Z = Z1 Z2 Z3 Z4 Z5
As expected, the means, medians, and standard errors of the McMC probit estimates
correspond quite well with ML probit estimates.
21 Comparison of estimates based on n = 6 versus n = 120 samples produces no difference in ML parameter estimates but substantial difference in the McMC estimates. The n = 6 McMC estimates are typically
larger in absolute value compared to their n = 120 counterparts. This tends to exagerate heterogeneity in
outcomes if we reconnect with the treatment effect examples. The remainder of this discussion focuses on
the n = 120 sample.
64
Logit example Next, we apply logistic regression (logit for short) to the same
(n = 120) data set. We compare MLE results with two McMC strategies: (1) logit
estimated via a random walk Metropolis-Hastings (MH) algorithm without data augmentation and (2) a uniform data augmented Gibbs sampler logit.
Random walk MH for logit The random walk MH algorithm employs a standard
binary discrete choice model


 
exp ZiT 


(Di | Zi )  Bernoulli
1 + exp ZiT 
The default tuning parameter, s2 = 0.25, produces an apparently satisfactory MH
acceptance rate of 28.6%. Details are below.
We wish to draw from the posterior
Pr ( | D, Z)  p ()  ( | D, Z)
where the log likelihood is




 
exp ZiT 
exp ZiT 

 + (1  Di ) log 1 


 ( | D, Z) =
Di log
1 + exp ZiT 
1 + exp ZiT 
i=1
n

For Z other than a constant, there is no prior, p (), which produces a well known
posterior, Pr ( | D, Z), for the logit model. This makes the MH algorithm attractive.
The MH algorithm builds a Markov chain (the current draw depends on only the
previous draw) such that eventually the influence of initial values dies out and draws
are from a stable, approximately independent distribution. The MH algorithm applied
to the logit model is as follows.
1. Initialize the vector 0 at some value.


2. Define a proposal generating density, q  , k1 for draw k  {1, 2, . . . , K}.
The random walk MH chooses a convenient generating density.


 = k1 + ,   N 0,  2 I
In other words, for each parameter, j ,


q j , k1
=
j
 
1

exp 
2
j

k1
j
2 2
2 




3. Draw a vector,  from N k1 ,  2 I . Notice, for the random walk, the tuning
parameter,  2 , is the key. If  2 is chosen too large, then the algorithm will reject
the proposal draw frequently and will converge slowly, If  2 is chosen too small,
then the algorithm will accept the proposal draw frequently but may fail to fully
explore the parameter space and may fail to discover the convergent distribution.
65
4. Calculate  where



Pr(  |D,Z)q (  , k1 )
min 1, Pr k1 |D,Z q k1 ,
(
)(
)
=
1

 

Pr k1 | D, Z q k1 ,  > 0

 

Pr k1 | D, Z q k1 ,  = 0
The core of the MH algorithm is that the ratio eliminates the problematic normalizing constant for the posterior (normalization is problematic since we don’t
recognize the posterior). The convenience
 MH enters here
of therandom walk

 k1
k1 
as, by symmetry of the normal, q  , 
=q 
,  and the calculation
of  simplifies as
=

q (  , k1 )
q ( k1 ,  )

min 1,
drops out. Hence, we calculate
Pr(  |D,Z)
Pr( k1 |D,Z )
1


 

Pr k1 | D, Z q k1 ,  > 0

 

Pr k1 | D, Z q k1 ,  = 0
5. Draw U from a Uniform(0, 1). If U < , set k =  , otherwise set k = k1 .
In other words, with probability  accept the proposal draw,  .
6. Repeat K times until the distribution converges.
Uniform Gibbs sampler for logit On the other hand, the uniform data augmented Gibbs sampler logit specifies a complete set of conditional posteriors developed
as follows. Let

1
with probability  i
Di =
, i = 1, 2, . . . , n
0 with probability 1   i
where  i =
exp[ZiT  ]
1+exp[ZiT  ]


= FV ZiT  , or log
i
1 i


= ZiT , and FV ZiT  is the
cumulative
distribution

 function of the logistic random variable V . Hence,  i =
exp[ZiT  ]
Pr U < 1+exp Z T  where U has a uniform(0, 1) distribution. Then, given the pri[ i ]
ors for , p (), the joint posterior for the latent variable u = (u1 , u2 , . . . , un ) and 
given the data D and Z is




exp[ZiT  ]




n
I
u

I
(D
=
1)


i
i

1+exp[ZiT  ]


Pr (, u | D, Z)  p ()
I (0  ui  1)
exp[ZiT  ]



i=1 
I
(D
=
0)
 +I ui > log 1+exp Z

i
[ iT ]
where I (X  A) is an indicator function that equals one if X  A, and zero otherwise.
Thus, the conditional posterior for the latent (uniform) variable u is


exp[ZiT  ]

U nif orm 0, 1+exp Z T 
if Di = 1
[ i ]

Pr (ui | , D, Z) 
T
exp[Zi  ]
U nif orm 1+exp Z
,1
if Di = 0
[ iT ]
66
Since the joint posterior can be written




n  I Z T   log ui


I
(D
=
1)
i
1ui 
i
I (0  ui  1)
Pr (, u | D, Z)  p ()
 +I Z T  < log ui I (Di = 0) 
i
i=1
1ui
we have
5
j=1
so
Zij j  log
ui
1ui
if Di = 1



1 
ui

log

Zij j 
Zik
1  ui
j=k
for all samples for which Di = 1 and Zik > 0, as well as for all samples for which
Di = 0 and Zik < 0. Similarly,



1 
ui
k <
log

Zij j 
Zik
1  ui
j=k
for all samples for which Di =1 and Zik > 0, as well as for all samples for which
Di = 0 and Zik < 0.22 Let Ak and Bk be the sets defined by the above, that is,
Ak = {i : ((Di = 1)  (Zik > 0))  ((Di = 0)  (Zik < 0))}
and
Bk = {i : ((Di = 0)  (Zik > 0))  ((Di = 1)  (Zik < 0))}
A diffuse prior p ()  1 combined with the above gives the conditional posterior for
k , k = 1, 2, . . . , 5, given the other ’s and latent variable, u.
p (k | k , u, D, Z)  U nif orm (ak , bk )
where k is a vector of parameters except k ,




1 
ui
ak = max 
log

Zij j 
iAk Zik
1  ui
j=k
and




1
u
i
log
bk = min 

Zij j 
iBk Zik
1  ui
j=k
The Gibbs sampler is implemented by drawing n values of u in one block conditional on  and the data, D, Z. The elements of  are drawn successively, each
conditional on u, the remaining parameters, k , and the data, D, Z.
22 If
Zik = 0, the observation is ignored as k is determined by the other regressor values.
67
ML logit estimates (with standard errors in parentheses below the estimates) are
E [UD | Z] = 0.9500 Z1 + 0.7808 Z2  0.2729 Z3  1.1193 Z4 + 0.3385 Z5
(0.3514)
(0.2419)
(0.4209)
(0.3250)
(0.3032)
Logit results are proportional to the probit results (approximately 1.5 times the probit
estimates), as is typical. As with the probit model, the logit model has modest explana 
(Z 
)
 is the log-likelihood for
tory power (pseudo-R2 = 1   = 10.8%, where  Z 
( 0 )
 
the model and  
0 is the log-likelihood with a constant only).
Now, we compare the ML results with McMC posterior draws. Statistics from
10, 000 posterior MH draws following 1, 000 burn-in draws are tabulated below based
on the n = 120 sample.
statistic
1
2
3
4
5
mean
0.9850 0.8176 0.2730 1.1633
0.3631
median
0.9745 0.8066 0.2883 1.1549
0.3440
standard deviation
0.3547
0.2426
0.4089
0.3224
0.3069
quantiles:
0.025
1.7074 0.3652 1.0921 1.7890 0.1787
0.25
1.2172 0.6546 0.5526 1.3793
0.1425
0.75
0.7406 0.9787
0.0082
0.9482
0.5644
0.975
0.3134 1.3203
0.5339
0.5465
0.9924
Sample statistics for MH McMC
logit
posterior
draws


DGP : UD = Z + , Z = Z1 Z2 Z3 Z4 Z5
Statistics from 10, 000 posterior data augmented uniform Gibbs draws following 40, 000
burn-in draws23 are tabulated below based on the n = 120 sample.
statistic
1
2
3
4
5
mean
1.015 0.8259 0.3375 1.199
0.3529
median
1.011 0.8126 0.3416 1.2053
0.3445
standard deviation
0.3014
0.2039
0.3748
0.2882
0.2884
quantiles:
0.025
1.6399 0.3835 1.1800 1.9028 0.2889
0.25
1.2024 0.6916 0.5902 1.3867
0.1579
0.75
0.8165 0.9514 0.0891 1.0099
0.5451
0.975
0.4423 1.2494
0.3849
0.6253
0.9451
Sample statistics for uniform Gibbs
 McMC logit posterior draws

DGP : UD = Z + , Z = Z1 Z2 Z3 Z4 Z5
As expected, the means, medians, and standard errors of the McMC logit estimates
correspond well with each other and the ML logit estimates. Now that we’ve developed
McMC data augmentation for the choice or selection equation, we return to the discussion of causal effects (initiated in the projections notes) and discuss data augmentation
for the counterfactuals as well as latent utility.
23 Convergence
to marginal posterior draws is much slower with this algorithm.
68
5
Treatment effects and counterfactuals
Suppose we observe treatment or no treatment and the associated outcome, Y = DY1 +
(1  D) Y0 , where
Y1
Y0
=  1 + V1
=  0 + V0
and a representative sample is
Y
15
14
13
13
14
15
D
1
1
1
0
0
0
Y1
15
14
13
11
10
9
Y0
9
10
11
13
14
15
V1
3
2
1
1
2
3
V0
3
2
1
1
2
3
Further, we have the following instruments at our disposal Z =
where their representative values are
Z1
5
6
0
0
0
1
Z2
4
5
0
0
1
0
Z3
3
4
0
1
0
0

Z1
Z2
Z3
Z4

Z4
1
2
1
0
0
0
and we perceive latent utility, EU , to be related to choice via the instruments.
EU = Z + VD
and observed choice is
D=

1
0
EU > 0
otherwise
This is the exact setup we discussed earlier in the projections analysis.
5.1
Gibbs sampler for treatment effects
There are three sources of missing data: latent utility, EU , counterfactuals for individuals who choose treatment, (Y0i | Di = 1), and counterfactuals for individuals who
choose no treatment, (Y1i | Di = 0). Bayesian data augmentation effectively models
these missing data processes (as in, for example, Albert and Chib’s McMC probit) by
drawing in sequence from the conditional posterior distributions — a Gibbs sampler.
Define the complete or augmented data as

T
ri = Di Di Yi + (1  Di ) Yimiss Di Yimiss + (1  Di ) Yi
69
Also, let

Zi
Hi =  0
0
and
0
Xi
0

where X = , a vector of ones.
5.1.1


 =  1 
0

0
0 
Xi
Full conditional posterior distributions
Let x denote all parameters other than x. The full conditional posteriors for the
augmented outcome data are
Yimiss | Yimiss , Data  N ((1  Di ) µ1i + Di µ0i , (1  Di )  1i + Di  0i )
where standard multivariate normal theory (see the appendix) is applied to derive
means and variances conditional on the draw for latent utility and the other outcome24
µ1i = Xi  1 +
 20  D1   10  D0
 10   D1  D0
(Di  Zi ) +
(Yi  Xi  0 )
 20   2D0
 20   2D0
µ0i = Xi  0 +
 21  D0   10  D1
 10   D1  D0
(Di  Zi ) +
(Yi  Xi  1 )
 21   2D1
 21   2D1
 1i =  21 
 0i =  20 
 2D1  20  2 10  D1  D0 +  210
 20   2D0
 2D0  21  2 10  D1  D0 +  210
 21   2D1
Similarly, the conditional posterior for the latent utility is


T N(0,) µDi  D 
Di | Di , Data 
T N(,0) µDi  D
if Di = 1
if Di = 0
where T N (·) refers to the truncated normal distribution with support indicated via the
subscript and the arguments are parameters of the untruncated distribution. Applying
multivariate normal theory for (Di | Yi ) we have
µDi

  2  D1   10  D0
= Zi  + Di Yi + (1  Di ) Yimiss  Xi  1 0 2 2
 1  0   210

  2  D0   10  D1
+ Di Yimiss + (1  Di ) Yi  Xi  0 1 2 2
 1  0   210
24 Technically, 
10 is unidentified (i.e., even with unlimited data we cannot "observe" the parameter).
However, we can employ restrictions derived through the positive-definiteness (see the appendix) of the
variance-covariance matrix, , to impose bounds on the parameter,  10 . If treatment effects are overly
sensitive this strategy will prove ineffective; otherwise, it allows us to proceed from observables to treatment
effects via augmentation of unobservables (the counterfactuals as well as latent utility).
70
D = 1 
 2D1  20  2 10  D1  D0 +  2D0  21
 21  20   210
The conditional posterior distribution for the parameters is


 |  , Data  N µ ,  
where by the SUR (seemingly-unrelated regression) generalization of Bayesian regression (see the appendix at the end of these notes)

1 





µ = H T 1  In H + V1
H T 1  In r + V1  0

1


  = H T 1  In H + V1
and the prior distribution is p ()  N ( 0 , V ). The conditional distribution for the
trivariate variance-covariance matrix is
 |  , Data  G1
where
G  W ishart (n + , S + R)
with prior p (G)  W ishart (, R), and S =
n

i=1
T
(ri  Hi ) (ri  Hi ) . As
usual, starting values for the Gibbs sampler are varied to test convergence of parameter
posterior distributions.
5.1.2
Nobile’s algorithm
Recall  2D is normalized to one. This creates a slight complication as the conditional
posterior is no longer inverse-Wishart. Nobile [2000] provides a convenient algorithm
for random Wishart (multivariate 2 ) draws with a restricted element. The algorithm
applied to the current setting results in the following steps:
1. Exchange rows and columns one and three in S + R, call this matrix V .

T
2. Find L such that V = L1 L1 .
3. Construct a lower triangular matrix A with
a. aii equal to the square root of 2 random variates, i = 1, 2.
1
where l33 is the third row-column element of L.
b. a33 = l33
c. aij equal to N (0, 1) random variates, i > j.

T  1 T 1 1

4. Set V = L1
A
A L .

5. Exchange rows and columns one and three in V and denote this draw .
71
5.1.3
Prior distributions
Li, Poirier, and Tobias choose relatively diffuse priors such that the data dominates
the posterior distribution. Their prior distribution for  is p ()  N ( 0 , V ) where
 0 = 0, V = 4I and their prior for 1 is p (G) 
 W ishart (, R) where  = 12
1 1 1
and R is a diagonal matrix with elements 12
, 4, 4 .
5.2
Marginal and average treatment effects
The marginal treatment effect is the impact of treatment for individuals who are indifferent between treatment and no treatment. We can employ Bayesian data augmentationbased estimation of marginal treatment effects (MTE) as data augmentation generates
repeated draws for unobservables, VDj , (Y1j | Dj = 0), and (Y0j | Dj = 1). Now, exploit these repeated samples to describe the distribution for M T E (uD ) where VD is
transformed to uniform (0, 1), uD = pv . For each draw, VD = v, we determine the
cumulative probability, uD =  (v),25 and calculate M T E (uD ) = E [Y1  Y0 | uD ].
If M T E (uD ) is constant for all uD , then all treatment effects are alike.
MTE can be connected to standard population-level treatment effects, ATE, ATT,
and ATUT, via non-negative weights whose sum is one (assuming full support)
n
j=1 I (uD )
wAT E (uD ) =
n n
j=1 I (uD ) Dj
n
wAT T (uD ) =
j=1 Dj
n
j=1 I (uD ) (1  Dj )
n
wAT U T (uD ) =
j=1 (1  Dj )
where probabilities pk refer to bins from 0 to 1 by increments of 0.01 for indicator
variable
I (uD ) = 1 uD = pk
I (uD ) = 0 uD = pk
Hence, MTE-estimated average treatment effects are
estAT E (M T E)
estAT T (M T E)
estAT U T (M T E)
=
=
=
n

i=1
n

i=1
n

wAT E (uD ) M T E (uD )
wAT T (uD ) M T E (uD )
wAT U T (uD ) M T E (uD )
i=1
Next, we apply these data augmentation ideas to the causal effects example and
estimate the average treatment effect on the treated (ATT), the average treatment effect
on the untreated (ATUT), and the average treatment effect (ATE).
25  (·)
is a cumulative probability distribution function.
72
5.3
Return to the treatment effect example
Initially, we employ Bayesian data augmentation via a Gibbs sampler on the treatment
effect problem outlined above. Recall this example was employed in the projections
notes to illustrate where the inverse-Mills ratios control functions strategy based on the
full complement of instruments26 was exceptionally effective.
The representative sample is
Y
15
14
13
13
14
15
D
1
1
1
0
0
0
Y1
15
14
13
11
10
9
Y0
9
10
11
13
14
15
Z1
5
6
0
0
0
1
Z2
4
5
0
0
1
0
Z3
3
4
0
1
0
0
Z4
1
2
1
0
0
0
which is repeated 200 times to create a sample of n = 1, 200 observations. The Gibbs
sampler employs 15, 000 draws from the conditional posteriors. The first 5, 000 draws
are discarded as burn-in, then sample statistics are based on the remaining 10, 000
draws.
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
1
0
1
13.76 13.76 0.810
13.76 13.76 0.809
0.026 0.028 0.051
2
0.391
0.391
0.054
3
1.647
1.645
0.080
4
1.649
1.650
0.080
13.67
13.70
13.71
13.72
13.73
13.74
13.78
13.79
13.80
13.81
13.82
13.84
0.585
0.521
0.500
0.481
0.461
0.428
0.356
0.325
0.306
0.289
0.269
0.185
1.943
1.837
1.807
1.781
1.751
1.699
1.593
1.547
1.519
1.497
1.467
1.335
1.362
1.461
1.493
1.518
1.547
1.595
1.704
1.751
1.778
1.806
1.836
1.971
13.64
13.69
13.70
13.71
13.71
13.74
13.78
13.80
13.80
13.81
13.82
13.86
0.617
0.695
0.713
0.727
0.746
0.776
0.844
0.873
0.893
0.910
0.931
1.006
Sample statistics for the parameters of the data augmented Gibbs
sampler applied to the treatment effect example
The results demonstrate selection bias as the means are biased upward from 12.
This does not bode well for effective estimation of marginal or average treatment effects. Sample statistics for average treatment effects as well as correlations, D,1 , D,0 ,
26 Typically, we’re fortunate to identify any instruments. In the example, the instruments form a basis for
the nullspace to the outcomes, Y1 and Y0 . In this (linear or Gaussian) sense, we’ve exhausted the potential
set of instruments.
73
and 1,0 are tabulated below.
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
AT E
0.000
0.000
0.017
AT T
0.481
0.480
0.041
AT U T
0.482
0.481
0.041
D,1
D,0
0.904 0.904
0.904 0.904
0.009 0.009
1,0
0.852
0.852
0.015
0.068 0.331
0.039 0.388
0.033 0.403
0.028 0.415
0.022 0.428
0.012 0.452
0.011 0.510
0.022 0.535
0.028 0.551
0.034 0.562
0.040 0.576
0.068 0.649
0.649
0.580
0.564
0.549
0.534
0.509
0.453
0.429
0.416
0.405
0.393
0.350
0.865
0.880
0.884
0.888
0.892
0.898
0.910
0.915
0.917
0.920
0.923
0.932
0.899
0.884
0.879
0.875
0.871
0.862
0.842
0.832
0.826
0.821
0.814
0.787
0.933
0.923
0.920
0.918
0.915
0.910
0.898
0.892
0.888
0.884
0.880
0.861
Sample statistics for average treatment effects and error correlations of the
data augmented Gibbs sampler applied to the treatment effect example
Average treatment effects estimated from weighted averages of MTE are similar:
estAT E (M T E)
estAT T (M T E)
estAT U T (M T E)
= 0.000
=
0.464
= 0.464
The average treatment effects on the treated and untreated suggest heterogeneity but
are grossly understated compared to the DGP averages of 4 and 4. Next, we revisit
the problem and attempt to consider what is left out of our model specification.
5.4
Instrumental variable restrictions
Consistency demands that we fully consider what we know. In the foregoing analysis,
we have not effectively employed this principle. Data augmentation of the counterfactuals involves another condition. That is, outcomes are independent of the instruments
(otherwise, they are not instruments), DY +(1  D) Y draw and DY draw +(1  D) Y
are independent of Z. We can impose orthogonality on the draws of the counterfactuals
such that the "sample" satisfies this population condition. We’ll refer to this as the IV
data augmented Gibbs sampler treatment effect analysis.
To implement this we add the following steps to the above Gibbs sampler. Minimize the distance of Y draw from Y miss such that Y1 = DY + (1  D) Y draw and
Y0 = DY draw + (1  D) Y are orthogonal to the instruments, Z.

T  draw

min Y draw  Y miss
Y
 Y miss
Y draw


s.t. Z T DY + (1  D) Y draw DY draw + (1  D) Y = 0
74
where the constraint is p  2 zeroes and p is the number of columns in Z (the number
of instruments). Hence, in each McMC round, IV outcome draws are
Y1 = DY + (1  D) Y draw
and
Y0 = DY draw + (1  D) Y
5.5
Return to the example once more
With the IV data augmented Gibbs
sample
Y D Y1
15 1 15
14 1 14
13 1 13
13 0 11
14 0 10
15 0 9
sampler in hand we return to the representative
Y0
9
10
11
13
14
15
Z1
5
6
0
0
0
1
Z2
4
5
0
0
1
0
Z3
3
4
0
1
0
0
Z4
1
2
1
0
0
0
and repeat 20 times to create a sample of n = 120 observations. The IV Gibbs sampler employs 15, 000 draws from the conditional posteriors. The first 5, 000 draws are
discarded as burn-in, then sample statistics are based on the remaining 10, 000 draws.
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
1
0
12.01 11.99
12.01 11.99
0.160 0.160
11.35
11.64
11.69
11.74
11.80
11.90
12.11
12.21
12.27
12.32
12.38
12.63
11.37
11.62
11.68
11.73
11.80
11.89
12.10
12.20
12.25
12.30
12.36
12.64
1
0.413
0.420
0.227
2
0.167
0.148
0.274
3
0.896
0.866
0.370
4
0.878
0.852
0.359
0.558
0.149
0.058
0.028
0.117
0.267
0.566
0.695
0.774
0.840
0.923
1.192
1.325
0.889
0.764
0.648
0.530
0.334
0.023
0.168
0.249
0.312
0.389
0.685
2.665
1.888
1.696
1.550
1.381
1.124
0.637
0.451
0.348
0.256
0.170
0.257
0.202
0.170
0.254
0.336
0.435
0.617
1.113
1.367
1.509
1.630
1.771
2.401
Sample statistics for the parameters of the IV data augmented Gibbs
sampler applied to the treatment effect example
Not surprisingly, the results demonstrate no selection bias and effectively estimate
marginal and average treatment effects. Sample statistics for average treatment effects
75
as well as correlations, D,1 , D,0 , and 1,0 are tabulated below.
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
AT E
0.000
0.000
0.000
AT T AT U T
4.000 4.000
4.000 4.000
0.000 0.000
D,1
0.813
0.815
0.031
D,0
0.812
0.815
0.032
1,0
0.976
0.976
0.004
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
0.650
0.728
0.743
0.756
0.772
0.794
0.835
0.850
0.859
0.866
0.874
0.904
0.910
0.874
0.866
0.859
0.851
0.835
0.794
0.771
0.755
0.742
0.726
0.640
0.987
0.984
0.983
0.982
0.981
0.979
0.973
0.970
0.968
0.967
0.965
0.952
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
Sample statistics for average treatment effects and error correlations of the
IV data augmented Gibbs sampler applied to the treatment effect example
Weighted MTE estimates of average treatment effects are similar.
estAT E (M T E)
0.000
estAT T (M T E)
3.792
estAT U T (M T E)
3.792
Next, we report some more interesting experiments. Instead, of having the full set
of instruments available, suppose we have only three, Z1 , Z2 , and Z3 + Z4 , or two,
Z1 + Z2 and Z3 + Z4 , or one, Z1 + Z2 + Z3 + Z4 . We repeat the above for each set of
instruments and compare the results with classical control function analysis based on
Heckman’s inverse Mills strategy introduced in the projections notes.
76
5.5.1
Three instruments
Suppose we have only three instruments, Z1 , Z2 , and Z3 + Z4 . IV data augmented
Gibbs sampler results are tabulated below.27
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
1
0
12.00 12.00
12.00 12.00
0.164 0.165
11.32
11.62
11.68
11.73
11.79
11.89
12.11
12.21
12.27
12.32
12.38
12.58
11.36
11.61
11.68
11.73
11.79
11.89
12.11
12.21
12.27
12.32
12.39
12.57
1
0.242
0.243
0.222
2
0.358
0.342
0.278
3
0.001
0.001
0.132
0.658
0.263
0.189
0.120
0.041
0.094
0.394
0.526
0.604
0.670
0.753
1.067
1.451
1.080
0.950
0.844
0.723
0.532
0.168
0.021
0.071
0.155
0.245
0.564
0.495
0.306
0.258
0.216
0.170
0.091
0.090
0.171
0.217
0.254
0.302
0.568
Sample statistics for the parameters of the IV data
augmented Gibbs sampler with three instruments
applied to the treatment effect example
These results differ very little from those based on the full set of four instruments.
There is no selection bias and marginal and average treatment effects are effectively
estimated. Sample statistics for average treatment effects as well as correlations, D,1 ,
27 Inclusion of an intercept in the selection equation with three, two, and one instruments makes no qualitative difference in the average treatment effect analysis. These results are not reported.
77
D,0 , and 1,0 are tabulated below.
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
AT E
0.000
0.000
0.000
AT T AT U T
4.000 4.000
4.000 4.000
0.000 0.000
D,1
0.799
0.802
0.036
D,0
0.800
0.814
0.037
1,0
0.884
0.888
0.029
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
0.605
0.702
0.719
0.734
0.751
0.777
0.825
0.842
0.852
0.860
0.869
0.894
0.899
0.870
0.861
0.853
0.844
0.826
0.778
0.751
0.734
0.720
0.699
0.554
0.956
0.936
0.930
0.924
0.918
0.905
0.867
0.846
0.833
0.821
0.803
0.703
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
Sample statistics for average treatment effects and error correlations
of the IV data augmented Gibbs sampler with three instruments
applied to the treatment effect example
Weighted MTE estimates of average treatment effects are similar.
estAT E (M T E)
0.000
estAT T (M T E)
3.940
estAT U T (M T E)
3.940
Classical results based on Heckman’s inverse Mills control function strategy with
three instruments are reported below for comparison. The selection equation estimated
via probit is
Pr (D | Z) =  (0.198Z1  0.297Z2 + 0.000 (Z3 + Z4 )) P seudoR2 = 0.019
where  (·) denotes the cumulative normal distribution function. The estimated outcome equations are
E [Y | X] = 11.890 (1  D) + 11.890D  2.700 (1  D) 0 + 2.700D1
and estimated average treatment effects are
estATE
0.000
estATT
4.220
estATUT
4.220
In spite of the weak explanatory of the selection model, control functions produce
reasonable estimates of average treatment effects. Next, we consider two instruments.
78
5.5.2
Two instruments
Suppose we have only two instruments, Z1 + Z2 , and Z3 + Z4 . IV data augmented
Gibbs sampler results are tabulated below.
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
1
0
12.08 13.27
12.07 13.27
0.168 0.243
1
0.034
0.034
0.065
2
0.008
0.009
0.128
11.47
11.69
11.75
11.80
11.86
11.96
12.18
12.29
12.35
12.41
12.46
12.64
0.328
0.185
0.162
0.141
0.118
0.077
0.009
0.048
0.073
0.092
0.115
0.260
0.579
0.287
0.244
0.207
0.159
0.077
0.095
0.171
0.219
0.260
0.308
0.635
12.41
12.70
12.79
12.87
12.96
13.11
13.42
13.58
13.67
13.75
13.84
14.26
Sample statistics for the parameters of the IV data
augmented Gibbs sampler with two instruments
applied to the treatment effect example
Selection bias emerges as  0 diverges from 12. This suggests marginal and average
treatment effects are likely to be confounded. Sample statistics for average treatment
79
effects as well as correlations, D,1 , D,0 , and 1,0 are tabulated below.
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
AT E
1.293
1.297
0.219
AT T
1.413
1.406
0.438
AT U T
4.000
4.000
0.000
D,1
D,0
1,0
0.802 0.516 0.634
0.806 0.532 0.648
0.037 0.136
0.115
2.105
1.806
1.738
1.665
1.572
1.435
1.147
1.005
0.930
0.861
0.795
0.625
0.211
0.389
0.525
0.670
0.855
1.130
1.705
1.989
2.141
2.277
2.409
2.750
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
0.601
0.695
0.719
0.735
0.754
0.779
0.828
0.846
0.856
0.864
0.874
0.902
0.813 0.890
0.757 0.834
0.732 0.813
0.706 0.795
0.675 0.768
0.613 0.716
0.438 0.569
0.340 0.479
0.262 0.417
0.195 0.365
0.124 0.301
0.150 0.055
Sample statistics for average treatment effects and error correlations
of the IV data augmented Gibbs sampler with two instruments
applied to the treatment effect example
Weighted MTE estimates of average treatment effects are similar.
estAT E (M T E)
1.293
estAT T (M T E)
1.372
estAT U T (M T E)
3.959
ATUT is effectively estimated but the other average treatment effects are biased.
Classical results based on Heckman’s inverse Mills control function strategy with
three instruments are reported below for comparison. The selection equation estimated
via probit is
Pr (D | Z) =  (0.023 (Z1 + Z2 ) + 0.004 (Z3 + Z4 )) P seudoR2 = 0.010
The estimated outcome equations are
E [Y | X] = 109.38 (1  D) + 11.683D + 121.14 (1  D) 0 + 2.926D1
and estimated average treatment effects are
estATE
97.69
estATT
191.31
estATUT
4.621
While the Bayesian estimates of ATE and ATT are moderately biased, classical estimates produce severe bias. Both strategies produce reasonable ATUT estimates with
the Bayesian estimation right on target. Finally, we consider one instrument.
80
5.5.3
One instrument
Suppose we have only one instrument, Z1 + Z2 + Z3 + Z4 . IV data augmented Gibbs
sampler results are tabulated below.
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
1
0
1
12.08 13.95 0.019
12.09 13.95 0.019
0.166 0.323 0.013
11.42
11.69
11.75
11.81
11.87
11.97
12.19
12.29
12.35
12.40
12.47
12.67
12.95
13.27
13.35
13.43
13.53
13.73
14.18
14.38
14.50
14.59
14.69
15.12
0.074
0.051
0.046
0.041
0.036
0.027
0.010
0.002
0.003
0.006
0.011
0.033
Sample statistics for the parameters
of the IV data augmented Gibbs
sampler with one instrument applied
to the treatment effect example
Selection bias emerges as  0 again diverges from 12. This suggests marginal and
average treatment effects are likely to be confounded. Sample statistics for average
81
treatment effects as well as correlations, D,1 , D,0 , and 1,0 are tabulated below.
statistic
mean
median
standard dev
quantiles:
minimum
0.01
0.025
0.05
0.10
0.25
0.75
0.90
0.95
0.975
0.99
maximum
AT E
1.293
1.297
0.219
AT T
1.413
1.406
0.438
AT U T
4.000
4.000
0.000
D,1
D,0
1,0
0.797 0.039 0.048
0.801 0.051 0.061
0.039 0.298
0.336
2.105
1.806
1.738
1.665
1.572
1.435
1.147
1.005
0.930
0.861
0.795
0.625
0.211
0.389
0.525
0.670
0.855
1.130
1.705
1.989
2.141
2.277
2.409
2.750
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
4.000
0.576
0.691
0.710
0.727
0.746
0.774
0.824
0.843
0.853
0.861
0.870
0.894
0.757 0.817
0.615 0.682
0.554 0.624
0.503 0.571
0.429 0.490
0.272 0.310
0.187
0.213
0.370
0.415
0.461
0.518
0.526
0.581
0.591
0.651
0.747
0.800
Sample statistics for average treatment effects and error correlations
of the IV data augmented Gibbs sampler with one instrument
applied to the treatment effect example
Weighted MTE estimates of average treatment effects are similar.
estAT E (M T E)
1.957
estAT T (M T E)
0.060
estAT U T (M T E)
3.975
ATUT is effectively estimated but the other average treatment effects are biased.
Classical results based on Heckman’s inverse Mills control function strategy with
three instruments are reported below for comparison. The selection equation estimated
via probit is
Pr (D | Z) =  (0.017 (Z1 + Z2 + Z3 + Z4 )) P seudoR2 = 0.009
The estimated outcome equations are
E [Y | X] = 14.000 (1  D) + 11.885D + N A (1  D) 0 + 2.671D1
and estimated average treatment effects are
estATE
2.115
estATT
NA
estATUT
NA
While the Bayesian estimates of ATE and ATT are biased, the classical strategy fails to
generate estimates for ATT and ATUT — it involves a singular X matrix as there is no
variation in 0 .
82
6
Appendix — common distributions and their kernels
To try to avoid confusion, we list our descriptions of common distributions and their
kernels. Others may employ variations on these definitions.
Multivariate
distributions
and their support
Gaussian (normal)
x  Rk
µ  Rk
  Rkk
positive definite
Student t
x  Rk
µ  Rk
  Rkk
positive definite
Wishart
W  Rkk
positive definite
S  Rkk
positive definite
 >k1
Inverse Wishart
W  Rkk
positive definite
S  Rkk
positive definite
 >k1
Density f (·) functions
and their kernels
f (x; µ, ) =
conjugacy
1
(2)k/2 ||1/2
conjugate prior for
mean of multinormal distribution
 12 (xµ)T 1 (xµ)
e
T 1
1
 e 2 (xµ)  (xµ)
f (x; , µ, )
[(+k)/2]
( 2 )()k/2 ||1/2
=

 +k
T 1
2
 1 + (xµ)  (xµ)

 +k
T 1
2
 1 + (xµ)  (xµ)
marginal posterior
for multi-normal
with unknown
mean and
variance
1
2k/2 |S|/2 k ( 2 )
f (W ; , S) =
k1
2
see inverse
Wishart
1
1
e 2 T r(S W )
k1
1
1
 |W | 2 e 2 T r(S W )
 |W |


f W ; , S 1 =
|S|/2
2k/2 k ( 2 )
 +k+1
2
e
 +k+1
2
e 2 T r(SW
 |W |
 |W |
 12 T r (SW 1 )
1
1
)
conjugate prior for
variance of multinormal distribution
 (n) = (n  1)!,  for n a positive integer

 (z) = 0 tz1 et dt
k

 


k 2 =  k(k1)/4
 +1j
2
j=1
Multivariate distributions
83
Univariate distributions
and their support
Beta
x  (0, 1)
,  > 0
Density f (·) functions
and their kernels
f (x; , )
1
(+) 1
= ()()
x
(1  x)
1
 x1 (1  x)
 F (s; ) ns
= ns s (1  )
ns
 s (1  )
f (x; )
1
x/21
= 2/2 (/2)
exp[x/2]
 x/21 ex/2
f (x; )
Binomial
s = 1.2. . . .
  (0, 1)
Chi-square
x  [0, )
>0
Inverse chi-square
x  (0, )
>0
Scaled, inverse
chi-square
x  (0, )
,  2 > 0
Exponential
x  [0, )
>0
Extreme value (logistic)
x  (, )
 < µ < , s > 0
Gamma
x  [0, )
,  > 0
Inverse gamma
x  [0, )
,  > 0
Gaussian (normal)
x  (, )
 < µ < ,  > 0
Student t
x  (, )
µ  (, )
,  > 0
=
=
exp[1/(2x)]
1
2/2 (/2)
x/2+1
/21 1/(2x)
x
e


f x; ,  2
/2
exp[ 2 /(2x)]
(2  )
2/2 (/2)
x/2+1
2
 x/21 e /(2x)
f (x; )
=  exp [x]
 exp [x]
f (x; µ, s)
exp[(xµ)/s]
= s(1+exp[(xµ)/s])
2

exp[(xµ)/s]
(1+exp[(xµ)/s])2
f (x; , )
= x1 exp[x/]
() 
 x1 exp [x/]
f (x; , )
x1 exp[/x]
() 
1
=
x
exp [x/]
f (x; µ,
 )

2
1
= 2 exp  (xµ)
2
2

(xµ)2
 exp  22
( +1 )
f (x; , µ, ) = 2 
(2)

( +1
2
2 )
1 (xµ)
 1 +  2

( +1
2
2 )
 1 + 1 (xµ)
2

Univariate distributions
84
conjugacy
beta is conjugate
prior to binomial
beta is conjugate
prior to binomial
see scaled,
inverse chi-square
see scaled,
inverse chi-square
conjugate prior for
variance of a
normal distribution
gamma is
conjugate prior
to exponential
posterior for
Bernoulli prior
and normal
likelihood
gamma is
conjugate prior
to exponential
and others
conjugate prior for
variance of a
normal distribution
conjugate prior for
mean of a
normal distribution
marginal posterior
for a normal with
unknown mean
and variance
7
Appendix — maximum likelihood estimation of discrete choice models
The most common method for estimating the parameters of discrete choice models is
maximum likelihood. The likelihood is defined as the joint density for the parameters
of interest  conditional on the data Xt . For binary choice models and Dt = 1 the
contribution to the likelihood is F (Xt ) , and for Dt = 0 the contribution to the
likelihood is 1  F (Xt ) where these are combined as binomial draws and F (Xt ) is
the cumulative distribution function evaluated at Xt . Hence, the likelihood is
L (|X) =
n

Dt
F (Xt )
t=1
The log-likelihood is
 (|X)  logL (|X) =
n

t=1
1Dt
[1  F (Xt )]
Dt log (F (Xt )) + (1  Dt ) log (1  F (Xt ))
Since this function for binary response models like probit and logit is globally concave,
numerical maximization is straightforward. The first order conditions for a maximum,
max  (|X) , are

n

t=1
Dt f (Xt )Xit
F (Xt )

(1Dt )f (Xt )Xti
1F (Xt )
= 0 i = 1, . . . , k
where f (·) is the density function. Simplifying yields
n

t=1
[Dt F (Xt )]f (Xt )Xti
F (Xt )[1F (Xt )]
= 0 i = 1, . . . , k
Estimates of  are found by solving these first order conditions iteratively or, in other
words, numerically.
A common estimator for the variance of ̂M LE is the negative inverse of the Hessian


1
matrix evaluated at ̂M LE , H D, ̂
. Let H (D, ) be the Hessian matrix for
the log-likelihood with typical element Hij (D, ) 
8
 2 t (D,) 28
 i  j .
Appendix — seemingly unrelated regression (SUR)
First, we describe the seemingly unrelated regression (SUR) model. Second, we remind
ourselves Bayesian regression works as if we have two samples: one representative of
our priors and another from the new evidence. Then, we connect to seemingly unrelated
regression (SUR) — both classical and Bayesian strategies are summarized.
28 Details can be found in numerous econometrics references and chapter 4 of Accounting and Causal
Effects: Econometric Challenges.
85
We describe the SUR model in terms of a stacked regression as if the latent variables
in a binary selection setting are observable.
r = X + 
where
and


U
W
r =  Y1  , X =  0
Y0
0


1
Let  =   D1
 D0

by , is
V
=
 D1
 11
 10



  N (0, V = 

In
In =   D1 In
 D0 In



 0

  D1


=  ...

 0

  D0

 .
 ..
0

In )
 D0
 10 , a 3  3 matrix, then the Kronecker product, denoted
 00

1
..
.






VD
0
0  ,  =   1  ,  =  V1  ,
2
V0
X2
0
X1
0
 D1 In
 11 In
 10 In
···
..
.
···
···
..
.
0
..
.
 D1
..
.
1
0
..
.
0
 11
..
.
···
..
.
···
···
..
.
···
···
..
.
 D1
0
..
.
0
 10
..
.
···
···
..
.

 D0 In
 10 In 
 00 In
0
..
.
 D0
..
.
0
 10
..
.
···
..
.
···
···
..
.
 D1
0
..
.
 11
0
..
.
 D0
0
..
.
0
 00
..
.
···
···
..
.
 10
0
..
.
 D0
0
· · ·  10
0
· · ·  00

a 3n  3n matrix and V 1 = 1 In .
Classical estimation of SUR follows generalized least square (GLS).


 


 = X T 1  In X 1 X T 1  In r

and
···
0
..
.
  

 
 = X T 1  In X 1
V ar 

















On the other hand, Bayesian analysis employs a Gibbs sampler based on the conditional posteriors as, apparently, the SUR error structure prevents identification of
conjugate priors. Recall, from the discussion of Bayesian linear regression with
 gen-
eral error structure the conditional posterior for  is p ( | , y;  0 ,  )  N , V
where


1  T 1

 = X T 1 X0 + X T 1 X
X  X0  + X T 1 X 
0
=

0
0
T 1
1
X
 +X 
1 
0
0
T 1 
1
X
 0 + X 
86



 = X T 1 X 1 X T 1 y

and
V

T 1
X0T 1
X
0 X0 + X 

1
T 1
=
1
X
 +X 
=
1

 
For the present SUR application, we replace y with r and 1with 1 In yielding the conditional posterior for , p ( | , y;  0 ,  )  N 

SU R
and
SU R
, VSU R where
 1   1
T
X0T 1

In X
0 X0 + X




SU R
T
1 

X

+
X

I
X

 X0T 1
0 0
n
0

1 



 1  

SU R
1
T
1
T

=
1
+
X

I
X


+
X

I
X

n
n
0


=

 




 SU R = X T 1  In X 1 X T 1  In r

VSU R
 1   1
T
X0T 1

In X
0 X0 + X


1
 1  
=
1
In X
 +X 
=

87
9
Examples [to be added]means (ANOVA)
- Gaussian likelihood
- variance unknown
- Jeffrey’s prior
9.1
regression
- Gaussian likelihood
-follow from Bayesian conditional distribution
data model: p (y | 1 , 2 , X)  N (X1 , 2 I),
prior model: p (1 , 2 | X)  12
- conditional simulation
effectively treats scale as a nuisance parameter
marginal posterior of 2 given (y, X),
conditional posterior of 1 given (2 , y, X).
- variance unknown
9.2
selection
missing data - latent expected utility and counterfactuals
- data augmentation,
- Gaussian likelihood,
- SUR,
- bounding unidentified parameters,
- Gibbs sampler
88