Download Markov Chain Monte Carlo Simulation Made Simple

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Markov Chain Monte Carlo Simulation Made
Simple
Alastair Smith
Department of Politics
New York University
April 2, 2003
1
Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique
to perform numerical integration. It can be used to numerically estimate
complex economometric models. In this paper I describe the intuition behind
the process, show its flexiblity and applicability. I conclude by demonstrating that these methods are often simpler to implement than many common
techniques such as MLE.
This paper serves as a brief introduction. I do not intend to derive any
results or prove any theorems. I beleive that MCMC offers a powerful estimation tool. This paper is designed to remove the mystery surround the
process. Not only it extremely powerful and flexible but it is easy to implement. Given the recent growth in the power of computers I beleive that
numerical procedures will be the estimation tools of the future. I outline the
underlying logic, show why these techniques work.
MCMC techniques are most often used in the Bayesian context. I start
by outlining the simple linear model in the Bayesian framework. Although,
analytical techniques exist for this model they are complex. In general, more
complex model are analytically intractable. Having setup the estimation
technique I examine the properties of Markov chains. These properties provide the basis for the estimation procedure.
1
The Bayesian Model
While I beleive that the Bayesian approach a superior, consistent approach
to statistics than the standard frequentist approach, this debate is volumous
and not the topic of this paper. For practical purposes if is usually possible
to use diffuse priors that do not influence the posterior results.
prior f(θ)
likelihood L(Y |θ)
posterior f(θ|Y ) ∝ f (θ) · L(Y |θ)
For example in the simple linear model θ = {β, σ 2 }
2
Markov Chains
A Markov chain is a stochastic process. It generates a series a observations, X. To illustrate the concept I focus on a discrete time, descrete state
space model. At each time period the process generates a sample, X t , from
2
the state space. For a simple example, suppose that the state space is the
numbers 1, 2,and 3. A Markov chain is simply a string of these numbers.
The Markov property is that the probability distribution over the next observation depends only upon the current observation. Let pij represent the
probability that the next observation is j (X t+1 = j), given that the current
observation is i (X t = i).
A convenient way to present these transition probabilities is through a
transition
 matrix P,

p11 p12 p13
P =  p21 p22 p23  . The elements of the first row represent the probp31 p32 p33
abilities of moving to the different states if the current state is 1. Therefore,
p11 represents the probability that state X t+1 = 1 if the current state is also
1; p12 represents the probability that state X t+1 = 2 if the current state is 1,
etc...
Suppose that our initial observation is indeed 1 (X 0 = 1). The probability
distribution for state is given by the first row of P . The next question is what
is the probability distribution over the following observation. To illustrate,
I consider the more specific question, with what probability does X 2 = 3?
There are three possible paths by which the second observation could equal
3; they are illustrated in the table below. Thus the probability that X 2 = 3
is p11 p13 + p12 p23 + p13 p33 .
Pathway X0 X1 X2 Probability
#1
1
1
3
p11 p13
#2
1
2
3
p12 p23
#3
1
3
3
p13 p33
Thus, for any initial state, we can calculate the probability density over
the states for a given number of moves. Obviously, as the number of moves
increases these become increasingly difficult to calculate. Yet, matrix notation simplifies the calculation.
Suppose, rather than start with£a specific state
¤0 we consider a probability
distribution over these states, ν 0 = v10 v20 v30 . If we randomly select the
initial state from this distribution, then what is the probability distribution
of the next state in the chain is given by

 0  

(1)
v1
p11 p12 p13
v1
p11 v10 + p12 v20 + p13 v30


v (1) =  v2(1)  = P v (0) =  p21 p22 p23   v20  =  p21 v10 + p22 v20 + p23 v30 
(1)
p31 p32 p33
v30
p31 v10 + p32 v20 + p33 v30
v3
3
This idea can be extended, the probability distribution over the states
after the second move is simply v 2 = P v 1 = P 2 v 0 . This idea can be generalised; specifically, v(t) = P t v (0) . Of particular interest, is the distribution as
the chain becomes long. As the chain’s length increases then the distribution
over the states becomes less and less determined by the starting distribution and more and more determined by the transition probabilities. Indeed,
providing the chain satisfies certain regularity conditions, i.e. it does not
get stuck in one state, there exists a unique invariant distribution associated
with every transition matrix. Let π represent this invariant distribution. So
for any starting distribution, π (0) , as the chain becomes long then the π (t)
tends to π ( lim π (t) = π).
t→∞
There are two ways to calculate this invariant distribution. The first
is analytical. This method exploits the fact that π = P π, and solves this
system of equations. The second, and of more relevance for this paper, is to
similate π by actually running the Markov chain. This involves choosing a
starting value and simply running the Markov chain. The initial values in
the chain depend strongly upon the starting values. However, as the chain
becomes longer then the elements of the chain represent random draws from
the probability distribution π.


0.01 0.7 0.29
.5 .4  .
Suppose, for example, that the transition matrix is P =  .1
0.2 .6 .2
0
We could start by setting X = 1 and then running the Markov chain. We
could estimate the density of each state by examining the frequency of each
state. Figure 1 demonstrates that, as the number of iterations becomes large,
that the relative frequency of each state converges to its invariant density.
We can arbitarily increase the accuracy of these estimates simply by taking
more iteration.
In this example, I use a discrete state space model; however, these ideas
are readily extendable to continuous state space models, where the transition
matrix is replaced by a transition kernel (a probability density over the next
state that depends only upon the current state).
2.1
Exploiting Markov chains for estimation
Most of Markov theory revolves around finding the invariant distribution of
Markov chains. MCMC turns the problem arround. Rather than finding
4
x1
x3
x2
1
0
1
150
N
Figure 1:
5
the invariant distribution of a specific Markov chain, it starts with a specific
invariant distributions and says, can I find a Markov chain that has this invariant distribution.1 Typically, we already know the distribution of interest:
the posterior distribution of the parameters. The key is to find a transition
kernel that has this invariant distribution f (θ|Y ). In Bayesian estimation
we want to find the posterior distribution of the parameters, f(θ|Y ). As
discussed above this is often analytically intractible. However, suppose we
have a Markov chain, P , whose invariant distribution is f (θ|Y ). If we run
this Markov process then, as the chain becomes long, its elements represent
random draws from the posterior distribution f (θ|Y ).
To illustrate how the process works consider the following algorthym.
1. Choose starting values, θ(0) , and length of the chain, n0 + m.
2. Given the current element in the chain, θ(t) , use the Markov process P ,
to draw the next element θ(t+1) .
3. If t > m, then store θ(t+1) .
4. If t < m + n0 , then return to step 2; otherwise calculate and report the
descriptive statistics for the elements stored in step 3.
This algorthym generates and stores n0 elements from the chain. These
elements represent random samples from the posterior distribution of f (θ|Y ).
Thus the sample average represents an estimate of the expected value of θ.
Other properties of f (θ|Y ) can also be estimated by examining the properties
of the sample. The accuracy of these estimates depends upon the number
of draws, n0 . Accurracy is improved by running the chain longer. Note that
the first m iterations of the chain were discarded. The initial elements in
the chain are strongly influenced by the starting value (as the figure above
demonstrates). If these starting values are drawn from a low density region
of the posterior denisty then the chain contains too many draws from this
region.2
1
Each Markov process has a unique invariant distribution. Yet, many Markov chains
could have the same invariant distribution. Thus, we are free to use any of these process
to simulate the invariant distribution.
2
Another practical problem with running this algorthym is the high autocorrelation
between elements in the chain. This reduces the rate a which convergence is acheived. A
practical solution is to subsample from elements stored at step 3.
6
In summary, if we can find a Markov process with transition kernel P ,
such that its invariant distribution is f (θ|Y ), then we can numerically estimate this posterior distribution by running the Markov chain. Obviously,
there are many importance convergence consideration that I have not considered. However, the basic point is that if an appropriate transition kernel can
be found, then estimations involves nothing more than running the Markov
process. So far I have said nothing about how to find an appropriate transition kernel. It is to this point that I turn next.
3
Transition Kernels
Table 1 compares the analytically calculated probability distribution with
the numerically simulated values. The accuracy of the simulation can be
increased by simply increases the number of iterations of the chain.3
Most of Markov theory revolves around finding the invariant distribution
of Markov chains. MCMC turns the problem arround. Typically, we already
know the distribution of interest: the posterior distribution of the parameters.
The key is to find a transition kernal that has this invariant distribution.
Then to estimate this distribution we simply need to run the Markov chain
for a suitably long period.
4
Joint, marginal and conditional distributions
In the linear model we want to estimate f (β, σ 2 |Y ). Being somewhat informal, this is the probability density of seeing a particular value of β and
σ 2 . Bayesian have calculated this density. It turns out that, with suitable
conjugate priors4 , f (β, σ 2 |Y ) is distributed inverse gamma normal. Unfortunately, this is about the most complicated model for which we can work with
3
These is a convenient time to discuss several aspects associated with implementing
MCMC methods. First the starting value of the Markov chain affect the initial values of
the chain. Over time their effect diminishes. However, if the starting values represent very
low density portions of the state space then the choice of starting values affects the results.
The usual solution is to discard the early part of the chain. This tends to disregard those
draws from the chain that are highly dependent upon the starting values.
Convergence criticeria??? literature ?????
4
what is a conjugate prior?
7
the joint posterior density analytically. For more complex models the joint
density is simply intractable. Yet, generally our interest is in the marginal
denisty of a particular parameter. In particular case of the simple linear
model we typically want to know about β and σ 2 , separately. For example,
this is all we report from a regression model, the distribution of β. This
marginal density is simply the joint density of β and σ 2 intregrated across
all possible values of σ 2 .
The key do using MCMC is to stop thinking in terms of calculating things
analytically and imagine how you could simulate a single parameter in a
model if you knew all the other parameters. Suppose for example that you
knew the marginal distribution of σ 2 and wanted to calculate the marginal
distribution of β. In order to estimate the marginal density of β I could
simply, integrate out σ 2 from the joint density. While simply tricky in this
problem, it is impossible in more complex econometric models. However,
knowing the marginal density of σ 2 , I can draw a large number of random
draws from this density. For each of these draws, the conditional density of
β is simple to calculate (with normal priors, f (β|Y, σ 2 ) is also distributed
normally). To numerically estimate β I could draw a random sample from
this distribution.
Algorithm to calculate the marginal density of β given that the
density of σ 2 is known.
1. set t=1
(t)
2. randonly draw (σ 2 )
from its known posterior marginal distribution
3. calculate the posterior density of β given (σ 2 )
(t)
(t)
(f (β|Y, (σ 2 ) ))
4. randomly draw (β)t from f (β|Y, σ 2(t) ))
5. let t=t+1 and go to 2
Suppose this algorthm is repeated T times. Then the T samples of β
represent random draws from its marginal density. The algorthm effectively
integrates out σ 2 .
As an analogy, in our 101 econometrics classes we learn how to estimate
the means of a variable if we know its variance. We then learn to calculate
the variance if we know the mean. Being an order of magnitude harder, the
calculation of the joint distribution of the mean and variance is typically
8
ommitted. Calculating the posterior density of the mean and the variance
togther is much harder than calculating either conditional density. However,
providing we can break a model down into a series of simple conditional
densities we can estimate the marginal density of a parameter.
The algorithm above assumed that the distribution of σ 2 was known and
it produced a random sample from the posterior density of β. However, if the
draws from the algorithm represent random draws from the marginal density
of β, then we could simply reverse the logic of the argument, and draw
random samples from the conditional density of σ 2 given the current value
of β. Given that the β’s are random draws from the marginal density for β,
then random draws of σ 2 represent random draws from the marginal density
of σ 2 . Hence the following algorithm simulates the posterior ditributions for
β and σ 2 .
Algorithm to calculate the marginal density of β and σ 2 .
1. set t=1 and choose starting values, β (0) and (σ 2 )(0) .
2. calculate the posterior density of β given (σ 2 )
(t)
(t)
(f (β|Y, (σ 2 ) ))
3. Randomly β (t+1) draw from this distribution .
4. calculate the posterior density of (σ 2 )(t) given (β)(t+1) (f ((σ 2 )
5. randomly draw (σ 2 )
(t+1)
(t)
|Y, (β)(t+1) ))
from f((σ 2 ) |Y, (β)(t+1) )
6. let t=t+1 and go to 2
(t)
Providing the prior are appropriately choosen then the calculates of f (β|Y, (σ 2 ) )
(t)
and f ((σ 2 ) |Y, (β)(t+1) ) are straightforward.
The following code shows how simply this algorthym can be implied in
STATA.
See program OLS_MCMC.do
5
Bayesian Updates for simple models.
Suppose we assume that the likelihood function is normal and so is our prior:
Likelihood: p(y|θ) = √12π exp −( 12 (y − θ)2 )
9
Normal prior: f (θ) = √12π exp −( 12 (θ − µ0 )2 ). To make life as simple as
possible, suppose initially that the variance of both the likelihood and the
prior density in one. By Bayes rule the posterior density is proportional to
the product of the prior and the likelihood: p(θ|y) ∝ p(y|θ)f (θ).
We can show that posterior denisty is also normal. Specifically,
p(θ|y) ∝ exp −( 12 (y−θ)2 ) exp −( 12 (θ−µ0 )2 ) = exp −( 12 [(y−θ)2 +(θ−µ0 )2 ])
We can expand the terms in the exponential and then collect them (completing the square).
£ 2
¤
1
1 2
1 2
2
2
[(y
−
θ)
+
(θ
−
µ
)
]
=
θ
+
(−y
−
µ
)
θ
+
y
+
µ
.
0
0
2
2
2 0
We only care about term is θ since everything else is in the nomalizing
constant.
£ 2
¤
θ + (−y − µ0 ) θ + 12 y 2 + 12 µ20 = (θ−b)2 = θ2 −2θb+b2 so b = 12 (y + µ0 )
Hence p(θ|y) ∝ exp −((θ − 12 (y + µ0 ))2 )
so p(θ|y) is distributed normal with mean 12 (y + µ0 ) and variance 12 . We
can now move to a more realistic example. The normal prior is referred to
as a conjugate prior since it results in posterior density from the same class
of distributions.
5.1
Simple Linear Model
Consider the OLS simple linear model, yi = xi β + ei where ei ∼ N (0, σ 2 ).
Using conjugate prior, β 0 ∼ N(β 0 , B0 ) and σ −2
0 ∼ GAM M A(υ 0 , δ 0 ), we can
derive the posterior conditional densities.
b B)
The posterior β parameter is normally distributed: β|y, σ 2 ∼ N(β,
P
P
b = B(B −1 β + N xi yi ) and B = (B −1 + N x0 xi )−1 , and
where β
0
0
i=1
i=1 i
0
The posterior σ 2 is inverse gamma distributed: So σ −2 is distributed
G( υ0 +N
, δ0 +SSE
)
2
2
P
2
where SSE = N
i.e. sum of squared errors.
i=1 (yi − xi β)
5.2
More Complex Models
A key advantage of MCMC is that models can be built up in simple stepwise
fashion. Suppose from example, that instead of a continuous dependent
variable we have binary outcomes. Such data is typically analysed as a
probit model. Specifically, zi = xi β + ei where ei ∼ N (0, 1), as if yi = 1
then zi > 0 and if yi = 0 then zi < 0. The variable zi is referred to a
latent variable since we never actually observe it. The standard approach
10
to estimating such a model is to integrate out the latent variable and then
apply maximium likelihood. A simple MCMC approach utilitizes a data
augmentation technique (Tanner and Wong, 198?). If we knew the value of
these latent data then we could simulate the β’s just as we did in the OLS
model above. Although we don’t care directly about the latent data we can
simulate these data. The probit model tells us the distribution of the latent
data. Specific, if yi = 1 then zi has a left truncated normal distribution
with mean xβ and variance 1: x ∼ T N[0,+∞] (xi β, 1). Similarly, if yi = 0
then we know that the corresponding latent variable lies between −∞ and
0: x ∼ T N[−∞,0] (xi β, 1).
We now that the tools to implement this model. Let Z refer to the set of
latent data (i.e. all the zi ’s.)
Algorithm to calculate the marginal density of β in a probit
model.
1. set t=1 and choose starting values, β (0) and (Z)(0) .
2. calculate the posterior density of β given Z (t) (f(β|Z t ))
3. Randomly β (t+1) draw from this distribution .
4. calculate the posterior density of Z (t) given (β)(t+1) (f (Z|Y, (β)(t+1) ))
5. randomly draw (Z)(t+1) from f (Z|Y, (β)(t+1) )
6. let t=t+1 and go to 2
The key simplification here is that given Z, the posterior distribution of
β is independent of the binary observed dependent variable. Now while in
this context, the MLE approach provides highly reliable estimates in more
complex models, such are multivariate, multinominal, or censored descrete
choice models, MLE is less reliable. MCMC provides a powerful tool in
these cases, being easy to program and less prone to the convergence failure
problems of MLE.
The construction of MCMC can be done peicewise. For example, the OLS
code above with estimate the probit model with two additions. First, set the
variance, σ 2 , equal to one. Second, add the set to draw the latent data, Z.
This is easily acheived using the following simulation. If z is a truncated
normal variable with mean xβ, variance 1 with a range p to q, then the
following algorithym readily provides a method to genrate a random sample
11
from the distribution of z. If x ∼ T N[p,q] (µ, σ 2 ) and u is a uniform random
number then x̃ = µ + σΦ−1 (Φ((p − µ)/σ) + u(Φ((q − µ)/σ) − Φ((p − µ)/σ))),
represents a random draw of x.
12