Download Geometry and ergodicity of Hamiltonian Monte Carlo

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Topological quantum field theory wikipedia , lookup

Noether's theorem wikipedia , lookup

Metric tensor wikipedia , lookup

Euclidean geometry wikipedia , lookup

History of geometry wikipedia , lookup

3-manifold wikipedia , lookup

Geometrization conjecture wikipedia , lookup

Transcript
Geometry and ergodicity of Hamiltonian Monte Carlo
Simon Byrne
with Michael Betancourt and Sam Livingston
Department of Statistical Science
University College London
24th September 2015
1 / 20
Bayesian computational problem
We want to compute the posterior distribution of some quantity x
conditional on having observed y :
Π(dx ) = R
p(y | x )ω(dx )
0
0
X p (y | x )ω(dx )
High-dimensional, complicated distributions.
Cannot be integrated or sampled from directly.
Normalisation constant is unknown.
Markov chain Monte Carlo (MCMC)
We construct a Markov chain T (dθi +1 | θi ) whose invariant distribution is
the posterior.
2 / 20
MCMC: Exploration vs. Acceptance
There is an inherent tradeoff:
Small proposals
⇒ highly correlated samples
⇒ poor estimators.
Large proposals
⇒ low acceptance rates
⇒ highly correlated samples
⇒ poor estimators.
3 / 20
Hamiltonian/Hybrid Monte Carlo
Let X be a Riemannian manifold with metric M.
Auxiliary momentum variable p ∈ Tx∗ , with p | x ∼ N (0, M (x ))
(Girolami and Calderhead 2011).
The Hamiltonian is the negative log-density of the joint distribution
H (x , p) = − log π(x ) +
1
2
log |M (x )| + 12 p> M (x )−1 p
Defines a dynamical system (Hamilton’s equations):
∂H
dx
(t ) =
= M −1 (x )p
dt
∂p
and
dp
∂H
(t ) = −
dt
∂x
Unaffected by normalisation constants
Properties:
H constant over t.
Reversible by negation of p.
(x (0), p(0)) 7→ (x (t ), p(t )) has unit Jacobian determinant.
4 / 20
Leapfrog integrator
We can’t solve the system exactly, but we can approximate it using a
leapfrog integrator, such that
Reversible
Unit Jacobian
H approximately preserved.
5 / 20
Hamiltonian/Hybrid Monte Carlo (cont.)
Hamiltonian Monte Carlo (Neal 2011)
1
Sample p(n) ∼ N (0, M (x (n) )).
2
From (x (n) , p(n) ), simulate L leapfrog steps to obtain (x ∗ , p∗ )
3
Compute α = exp{−H (x ∗ , p∗ ) + H (x (n) , p(n) )}.
4
Set x (n+1) = x ∗ with probability min(α, 1), otherwise x (n+1) = x (n) .
Can make large proposal moves,
with high acceptance, yet low
autocorrelation.
General purpose: only requires
derivatives.
Can be largely automated (e.g.
Stan library).
6 / 20
Rich geometric structure
Cotangent bundle is a symplectic manifold.
H is negative log-density w.r.t. canonical symplectic measure.
(x (0), p(0)) 7→ (x (t ), p(t )) is a symplectic map.
Choice of metric M:
Constant/Euclidean is easiest to work with.
Fisher–Rao typically intractable, and lacks invariance justification
(known data, influence of prior).
Observed information often requires modification to be positive
definite (Betancourt 2013).
Geometric numerical integration (Hairer, Lubich, and Wanner 2006)
Approximate integrator is actually exact solution to a different
Hamiltonian H̃.
H̃ (x , p) = H (x , p) + 2 G(x , p) + . . .
Backward error analysis.
Optimal tuning of (Betancourt, B., and Girolami 2014).
7 / 20
MCMC estimators
Samples are typically used to obtain estimates of moments of interest:
1X
f (xi ) ≈ EΠ [f ]
n
n
f̂n =
i =1
Samples are not independent: usual central limit theorem does not
apply.
8 / 20
Geometric ergodicity
Total variation distance between two probability measures
kµ − νkTV = sup kµ(A) − ν(A)k
A∈Ω
A Markov chain is geometrically ergodic if the n-step transition distance
decays geometrically
kT n (· | x ) − ΠkTV 6 V (x )ρn ,
ρ<1
9 / 20
Markov chain central limit theorem
If a Markov chain is geometrically ergodic, then subject to some moment
conditions on f , there is a central limit theorem
√ d
n f̂n − EΠ [f ] → N (0, σ2 )
for some σ2 < ∞.
Justifies use of an MCMC approximation to the posterior.
Known conditions for simple algorithms (Gibbs, RW, MALA).
Difficult to establish for more complicated schemes, such as HMC.
10 / 20
Establishing geometric ergodicity
Minorisation condition
Drift condition
There exists a small set C,
integer n and > 0 and
probability measure ν such that
There exist a drift function
V : X → [1, ∞], and constants
0 < λ < 1, b < ∞ such that
T n (· | x ) > ν(·)
for all x ∈ C.
where C is the small set.
Tn
Π
ET [V | x ] 6 λV (x ) + b1C (x )
ET [V |·]
V (·)
ε
C
C
If both conditions are satisfied then T is geometrically ergodic.
11 / 20
Example family
π(x ) ∝ exp{−|x |β }
Then with standard HMC with Euclidean metric:
Tails
heavy
intermediate
light
β
(0, 1)
[1, 2]
(2, ∞)
Geom. Ergod.?
7
3*
7
Same as MALA algorithm (Roberts and Tweedie 1996): arises as a
special case when L = 1.
So why does it fail?
12 / 20
Heavy-tailed case (β < 1)
The problem is that it takes too long to come in from the tails:
p
x
Lose in gradient information
Method behaves like a random walk, which performs poorly for
heavy-tailed distributions (Jarner and Tweedie 2003).
13 / 20
Option 1: integrate for longer
Suppose that:
1 We can solve Hamilton’s equations exactly.
The paths are the contours of H, and therefore there exists trec such
that (x (trec ), p(trec )) = (x (0), p(0)).
p
x
2
We can sample integration time I uniformly [0, trec ).
14 / 20
Virial theorem
Theorem
In a Hamiltonian system with constant metric M,
h
i
h
i
>
>
EI p Mp = EI x ∇x log π(x )
The Virial is the quantity
G = x >p
Then
dG
dx >
dp
=
p + x>
= p> Mp − x > ∇x log π(x )
dt
dt
dt
and hence
h
>
i
h
>
i
EI p Mp − EI x ∇x log π(x ) = EI
dG
1
[G(trec ) − G(0)] = 0.
=
dt
trec
15 / 20
Geometric ergodicity of HMC
Theorem
Suppose that for some B > 0,
log π(x ) 6 A + Bx > ∇x log π(x )
then the constant-metric Hamiltonian scheme
1
Sample p(0) ∼ N (0, M ),
2
Sample t ∼ I, (x (0), p(0)) 7→ (x (t ), p(t )).
is geometrically ergodic.
H = − log π(x ) + 12 p> Mp
Rough idea: drift function V = C − log π(x ).
1
“resets” the second term
2
“equilibrates” the two terms
16 / 20
Generalisations
Of course we can’t do this in typical practice, but
Poincaré recurrence theorem says that we will approximately recur
in finite time.
No-U-turn sampler (NUTS, Hoffman and Gelman 2014) provide a
framework for adaptively choosing the integration.
Can be adapted to utilise approximate Virial criterion.
17 / 20
Option 2: change the geometry
Let M be the Hessian of the log-density
M (x ) = |∇2 log π(x )| ∝ β2 |x |β−2
This manifold has an isometric embedding into Euclidean space
x 0 = sign(x )|x |β/2 ,
p 0 = β|q |β/2−1 p
In other words, this is equivalent to Euclidean HMC with the target
density
π 0 (x 0 ) ∝ |x 0 |2/β−1 exp(−x 02 /β2 )
The tails are dominated by the Gaussian term: geometrically
ergodic?
18 / 20
Summary
HMC is a deeply geometric algorithm.
Geometry provides valuable insight, where other tools fail.
Provided genuine qualitative improvements.
Lots of open questions:
Incorporate effects of discretisation.
Extend results to more general distributions.
Quantitative results (e.g. geometric rate ρ).
Diagnostics of Markov chain convergence.
19 / 20
Bibliography
Betancourt, M. (2013). “A General Metric for Riemannian Manifold Hamiltonian Monte
Carlo”. Geometric Science of Information. Springer Berlin Heidelberg, pp. 327–334.
Betancourt, M., S. Byrne, and M. Girolami (2014). “Optimizing The Integrator Step Size
for Hamiltonian Monte Carlo”. arXiv: 1411.6669.
Girolami, M. and B. Calderhead (2011). “Riemann manifold Langevin and Hamiltonian
Monte Carlo methods”. JRSS-B 73.2, pp. 123–214.
Hairer, E., C. Lubich, and G. Wanner (2006). Geometric numerical integration. 2nd ed.
Vol. 31. Berlin: Springer-Verlag.
Hoffman, M. D. and A. Gelman (2014). “The No-U-Turn Sampler: Adaptively Setting
Path Lengths in Hamiltonian Monte Carlo”. Journal of Machine Learning Research
15, pp. 1593–1623.
Jarner, S. F. and R. L. Tweedie (2003). “Necessary conditions for geometric and
polynomial ergodicity of random-walk-type Markov chains”. Bernoulli 9.4,
pp. 559–578.
Neal, R. M. (2011). “MCMC using Hamiltonian dynamics”. Handbook of Markov chain
Monte Carlo. Boca Raton, FL: CRC Press, pp. 113–162.
Roberts, G. O. and R. L. Tweedie (1996). “Exponential convergence of Langevin
distributions and their discrete approximations”. Bernoulli 2.4, pp. 341–363.
20 / 20