Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Approximate Bayesian Computation: a simulation based approach to inference Richard Wilkinson1 Simon Tavaré2 1 Department of Probability and Statistics University of Sheffield 2 Department of Applied Mathematics and Theoretical Physics University of Cambridge Workshop on Approximate Inference in Stochastic Processes and Dynamical System R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 1 / 19 Stochastic Computation Implicit Statistical Models Two types of statistical model: Prescribed models - likelihood function is specified. Implicit models - mechanism to simulate observations. Implicit models give scientists more freedom to accurately model the phenomenon under consideration. The increase in computer power has made there use more practicable. Popular in many disciplines. 00 11 11 00 11 00 11 00 (University of Sheffield) 11 00 00 11 11 00 11 00 00 11 Approximate Bayesian Computation 00 11 00 11 11 00 00 11 11 00 00 11 11 00 00 11 00 11 11 00 11 00 11 00 t 11 00 11 00 R.D. Wilkinson 11 00 11 00 00 11 00 11 00 11 11 00 00 11 11 00 11 00 11 00 11 00 11 00 00 11 11 00 11 00 Time 11 00 00 11 11 00 11 00 00 11 11 00 11 00 11 00 11 00 00 00 11 00 11 11 00 11 00 11 00 11 11 00 11 00 11 00 11 00 11 00 11 00 1100 00 11 00 1100 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 00 11 11 00 11 11 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 PASCAL 2008 2 / 19 Fitting to data Most models are forwards models, i.e., specify parameters θ and i.c.s and the model generates output D. Usually, we are interested in the inverse-problem, i.e., observe data, want to estimate parameter values. Different terminology: Calibration Data assimilation Parameter estimation Inverse-problem Bayesian inference R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 3 / 19 Monte Carlo Inference Aim to sample from the posterior distribution: π(θ|D) ∝ prior × likelihood = π(θ)P(D|θ). Monte Carlo methods enable Bayesian inference to be done in more complex models. MCMC can be difficult or impossible in many stochastic models, e.g., if ◮ ◮ P(D|θ) unknown - true for many stochastic models, or where there are convergence or mixing problems, often caused by highly dependent data arising from an underlying tree or graphical structure. ⋆ ⋆ ⋆ R.D. Wilkinson Population Genetics Epidemiology Evolutionary Biology (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 4 / 19 Likelihood-Free Inference Rejection Algorithm Draw θ from prior π(·) Accept θ with probability P(D | θ) Accepted θ are independent draws from the posterior distribution, π(θ | D). R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 5 / 19 Likelihood-Free Inference Rejection Algorithm Draw θ from prior π(·) Accept θ with probability P(D | θ) Accepted θ are independent draws from the posterior distribution, π(θ | D). If the likelihood, P(D|θ), is unknown: ‘Mechanical’ Rejection Algorithm Draw θ from π(·) Simulate D ′ ∼ P(· | θ) Accept θ if D = D ′ The acceptance rate is P(D): the number of runs to get n observations is n negative binomial, with mean P(D) . R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 5 / 19 Approximate Bayesian Computation I If P(D) is small, we will rarely accept any θ. Instead, there is an approximate version: Approximate Rejection Algorithm Draw θ from π(θ) Simulate D ′ ∼ P(· | θ) Accept θ if ρ(D, D ′ ) ≤ ǫ R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 6 / 19 Approximate Bayesian Computation I If P(D) is small, we will rarely accept any θ. Instead, there is an approximate version: Approximate Rejection Algorithm Draw θ from π(θ) Simulate D ′ ∼ P(· | θ) Accept θ if ρ(D, D ′ ) ≤ ǫ This generates observations from π(θ | ρ(D, D ′ ) < ǫ): As ǫ → ∞, we get observations from the prior, π(θ). If ǫ = 0, we generate observations from π(θ | D). ǫ reflects the tension between computability and accuracy. R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 6 / 19 Approximate Bayesian Computation II If the data are too high dimensional we never observe simulations that are ‘close’ to the field data. Reduce the dimension using summary statistics, S(D). Approximate Rejection Algorithm With Summaries Draw θ from π(θ) Simulate D ′ ∼ P(· | θ) Accept θ if ρ(S(D), S(D ′ )) < ǫ If S is sufficient this is equivalent to the previous algorithm. R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 7 / 19 Error Structure Example (Gaussian Distribution) 1.2 0.8 0.0 0.4 Density 0.8 0.4 −5 0 ǫ=5 10 5 10 0.8 1.2 ǫ=1 5 0.4 Density 0.0 −10 1.2 −10 (University of Sheffield) 10 µ −5 and ǫ2 Var (µ | |x̄| ≤ ǫ) = Var (µ | x̄ = 0)+ 3 R.D. Wilkinson 5 0.0 2ǫ 0 µ Density σ /n −5 0.0 σ /n −10 0.8 Then π(µ | |x̄| ≤ ǫ) = −ǫ−µ ǫ−µ √ √ −Φ Φ 2 2 0.4 Accept µ if |x̄| < ǫ. Density Simulate Xi ∼ N(µ, σ 2 ) 1.2 Suppose Xi ∼ N(µ, σ 2 ), with σ 2 known, and give µ an improper flat prior distribution, π(µ) = 1 for µ ∈ R. Suppose we observe data with x̄ = 0. 1000 samples Pick µ ∼ U(−∞, ∞) ǫ = 0.1 ǫ = 0.5 Approximate Bayesian Computation 0 µ 5 10 −10 −5 0 µ PASCAL 2008 8 / 19 Approximate MCMC Rejection sampling is inefficient, as θ is repeatedly sampled from its prior distribution. The idea behind MCMC is that by correlating observations more time is spent in regions of high likelihood. R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 9 / 19 Approximate MCMC Rejection sampling is inefficient, as θ is repeatedly sampled from its prior distribution. The idea behind MCMC is that by correlating observations more time is spent in regions of high likelihood. Approximate Metropolis-Hastings Algorithm Suppose we are currently at θ. Propose θ ′ from density q(θ, θ ′ ). Simulate D ′ from P(·|θ ′ ). If ρ(D, D ′ ) ≤ ǫ, calculate π(θ ′ )q(θ ′ , θ) h(θ, θ ) = min 1, π(θ)q(θ, θ ′ ) ′ . Accept the move to θ ′ with probability h(θ, θ ′ ), else stay at θ. Adaptive tolerance choices. Sisson et al. and Robert et al. proposed an approximate sequential importance sampling algorithm. R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 9 / 19 ABC-within-MCMC Problem: a low acceptance rate leads to slow convergence. R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 10 / 19 ABC-within-MCMC Problem: a low acceptance rate leads to slow convergence. Suppose θ = (θ 1 , θ 2 ) with π(θ 1 | D, θ 2 ) known, π(θ 2 | D, θ 1 ) unknown. We can combine Gibbs update steps (or any M-H update) with ABC. R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 10 / 19 ABC-within-MCMC Problem: a low acceptance rate leads to slow convergence. Suppose θ = (θ 1 , θ 2 ) with π(θ 1 | D, θ 2 ) known, π(θ 2 | D, θ 1 ) unknown. We can combine Gibbs update steps (or any M-H update) with ABC. ABC-within-Gibbs Algorithm Suppose we are at θ t = (θ1t , θ2t ) 1. Draw θ1t+1 ∼ π(θ1 | D, θ2t ) 2. Draw θ2∗ ∼ πθ2 (·) ◮ ◮ Simulate D′ ∼ P(· | θ1t+1 , θ2∗ ) If ρ(D, D′ ) < ǫ, set θ2t+1 = θ2∗ . Else return to step 2. This is often the case for models with a hidden tree structure generating highly dependent data. R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 10 / 19 Example From Population Biology Inferring ancestral divergence times 00 11 11 00 11 00 00 11 0 11 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 1 11 00 1 0 0 1 1 0 t 11 00 00 11 1 0 1 00 0 11 00 11 00 11 0 1 0 1 1 0 0 1 11 00 11 00 00 11 0 1 0 0 1 1 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 00 1 11 11 00 00 11 11 00 11 00 00 11 11 00 00 11 11 00 00 11 0 00 11 00 1 11 0 1 11 00 0 1 0 0 0 00 11 0 1 00 11 00 0 1 0 00 11 0 1 00 11 0 1 00 0 1 00 11 0 1 0 00 11 0 1 0 1 0 001 11 0 0 1 01 1 1 1 0 01 1 0 1 00 11 0 1 00 11 11 00 11 01 1 0 1 00 11 0 1 00 11 0 11 1 00 11 0 1 00 11 01 1 0 1 00 11 0 1 01 1 0 1 Time R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 11 / 19 Choosing summary statistics and metrics 1.5 1.0 D2 0.5 0.0 summaries S(D), which are sensitive to changes in θ, but robust to random variations in D 2.0 We need −1.0 −0.5 a definition of approximate sufficiency (LeCam 1963): distance between π(θ | D) and D1 π(θ | S(D))? a systematic implementable approach for finding good summary statistics. −1.0 −0.5 0.0 0.5 1.0 1.5 Complex dependence structures can be accounted for. R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 12 / 19 ABC Approach Data can be thought of in two parts: the observed number of fossils Di found in i th interval the total number of fossils found, D+ . D ′ denotes simulated data. A suitable metric might be ′ k X Di Di′ D+ ρ(D, D ) = D+ − D ′ + D+ − 1 + ′ i =1 Note: no data summaries here R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 13 / 19 250 200 150 100 50 Extant Population Size 300 Not going so well 0 200 400 600 800 1000 Iteration Number R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 14 / 19 Tweak the metric The simulated N0 values are too small (376 modern species) Easy to combine different types of information with ABC Change the metric ′ ′ k X N0 Di Di′ D+ − − 1 − 1 + + ρ(D, D ) = D+ D ′ D+ N0 + ′ i =1 This gives approximate samples from π(θ | D, N0 = 376) ∝ P(D, N0 = 376 | θ)π(θ) R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 15 / 19 0.02 0.00 0.01 Density 0.03 Results 60 80 100 120 140 Divergence Time (My) R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 16 / 19 Extensions Model selection: π (S ′ ≈S) M1 Ratio of acceptance rates πM (S ′ ≈S) ≈ Bayes Factor. Relative 2 acceptance rates gives posterior model probabilities. ◮ Hopeless in practice as it is too sensitive to the tolerance ǫ. Raftery and Lewis (1992) and Chib (1995) give computational schemes to calculate Bayes factors. Neither works. Expensive Simulators: Emulate the stochastic model with a Gaussian process emulator. Richard Boys, Darren Wilkinson et al. . R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 17 / 19 Pros and cons of ABC Pros ◮ ◮ ◮ ◮ Cons ◮ ◮ ◮ Issues ◮ ◮ ◮ R.D. Wilkinson Likelihood is not needed Easy to code Easy to adapt Generates independent observations (parallel computation) Hard to anticipate effect of summary statistics (needs intuition) Over dispersion of posterior due to ρ(D, D′ ) < ǫ For complex problems, sampling from the prior does not make good use of observations One run or many? How to choose good summary statistics? How good an approximation do we get? (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 18 / 19 References M. A. Beaumont and W. Zhang and D. J. Balding, Approximate Bayesian Computatation in Population Genetics, Genetics, 2002. P. Marjoram and J. Molitor and V. Plagnol and S. Tavaré, Markov Chain Monte Carlo without likelihoods, PNAS, 2003. S. A. Sisson and Y. Fan and M. M. Tanaka, Sequential Monte Carlo without Likelihoods, PNAS, 2007. C. P. Robert, M. A. Beaumont, J. Marin and J. Cornuet, Adaptivity for ABC algorithms: the ABC-PMC scheme, arXiv, 2008. R.D. Wilkinson (University of Sheffield) Approximate Bayesian Computation PASCAL 2008 19 / 19