* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Principles of Bayesian Inference Bayes Theorem
Survey
Document related concepts
Transcript
Basic probability 1 2 Principles of Bayesian Inference We like to think of “Probability” formally as a function that assigns a real number to an event. Sudipto Banerjee1 and Andrew O. Finley2 Let E and F be any events that might occur under an experimental setup H. Then a probability function P (E) is defined as: P1 0 ≤ P (E) ≤ 1 for all E. P2 P (H) = 1. P3 P (E ∪ F ) = P (E) + P (F ) whenever it is impossible for any two of the events E and F to occur. Usually consider: E ∩ F = {φ} and say they are mutually exclusive. Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. October 30, 2013 1 2 Basic probability Basic probability If E is an event, then we denote its complement (“NOT” E) by Ē or E c . Since E ∩ Ē = {φ}, we have from P3: Independent Events: E and F are said to be independent if the occurrence of one does not imply the occurrence of the other. Then, P (E | F ) = P (E) and we have the following multiplication rule: P (Ē) = 1 − P (E). Conditional Probability of E given F : P (E | F ) = P (EF ) = P (E)P (F ). P (E ∩ F ) P (F ) If P (E | F ) = P (E), then P (F | E) = P (F ). Marginalization: We can express P (E) by “marginalizing” over the event F : Sometimes we write EF for E ∩ F . Compound probability rule: write the above as P (E) = P (EF ) + P (E F̄ ) = P (F )P (E | F ) + P (F̄ )P (E | F̄ ). P (E | F )P (F ) = P (EF ). 3 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 4 Bayes Theorem LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian principles Bayes Theorem Observe that: Two hypothesis: H0 : excess relative risk for thrombosis for women taking a pill exceeds 2; H1 : it is under 2. P (EF ) = P (E | F )P (F ) = P (F | E)P (E) P (F )P (E | F ) ⇒ P (F | E) = . P (E) Data collected at hand from a controlled trial show a relative risk of x = 3.6. This is Bayes’ Theorem, named after Reverend Thomas Bayes – an English clergyman with a passion for gambling! Probability or likelihood under the data, given our prior beliefs is P (x | H); H is H0 or H1 . Often this is written as: P (F | E) = 5 P (F )P (E | F ) . P (F )P (E | F ) + P (F̄ )P (E | F̄ ) LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 6 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian principles Bayesian principles Likelihood and Prior Bayes Theorem updates the probability of each hypothesis: P (H | x) = Bayes theorem in English: P (H)P (x | H) ; H ∈ {H0 , H1 } . P (x) prior × likelihood Posterior distribution = P ( prior × likelihood) Marginal probability: Denominator is summed over all possible priors P (x) = P (H0 )P (x | H0 ) + P (H1 )P (x | H1 ) It is a fixed normalizing factor that is (usually) extremely difficult to evaluate: curse of dimensionality Reexpress: P (H | x) ∝ P (H)P (x | H); H ∈ {H0 , H1 } . Markov Chain Monte Carlo to the rescue! WinBUGS software: www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml 7 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 8 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian principles Bayesian principles An Example Clinician interested in π: proportion of children between age 5–9 in a particular population having asthma symptoms. Prior Support 0.10 0.12 0.14 0.16 0.18 0.20 Total Clinician has prior beliefs about π, summarized as “Prior support” and “Prior weights” Data: random sample of 15 children show 2 having asthma symptoms. Likelihood obtained from Binomial distribution: 15 2 π (1 − π)13 2 Note: 15 2 is a “constant” and *can* be ignored in the computations (though they are accounted for in the next Table). 9 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Likelihood 0.267 0.287 0.290 0.279 0.258 0.231 Prior × Likelihood 0.027 0.043 0.072 0.070 0.039 0.023 0.274 Posterior 0.098 0.157 0.265 0.255 0.141 0.084 1.000 Posterior: obtained by dividing Prior × Likelihood with normalizing constant 0.274 10 Bayesian principles LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian principles Classical statistics: model parameters are fixed and unknown. The key to Bayesian inference is “learning” or “updating” of prior beliefs. Thus, posterior information ≥ prior information. A Bayesian thinks of parameters as random, and thus having distributions (just like the data). We can thus think about unknowns for which no reliable frequentist experiment exists, e.g. θ = proportion of US men with untreated prostate cancer. Is the classical approach wrong? That may be a controversial statement, but it certainly is fair to say that the classical approach is limited in scope. A Bayesian writes down a prior guess for parameter(s) θ, say p(θ). He then combines this with the information provided by the observed data y to obtain the posterior distribution of θ, which we denote by p(θ | y). The Bayesian approach expands the class of models and easily handles: repeated measures unbalanced or missing data nonhomogenous variances multivariate data All statistical inferences (point and interval estimates, hypothesis tests) then follow from posterior summaries. For example, the posterior means/medians/modes offer point estimates of θ, while the quantiles yield credible intervals. 11 Prior weight 0.10 0.15 0.25 0.25 0.15 0.10 1.00 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop – and many other settings that are precluded (or much more complicated) in classical settings. 12 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Basics of Bayesian inference Basics of Bayesian inference Calculations (numerical and algebraic) are usually required only up to a proportionaly constant. We, therefore, write the posterior as: We start with a model (likelihood) f (y | θ) for the observed data y = (y1 , . . . , yn )0 given unknown parameters θ (perhaps a collection of several parameters). p(θ | y, λ) ∝ p(θ | λ) × f (y | θ). If λ are known/fixed, then the above represents the desired posterior. If, however, λ are unknown, we assign a prior, p(λ), and seek: Add a prior distribution p(θ | λ), where λ is a vector of hyper-parameters. p(θ, λ | y) ∝ p(λ)p(θ | λ)f (y | θ). The posterior distribution of θ is given by: p(θ | y, λ) = The proportionality constant does not depend upon θ or λ: 1 1 =R p(y) p(λ)p(θ | λ)f (y | θ)dλdθ p(θ | λ) × f (y | θ) p(θ | λ) × f (y | θ) =R . p(y | λ) f (y | θ)p(θ | λ)dθ The above represents a joint posterior from a hierarchical model. The marginal posterior distribution for θ is: Z p(θ | y) = p(λ)p(θ | λ)f (y | θ)dλ. We refer to this formula as Bayes Theorem. 13 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 14 A simple example: Normal data and normal priors A simple example: Normal data and normal priors Interpret: Posterior mean is a weighted mean of prior mean and data point. Example: Consider a single data point y from a Normal distribution: y ∼ N (θ, σ 2 ); assume σ is known. The direct estimate is shrunk towards the prior. 1 1 f (y|θ) = N (y | θ, σ ) = √ exp(− 2 (y − θ)2 ) 2σ σ 2π What if you had n observations instead of one in the earlier iid set up? Say y = (y1 , . . . , yn )0 , where yi ∼ N (θ, σ 2 ). 2 ȳ is a sufficient statistic for θ; ȳ ∼ N θ, σn 2 θ ∼ N (µ, τ 2 ), i.e. p(θ) = N (θ | µ, τ 2 ); µ, τ 2 are known. Posterior distribution of θ 2 =N θ| 1 τ2 1 σ2 + 1 τ2 µ+ 1 σ2 1 σ2 + 1 τ2 y, 1 σ2 1 + 1 τ2 σ2 τ2 σ2τ 2 θ| 2 µ+ 2 y, σ + τ2 σ + τ 2 σ2 + τ 2 15 Posterior distribution of θ 2 p(θ|y) ∝ N (θ | µ, τ ) × N (y | θ, σ ) =N LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop p(θ | y) ∝ N (θ | µ, τ 2 ) × N ! =N . LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop θ| = N θ| 1 τ2 n σ2 + 1 τ2 µ+ n σ2 n σ2 1 τ2 nτ 2 + ȳ, n σ2 1 + ! 1 τ2 σ2τ 2 σ2 µ+ 2 ȳ, 2 σ + nτ 2 σ + nτ 2 σ 2 + nτ 2 16 A simple exercise using R σ2 ȳ | θ, n LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Another simple example: The Beta-Binomial model Consider the problem of estimating the current weight of a group of people. A sample of 10 people were taken and their average weight was calculated as ȳ = 176 lbs. Assume that the population standard deviation was known as σ = 3. Assuming that the data y1 , ..., y10 came from a N (θ, σ 2 ) population perform the following: Example: Let Y be the number of successes in n independent trials. n y P (Y = y|θ) = f (y|θ) = θ (1 − θ)n−y y Prior: p(θ) = Beta(θ|a, b): Obtain a 95% confidence interval for θ using classical methods. p(θ) ∝ θa−1 (1 − θ)b−1 . Prior mean: µ = a/(a + b); Variance ab/((a + b)2 (a + b + 1)) Assume a prior distribution for θ of the form N (µ, τ 2 ). Obtain 95% posterior credible intervals for θ for each of the cases: (a) µ = 176, τ = 8; (b) µ = 176, τ = 1000 (c) µ = 0, τ = 1000. Which case gives results closest to that obtained in the classical method? 17 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Posterior distribution of θ p(θ|y) = Beta(θ|a + y, b + n − y) 18 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian inference: point estimation Bayesian inference: interval estimation The most popular method of inference in practical Bayesian modelling is interval estimation using credible sets. A 100(1 − α)% credible set C for θ is a set that satisfies: Z P (θ ∈ C | y) = p(θ | y)dθ ≥ 1 − α. Point estimation is easy: simply choose an appropriate distribution summary: posterior mean, median or mode. Mode sometimes easy to compute (no integration, simply optimization), but often misrepresents the “middle” of the distribution – especially for one-tailed distributions. C The most popular credible set is the simple equal-tail interval estimate (qL , qU ) such that: Z qL Z ∞ α p(θ | y)dθ = = p(θ | y)dθ 2 −∞ qU Mean: easy to compute. It has the “opposite effect” of the mode – chases tails. Median: probably the best compromise in being robust to tail behaviour although it may be awkward to compute as it needs to solve: Z θ median −∞ 19 This interval is relatively easy to compute and has a direct interpretation: The probability that θ lies between (qL , qU ) is 1 − α. The frequentist interpretation is extremely convoluted. 1 p(θ | y)dθ = . 2 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Sampling-based inference Previous example: direct evaluation of the posterior probabilities. Feasible only for simpler problems. Modern Bayesian Analysis: Derive complete posterior densities, say p(θ | y) by drawing samples from that density. Samples are of the parameters themselves, or of their functions. If θ1 , . . . , θM are samples from p(θ | y) then, densities are created by feeding them into a density plotter. Similarly samples from f (θ), for some function f , are obtained by simply feeding the θi ’s to f (·). In principle M can be arbitrarily large – it comes from the computer and only depends upon the time we have for analysis. Do not confuse this with the data sample size n which is limited in size by experimental constraints. 21 Then clearly P (θ ∈ (qL , qU ) | y) = 1 − α. LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 20 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop