Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basic probability 1 2 Principles of Bayesian Inference We like to think of “Probability” formally as a function that assigns a real number to an event. Sudipto Banerjee1 and Andrew O. Finley2 Let E and F be any events that might occur under an experimental setup H. Then a probability function P (E) is defined as: P1 0 ≤ P (E) ≤ 1 for all E. P2 P (H) = 1. P3 P (E ∪ F ) = P (E) + P (F ) whenever it is impossible for any two of the events E and F to occur. Usually consider: E ∩ F = {φ} and say they are mutually exclusive. Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. October 30, 2013 1 2 Basic probability Basic probability If E is an event, then we denote its complement (“NOT” E) by Ē or E c . Since E ∩ Ē = {φ}, we have from P3: Independent Events: E and F are said to be independent if the occurrence of one does not imply the occurrence of the other. Then, P (E | F ) = P (E) and we have the following multiplication rule: P (Ē) = 1 − P (E). Conditional Probability of E given F : P (E | F ) = P (EF ) = P (E)P (F ). P (E ∩ F ) P (F ) If P (E | F ) = P (E), then P (F | E) = P (F ). Marginalization: We can express P (E) by “marginalizing” over the event F : Sometimes we write EF for E ∩ F . Compound probability rule: write the above as P (E) = P (EF ) + P (E F̄ ) = P (F )P (E | F ) + P (F̄ )P (E | F̄ ). P (E | F )P (F ) = P (EF ). 3 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 4 Bayes Theorem LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian principles Bayes Theorem Observe that: Two hypothesis: H0 : excess relative risk for thrombosis for women taking a pill exceeds 2; H1 : it is under 2. P (EF ) = P (E | F )P (F ) = P (F | E)P (E) P (F )P (E | F ) ⇒ P (F | E) = . P (E) Data collected at hand from a controlled trial show a relative risk of x = 3.6. This is Bayes’ Theorem, named after Reverend Thomas Bayes – an English clergyman with a passion for gambling! Probability or likelihood under the data, given our prior beliefs is P (x | H); H is H0 or H1 . Often this is written as: P (F | E) = 5 P (F )P (E | F ) . P (F )P (E | F ) + P (F̄ )P (E | F̄ ) LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 6 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian principles Bayesian principles Likelihood and Prior Bayes Theorem updates the probability of each hypothesis: P (H | x) = Bayes theorem in English: P (H)P (x | H) ; H ∈ {H0 , H1 } . P (x) prior × likelihood Posterior distribution = P ( prior × likelihood) Marginal probability: Denominator is summed over all possible priors P (x) = P (H0 )P (x | H0 ) + P (H1 )P (x | H1 ) It is a fixed normalizing factor that is (usually) extremely difficult to evaluate: curse of dimensionality Reexpress: P (H | x) ∝ P (H)P (x | H); H ∈ {H0 , H1 } . Markov Chain Monte Carlo to the rescue! WinBUGS software: www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml 7 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 8 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian principles Bayesian principles An Example Clinician interested in π: proportion of children between age 5–9 in a particular population having asthma symptoms. Prior Support 0.10 0.12 0.14 0.16 0.18 0.20 Total Clinician has prior beliefs about π, summarized as “Prior support” and “Prior weights” Data: random sample of 15 children show 2 having asthma symptoms. Likelihood obtained from Binomial distribution: 15 2 π (1 − π)13 2 Note: 15 2 is a “constant” and *can* be ignored in the computations (though they are accounted for in the next Table). 9 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Likelihood 0.267 0.287 0.290 0.279 0.258 0.231 Prior × Likelihood 0.027 0.043 0.072 0.070 0.039 0.023 0.274 Posterior 0.098 0.157 0.265 0.255 0.141 0.084 1.000 Posterior: obtained by dividing Prior × Likelihood with normalizing constant 0.274 10 Bayesian principles LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian principles Classical statistics: model parameters are fixed and unknown. The key to Bayesian inference is “learning” or “updating” of prior beliefs. Thus, posterior information ≥ prior information. A Bayesian thinks of parameters as random, and thus having distributions (just like the data). We can thus think about unknowns for which no reliable frequentist experiment exists, e.g. θ = proportion of US men with untreated prostate cancer. Is the classical approach wrong? That may be a controversial statement, but it certainly is fair to say that the classical approach is limited in scope. A Bayesian writes down a prior guess for parameter(s) θ, say p(θ). He then combines this with the information provided by the observed data y to obtain the posterior distribution of θ, which we denote by p(θ | y). The Bayesian approach expands the class of models and easily handles: repeated measures unbalanced or missing data nonhomogenous variances multivariate data All statistical inferences (point and interval estimates, hypothesis tests) then follow from posterior summaries. For example, the posterior means/medians/modes offer point estimates of θ, while the quantiles yield credible intervals. 11 Prior weight 0.10 0.15 0.25 0.25 0.15 0.10 1.00 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop – and many other settings that are precluded (or much more complicated) in classical settings. 12 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Basics of Bayesian inference Basics of Bayesian inference Calculations (numerical and algebraic) are usually required only up to a proportionaly constant. We, therefore, write the posterior as: We start with a model (likelihood) f (y | θ) for the observed data y = (y1 , . . . , yn )0 given unknown parameters θ (perhaps a collection of several parameters). p(θ | y, λ) ∝ p(θ | λ) × f (y | θ). If λ are known/fixed, then the above represents the desired posterior. If, however, λ are unknown, we assign a prior, p(λ), and seek: Add a prior distribution p(θ | λ), where λ is a vector of hyper-parameters. p(θ, λ | y) ∝ p(λ)p(θ | λ)f (y | θ). The posterior distribution of θ is given by: p(θ | y, λ) = The proportionality constant does not depend upon θ or λ: 1 1 =R p(y) p(λ)p(θ | λ)f (y | θ)dλdθ p(θ | λ) × f (y | θ) p(θ | λ) × f (y | θ) =R . p(y | λ) f (y | θ)p(θ | λ)dθ The above represents a joint posterior from a hierarchical model. The marginal posterior distribution for θ is: Z p(θ | y) = p(λ)p(θ | λ)f (y | θ)dλ. We refer to this formula as Bayes Theorem. 13 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 14 A simple example: Normal data and normal priors A simple example: Normal data and normal priors Interpret: Posterior mean is a weighted mean of prior mean and data point. Example: Consider a single data point y from a Normal distribution: y ∼ N (θ, σ 2 ); assume σ is known. The direct estimate is shrunk towards the prior. 1 1 f (y|θ) = N (y | θ, σ ) = √ exp(− 2 (y − θ)2 ) 2σ σ 2π What if you had n observations instead of one in the earlier iid set up? Say y = (y1 , . . . , yn )0 , where yi ∼ N (θ, σ 2 ). 2 ȳ is a sufficient statistic for θ; ȳ ∼ N θ, σn 2 θ ∼ N (µ, τ 2 ), i.e. p(θ) = N (θ | µ, τ 2 ); µ, τ 2 are known. Posterior distribution of θ 2 =N θ| 1 τ2 1 σ2 + 1 τ2 µ+ 1 σ2 1 σ2 + 1 τ2 y, 1 σ2 1 + 1 τ2 σ2 τ2 σ2τ 2 θ| 2 µ+ 2 y, σ + τ2 σ + τ 2 σ2 + τ 2 15 Posterior distribution of θ 2 p(θ|y) ∝ N (θ | µ, τ ) × N (y | θ, σ ) =N LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop p(θ | y) ∝ N (θ | µ, τ 2 ) × N ! =N . LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop θ| = N θ| 1 τ2 n σ2 + 1 τ2 µ+ n σ2 n σ2 1 τ2 nτ 2 + ȳ, n σ2 1 + ! 1 τ2 σ2τ 2 σ2 µ+ 2 ȳ, 2 σ + nτ 2 σ + nτ 2 σ 2 + nτ 2 16 A simple exercise using R σ2 ȳ | θ, n LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Another simple example: The Beta-Binomial model Consider the problem of estimating the current weight of a group of people. A sample of 10 people were taken and their average weight was calculated as ȳ = 176 lbs. Assume that the population standard deviation was known as σ = 3. Assuming that the data y1 , ..., y10 came from a N (θ, σ 2 ) population perform the following: Example: Let Y be the number of successes in n independent trials. n y P (Y = y|θ) = f (y|θ) = θ (1 − θ)n−y y Prior: p(θ) = Beta(θ|a, b): Obtain a 95% confidence interval for θ using classical methods. p(θ) ∝ θa−1 (1 − θ)b−1 . Prior mean: µ = a/(a + b); Variance ab/((a + b)2 (a + b + 1)) Assume a prior distribution for θ of the form N (µ, τ 2 ). Obtain 95% posterior credible intervals for θ for each of the cases: (a) µ = 176, τ = 8; (b) µ = 176, τ = 1000 (c) µ = 0, τ = 1000. Which case gives results closest to that obtained in the classical method? 17 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Posterior distribution of θ p(θ|y) = Beta(θ|a + y, b + n − y) 18 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Bayesian inference: point estimation Bayesian inference: interval estimation The most popular method of inference in practical Bayesian modelling is interval estimation using credible sets. A 100(1 − α)% credible set C for θ is a set that satisfies: Z P (θ ∈ C | y) = p(θ | y)dθ ≥ 1 − α. Point estimation is easy: simply choose an appropriate distribution summary: posterior mean, median or mode. Mode sometimes easy to compute (no integration, simply optimization), but often misrepresents the “middle” of the distribution – especially for one-tailed distributions. C The most popular credible set is the simple equal-tail interval estimate (qL , qU ) such that: Z qL Z ∞ α p(θ | y)dθ = = p(θ | y)dθ 2 −∞ qU Mean: easy to compute. It has the “opposite effect” of the mode – chases tails. Median: probably the best compromise in being robust to tail behaviour although it may be awkward to compute as it needs to solve: Z θ median −∞ 19 This interval is relatively easy to compute and has a direct interpretation: The probability that θ lies between (qL , qU ) is 1 − α. The frequentist interpretation is extremely convoluted. 1 p(θ | y)dθ = . 2 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop Sampling-based inference Previous example: direct evaluation of the posterior probabilities. Feasible only for simpler problems. Modern Bayesian Analysis: Derive complete posterior densities, say p(θ | y) by drawing samples from that density. Samples are of the parameters themselves, or of their functions. If θ1 , . . . , θM are samples from p(θ | y) then, densities are created by feeding them into a density plotter. Similarly samples from f (θ), for some function f , are obtained by simply feeding the θi ’s to f (·). In principle M can be arbitrarily large – it comes from the computer and only depends upon the time we have for analysis. Do not confuse this with the data sample size n which is limited in size by experimental constraints. 21 Then clearly P (θ ∈ (qL , qU ) | y) = 1 − α. LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop 20 LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop