* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Lecture 1: Bayesian Inference and Data Analysis Department of Statistics, Rajshahi University, Rajshahi -Anandamayee Majumdar Visiting Scientist, University of North Texas School of Public Health, USA; Professor, University of Suzhou, China. Overview • • • • • • • • • Applications Introduction Steps and Components Motivation Bayes Rule Probability as a Measure of Certainty Simulation from a distribution using inverse CDF A one parameter model example Binomial example approached by Bayes and Laplace. Applications to Computer Science • Bayesian inference has applications in Artificial intelligence and Expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. • Recently Bayesian inference has gained popularity amongst the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously. In the areas of population genetics and dynamical systems theory, approximate Bayesian computation (ABC) is also becoming increasingly popular. • As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying e-mail spam. Application to the Court Room • Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt’. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. Other Applications • • • • • • Population genetics Ecology Archaeology Environmental Science Finance ….and many more Introduction: Bayesian Inference • Practical methods for learning from data • Use of Probability Models • Quantify Uncertainty Steps 1. Set up a full probability model (a joint distribution for all observable and unobservable quantities in a problem) • Consistent with underlying scientific problem • Consistent with data collection process Steps (continued) 2. Conditioning on observed data: Calculate and interpret the posterior distribution (the conditional probability distribution of the unobserved quantities given observed data) P (θ | Data) Steps (continued) 3. Evaluate the fit of the model and the implications of the resulting posterior distribution • Does model fit data? • Are conclusions reasonable? • How sensitive are results to the modeling assumptions in step 1? Step 3 continued • If necessary one can alter or expand the model and repeat the three steps Step 1 is a stumbling block • How do we go about constructing the joint distribution, i.e. the full probability model? • Advanced improved techniques in second step may help • Advances in carrying out the third step alleviate the somewhat the issue of incorrect model specification in first step. Primary motivation for Bayesian thinking • Facilitates common sense interpretation of statistical conclusions. • Eg. Bayesian (probability) interval for an unknown quantity of interest can be directly regarded as having a high probability of containing the unknown quantity in contrast to a frequentist (confidence) interval which is justified with a retrospective perspective and sampling methodology. Primary motivation for Bayesian thinking (continued) • Increased emphasis has been placed on interval estimation than hypothesis testing – adds a strong impetus to the Bayesian viewpoint -We shall look at the extent to which Bayesian interpretations of common simple statistics procedures are justified. Real Life Example • A clinical trial of cancer patients might be designed to compare the 5 year survival probability given the new drug – with that in the standard treatment • Inference based on a sample of patients • We can not assign patients to both treatments • Causal inference (compare the observed outcome in a patient to the unobserved outcome if exposed to the other treatment) Two kinds of estimands • Estimand = Unobserved quantity for which inference is needed 1. Potentially observable quantity (Ÿ). 2. Quantities that are not directly observable (parameters) (θ). • The first helps to understand how model fits real data General notation • θ → denotes unobservable vector quantities or population parameters of interest • y → observed data y= (y1, y2, …, yn) • Ÿ → potentially observable but unknown quantity (replication, future prediction etc) • In general these are multivariate quantities General notation • x → explanatory variable / covariate • X → entire set of explanatory variables for all n units (of data) Fundamental Difference Bayesian Approach • Inference of θ → based on p(θ|y) • Inference of Ÿ → based on p(Ÿ|y) *Bayesian statistical conclusions: Made using probability statements (‘highly unlikely’, ‘very likely’) Frequentist Approach • Inference of θ → based on p(y |θ) • Inference of Ÿ → based on θ → based on p(y |θ) *Frequentist statistical conclusions based on p-values (‘not significant’ ,`test can not be rejected’ etc) Practical similarity? Difference? • Despite differences in many simple analyses, results obtained using the two different procedures yield superficially similar results (especially in asymptotic cases) • Bayesian methods can be easily extended to more complex problems • Usually Bayesian models work better with less data • Bayesian method can include prior information into the analysis through the prior distribution • Easy sequential updates of inference possible by assuming previous posterior distribution as new prior distribution (Bayesian updating) as new data becomes available. A Fundamental Concept: The Prior distribution • θ→ random because it is unknown to us, though we may have some feeling about it from before • Prior distribution → “subjective” probability that quantifies whatever belief (however vague) we may have about θ before having looked at the data. Fundamental Result: Bayes Rule • Due to Thomas Bayes (1702–1761) • Joint distribution p(θ, y) = p(y | θ) p(θ ) • p(θ | y) = p(θ, y)/p(y) = p(y | θ)p(θ)/p(y) Gist – Main point to remember • p(θ | y) α p(y | θ) p(θ) as p(y) is free of θ • Any two data that yields the same likelihood, yields the same inference • Encapsulates the technical core of Bayesian inference : primary task is to develop the model p(θ, y) and perform the necessary computations to summarize p(θ|y) appropriately. Posterior Prediction • After data y has been observed, an unknown observable Ÿ can be predicted using similar conditional ideas. • p(Ÿ|y) = ∫p(θ, Ÿ|y) dθ = ∫ p(Ÿ|θ, y)p(θ|y) dθ = ∫ p(Ÿ|θ)p(θ|y) dθ Attractive property of Bayes Rule • Posterior Odds = p(θ1|y)/p(θ2|y) = {p(θ1 )p(y |θ1)/p(y)} {p(θ2 ) p(y |θ2)/p(y)} / = {p(θ1 ) / p(θ2 )} {p(y|θ1) / p(y|θ2)} = Prior Odds * Likelihood Ratio Example: Hemophilia Inheritance • Father →XY, Mother →XX • Hemophilia exhibits X-chromosome-linked recessive inheritance • If son receives a bad chromosome from mother, he will be affected • If daughter receives one bad chromosome from mother, she will not be affected, but will be a carrier • If both X are affected in a woman it is fatal (occurrence rare) • A woman has an affected brother → mother carrier of hemophilia • Mother →Xgood Xbad • Father not affected Unknown quantity of interest θ = 0 if woman is not a carrier 1 if woman is carrier Prior: P(θ=0) = P(θ=1) = 0.5 Model and Likelihood • Suppose the woman has two sons, neither of whom are affected. Let yi = 1 denote an affected son 0 denote an unaffected son • The two conditions of two sons are independent given θ (no two are identical twins). Pr(y1=0, y2=0 | θ=1) =(0.5)(0.5)=0.25 Pr(y1=0, y2=0 | θ=0) =(1)(1)=1 Posterior distribution • Bayes Rule: Combines the information in the data with the prior probability y = (y1, y2) joint data Posterior probability of interest: p(θ=1|y) = p(y |θ=1)p(θ=1) / {p(y|θ=1)p(θ=1) + p(y|θ=0)p(θ=0)} = (0.25)(0.5) / {(0.25)(0.5) + (1)(0.5)} = 0.2 Conclusions • It is clear that if the woman has unaffected children it is less probable she is a carrier • Bayes Rule provides a formal mechanism in terms of prior and posterior odds. • Prior odds= 0.5/0.5=1 • Likelihood ratio = 0.25/1= 0.25 • So posterior odds = (1) (0.25) = 0.25 • So P(θ=1|y)=0.2 Easy sequential analysis performance with Bayesian Analysis • Suppose that the woman has a third son, also unaffected. • We do not repeat entire analysis • Use previous posterior distribution as new prior P(θ=1| y1, y2,y3) = P(y3|θ=1)(0.2)/{P(y3|θ=1)(0.2)+ P(y3|θ=0)(0.8)} = (0.5)(0.2)/{(0.5)(0.2) + (1)(0.8)} = 0.111 Probability as a measure of uncertainty Legitimate to ask in Bayesian Analysis • Pr(Rain tomorrow)? • Pr(Victory of Bangladesh in 20-20 match)? • Pr(Heads if coin is tossed)? • Pr(Average height of students within (4ft, 5ft)) of interest after data is acquired • Pr(Sample average of students within (4ft, 5ft)) of interest before data is acquired • Bayesian Analysis methods enable statements to be made about the partial knowledge available (based on data) concerning some situation (unobservable, or as yet unobserved) in a systematic way, using probability as the measure • Guiding principle: State of knowledge about anything unknown is described by a probability distribution Usual Numerical Methods of Certainty 1. Symmetry/ Exchangeability Argument • Probability = # favourable cases/ # possibilities • (Coin tossing experiment) • Involves assumptions, on physical condition of toss, physical conditions about forces at work • Dubious if we know a coin is either doubleheaded or double-tailed. Usual Numerical Methods of Certainty 2. Frequency Argument • Probability = Relative frequency obtained in a very long sequence (experiments assumed, identically performed, physically independent of each other) Other arguments in consideration • Physical randomness induces uncertainty (we speak of ‘likely’, ‘less likely’ etc events) • Axiomatic approach: Decision theory related • Coherence of bets: (define probability through odds ratio) A. Fundamental difficulties remain defining odds B. Ultimate test is success of application! Summarizing inference using simulation • Simulation: Forms a central part of Bayesian Analysis → Relative ease with which samples can be drawn from even a complex, explicitly unknown probability distribution • For example: • To estimate 95th percentile of the posterior distribution of θ|y, draw a random sample of size L (large), from p(θ|y) and use the 0.95Lth order statistic. • For most purposes L=1000 is adequate for such estimates • Generating values from a probability distribution is often straight forward with modern computing techniques • This technique is based on (Pseudo) random number generators → yields a deterministic sequence that appears to have the same properties as a sequence of independent random draws from uniform distribution on [0,1] Sampling using inverse cumulative distribution function • F is the c.d.f of a random variable • F-1 (U) =inf{x: F(x) ≥ U} will follow the distribution defined by F(.) where U ~ Uniform(0, 1). • If F is discrete, F-1 can be tabulated Posterior samples as building blocks of posterior distribution • One can use this array, to generate the posterior distribution • One can use this array to find the posterior distribution of, say, θ1/θ2 or say, log(θ3) by adding appropriate columns to this array and using the existing columns – extremely straight forward! Single - parameter models • Consider some fundamental and widely used one dimensional models—the binomial, normal, Poisson, and exponential etc • We shall discuss important concepts and computational methods for Bayesian data analysis Estimating a probability from binomial data • Sequence of Bernoulli trials; data y1 ,…, yn , each of which is either 0 or 1 (n fixed). Exchangeability implies likelihood depends only on sum of yi (y). • Provides a relatively simple but important example • Parallels the very first published Bayesian analysis by Thomas Bayes in 1763 Proportion of female births • 200 years ago it was established that the proportion of female births in European populations was less than 0.5 • This century interest has focused on factors that may influence the gender ratio. • The currently accepted value of the proportion of female births in very large European-race populations is 0.485. • Define the parameter θ to be the proportion of female births • Alternative way of reporting this parameter is as a ratio of male to female birth rates • Bayesian inference in the binomial model, we must specify a prior distribution for θ . • For simplicity assume the prior to be Uniform(0,1) • Bayes rule implies that p(θ|y) α θy (1-θ)n-y • In single- parameter problems, this allows immediate graphical presentation of the posterior distribution. • Since p(θ|y) is a density and should integrate to 1, the normalizing constant can be worked out. • The posterior distribution is recognizable as a beta distribution • θ|y ~ Beta(y+1, n-y+1) • In analyzing the binomial model, Pierre-Simon Laplace (1749–1827) also used the uniform prior distribution. • His first serious application was to estimate the proportion of female births in a population. • A total of 241,945 girls and 251,527 boys were born in Paris from 1745 to 1770. • Laplace used (Normal) approximation and showed that • P(θ≥0.5|y =241,945, n =251,527+241,945) ≈ 1.15 × 10 -42 So he was ‘morally certain’ that θ <0.5. Lecture 2: Bayesian Inference and Data Analysis Dept. of Statistics, Rajshahi University, Rajshahi Anandamayee Majumdar Visiting Scientist, University of North Texas School of Public Health, USA; Professor, University of Suzhou, China. Overview • • • • • • • • • • • Prediction in the Binomial example Justification of the Uniform prior in Binomial case Prior distributions – more discussion and an example Hyperparameters, hyperpriors Hierarchical models Posterior distribution as a compromise between prior distribution and data. Graphical and Numerical Summaries Posterior probability intervals (or credible intervals) Normal example with unknown mean and known variance Central Limit Theorem in the Bayesian Context Large sample properties and results Prediction in the Binomial Example • In the binomial example with the uniform prior distribution, the prior predictive distribution (marginal of y) can be evaluated explicitly • Marginal distribution of y: p(y=i) = 1/(n+1) for i=0,1,…, n • All values of y are equally likely, a priori . • For posterior prediction, we might be more interested in the outcome of one new trial, rather than another set of n new trials. Prediction in the Binomial example • Letting y_tilde denote the result of a new trial, exchangeable with the first n • This result, based on the uniform prior distribution, is known as ‘Laplace’s law of succession.’ Justification of Uniform prior in Binomial problem • Bayes: The resulting marginal p(y) is uniform over {0, 1, …,n} . Justification good in the sense it uses y and n only. • Laplace: Insufficient information about θ → justified by a flat distribution. This argument is often followed in Bayesian model building. Interpretation: prior distributions 1. In the population interpretation, the prior distribution represents a population of possible parameter values, from which the θ of current interest has been drawn. • Probability of failure in a new industrial process: there is no perfectly relevant population 2. In the more subjective state of knowledge interpretation, the guiding principle is that we must express our knowledge (and uncertainty) about θ as if its value could be thought of as a random realization from the prior distribution. Prior: Constraints and Flexibility • Prior distribution should include all plausible values of θ • Prior need not be realistically concentrated around the true value………………………. • …because often the information about θ contained in the data will far outweigh any reasonable prior specification. Posterior distribution as compromise between data and prior information • The process of Bayesian inference involves passing from a prior distribution, p( θ ), to a posterior distribution, p( θ |y), → natural to expect that some general relations might hold between these two distributions. 1. E (θ) =E(E(θ |y)) the prior mean of θ is the average of all possible posterior means over the distribution of possible data 2. Var (θ) = E(Var(θ|y)) + Var(E(θ|y)), the posterior variance is on average smaller than the prior variance, by an amount that depends on the variation in posterior means over the distribution of possible data. Posterior distribution → compromise between the prior and the data • In the binomial example with the uniform prior distribution, the prior mean is ½ • The posterior mean y+1/n+2, is a compromise between the prior mean ½ and the sample proportion y/n • Clearly the prior mean has a smaller and smaller role as the size of the data sample increases. • This is a very general feature of Bayesian inference • The posterior distribution is centered at a point that represents a compromise between the prior information and the data • The compromise is controlled to a greater extent by the data as the sample size increases. Displaying & summarizing posterior inference • Graphical displays useful • Eg. Histograms, boxplots • Contour plots, scatterplots in multiparameter problems • Numerical summaries also desirable • Summaries of location are the mean, median, and mode(s) • Variation is commonly summarized by the standard deviation, IQR, other quantiles Posterior Interval Summaries in Bayesian Inference 1. A 100(1 −α)% central posterior interval : Range of values above and below which lies exactly 100(α /2)% of the posterior probability • For simple models (Binomial, Normal, Poisson, etc), posterior intervals can be computed directly from c.d.f. (use standard computer functions) Posterior Interval Summaries in Bayesian Inference 2. Highest posterior density (HPD) interval: the region of values that contains 100(1−α)% of the posterior probability, also has the characteristic that the density within the region is never lower than that outside. • HPD region is identical to a central posterior interval if the posterior distribution is unimodal and symmetric. • In general, we prefer the central posterior interval to the HPD region because the former has a direct interpretation as the posterior α/2 and 1−α/2 quantiles, is invariant to one-to-one transformations of the estimand, and is usually easier to compute A special case: comparison of central probability interval and a HPD interval Prior – categorization (Andrew Gelman) (1) Prior distributions giving numerical information that is crucial to estimation of the model. This would be a traditional informative prior, which might come from a literature review or explicitly from an earlier data analysis. (2) Prior distributions that are not supplying any controversial information but are strong enough to pull the data away from inappropriate inferences that are consistent with the likelihood. This might be called a weakly informative prior. (3) Prior distributions that are uniform, or nearly so, and basically allow the information from the likelihood to be interpreted probabilistically. These are noninformative priors, or maybe, in some cases, weakly informative. Noninformative priors • "Non-informative prior distribution: A prior distribution which is non-commital about a parameter, for example, a uniform distribution." -Everitt (1998) Improper Priors • A ‘prior’ distribution’ which integrates to infinity over the parameter space • Eg. Assume a constant prior for the Normal mean. • Many authors (Lindley, 1973; De Groot, 1937; Kass and Wasserman, 1996) warn against the danger of overinterpreting those priors since they are not probability densities. • As long as it yields a proper posterior distribution Bayesian methodology can be carried out • Improper priors have been proved to be limits of data adaptive proper priors –Akaike (JRSSB, 1980) Jeffrey’s prior • Named after Harold Jeffreys, is a non-informative (objective) prior distribution on parameter space that is proportional to the square root of the determinant of the Fisher information p(θ) α √det(I(θ)) • It has the key feature that it is invariant under reparameterization of the parameter vector. If φ = f(θ) then p(φ) = √det(I(φ)) • Sometimes the Jeffreys prior cannot be normalized, and thus one must use an improper prior. For example, the Jeffreys prior for the distribution mean is uniform over the entire real line in the case of a Gaussian distribution of known variance. Informative priors Conjugacy: Binomial example • Likelihood of the parametric form p(y|θ) α θa (1-θ)b (Binomial family) • Thus, if the prior has the same form, so will the posterior. p(θ) α θα-1 (1-θ)β-1 (Beta family) • If the prior and posterior distribution follow the same parametric family/functional form, then we get conjugacy. • Beta and Binomial families are said to be conjugate families. • Other examples: Poisson and Gamma, Normal (mean) and Normal, Normal (variance) and Inverse Gamma etc. Informative (Beta) prior (continued) • p(θ | y) θy (1-θ)n-y θα-1 (1-θ)β-1 = θy+α-1 (1-θ)n-y+β-1 = Beta (y+α, n-y+β) • E(θ | y) = y+α /(n+α+β) → y/n as n→∞ α • Var(θ | y) = (y+α)(n+β-y)/(n+α+β)2(n+α+β+1) → O(1/n) as n→∞ • As n becomes large the effect of the prior diminishes, also posterior distribution shrinks! Basic justification of conjugate priors • Similar to that for using standard models (such as Binomial and Normal) for the likelihood: 1. Easy to understand the results, which can often be put in analytic form 2. They are often a good approximation 3. They simplify computations. • Also, they are useful as building blocks for more complicated models, including in many dimensions, where conjugacy is typically impossible. • For these reasons, conjugate models can be good starting points; for example, mixtures of conjugate families can sometimes be useful when simple conjugate distributions are not reasonable Hyperparameters • Parameters of prior distributions are called hyperparameters, to distinguish them from parameters of the model of the underlying data. • Eg. We use the Beta(α, β) to model the distribution of the parameter p of a Binomial distribution Binomial(n, p) • p is a parameter of the underlying system • α and β are parameters of the prior distribution (beta distribution), hence hyperparameters. Hyperpriors • A hyperprior is a prior distribution on a hyperparameter • They arise particularly in the use of conjugate priors. Purpose of Hyperpriors 1. To express uncertainty about the hyperparameter. Assuming fixed hyperparameters is rigid, making them random allows data to choose the hyperparameters, and makes the ‘data speak’. 1. By using a hyperprior, the prior distribution itself becomes a mixture distribution; a weighted average of the various prior distributions (over different hyperparameters), with the hyperprior being the weighting. This adds additional possible distributions (beyond the parametric family one is using), because parametric families of distributions are generally not convex sets – as a mixture density is a convex combination of distributions, it will in general lie outside the family. For instance, the mixture of two normal distributions is not a normal distribution: if one takes different means (sufficiently distant) and mix 50% of each, one obtains a bimodal distribution, which is thus not normal. In fact, the convex hull of normal distributions is dense in all distributions, so in some cases, you can arbitrarily closely approximate a given prior by using a family with a suitable hyperprior. Purpose of Hyperpriors 3. Dynamical system A hyperprior is a distribution on the space of possible hyperparameters. If one is using conjugate priors, then this space is preserved by moving to posteriors – thus as data arrives, the distribution changes, but remains on this space: as data arrives, the distribution evolves as a dynamical system (each point of hyperparameter space evolving to the updated hyperparameters), over time converging, just as the prior itself converges. 4. Ideal for hierarchical / multilevel models where hierarchy arises as a natural phenomenon and for information sharing with many sources of data Hierarchical/Multilevel model • Generalization of linear and generalized linear modeling in which regression coeffecients are themselves given a model, whose parameters are also estimated from data. Results using noninformative priors • Many simple Bayesian analyses based on noninformative prior distributions give similar results to standard nonBayesian approaches (for example, the posterior t interval for the normal mean with unknown variance). • The extent to which a noninformative prior distribution can be justified as an objective assumption depends on the amount of information available in the data; as the sample size n increases, the influence of the prior distribution on posterior inferences decreases. Informative Nonconjugate prior distributions • For more complex problems, conjugacy may not be possible • Although they can make interpretations of posterior inferences less transparent and computation more difficult, nonconjugate prior distributions do not pose any new conceptual problems. Example: estimating the probability of a female birth given placenta previa • A special abnormal condition in expecting women • An early study concerning the gender of placenta previa births in Germany found that of a total of 980 births, 437 were female. • How much evidence does this provide for the claim that the proportion of female births in the population of placenta previa births is less than 0.485, the proportion of female births in the general population? Posterior summary using Uniform prior • • • • • Posterior distribution is Beta(438, 544). Posterior mean of θ is 0.446 Posterior standard deviation is 0.016 Posterior median is 0.446 Posterior central 95% posterior interval is [0.415, 0.477]. • This 95% posterior interval matches, to three decimal places, the interval that would be obtained by using a normal approximation with the calculated posterior mean and standard deviation. Check same summary using simulations • Simulate 1000 iid draws from the Beta(438, 544) posterior distribution • 2.5th and 97.5th percentiles give central 95% posterior interval [0.415, 0.476] • median of the 1000 draws from the posterior distribution is 0.446 • The sample mean and standard deviation of the 1000 draws are 0.445 and 0.016 Draws from the posterior distribution of (a) the probability of female birth, θ ; (b) the logit transform, logit( θ ); (c) the male-to-female gender ratio, Sensitivity to prior specification α/α+β α+β E(θ|y) 95% posterior interval for θ 0.500 2 0.446 [0.415, 0.477] 0.485 2 0.446 [0.415, 0.477] 0.485 5 0.446 [0.415, 0.477] 0.485 10 0.446 [0.415, 0.477] 0.485 20 0.447 [0.416, 0.478] 0.485 100 0.450 [0.420, 0.479] 0.485 200 0.453 [0.424, 0.481] *Interpret α/α+β, as the center and α+β as the number of observations (if large this implies prior is concentrated) Discussion • The first row corresponds to uniform prior • The lower the row, the more concentrated is the prior distribution towards 0.485 • Only when α+β ≥ 100 (likened to prior number of observations), the posterior interval begins to change. • Even then the intervals exclude the prior mean Alternative: Instead of conjugate prior, use a ‘flat’ non-conjugate prior (weakly informative prior) (a) Prior density for θ in nonconjugate analysis of birth ratio example; (b) histogram of 1000 draws from a discrete approximation to the posterior density. *Figures are plotted on different scales. Details of the nonconjugate flat prior (piecewise linear) • Centered around 0.485 but is flat far away from this value to admit the possibility that the truth is far away. • 40% of the probability mass is outside the interval [0.385, 0.585] • This prior distribution has mean 0.493 and standard deviation 0.21, similar to the standard deviation of a Beta distribution with a + β =5. Evaluating the posterior distribution • The unnormalized posterior distribution is obtained at a grid of θ values, (0.000, 0.001,…, 1.000), by multiplying the prior density and the binomial likelihood at each point. • Samples from the posterior distribution can be obtained by normalizing the distribution on the discrete grid of θ values. • Figure (b) is a histogram of 1000 draws from the discrete posterior distribution. • The posterior median is 0.448, 95% central posterior interval is [0.419, 0.480]. • Because the prior distribution is overwhelmed by the data, results match those in table based on Beta distributions. • In the grid approach, we avoid grids that are too coarse and distort a significant portion of the posterior mass. Estimating the mean of a normal distribution with known variance • The normal distribution is fundamental to most statistical modeling. • CLT helps to justify using the normal likelihood in many statistical problems, as an approximation to a less analytically convenient actual likelihood. • Also, even when the normal distribution does not itself provide a good model fit, it can be useful as a component of a more complicated model involving Student-t or finite mixture distributions. • For now, we simply work through the Bayesian results assuming Normal distribution is true Normal model with multiple observations, variance known • A sample of independent and identically distributed observations y =( y 1 , … , y n ) is available. • The posterior density is Posterior distribution also Normal Remarks about posterior results • Posterior variance converges to σ2/n if n→∞ or if prior variance τ02 →∞ • Posterior mean is weighted average of prior mean and sample mean • Incidentally, the same result is obtained by adding information for the data points y 1 , y 2 ,…, y n one point at a time, using the posterior distribution at each step as the prior distribution for the next CLT in Bayesian context ((θ - E(θ | y) ) /√Var(θ | y) | y) → N(0,1) as n→∞ • Often used to justify approximating the posterior distribution with a normal distribution. • For the binomial parameter θ , the normal distribution is a more accurate approximation in practice if we transform θ to the logit scale… • …that is, perform inference for log( θ /(1 − θ )) instead of θ • probability space from [0, 1] expands to (−∞, ∞), which is more fitting for a normal approximation. Large sample results • The large-sample results are not actually necessary for performing Bayesian data analysis… but are often useful as approximations and as tools for understanding. Normal approximations to the posterior distribution • A Taylor series expansion of log p(θ|y) centered at the posterior mode, (where mode can be a vector and is assumed to be in the interior of the parameter space), gives Posterior distribution converges to… Remark: • For a finite sample size n, the normal approximation is typically more accurate for conditional and marginal distributions of components of θ than for the full joint distribution. Posterior Consistency • If the true data distribution is included in the parametric family—that is, if f(y)=p(y|θ0) for some θ0—then, in addition to asymptotic normality, the property of consistency holds: the posterior distribution converges to a point mass at the true parameter value, θ0, as n→∞. • When the true distribution is not included in the parametric family, there is no longer a true value θ0, but its role in the theoretical result is replaced by a value θ0 that makes the model distribution, p(y|θ), closest to the true distribution, in a technical involving Kullback-Leibler information Large sample correspondence between Bayesian and Frequentist methods • When n→∞, a 95% central posterior interval for θ will cover the true value 95% of the time under repeated sampling with any fixed true θ. When asymptotic results fail • Correspond to situations in which the prior distribution has an impact on the posterior inference, even in the limit of infinite sample sizes. • Usually when likelihood is flat • For example when the model is unidentifiable (there exist two distinct parameters yielding same likelihood) Eg. f(x) = p g(x) + (1-p) h(x) where 0>p>1, (p, g,h) unknown • Number of parameters increase with data • Prior distributions that exclude point of convergence or yield improper posterior distributions Lecture 3: Bayesian Inference and Data Analysis Dept. of Statistics, Rajshahi University, Rajshahi Anandamayee Majumdar Visiting Scientist, University of North Texas School of Public Health, USA; Professor, University of Suzhou, China. Overview • • • • • Model checking and improvement Test quantities, P-values Starting the computation in Bayesian Inference Simulation of potentially observable quantities Posterior simulation methods: The Gibbs Sampler, Rejection sampling, Metropolis Hastings algorithm • Bivariate Unit Normal Example with Bivariate Jumping kernel • Recommended strategies for simulation. • Advanced techniques for Monte Carlo simulation Model checking and improvement • Checking the model is crucial to statistical analysis. • Bayesian inferences assume the whole structure of a probability model and can yield misleading inferences when the model is poor. • A good Bayesian analysis, therefore, should include some check of the adequacy of the fit of the model to the data and the plausibility of the model for the purposes for which the model will be used. • This is sometimes discussed as a problem of sensitivity to the prior distribution, • but in practice the likelihood model is typically just as suspect; • throughout, we use ‘model’ to encompass: 1. The sampling distribution, 2. the prior distribution, 3. Hierarchical structure, and 4. issues such as which explanatory variables have been included in a regression. Judging model flaws by their practical implications • Model TRUE or FALSE – is not the question • Relevant question: ‘Do the model’s deficiencies have a noticeable effect on the substantive inferences?’ • Do the inferences from the model make sense? .. NO: suggests a potential for creating a more accurate probability model for the parameters and data collection process. • Is the model consistent with data? Posterior predictive checking If the model fits, then replicated data generated under the model should look similar to observed data. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance. Basic technique for checking fit • Draw simulated values from the posterior predictive distribution of replicated data and compare these samples to the observed data. • Any systematic differences between the simulations and the data indicate potential failings of the model Example: Newcomb’s speed of light measurements • 66 measurements on the speed of light • modeled as N(μ, σ2), with a non-informative uniform prior distribution on (μ, log σ). • However, the lowest of Newcomb’s measurements look like outliers • Question: Could the extreme measurements have reasonably come from a normal distribution? Simulating replicated data using posterior sample • y observed data • θ be the vector of parameters • yrep replicated data that could have been observed (if x is the explanatory variable vector for y, then it is also for yrep) Smallest observation of Newcomb’s speed of light data (the vertical line at the left of the graph), compared to the smallest observations from each of the 20 posterior predictive simulated datasets Test quantity, or discrepancy measure • T(y, θ), is a scalar summary of parameters and data that is used as a standard when comparing data to predictive simulations. • Test quantities play the role in Bayesian model checking that test statistics play in classical testing. • Test quantity depends on both data and parameter. P-values or tail area probabilities • Classical p-value • Bayesian p-value • The probability is taken over the joint posterior distribution, p(θ, yrep|y): Example • Consider a sequence of binary outcomes, y1,…, yn, • Modeled as n iid Bernoulli trials • Uniform prior distribution on θ • suppose the observed data are, in order, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0. • The observed autocorrelation is evidence that the model is flawed. • T=number of switches between 0’s and 1’s • To simulate yrep under the model, we first draw θ from its Beta(8, 14) posterior distribution • Then draw 10,000 independent replications from Bernoulli(θ) • P-value = 0.028 Other posterior predictive checks for model fit/model comparison • Use partial data for building model • Use the rest of the data, to make prediction (usually in the case when each value of y is associated with covariate and/or coordinate information) • Compare prediction coverage for different competing models Computation in Bayesian Inference • Distribution to be simulated as the target distribution → denote it as p(θ|y) • Assume target density p(θ|y) can be easily computed for any value of θ, up to a proportionality constant involving only y • Starting point -- Crude estimation of parameters. Often reliable, and easy to compute. Use posterior simulations to make inferences about 1. Predictive quantities : ŷk ~ p(ŷ |θk) or ŷk ~ p(ŷ |Xk, θk) in the regression model 2. Or replications yrep,k ~ p(y |θk) or yrep,k ~ p(y|X, θk) in the regression model How many simulation draws are needed? • In general, few simulations are needed to estimate posterior medians, probabilities near 0.5, and low-dimensional summaries • More simulations needed for extreme quantiles, posterior means, probabilities of rare events, and higher-dimensional summaries. • Simulation draws typically 100 - 2000 Direct simulation • In simple nonhierarchical Bayesian models, it is often easy to draw from the posterior distribution directly, especially if conjugate prior distributions have been assumed. • In complex problems, we sometimes simulate hyperparameters (marginally), and then conditionally, the intermediate parameters Rejection sampling • Suppose we want to obtain a single random draw from a density p(θ|y). We require a positive function g(θ) defined for all θ for which p(θ|y)>0 that has the following properties: 1. We are able to draw random samples from the probability density proportional to g. It is not required that g(θ) integrate to 1, but g(θ) must have a finite integral. 2. there must be some known constant M for which p(θ|y)/g(θ)≤M for all θ. • The rejection sampling algorithm proceeds in two steps: 1. Sample θ at random from the probability density proportional to g(θ). 2. With probability p(θ|y)/(Mg(θ)), accept θ as a draw from p. If the drawn θ is rejected, repeat step 1. Illustration of rejection sampling. The top curve is an approximation function, Mg(θ), and the bottom curve is the target density, p(θ|y). As required, Mg(θ)≥p(θ|y) for all θ. The vertical line indicates a single random draw θ from the density proportional to g. The probability that a sampled draw θ is accepted is the ratio of the height of the lower curve to the height of the higher curve at the value θ Posterior simulation • Mostly used Markov chain simulation methods: 1. Gibbs sampler 2. Metropolis-Hastings algorithm Markov Chain Monte Carlo (MCMC) simulation • Definition: A Markov chain is a sequence of random variables θ1, θ2,…, for which, for any t, the distribution of θt given all previous θ’s depends only on the most recent value, θt−1. • Key: Create a Markov process whose stationary distribution is the specified p(θ|y), and run the simulation long enough that the distribution of the current draws is close enough to this stationary distribution. • For any specific p(θ|y), a variety of Markov chains with the desired property can be constructed The Gibbs Sampler • Suppose the parameter vector θ has been divided into d components or subvectors, θ= (θ1,…, θd). In iteration t, we simulate θjt ˜ p(θj |θ-j t-1, y) where θ-j t-1 represents all the components of θ, except for θj, at their current values for j=1,…,d Bivariate Normal Example • Suppose (y1, y2) ˜ Biv. Normal ((θ1, θ2), (1, ρ, ρ,1)) We note that the Gibbs sampler takes on the following steps: θ1|θ2, y ~ N(y1+ρ(θ2−y2), 1−ρ2) θ2|θ1, y ~ N(y2+ρ(θ1−y1), 1−ρ2). • Four independent sequences of the Gibbs sampler for a bivariate normal distribution with fixed correlation ρ=0.8, with overdispersed starting points indicated by solid squares. (a) (b) (c) First 10 iterations, showing the component-by-component updating of the Gibbs iterations. After 500 iterations, the sequences have reached approximate convergence. Iterates from the second halves of the sequences. The Metropolis algorithm • The Metropolis algorithm is an adaptation of a random walk that uses an acceptance/rejection rule to converge to the specified target distribution. 1. Draw a starting point θ0, for which p(θ0|y)>0, from a starting distribution p0(θ). Or we may simply choose starting values dispersed around a crude approximate 2. For t=1, 2,… (a) Sample a proposal θ* from a jumping distribution at time t, Jt(θ*|θt−1). For the Metropolis algorithm, Jt(θa|θb)=Jt(θb|θa) for all θa, θb, t (b) Calculate the ratio of the densities, The Metropolis algorithm c. Set The Metropolis-Hastings algorithm • Same as Metropolis algorithm, except that the jumping rule does not have to be symmetric Jt(θa|θb)≠Jt(θb|θa) for some θa, θb, t • to correct for the asymmetry in the jumping rule, the ratio r becomes: • Allowing asymmetric jumping rules can be useful in increasing the speed of the random walk. Properties of a good jumping rule • For any θ, it is easy to sample from J(θ*|θ). • It is easy to compute the ratio r. • Each jump goes a reasonable distance in the parameter space (otherwise the random walk moves too slowly). • The jumps are not rejected too frequently (otherwise the random walk wastes too much time standing still). Difficulties of inference from iterative simulation • If iterations have not proceeded long enough, the simulations may be grossly unrepresentative of the target distribution • Even when the simulations have reached approximate convergence, the early iterations still are influenced by the starting approximation rather than the target distribution • Within-sequence correlation; aside from any convergence issues, simulation inference from correlated draws is generally less precise than from the same number of independent draws Solutions • Design the simulation runs to allow effective monitoring of convergence, in particular by simulating multiple sequences with starting points dispersed throughout parameter space • Monitor the convergence of all quantities of interest by comparing variation between and within simulated sequences until ‘within’ variation roughly equals ‘between’ variation • If the simulation efficiency is unacceptably low (in the sense of requiring too much real time on the computer to obtain approximate convergence of posterior inferences for quantities of interest), the algorithm can be altered • To diminish the effect of the starting distribution, discard the first half of each sequence and focus attention on the second half (burn in fraction depends on the problem at hand) • Once approximate convergence has been reached, thin the sequences by keeping every kth simulation draw from each sequence and discard the rest. Monitoring convergence of each scalar estimand • Suppose we have simulated m parallel sequences, each of length n (after discarding the first half of the simulations). • For each scalar estimand ψ, we label the simulation draws as ψij (i=1,…, n; j=1,…, m), and we compute B and W, the between- and within-sequence variances: Step 1 • Estimate var(ψ|y), the marginal posterior variance of the estimand, by a weighted average of W and B, namely • This quantity overestimates the marginal posterior variance assuming the starting distribution is appropriately overdispersed, but is unbiased under stationarity Step 2 • ‘Within’ variance W should be an underestimate of var(ψ|y) because the individual sequences have not had time to range over all of the target distribution • In the limit as n→∞, the E(W) → var(ψ|y). • Potential scale reduction is estimated by • This declines to 1 as n→∞. Monitoring convergence for the entire distribution • Once is near 1 for all scalar estimands of interest, just collect the mn simulations from the second halves of all the sequences together and treat them as a sample from the target distribution. A small experimental dataset • Coagulation time in seconds for blood drawn from 24 animals randomly allocated to four different diets. • Different treatments have different numbers of observations because the randomization was unrestricted. Diet A B C D Measurements 62, 60, 63, 59 63, 67, 71, 64, 65, 66 68, 66, 71, 67, 68, 68 56, 62, 60, 61, 63, 64, 63, 59 Hierarchical Model and Priors in this example Under the hierarchical normal model • Data yij, i=1,…, nj, j=1,…, J, are independently normally distributed within each of J groups, with means θj, common variance σ2. The total number of observations is n. • The group means θj are assumed to follow a normal distribution with unknown mean μ and variance τ2, • A uniform prior distribution is assumed for (μ, log σ, τ) • If we were to assign a uniform prior distribution to log τ, the posterior distribution would be improper Joint posterior density of all the parameters Starting points • Choose over-dispersed starting points for each parameter θj by simply taking random points from the data yij from group j. • Starting points for μ can be taken as the average of the starting θj values. • No starting values are needed for τ or σ as they can be drawn as the first steps in the Gibbs sampler Gibbs sampler 1. The conditional distributions for this model all have simple conjugate forms Gibbs sampler 2. Conditional on y and the other parameters in the model, μ has a normal distribution with mean = and variance = σ2/J 3. Again, Gibbs Sampler 4. Posterior distribution Estimand Posterior quantiles 2.5% θ1 θ2 θ3 θ4 μ σ 58.9 63.9 66.0 59.5 56.9 1.8 25% 60.6 65.3 67.1 60.6 62.2 2.2 median 75% 61.3 65.9 67.8 61.1 63.9 2.4 62.1 66.6 68.5 61.7 65.5 2.6 97.5% 63.5 67.7 69.5 62.8 73.4 3.3 Rhat 1.01 1.01 1.01 1.01 1.04 1.00 Bivariate unit Normal Example • Suppose (y1, y2) ˜ Bivariate N ((θ1, θ2), (1, 0, 0,1)) • Data (y1, y2) = (0, 0) • Target density p(θ|y) = N(θ|0, I), where I is the 2×2 identity matrix • The jumping distribution is also bivariate normal, centered at the current iteration and scaled to 1/5 the size: Jt(θ*|θt−1)=N(θ*|θt−1, 0.22I). (symmetric) • Density ratio τ =N(θ*|0, I)/N(θt–1|0, I). • Five independent sequences of a Markov chain simulation for The bivariate unit normal distribution, with overdispersed starting points indicated by solid squares. (a) After 50 iterations, the sequences are still far from convergence (due to inefficient jumping rule, deliberately chosen to demonstrate the mixing) (b) After 1000 iterations, the sequences are nearer to convergence. (c) 3rd figure shows the iterates from the second halves of the sequences. The points in the 3rd figure have been jittered so that steps in which the random walks stood still are not hidden. Bivariate unit normal density with bivariate normal jumping kernel Recommended strategy for posterior simulation 1. 2. 3. 4. 5. Start off with crude estimates and possibly a mode-based approximation to the posterior distribution . If possible, simulate from the posterior distribution directly or sequentially, starting with hyperparameters and then moving to the main parameters Most likely, the best approach is to set up a Markov chain simulation algorithm. The updating can be done one parameter at a time or with parameters in batches (as is often convenient in regressions and hierarchical models). For parameters (or batches of parameters) whose conditional posterior distributions have standard forms, use the Gibbs sampler. For parameters whose conditional distributions do not have standard forms, use Metropolis jumps. Tune the scale of each jumping distribution so that acceptance rates are near 20% (when altering a vector of parameters) or 40% (when altering one parameter at a time). 6. Construct a transformation so that the parameters are approximately independent—this will speed the convergence of the Gibbs sampler. Or add auxiliary variables (data augmentation) to speed up the computation. 7. Start the Markov chain simulations with parameter values taken from the crude estimates or mode-based approximations, with noise added so they are over-dispersed with respect to the target distribution. 8. Run multiple Markov chains and monitor the mixing of the sequences. Run until approximate convergence appears to have been reached. 9. If Rhat is near 1 for all scalar estimands of interest, summarize inference about the posterior distribution by treating the set of all iterates from the second half of the simulated sequences as an identically distributed sample from the target distribution. At this point, simulations from the different sequences can be mixed. 10 . Compare the posterior inferences from the Markov chain simulation to the approximate distribution used to start the simulations. If they are not close with respect to locations and approximate distributional shape, check for errors before believing that the Markov chain simulation has produced a better answer. Advanced techniques for Markov Chain Simulation Hybrid Monte Carlo methods: For moving rapidly through the target distribution • • • • • • Borrows ideas from physics to add auxiliary variables that suppress the local random walk behavior in the Metropolis algorithm Thus allowing it to move much more rapidly through the target distribution. For each component θj in the target space, hybrid Monte Carlo adds a `momentum’ variable φ Both θ and φ are then updated together in a new Metropolis algorithm, in which the jumping distribution for θ is determined largely by φ Roughly, the momentum gives the expected distance and direction of the jump in θ, so that successive jumps tend to be in the same direction, allowing the simulation to move rapidly where possible through the space of θ. The MH accept/reject rule stops the movement when it reaches areas of low probability, at which point the momentum changes until the jumping can continue. Hybrid Monte Carlo is also called Hamiltonian Monte Carlo because it is related to the model of Hamiltonian dynamics in physics. Advanced techniques for Markov Chain Simulation Langevin methods • The basic symmetric-jumping Metropolis algorithm is simple to apply but has the disadvantage of wasting many of its jumps by going into lowprobability areas of the target distribution. • For example, optimal Metropolis algorithms in high dimensions have acceptance rates below 25%, which means that, in the best case, over 3/4 of the jumps are wasted. • A potential improvement is afforded by the Langevin algorithm, in which each jump is associated with a small shift in the direction of the gradient of the logarithm of the target density, thus moving the jumps toward higher density regions of the distribution. • This jumping rule is not symmetric. Thank you for your attention!