Download Bayesian Inference and Data Analysis

Document related concepts

Inductive probability wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Approximate Bayesian computation wikipedia , lookup

Gibbs sampling wikipedia , lookup

Statistical inference wikipedia , lookup

Bayesian inference wikipedia , lookup

Transcript
Lecture 1: Bayesian Inference
and Data Analysis
Department of Statistics,
Rajshahi University, Rajshahi
-Anandamayee Majumdar
Visiting Scientist, University of North Texas
School of Public Health, USA;
Professor, University of Suzhou, China.
Overview
•
•
•
•
•
•
•
•
•
Applications
Introduction
Steps and Components
Motivation
Bayes Rule
Probability as a Measure of Certainty
Simulation from a distribution using inverse CDF
A one parameter model example
Binomial example approached by Bayes and
Laplace.
Applications to Computer Science
•
Bayesian inference has applications in Artificial intelligence and Expert systems.
Bayesian inference techniques have been a fundamental part of
computerized pattern recognition techniques since the late 1950s.
•
Recently Bayesian inference has gained popularity amongst
the phylogenetics community for these reasons; a number of applications allow
many demographic and evolutionary parameters to be estimated simultaneously. In
the areas of population genetics and dynamical systems theory, approximate
Bayesian computation (ABC) is also becoming increasingly popular.
•
As applied to statistical classification, Bayesian inference has been used in recent
years to develop algorithms for identifying e-mail spam.
Application to the Court Room
• Bayesian inference can be used by jurors to
coherently accumulate the evidence for and
against a defendant, and to see whether, in
totality, it meets their personal threshold for
'beyond a reasonable doubt’. The benefit of a
Bayesian approach is that it gives the juror an
unbiased, rational mechanism for combining
evidence.
Other Applications
•
•
•
•
•
•
Population genetics
Ecology
Archaeology
Environmental Science
Finance
….and many more
Introduction: Bayesian Inference
• Practical methods for learning from data
• Use of Probability Models
• Quantify Uncertainty
Steps
1. Set up a full probability model
(a joint distribution for all observable and
unobservable quantities in a problem)
• Consistent with underlying scientific
problem
• Consistent with data collection process
Steps (continued)
2. Conditioning on observed data:
Calculate and interpret the
posterior distribution
(the conditional probability distribution of the
unobserved quantities given observed data)
P (θ | Data)
Steps (continued)
3. Evaluate the fit of the model and the
implications of the resulting posterior
distribution
• Does model fit data?
• Are conclusions reasonable?
• How sensitive are results to the modeling
assumptions in step 1?
Step 3 continued
• If necessary one can alter or expand the
model and repeat the three steps
Step 1 is a stumbling block
• How do we go about constructing the joint
distribution, i.e. the full probability model?
• Advanced improved techniques in second step
may help
• Advances in carrying out the third step
alleviate the somewhat the issue of incorrect
model specification in first step.
Primary motivation for Bayesian
thinking
• Facilitates common sense interpretation of
statistical conclusions.
• Eg. Bayesian (probability) interval for an
unknown quantity of interest can be directly
regarded as having a high probability of
containing the unknown quantity in contrast to a
frequentist (confidence) interval which is justified
with a retrospective perspective and sampling
methodology.
Primary motivation for Bayesian
thinking (continued)
• Increased emphasis has been placed on
interval estimation than hypothesis testing –
adds a strong impetus to the Bayesian
viewpoint
-We shall look at the extent to which
Bayesian interpretations of common
simple statistics procedures are justified.
Real Life Example
• A clinical trial of cancer patients might be
designed to compare the 5 year survival
probability given the new drug – with that in
the standard treatment
• Inference based on a sample of patients
• We can not assign patients to both treatments
• Causal inference (compare the observed
outcome in a patient to the unobserved
outcome if exposed to the other treatment)
Two kinds of estimands
• Estimand = Unobserved quantity for which
inference is needed
1. Potentially observable quantity (Ÿ).
2. Quantities that are not directly observable
(parameters) (θ).
• The first helps to understand how model fits
real data
General notation
• θ → denotes unobservable vector quantities or
population parameters of interest
• y → observed data y= (y1, y2, …, yn)
• Ÿ → potentially observable but unknown
quantity (replication, future prediction etc)
• In general these are multivariate quantities
General notation
• x → explanatory variable / covariate
• X → entire set of explanatory variables for all
n units (of data)
Fundamental Difference
Bayesian Approach
• Inference of θ → based on p(θ|y)
• Inference of Ÿ → based on p(Ÿ|y)
*Bayesian statistical conclusions: Made using probability
statements (‘highly unlikely’, ‘very likely’)
Frequentist Approach
• Inference of θ → based on p(y |θ)
• Inference of Ÿ → based on θ → based on
p(y |θ)
*Frequentist statistical conclusions based on p-values (‘not
significant’ ,`test can not be rejected’ etc)
Practical similarity? Difference?
• Despite differences in many simple analyses, results obtained
using the two different procedures yield superficially similar
results (especially in asymptotic cases)
• Bayesian methods can be easily extended to more complex
problems
• Usually Bayesian models work better with less data
• Bayesian method can include prior information into the
analysis through the prior distribution
• Easy sequential updates of inference possible by assuming
previous posterior distribution as new prior distribution
(Bayesian updating) as new data becomes available.
A Fundamental Concept:
The Prior distribution
• θ→ random because it is unknown to us,
though we may have some feeling about it
from before
• Prior distribution → “subjective” probability
that quantifies whatever belief (however
vague) we may have about θ before having
looked at the data.
Fundamental Result:
Bayes Rule
• Due to Thomas Bayes (1702–1761)
• Joint distribution p(θ, y) = p(y | θ) p(θ )
• p(θ | y) = p(θ, y)/p(y)
= p(y | θ)p(θ)/p(y)
Gist – Main point to remember
• p(θ | y) α p(y | θ) p(θ) as p(y) is free of θ
• Any two data that yields the same likelihood,
yields the same inference
• Encapsulates the technical core of Bayesian
inference : primary task is to develop the model
p(θ, y) and perform the necessary computations to
summarize p(θ|y) appropriately.
Posterior Prediction
• After data y has been observed, an unknown
observable Ÿ can be predicted using similar
conditional ideas.
• p(Ÿ|y) = ∫p(θ, Ÿ|y) dθ
= ∫ p(Ÿ|θ, y)p(θ|y) dθ
= ∫ p(Ÿ|θ)p(θ|y) dθ
Attractive property of Bayes Rule
• Posterior Odds
= p(θ1|y)/p(θ2|y)
= {p(θ1 )p(y |θ1)/p(y)} {p(θ2 ) p(y |θ2)/p(y)}
/
= {p(θ1 ) / p(θ2 )} {p(y|θ1) / p(y|θ2)}
= Prior
Odds * Likelihood Ratio
Example: Hemophilia Inheritance
• Father →XY, Mother →XX
• Hemophilia exhibits X-chromosome-linked
recessive inheritance
• If son receives a bad chromosome from mother,
he will be affected
• If daughter receives one bad chromosome from
mother, she will not be affected, but will be a
carrier
• If both X are affected in a woman it is fatal
(occurrence rare)
• A woman has an affected brother → mother
carrier of hemophilia
• Mother →Xgood Xbad
• Father not affected
Unknown quantity of interest
θ = 0 if woman is not a carrier
1 if woman is carrier
Prior: P(θ=0) = P(θ=1) = 0.5
Model and Likelihood
• Suppose the woman has two sons, neither of
whom are affected.
Let yi = 1 denote an affected son
0 denote an unaffected son
• The two conditions of two sons are
independent given θ (no two are identical
twins).
Pr(y1=0, y2=0 | θ=1) =(0.5)(0.5)=0.25
Pr(y1=0, y2=0 | θ=0) =(1)(1)=1
Posterior distribution
• Bayes Rule: Combines the information in the
data with the prior probability
y = (y1, y2) joint data
Posterior probability of interest:
p(θ=1|y)
= p(y |θ=1)p(θ=1) / {p(y|θ=1)p(θ=1) + p(y|θ=0)p(θ=0)}
= (0.25)(0.5) / {(0.25)(0.5) + (1)(0.5)} = 0.2
Conclusions
• It is clear that if the woman has unaffected
children it is less probable she is a carrier
• Bayes Rule provides a formal mechanism in
terms of prior and posterior odds.
• Prior odds= 0.5/0.5=1
• Likelihood ratio = 0.25/1= 0.25
• So posterior odds = (1) (0.25) = 0.25
• So P(θ=1|y)=0.2
Easy sequential analysis performance
with Bayesian Analysis
• Suppose that the woman has a third son, also
unaffected.
• We do not repeat entire analysis
• Use previous posterior distribution as new prior
P(θ=1| y1, y2,y3)
= P(y3|θ=1)(0.2)/{P(y3|θ=1)(0.2)+ P(y3|θ=0)(0.8)}
= (0.5)(0.2)/{(0.5)(0.2) + (1)(0.8)}
= 0.111
Probability as a measure of
uncertainty
Legitimate to ask in Bayesian Analysis
• Pr(Rain tomorrow)?
• Pr(Victory of Bangladesh in 20-20 match)?
• Pr(Heads if coin is tossed)?
• Pr(Average height of students within (4ft, 5ft))
of interest after data is acquired
• Pr(Sample average of students within (4ft,
5ft)) of interest before data is acquired
• Bayesian Analysis methods enable statements
to be made about the partial knowledge
available (based on data) concerning some
situation (unobservable, or as yet unobserved)
in a systematic way, using probability as the
measure
• Guiding principle: State of knowledge about
anything unknown is described by a
probability distribution
Usual Numerical Methods of Certainty
1. Symmetry/ Exchangeability Argument
• Probability = # favourable cases/
# possibilities
• (Coin tossing experiment)
• Involves assumptions, on physical condition of
toss, physical conditions about forces at work
• Dubious if we know a coin is either doubleheaded or double-tailed.
Usual Numerical Methods of Certainty
2. Frequency Argument
• Probability = Relative frequency obtained in a
very long sequence (experiments assumed,
identically performed, physically
independent of each other)
Other arguments in consideration
• Physical randomness induces uncertainty (we
speak of ‘likely’, ‘less likely’ etc events)
• Axiomatic approach: Decision theory related
• Coherence of bets: (define probability through
odds ratio)
A. Fundamental difficulties remain defining
odds
B. Ultimate test is success of application!
Summarizing inference using
simulation
• Simulation:
Forms a central part of Bayesian Analysis
→ Relative ease with which samples can be
drawn from even a complex, explicitly unknown
probability distribution
• For example:
• To estimate 95th percentile of the posterior
distribution of θ|y, draw a random sample of
size L (large), from p(θ|y) and use the 0.95Lth
order statistic.
• For most purposes L=1000 is adequate for
such estimates
• Generating values from a probability
distribution is often straight forward with
modern computing techniques
• This technique is based on (Pseudo) random
number generators → yields a deterministic
sequence that appears to have the same
properties as a sequence of independent
random draws from uniform distribution on
[0,1]
Sampling using inverse cumulative
distribution function
• F is the c.d.f of a random variable
• F-1 (U) =inf{x: F(x) ≥ U} will follow the distribution defined
by F(.) where U ~ Uniform(0, 1).
• If F is discrete, F-1 can be tabulated
Posterior samples as building blocks
of posterior distribution
• One can use this array, to generate the
posterior distribution
• One can use this array to find the posterior
distribution of, say, θ1/θ2 or say, log(θ3) by
adding appropriate columns to this array and
using the existing columns – extremely straight
forward!
Single - parameter models
• Consider some fundamental and widely used
one dimensional models—the binomial,
normal, Poisson, and exponential etc
• We shall discuss important concepts and
computational methods for Bayesian data
analysis
Estimating a probability from
binomial data
• Sequence of Bernoulli trials; data y1 ,…, yn ,
each of which is either 0 or 1 (n fixed).
Exchangeability implies likelihood depends
only on sum of yi (y).
• Provides a relatively simple but important
example
• Parallels the very first published Bayesian
analysis by Thomas Bayes in 1763
Proportion of female births
• 200 years ago it was established that the
proportion of female births in European
populations was less than 0.5
• This century interest has focused on factors
that may influence the gender ratio.
• The currently accepted value of the proportion
of female births in very large European-race
populations is 0.485.
• Define the parameter θ to be the proportion of
female births
• Alternative way of reporting this parameter is
as a ratio of male to female birth rates
• Bayesian inference in the binomial model, we
must specify a prior distribution for θ .
• For simplicity assume the prior to be
Uniform(0,1)
• Bayes rule implies that
p(θ|y)
α θy (1-θ)n-y
• In single- parameter problems, this allows
immediate graphical presentation of the
posterior distribution.
• Since p(θ|y) is a density and should integrate to 1, the
normalizing constant can be worked out.
• The posterior distribution is recognizable as a beta
distribution
• θ|y ~ Beta(y+1, n-y+1)
• In analyzing the binomial model, Pierre-Simon
Laplace (1749–1827) also used the uniform prior
distribution.
• His first serious application was to estimate the proportion
of female births in a population.
• A total of 241,945 girls and 251,527 boys were born in Paris
from 1745 to 1770.
• Laplace used (Normal) approximation and showed that
• P(θ≥0.5|y =241,945, n =251,527+241,945)
≈ 1.15 × 10 -42
So he was ‘morally certain’ that θ <0.5.
Lecture 2: Bayesian Inference
and Data Analysis
Dept. of Statistics,
Rajshahi University, Rajshahi
Anandamayee Majumdar
Visiting Scientist, University of North Texas
School of Public Health, USA;
Professor, University of Suzhou, China.
Overview
•
•
•
•
•
•
•
•
•
•
•
Prediction in the Binomial example
Justification of the Uniform prior in Binomial case
Prior distributions – more discussion and an example
Hyperparameters, hyperpriors
Hierarchical models
Posterior distribution as a compromise between prior
distribution and data.
Graphical and Numerical Summaries
Posterior probability intervals (or credible intervals)
Normal example with unknown mean and known variance
Central Limit Theorem in the Bayesian Context
Large sample properties and results
Prediction in the Binomial Example
• In the binomial example with the uniform prior
distribution, the prior predictive distribution
(marginal of y) can be evaluated explicitly
• Marginal distribution of y:
p(y=i) = 1/(n+1) for i=0,1,…, n
• All values of y are equally likely, a priori .
• For posterior prediction, we might be more
interested in the outcome of one new trial,
rather than another set of n new trials.
Prediction in the Binomial example
• Letting y_tilde denote the result of a new trial,
exchangeable with the first n
• This result, based on the uniform prior
distribution, is known as ‘Laplace’s law of
succession.’
Justification of Uniform prior in
Binomial problem
• Bayes: The resulting marginal p(y) is uniform
over {0, 1, …,n} . Justification good in the
sense it uses y and n only.
• Laplace: Insufficient information about θ
→ justified by a flat distribution. This argument
is often followed in Bayesian model building.
Interpretation: prior distributions
1. In the population interpretation, the prior distribution
represents a population of possible parameter values, from
which the θ of current interest has been drawn.
• Probability of failure in a new industrial process: there is no
perfectly relevant population
2. In the more subjective state of knowledge interpretation,
the guiding principle is that we must express our knowledge
(and uncertainty) about θ as if its value could be thought of as
a random realization from the prior distribution.
Prior: Constraints and
Flexibility
• Prior distribution should include all plausible
values of θ
• Prior need not be realistically concentrated
around the true value……………………….
• …because often the information about θ
contained in the data will far outweigh any
reasonable prior specification.
Posterior distribution as compromise
between data and prior information
• The process of Bayesian inference involves passing from a
prior distribution, p( θ ), to a posterior distribution, p( θ |y),
→ natural to expect that some general relations might hold
between these two distributions.
1. E (θ) =E(E(θ |y))
the prior mean of θ is the average of all possible posterior
means over the distribution of possible data
2. Var (θ) = E(Var(θ|y)) + Var(E(θ|y)),
the posterior variance is on average smaller than the prior
variance, by an amount that depends on the variation in
posterior means over the distribution of possible data.
Posterior distribution → compromise
between the prior and the data
• In the binomial example with the uniform prior
distribution, the prior mean is ½
• The posterior mean y+1/n+2, is a compromise between
the prior mean ½ and the sample proportion y/n
• Clearly the prior mean has a smaller and smaller role as
the size of the data sample increases.
• This is a very general feature of Bayesian inference
• The posterior distribution is centered at a point that
represents a compromise between the prior information
and the data
• The compromise is controlled to a greater extent by
the data as the sample size increases.
Displaying & summarizing posterior
inference
• Graphical displays useful
• Eg. Histograms, boxplots
• Contour plots, scatterplots in multiparameter
problems
• Numerical summaries also desirable
• Summaries of location are the mean, median, and
mode(s)
• Variation is commonly summarized by the
standard deviation, IQR, other quantiles
Posterior Interval Summaries in
Bayesian Inference
1. A 100(1 −α)% central posterior interval :
Range of values above and below which lies
exactly 100(α /2)% of the posterior probability
• For simple models (Binomial, Normal,
Poisson, etc), posterior intervals can be
computed directly from c.d.f. (use standard
computer functions)
Posterior Interval Summaries in
Bayesian Inference
2. Highest posterior density (HPD) interval: the region
of values that contains 100(1−α)% of the posterior
probability, also has the characteristic that the density
within the region is never lower than that outside.
• HPD region is identical to a central posterior interval if
the posterior distribution is unimodal and symmetric.
• In general, we prefer the central posterior interval to the
HPD region because the former has a direct
interpretation as the posterior α/2 and 1−α/2 quantiles,
is invariant to one-to-one transformations of the
estimand, and is usually easier to compute
A special case: comparison of central
probability interval and a HPD interval
Prior – categorization (Andrew Gelman)
(1) Prior distributions giving numerical information that is crucial to
estimation of the model. This would be a traditional informative
prior, which might come from a literature review or explicitly from
an earlier data analysis.
(2) Prior distributions that are not supplying any controversial
information but are strong enough to pull the data away from
inappropriate inferences that are consistent with the likelihood. This
might be called a weakly informative prior.
(3) Prior distributions that are uniform, or nearly so, and basically
allow the information from the likelihood to be interpreted
probabilistically. These are noninformative priors, or maybe, in
some cases, weakly informative.
Noninformative priors
• "Non-informative prior distribution: A prior
distribution which is non-commital about a
parameter, for example, a uniform distribution."
-Everitt (1998)
Improper Priors
• A ‘prior’ distribution’ which integrates to infinity over
the parameter space
• Eg. Assume a constant prior for the Normal mean.
• Many authors (Lindley, 1973; De Groot, 1937; Kass
and Wasserman, 1996) warn against the danger of overinterpreting those priors since they are not probability
densities.
• As long as it yields a proper posterior distribution
Bayesian methodology can be carried out
• Improper priors have been proved to be limits of data
adaptive proper priors –Akaike (JRSSB, 1980)
Jeffrey’s prior
• Named after Harold Jeffreys, is a non-informative (objective) prior
distribution on parameter space that is proportional to the square
root of the determinant of the Fisher information
p(θ) α √det(I(θ))
• It has the key feature that it is invariant under reparameterization of
the parameter vector. If φ = f(θ) then p(φ) = √det(I(φ))
• Sometimes the Jeffreys prior cannot be normalized, and thus one
must use an improper prior. For example, the Jeffreys prior for the
distribution mean is uniform over the entire real line in the case of
a Gaussian distribution of known variance.
Informative priors
Conjugacy: Binomial example
• Likelihood of the parametric form
p(y|θ) α θa (1-θ)b (Binomial family)
• Thus, if the prior has the same form, so will the posterior.
p(θ) α θα-1 (1-θ)β-1 (Beta family)
• If the prior and posterior distribution follow the same
parametric family/functional form, then we get conjugacy.
• Beta and Binomial families are said to be conjugate
families.
• Other examples: Poisson and Gamma, Normal (mean) and
Normal, Normal (variance) and Inverse Gamma etc.
Informative (Beta) prior (continued)
• p(θ | y)
θy (1-θ)n-y θα-1 (1-θ)β-1
= θy+α-1 (1-θ)n-y+β-1
= Beta (y+α, n-y+β)
• E(θ | y) = y+α /(n+α+β) → y/n as n→∞
α
• Var(θ | y) = (y+α)(n+β-y)/(n+α+β)2(n+α+β+1)
→ O(1/n) as n→∞
• As n becomes large the effect of the prior
diminishes, also posterior distribution shrinks!
Basic justification of conjugate
priors
• Similar to that for using standard models (such as Binomial and
Normal) for the likelihood:
1. Easy to understand the results, which can often be put in analytic
form
2. They are often a good approximation
3. They simplify computations.
• Also, they are useful as building blocks for more complicated
models, including in many dimensions, where conjugacy is typically
impossible.
• For these reasons, conjugate models can be good starting points; for
example, mixtures of conjugate families can sometimes be useful
when simple conjugate distributions are not reasonable
Hyperparameters
• Parameters of prior distributions are
called hyperparameters, to distinguish them
from parameters of the model of the underlying
data.
• Eg. We use the Beta(α, β) to model the
distribution of the parameter p of a Binomial
distribution Binomial(n, p)
• p is a parameter of the underlying system
• α and β are parameters of the prior distribution
(beta distribution), hence hyperparameters.
Hyperpriors
• A hyperprior is a prior distribution on
a hyperparameter
• They arise particularly in the use of conjugate
priors.
Purpose of Hyperpriors
1.
To express uncertainty about the hyperparameter. Assuming fixed
hyperparameters is rigid, making them random allows data to choose the
hyperparameters, and makes the ‘data speak’.
1.
By using a hyperprior, the prior distribution itself becomes a mixture distribution;
a weighted average of the various prior distributions (over different
hyperparameters), with the hyperprior being the weighting.
This adds additional possible distributions (beyond the parametric family one is
using), because parametric families of distributions are generally not convex
sets – as a mixture density is a convex combination of distributions, it will in
general lie outside the family.
For instance, the mixture of two normal distributions is not a normal distribution:
if one takes different means (sufficiently distant) and mix 50% of each, one
obtains a bimodal distribution, which is thus not normal. In fact, the convex hull
of normal distributions is dense in all distributions, so in some cases, you can
arbitrarily closely approximate a given prior by using a family with a suitable
hyperprior.
Purpose of Hyperpriors
3. Dynamical system
A hyperprior is a distribution on the space of possible hyperparameters. If
one is using conjugate priors, then this space is preserved by moving to
posteriors – thus as data arrives, the distribution changes, but remains on
this space: as data arrives, the distribution evolves as a dynamical
system (each point of hyperparameter space evolving to the updated
hyperparameters), over time converging, just as the prior itself converges.
4. Ideal for hierarchical / multilevel models where hierarchy arises as a
natural phenomenon and for information sharing with many sources of data
Hierarchical/Multilevel model
• Generalization of linear and generalized linear
modeling in which regression coeffecients are
themselves given a model, whose parameters
are also estimated from data.
Results using noninformative priors
• Many simple Bayesian analyses based on noninformative
prior distributions give similar results to standard nonBayesian approaches (for example, the posterior t interval
for the normal mean with unknown variance).
• The extent to which a noninformative prior distribution can
be justified as an objective assumption depends on
the amount of information available in the data; as
the sample size n increases, the influence of the
prior distribution on posterior inferences decreases.
Informative Nonconjugate prior
distributions
• For more complex problems, conjugacy may
not be possible
• Although they can make interpretations of
posterior inferences less transparent
and computation more difficult,
nonconjugate prior distributions
do not pose any new conceptual problems.
Example: estimating the probability
of a female birth given placenta
previa
• A special abnormal condition in expecting women
• An early study concerning the gender of placenta
previa births in Germany found that of a total of 980
births, 437 were female.
• How much evidence does this provide for the claim
that the proportion of female births in the population
of placenta previa births is less than 0.485, the
proportion of female births in the general population?
Posterior summary using Uniform
prior
•
•
•
•
•
Posterior distribution is Beta(438, 544).
Posterior mean of θ is 0.446
Posterior standard deviation is 0.016
Posterior median is 0.446
Posterior central 95% posterior interval is
[0.415, 0.477].
• This 95% posterior interval matches, to three decimal
places, the interval that would be obtained by using a
normal approximation with the calculated posterior
mean and standard deviation.
Check same summary using
simulations
• Simulate 1000 iid draws from the
Beta(438, 544) posterior distribution
• 2.5th and 97.5th percentiles give central 95%
posterior interval [0.415, 0.476]
• median of the 1000 draws from the posterior
distribution is 0.446
• The sample mean and standard deviation of the
1000 draws are 0.445 and 0.016
Draws from the posterior distribution of
(a) the probability of female birth, θ ;
(b) the logit transform, logit( θ );
(c) the male-to-female gender ratio,
Sensitivity to prior specification
α/α+β
α+β E(θ|y)
95% posterior interval for θ
0.500 2
0.446
[0.415, 0.477]
0.485 2
0.446
[0.415, 0.477]
0.485 5
0.446
[0.415, 0.477]
0.485 10
0.446
[0.415, 0.477]
0.485 20
0.447
[0.416, 0.478]
0.485 100 0.450
[0.420, 0.479]
0.485 200 0.453
[0.424, 0.481]
*Interpret α/α+β, as the center and α+β as the number of
observations (if large this implies prior is concentrated)
Discussion
• The first row corresponds to uniform prior
• The lower the row, the more concentrated is
the prior distribution towards 0.485
• Only when α+β ≥ 100 (likened to prior
number of observations), the posterior
interval begins to change.
• Even then the intervals exclude the prior
mean
Alternative: Instead of conjugate prior,
use a ‘flat’ non-conjugate prior
(weakly informative prior)
(a) Prior density for θ in nonconjugate analysis of birth ratio example;
(b) histogram of 1000 draws from a discrete approximation to the posterior
density.
*Figures are plotted on different scales.
Details of the nonconjugate flat prior
(piecewise linear)
• Centered around 0.485 but is flat far away
from this value to admit the possibility that the
truth is far away.
• 40% of the probability mass is outside the
interval [0.385, 0.585]
• This prior distribution has mean 0.493 and
standard deviation 0.21, similar to the standard
deviation of a Beta distribution with a + β =5.
Evaluating the posterior distribution
• The unnormalized posterior distribution is obtained at a grid of
θ values, (0.000, 0.001,…, 1.000), by multiplying the prior
density and the binomial likelihood at each point.
• Samples from the posterior distribution can be obtained by
normalizing the distribution on the discrete grid of θ values.
• Figure (b) is a histogram of 1000 draws from the discrete
posterior distribution.
• The posterior median is 0.448, 95% central posterior interval is
[0.419, 0.480].
• Because the prior distribution is overwhelmed by the data,
results match those in table based on Beta distributions.
• In the grid approach, we avoid grids that are too coarse and
distort a significant portion of the posterior mass.
Estimating the mean of a normal
distribution with known variance
• The normal distribution is fundamental to most
statistical modeling.
• CLT helps to justify using the normal likelihood
in many statistical problems, as an approximation
to a less analytically convenient actual likelihood.
• Also, even when the normal distribution does not
itself provide a good model fit, it can be useful as
a component of a more complicated model
involving Student-t or finite mixture distributions.
• For now, we simply work through the Bayesian
results assuming Normal distribution is true
Normal model with multiple
observations, variance known
• A sample of independent and identically distributed
observations y =( y 1 , … , y n ) is available.
• The posterior density is
Posterior distribution also Normal
Remarks about posterior results
• Posterior variance converges to σ2/n if n→∞ or
if prior variance τ02 →∞
• Posterior mean is weighted average of prior
mean and sample mean
• Incidentally, the same result is obtained by
adding information for the data points y 1 , y 2
,…, y n one point at a time, using the posterior
distribution at each step as the prior distribution
for the next
CLT in Bayesian context
((θ - E(θ | y) ) /√Var(θ | y) | y) → N(0,1)
as n→∞
• Often used to justify approximating the posterior
distribution with a normal distribution.
• For the binomial parameter θ , the normal distribution
is a more accurate approximation in practice if we
transform θ to the logit scale…
• …that is, perform inference for log( θ /(1 − θ ))
instead of θ
• probability space from [0, 1] expands to (−∞, ∞),
which is more fitting for a normal approximation.
Large sample results
• The large-sample results are not actually
necessary for performing Bayesian data
analysis… but are often useful as
approximations and as tools for
understanding.
Normal approximations to the
posterior distribution
• A Taylor series expansion of log p(θ|y)
centered at the posterior mode, (where mode
can be a vector and is assumed to be in the
interior of the parameter space), gives
Posterior distribution converges to…
Remark:
• For a finite sample size n, the normal
approximation is typically more accurate for
conditional and marginal distributions
of components of θ than for the full
joint distribution.
Posterior Consistency
• If the true data distribution is included in the parametric
family—that is, if f(y)=p(y|θ0) for some θ0—then,
in addition to asymptotic normality, the property
of consistency holds: the posterior distribution converges
to a point mass at the true parameter value, θ0, as n→∞.
• When the true distribution is not included in the
parametric family, there is no longer a true value θ0,
but its role in the theoretical result is replaced by a
value θ0 that makes the model distribution, p(y|θ),
closest to the true distribution, in a technical
involving Kullback-Leibler information
Large sample correspondence between
Bayesian and Frequentist methods
• When n→∞, a 95% central posterior interval
for θ will cover the true value 95% of the time
under repeated sampling with any fixed true θ.
When asymptotic results fail
• Correspond to situations in which the prior distribution
has an impact on the posterior inference, even in the
limit of infinite sample sizes.
• Usually when likelihood is flat
• For example when the model is unidentifiable (there
exist two distinct parameters yielding same likelihood)
Eg. f(x) = p g(x) + (1-p) h(x) where 0>p>1, (p, g,h)
unknown
• Number of parameters increase with data
• Prior distributions that exclude point of convergence
or yield improper posterior distributions
Lecture 3: Bayesian Inference
and Data Analysis
Dept. of Statistics,
Rajshahi University, Rajshahi
Anandamayee Majumdar
Visiting Scientist, University of North Texas
School of Public Health, USA;
Professor, University of Suzhou, China.
Overview
•
•
•
•
•
Model checking and improvement
Test quantities, P-values
Starting the computation in Bayesian Inference
Simulation of potentially observable quantities
Posterior simulation methods: The Gibbs Sampler,
Rejection sampling, Metropolis Hastings algorithm
• Bivariate Unit Normal Example with Bivariate Jumping
kernel
• Recommended strategies for simulation.
• Advanced techniques for Monte Carlo simulation
Model checking and improvement
• Checking the model is crucial to statistical analysis.
• Bayesian inferences assume the whole structure of a probability
model and can yield misleading inferences when the model is poor.
• A good Bayesian analysis, therefore, should include some check of
the adequacy of the fit of the model to the data and the plausibility
of the model for the purposes for which the model will be used.
• This is sometimes discussed as a problem of sensitivity to the prior
distribution,
• but in practice the likelihood model is typically just as suspect;
• throughout, we use ‘model’ to encompass:
1. The sampling distribution, 2. the prior distribution,
3. Hierarchical structure, and 4. issues such as which
explanatory variables have been included in a regression.
Judging model flaws by their
practical implications
• Model TRUE or FALSE – is not the question
• Relevant question: ‘Do the model’s
deficiencies have a noticeable effect on the
substantive inferences?’
• Do the inferences from the model make sense?
.. NO: suggests a potential for creating a more
accurate probability model for the parameters and
data collection process.
• Is the model consistent with data? Posterior
predictive checking
If the model fits, then replicated data generated
under the model should look similar to observed
data.
This is really a self-consistency check: an observed
discrepancy can be due to model misfit or chance.
Basic technique for checking fit
• Draw simulated values from the posterior
predictive distribution of replicated data and
compare these samples to the observed data.
• Any systematic differences between the
simulations and the data indicate potential
failings of the model
Example: Newcomb’s speed of light
measurements
• 66 measurements on the speed of light
• modeled as N(μ, σ2), with a non-informative
uniform prior distribution on (μ, log σ).
• However, the lowest of Newcomb’s
measurements look like outliers
• Question: Could the extreme measurements
have reasonably come from a normal
distribution?
Simulating replicated data using
posterior sample
• y observed data
• θ be the vector of parameters
• yrep replicated data that could have been
observed (if x is the explanatory variable
vector for y, then it is also for yrep)
Smallest observation of Newcomb’s speed of light
data (the vertical line at the left of the graph),
compared to the smallest observations from each of
the 20 posterior predictive simulated datasets
Test quantity, or discrepancy
measure
• T(y, θ), is a scalar summary of parameters and
data that is used as a standard when comparing
data to predictive simulations.
• Test quantities play the role in Bayesian
model checking that test statistics play in
classical testing.
• Test quantity depends on both data and
parameter.
P-values or tail area probabilities
• Classical p-value
• Bayesian p-value
• The probability is taken over the joint posterior
distribution, p(θ, yrep|y):
Example
• Consider a sequence of binary outcomes,
y1,…, yn,
• Modeled as n iid Bernoulli trials
• Uniform prior distribution on θ
• suppose the observed data are, in order, 1, 1,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0.
• The observed autocorrelation is evidence that
the model is flawed.
• T=number of switches between 0’s and 1’s
• To simulate yrep under the model, we first draw
θ from its Beta(8, 14) posterior distribution
• Then draw 10,000 independent replications
from Bernoulli(θ)
• P-value = 0.028
Other posterior predictive checks for
model fit/model comparison
• Use partial data for building model
• Use the rest of the data, to make prediction
(usually in the case when each value of y is
associated with covariate and/or coordinate
information)
• Compare prediction coverage for different
competing models
Computation in Bayesian Inference
• Distribution to be simulated as the target
distribution → denote it as p(θ|y)
• Assume target density p(θ|y) can be easily
computed for any value of θ, up to a
proportionality constant involving only y
• Starting point -- Crude estimation of
parameters. Often reliable, and easy to
compute.
Use posterior simulations to make
inferences about
1. Predictive quantities : ŷk ~ p(ŷ |θk)
or ŷk ~ p(ŷ |Xk, θk) in the regression model
2. Or replications yrep,k ~ p(y |θk)
or yrep,k ~ p(y|X, θk) in the regression model
How many simulation draws are
needed?
• In general, few simulations are needed to
estimate posterior medians, probabilities near
0.5, and low-dimensional summaries
• More simulations needed for extreme
quantiles, posterior means, probabilities of rare
events, and higher-dimensional summaries.
• Simulation draws typically 100 - 2000
Direct simulation
• In simple nonhierarchical Bayesian models, it
is often easy to draw from the posterior
distribution directly, especially if conjugate
prior distributions have been assumed.
• In complex problems, we sometimes simulate
hyperparameters (marginally), and then
conditionally, the intermediate parameters
Rejection sampling
• Suppose we want to obtain a single random draw from a density p(θ|y). We
require a positive function g(θ) defined for all θ for which p(θ|y)>0 that
has the following properties:
1. We are able to draw random samples from the probability density
proportional to g. It is not required that g(θ) integrate to 1, but g(θ) must
have a finite integral.
2. there must be some known constant M for which p(θ|y)/g(θ)≤M for all θ.
• The rejection sampling algorithm proceeds in two steps:
1. Sample θ at random from the probability density
proportional to g(θ).
2. With probability p(θ|y)/(Mg(θ)), accept θ as a draw
from p. If the drawn θ is rejected, repeat step 1.
Illustration of rejection sampling. The top curve is an
approximation function, Mg(θ), and the bottom curve is the target
density, p(θ|y). As required, Mg(θ)≥p(θ|y) for all θ. The vertical line
indicates a single random draw θ from the density proportional to g.
The probability that a sampled draw θ is accepted is the ratio of the
height of the lower curve to the height of the higher curve at the
value θ
Posterior simulation
• Mostly used Markov chain simulation
methods:
1. Gibbs sampler
2. Metropolis-Hastings algorithm
Markov Chain Monte Carlo (MCMC)
simulation
• Definition: A Markov chain is a sequence of
random variables θ1, θ2,…, for which, for any t,
the distribution of θt given all previous θ’s
depends only on the most recent value, θt−1.
• Key: Create a Markov process whose stationary
distribution is the specified p(θ|y), and run the
simulation long enough that the distribution of the
current draws is close enough to this stationary
distribution.
• For any specific p(θ|y), a variety of Markov
chains with the desired property can be
constructed
The Gibbs Sampler
• Suppose the parameter vector θ has been
divided into d components or subvectors,
θ= (θ1,…, θd).
In iteration t, we simulate θjt
˜
p(θj |θ-j t-1, y)
where θ-j t-1 represents all the components of θ,
except for θj, at their current values
for j=1,…,d
Bivariate Normal Example
• Suppose (y1, y2)
˜ Biv. Normal ((θ1, θ2), (1, ρ, ρ,1))
We note that the Gibbs sampler takes on the
following steps:
θ1|θ2, y ~
N(y1+ρ(θ2−y2), 1−ρ2)
θ2|θ1, y ~
N(y2+ρ(θ1−y1), 1−ρ2).
• Four independent sequences of the Gibbs
sampler for a bivariate normal distribution
with fixed correlation ρ=0.8, with
overdispersed starting points indicated by solid
squares.
(a)
(b)
(c)
First 10 iterations, showing the component-by-component
updating of the Gibbs iterations.
After 500 iterations, the sequences have reached
approximate convergence.
Iterates from the second halves of the sequences.
The Metropolis algorithm
• The Metropolis algorithm is an adaptation of a random
walk that uses an acceptance/rejection rule to converge to
the specified target distribution.
1. Draw a starting point θ0, for which p(θ0|y)>0, from a
starting distribution p0(θ). Or we may simply choose
starting values dispersed around a crude approximate
2. For t=1, 2,…
(a) Sample a proposal θ* from a jumping distribution
at time t, Jt(θ*|θt−1). For the Metropolis algorithm,
Jt(θa|θb)=Jt(θb|θa) for all θa, θb, t
(b) Calculate the ratio of the densities,
The Metropolis algorithm
c. Set
The Metropolis-Hastings algorithm
• Same as Metropolis algorithm, except that the
jumping rule does not have to be symmetric
Jt(θa|θb)≠Jt(θb|θa) for some θa, θb, t
• to correct for the asymmetry in the jumping rule,
the ratio r becomes:
• Allowing asymmetric jumping rules can be
useful in increasing the speed of the random
walk.
Properties of a good jumping rule
• For any θ, it is easy to sample from J(θ*|θ).
• It is easy to compute the ratio r.
• Each jump goes a reasonable distance in the
parameter space (otherwise the random walk
moves too slowly).
• The jumps are not rejected too frequently
(otherwise the random walk wastes too much
time standing still).
Difficulties of inference from
iterative simulation
• If iterations have not proceeded long enough, the
simulations may be grossly unrepresentative of
the target distribution
• Even when the simulations have reached
approximate convergence, the early iterations
still are influenced by the starting
approximation rather than the target distribution
• Within-sequence correlation; aside from any
convergence issues, simulation inference from
correlated draws is generally less precise than
from the same number of independent draws
Solutions
• Design the simulation runs to allow effective monitoring of
convergence, in particular by simulating multiple sequences with
starting points dispersed throughout parameter space
• Monitor the convergence of all quantities of interest by comparing
variation between and within simulated sequences until ‘within’
variation roughly equals ‘between’ variation
• If the simulation efficiency is unacceptably low (in the sense of
requiring too much real time on the computer to obtain approximate
convergence of posterior inferences for quantities of interest), the
algorithm can be altered
• To diminish the effect of the starting distribution, discard the first
half of each sequence and focus attention on the second half (burn
in fraction depends on the problem at hand)
• Once approximate convergence has been reached, thin the
sequences by keeping every kth simulation draw from each
sequence and discard the rest.
Monitoring convergence of each
scalar estimand
• Suppose we have simulated m parallel sequences,
each of length n (after discarding the first half of the
simulations).
• For each scalar estimand ψ, we label the simulation
draws as ψij (i=1,…, n; j=1,…, m), and we compute B
and W, the between- and within-sequence variances:
Step 1
• Estimate var(ψ|y), the marginal posterior variance
of the estimand, by a weighted average of W and
B, namely
• This quantity overestimates the marginal
posterior variance assuming the starting
distribution is appropriately overdispersed, but
is unbiased under stationarity
Step 2
• ‘Within’ variance W should be an
underestimate of var(ψ|y) because the
individual sequences have not had time to
range over all of the target distribution
• In the limit as n→∞, the E(W) → var(ψ|y).
• Potential scale reduction is estimated by
• This declines to 1 as n→∞.
Monitoring convergence for the
entire distribution
• Once is near 1 for all scalar estimands of
interest, just collect the mn simulations from
the second halves of all the sequences together
and treat them as a sample from the target
distribution.
A small experimental dataset
• Coagulation time in seconds for blood drawn from 24
animals randomly allocated to four different diets.
• Different treatments have different numbers of
observations because the randomization was
unrestricted.
Diet
A
B
C
D
Measurements
62, 60, 63, 59
63, 67, 71, 64, 65, 66
68, 66, 71, 67, 68, 68
56, 62, 60, 61, 63, 64, 63, 59
Hierarchical Model and Priors in this
example
Under the hierarchical normal model
• Data yij, i=1,…, nj, j=1,…, J, are independently
normally distributed within each of J groups, with
means θj, common variance σ2. The total number of
observations is n.
• The group means θj are assumed to follow a normal
distribution with unknown mean μ and variance τ2,
• A uniform prior distribution is assumed for
(μ, log σ, τ)
• If we were to assign a uniform prior distribution to
log τ, the posterior distribution would be improper
Joint posterior density of all the
parameters
Starting points
• Choose over-dispersed starting points for
each parameter θj by simply taking random
points from the data yij from group j.
• Starting points for μ can be taken as the
average of the starting θj values.
• No starting values are needed for τ or σ as they
can be drawn as the first steps in the Gibbs
sampler
Gibbs sampler
1. The conditional distributions for this model
all have simple conjugate forms
Gibbs sampler
2. Conditional on y and the other parameters in
the model, μ has a normal distribution
with mean =
and variance = σ2/J
3. Again,
Gibbs Sampler
4. Posterior distribution
Estimand
Posterior quantiles
2.5%
θ1
θ2
θ3
θ4
μ
σ
58.9
63.9
66.0
59.5
56.9
1.8
25%
60.6
65.3
67.1
60.6
62.2
2.2
median 75%
61.3
65.9
67.8
61.1
63.9
2.4
62.1
66.6
68.5
61.7
65.5
2.6
97.5%
63.5
67.7
69.5
62.8
73.4
3.3
Rhat
1.01
1.01
1.01
1.01
1.04
1.00
Bivariate unit Normal Example
• Suppose (y1, y2) ˜ Bivariate N ((θ1, θ2), (1, 0, 0,1))
• Data (y1, y2) = (0, 0)
• Target density p(θ|y) = N(θ|0, I), where I is the 2×2
identity matrix
• The jumping distribution is also bivariate normal,
centered at the current iteration and scaled to 1/5 the
size:
Jt(θ*|θt−1)=N(θ*|θt−1, 0.22I). (symmetric)
• Density ratio τ =N(θ*|0, I)/N(θt–1|0, I).
• Five independent sequences of a Markov chain simulation for
The bivariate unit normal distribution, with overdispersed starting points
indicated by solid squares.
(a) After 50 iterations, the sequences are still far from convergence (due to
inefficient jumping rule, deliberately chosen to demonstrate the mixing)
(b) After 1000 iterations, the sequences are nearer to convergence.
(c) 3rd figure shows the iterates from the second halves of the sequences.
The points in the 3rd figure have been jittered so that steps in which the
random walks stood still are not hidden.
Bivariate unit normal density with
bivariate normal jumping kernel
Recommended strategy for posterior
simulation
1.
2.
3.
4.
5.
Start off with crude estimates and possibly a mode-based
approximation to the posterior distribution .
If possible, simulate from the posterior distribution directly or
sequentially, starting with hyperparameters and then moving to the
main parameters
Most likely, the best approach is to set up a Markov chain simulation
algorithm. The updating can be done one parameter at a time or with
parameters in batches (as is often convenient in regressions and
hierarchical models).
For parameters (or batches of parameters) whose conditional posterior
distributions have standard forms, use the Gibbs sampler.
For parameters whose conditional distributions do not have
standard forms, use Metropolis jumps. Tune the scale of each
jumping distribution so that acceptance rates are near 20% (when
altering a vector of parameters) or 40% (when altering one parameter
at a time).
6. Construct a transformation so that the parameters are
approximately independent—this will speed the convergence of the
Gibbs sampler. Or add auxiliary variables (data augmentation) to
speed up the computation.
7. Start the Markov chain simulations with parameter values taken from the
crude estimates or mode-based approximations, with noise added
so they are over-dispersed with respect to the target distribution.
8. Run multiple Markov chains and monitor the mixing of the sequences.
Run until approximate convergence appears to have been reached.
9. If Rhat is near 1 for all scalar estimands of interest,
summarize inference about the posterior distribution by
treating the set of all iterates from the second half of the
simulated sequences as an identically distributed sample
from the target distribution. At this point, simulations
from the different sequences can be mixed.
10 . Compare the posterior inferences from the Markov
chain simulation to the approximate distribution used to
start the simulations. If they are not close with respect to
locations and approximate distributional shape, check for
errors before believing that the Markov chain simulation
has produced a better answer.
Advanced techniques for Markov
Chain Simulation
Hybrid Monte Carlo methods: For moving rapidly through the
target distribution
•
•
•
•
•
•
Borrows ideas from physics to add auxiliary variables that suppress the local random
walk behavior in the Metropolis algorithm
Thus allowing it to move much more rapidly through the target distribution. For
each component θj in the target space, hybrid Monte Carlo adds a `momentum’
variable φ
Both θ and φ are then updated together in a new Metropolis algorithm, in which the
jumping distribution for θ is determined largely by φ
Roughly, the momentum gives the expected distance and direction of the jump in θ,
so that successive jumps tend to be in the same direction, allowing the simulation to
move rapidly where possible through the space of θ.
The MH accept/reject rule stops the movement when it reaches areas of low
probability, at which point the momentum changes until the jumping
can continue.
Hybrid Monte Carlo is also called Hamiltonian Monte Carlo because it is related to
the model of Hamiltonian dynamics in physics.
Advanced techniques for Markov
Chain Simulation
Langevin methods
• The basic symmetric-jumping Metropolis algorithm is simple to apply but
has the disadvantage of wasting many of its jumps by going into lowprobability areas of the target distribution.
• For example, optimal Metropolis algorithms in high dimensions have
acceptance rates below 25%, which means that, in the best case,
over 3/4 of the jumps are wasted.
• A potential improvement is afforded by the Langevin algorithm, in which
each jump is associated with a small shift in the direction of the gradient of
the logarithm of the target density, thus moving the jumps toward higher
density regions of the distribution.
• This jumping rule is not symmetric.
Thank you for your attention!