Download Principles of Bayesian Inference Bayes Theorem

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Inductive probability wikipedia , lookup

Approximate Bayesian computation wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistical inference wikipedia , lookup

Bayesian inference wikipedia , lookup

Transcript
Basic probability
1
2
Principles of Bayesian Inference
We like to think of “Probability” formally as a function that
assigns a real number to an event.
Sudipto Banerjee1 and Andrew O. Finley2
Let E and F be any events that might occur under an
experimental setup H. Then a probability function P (E) is
defined as:
P1 0 ≤ P (E) ≤ 1 for all E.
P2 P (H) = 1.
P3 P (E ∪ F ) = P (E) + P (F ) whenever it is impossible for any
two of the events E and F to occur. Usually consider:
E ∩ F = {φ} and say they are mutually exclusive.
Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A.
Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A.
October 30, 2013
1
2
Basic probability
Basic probability
If E is an event, then we denote its complement (“NOT” E)
by Ē or E c . Since E ∩ Ē = {φ}, we have from P3:
Independent Events: E and F are said to be independent
if the occurrence of one does not imply the occurrence of
the other. Then, P (E | F ) = P (E) and we have the
following multiplication rule:
P (Ē) = 1 − P (E).
Conditional Probability of E given F :
P (E | F ) =
P (EF ) = P (E)P (F ).
P (E ∩ F )
P (F )
If P (E | F ) = P (E), then P (F | E) = P (F ).
Marginalization: We can express P (E) by “marginalizing”
over the event F :
Sometimes we write EF for E ∩ F .
Compound probability rule: write the above as
P (E) = P (EF ) + P (E F̄ )
= P (F )P (E | F ) + P (F̄ )P (E | F̄ ).
P (E | F )P (F ) = P (EF ).
3
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
4
Bayes Theorem
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian principles
Bayes Theorem
Observe that:
Two hypothesis: H0 : excess relative risk for thrombosis for
women taking a pill exceeds 2; H1 : it is under 2.
P (EF ) = P (E | F )P (F ) = P (F | E)P (E)
P (F )P (E | F )
⇒ P (F | E) =
.
P (E)
Data collected at hand from a controlled trial show a
relative risk of x = 3.6.
This is Bayes’ Theorem, named after Reverend Thomas
Bayes – an English clergyman with a passion for gambling!
Probability or likelihood under the data, given our prior
beliefs is P (x | H); H is H0 or H1 .
Often this is written as:
P (F | E) =
5
P (F )P (E | F )
.
P (F )P (E | F ) + P (F̄ )P (E | F̄ )
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
6
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian principles
Bayesian principles
Likelihood and Prior
Bayes Theorem updates the probability of each hypothesis:
P (H | x) =
Bayes theorem in English:
P (H)P (x | H)
; H ∈ {H0 , H1 } .
P (x)
prior × likelihood
Posterior distribution = P
( prior × likelihood)
Marginal probability:
Denominator is summed over all possible priors
P (x) = P (H0 )P (x | H0 ) + P (H1 )P (x | H1 )
It is a fixed normalizing factor that is (usually) extremely
difficult to evaluate: curse of dimensionality
Reexpress:
P (H | x) ∝ P (H)P (x | H); H ∈ {H0 , H1 } .
Markov Chain Monte Carlo to the rescue!
WinBUGS software:
www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml
7
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
8
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian principles
Bayesian principles
An Example
Clinician interested in π: proportion of children between
age 5–9 in a particular population having asthma
symptoms.
Prior Support
0.10
0.12
0.14
0.16
0.18
0.20
Total
Clinician has prior beliefs about π, summarized as “Prior
support” and “Prior weights”
Data: random sample of 15 children show 2 having asthma
symptoms. Likelihood obtained from Binomial distribution:
15 2
π (1 − π)13
2
Note: 15
2 is a “constant” and *can* be ignored in the
computations (though they are accounted for in the next
Table).
9
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Likelihood
0.267
0.287
0.290
0.279
0.258
0.231
Prior × Likelihood
0.027
0.043
0.072
0.070
0.039
0.023
0.274
Posterior
0.098
0.157
0.265
0.255
0.141
0.084
1.000
Posterior: obtained by dividing Prior × Likelihood with
normalizing constant 0.274
10
Bayesian principles
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian principles
Classical statistics: model parameters are fixed and
unknown.
The key to Bayesian inference is “learning” or “updating” of
prior beliefs. Thus, posterior information ≥ prior
information.
A Bayesian thinks of parameters as random, and thus
having distributions (just like the data). We can thus think
about unknowns for which no reliable frequentist
experiment exists, e.g. θ = proportion of US men with
untreated prostate cancer.
Is the classical approach wrong? That may be a
controversial statement, but it certainly is fair to say that
the classical approach is limited in scope.
A Bayesian writes down a prior guess for parameter(s) θ,
say p(θ). He then combines this with the information
provided by the observed data y to obtain the posterior
distribution of θ, which we denote by p(θ | y).
The Bayesian approach expands the class of models and
easily handles:
repeated measures
unbalanced or missing data
nonhomogenous variances
multivariate data
All statistical inferences (point and interval estimates,
hypothesis tests) then follow from posterior summaries. For
example, the posterior means/medians/modes offer point
estimates of θ, while the quantiles yield credible intervals.
11
Prior weight
0.10
0.15
0.25
0.25
0.15
0.10
1.00
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
– and many other settings that are precluded (or much
more complicated) in classical settings.
12
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Basics of Bayesian inference
Basics of Bayesian inference
Calculations (numerical and algebraic) are usually required
only up to a proportionaly constant. We, therefore, write
the posterior as:
We start with a model (likelihood) f (y | θ) for the observed
data y = (y1 , . . . , yn )0 given unknown parameters θ
(perhaps a collection of several parameters).
p(θ | y, λ) ∝ p(θ | λ) × f (y | θ).
If λ are known/fixed, then the above represents the desired
posterior. If, however, λ are unknown, we assign a prior,
p(λ), and seek:
Add a prior distribution p(θ | λ), where λ is a vector of
hyper-parameters.
p(θ, λ | y) ∝ p(λ)p(θ | λ)f (y | θ).
The posterior distribution of θ is given by:
p(θ | y, λ) =
The proportionality constant does not depend upon θ or λ:
1
1
=R
p(y)
p(λ)p(θ | λ)f (y | θ)dλdθ
p(θ | λ) × f (y | θ)
p(θ | λ) × f (y | θ)
=R
.
p(y | λ)
f (y | θ)p(θ | λ)dθ
The above represents a joint posterior from a hierarchical
model. The marginal posterior distribution for θ is:
Z
p(θ | y) = p(λ)p(θ | λ)f (y | θ)dλ.
We refer to this formula as Bayes Theorem.
13
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
14
A simple example: Normal data and normal priors
A simple example: Normal data and normal priors
Interpret: Posterior mean is a weighted mean of prior
mean and data point.
Example: Consider a single data point y from a Normal
distribution: y ∼ N (θ, σ 2 ); assume σ is known.
The direct estimate is shrunk towards the prior.
1
1
f (y|θ) = N (y | θ, σ ) = √ exp(− 2 (y − θ)2 )
2σ
σ 2π
What if you had n observations instead of one in the earlier
iid
set up? Say y = (y1 , . . . , yn )0 , where yi ∼ N (θ, σ 2 ).
2
ȳ is a sufficient statistic for θ; ȳ ∼ N θ, σn
2
θ ∼ N (µ, τ 2 ), i.e. p(θ) = N (θ | µ, τ 2 ); µ, τ 2 are known.
Posterior distribution of θ
2
=N
θ|
1
τ2
1
σ2
+
1
τ2
µ+
1
σ2
1
σ2
+
1
τ2
y,
1
σ2
1
+
1
τ2
σ2
τ2
σ2τ 2
θ| 2
µ+ 2
y,
σ + τ2
σ + τ 2 σ2 + τ 2
15
Posterior distribution of θ
2
p(θ|y) ∝ N (θ | µ, τ ) × N (y | θ, σ )
=N
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
p(θ | y) ∝ N (θ | µ, τ 2 ) × N
!
=N
.
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
θ|
= N θ|
1
τ2
n
σ2
+
1
τ2
µ+
n
σ2
n
σ2
1
τ2
nτ 2
+
ȳ,
n
σ2
1
+
!
1
τ2
σ2τ 2
σ2
µ+ 2
ȳ,
2
σ + nτ 2
σ + nτ 2 σ 2 + nτ 2
16
A simple exercise using R
σ2
ȳ | θ,
n
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Another simple example: The Beta-Binomial model
Consider the problem of estimating the current weight of a
group of people. A sample of 10 people were taken and
their average weight was calculated as ȳ = 176 lbs.
Assume that the population standard deviation was known
as σ = 3. Assuming that the data y1 , ..., y10 came from a
N (θ, σ 2 ) population perform the following:
Example: Let Y be the number of successes in n
independent trials.
n y
P (Y = y|θ) = f (y|θ) =
θ (1 − θ)n−y
y
Prior: p(θ) = Beta(θ|a, b):
Obtain a 95% confidence interval for θ using classical
methods.
p(θ) ∝ θa−1 (1 − θ)b−1 .
Prior mean: µ = a/(a + b); Variance ab/((a + b)2 (a + b + 1))
Assume a prior distribution for θ of the form N (µ, τ 2 ).
Obtain 95% posterior credible intervals for θ for each of the
cases: (a) µ = 176, τ = 8; (b) µ = 176, τ = 1000 (c)
µ = 0, τ = 1000. Which case gives results closest to that
obtained in the classical method?
17
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Posterior distribution of θ
p(θ|y) = Beta(θ|a + y, b + n − y)
18
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian inference: point estimation
Bayesian inference: interval estimation
The most popular method of inference in practical
Bayesian modelling is interval estimation using credible
sets. A 100(1 − α)% credible set C for θ is a set that
satisfies:
Z
P (θ ∈ C | y) =
p(θ | y)dθ ≥ 1 − α.
Point estimation is easy: simply choose an appropriate
distribution summary: posterior mean, median or mode.
Mode sometimes easy to compute (no integration, simply
optimization), but often misrepresents the “middle” of the
distribution – especially for one-tailed distributions.
C
The most popular credible set is the simple equal-tail
interval estimate (qL , qU ) such that:
Z qL
Z ∞
α
p(θ | y)dθ = =
p(θ | y)dθ
2
−∞
qU
Mean: easy to compute. It has the “opposite effect” of the
mode – chases tails.
Median: probably the best compromise in being robust to
tail behaviour although it may be awkward to compute as it
needs to solve:
Z θ median
−∞
19
This interval is relatively easy to compute and has a direct
interpretation: The probability that θ lies between (qL , qU )
is 1 − α. The frequentist interpretation is extremely
convoluted.
1
p(θ | y)dθ = .
2
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Sampling-based inference
Previous example: direct evaluation of the posterior
probabilities. Feasible only for simpler problems.
Modern Bayesian Analysis: Derive complete posterior
densities, say p(θ | y) by drawing samples from that density.
Samples are of the parameters themselves, or of their
functions.
If θ1 , . . . , θM are samples from p(θ | y) then, densities are
created by feeding them into a density plotter. Similarly
samples from f (θ), for some function f , are obtained by
simply feeding the θi ’s to f (·).
In principle M can be arbitrarily large – it comes from the
computer and only depends upon the time we have for
analysis. Do not confuse this with the data sample size n
which is limited in size by experimental constraints.
21
Then clearly P (θ ∈ (qL , qU ) | y) = 1 − α.
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
20
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop