Download Principles of Bayesian Inference Bayes Theorem

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia, lookup

Inductive probability wikipedia, lookup

Approximate Bayesian computation wikipedia, lookup

History of statistics wikipedia, lookup

Foundations of statistics wikipedia, lookup

Statistical inference wikipedia, lookup

Bayesian inference wikipedia, lookup

Transcript
Basic probability
1
2
Principles of Bayesian Inference
We like to think of “Probability” formally as a function that
assigns a real number to an event.
Sudipto Banerjee1 and Andrew O. Finley2
Let E and F be any events that might occur under an
experimental setup H. Then a probability function P (E) is
defined as:
P1 0 ≤ P (E) ≤ 1 for all E.
P2 P (H) = 1.
P3 P (E ∪ F ) = P (E) + P (F ) whenever it is impossible for any
two of the events E and F to occur. Usually consider:
E ∩ F = {φ} and say they are mutually exclusive.
Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A.
Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A.
October 30, 2013
1
2
Basic probability
Basic probability
If E is an event, then we denote its complement (“NOT” E)
by Ē or E c . Since E ∩ Ē = {φ}, we have from P3:
Independent Events: E and F are said to be independent
if the occurrence of one does not imply the occurrence of
the other. Then, P (E | F ) = P (E) and we have the
following multiplication rule:
P (Ē) = 1 − P (E).
Conditional Probability of E given F :
P (E | F ) =
P (EF ) = P (E)P (F ).
P (E ∩ F )
P (F )
If P (E | F ) = P (E), then P (F | E) = P (F ).
Marginalization: We can express P (E) by “marginalizing”
over the event F :
Sometimes we write EF for E ∩ F .
Compound probability rule: write the above as
P (E) = P (EF ) + P (E F̄ )
= P (F )P (E | F ) + P (F̄ )P (E | F̄ ).
P (E | F )P (F ) = P (EF ).
3
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
4
Bayes Theorem
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian principles
Bayes Theorem
Observe that:
Two hypothesis: H0 : excess relative risk for thrombosis for
women taking a pill exceeds 2; H1 : it is under 2.
P (EF ) = P (E | F )P (F ) = P (F | E)P (E)
P (F )P (E | F )
⇒ P (F | E) =
.
P (E)
Data collected at hand from a controlled trial show a
relative risk of x = 3.6.
This is Bayes’ Theorem, named after Reverend Thomas
Bayes – an English clergyman with a passion for gambling!
Probability or likelihood under the data, given our prior
beliefs is P (x | H); H is H0 or H1 .
Often this is written as:
P (F | E) =
5
P (F )P (E | F )
.
P (F )P (E | F ) + P (F̄ )P (E | F̄ )
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
6
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian principles
Bayesian principles
Likelihood and Prior
Bayes Theorem updates the probability of each hypothesis:
P (H | x) =
Bayes theorem in English:
P (H)P (x | H)
; H ∈ {H0 , H1 } .
P (x)
prior × likelihood
Posterior distribution = P
( prior × likelihood)
Marginal probability:
Denominator is summed over all possible priors
P (x) = P (H0 )P (x | H0 ) + P (H1 )P (x | H1 )
It is a fixed normalizing factor that is (usually) extremely
difficult to evaluate: curse of dimensionality
Reexpress:
P (H | x) ∝ P (H)P (x | H); H ∈ {H0 , H1 } .
Markov Chain Monte Carlo to the rescue!
WinBUGS software:
www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml
7
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
8
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian principles
Bayesian principles
An Example
Clinician interested in π: proportion of children between
age 5–9 in a particular population having asthma
symptoms.
Prior Support
0.10
0.12
0.14
0.16
0.18
0.20
Total
Clinician has prior beliefs about π, summarized as “Prior
support” and “Prior weights”
Data: random sample of 15 children show 2 having asthma
symptoms. Likelihood obtained from Binomial distribution:
15 2
π (1 − π)13
2
Note: 15
2 is a “constant” and *can* be ignored in the
computations (though they are accounted for in the next
Table).
9
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Likelihood
0.267
0.287
0.290
0.279
0.258
0.231
Prior × Likelihood
0.027
0.043
0.072
0.070
0.039
0.023
0.274
Posterior
0.098
0.157
0.265
0.255
0.141
0.084
1.000
Posterior: obtained by dividing Prior × Likelihood with
normalizing constant 0.274
10
Bayesian principles
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian principles
Classical statistics: model parameters are fixed and
unknown.
The key to Bayesian inference is “learning” or “updating” of
prior beliefs. Thus, posterior information ≥ prior
information.
A Bayesian thinks of parameters as random, and thus
having distributions (just like the data). We can thus think
about unknowns for which no reliable frequentist
experiment exists, e.g. θ = proportion of US men with
untreated prostate cancer.
Is the classical approach wrong? That may be a
controversial statement, but it certainly is fair to say that
the classical approach is limited in scope.
A Bayesian writes down a prior guess for parameter(s) θ,
say p(θ). He then combines this with the information
provided by the observed data y to obtain the posterior
distribution of θ, which we denote by p(θ | y).
The Bayesian approach expands the class of models and
easily handles:
repeated measures
unbalanced or missing data
nonhomogenous variances
multivariate data
All statistical inferences (point and interval estimates,
hypothesis tests) then follow from posterior summaries. For
example, the posterior means/medians/modes offer point
estimates of θ, while the quantiles yield credible intervals.
11
Prior weight
0.10
0.15
0.25
0.25
0.15
0.10
1.00
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
– and many other settings that are precluded (or much
more complicated) in classical settings.
12
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Basics of Bayesian inference
Basics of Bayesian inference
Calculations (numerical and algebraic) are usually required
only up to a proportionaly constant. We, therefore, write
the posterior as:
We start with a model (likelihood) f (y | θ) for the observed
data y = (y1 , . . . , yn )0 given unknown parameters θ
(perhaps a collection of several parameters).
p(θ | y, λ) ∝ p(θ | λ) × f (y | θ).
If λ are known/fixed, then the above represents the desired
posterior. If, however, λ are unknown, we assign a prior,
p(λ), and seek:
Add a prior distribution p(θ | λ), where λ is a vector of
hyper-parameters.
p(θ, λ | y) ∝ p(λ)p(θ | λ)f (y | θ).
The posterior distribution of θ is given by:
p(θ | y, λ) =
The proportionality constant does not depend upon θ or λ:
1
1
=R
p(y)
p(λ)p(θ | λ)f (y | θ)dλdθ
p(θ | λ) × f (y | θ)
p(θ | λ) × f (y | θ)
=R
.
p(y | λ)
f (y | θ)p(θ | λ)dθ
The above represents a joint posterior from a hierarchical
model. The marginal posterior distribution for θ is:
Z
p(θ | y) = p(λ)p(θ | λ)f (y | θ)dλ.
We refer to this formula as Bayes Theorem.
13
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
14
A simple example: Normal data and normal priors
A simple example: Normal data and normal priors
Interpret: Posterior mean is a weighted mean of prior
mean and data point.
Example: Consider a single data point y from a Normal
distribution: y ∼ N (θ, σ 2 ); assume σ is known.
The direct estimate is shrunk towards the prior.
1
1
f (y|θ) = N (y | θ, σ ) = √ exp(− 2 (y − θ)2 )
2σ
σ 2π
What if you had n observations instead of one in the earlier
iid
set up? Say y = (y1 , . . . , yn )0 , where yi ∼ N (θ, σ 2 ).
2
ȳ is a sufficient statistic for θ; ȳ ∼ N θ, σn
2
θ ∼ N (µ, τ 2 ), i.e. p(θ) = N (θ | µ, τ 2 ); µ, τ 2 are known.
Posterior distribution of θ
2
=N
θ|
1
τ2
1
σ2
+
1
τ2
µ+
1
σ2
1
σ2
+
1
τ2
y,
1
σ2
1
+
1
τ2
σ2
τ2
σ2τ 2
θ| 2
µ+ 2
y,
σ + τ2
σ + τ 2 σ2 + τ 2
15
Posterior distribution of θ
2
p(θ|y) ∝ N (θ | µ, τ ) × N (y | θ, σ )
=N
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
p(θ | y) ∝ N (θ | µ, τ 2 ) × N
!
=N
.
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
θ|
= N θ|
1
τ2
n
σ2
+
1
τ2
µ+
n
σ2
n
σ2
1
τ2
nτ 2
+
ȳ,
n
σ2
1
+
!
1
τ2
σ2τ 2
σ2
µ+ 2
ȳ,
2
σ + nτ 2
σ + nτ 2 σ 2 + nτ 2
16
A simple exercise using R
σ2
ȳ | θ,
n
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Another simple example: The Beta-Binomial model
Consider the problem of estimating the current weight of a
group of people. A sample of 10 people were taken and
their average weight was calculated as ȳ = 176 lbs.
Assume that the population standard deviation was known
as σ = 3. Assuming that the data y1 , ..., y10 came from a
N (θ, σ 2 ) population perform the following:
Example: Let Y be the number of successes in n
independent trials.
n y
P (Y = y|θ) = f (y|θ) =
θ (1 − θ)n−y
y
Prior: p(θ) = Beta(θ|a, b):
Obtain a 95% confidence interval for θ using classical
methods.
p(θ) ∝ θa−1 (1 − θ)b−1 .
Prior mean: µ = a/(a + b); Variance ab/((a + b)2 (a + b + 1))
Assume a prior distribution for θ of the form N (µ, τ 2 ).
Obtain 95% posterior credible intervals for θ for each of the
cases: (a) µ = 176, τ = 8; (b) µ = 176, τ = 1000 (c)
µ = 0, τ = 1000. Which case gives results closest to that
obtained in the classical method?
17
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Posterior distribution of θ
p(θ|y) = Beta(θ|a + y, b + n − y)
18
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Bayesian inference: point estimation
Bayesian inference: interval estimation
The most popular method of inference in practical
Bayesian modelling is interval estimation using credible
sets. A 100(1 − α)% credible set C for θ is a set that
satisfies:
Z
P (θ ∈ C | y) =
p(θ | y)dθ ≥ 1 − α.
Point estimation is easy: simply choose an appropriate
distribution summary: posterior mean, median or mode.
Mode sometimes easy to compute (no integration, simply
optimization), but often misrepresents the “middle” of the
distribution – especially for one-tailed distributions.
C
The most popular credible set is the simple equal-tail
interval estimate (qL , qU ) such that:
Z qL
Z ∞
α
p(θ | y)dθ = =
p(θ | y)dθ
2
−∞
qU
Mean: easy to compute. It has the “opposite effect” of the
mode – chases tails.
Median: probably the best compromise in being robust to
tail behaviour although it may be awkward to compute as it
needs to solve:
Z θ median
−∞
19
This interval is relatively easy to compute and has a direct
interpretation: The probability that θ lies between (qL , qU )
is 1 − α. The frequentist interpretation is extremely
convoluted.
1
p(θ | y)dθ = .
2
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
Sampling-based inference
Previous example: direct evaluation of the posterior
probabilities. Feasible only for simpler problems.
Modern Bayesian Analysis: Derive complete posterior
densities, say p(θ | y) by drawing samples from that density.
Samples are of the parameters themselves, or of their
functions.
If θ1 , . . . , θM are samples from p(θ | y) then, densities are
created by feeding them into a density plotter. Similarly
samples from f (θ), for some function f , are obtained by
simply feeding the θi ’s to f (·).
In principle M can be arbitrarily large – it comes from the
computer and only depends upon the time we have for
analysis. Do not confuse this with the data sample size n
which is limited in size by experimental constraints.
21
Then clearly P (θ ∈ (qL , qU ) | y) = 1 − α.
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop
20
LWF, ZWFH, DVFFA, & DR IBS: Biometry workshop