Download 9 Bayesian inference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
9 Bayesian inference
1702 - 1761
9.1 Subjective probability
This is probability regarded as degree of belief.
A subjective probability of an event A is assessed as p if you are prepared
to stake £pM to win £M and equally prepared to accept a stake of £pM to
win £M .
In other words ...
... the bet is fair and you are assumed to behave rationally.
9.1.1 Kolmogorov’s axioms
How does subjective probability fit in with the fundamental axioms?
Let A be the set of all subsets of a countable sample space Ω. Then
(i) P (A) ≥ 0 for every A ∈ A;
(ii) P (Ω) = 1;
83
(iii) If {Aλ : λ ∈ Λ} is a countable set of mutually exclusive events belonging
to A, then
P
Aλ = P (Aλ ) .
λ∈Λ
λ∈Λ
Obviously the subjective interpretation has no difficulty in conforming with
(i) and (ii). (iii) is slightly less obvious.
Suppose we have 2 events A and B such that A ∩ B = ∅. Consider a stake
of £pAM to win £M if A occurs and a stake £pB M to win £M if B occurs.
The total stake for bets on A or B occurring is £pA M + £pB M to win £M
if A or B occurs. Thus we have £(pA + pB )M to win £M and so
P (A ∪ B) = P (A) + P (B)
9.1.2 Conditional probability
Define pB , pAB , pA|B such that
£pB M is the fair stake for £M if B occurs;
£pAB M is the fair stake for £M if A and B occur;
£pA|B M is the fair stake for £M if A occurs given B has occurred − otherwise the bet is off.
Consider the 3 alternative outcomes A only, B only, A and B . If G1, G2, G3
are the gains, then
A:
−pB M2 − pAB M3 = G1
B:
−pA|B M1 + (1 − pB )M2 − pAB )M3 = G2
A ∩ B : (1 − pA|B )M1 + (1 − pB )M2 + (1 − pAB )M3 = G3
The principle of rationality does not allow the possibility of setting the stake
to obtain sure profit. Thus
−pB −pAB 0
−pA|B 1 − pB −pAB = 0
1 − pA|B 1 − pB 1 − pAB
⇒ pAB − pB pA|B = 0
or
P (A | B ) = P (PA(B∩ )B) .
84
9.2 Estimation formulated as a decision problem
A The action space, which is the set of all possible actions a ∈ A available.
Θ The parameter space, consisting of all possible “states of nature”, only
one of which occurs or will occur ( this “true” state being unknown).
L The loss function, having domain Θ × A (the set of all possible consequences (θ, a)) and codomain R.
RX Containing all possible realisations x ∈ RX of a random variable X
belonging to the family {f (x; θ) : θ ∈ Θ} .
D The decision space, consisting of all possible decisions δ ∈ D, each
decision function having domain RX and codomain δ.
9.2.1 The basic idea
• The true state of the world is unknown when the action is chosen.
• The loss is known which results from each of the possible consequences
(θ, a).
• Data are collected to provide information about the unknown θ.
• A choice of action is taken for each possible observation of X .
Decision theory is concerned with the criteria for making a choice.
The form of the loss function is crucial − it depends upon the nature of the
problem.
Example
You are a fashion buyer and must stock up with trendy dresses. The true
number you will sell is θ (unknown).
85
You stock up with a dresses. If you overstock you must sell off at your endof-season sale and lose £A per dress. If you understock you lose the profit of
£B per dress. Thus
A(a − θ), a ≥ θ,
L(θ, a) = B (θ − a), a ≤ θ.
Special forms of loss function
If A = B, you have modular loss,
L(θ, a) = A |θ − a|
To suit comparison with classical inference, quadratic loss is often used.
L(θ, a) = C (θ − a)2
9.2.2 The risk function
This is a measure of the quality of a decision function. Suppose we choose a
decision δ(x) (i.e. choose a = δ(x)).
The function R with domain Θ × D and codomain R defined by
R(θ, δ) = RX L (θ, δ(x)) f (x; θ)dx
or R(θ, δ) = x∈RX L (θ, δ(x)) p(x; θ)
is called the risk function.
We may write this as
R(θ, δ) = E [L (θ,δ(x))] .
Example
Suppose X is a random sample from N (θ,θ) and suppose we assume quadratic
loss.
86
Consider
n
1
δ1(X) = X, δ2(X) = n − 1 (Xi − X )2.
i=1
R(θ, δ1) = E (θ − X )2 .
We know that E (X ) = θ, so
R(θ, δ1) = V (X ) = nθ .
R(θ, δ2) = E (θ − δ2(X))2 .
But E [δ2(X)] = θ, so that
R(θ, δ2) = V [δ2(X)]
n (X − X )2 2
θ
i
= (n − 1)2 V
θ
i=1
2
2
= (n −θ 1)2 · 2(n − 1) = n2−θ 1 .
R(θ,d)
(n − 1)/2n
d2
d1
2
(n − 1)/2n
θ
Without knowing θ, how do we choose between δ1 and δ2?
Two sensible criteria
1. Take precautions against the worst.
2. Take into account any further information you may have.
87
Criterion 1 leads to minimax. Choose the decision leading to the smallest
maximum risk, i.e. choose δ∗ ∈ D such that
sup {R(θ, δ∗)} = δinf
sup {R(θ, δ)} .
∈D
θ ∈Θ
θ ∈Θ
Minimax rules have disadvantages, the main one being that they treat Nature
as an adversary. Minimax guards against the worst choice. The value of θ
should not be regarded as a value deliberately chosen by Nature to make the
risk large.
Fortunately in many situations they turn out to be reasonable in practice, if
not in concept.
Criterion 2 involves introducing beliefs about θ into the analysis. If π(θ)
represents our beliefs about θ, then define Bayes risk r(π, δ) as
R(θ, δ)π(θ)dθ if θ is continuous,
r(π, δ) = Θ R(θ, δ)π(θ) if θ is discrete.
θ∈Θ
Now choose δ∗ such that
r(π, δ∗) = δinf
{r(π, δ)} .
∈D
Bayes Rule
The Bayes rule with respect to a prior π is the decision rule δ∗ that minimises
the Bayes risk among all possible decision rules.
Bayes’ tomb
88
9.2.3 Decision terminology
δ∗ is called a Bayes decision function.
δ(X) is a Bayes estimator.
δ(x) is a Bayes estimate.
If a decision function δ is such that ∃ δ satisfying
R(θ, δ) ≤ R(θ, δ) ∀θ ∈ Θ
and
R(θ, δ) < R(θ,δ ) for some θ ∈ Θ,
then δ is dominated by δ and δ is said to be inadmissible.
All other decisions are admissible.
89
9.3 Prior and posterior distributions
The prior distribution represents our initial belief in the parameter θ. This
means that θ is regarded as a random variable with p.d.f. π(θ).
From Bayes Theorem,
f (x | θ)π(θ) = π(θ | x)f (x)
or
π(θ | x) = ff((xx || θθ))ππ((θθ))dθ .
Θ
π(θ | x) is called the posterior p.d.f.
It represents our updated belief in θ, having taken account of the data.
We write the expression in the form
π(θ | x) ∝ f (x | θ)π(θ)
or
Posterior ∝ Likelihood × Prior.
Example
Suppose X1,X2, . .. , Xn is a random sample from a Poisson distribution,
Poi(θ), where θ has a prior distribution
α−1 −θ
π(θ) = θ Γ(αe) , θ ≥ 0, α > 0.
The posterior distribution is given by
xi −nθ α−1 −θ
θ
π(θ | x) ∝ xe ! · θ Γ(αe)
which simplifies to
i
π(θ | x) ∝ θ
xi+α−1
e−(n+1)θ , θ ≥ 0.
This is recognisable as a Γ-distribution and, by comparison with π(θ), we
can write
xi+α xi +α−1 −(n+1)θ
θ
e
(
n
+
1)
, θ ≥ 0,
π(θ | x) =
Γ( xi + α)
which represents our posterior belief in θ.
90
9.4 Decisions
We have already defined the Bayes risk to be
r(π,δ) =
where
so that
R(θ, δ) =
RX
Θ
R(θ,δ)π(θ)dθ
L (θ,δ(x)) f (x | θ)dx
L (θ, δ(x)) f (x | θ)dxπ(θ)dθ
Θ RX
= RX Θ L (θ, δ(x)) π(θ | x)dθf (x)dx
r(π, δ) =
For any given x, minimising the Bayes risk is equivalent to minimising
Θ
L (θ,δ(x)) π(θ | x)dθ.
i.e. we minimise the posterior expected loss.
Example
Take the posterior distribution in the previous example and, for no special
reason, assume quadratic loss
L (θ,δ(x)) = [θ − δ(x)]2 .
Then we minimise
Θ
[θ − δ(x)]2 π(θ | x)dθ
by differentiating with respect to δ to obtain
−2 [θ − δ(x)] π(θ | x)dθ = 0
Θ
⇒ δ(x) = θπ(θ | x)dθ.
Θ
91
This is the mean of the posterior distribution.
∞ (n + 1) xi+αθ xi +αe−(n+1)θ
δ(x) =
dθ
Γ( x + α)
i
0
xi +α (
n
+
1)
xi + α + 1)
= Γ( x + α)(Γ(n + 1)
xi +α+1
i
x + α
= n +i 1
Theorem :
Suppose that δ π is a decision rule that is Bayes with respect to some prior
distribution π. If the risk function of δ π satisfies
R(θ, δ π ) ≤ r(π, δ π ) for all θ ∈ Θ,
then δ π is a minimax decision rule.
Proof :
Suppose δ π is not minimax. Then there is a decision rule δ such that
sup R(θ, δ ) < sup R(θ, δ π ).
θ ∈Θ
θ∈Θ
For this decision rule we have, since, if f (x) ≤ k, then
E (f (X )) ≤ k,
r(π, δ) ≤ sup R(θ, δ)
θ∈Θ
< sup R(θ, δ π )
θ∈Θ
≤ r(π, δ π ),
contradicting the statement that δ π is Bayes with respect to π. Hence δ π is
minimax.
Example
. . , Xn be a random sample from a Bernoulli distribution B (1, p).
Let X1, . Let Y = X and consider estimating p using the loss function
i
2
L(p, a) = p(p(1−−ap)) ,
92
a loss which penalises mistakes more if p is near 0 or 1. The m.l.e. is p = Y/n
and
(p − p)2 R(p, p) = E p(1 − p)
= p(1 1− p) p(1 n− p) = n1
The risk is constant and, therefore, minimax.
[NB A decision rule with constant risk is called an equalizer rule ]
We now show that p is the Bayes estimator with respect to the U (0, 1) prior
which will imply that p is minimax.
π(θ | y) ∝ py (1 − p)n−y
so we want the value of a which minimises
1 (p − a)2
py (1 − p)n−y dp.
p
(1
−
p
)
0
Differentiation with respect to a yields
1 py (1 − p)n−y−1dp
.
a = 10 y−1
p (1 − p)n−y−1dp
0
Now a β-distribution with parameters α, β has pdf
α−1
β −1
f (x) = Γ(α + βΓ()xα)Γ((1β )− x) , x ∈ [0, 1],
so
n − y) Γ(n)
a = Γ(y +Γ(1)Γ(
n + 1) Γ(y)Γ(n − y)
and, using the identity Γ(α + 1) = αΓ(α),
a = ny . i.e. p = Y/n.
93
9.5 Bayes estimates
(i) The mean of π(θ | x) is the Bayes estimate with respect to quadratic
loss.
Proof
Choose δ(x) to minimise
[θ − δ(x)]2 π(θ | x)dθ
Θ
by differentiating with respect to δ and equating to zero. Then
since
Θ
δ(x) =
Θ
θπ(θ | x)dθ
π(θ | x)dθ = 1.
(ii) The median of π(θ | x) is the Bayes estimate with respect to modular
loss (i.e. absolute value loss).
Proof
Choose δ(x) to minimise
|θ − δ(x)| π(θ | x)dθ
δ(x)
∞
=
(δ(x) − θ) π(θ | x)dθ +
(θ − δ(x)) π(θ | x)dθ
Θ
δ(x)
−∞
Differentiate with respect to δ and equate to zero.
δ(x) π(θ | x)dθ = ∞ π(θ | x)dθ
−∞
δ(x)
∞ π(θ | x)dθ = 1
δ(x)
⇒ 2 −∞
π(θ | x)dθ = −∞
so that
δ(x)
−∞
π(θ | x)dθ = 12 ,
and δ(x) is the median of π(θ | x).
94
(iii) The mode of π(θ | x) is the Bayes estimate with respect to zero-one
loss (i.e. loss of zero for a correct estimate and loss of one for any
incorrect eastimate).
Proof
The loss function has the form
1, if δ(x) = θ,
L (θ, δ(x)) =
0, if δ(x) = θ.
This is only a useful idea for a discrete distribution.
L (θ, δ(x)) π(θ | x) =
θ ∈Θ
π(θ | x)
θ ∈Θ\{θ }
= 1 − π(θ∗ | x)
∗
where δ(x) = θ∗.
Minimisation means maximising π(θ∗ | x), so δ(x) is taken to be the
‘most likely’ value − the mode.
Remember that π(θ | x) represents our most up-to-date belief in θ and is all
that is needed for inference.
95
9.6 Credible intervals
Definition
The interval (θ1, θ2) is a 100(1 − α)% credible interval for θ if
θ
θ1
2
π(θ | x)dθ = 1 − α, 0 ≤ α ≤ 1.
π(θ x)
θ
θ1 θ1
θ2 θ2
Both (θ1, θ2) and (θ1, θ2) are 100(1 − α)% credible intervals.
Definition
The interval (θ1, θ2) is a 100(1 − α)% highest posterior density interval if
(i) (θ1,θ2) is a 100(1 − α)% credible interval;
(ii) ∀θ ∈ (θ1, θ2) and θ ∈/ (θ1,θ2),
π(θ | x) ≥ π(θ | x)
Obviously this defines the shortest possible 100(1 − α)% credible interval.
96
π(θ x)
1−α
θ
θ1
θ2
97
9.7 Features of Bayesian models
(i) If there is a sufficient statistic, say T , the posterior distribution will
depend upon the data only through T .
(ii) There is a natural parametric family of priors such that the posterior
distributions also belong to this family. Such priors are called conjugate
priors.
Example: Binomial likelihood, beta prior
A binomial likelihood has the form
n
p(x | θ) = x θx (1 − θ)n−x , x = 0, 1, . .. , n,
and a beta prior is
a−1
b−1
π(θ) = θ B(1(a,−bθ)) , θ ∈ [0, 1],
where
so that
a)Γ(b) .
B (a,b) = Γ(
Γ(a + b)
Posterior ∝ Likelihood × Prior,
π(θ | x) ∝ θx+a−1 (1 − θ)n−x+b−1
and comparison with the prior leads to
n−x+b−1
x+a−1
θ
(1
−
θ
)
π(θ | x) = B (x + a, n − x + b) , θ ∈ [0, 1],
which is also a beta distribution.
98
9.8 Hypothesis testing
The Bayesian only has a concept of a hypothesis test in certain very specific
situations. This is because of the alien idea of having to regard a parameter
as a fixed point. However the concept has some validity in an example such
as the following.
Example:
Water samples are taken from a reservoir and checked for bacterial content.
These provide information about θ, the proportion of infected samples. If it
is believed that bacterial density is such that more than 1% of samples are
infected, there is a legal obligation to change the chlorination procedure.
Bayesian procedure is
(i) use the data to obtain the posterior distribution of θ;
(ii) calculate P (θ > 0.01).
EC rules are formulated in statistical terms and, in such cases, allow for a 5%
significance level. This could be interpreted in a Bayesian sense as allowing
for a wrong decision with probability no more than 0.05, so order increased
chlorination if P (θ > 0.01) > 0.05.
Where for some reason (e.g. legal reason) a fixed parameter value is of interest
because it defines a threshold, then a Bayesian concept of a test is possible.
It is usually performed either by calculating a credible bound or a credible
interval and looking at it in relation to that threshold value or values, or by
calculating posterior probabilities based on the threshold or thresholds.
99
9.9 Inferences for normal distributions
9.9.1 Unknown mean, known variance
Consider data from N (θ, σ2) and a prior N (τ, κ2).
Posterior ∝ Likelihood × Prior.
so, since
1 2
f (x |
exp − 2σ2 (xi − θ)
1
1
2
π(θ) = √ 2 exp − 2κ2 (θ − τ ) , θ ∈ R,
2πκ
n
1
2
2
2
2
π(θ | x, σ , τ, κ ) ∝ exp − 2σ2 (θ − 2θx) − 2κ2 (θ − 2θτ ) .
θ,σ2) = (2πσ2)−n/2
and
But
n (θ2 − 2θx) + 1 (θ2 − 2θτ )
2σ2
2κ2
nx τ n 1
1
2
= 2 σ2 + κ2 θ − σ2 + κ2 θ + garbage
n 1 nx /σ2 + τ /κ2 2
1
= 2 σ2 + κ2 θ − n /σ2 + 1 /κ2
so the posterior p.d.f. is proportional to
/σ2 + τ /κ2
exp − 2(n /σ2 +1 1 /κ2 )−1 θ − nx
n /σ2 + 1 /κ2
This means that
θ|x
, σ2, τ, κ2
2
nx /σ2 + τ /κ2 2 −1
2
∼ N n /σ2 + 1 /κ2 , (n σ + 1 κ ) .
The form is a weighted average.
The prior mean τ has weight 1 /κ2 :
this is 1/(prior variance),. . .
100
and the observed sample mean has weight n /σ2 :
this is 1/(variance of sample mean).
This is reasonable because
/σ2 + τ /κ2 nx
lim
n→∞
n /σ2 + 1 /κ2 = x
and
2 + τ /κ2 nx
/
σ
lim n /σ2 + 1 /κ2 = x.
κ2 →∞
The posterior mean tends to x as the amount of data swamps the prior or as
prior beliefs become very vague.
9.9.2 Unknown variance, known mean
We need to assign a prior p.d.f. π(σ2).
The Γ family of p.d.f.’s provides a flexible set of shapes over [0, ∞). We take
λ νλ 1
τ = σ2 ∼ Γ 2 , 2
where λ and ν are chosen to provide suitable location and shape.
In other words, the prior p.d.f. of τ = 1 /σ2 is chosen to be of the form
or
)ν /2 τ ν /2 −1 exp − νλτ , τ ≥ 0.
π(τ ) = (νλ
2ν/2 Γ(ν /2)
2
2 (νλ)ν /2 (σ2)−ν/2 −1 νλ π σ = 2ν/2 Γ(ν /2) exp − 2σ2 .
τ = 1 /σ2 is called the precision.
The posterior p.d.f. is given by
π σ2 | x, θ ∝ f x | θ, σ2 π σ2 ,
and the right-hand side as a function of σ2, is proportional to
νλ 2−n/2 1 −ν /2 −1
2
2
exp − 2σ2 (xi − θ) × σ
exp − 2σ2 .
σ
101
Thus π (σ2 | x, θ) is proportional to
2−(ν +n)/2 −1 1 2
σ
exp − 2σ2 νλ + (xi − θ)
This is the same as the prior except that ν is replaced by ν + n and λ is
replaced by
νλ + (xi − θ)2 .
ν +n
Consider the random variable
νλ
+
(xi − θ)2 .
W=
Clearly
σ2
fW (w) = Cw(ν+n)/2 −1e−w/2 , w ≥ 0.
This is a χ2-distribution with ν + n degrees of freedom. In other words,
νλ + (xi − θ)2 ∼ χ2(ν + n).
σ2
Again this agrees with intuition. As n → ∞ we approach the classical
estimate, and also as ν → 0 which corresponds to vague prior knowledge.
9.9.3 Mean and variance unknown
We now have to assign a joint prior p.d.f. π (θ, σ2) to obtain a joint posterior
p.d.f.
π θ, σ2 | x ∝ f x | θ, σ2 π θ, σ2 .
We shall look at the special case of a non-informative prior. From theoretical
considerations beyond the scope of this course, a non-informative prior may
be approximated by
π θ, σ2 ∝ 1 .
σ2
102
Note that this is not to be thought of as an expression of prior beliefs but
rather as an approximate prior which allows the posterior to be dominated
by the data.
2 2−n/2 −1 1 2
π θ, σ | x ∝ σ
exp − 2σ2 (xi − θ) .
The right-hand side may be written in the form
2−n/2 −1 1 2
2
σ
(xi − x) + n(x − θ) .
exp − 2σ2
We can use this form to integrate out, in turn, θ and σ2.
Using
2
∞ n
exp − 2σ2 (θ − x)2 dθ = 2πσ
n
−∞
we find
2 2−(n−1)/2 −1 1 2
exp − 2σ2 (xi − x)
π σ |x ∝ σ
and, putting W = (x − x)2 /σ2 ,
i
fW (w) = Cw(n−1)/2 −1e−w/2 , w ≥ 0,
so we see that
(x
2
i − x)
∼ χ2(n − 1).
σ2
To integrate out σ2, make the substitution σ2 = τ −1. Then π (θ | x) is
proportional to
∞
0
τ
n/2 −1
τ 2
2
(xi − x) + n(x − θ) dτ.
exp − 2
From the definition of the Γ-function,
∞
0
yα−1e−βy dy = Γ(βαα ) ,
and we see, since only the terms in θ are relevant,
π (θ | x) ∝
(xi − x)2 + n(x − θ)2
103
−n/2
.
or
where
2 −n/2
t
π (θ | x) ∝ 1 + n − 1
√n(θ − x)
.
t = (xi − x)2 /(n − 1)
104