Download Bayesian estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cosmic distance ladder wikipedia , lookup

Indian Institute of Astrophysics wikipedia , lookup

Leibniz Institute for Astrophysics Potsdam wikipedia , lookup

Transcript
Treatment and analysis of data – Applied statistics
Lecture 6: Bayesian estimation
Topics covered:
ƒ Bayes' Theorem again
ƒ Relation to Likelihood
ƒ Transformation of pdf
ƒ A trivial example
ƒ Wiener filter
ƒ Malmquist bias
ƒ Lutz-Kelker bias
ƒ Bayes versus Likelihood
ƒ The bus problem
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 1
Bayesian estimation
Thomas Bayes, British mathematician,
Presbyterian minister and Fellow of the Royal
Society.
The manuscript Essay towards solving a
problem in the doctrine of chances was found
after his death and published 1763.
It establishes a mathematical basis for
probability inference by:
ƒ treating model parameters as random variables
with a prior distribution
ƒ prescribing how the distribution is modified by
data (Bayes’ theorem)
ƒ basing the inference on the resulting posterior
distribution
Thomas Bayes (1702-1761)
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 2
Bayes' theorem
P(A&B) = P(A)P(B|A) = P(B)P(A|B)
⇒
P(A|B) = P(A)P(B|A)/P(B)
A
A = model (M), B = data (D)
A&B
⇒
P(M|D) = P(M)P(D|M)/P(D)
B
P(M)
P(M|D)
P(D|M)
P(D)
= prior probability of M (before D)
= posterior probability of M (in light of D)
= likelihood of M (given D)
= fixed [only needed to normalize P(M|D)]
fΘ( θ x ) =
Sept-Oct 2006
f X ( x θ ) fΘ (θ )
∫f
X(
x θ ) f Θ ( θ )dθ
∝ L( θ x ) f Θ ( θ )
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 3
Relation to Likelihood
fΘ( θ x ) ∝
f Θ ( θ ) × L( θ x )
↑
posterior
↑
prior
↑
likelihood
Treating θ as a random variable is still (after 240 years!) a somewhat controversial
issue. Also the choice of prior distribution is seen as problematic (“subjective”).
If the prior pdf is flat (over a reasonable interval of θ) then
“maximum a posteriori” (MAP) is equivalent to “maximum likelihood” (ML).
A slanted or peaked prior may push the MAP away from the ML.
If the data do not determine the parameter well (wide likelihood function),
then the posterior depends strongly on the prior.
Conversely, for well-determined problems, the prior has little influence.
Note that Bayes theorem gives a pdf for θ, not a value (point estimate).
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 4
Transformation of pdf
Let X be a random variable with pdf fX (x) and Y another random variable
obtained by the transformation Y = g(X) where g is some known function.
What is the pdf fY (y) of the transformed variable?
The general case is complex, and fY (y) may not even exist.
However, if g is continuous and monotone, the answer is simple:
f X ( x) ⋅ dx = fY ( y ) ⋅ dy
⇔
fY ( y ) = f X ( x) ⋅
dx
−1
= f X ( x) ⋅ g ′( x)
dy
Note that the | | is needed in case g is decreasing. Multivariate case:
Y = g( X )
⇒
⎛ ∂g ⎞
fY ( y ) = f X ( x ) ⋅ det⎜ T ⎟
⎝ ∂x ⎠
−1
The determinant is the Jacobian of the transformation.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 5
A trivial (?) example (1/2)
Suppose we want to measure the intensity λ of a source by counting the number of
photons, n, detected in a certain time interval. We assume that n ~ Poisson(λ).
Given n = 10, what is the estimate of λ?
L(λ|n) = λn exp(–λ) / n!
=>
MLE λ* = n = 10
(reasonable!)
Bayesian estimation (MAP) gives the same result if the prior distribution is flat
between (say) 0 and 100.
But if the prior state of knowledge is that we have no idea even about the order of
magnitude of λ, then it can be argued that the prior pdf is flat in log λ rather than
λ. (Cf. frequency of first digit in natural constants!) This implies a prior pdf
inversely proportional to λ; thus
posterior pdf
∝
λ–1 L(λ|n) = λn –1 exp(–λ)
which has a maximum at λ = n – 1. Thus, the Bayesian MAP estimate is 9.
(This gets really weird when n = 1.)
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 6
A trivial (?) example (2/2)
But is the MAP (maximum a posteriori) estimate really what we want? An
alternative in Bayesian estimation is to compute the posterior mean.
Using that ∫ λn exp(–λ) dλ = n! we find
• for prior ∝ λ0 :
E(λ | n) = n + 1
• for prior ∝ λ−1 :
E(λ | n) = n
The MAP estimate and the posterior mean are not invariant to transformation of λ
(while the MLE is). There is yet another Bayesian estimate which is invariant to
transformation, namely the posterior median. It is more complicated to compute, but
to a good approximation we have
• for prior ∝ λ0 :
median(λ | n) = n + 2/3
• for prior ∝ λ−1 :
median(λ | n) = n − 1/3
(n > 0)
With different estimators we get anything between n−1 and n+1, but does it matter ?
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 7
A more interesting example: Wiener filter (1/2)
Suppose we observe a continuous variable y = x + ε, where x and ε are independent
Gaussian random variables with zero mean and s.d. s (for signal) and n (for noise):
x ~ N(0, s2), ε ~ N(0, n2)
Given the value y, what is the estimate of x ?
Prior pdf for x :
fX (x) = (2π)−1/2 s−1 exp[−x2/2s2]
Likelihood function :
L(x | y) = (2π)−1/2 n−1 exp[−(y−x)2/2n2]
Posterior pdf for x :
fX (x | y) ∝ exp[−x2/2s2 − (y−x)2/2n2]
s 2n2
Completing the square, we find that fX (x | y) is Gaussian with variance = 2
s + n2
2
s
and mean value x̂ = 2
y , which is the Bayesian estimate of x.
2
s +n
(A Wiener filter has the transfer function R2/(1 + R2), where R = S/N.)
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 8
Wiener filter (2/2)
Sometimes it is helpful to visualize Bayes theorem by means of the
joint pdf of the parameter (x) and data (y):
y = x+ε
y=x
observed y
x
x̂
equiprobability
curves
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 9
Another example: Malmquist bias (1/3)
If a class of objects has an intrinsic spread in luminosity (or absolute magnitude
M), and we pick at random an object on the sky with apparent magnitude m, then
that object is likely to be more luminous than is typical for the class.
Malmquist (Lund Medd. Ser. II, No. 22, 1920) derived the required correction to
the mean observed M as function of the intrinsic spread in M and the observed
distribution of m.
In Bayesian terms, the effect can be understood as the difference between the
prior (intrinsic or true) distribution of M, and the posterior (apparent) distribution
of M for given m.
For a certain class of objects, assume for simplicity:
1. that the intrinsic luminosity function is M ~ N(M0, σ2);
2. that the objects are on average uniformly distributed in space;
3. that there is no extinction.
Let x = m − M = 5 log (r/10 pc) denote the distance modulus.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 10
Malmquist bias (2/3)
Assumptions 2 and 3 imply that the distance modulus x = m−M = 5 log (r/10 pc)
has the (improper) pdf fX (x) ∝ 100.6x. Thus:
f (m | M) = fX (m−M) ∝ 100.6(m − M) = exp[ γ (m−M) ]
( γ = 0.6 ln 10 = 1.38... )
f (M)
∝ exp[ − (M−M0)2/2σ2 ]
f (M | m)
∝ exp[ γ (m−M) ] × exp[ − (M−M0)2/2σ2 ]
= exp[ γ (m−M) − (M−M0)2/2σ2 ]
= exp[ γ (m−M0) + γ2σ2/2 ] × exp[ − (M−M0+ γσ2)2/2σ2 ]
thus
f (M | m) ~ N(M0 − γσ2, σ2 )
so the mean abs. mag. of the objects with apparent magnitude m is M = M 0 − γσ 2
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 11
Malmquist bias (3/3)
m
equiprobability
curves
x = constant
M
M M0
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 12
Yet another example: Lutz-Kelker bias (1/3)
Let p0 be the true parallax of a star and p the measured value. Assume that
the measurement errors are Gaussian with zero mean and s.d. σ. Then, for any
given star (with true parallax = p0 ),
P( p < p0 | p0 ) = P( p > p0 | p0 )
i.e., positive and negative errors are equally probable.
Now consider instead any given measured parallax value p. Then, in general,
P( p0 < p | p ) ≠ P( p0 > p | p )
that is, positive and negative errors are not equally probable!
This may at first seem paradoxical, but a single example may be enough to
make the statement credible: it is possible to obtain a negative value of the
measured parallax, in which case P( p0 < p | p ) = 0 and P( p0 > p | p ) = 1.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 13
Lutz-Kelker bias (2/3)
Lutz & Kelker (PASP 85, 573, 1973) discussed the use of trigonometric
parallaxes for luminosity calibration and derived a systematic correction
depending on the relative parallax error (σ/p).
In a stellar sample selected according to a lower limit on the observed parallax,
the sample mean parallax is systematically too large, because the random errors
will scatter more stars into the volume (with positive errors) than out of it (with
negative errors).
The effect can be formulated in Bayesian terms. Let us assume
1. that the observed parallax has the distribution p ~ N( p0, σ2 )
2. that the number density (n, in pc−3) of stars of a given class decreases
exponentially with the height z above the Galactic plane: n = n0 exp(−|z|/H),
H = scale height
3. that there is no extinction.
As function of distance r = 1/p0 we have n(r) = n0 exp(−βr), where β = |sin b|/H.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 14
Lutz-Kelker bias (3/3)
The number of stars in solid angle ω with distance r to r+dr is dN = ω r2 n(r) dr.
r = p0−1
⇒
|dr| = p0−2 |dp0|
⇒
dN/dp0 ∝ p0−4 exp(− β/p0), thus:
f ( p0 )
∝ p0−4 exp(− β/p0)
(prior)
f ( p | p0 )
∝ exp[ − (p − p0)2/2σ2 ]
(likelihood)
f ( p0 | p )
∝ p0−4 exp(− β/p0 − (p − p0)2/2σ2 ]
(posterior)
0.4
0.25
0.2
0.3
L( p0 )
L( p0 )
P ( p0 )
P ( p0 )
0.2
0.15
B ( p0 )
B ( p0 )
0.1
0.1
0.05
0
0
0
5
p = 10, σ = 1, β = 10
10
p0
Sept-Oct 2006
15
20
0
5
p = 10, σ = 2, β = 10
10
p0
Statistics for astronomers (L. Lindegren, Lund Observatory)
15
20
Lecture 6, p. 15
Bayes versus Likelihood
Let D1 and D2 be two independent data sets relevant to the same model M. Since
the data sets are independent, the total likelihood is
L(M|D1,D2) = L(M|D1) × L(M|D2)
If we regard D1 as representing the knowledge about M before introducing D2,
we have, after renormalization, essentially Bayes' theorem.
The nice thing about Bayesian theory is that it gives a framework for treating the
prior and posterior knowledge on exactly the same footing.
The Bayesian approach also encourages us to think about the a priori
assumptions in any experiment, which is probably a good thing.
Acknowledgement:
In this lecture I have made use of some good ideas from Ned Wright's Journal
Club Talk on Statistics, www.astro.ucla.edu/~wright/statistics/
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 6, p. 16