Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Treatment and analysis of data – Applied statistics Lecture 6: Bayesian estimation Topics covered: Bayes' Theorem again Relation to Likelihood Transformation of pdf A trivial example Wiener filter Malmquist bias Lutz-Kelker bias Bayes versus Likelihood The bus problem Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 1 Bayesian estimation Thomas Bayes, British mathematician, Presbyterian minister and Fellow of the Royal Society. The manuscript Essay towards solving a problem in the doctrine of chances was found after his death and published 1763. It establishes a mathematical basis for probability inference by: treating model parameters as random variables with a prior distribution prescribing how the distribution is modified by data (Bayes’ theorem) basing the inference on the resulting posterior distribution Thomas Bayes (1702-1761) Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 2 Bayes' theorem P(A&B) = P(A)P(B|A) = P(B)P(A|B) ⇒ P(A|B) = P(A)P(B|A)/P(B) A A = model (M), B = data (D) A&B ⇒ P(M|D) = P(M)P(D|M)/P(D) B P(M) P(M|D) P(D|M) P(D) = prior probability of M (before D) = posterior probability of M (in light of D) = likelihood of M (given D) = fixed [only needed to normalize P(M|D)] fΘ( θ x ) = Sept-Oct 2006 f X ( x θ ) fΘ (θ ) ∫f X( x θ ) f Θ ( θ )dθ ∝ L( θ x ) f Θ ( θ ) Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 3 Relation to Likelihood fΘ( θ x ) ∝ f Θ ( θ ) × L( θ x ) ↑ posterior ↑ prior ↑ likelihood Treating θ as a random variable is still (after 240 years!) a somewhat controversial issue. Also the choice of prior distribution is seen as problematic (“subjective”). If the prior pdf is flat (over a reasonable interval of θ) then “maximum a posteriori” (MAP) is equivalent to “maximum likelihood” (ML). A slanted or peaked prior may push the MAP away from the ML. If the data do not determine the parameter well (wide likelihood function), then the posterior depends strongly on the prior. Conversely, for well-determined problems, the prior has little influence. Note that Bayes theorem gives a pdf for θ, not a value (point estimate). Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 4 Transformation of pdf Let X be a random variable with pdf fX (x) and Y another random variable obtained by the transformation Y = g(X) where g is some known function. What is the pdf fY (y) of the transformed variable? The general case is complex, and fY (y) may not even exist. However, if g is continuous and monotone, the answer is simple: f X ( x) ⋅ dx = fY ( y ) ⋅ dy ⇔ fY ( y ) = f X ( x) ⋅ dx −1 = f X ( x) ⋅ g ′( x) dy Note that the | | is needed in case g is decreasing. Multivariate case: Y = g( X ) ⇒ ⎛ ∂g ⎞ fY ( y ) = f X ( x ) ⋅ det⎜ T ⎟ ⎝ ∂x ⎠ −1 The determinant is the Jacobian of the transformation. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 5 A trivial (?) example (1/2) Suppose we want to measure the intensity λ of a source by counting the number of photons, n, detected in a certain time interval. We assume that n ~ Poisson(λ). Given n = 10, what is the estimate of λ? L(λ|n) = λn exp(–λ) / n! => MLE λ* = n = 10 (reasonable!) Bayesian estimation (MAP) gives the same result if the prior distribution is flat between (say) 0 and 100. But if the prior state of knowledge is that we have no idea even about the order of magnitude of λ, then it can be argued that the prior pdf is flat in log λ rather than λ. (Cf. frequency of first digit in natural constants!) This implies a prior pdf inversely proportional to λ; thus posterior pdf ∝ λ–1 L(λ|n) = λn –1 exp(–λ) which has a maximum at λ = n – 1. Thus, the Bayesian MAP estimate is 9. (This gets really weird when n = 1.) Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 6 A trivial (?) example (2/2) But is the MAP (maximum a posteriori) estimate really what we want? An alternative in Bayesian estimation is to compute the posterior mean. Using that ∫ λn exp(–λ) dλ = n! we find • for prior ∝ λ0 : E(λ | n) = n + 1 • for prior ∝ λ−1 : E(λ | n) = n The MAP estimate and the posterior mean are not invariant to transformation of λ (while the MLE is). There is yet another Bayesian estimate which is invariant to transformation, namely the posterior median. It is more complicated to compute, but to a good approximation we have • for prior ∝ λ0 : median(λ | n) = n + 2/3 • for prior ∝ λ−1 : median(λ | n) = n − 1/3 (n > 0) With different estimators we get anything between n−1 and n+1, but does it matter ? Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 7 A more interesting example: Wiener filter (1/2) Suppose we observe a continuous variable y = x + ε, where x and ε are independent Gaussian random variables with zero mean and s.d. s (for signal) and n (for noise): x ~ N(0, s2), ε ~ N(0, n2) Given the value y, what is the estimate of x ? Prior pdf for x : fX (x) = (2π)−1/2 s−1 exp[−x2/2s2] Likelihood function : L(x | y) = (2π)−1/2 n−1 exp[−(y−x)2/2n2] Posterior pdf for x : fX (x | y) ∝ exp[−x2/2s2 − (y−x)2/2n2] s 2n2 Completing the square, we find that fX (x | y) is Gaussian with variance = 2 s + n2 2 s and mean value x̂ = 2 y , which is the Bayesian estimate of x. 2 s +n (A Wiener filter has the transfer function R2/(1 + R2), where R = S/N.) Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 8 Wiener filter (2/2) Sometimes it is helpful to visualize Bayes theorem by means of the joint pdf of the parameter (x) and data (y): y = x+ε y=x observed y x x̂ equiprobability curves Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 9 Another example: Malmquist bias (1/3) If a class of objects has an intrinsic spread in luminosity (or absolute magnitude M), and we pick at random an object on the sky with apparent magnitude m, then that object is likely to be more luminous than is typical for the class. Malmquist (Lund Medd. Ser. II, No. 22, 1920) derived the required correction to the mean observed M as function of the intrinsic spread in M and the observed distribution of m. In Bayesian terms, the effect can be understood as the difference between the prior (intrinsic or true) distribution of M, and the posterior (apparent) distribution of M for given m. For a certain class of objects, assume for simplicity: 1. that the intrinsic luminosity function is M ~ N(M0, σ2); 2. that the objects are on average uniformly distributed in space; 3. that there is no extinction. Let x = m − M = 5 log (r/10 pc) denote the distance modulus. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 10 Malmquist bias (2/3) Assumptions 2 and 3 imply that the distance modulus x = m−M = 5 log (r/10 pc) has the (improper) pdf fX (x) ∝ 100.6x. Thus: f (m | M) = fX (m−M) ∝ 100.6(m − M) = exp[ γ (m−M) ] ( γ = 0.6 ln 10 = 1.38... ) f (M) ∝ exp[ − (M−M0)2/2σ2 ] f (M | m) ∝ exp[ γ (m−M) ] × exp[ − (M−M0)2/2σ2 ] = exp[ γ (m−M) − (M−M0)2/2σ2 ] = exp[ γ (m−M0) + γ2σ2/2 ] × exp[ − (M−M0+ γσ2)2/2σ2 ] thus f (M | m) ~ N(M0 − γσ2, σ2 ) so the mean abs. mag. of the objects with apparent magnitude m is M = M 0 − γσ 2 Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 11 Malmquist bias (3/3) m equiprobability curves x = constant M M M0 Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 12 Yet another example: Lutz-Kelker bias (1/3) Let p0 be the true parallax of a star and p the measured value. Assume that the measurement errors are Gaussian with zero mean and s.d. σ. Then, for any given star (with true parallax = p0 ), P( p < p0 | p0 ) = P( p > p0 | p0 ) i.e., positive and negative errors are equally probable. Now consider instead any given measured parallax value p. Then, in general, P( p0 < p | p ) ≠ P( p0 > p | p ) that is, positive and negative errors are not equally probable! This may at first seem paradoxical, but a single example may be enough to make the statement credible: it is possible to obtain a negative value of the measured parallax, in which case P( p0 < p | p ) = 0 and P( p0 > p | p ) = 1. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 13 Lutz-Kelker bias (2/3) Lutz & Kelker (PASP 85, 573, 1973) discussed the use of trigonometric parallaxes for luminosity calibration and derived a systematic correction depending on the relative parallax error (σ/p). In a stellar sample selected according to a lower limit on the observed parallax, the sample mean parallax is systematically too large, because the random errors will scatter more stars into the volume (with positive errors) than out of it (with negative errors). The effect can be formulated in Bayesian terms. Let us assume 1. that the observed parallax has the distribution p ~ N( p0, σ2 ) 2. that the number density (n, in pc−3) of stars of a given class decreases exponentially with the height z above the Galactic plane: n = n0 exp(−|z|/H), H = scale height 3. that there is no extinction. As function of distance r = 1/p0 we have n(r) = n0 exp(−βr), where β = |sin b|/H. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 14 Lutz-Kelker bias (3/3) The number of stars in solid angle ω with distance r to r+dr is dN = ω r2 n(r) dr. r = p0−1 ⇒ |dr| = p0−2 |dp0| ⇒ dN/dp0 ∝ p0−4 exp(− β/p0), thus: f ( p0 ) ∝ p0−4 exp(− β/p0) (prior) f ( p | p0 ) ∝ exp[ − (p − p0)2/2σ2 ] (likelihood) f ( p0 | p ) ∝ p0−4 exp(− β/p0 − (p − p0)2/2σ2 ] (posterior) 0.4 0.25 0.2 0.3 L( p0 ) L( p0 ) P ( p0 ) P ( p0 ) 0.2 0.15 B ( p0 ) B ( p0 ) 0.1 0.1 0.05 0 0 0 5 p = 10, σ = 1, β = 10 10 p0 Sept-Oct 2006 15 20 0 5 p = 10, σ = 2, β = 10 10 p0 Statistics for astronomers (L. Lindegren, Lund Observatory) 15 20 Lecture 6, p. 15 Bayes versus Likelihood Let D1 and D2 be two independent data sets relevant to the same model M. Since the data sets are independent, the total likelihood is L(M|D1,D2) = L(M|D1) × L(M|D2) If we regard D1 as representing the knowledge about M before introducing D2, we have, after renormalization, essentially Bayes' theorem. The nice thing about Bayesian theory is that it gives a framework for treating the prior and posterior knowledge on exactly the same footing. The Bayesian approach also encourages us to think about the a priori assumptions in any experiment, which is probably a good thing. Acknowledgement: In this lecture I have made use of some good ideas from Ned Wright's Journal Club Talk on Statistics, www.astro.ucla.edu/~wright/statistics/ Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 16