Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Recap: Finite-dimensional linear Gaussian satistical inverse problem • The given data is y0 = M x0 + ε0 , where M ∈ Rm×n . • The statistical model of the noise ε is an m-dimensional Gaussian random vector, that is distributed according to N (0, Cε ) i.e. fε (y) = p 1 1 T −1 Cε y (2π)m det(Cε ) e− 2 y , for all y ∈ Rm . • The statistical model of the unknown is n-dimensional Gaussian random vector X, that is independent from ε and distributed according to N (0, CX ) i.e. 1 T −1 1 e− 2 x Cε x fpr (x) = p (2π)n det(CX ) for all x ∈ Rn . • The statistical model of the data is Y = M X + ε. • The solution is the posterior pdf fY (y0 |X = x)fpr (x) f (y0 |X = x)fpr (x)dx Rn Y fpost (x) = R T C −1 (y −M x) 0 ε 1 = cy0 e− 2 (y0 −M x) 1 T −1 CX x e− 2 x , which simplifies to 1 1 T −1 e− 2 (x−mpost ) Cpost (x−mpost ) , fpost (x) = p (2π)n det(Cpost ) where −1 mpost = M T Cε−1 M + CX −1 M T Cε−1 y0 and −1 Cpost = M T Cε−1 M + CX −1 . • In more general cases, the unknown and the noise can have non-zero expectations and the unknown and the noise need not be independent. 5.2.1 Likelihood function fY (y0 |X = x) Consider a statistical inverse problem, where the data Y is an m-dimensional rv and the unknown X in an n-dimensional rv. Definition 34. Let y0 ∈ Rm be a sample of Y . The function x 7→ fY (y0 |X = x) is called the likelihood function. The likelihood function can contain information about • inaccuracies due to the external disturbances (noise). • inaccuracies of the direct theory 86 The case of independent noise term Let X and ε be independent random vectors and denote Y = F (X) + ε, where the forwad mapping F : Rn → Rm is continuous. If random vector ε has a pdf, then the conditional pdf of Y = F (X) + ε given X = x is, by Corollary 5, fY (y0 |X = x) = fε+F (x) (y0 ) = fε (y0 − F (x)). (5.6) Example 38 (CT scan). The unknown X-ray mass absorption coefficient f = f (x0 , y 0 ) is approximated by equation f (x0 , y 0 ) = n X xj φj (x0 , y 0 ), x0 , y 0 ∈ R2 j=1 where x = (x1 , . . . , xn ) ∈ Rn contains the unknowns and the functions φj are fixed. The data can be (coarsely) modeled as a vector y = (y1 , . . . , ym ) whose components are Z n Z X y= f ds + εi = φj ds xi + εi = (M x)i + εi , Ci j=1 Ci where i = 1, . . . , , m and the random vector ε is distributed according to N (0, δI). Then we end up with the statistical inverse problem Y = M X + ε. When X and ε are taken to be statistically independent, the likelihood function is fY (y0 |X = x) = 1 1 − 2δ |y0 −M x|2 . m e (2π) 2 Model errors Next, we allow model errors for the direct theory and the unknown. Theorem 19. Let Y be an m-dimensional rv, X be an n-dimensional rv and U be a kdimensional rv so that the joint pdf f(X,U ) is positive and the conditional pdfs fY (y|(X, U ) = (x, u)) and fU (u|X = x), are given. Then the conditional pdf Z fY (y|X = x) = fY (y|(X, U ) = (x, u))fU (u|X = x)du. Rk whenever fX (x) > 0. Proof. We need to determine fY (y|X = x) = f(X,Y ) (x, y) . fX (x) By definition, the marginal pdf Z f(X,Y ) (x, y) = f(X,Y,U ) (x, y, u)du, Rk 87 where the integrand is determined by Theorem 16. Then Z f(X,Y,U ) (x, y, u) f(X,U ) (x, u) fY (y|X = x) = du, f(X,U ) (x, u) fX (x) Rk which gives the claim by the definition of conditional pdfs.. Example 39. (Approximation error) Consider the statistical inverse problem Y = F (X)+ ε, where the unknown X and the noise ε are statistically independent. For computational reasons, a high-dimensional X is often approximated by a lower dimensional rv Xn . Let’s take Xn = Pn X, where Pn : RN → RN is an orthogonal projection onto some n-dimensional subspace of RN , where n < N (and also m < N ). Then F (X) = F (Xn ) + (F (X) − F (Xn )) =: F (Xn ) + U, which leads to Y = F (X) + ε = F (Xn ) + U + ε. According to Theorem 19, the likelihood function for Xn can be expressed as Z fU (u|Xn = x)fε (y − F (x) − u)du, fY (y|Xn = x) = (5.7) Rm whenever the assumptions of the theorem are fulfilled. Especially fU (u|Xn = x) needs to be available. The integral (5.7) is often computationally costly. One approximation is to replace e that is a similarly distributed but independent from X. Whe the prior U by a rv U e + ε has a known probability distribution. When this distribution of X is given, then U distribution has a pdf, then fY (y|Xn = x) = fε+Ue (y − F (x)). Example 40. (Inaccuracies of the forward model) Let the forward model F : Rn → Rm be a linear mappping whose matrix M = M σ deoends continuously from σ ∈ R, where the value of σ is not precisely known. For example, in image enhancing (Chapter 1.2) the blurring map n X 2 2 2 2 2 m e kl = Ckl e−(|k−i| /n +|l−j| /n )/2σ mij i,j=1 contains such a parameter. Then we may model the inaccuracies of σ with a probability distribution. Say, σ, X and ε are statistically independent and fσ (s) is the pdf of σ. Then Y = M σ X + ε = G(σ, X, ε) is a random vector, since G : R × Rn × Rm 3 (s, x, z) 7→ M σ x + z is continuous. By Theorem 17, fY (y|(X, σ) = (x, s)) = fG(s,x,ε) (y) = fε (y − M s x). Under the assumptions of Theorem 19, we have Z fY (y|X = x) = fε (y − M s x)fσ (s)ds. Rm 88 5.2.2 The prior pdf fpr (x) The prior pdf represents the information that we have about the unknown and describes also our perception of the lack of information. Assume that x ∈ Rn corresponds to values of some unkown function g at fixed points of [0, 1] × [0, 1], say xi = g(ti ), where ti ∈ [0, 1] × [0, 1] kun i = 1, ..., n. Possible prior information: Function g Some values of g are known exactly or inexactly. Smoothness of g. Image of g is known E.g. g ≥ 0, or monotonicity Symmetry of g. Other restrictions for g E.g. if g : R3 → R3 is a magnetic field, then ∇ · g ≡ 0. Vectorx Some component of x are known exactly or inexactly. Behavior of the neighbor components in x. The subset, where x belongs is known. E.g. . xi ≥ 0, xi ≥ xi+1 Symmetry of x. Restrictions for x. Equations G(x) = 0. Possible statistical models: Unknown vector x ∈ Rn Some component of x are known exactly or inexactly. The vectors that span x are known. P 0 x = ni=1 ai ei , n0 ≤ n. Statistical model X : Ω → Rn Xi = mi + Zi , where rv Zi represents the inaccuracy of mi . P 0 X = ni=1 Zi ei where Zi models the uncertainty of the coefficients. The behavior of neighbor components in x Statistical dependencies between components of X The joint distribution of X. The subset containing x E.g. P (∩i {Xi ≥ 0}) = 1. E.g. xi ≥ 0 5.3 Different prior pdfs Let X : Ω → Rn be a random vector that models the uknown and let fpr : Rn → [0, ∞) denote its pdf. Next, we meet some pdf that can be often used as fpr . Uniform distribution Let B ⊂ Rn be a closed and bounded hyper-rectangular B = {x ∈ Rn : ai ≤ xi ≤ bi , i = 1, .., n}, where ai < bi when i = 1, .., n. 89 The random vector X is uniformly distributed on B if fpr (x) = where the number |C| := R C 1 1B (x), |B| dx. • Unknown belongs to set B i.e. the ith component belongs to the interval [ai , bi ]. • Reflects almost perfect uncertainty about the values of the unknown. They belong to B. • The set B needs to be bounded in order to fpr to be a proper pdf. • The posterior pdf fpost (x) = fY (y0 |X = x)1B (x) fY (y)|B| is the renormed and restricted likelihood. `1 -prior Define a new norm `1 by kxk1 = n X |xi | i=1 for all x ∈ Rn . A random vector X has `1 -prior, if fpr (x) = α n 2 e−αkxk1 • Components Xi are statistically independent. • Pdf fXi is symmetric w.r.t origo and the expectation is zero. • Parameter α reflects how certain we are that the unknown attains large values. 5.3.1 `2 -prior A random vector X has `2 -prior, if fpr (x) = α n2 π e−α|x| 2 • Components of X are independent and normally distributed. 90 1 alpha=0.5 alpha=1 alpha=2 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 Figure 5.5: Pdf of 1-dimensional `1 -prior. 0.8 alpha=0.5 alpha=1 alpha=2 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −8 −6 −4 −2 0 2 4 6 8 Figure 5.6: Pdf of 1-dimensional `2 -prior. 91 10 Cauchy prior A random vector X has Cauchy prior if fpr (x) = n α n Y π i=1 1 1 + α2 x2i where x ∈ Rn . • The components of Xi are independent • Pdf fXi is symmetric w.r. origo. • No expectation (large tail probabilities). • Reflects best a situation, where some of the components of the unknown can attain large values. 0.7 alpha=0.5 alpha=1 alpha=2 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 Figure 5.7: Pdf of 1-dimensional Cauchy-prior. Discrete Markov fields Let the unknown represent the values of some n0 -variable function 0 0 f : Rn → R at points ti ∈ Rn , i = 1, ..., n. The neighborhoods Ni ⊂ {1, ..., n} of indices i ∈ {1, . . . , n} consists of sets that 1. i ∈ / Ni 2. i ∈ Nj if and only if j ∈ Ni . 92 0.4 Gauss l1 Cauchy 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −10 −8 −6 −4 −2 0 2 4 6 Figure 5.8: Pdf of N (0, 1) , pdf of 1D-Cauchy priori with α = 2 . when α = 2π 8 10 √π 2π and pdf of 1D `1 -prior Definition 35. A random vector X is a discrete Markov field with respect to neighborhood system Ni , i = 1, .., n if jos fXi (x|(X1 , X2 , .., Xi−1 , Xi+1 , Xi+2 , ..., Xn ) = (x1 , x2 , .., xi−1 , xi+1 , xi+2 , ..., xn )) = fXi (x|Xk = xk ∀k ∈ Ni ) The components Xi of discrete Markov field depend only on its neighboring components Xk , k ∈ Ni . Theorem 20 (Hammersley-Clifford). Let rv X : Ω → Rn be a discrete Markov field with respect to the neighborhood system Ni , i = 1, .., n. If X has a pdf fX > 0, then fX (x) = ce− Pn i=1 Vi (x) where Vi : Rn → R depends only from xi and its neighbor components xk , k ∈ Ni . Example 41. (Total variation prior) Let the rv X model an image consisting of N × N pixels so that the corresponding matrix is organised as an n = N 2 -dimensional vectors. The rv X : Ω → R2 is distributed according to the total variation prior , if fpr (x) = ce− Pn j=1 Vj (x) where Vj (x) = α X lij |xi − xj | i∈Nj and the neighborhood Nj of index j contains only of indeces of those pixel i that share an edge with the pixel j. Moreover, the number lij is the length of the common edge between pixels i and j. 93 P P • The total variation nj=1 21 i∈Nj lij |xi − xj | is small if the difference between the value xi of the pixel i color and its the corresponding value of its neighbor components is small except possibly for those pixel sets whose borders have very short length. Example 42. (1D-Gaussian smoothness priors) Let X be such a rv that corresponds to values o an unknown function g at points ti ∈ [0, 1], i = 1, .., n, 0 = t0 < t1 < · · · < tn < 1 are equidistant points and g(t) = 0 for t ≤ 0. Fix the prior pdf of X as fpr (x) = ce−α(x1 + 2 Pn 2 i=2 (xi −xi−1 ) ). • The boundary component is forced to zero i.e. X0 = g(0) =≡ 0. • If α is large, then the neighbor components of X are more likely to be close to each other. • A Random walk model. Similarly, also higher differences can be used. For example, 1 fpr (x) = ce− 2a4 (x1 +(−2x2 −x1 ) 2 Pn 2+ 2 i=3 (xi −2xi−1 +xi−2 ) ) corresponds to the second differences. Example 43. (2D-Gaussian smoothness priors) Let f : [0, 1]2 → R be such a continuous function that f = 0 outside [0, 1]2 . Let X be rv, corresponding to a function g(t, s) at points k j 2 , : k, j = 1, ..., n . {ti ∈ [0, 1] × [0, 1] : i = 1, .., n } = n n Set fpr (x) = ce−α P j Vj (x) , where Vj = |4xj − X xi |2 i∈Nj and Nj contains only the indices of points ti that are next to the point tj ( over it or under it or to left or to right from it). Positvity constraint If we know that the unknown has non-negative components, then we may restrict and renorm the pdf fpr (x) = cf+ (x)fX (x) where ( 1, xi ≥ 0 ∀i = 1, .., n f+ (x) = 0 otherwise. 94 Hierarchical priors When the unknown is modeled as a random vector, whose pdf depends continuously on 0 parameter σ ∈ Rn , it is possible to model the uncertainty of the parameter σ by attaching a pdf to it. Let X : Ω → Rn be the rv that models the unknown and let the pdf of X be fX . Let 0 σ : Ω → Rn be a rv that models the unknown parameter and let its pdf be fσ . Assume that we have the conditional pdf of X given σ = s, that is x 7→ fX (x|σ = s) = fXs (x) 0 is known for all s ∈ Rn . When the product fXs (x)fσ (s) is integrable, we have the joint distribution f(X,σ) (x) = fXs (x)fσ (s). Option 1) The unknown is modeled as a rv X with pdf Z fpr (x) = fXs (x)fσ (s)ds1 · · · dsn0 (whenever the marginal exists). The corresponding posterior pdf is fpost (x) = cfY (y|X = x)fpr (x) whenever fY (y) > 0. Option 2) Also the hyperparameter σ is taken to be part of the unknown and as a prior pdf, we set the joint pdf fpr (x, s) = fXs (x)fσ (s). which implies that the posterior pdf is fpost (x, s) = cfY (y|(X, σ) = (x, s))fpr (x, s) = cfY (y|X = x, s)fpr (x, s) whenever fY (y) > 0 (note that the likelihood function does not depend on s but only on x). 0 In options 1,2 the prior pdf is called a hierarchical prior and the parameter σ : Ω → Rn is called a hyperparameter and its distribution a hyperprior. Example 44. Let X : Ω → R3 be rv that models the unknown and has pdf √ s 1 2 s 1 2 2 fpr (x; s) = √ 3 exp − x1 − (x2 − x1 ) − (x3 − x2 ) , 2 2 2 2π where s ∈ R is an unknown parameter. We model this parameter as a random variable σ : Ω → R and denote fX (x|σ = s) = fpr (x; s). As the hyperprior, we set fσ (s) = λf+ (s)e−λs 95 where λ > 0, and f+ (s) = 1 for s > 0 and 0 otherwise. Then √ sλ 1 2 s 1 2 2 f(X,σ) (x, s) = √ f+ (s) exp − x1 − (x2 − x1 ) − (x3 − x2 ) e−λs 3 2 2 2 ( 2π) and Z ∞ s √ λ 1 2 1 2 fX (x) = √ exp − x1 − (x3 − x2 ) s exp − (x2 − x1 )2 − λs ds 2 2 2 ( 2π)3 0 Z ∞ 1 λ 1 2 1 1 2 2 s 2 exp(−s = √ exp − x1 − (x3 − x2 ) (x2 − x1 ) + λ )ds 2 2 2 ( 2π)3 0 Z ∞ 1 λ 1 2 1 1 2 2 exp(−s)ds = √ exp − x1 − (x3 − x2 ) s 3 2 2 ( 2π)3 ( 12 (x2 − x1 )2 + λ) 2 0 λ exp − 12 x21 − 21 (x3 − x2 )2 3 = √ Γ 3 2 ( 2π)3 ( 12 (x2 − x1 )2 + λ) 2 1 2 1 λ exp − 2 x1 − 2 (x3 − x2 )2 . = √ 3 4π 2 ((x2 − x1 )2 + 2λ) 2 √ The value of the Gamma function Γ(3/2) = π/4. 0.7 lambda=0.3 lambda=1 lambda=2 0.6 0.5 0.4 0.3 0.2 0.1 0 −20 −15 −10 −5 0 5 Figure 5.9: Pdf f (x) = 10 15 λ 3 (x2 +2λ) 2 20 . • The differences between components of X are independent. • The difference X2 − X1 has a Cauchy type distribution (transformed Beta distribution), which gives a slightly lower probability to occurance of very high values. • Uncerainty about the variance of X2 − X1 produced a distribution that allows large values with higher probability than the Gaussian distribution. 96 0.25 Cauchy Transformed Beta 0.2 0.15 0.1 0.05 0 −20 −15 −10 −5 0 5 10 15 Figure 5.10: Cauchy prior and pdf f (x) = 5.4 5.4.1 20 λ 3 (x2 +2λ) 2 . Studying the posterior distribution Decision theory Let pdfs f(X,Y ) , fX > 0 and fY > 0 exist and be continuous. Denote fpost (x; y) = fX (x|Y = y) when y ∈ Rm . Multidimensional function fpost (x; y) can be very hard to visualize properly. Can we extract some information about the unknown on the basis of the posterior pdf? We turn our attention to the field of statistics that is called decision theory. Decision theory answer to the question: what function h : Rm → Rn is such that the vector h(y) resembles the most (in some sense) the unknown x that has produced the observation y = F (x) + ε? In statistics, the function h is called an estimator and the value h(y) an estimate. Let us fix in what sense the estimator is best. We first fix a loss function L : Rn × Rn → [0, ∞) that measures the accuracy of the estimate h(y) when the unknown is x as L(x, h(y)) (low values of L mean accurate estimates). For example, we can take L(x, h(y)) = |x − h(y)|2 . Assume that L is fixed and x 7→ L(x, z)fpost (x) is integrable for all z ∈ Rn . If y ∈ Rm , then the value h(y) ∈ Rn of the estimator h is chosen so that it minimizes the posterior expectation Z L(x, h(y))fpost (x; y)dx Rn i.e. Z h(y) = argmin z∈Rn L(x, z)fpost (x; y)dx. Rn 97 When data is y we look for h(y), that gives the smallest possible posterior expectation. The number Z Z r(h) = L(x, h(y))fpost (x; y)dx fY (y)dy Rm Rn is called the Bayes risk. An application of the Fubini theorem leads to Z Z L(x, h(y))fY (y|X = x)dy fpr (x)dx. r(h) = Rn Rm The interpretation of the Bayes risk is that when the unknown is X and the noisy data Y , then the Bayes risk r(h) of the estimator h is the expected loss with respect to the joint distribution of X and Y i.e. r(h) = E[L(X, h(Y ))]. Example 45. (CM estimate) Take L(x, z) = |x−h(y)|2 as the loss function. Let mpost (y) denote the posterior expectation Z mpost (y) = xfpost (x)dx Rn and Cpost (y) the posterior covariance matrix Z (xi − (mpost (y))i )(xj − (mpost (y))j )fpost (x)dx. (Cpost (y))ij = Rn Then Z Z |x − h(y)|2 fpost (x; y)dx L(x, h(y))fpost (x; y)dx = n Rn ZR = |x − mpost (y) + mpost (y) − h(y)|2 fpost (x; y)dx Rn Z 2 (|x − mpost (y)| + 2 = Rn n X (x − mpost (y))i (mpost (y) − h(y))i i=1 +|mpost (y) − h(y)|2 )fpost (x; y)dx Z |x − mpost (y)|2 fpost (x; y)dx = Rn +2 n X Z (mpost (y) − h(y))i (x − mpost (y))i fpost (x; y)dx Rn i=1 2 Z +|mpost − h(y)| fpost (x; y)dx n R Z = |x − mpost (y)|2 fpost (x; y)dx + |mpost − h(y)|2 Rn The minimum loss is attained when |mpost (y) − h(y))|2 = 0 i.e h(y) = mpost (y), so that Z L(x, h(y))fpost (x; y)dx = Rn n X i=1 98 (Cpost (y))ii . In other words, the expectation of the loss function is the sum of the diagonal elements of the posterior covariance matrix i.e. its trace. Posterior expectation is often denoted by x̂CM (CM=ccnditional mean). Example 46. MAP estimate We say that the pdf is unimodal if its global maximum is attained at only one point. Let δ > 0 and Lδ (x, z) = 1B̄(z,δ)C (x) when x, z ∈ Rn . Let x 7→ fpost (x; y) be unimodal for the given data y ∈ Rn . The limit of the estimate Z hδ (y) = argmin 1B̄(z,δ)C (x)fpost (x; y)dx z∈Rn Rn Z = argmin fpost (x; y)dx z∈Rn Rn \B̄(z,δ) is lim hδ (y) = x̂M AP (y) δ→0+ where x̂M AP (y) = argmax fpost (x; y). x∈Rn The maximum a posterior estimate x̂M AP (y) is useful when expectations are hard to obtain. It can be also written as x̂M AP (y) = argmax fY (y|X = x)fpr (x) x∈Rn MAP estimate is often used also in situations when the posterior pdf is not unimodal whereby it is not unique. In addition to estimates x̂ we can also determine their componentwise Bayesian confidence intervals by choosing a e.g. in such a way that Ppost (|Xi − x̂i | ≤ a) = 1 − α where α = 0.05. 5.5 Recap • About probability theory – Conditional pdf of random vector X given Y = y (with marginal pdf fY (y) > 0) is f(X,Y ) (x, y) . fX (x|Y = y) = fY (y) – Bayes’ formula fX (x|Y = y)fY (y) = f(X,Y ) (x, y) = fY (y|X = x)fX (x) holds for continuous pdfs (in case of discontinuities, only up to versions). 99 • Statistical inverse problem – The unknown and the data are modeled as random vectors X and Y . – The probability distributions of X and Y represents quantitative and qualitative information about X and Y and lack of such information. – The given data y0 is a sample of Y i.e. y0 = Y (ω0 ) for some elementary event ω0 ∈ Ω. – The solution of a statistical inverse problem is the conditional pdf of X given Y = y0 (with fY (y0 ) > 0) • Posterior pdf – consists of (normed) product of likelihood function x 7→ fY (y0 |X = x) and prior pdf x 7→ fpr (x). – can be used in determining estimates and confidence intervals for the unknown. • Typical priors include Gaussian priors (especially smoothness priors), `1 -prior, Cauchy prior and total variation prior (e.g. for 2D images). Please learn: • definitions of prior and posterior pdf • how to define posterior pdf (up to the norming constant) when the unknown and noise are statistically independent and the needed pdfs are continuous. • how to write the expressions for the posterior pdf, its mean and covariance, in the linear Gaussian case. • how to explain the connection between Tikhonov regularisation and Gaussian linear inverse problems • how to form the hierarchical prior pdf when the conditional pdf and the hyperprior are given • definition of CM-estimate as the conditional mean • definition of MAP-estimate as a maximizer of the posterior pdf 100