Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 5 Probability and Statistics Please Read Doug Martinson’s Chapter 3: ‘Statistics’ Available on Courseworks Abstraction Vector of N random variables, x with joint probability density p(x) expectation x and covariance Cx Shown as 2D here, but actually Ndimensional x2 x1 the multivariate normal distribution p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) } has expectation x covariance Cx And is normalized to unit area Special case of Cx= s12 0 0 0 … 0 s22 0 0 … 0 0 s32 0 … … Uncorrelated case p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) } Note |Cx| = s12 s22… sN2 and (x-x)T Cx-1 (x-x) = Si (xi-xi)2/si2 So p(x) = Pi (2)-1/2 si-1 exp{ (xi-xi)2 / 2si2 } Which is the product of N individual one-variable normal distributions How would you show that the this distribution p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) } Really has expectation x And covariance Cx ??? How would you prove this ? Do you remember how to transform a integral from x to y ? … p(x) dNx = … ? dNy = given y(x) then … p(x) N d x = … p[x(y)] |dx/dy| dNy = p(y) Jacobian determinant, that is, the determinant of matrix Jij whose elements are dxi/dyj Here’s how you prove the expectation … Insert p(x) into the usual formula for expectation E(x) = (2)-N/2 |Cx|-1/2 .. x exp{ -1/2 (x-x)T Cx-1 (x-x) } dNx Now use the transformation y=Cx-1/2(x-x) Noting that the Jacobian determinant is |Cx|1/2 E(x) = (2)-N/2 .. (x+ Cx1/2y) exp{ -1/2 yTy } dNy = x .. (2)-N/2 exp{ -1/2 yTy } dNy + (2)-N/2 Cx1/2 .. y exp{ -1/2 yTy } dNy The first integral is the area under a N-dimensional gaussian, which is just unity The second integral contains an odd function of y times an even function, and so is zero, thus E(x) = x 1 + 0 = x I’ve never tried to prove the covariance … But how much harder could it be ? examples x= 2 1 p(x,y) Cx = 1 0 0 1 x= 2 1 p(x,y) Cx = 2 0 0 1 x= 2 1 p(x,y) Cx = 1 0 0 2 x= 2 1 p(x,y) Cx = 1 0.5 0.5 1 x= 2 1 p(x,y) Cx = 1 -0.5 -0.5 1 Remember this from last lecture ? x2 x2 x2 x1 p(x1) x1 x1 p(x1) = p(x1,x2) dx2 distribution of x1 (irrespective of x2) p(x2) = p(x1,x2) dx1 distribution of x2 (irrespective of x1) p(x2) x p(x,y) y p(y) = p(x,y) dx p(y) y p(x) = p(x,y) dy p(x,y) x x y p(x) Remember p(x,y) = p(x|y) p(y) = p(y|x) p(x) from the last lecture ? we can compute p(x|y) and p(y,x) as follows P(x|y) = P(x,y) / P(y) P(y|x) = P(x,y) / P(x) p(x,y) p(x|y) p(y|x) Any linear function of a normal distribution is a normal distribution p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) } And y=Mx then p(y) = (2)-N/2 |Cy|-1/2 exp{ -1/2 (y-y)T Cy-1 (y-y) } with y=Mx and Cy=MCxMT Proof needs rules [AB]-1=B-1A-1 and |AB|=|A||B| and |A-1|=|A|-1 p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) } Transformation is p(y) = p[x(y)] |dx/dy| Jacobian determinant Substitute in x=M-1y and |dx/dy|=|M-1| I I P[x(y)]|dx/dy| = (2)-N|Cx|-1/2exp{ -1/2 (x-x)T MTMT-1Cx-1 M-1M (x-x) }|M-1| p[x(y)]|dx/dy| = (2)-N/2|Cx|-1/2 |M-1| exp{ -1/2 (x-x)T MTMT-1Cx-1 M-1M (x-x) } = |M-1/2||Cx|-1/2 |M-1/2| [M(x-x)]T [MCxMT]-1 M(x-x) } |Cy|-1/2 (y-y)T Cy-1 (y-y) } So p(y) = (2)-N/2 |Cy|-1/2 exp{ -1/2 (y-y)T Cy-1 (y-y) } Note that these rules work for the multivariate normal distribution if y is linearly related to x, y=Mx then y=Mx (rule for means) Cy = M Cx MT (rule for propagating error) Do you remember this from a previous lecture? if d = G m then the standard Least-squares Solution is mest = [GTG]-1 GT d Let’s suppose the data, d, are uncorrelated and that they all have the same variance, Cd=s2I To compute the variance of mest note that mest=[GTG]-1GTd is a linear rule of the form m=Md, with M=[GTG]-1GT so we can apply the rule Cm = M Cd MT M=[GTG]-1GT Cm = M Cd MT = {[GTG]-1GT} sd2I {[GTG]-1GT}T = sd2 [GTG]-1GT G [GTG]-1T = sd2 [GTG]-1T = sd2 [GTG]-1 GTG is a symmetric matrix, so its inverse it symmetic, too Memorize ! Example – all the data assumed to have the same true value, m1, and each measured with the same variance, sd2 G d1 d2 d3 … dN = 1 1 1 m1 GTG = N so [GTG]-1 = N-1 GTd = Si di 1 mest=[GTG]-1GTd = (Si di) / N Cm = sd2 / N m1est = (Si di) / N … the traditional formula for the mean! the estimated mean has variance Cm = sd2 / N = sm2 note then that sm = sd / N the estimated mean is a normally-distributed random variable the width of this distribution, sm, decreases with the square root of the number of measurements Accuracy grows only slowly with N N=1 N=10 N=100 N=1000 Another Example – fitting a straight line , with all the data assumed to 2 have the same variance, sd G d1 d2 d3 … dN = 1 x1 1 x2 1 x3 1 GTG = m1 m2 N Sixi Sixi Sixi2 xN Cm = s2d [GTG]-1 = Sixi2 -Sixi s2 d N Sixi2 – [Sixi]2 -Sixi N Cm = s2d [GTG]-1 = s2d N Sixi2 – [Sixi]2 Sixi2 -Sixi -Sixi N s2d Sixi2 s2intercept= N Sixi2 – [Sixi]2 s2slope= s2d N N Sixi2 – 95% confidence intervals [Sixi]2 standard error of the intercept intercept: m1est ± 2 sintercept slope: m2est ± 2 sslope Beware! 95% confidence intervals intercept: m1 ± 2 sintercept slope: m2 ± 2 sslope These are probabilities of m1 irrespective of the value of m2 And m2 irrespective of the value of m1 not the joint probability of m1 and m2 taken together is 95% p(m1,m2) m1est ± 2s1 m1 probability m2 in in this box m2est ± 2s2 m2 is 95% p(m1,m2) m1est ± 2s1 m1 probability m1 in in this box m2est ± 2s2 m2 p(m1,m2) m1est ± 2s1 m1 probability that both m1 and m2 are in in this box is < 95% m2est ± 2s2 m2 Cm = s2d [GTG]-1 = s2d N Sixi2 – [Sixi]2 Sixi2 -Sixi -Sixi N Intercept and slope are uncorrelated only when Sixi = 0, that is, the mean of the x’s is zero, which occurs when the data straddle the origin remember this discussion from a few lectures ago ? What s2d do you use in these formulas? Prior estimates of sd Based on knowledge of the limits of you measuring technique … my ruler has only mm tics, so I’m going to assume that sd = 0.5 mm the manufacturer claims that the instrument is accurate to 0.1%, so since my typical measurement is 25, I’ll assume sd=0.025 posterior estimate of the error Based on error measured with respect to best fit s2d = (1/N) Si (diobs-dipre)2 = (1/N) Si ei2 Dangerous … Because it assumes that the model (“a straight line”) accurately represents the behavior of the data Maybe the data really followed an exponential curve … One refinement to the formula s2d = (1/N) Si (diobs-dipre)2 having to do with the appearance of N, the number of data If there were only three data, then the best fitting straight line would likely have just a little error. y y If there were only two data, then the best fitting straight line would have no error at all. x x Therefore this formula very likely underestimates the error s2d = (1/N) Si (diobs-dipre)2 An improved formula would replace N with N-2 s2 d= 1 N-2 Si (diobs-dipre)2 Where the “2” is chosen because two points exactly define a straight line More generally, if there are M model parameters, then the formula would be s2 d= 1 N-M Si (diobs-dipre)2 the quantity N-M is often called the number of degrees of freedom