Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 5
Probability and Statistics
Please Read
Doug Martinson’s
Chapter 3: ‘Statistics’
Available on Courseworks
Abstraction
Vector of N random variables, x
with joint probability density p(x)
expectation x
and covariance Cx
Shown as 2D here,
but actually Ndimensional
x2
x1
the multivariate normal distribution
p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) }
has expectation x
covariance Cx
And is normalized to unit area
Special case of Cx=
s12 0 0 0 …
0 s22 0 0 …
0
0 s32 0 …
…
Uncorrelated
case
p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) }
Note |Cx| = s12 s22… sN2 and
(x-x)T Cx-1 (x-x) = Si (xi-xi)2/si2
So p(x) = Pi (2)-1/2 si-1 exp{ (xi-xi)2 / 2si2 }
Which is the product of N individual one-variable
normal distributions
How would you show that the this distribution
p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) }
Really has expectation x
And covariance Cx
???
How would you prove this ?
Do you remember how to transform
a integral from x to y ?
… p(x) dNx =
… ? dNy =
given y(x)
then
… p(x)
N
d x
=
… p[x(y)] |dx/dy| dNy =
p(y)
Jacobian determinant, that is,
the determinant of matrix Jij
whose elements are dxi/dyj
Here’s how you prove the expectation …
Insert p(x) into the usual formula for expectation
E(x) = (2)-N/2 |Cx|-1/2 ..  x exp{ -1/2 (x-x)T Cx-1 (x-x) } dNx
Now use the transformation y=Cx-1/2(x-x)
Noting that the Jacobian determinant is |Cx|1/2
E(x) = (2)-N/2 ..  (x+ Cx1/2y) exp{ -1/2 yTy } dNy
= x ..  (2)-N/2 exp{ -1/2 yTy } dNy
+ (2)-N/2 Cx1/2 ..  y exp{ -1/2 yTy } dNy
The first integral is the area under a N-dimensional gaussian, which is
just unity
The second integral contains an odd function of y times an even
function, and so is zero, thus
E(x) = x  1 + 0 = x
I’ve never tried to prove the
covariance …
But how much harder could it be ?
examples
x=
2
1
p(x,y)
Cx = 1
0
0
1
x=
2
1
p(x,y)
Cx =
2
0
0
1
x=
2
1
p(x,y)
Cx =
1
0
0
2
x=
2
1
p(x,y)
Cx =
1 0.5
0.5 1
x=
2
1
p(x,y)
Cx =
1 -0.5
-0.5 1
Remember this from last lecture ?
x2
x2
x2
x1
p(x1)
x1
x1
p(x1) =  p(x1,x2) dx2
distribution of x1
(irrespective of x2)
p(x2) =  p(x1,x2) dx1
distribution of x2
(irrespective of x1)
p(x2)
x
p(x,y)
y
p(y) =  p(x,y) dx
p(y)
y
p(x) =  p(x,y) dy
p(x,y)
x
x
y
p(x)
Remember
p(x,y) = p(x|y) p(y) = p(y|x) p(x)
from the last lecture ?
we can compute p(x|y) and p(y,x) as follows
P(x|y) = P(x,y) / P(y)
P(y|x) = P(x,y) / P(x)
p(x,y)
p(x|y)
p(y|x)
Any linear function of a normal distribution
is a normal distribution
p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) }
And y=Mx then
p(y) = (2)-N/2 |Cy|-1/2 exp{ -1/2 (y-y)T Cy-1 (y-y) }
with y=Mx and Cy=MCxMT
Proof
needs rules [AB]-1=B-1A-1 and |AB|=|A||B| and |A-1|=|A|-1
p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) }
Transformation is p(y) = p[x(y)] |dx/dy|
Jacobian
determinant
Substitute in x=M-1y and |dx/dy|=|M-1|
I
I
P[x(y)]|dx/dy| =
(2)-N|Cx|-1/2exp{ -1/2 (x-x)T MTMT-1Cx-1 M-1M (x-x) }|M-1|
p[x(y)]|dx/dy| =
(2)-N/2|Cx|-1/2 |M-1| exp{ -1/2 (x-x)T MTMT-1Cx-1 M-1M (x-x) } =
|M-1/2||Cx|-1/2 |M-1/2| [M(x-x)]T [MCxMT]-1 M(x-x) }
|Cy|-1/2
(y-y)T
Cy-1
(y-y) }
So p(y) = (2)-N/2 |Cy|-1/2 exp{ -1/2 (y-y)T Cy-1 (y-y) }
Note that these rules work for the
multivariate normal distribution
if y is linearly related to x, y=Mx
then
y=Mx
(rule for means)
Cy = M Cx MT
(rule for propagating error)
Do you remember this from a
previous lecture?
if d = G m
then the standard Least-squares
Solution is
mest = [GTG]-1 GT d
Let’s suppose the data, d, are
uncorrelated and that they all have the
same variance, Cd=s2I
To compute the variance of mest
note that mest=[GTG]-1GTd is a linear rule
of the form m=Md, with M=[GTG]-1GT
so we can apply the rule Cm = M Cd MT
M=[GTG]-1GT
Cm = M Cd MT
= {[GTG]-1GT} sd2I {[GTG]-1GT}T
= sd2 [GTG]-1GT G [GTG]-1T
= sd2 [GTG]-1T
= sd2 [GTG]-1
GTG is a symmetric
matrix, so its inverse it
symmetic, too
Memorize !
Example – all the data assumed to have
the same true value, m1, and each
measured with the same variance, sd2
G
d1
d2
d3
…
dN
=
1
1
1
m1
GTG = N so [GTG]-1 = N-1
GTd = Si di
1
mest=[GTG]-1GTd = (Si di) / N
Cm = sd2 / N
m1est = (Si di) / N … the traditional formula for the mean!
the estimated mean has variance Cm = sd2 / N = sm2
note then that sm = sd / N
the estimated mean is a normally-distributed random variable
the width of this distribution, sm, decreases with the square
root of the number of measurements
Accuracy grows only slowly with N
N=1
N=10
N=100
N=1000
Another Example – fitting a straight
line , with all the data assumed to
2
have the same variance, sd
G
d1
d2
d3
…
dN
=
1 x1
1 x2
1 x3
1
GTG =
m1
m2
N
Sixi
Sixi
Sixi2
xN
Cm = s2d [GTG]-1 =
Sixi2 -Sixi
s2
d
N
Sixi2 –
[Sixi]2
-Sixi
N
Cm = s2d [GTG]-1 =
s2d
N
Sixi2 –
[Sixi]2
Sixi2
-Sixi
-Sixi
N
s2d Sixi2
s2intercept=
N Sixi2 – [Sixi]2
s2slope=
s2d
N
N
Sixi2 –
95% confidence intervals
[Sixi]2
standard
error of
the
intercept
intercept: m1est ± 2 sintercept
slope: m2est ± 2 sslope
Beware!
95% confidence intervals
intercept: m1 ± 2 sintercept
slope: m2 ± 2 sslope
These are probabilities of
m1 irrespective of the value of m2
And
m2 irrespective of the value of m1
not the joint probability of m1 and m2 taken together
is 95%
p(m1,m2)
m1est ± 2s1
m1
probability m2 in in this box
m2est ± 2s2
m2
is 95%
p(m1,m2)
m1est ± 2s1
m1
probability m1 in in this box
m2est ± 2s2
m2
p(m1,m2)
m1est ± 2s1
m1
probability that both m1 and m2 are in in this box is < 95%
m2est ± 2s2
m2
Cm = s2d [GTG]-1 =
s2d
N
Sixi2 –
[Sixi]2
Sixi2
-Sixi
-Sixi
N
Intercept and slope are uncorrelated only
when Sixi = 0, that is, the mean of the x’s
is zero, which occurs when the data
straddle the origin
remember this discussion from a few
lectures ago ?
What s2d do you use
in these formulas?
Prior estimates of sd
Based on knowledge of the limits of you
measuring technique …
my ruler has only mm tics, so I’m going to
assume that sd = 0.5 mm
the manufacturer claims that the instrument is
accurate to 0.1%, so since my typical
measurement is 25, I’ll assume sd=0.025
posterior estimate of the error
Based on
error
measured
with
respect to
best fit
s2d = (1/N) Si (diobs-dipre)2 = (1/N) Si ei2
Dangerous …
Because it assumes that the model (“a
straight line”) accurately represents the
behavior of the data
Maybe the data really followed an
exponential curve …
One refinement to the formula
s2d = (1/N) Si (diobs-dipre)2
having to do with the
appearance of N, the
number of data
If there were only three
data, then the best fitting
straight line would likely
have just a little error.
y
y
If there were only two
data, then the best fitting
straight line would have
no error at all.
x
x
Therefore this formula very likely
underestimates the error
s2d = (1/N) Si (diobs-dipre)2
An improved formula would replace N
with N-2
s2
d=
1
N-2
Si (diobs-dipre)2
Where the “2” is chosen because two points
exactly define a straight line
More generally, if there are M model
parameters, then the formula would be
s2
d=
1
N-M
Si (diobs-dipre)2
the quantity N-M is often called
the number of degrees of freedom
Related documents