Download 3. Model Fitting 3.1 The bivariate normal distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia, lookup

System of linear equations wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Linear least squares (mathematics) wikipedia, lookup

Least squares wikipedia, lookup

Ordinary least squares wikipedia, lookup

Coefficient of determination wikipedia, lookup

False position method wikipedia, lookup

Transcript
3. Model Fitting
In this section we apply the statistical tools introduced in Section 2 to
explore:
¾ how to estimate model parameters
¾ how to test the goodness of fit of models.
We will consider:
3.2 The method of least squares
3.3 The principle of maximum likelihood
3.4 Least squares as maximum likelihood estimators
3.5 Chi-squared goodness of fit tests
3.6 More general hypothesis testing
3.7 Computational methods for minimising / maximising functions
But before we do, we first introduce an important pdf:
the bivariate normal distribution
3.1 The bivariate normal distribution
(3.1)
(3.2)
p(x,y)
y
x
(3.3)
In fact, for any two variables x and y, we define
cov( x, y )
E > x E ( x) y E ( y ) @
(3.4)
Isoprobability contours for
y
the bivariate normal pdf
U !0
:
U
y
0.0
U
0.3
positive correlation
y tends to increase as x increases
x
x
U 0
:
negative correlation
y
U
y
0.5
U
0.7
y tends to decrease as x increases
Contours become narrower and
steeper as U o 1
x
y U 0.7
Ÿ
x
y U 0.9
stronger (anti) correlation
between x and y.
i.e.
Given value of x , value of
y is tightly constrained.
x
3.2 The method of Least Squares
o ‘workhorse’ method for fitting lines and curves to data
in the physical sciences
o method often encountered (as a ‘black box’?) in
elementary courses
o useful demonstration of underlying statistical
principles
o simple illustration of fitting straight line to (x,y) data
x
Ordinary Linear Least Squares
(3.8)
n
S
¦H
i 1
(3.8)
(3.9)
2
i
(3.10)
(3.11)
We can show that
E aˆ LS
E bˆ LS
a LS
b LS
i.e. LS estimators are unbiased.
(3.12)
(3.13)
(3.14)
(3.15)
Choosing the ^x i ` so that
¦
xi
0 we can make â LS and b̂ LS independent.
Weighted Linear Least Squares
( Common in astronomy )
Define
(3.16)
Again we find Least Squares estimators of a and b satisfying
Solving, we find
(3.17)
(3.18)
Also
(3.19)
(3.20)
cov aˆ WLS , bˆWLS
¦
1
xi2
¦V ¦V
2
i
2
i
xi
V i2
§
x ·
¨¨ ¦ i2 ¸¸
© Vi ¹
2
(3.21)
Extensions and Generalisations
o Errors on both variables?
Need to modify merit function accordingly.
(3.22)
Renders equations non-linear ; no simple analytic solution!
Not examinable, but see e.g.
Numerical Recipes 15.3
Extensions and Generalisations
o General linear models?
e.g.
(3.23)
We can write
(3.24)
Can formulate as a matrix equation and solve for parameters
Matrix approach to Least Squares
Define
a
ª a1 º
« »
« »
«¬aM »¼
y
Vector of model
parameters
X
ª X 1 ( x1 ) X M ( x1 ) º
« »
«
»
«¬ X 1 ( x N ) X M ( x N )»¼
ª y1 º
« »
« »
«¬ y N »¼
Vector of
observations
Design matrix of
model basis functions
Model:
Vector of model
parameters
Xa H
y
(3.25)
Vector of
errors
Vector of
observations
H
Design matrix of
model basis functions
ª H1 º
« »
« »
«¬H N »¼
where we assume
Hi
is drawn from some
pdf with mean zero and variance
V2
Matrix approach to Least Squares: weighting by errors
Define
ª a1 º
« »
« »
«¬aM »¼
a
b
Vector of model
parameters
A
ª X 1V( x1 ) X MV( x1 ) º
1
« 1
»
»
« « X1 ( xN ) X M ( xN ) »
VN
¬ VN
¼
ª y1 V 1 º
« »
«
»
«¬ y N V N »¼
Vector of
weighted
observations
Weighted design
matrix of model basis
functions
Weighted Model:
Vector of model
parameters
Aa e
b
Vector of
weighted
observations
where we assume
n
eT ˜ e
Hi
is drawn from some
pdf with mean zero and variance
We solve for the parameter vector
S
Vector of
weighted
errors
Weighted design
matrix of model basis
functions
ª H1 º
« V1 »
« »
»
«H N
« V »
N¼
¬
e
(3.26)
¦ ei
V i2
â LS that minimises
2
i 1
This has solution
aˆ LS
A A
1
T
AT ˜ b
M u M matrix
and
cov aˆ LS A A T
1
(3.28)
(3.27)
Extensions and Generalisations
o Non-linear models?
Model parameters
Suppose
Hi
drawn from pdf with mean zero, variance V i
2
Then
S
(3.29)
But no simple analytic method to minimise sum of squares
( e.g. no analytic solutions to wS wT i 0 )
3.3
The principle of maximum likelihood
Frequentist approach:
A parameter is a fixed (but unknown) constant
From actual data we can compute Likelihood,
L =
probability of obtaining the observed data, given the value of
the parameter T
Now define likelihood function: (infinite) family of curves
generated by regarding L as
a function of T , for data fixed.
Principle of Maximum Likelihood
A good estimator of T maximises L i.e.
wL
wT
0 and
w2L
0
wT 2
We set the parameter equal to
the value that makes the actual
data sample we did observe –
out of all the possible random
samples we could have observed
– the most likely.
Aside:
Likelihood function has same definition in Bayesian probability theory, but subtle difference in
meaning and interpretation – no need to invoke idea of (infinite) ensemble of different samples.
Principle of Maximum Likelihood
A good estimator of T maximises
i.e.
wL
wT
0
and
L -
w2L
0
wT 2
T = T1
x
Observed value
Principle of Maximum Likelihood
A good estimator of T maximises
i.e.
wL
wT
0
and
L -
w2L
0
wT 2
T = T2
x
Observed value
Principle of Maximum Likelihood
A good estimator of T maximises
i.e.
wL
wT
0
and
L -
w2L
0
wT 2
T = T3
x
Observed value
Principle of Maximum Likelihood
A good estimator of T maximises
wL
wT
0
and
w2L
0
wT 2
!
i.e.
L -
T = TML
x
Observed value
3.4
Least squares as maximum likelihood estimators
To see the maximum likelihood method in action, let’s consider again
weighted least squares for the simple model
(From Section 3.3)
Let’s assume the pdf is a Gaussian
n
–
L
Likelihood
i 1
(note:
Substitute
ª 1 H i2 º
1
exp «
2»
2S V i
¬ 2 Vi ¼
(3.30)
L is a product of 1-D Gaussians because we are assuming the H i are independent)
Hi
yi a bxi
n
Ÿ
–
L
i 1
ª 1 yi a bxi 2 º
1
exp «
»
2
2
V
2S V i
i
¼
¬
(3.31)
and the ML estimators of a and b satisfy wL wa 0 and wL wb 0
But maximising
Here
"
L is equivalent to maximising "
ln L
n
n
1 n § yi a bxi ·
¸¸
ln 2S ln ¦ V i ¦ ¨¨
V
2
2
i 1
i 1©
i
¹
constant So in this case maximising
1
S
2
2
(3.32)
This is exactly the same
sum of squares we
defined in Section 3.3
L is exactly equivalent to minimising the sum of squares.
i.e. for Gaussian, independent errors, ML and weighted LS estimators are identical.
3.5 Chi-squared goodness of fit test
In the previous 3 sections we have discussed how to estimate
parameters of an underlying pdf model from sample data.
We now consider the related question:
How good is our pdf model in the first place?
We now illustrate the frequentist approach to this question using the
chi-squared goodness of fit test, for the (very common) case where
the model pdf is a Gaussian.
We take an example from Gregory (Chapter 7)
(book focusses mainly on Bayesian probability, but
is very good on frequentist approach too)
Model: radio emission from a galaxy is constant in time.
Assume residuals are iid, drawn from N(0,V)
Goodness-of-fit Test: the basic ideas
From Gregory, pg. 164
(3.33)
The
F2
pdf
pQ F
2
p0 u F
2
Q
2
1
e
F 2 / 2
pQ F 2
Q 1
Q 2
Q 3
Q 4
Q 5
F2
(3.34)
(3.35)
n = 15 data points, but Q = 14 degrees of
freedom, because
statistic involves the
sample mean and not the true mean.
We subtract one d.o.f. to account for this.
p14 F 2
If the null hypothesis is true, how probable is it that we
would measure as large, or larger, a value of
?
If the null hypothesis were true, how probable is it that we
would measure as large, or larger, a value of
?
This is an important quantity, referred to as the P-value
P - value
2
1 P F obs
2
F obs
1
³
0
Q 1
§ x·
p0 x 2 exp¨ ¸ dx
© 2¹
0.02
(3.36)
What precisely does the P-value mean?
“If the galaxy flux density really is constant, and we repeatedly obtained sets
of 15 measurements under the same conditions, then only 2% of the F 2
values derived from these sets would be expected to be greater than our one
actual measured value of 26.76”
From Gregory, pg. 165
If we obtain a very small P-value (e.g. a few percent?) we can interpret this as
providing little support for the null hypothesis, which we may then choose to reject.
2
(Ultimately this choice is subjective, but F may provide objective ammunition for doing so)
Nevertheless, P-value based frequentist hypothesis testing remains very
common in the literature:
Type of problem
test
Line and curve
goodness-of-fit
References
NR: 15.1-15.6
test
Difference of means
Student’s t
NR: 14.2
Ratio of variances
F test
NR: 14.2
Sample CDF
K-S test
Rank sum tests
NR: 14.3, 14.6
Correlated variables?
Sample correlation
coefficient
NR: 14.5, 14.6
Discrete RVs
test
/
NR: 14.4
contingency table
See also supplementary handout
3.7 Minimising and Maximising Functions
Least squares and maximum likelihood involve, in practice, a lot of
minimising and maximising of functions – i.e. solving equations of the form:
wL wT i
0
(3.37)
In general these equations may not have an analytic solution, especially if
our pdf is a function of two or more parameters.
Some computational strategies for minimising/maximising functions:
w" wT i
0
where
"
ln L
1.
Solve
(may be easier to solve)
2.
Perform grid search over
3.
Use gradient ascent / descent for increased efficiency
ș , evaluating L(ș ) at each point
2.
Perform grid search over
1-D example
ș , evaluating L(ș ) at each point
L(ș )
Regularly spaced grid points.
No need to compute
derivatives of likelihood
But we need very fine grid
spacing to obtain accurate
estimate of the maximum
This is computationally very
costly, particularly if we need
to search a multi-dimensional
parameter space.
3.
ș
Estimated
maximum of L(ș )
Method of Steepest Ascent / Descent
Make jumps in direction where gradient of L(ș ) is changing most rapidly.
Need to estimate derivatives of likelihood, i.e. ’L(ș)
(See Num. Rec. Chap. 10)