* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 3. Model Fitting 3.1 The bivariate normal distribution
Survey
Document related concepts
False position method wikipedia , lookup
System of linear equations wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Linear least squares (mathematics) wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Coefficient of determination wikipedia , lookup
Transcript
3. Model Fitting In this section we apply the statistical tools introduced in Section 2 to explore: ¾ how to estimate model parameters ¾ how to test the goodness of fit of models. We will consider: 3.2 The method of least squares 3.3 The principle of maximum likelihood 3.4 Least squares as maximum likelihood estimators 3.5 Chi-squared goodness of fit tests 3.6 More general hypothesis testing 3.7 Computational methods for minimising / maximising functions But before we do, we first introduce an important pdf: the bivariate normal distribution 3.1 The bivariate normal distribution (3.1) (3.2) p(x,y) y x (3.3) In fact, for any two variables x and y, we define cov( x, y ) E > x E ( x) y E ( y ) @ (3.4) Isoprobability contours for y the bivariate normal pdf U !0 : U y 0.0 U 0.3 positive correlation y tends to increase as x increases x x U 0 : negative correlation y U y 0.5 U 0.7 y tends to decrease as x increases Contours become narrower and steeper as U o 1 x y U 0.7 x y U 0.9 stronger (anti) correlation between x and y. i.e. Given value of x , value of y is tightly constrained. x 3.2 The method of Least Squares o ‘workhorse’ method for fitting lines and curves to data in the physical sciences o method often encountered (as a ‘black box’?) in elementary courses o useful demonstration of underlying statistical principles o simple illustration of fitting straight line to (x,y) data x Ordinary Linear Least Squares (3.8) n S ¦H i 1 (3.8) (3.9) 2 i (3.10) (3.11) We can show that E aˆ LS E bˆ LS a LS b LS i.e. LS estimators are unbiased. (3.12) (3.13) (3.14) (3.15) Choosing the ^x i ` so that ¦ xi 0 we can make â LS and b̂ LS independent. Weighted Linear Least Squares ( Common in astronomy ) Define (3.16) Again we find Least Squares estimators of a and b satisfying Solving, we find (3.17) (3.18) Also (3.19) (3.20) cov aˆ WLS , bˆWLS ¦ 1 xi2 ¦V ¦V 2 i 2 i xi V i2 § x · ¨¨ ¦ i2 ¸¸ © Vi ¹ 2 (3.21) Extensions and Generalisations o Errors on both variables? Need to modify merit function accordingly. (3.22) Renders equations non-linear ; no simple analytic solution! Not examinable, but see e.g. Numerical Recipes 15.3 Extensions and Generalisations o General linear models? e.g. (3.23) We can write (3.24) Can formulate as a matrix equation and solve for parameters Matrix approach to Least Squares Define a ª a1 º « » « » «¬aM »¼ y Vector of model parameters X ª X 1 ( x1 ) X M ( x1 ) º « » « » «¬ X 1 ( x N ) X M ( x N )»¼ ª y1 º « » « » «¬ y N »¼ Vector of observations Design matrix of model basis functions Model: Vector of model parameters Xa H y (3.25) Vector of errors Vector of observations H Design matrix of model basis functions ª H1 º « » « » «¬H N »¼ where we assume Hi is drawn from some pdf with mean zero and variance V2 Matrix approach to Least Squares: weighting by errors Define ª a1 º « » « » «¬aM »¼ a b Vector of model parameters A ª X 1V( x1 ) X MV( x1 ) º 1 « 1 » » « « X1 ( xN ) X M ( xN ) » VN ¬ VN ¼ ª y1 V 1 º « » « » «¬ y N V N »¼ Vector of weighted observations Weighted design matrix of model basis functions Weighted Model: Vector of model parameters Aa e b Vector of weighted observations where we assume n eT e Hi is drawn from some pdf with mean zero and variance We solve for the parameter vector S Vector of weighted errors Weighted design matrix of model basis functions ª H1 º « V1 » « » » «H N « V » N¼ ¬ e (3.26) ¦ ei V i2 â LS that minimises 2 i 1 This has solution aˆ LS A A 1 T AT b M u M matrix and cov aˆ LS A A T 1 (3.28) (3.27) Extensions and Generalisations o Non-linear models? Model parameters Suppose Hi drawn from pdf with mean zero, variance V i 2 Then S (3.29) But no simple analytic method to minimise sum of squares ( e.g. no analytic solutions to wS wT i 0 ) 3.3 The principle of maximum likelihood Frequentist approach: A parameter is a fixed (but unknown) constant From actual data we can compute Likelihood, L = probability of obtaining the observed data, given the value of the parameter T Now define likelihood function: (infinite) family of curves generated by regarding L as a function of T , for data fixed. Principle of Maximum Likelihood A good estimator of T maximises L i.e. wL wT 0 and w2L 0 wT 2 We set the parameter equal to the value that makes the actual data sample we did observe – out of all the possible random samples we could have observed – the most likely. Aside: Likelihood function has same definition in Bayesian probability theory, but subtle difference in meaning and interpretation – no need to invoke idea of (infinite) ensemble of different samples. Principle of Maximum Likelihood A good estimator of T maximises i.e. wL wT 0 and L - w2L 0 wT 2 T = T1 x Observed value Principle of Maximum Likelihood A good estimator of T maximises i.e. wL wT 0 and L - w2L 0 wT 2 T = T2 x Observed value Principle of Maximum Likelihood A good estimator of T maximises i.e. wL wT 0 and L - w2L 0 wT 2 T = T3 x Observed value Principle of Maximum Likelihood A good estimator of T maximises wL wT 0 and w2L 0 wT 2 ! i.e. L - T = TML x Observed value 3.4 Least squares as maximum likelihood estimators To see the maximum likelihood method in action, let’s consider again weighted least squares for the simple model (From Section 3.3) Let’s assume the pdf is a Gaussian n L Likelihood i 1 (note: Substitute ª 1 H i2 º 1 exp « 2» 2S V i ¬ 2 Vi ¼ (3.30) L is a product of 1-D Gaussians because we are assuming the H i are independent) Hi yi a bxi n L i 1 ª 1 yi a bxi 2 º 1 exp « » 2 2 V 2S V i i ¼ ¬ (3.31) and the ML estimators of a and b satisfy wL wa 0 and wL wb 0 But maximising Here " L is equivalent to maximising " ln L n n 1 n § yi a bxi · ¸¸ ln 2S ln ¦ V i ¦ ¨¨ V 2 2 i 1 i 1© i ¹ constant So in this case maximising 1 S 2 2 (3.32) This is exactly the same sum of squares we defined in Section 3.3 L is exactly equivalent to minimising the sum of squares. i.e. for Gaussian, independent errors, ML and weighted LS estimators are identical. 3.5 Chi-squared goodness of fit test In the previous 3 sections we have discussed how to estimate parameters of an underlying pdf model from sample data. We now consider the related question: How good is our pdf model in the first place? We now illustrate the frequentist approach to this question using the chi-squared goodness of fit test, for the (very common) case where the model pdf is a Gaussian. We take an example from Gregory (Chapter 7) (book focusses mainly on Bayesian probability, but is very good on frequentist approach too) Model: radio emission from a galaxy is constant in time. Assume residuals are iid, drawn from N(0,V) Goodness-of-fit Test: the basic ideas From Gregory, pg. 164 (3.33) The F2 pdf pQ F 2 p0 u F 2 Q 2 1 e F 2 / 2 pQ F 2 Q 1 Q 2 Q 3 Q 4 Q 5 F2 (3.34) (3.35) n = 15 data points, but Q = 14 degrees of freedom, because statistic involves the sample mean and not the true mean. We subtract one d.o.f. to account for this. p14 F 2 If the null hypothesis is true, how probable is it that we would measure as large, or larger, a value of ? If the null hypothesis were true, how probable is it that we would measure as large, or larger, a value of ? This is an important quantity, referred to as the P-value P - value 2 1 P F obs 2 F obs 1 ³ 0 Q 1 § x· p0 x 2 exp¨ ¸ dx © 2¹ 0.02 (3.36) What precisely does the P-value mean? “If the galaxy flux density really is constant, and we repeatedly obtained sets of 15 measurements under the same conditions, then only 2% of the F 2 values derived from these sets would be expected to be greater than our one actual measured value of 26.76” From Gregory, pg. 165 If we obtain a very small P-value (e.g. a few percent?) we can interpret this as providing little support for the null hypothesis, which we may then choose to reject. 2 (Ultimately this choice is subjective, but F may provide objective ammunition for doing so) Nevertheless, P-value based frequentist hypothesis testing remains very common in the literature: Type of problem test Line and curve goodness-of-fit References NR: 15.1-15.6 test Difference of means Student’s t NR: 14.2 Ratio of variances F test NR: 14.2 Sample CDF K-S test Rank sum tests NR: 14.3, 14.6 Correlated variables? Sample correlation coefficient NR: 14.5, 14.6 Discrete RVs test / NR: 14.4 contingency table See also supplementary handout 3.7 Minimising and Maximising Functions Least squares and maximum likelihood involve, in practice, a lot of minimising and maximising of functions – i.e. solving equations of the form: wL wT i 0 (3.37) In general these equations may not have an analytic solution, especially if our pdf is a function of two or more parameters. Some computational strategies for minimising/maximising functions: w" wT i 0 where " ln L 1. Solve (may be easier to solve) 2. Perform grid search over 3. Use gradient ascent / descent for increased efficiency ș , evaluating L(ș ) at each point 2. Perform grid search over 1-D example ș , evaluating L(ș ) at each point L(ș ) Regularly spaced grid points. No need to compute derivatives of likelihood But we need very fine grid spacing to obtain accurate estimate of the maximum This is computationally very costly, particularly if we need to search a multi-dimensional parameter space. 3. ș Estimated maximum of L(ș ) Method of Steepest Ascent / Descent Make jumps in direction where gradient of L(ș ) is changing most rapidly. Need to estimate derivatives of likelihood, i.e. L(ș) (See Num. Rec. Chap. 10)