Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Two sides to the modelling problem Y=F(X, ∃ , γ ). How do you make F as flexible as possible ? What effect does the behaviour of γ have on the inferences you make about F and ∃ ? Generally to know about F you have to know about γ and vice versa. ‘It’s some catch that catch-22’. Resistant Regression- Least Squares Least squares is ubiquitous. Regression does it Non Linear fits do it Even Neural nets and Box Jenkins do it Resistant Regression - Least Squares And with good reason. 1) Our old friend Gauss (the normal distribution) (1/ο 2Β )*exp((T/ο 2*Φ )**2) when modelling translates to (1/ο 2Β )*exp(((Y-F(X,B))/ο 2*Φ )**2) if you assume that the error/uncertainty about the proposed fit F(X,B) has a normal distribution. So least squares will maximise the likelihood for your parameter estimates B. Resistant Regression - Least Squares 2) Our old friend Pythagoras (as used by Fisher). When you fit a linear model with p parameters, you are projecting your N dimensional data vector into a p dimensional subspace. If you wish to test whether a model with q<p parameters is as good then you can simply partition the sums of squares of the projections. This is the basis of the analysis of variance. Point in N dimensions RQ RP PQ Projection into p dimensions RQ**2=RP**2+PQ**2 Projection into q dimensions Resistant Regression - Least Squares 3) Our old friend Newton (as modified by Gauss, Seidel, Marquart, Levenberg) Solution for parameter values B usually consists of some sort of updating equation of the type. B n+1= B n - dF(X,B)/d2F(X,B) where dF is a vector of 1st derivatives wrt B and d2F is a matrix of second derivatives wrt B. Life is a lot simpler if d2F(X,B) is positive definite. If F has a quadratic form this tends to be the case. Note the sharpness of the peak gives you the confidence in the estimate of B Resistant Regression - Least Squares So when it’s good it’s very very good. Why look at anything else ? Because when its bad it’s awful. When is it bad ? When your data contains points that don’t belong there, either because they are wrong or because they don’t belong to the same population. In the jargon this is called contamination, and the amount of contamination that an algorithm can withstand before giving rubbish answers is called the breakdown point. Least squares has a very low breakdown point. In fact one bad piece of data is all it takes. The big problem is that what is obvious in the case of 1 regressor is not obvious in the case of 100 regressors. Resistant Regression - a new idea Technology to deal with one outlier developed in late 1970’s (Cook Barnet et al), more complicated schemes in the 1980’s to deal with masking effects of 2 or more outliers, but very complicated, very computer intensive and as size of contamination increases, less feasible. Problem - the thinking behind these methods is still least squares thinking. So, leave points out do a least squares fit see how different it is and so on. Peter J. Rousseeuw et al had a much more original approach Why Least squares ? Many other measures of the size of a set of residuals are possible. 1) The size of the Mth largest absolute residual in the set or 2) The sums of squares of the M smallest residuals in the set Resistant Regression - a new idea The upshot of this is that all that matters is that the model fits P% of the data. So, if for example we chosen the size of the median of the absolute residuals, then the idea is to find the multiple regression line that fits at least 50% of the data (any 50%) better than any other regression line. This has a very high breakdown point. You can take a data set in which a multiple regression holds true, add 49% rubbish and the relationship will still be found. In particular you can deal with data sets where there are 2 populations with different relationships and segment out the 2. Multivariate outlier detection - Minimum covariance. Method 1 Find a subset of h points ( P%) that minimises the determinant of the covariance matrix. Method 2 Find a subset of h points ( P%) that minimises the volume of the smallest ellipsoid that contains the points. Example - Hertzsprung-Russell Star data Relates Temperature and Luminosity of stars. Example - Hertzsprung-Russell Star data Relates Temperature and Luminosity of stars. Example - Brownlee's Stack Loss Plant Data Obtained from 21 days of operation of a plant for the oxidation of ammonia (NH3) to nitric acid (HNO3). The nitric oxides produced are absorbed in a countercurrent absorption tower. x1 Air.Flow - Flow of cooling air x2 Water.Temp - Cooling Water Inlet Temperature x3 Acid.Conc. - Concentration of acid [per 1000, minus 500] y stack.loss Stack loss - Thin Plate Splines Y $ P (X , % ) # f (X , % ) # ' " ' P is a polynomial, f is a thin plate spline ! f ' ' ( X , % ) dx 2 # & * Minimise Residuals and smoothness penalty function Thin Plate Splines - Nice features Computationally fast - Therefore interactive graphics, bootstrap methods are feasible. Can take out systematic components - Therefore some interpretation is possible. Can change the degree of the penalty function - 3rd derivaritives ensure that the spline will take out quadratic effects etc.