Download Two sides to the modelling problem

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Two sides to the modelling problem
Y=F(X, ∃ , γ ).
How do you make F as
flexible as possible ?
What effect does the
behaviour of γ
have on
the inferences you make
about F and ∃
?
Generally to know about F you have to know about γ
and vice versa. ‘It’s some catch that catch-22’.
Resistant Regression- Least Squares
Least squares is ubiquitous.
Regression does it
Non Linear fits do it
Even Neural nets and Box Jenkins do it
Resistant Regression - Least Squares
And with good reason.
1) Our old friend Gauss (the normal distribution)
(1/ο 2Β
)*exp((T/ο 2*Φ )**2)
when modelling translates to
(1/ο 2Β
)*exp(((Y-F(X,B))/ο 2*Φ )**2)
if you assume that the error/uncertainty about
the proposed fit F(X,B) has a normal distribution.
So least squares will maximise the likelihood for
your parameter estimates B.
Resistant Regression - Least Squares
2) Our old friend Pythagoras (as used by Fisher).
When you fit a linear model with p parameters, you are projecting
your N dimensional data vector into a p dimensional subspace. If
you wish to test whether a model with q<p parameters is as good
then you can simply partition the sums of squares of the
projections. This is the basis of the analysis of variance.
Point in N dimensions
RQ
RP
PQ
Projection into p dimensions
RQ**2=RP**2+PQ**2
Projection into q dimensions
Resistant Regression - Least Squares
3) Our old friend Newton (as modified by Gauss, Seidel, Marquart, Levenberg)
Solution for parameter values B usually consists of some sort of updating
equation of the type.
B n+1= B n - dF(X,B)/d2F(X,B)
where dF is a vector of 1st derivatives wrt B and d2F is a matrix of second
derivatives wrt B. Life is a lot simpler if d2F(X,B) is positive definite.
If F has a quadratic form this tends to be the case.
Note the sharpness of the peak gives you
the confidence in the estimate of B
Resistant Regression - Least Squares
So when it’s good it’s very very good.
Why look at anything else ?
Because when its bad it’s awful.
When is it bad ? When your data contains points that don’t belong there, either
because they are wrong or because they don’t belong to the same population.
In the jargon this is called contamination, and the amount of contamination that
an algorithm can withstand before giving rubbish answers is called the
breakdown point. Least squares has a very low breakdown point. In fact one
bad piece of data is all it takes. The big problem is that what is obvious in the
case of 1 regressor is not obvious in the case of 100 regressors.
Resistant Regression - a new idea
Technology to deal with one outlier developed in late 1970’s
(Cook Barnet et al), more complicated schemes in the 1980’s to
deal with masking effects of 2 or more outliers, but very complicated, very
computer intensive and as size of contamination increases, less feasible.
Problem - the thinking behind these methods is still least squares thinking.
So, leave points out do a least squares fit see how different it is and so on.
Peter J. Rousseeuw et al had a much more original approach Why Least squares ? Many other measures of the size of a set of residuals
are possible.
1) The size of the Mth largest absolute residual in the set
or
2) The sums of squares of the M smallest residuals in the set
Resistant Regression - a new idea
The upshot of this is that all that matters is that the model fits P% of the data.
So, if for example we chosen the size of the median of the absolute residuals,
then the idea is to find the multiple regression line that fits at least 50% of
the data (any 50%) better than any other regression line.
This has a very high breakdown point. You can take a data set in which
a multiple regression holds true, add 49% rubbish and the relationship will
still be found.
In particular you can deal with data sets where there are 2 populations
with different relationships and segment out the 2.
Multivariate outlier detection - Minimum covariance.
Method 1 Find a subset of h points ( P%) that minimises the determinant of the covariance
matrix.
Method 2 Find a subset of h points ( P%) that minimises the volume of the smallest
ellipsoid that contains the points.
Example - Hertzsprung-Russell Star data
Relates Temperature and Luminosity of stars.
Example - Hertzsprung-Russell Star data
Relates Temperature and Luminosity of stars.
Example - Brownlee's Stack Loss Plant Data
Obtained from 21 days of operation of a plant for the oxidation of
ammonia (NH3) to nitric acid (HNO3). The nitric oxides produced
are absorbed in a countercurrent absorption tower.
x1 Air.Flow
- Flow of cooling air
x2 Water.Temp -
Cooling Water Inlet Temperature
x3 Acid.Conc. -
Concentration of acid [per 1000, minus 500]
y stack.loss
Stack loss
-
Thin Plate Splines
Y $ P (X , % ) # f (X , % ) # '
" '
P is a polynomial,
f is a thin plate spline
!
f ' ' ( X , % ) dx
2
# & *
Minimise Residuals
and smoothness penalty function
Thin Plate Splines - Nice features
Computationally fast - Therefore interactive graphics, bootstrap
methods are feasible.
Can take out systematic components - Therefore some interpretation
is possible.
Can change the degree of the penalty function - 3rd derivaritives
ensure that the spline will take out quadratic effects etc.