Download leastsquares

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix completion wikipedia , lookup

Rotation matrix wikipedia , lookup

Eigenvalues and eigenvectors wikipedia , lookup

Determinant wikipedia , lookup

Jordan normal form wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Four-vector wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Principal component analysis wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Gaussian elimination wikipedia , lookup

System of linear equations wikipedia , lookup

Matrix calculus wikipedia , lookup

Ordinary least squares wikipedia , lookup

Matrix multiplication wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Least squares wikipedia , lookup

Transcript
Least Squares example
There are 3 mountains u,y,z that from one site have been
measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y
looks 1422 ft. taller and the z looks 2354 ft. taller, and
from y, z looks 950 ft. taller. Set up the overdetermined
system.
2474
1 0 0
Ax=
0 1 0
u
0 0 1
y
-1 1 0
z
~
3882
4834
1422
-1 0 1
2354
0 -1 1
950
Want to minimize ||Ax-b||2
=b
1.4
1.2
Exponential example
1
0.8
0.6
Assume you are given the following data:
0.4
0.2
1
2
3
4
5
6
7
8
9
10
t
1.3
0.85
0.5
0.4
0.3
0.2
0.15
0.14
0.12
0.11
b
0
1
2
3
4
5
6
7
8
9
10
Which you think fits the model
b = ut + w e-.5t
for some values of u and w
which you wish to determine so
that
u
A b
w 
The Matrix A would look like:
1

2
3



10
e  .5 
1 
e 
e 1.5 
. 
. 

5
e 
Approaches to solve Ax b
• Normal equations-quick and dirty
• QR- standard in libraries uses orthogonal
decomposition
• SVD - decomposition which also gives
indication how linear independent columns
are
• Conjugate gradient- no decompositions,
good for large sparse problems
Quick and Dirty Approach
Multiply by AT to get the normal equations:
AT A x = AT b
For the mountain example the matrix AT A is 3 x 3.
The matrix AT A is symmetric .
However, sometimes AT A can be nearly singular or singular.
Consider the matrix A = 1 1
e 0
0 e
The matrix AT A = 1+ e2
1
1
1+ e2
becomes singular if e is less than the square
root of the machine precision.
Normal equations for least squares - slide 2
For the hill problem:
3
AT A x =
-1
-1
3
-1
-1
-1
-1
3
u
-1302
y = 4354 = ATb
x
8138
Solution is u =2472, y= 3886, z=4816.
QR: A more stable approach
Assume the m x n matrix A and the m-vector b can be
partitioned into
A=
R
and b
= c
0
d
where R is a nonsingular n x n matrix and c has length n.
Then ||Ax-b||22
= ||Rx -c ||22 + || d ||22
So pick x such that ||Rx -c ||22
which implies ||Ax-b||22 = || d ||22 .This is the best one can
do.
QR for least squares-slide 2
But most A matrices do not have this form, but using
orthogonal transformations we can transform
matrices of full column rank to this form.
A matrix Q is orthogonal if QTQ=I.
If Q is orthogonal then for any x, ||Qx ||22 =||x ||22 ,
that is an orthogonal matrix preserves the 2 norm.
Examples of orthogonal matrices:
1 0
0 1
cos y
sin y
0 1
1 0
-sin y cos y
for some angle y
Givens
rotations
QR for least squares -slide 3
We wish to pick a sequence of orthogonal matrices such
that A might be transformed into upper triangular form:
xxx
xxx
xxx
xxx
xxx
0xx
0xx
0xx= R
xxx
0xx
00x
00x
xxx
0xx
00x
000
0
The transformations are then applied to the data vector b
The solution is now found by solving Rx =c, where c
is the first n elements of the transformed b.
QR for least squares -slide 4
Householder transformations of the form I -2uuT / uTu can
easily eliminate whole columns.
If the A matrix is almost square the QR approach and the
normal equations approach require about the same
number of operations.
If A is tall and skinny, the QR approach takes about the
twice number of operations.
Most good least squares solvers use the QR
approach.
In Matlab: x= A \b.
Good project: Investigate structure of R if A is sparse.
Singular Value Decomposition
The singular value decomposition(SVD) of a matrix A is
given by
A = USVT
where U and V are orthogonal and S is a diagonal matrix of
the form.
 s1




s2



s3 
If any of the s’s are 0, the matrix is singular. Thus one can determine
how close A is to a singular matrix by looking at the smallest s’s.
Good project: Investigate an application of SVD
SVD for least squares
If A is an m x n matrix of rank n, then
A =USVT
= [ U1
 S1  T
U2 ]   V = U1S1VT
0
where U1 has the first n rows of U and S1 is n x n.
The solution to the least squares problem Ax b is given by
x=VS1-1U1Tb
Requires 4 to 10 times more work than QR but shows
dependencies in model.
Conjugate gradient
•Does not require decomposition of matrix
•Good for large sparse problem-like PET
•Iterative method that requires matrix vector multiplication
by A and AT each iteration
•In exact arithmetic for n variables guaranteed to converge
in n iterations- so 2 iterations for the exponential fit and 3
iterations for the hill problem.
•Does not zigzag
Conjugate gradient algorithm
Initialization:
Guess x
Set r= b-Ax , p = Atr ( p will be a direction),
Set γ=pTp
Until convergence
Set q= Ap , α = γ/qTq
Reset x to x + α p
Reset r to r - α q
Set s = Atr , γnew =sTs, β = γnew/ γ , γ = γnew
Reset p to s+ β p.
Nonlinear example
Assume in the exponential example that the model was
b = u + we-kt with w added to the list of unknowns.
The variable k is nonlinear while u and w are linear.
Possible approaches:
•Treat all three variables as nonlinear and use a nonlinear
solver
•Use a 1-dimensional solver for k and each time a function
value is requested one solves for u and w using a linear least
squares solver, plug the best value for them in and give
back the residual.
One dimensional minimization of f(x)
Assumption
Given an interval [a,b], where one “knows” there is a
minimum, that f is unimodal on [a,b], i.e. there is only one
local minimum.
With this assumption we can find the minimum by sampling
and discarding portions of the interval that cannot have the
solution.
Best algorithms are combinations of golden section search
and quadratic interpolation with 3 points and finding
minimum of the quadratic-Brent.
Golden section on -(2x3-9x2=12x+6)
Originally a =0, x1= .7639, x2= 1.2361, b=2.0
Subsequent values of a and b
a
b
.7639
2.0
.7639
1.5279
.7639
1.2361
.9443
1.2361
.9443
1.1246
.9443
1.0557
.9868
1.0557
.9868
1.0294 etc.
Unconstrained minimization of f(x) where x has n
elements
•Steepest descent- requires first derivatives (gradient)
might zigzag
good beginning strategy
sped up by conjugate gradient
•Newton- requires first and second derivatives(Hessian)
requires solution of linear system with n variables
fast if close to solution
•Quasi-Newton(most practical)- requires first derivative
no linear system to solve
builds up approximation to inverse of Hessian
Newton’s method for minimization
Let g = gradient, H = matrix of second partials
Taylor’s Theorem:
f(x+s)  f(x) + gTs + 0.5 sTHs.
This quadratic function in s is minimized when
s =- H-1g
Algorithm:
guess x
Until convergence
Solve H(x) s =- g(x) {needs H and solver}
Reset x to x + s
Quasi- Newton Method
•Builds up approximation to Hessian in directions
that have been searched
•Almost as fast as Newton
Initial: pick x, set B = I.
Until convergence:
set s = - Bg
(no linear system to solve)
set xnew = x+s
let γ= g(xnew) -g(x); δ = xnew - x; x= xnew
reset B to B + δ δT/ δT γ - B γ (B γ)T/ γT B γ
Comparison of steepest descent, Newton, and quasi
Newton on f(x) = 0.5x12 + 2.5x22
Iteration
Steepest
Newton
Quasi-Newton
0
5.00
1.00
5.00
1.00
5.00
1.00
1
3.333
-0.667
0.00
0.00
0.00
-4.00
2
2.222
0.444
-2.222
0.444
3
1.481
-0.296
0.816
0.082
4
0.988
0.198
-0.009
-0.015
5
0.658
-0.132
-0.001
0.001
6
0.439
0.088
7
0.293
-0.059
Large Scale Problems
Conjugate gradient vs. Limited Memory Quasi-Newton
Conjugate gradient- each step is linear combination of
previous step and current gradient
Limited Memory-(Nocedal,Schnabel, Byrd, Kaufman)
Do not multiply B out but keep vectors.
Need to keep 2 vectors per iteration
After k steps(k is about 5) reset B to I and start
again .
Experimentation favors LMQN over CG
Good project: How should LMQN be done