Download 23 Least squares approximation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gaussian elimination wikipedia , lookup

Four-vector wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Ordinary least squares wikipedia , lookup

System of linear equations wikipedia , lookup

Least squares wikipedia , lookup

Transcript
23
23.1
Least squares approximation - II
The transpose of A
In the next section we’ll develop a equation, known as the normal equation, which is much
easier to solve than Ax = Π(y), and which also gives the correct x. We need a bit of
background first.
The transpose of a matrix, which we haven’t made much use of until now, begins to play a
more important role once the dot product has been introduced. If A is an m×n matrix, then
as you know, it can be regarded as a linear transformation from Rn to Rm . Its transpose,
At then gives a linear transformation from Rm to Rn , since it’s n × m. Note that there is no
implication here that At = A−1 – the matrices needn’t be square, and even if they are, they
need not be invertible. But A and At are related by the dot product:
Theorem: x•At y = Ax•y
Proof: The same proof given for square matrices works here, although we should notice that
the dot product on the left is in Rn , while the one on the right is in Rm .
We can “move” A from one side of the dot product to the other by replacing it with At . So
for instance, if Ax•y = 0, then x•At y = 0, and conversely. In fact, pushing this a bit, we
get an important result:
Theorem: Ker(At ) = (Range(A))⊥ . (In words, for the linear transformation determined by
the matrix A, the kernel of At is the same as the orthogonal complement of the range of A.)
Proof: Let y ∈ (Range(A))⊥ . This means that for all x ∈ Rn , Ax•y = 0. But by the
previous theorem, this means that x•At y = 0 for all x ∈ Rn . But any vector in Rn which is
orthogonal to everything must be the zero vector (non-degenerate property of •). So At y = 0
and therefore y ∈ Ker(At ). Conversely, if y ∈ Ker(At ), then for any x ∈ Rn , x•At y = 0.
And again by the theorem, this means that Ax•y = 0 for all such x, which means that
y ⊥ Range(A).
We have shown that (Range(A))⊥ ⊆ Ker(At ), and conversely, that Ker(At ) ⊆ (Range(A))⊥ .
So the two sets are equal.
23.2
Least squares approximations – the Normal equation
Now we’re ready to take up the least squares problem again. Recall that the problem is
to solve Ax = Π(y). where y has been projected orthogonally onto the range of A. The
1
problem with solving this, as you’ll recall, is that finding the projection Π involves lots of
computation. And now we’ll see that it’s not necessary.
We write y = Π(y) + y⊥ , where y⊥ is orthogonal to the range of A. Suppose that x is
a solution to the least squares problem Ax = Π(y). Multiply this equation by At to get
At Ax = At Π(y). So x is certainly also a solution to this. But now we notice that, in
consequence of the previous theorem,
At y = At (Π(y) + y⊥ ) = At Π(y),
since At y⊥ = 0. (It’s orthogonal to the range, so the theorem says it’s in Ker(At ).)
So x is also a solution to the normal equation
At Ax = At y.
Conversely, if x is a solution to the normal equation, then
At (Ax − y) = 0,
and by the previous theorem, this means that Ax − y is orthogonal to the range of A. But
Ax − y is the error made using an approximate solution, and this shows that the error vector
is orthogonal to the range of A – this is our definition of the least squares solution!
The reason for all this fooling around is simple: we can compute At y by doing a simple
matrix multiplication. We don’t need to find an orthonormal basis for the range of A to
compute Π. We summarize the results:
Theorem: x̃ is a least-squares solution to Ax = y ⇐⇒ x̃ is a solution to the normal
equation At Ax = At y.
Example: Find the least squares regression line through the 4 points (1, 2), (2, 3), (−1, 1), (0, 1).
Solution: We’ve already set up this problem in



1 1
 2 1 

, y = 
A=
 −1 1 

0 1
the last lecture. We have

2
3 
m
 , and x =
.
1 
b
1
We compute
t
AA=
6 2
2 4
And the solution to the normal equation is
t
−1 t
x = (A A) A y = (1/20)
t
7
7
7
7
, Ay=
4 −2
−2
6
2
=
7/10
7/5
.
So the regression line has the equation y = (7/10)x + 7/5.
Remark: We have not addressed the critical issue of whether or not the least squares
solution is a “good” approximate solution. The normal equation can always be solved, so
we’ll always get an answer, but how good is the answer? This is not a simple question, but
it’s discussed at length under the general subject of linear models in statistics texts.
Another issue which often arises: looking at the data, in might seem more reasonable to try
and fit the data points to an exponential or trigonometric function, rather than to a linear
one. This still leads to a least squares problem.
Example: Suppose we’d like to fit a cubic (rather than linear) function to our data set
{(x1 , y1 ), . . . , (xn , yn )}. The cubic will have the form y = ax3 + bx2 + cx + d, where the
coefficients a, b, c, d have to be determined. Since the (xi , yi ) are known, this still gives us
a linear problem:
y1 = ax31 + bx21 + cx1 + d
..
.
yn = ax3n + bx2n + cxn + d
or
 a
 
y1
x31 x21 x1 1
b


 
..
..
y =  ...  = 

.
.
 c
x3n x2n xn 1
yn
d





This is a least squares problem just like the regression line problem, just a bit bigger. It’s
solved the same way, using the normal equation.
Exercises:
1. Find a least squares solution to the system Ax = y, where


 
2 1
1
A =  −1 3  , and y =  2 
3 4
3
2. Suppose you want to model your data {(xi , yi ) : 1 ≤ i ≤ n} with an exponential
function y = aebx . Show how to do this using logarithms.
3. (*) For these problems, think of the row space as the column space of At . Show that
v is in the row space of A ⇐⇒ v = At y for some y. This means that the row space
of A is the range of fAt (analogous to the fact that the column space of A is the range
of fA ).
3
4. (*) Show that the null space of A is the orthogonal complement of the row space.
(Hint: use the above theorem with At instead of A.)
4