Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
23 23.1 Least squares approximation - II The transpose of A In the next section we’ll develop a equation, known as the normal equation, which is much easier to solve than Ax = Π(y), and which also gives the correct x. We need a bit of background first. The transpose of a matrix, which we haven’t made much use of until now, begins to play a more important role once the dot product has been introduced. If A is an m×n matrix, then as you know, it can be regarded as a linear transformation from Rn to Rm . Its transpose, At then gives a linear transformation from Rm to Rn , since it’s n × m. Note that there is no implication here that At = A−1 – the matrices needn’t be square, and even if they are, they need not be invertible. But A and At are related by the dot product: Theorem: x•At y = Ax•y Proof: The same proof given for square matrices works here, although we should notice that the dot product on the left is in Rn , while the one on the right is in Rm . We can “move” A from one side of the dot product to the other by replacing it with At . So for instance, if Ax•y = 0, then x•At y = 0, and conversely. In fact, pushing this a bit, we get an important result: Theorem: Ker(At ) = (Range(A))⊥ . (In words, for the linear transformation determined by the matrix A, the kernel of At is the same as the orthogonal complement of the range of A.) Proof: Let y ∈ (Range(A))⊥ . This means that for all x ∈ Rn , Ax•y = 0. But by the previous theorem, this means that x•At y = 0 for all x ∈ Rn . But any vector in Rn which is orthogonal to everything must be the zero vector (non-degenerate property of •). So At y = 0 and therefore y ∈ Ker(At ). Conversely, if y ∈ Ker(At ), then for any x ∈ Rn , x•At y = 0. And again by the theorem, this means that Ax•y = 0 for all such x, which means that y ⊥ Range(A). We have shown that (Range(A))⊥ ⊆ Ker(At ), and conversely, that Ker(At ) ⊆ (Range(A))⊥ . So the two sets are equal. 23.2 Least squares approximations – the Normal equation Now we’re ready to take up the least squares problem again. Recall that the problem is to solve Ax = Π(y). where y has been projected orthogonally onto the range of A. The 1 problem with solving this, as you’ll recall, is that finding the projection Π involves lots of computation. And now we’ll see that it’s not necessary. We write y = Π(y) + y⊥ , where y⊥ is orthogonal to the range of A. Suppose that x is a solution to the least squares problem Ax = Π(y). Multiply this equation by At to get At Ax = At Π(y). So x is certainly also a solution to this. But now we notice that, in consequence of the previous theorem, At y = At (Π(y) + y⊥ ) = At Π(y), since At y⊥ = 0. (It’s orthogonal to the range, so the theorem says it’s in Ker(At ).) So x is also a solution to the normal equation At Ax = At y. Conversely, if x is a solution to the normal equation, then At (Ax − y) = 0, and by the previous theorem, this means that Ax − y is orthogonal to the range of A. But Ax − y is the error made using an approximate solution, and this shows that the error vector is orthogonal to the range of A – this is our definition of the least squares solution! The reason for all this fooling around is simple: we can compute At y by doing a simple matrix multiplication. We don’t need to find an orthonormal basis for the range of A to compute Π. We summarize the results: Theorem: x̃ is a least-squares solution to Ax = y ⇐⇒ x̃ is a solution to the normal equation At Ax = At y. Example: Find the least squares regression line through the 4 points (1, 2), (2, 3), (−1, 1), (0, 1). Solution: We’ve already set up this problem in 1 1 2 1 , y = A= −1 1 0 1 the last lecture. We have 2 3 m , and x = . 1 b 1 We compute t AA= 6 2 2 4 And the solution to the normal equation is t −1 t x = (A A) A y = (1/20) t 7 7 7 7 , Ay= 4 −2 −2 6 2 = 7/10 7/5 . So the regression line has the equation y = (7/10)x + 7/5. Remark: We have not addressed the critical issue of whether or not the least squares solution is a “good” approximate solution. The normal equation can always be solved, so we’ll always get an answer, but how good is the answer? This is not a simple question, but it’s discussed at length under the general subject of linear models in statistics texts. Another issue which often arises: looking at the data, in might seem more reasonable to try and fit the data points to an exponential or trigonometric function, rather than to a linear one. This still leads to a least squares problem. Example: Suppose we’d like to fit a cubic (rather than linear) function to our data set {(x1 , y1 ), . . . , (xn , yn )}. The cubic will have the form y = ax3 + bx2 + cx + d, where the coefficients a, b, c, d have to be determined. Since the (xi , yi ) are known, this still gives us a linear problem: y1 = ax31 + bx21 + cx1 + d .. . yn = ax3n + bx2n + cxn + d or a y1 x31 x21 x1 1 b .. .. y = ... = . . c x3n x2n xn 1 yn d This is a least squares problem just like the regression line problem, just a bit bigger. It’s solved the same way, using the normal equation. Exercises: 1. Find a least squares solution to the system Ax = y, where 2 1 1 A = −1 3 , and y = 2 3 4 3 2. Suppose you want to model your data {(xi , yi ) : 1 ≤ i ≤ n} with an exponential function y = aebx . Show how to do this using logarithms. 3. (*) For these problems, think of the row space as the column space of At . Show that v is in the row space of A ⇐⇒ v = At y for some y. This means that the row space of A is the range of fAt (analogous to the fact that the column space of A is the range of fA ). 3 4. (*) Show that the null space of A is the orthogonal complement of the row space. (Hint: use the above theorem with At instead of A.) 4