* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 23 Least squares approximation
Survey
Document related concepts
Transcript
23 23.1 Least squares approximation - II The transpose of A In the next section we’ll develop a equation, known as the normal equation, which is much easier to solve than Ax = Π(y), and which also gives the correct x. We need a bit of background first. The transpose of a matrix, which we haven’t made much use of until now, begins to play a more important role once the dot product has been introduced. If A is an m×n matrix, then as you know, it can be regarded as a linear transformation from Rn to Rm . Its transpose, At then gives a linear transformation from Rm to Rn , since it’s n × m. Note that there is no implication here that At = A−1 – the matrices needn’t be square, and even if they are, they need not be invertible. But A and At are related by the dot product: Theorem: x•At y = Ax•y Proof: The same proof given for square matrices works here, although we should notice that the dot product on the left is in Rn , while the one on the right is in Rm . We can “move” A from one side of the dot product to the other by replacing it with At . So for instance, if Ax•y = 0, then x•At y = 0, and conversely. In fact, pushing this a bit, we get an important result: Theorem: Ker(At ) = (Range(A))⊥ . (In words, for the linear transformation determined by the matrix A, the kernel of At is the same as the orthogonal complement of the range of A.) Proof: Let y ∈ (Range(A))⊥ . This means that for all x ∈ Rn , Ax•y = 0. But by the previous theorem, this means that x•At y = 0 for all x ∈ Rn . But any vector in Rn which is orthogonal to everything must be the zero vector (non-degenerate property of •). So At y = 0 and therefore y ∈ Ker(At ). Conversely, if y ∈ Ker(At ), then for any x ∈ Rn , x•At y = 0. And again by the theorem, this means that Ax•y = 0 for all such x, which means that y ⊥ Range(A). We have shown that (Range(A))⊥ ⊆ Ker(At ), and conversely, that Ker(At ) ⊆ (Range(A))⊥ . So the two sets are equal. 23.2 Least squares approximations – the Normal equation Now we’re ready to take up the least squares problem again. Recall that the problem is to solve Ax = Π(y). where y has been projected orthogonally onto the range of A. The 1 problem with solving this, as you’ll recall, is that finding the projection Π involves lots of computation. And now we’ll see that it’s not necessary. We write y = Π(y) + y⊥ , where y⊥ is orthogonal to the range of A. Suppose that x is a solution to the least squares problem Ax = Π(y). Multiply this equation by At to get At Ax = At Π(y). So x is certainly also a solution to this. But now we notice that, in consequence of the previous theorem, At y = At (Π(y) + y⊥ ) = At Π(y), since At y⊥ = 0. (It’s orthogonal to the range, so the theorem says it’s in Ker(At ).) So x is also a solution to the normal equation At Ax = At y. Conversely, if x is a solution to the normal equation, then At (Ax − y) = 0, and by the previous theorem, this means that Ax − y is orthogonal to the range of A. But Ax − y is the error made using an approximate solution, and this shows that the error vector is orthogonal to the range of A – this is our definition of the least squares solution! The reason for all this fooling around is simple: we can compute At y by doing a simple matrix multiplication. We don’t need to find an orthonormal basis for the range of A to compute Π. We summarize the results: Theorem: x̃ is a least-squares solution to Ax = y ⇐⇒ x̃ is a solution to the normal equation At Ax = At y. Example: Find the least squares regression line through the 4 points (1, 2), (2, 3), (−1, 1), (0, 1). Solution: We’ve already set up this problem in 1 1 2 1 , y = A= −1 1 0 1 the last lecture. We have 2 3 m , and x = . 1 b 1 We compute t AA= 6 2 2 4 And the solution to the normal equation is t −1 t x = (A A) A y = (1/20) t 7 7 7 7 , Ay= 4 −2 −2 6 2 = 7/10 7/5 . So the regression line has the equation y = (7/10)x + 7/5. Remark: We have not addressed the critical issue of whether or not the least squares solution is a “good” approximate solution. The normal equation can always be solved, so we’ll always get an answer, but how good is the answer? This is not a simple question, but it’s discussed at length under the general subject of linear models in statistics texts. Another issue which often arises: looking at the data, in might seem more reasonable to try and fit the data points to an exponential or trigonometric function, rather than to a linear one. This still leads to a least squares problem. Example: Suppose we’d like to fit a cubic (rather than linear) function to our data set {(x1 , y1 ), . . . , (xn , yn )}. The cubic will have the form y = ax3 + bx2 + cx + d, where the coefficients a, b, c, d have to be determined. Since the (xi , yi ) are known, this still gives us a linear problem: y1 = ax31 + bx21 + cx1 + d .. . yn = ax3n + bx2n + cxn + d or a y1 x31 x21 x1 1 b .. .. y = ... = . . c x3n x2n xn 1 yn d This is a least squares problem just like the regression line problem, just a bit bigger. It’s solved the same way, using the normal equation. Exercises: 1. Find a least squares solution to the system Ax = y, where 2 1 1 A = −1 3 , and y = 2 3 4 3 2. Suppose you want to model your data {(xi , yi ) : 1 ≤ i ≤ n} with an exponential function y = aebx . Show how to do this using logarithms. 3. (*) For these problems, think of the row space as the column space of At . Show that v is in the row space of A ⇐⇒ v = At y for some y. This means that the row space of A is the range of fAt (analogous to the fact that the column space of A is the range of fA ). 3 4. (*) Show that the null space of A is the orthogonal complement of the row space. (Hint: use the above theorem with At instead of A.) 4