* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1 The Chain Rule - McGill Math Department
Matrix (mathematics) wikipedia , lookup
Matrix completion wikipedia , lookup
System of linear equations wikipedia , lookup
Eigenvalues and eigenvectors wikipedia , lookup
Four-vector wikipedia , lookup
Singular-value decomposition wikipedia , lookup
Jordan normal form wikipedia , lookup
Orthogonal matrix wikipedia , lookup
Determinant wikipedia , lookup
Linear least squares (mathematics) wikipedia , lookup
Capelli's identity wikipedia , lookup
Principal component analysis wikipedia , lookup
Perron–Frobenius theorem wikipedia , lookup
Non-negative matrix factorization wikipedia , lookup
Matrix multiplication wikipedia , lookup
Gaussian elimination wikipedia , lookup
Ordinary least squares wikipedia , lookup
1 The Chain Rule If z = g(y) and y = f (x) and then the chain rule is a formula for the derivative of z with respect to x. If we let z = h(x) = g(f (x)), the chain rule formula is dz dz dy = or h0 (x) = g 0 (y)f 0 (x) where y = f (x). dx dy dx The second formula shows where the derivatives are to be evaluated. In other words dy dz is evaluated at y = f (x) and dx is evaluated at x. evaluated at x, dy x For example, if y = e and z = log y = ln y, then z = x as log(ex ) = x. This gives us dz dx is dz dz dy dx dx dy = = 1 or = = 1. dx dy dx dx dy dx This formula needs to read carefully as it says that dx dy , when y = ex , is equal to the reciprocal of dy dx at x. The typical chain rule in calculus of several variables is more complicated. For example, if → T = T (x, y) is viewed as the temperature at (x, y) and − r (t) = (x(t), y(t) is a path or curve in the plane, then the temperature T (x(t), (y(t)) depends on t as the curve is traversed. This temperature at “time” t, is the temperature T (x(t), y(t)) at (x(t), y(t)), namely the temperature at the location of “moving point” (x(t), y(t)) at “time” t. Then the rate of change of T = (x(t), y(t)) is given by the formula ∂T dx ∂T dy ∂T ∂T d~r dx dy d~r dT = + =( , ) · ( , ) = ∇T · = (∇T )(~r(t)) · . dt ∂x dt ∂y dt ∂x ∂x dt dt dt dt Here, the partial derivatives with respect to x and y are to be evaluated at (x, y) = (x(t), y(t)). One might also have x and y themselves functions of several variables, say, u, v, and w. This situation is given by a function (x, y) = F (u, v, w) = (F1 (u, v, w), F2 (u, v, w)). Then, if one fixes in turn all the variables but one, one gets three chain rule formulas ∂T ∂T ∂x ∂T ∂y ∂T ∂T ∂x ∂y = + =( , )·( , ) ∂u ∂x ∂u ∂y ∂u ∂x ∂y ∂u ∂u ∂T ∂x ∂T ∂y ∂T ∂T ∂x ∂y ∂T = + =( , )·( , ∂v ∂x ∂v ∂y ∂v ∂x ∂y ∂v ∂v ∂x ∂y ∂T ∂T ∂x ∂T ∂y ∂T ∂T = + =( , )·( , ). ∂w ∂x ∂w ∂y ∂w ∂x ∂y ∂w ∂w These three chain rule formulas can all be “packaged” as one equation if one uses matrix multiplication. It involves the gradient of T and what is called the derivative (matrix) DF or F 0 of F . This is the matrix ∂x ∂x ∂x 0 ∂v ∂w DF = F = ∂u ∂y ∂y ∂y ∂u ∂v ∂w . Notice the pattern of the matrix, it follows the pattern of the variables (x, y) and (u, v, w). The three formulas for the three partial derivatives of T can be written as one matrix equation ∂T ∂T ∂T h ∂T ∂T i ∂x ∂x ∂x ∂u ∂v ∂w = ∂x ∂y ∂y ∂y . ∂y ∂u ∂v ∂w ∂u ∂v ∂w Setting p = (u, v, w), this can be written in the form D(T ◦ F )(p) = DT (F (p))DF (p) or (T ◦ F )0 (p) = T 0 (F (p))F 0 (p) where the sign ◦ stands for composition of functions. In this way the chain rule for functions of several variables looks exactly the same as the one variable case. The derivative matrix DF is often called the Jacobian matrix. For example, when (y1 , y2 , · · · , ym ) = F (x1 , x2 , · · · , xn ) the Jacobian matrix (see Adams, p. 189 of the third edition) is the matrix ∂y ∂y1 ∂y1 1 · · · ∂x ∂x1 ∂x2 n ∂y2 ∂y2 ∂y2 · · · ∂x ∂x1 ∂x2 n DF = . . . . .. .. .. ∂ym ∂ym m · · · ∂y ∂x1 ∂x2 ∂xn ∂yi In other words, the (i, j)-entry is ∂x . When m = n the determinant of the Jacobian matrix DF is j called the Jacobian of F or the Jacobian determinant of F and is sometimes denoted by ∂(y1 , y2 , . . . , ym ) . ∂(x1 , x2 , . . . , xn ) One important consequence of the matrix version of the chain rule is that if (y1 , y2 , · · · , yn ) = F (x1 , x2 , · · · , xn ) and (x1 , x2 , · · · , xn ) = G(y1 , y2 , · · · , yn ) are two transformations such that (x1 , x2 , · · · , xn ) = G(F (x1 , x2 , · · · , xn )) then the Jacobian matrices DF and DG are inverse to one another. This is because, if I(x1 , x2 , · · · , xn ) = (x1 , x2 , · · · , xn ) then DI is the identity matrix n × n matrix In . Hence, In = D(I) = D(F ◦ G) = D(F )D(G). Taylor’s formula for a function of several variables is an immediate consequence of Taylor’s formula in one variable and the chain rule. For example if y = f (x) with x = (x1 , x2 ) and one wants the Taylor polynomial of order 2 of the function corresponding to the point a = (a1 , a2 ) let g(t) = f (a + t(x − a). Then it follows that g 0 (t) = ∇f (a + t(x − a)) · (x − a) = g 00 (t) = 2 X ∂f (a + t(x − a))(xi − ai ) ∂xi i=1 2 X ∂2f (a + t(x − a))(xi − ai )(xj − aj ). ∂xi ∂xj i,j=1 The Taylor polynomial of order 2 for g corresponding to t = 0 is g(t) = g(0) + tg 0 (0) + t2 00 g (0). 2 If we evaluate this polynomial at t = 1 and observe that g(1) = f (x), then f (x) = f (a) + 2 2 X X ∂f ∂2f (a)(xi − ai ) + (a)(xi − ai )(xj − aj ). ∂x1 ∂xi ∂xj i=1 i,j=1