Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AM 221: Advanced Optimization Spring 2014 Prof. Yaron Singer 1 Lecture 9 — February 26th, 2014 Overview Today we will recapitulate basic definitions and properties from multivariate calculus which we will need in the coming lectures about Convex Optimization. 2 2.1 Differentiable functions First-order expansion Recall that for a function f : R → R of a single variable, differentiable at point x̄ ∈ R, one can write: ∀x ∈ R, f (x) = f (x̄) + f 0 (x̄)(x − x̄) + o(|x − x̄|) where by definition: lim x̄→x o(|x̄ − x)| =0 |x̄ − x| This expansion expresses that the function f can be approximated by the affine function l : x 7→ f (x̄) + f 0 (x̄)(x − x̄) when x is “close to” x̄. Similarly, when f : Rn → R is a multivariate function, differentiable at point x̄ ∈ Rn , one can write: ∀x ∈ Rn , f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + o(kx − x̄k) where ∇f (x̄) is the gradient of f at point x̄ and is defined by: ∇f (x) := ∂f (x) ∂f (x) ,..., ∂x1 ∂xn T and where o(kx − x̄k) satisfies: o(kx − x̄k) =0 x→x̄ kx − x̄k lim By strengthening the assumptions on f , we can obtain another form of first-order expansion which can prove useful in cases when global properties on the gradient of f are known. If f is continuous over [x̄, x] and differentiable over (x̄, x), then: f (x) = f (x̄) + ∇f (z)T (x − x̄), 1 for some z ∈ [x̄, x]. 2.2 Second-order expansion When f is twice differentiable it is possible to obtain an approximation of f by quadratic functions. Assume that f : Rn → R is twice differentiable at x̄ ∈ Rn , then: 1 f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + (x − x̄)T Hf (x̄)(x − x̄) + o(kx − x̄k2 ) 2 where by definition: o(kx − x̄k2 ) =0 lim x→x̄ kx − x̄k2 and where Hf (x) is the Hessian matrix of f at x: ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x) . . . ∂x ∂x1 ∂x2 ∂x21 1 ∂xn ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x) . . . ∂x ∂x2 ∂x1 ∂x2 A 2 ∂xn Hf (x) := . .. .. .. .. . . . 2 2 2 ∂ f (x) ∂ f (x) ∂ f (x) . . . ∂xn ∂x1 ∂xn ∂x2 ∂x2 n Similarly to the first-order alternative expansion, we have, when f is of class C 1 over [x̄, x] and twice differentiable over this interval: 1 f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + (x − x̄)T Hf (z)(x − x̄) for some z ∈ [x̄, x]. 2 2.3 Directional derivatives Definition 1. Let f : Rn → R be a function differentiable at x ∈ Rn and let us consider d ∈ Rn with kdk = 1. We define the derivative of f at x in direction d as: f (x + λd) − f (x) x→0 λ f 0 (x, d) = lim Remark 2. We have f 0 (x, d) = ∇f (x)T d. Proof. Using the first order expansion of f at x: f (x + λd) = f (x) + ∇f (x)T (λd) + o(kλdk) hence: f (x + λd) − f (x) = ∇f (x)T d + o(kdk) λ letting λ go to 0 concludes the proof. Example What is the direction in which f changes the most rapidly? We want to find d∗ ∈ arg maxkdk=1 f 0 (x, d) = ∇f (x)T d. From the Cauchy-Schwarz inequality we get ∇f (x)T d ≤ k∇f (x)kkdk with equality when d = λ∇f (x), λ ∈ R. Since kdk = 1 this gives: d=± ∇f (x) k∇f (x)k 2 3 Conditions for local optimality 3.1 Necessary conditions We now want to find necessary and sufficient conditions for local optimality. Theorem 3. Let f : Rn → R be a function differentiable at x̄. If x̄ is a local extremum, then ∇f (x̄) = 0. Proof. Let us assume that x̄ is a local minimum for f . Then for all d ∈ Rn , f (x̄) ≤ f (x̄ + λd) for λ small enough. Hence: 0 ≤ f (x̄ + λd) − f (x̄) = λ∇f (x̄)T d + o(kλdk) dividing by λ > 0 and letting λ → 0+ , we obtain 0 ≤ ∇f (x̄)T d. Similarly, dividing by λ < 0 and letting λ → 0− , we obtain 0 ≥ ∇f (x̄)T d. As a consequence, f (x̄)T d = 0 for all d ∈ Rn . This implies that ∇f (x̄) = 0. The case where x̄ is a local maximum can be dealt with similarly. Remark 4. A point x̄ ∈ Rn at which ∇f (x̄) = 0 is called a stationary point. A second-order necessary condition allows us to distinguish local minima and local extrema. First, we need to introduce some definitions on symmetric matrices. Definition 5. Let A by a symmetric matrix in Rn×n . We say that A is: 1. positive definite iff xT Ax > 0 for all x ∈ Rn \ {0}. 2. negative definite iff xT Ax < 0 for all x ∈ Rn \ {0}. 3. positive semidefinite iff xT Ax ≥ 0 for all x ∈ Rn . 4. negative semidefinite iff xT Ax ≤ 0 for all x ∈ Rn . 5. otherwise we say that A is indefinite. Example Let us consider A: +4 −1 A := −1 −2 Then we have xT Ax = 4x21 − 2x1 x2 − 2x22 . For x = [1 0]T , we have xT Ax = 4 > 0. For x = [1 0] we have xT Ax = −2 < 0. A is indefinite. The following theorem gives a useful characterization of (semi)definite matrices. Theorem 6. Let A be a symmetric matrix in Rn×n . 1. A is positive (negative) definite iff all its eigenvalues are positive (negative). 2. A is positive (negative) semidefinite iff all its eigenvalues are non-negative (non-positive). 3 3. otherwise A is indefinite. T Proof. The spectral theorem allows us to diagonalize A: A = P T DP . Then one can Pnwrite x2 Ax = T T T T x P DP x = (P x) D(P x). Let us define y := P x, then we have x Ax = i=1 λi yi where λ1 , . . . , λn are the diagonal elements of D, the eigenvalues of A. Also note that since P is invertible, y = 0 iff x = 0. If λi > 0 for all i, then we see that xT Ax > 0 for all x 6= 0. For the other direction, choosing y such that yi = 1 and yj = 0for j 6= i yields λi > 0. The other cases can be dealt with similarly. We can now give the second-order necessary conditions for local extrema. Theorem 7. Let f : Rn → R be a function twice differentiable at x̄ ∈ Rn . 1. If x̄ is a local minimum, then Hf (x̄) is positive semidefinite. 2. If x̄ is a local maximum, then Hf (x̄) is negative semidefinite. Proof. Let us assume that x̄ is a local minimum. We know from Theorem 3 that ∇f (x̄) = 0, hence the second-order expansion at x̄ takes the form: 1 f (x̄ + d) = f (x̄) + dT Hf (x̄)d + o(kd|2 ) 2 Because x̄ is a local minimum, when kdk is small enough: 0 ≤ f (x̄ + d) − f (x̄) ∼kdk→0 1 T d Hf (x̄)d 2 So dT Hf (x̄) must be non-negative when kdk is small enough. This is true for any such d, hence Hf (x̄) is positive semidefinite. 3.2 Sufficient conditions Combining the conditions of Theorem 3 (stationary point) and Theorem 5 is almost enough to obtain a sufficient condition for local extrema. We only to slightly strengthen the conditions of Theorem 5. Theorem 8. Let f be a function twice differentiable at a stationary point x̄: 1. if Hf (x̄) is positive definite then x̄ is a local minimum. 2. if Hf (x̄) is negative definite then x̄ is a local maximum. Proof. Let us assume that Hf (x̄) is positive definite. We know that x̄ is a stationary point. We can write the second-order expansion at x̄: 1 f (x̄ + d) = f (x̄) + dT Hf (x)d + o(kdk2 ) 2 4 When kdk is small enough, f (x̄ + d) − f (x̄) has the same sign as 12 dT Hf (x)d. But because Hf (x̄) is positive definite, for any d 6= 0 small enough, f (x̄ + d) − f (x̄) > 0. This proves that x̄ is a local minimum. We can make a similar proof when Hf (x̄) is negative definite. Remark 9. When Hf (x̄) is indefinite at a stationary point x̄, we have what is known as a saddle point: x̄ will be a minimum along the eigenvectors of Hf (x̄) for which the eigenvalues are positive and a maximum along the eigenvectors of Hf (x̄) for which the eigenvalues are negative. 5