Download Lecture notes 2.26.14

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sobolev space wikipedia , lookup

Derivative wikipedia , lookup

Chain rule wikipedia , lookup

Function of several real variables wikipedia , lookup

Neumann–Poincaré operator wikipedia , lookup

Fundamental theorem of calculus wikipedia , lookup

Transcript
AM 221: Advanced Optimization
Spring 2014
Prof. Yaron Singer
1
Lecture 9 — February 26th, 2014
Overview
Today we will recapitulate basic definitions and properties from multivariate calculus which we will
need in the coming lectures about Convex Optimization.
2
2.1
Differentiable functions
First-order expansion
Recall that for a function f : R → R of a single variable, differentiable at point x̄ ∈ R, one can
write:
∀x ∈ R, f (x) = f (x̄) + f 0 (x̄)(x − x̄) + o(|x − x̄|)
where by definition:
lim
x̄→x
o(|x̄ − x)|
=0
|x̄ − x|
This expansion expresses that the function f can be approximated by the affine function l : x 7→
f (x̄) + f 0 (x̄)(x − x̄) when x is “close to” x̄.
Similarly, when f : Rn → R is a multivariate function, differentiable at point x̄ ∈ Rn , one can
write:
∀x ∈ Rn , f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + o(kx − x̄k)
where ∇f (x̄) is the gradient of f at point x̄ and is defined by:
∇f (x) :=
∂f (x)
∂f (x)
,...,
∂x1
∂xn
T
and where o(kx − x̄k) satisfies:
o(kx − x̄k)
=0
x→x̄ kx − x̄k
lim
By strengthening the assumptions on f , we can obtain another form of first-order expansion which
can prove useful in cases when global properties on the gradient of f are known.
If f is continuous over [x̄, x] and differentiable over (x̄, x), then:
f (x) = f (x̄) + ∇f (z)T (x − x̄),
1
for some z ∈ [x̄, x].
2.2
Second-order expansion
When f is twice differentiable it is possible to obtain an approximation of f by quadratic functions.
Assume that f : Rn → R is twice differentiable at x̄ ∈ Rn , then:
1
f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + (x − x̄)T Hf (x̄)(x − x̄) + o(kx − x̄k2 )
2
where by definition:
o(kx − x̄k2 )
=0
lim
x→x̄ kx − x̄k2
and where Hf (x) is the Hessian matrix of f at x:
 ∂ 2 f (x) ∂ 2 f (x)

∂ 2 f (x)
. . . ∂x
∂x1 ∂x2
∂x21
1 ∂xn
 ∂ 2 f (x) ∂ 2 f (x)

∂ 2 f (x) 

. . . ∂x
 ∂x2 ∂x1
∂x2 A
2 ∂xn 
Hf (x) :=  .
..
.. 
..
 ..
.
.
. 


2
2
2
∂ f (x)
∂ f (x)
∂ f (x)
.
.
.
∂xn ∂x1
∂xn ∂x2
∂x2
n
Similarly to the first-order alternative expansion, we have, when f is of class C 1 over [x̄, x] and
twice differentiable over this interval:
1
f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + (x − x̄)T Hf (z)(x − x̄) for some z ∈ [x̄, x].
2
2.3
Directional derivatives
Definition 1. Let f : Rn → R be a function differentiable at x ∈ Rn and let us consider d ∈ Rn
with kdk = 1. We define the derivative of f at x in direction d as:
f (x + λd) − f (x)
x→0
λ
f 0 (x, d) = lim
Remark 2. We have f 0 (x, d) = ∇f (x)T d.
Proof. Using the first order expansion of f at x:
f (x + λd) = f (x) + ∇f (x)T (λd) + o(kλdk)
hence:
f (x + λd) − f (x)
= ∇f (x)T d + o(kdk)
λ
letting λ go to 0 concludes the proof.
Example What is the direction in which f changes the most rapidly? We want to find d∗ ∈
arg maxkdk=1 f 0 (x, d) = ∇f (x)T d.
From the Cauchy-Schwarz inequality we get ∇f (x)T d ≤ k∇f (x)kkdk with equality when d =
λ∇f (x), λ ∈ R. Since kdk = 1 this gives:
d=±
∇f (x)
k∇f (x)k
2
3
Conditions for local optimality
3.1
Necessary conditions
We now want to find necessary and sufficient conditions for local optimality.
Theorem 3. Let f : Rn → R be a function differentiable at x̄. If x̄ is a local extremum, then
∇f (x̄) = 0.
Proof. Let us assume that x̄ is a local minimum for f . Then for all d ∈ Rn , f (x̄) ≤ f (x̄ + λd) for
λ small enough. Hence:
0 ≤ f (x̄ + λd) − f (x̄) = λ∇f (x̄)T d + o(kλdk)
dividing by λ > 0 and letting λ → 0+ , we obtain 0 ≤ ∇f (x̄)T d. Similarly, dividing by λ < 0 and
letting λ → 0− , we obtain 0 ≥ ∇f (x̄)T d. As a consequence, f (x̄)T d = 0 for all d ∈ Rn . This
implies that ∇f (x̄) = 0.
The case where x̄ is a local maximum can be dealt with similarly.
Remark 4. A point x̄ ∈ Rn at which ∇f (x̄) = 0 is called a stationary point.
A second-order necessary condition allows us to distinguish local minima and local extrema. First,
we need to introduce some definitions on symmetric matrices.
Definition 5. Let A by a symmetric matrix in Rn×n . We say that A is:
1. positive definite iff xT Ax > 0 for all x ∈ Rn \ {0}.
2. negative definite iff xT Ax < 0 for all x ∈ Rn \ {0}.
3. positive semidefinite iff xT Ax ≥ 0 for all x ∈ Rn .
4. negative semidefinite iff xT Ax ≤ 0 for all x ∈ Rn .
5. otherwise we say that A is indefinite.
Example Let us consider A:
+4 −1
A :=
−1 −2
Then we have xT Ax = 4x21 − 2x1 x2 − 2x22 . For x = [1 0]T , we have xT Ax = 4 > 0. For x = [1 0]
we have xT Ax = −2 < 0. A is indefinite.
The following theorem gives a useful characterization of (semi)definite matrices.
Theorem 6. Let A be a symmetric matrix in Rn×n .
1. A is positive (negative) definite iff all its eigenvalues are positive (negative).
2. A is positive (negative) semidefinite iff all its eigenvalues are non-negative (non-positive).
3
3. otherwise A is indefinite.
T
Proof. The spectral theorem allows us to diagonalize A: A = P T DP . Then one can
Pnwrite x2 Ax =
T
T
T
T
x P DP x = (P x) D(P x). Let us define y := P x, then we have x Ax =
i=1 λi yi where
λ1 , . . . , λn are the diagonal elements of D, the eigenvalues of A. Also note that since P is invertible,
y = 0 iff x = 0.
If λi > 0 for all i, then we see that xT Ax > 0 for all x 6= 0. For the other direction, choosing y
such that yi = 1 and yj = 0for j 6= i yields λi > 0.
The other cases can be dealt with similarly.
We can now give the second-order necessary conditions for local extrema.
Theorem 7. Let f : Rn → R be a function twice differentiable at x̄ ∈ Rn .
1. If x̄ is a local minimum, then Hf (x̄) is positive semidefinite.
2. If x̄ is a local maximum, then Hf (x̄) is negative semidefinite.
Proof. Let us assume that x̄ is a local minimum. We know from Theorem 3 that ∇f (x̄) = 0, hence
the second-order expansion at x̄ takes the form:
1
f (x̄ + d) = f (x̄) + dT Hf (x̄)d + o(kd|2 )
2
Because x̄ is a local minimum, when kdk is small enough:
0 ≤ f (x̄ + d) − f (x̄) ∼kdk→0
1 T
d Hf (x̄)d
2
So dT Hf (x̄) must be non-negative when kdk is small enough. This is true for any such d, hence
Hf (x̄) is positive semidefinite.
3.2
Sufficient conditions
Combining the conditions of Theorem 3 (stationary point) and Theorem 5 is almost enough to
obtain a sufficient condition for local extrema. We only to slightly strengthen the conditions of
Theorem 5.
Theorem 8. Let f be a function twice differentiable at a stationary point x̄:
1. if Hf (x̄) is positive definite then x̄ is a local minimum.
2. if Hf (x̄) is negative definite then x̄ is a local maximum.
Proof. Let us assume that Hf (x̄) is positive definite. We know that x̄ is a stationary point. We
can write the second-order expansion at x̄:
1
f (x̄ + d) = f (x̄) + dT Hf (x)d + o(kdk2 )
2
4
When kdk is small enough, f (x̄ + d) − f (x̄) has the same sign as 12 dT Hf (x)d. But because Hf (x̄)
is positive definite, for any d 6= 0 small enough, f (x̄ + d) − f (x̄) > 0. This proves that x̄ is a local
minimum.
We can make a similar proof when Hf (x̄) is negative definite.
Remark 9. When Hf (x̄) is indefinite at a stationary point x̄, we have what is known as a saddle
point: x̄ will be a minimum along the eigenvectors of Hf (x̄) for which the eigenvalues are positive
and a maximum along the eigenvectors of Hf (x̄) for which the eigenvalues are negative.
5