Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Math 273a: Optimization Proximal Operator and Proximal-Point Algorithm Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com Outline • Concept • Definition • Examples • Optimization and operator-theoretic properties • Algorithm • Interpretations Why proximal method? • Newton’s method: • • for C 2 -smooth, unconstrained problems allow modest size • Gradient method: • • for C 1 -smooth, unconstrained problems give large size and sometimes distributed implementations • Proximal method: • • • for smooth and non-smooth, constrained and unconstrained but for structured problems gives large size and distributed implementations Why proximal method? • Newton’s method • uses low-level (explicit) operation: x k+1 = x k − λH −1 (x k )∇f (x k ). • Gradient method • uses low-level (explicit) operation: x k+1 = x k − λ∇f (x k ). • Proximal methods • • • uses high-level (implicit) operation: x k+1 = proxλf (x k ). proxλf is an optimization problem only simple for structured f , but there are many of them. Proximal operator Assumptions and Notation 1. f : Rn → Rn ∪ {∞} is a closed, proper, convex function 2. f is proper if domf 6= ∅ 3. f is closed if its graph is a closed set. ⇒ proxλf is well-defined and unique for λ > 0 4. the raised ∗ (e.g. x ∗ ) is a global minimizer of some function. 5. domf is the domain of f , which is where f (x) is finite. Proximal operator Definition • The proximal operator proxf : Rn → Rn of a function f is defined by: proxf (v) = arg min f (x) + x∈Rn 1 kx − vk2 2 • The scaled proximal operator proxλf : Rn → Rn is defined by: proxλf (v) = arg min f (x) + x∈Rn 1 kx − vk2 2λ Special case: Projection • Consider a closed convex set C 6= ∅ • Let ιC be the indicator function of C : ιC (x) = 0 if x ∈ C ; ∞ otherwise. proxιC (x) = arg min ιC (y) + y 1 ky − xk2 2 1 = arg min ky − xk2 =: PC (x) 2 y∈C x proxιC x reflιC (x) 0 • By generalizing ιC to f , we generalize PC to proxf . Proximal “step size” proxλf (v) = arg min f (x) + x∈Rn 1 kx − vk2 2λ • λ > 0 is the “step size” : • λ ↑ ∞ =⇒ proxλf (v) → arg minx∈Rn f (x) (In case there are multiple solutions, pick the one closest to v) • λ↓0 =⇒ proxλf (v) → Pdomf (v) Pdomf (v) = arg min x∈Rn n 1 kv − xk2 : f (x) is finite 2 o • Given v, (proxλf (v) − v) is not linear in λ, so λ has the function like a step size is not a step size Examples • Linear function: Let a ∈ Rn , b ∈ R and f (x) = a T x + b. Then, proxλf (v) = arg min (a T x + b) + x∈Rn 1 kx − vk2 2λ has first-order optimality conditions: a+ 1 (proxλf (v) − v) = 0 ⇐⇒ λ proxλf (v) = v − λa • Application: proximal operator of linear approximation of f • let f (1) (x) = f (x 0 ) + h∇f (x 0 ), x − x 0 i • then, proxλf (1) (x 0 ) = x 0 − λ∇f (x 0 ) is a gradient step with size λ Examples • Quadratic function Let A ∈ Sn+ be a symmetric positive semi-definite matrix, b ∈ Rn , and f (x) = 1 T x Ax − bT x + c. 2 The proximal operator proxλf (v) = arg min f (x) + x∈Rn 1 kx − vk2 2λ has first order optimality conditions: (Av ∗ − b) + 1 ∗ (v − v) = 0 λ ⇔ v ∗ = (λA + I )−1 (λb + v) ⇔ v ∗ = (λA + I )−1 (λb + λAv + v − λAv) 1 v ∗ = v + (A + I )−1 (b − Av) λ ⇔ It gives a iterative refinement method for least squares problems. proxλf (v) = v + (A + 1 −1 I ) (b − Av) λ • Application: proximal operator of quadratic approximation of f • let f (2) (x) = f (x 0 ) + h∇f (x 0 ), x − x 0 i + 21 (x − x 0 )T ∇2 f (x 0 )(x − x 0 ) = 12 x T Ax − bT x + c where • • • A = ∇2 f (x 0 ) b = (∇2 f (x 0 ))T x 0 − ∇f (x 0 ) by letting v = x 0 , we get proxλf (2) (x 0 ) = x 0 − (∇2 f (x 0 ) + • 1 −1 I ) ∇f (x 0 ) λ modified-Hessian Newton update, Levenberg-Marquardt update Examples • `1 -norm: x ∈ Rn , let f (x) = kxk1 , then ˙ max(|x| − λ, 0) = x − P[−λ,λ]n x. proxλf = sign(x)× The operator is often written as shrink(x, λ) • `2 -norm: let f (x) = kxk2 , then proxλf = shrinkk·k (x, λ) = max(kxk − λ, 0) x = x − PB(0,λ) x, kxk where we let 0/0 = 0 if x = 0. More examples: • `∞ -norm. • `2,1 -norm. • Unitary-invariant matrix norms: Frobenius-norm, nuclear-norm, maximal singular value. Properties Proposition (separable sums) Suppose that f (x, y) = φ(x) + ψ(y) is a block separable function proxλf (v, w) = (proxλφ (v), proxλψ (w)) • Note: we have observed this with f (x) = Pn i=1 |xi |. Properties Theorem (minimizer = fixed point) Let λ > 0. Point x ∗ ∈ Rn is a minimizer of f if, and only if, proxλf (x ∗ ) = x ∗ . Proximal-point algorithm (PPA) • Iteration: x k+1 = proxλf (x k ) • Convergence: • For strongly convex f , proxλf is a contraction, i.e., ∃ C0 < 1: kproxλf (x) − proxλf (y)k ≤ C0 kx − yk. Let x = x k and y = x ∗ . We get kx k+1 − x ∗ k = kproxλf (x k ) − proxλf (x ∗ )k ≤ C0 kx k − x ∗ k. Iterate this, then kx k+1 − x ∗ k ≤ C0k+1 kx k − x ∗ k. x k converges x ∗ linearly. • For general convex f , proxλf is firmly nonexpansive, i.e., kproxλf (x)−proxλf (y)k2 ≤ kx−yk2 −k(x−proxλf (x))−(y−proxλf (y))k2 . From this inequality, one can show: • x k → x ∗ weakly; still true if computation error is summable • fixed-point residual kproxλf (x k ) − x k k2 = o(1/k 2 ) • objective function f (x k ) − f (x ∗ ) = o(1/k) Proximal operator and resolvent Definition Given a mapping T , (I + λT )−1 is called the resolvent of T . Proposition Suppose that f has subdifferential ∂f . We have proxλf = (I + λ∂f )−1 . In addition, they are single valued. Interpretation: implicit (sub)gradient x k+1 = proxλf (x k ) ⇔ x k+1 = (I + λ∂f )−1 (x k ) ⇔ x k ∈ (I + λ∂f )x k+1 ⇔ x k ∈ x k+1 + λ∂f (x k+1 ) ⇔ ˜ (x k+1 ) x k+1 = x k − λ∇f ˜ is the subgradient mapping uniquely defined by prox • Notation: ∇f λf ˜ (x k+1 ) ∈ ∂f (x k+1 ). Plugging formula of x k+1 , we get • Let y k+1 = ∇f y k+1 ∈ ∂f (x k − λy k+1 ). • Given x k and λ, compute proxλf (x k ) ⇐⇒ solve x k+1 (primal approach) ⇐⇒ solve y k+1 (dual approach) Summary: proximal operator • Conceptually simple, easy to understand and derive, a standard tool for nonsmooth and/or constrained optimization • Work for any λ > 0, more stable than gradient descent • Gives a fixed-point optimality condition and a converging algorithm • “Sits at a high level of abstraction” • Interpretations: general projection, implicit gradient, backward Euler • Closed-formed or quick solutions for many basic functions Next few lectures: • Prox operation is applied when it is easily evaluated, so often a step in other algorithms, especially operator splitting algorithms • Under separable structures, it is amenable to parallel and distributed algorithms with interesting applications in machine learning, signal processing, compressed sensing, large-scale modern convex problems