Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
M\cr NA Manchester Numerical Analysis A Geometric Theory of Phase Transitions in Convex Optimization Martin Lotz School of Mathematics The University of Manchester with the collaboration of Dennis Amelunxen (Manchester), Michael B. McCoy, Joel A. Tropp (Caltech) Computational Mathematics and Applications Seminar University of Oxford, October 24, 2013 Outline The phase transition phenomenon Statistical dimension Conic integral geometry Concentration of measure What else Outline The phase transition phenomenon Statistical dimension Conic integral geometry Concentration of measure What else Setting the stage Problem: find a “structured” solution x0 of m × d system (m < d) Ax = b by minimizing a convex regularizer minimize f (x) subject to Ax = b. (?) 1 / 40 Setting the stage Problem: find a “structured” solution x0 of m × d system (m < d) Ax = b by minimizing a convex regularizer minimize f (x) subject to Ax = b. (?) When is x0 the unique solution of (?) ? 1 / 40 Setting the stage Problem: find a “structured” solution x0 of m × d system (m < d) Ax = b by minimizing a convex regularizer minimize f (x) subject to Ax = b. (?) Examples include: I x0 sparse: f (x) = kxk1 = X |xi | i I X0 low-rank matrix: f (X) = kXkS1 = X σi (X) i I X0 K-mode tensor: f (X ) = X λi X(i) S 1 i 1 / 40 Setting the stage Problem: find a “structured” solution x0 of m × d system (m < d) Ax = b by minimizing a convex regularizer minimize f (x) subject to Ax = b. (?) I A convex problem comes with a solid theory and (in principle) efficient algorithms. I A priori there is no reason to believe that the solution of (?) (regardless of the algorithm) has anything to do with x0 ! 1 / 40 Setting the stage Problem: find a “structured” solution x0 of m × d system (m < d) Ax = b by minimizing a convex regularizer minimize f (x) subject to Ax = b. (?) I There are powerful algorithms for searching sparse/low-rank solutions x0 without resorting to convex optimisation. I For references in the context of compressed sensing and matrix recovery, see (Cartis & Thompson 2013), (Blanchard & Tanner 2013), or (Tanner & Wei 2013). 1 / 40 Sparse recovery guarantees: what is known Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize kxk1 subject to Ax = b. 2 / 40 Sparse recovery guarantees: what is known Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize I kxk1 subject to Ax = b. Donoho, Candès, Romberg & Tao, Rudelson & Vershynin: m ≥ const · log(d/m) · s (“complexity” m is proportional to the “information content” s) 2 / 40 Sparse recovery guarantees: what is known Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize I kxk1 subject to Ax = b. Donoho, Candès, Romberg & Tao, Rudelson & Vershynin: m ≥ const · log(d/m) · s (“complexity” m is proportional to the “information content” s) I Similar results for low-rank matrix recovery (Candés & Tao, Recht, Fazel, Recht & Parrilo) 2 / 40 Sparse recovery guarantees: what is known Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize I kxk1 subject to Ax = b. Donoho, Candès, Romberg & Tao, Rudelson & Vershynin: m ≥ const · log(d/m) · s (“complexity” m is proportional to the “information content” s) I Similar results for low-rank matrix recovery (Candés & Tao, Recht, Fazel, Recht & Parrilo) I Phase transitions for successful recovery were observed and precisely located by Donoho & Tanner and Stojnic 2 / 40 Some experiments: `1 minimization Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize kxk1 subject to Ax = b. 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 Number of equations m 150 200 s = 50, m = 25, d = 200 3 / 40 Some experiments: `1 minimization Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize kxk1 subject to Ax = b. 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 Number of equations m 150 200 s = 50, m = 50, d = 200 3 / 40 Some experiments: `1 minimization Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize kxk1 subject to Ax = b. 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 Number of equations m 150 200 s = 50, m = 75, d = 200 3 / 40 Some experiments: `1 minimization Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize kxk1 subject to Ax = b. 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 Number of equations m 150 200 s = 50, m = 100, d = 200 3 / 40 Some experiments: `1 minimization Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize kxk1 subject to Ax = b. 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 Number of equations m 150 200 s = 50, m = 125, d = 200 3 / 40 Some experiments: `1 minimization Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize kxk1 subject to Ax = b. 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 Number of equations m 150 200 s = 50, m = 150, d = 200 3 / 40 Some experiments: `1 minimization Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize kxk1 subject to Ax = b. 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 Number of equations m 150 200 s = 50, m = 175, d = 200 3 / 40 Some experiments: `1 minimization Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d). minimize kxk1 subject to Ax = b. 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 Number of equations m 150 200 s = 50, m = 200, d = 200 3 / 40 Some experiments: low-rank matrix recovery Let X0 have rank r, b = AX0 for random A : Rm×n → Rp . minimize kXkS1 AX = b. subject to 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Number of equations p/nm 0.8 1 r/n = 0.3, p/nm = 0.125 4 / 40 Some experiments: low-rank matrix recovery Let X0 have rank r, b = AX0 for random A : Rm×n → Rp . minimize kXkS1 AX = b. subject to 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Number of equations p/nm 0.8 1 r/n = 0.3, p/nm = 0.250 4 / 40 Some experiments: low-rank matrix recovery Let X0 have rank r, b = AX0 for random A : Rm×n → Rp . minimize kXkS1 AX = b. subject to 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Number of equations p/nm 0.8 1 r/n = 0.3, p/nm = 0.375 4 / 40 Some experiments: low-rank matrix recovery Let X0 have rank r, b = AX0 for random A : Rm×n → Rp . minimize kXkS1 AX = b. subject to 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Number of equations p/nm 0.8 1 r/n = 0.3, p/nm = 0.500 4 / 40 Some experiments: low-rank matrix recovery Let X0 have rank r, b = AX0 for random A : Rm×n → Rp . minimize kXkS1 AX = b. subject to 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Number of equations p/nm 0.8 1 r/n = 0.3, p/nm = 0.625 4 / 40 Some experiments: low-rank matrix recovery Let X0 have rank r, b = AX0 for random A : Rm×n → Rp . minimize kXkS1 AX = b. subject to 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Number of equations p/nm 0.8 1 r/n = 0.3, p/nm = 0.750 4 / 40 Some experiments: low-rank matrix recovery Let X0 have rank r, b = AX0 for random A : Rm×n → Rp . minimize kXkS1 AX = b. subject to 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Number of equations p/nm 0.8 1 r/n = 0.3, p/nm = 0.875 4 / 40 Some experiments: low-rank matrix recovery Let X0 have rank r, b = AX0 for random A : Rm×n → Rp . minimize kXkS1 AX = b. subject to 1 0.9 Probability of success 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Number of equations p/nm 0.8 1 r/n = 0.3, p/nm = 1.000 4 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 1 Probability of at most m crossings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 m 150 200 m = 25 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 1 Probability of at most m crossings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 m 150 200 m = 50 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 1 Probability of at most m crossings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 m 150 200 m = 75 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 1 Probability of at most m crossings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 m 150 200 m = 100 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 1 Probability of at most m crossings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 m 150 200 m = 125 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 1 Probability of at most m crossings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 m 150 200 m = 150 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 1 Probability of at most m crossings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 m 150 200 m = 175 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 1 Probability of at most m crossings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 m 150 200 m = 200 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? 1 Probability of at most m crossings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 100 m 127.324 200 Threshold location: n · 2/π 5 / 40 Some experiments: Buffon’s needle Throw n needles of length ` uniformly on a surface of strips of length d ≥ `. What is the probability that at most m of them lie across adjacent strips? I This example is not completely gratuitous, but intimately related to the phase transitions for convex regularization! I We will see that the success probability of convex minimization with random constraints is determined by a discrete probability distribution that concentrates around its mean. 5 / 40 Phase transitions for linear inverse problems Associate to a solution x0 of Ax = b and a convex problem minimize f (x) subject to Ax = b (?) a parameter δ(f, x0 ), the statistical dimension of f at x0 . Theorem [Amelunxen, L, McCoy & Tropp, 2013] Let η ∈ (0, 1) and let x0 ∈ Rd be a fixed vector. Suppose A ∈ Rm×d has independent standard normal entries, and let b = Ax0 . Then √ m ≥ δ(f, x0 ) + aη d =⇒ (?) succeeds with probability ≥ 1 − η; √ m ≤ δ(f, x0 ) − aη d =⇒ (?) succeeds with probability ≤ η. p where aη := 4 log(4/η) (a0.01 < 10 and a0.001 < 12). 6 / 40 Phase transitions for linear inverse problems Associate to a solution x0 of Ax = b and a convex problem minimize f (x) subject to Ax = b (?) a parameter δ(f, x0 ), the statistical dimension of f at x0 . I In a very precise sense, minimizing f in (?) has the effect of adding d − δ(f, x0 ) linear equations to the system Ax = b. 6 / 40 Phase transitions for linear inverse problems Associate to a solution x0 of Ax = b and a convex problem minimize f (x) subject to Ax = b (?) a parameter δ(f, x0 ), the statistical dimension of f at x0 . I In a very precise sense, minimizing f in (?) has the effect of adding d − δ(f, x0 ) linear equations to the system Ax = b. I The reason for this is that the descent cone of f at x0 behaves like a linear subspace in high dimension. 6 / 40 Phase transitions for linear inverse problems 100 900 75 600 50 300 25 0 0 25 50 75 100 0 0 10 20 30 7 / 40 Demixing Reconstruct two signals x0 , y0 from the observation z0 = x0 + U y0 , where U ∈ O(d) is some known orthogonal basis that encodes the orientation of y0 relative to x0 . 8 / 40 Demixing Reconstruct two signals x0 , y0 from the observation z0 = x0 + U y0 , where U ∈ O(d) is some known orthogonal basis that encodes the orientation of y0 relative to x0 . I both are sparse (→ morphological component analysis) 8 / 40 Demixing Reconstruct two signals x0 , y0 from the observation z0 = x0 + U y0 , where U ∈ O(d) is some known orthogonal basis that encodes the orientation of y0 relative to x0 . I both are sparse (→ morphological component analysis) I x0 sparse (corruption), y0 ∈ {±1}d (message) (→ robust communication protocol) 8 / 40 Demixing Reconstruct two signals x0 , y0 from the observation z0 = x0 + U y0 , where U ∈ O(d) is some known orthogonal basis that encodes the orientation of y0 relative to x0 . I both are sparse (→ morphological component analysis) I x0 sparse (corruption), y0 ∈ {±1}d (message) (→ robust communication protocol) I x0 low-rank matrix, y0 sparse (corruption) (→ latent variable selection in machine learning) 8 / 40 Phase transitions for convex demixing Separate two signals from observations z = x0 + U y0 by solving minimize f (x) subject to g(y) ≤ g(y0 ) and z = x + U y. (?) for convex functions f and g designed to capture the structure of x0 and y0 . 9 / 40 Phase transitions for convex demixing Separate two signals from observations z = x0 + U y0 by solving minimize f (x) subject to g(y) ≤ g(y0 ) and z = x + U y. (?) for convex functions f and g designed to capture the structure of x0 and y0 . Theorem [Amelunxen, L, McCoy, & Tropp 2013] Let η ∈ (0, 1), x0 , y0 be fixed vectors in Rd , orthogonal U ∈ Rd×d uniformly at random, and let z = x0 + U y0 . Then √ δ f, x0 + δ g, y0 ≤ d − aη d √ δ f, x0 + δ g, y0 ≥ d + aη d =⇒ (?) succeeds with probability ≥ 1 − η; =⇒ (?) succeeds with probability ≤ η. 9 / 40 Phase transitions for convex demixing minimize kck1 subject to minimize kmk∞ ≤ 1 and c + Qm = z0 . subject to kXk1 ≤ α and X + QY = Z0 . Demixing sparse & sparse Demixing sparse & low−rank 1 1 95% success 50% success 5% success Theory Nonzero proportion of Y0 Nonzero proportion of y0 kXk∗ 0.5 0 95% success 50% success 5% success Theory 0.8 0.6 0.4 0.2 0 0 0.5 Nonzero proportion of x0 1 0 0.2 0.4 0.6 0.8 1 Normalized rank of X0 Probability of success for deconvolution problems. 10 / 40 Outline The phase transition phenomenon Statistical dimension Conic integral geometry Concentration of measure What else When does convex relaxation work? minimize kxk1 subject to Ax = b x0 {Ax = b} “success” {kxk1 ≤ kx0 k1 } 11 / 40 When does convex relaxation work? minimize kxk1 subject to Ax = b x0 “failure” 11 / 40 When does convex relaxation work? minimize kxk1 subject to Ax = b x0 “success” descent cone 11 / 40 When does convex relaxation work? minimize kxk1 subject to Ax = b x0 “success” descent cone Nullspace property The convex relaxation method succeeds if and only if the kernel of A misses the cone of descent directions of k·k1 at x0 . 11 / 40 From optimization to geometry The problem minimize f (x) subject to Ax = b has x0 as unique solution if and only if the optimality condition ker A ∩ D(f, x0 ) = {0} is satisfied, where [ ◦ D(f, x0 ) := y ∈ Rd : f (x0 + τ y) ≤ f (x0 ) ∼ = cone (∂f (x0 )) τ >0 is the convex descent cone of f at x0 . 12 / 40 From optimization to geometry Examples of descent cones [ ◦ D(f, x0 ) := y ∈ Rd : f (x0 + τ y) ≤ f (x0 ) ∼ = cone (∂f (x0 )) τ >0 or their polars include: I I I x0 s-sparse: ∂ kx0 k1 ∼ = (d − s)-face of unit cube centred at origin. x0 on s-face of hypercube: D(k·k , x0 ) ∼ = Rd−s × Rs . `∞ ≤0 X0 rank r matrix: If X0 diagonal, 1r 0 ∂ kX0 kS1 = : kW0 k2 ≤ 1 (Watson 1992). 0 W0 13 / 40 The statistical dimension Definition The statistical dimension of a convex cone C is defined as δ(C) := E kΠC (g)k2 where g ∼ Normal(0, I) is a Gaussian vector and ΠC (·) denotes the Euclidean projection onto C. For a proper convex function f and x0 we define the statistical dimension as that of the descent cone of f at x0 : δ(f, x0 ) := δ (D(f, x0 )) . I Direct generalisation of the dimension of a linear space. I Closely related to squared Gaussian width. 14 / 40 Basic properties 15 / 40 Basic properties I Spherical formulation. δ(C) := d E kΠC (θ)k2 where θ ∼ Uniform(Sd−1 ). 15 / 40 Basic properties I I Spherical formulation. δ(C) := d E kΠC (θ)k2 where θ ∼ Uniform(Sd−1 ). Rotational invariance. δ(U C) = δ(C) for each U ∈ Od . 15 / 40 Basic properties I Spherical formulation. δ(C) := d E kΠC (θ)k2 where θ ∼ Uniform(Sd−1 ). I Rotational invariance. δ(U C) = δ(C) for each U ∈ Od . I Subspaces. For a subspace L ⊂ Rd , δ(L) = dim(L). 15 / 40 Basic properties I Spherical formulation. δ(C) := d E kΠC (θ)k2 where θ ∼ Uniform(Sd−1 ). I Rotational invariance. δ(U C) = δ(C) for each U ∈ Od . I Subspaces. For a subspace L ⊂ Rd , δ(L) = dim(L). I Totality. δ(C) + δ(C ◦ ) = d. This generalises dim(L) + dim(L⊥ ) = d for linear L. C C◦ 15 / 40 Basic properties I Spherical formulation. δ(C) := d E kΠC (θ)k2 where θ ∼ Uniform(Sd−1 ). I Rotational invariance. δ(U C) = δ(C) for each U ∈ Od . I Subspaces. For a subspace L ⊂ Rd , δ(L) = dim(L). I Totality. δ(C) + δ(C ◦ ) = d. This generalises dim(L) + dim(L⊥ ) = d for linear L. I Direct products. For each cone closed convex cone K, δ(C × K) = δ(C) + δ(K). In particular, invariance under embedding. 15 / 40 Basic properties I Spherical formulation. δ(C) := d E kΠC (θ)k2 where θ ∼ Uniform(Sd−1 ). I Rotational invariance. δ(U C) = δ(C) for each U ∈ Od . I Subspaces. For a subspace L ⊂ Rd , δ(L) = dim(L). I Totality. δ(C) + δ(C ◦ ) = d. This generalises dim(L) + dim(L⊥ ) = d for linear L. I Direct products. For each cone closed convex cone K, δ(C × K) = δ(C) + δ(K). In particular, invariance under embedding. I Monotonicity. C ⊂ K implies that δ(C) ≤ δ(K). 15 / 40 Examples I Linear subspaces. δ(L) = dim L I Non-negative orthant. δ(Rd≥0 ) = d/2. I Self-dual cones. We have δ(C) + δ(C ◦ ) = d, so that δ(C) = d/2 for any self-dual cone (for example, positive semidefinite matrices). I Second-order (ice cream) cones of angle α. Circ(d, α) := x ∈ Rn : x1 / kxk ≥ cos(α) . Then δ Circ(d, α) ≈ d sin2 (α) I The cone CA = {x : x1 ≤ · · · ≤ xd }. δ(CA ) = d X 1 ∼ log(d). k k=1 16 / 40 Approximate kinematic formula Theorem [ALMT13] Fix a tolerance η ∈ (0, 1). Suppose that C, K ⊂ Rd are closed convex cones, one of which is not a subspace. Draw an orthogonal matrix Q ∈ Rd×d uniformly at random. Then √ δ(C) + δ(K) ≤ d − aη d =⇒ P C ∩ QK = {0} ≥ 1 − η; √ δ(C) + δ(K) ≥ d + aη d =⇒ P C ∩ QK = {0} ≤ η, p where aη := 4 log(4/η) (a0.01 < 10 and a0.001 < 12). 17 / 40 Approximate kinematic formula Theorem [ALMT13] Fix a tolerance η ∈ (0, 1). Suppose that C, K ⊂ Rd are closed convex cones, one of which is not a subspace. Draw an orthogonal matrix Q ∈ Rd×d uniformly at random. Then √ δ(C) + δ(K) ≤ d − aη d =⇒ P C ∩ QK = {0} ≥ 1 − η; √ δ(C) + δ(K) ≥ d + aη d =⇒ P C ∩ QK = {0} ≤ η, p where aη := 4 log(4/η) (a0.01 < 10 and a0.001 < 12). I Applying this with C = D(f, x0 ) and K = ker A we get the phase transitions for convex regularization. 17 / 40 Approximate kinematic formula In high dimensions, convex cones have intersection behaviour like linear subspaces. Linear subspaces ( L1 ∩ L2 = {0} iff dim(L1 ) + dim(L2 ) ≤ d 6= {0} iff dim(L1 ) + dim(L2 ) > d. (almost surely) Convex cones ( = {0} iff δ(C) + δ(D) . d C ∩ QD 6= {0} iff δ(C) + δ(D) & d. (with overwhelming probability) 18 / 40 Computing the statistical dimension For many cases, the statistical dimension δ(f, x0 ) of a convex function f at x0 can be determined exactly or asymptotically. 19 / 40 Computing the statistical dimension For many cases, the statistical dimension δ(f, x0 ) of a convex function f at x0 can be determined exactly or asymptotically. I x0 s-sparse, f = k·k1 . Asymptotic formula for δ(k·k1 , x0 ) follows from Stojnic 2009. 19 / 40 Computing the statistical dimension For many cases, the statistical dimension δ(f, x0 ) of a convex function f at x0 can be determined exactly or asymptotically. I x0 s-sparse, f = k·k1 . Asymptotic formula for δ(k·k1 , x0 ) follows from Stojnic 2009. I X0 rank r matrix, f = k·kS1 . Asymptotic formula based on the Marčenko-Pastur characterisation of the empirical eigenvalue distribution of Wishart matrices. 19 / 40 Computing the statistical dimension For many cases, the statistical dimension δ(f, x0 ) of a convex function f at x0 can be determined exactly or asymptotically. I x0 s-sparse, f = k·k1 . Asymptotic formula for δ(k·k1 , x0 ) follows from Stojnic 2009. I X0 rank r matrix, f = k·kS1 . Asymptotic formula based on the Marčenko-Pastur characterisation of the empirical eigenvalue distribution of Wishart matrices. I x0 on s-face of hypercube, f = k·k`∞ . δ(f, x0 ) = d−s 2 19 / 40 Theory and experiment Phase transitions in linear inverse problems Empirical success % 100% 75% Sparse Sign Low rank Theory 50% 25% 0% 100 200 300 Number of generic measurements 20 / 40 Relation to convex denoising Let z = x0 + w with w ∼ Normal(0, σ1), and x̂ = arg min x 1 kx − zk22 + µf (x). 2 If, for example, f = k·k1 , then this is soft thresholding. The associated minimax MSE risk is given by i 1 h 2 k x̂ − x k E 0 2 2 σ>0 µ>0 σ R(f, x0 ) = sup inf 21 / 40 Relation to convex denoising Let z = x0 + w with w ∼ Normal(0, σ1), and x̂ = arg min x 1 kx − zk22 + µf (x). 2 If, for example, f = k·k1 , then this is soft thresholding. The associated minimax MSE risk is given by i 1 h 2 k x̂ − x k E 0 2 2 σ>0 µ>0 σ R(f, x0 ) = sup inf Lemma ([ALMT13]+[Oymak & Hassibi 2013]) p δ(f, x0 ) − R(f, x0 ) = O δ(f, x ) . 0 21 / 40 Relation to convex denoising Let z = x0 + w with w ∼ Normal(0, σ1), and x̂ = arg min x 1 kx − zk22 + µf (x). 2 If, for example, f = k·k1 , then this is soft thresholding. The associated minimax MSE risk is given by i 1 h 2 k x̂ − x k E 0 2 2 σ>0 µ>0 σ R(f, x0 ) = sup inf I The relation of the minimax MSE risk of `1 and S1 denoising to phase transitions for recovery has been conjectured and observed empirically by Donoho & Montanari. 21 / 40 Outline The phase transition phenomenon Statistical dimension Conic integral geometry Concentration of measure What else The kinematic formula Let Q be a random orthogonal transformation. The probability that a randomly rotated cone intersects another can be expressed in terms of a discrete probability distribution, the spherical intrinsic volumes v0 (C), . . . , vd (C): X X P{C ∩ QD 6= {0}} = 2 vi (C)vj (D). k odd i+j=d+k For the case where D = L is a linear subspace of codimension m, vi (L) = 1 if i = d − m and vi (L) = 0 else, X P{C ∩ QL 6= {0}} = 2 vm+k (C). (?) k odd The expression (?) is essentially the tail of a discrete probability distribution. 22 / 40 Spherical intrinsic volumes v2 (C) C v1 (C) 0 v1 (C) v0 (C) Let C ⊆ Rd be a polyhedral cone, Fk (C) set of k-dimensional faces. The k-th (spherical) intrinsic volume of C is defined as X vk (C) = P{ΠC (g) ∈ relint(F )}. F ∈Fk (C) I Clearly, the vk (C) describe a discrete probability distribution. 23 / 40 Spherical intrinsic volumes: examples 24 / 40 Spherical intrinsic volumes: examples I Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else. 24 / 40 Spherical intrinsic volumes: examples I Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else. I Orthant Rd≥0 : vk (Rd≥0 ) d −d = 2 k 24 / 40 Spherical intrinsic volumes: examples I Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else. I Orthant Rd≥0 : vk (Rd≥0 ) I d −d = 2 k For the second order cones we have 1 d−2 2 vk Circ(d, α) = sink−1 (α) cosd−k−1 (α). 2 k−1 2 24 / 40 Spherical intrinsic volumes: examples I Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else. I Orthant Rd≥0 : vk (Rd≥0 ) d −d = 2 k I For the second order cones we have 1 d−2 2 vk Circ(d, α) = sink−1 (α) cosd−k−1 (α). 2 k−1 2 I Asymptotics for intrinsic volumes of descent cones at faces of simplex and `1 -ball were computed by Vershik & Sporyshev and Donoho & Tanner (via polytope angles). 24 / 40 Spherical intrinsic volumes: examples I Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else. I Orthant Rd≥0 : vk (Rd≥0 ) d −d = 2 k I For the second order cones we have 1 d−2 2 vk Circ(d, α) = sink−1 (α) cosd−k−1 (α). 2 k−1 2 I Asymptotics for intrinsic volumes of descent cones at faces of simplex and `1 -ball were computed by Vershik & Sporyshev and Donoho & Tanner (via polytope angles). I Integral representations for the semidefinite cone derived by Amelunxen & Bürgisser. 24 / 40 Outline The phase transition phenomenon Statistical dimension Conic integral geometry Concentration of measure What else Concentration of measure Theorem [ALMT13] Let C be a convex cone, and XC a discrete random variable with distribution P{XC = k} = vk (C). Then the statistical dimension δ(C) is the expected value of XC : δ(C) = E kΠC (g)k2 = E[XC ]. Moreover, the intrinsic volumes satisfy −λ2 /8 P{|XC − δ(C)| > λ} ≤ exp ω(C) + λ for λ ≥ 0, where ω(C) := min{δ(C), d − δ(C)}. 25 / 40 The spherical Steiner formula Recall the statistical dimension δ(C) = E kΠC (g)k2 √ The measure of points within angle arccos( ε) of cone C on the sphere is given by Spherical Steiner Formula (Herglotz, Allendoerfer, Santaló) Xd P kΠC (θ)k2 ≥ ε = P kΠLk (θ)k2 ≥ ε vk (C) k=1 I Lk : k-dimensional subspace I θ: uniform on S d−1 . 26 / 40 The spherical Steiner formula Volume of neighbourhood of subspheres Xd P kΠC (θ)k2 ≥ ε = k=1 P kΠLk (θ)k2 ≥ ε vk (C) | {z } Beta distributed I √ Volume of arccos( ε)-neighbourhood of k-dimensional subsphere satisfies 0 if ε > k/d 2 P kΠLk (θ)k ≥ ε ≈ 1 if ε < k/d 27 / 40 The spherical Steiner formula Volume of neighbourhood of subspheres d X 2 P kΠC (θ)k ≥ ε ≈ vk (C). k=dεde I √ Volume of arccos( ε)-neighbourhood of k-dimensional subsphere satisfies 0 if ε > k/d P kΠLk (θ)k2 ≥ ε ≈ 1 if ε < k/d 28 / 40 The spherical Steiner formula Measure concentration d X 2 P kΠC (θ)k ≥ ε ≈ vk (C). k=dεde ≈ 0 1 if ε > δ(C)/d if ε < δ(C)/d Follows from concentration of measure, since the squared projection is Lipschitz and concentrates near expected value δ(C). 29 / 40 Concentration of measure S2 S 10 S 100 Height above equator of area occupying 90 per cent of measure. (From Matoušek, Lectures on Discrete Geometry) 30 / 40 The spherical Steiner formula Let XC be a random variable with distribution given by the spherical intrinsic volumes P{XC = k} = vk (C). By the spherical Steiner formula we have P{XC ≥ εd} ≈ d X k=dεde I vk (C) ≈ 0 1 if ε > δ(C)/d if ε < δ(C)/d Rigorous implementation based on log-Sobolev inequalities. 31 / 40 Summary I Problems of simple recovery or demixing by convex optimization are equivalent to problem of cone intersecting a subspace or another cone. I In high dimensions, the intersection behaviour of randomly oriented closed convex cones is determined by the statistical dimension. I The reason: intersection probabilities are determined precisely by the kinematic formula in terms of intrinsic volumes, and... I ...intrinsic volumes concentrate around the average dimension of the cone, which coincides with the statistical dimension. I There are simple recipes for computing δ(C)/d asymptotically, in some cases even exactly. 32 / 40 Outline The phase transition phenomenon Statistical dimension Conic integral geometry Concentration of measure What else A curious example Ultra slim cones (chambers of finite reflection groups) δ(CA ) = d X 1 , k CA := {x1 ≤ . . . ≤ xd }; k=1 d 1X1 , δ(CBC ) = 2 k CBC := {0 ≤ x1 ≤ . . . ≤ xd }. k=1 d X 1 Note that ≈ log(d). k k=1 34 / 40 A curious example CA = {x1 ≤ . . . ≤ xd } , CBC = {0 ≤ x1 ≤ . . . ≤ xd } These cones appear as certain normal cones: (CBC for d = 2) 35 / 40 A curious example CA = {x1 ≤ . . . ≤ xd } , CBC = {0 ≤ x1 ≤ . . . ≤ xd } The logarithmic statistical dimension implies that “recovering vectors from lists” by the convex relaxation method is disappointingly bad. Empirical success % Probability of finding a vector from a list 100% 75% 50% 25% 0% // 85 90 95 100 Number of measurements 35 / 40 Change of representation setting signal sparsity measurements synthesis sparsity x0 = Dα0 f (α0 ) small ADα = b analysis sparsity x0 f (Ωx0 ) small Ax = b. 36 / 40 Change of representation setting signal sparsity measurements synthesis sparsity x0 = Dα0 f (α0 ) small ADα = b analysis sparsity x0 f (Ωx0 ) small Ax = b. → One needs to understand the statistical dimension of linear images of cones. 36 / 40 Change of representation setting signal sparsity measurements synthesis sparsity x0 = Dα0 f (α0 ) small ADα = b analysis sparsity x0 f (Ωx0 ) small Ax = b. → One needs to understand the statistical dimension of linear images of cones. Let C ∈ Cd , T ∈ Gld . I TQC-Lemma: for Q ∈ O(d) uniformly at random E δ(T QC) = δ(C). 36 / 40 Change of representation setting signal sparsity measurements synthesis sparsity x0 = Dα0 f (α0 ) small ADα = b analysis sparsity x0 f (Ωx0 ) small Ax = b. → One needs to understand the statistical dimension of linear images of cones. Let C ∈ Cd , T ∈ Gld . I TQC-Lemma: for Q ∈ O(d) uniformly at random E δ(T QC) = δ(C). I condition-based estimate: κ(T ) = (condition number of T ), δ(T C) ∈ [κ(T )−2 , κ(T )2 ] (Amelunxen) δ(C) 36 / 40 Recovery with noise Let b = Ax0 + e with noise kek ≤ ε. Any x that solves minimize f (x) subject to kAx − bk ≤ ε satisfies kx − x0 k ≤ 2ε · σC (A)−1 ; where σC (A) = min x∈C kAxk kxk and C = D(f, x0 ) the cone of descent of f (x) at x0 . 37 / 40 Recovery with noise Recovery Recovery with noise σC (A) = min x∈C I kAxk >0 kxk σC (A) = min x∈C kAxk >t>0 kxk Using a relation to condition numbers one can study this in terms of and derive the probability of such events in terms of Grassmann tube formulae (work in progress). 38 / 40 Some problems I Spherical Hadwiger conjecture: Each continuous, rotation-invariant valuation on closed convex cones is a linear combination of spherical intrinsic volumes. I Are the spherical intrinsic volumes log concave? vk (C)2 ≥ vk−1 (C) · vk+1 (C) I Is the variance of XC maximised by the Lorentz cone Circπ/4 ? I Universality: what about other distributions (experiments by Donoho & Tanner, partial results by Montanari et al.)? I Lower bounds on number of measurements required for convex tensor recovery [Mu, Huang, Wright & Goldfarb 2013]. I Use developed technology on phase retrieval problems. 39 / 40 For more details: D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: a geometric theory of phase transitions in convex optimization. arXiv:1303.6672 40 / 40 For more details: D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: a geometric theory of phase transitions in convex optimization. arXiv:1303.6672 Thank You! 40 / 40