Download -1.5cm A Geometric Theory of Phase Transitions in Convex

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

System of linear equations wikipedia , lookup

Shapley–Folkman lemma wikipedia , lookup

Transcript
M\cr NA
Manchester Numerical Analysis
A Geometric Theory of Phase Transitions
in Convex Optimization
Martin Lotz
School of Mathematics
The University of Manchester
with the collaboration of
Dennis Amelunxen (Manchester), Michael B. McCoy, Joel A. Tropp (Caltech)
Computational Mathematics and Applications Seminar
University of Oxford, October 24, 2013
Outline
The phase transition phenomenon
Statistical dimension
Conic integral geometry
Concentration of measure
What else
Outline
The phase transition phenomenon
Statistical dimension
Conic integral geometry
Concentration of measure
What else
Setting the stage
Problem: find a “structured” solution x0 of m × d system (m < d)
Ax = b
by minimizing a convex regularizer
minimize
f (x)
subject to
Ax = b.
(?)
1 / 40
Setting the stage
Problem: find a “structured” solution x0 of m × d system (m < d)
Ax = b
by minimizing a convex regularizer
minimize
f (x)
subject to
Ax = b.
(?)
When is x0 the unique solution of (?) ?
1 / 40
Setting the stage
Problem: find a “structured” solution x0 of m × d system (m < d)
Ax = b
by minimizing a convex regularizer
minimize
f (x)
subject to
Ax = b.
(?)
Examples include:
I
x0 sparse: f (x) = kxk1 =
X
|xi |
i
I
X0 low-rank matrix: f (X) = kXkS1 =
X
σi (X)
i
I
X0 K-mode tensor: f (X ) =
X
λi X(i) S
1
i
1 / 40
Setting the stage
Problem: find a “structured” solution x0 of m × d system (m < d)
Ax = b
by minimizing a convex regularizer
minimize
f (x)
subject to
Ax = b.
(?)
I
A convex problem comes with a solid theory and (in principle)
efficient algorithms.
I
A priori there is no reason to believe that the solution of (?)
(regardless of the algorithm) has anything to do with x0 !
1 / 40
Setting the stage
Problem: find a “structured” solution x0 of m × d system (m < d)
Ax = b
by minimizing a convex regularizer
minimize
f (x)
subject to
Ax = b.
(?)
I
There are powerful algorithms for searching sparse/low-rank
solutions x0 without resorting to convex optimisation.
I
For references in the context of compressed sensing and
matrix recovery, see (Cartis & Thompson 2013), (Blanchard
& Tanner 2013), or (Tanner & Wei 2013).
1 / 40
Sparse recovery guarantees: what is known
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
kxk1
subject to
Ax = b.
2 / 40
Sparse recovery guarantees: what is known
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
I
kxk1
subject to
Ax = b.
Donoho, Candès, Romberg & Tao, Rudelson & Vershynin:
m ≥ const · log(d/m) · s
(“complexity” m is proportional to the “information content” s)
2 / 40
Sparse recovery guarantees: what is known
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
I
kxk1
subject to
Ax = b.
Donoho, Candès, Romberg & Tao, Rudelson & Vershynin:
m ≥ const · log(d/m) · s
(“complexity” m is proportional to the “information content” s)
I
Similar results for low-rank matrix recovery (Candés & Tao,
Recht, Fazel, Recht & Parrilo)
2 / 40
Sparse recovery guarantees: what is known
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
I
kxk1
subject to
Ax = b.
Donoho, Candès, Romberg & Tao, Rudelson & Vershynin:
m ≥ const · log(d/m) · s
(“complexity” m is proportional to the “information content” s)
I
Similar results for low-rank matrix recovery (Candés & Tao,
Recht, Fazel, Recht & Parrilo)
I
Phase transitions for successful recovery were observed and
precisely located by Donoho & Tanner and Stojnic
2 / 40
Some experiments: `1 minimization
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
kxk1
subject to
Ax = b.
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
Number of equations m
150
200
s = 50, m = 25, d = 200
3 / 40
Some experiments: `1 minimization
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
kxk1
subject to
Ax = b.
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
Number of equations m
150
200
s = 50, m = 50, d = 200
3 / 40
Some experiments: `1 minimization
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
kxk1
subject to
Ax = b.
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
Number of equations m
150
200
s = 50, m = 75, d = 200
3 / 40
Some experiments: `1 minimization
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
kxk1
subject to
Ax = b.
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
Number of equations m
150
200
s = 50, m = 100, d = 200
3 / 40
Some experiments: `1 minimization
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
kxk1
subject to
Ax = b.
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
Number of equations m
150
200
s = 50, m = 125, d = 200
3 / 40
Some experiments: `1 minimization
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
kxk1
subject to
Ax = b.
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
Number of equations m
150
200
s = 50, m = 150, d = 200
3 / 40
Some experiments: `1 minimization
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
kxk1
subject to
Ax = b.
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
Number of equations m
150
200
s = 50, m = 175, d = 200
3 / 40
Some experiments: `1 minimization
Let x0 be s-sparse, b = Ax0 for random A ∈ Rm×d (s < m < d).
minimize
kxk1
subject to
Ax = b.
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
Number of equations m
150
200
s = 50, m = 200, d = 200
3 / 40
Some experiments: low-rank matrix recovery
Let X0 have rank r, b = AX0 for random A : Rm×n → Rp .
minimize
kXkS1
AX = b.
subject to
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Number of equations p/nm
0.8
1
r/n = 0.3, p/nm = 0.125
4 / 40
Some experiments: low-rank matrix recovery
Let X0 have rank r, b = AX0 for random A : Rm×n → Rp .
minimize
kXkS1
AX = b.
subject to
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Number of equations p/nm
0.8
1
r/n = 0.3, p/nm = 0.250
4 / 40
Some experiments: low-rank matrix recovery
Let X0 have rank r, b = AX0 for random A : Rm×n → Rp .
minimize
kXkS1
AX = b.
subject to
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Number of equations p/nm
0.8
1
r/n = 0.3, p/nm = 0.375
4 / 40
Some experiments: low-rank matrix recovery
Let X0 have rank r, b = AX0 for random A : Rm×n → Rp .
minimize
kXkS1
AX = b.
subject to
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Number of equations p/nm
0.8
1
r/n = 0.3, p/nm = 0.500
4 / 40
Some experiments: low-rank matrix recovery
Let X0 have rank r, b = AX0 for random A : Rm×n → Rp .
minimize
kXkS1
AX = b.
subject to
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Number of equations p/nm
0.8
1
r/n = 0.3, p/nm = 0.625
4 / 40
Some experiments: low-rank matrix recovery
Let X0 have rank r, b = AX0 for random A : Rm×n → Rp .
minimize
kXkS1
AX = b.
subject to
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Number of equations p/nm
0.8
1
r/n = 0.3, p/nm = 0.750
4 / 40
Some experiments: low-rank matrix recovery
Let X0 have rank r, b = AX0 for random A : Rm×n → Rp .
minimize
kXkS1
AX = b.
subject to
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Number of equations p/nm
0.8
1
r/n = 0.3, p/nm = 0.875
4 / 40
Some experiments: low-rank matrix recovery
Let X0 have rank r, b = AX0 for random A : Rm×n → Rp .
minimize
kXkS1
AX = b.
subject to
1
0.9
Probability of success
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Number of equations p/nm
0.8
1
r/n = 0.3, p/nm = 1.000
4 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
1
Probability of at most m crossings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
m
150
200
m = 25
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
1
Probability of at most m crossings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
m
150
200
m = 50
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
1
Probability of at most m crossings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
m
150
200
m = 75
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
1
Probability of at most m crossings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
m
150
200
m = 100
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
1
Probability of at most m crossings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
m
150
200
m = 125
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
1
Probability of at most m crossings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
m
150
200
m = 150
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
1
Probability of at most m crossings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
m
150
200
m = 175
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
1
Probability of at most m crossings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
m
150
200
m = 200
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
1
Probability of at most m crossings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
100
m
127.324
200
Threshold location: n · 2/π
5 / 40
Some experiments: Buffon’s needle
Throw n needles of length ` uniformly on a surface of strips of
length d ≥ `. What is the probability that at most m of them lie
across adjacent strips?
I
This example is not completely gratuitous, but intimately
related to the phase transitions for convex regularization!
I
We will see that the success probability of convex minimization
with random constraints is determined by a discrete
probability distribution that concentrates around its mean.
5 / 40
Phase transitions for linear inverse problems
Associate to a solution x0 of Ax = b and a convex problem
minimize
f (x) subject to
Ax = b
(?)
a parameter δ(f, x0 ), the statistical dimension of f at x0 .
Theorem [Amelunxen, L, McCoy & Tropp, 2013]
Let η ∈ (0, 1) and let x0 ∈ Rd be a fixed vector. Suppose A ∈ Rm×d has
independent standard normal entries, and let b = Ax0 . Then
√
m ≥ δ(f, x0 ) + aη d =⇒ (?) succeeds with probability ≥ 1 − η;
√
m ≤ δ(f, x0 ) − aη d =⇒ (?) succeeds with probability ≤ η.
p
where aη := 4 log(4/η) (a0.01 < 10 and a0.001 < 12).
6 / 40
Phase transitions for linear inverse problems
Associate to a solution x0 of Ax = b and a convex problem
minimize
f (x) subject to
Ax = b
(?)
a parameter δ(f, x0 ), the statistical dimension of f at x0 .
I
In a very precise sense, minimizing f in (?) has the effect of
adding d − δ(f, x0 ) linear equations to the system Ax = b.
6 / 40
Phase transitions for linear inverse problems
Associate to a solution x0 of Ax = b and a convex problem
minimize
f (x) subject to
Ax = b
(?)
a parameter δ(f, x0 ), the statistical dimension of f at x0 .
I
In a very precise sense, minimizing f in (?) has the effect of
adding d − δ(f, x0 ) linear equations to the system Ax = b.
I
The reason for this is that the descent cone of f at x0
behaves like a linear subspace in high dimension.
6 / 40
Phase transitions for linear inverse problems
100
900
75
600
50
300
25
0
0
25
50
75
100
0
0
10
20
30
7 / 40
Demixing
Reconstruct two signals x0 , y0 from the observation
z0 = x0 + U y0 ,
where U ∈ O(d) is some known orthogonal basis that encodes the
orientation of y0 relative to x0 .
8 / 40
Demixing
Reconstruct two signals x0 , y0 from the observation
z0 = x0 + U y0 ,
where U ∈ O(d) is some known orthogonal basis that encodes the
orientation of y0 relative to x0 .
I
both are sparse
(→ morphological component analysis)
8 / 40
Demixing
Reconstruct two signals x0 , y0 from the observation
z0 = x0 + U y0 ,
where U ∈ O(d) is some known orthogonal basis that encodes the
orientation of y0 relative to x0 .
I
both are sparse
(→ morphological component analysis)
I
x0 sparse (corruption), y0 ∈ {±1}d (message)
(→ robust communication protocol)
8 / 40
Demixing
Reconstruct two signals x0 , y0 from the observation
z0 = x0 + U y0 ,
where U ∈ O(d) is some known orthogonal basis that encodes the
orientation of y0 relative to x0 .
I
both are sparse
(→ morphological component analysis)
I
x0 sparse (corruption), y0 ∈ {±1}d (message)
(→ robust communication protocol)
I
x0 low-rank matrix, y0 sparse (corruption)
(→ latent variable selection in machine learning)
8 / 40
Phase transitions for convex demixing
Separate two signals from observations z = x0 + U y0 by solving
minimize
f (x) subject to
g(y) ≤ g(y0 ) and z = x + U y. (?)
for convex functions f and g designed to capture the structure of
x0 and y0 .
9 / 40
Phase transitions for convex demixing
Separate two signals from observations z = x0 + U y0 by solving
minimize
f (x) subject to
g(y) ≤ g(y0 ) and z = x + U y. (?)
for convex functions f and g designed to capture the structure of
x0 and y0 .
Theorem [Amelunxen, L, McCoy, & Tropp 2013]
Let η ∈ (0, 1), x0 , y0 be fixed vectors in Rd , orthogonal U ∈ Rd×d uniformly at
random, and let z = x0 + U y0 . Then
√
δ f, x0 + δ g, y0 ≤ d − aη d
√
δ f, x0 + δ g, y0 ≥ d + aη d
=⇒
(?) succeeds with probability ≥ 1 − η;
=⇒
(?) succeeds with probability ≤ η.
9 / 40
Phase transitions for convex demixing
minimize
kck1
subject to
minimize
kmk∞ ≤ 1 and c + Qm = z0 .
subject to
kXk1 ≤ α and X + QY = Z0 .
Demixing sparse & sparse
Demixing sparse & low−rank
1
1
95% success
50% success
5% success
Theory
Nonzero proportion of Y0
Nonzero proportion of y0
kXk∗
0.5
0
95% success
50% success
5% success
Theory
0.8
0.6
0.4
0.2
0
0
0.5
Nonzero proportion of x0
1
0
0.2
0.4
0.6
0.8
1
Normalized rank of X0
Probability of success for deconvolution problems.
10 / 40
Outline
The phase transition phenomenon
Statistical dimension
Conic integral geometry
Concentration of measure
What else
When does convex relaxation work?
minimize kxk1 subject to Ax = b
x0
{Ax = b}
“success”
{kxk1 ≤ kx0 k1 }
11 / 40
When does convex relaxation work?
minimize kxk1 subject to Ax = b
x0
“failure”
11 / 40
When does convex relaxation work?
minimize kxk1 subject to Ax = b
x0
“success”
descent cone
11 / 40
When does convex relaxation work?
minimize kxk1 subject to Ax = b
x0
“success”
descent cone
Nullspace property
The convex relaxation method succeeds if and only if the kernel of
A misses the cone of descent directions of k·k1 at x0 .
11 / 40
From optimization to geometry
The problem
minimize
f (x) subject to
Ax = b
has x0 as unique solution if and only if the optimality condition
ker A ∩ D(f, x0 ) = {0}
is satisfied, where
[
◦
D(f, x0 ) :=
y ∈ Rd : f (x0 + τ y) ≤ f (x0 ) ∼
= cone (∂f (x0 ))
τ >0
is the convex descent cone of f at x0 .
12 / 40
From optimization to geometry
Examples of descent cones
[
◦
D(f, x0 ) :=
y ∈ Rd : f (x0 + τ y) ≤ f (x0 ) ∼
= cone (∂f (x0 ))
τ >0
or their polars include:
I
I
I
x0 s-sparse: ∂ kx0 k1 ∼
= (d − s)-face of unit cube centred at
origin.
x0 on s-face of hypercube: D(k·k , x0 ) ∼
= Rd−s × Rs .
`∞
≤0
X0 rank r matrix: If X0 diagonal,
1r 0
∂ kX0 kS1 =
: kW0 k2 ≤ 1 (Watson 1992).
0 W0
13 / 40
The statistical dimension
Definition
The statistical dimension of a convex cone C is defined as
δ(C) := E kΠC (g)k2
where g ∼ Normal(0, I) is a Gaussian vector and ΠC (·) denotes
the Euclidean projection onto C.
For a proper convex function f and x0 we define the statistical
dimension as that of the descent cone of f at x0 :
δ(f, x0 ) := δ (D(f, x0 )) .
I
Direct generalisation of the dimension of a linear space.
I
Closely related to squared Gaussian width.
14 / 40
Basic properties
15 / 40
Basic properties
I
Spherical formulation.
δ(C) := d E kΠC (θ)k2
where θ ∼ Uniform(Sd−1 ).
15 / 40
Basic properties
I
I
Spherical formulation.
δ(C) := d E kΠC (θ)k2
where θ ∼ Uniform(Sd−1 ).
Rotational invariance. δ(U C) = δ(C) for each U ∈ Od .
15 / 40
Basic properties
I
Spherical formulation.
δ(C) := d E kΠC (θ)k2
where θ ∼ Uniform(Sd−1 ).
I
Rotational invariance. δ(U C) = δ(C) for each U ∈ Od .
I
Subspaces. For a subspace L ⊂ Rd , δ(L) = dim(L).
15 / 40
Basic properties
I
Spherical formulation.
δ(C) := d E kΠC (θ)k2
where θ ∼ Uniform(Sd−1 ).
I
Rotational invariance. δ(U C) = δ(C) for each U ∈ Od .
I
Subspaces. For a subspace L ⊂ Rd , δ(L) = dim(L).
I
Totality.
δ(C) + δ(C ◦ ) = d.
This generalises dim(L) + dim(L⊥ ) = d for linear L.
C
C◦
15 / 40
Basic properties
I
Spherical formulation.
δ(C) := d E kΠC (θ)k2
where θ ∼ Uniform(Sd−1 ).
I
Rotational invariance. δ(U C) = δ(C) for each U ∈ Od .
I
Subspaces. For a subspace L ⊂ Rd , δ(L) = dim(L).
I
Totality.
δ(C) + δ(C ◦ ) = d.
This generalises dim(L) + dim(L⊥ ) = d for linear L.
I
Direct products. For each cone closed convex cone K,
δ(C × K) = δ(C) + δ(K).
In particular, invariance under embedding.
15 / 40
Basic properties
I
Spherical formulation.
δ(C) := d E kΠC (θ)k2
where θ ∼ Uniform(Sd−1 ).
I
Rotational invariance. δ(U C) = δ(C) for each U ∈ Od .
I
Subspaces. For a subspace L ⊂ Rd , δ(L) = dim(L).
I
Totality.
δ(C) + δ(C ◦ ) = d.
This generalises dim(L) + dim(L⊥ ) = d for linear L.
I
Direct products. For each cone closed convex cone K,
δ(C × K) = δ(C) + δ(K).
In particular, invariance under embedding.
I
Monotonicity. C ⊂ K implies that δ(C) ≤ δ(K).
15 / 40
Examples
I
Linear subspaces. δ(L) = dim L
I
Non-negative orthant. δ(Rd≥0 ) = d/2.
I
Self-dual cones. We have δ(C) + δ(C ◦ ) = d, so that
δ(C) = d/2 for any self-dual cone (for example, positive
semidefinite matrices).
I
Second-order (ice cream) cones of angle α.
Circ(d, α) := x ∈ Rn : x1 / kxk ≥ cos(α) .
Then δ Circ(d, α) ≈ d sin2 (α)
I
The cone CA = {x : x1 ≤ · · · ≤ xd }.
δ(CA ) =
d
X
1
∼ log(d).
k
k=1
16 / 40
Approximate kinematic formula
Theorem [ALMT13]
Fix a tolerance η ∈ (0, 1). Suppose that C, K ⊂ Rd are closed
convex cones, one of which is not a subspace. Draw an orthogonal
matrix Q ∈ Rd×d uniformly at random. Then
√
δ(C) + δ(K) ≤ d − aη d =⇒ P C ∩ QK = {0} ≥ 1 − η;
√
δ(C) + δ(K) ≥ d + aη d =⇒ P C ∩ QK = {0} ≤ η,
p
where aη := 4 log(4/η) (a0.01 < 10 and a0.001 < 12).
17 / 40
Approximate kinematic formula
Theorem [ALMT13]
Fix a tolerance η ∈ (0, 1). Suppose that C, K ⊂ Rd are closed
convex cones, one of which is not a subspace. Draw an orthogonal
matrix Q ∈ Rd×d uniformly at random. Then
√
δ(C) + δ(K) ≤ d − aη d =⇒ P C ∩ QK = {0} ≥ 1 − η;
√
δ(C) + δ(K) ≥ d + aη d =⇒ P C ∩ QK = {0} ≤ η,
p
where aη := 4 log(4/η) (a0.01 < 10 and a0.001 < 12).
I
Applying this with C = D(f, x0 ) and K = ker A we get the
phase transitions for convex regularization.
17 / 40
Approximate kinematic formula
In high dimensions, convex cones have intersection behaviour like
linear subspaces.
Linear subspaces
(
L1 ∩ L2
= {0} iff dim(L1 ) + dim(L2 ) ≤ d
6= {0} iff dim(L1 ) + dim(L2 ) > d.
(almost surely)
Convex cones
(
= {0} iff δ(C) + δ(D) . d
C ∩ QD
6= {0} iff δ(C) + δ(D) & d.
(with overwhelming probability)
18 / 40
Computing the statistical dimension
For many cases, the statistical dimension δ(f, x0 ) of a convex
function f at x0 can be determined exactly or asymptotically.
19 / 40
Computing the statistical dimension
For many cases, the statistical dimension δ(f, x0 ) of a convex
function f at x0 can be determined exactly or asymptotically.
I
x0 s-sparse, f = k·k1 .
Asymptotic formula for δ(k·k1 , x0 ) follows from Stojnic 2009.
19 / 40
Computing the statistical dimension
For many cases, the statistical dimension δ(f, x0 ) of a convex
function f at x0 can be determined exactly or asymptotically.
I
x0 s-sparse, f = k·k1 .
Asymptotic formula for δ(k·k1 , x0 ) follows from Stojnic 2009.
I
X0 rank r matrix, f = k·kS1 .
Asymptotic formula based on the Marčenko-Pastur
characterisation of the empirical eigenvalue distribution of
Wishart matrices.
19 / 40
Computing the statistical dimension
For many cases, the statistical dimension δ(f, x0 ) of a convex
function f at x0 can be determined exactly or asymptotically.
I
x0 s-sparse, f = k·k1 .
Asymptotic formula for δ(k·k1 , x0 ) follows from Stojnic 2009.
I
X0 rank r matrix, f = k·kS1 .
Asymptotic formula based on the Marčenko-Pastur
characterisation of the empirical eigenvalue distribution of
Wishart matrices.
I
x0 on s-face of hypercube, f = k·k`∞ .
δ(f, x0 ) =
d−s
2
19 / 40
Theory and experiment
Phase transitions in linear inverse problems
Empirical success %
100%
75%
Sparse
Sign
Low rank
Theory
50%
25%
0%
100
200
300
Number of generic measurements
20 / 40
Relation to convex denoising
Let z = x0 + w with w ∼ Normal(0, σ1), and
x̂ = arg min
x
1
kx − zk22 + µf (x).
2
If, for example, f = k·k1 , then this is soft thresholding. The
associated minimax MSE risk is given by
i
1 h
2
k
x̂
−
x
k
E
0
2
2
σ>0 µ>0 σ
R(f, x0 ) = sup inf
21 / 40
Relation to convex denoising
Let z = x0 + w with w ∼ Normal(0, σ1), and
x̂ = arg min
x
1
kx − zk22 + µf (x).
2
If, for example, f = k·k1 , then this is soft thresholding. The
associated minimax MSE risk is given by
i
1 h
2
k
x̂
−
x
k
E
0
2
2
σ>0 µ>0 σ
R(f, x0 ) = sup inf
Lemma ([ALMT13]+[Oymak & Hassibi 2013])
p
δ(f, x0 ) − R(f, x0 ) = O
δ(f,
x
)
.
0
21 / 40
Relation to convex denoising
Let z = x0 + w with w ∼ Normal(0, σ1), and
x̂ = arg min
x
1
kx − zk22 + µf (x).
2
If, for example, f = k·k1 , then this is soft thresholding. The
associated minimax MSE risk is given by
i
1 h
2
k
x̂
−
x
k
E
0
2
2
σ>0 µ>0 σ
R(f, x0 ) = sup inf
I
The relation of the minimax MSE risk of `1 and S1 denoising
to phase transitions for recovery has been conjectured and
observed empirically by Donoho & Montanari.
21 / 40
Outline
The phase transition phenomenon
Statistical dimension
Conic integral geometry
Concentration of measure
What else
The kinematic formula
Let Q be a random orthogonal transformation. The probability
that a randomly rotated cone intersects another can be expressed
in terms of a discrete probability distribution, the spherical intrinsic
volumes v0 (C), . . . , vd (C):
X X
P{C ∩ QD 6= {0}} = 2
vi (C)vj (D).
k odd i+j=d+k
For the case where D = L is a linear subspace of codimension m,
vi (L) = 1 if i = d − m and vi (L) = 0 else,
X
P{C ∩ QL 6= {0}} = 2
vm+k (C).
(?)
k odd
The expression (?) is essentially the tail of a discrete probability
distribution.
22 / 40
Spherical intrinsic volumes
v2 (C)
C
v1 (C)
0
v1 (C)
v0 (C)
Let C ⊆ Rd be a polyhedral cone, Fk (C) set of k-dimensional
faces. The k-th (spherical) intrinsic volume of C is defined as
X
vk (C) =
P{ΠC (g) ∈ relint(F )}.
F ∈Fk (C)
I
Clearly, the vk (C) describe a discrete probability distribution.
23 / 40
Spherical intrinsic volumes: examples
24 / 40
Spherical intrinsic volumes: examples
I
Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else.
24 / 40
Spherical intrinsic volumes: examples
I
Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else.
I
Orthant Rd≥0 :
vk (Rd≥0 )
d −d
=
2
k
24 / 40
Spherical intrinsic volumes: examples
I
Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else.
I
Orthant Rd≥0 :
vk (Rd≥0 )
I
d −d
=
2
k
For the second order cones we have
1 d−2
2
vk Circ(d, α) =
sink−1 (α) cosd−k−1 (α).
2 k−1
2
24 / 40
Spherical intrinsic volumes: examples
I
Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else.
I
Orthant Rd≥0 :
vk (Rd≥0 )
d −d
=
2
k
I
For the second order cones we have
1 d−2
2
vk Circ(d, α) =
sink−1 (α) cosd−k−1 (α).
2 k−1
2
I
Asymptotics for intrinsic volumes of descent cones at faces of
simplex and `1 -ball were computed by Vershik & Sporyshev
and Donoho & Tanner (via polytope angles).
24 / 40
Spherical intrinsic volumes: examples
I
Linear subspace L: vk (L) = 1 if dim L = k, vk (L) = 0 else.
I
Orthant Rd≥0 :
vk (Rd≥0 )
d −d
=
2
k
I
For the second order cones we have
1 d−2
2
vk Circ(d, α) =
sink−1 (α) cosd−k−1 (α).
2 k−1
2
I
Asymptotics for intrinsic volumes of descent cones at faces of
simplex and `1 -ball were computed by Vershik & Sporyshev
and Donoho & Tanner (via polytope angles).
I
Integral representations for the semidefinite cone derived by
Amelunxen & Bürgisser.
24 / 40
Outline
The phase transition phenomenon
Statistical dimension
Conic integral geometry
Concentration of measure
What else
Concentration of measure
Theorem [ALMT13]
Let C be a convex cone, and XC a discrete random variable with
distribution P{XC = k} = vk (C). Then the statistical dimension
δ(C) is the expected value of XC :
δ(C) = E kΠC (g)k2 = E[XC ].
Moreover, the intrinsic volumes satisfy
−λ2 /8
P{|XC − δ(C)| > λ} ≤ exp
ω(C) + λ
for λ ≥ 0,
where ω(C) := min{δ(C), d − δ(C)}.
25 / 40
The spherical Steiner formula
Recall the statistical dimension
δ(C) = E kΠC (g)k2
√
The measure of points within angle arccos( ε) of cone C on the
sphere is given by
Spherical Steiner Formula (Herglotz, Allendoerfer, Santaló)
Xd
P kΠC (θ)k2 ≥ ε =
P kΠLk (θ)k2 ≥ ε vk (C)
k=1
I
Lk : k-dimensional subspace
I
θ: uniform on S d−1 .
26 / 40
The spherical Steiner formula
Volume of neighbourhood of subspheres
Xd
P kΠC (θ)k2 ≥ ε =
k=1
P kΠLk (θ)k2 ≥ ε vk (C)
|
{z
}
Beta distributed
I
√
Volume of arccos( ε)-neighbourhood of k-dimensional
subsphere satisfies
0
if ε > k/d
2
P kΠLk (θ)k ≥ ε ≈
1
if ε < k/d
27 / 40
The spherical Steiner formula
Volume of neighbourhood of subspheres
d
X
2
P kΠC (θ)k ≥ ε ≈
vk (C).
k=dεde
I
√
Volume of arccos( ε)-neighbourhood of k-dimensional
subsphere satisfies
0
if ε > k/d
P kΠLk (θ)k2 ≥ ε ≈
1
if ε < k/d
28 / 40
The spherical Steiner formula
Measure concentration
d
X
2
P kΠC (θ)k ≥ ε ≈
vk (C).
k=dεde
≈
0
1
if ε > δ(C)/d
if ε < δ(C)/d
Follows from concentration of measure, since the squared
projection is Lipschitz and concentrates near expected value δ(C).
29 / 40
Concentration of measure
S2
S 10
S 100
Height above equator of area occupying 90 per cent of measure.
(From Matoušek, Lectures on Discrete Geometry)
30 / 40
The spherical Steiner formula
Let XC be a random variable with distribution given by the
spherical intrinsic volumes
P{XC = k} = vk (C).
By the spherical Steiner formula we have
P{XC ≥ εd} ≈
d
X
k=dεde
I
vk (C) ≈
0
1
if ε > δ(C)/d
if ε < δ(C)/d
Rigorous implementation based on log-Sobolev inequalities.
31 / 40
Summary
I
Problems of simple recovery or demixing by convex
optimization are equivalent to problem of cone intersecting a
subspace or another cone.
I
In high dimensions, the intersection behaviour of randomly
oriented closed convex cones is determined by the statistical
dimension.
I
The reason: intersection probabilities are determined precisely
by the kinematic formula in terms of intrinsic volumes, and...
I
...intrinsic volumes concentrate around the average dimension
of the cone, which coincides with the statistical dimension.
I
There are simple recipes for computing δ(C)/d asymptotically,
in some cases even exactly.
32 / 40
Outline
The phase transition phenomenon
Statistical dimension
Conic integral geometry
Concentration of measure
What else
A curious example
Ultra slim cones (chambers of finite reflection groups)
δ(CA ) =
d
X
1
,
k
CA := {x1 ≤ . . . ≤ xd };
k=1
d
1X1
,
δ(CBC ) =
2
k
CBC := {0 ≤ x1 ≤ . . . ≤ xd }.
k=1
d
X
1
Note that
≈ log(d).
k
k=1
34 / 40
A curious example
CA = {x1 ≤ . . . ≤ xd } ,
CBC = {0 ≤ x1 ≤ . . . ≤ xd }
These cones appear as certain normal cones: (CBC for d = 2)
35 / 40
A curious example
CA = {x1 ≤ . . . ≤ xd } ,
CBC = {0 ≤ x1 ≤ . . . ≤ xd }
The logarithmic statistical dimension implies that “recovering
vectors from lists” by the convex relaxation method is
disappointingly bad.
Empirical success %
Probability of finding a vector from a list
100%
75%
50%
25%
0%
//
85
90
95
100
Number of measurements
35 / 40
Change of representation
setting
signal
sparsity
measurements
synthesis sparsity
x0 = Dα0
f (α0 ) small
ADα = b
analysis sparsity
x0
f (Ωx0 ) small
Ax = b.
36 / 40
Change of representation
setting
signal
sparsity
measurements
synthesis sparsity
x0 = Dα0
f (α0 ) small
ADα = b
analysis sparsity
x0
f (Ωx0 ) small
Ax = b.
→ One needs to understand the statistical dimension of linear
images of cones.
36 / 40
Change of representation
setting
signal
sparsity
measurements
synthesis sparsity
x0 = Dα0
f (α0 ) small
ADα = b
analysis sparsity
x0
f (Ωx0 ) small
Ax = b.
→ One needs to understand the statistical dimension of linear
images of cones.
Let C ∈ Cd , T ∈ Gld .
I TQC-Lemma: for Q ∈ O(d) uniformly at random
E δ(T QC) = δ(C).
36 / 40
Change of representation
setting
signal
sparsity
measurements
synthesis sparsity
x0 = Dα0
f (α0 ) small
ADα = b
analysis sparsity
x0
f (Ωx0 ) small
Ax = b.
→ One needs to understand the statistical dimension of linear
images of cones.
Let C ∈ Cd , T ∈ Gld .
I TQC-Lemma: for Q ∈ O(d) uniformly at random
E δ(T QC) = δ(C).
I
condition-based estimate: κ(T ) = (condition number of T ),
δ(T C)
∈ [κ(T )−2 , κ(T )2 ] (Amelunxen)
δ(C)
36 / 40
Recovery with noise
Let b = Ax0 + e with noise kek ≤ ε. Any x that solves
minimize
f (x)
subject to
kAx − bk ≤ ε
satisfies
kx − x0 k ≤ 2ε · σC (A)−1 ;
where
σC (A) = min
x∈C
kAxk
kxk
and C = D(f, x0 ) the cone of descent of f (x) at x0 .
37 / 40
Recovery with noise
Recovery
Recovery with noise
σC (A) = min
x∈C
I
kAxk
>0
kxk
σC (A) = min
x∈C
kAxk
>t>0
kxk
Using a relation to condition numbers one can study this in
terms of and derive the probability of such events in terms of
Grassmann tube formulae (work in progress).
38 / 40
Some problems
I
Spherical Hadwiger conjecture:
Each continuous, rotation-invariant valuation on closed convex
cones is a linear combination of spherical intrinsic volumes.
I
Are the spherical intrinsic volumes log concave?
vk (C)2 ≥ vk−1 (C) · vk+1 (C)
I
Is the variance of XC maximised by the Lorentz cone Circπ/4 ?
I
Universality: what about other distributions (experiments by
Donoho & Tanner, partial results by Montanari et al.)?
I
Lower bounds on number of measurements required for
convex tensor recovery [Mu, Huang, Wright & Goldfarb 2013].
I
Use developed technology on phase retrieval problems.
39 / 40
For more details:
D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp.
Living on the edge: a geometric theory of phase transitions in convex
optimization.
arXiv:1303.6672
40 / 40
For more details:
D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp.
Living on the edge: a geometric theory of phase transitions in convex
optimization.
arXiv:1303.6672
Thank You!
40 / 40