Download Appendix A Proof of Theorem 3 B Proof of Theorem 5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Appendix
A
Proof of Theorem 3
Proof. Taking the expectation over the choice of edges (ik , jk ) gives the following inequality

1
1
1
Eik jk [f (xk+1 )|⌘k ]  Eik jk f (xk )
kryik f (xk ) ryjk f (xk )k2
krzik f (xk )k2
krzjk f (xk )k2
4L
2L
2L
1
1
 f (xk )
ry f (xk )> (L ⌦ Iny )ry f (xk )
rz f (xk )> (D ⌦ Inz )rz f (xk )
2
2
1
 f (xk )
rf (xk )> Krf (xk ),
(9)
2
where ⌦ denotes the Kronecker product. This shows that the method is a descent method. Now we are ready to prove the
main convergence theorem. We have the following:
f (xk+1 )
f ⇤  hrf (xk ), xk
x⇤ i  kxk
 R(x0 )krf (xk )kK
8k
Combining this with inequality (9), we obtain
(f (xk ) f ⇤ )2
.
2R2 (x0 )
E[f (xk+1 |⌘k ]  f (xk )
Taking the expectation of both sides an denoting
k
= E[f (xk )]
k+1
Dividing both sides by
k
k+1
and using the fact that
1
k
Adding these inequalities for k steps 0 
C = 2R2 (x0 ).
B

1
0

1
k

k+1
1
f ⇤ gives
2
k
k
2R2 (x0 )

k+1
k
2R2 (x0 )
0.
x⇤ k⇤K krf (xk )kK
k
.
we obtain
1
.
2R2 (x0 )
from which we obtain the statement of the theorem where
Proof of Theorem 5
Proof. In this case, the expectation should be over the selection of the pair (ik , jk ) and random index lk 2 [N ]. In this
proof, the definition of ⌘k includes lk i.e., ⌘k = {(i0 , j0 , l0 ), . . . , (ik 1 , jk 1 , lk 1 )}. We define the following:

i>
i> >
↵k h
↵k h
k
d ik =
ryjk flk (xk ) ryik flk (xk ) ,
rzik flk (xk )
,
2L
L

i>
i> >
↵k h
↵k h
dkjk =
ryjk flk (xk ) ryik flk (xk ) ,
rzjk flk (xk )
,
2L
L
dlikk jk = Uik dkik
Ujk dkjk .
For the expectation of objective value at xk+1 , we have

D
E L
E[f (xk+1 )|⌘k ]  Eik jk Elk f (xk ) + rf (xk ), dlikk jk + kdlikk jk k2
2

D
E L
 Eik jk f (xk ) + rf (xk ), Elk [dlikk jk ] + Elk [kdlikk jk k2 ]
2
h
D
E
↵
k
 Eik jk f (xk ) +
ryik f (xk ), Elk [ryjk flk (xk ) ryik flk (xk )]
2L
E
↵k D
+
ryjk f (xk ), Elk [ryik flk (xk ) ryjk flk (xk )]
2L
i
E ↵ D
E L
↵k D
k
rzik f (xk ), Elk [rzik flk (xk )]
rzjk f (xk ), Elk [rzjk flk (xk )] + Elk [kdlikk jk k2 ] .
L
L
2
Taking expectation over lk , we get the following relationship:
h
E
↵k D
E[f (xk+1 )|⌘k ]  Eik jk f (xk ) +
ryik f (xk ), ryjk f (xk ) ryik f (xk )
2L
E
↵k D
+
ryjk f (xk ), ryik f (xk ) ryjk f (xk )
2L
E ↵ D
E L
i
↵k D
k
rzik f (xk ), rzik f (xk )
rzjk f (xk ), rzjk f (xk ) + Elk [kdlikk jk k2 ] .
L
L
2
We first note that Elk [kdlikk jk k2 ]  8M 2 ↵k2 /L2 since krfl k  M . Substituting this in the above inequality and simplifying
we get,
E[f (xk+1 )|⌘k ]  f (xk )
 f (xk )
↵k ry f (xk )> (L ⌦ In )ry f (xk )
↵k rz f (xk )> (D ⌦ In )rz f (xk ) +
4M 2 ↵k2
.
L
↵k rf (xk )> Krf (xk ) +
4M 2 ↵k2
L
(10)
Similar to Theorem 3, we obtain a lower bound on rf (xk )> Krf (xk ) in the following manner.
f (xk )
f ⇤  hrf (xk ), xk
x⇤ i  kxk
x⇤ k⇤K .krf (xk )kK
 R(x0 )krf (xk )kK .
Combining this with inequality Equation 10, we obtain
E[f (xk+1 )|⌘k ]  f (xk )
Taking the expectation of both sides an denoting
k+1
f ⇤ gives
= E[f (xk )]
k

(f (xk ) f ⇤ )2
4M 2 ↵k2
+
.
R2 (x0 )
L
↵k
k
↵k
2
k
+
R2 (x0 )
4M 2 ↵k2
.
L
Adding these inequalities from i = 0 to i = k and use telescopy we get,
k+1
+
k
X
↵i
i=0
2
k
R2 (x0 )

0
+
k
4M 2 X 2
↵ .
L i=0 i
Using the definition of x̄k+1 = arg min0ik+1 f (xi ), we get
k
X
i=0
↵i
(E[f (x̄k+1 ) f ⇤ ])2

R2 (x0 )
k+1 +
k
X
↵i
i=0
2
k
R2 (x0 )

0+
k
4M 2 X 2
↵ .
L i=0 i
Therefore, from the above inequality we have,
E[f (x̄k+1 )
v
u
u(
f ⇤ ]  R(x0 )t
0
+ 4M 2
Pk
Pk
i=0
i=0
↵i2 /L)
↵i
.
P1
P1
Note that E[f (x̄k+1p) f ⇤ ] ! 0pif we choose step sizes satisfying the condition that i=0 ↵i = 1 and i=0 ↵i2 < 1.
Substituting ↵i =
0 L/(2M i + 1), we get the required result using the reasoning from [24] (we refer the reader to
Section 2.2 of [24] for more details).
C
Proof of Theorem 4
Proof. For ease of exposition, we analyze the case where the unconstrained variables z are absent. The analysis of case
with z variables can be carried out in a similar manner. Consider the update on edge (ik⇣, jk ). Recall that D(k) denotes the
⌘
index of the iterate used in the k th iteration for calculating the gradients. Let dk =
↵k
2L
ryjk f (xD(k) )
ryik f (xD(k) )
and dkik jk = xk+1
have
Ujk dk . Note that kdkik jk k2 = 2kdk k2 . Since f is Lipschitz continuous gradient, we
x k = U ik d k
D
E L
f (xk+1 )  f (xk ) + ryik yjk f (xk ), dkik jk + kdkik jk k2
2
D
E L
k
D(k)
 f (x ) + ryik yjk f (x
) + ryik yjk f (xk ) ryik yjk f (xD(k) ), dkik jk + kdkik jk k2
2
D
E L
L k
k
2
k
D(k)
k
k
 f (x )
kd
k + ryik yjk f (x ) ryik yjk f (x
), dik jk + kdik jk k2
↵ k ik j k
2
✓
◆
1
1
 f (xk ) L
kdkik jk k2 + kryik yjk f (xk ) ryik yjk f (xD(k) )kkdkik jk k
↵k
2
✓
◆
1
1
k
 f (x ) L
kdkik jk k2 + Lkxk xD(k) kkdkik jk k.
↵k
2
The third and fourth steps in the above derivation follow from definition of dkij and Cauchy-Schwarz inequality respectively.
The last step follows from the fact the gradients are Lipschitz continuous. Using the assumption that staleness in the
variables is bounded by ⌧ , i.e., k D(k)  ⌧ and definition of dkij , we have
f (xk+1 )  f (xk )
k
L
 f (x )
L
 f (xk )
L
✓
✓
✓
1
↵k
1
2
1
↵k
1
2
◆
kdkik jk k2 + L
◆
⌧
X
t=1
kdkik
t
t jk
t
!
kkdkik jk k
⌧
i
L Xh k t
kdkik jk k2 +
kdik t jk t k2 + kdkik jk k2
2 t=1
◆
⌧
1+⌧
LX k t
kdkik jk k2 +
kd
k2 .
2
2 t=1 ik t jk t
1
↵k
!
The first step follows from triangle inequality. The second inequality follows from fact that ab  (a2 + b2 )/2. Using
expectation over the edges, we have
E[f (x
k+1
)]  E[f (x )]
We now prove that, for all k
h
where we define E kdkik
k
L
✓
1
↵k
1+⌧
2
◆
E[kdkik jk k2 ]
" ⌧
X
L
+ E
kdkik
2
t=1
t
t jk
2
t
k
#
.
(11)
0
h
E kdkik
1
1 jk
i
⇥
⇤
2
k
 ⇢E kdkik jk k2 ,
1
(12)
i
p
2
t
k
= 0 for k = 0. Let wt denote the vector of size |E| such that wij
= pij kdtij k (with
1
⇥
⇤
t
slight abuse of notation, we use wij
to denote the entry corresponding to edge (i, j)). Note that E kdtit jt k2 = E[kwt k2 ].
We prove Equation (12) by induction.
p
Let uk be a vector of size |E| such that ukij = pij kdkij dkij 1 k. Consider the following:
E[kwk
1
1 jk
1
k]2
E[kwk k2 ] = E[2kwk
1
k]2
 2E[kw
k 1 2
 2E[kw
k 1
 2E[kw

p
2↵k
k 1
k ]
E[kwk k2 + kwk
2E[hw
kkw
k 1
k
k 1
k
w k]
k
, w i]
kku k]  2E[kwk
max(D(k 1),D(k))
X
t=min(D(k 1),D(k))
1
1 2
k ]
p
k 2↵k kxD(k)
E[kwk
1 2
xD(k
1)
k ] + E[kdtit jt k2 ] .
k]
(13)
The fourth step follows from the bound below on |ukij |
p
dkij 1 k
↵k
p
 pij k(Ui Uj ) (ryi f (xD(k) )
2L
p
 2pij ↵k kxD(k) xD(k 1) k.
|ukij | =
pij kdkij
ryj f (xD(k) ) + ryj f (xD(k
1)
)
ryi f (xD(k
1)
))k
The fifth step follows from triangle inequality. We now prove (12): the induction hypothesis is trivially true for k = 0.
Assume it is true for some k 1 0. Now using Equation (13), we have
p
p
E[kwk 1 k]2 E[kwk k2 ]  2↵k (⌧ + 2)E[kwk 1 k2 ] + 2↵k (⌧ + 2)⇢⌧ +1 E[kwk k2 ]
for our choice of ↵k . The last step follows from the fact that E[kdtit jt k2 ] = E[kwt k2 ] and mathematical induction. From
the above, we get
p
1 + 2↵k (⌧ + 2)⇢(⌧ +1)
k 1 2
p
E[kw
k ]
E[kwk k2 ]  ⇢E[kwk k2 ].
1
2↵k (⌧ + 2)
Thus, the statement holds for k. Therefore, the statement holds for all k 2 N by mathematical induction. Substituting the
above in Equation (11), we get
✓
◆
1
1 + ⌧ + ⌧ ⇢⌧
k+1
k
E[f (x
)]  E[f (x )] L
E[kdkik jk k2 ].
↵k
2
This proves that the method is a descent method in expectation. Using the definition of dkij , we have
E[f (xk+1 )]  E[f (xk )]
 E[f (xk )]
 E[f (xk )]
 E[f (xk )]
✓
◆
↵k2
1
1 + ⌧ + ⌧ ⇢⌧
E[kryik f (xD(k) ) ryjk f (xD(k) )k2 ]
4L ↵k
2
✓
◆
↵k2
1
1 + ⌧ + ⌧ ⇢⌧
E[krf (xD(k) ) rf (xD(k) )k2K ]
4L ↵k
2
✓
◆
↵k2
1
1 + ⌧ + ⌧ ⇢⌧
E[(f (xD(k) ) f ⇤ )2 ]
2R2 (x0 ) ↵k
2
✓
◆
↵k2
1
1 + ⌧ + ⌧ ⇢⌧
E[(f (xk ) f ⇤ )2 ].
2R2 (x0 ) ↵k
2
The second and third steps are similar to the proof of Theorem 3. The last step follows from the fact that the method is a
descent method in expectation. Following similar analysis as Theorem 3, we get the required result.
D
Proof of Theorem 6
Proof. Let Ax =
P
i
xi . Let x̃k+1 be solution to the following optimization problem:
x̃k+1 = arg
min hrf (xk ), x
{x|Ax=0}
xk i +
L
kx
2
xk k2 + h(x).
To prove our result, we first prove few intermediate results. We say vectors d 2 Rn and d0 2 Rn are conformal if di d0i 0
for all i 2 [b]. We use dik jk = xk+1 xk and d = x̃k+1 xk . Our first claim is that for any d, we can always find
conformal vectors whose sum is d (see [22]). More formally, we have the following result.
Lemma 7. For any d 2PRn with Ad = 0, we have a multi-set S = {d0ij }i6=j such that d and d0ij are conformal for all
i 6= j and i, j 2 [b] i.e., i6=j d0ij = d, Ad0ij = 0 and d0ij can be non-zero only in coordinates corresponding to xi and xj .
Proof. We prove by an iterative construction, i.e., for every vector d such that Ad = 0, we construct a set S = {sij }
(sij 2 Rn ) with the required properties. We start with a vector u0 = d and multi-set S 0 = {s0ij } and s0ij = 0 for all i 6= j
P
and i, j 2 [n]. At the k th step of the construction, we will have Auk = 0, As = 0 for all s 2 S k , d = uk + s2S k s and
each element of s is conformal to d.
In k th iteration, pick the element with the smallest absolute value (say v) in uk 1 . Let us assume it corresponds to ypj . Now
pick an element from uk 1 corresponding to yqj for p 6= q 2 [m] with at least absolute value v albeit with opposite sign.
Note that such an element should exist since Auk 1 = 0. Let p1 and p2 denote the indices of these elements in uk 1 . Let
S k be same as S k 1 except for skpq which is given by skpq = skpq 1 + r = skpq 1 + ukp1 1 ep1 ukp1 1 ep2 where ei denotes a
vector in Rn with zero in all components except in ith position (where it is one). Note that Ar = 0 and r is conformal to
d since it has the same sign. Let uk+1P= uk r. Note that Auk+1 = 0 since Auk = 0 and Ar = 0. Also observe that
As = 0 for all s 2 S k+1 and uk+1 = s2S k s = d.
Finally, note that each iteration the number of non-zero elements of uk decrease by at least 1. Therefore, this algorithm
terminates after a finite number of iterations. Moreover, at termination uk = 0 otherwise the algorithm can always pick an
element and continue with the process. This gives us the required conformal multi-set.
Now consider a set {d0ij } which is conformal to d. We define x̂k+1 in the following manner:
x̂k+1
=
i
Lemma 8. For any x 2 Rn and k
⇢
if (i, j) = (ik , jk )
if (i, j) 6= (ik , jk )
xki + d0ij
xki
0,
E[kx̂k+1
xk k2  (kx̃k+1
xk k2 ).
We also have
E(h(x̂k+1 ))  (1
)h(xk ) + h(x̃k+1 ).
Proof. We have the following bound:
Eik jk [kx̂k+1
x k k2 =
X
i6=j
kd0ij k2  k
X
i6=j
d0ij k2 = kdk2 = kx̃k+1
x k k2 .
The above statement directly follows the fact that {d0ij } is conformal to d. The remaining part directly follows from
[22].
The remaining part essentially on similar lines as [22]. We give the details here for completeness. From Lemma 1, we have
L
kdik jk k2 + h(xk + dik jk )]
2
L
 Eik jk [f (xk ) + hrf (xk ), d0ik jk i + kd0ik jk k2 + h(xk + d0ik jk )]
2
0
1
Eik jk [F (xk+1 )]  Eik jk [f (xk ) + hrf (xk ), dik jk i +
= f (xk ) + @hrf (x),
 (1

X
i6=j
d0ij i +
XL
i6=j
2
kd0ij k2 +
)F (xk ) + (f (xk ) + hrf (x), di +
min (1
{y|Ay=0}
)F (xk ) + (F (y) +
L
ky
2
X
i6=j
h(x + d0ij )A
L
kdk2 + h(x + d))
2
xk k2 )
2
L k
)F (xk ) + (F ( x⇤ + (1
)xk ) +
kx
2
2[0,1]
✓
◆
2(F (xk ) F (x⇤ ))2
k
k
 (1
)F (x ) +
F (x )
.
LR2 (x0 )
 min (1
x⇤ k2 )
The second step follows from optimality of dik jk . The fourth step follows from Lemma 8. Now using the similar recurrence
relation as in Theorem 2, we get the required result.
E
Reduction of General Case
In this section we show how to reduce a problem with linear constraints to the form of Problem 4 in the paper. For
simplicity, we focus on smooth objective functions. However, the formulation can be extended to composite objective
functions along similar lines. Consider the optimization problem
min f (x)
x
X
s.t. Ax =
Ai xi = 0,
where fi is a convex function with an L-Lipschitz gradient.
Let Āi be a matrix with orthonormal columns satisfying range(Āi ) = ker(Ai ) , this can be obtained (e.g. using SVD). For
each i, define yi = Ai xi and assume that the rank of Ai is less than or equal to the dimensionality of xi . 4 Then we can
rewrite x as a function h(y, z) satisfying
x i = A+
i yi + Āi zi ,
for some unknown zi , where C + denote the pseudo-inverse of C. The problem then becomes
min g(y, z) s.t.
y,z
N
X
(14)
yi = 0,
i=1
where
X
g(y, z) = f ( (y, z)) = f
It is clear that the sets S1 = {x|Ax = 0} and S2 = { (y, z)|
equivalent to that in 1.
P
Ui (A+
i yi
(15)
+ Āi zi ) .
i
i
!
yi = 0} are equal and hence the problem defined in 14 is
Note that such a transformation preserves convexity of the objective function. It is also easy to show that it preserves the
block-wise Lipschitz continuity of the gradients as we prove in the following result.
Lemma 9. Let f be a function with Li -Lipschitz gradient w.r.t xi . Let g(y, z) be the function defined in 15. Then g satisfies
the following condition
kryi g(y, z)
krzi g(y, z)
where
min (B)
Li
2 (A ) kyi
i
min
yi0 k
rzi g(y, z 0 )k  Li kzi
zi0 k,
ryi g(y 0 , z)k 
denotes the minimum non-zero singular value of B.
Proof. We have
kryi g(y, z)
>
ryi g(y 0 , z)k = k(Ui A+
i ) [rx f ( (y, z))
 kA+
i kkri f ( (y, z))
+
 Li kA+
i kkAi (yi
Similar proof holds for krzi g(y, z)
rx f ( (y 0 , z))]k
ri f ( (y 0 , z))k
2
yi0 )k  Li kA+
i k kyi
yi0 k =
Li
2 (A ) kyi
i
min
yi0 k,
rzi g(y, z 0 )k , noting that kĀi k = 1.
It is worth noting that this reduction is mainly used to simplify analysis. In practice, however, we observed that an algorithm
that operates directly on the original variables xi (i.e. Algorithm 1) converges much faster and is much less sensitive to the
conditioning of Ai compared to an algorithm that operates on yi and zi . Indeed, with appropriate step sizes, Algorithm 1
minimizes, in each step, a tighter bound on the objective function compared to the bound based 14 as stated in the following
result.
4
If the rank constraint is not satisfied then one solution is to use a coarser partitioning of x so that the dimensionality of xi is large
enough.
Lemma 10. Let g and
be as defined in 15. And let
d i = A+
i dyi + Āi dzi .
Then, for any di and dj satisfying Ai di + Aj dj = 0 and any feasible x = (y, z) we have
Li
Lj
kdi k2 +
kdj k2
2↵
2↵
 hryi g(y, z), dyi i + hrzi g(y, z), dzi i + hryj g(y, z), dyj i + hrzj g(y, z), dzj i
Li
Li
Lj
Lj
+
kdyi k2 +
kdzi k2 +
kdyj k2 +
kdzj k2 .
2
2
2↵ min (Ai )
2↵
2↵ min (Aj )
2↵
hri f (x), di i + hrj f (x), dj i +
Proof. The proof follows directly from the fact that
>
>
ri f (x) = A+
ryi g(y, z) + Āi rzi g(y, z).
i