Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Appendix A Proof of Theorem 3 Proof. Taking the expectation over the choice of edges (ik , jk ) gives the following inequality 1 1 1 Eik jk [f (xk+1 )|⌘k ] Eik jk f (xk ) kryik f (xk ) ryjk f (xk )k2 krzik f (xk )k2 krzjk f (xk )k2 4L 2L 2L 1 1 f (xk ) ry f (xk )> (L ⌦ Iny )ry f (xk ) rz f (xk )> (D ⌦ Inz )rz f (xk ) 2 2 1 f (xk ) rf (xk )> Krf (xk ), (9) 2 where ⌦ denotes the Kronecker product. This shows that the method is a descent method. Now we are ready to prove the main convergence theorem. We have the following: f (xk+1 ) f ⇤ hrf (xk ), xk x⇤ i kxk R(x0 )krf (xk )kK 8k Combining this with inequality (9), we obtain (f (xk ) f ⇤ )2 . 2R2 (x0 ) E[f (xk+1 |⌘k ] f (xk ) Taking the expectation of both sides an denoting k = E[f (xk )] k+1 Dividing both sides by k k+1 and using the fact that 1 k Adding these inequalities for k steps 0 C = 2R2 (x0 ). B 1 0 1 k k+1 1 f ⇤ gives 2 k k 2R2 (x0 ) k+1 k 2R2 (x0 ) 0. x⇤ k⇤K krf (xk )kK k . we obtain 1 . 2R2 (x0 ) from which we obtain the statement of the theorem where Proof of Theorem 5 Proof. In this case, the expectation should be over the selection of the pair (ik , jk ) and random index lk 2 [N ]. In this proof, the definition of ⌘k includes lk i.e., ⌘k = {(i0 , j0 , l0 ), . . . , (ik 1 , jk 1 , lk 1 )}. We define the following: i> i> > ↵k h ↵k h k d ik = ryjk flk (xk ) ryik flk (xk ) , rzik flk (xk ) , 2L L i> i> > ↵k h ↵k h dkjk = ryjk flk (xk ) ryik flk (xk ) , rzjk flk (xk ) , 2L L dlikk jk = Uik dkik Ujk dkjk . For the expectation of objective value at xk+1 , we have D E L E[f (xk+1 )|⌘k ] Eik jk Elk f (xk ) + rf (xk ), dlikk jk + kdlikk jk k2 2 D E L Eik jk f (xk ) + rf (xk ), Elk [dlikk jk ] + Elk [kdlikk jk k2 ] 2 h D E ↵ k Eik jk f (xk ) + ryik f (xk ), Elk [ryjk flk (xk ) ryik flk (xk )] 2L E ↵k D + ryjk f (xk ), Elk [ryik flk (xk ) ryjk flk (xk )] 2L i E ↵ D E L ↵k D k rzik f (xk ), Elk [rzik flk (xk )] rzjk f (xk ), Elk [rzjk flk (xk )] + Elk [kdlikk jk k2 ] . L L 2 Taking expectation over lk , we get the following relationship: h E ↵k D E[f (xk+1 )|⌘k ] Eik jk f (xk ) + ryik f (xk ), ryjk f (xk ) ryik f (xk ) 2L E ↵k D + ryjk f (xk ), ryik f (xk ) ryjk f (xk ) 2L E ↵ D E L i ↵k D k rzik f (xk ), rzik f (xk ) rzjk f (xk ), rzjk f (xk ) + Elk [kdlikk jk k2 ] . L L 2 We first note that Elk [kdlikk jk k2 ] 8M 2 ↵k2 /L2 since krfl k M . Substituting this in the above inequality and simplifying we get, E[f (xk+1 )|⌘k ] f (xk ) f (xk ) ↵k ry f (xk )> (L ⌦ In )ry f (xk ) ↵k rz f (xk )> (D ⌦ In )rz f (xk ) + 4M 2 ↵k2 . L ↵k rf (xk )> Krf (xk ) + 4M 2 ↵k2 L (10) Similar to Theorem 3, we obtain a lower bound on rf (xk )> Krf (xk ) in the following manner. f (xk ) f ⇤ hrf (xk ), xk x⇤ i kxk x⇤ k⇤K .krf (xk )kK R(x0 )krf (xk )kK . Combining this with inequality Equation 10, we obtain E[f (xk+1 )|⌘k ] f (xk ) Taking the expectation of both sides an denoting k+1 f ⇤ gives = E[f (xk )] k (f (xk ) f ⇤ )2 4M 2 ↵k2 + . R2 (x0 ) L ↵k k ↵k 2 k + R2 (x0 ) 4M 2 ↵k2 . L Adding these inequalities from i = 0 to i = k and use telescopy we get, k+1 + k X ↵i i=0 2 k R2 (x0 ) 0 + k 4M 2 X 2 ↵ . L i=0 i Using the definition of x̄k+1 = arg min0ik+1 f (xi ), we get k X i=0 ↵i (E[f (x̄k+1 ) f ⇤ ])2 R2 (x0 ) k+1 + k X ↵i i=0 2 k R2 (x0 ) 0+ k 4M 2 X 2 ↵ . L i=0 i Therefore, from the above inequality we have, E[f (x̄k+1 ) v u u( f ⇤ ] R(x0 )t 0 + 4M 2 Pk Pk i=0 i=0 ↵i2 /L) ↵i . P1 P1 Note that E[f (x̄k+1p) f ⇤ ] ! 0pif we choose step sizes satisfying the condition that i=0 ↵i = 1 and i=0 ↵i2 < 1. Substituting ↵i = 0 L/(2M i + 1), we get the required result using the reasoning from [24] (we refer the reader to Section 2.2 of [24] for more details). C Proof of Theorem 4 Proof. For ease of exposition, we analyze the case where the unconstrained variables z are absent. The analysis of case with z variables can be carried out in a similar manner. Consider the update on edge (ik⇣, jk ). Recall that D(k) denotes the ⌘ index of the iterate used in the k th iteration for calculating the gradients. Let dk = ↵k 2L ryjk f (xD(k) ) ryik f (xD(k) ) and dkik jk = xk+1 have Ujk dk . Note that kdkik jk k2 = 2kdk k2 . Since f is Lipschitz continuous gradient, we x k = U ik d k D E L f (xk+1 ) f (xk ) + ryik yjk f (xk ), dkik jk + kdkik jk k2 2 D E L k D(k) f (x ) + ryik yjk f (x ) + ryik yjk f (xk ) ryik yjk f (xD(k) ), dkik jk + kdkik jk k2 2 D E L L k k 2 k D(k) k k f (x ) kd k + ryik yjk f (x ) ryik yjk f (x ), dik jk + kdik jk k2 ↵ k ik j k 2 ✓ ◆ 1 1 f (xk ) L kdkik jk k2 + kryik yjk f (xk ) ryik yjk f (xD(k) )kkdkik jk k ↵k 2 ✓ ◆ 1 1 k f (x ) L kdkik jk k2 + Lkxk xD(k) kkdkik jk k. ↵k 2 The third and fourth steps in the above derivation follow from definition of dkij and Cauchy-Schwarz inequality respectively. The last step follows from the fact the gradients are Lipschitz continuous. Using the assumption that staleness in the variables is bounded by ⌧ , i.e., k D(k) ⌧ and definition of dkij , we have f (xk+1 ) f (xk ) k L f (x ) L f (xk ) L ✓ ✓ ✓ 1 ↵k 1 2 1 ↵k 1 2 ◆ kdkik jk k2 + L ◆ ⌧ X t=1 kdkik t t jk t ! kkdkik jk k ⌧ i L Xh k t kdkik jk k2 + kdik t jk t k2 + kdkik jk k2 2 t=1 ◆ ⌧ 1+⌧ LX k t kdkik jk k2 + kd k2 . 2 2 t=1 ik t jk t 1 ↵k ! The first step follows from triangle inequality. The second inequality follows from fact that ab (a2 + b2 )/2. Using expectation over the edges, we have E[f (x k+1 )] E[f (x )] We now prove that, for all k h where we define E kdkik k L ✓ 1 ↵k 1+⌧ 2 ◆ E[kdkik jk k2 ] " ⌧ X L + E kdkik 2 t=1 t t jk 2 t k # . (11) 0 h E kdkik 1 1 jk i ⇥ ⇤ 2 k ⇢E kdkik jk k2 , 1 (12) i p 2 t k = 0 for k = 0. Let wt denote the vector of size |E| such that wij = pij kdtij k (with 1 ⇥ ⇤ t slight abuse of notation, we use wij to denote the entry corresponding to edge (i, j)). Note that E kdtit jt k2 = E[kwt k2 ]. We prove Equation (12) by induction. p Let uk be a vector of size |E| such that ukij = pij kdkij dkij 1 k. Consider the following: E[kwk 1 1 jk 1 k]2 E[kwk k2 ] = E[2kwk 1 k]2 2E[kw k 1 2 2E[kw k 1 2E[kw p 2↵k k 1 k ] E[kwk k2 + kwk 2E[hw kkw k 1 k k 1 k w k] k , w i] kku k] 2E[kwk max(D(k 1),D(k)) X t=min(D(k 1),D(k)) 1 1 2 k ] p k 2↵k kxD(k) E[kwk 1 2 xD(k 1) k ] + E[kdtit jt k2 ] . k] (13) The fourth step follows from the bound below on |ukij | p dkij 1 k ↵k p pij k(Ui Uj ) (ryi f (xD(k) ) 2L p 2pij ↵k kxD(k) xD(k 1) k. |ukij | = pij kdkij ryj f (xD(k) ) + ryj f (xD(k 1) ) ryi f (xD(k 1) ))k The fifth step follows from triangle inequality. We now prove (12): the induction hypothesis is trivially true for k = 0. Assume it is true for some k 1 0. Now using Equation (13), we have p p E[kwk 1 k]2 E[kwk k2 ] 2↵k (⌧ + 2)E[kwk 1 k2 ] + 2↵k (⌧ + 2)⇢⌧ +1 E[kwk k2 ] for our choice of ↵k . The last step follows from the fact that E[kdtit jt k2 ] = E[kwt k2 ] and mathematical induction. From the above, we get p 1 + 2↵k (⌧ + 2)⇢(⌧ +1) k 1 2 p E[kw k ] E[kwk k2 ] ⇢E[kwk k2 ]. 1 2↵k (⌧ + 2) Thus, the statement holds for k. Therefore, the statement holds for all k 2 N by mathematical induction. Substituting the above in Equation (11), we get ✓ ◆ 1 1 + ⌧ + ⌧ ⇢⌧ k+1 k E[f (x )] E[f (x )] L E[kdkik jk k2 ]. ↵k 2 This proves that the method is a descent method in expectation. Using the definition of dkij , we have E[f (xk+1 )] E[f (xk )] E[f (xk )] E[f (xk )] E[f (xk )] ✓ ◆ ↵k2 1 1 + ⌧ + ⌧ ⇢⌧ E[kryik f (xD(k) ) ryjk f (xD(k) )k2 ] 4L ↵k 2 ✓ ◆ ↵k2 1 1 + ⌧ + ⌧ ⇢⌧ E[krf (xD(k) ) rf (xD(k) )k2K ] 4L ↵k 2 ✓ ◆ ↵k2 1 1 + ⌧ + ⌧ ⇢⌧ E[(f (xD(k) ) f ⇤ )2 ] 2R2 (x0 ) ↵k 2 ✓ ◆ ↵k2 1 1 + ⌧ + ⌧ ⇢⌧ E[(f (xk ) f ⇤ )2 ]. 2R2 (x0 ) ↵k 2 The second and third steps are similar to the proof of Theorem 3. The last step follows from the fact that the method is a descent method in expectation. Following similar analysis as Theorem 3, we get the required result. D Proof of Theorem 6 Proof. Let Ax = P i xi . Let x̃k+1 be solution to the following optimization problem: x̃k+1 = arg min hrf (xk ), x {x|Ax=0} xk i + L kx 2 xk k2 + h(x). To prove our result, we first prove few intermediate results. We say vectors d 2 Rn and d0 2 Rn are conformal if di d0i 0 for all i 2 [b]. We use dik jk = xk+1 xk and d = x̃k+1 xk . Our first claim is that for any d, we can always find conformal vectors whose sum is d (see [22]). More formally, we have the following result. Lemma 7. For any d 2PRn with Ad = 0, we have a multi-set S = {d0ij }i6=j such that d and d0ij are conformal for all i 6= j and i, j 2 [b] i.e., i6=j d0ij = d, Ad0ij = 0 and d0ij can be non-zero only in coordinates corresponding to xi and xj . Proof. We prove by an iterative construction, i.e., for every vector d such that Ad = 0, we construct a set S = {sij } (sij 2 Rn ) with the required properties. We start with a vector u0 = d and multi-set S 0 = {s0ij } and s0ij = 0 for all i 6= j P and i, j 2 [n]. At the k th step of the construction, we will have Auk = 0, As = 0 for all s 2 S k , d = uk + s2S k s and each element of s is conformal to d. In k th iteration, pick the element with the smallest absolute value (say v) in uk 1 . Let us assume it corresponds to ypj . Now pick an element from uk 1 corresponding to yqj for p 6= q 2 [m] with at least absolute value v albeit with opposite sign. Note that such an element should exist since Auk 1 = 0. Let p1 and p2 denote the indices of these elements in uk 1 . Let S k be same as S k 1 except for skpq which is given by skpq = skpq 1 + r = skpq 1 + ukp1 1 ep1 ukp1 1 ep2 where ei denotes a vector in Rn with zero in all components except in ith position (where it is one). Note that Ar = 0 and r is conformal to d since it has the same sign. Let uk+1P= uk r. Note that Auk+1 = 0 since Auk = 0 and Ar = 0. Also observe that As = 0 for all s 2 S k+1 and uk+1 = s2S k s = d. Finally, note that each iteration the number of non-zero elements of uk decrease by at least 1. Therefore, this algorithm terminates after a finite number of iterations. Moreover, at termination uk = 0 otherwise the algorithm can always pick an element and continue with the process. This gives us the required conformal multi-set. Now consider a set {d0ij } which is conformal to d. We define x̂k+1 in the following manner: x̂k+1 = i Lemma 8. For any x 2 Rn and k ⇢ if (i, j) = (ik , jk ) if (i, j) 6= (ik , jk ) xki + d0ij xki 0, E[kx̂k+1 xk k2 (kx̃k+1 xk k2 ). We also have E(h(x̂k+1 )) (1 )h(xk ) + h(x̃k+1 ). Proof. We have the following bound: Eik jk [kx̂k+1 x k k2 = X i6=j kd0ij k2 k X i6=j d0ij k2 = kdk2 = kx̃k+1 x k k2 . The above statement directly follows the fact that {d0ij } is conformal to d. The remaining part directly follows from [22]. The remaining part essentially on similar lines as [22]. We give the details here for completeness. From Lemma 1, we have L kdik jk k2 + h(xk + dik jk )] 2 L Eik jk [f (xk ) + hrf (xk ), d0ik jk i + kd0ik jk k2 + h(xk + d0ik jk )] 2 0 1 Eik jk [F (xk+1 )] Eik jk [f (xk ) + hrf (xk ), dik jk i + = f (xk ) + @hrf (x), (1 X i6=j d0ij i + XL i6=j 2 kd0ij k2 + )F (xk ) + (f (xk ) + hrf (x), di + min (1 {y|Ay=0} )F (xk ) + (F (y) + L ky 2 X i6=j h(x + d0ij )A L kdk2 + h(x + d)) 2 xk k2 ) 2 L k )F (xk ) + (F ( x⇤ + (1 )xk ) + kx 2 2[0,1] ✓ ◆ 2(F (xk ) F (x⇤ ))2 k k (1 )F (x ) + F (x ) . LR2 (x0 ) min (1 x⇤ k2 ) The second step follows from optimality of dik jk . The fourth step follows from Lemma 8. Now using the similar recurrence relation as in Theorem 2, we get the required result. E Reduction of General Case In this section we show how to reduce a problem with linear constraints to the form of Problem 4 in the paper. For simplicity, we focus on smooth objective functions. However, the formulation can be extended to composite objective functions along similar lines. Consider the optimization problem min f (x) x X s.t. Ax = Ai xi = 0, where fi is a convex function with an L-Lipschitz gradient. Let Āi be a matrix with orthonormal columns satisfying range(Āi ) = ker(Ai ) , this can be obtained (e.g. using SVD). For each i, define yi = Ai xi and assume that the rank of Ai is less than or equal to the dimensionality of xi . 4 Then we can rewrite x as a function h(y, z) satisfying x i = A+ i yi + Āi zi , for some unknown zi , where C + denote the pseudo-inverse of C. The problem then becomes min g(y, z) s.t. y,z N X (14) yi = 0, i=1 where X g(y, z) = f ( (y, z)) = f It is clear that the sets S1 = {x|Ax = 0} and S2 = { (y, z)| equivalent to that in 1. P Ui (A+ i yi (15) + Āi zi ) . i i ! yi = 0} are equal and hence the problem defined in 14 is Note that such a transformation preserves convexity of the objective function. It is also easy to show that it preserves the block-wise Lipschitz continuity of the gradients as we prove in the following result. Lemma 9. Let f be a function with Li -Lipschitz gradient w.r.t xi . Let g(y, z) be the function defined in 15. Then g satisfies the following condition kryi g(y, z) krzi g(y, z) where min (B) Li 2 (A ) kyi i min yi0 k rzi g(y, z 0 )k Li kzi zi0 k, ryi g(y 0 , z)k denotes the minimum non-zero singular value of B. Proof. We have kryi g(y, z) > ryi g(y 0 , z)k = k(Ui A+ i ) [rx f ( (y, z)) kA+ i kkri f ( (y, z)) + Li kA+ i kkAi (yi Similar proof holds for krzi g(y, z) rx f ( (y 0 , z))]k ri f ( (y 0 , z))k 2 yi0 )k Li kA+ i k kyi yi0 k = Li 2 (A ) kyi i min yi0 k, rzi g(y, z 0 )k , noting that kĀi k = 1. It is worth noting that this reduction is mainly used to simplify analysis. In practice, however, we observed that an algorithm that operates directly on the original variables xi (i.e. Algorithm 1) converges much faster and is much less sensitive to the conditioning of Ai compared to an algorithm that operates on yi and zi . Indeed, with appropriate step sizes, Algorithm 1 minimizes, in each step, a tighter bound on the objective function compared to the bound based 14 as stated in the following result. 4 If the rank constraint is not satisfied then one solution is to use a coarser partitioning of x so that the dimensionality of xi is large enough. Lemma 10. Let g and be as defined in 15. And let d i = A+ i dyi + Āi dzi . Then, for any di and dj satisfying Ai di + Aj dj = 0 and any feasible x = (y, z) we have Li Lj kdi k2 + kdj k2 2↵ 2↵ hryi g(y, z), dyi i + hrzi g(y, z), dzi i + hryj g(y, z), dyj i + hrzj g(y, z), dzj i Li Li Lj Lj + kdyi k2 + kdzi k2 + kdyj k2 + kdzj k2 . 2 2 2↵ min (Ai ) 2↵ 2↵ min (Aj ) 2↵ hri f (x), di i + hrj f (x), dj i + Proof. The proof follows directly from the fact that > > ri f (x) = A+ ryi g(y, z) + Āi rzi g(y, z). i