Download ppt - IBM Research

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM The Problems Given n x d matrix A and n x d’ matrix B, we want estimators for • The matrix product AT B • The matrix X* minimizing ||AX-B|| – A slightly generalized version of linear regression • Given integer k, the matrix Ak of rank k minimizing ||A-Ak|| • We consider the Frobenius matrix norm: square root of sum of squares General Properties of Our Algorithms • 1 pass over matrix entries, given in any order (allow multiple updates) • Maintain compressed versions or “sketches” of matrices • Do small work per entry to maintain the sketches • Output result using the sketches • Randomized approximation algorithms • Since we minimize space complexity, we restrict matrix entries to be O(log nc) bits, or O(log nd) bits Matrix Compression Methods • In a line of similar efforts… – Element-wise sampling [AM01], [AHK06] – Row / column sampling: pick small random subset of the rows, columns, or both [DK01], [DKM04], [DMM08] – Sketching / Random Projection: maintain a small number of random linear combinations of rows or columns [S06] – Usually more than 1 pass • Here: sketching Outline • Matrix Product • Linear Regression • Low-rank approximation An Optimal Matrix Product Algorithm • A and B have n rows, and a total of c columns, and we want to estimate ATB, so that ||Est-ATB|| · ε||A||¢||B|| • Let S be an n x m sign (Rademacher) matrix – Each entry is +1 or -1 with probability ½ – m small, set to O(log 1/ δ) ε-2 – Entries are O(log 1/δ)-wise independent • Observation: – E[ATSSTB/m] = ATE[SST]B/m = ATB • Wouldn’t it be nice if all the algorithm did was maintain STA and STB, and output ATSSTB/m? An Optimal Matrix Product Algorithm • This does work, and we are able to improve the previous dependence on m: • New Tail Estimate: for δ, ε > 0, there is m = O(log 1/ δ) ε-2 so that Pr[||ATSSTB/m-ATB|| > ε ||A|| ||B||] · δ (again ||C|| = [Σi, j Ci, j2]1/2) • Follows from bounding O(log 1/δ)-th moment of ||ATSSTB/m-ATB|| Efficiency • Easy to maintain sketches given updates – O(m) time/update, O(mc log(nc)) bits of space for STA and STB – Improves Sarlos’ algorithm by a log c factor. – Sarlos’ algorithm based on JL Lemma • JL preserves all entries of ATB up to an additive error, whereas we only preserve overall error – Can compute [ATS][STB]/m via fast rectangular matrix multiplication Matrix Product Lower Bound • Our algorithm is space-optimal for constant δ – a new lower bound • Reduction from a communication game – Augmented Indexing, players Alice and Bob – Alice has random x 2 {0,1}s – Bob has random i 2 {1, 2, …, s} • also xi+1, …, xs • Alice sends Bob one message • Bob should output xi with probability at least 2/3 • Theorem [MNSW]: Message must be (s) bits on average Lower Bound Proof • Set s := £(cε-2log cn) • Alice makes matrix U – Uses x1…xs • Bob makes matrix U’ and B – Uses i and xi+1, …, xs • Alg input will be A:=U+U’ and B – A and B are n x c/2 • Alice: – Runs streaming matrix product Alg on U – Sends Alg state to Bob – Bob continues Alg with A := U + U’ and B • ATB determines xi with probability at least 2/3 – By choice of U, U’, B – Solving Augmented Indexing • So space of Alg must be (s) = (cε-2log cn) bits Lower Bound Details • U = U(1); U(2); …, U(log (cn)); 0s – Each U(k) is an £(ε-2) x c/2 submatrix with entries in • • • • • {-10k, 10k} – U(k)i, j = 10k if matched entry of x is 0, else U(k)i, j = -10k Bob’s index i corresponds to U(k*)i*, j* U’ is such that A = U+U’ = U(1); U(2); …, U(k*); 0s – U’ is determined from xi+1, …, xs ATB is i*-th row of U(k*) ||A|| ¼ ||U(k*)|| since the entries of A are geometrically increasing ε2||A||2¢ ||B||2, the squared error, is small, so most entries of the approximation to ATB have the correct sign Outline • Matrix Product • Linear Regression • Low-rank approximation Linear Regression • The problem: minX ||AX-B|| • X* minimizing this has X* = A-B, where A- is the pseudoinverse of A • Every matrix A = UΣVT using singular value decomposition (SVD) – If A is n x d of rank k, then • U is n x k with orthonormal columns • Σ is k x k diagonal matrix, diagonal is positive • V is d x k with orthonormal columns • A- = VΣ-1UT • Normal Equations: ATA X = ATB for optimal X Linear Regression • Let S be an n x m sign matrix, m = O(dε-1log(1/δ)) • The algorithm is – Maintain STA and STB – Return X’ solving minX ||ST(AX-B)|| – Space is O(d2 ε-1log(1/δ)) words – Improves Sarlos’ space by log c factor – Space is optimal via new lower bound • Main claim: With probability at least 1- δ, ||AX’-B|| · (1+ε)||AX*-B|| – That is, relative error for X’ is small Regression Analysis • • • • Why should X’ solving minX ||ST(AX-B)|| be good? ST approximately preserves AX-B for fixed X If this worked for all X, we’re done ST must preserve norms even for X’, chosen using S • First reduce to showing that ||A(X*-X’)|| is small • Use normal equation ATAX* = ATB – Implies ||AX’-B||2 = ||AX*-B||2 + ||A(X’-X*)||2 • Bounding ||A(X’-X*)||2 equivalent to bounding ||UTA(X’X*)||2, where A = UΣVT, from SVD, and U is an orthonormal basis of the columnspace of A Regression Analysis Continued • Bounding ||¯||2 := ||UTA(X’-X*)||2 – ||¯|| · ||UTSSTU¯/m|| + ||UTSSTU¯/m-¯|| • Normal equations in sketch space imply (STA)T(STA)X’ = (STA)T(STB) • UTSSTU¯ = UTSSTA(X’-X*) = UTSSTA(X’-X*) + UTSST(B-AX’) = UTSST(B-AX*) • || UTSSTU¯/m|| = ||UTSST(B-AX*)/m|| · (ε/k)1/2||U||¢||B-AX*|| (new tail estimate) = ε1/2 ||B-AX*|| Regression Analysis Continued • Hence, ||¯||2 := ||UTA(X’-X*)||2 – ||¯|| · ||UTSST¯/m|| + ||UTSSTU¯/m-¯|| · ε1/2 ||B-AX*|| + ||UTSSTU¯/m-¯|| • Recall the spectral norm: ||A||2 = supx ||Ax||/||x|| • Implies ||CD|| · ||C||2 ||D|| • ||UTSSTU¯/m-¯|| · ||UTSSTU/m-I ||2 ||¯|| • Subspace JL: for m = (k log(1/δ)), ST approximately preserves lengths of all vectors in a k-space • ||UTSSTU¯/m-¯|| · ||¯||/2 • ||¯|| · 2ε1/2||AX*-B|| • ||AX’-B||2 = ||AX*-B||2 + ||¯||2 = ||AX*-B||2 + 4ε||AX*-B||2 Regression Lower Bound • • • • Tight (d2 log (nd) ε-1) space lower bound Again a reduction from augmented indexing This time more complicated Embed log (nd) ε-1 independent regression subproblems into hard instance – Uses deletions and geometrically growing property, as in matrix product lower bound • Choose the entries of A and b so that the algorithm’s output x encodes some entries of A Regression Lower Bound • Lower bound of (d2) already tricky because of bit complexity • Natural approach: – Alice has random d x d sign matrix A-1 – b is a standard basis vector ei – Alice computes A = (A-1)-1 and puts it into the stream. Solution x to minx ||Ax=b|| is i-th column of A-1 – Bob can isolate entries of A-1, solving indexing • Wrong: A has entries that can be exponentially small! Regression Lower Bound • We design A and b together (Aug. Index) 1 0 0 0 0 A1, 2 1 0 0 0 A1, 3 A1, 4 A2, 3 A2, 4 1 A3, 4 0 1 0 0 A1,5 A2,5 A3,5 A4,5 1 x1 x2 = x3 x4 x5 0 A2,4 A3,4 1 0 x5 = 0, x4 = 1, x3 = 0, x2 = 0, x1 = -A1,4 Outline • Matrix Product • Regression • Low-rank approximation Best Low-Rank Approximation • For any matrix A and integer k, there is a matrix Ak of rank k that is closest to A among all matrices of rank k. • Since rank of Ak is k, it is the product CDT of two k-column matrices C and D – Ak can be found from the SVD (singular value decomposition), where C and D are orthogonal matrices U and VΣk – This is a good compression of A – LSI, PCA, recommendation systems, clustering Best Low-Rank Approximation • Previously, nothing was known for 1-pass lowrank approximation and relative error – Even for k = 1, best upper bound O(nd log (nd)) bits – Problem 28 of [Mut]: can one get sublinear space? • We get 1-pass and O(kε-2(n+dε-2)log(nd)) space • Update time is O(kε-4), so total work is O(Nkε-4), where N is the number of non-zero entries of A • New space lower bound shows optimal up to 1/ε Best Low-Rank Approximation and STA • The sketch STA holds information about A • In particular, there is a rank k matrix Ak’ in the rowspace of STA nearly as close to A as the closest rank k matrix Ak – The rowspace of STA is the set of linear combinations of its rows • That is, ||A-Ak’|| · (1+ε)||A-Ak|| • Why is there such an Ak’? Low-Rank Approximation via Regression • Apply the regression results with A ! Ak, B ! A • The X’ minimizing ||ST(AkX-A)|| has ||AkX’-A|| · (1+ ε)||Ak X*-A|| • But here X* = I, and X’ = (ST Ak)- STA • So the matrix AkX’ = Ak(STAk)-STA: – Has rank k – In the rowspace of STA – Within 1+ε of smallest distance of any rank-k matrix Low-Rank Approximation in 2 Passes • Can’t use Ak(STAk)- STA without finding Ak • Instead: maintain STA – Can show that if GT has orthonormal rows, then the best rank-k approximation to A in the rowspace of GT is A’kGT, where A’k is the best rank-k approximation to AG – After 1st pass, compute orthonormal basis GT for rowspace of STA – In 2nd pass, maintain AG – Afterwards, compute A’k and A’kGT Low-Rank Approximation in 2 Passes • A’k is best rank-k approximation to AG • For any rank-k matrix Z, • ||AGGT – A’kGT|| = ||AG-A’k|| · ||AG-Z|| · ||AGGT – ZGT|| • For all Y: (AGGT-YGT) ¢ (A-AGGT)T = 0, so we can apply Pythagorean Theorem twice: • ||A – A’kG|| = ||A-AGGT|| + ||AGGT – A’kGT|| · ||A-AGGT|| + ||AGGT – ZGT|| = ||A – ZGT|| 1 Pass Algorithm • With high probability,(*) ||AX’-B|| · (1+ε)||AX*-B||, where – X* minimizes ||AX-B|| – X’ minimizes ||STAX-STB|| • Apply (*) with A ! AR and B ! A and X’ minimizing ||ST(ARX-A)|| • So X’= (STAR)-STA has ||ARX’-A|| · (1+ε)minX ||ARX-A|| • Columnspace of AR contains a (1+ε)-approximation to Ak • So, ||ARX’-A|| · (1+ε)minX ||ARX-A|| · (1+ε)2 minX ||A-Ak|| • Key idea: ARX’ = AR(STAR)-STA is – easy to compute in 1-pass with small space and fast update time – behaves like A (similar to SVD) – use it instead of A in our 2-pass algorithm! 1 Pass Algorithm • Algorithm: – Maintain AR and STA – Compute AR(STAR)-STA – Let GT be an orthonormal basis for the rowspace of STA, as before – Output the best rank-k approximation to AR(STAR)-STA in the rowspace of STA • Same as 2-pass algorithm except we don’t need a second pass to project A onto the rowspace of STA • Analysis is similar to that for regression A Lower Bound binary string x matrix A index i and xi+1, xi+2, … k ε-1 columns per block 0s k rows n-k rows 10, -10 -10, 10 -10,-10 … … -100, -100 100, 100 100, 100 … … 0, 0 -1000 -1000, -1000, 0, 0 1000 1000, 0, 0 1000 … … 0, 0 10000, -10000 -10000, 0, 0 -10000 -10000, 0, 0 -10000 … … Error now dominated by block of interest Bob also inserts a k x k identity submatrix into block of interest … 0 … 0 … … … … … … Lower Bound Details Block of interest: k rows 0s k ε-1 columns P*Ik 0s n-k rows * Bob inserts k x k identity submatrix, scaled by large value P Show any rank-k approximation must err on all of shaded region So good rank-k approximation likely has correct sign on Bob’s entry Concluding Remarks • Space bounds are tight for product, regression – Sharpen prior upper bounds – Prove optimal lower bounds • Space bounds off by a factor of ε-1 for low-rank approximation – First sub-linear (and near-optimal) 1-pass algorithm – We have better upper bounds for restricted cases • Improve the dependence on ε in the update time • Lower bounds for multi-pass algorithms?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - IBM Research