Download Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra Lectures 1-2 David Woodruff IBM Almaden Massive data sets Examples  Internet traffic logs  Financial data  etc. Algorithms  Want nearly linear time or less  Usually at the cost of a randomized approximation 2 Regression analysis Regression  Statistical method to study dependencies between variables in the presence of noise. 3 Regression analysis Linear Regression  Statistical method to study linear dependencies between variables in the presence of noise. 4 Regression analysis Linear Regression  Statistical method to study linear dependencies between variables in the presence of noise. Example Regression Example  Ohm's law V = R ∙ I 250 200 150 Example Regression 100 50 0 0 50 100 150 5 Regression analysis Linear Regression  Statistical method to study linear dependencies between variables in the presence of noise. Example Regression Example  Ohm's law V = R ∙ I  Find linear function that best fits the data 250 200 150 Example Regression 100 50 0 0 50 100 150 6 Regression analysis Linear Regression  Statistical method to study linear dependencies between variables in the presence of noise. Standard Setting  One measured variable b  A set of predictor variables a1 ,…, a d  Assumption: b = x0 + a 1 x1 + … + a d xd + e  e is assumed to be noise and the xi are model parameters we want to learn  Can assume x0 = 0  Now consider n observations of b 7 Regression analysis Matrix form Input: nd-matrix A and a vector b=(b1,…, bn) n is the number of observations; d is the number of predictor variables Output: x* so that Ax* and b are close  Consider the over-constrained case, when n À d  Can assume that A has full column rank 8 Regression analysis Least Squares Method  Find x* that minimizes |Ax-b|22 = S (bi – <Ai*, x>)²  Ai* is i-th row of A  Certain desirable statistical properties 9 Regression analysis Geometry of regression  We want to find an x that minimizes |Ax-b|2  The product Ax can be written as A*1x1 + A*2x2 + ... + A*dxd where A*i is the i-th column of A  This is a linear d-dimensional subspace  The problem is equivalent to computing the point of the column space of A nearest to b in l2-norm 10 Regression analysis Solving least squares regression via the normal equations  How to find the solution x to minx |Ax-b|2 ?  Equivalent problem: minx |Ax-b |22  Write b = Ax’ + b’, where b’ orthogonal to columns of A  Cost is |A(x-x’)|22 + |b’|22 by Pythagorean theorem  Optimal solution x if and only if AT(Ax-b) = AT(Ax-Ax’) = 0  Normal Equation: ATAx = ATb for any optimal x  x = (ATA)-1 AT b  If the columns of A are not linearly independent, the MoorePenrose pseudoinverse gives a minimum norm solution x 11 Moore-Penrose Pseudoinverse • Any optimal solution x has the form A− b + I − V′V′T z, where V’ corresponds to the rows i of V T for which Σ𝑖,𝑖 > 0 • Why? • Because A I − V′V′T z = 0, so A− b + I − V′V′T z is a solution. This is a d-rank(A) dimensional affine space so it spans all optimal solutions • Since A− b is in column span of V’, by Pythagorean theorem, |A− b + I − V′V′T z|22 = A− b 22 + |(I − V ′ V ′T )z|22 ≥ A− b 22 12 Time Complexity Solving least squares regression via the normal equations  Need to compute x = A-b  Naively this takes nd2 time  Can do nd1.376 using fast matrix multiplication  But we want much better running time! 13 Sketching to solve least squares regression  How to find an approximate solution x to minx |Ax-b|2 ?  Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2 with high probability  Draw S from a k x n random family of matrices, for a value k << n  Compute S*A and S*b  Output the solution x‘ to minx‘ |(SA)x-(Sb)|2  x’ = (SA)-Sb 14 How to choose the right sketching matrix S?  Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2  Lots of matrices work  S is d/ε2 x n matrix of i.i.d. Normal random variables  To see why this works, we introduce the notion of a subspace embedding 15 Subspace Embeddings • Let k = O(d/ε2) • Let S be a k x n matrix of i.i.d. normal N(0,1/k) random variables • For any fixed d-dimensional subspace, i.e., the column space of an n x d matrix A – W.h.p., for all x in Rd, |SAx|2 = (1±ε)|Ax|2 • Entire column space of A is preserved Why is this true? Subspace Embeddings – A Proof • Want to show |SAx|2 = (1±ε)|Ax|2 for all x • Can assume columns of A are orthonormal (since we prove this for all x) • Claim: SA is a k x d matrix of i.i.d. N(0,1/k) random variables – First property: for two independent random variables X and Y, with X drawn from N(0,a2 ) and Y drawn from N(0,b2 ), we have X+Y is drawn from N(0, a2 + b2 ) X+Y is drawn from N(0, 2 𝑎 + 2 𝑏 ) 2 • Probability density function 𝑓𝑧 of Z = X+Y𝑏2is 𝑧 𝑥− 2 2 2 𝑎 +𝑏 𝑓 and 𝑓 𝑧 convolution of probability density 𝑋 𝑌 − 2 functions − 2 − • 𝑧−𝑥 2 2𝑎2 2 𝑎 +𝑏 𝑥2 − 2 2𝑏 2 Calculation: 𝑒 = 𝑒 𝑓𝑍 𝑧 = ∫ 𝑓𝑌 𝑧 − 𝑥 𝑓𝑋 𝑥 𝑑𝑥 • 𝑓𝑥 𝑥 = •Density 𝑓𝑍 𝑧 2 /2𝑎 2 1 −𝑥 𝑒 𝑎 2𝜋 .5 , 𝑓𝑦 𝑦 = 2 /2𝑏2 1 −𝑥 𝑒 2𝑧 2 𝑏 .5 𝑏 2𝜋 𝑥− 2 2 𝑎 +𝑏 − 𝑎𝑏 2 2 /2𝑏2 𝑎2 +𝑏2 .5 𝑎2 +𝑏2 2 2 /2𝑎 2 1 1 −(𝑧−𝑥) −𝑥 of distribution: = Gaussian ∫ 𝑒 𝑒 ∫ 2𝜋 .5 𝑎𝑏 𝑎 2𝜋 .5 𝑏 2𝜋 .5 − = 1 −𝑧 2 /2 𝑎2 +𝑏2 𝑒 2𝜋 .5 𝑎2 +𝑏2 .5 𝑎𝑏 2 𝑎2 +𝑏2 ∫ .5 𝑎2 +𝑏2 2𝜋 .5 𝑎𝑏 𝑒 𝑒 𝑑𝑥 2 𝑏2 𝑧 𝑥− 2 2 𝑎 +𝑏 2 𝑎𝑏 2 𝑎2 +𝑏2 dx dx = 1 Rotational Invariance • Second property: if u, v are vectors with <u, v> = 0, then <g,u> and <g,v> are independent, where g is a vector of i.i.d. N(0,1/k) random variables • Why? • If g is an n-dimensional vector of i.i.d. N(0,1) random variables, and R is a fixed matrix, then the probability density function of Rg is 𝑥𝑇 RRT 𝑓 𝑥 = − 1 𝑒 det(RRT ) 2𝜋 𝑑/2 −1 𝑥 2 – RRT is the covariance matrix – For a rotation matrix R, the distribution of Rg and of g are the same Orthogonal Implies Independent • Want to show: if u, v are vectors with <u, v> = 0, then <g,u> and <g,v> are independent, where g is a vector of i.i.d. N(0,1/k) random variables • Choose a rotation R which sends u to αe1 , and sends v to βe2 • < g, u > = < gR, RT u > = < h, αe1 > = αh1 • < g, v > = < gR, RT v > = < h, βe2 > = βh2 where h is a vector of i.i.d. N(0, 1/k) random variables • Then h1 and h2 are independent by definition Where were we? • Claim: SA is a k x d matrix of i.i.d. N(0,1/k) random variables • Proof: The rows of SA are independent – Each row is: < g, A1 >, < g, A2 > , … , < g, Ad > – First property implies the entries in each row are N(0,1/k) since the columns Ai have unit norm – Since the columns Ai are orthonormal, the entries in a row are independent by our second property Back to Subspace Embeddings • • • • Want to show |SAx|2 = (1±ε)|Ax|2 for all x Can assume columns of A are orthonormal Can also assume x is a unit vector SA is a k x d matrix of i.i.d. N(0,1/k) random variables • Consider any fixed unit vector 𝑥 ∈ 𝑅𝑑 • SAx 22 = i∈ k < g i , x >2 , where g i is i-th row of SA • Each < g i , x >2 is distributed as N 1 2 0, k • E[< g i , x >2 ] = 1/k, and so E[ SAx 22 ] = 1 How concentrated is SAx 22 about its expectation? Johnson-Lindenstrauss Theorem • Suppose h1 , … , hk are i.i.d. N(0,1) random variables • Then G = i h2i is a 𝜒 2 -random variable • Apply known tail bounds to G: – (Upper) Pr G ≥ k + 2 kx .5 + 2x ≤ e−x – (Lower) Pr G ≤ k − 2 kx .5 ≤ e−x • If x = ϵ2 k , 16 then Pr G ∈ k(1 ± ϵ) ≥ 1 − 2 k/16 −ϵ 2e 1 δ • If k = Θ(ϵ−2 log( )), this probability is 1-δ • Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d This only holds for a fixed x, how to argue for all x? Net for Sphere • Consider the sphere Sd−1 • Subset N is a γ-net if for all x ∈ Sd−1 , there is a y ∈ N, such that x − y 2 ≤ γ • Greedy construction of N – While there is a point x ∈ Sd−1 of distance larger than γ from every point in N, include x in N • The sphere of radius γ/2 around ever point in N is contained in the sphere of radius 1+ γ around 0d • Further, all such spheres are disjoint • Ratio of volume of d-dimensional sphere of radius 1+ γ to dimensional sphere of radius 𝛾 is 1 + γ d /(γ/2)d , so N ≤ 1 + γ d /(γ/2)d Net for Subspace • Let M = {Ax | x in N}, so M ≤ 1 + γ d /(γ/2)d • Claim: For every x in Sd−1 , there is a y in M for which Ax − y 2 ≤ γ • Proof: Let x’ in S d−1 be such that x − x ′ 2 ≤ γ Then Ax − Ax ′ 2 = x − x ′ 2 ≤ γ, using that the columns of A are orthonormal. Set y = Ax’ Net Argument • For a fixed unit x, Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d • For a fixed pair of unit x, x’, SAx 22 , SAx′ 22 , SA x − x ′ are all 1 ± ϵ with probability 1 − 2−Θ d • SA x − x ′ 22 = SAx 22 + SAx ′ 22 − 2 < SAx, SAx ′ > • A x − x ′ 22 = Ax 22 + Ax ′ 22 − 2 < Ax, Ax ′ > – So Pr < Ax, Ax ′ > = < SAx, SAx ′ > ± O ϵ 2 2 = 1 − 2−Θ(d) • Choose a ½-net M = {Ax | x in N} of size 5𝑑 • By a union bound, for all pairs y, y’ in M, < y, y′ > = < Sy, Sy′ > ± O ϵ • Condition on this event • By linearity, if this holds for y, y’ in M, for αy, βy′ we have < αy, βy′ > = αβ < Sy, Sy′ > ± O ϵ αβ Finishing the Net Argument • Let y = Ax for an arbitrary x ∈ Sd−1 • Let y1 ∈ M be such that y − y1 2 ≤ γ • Let α be such that α(y − y1 ) 2 = 1 – α ≥ 1/γ (could be infinite) • Let y2′ ∈ M be such that α y − y1 − y2 ′ • Then y − y1 − y′2 . α y2 ′ α 2 ≤ γ α 2 ≤γ ≤ γ2 • Set y2 = Repeat, obtaining y1 , y2 , y3 , … such that for all integers i, y − y1 − y2 − … − yi 2 ≤ γi • Implies yi 2 ≤ γi−1 + γi ≤ 2γi−1 Finishing the Net Argument • Have y1 , y2 , y3 , … such that y = • Sy 2 2 = |S i yi and yi 2 ≤ 2γi−1 2 | y i i 2 = i Syi = i yi 2 2 2 2 +2 +2 i,j i,j < Syi , Syj > < yi , yj > ± O ϵ i,j yi 2 yj = | i yi |22 ± O ϵ = y 22 ± O ϵ =1±O ϵ • Since this held for an arbitrary y = Ax for unit x, by linearity it follows that for all x, |SAx|2 = (1±ε)|Ax|2 2 Back to Regression • We showed that S is a subspace embedding, that is, simultaneously for all x, |SAx|2 = (1±ε)|Ax|2 What does this have to do with regression? Subspace Embeddings for Regression • Want x so that |Ax-b|2 · (1+ε) miny |Ay-b|2 • Consider subspace L spanned by columns of A together with b • Then for all y in L, |Sy|2 = (1± ε) |y|2 • Hence, |S(Ax-b)|2 = (1± ε) |Ax-b|2 for all x • Solve argminy |(SA)y – (Sb)|2 • Given SA, Sb, can solve in poly(d/ε) time Only problem is computing SA takes O(nd2) time How to choose the right sketching matrix S? [S]  S is a Subsampled Randomized Hadamard Transform  S = P*H*D  D is a diagonal matrix with +1, -1 on diagonals  H is the Hadamard transform  P just chooses a random (small) subset of rows of H*D  S*A can be computed in O(nd log n) time Why does it work? 31 Why does this work?  We can again assume columns of A are orthonormal  It suffices to show SAx 2 2 = PHDAx  HD is a rotation matrix, so HDAx 2 2 2 2 = 1 ± ϵ for all x = Ax 2 2 = 1 for any x  Notation: let y = Ax  Flattening Lemma: For any fixed y, Pr [ HDy ∞ ≥C log.5 nd/δ ] n.5 ≤ δ 2d 32 Proving the Flattening Lemma  Flattening Lemma: Pr [ HDy ∞ ≥C log.5 nd/δ ] n.5 ≤ δ 2d  Let C > 0 be a constant. We will show for a fixed i in [n], Pr [ HDy ≥C i log.5 nd/δ ] n.5 ≤ δ 2nd  If we show this, we can apply a union bound over all i  HDy i = j Hi,j Dj,j yj  (Azuma-Hoeffding) Pr | probability 1 t2 ) 2 j β2 j −( j Zj | > t ≤ 2e , where |Zj | ≤ βj with  Zj = Hi,j Dj,j yj has 0 mean  |Zj | ≤  |yj | n.5 1 2 β = j j n = βj with probability 1 nd  Pr | j Zj | > C log.5 δ n.5 δ C2 log nd − 2 ≤ 2e ≤ δ 2nd 33 Consequence of the Flattening Lemma  Recall columns of A are orthonormal  HDA has orthonormal columns  Flattening Lemma implies HDAei probability 1 − δ 2d  With probability 1 ∞ ≤C log.5 nd/δ n.5 with for a fixed i ∈ d δ − , 2  Given this, ej HDA 2 ej HDAei ≤C ≤C d.5 log.5 nd/δ n.5 log.5 nd/δ n.5 for all i,j for all j (Can be optimized further) 34 Matrix Chernoff Bound  Let X1 , … , Xs be independent copies of a symmetric random matrix X ∈ Rdxd 1 with E X = 0, X 2 ≤ γ, and E X T X 2 ≤ 𝜎 2 . Let W = s i∈[s] Xi . For any ϵ > 0, Pr W 2 > ϵ ≤ 2d ⋅ e (here W γϵ 3 −sϵ2 /(𝜎2 + ) 2 = sup Wx 2 / x 2 )  Let V = HDA, and recall V has orthonormal columns  Suppose P in the S = PHD definition samples uniformly with replacement. If row i is sampled in the j-th sample, then Pj,i = n, and is 0 otherwise  Let Yi be the i-th sampled row of V = HDA  Let Xi = Id − n ⋅ YiT Yi  E X i = Id − n ⋅  Xi 2 ≤ Id 1 j n ViT Vi = Id − V T V = 0d 2 + n ⋅ max ej HDA 2 2 = 1 + n ⋅ C 2 log nd δ d ⋅ n = Θ(d log nd δ ) 35 Matrix Chernoff Bound  Recall: let Yi be the i-th sampled row of V = HDA  Let Xi = Id − n ⋅ YiT Yi  E X T X + Id = Id + Id − 2n E YiT Yi + n2 E[YiT Yi YiT Yi ] 1 i n nd T 2 v v C log i i i δ = 2Id − 2Id + n2 ⋅ viT vi viT vi = n d ⋅ n = C 2 dlog T i vi vi nd 𝛿 ⋅ 𝑣𝑖 2 2  Define Z = n  Note that 𝑋 𝑇 𝑋 + 𝐼𝑑 and Z are real symmetric, with non-negative eigenvalues Claim: for all vectors y, we have: 𝑦 𝑇 𝑋 𝑇 𝑋𝑦 + 𝑦 𝑇 𝑦 ≤ 𝑦 𝑇 𝑍𝑦 Proof: 𝑦 𝑇 𝑋 𝑇 𝑋𝑦 + 𝑦 𝑇 𝑦 = 𝑛 𝑖 𝑦 𝑇 𝑣𝑖𝑇 𝑣𝑖 𝑦 𝑣𝑖 22 = 𝑛 𝑖 < 𝑣𝑖 , 𝑦 >2 𝑣𝑖 22 and 𝑛𝑑 𝑑 𝑛𝑑 𝑇 𝑇 𝑇 2 2 2 𝑦 𝑍𝑦 = 𝑛 𝑦 𝑣𝑖 𝑣𝑖 𝑦 𝐶 log ⋅ =𝑛 < 𝑣𝑖 , 𝑦 > 𝐶 log 𝛿 𝑛 𝛿   𝑖 𝑋𝑇 𝑋  Hence, 𝐸  Hence, E X T X 𝐼𝑑 𝑖 2 2 ≤ 𝐸 𝑋 𝑇 𝑋 + 𝐼𝑑 = |𝐸 𝑋 𝑇 𝑋 + 𝐼𝑑 |2 + 1 𝑛𝑑 2 ≤ 𝑍 2 + 1 ≤ 𝐶 𝑑 log +1 𝛿 = O d log nd δ 2 + 𝐼𝑑 2 36 Matrix Chernoff Bound nd δ  Hence, E X T X  Recall: (Matrix Chernoff) Let X1 , … , Xs be independent copies of a symmetric random matrix X ∈ Rdxd with E X = 0, X 2 ≤ γ, and E X T X 𝜎 2. Let W = 1 s 2 = O d log γϵ i∈[s] X i . Pr |Id − PHDA d  Set s = d log nd log δ δ ϵ2 For any ϵ > 0, Pr W T PHDA 2 −sϵ2 /(𝜎2 + 3 ) ≤ > ϵ ≤ 2d ⋅ e −s ϵ2 /(Θ(d log 2 2 > ϵ ≤ 2d ⋅ e nd ) δ δ , to make this probability less than 2 37 SRHT Wrapup  Have shown |Id − PHDA T |2 < ϵ using Matrix PHDA Chernoff Bound and with s = d log d nd log δ δ ϵ2 samples  Implies for every unit vector x, |1− PHDAx 22 | = x T x − x PHDA so PHDAx 2 2 T PHDA x < ϵ , ∈ 1 ± ϵ for all unit vectors x  Considering the column span of A adjoined with b, we can again solve the regression problem  The time for regression is now only O(nd log n) + 𝑑 log 𝑛 poly( 𝜖 38 ). Nearly optimal in matrix dimensions (n >> d) Faster Subspace Embeddings S [CW,MM,NN]  CountSketch matrix  Define k x n matrix S, for k = O(d2/ε2)  S is really sparse: single randomly chosen non-zero entry per column [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1  nnz(A) is number of non-zero entries of A Can compute S ⋅ A in nnz(A) time! 39 Simple Proof [Nguyen]  Can assume columns of A are orthonormal  Suffices to show |SAx|2 = 1 ± ε for all unit x  For regression, apply S to [A, b]  SA is a 2d2/ε2 x d matrix  Suffices to show | ATST SA – I|2 · |ATST SA – I|F · ε  Matrix product result shown below: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿  Set C = AT and D = A.  Then |A|2F = d and (# rows of S) = 3 d2/(δε2) 40 Matrix Product Result  Show: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿  (JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment property if for all x ∈ Rn with x 2 = 1, ES Sx 22 − 1 ℓ ≤ ϵℓ ⋅ δ  (From vectors to matrices) For ϵ, δ ∈ 0, 1 2 , let D be a distribution on matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥ 2. Then for A, B matrices with n rows, Pr AT S T SB − AT B S F ≥3ϵ A F B F ≤δ 41 From Vectors to Matrices  1 (From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥ 2. Then for A, B matrices with n rows, Pr AT S T SB − AT B S  Proof: For a random scalar X, let X  Sometimes consider X = T  | T  F |p = E T p F F p ≥3ϵ A = EX F B F ≤δ p 1/p for a random matrix T 1/p Can show |. |p is a norm  Minkowski’s Inequality: X + Y  F p ≤ X p + Y p For unit vectors x, y, need to bound |〈Sx, Sy〉 - 〈x, y〉|ℓ 42 From Vectors to Matrices  For unit vectors x, y, |〈Sx, Sy〉 - 〈x, y〉|ℓ = ≤ ≤ 1 ( 2 1 ( 2 1 2 Sx 22 −1 + Sy Sx 2 2 2 2 − 1 ℓ + Sy 1 ℓ −1 − S x−y 2 2 1 ℓ −1 ℓ+ S x−y (ϵ ⋅ δ +ϵ ⋅ δ + x − y ≤3ϵ⋅δ 2 2 2 2 − x−y 2 2 2 2 − x−y |ℓ 2 2 ℓ) 1 ℓ ϵ⋅δ ) 1 ℓ Sx,Sy − x,y ℓ x2y2 1 ℓ  By linearity, for arbitrary x, y,  Suppose A has d columns and B has e columns. Let the columns of A be A1 , … , Ad and the columns of B be B1 , … , Be  Define Xi,j =  AT S T SB − 1 Ai 2 Bj 2 2 AT B F = ≤3ϵ⋅δ ⋅ ( SAi , SBj − Ai , Bj ) i 2 j Ai 2 ⋅ 2 2 Bj Xi,j 2 43 From Vectors to Matrices  Have shown: for arbitrary x, y,  For Xi,j =  | AT ST SB − A B F |ℓ/2 = 1 Sx,Sy − x,y ℓ x2y2 SAi , SBj − Ai , Bj : AT S T SB − AT B ⋅ Ai 2 Bj 2 T 2 i ≤ j 2 2 2 ⋅ Bj Xi,j j Ai 2 2 2 | ⋅ Bj |Xi,j ℓ/2 j Ai 2 2 ⋅ Bj i ≤ 3ϵδ 1 ℓ = 3 ϵδ  T T T Since E A S SB − A B Pr AT S T SB − AT B F ℓ F 1 ℓ 2 F = i j Ai 2 2 2 2 ⋅ Bj Xi,j 2 ℓ/2 2 2 2 2 Xi,j Ai 2 2 2 i 2 A j 2 F B 2 ℓ 2 Bj 2 2 F T T T = A S SB − A B > 3ϵ A 2 2 Ai i =  ≤3ϵ⋅δ 1 ℓ ℓ/2 2 F ℓ , by Markov’s inequality, 2 F B F ≤ 1 3ϵ A F B F ℓ E |AT S T SB − AT B|ℓF ≤ δ 44 Result for Vectors  Show: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿  (JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment property if for all x ∈ Rn with x 2 = 1, ES Sx  2 2 −1 ℓ ≤ ϵℓ ⋅ δ 1 (From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥ 2. Then for A, B matrices with n rows, Pr AT S T SB − AT B S  F ≥3ϵ A F B F ≤δ Just need to show that the CountSketch matrix S satisfies JL property and bound the number k of rows 45 CountSketch Satisfies the JL Property  (JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment property if for all x ∈ Rn with x 2 = 1, ES Sx 2 2 −1 ℓ ≤ ϵℓ ⋅ δ  We show this property holds with ℓ = 2. First, let us consider ℓ = 1  For CountSketch matrix S, let  h:[n] -> [k] be a 2-wise independent hash function  σ: n → {−1,1} be a 4-wise independent hash function  Let δ E = 1 if event E holds, and δ E = 0 otherwise  E[ Sx 22 ] = j∈ k E[ i∈ n δ h i = j σi x i = j∈ k i1,i2∈[n] E[δ = j∈ k i∈[n] E[δ = 1 k j∈[k[ i∈ n 2 ] h i1 = j δ h i2 = j σi1 σi2 ]xi1 xi2 h i = j 2 ]xi2 xi2 = x 2 2 46 CountSketch Satisfies the JL Property   E[ Sx 42 ] = 𝐸[ 𝑗1 ,𝑗2 ,𝑖1 .𝑖2 ,𝑖3 ,𝑖4 𝑗′∈ j∈ k i∈ n 𝑘 δ h i = j σi xi 2 2 i′∈ n δ h i′ = j′ σi′ xi′ ] = 𝐸[𝜎𝑖1 𝜎𝑖2 𝜎𝑖3 𝜎𝑖4 𝛿 ℎ 𝑖1 = 𝑗1 𝛿 ℎ 𝑖2 = 𝑗1 𝛿 ℎ 𝑖3 = 𝑗2 𝛿 ℎ 𝑖4 = 𝑗2 ]𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4  We must be able to partition {𝑖1 , 𝑖2 , 𝑖3 , 𝑖4 } into equal pairs  Suppose 𝑖1 = 𝑖2 = 𝑖3 = 𝑖4 . Then necessarily 𝑗1 = 𝑗2 . Obtain  Suppose 𝑖1 = 𝑖2 and 𝑖3 = 𝑖4 but 𝑖1 ≠ 𝑖3 . Then get  Suppose 𝑖1 = 𝑖3 and 𝑖2 = 𝑖4 but 𝑖1 ≠ 𝑖2 . Then necessarily 𝑗1 = 𝑗2 . Obtain 1 1 2 2 4 𝑥 𝑥 ≤ 𝑥 2 . Obtain same bound if 𝑖1 = 𝑖4 and 𝑖2 = 𝑖3 . 𝑗 𝑘 2 𝑖1 ,𝑖2 𝑖1 𝑖2 𝑘  Hence, E[ Sx 42 ] ∈ [ x 42 , 𝑥 42 (1 + 𝑘)] = [1, 1 + k]  So, ES Sx 2 2 2 −1 2 2 1 𝑗𝑘 4 𝑖 𝑥𝑖 1 2 2 𝑗1 ,𝑗2 ,𝑖1 ,𝑖3 𝑘 2 𝑥𝑖1 𝑥𝑖3 = 𝑥 = 𝑥 4 2 4 4 − 𝑥 4 4 2 2 1 ≤ 1 + 𝑘 − 2 + 1 = 𝑘 . Setting k = 2𝜖2 𝛿 finishes the proof 47 Where are we?  (JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment property if for all x ∈ Rn with x 2 = 1, ES Sx  2 2 −1 ℓ ≤ ϵℓ ⋅ δ 1 (From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥ 2. Then for A, B matrices with n rows, Pr AT S T SB − AT B S 2 F ≥ 3 ϵ2 A 2 F B 2 F ≤δ 2  We showed CountSketch has the JL property with ℓ = 2, 𝑎𝑛𝑑 𝑘 = 𝜖2 𝛿  Matrix product result we wanted was: Pr[|CSTSD – CD|F2 · (3/(𝛿k)) * |C|F2 |D|F2] ≥ 1 − 𝛿 We are now down with the proof CountSketch is a subspace embedding  48 Course Outline  Subspace embeddings and least squares regression  Gaussian matrices  Subsampled Randomized Hadamard Transform  CountSketch  Affine embeddings  Application to low rank approximation  High precision regression 49 Affine Embeddings  Want to solve min 𝐴𝑋 − 𝐵 2𝐹 , A is tall and thin with d columns, but B has a large 𝑋 number of columns  Can’t directly apply subspace embeddings  Let’s try to show SAX − SB we need of S  Can assume A has orthonormal columns  Let 𝐵 ∗ = 𝐴𝑋 ∗ − 𝐵, where 𝑋 ∗ is the optimum  F = 1 ± ϵ AX − B F for all X and see what properties 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵 ∗ 2𝐹 = 𝑆𝐴 𝑋 − 𝑋 ∗ + 𝑆 𝐴𝑋 ∗ − 𝐵 2𝐹 − 𝑆𝐵 ∗ 2𝐹 = 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 − 2𝑡𝑟[ 𝑋 − 𝑋 ∗ 𝑇 𝐴𝑇 𝑆 𝑇 𝑆𝐵∗ ] ∈ 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 ± 2 𝑋 − 𝑋 ∗ 𝐹 𝐴𝑇 𝑆 𝑇 𝑆𝐵 ∗ 𝐹 (use 𝑡𝑟 𝐶𝐷 ≤ 𝐶 𝐹 𝐷 𝐹 ) ∈ 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 ±2𝜖 𝑋 − 𝑋 ∗ 𝐹 B ∗ F (if we have approx. matrix product) ∈ 𝐴 𝑋 − 𝑋 ∗ 2𝐹 ± 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵 ∗ ) (subspace embedding for50A) Affine Embeddings  We have 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 ∈ 𝐴 𝑋 − 𝑋 ∗ 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵∗ )  Normal equations imply that 𝐴𝑋 − 𝐵 2𝐹 = 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 𝐵∗ ∗  = 1±𝜖 𝐵∗ 2𝐹 2 𝐹 2 ∈ ±𝜖 𝐴 𝑋 − 𝑋 𝐹 + 𝐵 𝐹 2 ∈ ±2𝜖 𝐴 𝑋 − 𝑋 ∗ 𝐹 + 𝐵 ∗ = ±2𝜖 𝐴𝑋 − 𝐵 2𝐹 𝑆𝐵∗ 2𝐹 ± 2 𝐹  𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 − 𝐴𝑋 − 𝐵 2𝐹 − 𝐵∗ ∈ 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵∗ 𝐹 ) ∗ 2 𝐹 2 𝐹 51 (this holds with constant probability) Affine Embeddings  Know: 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 − 𝐴𝑋 − 𝐵 ±2𝜖 𝐴𝑋 − 𝐵 2𝐹  Know: 𝑆𝐵∗ 2𝐹 = 1 ± 𝜖 𝐵∗ 2𝐹  𝑆 𝐴𝑋 − 𝐵 2 𝐹 = (1 ± 2𝜖) 𝐴𝑋 − 𝐵 2𝐹 +𝜖 𝐵∗ = 1 ± 3𝜖 𝐴𝑋 − 𝐵 2𝐹 2 𝐹 − 𝐵∗ 2 𝐹 ∈ 2 𝐹  Completes proof of affine embedding! 52 Low rank approximation  A is an n x d matrix  Think of n points in Rd  E.g., A is a customer-product matrix  Ai,j = how many times customer i purchased item j  A is typically well-approximated by low rank matrix  E.g., high rank because of noise  Goal: find a low rank matrix approximating A  Easy to store, data more interpretable 53 What is a good low rank approximation? Singular Value Decomposition (SVD) Any matrixAAk == argmin U ¢ Σ ¢ rank V k matrices B |A-B|F  U has orthonormal columns  Σ is diagonal with non-increasing positive 1/2 2 (|C| ) F = (Σ i,jVC i,j ) The rows of are k entries down the diagonal the orthonormal top k principal  V has rows components Computing Ak exactly is expensive  Rank-k approximation: Ak = Uk ¢ Σk ¢ Vk 54 Low rank approximation  Goal: output a rank k matrix A’, so that |A-A’|F · (1+ε) |A-Ak|F  Can do this in nnz(A) + (n+d)*poly(k/ε) time [S,CW]  nnz(A) is number of non-zero entries of A 55 Solution to low-rank approximation [S]  Given n x d input matrix A  Compute S*A using a random matrix S with k/ε << n rows. S*A takes random linear combinations of rows of A A SA  Project rows of A onto SA, then find best rank-k approximation to points inside of SA. 56 What is the matrix S?  S can be a k/ε x n matrix of i.i.d. normal random variables  [S] S can be a k/ε x n Fast Johnson Lindenstrauss Matrix  Uses Fast Fourier Transform  [CW] S can be a poly(k/ε) x n CountSketch matrix [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 S ¢ A can be computed in nnz(A) time 57 Why do these Matrices Work?  Consider the regression problem min Ak X − A  Write Ak = WY, where W is n x k and Y is k x d  Let S be an affine embedding  Then 𝑆𝐴𝑘 𝑋 − 𝑆𝐴  By normal equations, argmin 𝑆𝐴𝑘 𝑋 − 𝑆𝐴 X 𝐹 = 1 ± 𝜖 𝐴𝑘 𝑋 − 𝐴 𝐹 𝑋 − 𝑆𝐴 F for all X 𝐹 = 𝑆𝐴𝑘 − 𝑆𝐴  So, 𝐴𝑘 𝑆𝐴𝑘  But 𝐴𝑘 𝑆𝐴𝑘  By Pythagorean theorem, best rank-k matrix in the row span of SA is obtained by projecting rows of A onto SA, then computing the SVD − 𝑆𝐴 −𝐴 𝐹 ≤ 1 + 𝜖 𝐴𝑘 − 𝐴 𝐹 is a rank-k matrix in the row span of SA! 58 Caveat: projecting the points onto SA is slow  Current algorithm: 1. Compute S*A 2. Project each of the rows onto S*A 3. Find best rank-k approximation of projected points inside of rowspace of S*A minrank-k X |X(SA)R-AR|F2  Bottleneck is step 2 Can solve with affine embeddings  [CW] Approximate the projection  Fast algorithm for approximate regression minrank-k X |X(SA)-A|F2  nnz(A) + (n+d)*poly(k/ε) time 59 High Precision Regression  Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2 with high probability  Our algorithms all have running time poly(d/ε)  Goal: Sometimes we want running time poly(d)*log(1/ε)  Want to make A well-conditioned  𝜅 𝐴 = sup 𝐴𝑥 2 / inf 𝐴𝑥 2 𝑥 2 =1 𝑥 2 =1  Lots of algorithms’ time complexity depends on 𝜅 𝐴  Use sketching to reduce 𝜅 𝐴 to O(1)! 60 Small QR Decomposition  Let S be a 1 + 𝜖0 - subspace embedding for A  Compute SA  Compute QR-factorization, 𝑆𝐴 = 𝑄𝑅−1  Claim: 𝜅 𝐴𝑅 = 1+𝜖0 1−𝜖0  For all x, 1 − 𝜖0 𝐴𝑅𝑥 2 ≤ 𝑆𝐴𝑅 𝑥 2 = 1  For all x, 1 + 𝜖0 𝐴𝑅𝑥 2 ≥ 𝑆𝐴𝑅𝑥  So 𝜅 𝐴𝑅 ≤ 1+𝜖0 1−𝜖0 2 =1 61 Finding a Constant Factor Solution  Let S be a 1 + 𝜖0 - subspace embedding for AR  Solve x0 = argmin 𝑆𝐴𝑅𝑥 − 𝑆𝑏 𝑥 2  Time to compute 𝐴𝑅 and 𝑥0 is nnz(A) + poly(d) for constant 𝜖0  𝑥𝑚+1 ← 𝑥𝑚 + 𝑅𝑇 𝐴𝑇 𝑏 − 𝐴𝑅 𝑥𝑚  𝐴𝑅 𝑥𝑚+1 − 𝑥 ∗ = 𝐴𝑅 (𝑥𝑚 + 𝑅𝑇 𝐴𝑇 𝑏 − 𝐴𝑅𝑥𝑚 − 𝑥 ∗ ) = 𝐴𝑅 − 𝐴𝑅𝑅 𝑇 𝐴𝑇 𝐴𝑅 𝑥𝑚 − 𝑥 ∗ = 𝑈 Σ − Σ 3 𝑉 𝑇 (𝑥𝑚 − 𝑥 ∗ ), where 𝐴𝑅 = 𝑈 Σ𝑉 𝑇 is the SVD of AR • 𝐴𝑅 𝑥𝑚+1 − 𝑥 ∗  𝐴𝑅𝑥𝑚 − 𝑏 2 2 2 = Σ − Σ 3 𝑉 𝑇 𝑥𝑚 − 𝑥 ∗ = 𝐴𝑅 𝑥𝑚 − 𝑥∗ 2 2 + 2 𝐴𝑅𝑥 ∗ = O 𝜖0 |𝐴𝑅(𝑥𝑚 − 𝑥 ∗ )|2 −𝑏 2 2 62

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sketching as a Tool for Numerical Linear Algebra