* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sketching as a Tool for Numerical Linear Algebra
Capelli's identity wikipedia , lookup
Rotation matrix wikipedia , lookup
Linear least squares (mathematics) wikipedia , lookup
Eigenvalues and eigenvectors wikipedia , lookup
Determinant wikipedia , lookup
Jordan normal form wikipedia , lookup
Matrix (mathematics) wikipedia , lookup
Four-vector wikipedia , lookup
Principal component analysis wikipedia , lookup
System of linear equations wikipedia , lookup
Singular-value decomposition wikipedia , lookup
Perron–Frobenius theorem wikipedia , lookup
Non-negative matrix factorization wikipedia , lookup
Orthogonal matrix wikipedia , lookup
Gaussian elimination wikipedia , lookup
Cayley–Hamilton theorem wikipedia , lookup
Matrix calculus wikipedia , lookup
Least squares wikipedia , lookup
Sketching as a Tool for Numerical Linear Algebra Lectures 1-2 David Woodruff IBM Almaden Massive data sets Examples Internet traffic logs Financial data etc. Algorithms Want nearly linear time or less Usually at the cost of a randomized approximation 2 Regression analysis Regression Statistical method to study dependencies between variables in the presence of noise. 3 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. 4 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Regression Example Ohm's law V = R ∙ I 250 200 150 Example Regression 100 50 0 0 50 100 150 5 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Regression Example Ohm's law V = R ∙ I Find linear function that best fits the data 250 200 150 Example Regression 100 50 0 0 50 100 150 6 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Standard Setting One measured variable b A set of predictor variables a1 ,…, a d Assumption: b = x0 + a 1 x1 + … + a d xd + e e is assumed to be noise and the xi are model parameters we want to learn Can assume x0 = 0 Now consider n observations of b 7 Regression analysis Matrix form Input: nd-matrix A and a vector b=(b1,…, bn) n is the number of observations; d is the number of predictor variables Output: x* so that Ax* and b are close Consider the over-constrained case, when n À d Can assume that A has full column rank 8 Regression analysis Least Squares Method Find x* that minimizes |Ax-b|22 = S (bi – <Ai*, x>)² Ai* is i-th row of A Certain desirable statistical properties 9 Regression analysis Geometry of regression We want to find an x that minimizes |Ax-b|2 The product Ax can be written as A*1x1 + A*2x2 + ... + A*dxd where A*i is the i-th column of A This is a linear d-dimensional subspace The problem is equivalent to computing the point of the column space of A nearest to b in l2-norm 10 Regression analysis Solving least squares regression via the normal equations How to find the solution x to minx |Ax-b|2 ? Equivalent problem: minx |Ax-b |22 Write b = Ax’ + b’, where b’ orthogonal to columns of A Cost is |A(x-x’)|22 + |b’|22 by Pythagorean theorem Optimal solution x if and only if AT(Ax-b) = AT(Ax-Ax’) = 0 Normal Equation: ATAx = ATb for any optimal x x = (ATA)-1 AT b If the columns of A are not linearly independent, the MoorePenrose pseudoinverse gives a minimum norm solution x 11 Moore-Penrose Pseudoinverse • Any optimal solution x has the form A− b + I − V′V′T z, where V’ corresponds to the rows i of V T for which Σ𝑖,𝑖 > 0 • Why? • Because A I − V′V′T z = 0, so A− b + I − V′V′T z is a solution. This is a d-rank(A) dimensional affine space so it spans all optimal solutions • Since A− b is in column span of V’, by Pythagorean theorem, |A− b + I − V′V′T z|22 = A− b 22 + |(I − V ′ V ′T )z|22 ≥ A− b 22 12 Time Complexity Solving least squares regression via the normal equations Need to compute x = A-b Naively this takes nd2 time Can do nd1.376 using fast matrix multiplication But we want much better running time! 13 Sketching to solve least squares regression How to find an approximate solution x to minx |Ax-b|2 ? Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2 with high probability Draw S from a k x n random family of matrices, for a value k << n Compute S*A and S*b Output the solution x‘ to minx‘ |(SA)x-(Sb)|2 x’ = (SA)-Sb 14 How to choose the right sketching matrix S? Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2 Lots of matrices work S is d/ε2 x n matrix of i.i.d. Normal random variables To see why this works, we introduce the notion of a subspace embedding 15 Subspace Embeddings • Let k = O(d/ε2) • Let S be a k x n matrix of i.i.d. normal N(0,1/k) random variables • For any fixed d-dimensional subspace, i.e., the column space of an n x d matrix A – W.h.p., for all x in Rd, |SAx|2 = (1±ε)|Ax|2 • Entire column space of A is preserved Why is this true? Subspace Embeddings – A Proof • Want to show |SAx|2 = (1±ε)|Ax|2 for all x • Can assume columns of A are orthonormal (since we prove this for all x) • Claim: SA is a k x d matrix of i.i.d. N(0,1/k) random variables – First property: for two independent random variables X and Y, with X drawn from N(0,a2 ) and Y drawn from N(0,b2 ), we have X+Y is drawn from N(0, a2 + b2 ) X+Y is drawn from N(0, 2 𝑎 + 2 𝑏 ) 2 • Probability density function 𝑓𝑧 of Z = X+Y𝑏2is 𝑧 𝑥− 2 2 2 𝑎 +𝑏 𝑓 and 𝑓 𝑧 convolution of probability density 𝑋 𝑌 − 2 functions − 2 − • 𝑧−𝑥 2 2𝑎2 2 𝑎 +𝑏 𝑥2 − 2 2𝑏 2 Calculation: 𝑒 = 𝑒 𝑓𝑍 𝑧 = ∫ 𝑓𝑌 𝑧 − 𝑥 𝑓𝑋 𝑥 𝑑𝑥 • 𝑓𝑥 𝑥 = •Density 𝑓𝑍 𝑧 2 /2𝑎 2 1 −𝑥 𝑒 𝑎 2𝜋 .5 , 𝑓𝑦 𝑦 = 2 /2𝑏2 1 −𝑥 𝑒 2𝑧 2 𝑏 .5 𝑏 2𝜋 𝑥− 2 2 𝑎 +𝑏 − 𝑎𝑏 2 2 /2𝑏2 𝑎2 +𝑏2 .5 𝑎2 +𝑏2 2 2 /2𝑎 2 1 1 −(𝑧−𝑥) −𝑥 of distribution: = Gaussian ∫ 𝑒 𝑒 ∫ 2𝜋 .5 𝑎𝑏 𝑎 2𝜋 .5 𝑏 2𝜋 .5 − = 1 −𝑧 2 /2 𝑎2 +𝑏2 𝑒 2𝜋 .5 𝑎2 +𝑏2 .5 𝑎𝑏 2 𝑎2 +𝑏2 ∫ .5 𝑎2 +𝑏2 2𝜋 .5 𝑎𝑏 𝑒 𝑒 𝑑𝑥 2 𝑏2 𝑧 𝑥− 2 2 𝑎 +𝑏 2 𝑎𝑏 2 𝑎2 +𝑏2 dx dx = 1 Rotational Invariance • Second property: if u, v are vectors with <u, v> = 0, then <g,u> and <g,v> are independent, where g is a vector of i.i.d. N(0,1/k) random variables • Why? • If g is an n-dimensional vector of i.i.d. N(0,1) random variables, and R is a fixed matrix, then the probability density function of Rg is 𝑥𝑇 RRT 𝑓 𝑥 = − 1 𝑒 det(RRT ) 2𝜋 𝑑/2 −1 𝑥 2 – RRT is the covariance matrix – For a rotation matrix R, the distribution of Rg and of g are the same Orthogonal Implies Independent • Want to show: if u, v are vectors with <u, v> = 0, then <g,u> and <g,v> are independent, where g is a vector of i.i.d. N(0,1/k) random variables • Choose a rotation R which sends u to αe1 , and sends v to βe2 • < g, u > = < gR, RT u > = < h, αe1 > = αh1 • < g, v > = < gR, RT v > = < h, βe2 > = βh2 where h is a vector of i.i.d. N(0, 1/k) random variables • Then h1 and h2 are independent by definition Where were we? • Claim: SA is a k x d matrix of i.i.d. N(0,1/k) random variables • Proof: The rows of SA are independent – Each row is: < g, A1 >, < g, A2 > , … , < g, Ad > – First property implies the entries in each row are N(0,1/k) since the columns Ai have unit norm – Since the columns Ai are orthonormal, the entries in a row are independent by our second property Back to Subspace Embeddings • • • • Want to show |SAx|2 = (1±ε)|Ax|2 for all x Can assume columns of A are orthonormal Can also assume x is a unit vector SA is a k x d matrix of i.i.d. N(0,1/k) random variables • Consider any fixed unit vector 𝑥 ∈ 𝑅𝑑 • SAx 22 = i∈ k < g i , x >2 , where g i is i-th row of SA • Each < g i , x >2 is distributed as N 1 2 0, k • E[< g i , x >2 ] = 1/k, and so E[ SAx 22 ] = 1 How concentrated is SAx 22 about its expectation? Johnson-Lindenstrauss Theorem • Suppose h1 , … , hk are i.i.d. N(0,1) random variables • Then G = i h2i is a 𝜒 2 -random variable • Apply known tail bounds to G: – (Upper) Pr G ≥ k + 2 kx .5 + 2x ≤ e−x – (Lower) Pr G ≤ k − 2 kx .5 ≤ e−x • If x = ϵ2 k , 16 then Pr G ∈ k(1 ± ϵ) ≥ 1 − 2 k/16 −ϵ 2e 1 δ • If k = Θ(ϵ−2 log( )), this probability is 1-δ • Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d This only holds for a fixed x, how to argue for all x? Net for Sphere • Consider the sphere Sd−1 • Subset N is a γ-net if for all x ∈ Sd−1 , there is a y ∈ N, such that x − y 2 ≤ γ • Greedy construction of N – While there is a point x ∈ Sd−1 of distance larger than γ from every point in N, include x in N • The sphere of radius γ/2 around ever point in N is contained in the sphere of radius 1+ γ around 0d • Further, all such spheres are disjoint • Ratio of volume of d-dimensional sphere of radius 1+ γ to dimensional sphere of radius 𝛾 is 1 + γ d /(γ/2)d , so N ≤ 1 + γ d /(γ/2)d Net for Subspace • Let M = {Ax | x in N}, so M ≤ 1 + γ d /(γ/2)d • Claim: For every x in Sd−1 , there is a y in M for which Ax − y 2 ≤ γ • Proof: Let x’ in S d−1 be such that x − x ′ 2 ≤ γ Then Ax − Ax ′ 2 = x − x ′ 2 ≤ γ, using that the columns of A are orthonormal. Set y = Ax’ Net Argument • For a fixed unit x, Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d • For a fixed pair of unit x, x’, SAx 22 , SAx′ 22 , SA x − x ′ are all 1 ± ϵ with probability 1 − 2−Θ d • SA x − x ′ 22 = SAx 22 + SAx ′ 22 − 2 < SAx, SAx ′ > • A x − x ′ 22 = Ax 22 + Ax ′ 22 − 2 < Ax, Ax ′ > – So Pr < Ax, Ax ′ > = < SAx, SAx ′ > ± O ϵ 2 2 = 1 − 2−Θ(d) • Choose a ½-net M = {Ax | x in N} of size 5𝑑 • By a union bound, for all pairs y, y’ in M, < y, y′ > = < Sy, Sy′ > ± O ϵ • Condition on this event • By linearity, if this holds for y, y’ in M, for αy, βy′ we have < αy, βy′ > = αβ < Sy, Sy′ > ± O ϵ αβ Finishing the Net Argument • Let y = Ax for an arbitrary x ∈ Sd−1 • Let y1 ∈ M be such that y − y1 2 ≤ γ • Let α be such that α(y − y1 ) 2 = 1 – α ≥ 1/γ (could be infinite) • Let y2′ ∈ M be such that α y − y1 − y2 ′ • Then y − y1 − y′2 . α y2 ′ α 2 ≤ γ α 2 ≤γ ≤ γ2 • Set y2 = Repeat, obtaining y1 , y2 , y3 , … such that for all integers i, y − y1 − y2 − … − yi 2 ≤ γi • Implies yi 2 ≤ γi−1 + γi ≤ 2γi−1 Finishing the Net Argument • Have y1 , y2 , y3 , … such that y = • Sy 2 2 = |S i yi and yi 2 ≤ 2γi−1 2 | y i i 2 = i Syi = i yi 2 2 2 2 +2 +2 i,j i,j < Syi , Syj > < yi , yj > ± O ϵ i,j yi 2 yj = | i yi |22 ± O ϵ = y 22 ± O ϵ =1±O ϵ • Since this held for an arbitrary y = Ax for unit x, by linearity it follows that for all x, |SAx|2 = (1±ε)|Ax|2 2 Back to Regression • We showed that S is a subspace embedding, that is, simultaneously for all x, |SAx|2 = (1±ε)|Ax|2 What does this have to do with regression? Subspace Embeddings for Regression • Want x so that |Ax-b|2 · (1+ε) miny |Ay-b|2 • Consider subspace L spanned by columns of A together with b • Then for all y in L, |Sy|2 = (1± ε) |y|2 • Hence, |S(Ax-b)|2 = (1± ε) |Ax-b|2 for all x • Solve argminy |(SA)y – (Sb)|2 • Given SA, Sb, can solve in poly(d/ε) time Only problem is computing SA takes O(nd2) time How to choose the right sketching matrix S? [S] S is a Subsampled Randomized Hadamard Transform S = P*H*D D is a diagonal matrix with +1, -1 on diagonals H is the Hadamard transform P just chooses a random (small) subset of rows of H*D S*A can be computed in O(nd log n) time Why does it work? 31 Why does this work? We can again assume columns of A are orthonormal It suffices to show SAx 2 2 = PHDAx HD is a rotation matrix, so HDAx 2 2 2 2 = 1 ± ϵ for all x = Ax 2 2 = 1 for any x Notation: let y = Ax Flattening Lemma: For any fixed y, Pr [ HDy ∞ ≥C log.5 nd/δ ] n.5 ≤ δ 2d 32 Proving the Flattening Lemma Flattening Lemma: Pr [ HDy ∞ ≥C log.5 nd/δ ] n.5 ≤ δ 2d Let C > 0 be a constant. We will show for a fixed i in [n], Pr [ HDy ≥C i log.5 nd/δ ] n.5 ≤ δ 2nd If we show this, we can apply a union bound over all i HDy i = j Hi,j Dj,j yj (Azuma-Hoeffding) Pr | probability 1 t2 ) 2 j β2 j −( j Zj | > t ≤ 2e , where |Zj | ≤ βj with Zj = Hi,j Dj,j yj has 0 mean |Zj | ≤ |yj | n.5 1 2 β = j j n = βj with probability 1 nd Pr | j Zj | > C log.5 δ n.5 δ C2 log nd − 2 ≤ 2e ≤ δ 2nd 33 Consequence of the Flattening Lemma Recall columns of A are orthonormal HDA has orthonormal columns Flattening Lemma implies HDAei probability 1 − δ 2d With probability 1 ∞ ≤C log.5 nd/δ n.5 with for a fixed i ∈ d δ − , 2 Given this, ej HDA 2 ej HDAei ≤C ≤C d.5 log.5 nd/δ n.5 log.5 nd/δ n.5 for all i,j for all j (Can be optimized further) 34 Matrix Chernoff Bound Let X1 , … , Xs be independent copies of a symmetric random matrix X ∈ Rdxd 1 with E X = 0, X 2 ≤ γ, and E X T X 2 ≤ 𝜎 2 . Let W = s i∈[s] Xi . For any ϵ > 0, Pr W 2 > ϵ ≤ 2d ⋅ e (here W γϵ 3 −sϵ2 /(𝜎2 + ) 2 = sup Wx 2 / x 2 ) Let V = HDA, and recall V has orthonormal columns Suppose P in the S = PHD definition samples uniformly with replacement. If row i is sampled in the j-th sample, then Pj,i = n, and is 0 otherwise Let Yi be the i-th sampled row of V = HDA Let Xi = Id − n ⋅ YiT Yi E X i = Id − n ⋅ Xi 2 ≤ Id 1 j n ViT Vi = Id − V T V = 0d 2 + n ⋅ max ej HDA 2 2 = 1 + n ⋅ C 2 log nd δ d ⋅ n = Θ(d log nd δ ) 35 Matrix Chernoff Bound Recall: let Yi be the i-th sampled row of V = HDA Let Xi = Id − n ⋅ YiT Yi E X T X + Id = Id + Id − 2n E YiT Yi + n2 E[YiT Yi YiT Yi ] 1 i n nd T 2 v v C log i i i δ = 2Id − 2Id + n2 ⋅ viT vi viT vi = n d ⋅ n = C 2 dlog T i vi vi nd 𝛿 ⋅ 𝑣𝑖 2 2 Define Z = n Note that 𝑋 𝑇 𝑋 + 𝐼𝑑 and Z are real symmetric, with non-negative eigenvalues Claim: for all vectors y, we have: 𝑦 𝑇 𝑋 𝑇 𝑋𝑦 + 𝑦 𝑇 𝑦 ≤ 𝑦 𝑇 𝑍𝑦 Proof: 𝑦 𝑇 𝑋 𝑇 𝑋𝑦 + 𝑦 𝑇 𝑦 = 𝑛 𝑖 𝑦 𝑇 𝑣𝑖𝑇 𝑣𝑖 𝑦 𝑣𝑖 22 = 𝑛 𝑖 < 𝑣𝑖 , 𝑦 >2 𝑣𝑖 22 and 𝑛𝑑 𝑑 𝑛𝑑 𝑇 𝑇 𝑇 2 2 2 𝑦 𝑍𝑦 = 𝑛 𝑦 𝑣𝑖 𝑣𝑖 𝑦 𝐶 log ⋅ =𝑛 < 𝑣𝑖 , 𝑦 > 𝐶 log 𝛿 𝑛 𝛿 𝑖 𝑋𝑇 𝑋 Hence, 𝐸 Hence, E X T X 𝐼𝑑 𝑖 2 2 ≤ 𝐸 𝑋 𝑇 𝑋 + 𝐼𝑑 = |𝐸 𝑋 𝑇 𝑋 + 𝐼𝑑 |2 + 1 𝑛𝑑 2 ≤ 𝑍 2 + 1 ≤ 𝐶 𝑑 log +1 𝛿 = O d log nd δ 2 + 𝐼𝑑 2 36 Matrix Chernoff Bound nd δ Hence, E X T X Recall: (Matrix Chernoff) Let X1 , … , Xs be independent copies of a symmetric random matrix X ∈ Rdxd with E X = 0, X 2 ≤ γ, and E X T X 𝜎 2. Let W = 1 s 2 = O d log γϵ i∈[s] X i . Pr |Id − PHDA d Set s = d log nd log δ δ ϵ2 For any ϵ > 0, Pr W T PHDA 2 −sϵ2 /(𝜎2 + 3 ) ≤ > ϵ ≤ 2d ⋅ e −s ϵ2 /(Θ(d log 2 2 > ϵ ≤ 2d ⋅ e nd ) δ δ , to make this probability less than 2 37 SRHT Wrapup Have shown |Id − PHDA T |2 < ϵ using Matrix PHDA Chernoff Bound and with s = d log d nd log δ δ ϵ2 samples Implies for every unit vector x, |1− PHDAx 22 | = x T x − x PHDA so PHDAx 2 2 T PHDA x < ϵ , ∈ 1 ± ϵ for all unit vectors x Considering the column span of A adjoined with b, we can again solve the regression problem The time for regression is now only O(nd log n) + 𝑑 log 𝑛 poly( 𝜖 38 ). Nearly optimal in matrix dimensions (n >> d) Faster Subspace Embeddings S [CW,MM,NN] CountSketch matrix Define k x n matrix S, for k = O(d2/ε2) S is really sparse: single randomly chosen non-zero entry per column [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 nnz(A) is number of non-zero entries of A Can compute S ⋅ A in nnz(A) time! 39 Simple Proof [Nguyen] Can assume columns of A are orthonormal Suffices to show |SAx|2 = 1 ± ε for all unit x For regression, apply S to [A, b] SA is a 2d2/ε2 x d matrix Suffices to show | ATST SA – I|2 · |ATST SA – I|F · ε Matrix product result shown below: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿 Set C = AT and D = A. Then |A|2F = d and (# rows of S) = 3 d2/(δε2) 40 Matrix Product Result Show: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿 (JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment property if for all x ∈ Rn with x 2 = 1, ES Sx 22 − 1 ℓ ≤ ϵℓ ⋅ δ (From vectors to matrices) For ϵ, δ ∈ 0, 1 2 , let D be a distribution on matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥ 2. Then for A, B matrices with n rows, Pr AT S T SB − AT B S F ≥3ϵ A F B F ≤δ 41 From Vectors to Matrices 1 (From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥ 2. Then for A, B matrices with n rows, Pr AT S T SB − AT B S Proof: For a random scalar X, let X Sometimes consider X = T | T F |p = E T p F F p ≥3ϵ A = EX F B F ≤δ p 1/p for a random matrix T 1/p Can show |. |p is a norm Minkowski’s Inequality: X + Y F p ≤ X p + Y p For unit vectors x, y, need to bound |〈Sx, Sy〉 - 〈x, y〉|ℓ 42 From Vectors to Matrices For unit vectors x, y, |〈Sx, Sy〉 - 〈x, y〉|ℓ = ≤ ≤ 1 ( 2 1 ( 2 1 2 Sx 22 −1 + Sy Sx 2 2 2 2 − 1 ℓ + Sy 1 ℓ −1 − S x−y 2 2 1 ℓ −1 ℓ+ S x−y (ϵ ⋅ δ +ϵ ⋅ δ + x − y ≤3ϵ⋅δ 2 2 2 2 − x−y 2 2 2 2 − x−y |ℓ 2 2 ℓ) 1 ℓ ϵ⋅δ ) 1 ℓ Sx,Sy − x,y ℓ x2y2 1 ℓ By linearity, for arbitrary x, y, Suppose A has d columns and B has e columns. Let the columns of A be A1 , … , Ad and the columns of B be B1 , … , Be Define Xi,j = AT S T SB − 1 Ai 2 Bj 2 2 AT B F = ≤3ϵ⋅δ ⋅ ( SAi , SBj − Ai , Bj ) i 2 j Ai 2 ⋅ 2 2 Bj Xi,j 2 43 From Vectors to Matrices Have shown: for arbitrary x, y, For Xi,j = | AT ST SB − A B F |ℓ/2 = 1 Sx,Sy − x,y ℓ x2y2 SAi , SBj − Ai , Bj : AT S T SB − AT B ⋅ Ai 2 Bj 2 T 2 i ≤ j 2 2 2 ⋅ Bj Xi,j j Ai 2 2 2 | ⋅ Bj |Xi,j ℓ/2 j Ai 2 2 ⋅ Bj i ≤ 3ϵδ 1 ℓ = 3 ϵδ T T T Since E A S SB − A B Pr AT S T SB − AT B F ℓ F 1 ℓ 2 F = i j Ai 2 2 2 2 ⋅ Bj Xi,j 2 ℓ/2 2 2 2 2 Xi,j Ai 2 2 2 i 2 A j 2 F B 2 ℓ 2 Bj 2 2 F T T T = A S SB − A B > 3ϵ A 2 2 Ai i = ≤3ϵ⋅δ 1 ℓ ℓ/2 2 F ℓ , by Markov’s inequality, 2 F B F ≤ 1 3ϵ A F B F ℓ E |AT S T SB − AT B|ℓF ≤ δ 44 Result for Vectors Show: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿 (JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment property if for all x ∈ Rn with x 2 = 1, ES Sx 2 2 −1 ℓ ≤ ϵℓ ⋅ δ 1 (From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥ 2. Then for A, B matrices with n rows, Pr AT S T SB − AT B S F ≥3ϵ A F B F ≤δ Just need to show that the CountSketch matrix S satisfies JL property and bound the number k of rows 45 CountSketch Satisfies the JL Property (JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment property if for all x ∈ Rn with x 2 = 1, ES Sx 2 2 −1 ℓ ≤ ϵℓ ⋅ δ We show this property holds with ℓ = 2. First, let us consider ℓ = 1 For CountSketch matrix S, let h:[n] -> [k] be a 2-wise independent hash function σ: n → {−1,1} be a 4-wise independent hash function Let δ E = 1 if event E holds, and δ E = 0 otherwise E[ Sx 22 ] = j∈ k E[ i∈ n δ h i = j σi x i = j∈ k i1,i2∈[n] E[δ = j∈ k i∈[n] E[δ = 1 k j∈[k[ i∈ n 2 ] h i1 = j δ h i2 = j σi1 σi2 ]xi1 xi2 h i = j 2 ]xi2 xi2 = x 2 2 46 CountSketch Satisfies the JL Property E[ Sx 42 ] = 𝐸[ 𝑗1 ,𝑗2 ,𝑖1 .𝑖2 ,𝑖3 ,𝑖4 𝑗′∈ j∈ k i∈ n 𝑘 δ h i = j σi xi 2 2 i′∈ n δ h i′ = j′ σi′ xi′ ] = 𝐸[𝜎𝑖1 𝜎𝑖2 𝜎𝑖3 𝜎𝑖4 𝛿 ℎ 𝑖1 = 𝑗1 𝛿 ℎ 𝑖2 = 𝑗1 𝛿 ℎ 𝑖3 = 𝑗2 𝛿 ℎ 𝑖4 = 𝑗2 ]𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 We must be able to partition {𝑖1 , 𝑖2 , 𝑖3 , 𝑖4 } into equal pairs Suppose 𝑖1 = 𝑖2 = 𝑖3 = 𝑖4 . Then necessarily 𝑗1 = 𝑗2 . Obtain Suppose 𝑖1 = 𝑖2 and 𝑖3 = 𝑖4 but 𝑖1 ≠ 𝑖3 . Then get Suppose 𝑖1 = 𝑖3 and 𝑖2 = 𝑖4 but 𝑖1 ≠ 𝑖2 . Then necessarily 𝑗1 = 𝑗2 . Obtain 1 1 2 2 4 𝑥 𝑥 ≤ 𝑥 2 . Obtain same bound if 𝑖1 = 𝑖4 and 𝑖2 = 𝑖3 . 𝑗 𝑘 2 𝑖1 ,𝑖2 𝑖1 𝑖2 𝑘 Hence, E[ Sx 42 ] ∈ [ x 42 , 𝑥 42 (1 + 𝑘)] = [1, 1 + k] So, ES Sx 2 2 2 −1 2 2 1 𝑗𝑘 4 𝑖 𝑥𝑖 1 2 2 𝑗1 ,𝑗2 ,𝑖1 ,𝑖3 𝑘 2 𝑥𝑖1 𝑥𝑖3 = 𝑥 = 𝑥 4 2 4 4 − 𝑥 4 4 2 2 1 ≤ 1 + 𝑘 − 2 + 1 = 𝑘 . Setting k = 2𝜖2 𝛿 finishes the proof 47 Where are we? (JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment property if for all x ∈ Rn with x 2 = 1, ES Sx 2 2 −1 ℓ ≤ ϵℓ ⋅ δ 1 (From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property for some ℓ ≥ 2. Then for A, B matrices with n rows, Pr AT S T SB − AT B S 2 F ≥ 3 ϵ2 A 2 F B 2 F ≤δ 2 We showed CountSketch has the JL property with ℓ = 2, 𝑎𝑛𝑑 𝑘 = 𝜖2 𝛿 Matrix product result we wanted was: Pr[|CSTSD – CD|F2 · (3/(𝛿k)) * |C|F2 |D|F2] ≥ 1 − 𝛿 We are now down with the proof CountSketch is a subspace embedding 48 Course Outline Subspace embeddings and least squares regression Gaussian matrices Subsampled Randomized Hadamard Transform CountSketch Affine embeddings Application to low rank approximation High precision regression 49 Affine Embeddings Want to solve min 𝐴𝑋 − 𝐵 2𝐹 , A is tall and thin with d columns, but B has a large 𝑋 number of columns Can’t directly apply subspace embeddings Let’s try to show SAX − SB we need of S Can assume A has orthonormal columns Let 𝐵 ∗ = 𝐴𝑋 ∗ − 𝐵, where 𝑋 ∗ is the optimum F = 1 ± ϵ AX − B F for all X and see what properties 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵 ∗ 2𝐹 = 𝑆𝐴 𝑋 − 𝑋 ∗ + 𝑆 𝐴𝑋 ∗ − 𝐵 2𝐹 − 𝑆𝐵 ∗ 2𝐹 = 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 − 2𝑡𝑟[ 𝑋 − 𝑋 ∗ 𝑇 𝐴𝑇 𝑆 𝑇 𝑆𝐵∗ ] ∈ 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 ± 2 𝑋 − 𝑋 ∗ 𝐹 𝐴𝑇 𝑆 𝑇 𝑆𝐵 ∗ 𝐹 (use 𝑡𝑟 𝐶𝐷 ≤ 𝐶 𝐹 𝐷 𝐹 ) ∈ 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 ±2𝜖 𝑋 − 𝑋 ∗ 𝐹 B ∗ F (if we have approx. matrix product) ∈ 𝐴 𝑋 − 𝑋 ∗ 2𝐹 ± 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵 ∗ ) (subspace embedding for50A) Affine Embeddings We have 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 ∈ 𝐴 𝑋 − 𝑋 ∗ 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵∗ ) Normal equations imply that 𝐴𝑋 − 𝐵 2𝐹 = 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 𝐵∗ ∗ = 1±𝜖 𝐵∗ 2𝐹 2 𝐹 2 ∈ ±𝜖 𝐴 𝑋 − 𝑋 𝐹 + 𝐵 𝐹 2 ∈ ±2𝜖 𝐴 𝑋 − 𝑋 ∗ 𝐹 + 𝐵 ∗ = ±2𝜖 𝐴𝑋 − 𝐵 2𝐹 𝑆𝐵∗ 2𝐹 ± 2 𝐹 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 − 𝐴𝑋 − 𝐵 2𝐹 − 𝐵∗ ∈ 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵∗ 𝐹 ) ∗ 2 𝐹 2 𝐹 51 (this holds with constant probability) Affine Embeddings Know: 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 − 𝐴𝑋 − 𝐵 ±2𝜖 𝐴𝑋 − 𝐵 2𝐹 Know: 𝑆𝐵∗ 2𝐹 = 1 ± 𝜖 𝐵∗ 2𝐹 𝑆 𝐴𝑋 − 𝐵 2 𝐹 = (1 ± 2𝜖) 𝐴𝑋 − 𝐵 2𝐹 +𝜖 𝐵∗ = 1 ± 3𝜖 𝐴𝑋 − 𝐵 2𝐹 2 𝐹 − 𝐵∗ 2 𝐹 ∈ 2 𝐹 Completes proof of affine embedding! 52 Low rank approximation A is an n x d matrix Think of n points in Rd E.g., A is a customer-product matrix Ai,j = how many times customer i purchased item j A is typically well-approximated by low rank matrix E.g., high rank because of noise Goal: find a low rank matrix approximating A Easy to store, data more interpretable 53 What is a good low rank approximation? Singular Value Decomposition (SVD) Any matrixAAk == argmin U ¢ Σ ¢ rank V k matrices B |A-B|F U has orthonormal columns Σ is diagonal with non-increasing positive 1/2 2 (|C| ) F = (Σ i,jVC i,j ) The rows of are k entries down the diagonal the orthonormal top k principal V has rows components Computing Ak exactly is expensive Rank-k approximation: Ak = Uk ¢ Σk ¢ Vk 54 Low rank approximation Goal: output a rank k matrix A’, so that |A-A’|F · (1+ε) |A-Ak|F Can do this in nnz(A) + (n+d)*poly(k/ε) time [S,CW] nnz(A) is number of non-zero entries of A 55 Solution to low-rank approximation [S] Given n x d input matrix A Compute S*A using a random matrix S with k/ε << n rows. S*A takes random linear combinations of rows of A A SA Project rows of A onto SA, then find best rank-k approximation to points inside of SA. 56 What is the matrix S? S can be a k/ε x n matrix of i.i.d. normal random variables [S] S can be a k/ε x n Fast Johnson Lindenstrauss Matrix Uses Fast Fourier Transform [CW] S can be a poly(k/ε) x n CountSketch matrix [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 S ¢ A can be computed in nnz(A) time 57 Why do these Matrices Work? Consider the regression problem min Ak X − A Write Ak = WY, where W is n x k and Y is k x d Let S be an affine embedding Then 𝑆𝐴𝑘 𝑋 − 𝑆𝐴 By normal equations, argmin 𝑆𝐴𝑘 𝑋 − 𝑆𝐴 X 𝐹 = 1 ± 𝜖 𝐴𝑘 𝑋 − 𝐴 𝐹 𝑋 − 𝑆𝐴 F for all X 𝐹 = 𝑆𝐴𝑘 − 𝑆𝐴 So, 𝐴𝑘 𝑆𝐴𝑘 But 𝐴𝑘 𝑆𝐴𝑘 By Pythagorean theorem, best rank-k matrix in the row span of SA is obtained by projecting rows of A onto SA, then computing the SVD − 𝑆𝐴 −𝐴 𝐹 ≤ 1 + 𝜖 𝐴𝑘 − 𝐴 𝐹 is a rank-k matrix in the row span of SA! 58 Caveat: projecting the points onto SA is slow Current algorithm: 1. Compute S*A 2. Project each of the rows onto S*A 3. Find best rank-k approximation of projected points inside of rowspace of S*A minrank-k X |X(SA)R-AR|F2 Bottleneck is step 2 Can solve with affine embeddings [CW] Approximate the projection Fast algorithm for approximate regression minrank-k X |X(SA)-A|F2 nnz(A) + (n+d)*poly(k/ε) time 59 High Precision Regression Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2 with high probability Our algorithms all have running time poly(d/ε) Goal: Sometimes we want running time poly(d)*log(1/ε) Want to make A well-conditioned 𝜅 𝐴 = sup 𝐴𝑥 2 / inf 𝐴𝑥 2 𝑥 2 =1 𝑥 2 =1 Lots of algorithms’ time complexity depends on 𝜅 𝐴 Use sketching to reduce 𝜅 𝐴 to O(1)! 60 Small QR Decomposition Let S be a 1 + 𝜖0 - subspace embedding for A Compute SA Compute QR-factorization, 𝑆𝐴 = 𝑄𝑅−1 Claim: 𝜅 𝐴𝑅 = 1+𝜖0 1−𝜖0 For all x, 1 − 𝜖0 𝐴𝑅𝑥 2 ≤ 𝑆𝐴𝑅 𝑥 2 = 1 For all x, 1 + 𝜖0 𝐴𝑅𝑥 2 ≥ 𝑆𝐴𝑅𝑥 So 𝜅 𝐴𝑅 ≤ 1+𝜖0 1−𝜖0 2 =1 61 Finding a Constant Factor Solution Let S be a 1 + 𝜖0 - subspace embedding for AR Solve x0 = argmin 𝑆𝐴𝑅𝑥 − 𝑆𝑏 𝑥 2 Time to compute 𝐴𝑅 and 𝑥0 is nnz(A) + poly(d) for constant 𝜖0 𝑥𝑚+1 ← 𝑥𝑚 + 𝑅𝑇 𝐴𝑇 𝑏 − 𝐴𝑅 𝑥𝑚 𝐴𝑅 𝑥𝑚+1 − 𝑥 ∗ = 𝐴𝑅 (𝑥𝑚 + 𝑅𝑇 𝐴𝑇 𝑏 − 𝐴𝑅𝑥𝑚 − 𝑥 ∗ ) = 𝐴𝑅 − 𝐴𝑅𝑅 𝑇 𝐴𝑇 𝐴𝑅 𝑥𝑚 − 𝑥 ∗ = 𝑈 Σ − Σ 3 𝑉 𝑇 (𝑥𝑚 − 𝑥 ∗ ), where 𝐴𝑅 = 𝑈 Σ𝑉 𝑇 is the SVD of AR • 𝐴𝑅 𝑥𝑚+1 − 𝑥 ∗ 𝐴𝑅𝑥𝑚 − 𝑏 2 2 2 = Σ − Σ 3 𝑉 𝑇 𝑥𝑚 − 𝑥 ∗ = 𝐴𝑅 𝑥𝑚 − 𝑥∗ 2 2 + 2 𝐴𝑅𝑥 ∗ = O 𝜖0 |𝐴𝑅(𝑥𝑚 − 𝑥 ∗ )|2 −𝑏 2 2 62