Download Sketching as a Tool for Numerical Linear Algebra

Document related concepts

Capelli's identity wikipedia , lookup

Rotation matrix wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Eigenvalues and eigenvectors wikipedia , lookup

Determinant wikipedia , lookup

Jordan normal form wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Four-vector wikipedia , lookup

Principal component analysis wikipedia , lookup

System of linear equations wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Gaussian elimination wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Matrix calculus wikipedia , lookup

Least squares wikipedia , lookup

Matrix multiplication wikipedia , lookup

Ordinary least squares wikipedia , lookup

Transcript
Sketching as a Tool for Numerical
Linear Algebra
Lectures 1-2
David Woodruff
IBM Almaden
Massive data sets
Examples
 Internet traffic logs
 Financial data
 etc.
Algorithms
 Want nearly linear time or less
 Usually at the cost of a randomized approximation
2
Regression analysis
Regression
 Statistical method to study dependencies between
variables in the presence of noise.
3
Regression analysis
Linear Regression
 Statistical method to study linear dependencies
between variables in the presence of noise.
4
Regression analysis
Linear Regression
 Statistical method to study linear dependencies
between variables in the presence of noise.
Example Regression
Example
 Ohm's law V = R ∙ I
250
200
150
Example Regression
100
50
0
0
50
100
150
5
Regression analysis
Linear Regression
 Statistical method to study linear dependencies
between variables in the presence of noise.
Example Regression
Example
 Ohm's law V = R ∙ I
 Find linear function that
best fits the data
250
200
150
Example Regression
100
50
0
0
50
100
150
6
Regression analysis
Linear Regression
 Statistical method to study linear dependencies between
variables in the presence of noise.
Standard Setting
 One measured variable b
 A set of predictor variables a1 ,…, a d
 Assumption:
b = x0 + a 1 x1 + … + a d xd + e
 e is assumed to be noise and the xi are model
parameters we want to learn
 Can assume x0 = 0
 Now consider n observations of b
7
Regression analysis
Matrix form
Input: nd-matrix A and a vector b=(b1,…, bn)
n is the number of observations; d is the number of
predictor variables
Output: x* so that Ax* and b are close
 Consider the over-constrained case, when n À d
 Can assume that A has full column rank
8
Regression analysis
Least Squares Method
 Find x* that minimizes |Ax-b|22 = S (bi – <Ai*, x>)²
 Ai* is i-th row of A
 Certain desirable statistical properties
9
Regression analysis
Geometry of regression
 We want to find an x that minimizes |Ax-b|2
 The product Ax can be written as
A*1x1 + A*2x2 + ... + A*dxd
where A*i is the i-th column of A
 This is a linear d-dimensional subspace
 The problem is equivalent to computing the point of the
column space of A nearest to b in l2-norm
10
Regression analysis
Solving least squares regression via the normal equations
 How to find the solution x to minx |Ax-b|2 ?
 Equivalent problem: minx |Ax-b |22
 Write b = Ax’ + b’, where b’ orthogonal to columns of A
 Cost is |A(x-x’)|22 + |b’|22 by Pythagorean theorem
 Optimal solution x if and only if AT(Ax-b) = AT(Ax-Ax’) = 0
 Normal Equation: ATAx = ATb for any optimal x
 x = (ATA)-1 AT b
 If the columns of A are not linearly independent, the MoorePenrose pseudoinverse gives a minimum norm solution x
11
Moore-Penrose Pseudoinverse
• Any optimal solution x has the form A− b + I − V′V′T z, where
V’ corresponds to the rows i of V T for which Σ𝑖,𝑖 > 0
•
Why?
•
Because A I − V′V′T z = 0, so A− b + I − V′V′T z is a
solution. This is a d-rank(A) dimensional affine space so it
spans all optimal solutions
•
Since A− b is in column span of V’, by Pythagorean theorem,
|A− b + I − V′V′T z|22 = A− b 22 + |(I − V ′ V ′T )z|22 ≥ A− b 22
12
Time Complexity
Solving least squares regression via the normal equations
 Need to compute x = A-b
 Naively this takes nd2 time
 Can do nd1.376 using fast matrix multiplication
 But we want much better running time!
13
Sketching to solve least squares regression
 How to find an approximate solution x to minx |Ax-b|2 ?
 Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2
with high probability
 Draw S from a k x n random family of matrices, for a
value k << n
 Compute S*A and S*b
 Output the solution x‘ to minx‘ |(SA)x-(Sb)|2
 x’ = (SA)-Sb
14
How to choose the right sketching matrix S?
 Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2
 Lots of matrices work
 S is d/ε2 x n matrix of i.i.d. Normal random variables
 To see why this works, we
introduce the notion of a
subspace embedding
15
Subspace Embeddings
• Let k = O(d/ε2)
• Let S be a k x n matrix of i.i.d. normal
N(0,1/k) random variables
• For any fixed d-dimensional subspace, i.e.,
the column space of an n x d matrix A
– W.h.p., for all x in Rd, |SAx|2 = (1±ε)|Ax|2
• Entire column space of A is preserved
Why is this true?
Subspace Embeddings – A Proof
• Want to show |SAx|2 = (1±ε)|Ax|2 for all x
• Can assume columns of A are orthonormal
(since we prove this for all x)
• Claim: SA is a k x d matrix of i.i.d. N(0,1/k)
random variables
– First property: for two independent random variables X
and Y, with X drawn from N(0,a2 ) and Y drawn from
N(0,b2 ), we have X+Y is drawn from N(0, a2 + b2 )
X+Y is drawn from N(0,
2
𝑎
+
2
𝑏 )
2
• Probability density function 𝑓𝑧 of Z = X+Y𝑏2is
𝑧
𝑥− 2 2
2
𝑎 +𝑏 𝑓 and 𝑓
𝑧
convolution of probability density
𝑋
𝑌
− 2 functions
−
2
−
•
𝑧−𝑥 2
2𝑎2
2 𝑎 +𝑏
𝑥2
− 2
2𝑏
2
Calculation: 𝑒
= 𝑒
𝑓𝑍 𝑧 = ∫ 𝑓𝑌 𝑧 − 𝑥 𝑓𝑋 𝑥 𝑑𝑥
• 𝑓𝑥 𝑥 =
•Density
𝑓𝑍 𝑧
2 /2𝑎 2
1
−𝑥
𝑒
𝑎 2𝜋 .5
, 𝑓𝑦 𝑦 =
2 /2𝑏2
1
−𝑥
𝑒
2𝑧 2
𝑏
.5
𝑏 2𝜋
𝑥− 2 2
𝑎 +𝑏
−
𝑎𝑏 2
2
/2𝑏2 𝑎2 +𝑏2
.5
𝑎2 +𝑏2 2
2 /2𝑎 2
1
1
−(𝑧−𝑥)
−𝑥
of
distribution:
= Gaussian
∫
𝑒
𝑒
∫
2𝜋 .5 𝑎𝑏
𝑎 2𝜋 .5
𝑏 2𝜋 .5
−
=
1
−𝑧 2 /2 𝑎2 +𝑏2
𝑒
2𝜋 .5 𝑎2 +𝑏2 .5
𝑎𝑏 2
𝑎2 +𝑏2
∫
.5
𝑎2 +𝑏2
2𝜋 .5 𝑎𝑏
𝑒
𝑒 𝑑𝑥
2
𝑏2 𝑧
𝑥− 2 2
𝑎 +𝑏
2
𝑎𝑏 2
𝑎2 +𝑏2
dx
dx = 1
Rotational Invariance
• Second property: if u, v are vectors with <u, v> = 0,
then <g,u> and <g,v> are independent, where g is a
vector of i.i.d. N(0,1/k) random variables
• Why?
• If g is an n-dimensional vector of i.i.d. N(0,1)
random variables, and R is a fixed matrix, then
the probability density function of Rg is
𝑥𝑇 RRT
𝑓
𝑥 =
−
1
𝑒
det(RRT ) 2𝜋 𝑑/2
−1
𝑥
2
– RRT is the covariance matrix
– For a rotation matrix R, the distribution of Rg
and of g are the same
Orthogonal Implies Independent
• Want to show: if u, v are vectors with <u, v> = 0, then
<g,u> and <g,v> are independent, where g is a vector of
i.i.d. N(0,1/k) random variables
• Choose a rotation R which sends u to αe1 , and sends v
to βe2
• < g, u > = < gR, RT u > = < h, αe1 > = αh1
• < g, v > = < gR, RT v > = < h, βe2 > = βh2
where h is a vector of i.i.d. N(0, 1/k) random variables
• Then h1 and h2 are independent by definition
Where were we?
• Claim: SA is a k x d matrix of i.i.d. N(0,1/k) random
variables
• Proof: The rows of SA are independent
– Each row is: < g, A1 >, < g, A2 > , … , < g, Ad >
– First property implies the entries in each row are
N(0,1/k) since the columns Ai have unit norm
– Since the columns Ai are orthonormal, the entries in a
row are independent by our second property
Back to Subspace Embeddings
•
•
•
•
Want to show |SAx|2 = (1±ε)|Ax|2 for all x
Can assume columns of A are orthonormal
Can also assume x is a unit vector
SA is a k x d matrix of i.i.d. N(0,1/k) random variables
• Consider any fixed unit vector 𝑥 ∈ 𝑅𝑑
• SAx 22 = i∈ k < g i , x >2 , where g i is i-th row of SA
• Each < g i , x
>2
is distributed as N
1 2
0,
k
• E[< g i , x >2 ] = 1/k, and so E[ SAx 22 ] = 1
How concentrated is SAx 22 about its expectation?
Johnson-Lindenstrauss Theorem
• Suppose h1 , … , hk are i.i.d. N(0,1) random variables
• Then G = i h2i is a 𝜒 2 -random variable
• Apply known tail bounds to G:
– (Upper) Pr G ≥ k + 2 kx .5 + 2x ≤ e−x
– (Lower) Pr G ≤ k − 2 kx .5 ≤ e−x
• If x =
ϵ2 k
,
16
then Pr G ∈ k(1 ± ϵ) ≥ 1 −
2 k/16
−ϵ
2e
1
δ
• If k = Θ(ϵ−2 log( )), this probability is 1-δ
• Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d
This only holds for a fixed x, how to argue for all x?
Net for Sphere
• Consider the sphere Sd−1
• Subset N is a γ-net if for all x ∈ Sd−1 , there is a y ∈ N,
such that x − y 2 ≤ γ
• Greedy construction of N
– While there is a point x ∈ Sd−1 of distance larger than
γ from every point in N, include x in N
• The sphere of radius γ/2 around ever point in N is
contained in the sphere of radius 1+ γ around 0d
• Further, all such spheres are disjoint
• Ratio of volume of d-dimensional sphere of radius 1+ γ
to dimensional sphere of radius 𝛾 is 1 + γ d /(γ/2)d , so
N ≤ 1 + γ d /(γ/2)d
Net for Subspace
• Let M = {Ax | x in N}, so M ≤ 1 + γ d /(γ/2)d
• Claim: For every x in Sd−1 , there is a y in M for which
Ax − y 2 ≤ γ
• Proof: Let x’ in S d−1 be such that x − x ′ 2 ≤ γ
Then Ax − Ax ′ 2 = x − x ′ 2 ≤ γ, using that the
columns of A are orthonormal. Set y = Ax’
Net Argument
• For a fixed unit x, Pr SAx 22 ∈ 1 ± ϵ ≥ 1 − 2−Θ d
• For a fixed pair of unit x, x’, SAx 22 , SAx′ 22 , SA x − x ′
are all 1 ± ϵ with probability 1 − 2−Θ d
• SA x − x ′ 22 = SAx 22 + SAx ′ 22 − 2 < SAx, SAx ′ >
• A x − x ′ 22 = Ax 22 + Ax ′ 22 − 2 < Ax, Ax ′ >
– So Pr < Ax, Ax ′ > = < SAx, SAx ′ > ± O ϵ
2
2
= 1 − 2−Θ(d)
• Choose a ½-net M = {Ax | x in N} of size 5𝑑
• By a union bound, for all pairs y, y’ in M,
< y, y′ > = < Sy, Sy′ > ± O ϵ
• Condition on this event
• By linearity, if this holds for y, y’ in M, for αy, βy′ we have
< αy, βy′ > = αβ < Sy, Sy′ > ± O ϵ αβ
Finishing the Net Argument
• Let y = Ax for an arbitrary x ∈ Sd−1
• Let y1 ∈ M be such that y − y1 2 ≤ γ
• Let α be such that α(y − y1 ) 2 = 1
– α ≥ 1/γ (could be infinite)
• Let y2′ ∈ M be such that α y − y1 − y2 ′
• Then y − y1 −
y′2
.
α
y2 ′
α 2
≤
γ
α
2
≤γ
≤ γ2
• Set y2 =
Repeat, obtaining y1 , y2 , y3 , … such that for
all integers i,
y − y1 − y2 − … − yi 2 ≤ γi
• Implies yi 2 ≤ γi−1 + γi ≤ 2γi−1
Finishing the Net Argument
• Have y1 , y2 , y3 , … such that y =
• Sy
2
2
= |S
i yi
and yi
2
≤ 2γi−1
2
|
y
i i 2
=
i
Syi
=
i
yi
2
2
2
2
+2
+2
i,j
i,j
< Syi , Syj >
< yi , yj > ± O ϵ
i,j
yi
2
yj
= | i yi |22 ± O ϵ
= y 22 ± O ϵ
=1±O ϵ
• Since this held for an arbitrary y = Ax for unit x, by
linearity it follows that for all x, |SAx|2 = (1±ε)|Ax|2
2
Back to Regression
• We showed that S is a subspace
embedding, that is, simultaneously for all x,
|SAx|2 = (1±ε)|Ax|2
What does this have to do with regression?
Subspace Embeddings for
Regression
• Want x so that |Ax-b|2 · (1+ε) miny |Ay-b|2
• Consider subspace L spanned by columns of A
together with b
• Then for all y in L, |Sy|2 = (1± ε) |y|2
• Hence, |S(Ax-b)|2 = (1± ε) |Ax-b|2 for all x
• Solve argminy |(SA)y – (Sb)|2
• Given SA, Sb, can solve in poly(d/ε) time
Only problem is computing SA takes O(nd2) time
How to choose the right sketching matrix S? [S]
 S is a Subsampled Randomized Hadamard Transform
 S = P*H*D
 D is a diagonal matrix with +1, -1 on diagonals
 H is the Hadamard transform
 P just chooses a random (small) subset of rows of H*D
 S*A can be computed in O(nd log n) time
Why does it work?
31
Why does this work?
 We can again assume columns of A are orthonormal
 It suffices to show SAx
2
2
= PHDAx
 HD is a rotation matrix, so HDAx
2
2
2
2
= 1 ± ϵ for all x
= Ax
2
2
= 1 for any x
 Notation: let y = Ax
 Flattening Lemma: For any fixed y,
Pr [ HDy
∞
≥C
log.5 nd/δ
]
n.5
≤
δ
2d
32
Proving the Flattening Lemma
 Flattening Lemma: Pr [ HDy
∞
≥C
log.5 nd/δ
]
n.5
≤
δ
2d
 Let C > 0 be a constant. We will show for a fixed i in [n],
Pr [ HDy
≥C
i
log.5 nd/δ
]
n.5
≤
δ
2nd
 If we show this, we can apply a union bound over all i

HDy i = j Hi,j Dj,j yj
 (Azuma-Hoeffding) Pr |
probability 1
t2
)
2 j β2
j
−(
j Zj |
> t ≤ 2e
, where |Zj | ≤ βj with
 Zj = Hi,j Dj,j yj has 0 mean
 |Zj | ≤

|yj |
n.5
1
2
β
=
j j
n
= βj with probability 1
nd
 Pr |
j Zj | >
C log.5 δ
n.5
δ
C2 log
nd
−
2
≤ 2e
≤
δ
2nd
33
Consequence of the Flattening Lemma
 Recall columns of A are orthonormal
 HDA has orthonormal columns
 Flattening Lemma implies HDAei
probability 1 −
δ
2d
 With probability 1
∞
≤C
log.5 nd/δ
n.5
with
for a fixed i ∈ d
δ
− ,
2
 Given this, ej HDA
2
ej HDAei
≤C
≤C
d.5 log.5 nd/δ
n.5
log.5 nd/δ
n.5
for all i,j
for all j
(Can be optimized further)
34
Matrix Chernoff Bound

Let X1 , … , Xs be independent copies of a symmetric random matrix X ∈ Rdxd
1
with E X = 0, X 2 ≤ γ, and E X T X 2 ≤ 𝜎 2 . Let W = s i∈[s] Xi . For any ϵ > 0,
Pr W
2
> ϵ ≤ 2d ⋅ e
(here W
γϵ
3
−sϵ2 /(𝜎2 + )
2
= sup Wx 2 / x 2 )

Let V = HDA, and recall V has orthonormal columns

Suppose P in the S = PHD definition samples uniformly with replacement. If
row i is sampled in the j-th sample, then Pj,i = n, and is 0 otherwise

Let Yi be the i-th sampled row of V = HDA

Let Xi = Id − n ⋅ YiT Yi
 E X i = Id − n ⋅

Xi
2
≤ Id
1
j n
ViT Vi = Id − V T V = 0d
2 + n ⋅ max ej HDA
2
2
= 1 + n ⋅ C 2 log
nd
δ
d
⋅ n = Θ(d log
nd
δ
)
35
Matrix Chernoff Bound

Recall: let Yi be the i-th sampled row of V = HDA

Let Xi = Id − n ⋅ YiT Yi

E X T X + Id = Id + Id − 2n E YiT Yi + n2 E[YiT Yi YiT Yi ]
1
i n
nd
T
2
v
v
C
log
i
i i
δ
= 2Id − 2Id + n2
⋅ viT vi viT vi = n
d
⋅ n = C 2 dlog
T
i vi vi
nd
𝛿
⋅ 𝑣𝑖
2
2

Define Z = n

Note that 𝑋 𝑇 𝑋 + 𝐼𝑑 and Z are real symmetric, with non-negative
eigenvalues
Claim: for all vectors y, we have: 𝑦 𝑇 𝑋 𝑇 𝑋𝑦 + 𝑦 𝑇 𝑦 ≤ 𝑦 𝑇 𝑍𝑦
Proof: 𝑦 𝑇 𝑋 𝑇 𝑋𝑦 + 𝑦 𝑇 𝑦 = 𝑛 𝑖 𝑦 𝑇 𝑣𝑖𝑇 𝑣𝑖 𝑦 𝑣𝑖 22 = 𝑛 𝑖 < 𝑣𝑖 , 𝑦 >2 𝑣𝑖 22 and
𝑛𝑑 𝑑
𝑛𝑑
𝑇
𝑇
𝑇
2
2
2
𝑦 𝑍𝑦 = 𝑛
𝑦 𝑣𝑖 𝑣𝑖 𝑦 𝐶 log
⋅ =𝑛
< 𝑣𝑖 , 𝑦 > 𝐶 log
𝛿
𝑛
𝛿


𝑖
𝑋𝑇 𝑋

Hence, 𝐸

Hence, E X T X
𝐼𝑑
𝑖
2
2
≤ 𝐸 𝑋 𝑇 𝑋 + 𝐼𝑑
= |𝐸 𝑋 𝑇 𝑋 + 𝐼𝑑 |2 + 1
𝑛𝑑
2
≤ 𝑍 2 + 1 ≤ 𝐶 𝑑 log
+1
𝛿
= O d log
nd
δ
2
+ 𝐼𝑑
2
36
Matrix Chernoff Bound
nd
δ

Hence, E X T X

Recall: (Matrix Chernoff) Let X1 , … , Xs be independent copies of a
symmetric random matrix X ∈ Rdxd with E X = 0, X 2 ≤ γ, and E X T X
𝜎 2.
Let W =
1
s
2
= O d log
γϵ
i∈[s] X i .
Pr |Id − PHDA
d

Set s = d log
nd log δ
δ
ϵ2
For any ϵ > 0, Pr W
T
PHDA
2
−sϵ2 /(𝜎2 + 3 )
≤
> ϵ ≤ 2d ⋅ e
−s ϵ2 /(Θ(d log
2
2
> ϵ ≤ 2d ⋅ e
nd
)
δ
δ
, to make this probability less than 2
37
SRHT Wrapup
 Have shown |Id − PHDA
T
|2 < ϵ using Matrix
PHDA
Chernoff Bound and with s = d log
d
nd log δ
δ
ϵ2
samples
 Implies for every unit vector x,
|1− PHDAx 22 | = x T x − x PHDA
so PHDAx
2
2
T
PHDA x < ϵ ,
∈ 1 ± ϵ for all unit vectors x
 Considering the column span of A adjoined with b, we can
again solve the regression problem
 The time for regression is now only O(nd log n) +
𝑑 log 𝑛
poly(
𝜖
38
). Nearly optimal in matrix dimensions (n >> d)
Faster Subspace Embeddings S [CW,MM,NN]
 CountSketch matrix
 Define k x n matrix S, for k = O(d2/ε2)
 S is really sparse: single randomly chosen non-zero
entry per column
[
[
0010 01 00
1000 00 00
0 0 0 -1 1 0 -1 0
0-1 0 0 0 0 0 1
 nnz(A) is number of non-zero entries of A
Can compute
S ⋅ A in nnz(A)
time!
39
Simple Proof [Nguyen]
 Can assume columns of A are orthonormal
 Suffices to show |SAx|2 = 1 ± ε for all unit x
 For regression, apply S to [A, b]
 SA is a 2d2/ε2 x d matrix
 Suffices to show | ATST SA – I|2 · |ATST SA – I|F · ε
 Matrix product result shown below:
Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿
 Set C = AT and D = A.
 Then |A|2F = d and (# rows of S) = 3 d2/(δε2)
40
Matrix Product Result
 Show: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿
 (JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL
moment property if for all x ∈ Rn with x 2 = 1,
ES Sx 22 − 1
ℓ
≤ ϵℓ ⋅ δ
 (From vectors to matrices) For ϵ, δ ∈ 0,
1
2
, let D be a distribution on
matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL
moment property for some ℓ ≥ 2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S
F
≥3ϵ A
F
B
F
≤δ
41
From Vectors to Matrices

1
(From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices
S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property
for some ℓ ≥ 2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S

Proof: For a random scalar X, let X
 Sometimes consider X = T
 | T

F
|p = E T
p
F
F
p
≥3ϵ A
= EX
F
B
F
≤δ
p 1/p
for a random matrix T
1/p
Can show |. |p is a norm
 Minkowski’s Inequality: X + Y

F
p
≤ X
p
+ Y
p
For unit vectors x, y, need to bound |〈Sx, Sy〉 - 〈x, y〉|ℓ
42
From Vectors to Matrices

For unit vectors x, y, |〈Sx, Sy〉 - 〈x, y〉|ℓ
=
≤
≤
1
(
2
1
(
2
1
2
Sx 22 −1 + Sy
Sx
2
2
2
2
− 1 ℓ + Sy
1
ℓ
−1 − S x−y
2
2
1
ℓ
−1 ℓ+ S x−y
(ϵ ⋅ δ +ϵ ⋅ δ + x − y
≤3ϵ⋅δ
2
2
2
2
− x−y
2
2
2
2
− x−y
|ℓ
2
2 ℓ)
1
ℓ
ϵ⋅δ )
1
ℓ
Sx,Sy − x,y ℓ
x2y2
1
ℓ

By linearity, for arbitrary x, y,

Suppose A has d columns and B has e columns. Let the columns of A be
A1 , … , Ad and the columns of B be B1 , … , Be

Define Xi,j =

AT S T SB
−
1
Ai 2 Bj
2
2
AT B F
=
≤3ϵ⋅δ
⋅ ( SAi , SBj − Ai , Bj )
i
2
j Ai 2
⋅
2 2
Bj Xi,j
2
43
From Vectors to Matrices

Have shown: for arbitrary x, y,

For Xi,j =

| AT ST SB − A B F |ℓ/2 =
1
Sx,Sy − x,y ℓ
x2y2
SAi , SBj − Ai , Bj : AT S T SB − AT B
⋅
Ai 2 Bj
2
T 2
i
≤
j
2
2
2
⋅ Bj Xi,j
j
Ai
2
2
2 |
⋅ Bj |Xi,j
ℓ/2
j
Ai
2
2
⋅ Bj
i
≤ 3ϵδ
1
ℓ
= 3 ϵδ

T T
T
Since E A S SB − A B
Pr
AT S T SB
−
AT B
F
ℓ
F
1
ℓ
2
F
=
i
j
Ai
2
2
2
2
⋅ Bj Xi,j
2
ℓ/2
2
2
2
2
Xi,j
Ai
2
2
2
i
2
A
j
2
F
B
2
ℓ
2
Bj
2
2
F
T T
T
= A S SB − A B
> 3ϵ A
2
2
Ai
i
=

≤3ϵ⋅δ
1
ℓ
ℓ/2
2
F ℓ
, by Markov’s inequality,
2
F
B
F
≤
1
3ϵ A F B F
ℓ
E |AT S T SB − AT B|ℓF ≤ δ
44
Result for Vectors

Show: Pr[|CSTSD – CD|F2 · [3/(𝛿(# rows of S))] * |C|F2 |D|F2] ≥ 1 − 𝛿

(JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment
property if for all x ∈ Rn with x 2 = 1,
ES Sx

2
2
−1
ℓ
≤ ϵℓ ⋅ δ
1
(From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on matrices
S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment property
for some ℓ ≥ 2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S

F
≥3ϵ A
F
B
F
≤δ
Just need to show that the CountSketch matrix S satisfies JL property and
bound the number k of rows
45
CountSketch Satisfies the JL Property

(JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL moment
property if for all x ∈ Rn with x 2 = 1,
ES Sx
2
2
−1
ℓ
≤ ϵℓ ⋅ δ

We show this property holds with ℓ = 2. First, let us consider ℓ = 1

For CountSketch matrix S, let
 h:[n] -> [k] be a 2-wise independent hash function
 σ: n → {−1,1} be a 4-wise independent hash function

Let δ E = 1 if event E holds, and δ E = 0 otherwise

E[ Sx 22 ] =
j∈ k
E[
i∈ n
δ h i = j σi x i
=
j∈ k
i1,i2∈[n] E[δ
=
j∈ k
i∈[n] E[δ
=
1
k
j∈[k[
i∈ n
2
]
h i1 = j δ h i2 = j σi1 σi2 ]xi1 xi2
h i = j 2 ]xi2
xi2 = x
2
2
46
CountSketch Satisfies the JL Property


E[ Sx 42 ] = 𝐸[
𝑗1 ,𝑗2 ,𝑖1 .𝑖2 ,𝑖3 ,𝑖4
𝑗′∈
j∈ k
i∈ n
𝑘
δ h i = j σi xi
2
2
i′∈ n
δ h i′ = j′ σi′ xi′ ] =
𝐸[𝜎𝑖1 𝜎𝑖2 𝜎𝑖3 𝜎𝑖4 𝛿 ℎ 𝑖1 = 𝑗1 𝛿 ℎ 𝑖2 = 𝑗1 𝛿 ℎ 𝑖3 = 𝑗2 𝛿 ℎ 𝑖4 = 𝑗2 ]𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4

We must be able to partition {𝑖1 , 𝑖2 , 𝑖3 , 𝑖4 } into equal pairs

Suppose 𝑖1 = 𝑖2 = 𝑖3 = 𝑖4 . Then necessarily 𝑗1 = 𝑗2 . Obtain

Suppose 𝑖1 = 𝑖2 and 𝑖3 = 𝑖4 but 𝑖1 ≠ 𝑖3 . Then get

Suppose 𝑖1 = 𝑖3 and 𝑖2 = 𝑖4 but 𝑖1 ≠ 𝑖2 . Then necessarily 𝑗1 = 𝑗2 . Obtain
1
1
2 2
4
𝑥
𝑥
≤
𝑥
2 . Obtain same bound if 𝑖1 = 𝑖4 and 𝑖2 = 𝑖3 .
𝑗 𝑘 2 𝑖1 ,𝑖2 𝑖1 𝑖2
𝑘

Hence, E[ Sx 42 ] ∈ [ x 42 , 𝑥 42 (1 + 𝑘)] = [1, 1 + k]

So, ES Sx
2
2
2
−1
2
2
1
𝑗𝑘
4
𝑖 𝑥𝑖
1 2 2
𝑗1 ,𝑗2 ,𝑖1 ,𝑖3 𝑘 2 𝑥𝑖1 𝑥𝑖3
= 𝑥
= 𝑥
4
2
4
4
− 𝑥
4
4
2
2
1
≤ 1 + 𝑘 − 2 + 1 = 𝑘 . Setting k = 2𝜖2 𝛿 finishes the proof
47
Where are we?

(JL Property) A distribution on matrices S ∈ Rkx n has the (ϵ, δ, ℓ)-JL
moment property if for all x ∈ Rn with x 2 = 1,
ES Sx

2
2
−1
ℓ
≤ ϵℓ ⋅ δ
1
(From vectors to matrices) For ϵ, δ ∈ 0, 2 , let D be a distribution on
matrices S with k rows and n columns that satisfies the (ϵ, δ, ℓ)-JL moment
property for some ℓ ≥ 2. Then for A, B matrices with n rows,
Pr AT S T SB − AT B
S
2
F
≥ 3 ϵ2 A
2
F
B
2
F
≤δ
2

We showed CountSketch has the JL property with ℓ = 2, 𝑎𝑛𝑑 𝑘 = 𝜖2 𝛿

Matrix product result we wanted was:
Pr[|CSTSD – CD|F2 · (3/(𝛿k)) * |C|F2 |D|F2] ≥ 1 − 𝛿
We are now down with the proof CountSketch is a subspace embedding

48
Course Outline
 Subspace embeddings and least squares regression
 Gaussian matrices
 Subsampled Randomized Hadamard Transform
 CountSketch
 Affine embeddings
 Application to low rank approximation
 High precision regression
49
Affine Embeddings

Want to solve min 𝐴𝑋 − 𝐵 2𝐹 , A is tall and thin with d columns, but B has a large
𝑋
number of columns

Can’t directly apply subspace embeddings

Let’s try to show SAX − SB
we need of S

Can assume A has orthonormal columns

Let 𝐵 ∗ = 𝐴𝑋 ∗ − 𝐵, where 𝑋 ∗ is the optimum

F
= 1 ± ϵ AX − B
F
for all X and see what properties
𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵 ∗ 2𝐹 = 𝑆𝐴 𝑋 − 𝑋 ∗ + 𝑆 𝐴𝑋 ∗ − 𝐵 2𝐹 − 𝑆𝐵 ∗ 2𝐹
= 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 − 2𝑡𝑟[ 𝑋 − 𝑋 ∗ 𝑇 𝐴𝑇 𝑆 𝑇 𝑆𝐵∗ ]
∈ 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 ± 2 𝑋 − 𝑋 ∗ 𝐹 𝐴𝑇 𝑆 𝑇 𝑆𝐵 ∗ 𝐹 (use 𝑡𝑟 𝐶𝐷 ≤ 𝐶 𝐹 𝐷 𝐹 )
∈ 𝑆𝐴 𝑋 − 𝑋 ∗ 2𝐹 ±2𝜖 𝑋 − 𝑋 ∗ 𝐹 B ∗ F (if we have approx. matrix product)
∈ 𝐴 𝑋 − 𝑋 ∗ 2𝐹 ± 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵 ∗ ) (subspace embedding for50A)
Affine Embeddings
 We have 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 ∈ 𝐴 𝑋 − 𝑋 ∗
𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵∗ )
 Normal equations imply that
𝐴𝑋 − 𝐵 2𝐹 = 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 𝐵∗
∗

= 1±𝜖
𝐵∗ 2𝐹
2
𝐹
2
∈ ±𝜖 𝐴 𝑋 − 𝑋 𝐹 + 𝐵 𝐹
2
∈ ±2𝜖 𝐴 𝑋 − 𝑋 ∗ 𝐹 + 𝐵 ∗
= ±2𝜖 𝐴𝑋 − 𝐵 2𝐹
𝑆𝐵∗ 2𝐹
±
2
𝐹
 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 − 𝐴𝑋 − 𝐵 2𝐹 − 𝐵∗
∈ 𝜖( 𝐴 𝑋 − 𝑋 ∗ 2𝐹 + 2 𝑋 − 𝑋 ∗ 𝐹 𝐵∗ 𝐹 )
∗
2
𝐹
2
𝐹
51
(this holds with constant probability)
Affine Embeddings
 Know: 𝑆 𝐴𝑋 − 𝐵 2𝐹 − 𝑆𝐵∗ 2𝐹 − 𝐴𝑋 − 𝐵
±2𝜖 𝐴𝑋 − 𝐵 2𝐹
 Know: 𝑆𝐵∗ 2𝐹 = 1 ± 𝜖 𝐵∗ 2𝐹
 𝑆 𝐴𝑋 − 𝐵
2
𝐹
= (1 ± 2𝜖) 𝐴𝑋 − 𝐵 2𝐹 +𝜖 𝐵∗
= 1 ± 3𝜖 𝐴𝑋 − 𝐵 2𝐹
2
𝐹
− 𝐵∗
2
𝐹
∈
2
𝐹
 Completes proof of affine embedding!
52
Low rank approximation
 A is an n x d matrix
 Think of n points in Rd
 E.g., A is a customer-product matrix
 Ai,j = how many times customer i purchased item j
 A is typically well-approximated by low rank matrix
 E.g., high rank because of noise
 Goal: find a low rank matrix approximating A
 Easy to store, data more interpretable
53
What is a good low rank approximation?
Singular Value Decomposition (SVD)
Any matrixAAk == argmin
U ¢ Σ ¢ rank
V k matrices B |A-B|F
 U has orthonormal columns
 Σ is diagonal with non-increasing
positive
1/2
2
(|C|
)
F = (Σ
i,jVC
i,j )
The
rows
of
are
k
entries down the diagonal
the orthonormal
top k principal
 V has
rows
components
Computing
Ak exactly is
expensive
 Rank-k approximation:
Ak = Uk ¢ Σk ¢ Vk
54
Low rank approximation
 Goal: output a rank k matrix A’, so that
|A-A’|F · (1+ε) |A-Ak|F
 Can do this in nnz(A) + (n+d)*poly(k/ε) time [S,CW]
 nnz(A) is number of non-zero entries of A
55
Solution to low-rank approximation [S]
 Given n x d input matrix A
 Compute S*A using a random matrix S with k/ε << n
rows. S*A takes random linear combinations of rows of A
A
SA
 Project rows of A onto SA, then find best rank-k
approximation to points inside of SA.
56
What is the matrix S?
 S can be a k/ε x n matrix of i.i.d. normal random
variables
 [S] S can be a k/ε x n Fast Johnson Lindenstrauss
Matrix
 Uses Fast Fourier Transform
 [CW] S can be a poly(k/ε) x n CountSketch matrix
[
[
0010 01 00
1000 00 00
0 0 0 -1 1 0 -1 0
0-1 0 0 0 0 0 1
S ¢ A can be
computed in
nnz(A) time
57
Why do these Matrices Work?

Consider the regression problem min Ak X − A

Write Ak = WY, where W is n x k and Y is k x d

Let S be an affine embedding

Then 𝑆𝐴𝑘 𝑋 − 𝑆𝐴

By normal equations, argmin 𝑆𝐴𝑘 𝑋 − 𝑆𝐴
X
𝐹
= 1 ± 𝜖 𝐴𝑘 𝑋 − 𝐴
𝐹
𝑋
− 𝑆𝐴
F
for all X
𝐹
= 𝑆𝐴𝑘
− 𝑆𝐴

So, 𝐴𝑘 𝑆𝐴𝑘

But 𝐴𝑘 𝑆𝐴𝑘

By Pythagorean theorem, best rank-k matrix in the row span of SA is
obtained by projecting rows of A onto SA, then computing the SVD
− 𝑆𝐴
−𝐴
𝐹
≤ 1 + 𝜖 𝐴𝑘 − 𝐴
𝐹
is a rank-k matrix in the row span of SA!
58
Caveat: projecting the points onto SA is slow
 Current algorithm:
1. Compute S*A
2. Project each of the rows onto S*A
3. Find best rank-k approximation of projected points
inside of rowspace of S*A
minrank-k X |X(SA)R-AR|F2
 Bottleneck is step 2
Can solve with affine embeddings
 [CW] Approximate the projection
 Fast algorithm for approximate regression
minrank-k X |X(SA)-A|F2
 nnz(A) + (n+d)*poly(k/ε) time
59
High Precision Regression
 Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2
with high probability
 Our algorithms all have running time poly(d/ε)
 Goal: Sometimes we want running time poly(d)*log(1/ε)
 Want to make A well-conditioned
 𝜅 𝐴 = sup 𝐴𝑥 2 / inf 𝐴𝑥 2
𝑥 2 =1
𝑥 2 =1
 Lots of algorithms’ time complexity depends on 𝜅 𝐴
 Use sketching to reduce 𝜅 𝐴 to O(1)!
60
Small QR Decomposition
 Let S be a 1 + 𝜖0 - subspace embedding for A
 Compute SA
 Compute QR-factorization, 𝑆𝐴 = 𝑄𝑅−1
 Claim: 𝜅 𝐴𝑅 =
1+𝜖0
1−𝜖0
 For all x, 1 − 𝜖0 𝐴𝑅𝑥
2
≤ 𝑆𝐴𝑅 𝑥 2 = 1
 For all x, 1 + 𝜖0 𝐴𝑅𝑥
2
≥ 𝑆𝐴𝑅𝑥
 So 𝜅 𝐴𝑅 ≤
1+𝜖0
1−𝜖0
2
=1
61
Finding a Constant Factor Solution
 Let S be a 1 + 𝜖0 - subspace embedding for AR
 Solve x0 = argmin 𝑆𝐴𝑅𝑥 − 𝑆𝑏
𝑥
2
 Time to compute 𝐴𝑅 and 𝑥0 is nnz(A) + poly(d) for constant 𝜖0
 𝑥𝑚+1 ← 𝑥𝑚 + 𝑅𝑇 𝐴𝑇 𝑏 − 𝐴𝑅 𝑥𝑚
 𝐴𝑅 𝑥𝑚+1 − 𝑥 ∗ = 𝐴𝑅 (𝑥𝑚 + 𝑅𝑇 𝐴𝑇 𝑏 − 𝐴𝑅𝑥𝑚 − 𝑥 ∗ )
= 𝐴𝑅 − 𝐴𝑅𝑅 𝑇 𝐴𝑇 𝐴𝑅 𝑥𝑚 − 𝑥 ∗
= 𝑈 Σ − Σ 3 𝑉 𝑇 (𝑥𝑚 − 𝑥 ∗ ),
where 𝐴𝑅 = 𝑈 Σ𝑉 𝑇 is the SVD of AR
•
𝐴𝑅 𝑥𝑚+1 − 𝑥 ∗
 𝐴𝑅𝑥𝑚 − 𝑏
2
2
2
=
Σ − Σ 3 𝑉 𝑇 𝑥𝑚 − 𝑥 ∗
= 𝐴𝑅 𝑥𝑚 −
𝑥∗
2
2
+
2
𝐴𝑅𝑥 ∗
= O 𝜖0 |𝐴𝑅(𝑥𝑚 − 𝑥 ∗ )|2
−𝑏
2
2
62