Download Slides - DidaWiki - Università di Pisa

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix completion wikipedia , lookup

Covariance and contravariance of vectors wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

System of linear equations wikipedia , lookup

Rotation matrix wikipedia , lookup

Determinant wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Four-vector wikipedia , lookup

Principal component analysis wikipedia , lookup

Jordan normal form wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Eigenvalues and eigenvectors wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Gaussian elimination wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Matrix calculus wikipedia , lookup

Matrix multiplication wikipedia , lookup

Transcript
Latent Semantic Indexing
(mapping onto a smaller space of latent concepts)
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Reading 18
Speeding up cosine computation

What if we could take our vectors and “pack”
them into fewer dimensions (say
50,000100) while preserving distances?



Now, O(nm) to compute cos(d,q) for all d
Then, O(km+kn) where k << n,m
Two methods:


“Latent semantic indexing”
Random projection
Briefly

LSI is data-dependent


Create a k-dim subspace by eliminating
redundant axes
Pull together “related” axes – hopefully


car and automobile
What about polysemy ?
Random projection is data-independent

Choose a k-dim subspace that guarantees
good stretching properties with high
probability between any pair of points.
Notions from linear algebra





Matrix A, vector v
Matrix transpose (At)
Matrix product
Rank
Eigenvalues l and eigenvector v: Av = lv
Overview of LSI

Pre-process docs using a technique from
linear algebra called Singular Value
Decomposition

Create a new (smaller) vector space

Queries handled (faster) in this new space
Singular-Value Decomposition

Recall m  n matrix of terms  docs, A.


Define term-term correlation matrix T=AAt



A has rank r  m,n
T is a square, symmetric m  m matrix
Let P be m  r matrix of eigenvectors of T
Define doc-doc correlation matrix D=AtA


D is a square, symmetric n  n matrix
Let R be n  r matrix of eigenvectors of D
A’s decomposition


Given P (for T, m  r) and R (for D, n  r) formed by
orthonormal columns (unit dot-product)
It turns out that

A = PSRt
Where S is a diagonal matrix with the
eigenvalues of T=AAt in decreasing order.
mn
A
=
mr
P
rr
S
rn
Rt
Dimensionality reduction

For some k << r, zero out all but the k biggest
eigenvalues in S [choice of k is crucial]

Denote by Sk this new version of S, having rank k

Typically k is about 100, while r (A’s rank) is > 10,000
document
k
k
=
0
0
k
0
r
Ak
Pm x kr
Sk
Rtkr xx nn
useless due to 0-col/0-row of Sk
Guarantee

Ak is a pretty good approximation to A:



Relative distances are (approximately) preserved
Of all m  n matrices of rank k, Ak is the best
approximation to A wrt the following measures:

minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = sk+1

minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = sk+12+ sk+22+...+ sr2
Frobenius norm ||A||F2 = s12+ s22+...+ sr2
Reduction

Xk = Sk Rt
R,P are formed by
orthonormal eigenvectors
of the matrices D,T
is the doc-matrix k x n, hence reduced to k dim

Since we are interested in doc/q correlation, we consider:

D=At A =(P S Rt)t (P S Rt) = (SRt)t (SRt)
Approx S with Sk, thus get At A  Xkt Xk (both are n x n matr.)


We use Xk to define how to project A and Q:

Xk = Sk Rt , substitute Rt = S-1 Pt A, so get Pkt A

In fact, Sk S-1 Pt = Pkt which is a k x m matrix


This means that to reduce a doc/query vector is
enough to multiply it by Pkt thus paying O(km) per
doc/query
Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)
Which are the concepts ?

c-th concept = c-th row of Pkt
(which is k x m)

Denote it by Pkt [c], whose size is m = #terms

Pkt [c][i] = strength of association between c-th
concept and i-th term

Projected document: d’j = Pkt dj


d’j[c] = strenght of concept c in dj
Projected query: q’ = Pkt q

q’ [c] = strenght of concept c in q
Random Projections
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Slides only !
An interesting math result
Lemma (Johnson-Linderstrauss, ‘82)
Let P be a set of n distinct points in m-dimensions.
Given e > 0, there exists a function f : P  IRk such that
for every pair of points u,v in P it holds:
(1 - e) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤ (1 + e) ||u-v||2
Where k = O(e-2 log n)
f() is called JL-embedding
Setting v=0 we also get a bound on f(u)’s stretching!!!
What about the cosine-distance ?
f(u)’s, f(v)’s stretching
substituting formula above
for ||u-v||2
How to compute a JL-embedding?
If we set R = ri,j to be a random mx k matrix, where the
components are independent random variables with
one of the following distributions
E[ri,j] = 0
Var[ri,j] = 1
Finally...
 Random projections hide large constants
 k  (1/e)2 * log m, so it may be large…
 it is simple and fast to compute
 LSI is intuitive and may scale to any k
 optimal under various metrics
 but costly to compute, now good libraries indeed