Download Slides - DidaWiki - Università di Pisa

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18 Speeding up cosine computation  What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances?    Now, O(nm) to compute cos(d,q) for all d Then, O(km+kn) where k << n,m Two methods:   “Latent semantic indexing” Random projection Briefly  LSI is data-dependent   Create a k-dim subspace by eliminating redundant axes Pull together “related” axes – hopefully   car and automobile What about polysemy ? Random projection is data-independent  Choose a k-dim subspace that guarantees good stretching properties with high probability between any pair of points. Notions from linear algebra      Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues l and eigenvector v: Av = lv Overview of LSI  Pre-process docs using a technique from linear algebra called Singular Value Decomposition  Create a new (smaller) vector space  Queries handled (faster) in this new space Singular-Value Decomposition  Recall m  n matrix of terms  docs, A.   Define term-term correlation matrix T=AAt    A has rank r  m,n T is a square, symmetric m  m matrix Let P be m  r matrix of eigenvectors of T Define doc-doc correlation matrix D=AtA   D is a square, symmetric n  n matrix Let R be n  r matrix of eigenvectors of D A’s decomposition   Given P (for T, m  r) and R (for D, n  r) formed by orthonormal columns (unit dot-product) It turns out that  A = PSRt Where S is a diagonal matrix with the eigenvalues of T=AAt in decreasing order. mn A = mr P rr S rn Rt Dimensionality reduction  For some k << r, zero out all but the k biggest eigenvalues in S [choice of k is crucial]  Denote by Sk this new version of S, having rank k  Typically k is about 100, while r (A’s rank) is > 10,000 document k k = 0 0 k 0 r Ak Pm x kr Sk Rtkr xx nn useless due to 0-col/0-row of Sk Guarantee  Ak is a pretty good approximation to A:    Relative distances are (approximately) preserved Of all m  n matrices of rank k, Ak is the best approximation to A wrt the following measures:  minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = sk+1  minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = sk+12+ sk+22+...+ sr2 Frobenius norm ||A||F2 = s12+ s22+...+ sr2 Reduction  Xk = Sk Rt R,P are formed by orthonormal eigenvectors of the matrices D,T is the doc-matrix k x n, hence reduced to k dim  Since we are interested in doc/q correlation, we consider:  D=At A =(P S Rt)t (P S Rt) = (SRt)t (SRt) Approx S with Sk, thus get At A  Xkt Xk (both are n x n matr.)   We use Xk to define how to project A and Q:  Xk = Sk Rt , substitute Rt = S-1 Pt A, so get Pkt A  In fact, Sk S-1 Pt = Pkt which is a k x m matrix   This means that to reduce a doc/query vector is enough to multiply it by Pkt thus paying O(km) per doc/query Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn) Which are the concepts ?  c-th concept = c-th row of Pkt (which is k x m)  Denote it by Pkt [c], whose size is m = #terms  Pkt [c][i] = strength of association between c-th concept and i-th term  Projected document: d’j = Pkt dj   d’j[c] = strenght of concept c in dj Projected query: q’ = Pkt q  q’ [c] = strenght of concept c in q Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only ! An interesting math result Lemma (Johnson-Linderstrauss, ‘82) Let P be a set of n distinct points in m-dimensions. Given e > 0, there exists a function f : P  IRk such that for every pair of points u,v in P it holds: (1 - e) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤ (1 + e) ||u-v||2 Where k = O(e-2 log n) f() is called JL-embedding Setting v=0 we also get a bound on f(u)’s stretching!!! What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above for ||u-v||2 How to compute a JL-embedding? If we set R = ri,j to be a random mx k matrix, where the components are independent random variables with one of the following distributions E[ri,j] = 0 Var[ri,j] = 1 Finally...  Random projections hide large constants  k  (1/e)2 * log m, so it may be large…  it is simple and fast to compute  LSI is intuitive and may scale to any k  optimal under various metrics  but costly to compute, now good libraries indeed

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slides - DidaWiki - Università di Pisa