* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 3
Survey
Document related concepts
Jordan normal form wikipedia , lookup
Laplace–Runge–Lenz vector wikipedia , lookup
System of linear equations wikipedia , lookup
Eigenvalues and eigenvectors wikipedia , lookup
Matrix multiplication wikipedia , lookup
Non-negative matrix factorization wikipedia , lookup
Principal component analysis wikipedia , lookup
Exterior algebra wikipedia , lookup
Euclidean vector wikipedia , lookup
Vector space wikipedia , lookup
Matrix calculus wikipedia , lookup
Covariance and contravariance of vectors wikipedia , lookup
Transcript
Vector Space Model Rong Jin 1 Basic Issues in A Retrieval Model How to represent text objects How to refine query according to users’ feedbacks? What similarity function should be used? 2 Basic Issues in IR How to represent queries? How to represent documents? How to compute the similarity between documents and queries? How to utilize the users’ feedbacks to enhance the retrieval performance? 3 IR: Formal Formulation Vocabulary V={w1, w2, …, wn} of language Query q = q1,…,qm, where qi V Collection C= {d1, …, dk} Set of relevant documents R(q) C Document di = (di1,…,dimi), where dij V Generally unknown and user-dependent Query is a “hint” on which doc is in R(q) Task = compute R’(q), an “approximate R(q)” 4 Computing R(q) Strategy 1: Document selection Classification function f(d,q) {0,1} Outputs 1 for relevance, 0 for irrelevance R(q) is determined as a set {dC|f(d,q)=1} System must decide if a doc is relevant or not (“absolute relevance”) Example: Boolean retrieval 5 Document Selection Approach True R(q) Classifier C(q) + +- - + - + + --- --- 6 Computing R(q) Strategy 2: Document ranking Similarity function f(d,q) Cut off Outputs a similarity between document d and query q The minimum similarity for document and query to be relevant R(q) is determined as the set {dC|f(d,q)>} System must decide if one doc is more likely to be relevant than another (“relative relevance”) 7 Document Selection vs. Ranking True R(q) + +- - + - + + --- --- Doc Ranking f(d,q)=? 0.98 d1 + 0.95 d2 + 0.83 d3 0.80 d4 + 0.76 d5 0.56 d6 0.34 d7 0.21 d8 + 0.21 d9 - R’(q) 8 Document Selection vs. Ranking True R(q) + +- - + - + + --- --- 1 Doc Selection f(d,q)=? Doc Ranking f(d,q)=? 0 + +- + ++ R’(q) - -- - - + - 0.98 d1 + 0.95 d2 + 0.83 d3 0.80 d4 + 0.76 d5 0.56 d6 0.34 d7 0.21 d8 + 0.21 d9 - R’(q) 9 Ranking is often preferred Similarity function is more general than classification function The classifier is unlikely to be accurate Ambiguous information needs, short queries Relevance is a subjective concept Absolute relevance vs. relative relevance 10 Probability Ranking Principle As stated by Cooper “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.” Ranking documents in probability maximizes the utility of IR systems 11 Vector Space Model Any text object can be represented by a term vector Similarity is determined by relationship between two vectors Examples: Documents, queries, sentences, …. A query is viewed as a short document e.g., the cosine of the angle between the vectors, or the distance between vectors The SMART system: Developed at Cornell University, 1960-1999 Still used widely 12 Vector Space Model: illustration Java Starbuck Microsoft D1 1 1 0 D2 0 1 1 D3 1 0 1 D4 1 1 1 Query 1 0.1 1 13 Vector Space Model: illustration Starbucks ?? D2 ?? D3 D4 ?? Java Query D1 Microsoft ?? 14 Vector Space Model: Similarity Represent both documents and queries by word histogram vectors n: the number of unique words A query q = (q1, q2,…, qn) qi: occurrence of the i-th word in query A document dk = (dk,1, dk,2,…, dk,n) dk,i: occurrence of the the i-th word in document q Similarity of a query q to a document dk dk 15 Some Background in Linear Algebra Dot product (scalar product) q d k q1d k ,1 q2 d k , 2 ... qn d k ,n Example: q [1,2,5], d k [4,1,0] q d k 1 4 2 1 5 0 6 q [1,2,5], d k [1,3,4] q d k 11 2 3 5 4 26 Measure the similarity by dot product q dk 16 Some Background in Linear Algebra Length of a vector q q12 q22 ... qn2 , dk d k2,1 d k2, 2 ... d k2,n Angle between two vectors q dk cos( (q, d k )) q dk q1d k ,1 q2 d k , 2 ... qn d k ,n q (q, d) dk q12 q22 ... qn2 d k2,1 d k2, 2 ... d k2,n 17 Some Background in Linear Algebra q Example: (q, d) q [1,2,5], d k [4,1,0] 1 4 2 1 5 0 cos( (q, d k )) 12 2 2 52 4 2 12 0 2 q [1,2,5], d k [1,3,4] cos( (q, d k )) 1 1 2 3 5 4 1 2 5 2 2 2 1 3 4 2 2 2 0.27 dk 0.97 Measure similarity by the angle between vectors 18 Vector Space Model: Similarity Given A query q = (q1, q2,…, qn) A document dk = (dk,1, dk,2,…, dk,n) q qi: occurrence of the i-th word in query (q, d) dk,i: occurrence of the the i-th word in document Similarity of a query q to a document dk sim (q, d k ) q1d k ,1 q2 d k , 2 ... qn d k ,n q d k q d k cos( (q, d k )) sim' (q, d k ) cos( (q, d k )) dk q dk q dk q1d k ,1 q2 d k , 2 ... qn d k ,n q12 q22 ... qn2 d k2,1 d k2, 2 ... d k2,n 19 Vector Space Model: Similarity q q [1,2,5], d k [0,0,8] q d k 1 0 2 0 5 8 40 q [1,2,5], d k [1,3,4] dk q d k 11 2 3 5 4 26 20 Vector Space Model: Similarity q (q, d) q [1,2,5], d k [0,0,8] 1 0 2 0 5 8 cos( (q, d k )) 12 2 2 52 0 2 0 2 82 q [1,2,5], d k [1,3,4] cos( (q, d k )) 1 1 2 3 5 4 1 2 5 2 2 2 1 3 4 2 2 2 0.913 dk 0.97 21 Term Weighting sim(q, d k ) q1d k ,1 q2 d k ,2 sim(q, d k ) q1d k ,1wk ,1 q2 d k ,2 wk ,2 qn d k ,n wk ,n wk,i: the importance of the i-th word for document dk Why weighting ? qn d k ,n Some query terms carry more information TF.IDF weighting TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) TF normalization: avoid the bias of long documents 22 TF Weighting A term is important if it occurs frequently in document Formulas: f(t,d): term occurrence of word ‘t’ in document d Maximum frequency normalization: f (t , d ) Tf (t , d ) 0.5 0.5 MaxFreq(d ) Term frequency normalization 23 TF Weighting A term is important if it occurs frequently in document Formulas: f(t,d): term occurrence of word ‘t’ in document d Term frequency “Okapi/BM25 TF”: normalization kf (t , d ) Tf (t , d ) doclen(d ) f (t , d ) k 1 b b avg _ doclen doclen(d): the length of document d avg_doclen: average document length k,b: predefined constants 24 TF Normalization Why? Two views of document length Document length variation “Repeated occurrences” are less informative than the “first occurrence” A doc is long because it uses more words A doc is long because it has more contents Generally penalize long doc, but avoid overpenalizing (pivoted normalization) 25 TF Normalization Norm. TF Raw TF “Pivoted normalization” Tf (t , d ) kf (t , d ) doclen(d ) f (t , d ) k 1 b b avg _ docl e n 26 IDF Weighting A term is discriminative if it occurs only in a few documents Formula: IDF(t) = 1+ log(n/m) n – total number of docs m -- # docs with term t (doc freq) Can be interpreted as mutual information 27 TF-IDF Weighting TF-IDF weighting : The importance of a term t to a document d weight(t,d)=TF(t,d)*IDF(t) Freq in doc high tf high weight Rare in collection high idf high weight sim(q, d k ) q1d k ,1wk ,1 q2 d k ,2 wk ,2 qn d k ,n wk ,n 28 TF-IDF Weighting TF-IDF weighting : The importance of a term t to a document d weight(t,d)=TF(t,d)*IDF(t) Freq in doc high tf high weight Rare in collection high idf high weight sim(q, d k ) q1d k ,1wk ,1 q2 d k ,2 wk ,2 qn d k ,n wk ,n Both qi and dk,i arebinary values, i.e. presence and absence of a word in query and document. 29 Problems with Vector Space Model Still limited to word based matching A document will never be retrieved if it does not contain any query word How to modify the vector space model ? 30 Choice of Bases Starbucks D Q Java D1 Microsoft 31 Choice of Bases Starbucks D Q Java D1 Microsoft 32 Choice of Bases Starbucks D’ D Q Java D1 Microsoft 33 Choice of Bases Starbucks D’ D Q’ Q Java D1 Microsoft 34 Choice of Bases Starbucks D’ Java Q’ D1 Microsoft 35 Choosing Bases for VSM Modify the bases of the vector space Each basis is a concept: a group of words Every document is a vector in the concept space A1 c1 c2 c3 c4 c5 m1 m2 m3 m4 A1 1 1 1 1 1 0 0 0 0 A2 0 0 0 0 0 1 1 1 1 A2 36 Choosing Bases for VSM Modify the bases of the vector space Each basis is a concept: a group of words Every document is a mixture of concepts A1 c1 c2 c3 c4 c5 m1 m2 m3 m4 A1 1 1 1 1 1 0 0 0 0 A2 0 0 0 0 0 1 1 1 1 A2 37 Choosing Bases for VSM Modify the bases of the vector space Each basis is a concept: a group of words Every document is a mixture of concepts How to define/select ‘basic concept’? In VS model, each term is viewed as an independent concept 38 Basic: Matrix Multiplication 39 Basic: Matrix Multiplication 40 Linear Algebra Basic: Eigen Analysis Eigenvectors (for a square mm matrix S) (right) eigenvector eigenvalue Example 41 Linear Algebra Basic: Eigen Analysis 2 1 S 1 2 1 / 2 the first eigenvalue 1 3, v1 1 / 2 1/ 2 the second eigenvalue 2 1, v2 1 / 2 42 Linear Algebra Basic: Eigen Decomposition 1 / 2 1/ 2 , v2 1 3, 2 1, v1 1/ 2 1 / 2 1 1 1 1 2 1 2 2 3 0 2 2 S 1 0 1 1 1 2 1 1 2 2 2 2 S= U * * UT 43 Linear Algebra Basic: Eigen Decomposition 1 / 2 1/ 2 , v2 1 3, 2 1, v1 1/ 2 1 / 2 1 1 1 1 2 1 2 2 3 0 2 2 S 1 0 1 1 1 2 1 1 2 2 2 2 S= U * * UT 44 Linear Algebra Basic: Eigen Decomposition 1 1 1 1 2 1 2 2 3 0 2 2 S 1 0 1 1 1 2 1 1 2 2 2 2 S= U * * UT This is generally true for symmetric square matrix Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of S 45 Singular Value Decomposition For an m n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: A U V T mm mn V is nn The columns of U are left singular vectors. The columns of V are right singular vectors is a diagonal matrix with singular values 46 Singular Value Decomposition Illustration of SVD dimensions and sparseness 47 Singular Value Decomposition Illustration of SVD dimensions and sparseness 48 Singular Value Decomposition Illustration of SVD dimensions and sparseness 49 Low Rank Approximation Approximate matrix with the largest singular values and singular vectors 50 Low Rank Approximation Approximate matrix with the largest singular values and singular vectors 51 Low Rank Approximation Approximate matrix with the largest singular values and singular vectors 52 Latent Semantic Indexing (LSI) Computation: using single value decomposition (SVD) with the first m largest singular values and singular vectors, where m is the number of concepts Concept Concept Rep. of Concepts in term space Rep. of concepts in 53 document space Finding “Good Concepts” 54 SVD: Example: m=2 3.34 0 X 0 2.54 X 55 SVD: Example: m=2 3.34 0 X 0 2.54 X 56 SVD: Example: m=2 3.34 0 X 0 2.54 X 57 SVD: Example: m=2 3.34 0 X 0 2.54 X 2.54 4 0.76 3.34 5 58 SVD: Orthogonality v1 3.34 0 X 0 2.54 X v2 u1 · u2 =0 v1 · v2 = 0 59 SVD: Properties X: rank(X) = 9 3.34 0 X 0 2.54 X X’: rank(X’) = 2 rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent. SVD produces the best low rank approximation 60 SVD: Visualization X = 61 SVD: Visualization SVD tries to preserve the Euclidean distance of document vectors 62