Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics, Varaždin Outline • Information retrieval in vector space model (VSM) or bag of words representation • Techniques for conceptual indexing – Latent semantic indexing – Concept indexing • Comparison: Academic example • Experiment • Further work Information retrieval in VSM 1/3 • Task of information retrieval: to extract documents that are relevant for user query in document collection • In VSM documents are presented in high dimensional space • Dimension of space depends on number of indexing terms which are chosen to be relevant for the collection (40005000 in my experiments) • VSM is implemented by forming term-document matrix Information retrieval in VSM 2/3 • Term-document matrix is mn matrix where m is number of terms and n is number of documents • row of term-document matrix = term • column of termdocument matrix = document Figure 1. Term-document matrix d 1 a11 a21 A am1 d d 2 a a 12 a n 22 m2 2n amn a a 1n t1 t2 tm Information retrieval in VSM 3/3 • query has the same shape as document (m dimensional vector) • measure of similarity between query q and a document aj is a cosine of angle between those two vectors cos (a j , q ) T aj q aj 2 q 2 Retrieval performance evaluation • Measures for evaluation: – Recall – Precision – Average precision • Recall recall i • Precision r r precision i i n r i i • ri is number of relevant documents among i highest ranked documents • rn is total number of relevant documents in collection • Average precision – average precision for distinct levels of recall Techniques for conceptual indexing • In term-matching method similarity between query and the document is tested lexically • Polysemy (words having multiple meaning) and synonymy (multiple words having the same meaning) are two fundamental problems in efficient information retrieval • Here we compare two techniques for conceptual indexing based on projection of vectors of documents (in means of least squares) on lower-dimensional vector space – Latent semantic indexing (LSI) – Concept indexing (CI) Latent semantic indexing • Introduced in 1990; improved in 1995 • S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman: Indexing by latent semantic analysis, J. American Society for Information Science, 41, 1990, pp. 391-407 • M. W. Berry, S.T. Dumas, G.W. O’Brien: Using linear algebra for intelligent information retrieval, SIAM Review, 37, 1995, pp. 573-595 • Based on spectral analysis of term-document matrix Latent semantic indexing • For every m×n matrix A there is singular value decomposition (SVD) A UV T U orthogonal m×m matrix whose columns are left singular vectors of A diagonal matrix on whose diagonal are singular values of matrix A in descending order V orthogonal n×n matrix whose columns are right singular vectors of A Latent semantic indexing • For LSI truncated SVD is used A k U k k V k T where Uk is m×k matrix whose columns are first k left singular vectors of A k is k×k diagonal matrix whose diagonal is formed by k leading singular values of A Vk is n×k matrix whose columns are first k right singular vectors of A • Rows of Uk = terms • Rows of Vk = documents (Truncated) SVD Latent semantic indexing • Using the truncated LSI we include only first k independent linear components of A (singular vectors and values) • Documents are projected in means of least squares on space spread by first k singular vectors of A (LSI space) • First k components capture the major associational structure in in the term-document matrix and throw out the noise • Minor differences in terminology used in documents are ignored • Closeness of objects (queries and documents) is determined by overall pattern of term usage, so it is context based • Documents which contain synonyms are closer in LSI space than in original space; documents which contain polysemy in different context are more far in LSI space than in original space Concept indexing (CI) • • Indexing using concept decomposition (CD) instead of SVD like in LSI Concept decomposition was introduced in 2001 I.S.Dhillon, D.S. Modha: Concept decomposition for large sparse text data using clustering, Machine Learning, 42:1, 2001, pp. 143-175 Concept decomposition • First step: clustering of documents in term-document matrix A on k groups • Clustering algorithms: – Spherical k-means algorithm – Fuzzy k-means algorithm • Spherical k-means algorithm is a variant of k-means algorithm which uses the fact that vectors of documents are of the unit norm • Centroids of groups = concept vectors • Concept matrix is matrix whose columns are centroids of groups C c c k cj – centroid of j-th group 1 2 c k Concept decomposition • Second step: calculating the concept decomposition • Concept decomposition Dk of term-document matrix A is least squares approximation of A on the space of concept vectors Dk Ck Z where Z is solution of the least squares problem Z CTk Ck CTk A 1 • Rows of Ck = terms • Columns of Z = documents Comparison: Academic example • Collection of 15 documents (Titles of books) – – – • 9 from the field of data mining 5 from the field of linear algebra 1 combination of these fields (application of linear algebra for data mining) List of terms was formed 1) 2) 3) • By words contained in at least two documents Words on stop list were ejected Stemming was performed On term-document matrix we apply – – Truncated SVD (k=2) Concept decomposition (k=2) Documents 1/2 D1 Survey of text mining: clustering, classification, and retrieval D2 Automatic text processing: the transformation analysis and retrieval of information by computer D3 Elementary linear algebra: A matrix approach D4 Matrix algebra & its applications statistics and econometrics D5 Effective databases for text & document management D6 Matrices, vector spaces, and information retrieval D7 Matrix analysis and applied linear algebra D8 Topological vector spaces and algebras Documents 2/2 D9 Information retrieval: data structures & algorithms D10 Vector spaces and algebras for chemistry and physics D11 Classification, clustering and data analysis D12 Clustering of large data sets D13 Clustering algorithms D14 Document warehousing and text mining: techniques for improving business operations, marketing and sales D15 Data mining and knowledge discovery Terms Data mining terms Linear algebra terms Neutral terms Text Linear Analysis Mining Algebra Application Clustering Matrix Algorithm Classification Vector Retrieval Space Information Document Data Projection of terms by SVD 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 data mining terms linear algebra terms -0.4 -0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Projection of terms by CD 0.5 data mining terms linear algebra terms 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Queries • Q1: Data mining – Relevant documents : All data mining documents • Q2: Using linear algebra for data mining – Relevant document: D6 Projection of documents by SVD 0.3 data mining documents linear algebra documents D6 document queries 0.2 0.1 Q2 0 Relevant documents for query Q2 -0.1 -0.2 -0.3 -0.4 Q1 0 0.1 0.2 Relevant documents for query Q1 0.3 0.4 0.5 0.6 Projection of documents by CD 1.2 Relevant documents for query Q1 1 Q1 Q2 0.8 0.6 Relevant documents for query Q2 0.4 0.2 data mining documents linear algebra documents D6 document queries 0 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Results of information retrieval (Q1) score 1,4142 0,7071 0,5774 0,5000 0,5000 0,4472 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 0,0000 document D15 D12 D14 D9 D11 D1 D2 D3 D4 D5 D6 D7 D8 D10 D13 score 0,6141 0,5480 0,5465 0,4809 0,4644 0,4301 0,4127 0,3858 0,3165 0,1585 0,0013 -0,0631 -0,0631 -0,0712 -0,0712 document D1 D11 D12 D9 D15 D2 D14 D13 D5 D6 D7 D8 D10 D4 D3 score document 0,6622 D1 0,6057 D14 0,5953 D12 0,5763 D11 0,5619 D15 0,4727 D5 0,3379 D13 0,2690 D2 0,2615 D9 0,0016 D7 -0,0063 D6 -0,0462 D3 -0,0468 D4 -0,0473 D8 -0,0473 D10 Results of information retrieval (Q2) Term matching method score 1,4142 1,1547 0,8944 0,7071 0,5774 0,5774 0,5774 0,5774 0,5000 0,5000 0,4472 0,0000 0,0000 0,0000 0,0000 document D15 D3 D7 D12 D4 D8 D10 D14 D9 D11 D1 D2 D5 D6 D13 Latent semantic indexing score 0,6737 0,6472 0,6100 0,6100 0,5924 0,5924 0,5789 0,5404 0,5268 0,5236 0,4656 0,3936 0,3560 0,3320 0,2800 document D6 D7 D8 D10 D3 D4 D10 D2 D11 D9 D12 D15 D14 D13 D5 Concept indexing score 0,7105 0,6204 0,5936 0,5517 0,5488 0,5452 0,5441 0,5280 0,5256 0,4797 0,4797 0,4693 0,4693 0,4459 0,4202 document D1 D11 D12 D2 D14 D6 D9 D7 D15 D8 D10 D3 D4 D5 D13 Collections • MEDLINE – 1033 documents – 30 queries – Relevant judgements • CRANFIELD – 1400 documents – 225 queries – Relevant judgements Test A • Comparison of errors of approximation termdocument matrix by 1) k-rank SVD A U k k V Tk 2) k-rank CD A Ck Z F F MEDLINE - errors of approximation 900 800 error of approx. 700 600 500 400 300 200 100 0 0 50 100 150 200 250 rank of approximation k-means fuzzy k-means SVD 300 CRANFIELD - errors of approximation 1200 error of approx. 1000 800 600 400 200 0 0 50 100 150 200 250 rank of approximation k-means fuzzy k-means SVD 300 Test B • Average inner product between concept vectors cj, j=1,2,…,k k k 2 T T k (k 1) j 1 l j 1c j cl • Comparison of average inner product for – Concept vectors obtained by spherical k-means algorithm – Concept vectors obtained by fuzzy k-means algorithm MEDLINE – average inner product 0,25 0,20 0,15 0,10 0,05 0,00 0 50 100 150 200 rank of approximation k-means fuzzy k-means 250 300 CRANFIELD – average inner product 0,40 0,35 0,30 0,25 0,20 0,15 0,10 0,05 0,00 0 50 100 150 200 rank of approximation k-means fuzzy k-means 250 300 Test C • Comparison of mean average precision of information retrieval and precision-recall plots • Mean average precision for term-matching method: – MEDLINE : 43,54 – CRANFIELD : 20,89 MEDLINE – mean average precision mean average precision 60,00 55,00 50,00 45,00 40,00 35,00 30,00 0 50 100 150 200 250 rank of approximation k-means fuzzy k-means SVD 300 CRANFIELD – mean average precision mean average precision 0,25 0,20 0,15 0,10 0,05 0,00 0 50 100 150 200 250 rank of approximation (k) k-means fuzzy k-means SVD 300 MEDLINE – precision-recall plot 90 80 70 precision 60 50 40 30 20 10 0 0 20 40 60 80 100 recall k-means fuzzy k-means SVD term-matching 120 CRANFIELD – precision-recall plot 50 45 40 precision 35 30 25 20 15 10 5 0 0 20 40 60 80 100 recall k-means fuzzy k-means SVD term-matching 120 Test D • Correlation between mean average precision (MAP) and clustering quality • Measure of cluster quality – generalized within groups sum of square errors function Jfuzz k n b J fuzz ij a j ci i 1 j 1 • aj , j=1,2,…,n are vectors of documents, • • ci , i=1,2,…,k are concept vectors ij is the fuzzy membership degree of document aj in the group whose concept is ci • b1, is weight exponent MEDLINE - Correlation (clustering quality and MAP) • 46 observations for rank of approximation k[1,100] • Correlation between mean average precision and Jfuzz is r=-0,968198 with significance p<<0,01 • Correlation between rank of approximation and mean average precision is r= 0,70247 ( p<<0,01) • Correlation between rank of approximation and Jfuzz is r= -0,831071 ( p<<0,01) CRANFILD - Correlation (clustering quality and MAP) • 46 observations for rank of approximation k[1,100] • Correlation between mean average precision and Jfuzz is r=-0,988293 with significance p<<0,01 • Correlation between rank of approximation and mean average precision is r= 0,914489 ( p<<0,01) • Correlation between rank of approximation and Jfuzz is r= -0,904415 ( p<<0,01) Regression line: clustering quality and MAP (MEDLINE) 50 mean average precision 45 40 35 30 25 20 15 10 5 0 200 250 300 350 JW 400 450 Regression line: clustering quality and MAP (CRANFIELD) mean average precision 16 14 12 10 8 6 4 2 0 400 450 500 550 600 Jw 650 700 750 Conclusion 1/3 • By SVD approximation term-document matrix is projected on the first k left singular vectors, which for orthogonal base for LSI space • By CD approximation term-document matrix is projected on the k centroids of groups (concept vectors) • Concept vectors form the base for CI space; they tend to orthogonality as k raises • Concept vectors obtained by fuzzy k-means algorithm tend to orthogonality faster then those obtained by spherical k-means algorithm • CI using CD by fuzzy k-means algorithm gives higher MAP of information retrieval then LSI on both collections we have used Conclusion 2/3 • CI using CD by spherical k-means algorithm gives lower (but comparable) MAP of information retrieval then LSI on both collections we have used • According the results of MAP k=75 for MEDLINE collection, and k=200 for CRANFIELD collection is good choice of rank of approximation • By LSI and CI documents are presented in smaller matrices: – For MEDLINE collection term-document matrix is stored in 5940×1033 matrix – approximations of documents are stored in 75×1033 matrix – For CRANFIELD collection term-document matrix is stored in 4758×1400 matrix - approximations of documents are stored in 200×1400 matrix Conclusion 3/3 • LSI and CI work better on MEDLINE collection • When evaluated for different ranks of approximation MAP is more stable for LSI then for CI • There is high correlation between MAP and clustering quality Further work 1) To apply CI on the problem of classification in supervised setting To propose solutions of problem adding new documents in collection for CI method 2) • • • Adding new documents in collection requires recomputation of SVD or CD It is computationally inefficient 2 approximation methods are developed for adding new document in collection for LSI method