Download TANA07: Data Mining using Matrix Methods

TANA07: Data Mining using Matrix Methods Text mining – Information retrieval Lars Eldén and Berkant Savas Department of Mathematics Linköping University, Sweden 2012 Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Contents 1 Information retrieval 2 Latent semantic indexing Reference text: M. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval, Second edition, SIAM, 2005. Statistics programming environment: “Text-mining toolbox” has routines for LSI and cluster-based information retrieval Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Example of Software I SAS Text Miner provides full text preprocessing within the powerful, easy-to-use process flow environment of Enterprise Miner. This enables users to enrich the overall data mining process by integrating unstructured textual data with existing structured data such as age, income and purchasing patterns. Features include: Access to numerous textual formats including PDF, extended ASCII, HTML ... Numerous text preprocessing methods such as stemming, noun group extraction, user-defined synonym lists, multiword tokens and part-of-speech tagging. Extensive feature extraction capabilities with broad customizable data dictionaries. Singular value decomposition for dimension reduction. Unique clustering algorithms. http://www.sas.com/technologies/analytics/datamining/textminer TMG http://scgroup20.ceid.upatras.gr:8000/tmg/ Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Example Search in LiU library journal catalogue: Result Search phrase computer science engineering Nothing found computing science engineering IEEE: Computing in Science and En Straightforward word matching is not good enough! Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Information Retrieval Data base of documents: for a query with a set of keywords, find all documents that are relevant. Applications: Data bases of scientific abstracts, web search engines. SMART: System for mechanical analysis and retrieval of text, Gerard Salton 1983. Vector space model information retrieval Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Vector space IR model Term-by-document matrix: Term 1 Term 2 Term 3 Doc 1 1 0 0 Doc 2 0 0 1 Doc 3 1 1 1 Doc 4 0 1 0 Query 1 1 0 The documents and the query are represented by a vector in Rm (here m = 3) Query close to document vectors? Use some distance measure in Rm . In applications m ≈ 106 is common. Use linear algebra methods (e.g. SVD) for data compression and retrieval enhancement. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Latent Semantic Indexing (LSI) More than literal matching: conceptual-based modelling 1 Document file preparation: 1 2 3 Indexing: collect terms Use stop list: eliminate “meaningless” words Stemming 2 Constructing the term-by-document matrix, sparse matrix storage 3 Query matching: distance measures 4 Data compression by low rank approximation: SVD 5 Ranking and relevance feedback Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Stop List I Eliminate words that occur in “all documents”: We consider the computation of an eigenvalue and corresponding eigenvector of a Hermitian positive definite matrix A ∈ Cn×n , assuming that good approximations of the wanted eigenpair are already available, as may be the case in applications such as structural mechanics. We analyze efficient implementations of inexact Rayleigh quotient–type methods, which involve the approximate solution of a linear system at each iteration by means of the Conjugate Residuals method. ftp://ftp.cs.cornell.edu/pub/smart/english.stop Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Stop list II Beginning of the Cornell list: a, a’s, able, about, above, according, accordingly, across, actually, after, afterwards, again, against, ain’t, all, allow, allows, almost, alone, along, already, also, although, always, am, among, amongst, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anyways, anywhere, apart, appear, appreciate, appropriate, are, aren’t, around, as, aside, ask, asking, associated, at, available, away, awfully, b, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, believe, below, beside, besides, best, better, between, beyond, both, brief, but, by, c, c’mon, c’s, came, can,... Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Stemming  computable    computation computing    computational =⇒ comput http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm http://www.tartarus.org/~martin/PorterStemmer/ http://snowball.tartarus.org/algorithms/swedish/stemmer.html Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Query Also queries are preprocessed =⇒ natural language queries I want to know how to compute singular values of data matrices, especially such that are large and sparse. becomes comput singular value data matri large sparse Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Example Example (Porter stemmer) 2. changes of the nucleic acid and phospholipid levels of the livers in the course of fetal and postnatal development . we have followed the evolution of dna, rna and pl in the livers of rat foeti removed between the fifteenth and the twenty-first day of gestation and of young rats newly-born or at weaning . we can observe the following facts.. 1. dna concentration is 1100 ug p on the 15th day, it decreases from the 19th day until it reaches a value of 280 ug 5 days after weaning . becomes 2. chang of the nucleic acid and phospholipid level of the liver in the cours of fetal and postnat develop . we have follow the evolut of dna, rna and pl in the liver of rat foeti remov between the fifteenth and the twenti-first dai of gestat and of young rat newli-born or at wean . we can observ the follow fact.. 1. dna concentr is 1100 ug p on the 15th dai, it decreas from the 19th dai until it reach a valu of 280 ug 5 dai after wean . Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Inverted File Structures Document file: Each document has a number and all terms are identified Dictionary: Sorted list of all unique terms Inversion List: Pointers from a term to the documents that contain that term (column index for non-zeros in a row of the matrix). Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Example Document Document Document Document 1: 2: 3: 4: Document 5: The Google matrix P is a model of the Internet. Pij is nonzero if there is a link from web page j to i. The Google matrix is used to rank all web pages The ranking is done by solving a matrix eigenvalue problem. England dropped out of the top 10 in the FIFA ranking. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Term–document matrix I Count the frequency of terms in each document: Term eigenvalue England FIFA Google Internet link matrix page rank web Doc 1 0 0 0 1 1 0 1 0 0 0 Doc 2 0 0 0 0 0 1 0 1 0 1 Copyright Lars Eldén, Berkant Savas 2012 Doc 3 0 0 0 1 0 0 1 1 1 1 Doc 4 1 0 0 0 0 0 1 0 1 0 Doc 5 0 1 1 0 0 0 0 0 1 0 TANA07: Data Mining using Matrix Methods Term–Document Matrix II 0 0  0  1  1 A= 0  1  0  0 0  0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1 0  0 1  1  0  0  ∈ Rm×n 0  0  0  1 0 aij is the weighted frequency of term i in document j Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Simple Query Matching Given a query vector q, find the columns aj of A, which have dist(d, aj ) ≤ tol Common distance definitions: arccos(θ(x, y )) = arccos(x T y ), dist(x, y ) = kx − y k2 Copyright Lars Eldén, Berkant Savas 2012 (note: kxk2 = ky k2 = 1 ) Euclidean distance TANA07: Data Mining using Matrix Methods Query Query: “ranking of web pages”. query vector   0 0   0   0   0 10  q= 0 ∈ R .   0   1   1 1 Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Cosine measure Cosine measure: cos(θ(x, y )) = Cosines: xT y kxk2 ky k2 ≥ tol. 0 0.6667 0.7746 0.3333 0.3333 . With a cosine threshold of tol = 0.4 documents 2 and 3 would be returned. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Sparse Matrix 0 50 100 150 200 250 300 350 400 450 500 0 100 200 300 nz = 2676 400 500 Figure: The first 500 rows and columns of the Medline matrix. Each dot represents a non-zero element. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Sparse Matrix Storage Typically only 1% of the matrix elements are sparse Example:  0.6667 0 0  0 0.7071 0.4082 A= 0.3333 0 0.4082 0.6667 0 0 non-zero: matrix is  0.2887 0.2887  0.2887 0 Compressed row storage: val col-ind row-ptr 0.666 1 1 0.288 4 3 0.707 2 6 0.408 3 9 0.288 4 0.333 1 0.408 3 0.288 4 Compressed column storage: analogous Sparse matrices are automatic in Matlab Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods 0.666 1 Performance Evaluation Precision: P= Dr , Dt Dr : number of relevant documents retrieved Dt : total number of documents retrieved Recall R= Dr Nr Nr : Total number of relevant documents in the data base Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Performance evaluation II RETURNED RELEVANT DOCUMENTS DOCUMENTS ALL DOCUMENTS Figure: Returned and relevant documents for two values of the tolerance: The dashed circle represents the retrieved documents for a high value of the cosine tolerance. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Performance evaluation III T y Cosine measure: cos(θ(x, y )) = kxkx2 ky k2 ≥ tol. Large value of tol(≤ 1): High precision, but low recall Small value of tol: High recall, but low precision Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Our test query Query Q9 in the Medline collection: 9. the use of induced hypothermia in heart surgery, neurosurgery, head injuries and infectious diseases. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Precision-recall graph Query matching for Q9 in the Medline collection (stemmed) using the cosine measure 100 90 80 Precision (%) 70 60 50 40 30 20 10 0 0 20 40 60 Recall (%) Copyright Lars Eldén, Berkant Savas 2012 80 100 TANA07: Data Mining using Matrix Methods Singular Value Decomposition (SVD) A ∈ Rm×n , m≥n Σ A=U VT, 0 U ∈ Rm×m , Σ ∈ Rn×n , V ∈ Rn×n , U and V are orthogonal, and Σ is diagonal Σ = diag(σ1 , σ2 , . . . , σn ), σ1 ≥ σ2 ≥ · · · ≥ σr > σr +1 = · · · = σn = 0 Rank(A)=r . P Σ Sum of rank 1 matrices: A = U V T = ni=1 σi ui viT . 0 Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Singular values of A Google example: A has rank 5 (full column rank). 3 2.5 2 1.5 1 0.5 0 0 1 2 Copyright Lars Eldén, Berkant Savas 2012 3 4 5 6 TANA07: Data Mining using Matrix Methods Singular values of Medline matrix 5 4.5 4 Singular values 3.5 3 2.5 2 1.5 1 0.5 0 0 20 40 60 80 100 First 100 singular values Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Matrix approximation by SVD P Define the Frobenius matrix norm kAkF = ( i,j aij2 )1/2 , and assume k ≤ r . Theorem (Eckart-Young 1936) The approximation problem min rank(Z )=k kA − Z kF has the solution Z= k X σi ui viT = Uk Σk VkT i=1 Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Solution Solution: Z= k X σi ui viT = Uk Σk VkT = i=1 Uk = (u1 , . . . uk ), Σk = diag(σ1 , . . . , σk ), Copyright Lars Eldén, Berkant Savas 2012 Vk = (v1 , . . . vk ) TANA07: Data Mining using Matrix Methods First Singular Vectors: Medline find(abs(U(:,k))>0.13) Look-up in the dictionary of terms U(:,1) cell growth hormon patient Copyright Lars Eldén, Berkant Savas 2012 U(:,2) case cell children defect dna growth patient ventricular TANA07: Data Mining using Matrix Methods Basis Vectors Use the columns of Uk as new basis vectors in the document space. Express all the documents in terms of the new basis: min kA − Uk DkF D Least squares problem, solution D = (UkT Uk )−1 UkT A = UkT A = UkT n X σi ui viT i=1 = k X σi ei viT = Σk Vk =: Dk i=1 Dk = UkT A is the projection of A onto the subspace spanned by Uk . Column j of Dk holds the coordinates of document j in the new basis. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods SVD – PCA Data Compression by SVD ≡ Principal Components Analysis A ≈ Ak = Uk (Σk Vk ) =: Uk Dk = Dk holds the coordinates of all documents in terms of the k first singular vectors u1 , . . . uk (principal components). Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Coordinates Dk holds the coordinates of all documents in terms of the k first singular vectors u1 , . . . , uk (principal components). Here k = 2: 1 5 4 0.5 1 u 2 0 3 −0.5 q −1 2 −1.5 0.5 1 1.5 u1 2 2.5 Query: “ranking of web pages”. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Query Matching after Data Compression I Data compression: represent term-document matrix by Ak = Uk Dk . Compute q T Ak = q T Uk Dk = (UkT q)T Dk . Project the queery: qk := UkT q Cosines: qkT (Dk ej ) cos θj = kqk k2 kDk ej k2 Cosines for query and original data: 0 0.6667 0.7746 0.3333 0.3333 . After projection to the two-dimensional subspace: 0.7857 0.8332 0.9670 0.4873 0.1819 Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods LSI (SVD compression) enhances retrieval quality Document Document Document Document 1: 2: 3: 4: Document 5: The Google matrix P is a model of the Internet. Pij is nonzero if there is a link from web page j to i. The Google matrix is used to rank all web pages The ranking is done by solving a matrix eigenvalue problem. England dropped out of the top 10 in the FIFA ranking. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods First Two Singular Vectors  0.1425 0.0787   0.0787   0.3924   0.1297  , u1 =   0.1020 0.5348   0.3647   0.4838  0.3647 Copyright Lars Eldén, Berkant Savas 2012  0.2430  0.2607     0.2607    −0.0274    0.0740    −0.3735 ,    0.2156    −0.4749    0.4023   −0.4749 TANA07: Data Mining using Matrix Methods Recall vs. precision for Medline, Q9 100 90 80 Precision (%) 70 60 50 40 30 20 10 0 0 20 40 60 Recall (%) 80 100 Full vector space model (solid line), the rank 100 approximation (dashed). Error in Matrix Approximation kA − Ak kF /kAkF ≈ 0.8 Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Latent Semantic Indexing Represent the data matrix in terms of singular vectors. Jessup & Martin: “Rank reduction removes the noise that obscures the semantic content of the data.” Park et al.: “LSI is based on the assumption that there is some underlying latent semantic structure in the data ... that is corrupted by the wide variety of words used ... Berry p. 57: “automatic association of related terms” The vector space model and LSI can deal with synonymy and polysemy (cf. plain word matching) Synonyms Different words with the same meaning: Football, Soccer Polysemy The same word has different meanings Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Latent Semantic Indexing “A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts” “... uncovers the underlying latent semantic structure in the usage of words in a body of text” “LSI will return results that are conceptually similar in meaning to the search criteria even if the results dont share a specific word or words with the search criteria.” Source: http://en.wikipedia.org/wiki/Latent_semantic_indexing Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods Text Parser TMG I TMG - Text to Matrix Generator TMG parses a text collection and generates the term - document matrix. A = TMG(FILENAME) returns the term - document matrix, that corresponds to the text collection contained in files of directory (or file) FILENAME. Each document must be separeted by a blank line (or another delimiter that is defined by OPTIONS argument) in each file. [A, DICTIONARY] = TMG(FILENAME) returns also the dictionary for the collection, while [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZED_FACTORS] = TMG(FILENAME) returns the vectors of global weights for the dictionary and the normalization factor for each document in case such a factor is used. If normalization is not used TMG returns a vector of all ones. [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS, WORDS_PER_DOC] = TMG(FILENAME) returns statistics for each document, i.e. the number of terms for each document. .... Copyright 2004 Dimitrios Zeimpekis, Efstratios Gallopoulos Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods M. W. Berry and M. Browne. Understanding Search Engines. Mathematical Modeling and Text Retrieval. SIAM, Philadelphia, PA, second edition, 2005. E. R. Jessup and J. H. Martin. Taking a new look at the latent semantic analysis approach to information retrieval. In M. W. Berry, editor, Computational Information Retrieval, pages 121–144, Philadelphia, PA, 2001. SIAM. H. Park, M. Jeon, and J. B. Rosen. Lower dimensional representation of text data in vector space based information retrieval. In M. W. Berry, editor, Computational Information Retrieval, pages 3–23, Philadelphia, PA, 2001. SIAM. Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download TANA07: Data Mining using Matrix Methods