Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Text Mining Huaizhong KOU PHD Student of Georges Gardarin PRiSM Laboratory 0. Content 1. Introduction – What happens – What is Text Mining – Text Mining vs Data Mining – Applications 2. Feature Extraction – Task – Indexing – Dimensionality Reduction 3. Document Categorization – – – – Task Architecture Categorization Classifiers Application:Trend Analysis 4. Document Clustering – Task – Algorithms – Application 5. Product 6.Reference 1. Text Mining:Introduction 1.1 What happens 1.2 What’s Text Mining 1.3 Text Mining vs Data Mining 1.4 Application 1.1 Introduction:What happens(1) •Information explosive •80% information stored in text documents:journals, web pages,emails... •Difficult to extract special information •Current technologies... Internet 1.1 Introduction: What happens(2) •It is necessary to automatically analyze, organize,summarize... 1.2 Introduction: What’s Text Mining(1) •Text Mining::= the procedure of synthesizing the information by analyzing the relations, the patterns, and the rules among textual data - semi-structured or unstructured text. •Techniques::= data mining machine learning information retrieval statistics natural-language understanding case-based reasoning Ref:[1]-[4] 1.2 Introduction: What’s Text Mining(2) Sample Documents Text document Transformed Representation models Learning Learning Working Domain specific templates/models Visualizations 1.3 Introduction:TM vs DM Data Mining Text Mining Data Object Numerical & categorical data Textual data Data structure Structure Unstructure&semi-structure Data representation Straightforward Complex Space dimension <tens of thousands Methods Maturity > tens of thousands Data analysis, machine learning Data mining, information statistic, neural networks retrieval, NLP,... Broad implementation since1994 Broad implementation starting 2000 Market Ref:[12][13] 105 analysts at large and mid 108 analysts corporate workers size companies and individual users 1.4 Introduction:Application •The potential applications are countless. •Customer profile analysis •Trend analysis •Information filtering and routing •Event tracks •news stories classification •Web search •……. Ref:[12][13] 2. Feature Extraction 2.1 Task 2.2 Indexing 2.3 Weighting Model 2.4 Dimensionality Reduction Ref:[7][11][14][18][20][22] 2.1 Feature Extraction:Task(1) Task: Extract a good subset of words to represent documents Document collection All unique words/phrases Feature Extraction All good words/phrases Ref:[7][11][14][18][20][22] 2.1 Feature Extraction:Task(2) While more and more textual information is available online, effective retrieval is difficult without good indexing of text content. 16 While-more-and-textual-information-is-available-onlineeffective-retrieval-difficult-without-good-indexing-text-content Feature Extraction 5 Text-information-online-retrieval-index 2 Ref:[7][11][14][18][20][22] 1 1 1 1 2.2 Feature Extraction:Indexing(1) Training documents Identification all unique words Removal stop words non-informative word ex.{the,and,when,more} Removal Ref:[7][11][14][18][20][22] Word Stemming of suffix to generate word stem grouping words increasing the relevance ex.{walker,walking}walk Term Weighting •Naive terms •Importance of term in Doc 2.2 Feature Extraction:Indexing(2) •Document representations: vector space models d=(w1,w2,…wt)Rt wi is the weight of ith term in document d. Ref:[7][11][14][18][20][22] 2.3 Feature Extraction:Weighting Model(1) •tf - Term Frequency weighting wij = Freqij Freqij : := the number of times jth term occurs in document Di. Drawback: without reflection of importance factor for document discrimination. •Ex. D1 D2 Ref:[11][22] A B K O Q R S T W X ABRTSAQWA XAO RTABBAXA QSAK D1 3 1 0 1 1 1 1 1 1 1 D2 3 2 1 0 1 1 1 1 0 1 2.3 Feature Extraction:Weighting Model(2) •tfidf - Inverse Document Frequency weighting wij = Freqij * log(N/ DocFreqj) . N : := the number of documents in the training document collection. DocFreqj ::= the number of documents in which the jth term occurs. Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection •Ex. A B K O Q R S T W X Ref:[11][22] Ref:[13] D1 0 0 0 0.3 0 D2 0 0 0.3 0 0 0 0 0 0.3 0 0 0 0 0 0 2.3 Feature Extraction:Weighting Model(3) •Entropy weighting wij log FREQij 1.0 * 1 entropy( wi ) where N FREQij 1 FREQij entropy(wi ) log log N j 1 DOCFREQ j DOCFREQ j is average entropy of ith term and -1: if word occurs once time in every document 0: if word occurs in only one document Ref:[11][22] Ref:[13] 2.4 Feature Extraction:Dimension Reduction •2.4.1 Document Frequency Thresholding •2.4.2 X2-statistic •2.4.3 Latent Semantic Indexing Ref:[11][20][21][27] 2.4.1 Dimension Reduction:DocFreq Thresholding •Document Frequency Thresholding Training documents D Naive Calculates DocFreq(w) Sets threshold Removes all words: DocFreq < Feature Terms Ref:[11][20][21][27] Terms 2.4.2 Dimension Reduction: X2-statistic •Assumption:a pre-defined category set for a training collection D •Goal: Estimation independence between term and category Category set C={c1,c2,..cm} Naive Terms N AD CB X w, c j A C B D A B C D 2 2 2 w max X 2 w, c j X max j Sets threshold Removes all words: X2max(w)< FEATURE TERMS Ref:[11][20][21][27] Term categorical score A:=|{d| d cj w d}| B:=|{d| d cj w d}| C:=|{d| d cj w d}| D:=|{d| d cj w d}| N:=|{d| d D}| 2.4.3 Dimension Reduction:LSI(1) •LSI=Latent Semantic Indexing. 1.SVD Model:Singular Value Decomposition of matrix documents m<=min(t,d) ** T0' T0 = I X terms = T0 S0** D0' D0' D0 = I * (t,d) (t,m) documents X ' = D0 (d,t) (d,m) (m,m) S0 (m,m) (m,d) T0' (m,t) T0: t m orthogonal eigenvector matrix. Its rows are eigenvectors of X X' D0: d m orthogonal eigenvector matrix. Its rows are eigenvectors of X 'X S0: m m diagonal matrix of singular value( square roots of eigenvalue) in decreasing order of importance. m: rank of matrix X. m<=min(t,d). Ref:[11][20][21][27] 2.4.3 Dimension Reduction:LSI(2) 2.Approximate Model X appr(X) = (t,d) T (t,k) ** S ** * (k,k) D' Select k<=m !! (k,d) X ' appr(X)' = D S T ' •Every rows of matrix appr(X) 'approximately represents one documents. •Given a row xi and its corresponding row di, the following holds: xi = di S T ' and di = xi T S-1 Ref:[11][20][21][27] 2.4.3 Dimension Reduction:LSI(3) 3.Document Represent Model: new document d t Naive Terms d=(w1,w2,…,wt) Rt d appr(d) = d T S-1 Rk (1,k) (1,t) (t,k) (k, k) •No good methods to determine k. It depends on application domain. •Some experiments suggest: 100 300. Ref:[11][20][21][27] 3. Document Categorization 3.1 Task 3.2 Architecture 3.3 Categorization Classifiers 3.4 Application Ref:[1][2][4][5][11][18][23][24] 3.1 Categorization:Task •Task: assignment of one or more predefined categories to one document. Topics Themes 3.2 Categorization:Architecture Training documents New document preprocessing Weighting Selecting feature Predefined categories Classifier Category(ies) to d d 3.3 Categorization Classifiers 3.3.1Centroid-based Classifier 3.3.2 k-Nearest Neighbor Classifier 3.3.3 Naive Bayes Classifier 3.3.1 Model:Centroid-Based Classifier(1) 1.Input:new document d =(w1, w2,…,wn); 2.Predefined categories:C={c1,c2,….,cl}; 3.//Compute centroid vector d' c , ciC c 4.//Similarity model - cosine function d d w w Simil (d , d ) cosd , d d d w w 5.//Compute similarity Simil (ci , d ) cos(ci , d ) 6.//Output:Assign to document d the category cmax d 'ci i i i i j i j il j 2 i 2 j 2 il Simil (ci , d ) Simil (cmax , d ) jl 2 jl 3.3.1 Model:Centroid-Based Classifier(2) • > •cos()<cos() •d2 is more close to d1 than d3 •Cosine-based similarity model can reflect the relations between features. 3.3.2 Model:K-Nearest Neighbor Classifier 1.Input:new document d; 2.training collection:D={d1,d2,…dn }; 3.predefined categories:C={c1,c2,….,cl}; 4.//Compute similarities for(diD){ Simil(d,di) =cos(d,di); } 5.//Select k-nearest neighbor Construct k-document subset Dk so that Simil(d,di) < min(Simil(d,doc) | doc Dk) di D- Dk. 6.//Compute score for each category for(ciC){ score(ci)=0; for(docDk){ score(ci)+=((docci)=true?1:0)} } 7.//Output:Assign to d the category c with the highest score: score(c) score(ci) , ci C- {c} 3.3.3 Model:Naive Bayes Classifier Basic assumption: all terms distribute in documents independently. 1.Input:new document d; 2.predefined categories:C={c1,c2,….,cl}; 3.//Compute the probability that d is in each class c C for(ciC){ Pr(ci d ) Pr d ci Pr ci Pr d Pr d ci Pr ci c j C Pr d c j Pr c j //note that terms wi in document are independent each other Pr ci Pr w j d ci J 1 d Pr w j ck ck C j 1 Pr c k } 4.//output:Assigns to d the category c with the highest probability: Pr d c m ax Pr d c cC 3.4 Categorization: Application Trend Analysis-EAnalyst System Goal: Predicts the trends in stock price based on news stories Find Trends Time series data Trends Piecewise linear fitting Retrieve Sample Textual data Docs Sample relevant docs Align trends with docs Trend::=(slope,confidence) Language models Sample docs News documents Learning Process Categorization Ref:[28] New trend Bayes Classifier Language model 4. Document Clustering 4.1 Task 4.2 Algorithms 4.3 Application Ref:[5][7][8][9][10][15][16][29] 4.1 Document Clustering:Task •Task: It groups all documents so that the documents in the same group are more similar than ones in other groups. •Cluster hypothesis: relevant documents tend to be more closely related to each other than to non-relevant document. Ref:[5][7][8][9][10][15][16][29] 4.2 Document Clustering:Algorithms •4.2.1 k-means •4.2.2 Hierarchic Agglomerative Clustering(HAC) •4.2.3 Association Rule Hypergraph Partitioning (ARHP) Ref:[5][7][8][9][10][15][16][29] 4.2.1 Document Clustering:k-means •k-means: distance-based flat clustering 0. Input: D::={d1,d2,…dn }; k::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2. Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output:k clusters of documents •Advantage: •linear time complexity •works relatively well in low dimension space •Drawback: •distance computation in high dimension space •centroid vector may not well summarize the cluster documents •initial k clusters affect the quality of clusters Ref:[5][7][8][9][10][15][16][29] 4.2.2 Document Clustering:HAC •Hierarchic agglomerative clustering(HAC):distance-based hierarchic clustering 0. Input: D::={d1,d2,…dn }; 1. Calculate similarity matrix SIM[i,j] 2. Repeat 3. Merge the most similar two clusters, K and L, to form a new cluster KL 4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j] 5. Until there is a single(or specified number) cluster 6. Output: dendogram of clusters •Advantage: •producing better quality clusters •works relatively well in low dimension space •Drawback: •distance computation in high dimension space •quadratic time complexity Ref:[5][7][8][9][10][15][16][29] 4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(1) •Hypergraph H=(V,E) V::a set of vertices E::a set of hyperedges. A a 1 D h 5/6 b 1/3 B C Ref:[30]-[35] i 1/2 g F 1/3 e 3/4 c 1 E G f 3/2 4.2.3 Document Clustering: Association Rule Hypergraph Partitioning (2) •Transactional View of Documents and features item::=Document transaction::=feature Doc1 Doc2 Doc3 … Docn w1 5 5 2 ... 1 w2 2 4 3 … 5 w3 . . . 0 . . . 0 . . . 0 . . . … … ... 1 . . . wt 6 0 0 … 3 (Transactional database of Documents and features) Ref:[30]-[35] transactions items 4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(3) •Clustering Document-feature transaction database Discovering association rules Apriori algorithm Constructing hypergraph Association rule hypergraph Partitioning hypergraph Hypergraph partitioning algorithm k partitions •Hyperedges::=frequent item sets •Hyperedge weight::=average of the confidences of all rules •Assumption: documents occurring in the same frequent item set are more similar Ref:[30]-[35] 4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(4) •Advantage •Without the calculation of the mean of clusters. •Linear time complexity. •The quality of the clusters is not affected by the space dimensionality. •performing much better than traditional clustering in high dimensional space in terms of the quality of clusters and runtime. Ref:[30]-[35] 4.3 Document Clustering:Application •Summarization of documents •Navigation of large document collections •Organization of Web search results Ref:[10][15]-[17] 5. Product:Intelligent Miner for Text(IMT)(1) Name Extractions Term Extraction Feature extraction Categorization Text Analysis Tools Summarization Clustering IMT Web Searching Tools Ref:[5][36] Text search engine NetQuestion Solution Web Crawler Abbreviation Extraction Relationship Extraction Hierarchical Clustering Binary relational Clustering 5. Product:Intelligent Miner for Text(IMT)(2) 1.Feature extraction tools 1.1 Information extraction •Extract linguistic items that represent document contents 1.2 Feature extraction •Assign of different categories to vocabulary in documents, •Measure their importance to the document content. 1.3 Name extraction •Locate names in text, •Determine what type of entity the name refers to 1.4 Term extraction •Discover terms in text. Multiword technical terms •Recognize variants of the same concept 1.5 Abbreviation recognition •Find abbreviation and math them with their full forms. 1.6 Relation extraction 5. Product:Intelligent Miner for Text(IMT)(3) Feature extraction Demo. 5. Product:Intelligent Miner for Text(IMT)(4) 2.Clustering tools 2.1 Applications • Provide a overview of content in a large document collection • Identify hidden structures between groups of objects • Improve the browsing process to find similar or related information • Find outstanding documents within a collection 2.2 Hierarchical clustering • Clusters are organized in a clustering tree and related clusters occurs in the same branch of tree. 2.3 Binary relational clustering • Relationship of topics. • document cluster topic. NB:preprocessing step for the categorization tool 5.Product:Intelligent Miner for Text(IMT)(5) •Clustering demo.:navigation of document collection 5. Product:Intelligent Miner for Text(IMT)(6) 3.Summarization tools 3.1 Steps the most relevant sentences the relevancy of a sentence to a document a summary of the document with length set by user 3.2 Applications •Judge the relevancy of a full text Easily determine whether the document is relevant to read. •Enrich search results The results of a query to a search engine can be enriched with a short summary of each document. •Get a fast overview over document collections summary full document 5.Product:Intelligent Miner for Text(IMT)(7) 4.Categorization tool •Applications • Organize intranet documents • Assign documents to folders • Dispatch requests • Forward news to subscribers News article sports cultures categorizer health politics economics vacations I like health news new router Black cat 6. Reference (1) Bibliography [1] Marti A. Hearst, Untangling Text Data Mining, Proceedings of ACL’99 the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999 (invited paper) http://www.sims.berkeley.edu/~hearst [2] Feldman and Dagan 1995 KDT - knowledge discovery in texts. In Proceedings of the First Annual Conference on Knowledge Discovery and Data Mining (KDD), Montreal. [3] IJCAI-99 Workshop TEXT MINING: FOUNDATIONS, TECHNIQUES AND APPLICATIONS Stockholm, Sweden August 2, 1999 http://www.cs.biu.ac.il/~feldman/ijcai-workshop%20cfp.html [4] Taeho C. Jo Text Categorization considering Categorical Weights and Substantial Weights of Informative Keywords, 1999 (http://www.sccs.chukyo-u.ac.jp/ICCS/olp/p3-13/p3-13.htm) [5] A White Paper from IBM TechnologyText Mining: Turning Information Into Knowledge, February 17, 1998 editor: Daniel Tkach IBM Software Solutions ( http://allen.comm.virginia.edu/jtl5t/whiteweb.html) [6] http://allen.comm.virginia.edu/jtl5t/index.htm [7] G. Salton et al, Introduction to Modern Information Retrieval, McGraw-Hill Book company, 1983 [8] Michael Steinbach and George Karypis and Vipin Kumar, A Comparison of Document Clustering Techiques, KDD-2000 [9] Douglass R. Cutting, Divid R. Karger, Jan O. Pedersen, and John W. Tukey, Scatter/Gather : A Cluster-based Approach to Browsing large Document Collections, SIGIR ’92,Pages 318 - 329 [10] Oren Zamir, Oren Etzioni, Omid Madani, Richard M. Karp, Fast and Intuitive Clustering of Web Documents, KDD ’97, Pages 287 – 290, 1997 6. Reference (2) Bibliography [11] Kjersti Aas et al. Text Categorization: Survey, 1999 [12] Text mining White Paper http://textmining.krdl.org.sg/whiteppr.html [13] Gartner Group, Text mining White Paper , June 2000 http://www.xanalys.com/intelligence_tools/products/text_mining_text.html [14] Yiming Y. and Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, In the 14th Int. Conf. On Machine Learning, PP. 412- 420, 1997 [15] C. J. van Rijsbergen, (1979), Information Retrieval, Buttersworth, London. [16] Chris Buckley and Alan F. Lewit, Optimizations of inverted vector searches, SIGIR ’85, PP 97 – 110,1985. [17] Daphe Koller and Mehran Sahami, Hierarchically classifying documents using very few words, proceedings of the 14th International Conference on Machine Learning, Nashville, Tennessee, July 1997, PP170 – 178. [18] T. Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, In Int. Conf. Machine Learning, 1997. [19] K. Lang, NewsWeeder: Learning to Filter Netnews, International conference on Machine learning, 1995, http://anther.learning.cs.cmu.edu/ml95.ps [20] Hart, S.P. A Probabilistic Approch to Automatic Keyword Indexing, Journal of the American Society for Information Science, July-August, 1975 6. Reference (3) Bibliography [21] Scott D., Indexing by Latent Semantic Analsis, Journal of the American Society for Information Science, Vol. 41,No,6, P.P. 391-407, 1990 [22] S.T. Dumais, Improving the retrieval information from external sources, Behavior Research Methods, Instruments and Computers, Vol23, No.2,PP.229-236, 1991 [23] T. Yavuz and H. A. Guvenir, Application of k-Nearest Neighbor on Feature Projections Classifier to Text Categorization, 1998 [24]Eui-Hong H. and George K., Centroid-Based Document Classification: Analysis & Experimental Results, In European Conference on Principles of Data Mining and Knowledge Discovery(PKDD), 2000 [25]Vladmimir N. Vapnik, the Nature of Statistical Learning Theory, Springer, New York, 1995 [26]Ji He,A-Hwee Tan and Chew-Lim Tan, A comparative Study on Chinese Text Categorization Methods, , PRICAI 2000 Workshop on Text And Web Mining, Melbourne, pp.24-35,August 2000 (http://textmining.krdl.org.sg/PRICAI2000/text-categorization.pdf) [27] Erik W. and Jan O. Pedersen and Andreas S. Weigend, A Neural Network Approach to Topic Spotting, In Pro 4th annaul symposium on cocument analysis and information retrieval, PP 22-34,1993 [28] Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D. and Allan, J. in the Proceedings of KDD 2000 Conference, pp. 37-44. [29]Peter W. , Recent Trends in Hierarchic Document Clustering : A Critical Review, Information Procession & Management Vol.24, No. 5 pp. 577-597, 1988 6. Reference (4) Bibliography [30] J. Larocca Neto, A.D. Santos, C.A.A. Kaestner, A.A. Freitas. Document clustering and text summarization. Proc. 4th Int. Conf. Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), 41-55. London: The Practical Application Company. 2000. [31] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar and Bamshad Mobasher, Clustering Based On Association Rule Hypergraphs , SIGMOD'97 Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997. [32] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore, Parititioning-Based Clustering for Web Document Categorization, Decision Support Systems Journal, Vol 27, No. 3, pp 329-341, 1999. [34 George K. and Rajat A. and Vipin K. and Shashi S., Multilevel Hypergraph Partitioning: Applications in VLSI Domain, In proceedings o th e Design and Automation Conferences 97. [35] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar and Bamshad Mobasher, Clustering In A HighDimensional Space Using Hypergraph Models, Technical report #97-019, http://www-users.cs.umn.edu/~han/. [36] IBM White paper: Information Mining with the IBM Intelligent Miner Family, Daniel S. Tkach, February, 1998