Download Introduction to Text Mining - Indian Statistical Institute

Introduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29 Outline 1 Preliminaries 2 Preprocessing 3 Mining word associations 4 Opinion mining M. Mitra (ISI) Text Mining 2 / 29 What is Text Mining? . Strict definition . The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data . OR . Loose definition . The science of extracting useful information from large [textual] data sets . M. Mitra (ISI) Text Mining 3 / 29 What is Text Mining? . Strict definition . The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data . OR . Loose definition . The science of extracting useful information from large [textual] data sets . . Old wine in a new bottle? . Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) . M. Mitra (ISI) Text Mining 3 / 29 Why is it interesting? Growth of Web / electronic information sources Multidisciplinary nature E-commerce potential “Electronic commerce is emerging as the killer domain for data-mining technology” — RONNY KOHAVI M. Mitra (ISI) Text Mining 4 / 29 Data sources World Wide Web unstructured and semi-structured text “deep” web: pages that do not exist until they are created dynamically as the result of a specific search social networks M. Mitra (ISI) Text Mining 5 / 29 Data sources World Wide Web unstructured and semi-structured text “deep” web: pages that do not exist until they are created dynamically as the result of a specific search social networks Intranet internal correspondence, memos, presentations white papers, technical reports customer email, customer forums, product reviews news Wires ... M. Mitra (ISI) Text Mining 5 / 29 Data sources World Wide Web unstructured and semi-structured text “deep” web: pages that do not exist until they are created dynamically as the result of a specific search social networks Intranet internal correspondence, memos, presentations white papers, technical reports customer email, customer forums, product reviews news Wires ... . . No structure / general schema / tabular form that fits text M. Mitra (ISI) Text Mining 5 / 29 Outline 1 Preliminaries 2 Preprocessing 3 Mining word associations 4 Opinion mining M. Mitra (ISI) Text Mining 6 / 29 Indexing Any text item (“document”) represented as list of terms and associated weights D = (⟨t1 , w1 ⟩, . . . , ⟨tn , wn ⟩) Term = keywords or content-descriptors Weight = measure of the importance of a term in representing the information contained in the document M. Mitra (ISI) Text Mining 7 / 29 Indexing Tokenization: identify individual words . Sachin Tendulkar made a tearful but self-effacing farewell as his glittering 24-year career came to an end on Saturday at his home ground of Wankhede Stadium. . ⇓ Sachin Tendulkar made a tearful but ... M. Mitra (ISI) Text Mining 8 / 29 Indexing Stopword removal: eliminate common words, e.g. and, of, the, etc. . Sachin Tendulkar made a tearful but self-effacing farewell as his glittering 24-year career came to an end on Saturday at his home .ground of Wankhede Stadium. M. Mitra (ISI) Text Mining 9 / 29 Indexing Stemming: reduce words to a common root e.g. resignation, resigned, resigns → resign analysis, analyze, analyzing → analy use standard algorithms (Porter) M. Mitra (ISI) Text Mining 10 / 29 Indexing Thesaurus: find synonyms for words in the document M. Mitra (ISI) Text Mining 11 / 29 Indexing Thesaurus: find synonyms for words in the document Phrases: find multi-word terms e.g. computer science, data mining use syntax/linguistic methods or “statistical” methods M. Mitra (ISI) Text Mining 11 / 29 Indexing Thesaurus: find synonyms for words in the document Phrases: find multi-word terms e.g. computer science, data mining use syntax/linguistic methods or “statistical” methods Named entities: identify names of people, organizations, places; dates; monetary or other amounts, etc. M. Mitra (ISI) Text Mining 11 / 29 Indexing Thesaurus: find synonyms for words in the document Phrases: find multi-word terms e.g. computer science, data mining use syntax/linguistic methods or “statistical” methods Named entities: identify names of people, organizations, places; dates; monetary or other amounts, etc. . Sachin Tendulkar made a tearful but self-effacing farewell as his glittering 24-year career came to an end on Saturday at his home .ground of Wankhede Stadium. M. Mitra (ISI) Text Mining 11 / 29 Indexing: Term Weights Term frequency (tf): repeated words are strongly related to content Inverse document frequency (idf): uncommon term is more important Normalization by document length long docs. contain many distinct words long docs. contain same word many times term-weights for long documents should be reduced use # bytes, # distinct words, Euclidean length, etc. Weight = tf x idf / normalization M. Mitra (ISI) Text Mining 12 / 29 Commonly used weighting schemes Pivoted normalization [Singhal et al., SIGIR 96] 1+log(tf ) 1+log(average tf ) N × log( df ) (1.0 − slope) × pivot + slope × # unique terms BM25 (probabilistic model) [Robertson and Zaragoza, FTIR 2009] −df +0.5 tf × log( N df +0.5 ) dl k1 ((1 − b) + b avdl ) + tf M. Mitra (ISI) Text Mining 13 / 29 Searching Measure vocabulary overlap between user query and documents. t1 . . . tn Q = q1 . . . qn D = d1 . . . dn ⃗ D ⃗ Sim(Q, D) = ∑ Q. = i qi × d i M. Mitra (ISI) Text Mining 14 / 29 Searching Measure vocabulary overlap between user query and documents. t1 . . . tn Q = q1 . . . qn D = d1 . . . dn ⃗ D ⃗ Sim(Q, D) = ∑ Q. = i qi × d i Use inverted list (index). Term i → (Di1 , wi1 ), . . . , (Dik , wik ) M. Mitra (ISI) Text Mining 14 / 29 Outline 1 Preliminaries 2 Preprocessing 3 Mining word associations 4 Opinion mining M. Mitra (ISI) Text Mining 15 / 29 Stemming YASS [Majumder et al., ACM TOIS 25(4), 2007] Stemming ≡ grouping morphologically related words together e.g. { analysis, analyze, analyzing } Try clustering distance measure: edit distance, or D(X, Y ) = n n−m+1 ∑ 1 × if m > 0, m 2i−m i=m ∞ otherwise clustering algorithm: hierarchical agglomerative (single link / complete link / average link) M. Mitra (ISI) Text Mining 16 / 29 Stemming 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a s t r o n o m e r x x x x a s t r o n o m i c a l l y Edit distance = 6 D = 68 × ( 210 + . . . + 1 ) 213−8 = 1.4766 0 1 2 3 4 5 6 7 8 9 a s t r o n o m e r a s t o n i s h x x D = 73 × ( 210 + . . . + Edit distance = 5 M. Mitra (ISI) 1 ) 29−3 = 4.6302 Text Mining 17 / 29 Stemming Clustering: [Courtesy: http://espin086.files.wordpress.com/2011/02/2-variable-clustering.png] M. Mitra (ISI) Text Mining 18 / 29 Word Relations Motivation: Manual thesauri are: general purpose (Roget’s Thesaurus, WordNet) – difficult to use for document retrieval retrieval-oriented (INSPEC, MeSH) – expensive to build and maintain Construct an automatic thesaurus (based on information about co-occurrence of words in a collection) M. Mitra (ISI) Text Mining 19 / 29 Word Relations Association: if two terms co-occur within the same paragraph, they constitute an association ⟨term1 , term2 , assoc. frequency⟩ Gather data about term-associations over a large amount of text Refine associations: Discard associations with frequency 1 Discard terms that are associated with too many other terms (people, state, company, etc.) M. Mitra (ISI) Text Mining 20 / 29 Word Relations Each term is represented by a vector of associated terms T = (⟨t1 , w1 ⟩, . . . , ⟨tn , wn ⟩) ⇒ term = pseudo document Compare query to the term vectors (instead of document vectors) Sim(Q, T ) = Σi wt(qi ) × wt(ti ) Most “similar” terms are added to the query Example: 1986 US Immigration Law similar terms: illegal immigration, amnesty program, simpson-mazzoli M. Mitra (ISI) Text Mining 21 / 29 Word Relations Experimental results: Data: 500,000 documents (news, computer abstracts, govt. documents); 50 queries Baseline average precision: 37% Improves to 6 - 30% by using thesaurus 2 weeks to generate association data! Processing time can be reduced without major loss in performance by using a subset of the document collection M. Mitra (ISI) Text Mining 22 / 29 Outline 1 Preliminaries 2 Preprocessing 3 Mining word associations 4 Opinion mining M. Mitra (ISI) Text Mining 23 / 29 Challenges Does a document contain an opinion? In which portion? sites with a review component — easy e.g. CNET, Amazon, Epinions blogs — harder M. Mitra (ISI) Text Mining 24 / 29 Challenges Does a document contain an opinion? In which portion? sites with a review component — easy e.g. CNET, Amazon, Epinions blogs — harder Sentiment classification overall (polarity) / specific free form / grades or stars . quotations M. Mitra (ISI) Text Mining 24 / 29 Challenges Does a document contain an opinion? In which portion? sites with a review component — easy e.g. CNET, Amazon, Epinions blogs — harder Sentiment classification overall (polarity) / specific free form / grades or stars . quotations Presentation highlighting aggregation community identification estimating reliability M. Mitra (ISI) Text Mining 24 / 29 Challenges Does a document contain an opinion? In which portion? sites with a review component — easy e.g. CNET, Amazon, Epinions blogs — harder Sentiment classification overall (polarity) / specific free form / grades or stars . quotations Presentation highlighting Query classification: is the user looking for an opinion? aggregation community identification estimating reliability . M. Mitra (ISI) Text Mining 24 / 29 Opinion Mining Feature-based opinion summarization Identify the features of the product that customers have expressed opinions on (called opinion features) For each feature, identify how many customer reviews are positive / negative Examples: The pictures are very clear. Overall a fantastic, very compact, camera. While light, it will not easily fit in pockets. (HARD !) M. Mitra (ISI) Text Mining 25 / 29 Opinion Mining Feature identification 1 2 POS tagging + chunking: identify nouns, verbs, adjectives, simple noun groups, verb groups Transaction creation for each sentence: item ≡ normalized nouns / noun phrases 3 Association rule mining: all itemsets with > 1% support are candidate frequent features 4 Feature pruning: keep features that have some compact occurrences keep singleton itemsets only if they occur enough times in isolation e.g. manual vs. manual mode, manual setting M. Mitra (ISI) Text Mining 26 / 29 Opinion Mining Sentiment / orientation identification 1 Examine each sentence in the review database 2 If it contains a frequent feature, extract all the adjective words as opinion words 3 For each feature in the sentence, the nearby adjective is recorded as its effective opinion 4 Look up adjective in a list of adjectives with known orientation, or consult WordNet (discard unknowns) adjectives arranged in bipolar structures M. Mitra (ISI) Text Mining 27 / 29 Datasets Blog06 (25GB) : University of Glasgow http://ir.dcs.gla.ac.uk/test_collections/access_to_data.htm Congressional floor-debate transcripts http://www.cs.cornell.edu/home/llee/data/convote.html Cornell movie-review datasets http://www.cs.cornell.edu/people/pabo/movie-review-data/ M. Mitra (ISI) Text Mining 28 / 29 References Untangling Text Data Mining. M. Hearst. Proceedings of ACL’99. www.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html An Introduction to Information Retrieval. Manning, Raghavan, Schutze. www-csli.stanford.edu/~schuetze/information-retrieval-book.html Tutorial on Web Content Mining. Bing Liu. WWW 2005. www.cs.uic.edu/~liub Web Data Mining. Bing Liu. Springer, 2006. Opinion Mining and Sentiment Analysis. B. Pang and L. Lee. Foundations and Trends in Information Retrieval, 2(1-2), 2008. Sentiment Analysis and Opinion Mining. Bing Liu. Morgan Claypool, 2012. www.morganclaypool.com/doi/abs/10.2200/S00416ED1V01Y201204HLT016? journalCode=hlt M. Mitra (ISI) Text Mining 29 / 29

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to Text Mining - Indian Statistical Institute