Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course on Data Mining (581550-4) Intro/Ass. Rules 7.11. 24./26.10. Clustering 14.11. Episodes KDD Process Home Exam 30.10. Text Mining 21.11. 28.11. Appl./Summary Course on Data Mining Page 1/70 1 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course on Data Mining (581550-4) Today 07.11.2001 • Today's subject: – Text Mining, focus on maximal frequent phrases or maximal frequent sequences (MaxFreq) • Next week's program: – Lecture: Clustering, Classification, Similarity – Exercise: Text Mining – Seminar: Text Mining Course on Data Mining Page 2/70 2 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Mining Background What is Text Mining? MaxFreq Sequences MaxFreq Algorithms MaxFreq Experiments Course on Data Mining Page 3/70 3 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Databases and Information Retrieval • Text databases (document databases) – Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, Web pages, etc. • Information retrieval (IR) – Information is organized into (a large number of) documents – Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Course on Data Mining Page 4/70 4 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Basic Measures for Text Retrieval Relevant Relevant & Retrieved Retrieved All Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) | {Relevant} {Retrieved} | = precision | {Retrieved } | Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved | {Relevant} {Retrieved} | = recall | { Relevant } | Course on Data Mining Page 5/70 5 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Keyword/Similarity-Based Retrieval • A document is represented by a string, which can be identified by a set of keywords • Find similar documents based on a set of common keywords • Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. • In the following, some basic techniques related to the preprocessing and retrieval are briefly mentioned Course on Data Mining Page 6/70 6 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Keyword/Similarity-Based Retrieval • Basic techniques (1): Remove unrelevant words with stop list – Set of words that are deemed “irrelevant”, even though they may appear frequently – E.g., a, the, of, for, with, etc. – Stop lists may vary when document set varies • Basic techniques (2): Take basic forms of words with word stemming – Several words are small syntactic variants of each other since they share a common word stem (basic form) – E.g., drug, drugs, drugged Course on Data Mining Page 7/70 7 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Keyword/Similarity-Based Retrieval • Basic techniques (3): Calculate occurrences of terms to a term frequency table – Each entry frequent_table(i, j) = # of occurrences of the word ti in document di (or just "0" or "1" ) • Basic techniques (4): Similarity metrics: measure the closeness of a document to a query (a set of keywords) v v – Cosine distance: sim(v1 , v2 ) = 1 2 | v1 || v2 | – Relative term occurrences • This is all nice to know, but where is the text mining and how does it relate to this? Course on Data Mining Page 8/70 8 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Mining Background What is Text Mining? MaxFreq Sequences MaxFreq Algorithms MaxFreq Experiments Course on Data Mining Page 9/70 9 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 What is Text Mining? • Data mining in text: find something useful and surprising from a text collection • Text mining vs. information retrieval is like data mining vs. database queries Course on Data Mining Page 10 10/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Different Views on Text • For example, we might have the following text: Documents are an interesting application field for data mining techniques. • Remember the market basket data? – The text can then be considered as a shopping transaction, i.e., row in the database – The words occurring in the text can be considered as items bought Transaction ID Items Bought 100 A,B,C 200 A,C Document ID Words occurring 100 an,application,... 200 ... Course on Data Mining Page 11 11/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Different Views on Text • Recall the event sequence from episode rules: D C A 0 10 B 20 30 40 D A B C 50 60 70 80 90 are an interesting application field for data mining techniques 0 Documents • Now we can consider the text as a sequence of words! 1 2 3 4 5 6 7 8 9 10 11 Course on Data Mining Page 12 12/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Preprocessing • So, suppose that we have the following example text: Documents are an interesting application field for data mining techniques. • To this text, we might do the following preprocessing operations: 1. Find the basic forms of the words (stemming) 2. Use stop list to remove uninteresting words 3. Select, e.g., the wanted word classes (e.g., nouns) Course on Data Mining Page 13 13/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Preprocessing (Documents, 1) (are, 2) (an, 3) (interesting, 4) (application, 5) (field, 6) (for, 7) (data, 8) (mining, 9) (techniques, 10) (., 11) (document_N_PL, 1) (be_V_PRES_PL, 2) (an_DET, 3) (interesting_A_POS, 4) (application_N_SG, 5) (field_N_SG, 6) (for_PP, 7) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) (STOP, 11) Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition Course on Data Mining Page 14 14/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Preprocessing (document_N_PL, 1) (be_V_PRES_PL, 2) (an_DET, 3) (interesting_A_POS, 4) (application_N_SG, 5) (field_N_SG, 6) (for_PP, 7) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) (STOP, 11) (document_N_PL, 1) (interesting_A_POS, 4) (application_N_SG, 5) (field_N_SG, 6) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition Course on Data Mining Page 15 15/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Preprocessing (document_N_PL, 1) (document_N_PL, 1) (interesting_A_POS, 4) (application_N_SG, 5) (field_N_SG, 6) (application_N_SG, 5) (field_N_SG, 6) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition Course on Data Mining Page 16 16/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Preprocessing 4 5 6 7 technique 3 mining 2 data 1 field 0 application document • Now we have a preprocessed sequence of words 8 9 10 11 • We might also just throw away the stop words etc., and put words in consecutive "time slots" (1, 2, 3, …) • Preprocessing can be applied to transaction-based text data in a similar fashion Course on Data Mining Page 17 17/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Types of Text Mining • Keyword (or term) based association analysis • Automatic document classification • Similarity detection – Cluster documents by a common author – Cluster documents containing information from a common source • Sequence analysis: predicting a recurring event, discovering trends • Anomaly detection: find information that violates usual patterns Course on Data Mining Page 18 18/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Term-Based Assoc. Analysis • Collect sets of keywords or terms that occur frequently together and then find the association relationships among them • First preprocess the text data by parsing, stemming, removing stopwords, etc. • Then evoke association mining algorithms – Consider each document as a transaction – View a set of keywords/terms in the document as a set of items in the transaction Course on Data Mining Page 19 19/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Term-Based Assoc. Analysis • For example, we might find frequent sets such as: 2%: application, field 5%: data, mining • …and association rules like: application field (2%,52%) data mining (5%,75%) • These kind of frequent sets etc. might help in expanding user queries or in describing better the documents than simple key words • Sometimes it would be nice to discover new descriptive phrases directly from the actual text - what then? Course on Data Mining Page 20 20/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Term-Based Episode Analysis • Now, we want to find words/terms that occur frequently close to each other in the actual text • Take the preprocessed sequential text data and then find relationships among the words/terms by evoking episode mining algorithms (WINEPI or MINEPI) • For example, we might find frequent episodes such as: data, mining, knowledge, discovery • …and MINEPI style episode rules like: data, mining knowledge, discovery [4] [8] (2%,81%) Course on Data Mining Page 21 21/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Problems • Quite often, it could be interesting to try to find very long descriptive phrases to describe the documents… • …but discovery of long descriptive phrases might be tedious, especially if and when you'll have to create all shorter phrases in order to get the longest ones • One answer can be maximal frequent sequences or maximal frequent phrases (note: by concepts "sequence" and "phrase" we mean basically the same) Course on Data Mining Page 22 22/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Mining Background What is Text Mining? MaxFreq Sequences MaxFreq Algorithms MaxFreq Experiments Course on Data Mining Page 23 23/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Frequent Word Sequences • Assume: S is a set of documents; each document consists of a sequence of words • A phrase is a sequence of words • A sequence p occurs in a document d if all the words of p occur in d, in the same order as in p • A sequence p is frequent in S if p occurs in at least documents of S, where is a frequency threshold given • A maximal gap n can be given: the original locations of any two consecutive words of a sequence can have at most n words between them Course on Data Mining Page 24 24/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Frequent Word Sequences 1: (The,70) (Congress,71) (subcommittee,72) (backed,73) (away,74) (from,75) (mandating,76) (specific,77) (retaliation,78) (against,79) (foreign,80) (countries,81) (for,82) (unfair,83) (foreign,84) (trade,85) (practices,86) 2: (He,105) (urged,106) (Congress,107) (to,108) (reject,109) (provisions,110) (that,111) (would,112) (mandate,113) (U.S.,114) (retaliation,115) (against,116) (foreign,117) (unfair,118) (trade,119) (practices,120) 3: (Washington,407) (charged,408) (France,409) (West,410) (Germany,411) (the,412) (U.K.,413) (Spain, 414) (and,415) (the,416) (EC,417) (Commission,418) (with,419) (unfair,420) (practices,421) (on,422) (behalf,423) (of,424) (Airbus,425) Course on Data Mining Page 25 25/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Frequent Word Sequences Examples from the previous slides: • The phrase (retaliation, against, foreign, unfair, trade, practices) occurs in the first two documents, in the locations (78, 79, 80, 83, 85, 86) and (115, 116, 117, 118, 119, 120). • The phrase (unfair, practices) occurs in all the documents, namely in the locations (83, 86), (118, 120), and (420, 421). Note that we only count one occurrence of a sequence/doc! Course on Data Mining Page 26 26/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Maximal Frequent Sequences • Maximal frequent sequence: – A sequence p is a maximal frequent (sub)sequence in S if there does not exist any other sequence p' in S such that p is a subsequence of p' and p' is frequent in S • Shortly, a maximal frequent sequence is a sequence of words that – appears frequently in the document collection – is not included in another longer frequent sequence Course on Data Mining Page 27 27/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Maximal Frequent Sequences • Usually, it makes sense to concentrate on the maximal frequent sequences or maximal frequent phrases – Subsequences or subphrases usually do not have own meaning – However, sometimes also subsequences or subphrases may be interesting, if they are much more frequent Course on Data Mining Page 28 28/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 A Maximal Seq. with Subseq.s • Example (maximal sequence + subsequences): dow jones industrial average dow jones dow industrial dow average jones industrial jones average industrial average dow jones industrial dow jones average jones industrial average Course on Data Mining Page 29 29/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Examples of Meaningful Subseqs • Interesting subsequences can be distinguished by the characteristic that they are more frequent than the maximal sequences – Subsequence has its OWN occurrences in the text – Subsequence might be joint to MANY maximal sequences – TOO FREQUENT subsequence might NOT be interesting Course on Data Mining Page 30 30/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Examples of Meaningful Subseqs • Maximal sequences: prime minister Lionel Jospin prime minister Paavo Lipponen • Subsequences: prime minister Lionel Jospin Paavo Lipponen Course on Data Mining Page 31 31/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Mining Background What is Text Mining? MaxFreq Sequences MaxFreq Algorithms MaxFreq Experiments Course on Data Mining Page 32 32/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Discovery of Frequent Sequences • Frequency of a sequence cannot be decided locally: all the instances in the collection has to be counted • However: already a document of length 20 (words) contains over one million sequences • Only small fraction of sequences are frequent – There are many sequences that have only very few occurrences Course on Data Mining Page 33 33/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Naïve Discovery Approach • Basic idea: the "standard" bottom-up approach – Collect all the pairs from the documents, count them, and select the frequent ones – Build sequences of length p+1 from frequent sequences of length p – Select sequences that are frequent – Iterate • Finally: select maximal sequences (by checking for each phrase, whether it is contained in some other phrase) Course on Data Mining Page 34 34/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Problems in the Naïve Approach • Problem: frequent sequences in text can be long – In our experiments: longest phrase 22 words (Reuters21578 newswire data, 19000 documents, frequency threshold 15, max gap 2) – Processing all the subphrases of all lengths is not possible – Straightforward bottom-up approach does not work – Restriction of the length would produce a large amount of slightly differing subphrases of a phrase that is longer than the threshold Course on Data Mining Page 35 35/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Combining Bottom-Up and Greedy Approaches: MaxFreq • First, frequent pairs are collected Initial phase • Longer sequences are constructed from shorter sequences (k-grams) as in the bottom-up approach Discovery phase • Maximal sequences are discovered directly, starting from a k-gram that is not a subsequence of any known maximal sequence Expansion step Course on Data Mining Page 36 36/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Combining Bottom-Up and Greedy Approaches: MaxFreq • Each maximal sequence has at least one unique subsequence that distinguishes it from the other maximal sequences. A maximal sequence is discovered, at the latest, on the level k, where k is the length of the shortest unique subsequence. • Grams that cannot be used to construct any new maximal sequences are pruned away after each level, before the length of grams is increased Pruning step • Let's take a closer look at these phases and steps! Course on Data Mining Page 37 37/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Algorithm: Initial Phase Input: Output: a set of documents S, a frequency threshold, and a maximal gap a gram set Grams2 containing the frequent pairs For all the documents d S collect all the ordered pairs of words (A,B) within d such that A and B occur in this order (wrt maximal gap) Grams2 = all the ordered pairs that are frequent in the set S (wrt frequency threshold) Return Grams2 Course on Data Mining Page 38 38/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Algorithm: Initial Phase Document 1: (A,11) (B,12) (C,13) (D,14) (E,15) Document 2: (P,21) (B,22) (C,23) (D,24) (K,25) Document 3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36) Document 4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46) Document 5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57) Document 6: (R,61) (H,62) (K,63) (L,64) (M,65) Course on Data Mining Page 39 39/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Algorithm: Initial Phase • The following pairs of words could be found (with max gap=2). E.g., AB occurs in doc 1 ([11-12]) and in doc 2 ([31-32]), while AE is unfrequent ([11-15] > max gap). AB AC AD AH BC BD 2 2 1 1 5 4 BE BH BK CD CE CH 3 1 2 4 3 1 CK CL CN DE DK DN 3 1 1 2 2 1 EL EM EN HD HK HL 1 1 1 1 2 1 HM 1 KE 1 KL 2 KM 2 LM 2 PB 3 PC PD PK RH RK RL Course on Data Mining 3 2 1 1 1 1 Page 40 40/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Algorithm: Discovery Phase Input: a gram set Grams2 containing the frequent pairs (A, B) Output: the set Max of maximal frequent phrases k := 2; Max := While Gramsk is not empty For all grams g Gramsk If a gram g is not a subphrase of some m Max If a gram g is frequent max := Expand(g) Max := Max max If max = g Remove {g} from Gramsk Else Remove {g} from Gramsk Prune(Gramsk) Join the grams of Gramsk to form Gramsk+1 k := k + 1 Return Max Course on Data Mining Page 41 41/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Algorithm: Expansion Step Input: Output: a phrase p a maximal frequent phrase p' such that p is a subphrase of p' Repeat Let l be the length of the sequence p. Find a sequence p' such that the length of p' is l+1, and p is a subsequence of p'. Note! If p' is frequent All the possibilities p := p' to expand has to be Until there exists no frequent p' Return p checked: tail, front, middle! Course on Data Mining Page 42 42/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Algorithm: Expansion Step 1: 2: 3: 4: 5: 6: (A,11) (B,12) (C,13) (D,14) (E,15) (P,21) (B,22) (C,23) (D,24) (K,25) (A,31) (B,32) (C,33) (H,34) (D,35) (K,36) (P,41) (B,42) (C,43) (D,44) (E,45) (N,46) (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57) (R,61) (H,62) (K,63) (L,64) (M,65) Freq: AB BD CD DE KL PB AC BE CE DK KM PC BC BK CK HK LM PD Exp: AB => BE => ABC => ABCD (- ABCDE, ABCDK) BCE => BCDE Course on Data Mining Page 43 43/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Example • Maximal frequent sequences after the first expansion step: AB BE BK KL PD HK => => => => => ABC => BCE => BDK => KLM PBD => ABCD BCDE BCDK PBCD Course on Data Mining Page 44 44/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Example • 3-grams after join: ABC ACK CDE ABD BCD CDK ABE BCE PBC ABK BCK PBD ACD BDE PBE ACE BDK PBK PCD PCE PCK PDE PDK BKL BKM CKL CKM DKL DKM KLM italics+ underlined= already found maximal phrase • New maximal frequent sequences: PBE => PBCE PBK => PBCK Course on Data Mining Page 45 45/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Example • 3-grams after the second expansion step: ABC ABD ACD BCD BCE BCK BDE BDK CDE CDK PBC PBD PBE PBK PCD PCE BCDK PBCD PBCE PBCK PBDE PBDK PCDE PCDK PCK • 4-grams after join: ABCD ABCE ABCK ABDE ABDK ACDE ACDK BCDE Course on Data Mining Page 46 46/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Algorithm: Pruning Step • After expansion step, every gram is a subsequence of some maximal sequence • For any other maximal sequence m not found yet: m has to contain grams from two or more other maximal sequences, or from one sequence m' in a different order than in m' • For each gram g: check if g can join grams of maximal sequences in a new way => extract sequences that are frequent and not yet included in any maximal sequence; mark the grams • Remove grams that are not marked Course on Data Mining Page 47 47/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Pruning After the 1st Exp. Step • • • • • BC: ABCD, BCDE, BCDK, PBCD Prefixes: A, P Suffixes: D, DE, DK Check the strings ABCDE, ABCDK, PBCDE, PBCDK a subsequence that is frequent and not included in any maximal sequence? ABCDE - ABC - ABCD (maximal) - ABCE (not frequent) - BCD - BCDE (maximal) - ABCD (known) - BCE - ABCE (known) Course on Data Mining Page 48 48/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Pruning After the 1st Exp. Step PBCDE - PBC - BCD - BCE PBCDK - PBC - PBCD - PBCE - BCDE - PBCD - PBCE (maximal) (frequent, not in maximal) (maximal) (known) (known) - PBCD (maximal) - PBCK (frequent, not in maximal) ... Marked: PB, BC, CE, CK All the other grams are removed. Course on Data Mining Page 49 49/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Algorithm: Implementation Data structures: • Table: for each pair its exact occurrences in text • Table: for each prefix the grams that have this prefix • Table: for each suffix the grams that have this suffix • Table: for each pair the indexes of maximal sequences within which it is a subsequence • An array of maximal sequences • Document identifiers are attached to the grams and occurrences Course on Data Mining Page 50 50/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Testing Frequency • The occurrences of frequent pairs are stored: AB: [11-12][31-32] AC: [11-13][31-33] BC: [12-13][22-23][32-33][42-43][52-53] • The occurrences of longer sequences are computed from the occurrences of pairs • All the occurrences computed are stored – The computation for ABC may help to compute later the frequency for ABCD Course on Data Mining Page 51 51/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Testing Frequency – ABCD can only occur in places where ABC has occurred • NOTE: – Already calculated occurrences can be used while adding elements to the front or to the tail – ABCD may occur in more documents than ABD, since the distance of B and D might be greater than the maximal gap Course on Data Mining Page 52 52/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Mining Background What is Text Mining? MaxFreq Sequences MaxFreq Algorithms MaxFreq Experiments Course on Data Mining Page 53 53/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Experiments • Data: Reuters-21578 newswire collection (year 1987) • Around 19000 documents (average length 135 words) • Originally 2.5 million words, after stopword pruning (400 stopwords) 1.3 million words – Stopwords: single letters, pronouns, prepositions, some abbreviations (e.g., pct, dlr, cts, shr), etc. • 50.000 distinct words (stemming was not used) • Frequency threshold 15, max gap 2 (stopwords pruned) • Prototype implementation in Perl • Sun Enterprise 450, with 1 GB of main memory Course on Data Mining Page 54 54/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Experiments • Amounts of maximal frequent sequences of different lengths: Len 2 3 4 5 6 7 8 9 10 11 12 f:15 7,664 1,320 353 146 65 17 8 4 13 12 13 Len 13 f:15 5 14 - 15 1 16 1 17 18 19 20 21 22 23 - 1 - - - 2 - Course on Data Mining Page 55 55/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Examples of MaxFreq Sequences • Solid, established phrases: bundesbank president karl otto poehl european monetary system ems • Verb phrases: bank england provided money market assistance board declared stock split payable april boost domestic demand • Short phrases: expects higher expects complete Course on Data Mining Page 56 56/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Phrases Extracted from "Doc A" • The following phrases are extracted from one document belonging to the Reuters data set • The phrases contain both maximal phrases and subphrases that are more frequent than the maximal ones • The document describes a situation, where the persons monitoring the nuclear power plant operation were catched asleep during their shift and the Nuclear Regulatory Commission ordered the power plant to be closed • As you can see, the phrases do not actually reveal what happened, they just tell about the subject matter Course on Data Mining Page 57 57/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Phrases Extracted from "Doc A" power station 11 immediately after 26 co operations 11 effective april 63 company's operations 20 unit nuclear 12 unit power 16 early week 42 senior management 28 nuclear regulatory commission 14 -regulatory commission 34 nuclear power plant 26 -power plant 55 -nuclear power 42 -nuclear plant 42 electric co 143 Course on Data Mining Page 58 58/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Phrases Extracted from "Doc B" • Maximal frequent sequence (frequency = 15): federal reserve entered u.s. government securities market arrange repurchase agreements fed dealers federal funds trading fed began temporary supply reserves banking system • One occurrence of the phrase: The Federal Reserve entered the U.S. Government securities market to arrange 1.5 billion dlrs of customer repurchase agreements, a Fed spokesman said. Dealers said Federal funds were trading at 6-3/16 pct when the Fed began its temporary and indirect supply of reserves to the banking system. Course on Data Mining Page 59 59/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Phrases Extracted from "Doc B" • The frequency of the sequence is 13, and it contains the following subsequences that are more frequent: arrange repurchase 23 fed federal 25 fed funds 23 fed temporary 23 market arrange 23 market trading 41 u.s. government 160 u.s. dealers 32 u.s. trading 35 u.s. supply 26 reserves system 36 securities arrange 23 securities trading 32 government arrange 23 banking system trading fed trading system reserve u.s. supply reserves supply system dealers federal dealers funds dealers trading federal u.s. federal trading funds trading reserve u.s. government reserves banking system 66 22 25 43 36 25 30 27 33 28 30 43 31 25 Course on Data Mining Page 60 60/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Use of Frequent Phrases • Goal: rich computational representation for documents – Feature sets for analysis – Human-readable description • Applications – Key phrases in information retrieval – Overview to the collection: clustering – Summary of the content – Automatic generation of hypertext links – Associations between documents – Browsing of document collection Course on Data Mining Page 61 61/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Use of Frequent Phrases • Example: suppose that a query "agricultur*" has been made agricultur* QUERY • The user has been given a "middle-level list" of phrases that tell something more about the context around the words in the query Course on Data Mining Page 62 62/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Use of Frequent Phrases agricultural exports agricultural production agricultural products agricultural stabilization conservation service agricultural subsidies agricultural trade u.s. agriculture agriculture department usda agriculture department wheat agriculture minister agriculture officials agriculture undersecretary daniel amstutz common agricultural policy ec agriculture ministers european community agriculture Course on Data Mining Page 63 63/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Use of Frequent Phrases • Suppose that the user is interested in subject "agricultural subsidies" and selects it from the list • As an answer to the query, one might now return all the sentences containing the phrase "agricultural subsidies" (e.g., the ones on the next pages) • Alternatively, the user might want to see directly the whole documents in which the phrase appears, or the other phrases that occur together with the phrase "agricultural subsidies" in the documents Course on Data Mining Page 64 64/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Summary • Text mining: – The "roots" are in text databases and information retrieval – Data mining techniques might complement or help the existing database/information retrieval techniques • In this lecture, only a few methods based of association and episode style algorithms were given: – Naïve approaches applicable to some extent, maximal frequent phrases might be useful in some cases – Many clustering, classification and similarity techniques that will be presented on the next lectures, are useful to go a few steps further Course on Data Mining Page 65 65/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 References • Helena Ahonen-Myka: Finding All Frequent Maximal Sequences in Text. In ICML-99 Workshop on Machine Learning in Text Data Analysis, p. 11-17, J. Stefan Institute, Ljubljana 1999. See electronic version at http://www.cs.helsinki.fi/u/hahonen/ham_icml99.ps • Han, J., Kamber, M.: Data Mining: Concepts and Techniques (also available at "http://www.cs.sfu.ca/~han/DM_Book.html"), Section 9.5 of the book. • Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri Verkamo. Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections. In Advances in Digital Libraries'98, April 1998. See electronic version at http://wwwdb.informatik.uni-tuebingen.de/forschung/papers/adl98.ps Course on Data Mining Page 66 66/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course Organization Next Week • Lecture 14.11.: Clustering, Classification, Similarity – Pirjo gives the lecture • Excercise 15.11.: Text mining – Pirjo takes care of you! :-) • Seminar 9.11.: Text mining – Mika gives the lecture – 2 group presentations (groups 5-6) Course on Data Mining Page 67 67/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Seminar Presentations/Groups 5-6 Feldman et. al R. Feldman et al.: "Knowledge Management: A Text Mining Approach", PAKM 1998. Lent, Agrawal, Srikant B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text Databases", KDD 1997. Course on Data Mining Page 68 68/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Seminar Presentations • Requirements: • Remember: – Articles are given on previous week's Wed – Presentation in an HTML page (around 3-5 printed pages) due to seminar starting: • Can be either a HTML page or a printable document in PostScript/PDF format – 30 minutes of presentation – 5-15 minutes of discussion – Active participation – Try to understand the "message" in the article – Try to present the basic ideas as clearly as possible, use examples – Do not present detailed mathematics or algorithms – Test: do you understand your own presentation? – In the presentation, use PowerPoint or conventional slides Course on Data Mining Page 69 69/70 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Text Mining Thank you for your attention! Thanks to Helena Ahonen-Myka and Jiawei Han for their slides which greatly helped in preparing this lecture! Course on Data Mining Page 70 70/70