Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Course on Data Mining (581550-4): Seminar Meetings P Ass. Rules 16.11. P Clustering 02.11. M 23.11. Episodes P KDD Process 09.11. M Text Mining 30.11. M Seminar by Mika P Seminar by Pirjo Home Exam Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings Today 09.11.2001 • Rakesh Agrawal and Ramakrishnan Srikant: Mining Sequential Patterns. Int'l Conference on Data Engineering, 1995. • F. Masseglia, P. Poncelet and M. Teisseire: Incremental Mining of Sequential Patterns in Large Databases. 16èmes Journées Bases de Données Avancées, 2000. Course on Data Mining: Seminar Meetings Page 2/30 Mining Sequential Patterns Rakesh Agrawal and Ramakrishnan Srikant IBM Almaden Research Center, USA Published in ICDE'95 (Int'l Conf. on Data Engineering) Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen Course on Data Mining: Seminar Meetings Page 3/30 Mining Sequential Patterns • Problem statement: • Database D with customer transactions • Customer-id, transaction time, items purchased • Quantities of items purchased are NOT concerned • Definitions: • Itemset: a non-empty set of items, i1 i2 i3 … • Sequence: an ordered list of itemsets, s1 s2 s3 … • A sequence a1 a2 … an is contained in b1 b2 … bn if there exist i1 < i2 < ... < in such that a1 bi1, a2 bi2, … an bin • E.g., (3)(4 5)(8) (7)(3 8)(9)(4 5 6)(8)>, since (3) (3 8), (4 5) (4 5 6) and (8) (8) • However, note that sequence (3)(5) (3 5) (and vice versa) Course on Data Mining: Seminar Meetings Page 4/30 Mining Sequential Patterns • Customer sequence: a sequence of transactions ("shopping baskets") of a customer, ordered by transaction times Ti: itemset(T1) itemset(T2) … itemset(Tn) • A customer supports a sequence s if s is contained in the customer sequence for this customer • The support for a sequence is defined as the fraction of total customers who support this sequence • Task: Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain userspecified minimun support. Each such maximal sequence represents a sequential pattern Course on Data Mining: Seminar Meetings Page 5/30 Mining Sequential Patterns Customer Id Transaction time Items bought 1 1 2 2 2 ... June June June June June ... 30 90 10, 20 30 40, 60, 70 ... Customer Id Customer sequence 1 (30)(90) 2 3 4 5 (10 20)(30)(40 60 70) (30 50 70) (30)(40 70)(90) (90) 25, 30, 10, 15, 20, 1993 1993 1993 1993 1993 Min. support 25% => 2 customers: <(30)(90)> (1&4) and <(30)(40 70)> (2&4) are maximal Course on Data Mining: Seminar Meetings Page 6/30 Mining Sequential Patterns • Definitions: • Length of a sequence is the number of itemsets in the sequence • A sequence of length k is called k-sequence • A sequence concatenated from sequences x and y is denoted by x.y • The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction • An itemset with minimum support is called large itemset or litemset • Each itemset in a large sequence must have minimum support, i.e., any large sequence must be a list of litemsets (Apriori trick!) • Three algorithms, all for sequential patterns: • AprioriSome • AprioriAll • DynamicSome Course on Data Mining: Seminar Meetings Page 7/30 Mining Sequential Patterns • Mining of sequential patterns: • 1. Sort Phase • Sort according to customer Id and transaction time • 2. Litemset Phase • Find large itemsets in a Apriori fashion, but like in MaxFreq, the support count is incremented only once even if the customer buys the same set of items in two different transactions • The large itemsets are mapped to a set of contiguous integers (e.g. (30), (40), (70), (40 70) and (90) becomes 1, 2, 3, 4 and 5); checking of equality is then fast (constant time)! Course on Data Mining: Seminar Meetings Page 8/30 Mining Sequential Patterns • 3. Transformation Phase • There is a need to repeatedly check which large itemsets are contained in customer sequences • To make this fast, each customer sequence is transformed to a list of large itemsets • Then the large itemsets are mapped to integers CId Original seq. Transf. Mapping 1 (30)(90) {(30)}{(90)} {1}{5} 2 (10 20)(30)(40 60 70) {(30)}{(40),(70),(40 70)} {1}{2,3,4} 3 (30 50 70) {(30),(70)} {1,3} 4 (30)(40 70)(90) 5 (90) {(30)}{(40),(70),(40 70)}{(90)} {1}{2,3,4}{5} {(90)} {5} Course on Data Mining: Seminar Meetings Page 9/30 Mining Sequential Patterns • 4. Sequence Phase • The large itemsets are used to find the desired sequences • AprioriAll: – Based on the normal Apriori algorithm – Counts all the large sequences – Prunes non-maximal in the "Maximal phase" • *Some: – Avoid counting sequences that are contained in longer sequences by counting the longer ones first, also avoid having to count many subsequences because their supersequences are not large Course on Data Mining: Seminar Meetings Page 10/30 Mining Sequential Patterns – Forward phase: find all large sequences of certain lengths – Backward phase: find all remaining large sequences – AprioriSome: use only large sequences from previous pass to generate candidates and validate their supports (i.e., if they are frequent or not) – DynamicSome: generate candidates on-the-fly based on large sequences found from the previous passes and the customer sequences read from the database • 5. Maximal Phase • Find the maximal sequences among the large sequences • In practice, starting from the largest sequences, delete all their subsequences Course on Data Mining: Seminar Meetings Page 11/30 Mining Sequential Patterns • AprioriAll: • Find all large sequences "normally" • • Prune the non-maximal ones away starting from 1 2 3 4 by deleting all its subsequences ( 1 2 3 , 1 2 4 , 1 3 4 , 2 3 4 , 1 2 , 1 3 , …, 4 ), then take the remaining 1 3 5 and prune all its subsequences, … The maximal large sequences are 1 2 3 4 , 1 3 5 and 4 5 Course on Data Mining: Seminar Meetings Page 12/30 Mining Sequential Patterns • AprioriSome: • Count only sequences of, e.g., length 1, 2, 4 and 6 in "forward phase" and count sequences of length 3 and 5 in "backward phase" • Note: in the forward phase, candidates for all levels are counted: • • If in the large sequences of length Lk-1were checked, then generate new candidates Ck based on them • If in the large sequences of length Lk-1were NOT checked, then generate new candidates Ck based on candidates Ck-1 In backward phase: delete all sequences of the length k in candidate collection if they are contained in some longer large sequence Li (i > k) Course on Data Mining: Seminar Meetings Page 13/30 Mining Sequential Patterns • Function "next" determines the next sequence length which is counted: this is based on the assumption that if, e.g, almost all sequences of length k are large (frequent), then many of the sequences of length k+1 are also large (frequent). E.g., • Most of the sequences are large (85%) => next round is k+5 • ... • Not many of the sequences are large (67%) => next round is k+1 (AprioriAll) Course on Data Mining: Seminar Meetings Page 14/30 Mining Sequential Patterns • DynamicSome: • In the initialization phase, count only sequences upto and including step variable length • • E.g., if step is 3, count sequences of length 1, 2 and 3 In the forward phase, we generate sequences of length 2 × step, 3 × step, 4 × step, etc. on-the-fly based on previous passes and customer sequences in the database • E.g., while generating sequences of length 9 with a step size 3: While passing the data, if sequences s6 L6 and s3 L3 are both contained in the customer sequence c in hand, and they do not overlap in c, then sk . sj is a candidate (k+j)-sequence Course on Data Mining: Seminar Meetings Page 15/30 Mining Sequential Patterns • In the intermediate phase, generate the candidate sequences for the skipped lengths • • E.g., if we have counted L6 and L3 , and L9 turns out to be empty: we generate C7 and C8 , count C8 followed by C7 after deleting non-maximal sequences, and repeat the process for C4 and C5 The backward phase is identical to AprioriSome • Then we go on and spare a few thoughts on incremental mining of sequential patterns Course on Data Mining: Seminar Meetings Page 16/30 Incremental Mining of Sequential Patterns in Large Databases F. Masseglia, P. Poncelet and M. Teisseire Laboratoire PRiSM & LIRMM UMR CNRS, France Published in BDA'00 (Bases de Données Avancées) Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen Course on Data Mining: Seminar Meetings Page 17/30 Incremental Mining of Sequential Patterns • Problem setting: • Let us consider an original and an incremental customer transaction database • For the original database, the frequent patterns have been created • Incremental database may contain new customers and new transactions for both old and new customers • To compute the set of sequential patterns in the updated database, we want to avoid counting everything from the scratch • Some main things one has to consider: • Discover all sequential patterns NOT frequent in the original database but become frequent with the increment • Examine all transactions in the original database which can be extended to become frequent • Old frequent sequences may become invalid when adding a customer or customers Course on Data Mining: Seminar Meetings Page 18/30 Incremental Mining of Sequential Patterns • Definitions are basically the same as in "Mining Sequential Patterns" paper • Again, the problem is to find all (maximal) sequences whose support is greater than a specified threshold (minimum support) • Additional definitions: • DB is the original database, minSupp is the minimum support • db is the increment database • U = DB db is the updated database containing all sequences from DB and db • LDB is the set of frequent sequences in DB • Task is to find frequent sequences in U, noted LU, with respect to the minSupp • An example database is presented on the next slide Course on Data Mining: Seminar Meetings Page 19/30 Incremental Mining of Sequential Patterns Course on Data Mining: Seminar Meetings Page 20/30 Incremental Mining of Sequential Patterns • First problem (Figure 1): Append new transactions to customers already existing in the original database • Suppose that we have minSupp threshold of 50% • In the original database, the frequent (maximal) sequences LDB are { (10 20) (30) , (10 20) (40) } • New transactions are appended to customers C2 and C3 • Sequences (60) (90) and (10 20) (50 70) become frequent • Customers C3 and C4 contain the first one, thus support is 50% • Customers C1, C2, and C3 contain (10 20) , thus the increments for C2 and C3 make the second one frequent, since customers C1 and C2 contain it ; thus support is 50% • Sequences (10 20) (30)(50 60)(80) and (10 20) (40)(50 60)(80) become frequent, since (50 60) (80) is frequent in db and was added to the rows already containing frequent sequences (10 20) (30) and (10 20) (40) Course on Data Mining: Seminar Meetings Page 21/30 Incremental Mining of Sequential Patterns • Second problem (Figure 2): Append new customers and new transactions to the original database • Suppose again that we have minSupp threshold of 50% • When one new customer is added to the database, a frequent sequence must be observed for 3 customers (previously 2) • In the original database, the frequent (maximal) sequences LDB used to be { (10 20) (30) , (10 20) (40) }, but is now just { (10 20) } • Sequences (10 20) (30) and (10 20) (40) occur only for customers C2 and C3 • Sequence (10 20) occurs for C1, C2, and C3 • By introducing increment database db, the LU becomes { (10 20) (50) , (10) (70) , (10) (80) , (40) (80) , (60) } • E.g., sequence (10 20) (50) is in the original database only for C1, and is not frequent; as the item 50 becomes frequent with the increment database, the sequence matches also C2 and C3 Course on Data Mining: Seminar Meetings Page 22/30 Incremental Mining of Sequential Patterns • Algorithm (ISE): The incremental mining is decomposed into two subproblems (k = length of the longest frequent sequences in DB) • • Find all new frequent sequences of size j (k+1). During this phase, three kinds of frequent sequences are considered: • Sequences in DB can become frequent since they have sufficient support with the increment • There can be new frequent sequences appearing in increment db but not in original DB • Sequences in DB can become frequent when adding items of db Find all new frequent sequences of size j > (k+1) • This is straightforward Apriori-like algorithm applying, since we have all frequent (k+1)-sequences discovered in the previous phase Course on Data Mining: Seminar Meetings Page 23/30 Incremental Mining of Sequential Patterns • First iteration (1): • Make a pass on db, count support for individual items of db • Provide 1-candExt, sequences occurring in db • Determine which items of db are frequent in U => Ld1b • Prune out frequent sequences that used to be frequent in LDB, but which are no more frequent in U Course on Data Mining: Seminar Meetings Page 24/30 Incremental Mining of Sequential Patterns • First iteration (2): • Create candidate sequences of length 2 by joining Ld1b with Ld1b => 2-candExt • Generate from LDB the set of frequent sub-sequences • Scan U to find out frequent 2-sequences from 2-candExt and frequent sub-sequences occurring before items of Ld1b Course on Data Mining: Seminar Meetings Page 25/30 Incremental Mining of Sequential Patterns • First iteration (3): • freqSeed <= frequent sub-sequences occurring before items of Ld1b and appended with the item • 2-freqExt <= frequent 2-sequences from 2-candExt Course on Data Mining: Seminar Meetings Page 26/30 Incremental Mining of Sequential Patterns • j th iteration with j (k+1) While (j-freqExt != AND j (k+1) do candInc <= Generate candidates from freqSeed and j-freqExt ; j++; j-candExt <= Generate candidate j-sequences from (j-1)freqExt ; Scan db for j-candExt ; if (j-candExt != AND candInc != ) then Scan U for j-candExt and candInc ; endif j-freqExt <= frequent j-sequences; freqInc <= freqInc + candidates from candInc verifying the support enddo LU <= LDB { max. freq. sequences in freqSeed freqInc freqExt}; Course on Data Mining: Seminar Meetings on U ; Page 27/30 Incremental Mining of Sequential Patterns • j th iteration with j > (k+1) Apply Apriori-style algortihm until all frequent sequences are discovered LU <= LU { max. freq. sequences obtained from the previous step}; • On the next slide, processes in the first and j th iteration with j > (k+1) are summarized • Optimization in "candInc <= Generate candidates from freqSeed and j-freqExt ": Consider two sequences (s freqSeed, s' freqExt) such that an item i Ld1b is the last item of s and the first item of s' Do not append s' freqExt to s freqSeed if there exist an item j Ld1b such that j is in s' and j is not preceded by s Course on Data Mining: Seminar Meetings Page 28/30 Incremental Mining of Sequential Patterns Course on Data Mining: Seminar Meetings Page 29/30 Unofficial Evaluation (Personal Views…) • Mining Sequential Patterns: • Paper comes from one of the top research groups in data mining area (IBM Almaden Data Mining group led by Rakesh Agrawal) • Quite well-written paper: Good language, clear examples and presentation => rather "easy to read" • Simple ideas, not very "break-through" ideas (at least this is the interpretation now); quite good international conference • One has to remember: this is written already in 1995 • Incremental Mining of Sequential Patterns in Large Databases • Paper comes from not so well-known French research group • Good: Lots of examples • Bad: Language is not always as good as it could be & definitions are sometimes somewhat "blurry", maybe too many abbreviations used • Probably not very "break-through" ideas, national DB conference • Remember: this is from year 2000 - rather new! Course on Data Mining: Seminar Meetings Page 30/30