Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture Notes for Chapter 7 Sequential Pattern Mining Frequent Subgraph Mining © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Sequence Data Timeline 10 Sequence Database: Object A A A B B B B C Timestamp 10 20 23 11 17 21 28 14 Events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 15 20 25 30 35 Object A: 2 3 5 6 1 1 Object B: 4 5 6 2 1 6 7 8 1 2 Object C: 1 7 8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Sequence Databases A sequence database consists of ordered elements or events Transaction databases vs. sequence databases A transaction database A sequence database TID itemsets SID sequences 10 a, b, d 10 <a(abc)(ac)d(cf)> 20 a, c, d 20 <(ad)c(bc)(ae)> 30 a, d, e 30 <(ef)(ab)(df)cb> 40 b, e, f 40 <eg(af)cbc> © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Examples of Sequence Data Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C Element (Transaction) Sequence © Tan,Steinbach, Kumar E1 E2 E1 E3 E2 Introduction to Data Mining E2 E3 E4 Event (Item) 4/18/2004 4 Formal Definition of a Sequence A sequence is an ordered list of elements (transactions) s = < e1 e2 e3 … > – Each element contains a collection of events (items) ei = {i1, i2, …, ik} – Each element is attributed to a specific time or location Length of a sequence, |s|, is given by the number of elements of the sequence A k-sequence is a sequence that contains k events (items) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Examples of Sequence Web sequence: < {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} > Sequence of initiating events causing the nuclear accident at 3-mile Island: (http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm) < {clogged resin} {outlet valve closure} {loss of feedwater} {condenser polisher outlet valve shut} {booster pumps trip} {main waterpump trips} {main turbine trips} {reactor pressure increases}> Sequence of books checked out at a library: <{Fellowship of the Ring} {The Two Towers} {Return of the King}> © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Formal Definition of a Subsequence A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm> (m ≥ n) if there exist integers i1 < i2 < … < in such that a1 bi1 , a2 bi1, …, an bin Data sequence Subsequence Contain? < {2,4} {3,5,6} {8} > < {2} {3,5} > Yes < {1,2} {3,4} > < {1} {2} > No < {2,4} {2,4} {2,5} > < {2} {4} > Yes The support of a subsequence w is defined as the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Sequential Pattern Mining: Definition Given: – a database of sequences – a user-specified minimum support threshold, minsup Task: – Find all subsequences with support ≥ minsup © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Sequential Pattern Mining: Challenge Given a sequence: <{a b} {c d e} {f} {g h i}> – Examples of subsequences: <{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc. How many k-subsequences can be extracted from a given n-sequence? <{a b} {c d e} {f} {g h i}> n = 9 k=4: Y_ <{a} © Tan,Steinbach, Kumar _YY _ _ _Y {d e} Introduction to Data Mining {i}> Answer : n 9 126 k 4 4/18/2004 9 Sequential Pattern Mining: Example Object A A A B B C C C D D D E E Timestamp 1 2 3 1 2 1 2 3 1 2 3 1 2 © Tan,Steinbach, Kumar Events 1,2,4 2,3 5 1,2 2,3,4 1, 2 2,3,4 2,4,5 2 3, 4 4, 5 1, 3 2, 4, 5 Introduction to Data Mining Minsup = 50% Examples of Frequent Subsequences: < {1,2} > < {2,3} > < {2,4}> < {3} {5}> < {1} {2} > < {2} {2} > < {1} {2,3} > < {2} {2,3} > < {1,2} {2,3} > s=60% s=60% s=80% s=80% s=80% s=60% s=60% s=60% s=60% 4/18/2004 10 Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should – find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold – be highly efficient, scalable, involving only a small number of database scans – be able to incorporate various kinds of user-specific constraints © Tan,Steinbach, Kumar Introduction to Data Mining 11 4/18/2004 11 Studies on Sequential Pattern Mining Concept introduction and an initial Apriori-like algorithm – Agrawal & Srikant. Mining sequential patterns, [ICDE’95] Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal [EDBT’96]) Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.KDD’00; Pei, et al. [ICDE’01]) Vertical format-based mining: SPADE (Zaki [Machine Leanining’00]) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim [VLDB’99]; Pei, Han, Wang [CIKM’02]) Mining closed sequential patterns: CloSpan (Yan, Han & Afshar [SDM’03]) © Tan,Steinbach, Kumar Introduction to Data Mining 12 4/18/2004 12 Methods for sequential pattern mining Apriori-based Approaches – AprioriAll – GSP Sequence-Enumeration Tree/Lattice-based – SPADE – SPAM Apriori-based Pattern-Growth-based Approaches – FreeSpan – PrefixSpan © Tan,Steinbach, Kumar Introduction to Data Mining 13 4/18/2004 13 The Apriori Property of Sequential Patterns A basic property: Apriori (Agrawal & Sirkant’94) – If a sequence S is not frequent, then none of the supersequences of S is frequent – E.g, <hb> is infrequent so do <hab> and <(ah)b> Major strength: Candidate pruning by Apriori © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Extracting Sequential Patterns Given n events: i1, i2, i3, …, in Candidate 1-subsequences: <{i1}>, <{i2}>, <{i3}>, …, <{in}> Candidate 2-subsequences: <{i1, i2}>, <{i1, i3}>, …, <{i1} {i1}>, <{i1} {i2}>, …, <{in-1} {in}> Candidate 3-subsequences: <{i1, i2 , i3}>, <{i1, i2 , i4}>, …, <{i1, i2} {i1}>, <{i1, i2} {i2}>, …, <{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …, <{i1} {i1} {i1}>, <{i1} {i1} {i2}>, … © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Generalized Sequential Pattern (GSP) Step 1: – Make the first pass over the sequence database D to yield all the 1element frequent sequences Step 2: Repeat until no new frequent sequences are found – Candidate Generation: Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items – Candidate Pruning: Prune candidate k-sequences that contain infrequent (k-1)-subsequences – Support Counting: Make a new pass over the sequence database D to find the support for these candidate sequences – Candidate Elimination: Eliminate candidate k-sequences whose actual support is less than minsup © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Candidate Generation Base case (k=2): – Merging two frequent 1-sequences <{i1}> and <{i2}> will produce two candidate 2-sequences: <{i1} {i2}> and <{i1 i2}> General case (k>2): – A frequent (k-1)-sequence w1 is merged with another frequent (k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w1 is the same as the subsequence obtained by removing the last event in w2 The resulting candidate after merging is given by the sequence w1 extended with the last event of w2. – If the last two events in w2 belong to the same element, then the last event in w2 becomes part of the last element in w1 – Otherwise, the last event in w2 becomes a separate element appended to the end of w1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Candidate Generation Examples Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}> will produce the candidate sequence < {1} {2 3} {4 5}> because the last two events in w2 (4 and 5) belong to the same element Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}> will produce the candidate sequence < {1} {2 3} {4} {5}> because the last two events in w2 (4 and 5) do not belong to the same element We do not have to merge the sequences w1 =<{1} {2 6} {4}> and w2 =<{1} {2} {4 5}> to produce the candidate < {1} {2 6} {4 5}> because if the latter is a viable candidate, then it can be obtained by merging w1 with < {1} {2 6} {5}> © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 GSP Example Frequent 3-sequences < {1} {2} {3} > < {1} {2 5} > < {1} {5} {3} > < {2} {3} {4} > < {2 5} {3} > < {3} {4} {5} > < {5} {3 4} > © Tan,Steinbach, Kumar Candidate Generation < {1} {2} {3} {4} > < {1} {2 5} {3} > < {1} {5} {3 4} > < {2} {3} {4} {5} > < {2 5} {3 4} > Introduction to Data Mining Candidate Pruning < {1} {2 5} {3} > 4/18/2004 19 More GSP Example Seq. ID 10 20 30 40 50 © Tan,Steinbach, Kumar Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Introduction to Data Mining Given support threshold min_sup =2 4/18/2004 20 Finding Length-1 Sequential Patterns Initial candidates: – <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates min_sup =2 Seq. ID 10 20 30 40 50 © Tan,Steinbach, Kumar Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Introduction to Data Mining Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1 4/18/2004 21 Generating Length-2 Candidates 51 length-2 Candidates <a> <a> <a> <b> <c> <d> <e> <f> <a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff> <b> <c> <d> <e> <f> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <(cd)> <(ce)> <(cf)> <(de)> <(df)> <b> <c> <d> <e> <(ef)> <f> © Tan,Steinbach, Kumar Introduction to Data Mining Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates 4/18/2004 22 Finding Lenth-2 Sequential Patterns Scan database one more time, collect support count for each length-2 candidate There are 19 length-2 candidates which pass the minimum support threshold – They are length-2 sequential patterns © Tan,Steinbach, Kumar Introduction to Data Mining 23 4/18/2004 23 The GSP Mining Process Cand. cannot pass sup. threshold <(bd)cba> 1st scan: 8 cand. 6 length-1 seq. pat. <abba> <(bd)bc> … 2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all Cand. not in DB at all <abb> <aab> <aba> <baa> <bab> … <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(e 3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4th scan: 8 cand. 6 length-4 seq. pat. 5th scan: 1 cand. 1 length-5 seq. pat. © Tan,Steinbach, Kumar <a> <b> <c> <d> <e> <f> <g> <h> Seq. ID Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> Introduction to Data Mining min_sup =2 4/18/2004 24 The GSP Algorithm Benefits from the Apriori pruning – Reduces search space Bottlenecks – Scans the database multiple times – Generates a huge set of candidate sequences Especially 2-item candidate sequence – Inefficient for mining long sequential patterns. A long pattern grow up from short patterns An exponential number of short candidates There is a need for more efficient mining methods © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 The SPADE Algorithm SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 A vertical format sequential pattern mining method A sequence database is mapped to a large set of Item: <SID, EID> (vertical data format) Sequential pattern mining is performed by – growing the subsequences (patterns) one item at a time by Apriori candidate generation © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 The SPADE Algorithm © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 SPAM (Sequential pattern mining) S = ({a, b, c}, {a, b}) Sequence length: – Length (S) = 5 Sequence size: – Size (S) = 2 Sequence-extended sequence S’ = ({a, b, c}, {a, b}, b}, {a}) {a}) Itemset-extended sequence S’ = ({a, b, c}, {a, b, b, d}) d}) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 SPAM (Sequential pattern mining) {} • Max Size = 3 • Items = {a, b} a a,a a,a,a a,a,{a,b} a,a,b b a,b {a,b} Level 1 Level 2 a,{a,b} Level 3 Level 4 a,{a,b},a a,{a,b},b Level 5 a,{a,b},{a,b} Sequence-extended Item-extended © Tan,Steinbach, Kumar Introduction to Data Mining Level 6 4/18/2004 29 SPAM (Sequential pattern mining) {} • Max Size = 3 • Items = {a, b} Level 1 a a,a a,a,a a,a,{a,b} a,a,b a,{a,b} a,b a,b,a a,{a,b},a a,{a,b},b a,b,{a,b} a,{a,b},{a,b} © Tan,Steinbach, Kumar Level 2 {a,b} a,b,b {a,b},a {a,b},b {a,b},a,a {a,b},a,b {a,b},{a,b} {a,b},a,{a,b} Introduction to Data Mining {a,b},{a,b},a {a,b},{a,b},b {a,b},{a,b},{a,b} Level 3 Level 4 Level 5 Level 6 4/18/2004 30 SPAM (Sequential pattern mining) Pruning Items = {a, b, c, d} a a,a a,b a,a,a a,a,b a,a,c a,a,d a,{a,b} a,{a,c} © Tan,Steinbach, Kumar a,{a,d} Introduction to Data Mining a,c a,d a,b,a a,b,b a,b,c a,b,d a,{b,c} a,{b,d} 4/18/2004 31 Data Representation – BitMap © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 S-type S = {a} S’={a},{b} S’={a},{c} … © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 I-type S = {a} S’={a, b} S’={a, c} … © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Small database Small database •SPADE •SPAM •PrefixSpan prefix middle database © Tan,Steinbach, Kumar middle database Introduction to Data Mining 4/18/2004 35 Large Database © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 PrefixSpan (Prefix-Projected Sequential Pattern Growth) PrefixSpan – Projection-based – But only prefix-based projection: less projections and quickly shrinking sequences J.Pei, J.Han, PrefixSpan : Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE’01. © Tan,Steinbach, Kumar Introduction to Data Mining 38 4/18/2004 38 Prefix and Suffix (Projection) <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> Given sequence <a(abc)(ac)d(cf)> Prefix Suffix (Prefix-Based Projection) <a> <aa> <ab> <(abc)(ac)d(cf)> <(_bc)(ac)d(cf)> <(_c)(ac)d(cf)> © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Mining Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns – <a>, <b>, <c>, <d>, <e>, <f> Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: – – – – The ones having prefix <a>; The ones having prefix <b>; … The ones having prefix <f> © Tan,Steinbach, Kumar Introduction to Data Mining SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 4/18/2004 40 Finding Seq. Patterns with Prefix <a> Only need to consider projections w.r.t. <a> – <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> – Further partition into 6 subsets Having prefix <aa> … Having © Tan,Steinbach, Kumar prefix <af> Introduction to Data Mining SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 4/18/2004 41 Completeness of PrefixSpan SDB Min_sup=2 SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Having prefix <a> Having prefix <b> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Having prefix <aa> Having prefix <af> <aa>-proj. db <(_bc)(ac)d(cf)> … © Tan,Steinbach, Kumar <af>-proj. db <cb> Introduction to Data Mining Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> L1: <a>: 4, <b>: 4, <c>: 4 <d>: 3, <e>: 3, <f>: 3 Having prefix <c>, …, <f> <b>-projected database … …… Scanning <a>-Projected database once: a:2, b:4, c:4, d:2, e:1, f:2 (_b):2, (_c):1, (_d):1, (_e):1, (_f):1 L2: <aa>: 2 , <ab>: 4 , <(ab)>: 2 <ac>: 4 , <ad>: 2 , <af>: 2 Scanning <ab>-Projected database once: a:2 , c:2 , d:1 , f:1 , (_c):2 L3: <a(bc)>: 2, <aba>: 2, <abc>: 2 4/18/2004 42 The Algorithm of PrefixSpan Input: A sequence database S, and the minimum support threshold min_sup Output: The complete set of sequential patterns Method: Call PrefixSpan(<>,0,S) Subroutine PrefixSpan(α, l, S|α) Parameters: – α: sequential pattern, – l: the length of α; – S|α: the α-projected database, if α ≠<>; otherwise; the sequence database S © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 The Algorithm of PrefixSpan(2) Method 1. Scan S|α once, find the set of frequent items b such that: a) b can be assembled to the last element of α to form <(ab)> a sequential pattern; or b) <b> can be appended to α to form a sequential pattern. <ab> 2. For each frequent item b, append it to α to form a sequential pattern α’, and output α’; 3. For each α’, construct α’-projected database S|α’, and call PrefixSpan(α’, l+1, S|α’). © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases – Can be improved by bi-level projections © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Optimization in PrefixSpan Single level vs. bi-level projection – Bi-level projection with 3-way checking may reduce the number and size of projected databases Physical projection vs. pseudo-projection – Pseudo-projection may reduce the effort of projection when the projected database fits in main memory Parallel projection vs. partition projection – Partition projection may avoid the blowup of disk space © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Performance on Data Set C10T8S8I8 © Tan,Steinbach, Kumar Introduction to Data Mining 50 4/18/2004 50 Performance on Data Set Gazelle © Tan,Steinbach, Kumar Introduction to Data Mining 51 4/18/2004 51 Effect of Pseudo-Projection © Tan,Steinbach, Kumar Introduction to Data Mining 52 4/18/2004 52 Timing Constraints (I) {A B} {C} <= xg {D E} xg: max-gap >ng ng: min-gap ms: maximum span <= ms xg = 2, ng = 0, ms= 4 Data sequence Subsequence Contain? < {2,4} {3,5,6} {4,7} {4,5} {8} > < {6} {5} > Yes < {1} {2} {3} {4} {5}> < {1} {4} > No < {1} {2,3} {3,4} {4,5}> < {2} {3} {5} > Yes < {1,2} {3} {2,3} {3,4} {2,4} {4,5}> < {1,2} {5} > No © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53 Mining Sequential Patterns with Timing Constraints Approach 1: – Mine sequential patterns without timing constraints – Postprocess the discovered patterns Approach 2: – Modify GSP to directly prune candidates that violate timing constraints – Question: Does Apriori principle still hold? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54 Apriori Principle for Sequence Data Object A A A B B C C C D D D E E Timestamp 1 2 3 1 2 1 2 3 1 2 3 1 2 Events 1,2,4 2,3 5 1,2 2,3,4 1, 2 2,3,4 2,4,5 2 3, 4 4, 5 1, 3 2, 4, 5 Suppose: xg = 1 (max-gap) ng = 0 (min-gap) ms = 5 (maximum span) minsup = 60% <{2} {5}> support = 40% but <{2} {3} {5}> support = 60% Problem exists because of max-gap constraint (No such problem if max-gap is infinite) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55 Contiguous Subsequences s is a contiguous subsequence of w = <e1>< e2>…< ek> if any of the following conditions hold: 1. s is obtained from w by deleting an item from either e1 or ek 2. s is obtained from w by deleting an item from any element ei that contains more than 2 items 3. s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of w (recursive definition) Examples: s = < {1} {2} > – is a contiguous subsequence of < {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} > – is not a contiguous subsequence of < {1} {3} {2}> and < {2} {1} {3} {2}> © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 56 Modified Candidate Pruning Step Without maxgap constraint: – A candidate k-sequence is pruned if at least one of its (k-1)-subsequences is infrequent With maxgap constraint: – A candidate k-sequence is pruned if at least one of its contiguous (k-1)-subsequences is infrequent © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 57 Timing Constraints (II) {A B} {C} <= xg xg: max-gap {D E} >ng ng: min-gap <= ws ws: window size <= ms ms: maximum span xg = 2, ng = 0, ws = 1, ms= 5 Data sequence Subsequence Contain? < {2,4} {3,5,6} {4,7} {4,6} {8} > < {3} {5} > No < {1} {2} {3} {4} {5}> < {1,2} {3} > Yes < {1,2} {2,3} {3,4} {4,5}> < {1,2} {3,4} > Yes © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 58 Modified Support Counting Step Given a candidate pattern: <{a, c}> – Any data sequences that contain <… {a c} … >, <… {a} … {c}…> ( where time({c}) – time({a}) ≤ ws) <…{c} … {a} …> (where time({a}) – time({c}) ≤ ws) will contribute to the support count of candidate pattern © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59 Other Formulation In some domains, we may have only one very long time series – Example: monitoring network traffic events for attacks monitoring telecommunication alarm signals Goal is to find frequent sequences of events in the time series – This problem is also known as frequent episode mining E1 E3 E1 E1 E2 E4 E1 E2 E2 E4 E2 E3 E4 E2 E3 E5 E2 E3 E5 E1 E2 E3 E1 Pattern: <E1> <E3> © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60 General Support Counting Schemes Object's Timeline p 1 p 2 p q 3 q 4 p q 5 p q 6 Sequence: (p) (q) q Method 7 Support Count COBJ 1 CWIN 6 CMINWIN 4 Assume: CDIST_O 8 xg = 2 (max-gap) ng = 0 (min-gap) ws = 0 (window size) CDIST 5 ms = 2 (maximum span) Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc Homepage Research Artificial Intelligence Databases Data Mining © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62 Graph Definitions a a q p p b r p t c c p s q b (a) Labeled Graph © Tan,Steinbach, Kumar a a s r t r p p a s a t t c r c p p b (b) Subgraph Introduction to Data Mining b (c) Induced Subgraph 4/18/2004 63 Representing Transactions as Graphs Each transaction is a clique of items A TID = 1: Transaction Id 1 2 3 4 5 Items {A,B,C,D} {A,B,E} {B,C} {A,B,D,E} {B,C,D} C B E D © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64 Representing Graphs as Transactions a q e p r b p p q d c G1 G1 G2 G3 G3 (a,b,p) 1 1 0 … © Tan,Steinbach, Kumar (a,b,r) 0 0 1 … (b,c,p) 0 0 1 … Introduction to Data Mining d r a G2 (a,b,q) 0 0 0 … q c r r p p b b d r e a G3 (b,c,q) 0 0 0 … (b,c,r) 1 0 0 … … … … … … 4/18/2004 (d,e,r) 0 0 0 … 65 Challenges Node may contain duplicate labels Support and confidence – How to define them? Additional constraints imposed by pattern structure – Support and confidence are not the only constraints – Assumption: frequent subgraphs must be connected Apriori-like approach: – Use frequent k-subgraphs to generate frequent (k+1) subgraphs What © Tan,Steinbach, Kumar is k? Introduction to Data Mining 4/18/2004 66 Challenges… Support: – number of graphs that contain a particular subgraph Apriori principle still holds Level-wise (Apriori-like) approach: – Vertex growing: k is the number of vertices – Edge growing: k is the number of edges © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 67 Vertex Growing a q e p p r 0 p M p q G1 d r r p a a + q e p p p a a a r r d a a a G1 G2 G3 = join(G1,G2) p p 0 r r 0 0 0 © Tan,Steinbach, Kumar q 0 0 0 0 p M p 0 G2 p 0 r 0 p 0 r 0 0 r r 0 Introduction to Data Mining M G3 0 p p 0 q p 0 r 0 0 p 0 q r 0 0 0 r 0 r 0 0 0 0 0 4/18/2004 68 Edge Growing a a q p p a f r p + a p f p q f p a a r r r r a a a G1 G2 G3 = join(G1,G2) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69 Apriori-like Algorithm Find frequent 1-subgraphs Repeat – Candidate generation Use frequent (k-1)-subgraphs to generate candidate k-subgraph – Candidate pruning Prune candidate subgraphs that contain infrequent (k-1)-subgraphs – Support counting Count the support of each remaining candidate – Eliminate candidate k-subgraphs that are infrequent In practice, it is not as easy. There are many other issues © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 70 Example: Dataset a q e p r r b r d q b d © Tan,Steinbach, Kumar (a,b,r) 0 0 1 0 q Introduction to Data Mining p r c p d G3 (b,c,p) 0 0 1 0 (b,c,q) 0 0 0 0 e p d a G2 (a,b,p) (a,b,q) 1 0 1 0 0 0 0 0 c r r c G1 p b q a p p p G1 G2 G3 G4 e a G4 (b,c,r) 1 0 0 0 … … … … … (d,e,r) 0 0 0 0 4/18/2004 71 Example Minimum support count = 2 k=1 Frequent Subgraphs k=2 Frequent Subgraphs k=3 Candidate Subgraphs a a c a b p p p b c a d c d q p d b e e b d e p c p r d r e (Pruned candidate) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 72 Candidate Generation In Apriori: – Merging two frequent k-itemsets will produce a candidate (k+1)-itemset In frequent subgraph mining (vertex/edge growing) – Merging two frequent k-subgraphs may produce more than one candidate (k+1)-subgraph © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73 Multiplicity of Candidates (Vertex Growing) a q e p p r 0 p M p q G1 d r r p ? a a + q e p p p a a a r r d a a a G1 G2 G3 = join(G1,G2) p p 0 r r 0 0 0 © Tan,Steinbach, Kumar q 0 0 0 0 p M p 0 G2 p 0 r 0 p 0 r 0 0 r r 0 Introduction to Data Mining M G3 0 p p 0 q p 0 r 0 0 p 0 q r 0 0 0 r 0 r 0 ? 0 ? 0 4/18/2004 74 Multiplicity of Candidates (Edge growing) Case 1: identical vertex labels a a a e b c + b c e b e a c e e b c © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 75 Multiplicity of Candidates (Edge growing) Case 2: Core contains identical labels b a c a a a a a a a a b + c a b a a c a a a a a Core: The (k-1) subgraph that is common between the joint graphs © Tan,Steinbach, Kumar Introduction to Data Mining c a a a 4/18/2004 b 76 Multiplicity of Candidates (Edge growing) Case 3: Core multiplicity a a a a a a b b a b a a b a a b a a + a b a © Tan,Steinbach, Kumar a a b Introduction to Data Mining a a 4/18/2004 77 Adjacency Matrix Representation A(1) A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 1 1 1 A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 1 1 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 1 A(2) B (5) B (6) B (7) B (8) A(3) A(4) A(2) A(1) B (7) B (6) B (5) B (8) A(3) A(4) • The same graph can be represented in many ways © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 78 Graph Isomorphism A graph is isomorphic if it is topologically equivalent to another graph B A A B B A B A B B A A A B A B © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 79 Graph Isomorphism Test for graph isomorphism is needed: – During candidate generation step, to determine whether a candidate has been generated – During candidate pruning step, to check whether its (k-1)-subgraphs are frequent – During candidate counting, to check whether a candidate is contained within another graph © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 80 Graph Isomorphism Use canonical labeling to handle isomorphism – Map each graph into an ordered string representation (known as its code) such that two isomorphic graphs will be mapped to the same canonical encoding – Example: Lexicographically largest adjacency matrix 0 0 1 0 0 1 0 0 1 1 1 0 1 1 1 0 String: 0010001111010110 © Tan,Steinbach, Kumar Introduction to Data Mining 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 Canonical: 0111101011001000 4/18/2004 81