Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar Sequence Data Timeline 10 Sequence Database: Object A A A B B B B C Timestamp 10 20 23 11 17 21 28 14 Events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 15 20 25 30 35 Object A: 2 3 5 6 1 1 Object B: 4 5 6 2 1 6 7 8 1 2 Object C: 1 7 8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Examples of Sequence Data Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C Element (Transaction) Sequence © Tan,Steinbach, Kumar E1 E2 E1 E3 E2 Introduction to Data Mining E2 E3 E4 Event (Item) 4/18/2004 ‹#› Formal Definition of a Sequence A sequence is an ordered list of elements (transactions) s = < e1 e2 e3 … > – Each element contains a collection of events (items) ei = {i1, i2, …, ik} – Each element is attributed to a specific time or location Length of a sequence, |s|, is given by the number of elements of the sequence A k-sequence is a sequence that contains k events (items) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Examples of Sequence Web sequence: < {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} > Sequence of initiating events causing the nuclear accident at 3-mile Island: (http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm) < {clogged resin} {outlet valve closure} {loss of feedwater} {condenser polisher outlet valve shut} {booster pumps trip} {main waterpump trips} {main turbine trips} {reactor pressure increases}> Sequence of books checked out at a library: <{Fellowship of the Ring} {The Two Towers} {Return of the King}> © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Formal Definition of a Subsequence A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm> (m ≥ n) if there exist integers i1 < i2 < … < in such that a1 bi1 , a2 bi1, …, an bin Data sequence Subsequence Contain? < {2,4} {3,5,6} {8} > < {2} {3,5} > Yes < {1,2} {3,4} > < {1} {2} > No < {2,4} {2,4} {2,5} > < {2} {4} > Yes The support of a subsequence w is defined as the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Sequential Pattern Mining: Definition Given: – a database of sequences – a user-specified minimum support threshold, minsup Task: – Find all subsequences with support ≥ minsup © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Sequential Pattern Mining: Challenge Given a sequence: <{a b} {c d e} {f} {g h i}> – Examples of subsequences: <{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc. How many k-subsequences can be extracted from a given n-sequence? <{a b} {c d e} {f} {g h i}> n = 9 k=4: Y_ <{a} © Tan,Steinbach, Kumar _YY _ _ _Y {d e} Introduction to Data Mining {i}> Answer : n 9 126 k 4 4/18/2004 ‹#› Sequential Pattern Mining: Example Object A A A B B C C C D D D E E Timestamp 1 2 3 1 2 1 2 3 1 2 3 1 2 © Tan,Steinbach, Kumar Events 1,2,4 2,3 5 1,2 2,3,4 1, 2 2,3,4 2,4,5 2 3, 4 4, 5 1, 3 2, 4, 5 Introduction to Data Mining Minsup = 50% Examples of Frequent Subsequences: < {1,2} > < {2,3} > < {2,4}> < {3} {5}> < {1} {2} > < {2} {2} > < {1} {2,3} > < {2} {2,3} > < {1,2} {2,3} > s=60% s=60% s=80% s=80% s=80% s=60% s=60% s=60% s=60% 4/18/2004 ‹#› Extracting Sequential Patterns Given n events: i1, i2, i3, …, in Candidate 1-subsequences: <{i1}>, <{i2}>, <{i3}>, …, <{in}> Candidate 2-subsequences: <{i1, i2}>, <{i1, i3}>, …, <{i1} {i1}>, <{i1} {i2}>, …, <{in-1} {in}> Candidate 3-subsequences: <{i1, i2 , i3}>, <{i1, i2 , i4}>, …, <{i1, i2} {i1}>, <{i1, i2} {i2}>, …, <{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …, <{i1} {i1} {i1}>, <{i1} {i1} {i2}>, … © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Generalized Sequential Pattern (GSP) Step 1: – Make the first pass over the sequence database D to yield all the 1element frequent sequences Step 2: Repeat until no new frequent sequences are found – Candidate Generation: Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items – Support Counting: Make a new pass over the sequence database D to find the support for these candidate sequences – Candidate Elimination: Eliminate candidate k-sequences whose actual support is less than minsup © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› GSP Example Frequent 3-sequences < {1} {2} {3} > < {1} {2 5} > < {1} {5} {3} > < {2} {3} {4} > < {2 5} {3} > < {3} {4} {5} > < {5} {3 4} > © Tan,Steinbach, Kumar Candidate Generation < {1} {2} {3} {4} > < {1} {2 5} {3} > < {1} {5} {3 4} > < {2} {3} {4} {5} > < {2 5} {3 4} > Introduction to Data Mining Candidate Pruning < {1} {2 5} {3} > 4/18/2004 ‹#› Timing Constraints (I) {A B} {C} <= xg {D E} xg: max-gap >ng ng: min-gap ms: maximum span <= ms xg = 2, ng = 0, ms= 4 Data sequence Subsequence Contain? < {2,4} {3,5,6} {4,7} {4,5} {8} > < {6} {5} > Yes < {1} {2} {3} {4} {5}> < {1} {4} > No < {1} {2,3} {3,4} {4,5}> < {2} {3} {5} > Yes < {1,2} {3} {2,3} {3,4} {2,4} {4,5}> < {1,2} {5} > No © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Mining Sequential Patterns with Timing Constraints Approach 1: – Mine sequential patterns without timing constraints – Postprocess the discovered patterns Approach 2: – Modify GSP to directly prune candidates that violate timing constraints © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc Homepage Research Artificial Intelligence Databases Data Mining © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Graph Definitions a a q p p b a p p a s s a s r p a r t t c c p q b (a) Labeled Graph © Tan,Steinbach, Kumar t t r r c c p p b (b) Subgraph Introduction to Data Mining b (c) Induced Subgraph 4/18/2004 ‹#› Representing Transactions as Graphs Each transaction is a clique of items A TID = 1: Transaction Id 1 2 3 4 5 Items {A,B,C,D} {A,B,E} {B,C} {A,B,D,E} {B,C,D} C B E D © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Representing Graphs as Transactions a q e p p p q r b d c G1 G1 G2 G3 G3 (a,b,p) 1 1 0 … © Tan,Steinbach, Kumar (a,b,r) 0 0 1 … (b,c,p) 0 0 1 … Introduction to Data Mining d r a G2 (a,b,q) 0 0 0 … q c r r p p b b d r e a G3 (b,c,q) 0 0 0 … (b,c,r) 1 0 0 … … … … … … 4/18/2004 (d,e,r) 0 0 0 … ‹#› Edge Growing a a q p p a q p f a r + p f p f p a a r r r r a a a G1 G2 G3 = join(G1,G2) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Apriori-like Algorithm Find frequent 1-subgraphs Repeat – Candidate generation Use frequent (k-1)-subgraphs to generate candidate k-subgraph – Candidate pruning Prune candidate subgraphs that contain infrequent (k-1)-subgraphs – Support counting Count the support of each remaining candidate – Eliminate candidate k-subgraphs that are infrequent In practice, it is not as easy. There are many other issues © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Example: Dataset a q e p r r b p b b d © Tan,Steinbach, Kumar (a,b,r) 0 0 1 0 q Introduction to Data Mining p r c p d G3 (b,c,p) 0 0 1 0 e p d a G2 (a,b,p) (a,b,q) 1 0 1 0 0 0 0 0 c r r c G1 a p p p G1 G2 G3 G4 q q d r e a (b,c,q) 0 0 0 0 G4 (b,c,r) 1 0 0 0 … … … … … (d,e,r) 0 0 0 0 4/18/2004 ‹#› Example Minimum support count = 2 k=1 Frequent Subgraphs a b c p k=2 Frequent Subgraphs a a p a r e b d p d c e p p k=3 Candidate Subgraphs e q b c d d b c p r d e (Pruned candidate) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Adjacency Matrix Representation A(1) A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 1 1 1 A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 1 1 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 1 A(2) B (5) B (6) B (7) B (8) A(3) A(4) A(2) A(1) B (7) B (6) B (5) B (8) A(3) A(4) • The same graph can be represented in many ways © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Graph Isomorphism A graph is isomorphic if it is topologically equivalent to another graph B A A B B A B A B B A A A B A B © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Graph Isomorphism Test for graph isomorphism is needed: – During candidate generation step, to determine whether a candidate has been generated – During candidate pruning step, to check whether its (k-1)-subgraphs are frequent – During candidate counting, to check whether a candidate is contained within another graph © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›