Download Fast Algorithms for Mining Association Rules

Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant 1 Data Mining Seminar 2003 Outline Introduction  Formal statement  Apriori Algorithm  AprioriTid Algorithm  Comparison  AprioriHybrid Algorithm  Conclusions  ©Ofer Pasternak 2 Data Mining Seminar 2003 Introduction Bar-Code technology  Mining Association Rules over basket data (93)  Tires ^ accessories  automotive service  Cross market, Attached mail.  Very large databases.  ©Ofer Pasternak 3 Data Mining Seminar 2003 Notation Items – I = {i1,i2,…,im}  Transaction – set of items  TI – Items are sorted lexicographically  ©Ofer Pasternak TID – unique identifier for each transaction 4 Data Mining Seminar 2003 Notation  Association Rule – X  Y X  I , Y  I and X  Y   ©Ofer Pasternak 5 Data Mining Seminar 2003 Confidence and Support   ©Ofer Pasternak Association rule XY has confidence c, c% of transactions in D that contain X also contain Y. Association rule XY has support s, s% of transactions in D contain X and Y. 6 Data Mining Seminar 2003 Notice X  A doesn’t mean X+YA  – X  A and A  Z doesn’t mean X  Z  – ©Ofer Pasternak May not have minimum support May not have minimum confidence 7 Data Mining Seminar 2003 Define the Problem Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence. ©Ofer Pasternak 8 Data Mining Seminar 2003 Previous Algorithms AIS  SETM  Knowledge Discovery  Induction of Classification Rules  Discovery of causal rules  Fitting of function to data  KID3 – machine learning  ©Ofer Pasternak 9 Data Mining Seminar 2003 Discovering all Association Rules  Find all Large itemsets – itemsets with support above minimum support.  ©Ofer Pasternak Use Large itemsets to generate the rules. 10 Data Mining Seminar 2003 General idea Say ABCD and AB are large itemsets  Compute conf = support(ABCD) / support(AB)  If conf >= minconf AB  CD holds.  ©Ofer Pasternak 11 Data Mining Seminar 2003 Discovering Large Itemsets Multiple passes over the data  First pass – count the support of individual items.  Subsequent pass  – Generate Candidates using previous pass’s large itemset. – Go over the data and check the actual support of the candidates.  ©Ofer Pasternak Stop when no new large itemsets are found. 12 Data Mining Seminar 2003 The Trick Any subset of large itemset is large. Therefore To find large k-itemset – Create candidates by combining large k-1 itemsets. – Delete those that contain any subset that is not large. ©Ofer Pasternak 13 Data Mining Seminar 2003 Algorithm Apriori L1  {large 1- itemsets} For ( k  2; Lk-1   ; k   ) do begin Ck  apriori- gen (Lk-1 ); forall transacti ons t  D do begin Ct  subset (C k ,t) forall candidates c  Ct do c.count  ; Count item occurrences Generate new k-itemsets candidates Find the support of all the candidates end end Lk  { c  Ck|c.count  minsup} end Answer  Take only those with support over minsup L ; k k ©Ofer Pasternak 14 Data Mining Seminar 2003 Candidate generation  Join step insert into Ck P and q are 2 k-1 large itemsets identical in all k-2 first items. select p.item1 , p.item2 , p.itemk 1 , q.itemk 1 from Lk 1 p,Lk 1q where p.item1  q.item1 ,..., p.itemk  2  q.itemk  2 , p.itemk 1  q.itemk 1  Prune step forall itemsets c  Ck do forall (k-1)-subsets s of c do if (s  Lk-1 ) then delete c from Ck ©Ofer Pasternak Join by adding the last item of q to p Check all the subsets, remove a candidate with “small” subset 15 Data Mining Seminar 2003 Example L3 = { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} } After joining { {1 2 3 4}, {1 3 4 5} } {1 4 5} and {3 4 5} After pruning Are not in L3 {1 2 3 4} ©Ofer Pasternak 16 Data Mining Seminar 2003 Correctness Show that Ck  Lk Any subset of large itemset must also be large insert into Ck Join is equivalent to extending Lk-1 with all items and removing those whose (k-1) subsets are not in Lk-1 ©Ofer Pasternak select p.item1 , p.item2 , p.itemk 1 , q.itemk 1 from Lk 1 p,Lk 1q where p.item1  q.item1 ,..., p.itemk  2  q.itemk  2 , p.itemk 1  q.itemk 1 forall itemsets c  Ck do forall (k-1)-subsets s of c do if (s  Lk-1 ) then delete c from Ck Prevents duplications 17 Data Mining Seminar 2003 Subset Function L1  {large 1- itemsets} Candidate itemsets - Ck are stored in a hash-tree  Finds in O(k) time whether a candidate itemset of size k is contained in transaction t.  Total time O(max(k,size(t)) For ( k  2; Lk-1   ; k   ) do begin Ck  apriori- gen (Lk-1 );  ©Ofer Pasternak forall transacti ons t  D do begin Ct  subset (C k ,t) forall candidates c  Ct do c.count  ; end end Lk  { c  Ck|c.count  minsup} end Answer  L ; k k 18 Data Mining Seminar 2003 Problem? L1  {large 1- itemsets}  Every pass goes over the whole data. For ( k  2; Lk-1   ; k   ) do begin Ck  apriori- gen (Lk-1 ); forall transacti ons t  D do begin Ct  subset (C k ,t) forall candidates c  Ct do c.count  ; end end Lk  { c  Ck|c.count  minsup} end Answer  L ; k k ©Ofer Pasternak 19 Data Mining Seminar 2003 Algorithm AprioriTid Uses the database only once.  Builds a storage set C^k  – Members has the form < TID, {Xk} >  Xk are potentially large k-items in transaction TID. Each item is replaced by an itemset of size 1  For k=1, C^1 is the database.  ©Ofer Pasternak Uses C^k in pass k+1. 20 Data Mining Seminar 2003 Advantage  C^k could be smaller than the database. – If a transaction does not contain k- itemset candidates, than it will be excluded from C^k .  For large k, each entry may be smaller than the transaction – The transaction might contain only few candidates. ©Ofer Pasternak 21 Data Mining Seminar 2003 Disadvantage  For small k, each entry may be larger than the corresponding transaction. – An entry includes all k-itemsets contained in the transaction. ©Ofer Pasternak 22 Data Mining Seminar 2003 Algorithm AprioriTid Count item occurrences L1  {large 1- itemsets} C1^  database D; The storage set is initialized with the database For ( k  2; Lk-1   ; k   ) do begin Ck  apriori- gen (Lk-1 ); Generate new k-itemsets candidates Ck^   ; ^ forall entries t  Ck-1 do begin Build a new storage set Ct  {c  Ck|(c  c[k]  t.set  of  items  (c  c[k  1])  t.set  of  items}; forall candidates c  Ct do c.count  ; if (Ct  φ) then C   t.TID,Ct  ; ^ k end end Lk  { c  Ck|c.count  minsup} end Answer  L; ©Ofer Pasternak k k Determine candidate itemsets which are containted in transaction TID Find the support of all the candidates Remove empty entries Take only those with support over minsup 23 Data Mining Seminar 2003 C^1 Database TID Items 100 134 200 235 300 1235 400 25 TID 100 { {1},{3},{4} } 200 { {2},{3},{5} } 300 { {1},{2},{3},{5} } 400 { {2},{5} } C2 itemset {1 2} {1 3} {1 5} {2 3} {2 5} TID Set-of-itemsets 100 { {1 3} } 200 { {2 3},{2 5} {3 5} } 300 { {1 2},{1 3},{1 5}, {2 3}, {2 5}, {3 5} } 400 { {2 5} } C^3 TID Set-of-itemsets itemset 200 { {2 3 5} } {2 3 5} 300 { {2 3 5} } ©Ofer Pasternak Itemset Support {1} 2 {2} 3 {3} 3 {5} 3 C^2 {3 5} C3 Set-ofitemsets L1 L2 Itemset Support {1 3} 2 {2 3} 3 {2 5} 3 {3 5} 2 L3 Itemset Support {2 3 5} 2 24 Data Mining Seminar 2003 Correctness  Show that Ct generated in the kth pass is the same as set of candidate kitemsets in Ck contained in transaction with t.TID L1  {large 1- itemsets} C1^  database D; For ( k  2; Lk-1   ; k   ) do begin Ck  apriori- gen (Lk-1 ); Ck^   ; ^ forall entries t  Ck-1 do begin Ct  {c  Ck|(c  c[k]  t.set  of  items  (c  c[k  1])  t.set  of  items}; forall candidates c  Ct do c.count  ; if (Ct  φ) then Ck^   t.TID,Ct  ; end end Lk  { c  Ck|c.count  minsup} end Answer  L ; k k ©Ofer Pasternak 25 Data Mining Seminar 2003 Correctness t of C^k t.set-of-itemsets doesn’t include any k-itemsets not contained in transaction with t.TID t of C^k t.set-of-itemsets includes all large k-itemsets contained in transaction with t.TID Lemma 1 k >1, if C^k-1 is correct and complete, Same as the set of all large kand Lk-1 is correct, itemsets Then the set Ct generated at the kth pass is the same as the set of candidate k-itemsets in Ck contained in transaction with t.TID ©Ofer Pasternak 26 Data Mining Seminar 2003 Proof Suppose a candidate itemset c = c[1]c[2]…c[k] is in transaction t.TID  c1 = (c-c[k]) and c2=(c-c[k-1]) were in transaction t.TID  Ck was built using apriori-gen(Lk-1)  all subsets of c of Ck must be large c1 and c2 must be large  C^k-1 is complete c1 and c2 were members of t.set-of-items  c will be a member of Ct ©Ofer Pasternak 27 Data Mining Seminar 2003 Proof Suppose c1 (c2) is not in transaction C^k-1 is correct t.TID  c1 (c2) is not in t.set-of-itemsets  c of Ck is not in transaction t.TID  c will not be a member of Ct ©Ofer Pasternak 28 Data Mining Seminar 2003 Correctness Lemma 2 k >1, if Lk-1 is correct and the set Ct generated in the kth step is the same as the set of candidate k-itemsets in Ck in transaction t.TID, then the set C^k is correct and complete. ©Ofer Pasternak 29 Data Mining Seminar 2003 Proof Apriori-gen guarantees Ck  Lk  Ct includes all large k-itemsets in t.TID, which are added to C^k  C^k is complete. L1  {large 1- itemsets} C1^  database D; For ( k  2; Lk-1   ; k   ) do begin Ck  apriori- gen (Lk-1 ); Ck^   ; ^ forall entries t  Ck-1 do begin Ct  {c  Ck|(c  c[k]  t.set  of  items  (c  c[k  1])  t.set  of  items}; forall candidates c  Ct do c.count  ; if (Ct  φ) then Ck^   t.TID,Ct  ; end Ct includes only itemsets in t.TID, only items in Ct are added to C^k  C^k is correct. ©Ofer Pasternak end Lk  { c  Ck|c.count  minsup} end Answer  L ; k k 30 Data Mining Seminar 2003 Correctness Theorem 1 k >1, the set Ct generated in the kth pass is the same as the set of candidate kitemsets in Ck contained in transaction t.TID Show: C^k is correct and complete and Lk is correct for all k>=1. ©Ofer Pasternak 31 Data Mining Seminar 2003 Proof (by induction on k) K=1 – C^1 is the database.  Assume it holds for k=n.  – – ©Ofer Pasternak Ct generated in pass n+1 consists of exactly those itemsets in Cn+1 contained in transaction t.TID. Apriori-gen guarantees Cn 1  Ln 1 and Ct is correct  Ln+1 is correct Lemma 2 C^n+1 will be correct and complete  C^k is correct and complete for all k>=1  Lemma 1 The theorem holds 32 Data Mining Seminar 2003 General idea (reminder)  Say ABCD and AB are large itemsets  Compute conf = support(ABCD) / support(AB)  If conf >= minconf AB  CD holds. ©Ofer Pasternak 33 Data Mining Seminar 2003 Discovering Rules  For every large itemset l – Find all non-empty subsets of l. – For every subset a  Produce rule a  (l-a)  Accept if support(l) / support(a) >= minconf ©Ofer Pasternak 34 Data Mining Seminar 2003 Checking the subsets For efficiency, generate subsets using recursive DFS. If a subset ‘a’ doesn’t produce a rule, we don’t need to check for subsets of ‘a’. Example Given itemset : ABCD If ABC  D doesn’t have enough confidence then surely AB  CD won’t hold  ©Ofer Pasternak 35 Data Mining Seminar 2003 Why? For any subset a^ of a: Support(a^) >= support(a)  Confidence ( a^ (l-a^) ) = support(l) / support(a^) <= support(l) / support(a) = confidence ( a  (l-a) ) ©Ofer Pasternak 36 Data Mining Seminar 2003 Simple Algorithm forall large item sets lk , k  2 do Check all the large itemsets genrules(l k ,lk ) procedure genrules (l k :large k-it emset, am: large m-it emset) A {(m-1)-it emset am-1| a m-1  am }; forall am-1  A do begin conf  support(l k )/support( am-1 ) if (conf  minconf) then begin output the rule am-1  (l k  am-1 ); if (m  1  1) then call genrules(l k ,am-1 ); end ©Ofer Pasternak end Check all the subsets Check confidence of new rule Output the rule Continue the DFS over the subsets. If there is no confidence the DFS branch cuts here 37 Data Mining Seminar 2003 Faster Algorithm Idea: If (l-c)  c holds than all the rules (l-c^)  c^ must hold Example: C^ is a non empty subset of c If AB  CD holds, then so do ABC  D and ABD  C ©Ofer Pasternak 38 Data Mining Seminar 2003 Faster Algorithm  From a large itemset l, – Generate all rules with one item in it’s consequent. Use those consequents and Apriori-gen to generate all possible 2 item consequents.  Etc.  The candidate set of the faster algorithm is a subset of the candidate set of the simple algorithm.  ©Ofer Pasternak 39 Data Mining Seminar 2003 Faster algorithm forall large k-it emsets lk , k  2 do begin H 1  {consequen ts of rules derived from lk with one item in the consequent }; call ap-genrule s(l k ,H 1 ); end procedure ap - genrules (l k :large k-itemset , Hm: set of m-item consequents) if (k  m  1) then begin H m 1  apriori-g en(H m ); forall hm 1  H m 1 do begin conf  support(l k )/support( lk  hm 1 ); if (conf  minconf) then output the rule (lk  hm 1 )  hm 1 with confidence  conf and support  support(l k ) else delete hm 1 from H m 1 end call ap-genrule s(l k ,hm 1 ); end ©Ofer Pasternak Find all 1 item consequents (using 1 pass of the simple algorithm) Generate new (m+1)consequents Check the support of the new rule Continue for bigger consequents If a consq. Doesn’t hold, don’t look for 40 bigger. Data Mining Seminar 2003 Advantage Example Large itemset : ABCDE One item conseq. : ACDEB ABCED Simple algorithm will check: ABCDE, ABECD, BCEAD and ACEBD. Faster algorithm will check: ACEBD which is also the only rule that holds. ©Ofer Pasternak 41 Example Data Mining Seminar 2003 Simple algorithm: ABCDE Large itemset Rules with minsup ACDEB CDEAB ADEBC ABCED BCEAD ACDBE ACEBD Fast algorithm: ACEBD ABCED ABCDE ACDEB ©Ofer Pasternak ABECD ABCED ACEBD 42 Data Mining Seminar 2003 Results  Compare Apriori, and AprioriTid performances to each other, and to previous known algorithms: Both methods generate – AIS – SETM  ©Ofer Pasternak candidates “on-the-fly” Designed for use over SQL The algorithms differ in the method of generating all large itemsets. 43 Data Mining Seminar 2003 Method  Check the algorithms on the same databases – Synthetic data – Real data ©Ofer Pasternak 44 Data Mining Seminar 2003 Synthetic Data  Choose the parameters to be compared. – Transaction sizes, and large itemsets sizes are each clustered around a mean. – Parameters for data generation      ©Ofer Pasternak D – Number of transactions T – Average size of the transaction I – Average size of the maximal potentially large itemsets L – Number of maximal potentially large itemsets N – Number of Items. 45 Data Mining Seminar 2003 Synthetic Data  Expriment values: – N = 1000 – L = 2000       ©Ofer Pasternak T5.I2.D100k T10.I2.D100k T10.I4.D100k T20.I2.D100k T20.I4.D100k T20.I6.D100k D – Number of transactions T – Average size of the transaction I – Average size of the maximal potentially large itemsets L – Number of maximal potentially large itemsets N – Number of Items. T=5, I=2, D=100,000 46 Data Mining Seminar 2003 •SETM values are too big to fit the graphs. •Apriori always beats AIS •Apriori is better than AprioriTid in large problems ©Ofer Pasternak D – Number of transactions T – Average size of the transaction I – Average size of the maximal potentially large itemsets 47 Data Mining Seminar 2003 Explaining the Results AprioriTid uses C^k instead of the database. If C^k fits in memory AprioriTid is faster than Apriori.  When C^k is too big it cannot sit in memory, and the computation time is much longer. Thus Apriori is faster than AprioriTid.  ©Ofer Pasternak 48 Data Mining Seminar 2003 Reality Check   ©Ofer Pasternak Retail sales – 63 departments – 46873 transactions (avg. size 2.47) Small database, C^k fits in memory. 49 Data Mining Seminar 2003 Reality Check Mail Order 15836 items 2.9 million transactions (avg size 2.62) ©Ofer Pasternak Mail Customer 15836 items 213972 transactions (avg size 31) 50 Data Mining Seminar 2003 So who is better?  Look At the Passes. At final stages, C^k is small enough to fit in memory ©Ofer Pasternak 51 Data Mining Seminar 2003 Algorithm AprioriHybrid Use Apriori in initial passes  Estimate the size of C^k   support(c)  number of transacti ons candidatescCk Switch to AprioriTid when C^k is expected to fit in memory  The switch takes time, but it is still better in most cases.  ©Ofer Pasternak 52 Data Mining Seminar 2003 ©Ofer Pasternak 53 Data Mining Seminar 2003 Scale up experiment ©Ofer Pasternak 54 Data Mining Seminar 2003 Conclusions  The Apriori algorithms are better than the previous algorithms. – For small problems by factors – For large problems by orders of magnitudes. The algorithms are best combined.  The algorithm shows good results in scale-up experiments.  ©Ofer Pasternak 55 Data Mining Seminar 2003 Summary     ©Ofer Pasternak Association rules are an important tool in analyzing databases. We’ve seen an algorithm which finds all association rules in a database. The algorithm has better time results then previous algorithms. The algorithm maintains it’s performances for large databases. 56 Data Mining Seminar 2003 End ©Ofer Pasternak 57

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Fast Algorithms for Mining Association Rules