Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak 1 Data Mining Seminar 2003 Introduction Bar-Code technology Mining Association Rules over basket data (93) Tires ^ accessories automotive service Cross market, Attached mail. Very large databases. ©Ofer Pasternak 2 Data Mining Seminar 2003 Notation Items – I = {i1,i2,…,im} Transaction – set of items TI – Items are sorted lexicographically ©Ofer Pasternak TID – unique identifier for each transaction 3 Data Mining Seminar 2003 Notation Association Rule – X Y X I , Y I and X Y ©Ofer Pasternak 4 Data Mining Seminar 2003 Confidence and Support ©Ofer Pasternak Association rule XY has confidence c, c% of transactions in D that contain X also contain Y. Association rule XY has support s, s% of transactions in D contain X and Y. 5 Data Mining Seminar 2003 Define the Problem Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence. ©Ofer Pasternak 6 Data Mining Seminar 2003 Discovering all Association Rules Find all Large itemsets – itemsets with support above minimum support. ©Ofer Pasternak Use Large itemsets to generate the rules. 7 Data Mining Seminar 2003 General idea Say ABCD and AB are large itemsets Compute conf = support(ABCD) / support(AB) If conf >= minconf AB CD holds. ©Ofer Pasternak 8 Data Mining Seminar 2003 Discovering Large Itemsets Multiple passes over the data First pass – count the support of individual items. Subsequent pass – Generate Candidates using previous pass’s large itemset. – Go over the data and check the actual support of the candidates. ©Ofer Pasternak Stop when no new large itemsets are found. 9 Data Mining Seminar 2003 The Trick Any subset of large itemset is large. Therefore To find large k-itemset – Create candidates by combining large k-1 itemsets. – Delete those that contain any subset that is not large. ©Ofer Pasternak 10 Data Mining Seminar 2003 Algorithm Apriori L1 {large 1- itemsets} For ( k 2; Lk-1 ; k ) do begin Ck apriori- gen (Lk-1 ); forall transacti ons t D do begin Ct subset (C k ,t) forall candidates c Ct do c.count ; Count item occurrences Generate new k-itemsets candidates Find the support of all the candidates end end Lk { c Ck|c.count minsup} end Answer Take only those with support over minsup L ; k k ©Ofer Pasternak 11 Data Mining Seminar 2003 Candidate generation Join step insert into Ck P and q are 2 k-1 large itemsets identical in all k-2 first items. select p.item1 , p.item2 , p.itemk 1 , q.itemk 1 from Lk 1 p,Lk 1q where p.item1 q.item1 ,..., p.itemk 2 q.itemk 2 , p.itemk 1 q.itemk 1 Prune step forall itemsets c Ck do forall (k-1)-subsets s of c do if (s Lk-1 ) then delete c from Ck ©Ofer Pasternak Join by adding the last item of q to p Check all the subsets, remove a candidate with “small” subset 12 Data Mining Seminar 2003 Example L3 = { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} } After joining { {1 2 3 4}, {1 3 4 5} } {1 4 5} and {3 4 5} After pruning Are not in L3 {1 2 3 4} ©Ofer Pasternak 13 Data Mining Seminar 2003 Correctness Show that Ck Lk Any subset of large itemset must also be large insert into Ck Join is equivalent to extending Lk-1 with all items and removing those whose (k-1) subsets are not in Lk-1 ©Ofer Pasternak select p.item1 , p.item2 , p.itemk 1 , q.itemk 1 from Lk 1 p,Lk 1q where p.item1 q.item1 ,..., p.itemk 2 q.itemk 2 , p.itemk 1 q.itemk 1 forall itemsets c Ck do forall (k-1)-subsets s of c do if (s Lk-1 ) then delete c from Ck Prevents duplications 14 Data Mining Seminar 2003 Subset Function L1 {large 1- itemsets} Candidate itemsets - Ck are stored in a hash-tree Finds in O(k) time whether a candidate itemset of size k is contained in transaction t. Total time O(max(k,size(t)) For ( k 2; Lk-1 ; k ) do begin Ck apriori- gen (Lk-1 ); ©Ofer Pasternak forall transacti ons t D do begin Ct subset (C k ,t) forall candidates c Ct do c.count ; end end Lk { c Ck|c.count minsup} end Answer L ; k k 15