Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Association Rules Jen-Ting Tony Hsiao Alexandros Ntoulas CS 240B May 21, 2002 Outline Overview of data mining association rules Algorithm Apriori FP-Trees May 21, 2002 Data Mining Association Rules Data mining is motivated by the decision support problem faced by most large retail organizations Bar-code technology made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data Successful organizations view such databases as important pieces of the marketing infrastructure Organizations are interested in instituting informationdriven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies May 21, 2002 Data Mining Association Rules For example, 98% of customers that purchase tires and auto accessories also get automotive services done Finding all such rules is valuable for cross-marketing and attached mailing applications Other applications include catalog design, add-on sales, store layout, and customer segmentation based on buying patterns Databases involved in these applications are very large, thus imperative to have fast algorithms for this task May 21, 2002 Formal Definition of Problem Formal statement of the problem [AIS93]: Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I. A transaction T contains X, a set of some items in I, if X T. An association rule is an implication of the form X Y, where X I, Y I, and X Y = Ø. X Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. X Y has support s in the transaction set D if s% of transactions in D contain X Y. May 21, 2002 Example of Confidence and Support Given the following table of transactions: {1 2 3 4} {1 2 4 5} {1 2 5} 1 2 has 100% confidence, with 100% support 3 4 has 100% confidence, with 33% support 2 3 has 33% confidence, with 33% support May 21, 2002 Problem Decomposition Find all sets of items (itemsets) that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets, and all others small itemsets. Use large itemsets to generate the desired rules. If ABCD and AB are large itemsets, then we can determine if rule AB CD holds by computing the ratio conf = support(ABCD)/support(AB). If conf minconf, then the rule holds. (The rule will surely have minimum support because ABCD is large.) May 21, 2002 Discoverying Large Itemsets Algorithms for discovering large itemsets make multiple passes over the data In the first pass, we count the support of individual items and determine which of them are large (with minimum support) In each subsequent pass, we start with a seed set of itemsets found to be large in the previous pass, then use this seed set for generating new potentially large itemsets, called candidate itemsets, and count the actual support for these candidate itemsets during the pass over the data At the end of the pass, we determine which of the candidate itemsets are actually large, and they become the seed for the next pass. This process continues until no new large itemsets are found May 21, 2002 Notations The number of items in an itemset is its size and call an itemset of size k a k-itemset Items within an itemset are kept in lexicographic order k-itemset An itemset having k items. Lk Set of large k-itemsets (those with minimum support). Each member of this set has two fields: i) itemset and ii) support count. Ck Set of candidate k-itemsets (potentially large itemsets). Each member of this set has two fields: i) itemset and ii) support count. May 21, 2002 Algorithm Apriori 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) L1 = {large 1-itemsets}; for ( k = 2; Lk-1 Ø; k++ ) do begin Ck = apriori-gen(Lk-1); // New candidates forall transactions t D do begin Ct = subset(Ck, t); Candidates contained in t forall candidates c Ct do c.count++; end Lk = {c Ck | c.count minsup} end Answer = k Lk; May 21, 2002 Apriori Candidate Generation The apriori-gen function takes as argument Lk – 1, the set of all large (k – 1)-itemsets. It returns a superset of the a set of all large kitemsets. First, in the join step, we join Lk – 1 with Lk – 1: insert into Ck select p.item1, p.item2, …, p.itemk from Lk – 1 p, Lk – 1 q where p.item1 = q.item1, …, p.itemk 1 q.itemk – 1 – 1, – 2 q.itemk – 1 = q.itemk – 2, p.itemk – Next, in the prune step, we delete all itemsets c Ck such that some (k – 1)-subset of c is not in Lk – 1: forall itemsets c Ck do forall (k – 1)-subsets s of c do if (s Lk – 1) then delete c from Ck; May 21, 2002 Apriori-gen Example Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}} In the join step: {1 2 3} joins with {1 2 4} to produce {1 2 3 4} {1 3 4} joins with {1 3 5} to produce {1 3 4 5} After the join step, C4 will be {{1 2 3 4}, {1 3 4 5}} In the prune step: {1 2 3 4} is tested for existence of 3-item subsets within L3, thus for {1 2 3}, {1 3 4}, and {2 3 4} {1 3 4 5} is tested for {1 3 4}, {1 4 5}, and {3 4 5}, with {1 4 5} not found, and thus this set is deleted We then will be left with only {1 2 3 4} in C4 May 21, 2002 Discovering Rules For every large itemset l, we find all non-empty subsets of l For every such subset a, we output a rule of the form a (l – a) if the ratio of support(l) to support(a) is at least minconf We consider all subsets of l to generate rules with multiple consequents May 21, 2002 Simple Algorithm We can improve the proceeding procedure by generating the subsets of large itemset in a recursive depth-first fashion For example, given an itemset ABCD, we first consider the subset ABC, then AB, etc Then if a subset a of a large itemset l does not generate a rule, the subsets of a need not to be considered for generating rules using l For example, if ABC D does not have enough confidence, we need not check whether AB CD holds May 21, 2002 Simple Algorithm 1) 2) forall large itemsets lk, k 2 do call genrules(lk, lk) Procedure genrules(lk: large k-itemsets, am: large m-itemsets) 1) A = {(m – 1)-itemsets am – 1 | am – 1 am}; 2) forall am – 1 A do again 3) conf = support(lk)/support(am – 1) 4) if (conf minconf) then begin 5) output the rule am – 1 (lk - am – 1), 6) with confidence = conf and support = support(lk) 7) if (m – 1 > 1) then 8) call genrules(lk, am – 1); 9) // to generate rules with subsets of am – 1 10) as the antecedents 11) end 12) end May 21, 2002 Faster Algorithm Earlier, we showed that if a rule does not hold for a superset does not hold, it also does not hold for its subset For example, if ABC D does not have enough confidence, AB CD also will not hold This can also be applied in the reverse direction, where if a rule holds for a subset, it also holds for its superset For example, if AB CD has enough confidence, ABC D will also hold May 21, 2002 Faster Algorithm 1) 2) 3) 4) forall large itemsets lk, k 2 do begin H1 = { consequents of rules derived from lk with one item in the consequent }; call ap-genrules(lk, H1) end Procedure ap-genrules(lk: large k-itemsets, Hm: set of m-item consequents) 1) if (k > m + 1) then begin 2) Hm + 1 = apriori-gen(Hm); 3) forall hm + 1 Hm + 1 do begin 4) conf = support(lk)/support(lk - hm + 1); 5) if (conf minconf) then 6) output the rule (lk - hm + 1) hm + 1 with 7) confidence = conf and support = support(lk); 8) else 9) delete hm + 1 from Hm + 1 10) end 11) call ap-genrules(lk, Hm + 1); 12) end May 21, 2002 Faster Algorithm For example, consider a large itemset ABCDE. Assume that ACDE B and ABCE E are the only one-item consequent rules with minimum confidence. In the simple algorithm, the recursive call genrules(ABCDE, ACDE) will test if the two-item consequence rules ACD BE, ADE BC, CDE BA, and ACE BD hold. The first of these rules cannot hold, because E BE, and ABCD E does not have minimum confidence. The second and third rules cannot hold for similar reasons. The only two-item consequent rule that can possibly hold is ACE BD, where B and D are the consequents in the valid one-item consequent rules. This is the only rule that will be tested by the faster algorithm. May 21, 2002 Mining Frequent Itemsets If we use candidate generation Need to generate a huge number of candidate sets. Need to repeatedly scan the database and check a large set of candidates by pattern matching. Can we avoid that? FP-Trees (Frequent Pattern Trees) May 21, 2002 FP-Tree General Idea Divide and Conquer Outline of the procedure For each item create its conditional pattern base Then expand it and create its conditional FP-Tree Repeat the process recursively on each constructed FP-Tree Until FP-Tree is either empty or contains only one path May 21, 2002 Example TID 100 200 300 400 500 Items bought {f, a, c, d, g, i, m, p} {a, b, c, f, l, m, o} {b, f, h, j, o} {b, c, k, s, p} {a, f, c, e, l, p, m, n} For support = 50% Step 1: Scan DB find frequent 1-itemsets Step 2: Order them in descending order Step 3: Scan DB and construct FP-Tree May 21, 2002 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 Example – Conditional Pattern Base Start with the frequent header table Traverse tree by following links from frequent items Accumulate prefix paths to form a conditional pattern base Header Table Item frequency f 4 c 4 a 3 b 3 m 3 p 3 root f:4 c:3 p:2 May 21, 2002 c:1 b:1 a:3 m:2 Conditional pattern bases b:1 p:1 b:1 m:1 item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Example – Conditional FP-Tree For every pattern base Accumulate the frequency for each item Construct an FP-Tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {root} f:3 c:3 a:3 m-conditional FP-tree May 21, 2002 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam Mining The FP-Tree Construct conditional pattern bases for each node in the initial FP-Tree Construct conditional FP-Tree from each conditional pattern base Recursively visit conditional FP-Trees and expand frequent patterns If the conditional FP-Tree contains only one path enumerate all the combinations of the items May 21, 2002 Example – Mining The FP-Tree Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} NULL a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f NULL NULL May 21, 2002 Example – Mining The FP-Tree Recursively visit conditional FP-Trees {root} {root} f:3 f:3 c:3 c:3 a:3 am-conditional FP-tree m-conditional FP-tree {root} {root} f:3 f:3 cm-conditional FP-tree May 21, 2002 cam-conditional FP-tree Example – Mining The FP-Tree If we have a single path, the frequent itemsets of a tree can be generated by enumerating all combinations of nodes of the tree. {root} f:3 c:3 a:3 m-conditional FP-tree May 21, 2002 Enumeration of all patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam Advantages of FP-Tree Completeness: Never breaks a long pattern of any transaction Preserves complete information for frequent pattern mining Compactness Reduce irrelevant information—infrequent items are gone Frequency descending ordering: more frequent items are more likely to be shared Never be larger than the original database (if not count node-links and counts) May 21, 2002 Comparison of FP-Tree Method and Apriori Data set size T25I20D10K 100 D1 FP-grow th runtime 90 D1 Apriori runtime 80 Run time(sec.) 70 60 50 40 30 20 10 0 0 May 21, 2002 0.5 1 1.5 2 Support threshold(%) 2.5 3 References R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), pages 207-216, May 1993 R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules (1994) Proc. 20th Int. Conf. Very Large Data Bases, VLDB Jiawei Han, Jian Pei, Yiwen Yin. Mining Frequent Patterns without Candidate Generation. SIGMOD Conference 2000: 1-12 May 21, 2002