Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran Outline Motivation Terms & Definitions Interest Measure Algorithms for mining generalized association rules Comparison Conclusions Idit Haran, Data Mining Seminar, 2003 2 Motivation Find Association Rules of the form: Diapers Beer Different kinds of diapers: Huggies/Pampers, S/M/L, etc. Different kinds of beers: Heineken/Maccabi, in a bottle/in a can, etc. The information on the bar-code is of type: Huggies Diapers, M Heineken Beer in bottle The preliminary rule is not interesting, and probably will not have minimum support. Idit Haran, Data Mining Seminar, 2003 3 Taxonomy is-a hierarchies Clothes Outwear Jackets Shirts Footwear Shoes Hiking Boots Ski Pants Idit Haran, Data Mining Seminar, 2003 4 Taxonomy - Example Let say we found the rule: Outwear Hiking Boots with minimum support and confidence. The rule Jackets Hiking Boots may not have minimum support The rule Clothes Hiking Boots may not have minimum confidence. Idit Haran, Data Mining Seminar, 2003 5 Taxonomy Users are interested in generating rules that span different levels of the taxonomy. Rules of lower levels may not have minimum support Taxonomy can be used to prune uninteresting or redundant rules Multiple taxonomies may be present. for example: category, price(cheap, expensive), “items-on-sale”. etc. Multiple taxonomies may be modeled as a forest, or a DAG. Idit Haran, Data Mining Seminar, 2003 6 Notations z ancestors (marked with ^) edge: is_a relationship parent p c1 c2 child descendants Idit Haran, Data Mining Seminar, 2003 7 Notations I = {i1, i2, …, im}- items. T- transaction, set of items TI (we expect the items in T to be leaves in T .) – set of transactions T supports item x, if x is in T or x is an ancestor of some item in T. T supports XI if it supports every item in X. D Idit Haran, Data Mining Seminar, 2003 8 Notations A generalized association rule: X Y if XI , YI , XY = , and no item in Y is an ancestor of any item in X. The rule XY has confidence c in D if c% of transactions in D that support X also support Y. The rule XY has support s in D if s% of transactions in D supports XY. Idit Haran, Data Mining Seminar, 2003 9 Problem Statement To find all generalized association rules that have support and confidence greater than the userspecified minimum support (called minsup) and minimum confidence (called minconf) respectively. Idit Haran, Data Mining Seminar, 2003 10 Example Recall the taxonomy: Clothes Outwear Jackets Shirts Footwear Shoes Hiking Boots Ski Pants Idit Haran, Data Mining Seminar, 2003 11 Frequent Itemsets Example Itemset Support {Jacket} 2 {Outwear} 3 {Clothes} 4 {Shoes} 2 {Hiking Boots} 2 {Footwear} 4 Database D Transaction Items Bought 100 Shirt 200 Jacket, Hiking Boots 300 Ski Pants, Hiking Boots 400 Shoes 500 Shoes {Clothes,Hiking Boots} 2 600 Jacket {Outwear, Footwear} 2 {Clothes, Footwear} 2 {Outwear, Hiking Boots} 2 Rules Rule Support Confidence Outwear Hiking Boots 33% 66.6% minsup = 30% Outwear Footwear 33% 66.6% minconf = 60% Hiking Boots Outwear 33% 100% Idit Haran, Data Seminar, 2003 Hiking Boots Clothes 33%Mining 100% 12 Observation 1 If the set{x,y} has minimum support, so do {x^,y^} {x^,y} and {x^,y^} For example: if {Jacket, Shoes} has minsup, so will {Outwear, Shoes}, {Jacket,Footwear}, and {Outwear,Footwear} Clothes Outwear Footwear Shirts Shoes Idit Haran, Data Mining Seminar, 2003 Jackets Ski Pants Hiking Boots 13 Observation 2 If the rule xy has minimum support and confidence, only xy^ is guaranteed to have both minsup and minconf. The rule OutwearHiking Boots has minsup and minconf. The rule OutwearFootwear has both minsup and minconf. Clothes Outwear Footwear Shirts Shoes Jackets Idit Haran, Ski Pants Data Mining Seminar, 2003 Hiking Boots 14 Observation 2 – cont. However, the rules x^y and x^y^ will have minsup, they may not have minconf. For example: The rules ClothesHiking Boots and ClothesFootwear have minsup, but not minconf. Clothes Outwear Footwear Shirts Shoes Jackets Idit Haran, Ski Pants Data Mining Seminar, 2003 Hiking Boots 15 Interesting Rules – Previous Work a rule XY is not interesting if: support(XY) support(X)•support(Y) Previous work does not consider taxonomy. The previous interest measure pruned less than 1% of the rules on a real database. Idit Haran, Data Mining Seminar, 2003 16 Interesting Rules – Using the Taxonomy MilkCereal (8% support, 70% conf) Milk is parent of Skim Milk, and 25% of sales of Milk are Skim Milk We expect: Skim MilkCereal to have 2% support and 70% confidence Idit Haran, Data Mining Seminar, 2003 17 R-Interesting Rules A rule is XY is R-interesting w.r.t an ancestor X^Y^ if: or, real support (XY) real confidence (XY) > > R• expected support (XY) based on (X^Y^) expected confidence (XY) R• based on (X^Y^) With R = 1.1 about 40-55% of the rules were prunes. Idit Haran, Data Mining Seminar, 2003 18 Problem Statement (new) To find all generalized R-interesting association rules (R is a userspecified minimum interest called min-interest) that have support and confidence greater than minsup and minconf respectively. Idit Haran, Data Mining Seminar, 2003 19 Algorithms – 3 steps 1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets. 2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(ABCD) = support(ABCD)/support(AB) 3. Prune all uninteresting rules from this set. *All presented algorithms will only implement step 1. Idit Haran, Data Mining Seminar, 2003 20 Algorithms – 3 steps 1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets. 2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(ABCD) = support(ABCD)/support(AB) 3. Prune all uninteresting rules from this set. *All presented algorithms will only implement step 1. Idit Haran, Data Mining Seminar, 2003 21 Algorithms (step 1) Input: Database, Taxonomy Output: All frequent itemsets 3 algorithms (same output, different run-time): Basic, Cumulate, EstMerge Idit Haran, Data Mining Seminar, 2003 22 Algorithm Basic – Main Idea Is itemset X is frequent? Does transaction T supports X? (X contains items from different levels of taxonomy, T contains only leaves) T’ = T + ancestors(T); Answer: T supports X X T’ Idit Haran, Data Mining Seminar, 2003 23 Algorithm Basic L1 {frequent 1-itemsets} Count item occurrences For ( k 2; Lk-1 ; k ) do begin Ck apriori- gen (Lk-1 ); forall transacti on t D do begin t add - ancestor (t , T ) Ct subset (C k ,t) forall candidates c Ct do Generate new k-itemsets candidates Add all ancestors of each item in t to t, removing any duplication c.count ; end end Lk { c Ck|c.count minsup} end Answer Find the support of all the candidates Take only those with support over minsup L ; k k Idit Haran, Data Mining Seminar, 2003 24 Candidate generation Join step P and q are 2 k-1 frequent itemsets identical in all k-2 first items. insert into Ck select p.item1 , p.item2 , p.itemk 1 , q.itemk 1 from Lk 1 p,Lk 1q where p.item1 q.item1 ,..., p.itemk 2 q.itemk 2 , p.itemk 1 q.itemk 1 Prune step Join by adding the last item of q to p forall itemsets c Ck do forall (k-1)-subsets s of c do if (s Lk-1 ) then delete c from Ck Check all the subsets, remove a candidate with “small” subset Idit Haran, Data Mining Seminar, 2003 25 Optimization 1 Filtering the ancestors added to transactions We only need to add to transaction t the ancestors that are in one of the candidates. If the original item is not in any itemsets, it can be dropped from the transaction. Clothes Example: candidates: {clothes,shoes}. Transaction t: {Jacket, …} can be replaced with {clothes, …} Outwear Jackets Idit Haran, Data Mining Seminar, 2003 Shirts Ski Pants 26 Optimization 2 Pre-computing ancestors Rather than finding ancestors for each item by traversing the taxonomy graph, we can precompute the ancestors for each item. We can drop ancestors that are not contained in any of the candidates in the same time. Idit Haran, Data Mining Seminar, 2003 27 Optimization 3 Pruning itemsets containing an item and its ancestor If we have {Jacket} and {Outwear}, we will have candidate {Jacket, Outwear} which is not interesting. support({Jacket} ) = support({Jacket, Outwear}) Delete ({Jacket, Outwear}) in k=2 will ensure it will not erase in k>2. (because of the prune step of candidate generation method) Therefore, we can prune the rules containing an item an its ancestor only for k=2, and in the next steps all candidates will not include item + ancestor. Idit Haran, Data Mining Seminar, 2003 28 Algorithm Cumulate Optimization 2: compute the set of all ancestors T* from T ComputeT * from T L1 {frequent 1-itemsets} For ( k 2; Lk-1 ; k ) do begin Ck apriori- gen (Lk-1 ); if (k 2) then prune(C 2 ) T remove - unnecessar y(T , C k ) * * forall transacti on t D do begin t add - ancestor (t , T * ) Ct subset (C k ,t) forall candidates c Ct do c.count ; end Optimization 3: Delete any candidate in C2 that consists of an item and its ancestor Optimization 1: Delete any ancestors in T* that are not present in any of the candidates in Ck Optimzation2: foreach item xt add all ancestor of x in T* to t. Then, remove any duplicates in t. end Lk { c Ck|c.count minsup} end Answer L ; k k Idit Haran, Data Mining Seminar, 2003 29 Clothes Stratification Outwear Jackets Candidates: Footwear Shirts Shoes Hiking Boots Ski Pants {Clothes, Shoes}, {Outwear,Shoes}, {Jacket,Shoes} If {Clothes, Shoes} does not have minimum support, we don’t need to count either {Outwear,Shoes} or {Jacket,Shoes} We will count in steps: step 1: count {Clothes, Shoes}, and if it has minsup step 2: count {Outwear,Shoes}, if has minsup – step 3: count {Jacket,Shoes} Idit Haran, Data Mining Seminar, 2003 30 Version 1: Stratify Depth of an itemset: itemsets with no parents are of depth 0. others: depth(X) = max({depth(X^) |X^ is a parent of X}) + 1 The algorithm: Count all itemsets C0 of depth 0. Delete candidates that are descendants to the itemsets in C0 that didn’t have minsup. Count remaining itemsets at depth 1 (C1) Delete candidates that are descendants to the itemsets in C1 that didn’t have minsup. Count remaining itemsets at depth 2 (C2), etc… Idit Haran, Data Mining Seminar, 2003 31 Tradeoff & Optimizations #candidates counted #passes over DB Count each depth on different pass Cumulate Optimiztion 1: Count together multiple depths from certain level Optimiztion 2: Count more than 20% of candidatesIdit per pass Haran, Data Mining Seminar, 2003 32 Version 2: Estimate Estimating 1st candidates support using sample pass: (C’k) count candidates that are expected to have minsup (we count these candidates as candidates that has 0.9*minsup in the sample) count 2nd candidates whose parents expect to have minsup. pass: (C”k) children of candidates in C’k that were not expected to have minsup. count Idit Haran, Data Mining Seminar, 2003 33 Example for Estimate minsup = 5% Candidates Itemsets Support in Support in Database Sample Scenario A Scenario B 8% 7% 9% {Clothes, Shoes} {Outwear, Shoes} 4% {Jacket, Shoes} 2% 4% Idit Haran, Data Mining Seminar, 2003 6% 34 Version 3: EstMerge Motivation: eliminate 2nd pass of algorithm Estimate Implementation: count these candidates of C”k with the candidates in C’k+1. Restriction: to create C’k+1 we assume that all candidates in C”k has minsup. The tradeoff: extra candidates counted by EstMerge v.s. extra pass made by Estimate. Idit Haran, Data Mining Seminar, 2003 35 Algorithm EstMerge Count item occurrences Generate a sample over the Database, in the first pass L1 {frequent 1-itemsets} Ds generate - sample ( D); For ( k 2, C"1 ; Lk-1 or C"k-1 ; k ) do begin Ck ge nerate - candidates ( Lk-1 , C"k-1 ); C 'k expected - frequent - and - sons ( Ds , Ck ); find - support ( D,C' k ,C"k-1 ) ; Ck prune - descendent s ( D,C' k ,C"k-1 ); C"k Ck - C'k ; Lk {c C 'k | c.count minsup} Lk 1 Lk 1 {c C"k 1 | c.count minsup} end Answer L ; k Generate new k-itemsets candidates from Lk-1C”k-1 Estimate Ck candidate’s support by making a pass over Ds. C’k = candidates that are expected to have minsup + candidates whose parents are expected to have minsup Find the support of C’kC”k-1 by making a pass over D Delete candidates in Ck whose ancestors in C’k don’t have minsup Remaining candidates in Ck that are not in C’k k Idit Haran, Data Mining Seminar, 2003 Add all candidate in C”k with minsup 36 All candidate in C’k with minsup Stratify - Variants Idit Haran, Data Mining Seminar, 2003 37 Size of Sample P=5% P=1% P=0.5% P=0.1% a=.8p a=.9p a=.8p a=.9p a=.8p a=.9p a=.8p a=.9p n=1000 0.32 0.76 0.80 0.95 0.89 0.97 0.98 0.99 n=10,000 0.00 0.07 0.11 0.59 0.34 0.77 0.80 0.95 n=100,000 0.00 0.00 0.00 0.01 0.00 0.07 0.12 0.60 n=1,000,000 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 Pr[support in sample < a] Idit Haran, Data Mining Seminar, 2003 38 Size of Sample Idit Haran, Data Mining Seminar, 2003 39 Performance Evaluation Compare running time of 3 algorithms: Basic, Cumulate and EstMerge On synthetic data: effect On of each parameter on performance real data: Supermarket Data Department Store Data Idit Haran, Data Mining Seminar, 2003 40 Synthetic Data Generation Parameter Default Value |D| Number of transactions 1,000,000 |T| Average size of the Transactions 10 |I| Average size of the maximal potentially frequent itemsets 4 |I | Number of maximal potentially frequent itemsets 10,000 N Number of items 100,000 R Number of Roots 250 L Number of Levels 4-5 F Fanout 5 D Depth-ration ( probability that item in a rule comes from 1 level i / probability that item Data comes from level2003 i+1) Idit Haran, Mining Seminar, 41 Minimum Support Idit Haran, Data Mining Seminar, 2003 42 Number of Transactions Idit Haran, Data Mining Seminar, 2003 43 Fanout Idit Haran, Data Mining Seminar, 2003 44 Number of Items Idit Haran, Data Mining Seminar, 2003 45 Reality Check Supermarket Data 548,000 items Taxonomy: 4 levels, 118 roots ~1.5 million transactions Average of 9.6 items per transaction Department Store Data 228,000 items Taxonomy: 7 levels, 89 roots 570,000 transactions Average of 4.4 items per transaction Idit Haran, Data Mining Seminar, 2003 46 Results Idit Haran, Data Mining Seminar, 2003 47 Conclusions Cumulate and EstMerge were 2 to 5 times faster than Basic on all synthetic datasets. On the supermarket database they were 100 times faster ! EstMerge was ~25-30% faster than Cumulate. Both EstMerge and Cumulate exhibits linear scale-up with the number of transactions. Idit Haran, Data Mining Seminar, 2003 48 Summary The use of taxonomy is necessary for finding association rules between items at any level of hierarchy. The obvious solution (algorithm Basic) is not very fast. New algorithms that use the taxonomy benefits are much faster We can use the taxonomy to prune uninteresting rules. Idit Haran, Data Mining Seminar, 2003 49 Idit Haran, Data Mining Seminar, 2003 50