Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Frequent Itemsets Association rules and market basket analysis Carlo Zaniolo UCLA CSD 1 Association Rules & Correlations Basic concepts Efficient and scalable frequent itemset mining methods: Apriori, and improvements FP-growth Rule derivation, visualization and validation Multi-level Associations Summary 2 Market Basket Analysis: the context Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket” Milk, eggs, sugar, bread Milk, eggs, cereal, bread Eggs, sugar Customer1 Customer2 Customer3 3 Market Basket Analysis: the context Given: a database of customer transactions, where each transaction is a set of items Find groups of items which are frequently purchased together 4 Goal of MBA Extract information on purchasing behavior Actionable information: can suggest new store layouts new product assortments which products to put on promotion MBA applicable whenever a customer purchases multiple things in proximity credit cards services of telecommunication companies banking services medical treatments 5 MBA: applicable to many other contexts Telecommunication: Each customer is a transaction containing the set of customer’s phone calls Atmospheric phenomena: Each time interval (e.g. a day) is a transaction containing the set of observed event (rains, wind, etc.) Etc. 6 Association Rules Express how product/services relate to each other, and tend to group together “if a customer purchases three-way calling, then will also purchase call-waiting” simple to understand actionable information: bundle three-way calling and call-waiting in a single package 7 Frequent Itemsets Transaction: Relational format Compact format <Tid,item> <Tid,itemset> <1, item1> <1, {item1,item2}> <1, item2> <2, {item3}> <2, item3> Item: single element, Itemset: set of items Support of an itemset I: # of transaction containing I Minimum Support : threshold for support Frequent Itemset : with support . Frequent Itemsets represents set of items which are positively correlated 8 Frequent Itemsets Example Transaction ID Items Bought 1 dairy,fruit 2 dairy,fruit, vegetable 3 dairy 4 fruit, cereals Support({dairy}) = 3 (75%) Support({fruit}) = 3 (75%) Support({dairy, fruit}) = 2 (50%) If = 60%, then {dairy} and {fruit} are frequent while {dairy, fruit} is not. 9 Itemset support & Rules confidence Let A and B be disjoint itemsets and let: s = support(AB) and c= support(AB)/support(A) Then the rule A B holds with support s and confidence c: write A B [s, c] Objective of the mining task. Find all rules with minimum support minimum confidence Thus A B [s, c] holds if : s and c 10 Association Rules: Meaning A B [ s, c ] Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database. support(A B [ s, c ]) = p(A B) Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability . confidence(A B [ s, c ]) = p(B|A) = p(A & B)/p(A). 11 Association Rules - Example Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F For rule A C: Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent 12 Scalable Methods for Mining Frequent Patterns The downward closure property of frequent patterns Every subset of a frequent itemset must be frequent [antimonotonic property] If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao @SDM’02) 13 Frequent Patterns=Frequent Itemsets Closed Patterns and Max-Patterns A long pattern contains very many sub-patterns--combinatorial explosion An itemset is closed if none of its supersets has the same support (i.e., frequency drops for each item added) Closed pattern is a lossless compression of freq. patterns-Reducing the # of patterns and rules An itemset is maximal frequent if none of its supersets is frequent (i.e., frequency drops below threshold for each item added)---thus maxima are also closed. 14 Frequent Itemsets Minimum support = 2 null 124 123 A 12 124 AB 12 ABC TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE 24 AC AD ABD ABE B AE 2 345 D 2 3 BC BD 4 ACD 245 C 123 4 24 2 1234 BE 2 4 ACE ADE E 24 CD ABCE ABDE CE 3 BCD 45 DE 4 BCE BDE CDE # Frequent = 13 4 ABCD 34 ACDE BCDE ABCDE 15 Maximal Frequent Itemset: if none of its supersets is frequent Minimum support = 2 null 124 123 A 12 124 AB 12 ABC TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE 24 AC AD ABD ABE B AE 2 345 D 2 3 BC BD 4 ACD 245 C 123 4 24 2 1234 BE 2 4 ACE ADE E 24 CD ABCE ABDE CE 3 BCD 45 DE 4 BCE BDE CDE # Frequent = 13 4 ABCD 34 ACDE BCDE # Maximal = 4 ABCDE 16 Closed Frequent Itemset: None of its superset has the same support Minimum support = 2 124 123 A 12 124 AB 12 ABC TID 1 24 AC ABE AE 3 BC BD BE 2 4 ACE ADE Closed and maximal 345 D 2 4 ACD 245 C 123 4 24 ABD 1234 B AD 2 Closed but not maximal null E 24 CD 34 CE 3 BCD 45 DE 4 BCE BDE CDE Items ABC 2 ABCD 3 BCE 4 ACDE 5 DE 2 4 ABCD ABCE ABDE ACDE BCDE # Frequent = 13 # Closed = 9 # Maximal = 4 Closed and maximal ABCDE 17 Maximal vs Closed Itemsets As we move from a frequent itemset A to its supersets, support can: 1. Remain the same, 2. Drop but still remain above threshold. Then A is closed. 3. Drop below the threshold: A is maximal (and closed). Maximal is sufficient do determine frequent itemsets, but all closed itemsets cannot be reconstructed from it. 4. Crucial patterns: an informationpreserving subset of closed patterns. Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets 18 Apriori: A Candidate Generation-and-Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 19 Association Rules & Correlations Basic concepts Efficient and scalable frequent itemset mining methods: Apriori, and improvements 20 The Apriori Algorithm—An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E L2 Itemset {A, C} {B, C} {B, E} {C, E} Supmin = 2 Itemset sup {A} 2 C1 {B} 3 1st scan {C} 3 {D} 1 {E} 3 C2 sup 2 2 3 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 1 2 3 2 Itemset` sup {A} 2 {B} 3 {C} 3 {E} 3 L1 C2 2nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C3 Itemset {B, C, E} 3rd scan L3 Itemset sup {B, C, E} 2 21 Important Details of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning How to count supports of candidates? Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd} 22 How to Generate Candidates? Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck 23 How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Data Structures used: Candidate itemsets can be stored in a hash-tree or in a prefix-tree (trie)--example 24 Effect of Support Distribution Many real data sets have skewed support distribution Support distribution of a retail data set © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Effect of Support Distribution How to set the appropriate minsup threshold? – If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) – If minsup is set too low, it is computationally expensive and the number of itemsets is very large Using a single minimum support threshold may not be effective © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Rule Generation How to efficiently generate rules from frequent itemsets? – In general, confidence does not have an antimonotone property c(ABC D) can be larger or smaller than c(AB D) – But confidence of rules generated from the same itemset has an anti-monotone property – e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L–f satisfies the minimum confidence requirement. Theorem: If AB C fails confidence, so does A BC If |L| = k, then there are 2k candidate association rules (including L and L) – Example: L= {A,B,C,D} is the frequent itemset, then – The candidate rules are: ABC D, A BCD, AB CD, BD AC, ABD C, B ACD, AC BD, CD AB, ACD B, C ABD, AD BC, BCD A, D ABC BC AD, But antimonotonicity will make things converge fast by building on (i) successes an (ii) failures. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Rule Generation (i) and (ii) 1. Candidate rule is generated by merging two rules that share the same prefixin the rule consequent 2. If ADBC fails then Prune rule DABC 3. If AD BC and BD AC succeed then test candidate rule DABC © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Lattice of rules: confidence(f L– f)=support(L)/support(f) L={A,B,C,D} Low Confidence Rule CD=>AB ABCD=>{ } BCD=>A ACD=>B BD=>AC D=>ABC BC=>AD C=>ABD L= f ABD=>C AD=>BC B=>ACD ABC=>D AC=>BD AB=>CD A=>BCD Pruned Rules © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Rules: some useful, some trivial, others unexplicable Useful: “On Thursdays, grocery store consumers often purchase diapers and beer together”. Trivial: “Customers who purchase maintenance agreements are very likely to purchase large appliances”. Unexplicable: “When a new hardaware store opens, one of the most sold items is toilet rings.” Conclusion: Inferred rules must be validate by domain expert, before they can be used in the marketplace: Post Mining of association rules. 31 Mining for Association Rules The main steps in the process 1. 2. 3. 4. Select a Find the Find the Validate minimum support/confidence level frequent itemsets association rules (postmine) the rules so found. 32 Mining for Association Rules: Checkpoint Apriori opened up a big commercial market for DM association rules came from the db fields, classifier from AI, clustering precedes both … and DM Many open problem areas, including 1. Performance: Faster Algorithms needed for frequent itemsets 2. Improving statistical/semantic significance of rules 3. Data Stream Mining for association rules. Even Faster algorithms needed, incremental computation, adaptability, etc. Also the post-mining process becomes more challenging. 33 Performance: Efficient Implementation Apriori in SQL Hard to get good performance out of pure SQL (SQL-92) based approaches alone Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD’98 A much better solution: use UDAs—native or imported. Haixun Wang and Carlo Zaniolo: ATLaS: A Native Extension of SQL for Data Mining. SIAM International Conference on Data Mining 2003, San Francisco, CA, May 1-3, 2003 34 Performance for Apriori Challenges Multiple scans of transaction database [not for data streams] Huge number of candidates Tedious workload of support counting for candidates Many Improvements suggested: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate counting of candidates 35 Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 Does this scaleup to larger partitions? 36 Sampling for Frequent Patterns Select a sample S of original database, mine frequent patterns within sample using Apriori To avoid losses mine for a support less than that required Scan rest of database to find exact counts. H. Toivonen. Sampling large databases for association rules. In VLDB’96 37 DIC: Reduce Number of Scans ABCD ABC ABD ACD BCD AB AC BC AD BD CD Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Transactions B A C D Apriori {} Itemset lattice S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset DIC counting and implication rules for market basket data. In SIGMOD’97 1-itemsets 2-itemsets … 1-itemsets 2-items 3-items 38 Improving Performance (cont.) APriori Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2…i100 # of scans: 100 # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! Bottleneck: candidate-generation-and-test Can we avoid candidate generation? 39 Mining Frequent Patterns Without Candidate Generation FP-Growth Algorithm 1. Build FP-tree: items are listed by decreasing frequency 2. For each suffix (recursively) Build its conditionalized subtree and compute its frequent items An order of magnitude faster than Apriori 40 Frequent Patterns (FP) Algorithm The algorithm consists of two steps: Step 1: builds the FP-Tree (Frequent Patterns Tree). Step 2: use FP_Growth Algorithm for finding frequent itemsets from the FP- Tree. _________________________________________ These slides are based on those by: Yousry Taha,Taghrid Al-Shallali, Ghada AL Modaifer ,Nesreen AL Boiez 41 Frequent Pattern Tree Algorithm: Example T-ID List of Items 101 Milk, bread, cookies, juice 792 Milk, juice 1130 Milk, eggs 1735 Bread, cookies, coffee • The first scan of database is same as Apriori, which derives the set of 1-itemsets & their support counts. • The set of frequent items is sorted in the order of descending support count. • An Fp-tree is constructed • The Fp-tree is conditionalized and mined for frequent itemsets 42 FP-Tree for T-ID List of Items 101 Milk, bread, cookies, juice 792 Milk, juice 1130 Milk, eggs 1735 Bread, cookies, coffee Table: Item header table Item Id Support milk 3 bread 2 cookies 2 Node-link NULL FP-tree Milk:3 Milk:2 Milk:1 Bread:1 juice Bread:1 Juice:1 Cookies:1 2 Cookies:1 Juice:1 43 FP-Growth: for each suffix find (1) its supporting paths, (2) its conditional FP-tree, and (3) the frequent patterns with such an ending (suffix) Tree paths supporting suffix (conditional pattern base) Suffix juice {(milk, bread,cookies:1), Conditional Frequent pattern FP-Tree generated {milk:2} {juice, milk:2} (milk: 1)} cookies {(milk, bread:1),(bread: 1)} {bread: 2} {cookies, bread:2} bread {(milk: 1)} - - milk - - - … then expand the suffix and repeat these operations 45 Conditional pattern base: prefix paths having a given suffix: we start form the least frequent one: Juice NULL NULL Milk:1 Milk:2 Milk:3 2 Milk:3 Milk:2 Milk:1 Bread:1 Bread:1 Bread:1 Bread:1 Juice:1 Juice:1 Cookies:1 Cookies:1 Cookies:1 Cookies:1 Juice:1 Juice:1 46 Conditionalized tree for Suffix “Juice” NULL Milk:2 Thus: (Juice, Milk:2) is a frequent pattern 47 Now Patterns with Suffix “Cookies” NULL Item Id milk Sup Count Node-link 3 .. bread 2 Next cookies 2 NOW juice Done Milk:3 Milk:2 Milk:1 Done Bread:1 Bread:1 Cookies:1 Cookies:1 NULL Milk:2 Milk:1 Bread:1 NULL Bread:1 Bread:2 Thus: (Cookies, Bread:2) is frequent 48 Why Frequent Pattern Growth Fast ? • Performance study shows FP-growth is an order of magnitude faster than Apriori • Reasoning − No candidate generation, no candidate test − Use compact data structure − Eliminate repeated database scan − Basic operation is counting and FP-tree building 49 Other types of Association RULES • Association Rules among Hierarchies. • Multidimensional Association • Negative Association 50 FP-growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K 100 D1 FP-grow th runtime 90 D1 Apriori runtime 80 Run time(sec.) 70 60 50 40 30 20 10 0 0 6/22/2000 0.5 1 1.5 2 Support threshold(%) 2.5 3 51 51 FP-growth vs. Apriori: Scalability With Number of Transactions Data set T25I20D100K (1.5%) FP-growth Run time (sec.) 60 Apriori 50 40 30 20 10 0 0 20 40 60 80 100 Number of transactions (K) 6/22/2000 52 52 FP-Growth: pros and cons FP- tree is Complete Preserve complete information for frequent pattern mining Never break a long pattern of any transaction FP- tree Compact Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) FP-tree is generate in one scan of database (data streams mining?) However, deriving the frequent patterns from the FP-tree is still computationally expensive—improved algorithms needed for data streams. 53 Rule Validations Only a small subset of derived rules might be meaningful/useful Domain expert must validate the rules Useful tools: Visualization Correlation analysis 54 Association Rules & Correlations Basic concepts Efficient and scalable frequent itemset mining methods: Apriori, and improvements FP-growth Rule derivation, visualization and validation Multi-level Associations Summary 55 Finding Frequent Sequential Patterns The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently. For example: A followed by B (A->B), or AC followed by D (AC->D) The patterns should occur within a window W. Applications in telecommunication data, networks, biology Summary Association rule mining probably the most significant contribution from the database community in KDD New interesting research directions Association analysis in other types of data: spatial data, multimedia data, time series data, Association Rule Mining for Data Streams: a very difficult challenge. 57 PostProcessing: Rule Validation Interestingness Measures 58 Visualization of Association Rules: Plane Graph 59 Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y Y X f11 f10 f1+ X f01 f00 fo+ f+1 f+0 |T| f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc. 60 Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Day drinks of 100 People 75: {Coffee} 5: {Tea} 15:{Coffee, Tea} [support: 15/100] Confidence= P(Coffee|Tea) = [confidence: 15/20= .75] but P(Coffee) = 0.9 … >0.75 Confidence is high, rule is misleading since drinking tea lowers probability of coffee. Or conversely: P(Coffee|Tea) = 0.9375 …>>0.75 61 A better Measure: Lift/Interest Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee lift(Tea Coffee)= c(Tea Coffee)/s(Coffee)= s(Tea&Coffee)/s(Tea) s(Coffee) Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) 62 Drawback of Lift & Interest Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 Y Y X 10 0 10 X 0 90 90 10 90 100 0.1 Lift = = 10 (0.1)(0.1) Y Y X 90 0 90 X 0 10 10 90 10 100 0.9 Lift = = 1.11 (0.9)(0.9) Lift favors infrequent items Other criteria proposed Gini, J-measure, etc. 63 Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42 P(SB) = P(S) P(B) => Statistical independence P(SB) > P(S) P(B) => Positively correlated P(SB) < P(S) P(B) => Negatively correlated 64 Statistical Measures Measures that take into account statistical dependence Lift: Does X lift the probability of Y? i.e. probability of Y given X over probability of Y. P(Y | X ) Lift = P(Y ) Interest: I =1 independence, P( X , Y ) Interest = I> 1 positive association (<1 negative) P( X ) P(Y ) PS = P( X , Y ) - P( X ) P(Y ) 65 There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? 66 Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant confidence(A B) = p(B|A) = p(A & B)/p(A) Confidence is not discriminative enough criterion Beyond original support & confidence Interestingness measures can be used to prune/rank the derived patterns 67 Other Association Mining Methods CHARM: Mining frequent itemsets by a Vertical Data Format Mining Frequent Closed Patterns Mining Max-patterns Mining Quantitative Associations [e.g., what is the implication between age and income?] Constraint-base association mining Frequent Patterns in Data Streams: very difficult problem. Performance is a real issue Constraint-based (Query-Directed) Mining Mining sequential and structured patterns 68