Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Learning Association Rules Primary References • Tan, P., Steinbach, M., and Kumar, V. (2006) Introduction to Data Mining, 1st edition, Addison-Wesley, ISBN: 0-321-32136-7. • • Mccombs Business School Lecture Notes J. Han and M. Kamber (2000) Data Mining: Concepts and Techniques, Morgan Kaufmann. Database oriente. • SAS Enterprise Miner Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Implication means co-occurrence, not causality! The Apriori Algorithm • Join Step: Ck is generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; Definition: Frequent Itemset • Itemset – A collection of one or more items • Example: {Milk, Bread, Diaper} – k-itemset • An itemset that contains k items • Support count (σ) – Frequency of occurrence of an itemset – E.g. σ({Milk, Bread,Diaper}) = 2 • Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Definition: Association Rule • Association Rule – An implication expression of the form X → Y, where X and Y are itemsets – Example: {Milk, Diaper} → {Beer} • Rule Evaluation Metrics – Confidence (c) Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example: {Milk , Diaper } ⇒ Beer – Support (s) • Fraction of transactions that contain both X and Y TID s= σ ( Milk , Diaper, Beer ) |T| = 2 = 0 .4 5 • Measures how often items in Y σ (Milk, Diaper, Beer) 2 appear in transactions that c= = = 0.67 contain X σ (Milk, Diaper) 3 Mining Association Rules TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ABCDE ACDE BCDE Given d items, there are 2d possible candidate itemsets Illustrating Apriori Principle Found to be Infrequent Pruned supersets The Apriori Algorithm — Example Database D TID 100 200 300 400 itemset sup. {1} 2 C1 {2} 3 Scan D {3} 3 {4} 1 {5} 3 Items 134 235 1235 25 C2 itemset sup L2 itemset sup 2 2 3 2 {1 {1 {1 {2 {2 {3 C3 itemset {2 3 5} Scan D {1 3} {2 3} {2 5} {3 5} 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Presentation of Association Rules (Table Form ) Visualization of Association Rule Using Plane Graph Evaluation of Association Rules What rules should be considered valid? An association rule is valid if it satisfies some evaluation measures There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? Rule Evaluation Support – Milk & Wine co-occur – But… Only 2 out of 200K transactions contain these items Transaction No. Item 1 Item 2 Item 3 100 Beer Diapers Chocolate 101 Milk Chocolate Wine 102 Beer Wine Vodka 103 Beer Cheese Diapers 104 Ice Cream Diapers Beer …. … Rule Evaluation Support: The frequency in which the items in LHS and RHS co-occur. E.g., The support of the If {Diapers} then {Beer} rule is 3/5: 60% of the transactions contain both items. Support = No. of transactions containing items in body and head Total no. of transactions in database Transaction No. Item 1 Item 2 Item 3 100 Beer Diapers Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diapers 104 Ice Cream Diapers Beer … LHS If {Diapers} RHS Then {Beer} Rule Evaluation Strength of the Implication • Which implication is stronger? • Of the transactions containing Milk – 1% contain Wine. – 20% contain Beer body head If {Milk} Then {Wine} body If {Milk} head Then {Beer} Rule Evaluation Confidence A rule’s strength is measured by its confidence: How strongly the body implies the head? Confidence: The proportion of transactions containing the body that also contain the head. Example: The confidence of the rule is 3/3, i.e., in100% of the transactions that contain diapers also contain beer. No. of transactions containing both LHS and RHS Confidence = No. of transactions containing LHS Transaction No. Item 1 Item 2 Item 3 100 Beer Diapers Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diapers 104 Ice Cream Diapers Beer … LHS If {Diapers} RHS Then {Beer} Rule Evaluation: Confidence • Is the rule Milk Æ Wine equivalent to the rule WineÆ Milk? • When is the implications Milk Æ Wine is more likely than the reverse? Rule Evaluation Example: Lift • Consider the rule: body If {Milk} head Then {Beer} • Assume: Support 20%, Confidence 100% – What if all shoppers at the store buy beer? • Assume confidence is 60% – What if 60% of shoppers buy beer ? Æ Find rules where the frequency of head given body > expected frequency of head Another example • Example 1: (Aggarwal & Yu, PODS98) – Among 5000 students • 3000 play basketball • 3750 eat cereal • 2000 both play basket ball and eat cereal – play basketball ⇒ eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 Lift • Measures how much more likely is the head given the body than merely the head • (confidence/frequency of head) Example: • • • • Total number of customer in database: 1000 No. of customers buying Milk: 200 No. of customers buying beer: 50 No. of customers buying Milk & beer: 20 • Frequency of head: – 50/1000 (5%) body Confidence: – 20/200 (10%) • Lift: – 10%/5%=2 head Then If {Milk} {Beer} Comparison to Traditional Database Queries Traditional methods Traditional methods such as database queries: support hypothesis verification about a relationship such as the co-occurrence of diapers & beer. Transaction No. Item 1 Item 2 Item 3 100 Beer Diapers Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diapers 104 Ice Cream Diapers Beer … … … … … Comparison to Traditional Database Queries – Data Mining: Explore the data for patterns. Data Mining methods automatically discover significant associations rules from data. • Find whatever patterns exist in the database, without the user having to specify in advance what to look for (data driven). • Therefore allow finding unexpected correlations Applications • Store planning: – Placing associated items together (Milk & Bread)? • May reduce basket total value (buy less unplanned merchandise) • Fraud detection: – Finding in insurance data that a certain doctor often works with a certain lawyer may indicate potential fraudulent activity. Sequential Patterns Instead of finding association between items in a single transactions, find association between items bought by the same customer in different occasions. Customer ID Transaction Data. Item 1 Item 2 AA 2/2/2001 Laptop Case AA 1/13/2002 Wireless network card Router BB 4/5/2002 laptop iPaq BB 8/10/2002 Wireless network card Router … … … … … • Sequence : {Laptop}, {Wireless Card, Router} • A sequences has to satisfy some predetermined minimum support • SAS Demo