Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke © Tan,Steinbach, Kumar Introduction to Data Mining Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality! 4/18/2004 ‹#› Definition: Frequent Itemset l Itemset – A collection of one or more items Example: {Milk, Bread, Diaper} – k-itemset l An itemset that contains k items Support count () – Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2 l Support TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 l Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Definition: Association Rule l Association Rule – An implication expression of the form X Y, where X and Y are itemsets – Example: {Milk, Diaper} {Beer} l Rule Evaluation Metrics TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke – Support (s) Example: Fraction of transactions that contain both X and Y {Milk , Diaper } Beer – Confidence (c) Measures how often items in Y appear in transactions that contain X © Tan,Steinbach, Kumar s (Milk, Diaper, Beer ) |T| 2 0.4 5 (Milk, Diaper, Beer ) 2 c 0.67 (Milk, Diaper ) 3 Introduction to Data Mining 4/18/2004 ‹#› Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE ABCDE © Tan,Steinbach, Kumar Introduction to Data Mining BCDE Given d items, there are 2d possible candidate itemsets 4/18/2004 ‹#› Illustrating Apriori Principle null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Found to be Infrequent ABCD ABCE Pruned supersets © Tan,Steinbach, Kumar Introduction to Data Mining ABDE ACDE BCDE ABCDE 4/18/2004 ‹#› Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1 Items (1-itemsets) Minimum Support = 3 Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) If every subset is considered: 41 With support-based pruning: 13 © Tan,Steinbach, Kumar Count 3 2 3 2 3 3 Introduction to Data Mining Itemset {Bread,Milk,Diaper} Count 3 4/18/2004 ‹#› Apriori Algorithm Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Apriori in DB2S. Sarawagi, S. Thomas and R. Agrawal. "Integrating association rule mining with databases: alternatives and implications". Proc. of the ACM SIGMOD Int'l Conference on Management of Data, Seattle, Washington, June 1998. BEST PAPER AWARD. An extented version also appeared in Data Mining and Knowledge Discovery Journal, 4(2/3), July 2000. Because of the obvious challenges, reinforced by this paper, vendors and researchers gave up on the idea of turning DBMS into data mining systems: OLAP are tightly integrated into the DBMS but the KDD methods are not. 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› But much research into better methods for frequent Items Many methods explored different data representations and algorithms: Horizontal Data Layout TID 1 2 3 4 5 6 7 8 9 10 Items A,B,E B,C,D C,E A,C,D A,B,C,D A,E A,B A,B,C A,C,D B © Tan,Steinbach, Kumar Vertical Data Layout A 1 4 5 6 7 8 9 Introduction to Data Mining B 1 2 5 7 8 10 C 2 3 4 8 9 D 2 4 5 9 E 1 3 6 4/18/2004 ‹#› The best Algorithm combines both FP-growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› FP-tree construction null After reading TID=1: TID 1 2 3 4 5 6 7 8 9 10 Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E} A:1 B:1 After reading TID=2: null A:1 B:1 B:1 C:1 D:1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› FP-Tree TID 1 2 3 4 5 6 7 8 9 10 Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E} Header table Item Pointer A B C D E © Tan,Steinbach, Kumar Transaction Database null B:3 A:7 B:5 C:1 C:3 D:1 D:1 C:3 D:1 D:1 D:1 E:1 E:1 E:1 Pointers are used to assist frequent itemset generation Introduction to Data Mining 4/18/2004 ‹#› FP-growth Conditional Pattern base for D: Recursively apply FPgrowth on P null A:7 B:5 B:1 C:1 C:3 D:1 D:1 C:1 Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD D:1 D:1 D:1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›