Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 10 – Association Rule Mining Dr. Songsri Tangsripairoj Dr.Benjarath Pupacdi Faculty of ICT, Mahidol University ITCS453 Data Warehousing and Data Mining (2/2009) 1 Topics What is Frequent Pattern Analysis? A Association i ti R Rule l Mi Mining i A Two-Step p Process of Association Rule Mining The Apriori Algorithm Frequent Itemset Generation Rule Generation Mining Association Rules: An Example ITCS453 Data Warehousing and Data Mining (2/2009) 2 Frequent Pattern Analysis A frequent pattern: a pattern that occurs frequently in a data set Frequent F t it itemsett ▪ A set of items,, such as milk and bread appear pp frequently q y together in a transaction data set Frequent sequential pattern ▪ Buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database ITCS453 Data Warehousing and Data Mining (2/2009) 3 Frequent Pattern Analysis Motivation: Finding inherent regularities in d t data What p products were often purchased p together?— Beer and diapers?! What are the subsequent purchases after buying a PC? Applications l Market basket analysis, Market-basket analysis cross-marketing cross marketing, catalog design, sale campaign analysis. ITCS453 Data Warehousing and Data Mining (2/2009) 4 Market Basket Analysis Help retailers plan marketing or advertising strategies, or in design of a new catalog Help retailers design different store layouts Help retailers plan which items to put on sale at reduced prices ITCS453 Data Warehousing and Data Mining (2/2009) 5 A Association i ti Rule R l Mi Mining i Mining for interesting rules (= gold) through a l large d b database (=mountain) “Interesting” rules tell you something about your database that you did not already know and probably were not able to explicit articulate because the data is so large. ITCS453 Data Warehousing and Data Mining (2/2009) 6 Problem Statement Given a set of items in a transaction database Retrieve all possible patterns in form of association rules Number of rules may y be massivelyy large g May need a filter to select a set of the most valuable or interesting rules ITCS453 Data Warehousing and Data Mining (2/2009) 7 Components of Association Rule A B, [support = %, confidence = %] Milk Bread, B d [support [ = 3%, % confidence fid = 80%] 8 %] If milk is purchased, purchased then bread is purchased 90 percent of the time and this pattern occurs in 3 percent of all shopping basket ITCS453 Data Warehousing and Data Mining (2/2009) 8 Components of Association Rule A B, [support = %, confidence = %] A and d B are sets off iitems, i.e. i itemsets. i F For example, A = {bread, milk} and B = {jam, eggs}. A = Antecedence B = Consequence Measure of Rule Interestingness Support = P(A U B) Confidence = P(B | A) ITCS453 Data Warehousing and Data Mining (2/2009) 9 Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke ITCS453 Data Warehousing and Data Mining (2/2009) Example of Association Rules {Diaper} {Beer}, {Milk Bread} {Eggs,Coke}, {Milk, {Eggs Coke} {Beer, Bread} {Milk}, 10 Binary Representation TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Bread Diaper Beer, Beer Eggs Milk, Diaper, Beer, Coke Bread,, Milk,, Diaper, p , Beer Bread, Milk, Diaper, Coke • 1 if the item is present in a transaction. • 0 otherwise. • Each row corresponds to a transaction. • Each column corresponds to an item. TID Bread Milk Diapers Beer Eggs Coke 1 1 1 0 0 0 0 2 1 0 1 1 1 0 3 0 1 1 1 0 1 4 1 1 1 1 0 0 5 1 1 1 0 0 1 ITCS453 Data Warehousing and Data Mining (2/2009) 11 D fi iti Definitions Itemset X = {x1, …, xk} A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Example: {Milk, Bread} is a 2-itemset TID I Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Support count () Frequency of occurrence of an itemset Example: ({Milk, Bread,Diaper}) = 2 Support Percentage of transactions that contain an itemset Example: s({Milk, Bread, Diaper}) = 2/5 ITCS453 Data Warehousing and Data Mining (2/2009) 12 D fi iti Definitions Frequent Itemset An itemset whose support is greater than or equal to a minsup i th h ld threshold TID I Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Association Rule An implication expression of the form X Y,, where X and Y are disjoint itemsets (X ∩Y = Ø) Example: E ample {Milk, Diaper} {Beer} ITCS453 Data Warehousing and Data Mining (2/2009) 13 Definitions Rule Evaluation Metrics Support (s) Percentage of transactions in D that contain both X and Y Support(X=>Y) = P(X U Y) Confidence (c) Percentage of transactions in D containing X that also contain Y confidence(X=>Y) ( ) = P(Y|X) ( | ) = (X U Y) (X) ITCS453 Data Warehousing and Data Mining (2/2009) TID I Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke E Example: l {Milk , Diaper } Beer s (Milk , Diaper, Beer ) |T| 2 0.4 5 (Milk, Diaper, Beer) 2 0.67 c 3 (Milk, Diaper p ) 14 Association Rule Mining TID Items 1 B d Milk Bread, 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Milk Diaper} {Beer} (s=0 (s 0.4, 4 cc=0 0.67) 67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} {Diaper Beer} (s=0.4, (s=0 4 c=0.5) c=0 5) Observations: • All the above rules are binary partitions of the same itemset: {{Milk, Diaper, p Beer}} • Rules originating from the same itemset have identical support but can have different confidence ITCS453 Data Warehousing and Data Mining (2/2009) 15 Association Rule Mining Task Given a set of transactions T, the goal of association i ti rule l mining i i is i to t find fi d allll rules l g having support ≥ minsup threshold confidence ≥ minconf threshold Rules that satisfy both minsup and minconf are called strong rules rules. ITCS453 Data Warehousing and Data Mining (2/2009) 16 Two-step process 1. Find all frequent q itemsets Generate all itemsets whose support minsup 2 2. Generate strong association rules from the frequent itemsets Generate G strong rules l from f each h ffrequent itemset. These rules must satisfy minsup and minconf i f The overall performance of mining association rules is determined by Step 1. ITCS453 Data Warehousing and Data Mining (2/2009) 17 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ABCDE ITCS453 Data Warehousing and Data Mining (2/2009) ACDE BCDE Given k items, there are 2k possible candidate itemsets 18 R d i Number Reducing N b off Candidates C did t The Apriori property: Any subset of a frequent itemset must be frequent If {beer, {beer diaper, diaper nuts} is frequent, frequent so is {beer, {beer diaper} i.e., every transaction having h {beer, b diaper, d nuts} also contains {beer, diaper} X , Y : ( X Y ) s( X ) s(Y ) Note that ssupport pport of an itemset ne never er eexceeds ceeds the support of its subsets ITCS453 Data Warehousing and Data Mining (2/2009) 19 Apriori: A Candidate Generationand-Test Approach Apriori pruning principle: If there is any itemset which is infrequent its superset should not be generated/tested! infrequent, Method: Initially, scan DB once to get frequent 1-itemset Generate G t length l th (k (k+1)) candidate did t itemsets it t ffrom g k frequent q itemsets length Test the candidates against DB Terminate when no frequent or candidate set can be generated ITCS453 Data Warehousing and Data Mining (2/2009) 20 Illustrating Apriori Pruning Principle null A Found to be I f Infrequent t B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD Pruned supersets ITCS453 Data Warehousing and Data Mining (2/2009) ABCE ABDE ACDE BCDE ABCDE 21 Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1 Items (1-itemsets) Itemset {B d Milk} {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk B } {Milk,Beer} {Milk,Diaper} {Beer,Diaper} Minimum Support = 3 Count 3 2 3 2 3 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Ite m s e t { B r e a d ,M ilk ,D ia p e r } ITCS453 Data Warehousing and Data Mining (2/2009) C ount 3 22 The Apriori Algorithm Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates dd generated d from f Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; ITCS453 Data Warehousing and Data Mining (2/2009) 23 The Apriori Algorithm—An Example Database TDB Tid I Items 10 A, C, D 20 B C, B, C E 30 A, B, C, E 40 B,, E L2 Supmin = 2 sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 C1 1st scan C2 Itemset sup {A, C} 2 {B, C} 2 {B E} {B, 3 {C, E} 2 C3 Itemset Itemset {B, C, E} L1 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset sup {A, B} 1 {A C} {A, 2 {A, E} 1 {A, C} {B, C} 2 {A, E} {B, E} 3 {B, C} {C, E} 2 {B, E} 3rd scan ITCS453 Data Warehousing and Data Mining (2/2009) L3 C2 2nd scan Itemset {A, B} {C E} {C, Itemset sup {B, C, E} 2 24 Generation of candidate itemsets and frequent itemsets (where minsup count=2) ITCS453 Data Warehousing and Data Mining (2/2009) 25 Generating g Association Rules from Frequent Itemsets Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement If {{I1,, I2,, I5} 5} is a frequent q itemset,, The nonempty subsets of L are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5} The resulting association rules are I1 I2 => I5, I1 I5 => I2, I2 I5 => I1, I1 => I2 I5, I2 => I1 I5, I5 I5 => I1 I2, confidence = 2/4 = 50% confidence = 2/2 = 100% confidence = 2/2 = 100% confidence = 2/6 = 33% confidence = 2/7 = 29% confidence = 2/2 = 100% ITCS453 Data Warehousing and Data Mining (2/2009) If ||L|| = k,, then there are 2k – 2 candidate association rules (i (ignoring i L and d L) 26 Mining g Association Rules: An Example A Subset of the Credit Card Promotion Database Magazine Promotion Watch Promotion Life Insurance Promotion Credit Card Insurance Sex Yes No No No Male Yes Yes Yes No Female No No No No Male Yes Yes Yes Yes Male Yes No Yes No Female No No No No Female Yes No Yes Yes Male No Yes No No Male Yes No No No Male Yes Yes Yes No Female ITCS453 Data Warehousing and Data Mining (2/2009) 27 1-Itemsets It t 1-Itemsets Number of Items Magazine Promotion=Yes 7 Watch Promotion=Yes 4 Watch Promotion=No 6 Lif IInsurance P Life Promotion=Yes ti Y 5 Life Insurance Promotion=No 5 Credit Card Insurance=No 8 Sex=Male 6 Sex=Female 4 ITCS453 Data Warehousing and Data Mining (2/2009) 28 2-Itemsets It t 2-Itemsets Number of Items g Promotion=Yes & Watch Promotion=No Magazine 4 Magazine Promotion=Yes & Life Insurance Promotion=Yes 5 Magazine Promotion=Yes & Credit Card Insurance=No 5 Magazine Promotion=Yes & Sex=Male 4 Watch Promotion=No & Life Insurance Promotion=No 4 Watch Promotion=No & Credit Card Insurance=No 5 Watch Promotion=No & Sex=Male 4 Life Insurance Promotion=No & Credit Card Insurance=No 5 Life Insurance Promotion=No & Sex=Male 4 Credit Card Insurance=No & Sex=Male 4 Credit Card Insurance=No & Sex=Female 4 ITCS453 Data Warehousing and Data Mining (2/2009) 29 3-Itemsets It t 3-Itemsets Number of Items Watch Promotion=No & Life Insurance Promotion=No & Credit Card Insurance=No 4 ITCS453 Data Warehousing and Data Mining (2/2009) 30 Two possible 2-itemset 2 itemset rules are: IF Magazine Promotion = Yes THEN Life f Insurance Promotion = Yes (5/7) IF Life Lif Insurance I Promotion P i = Yes Y THEN Magazine Promotion = Yes (5/5) ITCS453 Data Warehousing and Data Mining (2/2009) 31 Three possible 3-itemset rules are: IF Watch Promotion = No & Life Insurance Promotion = No THEN Credit Card Insurance = No (4/4) IF Watch Promotion = No THEN Life Insurance Promotion = No & Credit Card Insurance = No (4/6) IF Credit Card Insurance = No THEN Watch W t hP Promotion ti = N No & Life Insurance Promotion = No (4/8) ITCS453 Data Warehousing and Data Mining (2/2009) 32