Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Association Rules D E Dr. Engin i YILDIZTEPE Reference Books y Han, J. , Kamber, M., Pei, J., (2011). Data Mining: Concepts C t and d Techniques. T h i Third Thi d edition. diti San S Francisco: Morgan Kaufmann Publishers. y Larose, L Daniel D i l T. T (2005). (2005) Discovering Di i Knowledge K l d In I Data – An Introduction to Data Mining. New Jersey: John Wiley and Sons Ltd. Ltd y Pang-Ning Tan,Michael Steinbach, Vipin Kumar (2006) Introduction (2006), I t d ti to t Data D t Mining, Mi i Addi Addison Wesley. y Alpaydın, Al d E. E (2010) (2010). Introduction I t d ti to t Machine M hi Learning. Second Ed. London:MIT Press. 2 Association Rules: Definition y Given a set of transactions, find rules that will predict di t the th occurrence off an item it b based d on the th occurrences of other items in the transaction. y In general, association rule mining can be viewed as pp process: a two-step 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. 2 Generate 2. G t strong t association i ti rules l f from th the frequent itemsets: By definition, these rules must satisfy i f minimum i i support and d minimum i i confidence. fid 3 Association Rule Miningg Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Association Rules {Diaper}} → {Beer}, {Di {B } {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Implication means co-occurrence, not causality! Data representation p for transactions y There are two principal methods of representing this type thi t off market k tb basket k td data: t using i either ith the th transactional data format or the tabular data format. 5 Data representation p for transactions y The transactional data format requires only two fields, an ID field and a content field, fields field with each record representing a single item only. 6 Transaction ID Items 1 Milk 1 Bread d 1 Eggs 2 B d Bread 2 Sugar 3 Bread 3 Cereal 4 Milk 4 Bread 4 Sugar … … Data representation p for transactions y In the tabular data format, each record represents a separate transaction, transaction with as many 0/1 flag fields as there are items. 7 Transaction ID Milk Bread Eggs Sugar Cereal 1 1 1 1 0 0 2 0 1 0 1 0 3 0 1 0 0 1 4 1 1 0 1 0 5 1 0 0 0 1 6 0 1 0 0 1 7 1 0 0 0 1 8 1 1 1 0 1 9 1 1 0 0 1 Definition: Frequent Itemset y Itemset y A collection ll off one or more items y Example: {Milk, Bread, Coke} y k k-itemset itemset y An itemset that contains k items y Support count (σ) y Frequency of occurrence of an itemset y E.g. y σ({Milk, Bread, Beer}) = 2 Support y Fraction of transactions that contain an itemset y E.g. s({Milk, Bread, Diaper}) = 2/5 y Frequent q Itemset y An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Definition: Association Rule z Association Rule – An implication expression of the form X → Y, where X and Y are itemsets – Example: {Milk, Diaper} → {Beer} z Rule Evaluation Metrics TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread Milk Bread, Milk, Diaper Diaper, Coke – Support (s) E Example: l Fraction of transactions that contain both X and Y {Milk , Diaper } ⇒ Beer – Confidence (c) Measures how often items in Y appear in transactions that contain X s= c= σ (Milk , Diaper, Beer ) |T| = 2 = 0.4 5 σ (Milk, Diaper, Beer) 2 = = 0.67 σ (Milk, Diaper) 3 Definition: Association Rule y Based on the types of values, the association rules can be b classified l ifi d into i t two t categories: t i B l Boolean Association Rules and Quantitative Association Rules y Boolean Association Rule: Keyboard y Mouse Q Quantitative tit ti Association A i ti Rule: R l ((Age g = 26…30)) 10 [support = 6%, 6% confidence = 70%] ((Cars =1, 2)) [Support 3%, confidence = 36%] Definition: Association Rule support s for a particular association rule A B i the is th proportion ti off transactions t ti iin D th thatt contain t i both A and B. That is, y The confidence c of the association rule A B is a measure of the accuracy of the rule, as determined by percentage g of transactions in D containing g A that the p also contain B y The 11 Miningg Association Rules TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, i Beer, Eggs Milk, Diaper, Beer, Coke Bread Milk Bread, Milk, Diaper Diaper, Beer Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk Beer} → {Diaper} (s {Milk,Beer} (s=00.4, 4 c=1 c 1.0) 0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the th above b rules l are binary bi partitions titi off the th same itemset: it t {Milk, Diaper, Beer} • Rules R l originating i i ti ffrom th the same itemset it t have h id identical ti l supportt bbutt can have different confidence • Thus, Th we may ddecouple l th the supportt andd confidence fid requirements i t Association Rule Miningg Task y Given a set of transactions T, the goal of association i ti rule l mining i i is i to t find fi d allll rules l having h i y support ≥ minsup threshold y confidence ≥ minconf threshold y Brute-force approach: y List all possible association rules y Compute the support and confidence for each rule y Prune rules that fail the minsup and minconf thresholds ⇒ Computationally C t ti ll prohibitive! hibiti ! Miningg Association Rules y Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support ≥ minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset y Frequent itemset i generation i is i still ill computationally expensive Frequent q Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ABCDE ACDE BCDE Given d items, there are 2d possible candidate itemsets Frequent q Itemset Generation y Brute-force approach: y Each itemset in the lattice is a candidate frequent itemset y Count the support of each candidate by scanning the database Transactions a sact o s TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread Milk Bread, Milk, Diaper, Diaper Beer Bread, Milk, Diaper, Coke y Match each transaction against every candidate y Complexity ~ O(NMw) => Expensive since M = 2d !!! y This h is an instance off the h curse off dimensionality. d l Computational p Complexity p y y Given d unique items: yT Total t l number b off it itemsets t = 2d y Total number of possible association rules: ⎡⎛ d ⎞ ⎛ d − k ⎞⎤ R = ∑ ⎢⎜ ⎟ × ∑ ⎜ ⎟⎥ ⎣⎝ k ⎠ ⎝ j ⎠⎦ = 3 − 2 +1 d −1 d −k k =1 j =1 d d +1 If d=6, R = 602 rules Frequent Itemset Generation Strategies y Reduce the number of candidates (M) y Complete search: M=2d y Use pruning techniques to reduce M y Reduce the number of transactions (N) y Reduce size of N as the size of itemset increases y Used by DHP and vertical-based mining algorithms y Reduce the number of comparisons (NM) y Use efficient data structures to store the candidates or transactions y No need to match every y candidate against g everyy transaction Reducingg Number of Candidates y Apriori, (Agrawal and Srikant, 1994) algorithm is the most popular association rule algorithm. algorithm y Apriori principle: y If an itemset is frequent, then all of its subsets must also be frequent. y Apriori p principle p p holds due to the following g property p p y of the support measure: ∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y ) y Support S t off an itemset it t never exceeds d the th supportt off it its subsets y This Thi is i known k as the h anti-monotone i property off support Illustrating Apriori Principle A null B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Found to be I f Infrequent t ABCD Pruned supersets ABCE ABDE ABCDE ACDE BCDE Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs gg Count 4 2 4 3 4 1 Items (1-itemsets) (1 itemsets) Itemset {Bread,Milk} {Bread,Beer} {Bread Diaper} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer Diaper} {Beer,Diaper} Minimum Support Count = 3 If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based support based pruning, 6 + 6 + 1 = 13 Count 3 2 3 2 3 3 Pairs (2-itemsets) (2 itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Itemset {Bread,Milk,Diaper} Count 2 Apriori p Algorithm g y Method: y Let k=1 y Generate frequent itemsets of length 1 y Repeat p until no new frequent q itemsets are identified y Generate length (k+1) candidate itemsets from length k frequent itemsets y Prune P candidate did t itemsets it t containing t i i subsets b t off length l th k that th t are infrequent y Count the support pp of each candidate byy scanning g the DB y Eliminate candidates that are infrequent, leaving only those that are frequent Generating Association Rules from Frequent Itemsets y Once the frequent itemsets from transactions in a database d t b D have h b been f found, d it is i straightforward t i htf d to t generate strong association rules from them y (where strong association rules satisfy both pp and minimum confidence). ) minimum support 23 Generating Association Rules from Frequent Itemsets y Association rules can be generated as follows: y For each frequent itemset p y subsets of l. nonempty l, generate all y For every nonempty subset s of l, output the rule sÆ (l (l-s) s) , if confidence ≥minimum confidence threshold. y Because the rules are generated from frequent itemsets, each one automatically satisfies the minimum support 24 Generating Association Rules from Frequent Itemsets y Continue previous example Itemset L3 {{Bread,Milk,Diaper} , , p } Count 2 The nonempty p y subsets of L3 are {Bread}, {Milk}, {Diaper}, {Bread,Milk}, {Bread, Diaper}, {Milk,Diaper} 25 TID Items 1 Bread Milk Bread, 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Association Rules: {Milk,Diaper} → {Bread} (s (s=0.4, 0.4, cc=0.67) 0.67) {Bread, Diaper} → {Milk} (s=0.4, c=0.67) {Bread, Milk} → {Diaper} (s=0.4, c=0.67) { {Diaper} } → {Milk, { Bread}} (s=0.4, c=0.5) {Milk} → {Bread, Diaper} (s=0.4, c=0.5) {Bread} → {Diaper,Milk} (s (s=0.4, 0.4, cc=0.5) 0.5) Example p 1 y A subset of market basket database, shown in the following table: T ID 1 2 3 4 5 6 7 8 9 10 Bread Yes Yes Y Yes No Yes Yes No Yes Yes No Milk Yes No Y Yes No No No No Yes No Yes Cheese Yes No Y Yes No No Yes No Yes Yes No Cleaner Yes No N No No No Yes No No No No y Find 1-itemsets and 2-itemsets 26 Example p 1 Single Items (1-itemsets) itemset Bread Milk Cheese Cleaner count 7 4 5 2 Pair Items (2-itemsets) itemset Bread&Milk Bread&Cheese Bread&Cleaner Milk&Ch Milk&Cheese Milk&Cleaner Cheese&Cleaner 27 count 3 5 2 3 1 2 Example p 2 y Consider the following transactions. Develop association rules using apriori algorithm. Let φ=3 (minimum support threshold). y Using 75% minimum confidence and 20% minimum support, generate one-antecedent and two-antecedent association rules. ,F 28 ,F Example p 2 29 Ck : Candidate itemset of size k Lk : Frequent itemset of size k Example p 2 1 30 Example p 2 31 Example p 2 C3 Item - Set # of Items {A, B, C} {A, B, D} {A, B, F} {A, D, F} {B C, {B, C D} {B, C, F} {B, D, F} 32 1 1 2 3 1 1 1 Example p 2 C3 Item - Set # of Items {A, B, {A B C} {A, B, D} {A, B, F} {A D, {A, D F} {B, C, D} {B, C, F} {B D, {B, D F} 1 1 2 3 1 1 1 the frequent itemset {A, D, F} have subsets ({A}, ({ }, {{D}, }, {{F}, }, {{A,, D},{A, },{ , F}, }, {{D,, F}) }) 33 Example p 2 y the frequent itemset {A, D, F} have subsets y ({A}, {D}, {F}, {A, D},{A, F}, {D, F}) y generated association rules are; y A ⇒ F & D confidence = 3 / 8 = 37.5 % y F ⇒ A & D confidence = 3 / 4 = 75 % y D ⇒ F & A confidence = 3 / 4 = 75 % y A & D ⇒ F confidence = 3 / 3 = 100 % y A & F ⇒ D confidence = 3 / 4 = 75 % y D & F ⇒ A confidence = 3 / 3 = 100 % 34