Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Association Analysis: Basic Concepts and Algorithms Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Lecture Notes for Chapter 6 Market-Basket transactions Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Definition: Frequent Itemset l TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 4 Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke © Tan,Steinbach, Kumar l u l Support count (σ σ) – Frequency of occurrence of an itemset – E.g. σ({Milk, Bread,Diaper}) = 2 l Support TID Items 1 2 3 4 5 Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke l Rule Evaluation Metrics u u Measures how often items in Y appear in transactions that contain X 4/18/2004 3 Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having © Tan,Steinbach, Kumar 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke s= σ ( Milk, Diaper, Beer ) |T| = 2 = 0 .4 5 σ (Milk, Diaper, Beer) 2 = = 0.67 σ (Milk, Diaper) 3 4/18/2004 4 4/18/2004 Example of Rules: Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke {Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5) Observations: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds ⇒ Computationally prohibitive! Introduction to Data Mining Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Introduction to Data Mining TID Brute-force approach: © Tan,Steinbach, Kumar Bread, Milk 2 3 Mining Association Rules – support ≥ minsup threshold – confidence ≥ minconf threshold l Items 1 {Milk , Diaper} → {Beer} c= – An itemset whose support is greater than or equal to a minsup threshold Introduction to Data Mining TID Example: Fraction of transactions that contain both X and Y – Confidence (c) – E.g. s({Milk, Bread, Diaper}) = 2/5 l 2 – Support (s) Frequent Itemset © Tan,Steinbach, Kumar 4/18/2004 – Example: {Milk, Diaper} → {Beer} – Fraction of transactions that contain an itemset l Introduction to Data Mining – An implication expression of the form X → Y, where X and Y are itemsets Example: {Milk, Bread, Diaper} An itemset that contains k items Implication means co-occurrence, not causality! Association Rule – A collection of one or more items – k-itemset {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Definition: Association Rule Itemset u Example of Association Rules • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Mining Association Rules Frequent Itemset Generation null l Two-step approach: 1. Frequent Itemset Generation – A 2. Rule Generation – l B C D E Generate all itemsets whose support ≥ minsup AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive ABCD ABCE ABDE ACDE BCDE Given d items, there are 2d possible candidate itemsets ABCDE © Tan,Steinbach, Kumar l Introduction to Data Mining 4/18/2004 7 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Frequent Itemset Generation Computational Complexity Brute-force approach: l Given d unique items: – Total number of itemsets = 2d – Total number of possible association rules: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database d d − k R = ∑ × ∑ k j = 3 − 2 +1 Transactions TID 1 2 3 4 5 8 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke d −1 d −k k =1 j =1 d d +1 If d=6, R = 602 rules – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!! © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Frequent Itemset Generation Strategies l Reduce the number of candidates (M) © Tan,Steinbach, Kumar 4/18/2004 10 Reducing Number of Candidates l – Complete search: M=2d – Use pruning techniques to reduce M l Introduction to Data Mining Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Reduce the number of comparisons (NM) l – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction Apriori principle holds due to the following property of the support measure: ∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s (Y ) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Illustrating Apriori Principle Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs Found to be Infrequent Items (1-itemsets) Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} Count 3 2 3 2 3 3 Minimum Support = 3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Apriori Algorithm – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified u Generate length (k+1) candidate itemsets from length k frequent itemsets u Prune candidate itemsets containing subsets of length k that are infrequent u Count the support of each candidate by scanning the DB u Eliminate candidates that are infrequent, leaving only those that are frequent Introduction to Data Mining Itemset {Bread,Milk,Diaper} Introduction to Data Mining 4/18/2004 15 TID 1 2 3 4 5 6 7 8 9 Items ABE BD BC ABD AC BC AC ABCE ABC © Tan,Steinbach, Kumar Introduction to Data Mining Apriori Algorithm: Level 2 candidate AB AC AD AE BC BD BE CD CE DE candidate support Frequent? 6 Y B 7 Y C 6 Y D 2 Y E 2 Y © Tan,Steinbach, Kumar Introduction to Data Mining Count 2 4/18/2004 14 Minsup = 2 Apriori Algorithm: Level 1 A (No need to generate candidates involving Coke or Eggs) Apriori Algorithm: Example Method: © Tan,Steinbach, Kumar © Tan,Steinbach, Kumar Pairs (2-itemsets) Triplets (3-itemsets) If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 Pruned supersets l Count 4 2 4 3 4 1 4/18/2004 17 © Tan,Steinbach, Kumar support 4 4 1 2 4 2 2 0 1 0 4/18/2004 16 4/18/2004 18 Frequent? Y Y N Y Y Y Y N N N Introduction to Data Mining Apriori Algorithm: Level 3 Reducing Number of Comparisons l candidate support Frequent? ABC 2 Y ABE 2 Y Candidate counting: – Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets u Transactions No level 4 candidates. © Tan,Steinbach, Kumar TID 1 2 3 4 5 Introduction to Data Mining 4/18/2004 19 Generate Hash Tree Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Association Rule Discovery: Hash tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash Function 1,4,7 • Hash function Candidate Hash Tree 3,6,9 2,5,8 • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Hash function 3,6,9 1,4,7 234 567 345 136 145 2,5,8 124 457 © Tan,Steinbach, Kumar 125 458 145 345 356 357 689 367 368 124 457 4/18/2004 21 Association Rule Discovery: Hash tree 1,4,7 136 Hash on 1, 4 or 7 159 Introduction to Data Mining Hash Function 234 567 © Tan,Steinbach, Kumar 125 458 159 Introduction to Data Mining 367 368 4/18/2004 22 Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 3,6,9 1,4,7 2,5,8 Candidate Hash Tree 3,6,9 2,5,8 234 567 145 145 345 124 457 234 567 136 Hash on 2, 5 or 8 © Tan,Steinbach, Kumar 356 357 689 125 458 159 Introduction to Data Mining 356 357 689 367 368 345 124 457 4/18/2004 23 136 Hash on 3, 6 or 9 © Tan,Steinbach, Kumar 125 458 159 Introduction to Data Mining 356 357 689 367 368 4/18/2004 24 Subset Operation Subset Operation Using Hash Tree Given a transaction t, what are the possible subsets of size 3? Hash Function 1 2 3 5 6 transaction 1+ 2356 2+ 356 1,4,7 3+ 56 3,6,9 2,5,8 234 567 145 136 345 124 457 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Subset Operation Using Hash Tree 2+ 356 3+ 56 1+ 2356 2+ 356 1,4,7 3+ 56 3,6,9 2,5,8 124 457 125 458 © Tan,Steinbach, Kumar 145 159 356 357 689 Introduction to Data Mining 345 124 457 4/18/2004 27 © Tan,Steinbach, Kumar l lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set Average transaction width l – transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) Introduction to Data Mining Match transaction against 11 out of 15 candidates Introduction to Data Mining 4/18/2004 28 4/18/2004 29 Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f → L – f satisfies the minimum confidence requirement ABC →D, A →BCD, AB →CD, BD →AC, since Apriori makes multiple passes, run time of algorithm may increase with number of transactions © Tan,Steinbach, Kumar 159 367 368 – If {A,B,C,D} is a frequent itemset, candidate rules: more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase l 125 458 356 357 689 Rule Generation Choice of minimum support threshold Size of database 136 367 368 Factors Affecting Complexity l 234 567 15+ 6 136 345 – 26 Hash Function 12+ 356 234 567 145 – – 4/18/2004 13+ 56 15+ 6 l Introduction to Data Mining 1 2 3 5 6 transaction 3,6,9 2,5,8 13+ 56 – – © Tan,Steinbach, Kumar 367 368 Subset Operation Using Hash Tree 1,4,7 12+ 356 l 159 Hash Function 1 2 3 5 6 transaction 1+ 2356 125 458 356 357 689 ABD →C, B →ACD, AC → BD, CD →AB, ACD →B, C →ABD, AD → BC, BCD →A, D →ABC BC →AD, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L → ∅ and ∅ → L) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Rule Generation l Rule Generation for Apriori Algorithm Lattice of rules How to efficiently generate rules from frequent itemsets? Low Confidence Rule – In general, confidence does not have an antimonotone property c(ABC →D) can be larger or smaller than c(AB →D) – But confidence of rules generated from the same itemset has an anti-monotone property – e.g., L = {A,B,C,D}: c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD) Pruned Rules Confidence is anti-monotone w.r.t. number of items on the RHS of the rule u © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Rule Generation for Apriori Algorithm l l join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC l Prune rule D=>ABC if its subset AD=>BC does not have high confidence Introduction to Data Mining BD=>AC l Introduction to Data Mining TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 l l 4/18/2004 33 Number of frequent itemsets 10 k =1 Introduction to Data Mining Closed Itemset An itemset is maximal frequent if none of its immediate supersets is frequent l 4/18/2004 34 An itemset is closed if none of its (immediate) supersets has the same support as the itemset Maximal Itemsets TID 1 2 3 4 5 © Tan,Steinbach, Kumar 10 = 3× ∑ k Need a compact representation © Tan,Steinbach, Kumar Maximal Frequent Itemset Infrequent Itemsets 32 Some itemsets are redundant because they have identical support as their supersets D=>ABC © Tan,Steinbach, Kumar 4/18/2004 Compact Representation of Frequent Itemsets Candidate rule is generated by merging two rules that share the same prefix in the rule consequent CD=>AB © Tan,Steinbach, Kumar Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D} Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Support 4 5 3 4 4 2 3 3 4 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2 Border Introduction to Data Mining 4/18/2004 35 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Maximal vs Closed Itemsets TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE Maximal vs Closed Frequent Itemsets Transaction Ids null 124 123 A 12 124 AB 12 24 AC AD ABD ABE B 245 C 123 4 AE 24 2 ABC 1234 D 2 3 BC BD 4 ACD BE 2 4 ACE ADE 24 12 CD 34 CE 45 3 BCD ABCD ABCE ABDE DE BDE CDE 124 AB 12 4 BCE 123 A E ABC 24 AC AD ABD ABE 1234 B AE 2 3 BD 24 2 ADE CD Closed and maximal 34 CE 3 BCD 45 DE 4 BCE BDE CDE 4 ABCD BCDE E BE 4 ACE 2 ACDE 345 D BC 4 ACD 245 C 123 4 24 2 Closed but not maximal null 124 4 2 Minimum support = 2 345 ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 Not supported by any transactions © Tan,Steinbach, Kumar ABCDE ABCDE Introduction to Data Mining 4/18/2004 37 4/18/2004 39 Maximal vs Closed Itemsets © Tan,Steinbach, Kumar Introduction to Data Mining © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38