Download Document

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke © Tan,Steinbach, Kumar Introduction to Data Mining Example of Association Rules {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Implication means co-occurrence, not causality! 4/18/2004 1 Definition: Frequent Itemset Itemset – A collection of one or more items Example: {Milk, Bread, Diaper} – k-itemset An itemset that contains k items Support count (σ σ) – Frequency of occurrence of an itemset – E.g. σ({Milk, Bread,Diaper}) = 2 Support TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Definition: Association Rule Association Rule – An implication expression of the form X → Y, where X and Y are itemsets – Example: {Milk, Diaper} → {Beer} Rule Evaluation Metrics TID Items 1 Bread, Milk 2 3 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke 4 5 Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke – Support (s) Example: Fraction of transactions that contain both X and Y {Milk , Diaper } ⇒ Beer – Confidence (c) Measures how often items in Y appear in transactions that contain X s= c= © Tan,Steinbach, Kumar Introduction to Data Mining σ (Milk , Diaper, Beer) |T| = 2 = 0 .4 5 σ (Milk, Diaper, Beer) 2 = = 0.67 σ (Milk, Diaper) 3 4/18/2004 3 Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds ⇒ Computationally prohibitive! © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Mining Association Rules Example of Rules: TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke {Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Mining Association Rules Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support ≥ minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE ABCDE © Tan,Steinbach, Kumar Introduction to Data Mining BCDE Given d items, there are 2d possible candidate itemsets 4/18/2004 7 Frequent Itemset Generation Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database Transactions TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!! © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Frequent Itemset Generation Strategies Reduce the number of candidates (M) – Complete search: M=2d – Use pruning techniques to reduce M Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: ∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y ) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Illustrating Apriori Principle Found to be Infrequent Pruned supersets © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1 Items (1-itemsets) Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} Minimum Support = 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 © Tan,Steinbach, Kumar Count 3 2 3 2 3 3 Introduction to Data Mining Ite m s e t { B r e a d ,M ilk ,D ia p e r } Count 3 4/18/2004 12 Frequent Itemset Mining 2 strategies: – Breadth-first: Apriori Exploit monotonicity to the maximum – Depth-first strategy: Eclat Prune Do the database not fully exploit monotonicity © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates A © Tan,Steinbach, Kumar 0 B C0 0 Introduction to Data Mining {} D0 4/18/2004 14 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates A © Tan,Steinbach, Kumar 0 B C1 1 Introduction to Data Mining {} D0 4/18/2004 15 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates A © Tan,Steinbach, Kumar 0 B C2 2 Introduction to Data Mining {} D0 4/18/2004 16 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates A © Tan,Steinbach, Kumar 1 B C3 2 Introduction to Data Mining {} D1 4/18/2004 17 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates A © Tan,Steinbach, Kumar 2 B C4 3 Introduction to Data Mining {} D2 4/18/2004 18 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates A © Tan,Steinbach, Kumar 2 B C4 4 Introduction to Data Mining {} D3 4/18/2004 19 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates AB AC A © Tan,Steinbach, Kumar 2 AD B BC C4 4 Introduction to Data Mining BD {} CD D3 4/18/2004 20 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 AB AC 1 A © Tan,Steinbach, Kumar 2 2 AD B BC 2 C4 4 Introduction to Data Mining 3 {} BD CD 2 2 D3 4/18/2004 21 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D Candidates minsup=2 ACD AB AC 1 A © Tan,Steinbach, Kumar 2 2 AD B BC 2 C4 4 Introduction to Data Mining 3 {} BCD BD CD 2 2 D3 4/18/2004 22 Apriori 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 ACD 2 AB AC 1 A © Tan,Steinbach, Kumar 2 2 AD B BC 2 C4 4 Introduction to Data Mining 3 {} BD BCD 1 2 CD 2 D3 4/18/2004 23 Apriori Algorithm Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Frequent Itemset Mining 2 strategies: – Breadth-first: Apriori Exploit monotonicity to the maximum – Depth-first strategy: Eclat Prune Do the database not fully exploit monotonicity © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Depth-First Algorithms Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Depth-First Algorithms Find all frequent itemsets Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D without D minsup=2 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 with D Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Depth-First Algorithms Find all frequent itemsets Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D without D minsup=2 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 A, B, C, AC, BC with D Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 A, B, C, AC © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Depth-First Algorithms Find all frequent itemsets Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D without D 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 minsup=2 A, B, C, AC, BC with D Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 add D again AD, BD, CD, ACD A, B, C, AC, BC A, B, C, AC © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Depth-First Algorithms Find all frequent itemsets Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D without D 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D minsup=2 minsup=2 A, B, C, AC, BC, AD, BD, CD, ACD A, B, C, AC, BC with D Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D AD, BD, CD, ACD + A, B, C, AC, BC minsup=2 A, B, C, AC © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 © Tan,Steinbach, Kumar DB[D] 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 AC: 2 Introduction to Data Mining 4/18/2004 32 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 © Tan,Steinbach, Kumar DB[D] 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 Introduction to Data Mining DB[CD] 3 4 A, A, B A: 2 4/18/2004 33 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 © Tan,Steinbach, Kumar DB[D] 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 AC: 2 Introduction to Data Mining DB[CD] 3 4 A, A, B A: 2 4/18/2004 34 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 © Tan,Steinbach, Kumar DB[D] 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 AC: 2 Introduction to Data Mining 4/18/2004 35 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 © Tan,Steinbach, Kumar DB[D] 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 AC: 2 Introduction to Data Mining DB[BD] 4 A A:1 4/18/2004 36 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 © Tan,Steinbach, Kumar DB[D] 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 AC: 2 Introduction to Data Mining 4/18/2004 37 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 © Tan,Steinbach, Kumar DB[D] 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 AC: 2 Introduction to Data Mining 4/18/2004 38 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Depth-First Algorithm DB[C] DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 © Tan,Steinbach, Kumar 1 2 3 4 B B A A, B A: 2 B: 3 Introduction to Data Mining 4/18/2004 40 Depth-First Algorithm DB[C] DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 © Tan,Steinbach, Kumar 1 2 3 4 B B A A, B DB[BC] 1 2 4 A A: 1 A: 2 B: 3 Introduction to Data Mining 4/18/2004 41 Depth-First Algorithm DB[C] DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 © Tan,Steinbach, Kumar 1 2 3 4 B B A A, B A: 2 B: 3 Introduction to Data Mining 4/18/2004 42 Depth-First Algorithm DB[C] DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 © Tan,Steinbach, Kumar 1 2 3 4 B B A A, B A: 2 B: 3 Introduction to Data Mining 4/18/2004 43 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Depth-First Algorithm DB[B] DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D 1 2 4 5 A A:1 A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 © Tan,Steinbach, Kumar Final set of frequent itemsets Introduction to Data Mining 4/18/2004 47 ECLAT For each item, store a list of transaction ids (tids) Horizontal Data Layout TID 1 2 3 4 5 6 7 8 9 10 © Tan,Steinbach, Kumar Items A,B,E B,C,D C,E A,C,D A,B,C,D A,E A,B A,B,C A,C,D B Vertical Data Layout A 1 4 5 6 7 8 9 B 1 2 5 7 8 10 C 2 3 4 8 9 D 2 4 5 9 E 1 3 6 TID-list Introduction to Data Mining 4/18/2004 48 ECLAT Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. A 1 4 5 6 7 8 9 ∧ B 1 2 5 7 8 10 → AB 1 5 7 8 Depth-first traversal of the search lattice Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49 Rule Generation Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f → L – f satisfies the minimum confidence requirement – If {A,B,C,D} is a frequent itemset, candidate rules: ABC →D, A →BCD, AB →CD, BD →AC, ABD →C, B →ACD, AC → BD, CD →AB, ACD →B, C →ABD, AD → BC, BCD →A, D →ABC BC →AD, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L → ∅ and ∅ → L) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50 Rule Generation How to efficiently generate rules from frequent itemsets? – In general, confidence does not have an antimonotone property c(ABC →D) can be larger or smaller than c(AB →D) – But confidence of rules generated from the same itemset has an anti-monotone property – e.g., L = {A,B,C,D}: c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51 Rule Generation for Apriori Algorithm Lattice of rules Low Confidence Rule Pruned Rules © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document