Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Rule Mining Definition: Frequent Itemset Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Itemset – A collection of one or more items Example: {Milk, Bread, Diaper} – k-itemset Market-Basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 4 Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Implication means co-occurrence, not causality! An itemset that contains k items Support count (σ σ) – Frequency of occurrence of an itemset – E.g. σ({Milk, Bread,Diaper}) = 2 Support TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 4 Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Definition: Association Rule Association Rule – An implication expression of the form X → Y, where X and Y are itemsets – Example: {Milk, Diaper} → {Beer} Rule Evaluation Metrics TID Items 1 Bread, Milk 2 3 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke 4 5 Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Measures how often items in Y appear in transactions that contain X s= c= © Tan,Steinbach, Kumar σ ( Milk , Diaper, Beer ) |T| 2 = = 0.4 5 σ (Milk, Diaper, Beer) 2 = = 0.67 σ (Milk, Diaper) 3 Introduction to Data Mining 4/18/2004 3 Mining Association Rules Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Given a set of transactions T, the goal of association rule mining is to find all rules having Brute-force approach: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Mining Association Rules Example of Rules: TID 2 – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds ⇒ Computationally prohibitive! {Milk , Diaper } ⇒ Beer – Confidence (c) 4/18/2004 – support ≥ minsup threshold – confidence ≥ minconf threshold Example: Fraction of transactions that contain both X and Y Introduction to Data Mining Association Rule Mining Task – Support (s) © Tan,Steinbach, Kumar Two-step approach: 1. Frequent Itemset Generation {Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5) – Generate all itemsets whose support ≥ minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence Frequent itemset generation is still computationally expensive • Thus, we may decouple the support and confidence requirements © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Frequent Itemset Generation Frequent Itemset Generation null A AB ABC AC AD ABD ABE ABCD B C AE ACD D BC BD ACE ABCE ABDE – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database E BE ADE CD BCD CE BCE ACDE Transactions DE BDE TID 1 2 3 4 5 CDE Given d items, there are 2d possible candidate itemsets Introduction to Data Mining 4/18/2004 7 – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!! © Tan,Steinbach, Kumar Frequent Itemset Generation Strategies Reduce the number of candidates (M) Reduce the number of transactions (N) 8 Apriori principle: Apriori principle holds due to the following property of the support measure: ∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y ) Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction © Tan,Steinbach, Kumar 4/18/2004 – If an itemset is frequent, then all of its subsets must also be frequent – Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms Introduction to Data Mining Reducing Number of Candidates – Complete search: M=2d – Use pruning techniques to reduce M Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke BCDE ABCDE © Tan,Steinbach, Kumar Brute-force approach: Introduction to Data Mining 4/18/2004 – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support 9 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Illustrating Apriori Principle Illustrating Apriori Principle null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Found to be Infrequent ABCD ABCE Pruned supersets © Tan,Steinbach, Kumar Introduction to Data Mining ABDE ACDE Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1 Items (1-itemsets) Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} Minimum Support = 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 BCDE Count 3 2 3 2 3 3 Ite m s e t { B r e a d ,M ilk ,D ia p e r } C ount 3 ABCDE 4/18/2004 11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Frequent Itemset Mining Apriori 1 2 3 4 5 2 strategies: – Breadth-first: Apriori Exploit B, C B, C A, C, D A, B, C, D B, D minsup=2 monotonicity to the maximum – Depth-first strategy: Eclat Prune Do the database not fully exploit monotonicity Candidates A © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 B 0 © Tan,Steinbach, Kumar Introduction to Data Mining Apriori 1 2 3 4 5 {} D0 4/18/2004 14 Apriori B, C B, C A, C, D A, B, C, D B, D 1 2 3 4 5 minsup=2 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates A Candidates B 0 © Tan,Steinbach, Kumar C1 1 Introduction to Data Mining {} D0 4/18/2004 A 15 B 0 © Tan,Steinbach, Kumar C2 2 Introduction to Data Mining Apriori 1 2 3 4 5 C0 0 {} D0 4/18/2004 16 Apriori B, C B, C A, C, D A, B, C, D B, D 1 2 3 4 5 minsup=2 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates A © Tan,Steinbach, Kumar 1 Candidates B C3 2 Introduction to Data Mining {} D1 4/18/2004 A 17 © Tan,Steinbach, Kumar 2 B C4 3 Introduction to Data Mining {} D2 4/18/2004 18 Apriori 1 2 3 4 5 Apriori B, C B, C A, C, D A, B, C, D B, D 1 2 3 4 5 minsup=2 B, C B, C A, C, D A, B, C, D B, D minsup=2 Candidates AB AC AD BC BD CD Candidates A B 2 © Tan,Steinbach, Kumar C4 4 Introduction to Data Mining {} D3 A 4/18/2004 19 B 2 © Tan,Steinbach, Kumar Introduction to Data Mining Apriori 1 2 3 4 5 C4 4 {} D3 4/18/2004 Apriori B, C B, C A, C, D A, B, C, D B, D 1 2 3 4 5 minsup=2 B, C B, C A, C, D A, B, C, D B, D Candidates minsup=2 ACD AB AC 1 A 2 B 2 © Tan,Steinbach, Kumar AD BC 2 3 BD {} CD 2 2 C4 4 Introduction to Data Mining AB AC 1 D3 A 4/18/2004 21 Apriori 1 2 3 4 5 20 © Tan,Steinbach, Kumar 2 AD B 2 BC 2 C4 4 Introduction to Data Mining 3 {} BCD BD CD 2 2 D3 4/18/2004 22 Apriori Algorithm B, C B, C A, C, D A, B, C, D B, D – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified minsup=2 ACD 2 BCD Method: 1 Generate AB AC 1 A © Tan,Steinbach, Kumar 2 2 AD B BC 2 C4 4 Introduction to Data Mining 3 {} BD CD 2 2 D3 4/18/2004 23 length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Frequent Itemset Mining Depth-First Algorithms 2 strategies: Find all frequent itemsets 1 2 3 4 5 – Breadth-first: Apriori Exploit monotonicity to the maximum B, B, A, A, B, B, C C C, D C, D D minsup=2 – Depth-first strategy: Eclat Prune Do the database not fully exploit monotonicity © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Depth-First Algorithms B, C B, C A, C, D A, B, C, D B, D Introduction to Data Mining 4/18/2004 26 Depth-First Algorithms Find all frequent itemsets Find all frequent itemsets 1 2 3 4 5 © Tan,Steinbach, Kumar 1 2 3 4 5 without D 1 2 3 4 5 minsup=2 minsup=2 Find all frequent itemsets Find all frequent itemsets B, C B, C A, C, D A, B, C, D B, D B, C B, C A, C, D A, B, C, D B, D without D 1 2 3 4 5 B, B, A, A, B, B, C C C, D C, D D minsup=2 minsup=2 A, B, C, AC, BC with D with D Find all frequent itemsets Find all frequent itemsets 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D 1 2 3 4 5 minsup=2 B, B, A, A, B, B, C C C, D C, D D minsup=2 A, B, C, AC © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Depth-First Algorithms B, C B, C A, C, D A, B, C, D B, D Introduction to Data Mining without D 1 2 3 4 5 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D with D Find all frequent itemsets Find all frequent itemsets B, C B, C A, C, D A, B, C, D B, D 1 2 3 4 5 AD, BD, CD, ACD A, B, C, AC, BC B, B, A, A, B, B, C C C, D C, D D A, B, C, AC, BC C C C, D C, D D AD, BD, CD, ACD + A, B, C, AC, BC minsup=2 A, B, C, AC © Tan,Steinbach, Kumar B, B, A, A, B, B, 1 2 3 4 5 minsup=2 A, B, C, AC, BC, AD, BD, CD, ACD with D add D again without D minsup=2 A, B, C, AC, BC minsup=2 28 Find all frequent itemsets Find all frequent itemsets B, C B, C A, C, D A, B, C, D B, D minsup=2 minsup=2 1 2 3 4 5 4/18/2004 Depth-First Algorithms Find all frequent itemsets Find all frequent itemsets 1 2 3 4 5 © Tan,Steinbach, Kumar A, B, C, AC Introduction to Data Mining 4/18/2004 29 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Depth-First Algorithm Depth-First Algorithm DB DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D 1 2 3 4 5 A: 2 B: 4 C: 4 D: 3 Introduction to Data Mining 4/18/2004 31 Depth-First Algorithm A, C A, B, C B, A: 2 B: 4 C: 2 Introduction to Data Mining 1 2 3 4 5 DB[CD] 3 4 A, A, B A: 2 4/18/2004 33 B, C B, C A, C, D A, B, C, D B, D DB[D] 3 4 5 A, C A, B, C B, 4/18/2004 32 4/18/2004 34 4/18/2004 36 DB[CD] 3 4 A, A, B A: 2 A: 2 B: 4 C: 2 AC: 2 © Tan,Steinbach, Kumar Introduction to Data Mining Depth-First Algorithm DB © Tan,Steinbach, Kumar Introduction to Data Mining A: 2 B: 4 C: 4 D: 3 Depth-First Algorithm A: 2 B: 4 C: 4 D: 3 A: 2 B: 4 C: 2 AC: 2 DB © Tan,Steinbach, Kumar B, C B, C A, C, D A, B, C, D B, D A, C A, B, C B, Depth-First Algorithm DB[D] 3 4 5 A: 2 B: 4 C: 4 D: 3 1 2 3 4 5 DB[D] 3 4 5 © Tan,Steinbach, Kumar DB B, C B, C A, C, D A, B, C, D B, D C C C, D C, D D A: 2 B: 4 C: 4 D: 3 © Tan,Steinbach, Kumar 1 2 3 4 5 B, B, A, A, B, B, DB DB[D] 3 4 5 1 2 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 AC: 2 Introduction to Data Mining B, B, A, A, B, B, C C C, D C, D D A: 2 B: 4 C: 4 D: 3 4/18/2004 35 © Tan,Steinbach, Kumar DB[D] 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 AC: 2 Introduction to Data Mining DB[BD] 4 A A:1 Depth-First Algorithm Depth-First Algorithm DB 1 2 3 4 5 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 © Tan,Steinbach, Kumar DB DB[D] 3 4 5 1 2 3 4 5 A, C A, B, C B, A: 2 B: 4 C: 2 AC: 2 Introduction to Data Mining 4/18/2004 37 © Tan,Steinbach, Kumar 1 2 3 4 5 Introduction to Data Mining 4/18/2004 © Tan,Steinbach, Kumar Introduction to Data Mining 1 2 3 4 B B A A, B 39 B, C B, C A, C, D A, B, C, D B, D © Tan,Steinbach, Kumar DB[BC] 1 2 4 38 4/18/2004 40 4/18/2004 42 DB[C] 1 2 3 4 B B A A, B A: 2 B: 3 Introduction to Data Mining A A: 1 B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 4/18/2004 DB[C] DB 1 2 3 4 5 A: 2 B: 3 Introduction to Data Mining 4/18/2004 Depth-First Algorithm DB[C] DB A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 A: 2 B: 4 C: 2 AC: 2 A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 Depth-First Algorithm B, C B, C A, C, D A, B, C, D B, D A, C A, B, C B, DB B, C B, C A, C, D A, B, C, D B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 1 2 3 4 5 DB[D] 3 4 5 Depth-First Algorithm DB © Tan,Steinbach, Kumar C C C, D C, D D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 Depth-First Algorithm 1 2 3 4 5 B, B, A, A, B, B, 41 © Tan,Steinbach, Kumar 1 2 3 4 B B A A, B A: 2 B: 3 Introduction to Data Mining Depth-First Algorithm DB[C] DB 1 2 3 4 5 Depth-First Algorithm 1 2 3 4 B, C B, C A, C, D A, B, C, D B, D 1 2 3 4 5 A A, B A: 2 B: 3 A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 © Tan,Steinbach, Kumar DB B B Introduction to Data Mining 4/18/2004 43 © Tan,Steinbach, Kumar B, C B, C A, C, D A, B, C, D B, D 1 2 4 5 A A:1 Introduction to Data Mining 4/18/2004 45 © Tan,Steinbach, Kumar 4/18/2004 46 C C C, D C, D D © Tan,Steinbach, Kumar C C C, D C, D D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 B, B, A, A, B, B, Introduction to Data Mining ECLAT DB B, B, A, A, B, B, 44 A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 Depth-First Algorithm 1 2 3 4 5 4/18/2004 DB 1 2 3 4 5 A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 © Tan,Steinbach, Kumar Introduction to Data Mining Depth-First Algorithm DB[B] DB C C C, D C, D D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 4 CD: 2 ACD: 2 AC: 2 BC: 3 Depth-First Algorithm 1 2 3 4 5 B, B, A, A, B, B, For each item, store a list of transaction ids (tids) Horizontal Data Layout TID 1 2 3 4 5 6 7 8 9 10 Final set of frequent itemsets Introduction to Data Mining 4/18/2004 47 © Tan,Steinbach, Kumar Items A,B,E B,C,D C,E A,C,D A,B,C,D A,E A,B A,B,C A,C,D B Vertical Data Layout A 1 4 5 6 7 8 9 B 1 2 5 7 8 10 C 2 3 4 8 9 D 2 4 5 9 E 1 3 6 TID-list Introduction to Data Mining 4/18/2004 48 ECLAT Rule Generation Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. A 1 4 5 6 7 8 9 ∧ B 1 2 5 7 8 10 → Introduction to Data Mining ABC →D, A →BCD, AB →CD, BD →AC, 4/18/2004 49 Rule Generation Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f → L – f satisfies the minimum confidence requirement – If {A,B,C,D} is a frequent itemset, candidate rules: Depth-first traversal of the search lattice Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory © Tan,Steinbach, Kumar AB 1 5 7 8 ABD →C, B →ACD, AC → BD, CD →AB, ACD →B, C →ABD, AD → BC, BCD →A, D →ABC BC →AD, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L → ∅ and ∅ → L) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50 Rule Generation for Apriori Algorithm Lattice of rules How to efficiently generate rules from frequent itemsets? Low Confidence Rule – In general, confidence does not have an antimonotone property c(ABC →D) can be larger or smaller than c(AB →D) – But confidence of rules generated from the same itemset has an anti-monotone property – e.g., L = {A,B,C,D}: c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD) Pruned Rules Confidence is anti-monotone w.r.t. number of items on the RHS of the rule © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52