Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA MINING ASSOCIATION RULES Outline – Large Itemsets – Basic Algorithms – Simple Apriori Algorithm – FP Algorithm – Advanced Topics DATA MINING VESIT M.VIJAYALAKSHMI 2 Association Rules Outline • Association Rules Problem Overview – Large itemsets • Association Rules Algorithms – Apriori – Sampling – Partitioning – Parallel Algorithms • Comparing Techniques • Incremental Algorithms • Advanced AR Techniques DATA MINING VESIT M.VIJAYALAKSHMI 3 Example: Market Basket Data • Items frequently purchased together: Computer Printer • Uses: – Placement – Advertising – Sales – Coupons • Objective: increase sales and reduce costs • Called Market Basket Analysis, Shopping Cart Analysis DATA MINING VESIT M.VIJAYALAKSHMI 4 Association Rule Definitions • • • • Set of items: I={I1,I2,…,Im} Transactions: D={t1,t2, …, tn}, tj I Itemset: {Ii1,Ii2, …, Iik} I Support of an itemset: Percentage of transactions which contain that itemset. • Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. DATA MINING VESIT M.VIJAYALAKSHMI 5 Transaction Data: Supermarket Data • Market basket transactions: t1: {bread, cheese, milk} t2: {apple, jam, salt, ice-cream} … … tn: {biscuit, jam, milk} • Concepts: – An item: an item/article in a basket – I: the set of all items sold in the store – A Transaction: items purchased in a basket; it may have TID (transaction ID) – A Transactional dataset: A set of transactions DATA MINING VESIT M.VIJAYALAKSHMI 6 Transaction Data: A Set Of Documents • A text document data set. Each document is treated as a “bag” of keywords doc1: doc2: doc3: doc4: doc5: doc6: doc7: Student, Teach, School Student, School Teach, School, City, Game Baseball, Basketball Basketball, Player, Spectator Baseball, Coach, Game, Team Basketball, Team, City, Game DATA MINING VESIT M.VIJAYALAKSHMI 7 Association Rules Example TID Items 1 Bread, Milk 2 Bread, Biscuit, FruitJuice, Eggs Milk, Biscuit, FruitJuice, Coke Bread, Milk, Biscuit, FruitJuice Bread, Milk, Biscuit, Coke 3 4 5 {Biscuit} {FruitJuice}, {Milk, Bread} {Eggs,Coke}, {FruitJuice, Bread} {Milk}, DATA MINING VESIT M.VIJAYALAKSHMI 8 Association Rule Definitions • Association Rule (AR): implication X Y where X,Y I and X Y = ; • Support of AR (s) X Y: Percentage of transactions that contain X Y • Confidence of AR (a) X Y: Ratio of number of transactions that contain X Y to the number that contain X DATA MINING VESIT M.VIJAYALAKSHMI 9 Association Rules Ex (cont’d) • Itemset – A collection of one or more items • Example: {Milk, Bread, Biscuit} – k-itemset • An itemset that contains k items • Support count () – Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Biscuit}) = 2 • Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Biscuit}) = 2/5 • Frequent Itemset TID Items 1 Bread, Milk 2 Bread, Biscuit, FruitJuice, Eggs Milk, Biscuit, FruitJuice, Coke Bread, Milk, Biscuit, FruitJuice Bread, Milk, Biscuit, Coke 3 4 5 – An itemset whose support is greater than or equal to a minsup threshold DATA MINING VESIT M.VIJAYALAKSHMI 10 Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence. • Link Analysis • NOTE: Support of X Y is same as support of X Y. DATA MINING VESIT M.VIJAYALAKSHMI 11 Definition: Association Rule • An implication expression of the form X Y, where X and Y are itemsets • Example: {Milk, Biscuit} {FruitJuice} • Rule Evaluation Metrics TID Items 1 Bread, Milk 2 Bread, Biscuit, FruitJuice, Eggs Milk, Biscuit, FruitJuice, Coke Bread, Milk, Biscuit, FruitJuice Bread, Milk, Biscuit, Coke 3 4 5 –Support (s) Example: •Fraction of transactions that {Milk, Biscuit} FruitJuice contain both X and Y –Confidence (c) •Measures how often items in s σ(Milk,BiscuitFruitJuice) 2 0.4 |T| 5 Y appear in transactions that σ(Milk,Biscuit, FruitJuice) 2 contain X c 0.67 σ(Biscuit,FruitJuice) 3 DATA MINING VESIT M.VIJAYALAKSHMI 12 Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold • Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! DATA MINING VESIT M.VIJAYALAKSHMI 13 Mining Association Rules Example of Rules: TID Items 1 Bread, Milk 2 Bread, Biscuit, FruitJuice, Eggs Milk, Biscuit, FruitJuice, Coke Bread, Milk, Biscuit, FruitJuice Bread, Milk, Biscuit, Coke 3 4 5 DATA MINING VESIT M.VIJAYALAKSHMI {Milk,Biscuit} {FruitJuice} (s=0.4, c=0.67) {Milk,FruitJuice} {Biscuit} (s=0.4, c=1.0) {Biscuit,FruitJuice} {Milk} (s=0.4, c=0.67) {FruitJuice} {Milk,Biscuit} (s=0.4, c=0.67) {Biscuit} {Milk,FruitJuice} (s=0.4, c=0.5) {Milk} {Biscuit,FruitJuice} (s=0.4, c=0.5) 14 Observations • All the above rules are binary partitions of the same itemset: {Milk, Biscuit, FruitJuice} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements DATA MINING VESIT M.VIJAYALAKSHMI 15 Example 2 • Transaction data • Assume: t1: t2: t3: t4: t5: t6: t7: Butter, Cocoa, Milk Butter, Cheese Cheese, Boots Butter, Cocoa, Cheese Butter, Cocoa, Clothes, Cheese, Milk Cocoa, Clothes, Milk Cocoa, Milk, Clothes minsup = 30% minconf = 80% • An example frequent itemset: {Cocoa, Clothes, Milk} [sup = 3/7] • Association rules from the itemset: Clothes Milk, Cocoa … Clothes, Cocoa Milk, DATA MINING VESIT M.VIJAYALAKSHMI [sup = 3/7, conf = 3/3] … [sup = 3/7, conf = 3/3] 16 Mining Association Rules • Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive DATA MINING VESIT M.VIJAYALAKSHMI 17 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Given d items, there are 2d possible candidate itemsets ABCDE DATA MINING VESIT M.VIJAYALAKSHMI 18 Frequent Itemset Generation • Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!! N W TID Items 1 Bread, Milk 2 Bread, Biscuit, FruitJuice, Eggs Milk, Biscuit, FruitJuice, Coke Bread, Milk, Biscuit, FruitJuice Bread, Milk, Biscuit, Coke 3 4 5 DATA MINING VESIT M.VIJAYALAKSHMI 19 Frequent Itemset Generation Strategies • Reduce the number of candidates (M) – Complete search: M=2d – Use pruning techniques to reduce M • Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases • Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction DATA MINING VESIT M.VIJAYALAKSHMI 20 Reducing Number of Candidates • Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: X , Y : ( X Y ) s( X ) s(Y ) – Support of an itemset never exceeds the support of its subsets DATA MINING VESIT M.VIJAYALAKSHMI 21 Apriori Rule • Frequent Itemset Property: Any subset of a Frequent itemset is Frequent. • Contrapositive: If an itemset is not Frequent, none of its supersets are large Frequent. DATA MINING VESIT M.VIJAYALAKSHMI 22 Illustrating Apriori Principle null null A A B B C C D D E E AB AB AC AC AD AD AE AE BC BC BD BD BE BE CD CD CE CE DE DE ABCABC ABDABD ABEABE ACDACD ACEACE ADEADE BCDBCD BCEBCE BDEBDE CDECDE Found to be Infrequent ABCD ABCD Pruned supersets DATA MINING VESIT M.VIJAYALAKSHMI ABCE ABCE ABDE ABDE ACDE ACDE BCDE BCDE ABCDE ABCDE 23 Illustrating Apriori Principle Item Bread Coke Milk FruitJuice Biscuit Eggs Count 4 2 4 3 4 1 Items (1-itemsets) Minimum Support = 3 Itemset Count {Bread,Milk} 3 {Bread,FruitJuice} 2 {Bread,Biscuit} 3 {Milk,FruitJuice} 2 {Milk,Biscuit} 3 {FruitJuice,Biscuit} 3 If every subset is considered, 6C 1 + 6C2 + 6C3 = 41 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Itemset {Bread,Milk,Biscuit} Count 3 With support-based pruning, 6 + 6 + 1 = 13 DATA MINING VESIT M.VIJAYALAKSHMI 24 Apriori Algorithm • Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k frequent itemsets • Prune candidate itemsets containing subsets of length k that are infrequent • Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those that are frequent DATA MINING VESIT M.VIJAYALAKSHMI 25 Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D 1-candidates Scan D Min_sup=2 3-candidates Scan D Itemset a b c d e Sup 2 3 3 1 3 Itemset a b c e Freq 2-itemsets Itemset bce Freq 3-itemsets Itemset bce Freq 1-itemsets Sup 2 DATA MINING VESIT M.VIJAYALAKSHMI Itemset ac bc be ce Sup 2 2 3 2 2-candidates Sup 2 3 3 3 Counting Itemset ab ac ae bc be ce Sup 1 2 1 2 3 2 Itemset ab ac ae bc be ce Scan D 26 Example – Finding frequent itemsets minsup=0.5 itemset:count Dataset T TID Items T100 1, 3, 4 1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 F1: {1}:2, {2}:3, {3}:3, C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5} {5}:3 T200 2, 3, 5 T300 1, 2, 3, 5 T400 2, 5 2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2 F2: C3: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2 {2, 3,5} 3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5} DATA MINING VESIT M.VIJAYALAKSHMI 27 Details: Ordering Of Items • The items in I are sorted in lexicographic order (which is a total order). • The order is used throughout the algorithm in each itemset. • {w[1], w[2], …, w[k]} represents a k-itemset w consisting of items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k] according to the total order. DATA MINING VESIT M.VIJAYALAKSHMI 28 Apriori-Gen • Generate candidates of size i+1 from large itemsets of size i. • Approach used: join large itemsets of size i if they agree on i-1 • May also prune candidates who have subsets that are not large. DATA MINING VESIT M.VIJAYALAKSHMI 29 Candidate-Gen Function Function candidate-gen(Fk-1) Ck ; forall f1, f2 Fk-1 with f1 = {i1, … , ik-2, ik-1} and f2 = {i1, … , ik-2, i’k-1} and ik-1 < i’k-1 do c {i1, …, ik-1, i’k-1}; // join f1 and f2 Ck Ck {c}; for each (k-1)-subset s of c do if (s Fk-1) then delete c from Ck; // prune end end return Ck; DATA MINING VESIT M.VIJAYALAKSHMI 30 An Example • F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} • After join – C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}} • After pruning: – C4 = {{1, 2, 3, 4}} because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed) DATA MINING VESIT M.VIJAYALAKSHMI 31 Apriori-Gen Example DATA MINING VESIT M.VIJAYALAKSHMI 32 Apriori-Gen Example (cont’d) DATA MINING VESIT M.VIJAYALAKSHMI 33 Apriori Adv/Disadv • Advantages: – Uses large itemset property. – Easily parallelized – Easy to implement. • Disadvantages: – Assumes transaction database is memory resident. – Requires up to m database scans. DATA MINING VESIT M.VIJAYALAKSHMI 34 Step 2: Generating Rules From Frequent Itemsets • Frequent itemsets association rules • One more step is needed to generate association rules • For each frequent itemset X, For each proper nonempty subset A of X, – Let B = X - A – A B is an association rule if • Confidence(A B) ≥ minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A) DATA MINING VESIT M.VIJAYALAKSHMI 35 Generating Rules: An example • Suppose {2,3,4} is frequent, with sup=50% – Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively – These generate these association rules: • 2,3 4, confidence=100% • 2,4 3, confidence=100% • 3,4 2, confidence=67% • 2 3,4, confidence=67% • 3 2,4, confidence=67% • 4 2,3, confidence=67% • All rules have support > 50% DATA MINING VESIT M.VIJAYALAKSHMI 36 Rule Generation • Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement – If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, A BCD, AB CD, BD AC, ABD C, B ACD, AC BD, CD AB, ACD B, C ABD, AD BC, BCD A, D ABC BC AD, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L) DATA MINING VESIT M.VIJAYALAKSHMI 37 Generating Rules • To recap, in order to obtain A B, we need to have support(A B) and support(A) • All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more. • This step is not as time-consuming as frequent itemsets generation. DATA MINING VESIT M.VIJAYALAKSHMI 38 Rule Generation • How to efficiently generate rules from frequent itemsets? – In general, confidence does not have an antimonotone property c(ABC D) can be larger or smaller than c(AB D) – But confidence of rules generated from the same itemset has an anti-monotone property – e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) DATA MINING VESIT M.VIJAYALAKSHMI 39 Rule Generation for Apriori Lattice of rules Algorithm Low Confidence Rule CD=>AB CD=>AB ABCD=>{ ABCD=>{} } BCD=>A BCD=>A BD=>AC BD=>AC D=>ABC D=>ABC ACD=>B ACD=>B BC=>AD BC=>AD C=>ABD C=>ABD ABD=>C ABD=>C AD=>BC AD=>BC B=>ACD B=>ACD ABC=>D ABC=>D AC=>BD AC=>BD AB=>CD AB=>CD A=>BCD A=>BCD Pruned Rules DATA MINING VESIT M.VIJAYALAKSHMI 40 APriori - Performance Bottlenecks • The core of the Apriori algorithm: – Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets – Use database scan and pattern matching to collect counts for the candidate itemsets • Bottleneck of Apriori: candidate generation – Huge candidate sets: • 104 frequent 1-itemset will generate 107 candidate 2itemsets • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates. – Multiple scans of database: • Needs (n +1 ) scans, n is the length of the longest pattern DATA MINING VESIT M.VIJAYALAKSHMI 41 Mining Frequent Patterns Without Candidate Generation • Compress a large database into a compact, FrequentPattern tree (FP-tree) structure – highly condensed, but complete for frequent pattern mining – avoid costly database scans • Develop an efficient, FP-tree-based frequent pattern mining method – A divide-and-conquer methodology: decompose mining tasks into smaller ones – Avoid candidate generation: sub-database test only! DATA MINING VESIT M.VIJAYALAKSHMI 42 Construct FP-tree From A Transaction DB TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 0.5 {} Steps: 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Order frequent items in frequency descending order 3. Scan DB again, construct FP-tree Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 DATA MINING VESIT M.VIJAYALAKSHMI f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 43 Benefits of the FP-tree Structure • Completeness: – never breaks a long pattern of any transaction – preserves complete information for frequent pattern mining • Compactness – reduce irrelevant information—infrequent items are gone – frequency descending ordering: more frequent items are more likely to be shared – never be larger than the original database (if not count nodelinks and counts) DATA MINING VESIT M.VIJAYALAKSHMI 44 Major Steps to Mine FP-tree 1) Construct conditional pattern base for each node in the FP-tree 2) Construct conditional FP-tree from each conditional pattern-base 3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far If the conditional FP-tree contains a single path, simply enumerate all the patterns DATA MINING VESIT M.VIJAYALAKSHMI 45 Step 1: FP-tree to Conditional Pattern Base • Starting at the frequent header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item • Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases Header Table {} Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 b:1 a:3 c:1 item base cond. pattern b:1 c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 p:1 m:2 b:1 p:2 m:1 DATA MINING VESIT M.VIJAYALAKSHMI 46 Properties of FP-tree for Conditional Pattern Base Construction • Node-link property – For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header • Prefix path property – To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai. DATA MINING VESIT M.VIJAYALAKSHMI 47 Step 2: Construct Conditional FP-tree • For each pattern-base – Accumulate the count for each item in the base – Construct the FP-tree for the frequent items of the pattern base Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 m-conditional pattern base: fca:2, fcab:1 {} f:4 c:3 b:1 a:3 m:2 {} c:1 b:1 p:1 f:3 c:3 b:1 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam a:3 p:2 m:1 DATA MINING VESIT M.VIJAYALAKSHMI m-conditional FPtree 48 Mining Frequent Patterns by Creating Conditional Pattern-Bases Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty DATA MINING VESIT M.VIJAYALAKSHMI 49 Step 3: Recursively mine the conditional FP-tree { } {} f: 3 Cond. pattern base of “am”: (fc:3) c:3 f:3 am-conditional FP-tree c:3 a:3 {} Cond. pattern base of “cm”: (f:3) f:3 m-conditional FP-tree cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI 50 Single FP-tree Path Generation • Suppose an FP-tree T has a single path P • The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} f:3 c:3 a:3 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam m-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI 51 Principles of Frequent Pattern Growth • Pattern growth property – Let a be a frequent itemset in DB, B be a's conditional pattern base, and be an itemset in B. Then a is a frequent itemset in DB iff is frequent in B. • “abcdef ” is a frequent pattern, if and only if – “abcde ” is a frequent pattern, and – “f ” is frequent in the set of transactions containing “abcde ” DATA MINING VESIT M.VIJAYALAKSHMI 52 Why Is Frequent Pattern Growth Fast? • Performance study shows – FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection • Reasoning – No candidate generation, no candidate test – Use compact data structure – Eliminate repeated database scan – Basic operation is counting and FP-tree building DATA MINING VESIT M.VIJAYALAKSHMI 53 FP-Tree Construction TID 1 2 3 4 5 6 7 8 9 10 Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E} null After reading TID=1: A:1 B:1 After reading TID=2: DATA MINING VESIT M.VIJAYALAKSHMI null A:1 B:1 B:1 C:1 D:154 FP-Tree Construction TID 1 2 3 4 5 6 7 8 9 10 Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E} Header table Item Pointer A B C D E Transaction Database null B:3 A:7 B:5 C:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI C:3 D:1 C:3 D:1 D:1 D:1 E:1 E:1 E:1 Pointers are used to assist frequent itemset generation 55 FP-growth Build conditional pattern base for E: P = {(A:1,C:1,D:1), (A:1,D:1), (B:1,C:1)} null B:3 A:7 B:5 C:1 D:1 C:3 C:3 Recursively apply FPgrowth on P D:1 D:1 D:1 E:1 E:1 E:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI 56 FP-growth Conditional tree for E: Conditional Pattern base for E: P = {(A:1,C:1,D:1,E:1), (A:1,D:1,E:1), (B:1,C:1,E:1)} null B:1 A:2 C:1 D:1 C:1 Count for E is 3: {E} is frequent itemset Recursively apply FPgrowth on P D:1 E:1 E:1 E:1 DATA MINING VESIT M.VIJAYALAKSHMI 57 FP-growth Conditional tree for D within conditional tree for E: null A:2 C:1 D:1 Conditional pattern base for D within conditional base for E: P = {(A:1,C:1,D:1), (A:1,D:1)} Count for D is 2: {D,E} is frequent itemset Recursively apply FPgrowth on P D:1 DATA MINING VESIT M.VIJAYALAKSHMI 58 FP-growth Conditional tree for C within D within E: null Conditional pattern base for C within D within E: P = {(A:1,C:1)} Count for C is 1: {C,D,E} is NOT frequent itemset A:1 C:1 DATA MINING VESIT M.VIJAYALAKSHMI 59 FP-growth Conditional tree for A within D within E: null Count for A is 2: {A,D,E} is frequent itemset Next step: A:2 Construct conditional tree C within conditional tree E Continue until exploring conditional tree for A (which has only node A) DATA MINING VESIT M.VIJAYALAKSHMI 60 Benefits of the FP-tree Structure • Performance study shows • Reasoning – No candidate generation, no candidate test – Use compact data structure – Eliminate repeated database scan – Basic operation is counting and FP-tree building DATA MINING VESIT M.VIJAYALAKSHMI D1 FP-grow th runtime 90 D1 Apriori runtime 80 70 Run time(sec.) – FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection 100 60 50 40 30 20 10 0 0 0.5 1 1.5 2 Support threshold(%) 2.5 3 61 Iceberg Queries • Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold • Example: select P.custID, P.itemID, sum(P.qty) from purchase P group by P.custID, P.itemID having sum(P.qty) >= 10 • Compute iceberg queries efficiently by Apriori: – First compute lower dimensions – Then compute higher dimensions only when all the lower ones are above the threshold DATA MINING VESIT M.VIJAYALAKSHMI 62 Multiple-Level Association Rules • Items often form hierarchy. • Items at the lower level are expected to have lower support. • Rules regarding itemsets at appropriate levels could be quite useful. • Transaction database can be encoded based on dimensions and levels • We can explore shared multilevel mining DATA MINING VESIT M.VIJAYALAKSHMI Food bread milk skim Amul TID T1 T2 T3 T4 T5 2% wheat white Dairy Items {111, 121, 211, 221} {111, 211, 222, 323} {112, 122, 221, 411} {111, 121} {111, 122, 211, 221, 413} 63 Mining Multi-Level Associations • A top_down, progressive deepening approach: – First find high-level strong rules: milk bread [20%, 60%]. – Then find their lower-level “weaker” rules: 2% milk wheat bread [6%, 50%]. • Variations at mining multiple-level association rules. – Level-crossed association rules: 2% milk Wonder wheat bread – Association rules with multiple, alternative hierarchies: 2% milk Wonder bread DATA MINING VESIT M.VIJAYALAKSHMI 64 Multi-level Association: Uniform Support vs. Reduced Support • Uniform Support: the same minimum support for all levels – + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – – Lower level items do not occur as frequently. If support threshold • too high miss low level associations • too low generate too many high level associations • Reduced Support: reduced minimum support at lower levels DATA MINING VESIT M.VIJAYALAKSHMI 65 Multi-level Association: Redundancy Filtering • Some rules may be redundant due to “ancestor” relationships between items. • Example – milk wheat bread [support = 8%, confidence = 70%] – 2% milk wheat bread [support = 2%, confidence = 72%] • We say the first rule is an ancestor of the second rule. • A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. DATA MINING VESIT M.VIJAYALAKSHMI 66 Multi-Dimensional Association: Concepts • Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) • Multi-dimensional rules: 2 dimensions or predicates – Inter-dimension association rules (no repeated predicates) age(X,”19-25”) occupation(X,“student”) buys(X,“coke”) – hybrid-dimension association rules (repeated predicates) age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) • Categorical Attributes – finite number of possible values, no ordering among values • Quantitative Attributes – numeric, implicit ordering among values DATA MINING VESIT M.VIJAYALAKSHMI 67 Techniques for Mining MD Associations • Search for frequent k-predicate set: – Example: {age, occupation, buys} is a 3-predicate set. – Techniques can be categorized by how age are treated. 1. Using static discretization of quantitative attributes – Quantitative attributes are statically discretized by using predefined concept hierarchies. 2. Quantitative association rules – Quantitative attributes are dynamically discretized into “bins”based on the distribution of the data. 3. Distance-based association rules – This is a dynamic discretization process that considers the distance between data points. DATA MINING VESIT M.VIJAYALAKSHMI 68 Interestingness Measurements • Objective measures Two popular measurements: 1. support; and 2. confidence • • Subjective measures A rule (pattern) is interesting if 1. it is unexpected (surprising to the user); and/or 2. actionable (the user can do something with it) DATA MINING VESIT M.VIJAYALAKSHMI 69 Criticism to Support and Confidence • Example 1: Among 5000 students • 3000 play basketball • 3750 eat cereal • 2000 both play basket ball and eat cereal – play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. – play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 DATA MINING VESIT M.VIJAYALAKSHMI 70 (Cont.) • Example 2: – X and Y: positively correlated, – X and Z, negatively related – support and confidence of X=>Z dominates • We need a measure of dependent or correlated events X11110000 Y11000000 Z01111111 P( A B) P( A) P( B) Rule Support Confidence corrA, B X=>Y 25% 50% X=>Z 37.50% 75% • P(B|A)/P(B) is also called the lift of rule A => B DATA MINING VESIT M.VIJAYALAKSHMI 71