Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Association Analysis: Basic Concepts and Algorithms Alternative Methods for Frequent Itemset Generation l Traversal of Itemset Lattice – General-to-specific vs Specific-to-general Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Alternative Methods for Frequent Itemset Generation l Traversal of Itemset Lattice © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Alternative Methods for Frequent Itemset Generation l – Equivalence Classes © Tan,Steinbach, Kumar Introduction to Data Mining Traversal of Itemset Lattice – Breadth-first vs Depth-first 4/18/2004 3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 FP-growth Algorithm l Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets l Requires only two passes over the data base. l © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 FP-tree construction: Pass 1 FP-tree Construction: Pass 2 TID Items Item Count TID Items TID Items 1 ABE A 6 1 BAE 1 BAE 2 BCF B 7 2 BC 2 BC 3 BD C 6 3 BD 3 BD 4 ABC D 2 4 BAC 4 BAC 5 AC 5 AC 5 AC 6 BC 6 BC 6 BC 7 AC 7 AC 7 AC 8 ABCE 8 BACE 8 BACE 9 ABD 9 BAD 9 BAD © Tan,Steinbach, Kumar E 2 F 1 Minsup = 2 Introduction to Data Mining 4/18/2004 null 7 FP-tree Construction © Tan,Steinbach, Kumar B:1 A:1 E:1 Introduction to Data Mining BAE 2 BC 3 BD 4 BAC 5 AC 2 BC BD 4 BAC 5 AC 6 BC 6 BC 7 AC 7 AC 8 BACE 8 BACE 9 BAD 9 BAD B:2 A:1 C:1 E:1 Introduction to Data Mining 4/18/2004 9 FP-tree Construction © Tan,Steinbach, Kumar B:3 A:1 C:1 D:1 E:1 Introduction to Data Mining FP-tree Construction TID Items TID Items null 1 BAE 2 BC 3 BD null B:4 1 BAE 2 BC 3 BD 4 BAC 5 AC 6 BC 4 BAC 5 AC 6 BC 7 AC 7 AC 8 BACE 8 BACE 9 BAD 9 BAD © Tan,Steinbach, Kumar 10 null 1 3 © Tan,Steinbach, Kumar 4/18/2004 TID Items null BAE 8 FP-tree Construction TID Items 1 4/18/2004 A:2 E:1 C:1 D:1 C:1 Introduction to Data Mining 4/18/2004 11 © Tan,Steinbach, Kumar B:4 A:2 E:1 C:1 A:1 D:1 C:1 C:1 Introduction to Data Mining 4/18/2004 12 FP-tree Construction FP-tree Construction TID Items TID Items null 1 BAE 2 BC 3 BD 4 BAC 5 AC 6 BC 7 8 9 null 1 BAE 2 BC 3 BD 4 BAC 5 AC 6 BC AC 7 AC BACE 8 BACE BAD 9 BAD © Tan,Steinbach, Kumar B:5 A:2 E:1 A:1 C:2 D:1 C:1 C:1 Introduction to Data Mining 4/18/2004 13 FP-tree Construction © Tan,Steinbach, Kumar B:5 A:2 E:1 C:2 A:2 D:1 C:1 Introduction to Data Mining 4/18/2004 TID Items null BAE 2 BC 3 BD 4 BAC 5 AC 6 BC 7 AC 8 BACE 9 BAD © Tan,Steinbach, Kumar 14 FP-tree Construction TID Items 1 C:2 null B:6 A:3 E:1 A:2 C:2 D:1 C:2 C:2 E:1 Introduction to Data Mining 4/18/2004 15 FP-tree construction: reverse order 1 BAE 2 BC 3 BD 4 BAC 5 AC 6 BC 7 AC 8 BACE 9 BAD © Tan,Steinbach, Kumar B:7 A:4 E:1 C:2 C:2 A:2 D:1 C:2 D:1 E:1 Introduction to Data Mining 4/18/2004 16 FP-tree Construction TID Items null 1 EAB 2 CB 3 DB 4 CAB 5 CA 6 CB 7 CA 8 ECAB 9 DAB E:2 C:5 A:1 C:1 B:1 A:1 B:2 Item B A C D E D:2 A:3 B:1 B:2 A:1 B:1 B:7 A:4 E:1 B:1 © Tan,Steinbach, Kumar null Count 7 6 6 2 2 C:2 C:2 A:2 D:1 C:2 D:1 E:1 Less compression! Introduction to Data Mining 4/18/2004 17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 FP-growth: finding frequent itemsets FP-growth: Patterns ending in E After construction the FP-tree it is used to generate all frequent itemsets. l We generate all frequent itemsets ending in E,D,C,A and B respectively. l To generate all frequent itemsets ending in E, we generate all that end in DE,CE,AE and BE. l Conditional Pattern: BA:1 null B:7 A:4 E:1 C:2 C:2 A:2 D:1 C:2 D:1 E:1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 FP-growth: Patterns ending in E Conditional Pattern: BAC:1 E:1 C:2 D:1 B:2 Item B A C C:2 D:1 Introduction to Data Mining 4/18/2004 21 A:2 C:1 © Tan,Steinbach, Kumar Introduction to Data Mining Conditional Pattern: BA:2 A:4 The tree is a single path. Output βE, where β is any subset of BA. The support of this pattern is the minimum item support appearing in β. Hence, we output: BAE:2, BE:2 and AE:2 Introduction to Data Mining 22 null B:7 B:2 A:2 © Tan,Steinbach, Kumar 4/18/2004 FP-growth: Patterns ending in C null Count 2 2 Count 2 2 1 C is infrequent, so it can be removed: CE and all its supersets are infrequent. Conditional Pattern tree for E Item B A 20 null A:2 E:1 © Tan,Steinbach, Kumar 4/18/2004 Conditional Pattern Base BA:1, BAC:1 null C:2 Introduction to Data Mining Conditional Pattern tree for E B:7 A:4 © Tan,Steinbach, Kumar 4/18/2004 E:1 C:2 C:2 A:2 D:1 C:2 D:1 E:1 23 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 FP-growth: Patterns ending in C Conditional Pattern: B:2 E:1 C:2 Conditional Pattern: A:2 null B:7 A:4 FP-growth: Patterns ending in C C:2 A:2 D:1 B:7 C:2 A:4 D:1 E:1 E:1 © Tan,Steinbach, Kumar 4/18/2004 25 Conditional FP-Tree for C Conditional Pattern Base: BA:2, B:2, A:2 Count B 4 A 4 C:2 C:2 A:2 D:1 C:2 D:1 E:1 Introduction to Data Mining Item null Introduction to Data Mining 4/18/2004 26 4/18/2004 28 Conditional FP-Tree for AC Conditional Pattern Base: B:2 null B:4 © Tan,Steinbach, Kumar A:2 Item Count B 2 null B:2 A:2 Tree is single path: output BAC:2. The tree is not a single path so: (1) Output iC, where i is any item in the tree, with the support of i. Hence, we output: BC:4, AC:4. (2) Recurse. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 FP-growth algorithm l 4/18/2004 Association rule algorithms tend to produce too many rules – many of them are uninteresting or redundant – Redundant if {A,B,C} → {D} and {A,B} → {D} have same support & confidence Initial call: FP-growth (T(∅), ∅). Introduction to Data Mining Introduction to Data Mining Pattern Evaluation FP-growth (T(α),α) IF T(α) is a single path P THEN for all β ⊆ P, output β ∪ α with support equal to minimum support in T(α) of item occurring in β. ELSE FOR EACH item i in T(α) output β = i ∪ α with support of i in T(α). construct T(β) IF T(β) not empty THEN call FP-growth (T(β), β) END FOR END IF © Tan,Steinbach, Kumar © Tan,Steinbach, Kumar 29 l Interestingness measures can be used to prune/rank the derived patterns l In the original formulation of association rules, support & confidence are the only measures used © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Computing Interestingness Measure l Drawback of Confidence Given a rule X → Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X → Y Y Y X f11 f10 X f01 f00 fo+ f+1 f+0 |T| f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y f1+ Coffee Coffee Tea 15 5 Tea 75 5 80 90 10 100 20 Association Rule: Tea → Coffee Confidence= P(Coffee|Tea) = 0.75 Used to define various measures u © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Statistical Independence l ⇒ Although confidence is high, rule is misleading l – 600 students know how to swim (S) – 700 students know how to bike (B) – 420 students know how to swim and bike (S,B) Introduction to Data Mining 4/18/2004 32 P(Y | X ) P(Y ) P( X , Y ) Interest = P( X ) P(Y ) PS = P( X , Y ) − P( X ) P(Y ) P( X , Y ) − P( X ) P(Y ) φ − coefficient = P( X )[1 − P( X )]P(Y )[1 − P(Y )] – P(S∧B) = P(S) × P(B) => Statistical independence – P(S∧B) > P(S) × P(B) => Positively correlated – P(S∧B) < P(S) × P(B) => Negatively correlated 4/18/2004 Measures that take into account statistical dependence Lift = – P(S∧B) = 420/1000 = 0.42 – P(S) × P(B) = 0.6 × 0.7 = 0.42 Introduction to Data Mining ⇒ P(Coffee|Tea) = 0.9375 © Tan,Steinbach, Kumar Statistical-based Measures Population of 1000 students © Tan,Steinbach, Kumar but P(Coffee) = 0.9 support, confidence, lift, Gini, J-measure, etc. 33 Example: Lift/Interest © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Drawback of Lift & Interest Coffee Coffee Y Y Tea 15 5 20 X 10 0 10 X 90 0 Tea 75 5 80 X 0 90 90 X 0 10 10 90 10 100 10 90 100 90 10 100 Association Rule: Tea → Coffee Lift = Confidence= P(Coffee|Tea) = 0.75 Y 0. 1 = 10 (0.1)(0.1) Lift = Y 90 0. 9 = 1.11 (0.9)(0.9) but P(Coffee) = 0.9 Statistical independence: ⇒ Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) If P(X,Y)=P(X)P(Y) => Lift = 1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Comparing Different Measures There are lots of measures proposed in the literature 10 examples of contingency tables: Some measures are good for certain applications, but not for others © Tan,Steinbach, Kumar Property under Variable Permutation B p r B q s A p q B B A r s f01 f00 83 2 94 3080 1363 2000 2000 2000 7121 2483 424 622 127 5 1320 500 1000 2000 5 4 1370 1046 298 2961 4431 6000 3000 2000 1154 7452 Introduction to Data Mining B p r A A 4/18/2004 38 B q s A p q B B A r s c(A → B) = P(B|A) = σ(AB) / σ(A) = p / (p+q) c(B → A) = P(A|B) = σ(AB) / σ(B) = p / (p+r) Symmetric measures: support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: u f10 8123 8330 9481 3954 2886 1500 4000 4000 1720 61 Example: Confidence Does M(A,B) = M(B,A)? u f11 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Rankings of contingency tables using various measures: What criteria should we use to determine whether a measure is good or bad? A A Exam ple Hence, confidence is not symmetric. confidence, conviction, Laplace, J-measure, etc © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Property under Row/Column Scaling © Tan,Steinbach, Kumar Male Female Male Female Male Female High 2 3 5 High 4 30 34 Low 1 4 5 Low 2 40 42 3 7 10 6 70 76 × k1 × k2 2 3 5 High 4 30 34 Low 1 4 5 Low 2 40 42 3 7 10 6 70 76 2x 10x cpr = Mosteller: Underlying association should be independent of the relative number of male and female students in the samples Introduction to Data Mining 40 Female High © Tan,Steinbach, Kumar 4/18/2004 Example: cross-product ratio Grade-Gender Example (Mosteller, 1968): Male Introduction to Data Mining 4/18/2004 f (H,M )f (L,F ) f (H,F )f (L,M ) After Column Scaling: cpr = 41 © Tan,Steinbach, Kumar k1 f (H,M )k2 f (L,F ) k2 f (H,F )k1 f (L,M ) Introduction to Data Mining = f (H,M )f (L,F ) f (H,F )f (L,M ) 4/18/2004 42 Example: φ-Coefficient Property under Inversion Operation Transaction 1 . . . . . Transaction N A B C D E F 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 (a) © Tan,Steinbach, Kumar (b) l B p r 43 B q s+k 20 10 X 10 20 30 X 10 60 70 70 30 100 30 70 100 30 0.2 − 0.3 × 0.3 0.7 × 0.3 × 0.7 × 0.3 = 0.5238 φ= Introduction to Data Mining B p r A A 4/18/2004 44 B q s A A B p r B q s+k c(A → B) = P(B|A) = σ(AB) / σ(A) = p / (p+q) support, cosine, Jaccard, etc Hence confidence is invariant under null addition (i.e., adding transactions in which neither A nor B were bought). Non-invariant measures: u X © Tan,Steinbach, Kumar Invariant measures: u Y 70 Example: Confidence B p r A A Y 10 φ Coefficient is the same for both tables 4/18/2004 B q s Y 60 0.6 − 0.7 × 0.7 0.7 × 0.3 × 0.7 × 0.3 = 0.5238 Invariant under Null Addition? A A Y X φ= (c) Introduction to Data Mining φ-coefficient is analogous to correlation coefficient for continuous variables correlation, Gini, mutual information, odds ratio, etc © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 © Tan,Steinbach, Kumar Sym bol M e as ure Range P1 P2 P3 O2 O3 Φ λ α Q Y κ M J G s c L V I IS PS F AV S ζ Correlation -1 … 0 … 1 Yes Yes Yes Y es No Y es Y es No Lambda 0…1 Yes No No Y es No No* Y es No Odds ratio Y es* Yes Yes Y es Y es Y es* Y es No Y ule's Q 0…1…∞ -1 … 0 … 1 Yes Yes Yes Y es Y es Y es Y es No Y ule's Y -1 … 0 … 1 Yes Yes Yes Y es Y es Y es Y es No O3' O4 Cohen's -1 … 0 … 1 Yes Yes Yes Y es No No Y es Mutual Inf ormation 0…1 Yes Yes Yes Y es No No* Y es No J-Measure 0…1 Yes No No No No No No No Gini Index 0…1 Yes No No No No No* Y es Support 0…1 No Yes No Y es No No No No 0…1 No Yes No Y es No No No Y es Laplace 0…1 No Yes No Y es No No No No Conviction 0.5 … 1 … ∞ 0…1…∞ No Yes No Yes** No No Y es No Y es* Yes Yes Y es No No No No IS (cosine) 0 .. 1 No Yes Yes Y es No No No Y es -0.25 … 0 … 0.25 Yes Yes Yes Y es No Y es Y es No Certainty factor -1 … 0 … 1 Yes A dded value 0.5 … 1 … 1 Yes Yes Yes No No No No No Collective strength 0…1…∞ 0 .. 1 No Yes Yes Y es No Y es* Y es No No Yes Yes Y es No No No Yes Yes No No No 4/18/2004 No Jaccard Klosgen's K © Tan,Steinbach, Kumar 2 3 1 2 − 1 2 − 3 − Yes K 0 K 3 to Data 3 3 Mining Introduction Yes No No No 46 Objective measure: – Rank patterns based on statistics computed from data – e.g., 21 measures of association (support, confidence, Laplace, Gini, mutual information, Jaccard, etc). No Piatetsky-Shapiro's Yes l No Conf idence Interest 4/18/2004 Subjective Interestingness Measure Different Measures have Different Properties O1 Introduction to Data Mining Y es No l Subjective measure: – Rank patterns according to user’s interpretation u A pattern is subjectively interesting if it contradicts the expectation of a user (Silberschatz & Tuzhilin) u A pattern is subjectively interesting if it is actionable (Silberschatz & Tuzhilin) Y es 47 No © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Interestingness via Unexpectedness l Need to model expectation of users (domain knowledge) + - Pattern expected to be frequent Pattern expected to be infrequent Pattern found to be frequent Pattern found to be infrequent + - + l Expected Patterns Unexpected Patterns Need to combine expectation of users with evidence from data (i.e., extracted patterns) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49