Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Analysis: Basic Concepts and Algorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Association Analysis: Basic Concepts and Algorithms Basic Concepts Introduction to Data Mining 08/06/2006 2 Association Rule Mining z Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Introduction to Data Mining Example of Association Rules {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Implication means co-occurrence, not causality! 08/06/2006 3 Definition: Frequent Itemset z Itemset – A collection of one or more items Example: {Milk, Bread, Diaper} – k-itemset k itemset z An itemset that contains k items Support count (σ) – Frequency of occurrence of an itemset – E.g. σ({Milk, Bread,Diaper}) = 2 z Support TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 z Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold Introduction to Data Mining 08/06/2006 4 Definition: Association Rule z Association Rule – An implication expression of the form X → Y, where X and Y are itemsets – Example: {Milk, Diaper} → {Beer} z Rule Evaluation Metrics TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 4 Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke – Support (s) Fraction of transactions that contain both X and Y Example: {Milk , Diaper } ⇒ {Beer} – Confidence (c) Measures how often items in Y appear in transactions that contain X s= c= Introduction to Data Mining σ ( Milk , Diaper, Beer ) |T| = 2 = 0 .4 5 σ (Milk, Diaper, Beer) 2 = = 0.67 σ (Milk, Diaper) 3 08/06/2006 5 Association Rule Mining Task z Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold z Brute-force approach: – List all possible association rules – Compute C t th the supportt and d confidence fid for f each h rule l – Prune rules that fail the minsup and minconf thresholds ⇒ Computationally prohibitive! Introduction to Data Mining 08/06/2006 6 Computational Complexity z Given d unique items: – Total number of itemsets = 2d – Total number of possible association rules: ⎡⎛ d ⎞ ⎛ d − k ⎞⎤ R = ∑ ⎢⎜ ⎟ × ∑ ⎜ ⎟⎥ k j ⎠⎦ ⎣⎝ ⎠ ⎝ = 3 − 2 +1 d −1 d −k k =1 j =1 d d +1 If d=6, R = 602 rules Introduction to Data Mining 08/06/2006 7 Mining Association Rules TID Items 1 Bread, Milk 2 3 4 5 Bread, B d Di Diaper, B Beer, E Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk Beer} → {Diaper} (s=0 {Milk,Beer} (s 0.4, 4 c=1 c 1.0) 0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements Introduction to Data Mining 08/06/2006 8 Mining Association Rules z Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support ≥ minsup 2. Rule Generation – z Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive Introduction to Data Mining 08/06/2006 9 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE Given d items, there are 2d possible candidate itemsets ABCDE Introduction to Data Mining BCDE 08/06/2006 10 Frequent Itemset Generation z Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database Transactions TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread Milk, Bread, Milk Diaper Diaper, Beer Bread, Milk, Diaper, Coke – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!! Introduction to Data Mining 08/06/2006 11 Frequent Itemset Generation Strategies z Reduce the number of candidates (M) – Complete search: M=2d – Use p pruning g techniques q to reduce M z Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms z Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction Introduction to Data Mining 08/06/2006 12 Reducing Number of Candidates z Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent z Apriori principle holds due to the following property of the support measure: ∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y ) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support Introduction to Data Mining 08/06/2006 13 Illustrating Apriori Principle Found to be Infrequent Pruned supersets Introduction to Data Mining 08/06/2006 14 Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1 Items (1-itemsets) Minimum Support = 3 Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 Introduction to Data Mining Count 3 2 3 2 3 3 Pairs (2-itemsets) (2 itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Ite m s e t { B r e a d ,M ilk ,D ia p e r } 08/06/2006 Count 3 15 Apriori Algorithm z Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent Introduction to Data Mining 08/06/2006 16 Reducing Number of Comparisons z Candidate counting: – Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets Transactions TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Introduction to Data Mining 08/06/2006 17 Data Mining Association Analysis: Basic Concepts and Algorithms Algorithms and Complexity © Tan,Steinbach, Kumar Introduction to Data Mining 03/24/2006 18 Factors Affecting Complexity of Apriori z Choice of minimum support threshold – – z Dimensionality (number of items) of the data set – – z more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database – z lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width – transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) Introduction to Data Mining 08/06/2006 19 Compact Representation of Frequent Itemsets z Some itemsets are redundant because they have identical support as their supersets TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎛10 ⎞ = 3× ∑ ⎜ ⎟ ⎝k⎠ z Number of frequent itemsets z Need a compact representation Introduction to Data Mining 08/06/2006 10 k =1 20 Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border Introduction to Data Mining 08/06/2006 21 Closed Itemset z An itemset is closed if none of its immediate supersets has the same support as the itemset TID 1 2 3 4 5 Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D} Introduction to Data Mining Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Support 4 5 3 4 4 2 3 3 4 3 08/06/2006 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 2 {A,B,C,D} 2 22 Maximal vs Closed Itemsets TID Items 1 ABC 2 124 BCE 4 ACDE 123 A ABCD 3 5 Transaction Ids null 12 124 AB DE 12 24 AD ABD ABE 123 AE 2 E 3 BD 4 ACD 345 D BC 24 2 245 C 4 AC 2 ABC 1234 B 24 BE 2 4 ACE ADE 34 CD CE 3 BCD 45 4 BCE BDE 4 ABCD ABCE ABDE Not supported by any transactions ACDE BCDE ABCDE Introduction to Data Mining 08/06/2006 23 Maximal vs Closed Frequent Itemsets Minimum support = 2 124 123 A 12 124 AB 12 ABC 24 AC AD ABD ABE 2 1234 B AE 345 D 2 3 BC BD 4 ACD 245 C 123 4 24 2 Closed but not maximal null 24 BE 2 4 ACE E ADE CD Closed and maximal 34 CE 3 BCD 45 DE 4 BCE BDE CDE 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE Introduction to Data Mining 08/06/2006 DE 24 CDE Maximal vs Closed Itemsets Introduction to Data Mining 08/06/2006 25 Alternative Methods for Frequent Itemset Generation z Traversal of Itemset Lattice – General-to-specific vs Specific-to-general Introduction to Data Mining 08/06/2006 26 Alternative Methods for Frequent Itemset Generation z Traversal of Itemset Lattice – Equivalent Classes Introduction to Data Mining 08/06/2006 27 Alternative Methods for Frequent Itemset Generation z Traversal of Itemset Lattice – Breadth-first vs Depth-first Introduction to Data Mining 08/06/2006 28 Alternative Methods for Frequent Itemset Generation z Representation of Database – horizontal vs vertical data layout TID 1 2 3 4 5 6 7 8 9 10 Items A,B,E B,C,D C,E A,C,D ABCD A,B,C,D A,E A,B A,B,C A,C,D B A 1 4 5 6 7 8 9 Introduction to Data Mining B 1 2 5 7 8 10 C 2 3 4 8 9 D 2 4 5 9 E 1 3 6 08/06/2006 29 Rule Generation z Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f → L – f satisfies the minimum confidence requirement – If {A,B,C,D} is a frequent itemset, candidate rules: ABC →D, A →BCD, AB →CD, BD →AC, z ABD →C, B →ACD, AC → BD, CD →AB, ACD →B, C →ABD, AD → BC, BCD →A, D →ABC BC →AD, If |L| = k, k then th there th are 2k – 2 candidate did t association rules (ignoring L → ∅ and ∅ → L) Introduction to Data Mining 08/06/2006 30 Rule Generation z How to efficiently generate rules from frequent itemsets? – In general, confidence does not have an anti antimonotone property c(ABC →D) can be larger or smaller than c(AB →D) – But confidence of rules generated from the same itemset has an anti-monotone property – e.g., L = {A,B,C,D}: c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule Introduction to Data Mining 08/06/2006 31 Rule Generation for Apriori Algorithm Lattice of rules Low Confidence R l Rule Pruned Rules Introduction to Data Mining 08/06/2006 32 Rule Generation for Apriori Algorithm z Candidate rule is generated by merging two rules that share the same prefix in the rule consequent CD=>AB z join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC z Prune rule D=>ABC if its subset AD=>BC does not have high confidence Introduction to Data Mining BD=>AC D=>ABC 08/06/2006 33 Association Analysis: Basic Concepts and Algorithms Pattern Evaluation © Tan,Steinbach, Kumar Introduction to Data Mining 03/02/2006 34 Effect of Support Distribution z Many real data sets have skewed support distribution Support distribution of a retail data set Introduction to Data Mining 08/06/2006 35 Effect of Support Distribution z How to set the appropriate minsup threshold? – If minsup is too high, we could miss itemsets involving interesting rare items (e (e.g., g expensive products) – If minsup is too low, it is computationally expensive and the number of itemsets is very large Introduction to Data Mining 08/06/2006 36 Cross-Support Patterns 100 A cross-supportt pattern tt iinvolves l items with varying degree of support Support (%) 80 60 • Example: {caviar,milk} 40 How to avoid such patterns? 20 0 0 500 1000 1500 2000 Sorted Items caviar milk Introduction to Data Mining 08/06/2006 37 Cross-Support Patterns Observation: 100 Conf(ca ia milk) is very Conf(caviar→milk) e high but Support (%) 80 Conf(milk→caviar) is very low 60 40 Therefore 20 0 0 500 1000 Sorted Items caviar Introduction to Data Mining 1500 min( Conf(caviar→milk) Conf(caviar→milk), Conf(milk→caviar) ) is also very low 2000 milk 08/06/2006 38 h-Confidence h-confidence: z 100 Advantages of h-confidence: Support (%) 80 1. Eliminate cross-support patterns such as {caviar,milk} 60 40 2. Min function has antimonotone property 20 0 0 500 1000 1500 • 2000 Sorted Items caviar Algorithm can be applied to efficiently discover low support, high confidence patterns milk Introduction to Data Mining 08/06/2006 39 Pattern Evaluation z Association rule algorithms can produce large number of rules – many of them are uninteresting or redundant – Redundant if {A,B,C} → {D} and {A,B} → {D} have same support & confidence z Interestingness measures can be used to prune/rank the patterns – In the original formulation, support & confidence are the only measures used Introduction to Data Mining 08/06/2006 40 Application of Interestingness Measure Knowledge Interestingness Measures Patterns Postprocessing Preprocessed Data Prod Prod uct Prod uct Prod uct Prod uct Prod uct Prod uct Prod uct Prod uct Prod uct uct Featur Featur e Featur e Featur e Featur e Featur e Featur e Featur e Featur e Featur e e M ining Selected Data Preprocessing Data Selection Introduction to Data Mining 08/06/2006 41 Computing Interestingness Measure z Given a rule X → Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X → Y Y Y X f11 f10 f1+ X f01 f00 fo+ f+1 f+0 |T| f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y Used to define various measures Introduction to Data Mining support, confidence, lift, Gini, J-measure, etc. 08/06/2006 42 Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea → Coffee Confidence= P(Coffee|Tea) = 0.75 0 75 but P(Coffee) = 0.9 ⇒ Although confidence is high, rule is misleading ⇒ P(Coffee|Tea) = 0.9375 Introduction to Data Mining 08/06/2006 43 Statistical Independence z Population of 1000 students – 600 students know how to swim (S) – 700 students t d t kknow h how tto bik bike (B) – 420 students know how to swim and bike (S,B) – P(S∧B) = 420/1000 = 0.42 – P(S) × P(B) = 0.6 × 0.7 = 0.42 – P(S∧B) = P(S) × P(B) => Statistical independence – P(S∧B) > P(S) × P(B) => Positively correlated – P(S∧B) < P(S) × P(B) => Negatively correlated Introduction to Data Mining 08/06/2006 44 Statistical-based Measures z Measures that take into account statistical dependence P (Y | X ) P(Y ) P( X , Y ) Interest = P ( X ) P (Y ) PS = P( X , Y ) − P ( X ) P (Y ) Lift = φ − coefficient = P ( X , Y ) − P( X ) P(Y ) P( X )[1 − P ( X )]P (Y )[1 − P (Y )] Introduction to Data Mining 08/06/2006 45 Example: Lift/Interest Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea → Coffee Confidence= P(Coffee|Tea) = 0.75 0 75 but P(Coffee) = 0.9 ⇒ Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) Introduction to Data Mining 08/06/2006 46 Drawback of Lift & Interest Y Y X 10 0 10 X 0 90 90 10 90 100 Lift = 0.1 = 10 (0.1)(0.1) Y Y X 90 0 90 X 0 10 10 90 10 100 Lift = 0.9 = 1.11 (0.9)(0.9) Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 Introduction to Data Mining 08/06/2006 47 08/06/2006 48 There are lots of measures proposed in the literature Introduction to Data Mining Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures: Introduction to Data Mining Exam ple f11 f10 f01 f00 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 8123 8330 9481 3954 2886 1500 4000 4000 1720 61 83 2 94 3080 1363 2000 2000 2000 7121 2483 424 622 127 5 1320 500 1000 2000 5 4 1370 1046 298 2961 4431 6000 3000 2000 1154 7452 08/06/2006 49 Property under Variable Permutation A A B p r B q s B B A p q A r s Does M(A,B) = M(B,A)? Symmetric measures: support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: confidence, conviction, Laplace, J-measure, etc Introduction to Data Mining 08/06/2006 50 Property under Row/Column Scaling Grade-Gender Example (Mosteller, 1968): Male Female High 2 3 5 Low 1 4 5 3 7 10 Male Female High 4 30 34 Low 2 40 42 6 70 76 2x 10x Mosteller: Underlying association should be independent of the relative number of male and female students in the samples Introduction to Data Mining 08/06/2006 51 Property under Inversion Operation Transaction 1 . . . . . Transaction N A B C D E F 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 (a) Introduction to Data Mining (b) 08/06/2006 (c) 52 Example: φ-Coefficient z φ-coefficient is analogous to correlation coefficient for continuous variables Y Y Y Y X 60 10 70 X 20 10 30 X 10 20 30 X 10 60 70 70 30 100 30 70 100 0.6 − 0.7 × 0.7 0.7 × 0.3 × 0.7 × 0.3 = 0.5238 φ= 0.2 − 0.3 × 0.3 0.7 × 0.3 × 0.7 × 0.3 = 0.5238 φ= φ Coefficient is the same for both tables Introduction to Data Mining 08/06/2006 53 Property under Null Addition A A B p r B q s A A B p r B q s+k Invariant measures: support, cosine, Jaccard, etc Non invariant measures: Non-invariant correlation, Gini, mutual information, odds ratio, etc Introduction to Data Mining 08/06/2006 54 Different Measures have Different Properties Introduction to Data Mining 08/06/2006 55 Simpson’s Paradox c({HDTV = Yes} → {Exercise Machine = Yes}) = 99 / 180 = 55% c({HDTV = No} → {Exercise Machine = Yes}) = 54 / 120 = 45% => Customers who buy HDTV are more likely to buy exercise machines Introduction to Data Mining 08/06/2006 56 Simpson’s Paradox College students: c ({HDTV = Yes} → {Exercise Machine = Yes} ) = 1 / 10 = 10 % c ({HDTV = No} → {Exercise Machine = Yes} ) = 4 / 34 = 11 .8% Working adults: c({HDTV = Yes} → {Exercise Machine = Yes}) = 98 / 170 = 57.7% c({HDTV = No} → {Exercise Machine = Yes}) = 50 / 86 = 58.1% Introduction to Data Mining 08/06/2006 57 Simpson’s Paradox z Observed relationship in data may be influenced by the presence of other confounding factors (hidden variables) – Hidden variables may cause the observed relationship to disappear or reverse its direction! z Proper stratification is needed to avoid generating spurious patterns Introduction to Data Mining 08/06/2006 58 Extensions of Association Analysis Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/26/2006 1 Association Analysis: Advanced Concepts Extensions of Association Analysis to Continuous and Categorical Attributes and Multi-level Rules Introduction to Data Mining 08/26/2006 2 Continuous and Categorical Attributes How to apply association analysis to non-asymmetric binary variables? Example of Association Rule: {Gender=Male, Age ∈ [21,30)} → {No of hours online ≥ 10} Introduction to Data Mining 08/26/2006 3 Handling Categorical Attributes z Example: Internet Usage Data {Level of Education=Graduate, Online Banking=Yes} → {Privacy Concerns = Yes} Introduction to Data Mining 08/26/2006 4 Handling Categorical Attributes z Introduce a new “item” for each distinct attributevalue pair Introduction to Data Mining 08/26/2006 5 Handling Categorical Attributes z Some attributes can have many possible values – Many of their attribute values have very low support Potential solution: Aggregate the low low-support support attribute values Introduction to Data Mining 08/26/2006 6 Handling Categorical Attributes z Distribution of attribute values can be highly skewed – Example: 85% of survey participants own a computer at home Most records have Computer at home = Yes Computation becomes expensive; many frequent itemsets involving the binary item (Computer at home = Yes) Potential solution: – discard the highly frequent items – Use alternative measures such as h-confidence z Computational Complexity – Binarizing Bi i i th the d data t iincreases th the number b off ititems – But the width of the “transactions” remain the same as the number of original (non-binarized) attributes – Produce more frequent itemsets but maximum size of frequent itemset is limited to the number of original attributes Introduction to Data Mining 08/26/2006 7 Handling Continuous Attributes z Different methods: – Discretization-based – Statistics-based St ti ti b d – Non-discretization based z minApriori Different kinds of rules can be produced: – {Age∈[21 {Age∈[21,30), 30) No of hours online∈[10 online∈[10,20)} 20)} → {Chat Online =Yes} – {Age∈[21,30), Chat Online = Yes} → No of hours online: μ=14, σ=4 Introduction to Data Mining 08/26/2006 8 Discretization-based Methods Introduction to Data Mining 08/26/2006 9 Discretization-based Methods z Unsupervised: – Equal-width binning – Equal-depth E l d th bi binning i – Cluster-based z Supervised discretization Continuous attribute, v v1 v2 v3 v4 v5 v6 v7 v8 v9 Chat Online = Yes 0 0 20 10 20 0 0 0 0 Chat Online = No 150 100 0 0 0 100 100 150 100 bin1 Introduction to Data Mining bin2 08/26/2006 bin3 10 Discretization Issues z Interval width Introduction to Data Mining 08/26/2006 11 Discretization Issues z Interval too wide (e.g., Bin size= 30) – May merge several disparate patterns Patterns A and B are merged together – May lose some of the interesting patterns z Pattern C may not have enough confidence Interval too narrow (e.g., Bin size = 2) – Pattern A is broken up into two smaller patterns – z Can recover the pattern by merging adjacent subpatterns Pattern B is broken up into smaller patterns Cannot recover the pattern by merging adjacent subpatterns Potential solution: use all possible intervals – Start with narrow intervals – Consider all possible mergings of adjacent intervals Introduction to Data Mining 08/26/2006 12 Discretization Issues z Execution time – If the range is partitioned into k intervals, there are O(k2) new items – If an interval [a,b) is frequent, then all intervals that subsume [a,b) must also be frequent E.g.: if {Age ∈[21,25), Chat Online=Yes} is frequent, then {Age ∈[10,50), Chat Online=Yes} is also frequent – Improve efficiency: Use maximum support to avoid intervals that are too wide Modify algorithm to exclude candidate itemsets containing more than one intervals of the same attribute Introduction to Data Mining 08/26/2006 13 Discretization Issues z Redundant rules R1: {Age ∈[18,20), Age ∈[10,12)} → {Chat Online=Yes} R2: {Age ∈[18,23), Age ∈[10,20)} → {Chat Online=Yes} – If both rules have the same support and confidence, prune the more specific rule (R1) Introduction to Data Mining 08/26/2006 14 Statistics-based Methods z Example: {Income > 100K, Online Banking=Yes} → Age: μ=34 z Rule consequent consists of a continuous variable, characterized by their statistics – mean, median, standard deviation, etc. z Approach: – Withhold the target attribute from the rest of the data – Extract frequent itemsets from the rest of the attributes Binarized the continuous attributes (except for the target attribute) – F For each h frequent f t itemset, it t compute t the th corresponding di descriptive statistics of the target attribute Frequent itemset becomes a rule by introducing the target variable as rule consequent – Apply statistical test to determine interestingness of the rule Introduction to Data Mining 08/26/2006 15 Statistics-based Methods Frequent Itemsets: {Male Income > 100K} {Male, Association Rules: {Male Income > 100K} → Age: μ = 30 {Male, {Income < 30K, No hours ∈[10,15)} {Income < 40K, No hours ∈[10,15)} → Age: μ = 24 {Income > 100K, Online Banking = Yes} {Income > 100K,Online Banking = Yes} → Age: μ = 34 …. …. Introduction to Data Mining 08/26/2006 16 Statistics-based Methods z How to determine whether an association rule interesting? – Compare the statistics for segment of population covered by the rule vs segment of population not covered by the rule: A ⇒ B: μ versus A ⇒ B: μ’ Z= – Statistical hypothesis testing: μ '− μ − Δ s12 s22 + n1 n2 Null hypothesis: H0: μ’ = μ + Δ Alternative hypothesis: H1: μ’ > μ + Δ Z has zero mean and variance 1 under null hypothesis Introduction to Data Mining 08/26/2006 17 Statistics-based Methods z Example: r: Browser=Mozilla ∧ Buy=Yes → Age: μ=23 – Rule is interesting g if difference between μ and μ μ’ is more than 5 years (i.e., Δ = 5) – For r, suppose n1 = 50, s1 = 3.5 – For r’ (complement): n2 = 250, s2 = 6.5 Z= μ '− μ − Δ 2 1 2 2 s s + n1 n2 = 30 − 23 − 5 3.52 6.52 + 50 250 = 3.11 – For 1-sided test at 95% confidence level, critical Z-value for rejecting null hypothesis is 1.64. – Since Z is greater than 1.64, r is an interesting rule Introduction to Data Mining 08/26/2006 18 Min-Apriori Document-term matrix: TID W 1 W 2 W 3 W 4 W 5 D1 2 2 0 0 1 D2 0 0 1 2 2 D3 2 3 0 0 0 D4 0 0 1 0 1 D5 1 1 1 0 2 Example: W1 and W2 tends to appear together in the same document Introduction to Data Mining 08/26/2006 19 Min-Apriori z Data contains only continuous attributes of the same “type” – e.g., g frequency q y of words in a document z Potential solution: TID W 1 W 2 W 3 W 4 W 5 D1 2 2 0 0 1 D2 0 0 1 2 2 D3 2 3 0 0 0 D4 0 0 1 0 1 D5 1 1 1 0 2 – Convert into 0/1 matrix and then apply existing algorithms lose word frequency information – Discretization does not apply as users want association among words not ranges of words Introduction to Data Mining 08/26/2006 20 Min-Apriori z How to determine the support of a word? – If we simply sum up its frequency, support count will be greater than total number of documents! Normalize the word vectors – e.g., using L1 norms Each word has a support equals to 1.0 TID W 1 W 2 W 3 W 4 W 5 D1 2 2 0 0 1 D2 0 0 1 2 2 D3 2 3 0 0 0 D4 0 0 1 0 1 D5 1 1 1 0 2 TID D1 D2 D3 D4 D5 Normalize Introduction to Data Mining W1 0.40 0 00 0.00 0.40 0.00 0.20 W2 0.33 0 00 0.00 0.50 0.00 0.17 W3 0.00 0 33 0.33 0.00 0.33 0.33 W4 0.00 1 00 1.00 0.00 0.00 0.00 08/26/2006 W5 0.17 0 33 0.33 0.00 0.17 0.33 21 Min-Apriori z New definition of support: sup(C ) = ∑ min D (i, j ) i∈T TID D1 D2 D3 D4 D5 W1 0.40 0.00 0.40 0 00 0.00 0.20 W2 0.33 0.00 0.50 0 00 0.00 0.17 Introduction to Data Mining W3 0.00 0.33 0.00 0 33 0.33 0.33 W4 0.00 1.00 0.00 0 00 0.00 0.00 W5 0.17 0.33 0.00 0 17 0.17 0.33 j∈C Example: Sup(W1,W2,W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17 0 17 08/26/2006 22 Anti-monotone property of Support TID D1 D2 D3 D4 D5 W1 0.40 0.00 0.40 0.00 0.20 W2 0.33 0.00 0.50 0.00 0.17 W3 0.00 0.33 0.00 0.33 0.33 W4 0.00 1.00 0.00 0.00 0.00 W5 0.17 0.33 0.00 0.17 0.33 Example: Sup(W1) = 0.4 + 0 + 0.4 + 0 + 0.2 = 1 Sup(W1, W2) = 0.33 + 0 + 0.4 + 0 + 0.17 = 0.9 Sup(W1, W2, W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17 Introduction to Data Mining 08/26/2006 23 Concept Hierarchies Food Electronics Bread Computers Milk Wheat White Skim Foremost Introduction to Data Mining 2% Desktop Kemps Home Laptop Accessory Printer 08/26/2006 TV DVD Scanner 24 Multi-level Association Rules z Why should we incorporate concept hierarchy? – Rules at lower levels may not have enough support to appear in any frequent itemsets – Rules at lower levels of the hierarchy are overly specific e.g., skim milk → white bread, 2% milk → wheat bread, skim milk → wheat bread, etc. are indicative of association between milk and bread Introduction to Data Mining 08/26/2006 25 Multi-level Association Rules z How do support and confidence vary as we traverse the concept hierarchy? – If X is the parent item for both X1 and X2 X2, then σ(X) ≤ σ(X1) + σ(X2) – If and then σ(X1 ∪ Y1) ≥ minsup, X is parent of X1, Y is parent of Y1 σ(X ∪ Y1) ≥ minsup, σ(X1 ∪ Y) ≥ minsup σ(X ∪ Y) ≥ minsup – If then conf(X1 ⇒ Y1) ≥ minconf, conf(X1 ⇒ Y) ≥ minconf Introduction to Data Mining 08/26/2006 26 Multi-level Association Rules z Approach 1: – Extend current association rule formulation by augmenting each transaction with higher level items Original Transaction: {skim milk, wheat bread} Augmented Transaction: {skim milk, wheat bread, milk, bread, food} z Issues: – Items that reside at higher levels have much higher support counts if support threshold is low, too many frequent patterns involving items from the higher levels – Increased dimensionality of the data Introduction to Data Mining 08/26/2006 27 Multi-level Association Rules z Approach 2: – Generate frequent patterns at highest level first – Then, generate frequent patterns at the next highest level, and so on z Issues: – I/O requirements will increase dramatically because we need to perform more passes over the data – May miss some potentially interesting cross-level association patterns Introduction to Data Mining 08/26/2006 28 Association Analysis: Advanced Concepts Subgraph Mining Introduction to Data Mining 08/26/2006 29 Frequent Subgraph Mining Extends association analysis to finding frequent subgraphs z Useful for Web Mining, Mining computational chemistry chemistry, bioinformatics, spatial data sets, etc z Homepage Research Artificial Intelligence Databases Data Mining Introduction to Data Mining 08/26/2006 30 Graph Definitions a a q p p b r p t s c a s t t c p a r t r p p a s a c q p p b b (a) Labeled Graph Introduction to Data Mining r c b (b) Subgraph (c) Induced Subgraph 08/26/2006 31 Representing Transactions as Graphs z Each transaction is a clique of items A TID = 1: Transaction Id 1 2 3 4 5 Items {A,B,C,D} {A,B,E} {B,C} {{A,B,D,E} , , , } {B,C,D} C B D Introduction to Data Mining 08/26/2006 E 32 Representing Graphs as Transactions a e p p p q r b d c G1 G1 G2 G3 G3 (a,b,q) 0 0 0 … (a,b,r) 0 0 1 … Introduction to Data Mining d r a G2 (a,b,p) 1 1 0 … q c r r p p b b d r e a q G3 (b,c,p) 0 0 1 … (b,c,q) 0 0 0 … (b,c,r) 1 0 0 … … … … … … 08/26/2006 (d,e,r) 0 0 0 … 33 Challenges Node may contain duplicate labels z Support and confidence z – How to define them? z Additional constraints imposed by pattern structure – Support and confidence are not the only constraints – Assumption: frequent subgraphs must be connected z Apriori-like approach: – Use frequent k-subgraphs to generate frequent (k+1) subgraphs What is k? Introduction to Data Mining 08/26/2006 34 Challenges… z Support: – number of graphs that contain a particular subgraph z Apriori principle still holds z Level-wise (Apriori-like) approach: – Vertex growing: k is the number of vertices – Edge growing: k is the number of edges Introduction to Data Mining 08/26/2006 35 Vertex Growing ⎛0 ⎜ ⎜p M =⎜ p ⎜⎜ ⎝q G1 p 0 r 0 p q⎞ ⎟ r 0⎟ 0 0⎟ ⎟ 0 0 ⎟⎠ Introduction to Data Mining ⎛0 ⎜ ⎜p M =⎜ p ⎜⎜ ⎝0 G2 p 0 r 0 p 0⎞ ⎟ r 0⎟ 0 r⎟ ⎟ r 0 ⎟⎠ 08/26/2006 M G3 ⎛0 ⎜ ⎜p =⎜p ⎜ ⎜q ⎜ ⎝0 p 0 r 0 0 p r 0 0 r 0⎞ ⎟ 0⎟ r⎟ ⎟ ?⎟ 0 ⎟⎠ q 0 0 0 ? 36 Edge Growing Introduction to Data Mining 08/26/2006 37 Apriori-like Algorithm Find frequent 1-subgraphs z Repeat z – Candidate generation Use frequent (k-1)-subgraphs to generate candidate k-subgraph – Candidate pruning Prune candidate subgraphs that contain infrequent (k-1)-subgraphs – Support counting Count the support of each remaining candidate – Eliminate candidate k-subgraphs that are infrequent In practice, it is not as easy. There are many other issues Introduction to Data Mining 08/26/2006 38 Example: Dataset G1 G2 G3 G4 ((a,b,p) p) (a,b,q) ( q) 1 0 1 0 0 0 0 0 Introduction to Data Mining ((a,b,r)) 0 0 1 0 ((b,c,p) p) (b,c,q) ( q) 0 0 0 0 1 0 0 0 ((b,c,r)) 1 0 0 0 … … … … … ((d,e,r)) 0 0 0 0 08/26/2006 39 08/26/2006 40 Example Introduction to Data Mining Candidate Generation z In Apriori: – Merging two frequent k-itemsets will produce a candidate (k+1)-itemset z In frequent subgraph mining (vertex/edge growing) – Merging two frequent k-subgraphs may produce more than one candidate (kk+1 1))-subgraph subgraph Introduction to Data Mining 08/26/2006 41 Multiplicity of Candidates (Vertex Growing) ⎛0 ⎜ p M = ⎜⎜ p ⎜⎜ ⎝q G1 p 0 r 0 p q⎞ ⎟ r 0⎟ 0 0⎟ ⎟ 0 0 ⎟⎠ Introduction to Data Mining ⎛0 ⎜ ⎜p M =⎜ p ⎜⎜ ⎝0 G2 p 0 r 0 p 0⎞ ⎟ r 0⎟ 0 r⎟ ⎟ r 0 ⎟⎠ 08/26/2006 ⎛0 ⎜ ⎜p M =⎜p ⎜ ⎜0 ⎜q ⎝ G3 p 0 r 0 0 p 0 q⎞ ⎟ r 0 0⎟ 0 r 0⎟ ⎟ r 0 ?⎟ 0 ? 0 ⎟⎠ 42 Multiplicity of Candidates (Edge growing) z Case 1: identical vertex labels a e b a a + e b c e e b c c a a a e b c + c e b e b a c e e b c Introduction to Data Mining 08/26/2006 43 Multiplicity of Candidates (Edge growing) z Case 2: Core contains identical labels Core: The (k-1) subgraph that is common between the joint graphs Introduction to Data Mining 08/26/2006 44 Multiplicity of Candidates (Edge growing) z Case 3: Core multiplicity a a a a a a b b a b a a b a a b a a + a b a a a b a a Introduction to Data Mining 08/26/2006 45 Candidate Generation by Edge Growing z Given: G1 G2 b a c Core z d Core Case 1: a ≠ c and b ≠ d G3 = Merge(G1,G2) Core Introduction to Data Mining a b c d 08/26/2006 46 Candidate Generation by Edge Growing z Case 2: a = c and b ≠ d G3 = Merge(G1,G2) Core G3 = Merge(G1,G2) a b a d Introduction to Data Mining b a Core d 08/26/2006 47 Candidate Generation by Edge Growing z Case 3: a ≠ c and b = d G3 = Merge(G1,G2) G3 = Merge(G1,G2) Core a b c b Introduction to Data Mining a Core 08/26/2006 b c 48 Candidate Generation by Edge Growing z Case 4: a = c and b = d G3 = Merge(G1,G2) Merge(G1 G2) Core G3 = Merge(G1,G2) a b a b G3 = Merge(G1,G2) a b Core b Introduction to Data Mining a Core b a 08/26/2006 49 Graph Isomorphism z A graph is isomorphic if it is topologically equivalent to another graph Introduction to Data Mining 08/26/2006 50 Graph Isomorphism z Test for graph isomorphism is needed: – During candidate generation step, to determine whether a candidate has been generated – During candidate pruning step, to check whether its (k-1)-subgraphs are frequent – During g candidate counting, g, to check whether a candidate is contained within another graph Introduction to Data Mining 08/26/2006 51 Graph Isomorphism A(1) A(2) B (5) B (6) B (7) B (8) A(3) A(4) A(2) A(1) B (7) B (6) B (5) B (8) A(3) A(4) A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 1 1 1 A(1) A(2) A(3) (3) A(4) B(5) B(6) B(7) B(8) A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 1 1 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 1 • The same graph can be represented in many ways Introduction to Data Mining 08/26/2006 52 Graph Isomorphism z Use canonical labeling to handle isomorphism – Map each graph into an ordered string representation (known as its code) such that two isomorphic graphs will be mapped to the same canonical encoding – Example: Lexicographically largest adjacency matrix ⎡0 ⎢0 ⎢ ⎢1 ⎢ ⎣0 0 0 1 1 1 1 0 1 0⎤ 1⎥ ⎥ 1⎥ ⎥ 0⎦ ⎡0 ⎢1 ⎢ ⎢1 ⎢ ⎣1 1 1 0 0 1⎤ 0⎥ ⎥ 0⎥ ⎥ 0⎦ Canonical: 111100 String: 011011 Introduction to Data Mining 1 0 1 0 08/26/2006 53