Download Data Mining Using SQL Queries

Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute Sample Applications   Sample Commercial Applications  Market basket analysis  cross-marketing  attached mailing  store layout, catalog design  customer segmentation based on buying patterns  … Sample Scientific Applications  Genetic analysis  Analysis of medical data  … Transactions and Assoc. Rules  Association Rule: Transaction Id 1 2 3 4 Purchased Items {a, b, c} {a, d} {a, c} {b, e, f} a→c  support: 50% = P(a & c) percentage of transactions that contain both a and c  confidence: 66% = P(c | a) percentage of transactions that contain c among those transactions that contain a Association Rules - Intuition   Given a set of transactions where each transaction is a set of items Find all rules X → Y that relate the presence of one set of items X with another set of items Y    Example: 98% of people who purchase diapers and baby food also buy beer A rule may have any number of items in the antecedent and in the consequent Possible to specify constraints on rules Mining Association Rules Problem Statement Given:  a set of transactions (each transaction is a set of items)  user-specified minimum support  user-specified minimum confidence Find:  all association rules that have support and confidence greater than or equal to the user-specified minimum support and minimum confidence Naïve Procedure to mine rules   List all the subsets of the set of items For each subset     Split the subset into two parts (one for the antecedent and one for the consequent of the rule Compute the support of the rule Compute the confidence of the rule IF support and confidence are no lower than userspecified min. support and confident THEN output the rule Complexity: Let n be the number of items. The number of rules naively considered is: n i-1 n i S[(i ) * S(k )] i=2 k=1 n n i = S[(i ) * (2 -2) ] i=2 n =3 –2 (n+1) +1 The Apriori Algorithm 1. Find all frequent itemsets: sets of items whose support is greater than or equal to the userspecified minimum support. 2. Generate the desired rules: if {a, b, c, d} and {a, b} are frequent itemsets, then compute the ratio conf (a & b → c & d) = P(c & d | a & b) = P( a & b & c & d)/P(a & b) = support({a, b, c, d})/support({a, b}). If conf >= mincoff, then add rule a & b → c & d The Apriori Algorithm — Example slide taken from J. Han & M. Kamber’s Data Mining book Min. supp = 50%, I.e. min support count = 2 Database D TID 100 200 300 400 itemset sup. C1 {1} 2 {2} 3 Scan D {3} 3 {4} 1 {5} 3 Items 134 235 1235 25 C2 itemset sup L2 itemset sup 2 2 3 2 {1 {1 {1 {2 {2 {3 C3 itemset {2 3 5} Scan D {1 3} {2 3} {2 5} {3 5} 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Apriori Principle  Key observation: Every subset of a frequent itemset is also a frequent itemset Or equivalently,  The support of an itemset is greater than or equal to the support of any superset of the itemset  Apriori - Compute Frequent Itemsets Making multiple passes over the data for pass k {candidate generation: Ck := Lk-1 joined with Lk-1 ; support counting in Ck; Lk := All candidates in Ck with minimum support; } terminate when Lk==  or Ck+1==  Frequent-Itemsets =  k Lk Lk - Set of frequent itemsets of size k. (those with minsup) Ck - Set of candidate itemsets of size k. (potentially frequent itemsets) Apriori – Generating rules For each frequent itemset: - Generate the desired rules: if {a, b, c, d} and {a, b} are frequent itemsets, then compute the ratio conf (a & b → c & d) = support({a, b, c, d})/support({a, b}). If conf >= mincoff, then add rule a & b → c & d

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Using SQL Queries