Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2013 Han, Kamber & Pei. All rights reserved. 1 Now we have done with data Preprocessing • Data cleaning • Data integration • Data reduction • Data transformation Mining April 30, 2017 Data Mining: Concepts and Techniques 2 Chapter 6: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Basic Concepts Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary 3 What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Frequent pattern mining searches for recurring relationships in a given data set. First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent item sets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?—What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 4 Market Basket Analysis : A motivating Example To understand The basic concepts of mining frequent patterns and associations Lets see the earliest form of frequent pattern mining for association rules - A typical example of frequent item set mining is market basket analysis. This process analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” -market basket analysis may be performed on the retail data of customer transactions at your store. You can then use the results to plan marketing or advertising strategies, or in the design of a new catalog. April 30, 2017 Data Mining: Concepts and Techniques 5 What is Association Rule Association rule mining Finding frequent patterns, associations, correlations, or causal structures among sets of items in transaction databases Understand customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket”. Ideas come from the market basket analysis (MBA) Applications Basket data analysis, cross-marketing, catalog design, lossleader analysis, web log analysis, fraud detection April 30, 2017 Data Mining: Concepts and Techniques 6 What is Association Rule Rule form Antecedent → Consequent [support, confidence] (support and confidence are user defined measures of interestingness) Examples buys(x, “computer”) → buys(x, “financial management software”) [0.5%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) → buys(x, “car”) [1%,75%] April 30, 2017 Data Mining: Concepts and Techniques 7 Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items. April 30, 2017 Data Mining: Concepts and Techniques 8 Rule basic Measures: Support and Confidence Support of the rule A ⇒ B : denotes the frequency of the rule within all transactions in the database, i.e., the probability that a transaction contains both A and B. A high value means that the rule involve a great part of database. support(A ⇒ B [ s, c ]) = p(A ∪ B)/N OR p(A ∪ B) Confidence of the rule A ⇒ B : denotes the percentage of transactions containing A which also contain B, i.e., the probability that a transaction containing A also contains B. It is an estimation of conditioned probability . confidence(A ⇒ B [ s, c ]) = p(B|A) = p(A ∪ B) / p(A) = support({A,B}) / support({A}) April 30, 2017 Data Mining: Concepts and Techniques 9 Why Use Support and Confidence? Support Confidence is an important measure because a rule that has very low support may occur simply by chance. A low support rule is also likely to be uninteresting from a business perspective because it may not be profitable to promote items that customers seldom buy together. support is often used to eliminate uninteresting rules. Is a measures the reliability of the inference made by a rule. For a given rule X → Y , the higher the confidence, the more likely it is for X to be present in transactions that contain Y. Confidence also provides an estimate of the conditional probability of X given Y. Example: Rule 1: Computer → Antivirus-software [ support=2%, Confidence= 60%] A support of 2% for Rule 1 means that 2% of all the transactions under analysis show that computer and antivirus software are purchased together. A confidence of 60% means that 60% of the customers who purchased a computer also bought the software. 10 April 30, 2017 Data Mining: Concepts and Techniques 11 Formulation of association rule problem The association rule mining problem can be formally stated as follows: Association Rule Discovery. Given a set of transactions T, find all the rules having support ≥ minsup and confidence ≥ minconf, where minsup and minconf are the corresponding support and confidence thresholds. If the item set is infrequent (lower support) , then all rules can be pruned immediately (removed) without our having to compute their confidence values. Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong. Association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. 12 Formulation of association rule problem To find the numbers of possible rules extracted from data set contains d items can calculated by: R= 3d – 2d+1 +1 Example: Q: Given frequent set {A,B,E}, what are possible association rules? d=3 R= 33 – 23+1 +1 = 12 rules. April 30, 2017 Data Mining: Concepts and Techniques 13 Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Beer ->diaper Nuts, Egg->Milk . . . 14 Formulation of association rule problem A common strategy adopted by many association rule mining algorithms is to decompose the problem into two major subtasks: 1. 2. Frequent Itemset Generation, whose objective is to find all the item sets that satisfy the minsup threshold. These item sets are called frequent item sets. Rule Generation, whose objective is to extract all the highconfidence rules from the frequent item sets found in the previous step. These rules are called strong rules. April 30, 2017 Data Mining: Concepts and Techniques 15 Basic Concepts: Frequent Patterns Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Customer buys beer itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold. 16 Basic Concepts: Association Rules Tid Items bought 10 Beer, Nuts, Ice 20 Beer, Coffee, Ice 30 Beer, Ice, Eggs 40 50 Nuts, Eggs, Milk Nuts, Coffee, Ice, Eggs, Milk Customer buys both Customer buys Ice Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Ice:4, Eggs:3, Customer buys beer {Beer, Ice}:3 Association rules: (many more!) Beer Ice (60%, 100%) Ice Beer (60%, 75%) 17 April 30, 2017 Data Mining: Concepts and Techniques 18 April 30, 2017 Data Mining: Concepts and Techniques 19 Exercise 1. Basic association rule creation manually. The 'database' below has four transactions. What association rules can be found in this set, if the minimum support (i.e coverage) is 60% and the minimum confidence (i.e. accuracy) is 80% ? Trans_id Itemlist Transaction Items T1 K, A, D, B T2 D, A, C, E, B T3 C, A, B, E T4 B, A, D April 30, 2017 Data Mining: Concepts and Techniques 20