Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
CS 349: Market Basket Data Mining All about beer and diapers. Overview • • • • What is Data Mining Market Baskets How fast does it run? What does it do? What is Data Mining? • • • • Statistics Data Analysis Machine Learning Databases Types of Data that can be Mined • • • • market basket classification time series text Applications of Market Basket • supermarkets • data with boolean attributes – census data: single vs married • word occurrence Some Measures of the Data • number of baskets : N • number of items : M • average number of items per basket: W (width) Aspects of Market Basket Mining • What is interesting? • How do you make it run fast? What is Interesting? (first try) • Itemset I = set of items • association rule - A -> B • support(I) = fraction of baskets that contain I • confidence(A->B) = probability that a basket contains B given that it contains A How do you find Itemsets with high support? • Apriori algorithm, Agrawal et al (1993) • Find all itemsets with support > s • 1-itemset = itemset with 1 item … k-itemset = itemset with k items • large itemset = itemset with support > s • candidate itemset = itemset that may have support > s Apriori Algorithm • start with all 1-itemsets • go through data and count their support and find all “large” 1-itemsets • combine them to form “candidate” 2itemsets • go through data and count their support and find all “large” 2-itemsets • combine them to form “candidate” 3itemsets … Run Time • k passes over data where k is the size of the largest candidate itemset • Memory chunking algorithm ==> 2 passes over data on disk but multiple in memory • Toivonen 1996 gives statistical technique 1 + e passes (but more memory) • Brin 1997 - Dynamic Itemset Counting 1 But what is really interesting? • • • • • A->B Support = P(AB) Confidence = P(B|A) Interest = P(AB)/P(A)P(B) Implication Strength = P(A)P(~B)/P(A~B) But what is really really interesting? • Causality • Surprise Summary • • • • What is Data Mining? Market Baskets Finding Itemsets with high support Finding Interesting Rules