Download CS 349: Market Basket Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS 349: Market Basket
Data Mining
All about beer and diapers.
Overview
•
•
•
•
What is Data Mining
Market Baskets
How fast does it run?
What does it do?
What is Data Mining?
•
•
•
•
Statistics
Data Analysis
Machine Learning
Databases
Types of Data that can be
Mined
•
•
•
•
market basket
classification
time series
text
Applications of Market Basket
• supermarkets
• data with boolean attributes
– census data: single vs married
• word occurrence
Some Measures of the Data
• number of baskets : N
• number of items : M
• average number of items per basket: W
(width)
Aspects of Market Basket
Mining
• What is interesting?
• How do you make it run fast?
What is Interesting? (first try)
• Itemset I = set of items
• association rule - A -> B
• support(I) = fraction of baskets that
contain I
• confidence(A->B) = probability that a
basket contains B given that it contains
A
How do you find Itemsets with
high support?
• Apriori algorithm, Agrawal et al (1993)
• Find all itemsets with support > s
• 1-itemset = itemset with 1 item …
k-itemset = itemset with k items
• large itemset = itemset with support > s
• candidate itemset = itemset that may
have support > s
Apriori Algorithm
• start with all 1-itemsets
• go through data and count their support
and find all “large” 1-itemsets
• combine them to form “candidate” 2itemsets
• go through data and count their support
and find all “large” 2-itemsets
• combine them to form “candidate” 3itemsets …
Run Time
• k passes over data where k is the size
of the largest candidate itemset
• Memory chunking algorithm ==> 2
passes over data on disk but multiple in
memory
• Toivonen 1996 gives statistical
technique 1 + e passes (but more
memory)
• Brin 1997 - Dynamic Itemset Counting 1
But what is really interesting?
•
•
•
•
•
A->B
Support = P(AB)
Confidence = P(B|A)
Interest = P(AB)/P(A)P(B)
Implication Strength =
P(A)P(~B)/P(A~B)
But what is really really
interesting?
• Causality
• Surprise
Summary
•
•
•
•
What is Data Mining?
Market Baskets
Finding Itemsets with high support
Finding Interesting Rules
Related documents