Download Association Rule Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

The Beer Store wikipedia, lookup

Transcript
MIS2502:
Data Analytics
Association Rule Mining
Jeremy Shafer
[email protected]
http://community.mis.temple.edu/jshafer
Association Rule Mining
Find out which items
predict the occurrence of
other items
Also known as “affinity
analysis” or “market
basket” analysis
Uses
• What products are bought together?
• Amazon’s recommendation engine
• Telephone calling patterns
Examples of Association Rule Mining
• Market basket analysis/affinity analysis
– What products are bought together?
– Where to place items on grocery store shelves?
• Amazon’s recommendation engine
– “People who bought this product also
bought…”
• Telephone calling patterns
– Who do a set of people tend to call most often?
• Social network analysis
– Determine who you “may know”
Market-Basket Transactions
Basket
1
2
3
Items
Bread, Milk
Bread, Diapers, Beer, Eggs
Milk, Diapers, Beer, Coke
4
5
Bread, Milk, Diapers, Beer
Bread, Milk, Diapers, Coke
Some Association
Rules from these
transactions
XY
(antecedent  consequent)
{Diapers}  {Beer},
{Milk, Bread}  {Diapers}
{Beer, Bread}  {Milk},
{Bread}  {Milk, Diapers}
Core idea: The itemset
Itemset
A group of items of interest
{Milk, Beer, Diapers}
Basket
Association rules express
relationships between itemsets
XY
{Milk, Diapers}  {Beer}
“when you have milk and diapers,
you also have beer”
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
Support
• Support count ()
– In how many baskets does the itemset
appear?
– {Milk, Beer, Diapers} = 2
(i.e., in baskets 3 and 4)
• Support (s)
– Fraction of transactions that contain all
items in X  Y
– s({Milk, Diapers, Beer}) = 2/5 = 0.4
X
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
2 baskets have milk,
beer, and diapers
Y
• You can calculate support for both X and Y
separately
– Support for X = 3/5 = 0.6
– Support for Y = 3/5 = 0.6
5 baskets total
Confidence
• Confidence is the strength
of the association
– Measures how often items in
Y appear in transactions that
contain X
c
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
 ( X  Y )  (Milk, Diapers, Beer) 2

  0.67
 (X )
 (Milk, Diapers)
3
c must be between 0 and 1
1 is a complete association
0 is no association
This says 67% of the times
when you have milk and
diapers in the itemset you
also have beer!
Some sample rules
Association Rule
Support (s) Confidence (c)
{Milk,Diapers}  {Beer}
2/5 = 0.4
2/3 = 0.67
{Milk,Beer}  {Diapers}
2/5 = 0.4
2/2 = 1.0
{Diapers,Beer}  {Milk}
2/5 = 0.4
2/3 = 0.67
{Beer}  {Milk,Diapers}
2/5 = 0.4
2/3 = 0.67
{Diapers}  {Milk,Beer}
2/5 = 0.4
2/4 = 0.5
{Milk}  {Diapers,Beer}
2/5 = 0.4
2/4 = 0.5
All the above rules
are binary partitions
of the same itemset:
{Milk, Diapers, Beer}
(That is, we are
splitting the set into
two parts.)
Basket
1
Items
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
Rule Pairs
Association Rule
Support
(s)
Confidence
(c)
1
{Milk,Diapers}  {Beer}
2/5 = 0.4
2/3 = 0.67
2
{Milk,Beer}  {Diapers}
2/5 = 0.4
2/2 = 1.0
3
{Diapers,Beer}  {Milk}
2/5 = 0.4
2/3 = 0.67
4
{Beer}  {Milk,Diapers}
2/5 = 0.4
2/3 = 0.67
5
{Diapers}  {Milk,Beer}
2/5 = 0.4
2/4 = 0.5
6
{Milk}  {Diapers,Beer}
2/5 = 0.4
2/4 = 0.5
A rule pair is a pair of two association rules that represent the
two possible combinations of a binary partition. If I have a rule
of X  Y then its corresponding rule pair is Y  X.
One rule pair is starred above. Can you find the others?
Don’t blindly follow the numbers
i.e., high confidence suggests a strong
association…
• But this can be deceptive
• Consider {Bread} {Diapers}
• Support for the total itemset is 0.6 (3/5)
• And confidence is 0.75 (3/4) – pretty high
• But is this just because both are frequently
occurring items (s=0.8)?
• You’d almost expect them to show up in the
same baskets by chance
Lift
Takes into account how co-occurrence differs
from what is expected by chance
– i.e., if items were selected independently from
one another
s( X  Y )
Lift 
s ( X ) * s (Y )
Support for total itemset X and Y
Support for X times support for Y
Lift Example
• What’s the lift for the rule:
{Milk, Diapers}  {Beer}
• So X = {Milk, Diapers}
Y = {Beer}
s({Milk, Diapers, Beer}) = 2/5 = 0.4
s({Milk, Diapers}) = 3/5 = 0.6
s({Beer}) = 3/5 = 0.6
0.4
0.4
So Lift 

 1.11
0.6 * 0.6 0.36
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
When Lift > 1, the
occurrence of
X  Y together is
more likely than what
you would expect by
chance
Another example
Checking Account
Savings
Account
No
Yes
No
500
3500
4000
Yes
1000
5000
6000
1500
8500
10000
Are people more inclined to have a
checking account if they have a savings
account?
Support ({Savings} {Checking}) = 5000/10000 = 0.5
Support ({Savings}) = 6000/10000 = 0.6
Support ({Checking}) = 8500/10000 = 0.85
0.5
0.5
Lift 

 0.98
0.6 * 0.85 0.51
Answer: No
In fact, it’s
slightly less
than what
you’d expect by
chance!
But this can be overwhelming
Thousands of
products
Many
customer types
Millions of
combinations
So where do you start?
Selecting the rules
• We know how to
calculate the measures
for each rule
– Support
– Confidence
– Lift
• Then we set up
thresholds for the
minimum rule strength
we want to accept
The steps
• List all possible
association rules
• Compute the support
and confidence for
each rule
• Drop rules that don’t
make the thresholds
• Use lift to further
check the association
Once you are confident in a rule, take
action
{Milk, Diapers}  {Beer}
Possible Marketing
Actions
• Create “New Parent Coping
Kits” of beer, milk, and diapers
• What are some others?
Recap
In this presentation, we identified four statistics
used in Association Rule mining.
1. Support Count
2. Support
3. Confidence
4. Lift
Support Count ()
Support Count answers the question – “How Many?”
Specifically, how many occurrences in a data set have a particular
combination of items.
Support count can be calculated for either a rule or a specific
item set.
So, if:
X = {Milk, Diapers}
Y = {Beer}
Then:
(X) = 3
(Y) = 3
(X  Y) = 2
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
Support (s)
Support answers the question – “What percent?” Specifically, what
percentage of occurrences in a data set have a particular
combination of items. Support (s) may be expressed as a
percentage or a probability.
Support can be calculated for either a rule or a specific item set.
So, if:
X = {Milk, Diapers}
Y = {Beer}
Then:
s(X) = 3/5 = .6 = 60%
s(Y) = 3/5 = .6 = 60%
s(X  Y) = 2/5 = .4 = 40%
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
Confidence (c)
Confidence answers about the quality (or strength) of a
particular association rule. Specifically, if I see itemsset
X in my data, how likely am I to see itemset Y as well?
Confidence can be expressed as a percentage or a
probability.
Confidence can only be calculated for an association
rule.
So, if:
X = {Milk, Diapers} Y = {Beer}
Then:
s (X ® Y ) 2
c(X ® Y ) =
= = 0.67
s (X)
3
Lift
Lift answers the question – does the support for a
association rule in our data appear because of some
influence that is more than just dumb luck?
Lift can only be calculated for an association rule.
• A lift > 1 indicates that the association is real. The
greater the lift, the stronger the association.
• A lift = 1 is what we expect to see if there is no
connection between the itemsets.
• A lift < 1 indicates something called a negative
association one. That is if we have X we are actually
less likely to see Y.
Lift
If:
X = {Milk, Diapers}
Y = {Beer}
Then: s(XY) = 2/5 = 0.4
s(X) = 3/5 = 0.6
s(Y) = 3/5 = 0.6
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
When Lift > 1, the
occurrence of
X  Y together is more
likely than what you would
expect by chance
s( X  Y )
0.4
0.4
Lift 


 1.11
s ( X ) * s (Y ) 0.6 * 0.6 0.36
What are these numbers good for?
Measurement
Remark
Support Count
A basic measurement, Support Count can be used to get a
rough estimate, and is also useful in the computation of
support and confidence.
Support
Gives me a probability (or percentage) of an itemset's
occurrence within a body of data.
Confidence
A good, preliminary indication of the strength of an
association rule.
Lift
The best indicator of the quality or strength of an
association rule.