Download Learning association rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining
Learning Association Rules
Primary References
•
Tan, P., Steinbach, M., and Kumar, V. (2006) Introduction to Data Mining, 1st edition,
Addison-Wesley, ISBN: 0-321-32136-7.
•
•
Mccombs Business School Lecture Notes
J. Han and M. Kamber (2000) Data Mining: Concepts and Techniques, Morgan Kaufmann.
Database oriente.
•
SAS Enterprise Miner
Association Rule Mining
• Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence,
not causality!
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count (σ)
– Frequency of occurrence of an
itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that
contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is
greater than or equal to a minsup
threshold
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Definition: Association Rule
• Association Rule
– An implication expression of the
form X → Y, where X and Y are
itemsets
– Example:
{Milk, Diaper} → {Beer}
• Rule Evaluation Metrics
– Confidence (c)
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example:
{Milk , Diaper } ⇒ Beer
– Support (s)
• Fraction of transactions that
contain both X and Y
TID
s=
σ ( Milk , Diaper, Beer )
|T|
=
2
= 0 .4
5
• Measures how often items in Y
σ (Milk, Diaper, Beer) 2
appear in transactions that
c=
= = 0.67
contain X
σ (Milk, Diaper)
3
Mining Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ACDE
BCDE
Given d items, there
are 2d possible
candidate itemsets
Illustrating Apriori Principle
Found to be
Infrequent
Pruned
supersets
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
itemset sup.
{1}
2
C1
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Presentation of Association
Rules (Table Form )
Visualization of Association Rule Using Plane Graph
Evaluation of Association Rules
What rules should be considered valid?
An association rule is valid if it satisfies some
evaluation measures
There are lots of
measures proposed
in the literature
Some measures are
good for certain
applications, but not
for others
What criteria should
we use to determine
whether a measure
is good or bad?
What about Aprioristyle support based
pruning? How does
it affect these
measures?
Rule Evaluation
Support
– Milk & Wine co-occur
– But…
Only 2 out of 200K transactions contain these
items
Transaction No.
Item 1
Item 2
Item 3
100
Beer
Diapers
Chocolate
101
Milk
Chocolate
Wine
102
Beer
Wine
Vodka
103
Beer
Cheese
Diapers
104
Ice Cream
Diapers
Beer
….
…
Rule Evaluation
Support:
The frequency in which the items in LHS and RHS co-occur.
E.g., The support of the If {Diapers} then {Beer} rule is 3/5:
60% of the transactions contain both items.
Support =
No. of transactions containing items in body and head
Total no. of transactions in database
Transaction No.
Item 1
Item 2
Item 3
100
Beer
Diapers
Chocolate
101
Milk
Chocolate
Shampoo
102
Beer
Wine
Vodka
103
Beer
Cheese
Diapers
104
Ice
Cream
Diapers
Beer
…
LHS
If {Diapers}
RHS
Then
{Beer}
Rule Evaluation
Strength of the Implication
• Which implication is stronger?
• Of the transactions containing Milk
– 1% contain Wine.
– 20% contain Beer
body
head
If {Milk}
Then
{Wine}
body
If {Milk}
head
Then
{Beer}
Rule Evaluation
Confidence
A rule’s strength is measured by its confidence: How strongly the
body implies the head?
Confidence: The proportion of transactions containing the body that
also contain the head.
Example: The confidence of the rule is 3/3, i.e., in100% of the
transactions that contain diapers also contain beer.
No. of transactions containing both LHS and RHS
Confidence =
No. of transactions containing LHS
Transaction No. Item 1
Item 2
Item 3
100
Beer
Diapers
Chocolate
101
Milk
Chocolate
Shampoo
102
Beer
Wine
Vodka
103
Beer
Cheese
Diapers
104
Ice
Cream
Diapers
Beer
…
LHS
If {Diapers}
RHS
Then
{Beer}
Rule Evaluation: Confidence
• Is the rule Milk Æ Wine equivalent to
the rule WineÆ Milk?
• When is the implications Milk Æ Wine
is more likely than the reverse?
Rule Evaluation
Example: Lift
• Consider the rule:
body
If {Milk}
head
Then
{Beer}
• Assume: Support 20%, Confidence 100%
– What if all shoppers at the store buy beer?
• Assume confidence is 60%
– What if 60% of shoppers buy beer ?
Æ Find rules where the frequency of head given
body > expected frequency of head
Another example
• Example 1: (Aggarwal & Yu, PODS98)
– Among 5000 students
• 3000 play basketball
• 3750 eat cereal
• 2000 both play basket ball and eat cereal
– play basketball ⇒ eat cereal [40%, 66.7%] is misleading because
the overall percentage of students eating cereal is 75% which is
higher than 66.7%.
basketball not basketball sum(row)
cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000
Lift
• Measures how much more likely is the head given the
body than merely the head
• (confidence/frequency of head)
Example:
•
•
•
•
Total number of customer in database: 1000
No. of customers buying Milk: 200
No. of customers buying beer: 50
No. of customers buying Milk & beer: 20
• Frequency of head:
– 50/1000 (5%)
body
Confidence:
– 20/200 (10%)
• Lift:
– 10%/5%=2
head
Then
If {Milk}
{Beer}
Comparison to Traditional Database
Queries
Traditional methods
Traditional methods such as database queries: support
hypothesis verification about a relationship such as
the co-occurrence of diapers & beer.
Transaction No.
Item 1
Item 2
Item 3
100
Beer
Diapers
Chocolate
101
Milk
Chocolate
Shampoo
102
Beer
Wine
Vodka
103
Beer
Cheese
Diapers
104
Ice Cream
Diapers
Beer
…
…
…
…
…
Comparison to Traditional Database Queries
– Data Mining: Explore the data for patterns.
Data Mining methods automatically discover
significant associations rules from data.
• Find whatever patterns exist in the database,
without the user having to specify in advance what
to look for (data driven).
• Therefore allow finding unexpected correlations
Applications
• Store planning:
– Placing associated items together (Milk &
Bread)?
• May reduce basket total value (buy less unplanned
merchandise)
• Fraud detection:
– Finding in insurance data that a certain doctor
often works with a certain lawyer may indicate
potential fraudulent activity.
Sequential Patterns
Instead of finding association between items in a single
transactions,
find association between items bought by the same
customer in different occasions.
Customer ID
Transaction Data. Item 1
Item 2
AA
2/2/2001
Laptop
Case
AA
1/13/2002
Wireless network card
Router
BB
4/5/2002
laptop
iPaq
BB
8/10/2002
Wireless network card
Router
…
…
…
…
…
• Sequence : {Laptop}, {Wireless Card, Router}
• A sequences has to satisfy some predetermined
minimum support
• SAS Demo