Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Lecture 10 – Association Rule Mining
Dr. Songsri Tangsripairoj
Dr.Benjarath Pupacdi
Faculty of ICT, Mahidol University
ITCS453 Data Warehousing and Data Mining (2/2009)
1
Topics
What is Frequent Pattern Analysis?
A
Association
i ti R
Rule
l Mi
Mining
i
A Two-Step
p Process of Association Rule
Mining
The Apriori Algorithm
Frequent Itemset Generation
Rule Generation
Mining Association Rules: An Example
ITCS453 Data Warehousing and Data Mining (2/2009)
2
Frequent Pattern Analysis
A frequent pattern: a pattern that occurs
frequently in a data set
Frequent
F
t it
itemsett
▪ A set of items,, such as milk and bread appear
pp
frequently
q
y
together in a transaction data set
Frequent sequential pattern
▪ Buying first a PC, then a digital camera, and then a
memory card, if it occurs frequently in a shopping history
database
ITCS453 Data Warehousing and Data Mining (2/2009)
3
Frequent Pattern Analysis
Motivation: Finding inherent regularities in
d t
data
What p
products were often purchased
p
together?— Beer and diapers?!
What are the subsequent purchases after
buying a PC?
Applications
l
Market basket analysis,
Market-basket
analysis cross-marketing
cross marketing,
catalog design, sale campaign analysis.
ITCS453 Data Warehousing and Data Mining (2/2009)
4
Market Basket Analysis
Help retailers plan marketing or advertising strategies, or in design of a
new catalog
Help retailers design different store layouts
Help retailers plan which items to put on sale at reduced prices
ITCS453 Data Warehousing and Data Mining (2/2009)
5
A
Association
i ti Rule
R l Mi
Mining
i
Mining for interesting rules (= gold) through a
l
large
d b
database
(=mountain)
“Interesting” rules tell you something about
your database that you did not already know
and probably were not able to explicit
articulate because the data is so large.
ITCS453 Data Warehousing and Data Mining (2/2009)
6
Problem Statement
Given a set of items in a transaction database
Retrieve all possible patterns in form of
association rules
Number of rules may
y be massivelyy large
g
May need a filter to select a set of the most
valuable or interesting rules
ITCS453 Data Warehousing and Data Mining (2/2009)
7
Components of Association Rule
A B, [support = %, confidence = %]
Milk Bread,
B d [support
[
= 3%,
% confidence
fid
= 80%]
8 %]
If milk is purchased,
purchased then bread is
purchased 90 percent of the time and this
pattern occurs in 3 percent of all shopping
basket
ITCS453 Data Warehousing and Data Mining (2/2009)
8
Components of Association Rule
A B, [support = %, confidence = %]
A and
d B are sets off iitems, i.e.
i itemsets.
i
F
For
example, A = {bread, milk} and B = {jam, eggs}.
A = Antecedence
B = Consequence
Measure of Rule
Interestingness
Support = P(A U B)
Confidence = P(B | A)
ITCS453 Data Warehousing and Data Mining (2/2009)
9
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
ITCS453 Data Warehousing and Data Mining (2/2009)
Example of Association Rules
{Diaper} {Beer},
{Milk Bread} {Eggs,Coke},
{Milk,
{Eggs Coke}
{Beer, Bread} {Milk},
10
Binary Representation
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper,
Bread
Diaper Beer,
Beer Eggs
Milk, Diaper, Beer, Coke
Bread,, Milk,, Diaper,
p , Beer
Bread, Milk, Diaper, Coke
• 1 if the item is
present in a
transaction.
• 0 otherwise.
• Each row corresponds to
a transaction.
• Each column corresponds to
an item.
TID Bread Milk Diapers Beer Eggs Coke
1
1
1
0
0
0
0
2
1
0
1
1
1
0
3
0
1
1
1
0
1
4
1
1
1
1
0
0
5
1
1
1
0
0
1
ITCS453 Data Warehousing and Data Mining (2/2009)
11
D fi iti
Definitions
Itemset X = {x1, …, xk}
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
Example: {Milk, Bread} is a 2-itemset
TID
I
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Support count ()
Frequency of occurrence of an itemset
Example: ({Milk, Bread,Diaper}) = 2
Support
Percentage of transactions that contain an itemset
Example: s({Milk, Bread, Diaper}) = 2/5
ITCS453 Data Warehousing and Data Mining (2/2009)
12
D fi iti
Definitions
Frequent Itemset
An itemset whose support is
greater than or equal to a
minsup
i
th h ld
threshold
TID
I
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Association Rule
An implication expression of
the form X Y,, where X and Y
are disjoint itemsets (X ∩Y = Ø)
Example:
E ample
{Milk, Diaper} {Beer}
ITCS453 Data Warehousing and Data Mining (2/2009)
13
Definitions
Rule Evaluation Metrics
Support (s)
Percentage of transactions in D
that contain both X and Y
Support(X=>Y) = P(X U Y)
Confidence (c)
Percentage of transactions in D
containing X that also contain Y
confidence(X=>Y)
(
) = P(Y|X)
( | )
= (X U Y)
(X)
ITCS453 Data Warehousing and Data Mining (2/2009)
TID
I
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
E
Example:
l
{Milk , Diaper } Beer
s
(Milk , Diaper, Beer )
|T|
2
0.4
5
(Milk, Diaper, Beer) 2
0.67
c
3
(Milk, Diaper
p )
14
Association Rule Mining
TID
Items
1
B d Milk
Bread,
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper}
{Milk
Diaper} {Beer} (s=0
(s 0.4,
4 cc=0
0.67)
67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer}
{Diaper Beer} (s=0.4,
(s=0 4 c=0.5)
c=0 5)
Observations:
• All the above rules are binary partitions of the same itemset:
{{Milk, Diaper,
p Beer}}
• Rules originating from the same itemset have identical
support but can have different confidence
ITCS453 Data Warehousing and Data Mining (2/2009)
15
Association Rule Mining Task
Given a set of transactions T, the goal of
association
i ti rule
l mining
i i is
i to
t find
fi d allll rules
l
g
having
support ≥ minsup threshold
confidence ≥ minconf threshold
Rules that satisfy both minsup and minconf are
called strong rules
rules.
ITCS453 Data Warehousing and Data Mining (2/2009)
16
Two-step process
1.
Find all frequent
q
itemsets
Generate all itemsets whose support minsup
2
2.
Generate strong association rules from the
frequent itemsets
Generate
G
strong rules
l from
f
each
h ffrequent
itemset. These rules must satisfy minsup and
minconf
i
f
The overall performance of mining
association rules is determined by Step 1.
ITCS453 Data Warehousing and Data Mining (2/2009)
17
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ITCS453 Data Warehousing and Data Mining (2/2009)
ACDE
BCDE
Given k items, there
are 2k possible
candidate itemsets
18
R d i Number
Reducing
N b off Candidates
C did t
The Apriori property:
Any subset of a frequent itemset must be frequent
If {beer,
{beer diaper,
diaper nuts} is frequent,
frequent so is {beer,
{beer
diaper}
i.e., every transaction having
h
{beer,
b
diaper,
d
nuts}
also contains {beer, diaper}
X , Y : ( X Y ) s( X ) s(Y )
Note that ssupport
pport of an itemset ne
never
er eexceeds
ceeds
the support of its subsets
ITCS453 Data Warehousing and Data Mining (2/2009)
19
Apriori: A Candidate Generationand-Test Approach
Apriori pruning principle: If there is any itemset which is
infrequent its superset should not be generated/tested!
infrequent,
Method:
Initially, scan DB once to get frequent 1-itemset
Generate
G
t length
l
th (k
(k+1)) candidate
did t itemsets
it
t ffrom
g k frequent
q
itemsets
length
Test the candidates against DB
Terminate when no frequent or candidate set can
be generated
ITCS453 Data Warehousing and Data Mining (2/2009)
20
Illustrating Apriori Pruning Principle
null
A
Found to be
I f
Infrequent
t
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
Pruned
supersets
ITCS453 Data Warehousing and Data Mining (2/2009)
ABCE
ABDE
ACDE
BCDE
ABCDE
21
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Itemset
{B d Milk}
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk B }
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Minimum Support = 3
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Ite m s e t
{ B r e a d ,M ilk ,D ia p e r }
ITCS453 Data Warehousing and Data Mining (2/2009)
C ount
3
22
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates
dd
generated
d from
f
Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
ITCS453 Data Warehousing and Data Mining (2/2009)
23
The Apriori Algorithm—An Example
Database TDB
Tid
I
Items
10
A, C, D
20
B C,
B,
C E
30
A, B, C, E
40
B,, E
L2
Supmin = 2
sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
C1
1st
scan
C2
Itemset
sup
{A, C}
2
{B, C}
2
{B E}
{B,
3
{C, E}
2
C3
Itemset
Itemset
{B, C, E}
L1
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
Itemset
sup
{A, B}
1
{A C}
{A,
2
{A, E}
1
{A, C}
{B, C}
2
{A, E}
{B, E}
3
{B, C}
{C, E}
2
{B, E}
3rd scan
ITCS453 Data Warehousing and Data Mining (2/2009)
L3
C2
2nd scan
Itemset
{A, B}
{C E}
{C,
Itemset
sup
{B, C, E}
2
24
Generation of candidate itemsets and
frequent itemsets (where minsup count=2)
ITCS453 Data Warehousing and Data Mining (2/2009)
25
Generating
g Association Rules
from Frequent Itemsets
Given a frequent itemset L, find all non-empty subsets f L
such that f L – f satisfies the minimum confidence
requirement
If {{I1,, I2,, I5}
5} is a frequent
q
itemset,,
The nonempty subsets of L are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}
The resulting association rules are
I1 I2 => I5,
I1 I5 => I2,
I2 I5 => I1,
I1 => I2 I5,
I2 => I1 I5,
I5
I5 => I1 I2,
confidence = 2/4 = 50%
confidence = 2/2 = 100%
confidence = 2/2 = 100%
confidence = 2/6 = 33%
confidence = 2/7 = 29%
confidence = 2/2 = 100%
ITCS453 Data Warehousing and Data Mining (2/2009)
If ||L|| = k,, then there
are 2k – 2 candidate
association rules
(i
(ignoring
i L and
d
L)
26
Mining
g Association Rules: An
Example
A Subset of the Credit Card Promotion Database
Magazine
Promotion
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
Yes
No
No
No
Male
Yes
Yes
Yes
No
Female
No
No
No
No
Male
Yes
Yes
Yes
Yes
Male
Yes
No
Yes
No
Female
No
No
No
No
Female
Yes
No
Yes
Yes
Male
No
Yes
No
No
Male
Yes
No
No
No
Male
Yes
Yes
Yes
No
Female
ITCS453 Data Warehousing and Data Mining (2/2009)
27
1-Itemsets
It
t
1-Itemsets
Number of Items
Magazine Promotion=Yes
7
Watch Promotion=Yes
4
Watch Promotion=No
6
Lif IInsurance P
Life
Promotion=Yes
ti
Y
5
Life Insurance Promotion=No
5
Credit Card Insurance=No
8
Sex=Male
6
Sex=Female
4
ITCS453 Data Warehousing and Data Mining (2/2009)
28
2-Itemsets
It
t
2-Itemsets
Number of Items
g
Promotion=Yes & Watch Promotion=No
Magazine
4
Magazine Promotion=Yes & Life Insurance Promotion=Yes
5
Magazine Promotion=Yes & Credit Card Insurance=No
5
Magazine Promotion=Yes & Sex=Male
4
Watch Promotion=No & Life Insurance Promotion=No
4
Watch Promotion=No & Credit Card Insurance=No
5
Watch Promotion=No & Sex=Male
4
Life Insurance Promotion=No & Credit Card Insurance=No
5
Life Insurance Promotion=No & Sex=Male
4
Credit Card Insurance=No & Sex=Male
4
Credit Card Insurance=No & Sex=Female
4
ITCS453 Data Warehousing and Data Mining (2/2009)
29
3-Itemsets
It
t
3-Itemsets
Number of Items
Watch Promotion=No & Life Insurance Promotion=No & Credit
Card Insurance=No
4
ITCS453 Data Warehousing and Data Mining (2/2009)
30
Two possible 2-itemset
2 itemset rules are:
IF Magazine Promotion = Yes
THEN Life
f Insurance Promotion = Yes (5/7)
IF Life
Lif Insurance
I
Promotion
P
i = Yes
Y
THEN Magazine Promotion = Yes (5/5)
ITCS453 Data Warehousing and Data Mining (2/2009)
31
Three possible 3-itemset rules
are:
IF Watch Promotion = No & Life Insurance Promotion = No
THEN Credit Card Insurance = No (4/4)
IF Watch Promotion = No
THEN Life Insurance Promotion = No
& Credit Card Insurance = No (4/6)
IF Credit Card Insurance = No
THEN Watch
W t hP
Promotion
ti = N
No
& Life Insurance Promotion = No (4/8)
ITCS453 Data Warehousing and Data Mining (2/2009)
32