Download Lecture 10 – Association Rule Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 10 – Association Rule Mining
Dr. Songsri Tangsripairoj
Dr.Benjarath Pupacdi
Faculty of ICT, Mahidol University
ITCS453 Data Warehousing and Data Mining (2/2009)
1
Topics







What is Frequent Pattern Analysis?
A
Association
i ti R
Rule
l Mi
Mining
i
A Two-Step
p Process of Association Rule
Mining
The Apriori Algorithm
Frequent Itemset Generation
Rule Generation
Mining Association Rules: An Example
ITCS453 Data Warehousing and Data Mining (2/2009)
2
Frequent Pattern Analysis

A frequent pattern: a pattern that occurs
frequently in a data set
 Frequent
F
t it
itemsett
▪ A set of items,, such as milk and bread appear
pp
frequently
q
y
together in a transaction data set
 Frequent sequential pattern
▪ Buying first a PC, then a digital camera, and then a
memory card, if it occurs frequently in a shopping history
database
ITCS453 Data Warehousing and Data Mining (2/2009)
3
Frequent Pattern Analysis

Motivation: Finding inherent regularities in
d t
data
 What p
products were often purchased
p
together?— Beer and diapers?!
 What are the subsequent purchases after
buying a PC?

Applications
l

Market basket analysis,
Market-basket
analysis cross-marketing
cross marketing,
catalog design, sale campaign analysis.
ITCS453 Data Warehousing and Data Mining (2/2009)
4
Market Basket Analysis
Help retailers plan marketing or advertising strategies, or in design of a
new catalog
 Help retailers design different store layouts
 Help retailers plan which items to put on sale at reduced prices

ITCS453 Data Warehousing and Data Mining (2/2009)
5
A
Association
i ti Rule
R l Mi
Mining
i

Mining for interesting rules (= gold) through a
l
large
d b
database
(=mountain)

“Interesting” rules tell you something about
your database that you did not already know
and probably were not able to explicit
articulate because the data is so large.
ITCS453 Data Warehousing and Data Mining (2/2009)
6
Problem Statement

Given a set of items in a transaction database

Retrieve all possible patterns in form of
association rules
 Number of rules may
y be massivelyy large
g

May need a filter to select a set of the most
valuable or interesting rules
ITCS453 Data Warehousing and Data Mining (2/2009)
7
Components of Association Rule
A  B, [support = %, confidence = %]
Milk  Bread,
B d [support
[
= 3%,
% confidence
fid
= 80%]
8 %]
If milk is purchased,
purchased then bread is
purchased 90 percent of the time and this
pattern occurs in 3 percent of all shopping
basket
ITCS453 Data Warehousing and Data Mining (2/2009)
8
Components of Association Rule
A  B, [support = %, confidence = %]
A and
d B are sets off iitems, i.e.
i itemsets.
i
F
For
example, A = {bread, milk} and B = {jam, eggs}.
A = Antecedence
B = Consequence
Measure of Rule
Interestingness
Support = P(A U B)
Confidence = P(B | A)
ITCS453 Data Warehousing and Data Mining (2/2009)
9
Association Rule Mining

Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
ITCS453 Data Warehousing and Data Mining (2/2009)
Example of Association Rules
{Diaper}  {Beer},
{Milk Bread}  {Eggs,Coke},
{Milk,
{Eggs Coke}
{Beer, Bread}  {Milk},
10
Binary Representation
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper,
Bread
Diaper Beer,
Beer Eggs
Milk, Diaper, Beer, Coke
Bread,, Milk,, Diaper,
p , Beer
Bread, Milk, Diaper, Coke
• 1 if the item is
present in a
transaction.
• 0 otherwise.
• Each row corresponds to
a transaction.
• Each column corresponds to
an item.
TID Bread Milk Diapers Beer Eggs Coke
1
1
1
0
0
0
0
2
1
0
1
1
1
0
3
0
1
1
1
0
1
4
1
1
1
1
0
0
5
1
1
1
0
0
1
ITCS453 Data Warehousing and Data Mining (2/2009)
11
D fi iti
Definitions
Itemset X = {x1, …, xk}



A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset



An itemset that contains k items
Example: {Milk, Bread} is a 2-itemset
TID
I
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Support count ()



Frequency of occurrence of an itemset
Example: ({Milk, Bread,Diaper}) = 2
Support



Percentage of transactions that contain an itemset
Example: s({Milk, Bread, Diaper}) = 2/5
ITCS453 Data Warehousing and Data Mining (2/2009)
12
D fi iti
Definitions

Frequent Itemset
 An itemset whose support is
greater than or equal to a
minsup
i
th h ld
threshold

TID
I
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Association Rule
 An implication expression of
the form X Y,, where X and Y
are disjoint itemsets (X ∩Y = Ø)
 Example:
E ample
{Milk, Diaper}  {Beer}
ITCS453 Data Warehousing and Data Mining (2/2009)
13
Definitions

Rule Evaluation Metrics
 Support (s)

Percentage of transactions in D
that contain both X and Y
Support(X=>Y) = P(X U Y)

Confidence (c)

Percentage of transactions in D
containing X that also contain Y
confidence(X=>Y)
(
) = P(Y|X)
( | )
= (X U Y)
(X)
ITCS453 Data Warehousing and Data Mining (2/2009)
TID
I
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
E
Example:
l
{Milk , Diaper }  Beer
s
 (Milk , Diaper, Beer )
|T|
2
  0.4
5
 (Milk, Diaper, Beer) 2
  0.67
c
3
 (Milk, Diaper
p )
14
Association Rule Mining
TID
Items
1
B d Milk
Bread,
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper}
{Milk
Diaper}  {Beer} (s=0
(s 0.4,
4 cc=0
0.67)
67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer}
{Diaper Beer} (s=0.4,
(s=0 4 c=0.5)
c=0 5)
Observations:
• All the above rules are binary partitions of the same itemset:
{{Milk, Diaper,
p Beer}}
• Rules originating from the same itemset have identical
support but can have different confidence
ITCS453 Data Warehousing and Data Mining (2/2009)
15
Association Rule Mining Task

Given a set of transactions T, the goal of
association
i ti rule
l mining
i i is
i to
t find
fi d allll rules
l
g
having
 support ≥ minsup threshold
 confidence ≥ minconf threshold

Rules that satisfy both minsup and minconf are
called strong rules
rules.
ITCS453 Data Warehousing and Data Mining (2/2009)
16
Two-step process
1.
Find all frequent
q
itemsets
 Generate all itemsets whose support  minsup
2
2.
Generate strong association rules from the
frequent itemsets
 Generate
G
strong rules
l from
f
each
h ffrequent
itemset. These rules must satisfy minsup and
minconf
i
f
The overall performance of mining
association rules is determined by Step 1.
ITCS453 Data Warehousing and Data Mining (2/2009)
17
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ITCS453 Data Warehousing and Data Mining (2/2009)
ACDE
BCDE
Given k items, there
are 2k possible
candidate itemsets
18
R d i Number
Reducing
N b off Candidates
C did t

The Apriori property:
 Any subset of a frequent itemset must be frequent
 If {beer,
{beer diaper,
diaper nuts} is frequent,
frequent so is {beer,
{beer
diaper}
 i.e., every transaction having
h
{beer,
b
diaper,
d
nuts}
also contains {beer, diaper}
X , Y : ( X  Y )  s( X )  s(Y )
 Note that ssupport
pport of an itemset ne
never
er eexceeds
ceeds
the support of its subsets
ITCS453 Data Warehousing and Data Mining (2/2009)
19
Apriori: A Candidate Generationand-Test Approach
Apriori pruning principle: If there is any itemset which is
infrequent its superset should not be generated/tested!
infrequent,
 Method:

 Initially, scan DB once to get frequent 1-itemset
 Generate
G
t length
l
th (k
(k+1)) candidate
did t itemsets
it
t ffrom
g k frequent
q
itemsets
length
 Test the candidates against DB
 Terminate when no frequent or candidate set can
be generated
ITCS453 Data Warehousing and Data Mining (2/2009)
20
Illustrating Apriori Pruning Principle
null
A
Found to be
I f
Infrequent
t
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
Pruned
supersets
ITCS453 Data Warehousing and Data Mining (2/2009)
ABCE
ABDE
ACDE
BCDE
ABCDE
21
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Itemset
{B d Milk}
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk B }
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Minimum Support = 3
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Ite m s e t
{ B r e a d ,M ilk ,D ia p e r }
ITCS453 Data Warehousing and Data Mining (2/2009)
C ount
3
22
The Apriori Algorithm

Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates
dd
generated
d from
f
Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
ITCS453 Data Warehousing and Data Mining (2/2009)
23
The Apriori Algorithm—An Example
Database TDB
Tid
I
Items
10
A, C, D
20
B C,
B,
C E
30
A, B, C, E
40
B,, E
L2
Supmin = 2
sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
C1
1st
scan
C2
Itemset
sup
{A, C}
2
{B, C}
2
{B E}
{B,
3
{C, E}
2
C3
Itemset
Itemset
{B, C, E}
L1
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
Itemset
sup
{A, B}
1
{A C}
{A,
2
{A, E}
1
{A, C}
{B, C}
2
{A, E}
{B, E}
3
{B, C}
{C, E}
2
{B, E}
3rd scan
ITCS453 Data Warehousing and Data Mining (2/2009)
L3
C2
2nd scan
Itemset
{A, B}
{C E}
{C,
Itemset
sup
{B, C, E}
2
24
Generation of candidate itemsets and
frequent itemsets (where minsup count=2)
ITCS453 Data Warehousing and Data Mining (2/2009)
25
Generating
g Association Rules
from Frequent Itemsets

Given a frequent itemset L, find all non-empty subsets f  L
such that f  L – f satisfies the minimum confidence
requirement

If {{I1,, I2,, I5}
5} is a frequent
q
itemset,,
 The nonempty subsets of L are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}
 The resulting association rules are
I1  I2 => I5,
I1  I5 => I2,
I2  I5 => I1,
I1 => I2  I5,
I2 => I1  I5,
I5
I5 => I1  I2,
confidence = 2/4 = 50%
confidence = 2/2 = 100%
confidence = 2/2 = 100%
confidence = 2/6 = 33%
confidence = 2/7 = 29%
confidence = 2/2 = 100%
ITCS453 Data Warehousing and Data Mining (2/2009)
If ||L|| = k,, then there
are 2k – 2 candidate
association rules
(i
(ignoring
i L   and
d
  L)
26
Mining
g Association Rules: An
Example

A Subset of the Credit Card Promotion Database
Magazine
Promotion
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
Yes
No
No
No
Male
Yes
Yes
Yes
No
Female
No
No
No
No
Male
Yes
Yes
Yes
Yes
Male
Yes
No
Yes
No
Female
No
No
No
No
Female
Yes
No
Yes
Yes
Male
No
Yes
No
No
Male
Yes
No
No
No
Male
Yes
Yes
Yes
No
Female
ITCS453 Data Warehousing and Data Mining (2/2009)
27
1-Itemsets
It
t
1-Itemsets
Number of Items
Magazine Promotion=Yes
7
Watch Promotion=Yes
4
Watch Promotion=No
6
Lif IInsurance P
Life
Promotion=Yes
ti
Y
5
Life Insurance Promotion=No
5
Credit Card Insurance=No
8
Sex=Male
6
Sex=Female
4
ITCS453 Data Warehousing and Data Mining (2/2009)
28
2-Itemsets
It
t
2-Itemsets
Number of Items
g
Promotion=Yes & Watch Promotion=No
Magazine
4
Magazine Promotion=Yes & Life Insurance Promotion=Yes
5
Magazine Promotion=Yes & Credit Card Insurance=No
5
Magazine Promotion=Yes & Sex=Male
4
Watch Promotion=No & Life Insurance Promotion=No
4
Watch Promotion=No & Credit Card Insurance=No
5
Watch Promotion=No & Sex=Male
4
Life Insurance Promotion=No & Credit Card Insurance=No
5
Life Insurance Promotion=No & Sex=Male
4
Credit Card Insurance=No & Sex=Male
4
Credit Card Insurance=No & Sex=Female
4
ITCS453 Data Warehousing and Data Mining (2/2009)
29
3-Itemsets
It
t
3-Itemsets
Number of Items
Watch Promotion=No & Life Insurance Promotion=No & Credit
Card Insurance=No
4
ITCS453 Data Warehousing and Data Mining (2/2009)
30
Two possible 2-itemset
2 itemset rules are:
IF Magazine Promotion = Yes
THEN Life
f Insurance Promotion = Yes (5/7)
IF Life
Lif Insurance
I
Promotion
P
i = Yes
Y
THEN Magazine Promotion = Yes (5/5)
ITCS453 Data Warehousing and Data Mining (2/2009)
31
Three possible 3-itemset rules
are:
IF Watch Promotion = No & Life Insurance Promotion = No
THEN Credit Card Insurance = No (4/4)
IF Watch Promotion = No
THEN Life Insurance Promotion = No
& Credit Card Insurance = No (4/6)
IF Credit Card Insurance = No
THEN Watch
W t hP
Promotion
ti = N
No
& Life Insurance Promotion = No (4/8)
ITCS453 Data Warehousing and Data Mining (2/2009)
32
Related documents