Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining
Association Rules
D E
Dr.
Engin
i YILDIZTEPE
Reference Books
y Han, J. , Kamber, M., Pei, J., (2011). Data Mining:
Concepts
C
t and
d Techniques.
T h i
Third
Thi d edition.
diti
San
S
Francisco: Morgan Kaufmann Publishers.
y Larose,
L
Daniel
D i l T.
T (2005).
(2005) Discovering
Di
i Knowledge
K
l d In
I
Data – An Introduction to Data Mining. New Jersey:
John Wiley and Sons Ltd.
Ltd
y Pang-Ning Tan,Michael Steinbach, Vipin Kumar
(2006) Introduction
(2006),
I t d ti to
t Data
D t Mining,
Mi i
Addi
Addison
Wesley.
y Alpaydın,
Al
d E.
E (2010)
(2010). Introduction
I t d ti to
t Machine
M hi
Learning. Second Ed. London:MIT Press.
2
Association Rules: Definition
y Given a set of transactions, find rules that will
predict
di t the
th occurrence off an item
it
b
based
d on the
th
occurrences of other items in the transaction.
y In general, association rule mining can be viewed as
pp
process:
a two-step
1. Find all frequent itemsets: By definition, each of
these itemsets will occur at least as frequently as a
predetermined minimum support count, min sup.
2 Generate
2.
G
t
strong
t
association
i ti
rules
l
f
from
th
the
frequent itemsets: By definition, these rules must
satisfy
i f minimum
i i
support and
d minimum
i i
confidence.
fid
3
Association Rule Miningg
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}} → {Beer},
{Di
{B }
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence, not
causality!
Data representation
p
for transactions
y There are two principal methods of representing
this type
thi
t
off market
k tb
basket
k td
data:
t using
i either
ith the
th
transactional data format or the tabular data
format.
5
Data representation
p
for transactions
y The
transactional data format requires only two
fields, an ID field and a content field,
fields
field with each
record representing a single item only.
6
Transaction ID
Items
1
Milk
1
Bread
d
1
Eggs
2
B d
Bread
2
Sugar
3
Bread
3
Cereal
4
Milk
4
Bread
4
Sugar
…
…
Data representation
p
for transactions
y In
the tabular data format, each record represents
a separate transaction,
transaction with as many 0/1 flag fields
as there are items.
7
Transaction ID
Milk
Bread
Eggs
Sugar
Cereal
1
1
1
1
0
0
2
0
1
0
1
0
3
0
1
0
0
1
4
1
1
0
1
0
5
1
0
0
0
1
6
0
1
0
0
1
7
1
0
0
0
1
8
1
1
1
0
1
9
1
1
0
0
1
Definition: Frequent Itemset
y
Itemset
y A collection
ll
off one or more items
y Example: {Milk, Bread, Coke}
y k
k-itemset
itemset
y An itemset that contains k items
y
Support count (σ)
y Frequency of occurrence of an itemset
y E.g.
y
σ({Milk, Bread, Beer}) = 2
Support
y Fraction of transactions that contain an
itemset
y E.g. s({Milk, Bread, Diaper}) = 2/5
y
Frequent
q
Itemset
y An itemset whose support is greater
than or equal to a minsup threshold
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Definition: Association Rule
z
Association Rule
– An implication expression of the form
X → Y, where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}
z
Rule Evaluation Metrics
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread Milk
Bread,
Milk, Diaper
Diaper, Coke
– Support (s)
‹
E
Example:
l
Fraction of transactions that contain
both X and Y
{Milk , Diaper } ⇒ Beer
– Confidence (c)
‹
Measures how often items in Y
appear in transactions that
contain X
s=
c=
σ (Milk , Diaper, Beer )
|T|
=
2
= 0.4
5
σ (Milk, Diaper, Beer) 2
= = 0.67
σ (Milk, Diaper)
3
Definition: Association Rule
y Based on the types of values, the association rules
can be
b classified
l ifi d into
i t two
t
categories:
t
i
B l
Boolean
Association Rules and Quantitative Association Rules
y
Boolean Association Rule:
Keyboard
y
Mouse
Q
Quantitative
tit ti Association
A
i ti Rule:
R l
((Age
g = 26…30))
10
[support = 6%,
6% confidence = 70%]
((Cars =1, 2))
[Support 3%, confidence = 36%]
Definition: Association Rule
support s for a particular association rule A B
i the
is
th proportion
ti off transactions
t
ti
iin D th
thatt contain
t i
both A and B. That is,
y The
confidence c of the association rule A B is a
measure of the accuracy of the rule, as determined by
percentage
g of transactions in D containing
g A that
the p
also contain B
y The
11
Miningg Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper,
i
Beer, Eggs
Milk, Diaper, Beer, Coke
Bread Milk
Bread,
Milk, Diaper
Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk Beer} → {Diaper} (s
{Milk,Beer}
(s=00.4,
4 c=1
c 1.0)
0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the
th above
b rules
l are binary
bi
partitions
titi off the
th same itemset:
it
t
{Milk, Diaper, Beer}
• Rules
R l originating
i i ti ffrom th
the same itemset
it
t have
h id
identical
ti l supportt bbutt
can have different confidence
• Thus,
Th we may ddecouple
l th
the supportt andd confidence
fid
requirements
i
t
Association Rule Miningg Task
y Given a set of transactions T, the goal of
association
i ti rule
l mining
i i is
i to
t find
fi d allll rules
l having
h i
y support ≥ minsup threshold
y confidence ≥ minconf threshold
y Brute-force approach:
y List all possible association rules
y Compute the support and confidence for each rule
y Prune rules that fail the
minsup and minconf
thresholds
⇒ Computationally
C
t ti
ll prohibitive!
hibiti !
Miningg Association Rules
y
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
y
Frequent itemset
i
generation
i is
i still
ill
computationally expensive
Frequent
q
Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ACDE
BCDE
Given d items, there are 2d
possible candidate itemsets
Frequent
q
Itemset Generation
y Brute-force approach:
y Each itemset in the lattice is a candidate frequent itemset
y Count the support of each candidate by scanning the
database
Transactions
a sact o s
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread Milk
Bread,
Milk, Diaper,
Diaper Beer
Bread, Milk, Diaper, Coke
y Match each transaction against every candidate
y Complexity ~ O(NMw) => Expensive since M = 2d !!!
y This
h is an instance off the
h curse off dimensionality.
d
l
Computational
p
Complexity
p
y
y Given d unique items:
yT
Total
t l number
b off it
itemsets
t = 2d
y Total number of possible association rules:
⎡⎛ d ⎞ ⎛ d − k ⎞⎤
R = ∑ ⎢⎜ ⎟ × ∑ ⎜
⎟⎥
⎣⎝ k ⎠ ⎝ j ⎠⎦
= 3 − 2 +1
d −1
d −k
k =1
j =1
d
d +1
If d=6, R = 602 rules
Frequent Itemset Generation Strategies
y Reduce the number of candidates (M)
y Complete search: M=2d
y Use pruning techniques to reduce M
y Reduce the number of transactions (N)
y Reduce size of N as the size of itemset increases
y Used by DHP and vertical-based mining algorithms
y Reduce the number of comparisons (NM)
y Use efficient data structures to store the candidates or
transactions
y No need to match every
y candidate against
g
everyy
transaction
Reducingg Number of Candidates
y Apriori, (Agrawal and Srikant, 1994) algorithm is the
most popular association rule algorithm.
algorithm
y Apriori principle:
y If an itemset is frequent, then all of its subsets must also
be frequent.
y Apriori
p
principle
p
p holds due to the following
g property
p p y of
the support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
y Support
S
t off an itemset
it
t never exceeds
d the
th supportt off it
its
subsets
y This
Thi is
i known
k
as the
h anti-monotone
i
property off support
Illustrating Apriori Principle
A
null
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
I f
Infrequent
t
ABCD
Pruned
supersets
ABCE
ABDE
ABCDE
ACDE
BCDE
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
gg
Count
4
2
4
3
4
1
Items (1-itemsets)
(1 itemsets)
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread Diaper}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer Diaper}
{Beer,Diaper}
Minimum Support Count = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based
support based pruning,
6 + 6 + 1 = 13
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(2 itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
Count
2
Apriori
p
Algorithm
g
y Method:
y Let k=1
y Generate frequent itemsets of length 1
y Repeat
p
until no new frequent
q
itemsets are identified
y Generate length (k+1) candidate itemsets from length k
frequent itemsets
y Prune
P
candidate
did t itemsets
it
t containing
t i i subsets
b t off length
l
th k that
th t
are infrequent
y Count the support
pp
of each candidate byy scanning
g the DB
y Eliminate candidates that are infrequent, leaving only those
that are frequent
Generating Association Rules from Frequent
Itemsets
y Once the frequent itemsets from transactions in a
database
d
t b
D have
h
b
been
f
found,
d it is
i straightforward
t i htf
d to
t
generate strong association rules from them
y (where strong association rules satisfy both
pp
and minimum confidence).
)
minimum support
23
Generating Association Rules from Frequent
Itemsets
y Association rules can be generated as follows:
y For each frequent itemset
p y subsets of l.
nonempty
l, generate all
y For every nonempty subset
s of l, output the rule
sÆ (l
(l-s)
s) , if confidence ≥minimum confidence
threshold.
y Because the rules are generated from frequent
itemsets, each one automatically satisfies the
minimum support
24
Generating Association Rules from Frequent
Itemsets
y Continue previous example
Itemset L3
{{Bread,Milk,Diaper}
,
,
p }
Count
2
The nonempty
p y subsets of L3 are {Bread}, {Milk},
{Diaper}, {Bread,Milk}, {Bread, Diaper}, {Milk,Diaper}
25
TID
Items
1
Bread Milk
Bread,
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Association Rules:
{Milk,Diaper} → {Bread} (s
(s=0.4,
0.4, cc=0.67)
0.67)
{Bread, Diaper} → {Milk} (s=0.4, c=0.67)
{Bread, Milk} → {Diaper} (s=0.4, c=0.67)
{
{Diaper}
} → {Milk,
{
Bread}} (s=0.4, c=0.5)
{Milk} → {Bread, Diaper} (s=0.4, c=0.5)
{Bread} → {Diaper,Milk} (s
(s=0.4,
0.4, cc=0.5)
0.5)
Example
p 1
y A subset of market basket database, shown in the
following table:
T ID
1
2
3
4
5
6
7
8
9
10
Bread
Yes
Yes
Y
Yes
No
Yes
Yes
No
Yes
Yes
No
Milk
Yes
No
Y
Yes
No
No
No
No
Yes
No
Yes
Cheese
Yes
No
Y
Yes
No
No
Yes
No
Yes
Yes
No
Cleaner
Yes
No
N
No
No
No
Yes
No
No
No
No
y Find 1-itemsets and 2-itemsets
26
Example
p 1
Single Items (1-itemsets)
itemset
Bread
Milk
Cheese
Cleaner
count
7
4
5
2
Pair Items (2-itemsets)
itemset
Bread&Milk
Bread&Cheese
Bread&Cleaner
Milk&Ch
Milk&Cheese
Milk&Cleaner
Cheese&Cleaner
27
count
3
5
2
3
1
2
Example
p 2
y Consider the following transactions. Develop association rules using
apriori algorithm. Let φ=3 (minimum support threshold).
y
Using 75% minimum confidence and 20% minimum support,
generate one-antecedent and two-antecedent association rules.
,F
28
,F
Example
p 2
29
Ck : Candidate itemset of size k
Lk : Frequent itemset of size k
Example
p 2
1
30
Example
p 2
31
Example
p 2
C3
Item - Set # of Items
{A, B, C}
{A, B, D}
{A, B, F}
{A, D, F}
{B C,
{B,
C D}
{B, C, F}
{B, D, F}
32
1
1
2
3
1
1
1
Example
p 2
C3
Item - Set # of Items
{A, B,
{A
B C}
{A, B, D}
{A, B, F}
{A D,
{A,
D F}
{B, C, D}
{B, C, F}
{B D,
{B,
D F}
1
1
2
3
1
1
1
the frequent itemset {A, D, F} have subsets
({A},
({
}, {{D},
}, {{F},
}, {{A,, D},{A,
},{ , F},
}, {{D,, F})
})
33
Example
p 2
y the frequent itemset {A, D, F} have subsets
y ({A}, {D}, {F}, {A, D},{A, F}, {D, F})
y generated association rules are;
y A ⇒ F & D confidence = 3 / 8 = 37.5 %
y F ⇒ A & D confidence = 3 / 4 = 75 %
y D ⇒ F & A confidence = 3 / 4 = 75 %
y A & D ⇒ F confidence = 3 / 3 = 100 %
y A & F ⇒ D confidence = 3 / 4 = 75 %
y D & F ⇒ A confidence = 3 / 3 = 100 %
34