Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining
Association Rules
D E
Dr.
Engin
i YILDIZTEPE
Reference Books
y Han, J. , Kamber, M., Pei, J., (2011). Data Mining:
Concepts
C
t and
d Techniques.
T h i
Third
Thi d edition.
diti
San
S
Francisco: Morgan Kaufmann Publishers.
y Larose,
L
Daniel
D i l T.
T (2005).
(2005) Discovering
Di
i Knowledge
K
l d In
I
Data – An Introduction to Data Mining. New Jersey:
John Wiley and Sons Ltd.
Ltd
y Pang-Ning Tan,Michael Steinbach, Vipin Kumar
(2006) Introduction
(2006),
I t d ti to
t Data
D t Mining,
Mi i
Addi
Addison
Wesley.
y Alpaydın,
Al
d E.
E (2010)
(2010). Introduction
I t d ti to
t Machine
M hi
Learning. Second Ed. London:MIT Press.
2
Association Rules: Definition
y Given a set of transactions, find rules that will
predict
di t the
th occurrence off an item
it
b
based
d on the
th
occurrences of other items in the transaction.
y In general, association rule mining can be viewed as
pp
process:
a two-step
1. Find all frequent itemsets: By definition, each of
these itemsets will occur at least as frequently as a
predetermined minimum support count, min sup.
2 Generate
2.
G
t
strong
t
association
i ti
rules
l
f
from
th
the
frequent itemsets: By definition, these rules must
satisfy
i f minimum
i i
support and
d minimum
i i
confidence.
fid
3
Association Rule Miningg
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}} → {Beer},
{Di
{B }
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence, not
causality!
Data representation
p
for transactions
y There are two principal methods of representing
this type
thi
t
off market
k tb
basket
k td
data:
t using
i either
ith the
th
transactional data format or the tabular data
format.
5
Data representation
p
for transactions
y The
transactional data format requires only two
fields, an ID field and a content field,
fields
field with each
record representing a single item only.
6
Transaction ID
Items
1
Milk
1
Bread
d
1
Eggs
2
B d
Bread
2
Sugar
3
Bread
3
Cereal
4
Milk
4
Bread
4
Sugar
…
…
Data representation
p
for transactions
y In
the tabular data format, each record represents
a separate transaction,
transaction with as many 0/1 flag fields
as there are items.
7
Transaction ID
Milk
Bread
Eggs
Sugar
Cereal
1
1
1
1
0
0
2
0
1
0
1
0
3
0
1
0
0
1
4
1
1
0
1
0
5
1
0
0
0
1
6
0
1
0
0
1
7
1
0
0
0
1
8
1
1
1
0
1
9
1
1
0
0
1
Definition: Frequent Itemset
y
Itemset
y A collection
ll
off one or more items
y Example: {Milk, Bread, Coke}
y k
k-itemset
itemset
y An itemset that contains k items
y
Support count (σ)
y Frequency of occurrence of an itemset
y E.g.
y
σ({Milk, Bread, Beer}) = 2
Support
y Fraction of transactions that contain an
itemset
y E.g. s({Milk, Bread, Diaper}) = 2/5
y
Frequent
q
Itemset
y An itemset whose support is greater
than or equal to a minsup threshold
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Definition: Association Rule
z
Association Rule
– An implication expression of the form
X → Y, where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}
z
Rule Evaluation Metrics
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread Milk
Bread,
Milk, Diaper
Diaper, Coke
– Support (s)
‹
E
Example:
l
Fraction of transactions that contain
both X and Y
{Milk , Diaper } ⇒ Beer
– Confidence (c)
‹
Measures how often items in Y
appear in transactions that
contain X
s=
c=
σ (Milk , Diaper, Beer )
|T|
=
2
= 0.4
5
σ (Milk, Diaper, Beer) 2
= = 0.67
σ (Milk, Diaper)
3
Definition: Association Rule
y Based on the types of values, the association rules
can be
b classified
l ifi d into
i t two
t
categories:
t
i
B l
Boolean
Association Rules and Quantitative Association Rules
y
Boolean Association Rule:
Keyboard
y
Mouse
Q
Quantitative
tit ti Association
A
i ti Rule:
R l
((Age
g = 26…30))
10
[support = 6%,
6% confidence = 70%]
((Cars =1, 2))
[Support 3%, confidence = 36%]
Definition: Association Rule
support s for a particular association rule A B
i the
is
th proportion
ti off transactions
t
ti
iin D th
thatt contain
t i
both A and B. That is,
y The
confidence c of the association rule A B is a
measure of the accuracy of the rule, as determined by
percentage
g of transactions in D containing
g A that
the p
also contain B
y The
11
Miningg Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper,
i
Beer, Eggs
Milk, Diaper, Beer, Coke
Bread Milk
Bread,
Milk, Diaper
Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk Beer} → {Diaper} (s
{Milk,Beer}
(s=00.4,
4 c=1
c 1.0)
0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the
th above
b rules
l are binary
bi
partitions
titi off the
th same itemset:
it
t
{Milk, Diaper, Beer}
• Rules
R l originating
i i ti ffrom th
the same itemset
it
t have
h id
identical
ti l supportt bbutt
can have different confidence
• Thus,
Th we may ddecouple
l th
the supportt andd confidence
fid
requirements
i
t
Association Rule Miningg Task
y Given a set of transactions T, the goal of
association
i ti rule
l mining
i i is
i to
t find
fi d allll rules
l having
h i
y support ≥ minsup threshold
y confidence ≥ minconf threshold
y Brute-force approach:
y List all possible association rules
y Compute the support and confidence for each rule
y Prune rules that fail the
minsup and minconf
thresholds
⇒ Computationally
C
t ti
ll prohibitive!
hibiti !
Miningg Association Rules
y
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
y
Frequent itemset
i
generation
i is
i still
ill
computationally expensive
Frequent
q
Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ACDE
BCDE
Given d items, there are 2d
possible candidate itemsets
Frequent
q
Itemset Generation
y Brute-force approach:
y Each itemset in the lattice is a candidate frequent itemset
y Count the support of each candidate by scanning the
database
Transactions
a sact o s
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread Milk
Bread,
Milk, Diaper,
Diaper Beer
Bread, Milk, Diaper, Coke
y Match each transaction against every candidate
y Complexity ~ O(NMw) => Expensive since M = 2d !!!
y This
h is an instance off the
h curse off dimensionality.
d
l
Computational
p
Complexity
p
y
y Given d unique items:
yT
Total
t l number
b off it
itemsets
t = 2d
y Total number of possible association rules:
⎡⎛ d ⎞ ⎛ d − k ⎞⎤
R = ∑ ⎢⎜ ⎟ × ∑ ⎜
⎟⎥
⎣⎝ k ⎠ ⎝ j ⎠⎦
= 3 − 2 +1
d −1
d −k
k =1
j =1
d
d +1
If d=6, R = 602 rules
Frequent Itemset Generation Strategies
y Reduce the number of candidates (M)
y Complete search: M=2d
y Use pruning techniques to reduce M
y Reduce the number of transactions (N)
y Reduce size of N as the size of itemset increases
y Used by DHP and vertical-based mining algorithms
y Reduce the number of comparisons (NM)
y Use efficient data structures to store the candidates or
transactions
y No need to match every
y candidate against
g
everyy
transaction
Reducingg Number of Candidates
y Apriori, (Agrawal and Srikant, 1994) algorithm is the
most popular association rule algorithm.
algorithm
y Apriori principle:
y If an itemset is frequent, then all of its subsets must also
be frequent.
y Apriori
p
principle
p
p holds due to the following
g property
p p y of
the support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
y Support
S
t off an itemset
it
t never exceeds
d the
th supportt off it
its
subsets
y This
Thi is
i known
k
as the
h anti-monotone
i
property off support
Illustrating Apriori Principle
A
null
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
I f
Infrequent
t
ABCD
Pruned
supersets
ABCE
ABDE
ABCDE
ACDE
BCDE
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
gg
Count
4
2
4
3
4
1
Items (1-itemsets)
(1 itemsets)
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread Diaper}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer Diaper}
{Beer,Diaper}
Minimum Support Count = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based
support based pruning,
6 + 6 + 1 = 13
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(2 itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
Count
2
Apriori
p
Algorithm
g
y Method:
y Let k=1
y Generate frequent itemsets of length 1
y Repeat
p
until no new frequent
q
itemsets are identified
y Generate length (k+1) candidate itemsets from length k
frequent itemsets
y Prune
P
candidate
did t itemsets
it
t containing
t i i subsets
b t off length
l
th k that
th t
are infrequent
y Count the support
pp
of each candidate byy scanning
g the DB
y Eliminate candidates that are infrequent, leaving only those
that are frequent
Generating Association Rules from Frequent
Itemsets
y Once the frequent itemsets from transactions in a
database
d
t b
D have
h
b
been
f
found,
d it is
i straightforward
t i htf
d to
t
generate strong association rules from them
y (where strong association rules satisfy both
pp
and minimum confidence).
)
minimum support
23
Generating Association Rules from Frequent
Itemsets
y Association rules can be generated as follows:
y For each frequent itemset
p y subsets of l.
nonempty
l, generate all
y For every nonempty subset
s of l, output the rule
sÆ (l
(l-s)
s) , if confidence ≥minimum confidence
threshold.
y Because the rules are generated from frequent
itemsets, each one automatically satisfies the
minimum support
24
Generating Association Rules from Frequent
Itemsets
y Continue previous example
Itemset L3
{{Bread,Milk,Diaper}
,
,
p }
Count
2
The nonempty
p y subsets of L3 are {Bread}, {Milk},
{Diaper}, {Bread,Milk}, {Bread, Diaper}, {Milk,Diaper}
25
TID
Items
1
Bread Milk
Bread,
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Association Rules:
{Milk,Diaper} → {Bread} (s
(s=0.4,
0.4, cc=0.67)
0.67)
{Bread, Diaper} → {Milk} (s=0.4, c=0.67)
{Bread, Milk} → {Diaper} (s=0.4, c=0.67)
{
{Diaper}
} → {Milk,
{
Bread}} (s=0.4, c=0.5)
{Milk} → {Bread, Diaper} (s=0.4, c=0.5)
{Bread} → {Diaper,Milk} (s
(s=0.4,
0.4, cc=0.5)
0.5)
Example
p 1
y A subset of market basket database, shown in the
following table:
T ID
1
2
3
4
5
6
7
8
9
10
Bread
Yes
Yes
Y
Yes
No
Yes
Yes
No
Yes
Yes
No
Milk
Yes
No
Y
Yes
No
No
No
No
Yes
No
Yes
Cheese
Yes
No
Y
Yes
No
No
Yes
No
Yes
Yes
No
Cleaner
Yes
No
N
No
No
No
Yes
No
No
No
No
y Find 1-itemsets and 2-itemsets
26
Example
p 1
Single Items (1-itemsets)
itemset
Bread
Milk
Cheese
Cleaner
count
7
4
5
2
Pair Items (2-itemsets)
itemset
Bread&Milk
Bread&Cheese
Bread&Cleaner
Milk&Ch
Milk&Cheese
Milk&Cleaner
Cheese&Cleaner
27
count
3
5
2
3
1
2
Example
p 2
y Consider the following transactions. Develop association rules using
apriori algorithm. Let φ=3 (minimum support threshold).
y
Using 75% minimum confidence and 20% minimum support,
generate one-antecedent and two-antecedent association rules.
,F
28
,F
Example
p 2
29
Ck : Candidate itemset of size k
Lk : Frequent itemset of size k
Example
p 2
1
30
Example
p 2
31
Example
p 2
C3
Item - Set # of Items
{A, B, C}
{A, B, D}
{A, B, F}
{A, D, F}
{B C,
{B,
C D}
{B, C, F}
{B, D, F}
32
1
1
2
3
1
1
1
Example
p 2
C3
Item - Set # of Items
{A, B,
{A
B C}
{A, B, D}
{A, B, F}
{A D,
{A,
D F}
{B, C, D}
{B, C, F}
{B D,
{B,
D F}
1
1
2
3
1
1
1
the frequent itemset {A, D, F} have subsets
({A},
({
}, {{D},
}, {{F},
}, {{A,, D},{A,
},{ , F},
}, {{D,, F})
})
33
Example
p 2
y the frequent itemset {A, D, F} have subsets
y ({A}, {D}, {F}, {A, D},{A, F}, {D, F})
y generated association rules are;
y A ⇒ F & D confidence = 3 / 8 = 37.5 %
y F ⇒ A & D confidence = 3 / 4 = 75 %
y D ⇒ F & A confidence = 3 / 4 = 75 %
y A & D ⇒ F confidence = 3 / 3 = 100 %
y A & F ⇒ D confidence = 3 / 4 = 75 %
y D & F ⇒ A confidence = 3 / 3 = 100 %
34