Download Fast Algorithms for Mining Association Rules

Document related concepts
no text concepts found
Transcript
Fast Algorithms for Mining
Association Rules
Rakesh Agrawal
Ramakrishnan Srikant
1
Data Mining Seminar 2003
Outline
Introduction
 Formal statement
 Apriori Algorithm
 AprioriTid Algorithm
 Comparison
 AprioriHybrid Algorithm
 Conclusions

©Ofer Pasternak
2
Data Mining Seminar 2003
Introduction
Bar-Code technology
 Mining Association Rules over basket
data (93)
 Tires ^ accessories  automotive
service
 Cross market, Attached mail.
 Very large databases.

©Ofer Pasternak
3
Data Mining Seminar 2003
Notation
Items – I = {i1,i2,…,im}
 Transaction – set of items

TI
– Items are sorted lexicographically

©Ofer Pasternak
TID – unique identifier for each
transaction
4
Data Mining Seminar 2003
Notation

Association Rule – X  Y
X  I , Y  I and X  Y  
©Ofer Pasternak
5
Data Mining Seminar 2003
Confidence and Support


©Ofer Pasternak
Association rule XY has
confidence c,
c% of transactions in D that contain
X also contain Y.
Association rule XY has support s,
s% of transactions in D contain X
and Y.
6
Data Mining Seminar 2003
Notice
X  A doesn’t mean X+YA

–
X  A and A  Z
doesn’t mean X  Z

–
©Ofer Pasternak
May not have minimum support
May not have minimum confidence
7
Data Mining Seminar 2003
Define the Problem
Given a set of transactions D, generate
all association rules that have support
and confidence greater than the
user-specified minimum support and
minimum confidence.
©Ofer Pasternak
8
Data Mining Seminar 2003
Previous Algorithms
AIS
 SETM
 Knowledge Discovery
 Induction of Classification Rules
 Discovery of causal rules
 Fitting of function to data
 KID3 – machine learning

©Ofer Pasternak
9
Data Mining Seminar 2003
Discovering all Association
Rules

Find all Large itemsets
– itemsets with support above minimum
support.

©Ofer Pasternak
Use Large itemsets to generate the
rules.
10
Data Mining Seminar 2003
General idea
Say ABCD and AB are large itemsets
 Compute
conf = support(ABCD) / support(AB)
 If conf >= minconf
AB  CD holds.

©Ofer Pasternak
11
Data Mining Seminar 2003
Discovering Large Itemsets
Multiple passes over the data
 First pass – count the support of individual
items.
 Subsequent pass

– Generate Candidates using previous pass’s large
itemset.
– Go over the data and check the actual support
of the candidates.

©Ofer Pasternak
Stop when no new large itemsets are found.
12
Data Mining Seminar 2003
The Trick
Any subset of large itemset is large.
Therefore
To find large k-itemset
– Create candidates by combining large k-1
itemsets.
– Delete those that contain any subset
that is not large.
©Ofer Pasternak
13
Data Mining Seminar 2003
Algorithm Apriori
L1  {large 1- itemsets}
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );
forall transacti ons t  D do begin
Ct  subset (C k ,t)
forall candidates c  Ct do
c.count  ;
Count item occurrences
Generate new k-itemsets
candidates
Find the support of all the
candidates
end
end
Lk  { c  Ck|c.count  minsup}
end
Answer 
Take only those with
support over minsup
L ;
k
k
©Ofer Pasternak
14
Data Mining Seminar 2003
Candidate generation

Join step
insert into Ck
P and q are 2 k-1 large
itemsets identical in all
k-2 first items.
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1  q.item1 ,..., p.itemk  2  q.itemk  2 , p.itemk 1  q.itemk 1

Prune step
forall itemsets c  Ck do
forall (k-1)-subsets s of c do
if (s  Lk-1 ) then
delete c from Ck
©Ofer Pasternak
Join by adding the last item of
q to p
Check all the subsets, remove a
candidate with “small” subset
15
Data Mining Seminar 2003
Example
L3 = { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} }
After joining
{ {1 2 3 4}, {1 3 4 5} }
{1 4 5} and {3 4 5}
After pruning
Are not in L3
{1 2 3 4}
©Ofer Pasternak
16
Data Mining Seminar 2003
Correctness
Show that Ck  Lk
Any subset of large itemset
must also be large
insert into Ck
Join is equivalent to
extending Lk-1 with all
items and removing
those whose (k-1)
subsets are not in Lk-1
©Ofer Pasternak
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1  q.item1 ,..., p.itemk  2  q.itemk  2 , p.itemk 1  q.itemk 1
forall itemsets c  Ck do
forall (k-1)-subsets s of c do
if (s  Lk-1 ) then
delete c from Ck
Prevents duplications
17
Data Mining Seminar 2003
Subset Function
L1  {large 1- itemsets}
Candidate itemsets - Ck are
stored in a hash-tree
 Finds in O(k) time whether a
candidate itemset of size k
is contained in transaction
t.
 Total time O(max(k,size(t))
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );

©Ofer Pasternak
forall transacti ons t  D do begin
Ct  subset (C k ,t)
forall candidates c  Ct do
c.count  ;
end
end
Lk  { c  Ck|c.count  minsup}
end
Answer 
L ;
k
k
18
Data Mining Seminar 2003
Problem?
L1  {large 1- itemsets}

Every pass goes over
the whole data.
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );
forall transacti ons t  D do begin
Ct  subset (C k ,t)
forall candidates c  Ct do
c.count  ;
end
end
Lk  { c  Ck|c.count  minsup}
end
Answer 
L ;
k
k
©Ofer Pasternak
19
Data Mining Seminar 2003
Algorithm AprioriTid
Uses the database only once.
 Builds a storage set C^k

– Members has the form < TID, {Xk} >
 Xk are potentially large k-items in
transaction TID.
Each item is replaced by
an itemset of size 1
 For k=1, C^1 is the database.

©Ofer Pasternak
Uses C^k in pass k+1.
20
Data Mining Seminar 2003
Advantage

C^k could be smaller than the
database.
– If a transaction does not contain k-
itemset candidates, than it will be
excluded from C^k .

For large k, each entry may be
smaller than the transaction
– The transaction might contain only few
candidates.
©Ofer Pasternak
21
Data Mining Seminar 2003
Disadvantage

For small k, each entry may be larger
than the corresponding transaction.
– An entry includes all k-itemsets
contained in the transaction.
©Ofer Pasternak
22
Data Mining Seminar 2003
Algorithm AprioriTid
Count item occurrences
L1  {large 1- itemsets}
C1^  database D;
The storage set is initialized
with the database
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );
Generate new k-itemsets
candidates
Ck^   ;
^
forall entries t  Ck-1
do begin
Build a new storage set
Ct  {c  Ck|(c  c[k]  t.set  of  items
 (c  c[k  1])  t.set  of  items};
forall candidates c  Ct do
c.count  ;
if (Ct  φ) then C   t.TID,Ct  ;
^
k
end
end
Lk  { c  Ck|c.count  minsup}
end
Answer

L;
©Ofer Pasternak k
k
Determine candidate itemsets
which are containted in
transaction TID
Find the support of all the
candidates
Remove empty entries
Take only those with
support over minsup
23
Data Mining Seminar 2003
C^1
Database
TID
Items
100
134
200
235
300
1235
400
25
TID
100
{ {1},{3},{4} }
200
{ {2},{3},{5} }
300
{ {1},{2},{3},{5} }
400
{ {2},{5} }
C2
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
TID
Set-of-itemsets
100
{ {1 3} }
200
{ {2 3},{2 5} {3 5} }
300
{ {1 2},{1 3},{1 5},
{2 3}, {2 5}, {3 5} }
400
{ {2 5} }
C^3
TID
Set-of-itemsets
itemset
200
{ {2 3 5} }
{2 3 5}
300
{ {2 3 5} }
©Ofer Pasternak
Itemset
Support
{1}
2
{2}
3
{3}
3
{5}
3
C^2
{3 5}
C3
Set-ofitemsets
L1
L2
Itemset
Support
{1 3}
2
{2 3}
3
{2 5}
3
{3 5}
2
L3
Itemset
Support
{2 3 5}
2
24
Data Mining Seminar 2003
Correctness

Show that Ct
generated in the kth
pass is the same as set
of candidate kitemsets in Ck
contained in
transaction with t.TID
L1  {large 1- itemsets}
C1^  database D;
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );
Ck^   ;
^
forall entries t  Ck-1
do begin
Ct  {c  Ck|(c  c[k]  t.set  of  items
 (c  c[k  1])  t.set  of  items};
forall candidates c  Ct do
c.count  ;
if (Ct  φ) then Ck^   t.TID,Ct  ;
end
end
Lk  { c  Ck|c.count  minsup}
end
Answer 
L ;
k
k
©Ofer Pasternak
25
Data Mining Seminar 2003
Correctness
t of C^k t.set-of-itemsets
doesn’t include any k-itemsets
not contained in transaction
with t.TID
t of C^k t.set-of-itemsets
includes all large k-itemsets
contained in transaction
with t.TID
Lemma 1
k >1, if C^k-1 is correct and complete,
Same as the set of all large kand Lk-1 is correct,
itemsets
Then the set Ct generated at the kth
pass is the same as the set of
candidate k-itemsets in Ck contained
in transaction with t.TID
©Ofer Pasternak
26
Data Mining Seminar 2003
Proof
Suppose a candidate itemset
c = c[1]c[2]…c[k] is in transaction t.TID 
c1 = (c-c[k]) and c2=(c-c[k-1]) were in
transaction t.TID  Ck was built using apriori-gen(Lk-1) 
all subsets of c of Ck must be large
c1 and c2 must be large 
C^k-1 is complete
c1 and c2 were members of t.set-of-items 
c will be a member of Ct
©Ofer Pasternak
27
Data Mining Seminar 2003
Proof
Suppose c1 (c2) is not in transaction
C^k-1 is correct
t.TID 
c1 (c2) is not in t.set-of-itemsets 
c of Ck is not in transaction t.TID 
c will not be a member of Ct
©Ofer Pasternak
28
Data Mining Seminar 2003
Correctness
Lemma 2
k >1, if Lk-1 is correct and the set Ct
generated in the kth step is the same
as the set of candidate k-itemsets in
Ck in transaction t.TID, then the set
C^k is correct and complete.
©Ofer Pasternak
29
Data Mining Seminar 2003
Proof
Apriori-gen guarantees
Ck  Lk  Ct includes all
large k-itemsets in
t.TID, which are added
to C^k 
C^k is complete.
L1  {large 1- itemsets}
C1^  database D;
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );
Ck^   ;
^
forall entries t  Ck-1
do begin
Ct  {c  Ck|(c  c[k]  t.set  of  items
 (c  c[k  1])  t.set  of  items};
forall candidates c  Ct do
c.count  ;
if (Ct  φ) then Ck^   t.TID,Ct  ;
end
Ct includes only itemsets
in t.TID, only items in Ct
are added to C^k 
C^k is correct.
©Ofer Pasternak
end
Lk  { c  Ck|c.count  minsup}
end
Answer 
L ;
k
k
30
Data Mining Seminar 2003
Correctness
Theorem 1
k >1, the set Ct generated in the kth pass is
the same as the set of candidate kitemsets in Ck contained in transaction
t.TID
Show:
C^k is correct and complete and Lk is
correct for all k>=1.
©Ofer Pasternak
31
Data Mining Seminar 2003
Proof (by induction on k)
K=1 – C^1 is the database.
 Assume it holds for k=n.

–
–
©Ofer Pasternak
Ct generated in pass n+1 consists of exactly
those itemsets in Cn+1 contained in transaction
t.TID.
Apriori-gen guarantees Cn 1  Ln 1
and Ct is correct  Ln+1 is correct Lemma 2
C^n+1 will be correct and complete 
C^k is correct and complete for all k>=1  Lemma 1
The theorem holds
32
Data Mining Seminar 2003
General idea (reminder)
 Say
ABCD and AB are large itemsets
 Compute
conf = support(ABCD) / support(AB)
 If conf >= minconf
AB  CD holds.
©Ofer Pasternak
33
Data Mining Seminar 2003
Discovering Rules

For every large itemset l
– Find all non-empty subsets of l.
– For every subset a
 Produce rule a  (l-a)
 Accept if support(l) / support(a) >= minconf
©Ofer Pasternak
34
Data Mining Seminar 2003
Checking the subsets
For efficiency, generate subsets using
recursive DFS. If a subset ‘a’ doesn’t
produce a rule, we don’t need to check
for subsets of ‘a’.
Example
Given itemset : ABCD
If ABC  D doesn’t have enough confidence
then surely AB  CD won’t hold

©Ofer Pasternak
35
Data Mining Seminar 2003
Why?
For any subset a^ of a:
Support(a^) >= support(a) 
Confidence ( a^ (l-a^) ) =
support(l) / support(a^) <=
support(l) / support(a) =
confidence ( a  (l-a) )
©Ofer Pasternak
36
Data Mining Seminar 2003
Simple Algorithm
forall large item sets lk , k  2 do
Check all the large itemsets
genrules(l k ,lk )
procedure genrules (l k :large k-it emset, am: large m-it emset)
A {(m-1)-it emset am-1| a m-1  am };
forall am-1  A do begin
conf  support(l k )/support( am-1 )
if (conf  minconf) then begin
output the rule am-1  (l k  am-1 );
if (m  1  1) then
call genrules(l k ,am-1 );
end
©Ofer
Pasternak
end
Check all the subsets
Check confidence of
new rule
Output the rule
Continue the DFS
over the subsets.
If there is no confidence the
DFS branch cuts here 37
Data Mining Seminar 2003
Faster Algorithm
Idea:
If (l-c)  c holds than all the rules
(l-c^)  c^ must hold
Example:
C^ is a non empty
subset of c
If AB  CD holds,
then so do ABC  D and ABD  C
©Ofer Pasternak
38
Data Mining Seminar 2003
Faster Algorithm

From a large itemset l,
– Generate all rules with one item in it’s
consequent.
Use those consequents and Apriori-gen to
generate all possible 2 item consequents.
 Etc.
 The candidate set of the faster algorithm
is a subset of the candidate set of the
simple algorithm.

©Ofer Pasternak
39
Data Mining Seminar 2003
Faster algorithm
forall large k-it emsets lk , k  2 do begin
H 1  {consequen ts of rules derived from lk with one item in the consequent };
call ap-genrule s(l k ,H 1 );
end
procedure ap - genrules (l k :large k-itemset , Hm: set of m-item consequents)
if (k  m  1) then begin
H m 1  apriori-g en(H m );
forall hm 1  H m 1 do begin
conf  support(l k )/support( lk  hm 1 );
if (conf  minconf) then
output the rule (lk  hm 1 )  hm 1
with confidence  conf and support  support(l k )
else
delete hm 1 from H m 1
end
call ap-genrule s(l k ,hm 1 );
end
©Ofer Pasternak
Find all 1 item
consequents (using 1
pass of the simple
algorithm)
Generate new (m+1)consequents
Check the support of
the new rule
Continue for bigger
consequents
If a consq. Doesn’t
hold, don’t look for
40
bigger.
Data Mining Seminar 2003
Advantage
Example
Large itemset : ABCDE
One item conseq. : ACDEB ABCED
Simple algorithm will check:
ABCDE, ABECD, BCEAD and ACEBD.
Faster algorithm will check:
ACEBD which is also the only rule that
holds.
©Ofer Pasternak
41
Example
Data Mining Seminar 2003
Simple algorithm:
ABCDE
Large itemset
Rules with minsup
ACDEB
CDEAB
ADEBC
ABCED
BCEAD
ACDBE ACEBD
Fast algorithm:
ACEBD ABCED
ABCDE
ACDEB
©Ofer Pasternak
ABECD
ABCED
ACEBD
42
Data Mining Seminar 2003
Results

Compare Apriori, and AprioriTid
performances to each other, and to
previous known algorithms: Both methods generate
– AIS
– SETM

©Ofer Pasternak
candidates “on-the-fly”
Designed for use
over SQL
The algorithms differ in the method
of generating all large itemsets.
43
Data Mining Seminar 2003
Method

Check the algorithms on the same
databases
– Synthetic data
– Real data
©Ofer Pasternak
44
Data Mining Seminar 2003
Synthetic Data

Choose the parameters to be compared.
– Transaction sizes, and large itemsets sizes are
each clustered around a mean.
– Parameters for data generation





©Ofer Pasternak
D – Number of transactions
T – Average size of the transaction
I – Average size of the maximal potentially large
itemsets
L – Number of maximal potentially large itemsets
N – Number of Items.
45
Data Mining Seminar 2003
Synthetic Data

Expriment values:
– N = 1000
– L = 2000






©Ofer Pasternak
T5.I2.D100k
T10.I2.D100k
T10.I4.D100k
T20.I2.D100k
T20.I4.D100k
T20.I6.D100k
D – Number of transactions
T – Average size of the transaction
I – Average size of the maximal
potentially large itemsets
L – Number of maximal potentially large
itemsets
N – Number of Items.
T=5, I=2, D=100,000
46
Data Mining Seminar 2003
•SETM values
are too big to
fit the graphs.
•Apriori always
beats AIS
•Apriori is better
than AprioriTid in
large problems
©Ofer Pasternak
D – Number of transactions
T – Average size of the transaction
I – Average size of the maximal
potentially large itemsets 47
Data Mining Seminar 2003
Explaining the Results
AprioriTid uses C^k instead of the
database. If C^k fits in memory
AprioriTid is faster than Apriori.
 When C^k is too big it cannot sit in
memory, and the computation time is
much longer. Thus Apriori is faster
than AprioriTid.

©Ofer Pasternak
48
Data Mining Seminar 2003
Reality Check


©Ofer Pasternak
Retail sales
– 63 departments
– 46873
transactions
(avg. size 2.47)
Small database,
C^k fits in
memory.
49
Data Mining Seminar 2003
Reality Check
Mail Order
15836 items
2.9 million transactions
(avg size 2.62)
©Ofer Pasternak
Mail Customer
15836 items
213972 transactions
(avg size 31)
50
Data Mining Seminar 2003
So who is better?

Look At the Passes.
At final stages, C^k
is small enough to fit
in memory
©Ofer Pasternak
51
Data Mining Seminar 2003
Algorithm AprioriHybrid
Use Apriori in initial passes
 Estimate the size of C^k

 support(c)  number of transacti ons
candidatescCk
Switch to AprioriTid when C^k is
expected to fit in memory
 The switch takes time, but it is still
better in most cases.

©Ofer Pasternak
52
Data Mining Seminar 2003
©Ofer Pasternak
53
Data Mining Seminar 2003
Scale up experiment
©Ofer Pasternak
54
Data Mining Seminar 2003
Conclusions

The Apriori algorithms are better
than the previous algorithms.
– For small problems by factors
– For large problems by orders of
magnitudes.
The algorithms are best combined.
 The algorithm shows good results in
scale-up experiments.

©Ofer Pasternak
55
Data Mining Seminar 2003
Summary




©Ofer Pasternak
Association rules are an important
tool in analyzing databases.
We’ve seen an algorithm which finds
all association rules in a database.
The algorithm has better time
results then previous algorithms.
The algorithm maintains it’s
performances for large databases.
56
Data Mining Seminar 2003
End
©Ofer Pasternak
57
Related documents