Download Apriori for Mining Association Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Fast Algorithms for Mining
Association Rules
Rakesh Agrawal
Ramakrishnan Srikant
Slides from Ofer Pasternak
1
Data Mining Seminar 2003
Introduction
Bar-Code technology
 Mining Association Rules over basket
data (93)
 Tires ^ accessories  automotive
service
 Cross market, Attached mail.
 Very large databases.

©Ofer Pasternak
2
Data Mining Seminar 2003
Notation
Items – I = {i1,i2,…,im}
 Transaction – set of items

TI
– Items are sorted lexicographically

©Ofer Pasternak
TID – unique identifier for each
transaction
3
Data Mining Seminar 2003
Notation

Association Rule – X  Y
X  I , Y  I and X  Y  
©Ofer Pasternak
4
Data Mining Seminar 2003
Confidence and Support


©Ofer Pasternak
Association rule XY has
confidence c,
c% of transactions in D that contain
X also contain Y.
Association rule XY has support s,
s% of transactions in D contain X
and Y.
5
Data Mining Seminar 2003
Define the Problem
Given a set of transactions D, generate
all association rules that have support
and confidence greater than the
user-specified minimum support and
minimum confidence.
©Ofer Pasternak
6
Data Mining Seminar 2003
Discovering all Association
Rules

Find all Large itemsets
– itemsets with support above minimum
support.

©Ofer Pasternak
Use Large itemsets to generate the
rules.
7
Data Mining Seminar 2003
General idea
Say ABCD and AB are large itemsets
 Compute
conf = support(ABCD) / support(AB)
 If conf >= minconf
AB  CD holds.

©Ofer Pasternak
8
Data Mining Seminar 2003
Discovering Large Itemsets
Multiple passes over the data
 First pass – count the support of individual
items.
 Subsequent pass

– Generate Candidates using previous pass’s large
itemset.
– Go over the data and check the actual support
of the candidates.

©Ofer Pasternak
Stop when no new large itemsets are found.
9
Data Mining Seminar 2003
The Trick
Any subset of large itemset is large.
Therefore
To find large k-itemset
– Create candidates by combining large k-1
itemsets.
– Delete those that contain any subset
that is not large.
©Ofer Pasternak
10
Data Mining Seminar 2003
Algorithm Apriori
L1  {large 1- itemsets}
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );
forall transacti ons t  D do begin
Ct  subset (C k ,t)
forall candidates c  Ct do
c.count  ;
Count item occurrences
Generate new k-itemsets
candidates
Find the support of all the
candidates
end
end
Lk  { c  Ck|c.count  minsup}
end
Answer 
Take only those with
support over minsup
L ;
k
k
©Ofer Pasternak
11
Data Mining Seminar 2003
Candidate generation

Join step
insert into Ck
P and q are 2 k-1 large
itemsets identical in all
k-2 first items.
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1  q.item1 ,..., p.itemk  2  q.itemk  2 , p.itemk 1  q.itemk 1

Prune step
forall itemsets c  Ck do
forall (k-1)-subsets s of c do
if (s  Lk-1 ) then
delete c from Ck
©Ofer Pasternak
Join by adding the last item of
q to p
Check all the subsets, remove a
candidate with “small” subset
12
Data Mining Seminar 2003
Example
L3 = { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} }
After joining
{ {1 2 3 4}, {1 3 4 5} }
{1 4 5} and {3 4 5}
After pruning
Are not in L3
{1 2 3 4}
©Ofer Pasternak
13
Data Mining Seminar 2003
Correctness
Show that Ck  Lk
Any subset of large itemset
must also be large
insert into Ck
Join is equivalent to
extending Lk-1 with all
items and removing
those whose (k-1)
subsets are not in Lk-1
©Ofer Pasternak
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1  q.item1 ,..., p.itemk  2  q.itemk  2 , p.itemk 1  q.itemk 1
forall itemsets c  Ck do
forall (k-1)-subsets s of c do
if (s  Lk-1 ) then
delete c from Ck
Prevents duplications
14
Data Mining Seminar 2003
Subset Function
L1  {large 1- itemsets}
Candidate itemsets - Ck are
stored in a hash-tree
 Finds in O(k) time whether a
candidate itemset of size k
is contained in transaction t.
 Total time O(max(k,size(t))
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );

©Ofer Pasternak
forall transacti ons t  D do begin
Ct  subset (C k ,t)
forall candidates c  Ct do
c.count  ;
end
end
Lk  { c  Ck|c.count  minsup}
end
Answer 
L ;
k
k
15
Related documents