Download Mining Generalized Association Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Mining Generalized
Association Rules
Ramkrishnan Strikant
Rakesh Agrawal
Data Mining Seminar, spring semester, 2003
Prof. Amos Fiat
Student: Idit Haran
Outline
 Motivation
 Terms
& Definitions
 Interest Measure
 Algorithms for mining generalized
association rules
 Comparison
 Conclusions
Idit Haran, Data Mining Seminar, 2003
2
Motivation
 Find Association
Rules of the form:
Diapers  Beer
 Different kinds of diapers:
Huggies/Pampers, S/M/L, etc.
 Different kinds of beers:
Heineken/Maccabi, in a bottle/in a can, etc.
 The information on the bar-code is of type:
Huggies Diapers, M  Heineken Beer in bottle
 The preliminary rule is not interesting, and probably
will not have minimum support.
Idit Haran, Data Mining Seminar, 2003
3
Taxonomy
 is-a
hierarchies
Clothes
Outwear
Jackets
Shirts
Footwear
Shoes
Hiking
Boots
Ski Pants
Idit Haran, Data Mining Seminar, 2003
4
Taxonomy - Example
 Let
say we found the rule:
Outwear  Hiking Boots
with minimum support and confidence.
 The rule
Jackets  Hiking Boots
may not have minimum support
 The rule
Clothes  Hiking Boots
may not have minimum confidence.
Idit Haran, Data Mining Seminar, 2003
5
Taxonomy
 Users
are interested in generating rules that span
different levels of the taxonomy.
 Rules of lower levels may not have minimum
support
 Taxonomy can be used to prune uninteresting or
redundant rules
 Multiple taxonomies may be present.
for example: category, price(cheap, expensive),
“items-on-sale”. etc.
 Multiple taxonomies may be modeled as a forest, or
a DAG.
Idit Haran, Data Mining Seminar, 2003
6
Notations
z
ancestors
(marked with ^)
edge:
is_a relationship
parent
p
c1
c2
child
descendants
Idit Haran, Data Mining Seminar, 2003
7
Notations
I
= {i1, i2, …, im}- items.
T-
transaction, set of items TI
(we expect the items in T to be leaves in T .)
– set of transactions
T supports item x, if x is in T or x is an
ancestor of some item in T.
T supports XI if it supports every item
in X.
D
Idit Haran, Data Mining Seminar, 2003
8
Notations
A generalized
association rule: X Y
if XI , YI , XY =  , and no item in
Y is an ancestor of any item in X.
 The rule XY has confidence c in D if c% of
transactions in D that support X also support Y.
 The rule XY has support s in D if s% of
transactions in D supports XY.
Idit Haran, Data Mining Seminar, 2003
9
Problem Statement
 To
find all generalized association
rules that have support and
confidence greater than the userspecified minimum support (called
minsup) and minimum confidence
(called minconf) respectively.
Idit Haran, Data Mining Seminar, 2003
10
Example
 Recall
the taxonomy:
Clothes
Outwear
Jackets
Shirts
Footwear
Shoes
Hiking
Boots
Ski Pants
Idit Haran, Data Mining Seminar, 2003
11
Frequent Itemsets
Example
Itemset
Support
{Jacket}
2
{Outwear}
3
{Clothes}
4
{Shoes}
2
{Hiking Boots}
2
{Footwear}
4
Database D
Transaction Items Bought
100
Shirt
200
Jacket, Hiking Boots
300
Ski Pants, Hiking Boots
400
Shoes
500
Shoes
{Clothes,Hiking Boots}
2
600
Jacket
{Outwear, Footwear}
2
{Clothes, Footwear}
2
{Outwear, Hiking Boots} 2
Rules
Rule
Support Confidence
Outwear  Hiking Boots
33%
66.6%
minsup = 30%
Outwear  Footwear
33%
66.6%
minconf = 60%
Hiking Boots  Outwear
33%
100%
Idit Haran, Data
Seminar, 2003
Hiking Boots  Clothes
33%Mining
100%
12
Observation 1
 If
the set{x,y} has minimum support,
so do {x^,y^} {x^,y} and {x^,y^}

For example:
if {Jacket, Shoes} has minsup, so will
{Outwear, Shoes}, {Jacket,Footwear}, and
{Outwear,Footwear}
Clothes
Outwear
Footwear
Shirts
Shoes
Idit Haran, Data Mining Seminar, 2003
Jackets
Ski Pants
Hiking
Boots
13
Observation 2
 If
the rule xy has minimum support and
confidence, only xy^ is guaranteed to have both
minsup and minconf.
 The
rule OutwearHiking Boots has minsup and minconf.
 The rule OutwearFootwear has both minsup and minconf.
Clothes
Outwear
Footwear
Shirts
Shoes
Jackets Idit Haran,
Ski Pants
Data Mining Seminar, 2003
Hiking
Boots
14
Observation 2 – cont.
 However,
the rules x^y and x^y^ will have
minsup, they may not have minconf.
 For
example:
The rules ClothesHiking Boots and ClothesFootwear
have minsup, but not minconf.
Clothes
Outwear
Footwear
Shirts
Shoes
Jackets Idit Haran,
Ski Pants
Data Mining Seminar, 2003
Hiking
Boots
15
Interesting Rules –
Previous Work
a
rule XY is not interesting if:
support(XY)  support(X)•support(Y)
 Previous work does not consider taxonomy.
 The previous interest measure pruned less
than 1% of the rules on a real database.
Idit Haran, Data Mining Seminar, 2003
16
Interesting Rules –
Using the Taxonomy
 MilkCereal
(8% support, 70% conf)
 Milk is parent of Skim Milk, and 25% of sales
of Milk are Skim Milk
 We expect:
Skim MilkCereal
to have 2% support
and 70% confidence
Idit Haran, Data Mining Seminar, 2003
17
R-Interesting Rules
 A rule
is XY is R-interesting w.r.t an
ancestor X^Y^ if:
or,
real support
(XY)
real confidence
(XY)
>
>
R•
expected support (XY)
based on (X^Y^)
expected confidence (XY)
R•
based on (X^Y^)
 With
R = 1.1 about 40-55% of the rules were
prunes.
Idit Haran, Data Mining Seminar, 2003
18
Problem Statement (new)
 To
find all generalized R-interesting
association rules (R is a userspecified minimum interest called
min-interest) that have support and
confidence greater than minsup and
minconf respectively.
Idit Haran, Data Mining Seminar, 2003
19
Algorithms – 3 steps
1. Find all itemsets whose support is greater than
minsup. These itemsets are called frequent
itemsets.
2. Use the frequent itemsets to generate the
desired rules:
if ABCD and AB are frequent then
conf(ABCD) = support(ABCD)/support(AB)
3. Prune all uninteresting rules from this set.
*All presented algorithms will only implement step 1.
Idit Haran, Data Mining Seminar, 2003
20
Algorithms – 3 steps
1. Find all itemsets whose support is greater than
minsup. These itemsets are called frequent
itemsets.
2. Use the frequent itemsets to generate the
desired rules:
if ABCD and AB are frequent then
conf(ABCD) = support(ABCD)/support(AB)
3. Prune all uninteresting rules from this set.
*All presented algorithms will only implement step 1.
Idit Haran, Data Mining Seminar, 2003
21
Algorithms (step 1)
 Input:
Database, Taxonomy
 Output: All frequent itemsets
 3 algorithms (same output, different run-time):
Basic, Cumulate, EstMerge
Idit Haran, Data Mining Seminar, 2003
22
Algorithm Basic – Main Idea
 Is
itemset X is frequent?
 Does transaction T supports X?
(X contains items from different levels of taxonomy,
T contains only leaves)
 T’ =
T + ancestors(T);
 Answer: T supports X  X  T’
Idit Haran, Data Mining Seminar, 2003
23
Algorithm Basic
L1  {frequent 1-itemsets}
Count item occurrences
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );
forall transacti on t  D do begin
t  add - ancestor (t , T )
Ct  subset (C k ,t)
forall candidates c  Ct do
Generate new k-itemsets
candidates
Add all ancestors of each
item in t to t, removing any
duplication
c.count  ;
end
end
Lk  { c  Ck|c.count  minsup}
end
Answer 
Find the support of all
the candidates
Take only those with
support over minsup
L ;
k
k
Idit Haran, Data Mining Seminar, 2003
24
Candidate generation
Join
step
P and q are 2 k-1 frequent
itemsets identical in all k-2
first items.
insert into Ck
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1  q.item1 ,..., p.itemk  2  q.itemk  2 , p.itemk 1  q.itemk 1
Prune
step
Join by adding the last item of
q to p
forall itemsets c  Ck do
forall (k-1)-subsets s of c do
if (s  Lk-1 ) then
delete c from Ck
Check all the subsets, remove a
candidate with “small” subset
Idit Haran, Data Mining Seminar, 2003
25
Optimization 1
Filtering the ancestors added to transactions
 We
only need to add to transaction t the
ancestors that are in one of the candidates.
 If the original item is not in any itemsets, it can
be dropped from the transaction.
Clothes
 Example:
candidates: {clothes,shoes}.
Transaction t: {Jacket, …}
can be replaced with {clothes, …}
Outwear
Jackets
Idit Haran, Data Mining Seminar, 2003
Shirts
Ski Pants
26
Optimization 2
Pre-computing ancestors
 Rather
than finding ancestors for each item by
traversing the taxonomy graph, we can precompute the ancestors for each item.
 We can drop ancestors that are not contained
in any of the candidates in the same time.
Idit Haran, Data Mining Seminar, 2003
27
Optimization 3
Pruning itemsets containing an item and its ancestor
 If
we have {Jacket} and {Outwear}, we will have
candidate {Jacket, Outwear} which is not interesting.
 support({Jacket} ) = support({Jacket, Outwear})
 Delete ({Jacket, Outwear}) in k=2 will ensure it will not
erase in k>2. (because of the prune step of candidate generation
method)
 Therefore,
we can prune the rules containing an item an
its ancestor only for k=2, and in the next steps all
candidates will not include item + ancestor.
Idit Haran, Data Mining Seminar, 2003
28
Algorithm Cumulate
Optimization 2: compute the set of
all ancestors T* from T
ComputeT * from T
L1  {frequent 1-itemsets}
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori- gen (Lk-1 );
if (k  2) then prune(C 2 )
T  remove - unnecessar y(T , C k )
*
*
forall transacti on t  D do begin
t  add - ancestor (t , T * )
Ct  subset (C k ,t)
forall candidates c  Ct do
c.count  ;
end
Optimization 3: Delete any candidate
in C2 that consists of an item and its
ancestor
Optimization 1: Delete any ancestors
in T* that are not present in any of
the candidates in Ck
Optimzation2: foreach item xt add
all ancestor of x in T* to t.
Then, remove any duplicates in t.
end
Lk  { c  Ck|c.count  minsup}
end
Answer 
L ;
k
k
Idit Haran, Data Mining Seminar, 2003
29
Clothes
Stratification
Outwear
Jackets
 Candidates:
Footwear
Shirts
Shoes
Hiking
Boots
Ski Pants
{Clothes, Shoes}, {Outwear,Shoes}, {Jacket,Shoes}
 If {Clothes, Shoes}
does not have minimum
support, we don’t need to count either
{Outwear,Shoes} or {Jacket,Shoes}
 We will count in steps:
step 1: count {Clothes, Shoes}, and if it has minsup step 2: count {Outwear,Shoes}, if has minsup –
step 3: count {Jacket,Shoes}
Idit Haran, Data Mining Seminar, 2003
30
Version 1: Stratify
 Depth
of an itemset:
itemsets with no parents are of depth 0.
 others:

depth(X) = max({depth(X^) |X^ is a parent of X}) + 1
 The algorithm:
 Count all itemsets C0 of depth 0.
 Delete candidates that are descendants to the itemsets in C0 that
didn’t have minsup.
 Count remaining itemsets at depth 1 (C1)
 Delete candidates that are descendants to the itemsets in C1 that
didn’t have minsup.
 Count remaining itemsets at depth 2 (C2), etc…
Idit Haran, Data Mining Seminar, 2003
31
Tradeoff & Optimizations
#candidates counted
#passes over DB
Count each depth
on different pass
Cumulate
Optimiztion 1: Count together
multiple depths from certain level
Optimiztion 2: Count more than 20%
of candidatesIdit
per
pass
Haran, Data Mining Seminar, 2003
32
Version 2: Estimate
 Estimating
 1st
candidates support using sample
pass: (C’k)
 count
candidates that are expected to have minsup
(we count these candidates as candidates that has 0.9*minsup in the sample)
 count
 2nd
candidates whose parents expect to have minsup.
pass: (C”k)
children of candidates in C’k that were not
expected to have minsup.
 count
Idit Haran, Data Mining Seminar, 2003
33
Example for Estimate
minsup = 5%
Candidates
Itemsets
Support in
Support in Database
Sample Scenario A Scenario B
8%
7%
9%
{Clothes, Shoes}
{Outwear, Shoes}
4%
{Jacket, Shoes}
2%
4%
Idit Haran, Data Mining Seminar, 2003
6%
34
Version 3: EstMerge
Motivation: eliminate 2nd pass of algorithm Estimate
 Implementation: count these candidates of C”k with
the candidates in C’k+1.
 Restriction: to create C’k+1 we assume that all
candidates in C”k has minsup.
 The tradeoff: extra candidates counted by EstMerge
v.s. extra pass made by Estimate.

Idit Haran, Data Mining Seminar, 2003
35
Algorithm EstMerge
Count item occurrences
Generate a sample over the
Database, in the first pass
L1  {frequent 1-itemsets}
Ds  generate - sample ( D);
For ( k  2, C"1   ; Lk-1   or C"k-1   ; k  ) do begin
Ck  ge nerate - candidates ( Lk-1 , C"k-1 );
C 'k  expected - frequent - and - sons ( Ds , Ck );
find - support ( D,C' k ,C"k-1 ) ;
Ck  prune - descendent s ( D,C' k ,C"k-1 );
C"k  Ck - C'k ;
Lk  {c  C 'k | c.count  minsup}
Lk 1  Lk 1  {c  C"k 1 | c.count  minsup}
end
Answer 
L ;
k
Generate new k-itemsets
candidates from Lk-1C”k-1
Estimate Ck candidate’s support by
making a pass over Ds. C’k = candidates
that are expected to have minsup +
candidates whose parents are expected
to have minsup
Find the support of C’kC”k-1
by making a pass over D
Delete candidates in Ck whose
ancestors in C’k don’t have minsup
Remaining candidates in Ck
that are not in C’k
k
Idit Haran, Data Mining Seminar, 2003
Add all candidate in C”k with minsup
36
All candidate in C’k with minsup
Stratify - Variants
Idit Haran, Data Mining Seminar, 2003
37
Size of Sample
P=5%
P=1%
P=0.5%
P=0.1%
a=.8p a=.9p
a=.8p
a=.9p
a=.8p
a=.9p
a=.8p
a=.9p
n=1000
0.32
0.76
0.80
0.95
0.89
0.97
0.98
0.99
n=10,000
0.00
0.07
0.11
0.59
0.34
0.77
0.80
0.95
n=100,000
0.00
0.00
0.00
0.01
0.00
0.07
0.12
0.60
n=1,000,000
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.01
Pr[support in sample < a]
Idit Haran, Data Mining Seminar, 2003
38
Size of Sample
Idit Haran, Data Mining Seminar, 2003
39
Performance Evaluation
 Compare
running time of 3 algorithms:
Basic, Cumulate and EstMerge
 On synthetic data:
 effect
 On
of each parameter on performance
real data:
 Supermarket
Data
 Department Store Data
Idit Haran, Data Mining Seminar, 2003
40
Synthetic Data Generation
Parameter
Default
Value
|D|
Number of transactions
1,000,000
|T|
Average size of the Transactions
10
|I|
Average size of the maximal potentially frequent itemsets 4
|I |
Number of maximal potentially frequent itemsets
10,000
N
Number of items
100,000
R
Number of Roots
250
L
Number of Levels
4-5
F
Fanout
5
D
Depth-ration ( probability that item in a rule comes from 1
level i / probability that
item Data
comes
from
level2003
i+1)
Idit Haran,
Mining
Seminar,
41
Minimum Support
Idit Haran, Data Mining Seminar, 2003
42
Number of Transactions
Idit Haran, Data Mining Seminar, 2003
43
Fanout
Idit Haran, Data Mining Seminar, 2003
44
Number of Items
Idit Haran, Data Mining Seminar, 2003
45
Reality Check
 Supermarket
Data
 548,000
items
 Taxonomy: 4 levels, 118 roots
 ~1.5 million transactions
 Average of 9.6 items per transaction
 Department
Store Data
 228,000
items
 Taxonomy: 7 levels, 89 roots
 570,000 transactions
 Average of 4.4 items per transaction
Idit Haran, Data Mining Seminar, 2003
46
Results
Idit Haran, Data Mining Seminar, 2003
47
Conclusions
 Cumulate
and EstMerge were 2 to 5 times
faster than Basic on all synthetic datasets.
On the supermarket database they were 100
times faster !
 EstMerge was ~25-30% faster than Cumulate.
 Both EstMerge and Cumulate exhibits linear
scale-up with the number of transactions.
Idit Haran, Data Mining Seminar, 2003
48
Summary
 The
use of taxonomy is necessary for finding
association rules between items at any level
of hierarchy.
 The obvious solution (algorithm Basic) is not
very fast.
 New algorithms that use the taxonomy
benefits are much faster
 We can use the taxonomy to prune
uninteresting rules.
Idit Haran, Data Mining Seminar, 2003
49
Idit Haran, Data Mining Seminar, 2003
50