Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fast and Memory
Efficient Mining of
Frequent Closed Itemsets
Claudio Lucchese
Salvatore Orlando
DB group seminar
Presenter: Leonidas
Raffaele Perego
Abstract
• Frequent Itemsets Mining
• Closed Itemsets
• Mining Frequent Closed Itemsets
• Handling duplicates
• Brief introduction of the algorithm
• Experimental results
Frequent Itemsets Mining
• A set of items I, set of transactions D
• Discover all the itemsets from I with support >
min_supp
• Support of a k-itemset I supp(I) : number of transactions in D
includes I
• I is a set of items from I
• Transaction t in D is a set of items from I
• Well known algorithm: Apriori
• Discover frequent itemsets
Weaknesses & Solutions
• Number of frequent itemsets grows up quickly as
min_supp decreases
• Complexity of mining task increases rapidly
• Huge size of output
• Complex for analysis
• Closed itemsets are one of the solutions
• Unique maximal elements of the equivalence classes
defined over the lattice of all the frequent itemsets
Weaknesses & Solutions
• Equivalence class
• Distinct group of frequent itemsets
• Supported by same set of transactions
• Represent same knowledge
• Vertical bitwise representation of data set
• Association Rules extracted are more meaningful
[ZAKI04]
• Redundancies are removed
• Suitable for dense data set
• Frequent closed itemsets are much fewer than frequent
itemsets
Closed Itemsets
• I is subsets of items appearing in D
• T is subset of transactions in D
• Define two functions:
f (T ) {i Ι | t T , i t}
g ( I ) {t D | i I , i t}
TID
Items
1
B
D
2
A
B
C
3
A
C
D
4
C
D
• Itemset I is closed iff
c( I ) f ( g ( I )) f g ( I ) I
• Function c f g is called Galois operator / closure operator
Equivalence classes
• Two itemsets belong to same equivalence class iff
• They have same closure
• Supported by same set of transactions
• An itemset I is closed iff
• No supersets of I have the same support
TID
Items
1
B
D
2
A
B
C
3
A
C
D
4
C
ABCD 1
AB
1
ABC
1
AC
2
A
2
ABD
1
AD
2
B
2
Support
ACD
2
BC
1
C
3
44
D
BCD
1
BD
2
D
3
A
2
Frequent Itemset
CD
2
D
2
Frequent Closed
Itemset
Equivalence
Class
Mining Frequent Closed Itemsets
• Search Space Browsing
• Traverse the lattice of frequent itemsets from one
equivalence class to another
• Closure computation
• Compute the closure of frequent itemsets
• Determine the closed itemsets
Closure generator:
• A single representative of an equivalence class
• Can mine all the closed itemsets by computing the
closure of the generator for each class
Browsing the Search Space
• Choose the key patterns (minimal elements) as
generators
• Traverse the lattice formed by key patterns with
Apriori-like algorithm[TAOU00]
• Unfortunately, same closed itemset can be led from
more then one key patterns
ABCD 1
AB
1
ABC
1
ABD
1
ACD
2
BCD
1
AC
2
AD
2
BC
1
BD
2
A
2
B
2
C
3
D
3
44
CD
2
Browsing the Search Space
• Closure climbing
• New generators are built as the supersets of the
closed itemset discovered so far
• Jump from an equivalence class to another
• Cannot ensure the equivalence class is not visited
yet
ABCD 1
AB
1
ABC
1
ABD
1
ACD
2
BCD
1
AC
2
AD
2
BC
1
BD
2
A
2
B
2
C
3
D
3
44
CD
2
Problem of duplicate
• Need duplicate checking to avoid generating the same
closed itemset
• To avoid useless expensive closure operation, use following
lemma:
Given two itemsets X and Y, if X Y and supp(X) supp(Y)
(i.e., g(X) g(Y) ), then c(X) c(Y)
• However, it is still expensive in time and space
• All the mined closed itemsets need to be kept in main memory
• Several algorithms are forced to adopt a strict lexicographic
visiting order of the search space to ensure correct duplicate
avoidance
• CHARM[PEI00], CLOSET[PEI03], CLOSET+[ZAKI02]
Computing Closures
• Besides Galois operator, make use of the lemma:
Given an itemsets X and an item i I, g(X) g(i) i c(X)
• Perform inclusion check for all items in I
• The chcek is benefited from using vertical
representation of list of tidlist
• Calculation can be either offline or online
• Offline: compute closures for the entire set of
generators
Item
A
B
C
D
T1
0
1
0
1
T2
1
1
1
1
T3
1
0
1
1
T4
0
0
1
0
• Use key patterns, generators are shorter
• Online: compute closure for a discovered generator
• Use closure climbing, generators are longer
• Fewer checks for longer generators, more efficient
tidlist
Handling duplicates
• To identify the unique generator for each
equivalence class
• Define order-preserving property of generator
• Check whether a given generator is order-preserving or not
• Compute the closure of order-preserving generators only
• Prune other generators
Handling duplicates
• Order-preserving property of generators:
A generator of the form X Y i, where Y is a
closed ite mset and i Y, is said to be order - preserving iff
either c(X) X or i (c(X) \ X)
• It means that if items need to be added to an
order-preserving generator to compute the closure,
they need to follow the item i
• The introduction of order-preserving generator is
used to avoid duplicate generation of closed
itemset
Example
• {A}= Ø∪{A} is order-preserving generator
• A c( A) \ A {C , D}
• {C,D}={C}∪{D} is not order-preserving
• D c({C , D}) \ {C , D} { A}
ABCD 1
AB
1
ABC
1
ABD
1
ACD
2
BCD
1
AC
2
AD
2
BC
1
BD
2
A
2
B
2
C
3
D
3
44
Item
A
B
C
D
T1
0
1
0
1
T2
1
1
1
1
T3
1
0
1
1
T4
0
0
1
0
CD
2
Handling duplicates
• We need to check whether a generator is orderpreserving or not
• Define a set called pre-set(gen) of a generator
gen Y i
pre - set( gen) { j | j I, j gen, and j i}
• We can now check whether a generator is orderpreserving by checking:
j pre - set ( gen) such that g ( gen) g ( j )
• If yes, then gen is not order-preserving
Handling duplicates
• The goal is to compute the closure of orderpreserving generators only
• For any closed itemset Y , there exists a sequence of
order-preserving generators
• Using closure climbing to climb a sequence of
closed itemsets and reach Y
• For each closed itemset Y ,the sequence of orderpreserving generators is unique
Handling duplicates
• Example :
Y { A, B, C , D}
gen0 { A}
Y0 c()
gen1 {A, C, D} {B}
Y1 c( gen0 ) {A, C, D}
Y c( gen1 ) { A, B, C, D}
ABCD 1
Generator ={ A, C , D} {B}
ACD
Generator = {A}
Note : Y0 Y1 Y
AC
2
A
2
4
2
The DCI_CLOSED Algorithm
• Two different types of data sets
• Dense & Sparse
• Dense data set
• Transactions are long
• Contain strongly correlated items
• Number of closed itemsets may be nearly equal to
number of frequent itemsets in sparse data sets
• Mining closed itemsets becomes more expensive
• Separated into two parts
• DCI_CLOSEDs() & DCI_CLOSEDd()
The DCI_CLOSED Algorithm
• Discriminate between sparse and dense data
sets:
• Scan data set to find out frequent single items
F1⊆ I
• Build bitwise vertical data set VD
• Items are increasingly sorted w.r.t. frequencies
A 1010111
B 0101101
E 0101100
…
• Decide whether a data set is sparse or dense
• If percentage of 1s is large
• If a large set of items is strongly correlated
• Compute the percentage of the most frequent items
that co-occur in the same transaction
The DCI_CLOSED Algorithm
• 3 input parameters:
• CLOSED_SET=c(Ø), PRE_SET=Ø, POST_SET=F1\c(Ø)
• Get an item i from POST_SET (minimum in order)
• Add i to CLOSED_SET to build new_gen (closure
climbing)
• Check validity of generator new_gen with PRE_SET
• Compute closure of new_gen using lemma 2 for
CLOSED_SET
• New closed set generated from new_gen
The DCI_CLOSED Algorithm
• Use PRE_SET to check validity of new_gen
• Guarantee duplicate generators will be correctly pruned
out
• POST_SET is used to guarantee generators are
produced according to Theorem 1
• POST_SET contains items j follow i in lexicographic
order & not included in CLOSED_SET yet
POST_SETnew { j POST _ SET | i j and j X }
Running example of DCI_CLOSEDd()
• CLOSED_SET = c(Ø)=Ø, PRE_SET=Ø,
POST_SET={A,B,C,D}
• Compute closure of generator gen= Ø∪{A}={A}
• Check with PRE_SET order-preserving
• Check if g(A)⊂g(j), ∀j∈POST_SET
• If yes, include j into CLOSED_SET
ACD
Generator = {A}
AC
2
A
2
A B C D
2
Generator =
4
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• CLOSED_SET={A,C,D}, PRE_SET=Ø, POST_SET={B}
• New generator gen= {A,C,D}∪{B}={A,B,C,D}
• Check with PRE_SET order-preserving
• gen is closed since POST_SET is empty
• Note: {A,C,D} {A,B,C,D}, need not to be in order
Generator ={ A, C , D} {B}
ACD
AC
2
A
2
A B C D
ABCD 1
4
2
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• gen=Ø∪{B}, PRE_SET={A}, POST_SET={C,D}
• gen is order-preserving by checking with g(A)
• Check g(B) with g(C) and g(D) get c(B)={B,D}
• {B,D} is closed by checking with POST_SET
A B C D
ABCD 1
ACD
AC
2
A
2
2
BD
B
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
2
2
4
Generator = {B}
Running example of DCI_CLOSEDd()
• CLOSED_SET={B,D}, PRE_SET={A}, POST_SET={C}
• gen now is {B,D}∪{C} = {B,C,D}
• Check g({B,C,D}) with g(A), g({B,C,D})⊂g(A)
• gen is not order-preserving and can be pruned with all its
possible extensions
Generator = {B, D} {C}
A B C D
ABCD 1
ACD
AC
2
A
2
2
BCD
BD
B
2
4
1
2
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• gen=Ø∪{C}, PRE_SET={A,B}, POST_SET={D}
• gen is order-preserving by checking with g(A), g(B)
• gen cannot not be extended by checking with
POST_SET, so it is closed
A B C D
ABCD 1
ACD
AC
2
A
2
2
BCD
BD
B
2
C
4
1
2
3
Generator = {C}
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• CLOSED_SET={C}, PRE_SET={A,B}, POST_SET={D}
• gen now is {C}∪{D} = {C,D}
• Check g({C,D}) with g(A), g({C,D})⊂g(A)
• gen is not order-preserving and can be pruned with
considering its possible extensions
A B C D
ABCD 1
ACD
AC
2
A
2
2
BCD
BD
B
2
C
4
1
2
CD
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
2
3
Generator ={C} {D}
Running example of DCI_CLOSEDd()
• gen=Ø∪{D}, PRE_SET={A,B,C}, POST_SET= Ø
• gen is order-preserving by checking with g(A), g(B),
g(C)
• gen cannot not be extended by checking with
POST_SET, so it is closed
A B C D
ABCD 1
ACD
AC
2
A
2
2
BCD
BD
B
2
C
4
3
D
1
2
CD
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
2
3
Generator = {B}
Optimizations
• Vertical data set (frequent single items) is
represented by a bitmap matrix VD MxN
• VD(i,j) =1 when item i of transaction j is frequent
• Row i of the matrix represents g(i), the tidlist
• Optimize the bitwise AND operations for
• tidlist intersections
• Inclusion checks
• 3 optimization techniques
Optimizations
• Data Set Projection (projection)
• For closed itemsets Z discovered by closed set X
• g(Z) is supported by subsets of g(X)
• Delete all columns from VD corresponding transactions not
occurring in g(X)
• This process is limited to generators of 1st level of recursion
since it is expensive
Optimizations
• Data Sets with Highly Correlated Items (section eq)
• Columns of VD are reordered to profit of data correlation
• Maximize the submatrix VE of VD having all rows and
columns are identical
• VE is likely to be large and includes most frequent items
• Many frequent itemsets can be mined within VE
T1
T2
T3
T4
T2
T4
T1
T3
A
0
1
0
1
A
1
1
0
0
B
1
1
1
1
B
1
1
1
1
C
1
1
0
1
C
1
1
1
0
D
0
1
0
1
D
1
1
0
0
Optimizations
• Reusing Results of Previous Bitwise Intersections
(included)
• To check whether an itemset X is closed, compare X with its
PRE_SET
• For X is closed, g(X)⊆g(j) for all j
• Large part of g(X) may be included in g(j)
• Let gh(X)⊆gh(j), so gh(X∪Y)⊆gh(j)
• We can limit the check of various g(j) to the
complementary part of gh(j)
g(X∪Y)
g(X)
check
g(j)
h
Optimizations
• Actual number of bitwise AND operations vs.
support threshold
• Optimizations “section eq” & “included” are most
effective
Performance Analysis
• Competitors: FP-CLOSE[GRAH03], CLOSET+[PEI03]
• Environment: Windows XP, Pentium IV 2.8GHz, 512MB
• Spare & Dense data sets
Dataset
Items
Avg. Trans.
Size
Transactions
T40I10D100K 1000
40
100000
Retail
16471
13
88162
Chess
76
37
3196
Pumsb
7117
74
49046
Performance Analysis
• Data set: T40I10D100K, Retail
• DCI_CLOSED is faster in one order of magnitude
Performance Analysis
• Data set: , CHESS, PUMSB
Performance Analysis
• Time efficiency of duplicate checking
• Speedup up to six when support thresholds are small
chess
chess
References
• [GRAH03] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in
Mining Frequent Itemsets,” Proc. ICDM Workshop Frequent Itemset
Mining Implementations, Dec. 2003.
• [PEI00] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Int’l Workshop
Data Mining and Knowledge Discovery, May 2000.
• [PEI03] J. Pei, J. Han, and J. Wang, “CLOSET+: Searching for the Best
Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM
SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug. 2003.
• [TAOU00] R. Taouil, N. Pasquier, Y. Bastide, L. Lajhal, and G. Stumme,
“Mining Frequent Patterns with Counting Inference,” SIGKDD
Explorations, vol. 2, no. 2, Dec. 2000.
• [ZAKI02] M.J. Zaki and C.-J. Hsiao, “Charm: An Efficient Algorithm for
Closed Itemsets Mining,” Proc. Second SIAM Int’l Conf. Data Mining,
Apr. 2002.
• [ZAKI04] M.J. Zaki, “Mining Non-Redundant Association Rules,” Data
Mining and Knowledge Discovery, vol. 9, no.3, pp. 223-248, 2004.