Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Fast and Memory
Efficient Mining of
Frequent Closed Itemsets
Claudio Lucchese
Salvatore Orlando
DB group seminar
Presenter: Leonidas
Raffaele Perego
Abstract
• Frequent Itemsets Mining
• Closed Itemsets
• Mining Frequent Closed Itemsets
• Handling duplicates
• Brief introduction of the algorithm
• Experimental results
Frequent Itemsets Mining
• A set of items I, set of transactions D
• Discover all the itemsets from I with support >
min_supp
• Support of a k-itemset I supp(I) : number of transactions in D
includes I
• I is a set of items from I
• Transaction t in D is a set of items from I
• Well known algorithm: Apriori
• Discover frequent itemsets
Weaknesses & Solutions
• Number of frequent itemsets grows up quickly as
min_supp decreases
• Complexity of mining task increases rapidly
• Huge size of output
• Complex for analysis
• Closed itemsets are one of the solutions
• Unique maximal elements of the equivalence classes
defined over the lattice of all the frequent itemsets
Weaknesses & Solutions
• Equivalence class
• Distinct group of frequent itemsets
• Supported by same set of transactions
• Represent same knowledge
• Vertical bitwise representation of data set
• Association Rules extracted are more meaningful
[ZAKI04]
• Redundancies are removed
• Suitable for dense data set
• Frequent closed itemsets are much fewer than frequent
itemsets
Closed Itemsets
• I is subsets of items appearing in D
• T is subset of transactions in D
• Define two functions:
f (T )  {i  Ι | t  T , i  t}
g ( I )  {t  D | i  I , i  t}
TID
Items
1
B
D
2
A
B
C
3
A
C
D
4
C
D
• Itemset I is closed iff
c( I )  f ( g ( I ))  f  g ( I )  I
• Function c  f  g is called Galois operator / closure operator
Equivalence classes
• Two itemsets belong to same equivalence class iff
• They have same closure
• Supported by same set of transactions
• An itemset I is closed iff
• No supersets of I have the same support
TID
Items
1
B
D
2
A
B
C
3
A
C
D
4
C
ABCD 1
AB
1
ABC
1
AC
2
A
2
ABD
1
AD
2
B
2

Support
ACD
2
BC
1
C
3
44
D
BCD
1
BD
2
D
3
A
2
Frequent Itemset
CD
2
D
2
Frequent Closed
Itemset
Equivalence
Class
Mining Frequent Closed Itemsets
• Search Space Browsing
• Traverse the lattice of frequent itemsets from one
equivalence class to another
• Closure computation
• Compute the closure of frequent itemsets
• Determine the closed itemsets
Closure generator:
• A single representative of an equivalence class
• Can mine all the closed itemsets by computing the
closure of the generator for each class
Browsing the Search Space
• Choose the key patterns (minimal elements) as
generators
• Traverse the lattice formed by key patterns with
Apriori-like algorithm[TAOU00]
• Unfortunately, same closed itemset can be led from
more then one key patterns
ABCD 1
AB
1
ABC
1
ABD
1
ACD
2
BCD
1
AC
2
AD
2
BC
1
BD
2
A
2
B
2
C
3
D
3

44
CD
2
Browsing the Search Space
• Closure climbing
• New generators are built as the supersets of the
closed itemset discovered so far
• Jump from an equivalence class to another
• Cannot ensure the equivalence class is not visited
yet
ABCD 1
AB
1
ABC
1
ABD
1
ACD
2
BCD
1
AC
2
AD
2
BC
1
BD
2
A
2
B
2
C
3
D
3

44
CD
2
Problem of duplicate
• Need duplicate checking to avoid generating the same
closed itemset
• To avoid useless expensive closure operation, use following
lemma:
Given two itemsets X and Y, if X  Y and supp(X)  supp(Y)
(i.e., g(X)  g(Y) ), then c(X)  c(Y)
• However, it is still expensive in time and space
• All the mined closed itemsets need to be kept in main memory
• Several algorithms are forced to adopt a strict lexicographic
visiting order of the search space to ensure correct duplicate
avoidance
• CHARM[PEI00], CLOSET[PEI03], CLOSET+[ZAKI02]
Computing Closures
• Besides Galois operator, make use of the lemma:
Given an itemsets X and an item i  I, g(X)  g(i)  i  c(X)
• Perform inclusion check for all items in I
• The chcek is benefited from using vertical
representation of list of tidlist
• Calculation can be either offline or online
• Offline: compute closures for the entire set of
generators
Item
A
B
C
D
T1
0
1
0
1
T2
1
1
1
1
T3
1
0
1
1
T4
0
0
1
0
• Use key patterns, generators are shorter
• Online: compute closure for a discovered generator
• Use closure climbing, generators are longer
• Fewer checks for longer generators, more efficient
tidlist
Handling duplicates
• To identify the unique generator for each
equivalence class
• Define order-preserving property of generator
• Check whether a given generator is order-preserving or not
• Compute the closure of order-preserving generators only
• Prune other generators
Handling duplicates
• Order-preserving property of generators:
A generator of the form X  Y  i, where Y is a
closed ite mset and i  Y, is said to be order - preserving iff
either c(X)  X or i  (c(X) \ X)
• It means that if items need to be added to an
order-preserving generator to compute the closure,
they need to follow the item i
• The introduction of order-preserving generator is
used to avoid duplicate generation of closed
itemset
Example
• {A}= Ø∪{A} is order-preserving generator
• A  c( A) \ A  {C , D}
• {C,D}={C}∪{D} is not order-preserving
• D  c({C , D}) \ {C , D}  { A}
ABCD 1
AB
1
ABC
1
ABD
1
ACD
2
BCD
1
AC
2
AD
2
BC
1
BD
2
A
2
B
2
C
3
D
3

44
Item
A
B
C
D
T1
0
1
0
1
T2
1
1
1
1
T3
1
0
1
1
T4
0
0
1
0
CD
2
Handling duplicates
• We need to check whether a generator is orderpreserving or not
• Define a set called pre-set(gen) of a generator
gen  Y  i
pre - set( gen)  { j | j  I, j  gen, and j  i}
• We can now check whether a generator is orderpreserving by checking:
j  pre - set ( gen) such that g ( gen)  g ( j )
• If yes, then gen is not order-preserving
Handling duplicates
• The goal is to compute the closure of orderpreserving generators only
• For any closed itemset Y , there exists a sequence of
order-preserving generators
• Using closure climbing to climb a sequence of
closed itemsets and reach Y
• For each closed itemset Y ,the sequence of orderpreserving generators is unique
Handling duplicates
• Example :
Y  { A, B, C , D}
gen0    { A}
Y0  c()  
gen1  {A, C, D} {B}
Y1  c( gen0 )  {A, C, D}
Y  c( gen1 )  { A, B, C, D}
ABCD 1
Generator ={ A, C , D}  {B}
ACD
Generator =   {A}
Note : Y0  Y1  Y
AC
2
A
2

4
2
The DCI_CLOSED Algorithm
• Two different types of data sets
• Dense & Sparse
• Dense data set
• Transactions are long
• Contain strongly correlated items
• Number of closed itemsets may be nearly equal to
number of frequent itemsets in sparse data sets
• Mining closed itemsets becomes more expensive
• Separated into two parts
• DCI_CLOSEDs() & DCI_CLOSEDd()
The DCI_CLOSED Algorithm
• Discriminate between sparse and dense data
sets:
• Scan data set to find out frequent single items
F1⊆ I
• Build bitwise vertical data set VD
• Items are increasingly sorted w.r.t. frequencies
A 1010111
B 0101101
E 0101100
…
• Decide whether a data set is sparse or dense
• If percentage of 1s is large
• If a large set of items is strongly correlated
• Compute the percentage of the most frequent items
that co-occur in the same transaction
The DCI_CLOSED Algorithm
• 3 input parameters:
• CLOSED_SET=c(Ø), PRE_SET=Ø, POST_SET=F1\c(Ø)
• Get an item i from POST_SET (minimum in order)
• Add i to CLOSED_SET to build new_gen (closure
climbing)
• Check validity of generator new_gen with PRE_SET
• Compute closure of new_gen using lemma 2 for
CLOSED_SET
• New closed set generated from new_gen
The DCI_CLOSED Algorithm
• Use PRE_SET to check validity of new_gen
• Guarantee duplicate generators will be correctly pruned
out
• POST_SET is used to guarantee generators are
produced according to Theorem 1
• POST_SET contains items j follow i in lexicographic
order & not included in CLOSED_SET yet
POST_SETnew  { j  POST _ SET | i  j and j  X }
Running example of DCI_CLOSEDd()
• CLOSED_SET = c(Ø)=Ø, PRE_SET=Ø,
POST_SET={A,B,C,D}
• Compute closure of generator gen= Ø∪{A}={A}
• Check with PRE_SET  order-preserving
• Check if g(A)⊂g(j), ∀j∈POST_SET
• If yes, include j into CLOSED_SET
ACD
Generator =   {A}
AC
2
A
2
A B C D
2
Generator = 

4
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• CLOSED_SET={A,C,D}, PRE_SET=Ø, POST_SET={B}
• New generator gen= {A,C,D}∪{B}={A,B,C,D}
• Check with PRE_SET  order-preserving
• gen is closed since POST_SET is empty
• Note: {A,C,D}  {A,B,C,D}, need not to be in order
Generator ={ A, C , D}  {B}
ACD
AC
2
A
2
A B C D
ABCD 1

4
2
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• gen=Ø∪{B}, PRE_SET={A}, POST_SET={C,D}
• gen is order-preserving by checking with g(A)
• Check g(B) with g(C) and g(D) get c(B)={B,D}
• {B,D} is closed by checking with POST_SET
A B C D
ABCD 1
ACD
AC
2
A
2
2
BD
B
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
2
2

4
Generator =   {B}
Running example of DCI_CLOSEDd()
• CLOSED_SET={B,D}, PRE_SET={A}, POST_SET={C}
• gen now is {B,D}∪{C} = {B,C,D}
• Check g({B,C,D}) with g(A), g({B,C,D})⊂g(A)
• gen is not order-preserving and can be pruned with all its
possible extensions
Generator = {B, D}  {C}
A B C D
ABCD 1
ACD
AC
2
A
2
2
BCD
BD
B
2

4
1
2
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• gen=Ø∪{C}, PRE_SET={A,B}, POST_SET={D}
• gen is order-preserving by checking with g(A), g(B)
• gen cannot not be extended by checking with
POST_SET, so it is closed
A B C D
ABCD 1
ACD
AC
2
A
2
2
BCD
BD
B
2

C
4
1
2
3
Generator =   {C}
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• CLOSED_SET={C}, PRE_SET={A,B}, POST_SET={D}
• gen now is {C}∪{D} = {C,D}
• Check g({C,D}) with g(A), g({C,D})⊂g(A)
• gen is not order-preserving and can be pruned with
considering its possible extensions
A B C D
ABCD 1
ACD
AC
2
A
2
2
BCD
BD
B
2

C
4
1
2
CD
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
2
3
Generator ={C}  {D}
Running example of DCI_CLOSEDd()
• gen=Ø∪{D}, PRE_SET={A,B,C}, POST_SET= Ø
• gen is order-preserving by checking with g(A), g(B),
g(C)
• gen cannot not be extended by checking with
POST_SET, so it is closed
A B C D
ABCD 1
ACD
AC
2
A
2
2
BCD
BD
B
2

C
4
3
D
1
2
CD
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
2
3
Generator =   {B}
Optimizations
• Vertical data set (frequent single items) is
represented by a bitmap matrix VD MxN
• VD(i,j) =1 when item i of transaction j is frequent
• Row i of the matrix represents g(i), the tidlist
• Optimize the bitwise AND operations for
• tidlist intersections
• Inclusion checks
• 3 optimization techniques
Optimizations
• Data Set Projection (projection)
• For closed itemsets Z discovered by closed set X
• g(Z) is supported by subsets of g(X)
• Delete all columns from VD corresponding transactions not
occurring in g(X)
• This process is limited to generators of 1st level of recursion
since it is expensive
Optimizations
• Data Sets with Highly Correlated Items (section eq)
• Columns of VD are reordered to profit of data correlation
• Maximize the submatrix VE of VD having all rows and
columns are identical
• VE is likely to be large and includes most frequent items
• Many frequent itemsets can be mined within VE
T1
T2
T3
T4
T2
T4
T1
T3
A
0
1
0
1
A
1
1
0
0
B
1
1
1
1
B
1
1
1
1
C
1
1
0
1
C
1
1
1
0
D
0
1
0
1
D
1
1
0
0
Optimizations
• Reusing Results of Previous Bitwise Intersections
(included)
• To check whether an itemset X is closed, compare X with its
PRE_SET
• For X is closed, g(X)⊆g(j) for all j
• Large part of g(X) may be included in g(j)
• Let gh(X)⊆gh(j), so gh(X∪Y)⊆gh(j)
• We can limit the check of various g(j) to the
complementary part of gh(j)
g(X∪Y)
g(X)
check
g(j)
h
Optimizations
• Actual number of bitwise AND operations vs.
support threshold
• Optimizations “section eq” & “included” are most
effective
Performance Analysis
• Competitors: FP-CLOSE[GRAH03], CLOSET+[PEI03]
• Environment: Windows XP, Pentium IV 2.8GHz, 512MB
• Spare & Dense data sets
Dataset
Items
Avg. Trans.
Size
Transactions
T40I10D100K 1000
40
100000
Retail
16471
13
88162
Chess
76
37
3196
Pumsb
7117
74
49046
Performance Analysis
• Data set: T40I10D100K, Retail
• DCI_CLOSED is faster in one order of magnitude
Performance Analysis
• Data set: , CHESS, PUMSB
Performance Analysis
• Time efficiency of duplicate checking
• Speedup up to six when support thresholds are small
chess
chess
References
• [GRAH03] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in
Mining Frequent Itemsets,” Proc. ICDM Workshop Frequent Itemset
Mining Implementations, Dec. 2003.
• [PEI00] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Int’l Workshop
Data Mining and Knowledge Discovery, May 2000.
• [PEI03] J. Pei, J. Han, and J. Wang, “CLOSET+: Searching for the Best
Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM
SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug. 2003.
• [TAOU00] R. Taouil, N. Pasquier, Y. Bastide, L. Lajhal, and G. Stumme,
“Mining Frequent Patterns with Counting Inference,” SIGKDD
Explorations, vol. 2, no. 2, Dec. 2000.
• [ZAKI02] M.J. Zaki and C.-J. Hsiao, “Charm: An Efficient Algorithm for
Closed Itemsets Mining,” Proc. Second SIAM Int’l Conf. Data Mining,
Apr. 2002.
• [ZAKI04] M.J. Zaki, “Mining Non-Redundant Association Rules,” Data
Mining and Knowledge Discovery, vol. 9, no.3, pp. 223-248, 2004.