Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando DB group seminar Presenter: Leonidas Raffaele Perego Abstract • Frequent Itemsets Mining • Closed Itemsets • Mining Frequent Closed Itemsets • Handling duplicates • Brief introduction of the algorithm • Experimental results Frequent Itemsets Mining • A set of items I, set of transactions D • Discover all the itemsets from I with support > min_supp • Support of a k-itemset I supp(I) : number of transactions in D includes I • I is a set of items from I • Transaction t in D is a set of items from I • Well known algorithm: Apriori • Discover frequent itemsets Weaknesses & Solutions • Number of frequent itemsets grows up quickly as min_supp decreases • Complexity of mining task increases rapidly • Huge size of output • Complex for analysis • Closed itemsets are one of the solutions • Unique maximal elements of the equivalence classes defined over the lattice of all the frequent itemsets Weaknesses & Solutions • Equivalence class • Distinct group of frequent itemsets • Supported by same set of transactions • Represent same knowledge • Vertical bitwise representation of data set • Association Rules extracted are more meaningful [ZAKI04] • Redundancies are removed • Suitable for dense data set • Frequent closed itemsets are much fewer than frequent itemsets Closed Itemsets • I is subsets of items appearing in D • T is subset of transactions in D • Define two functions: f (T ) {i Ι | t T , i t} g ( I ) {t D | i I , i t} TID Items 1 B D 2 A B C 3 A C D 4 C D • Itemset I is closed iff c( I ) f ( g ( I )) f g ( I ) I • Function c f g is called Galois operator / closure operator Equivalence classes • Two itemsets belong to same equivalence class iff • They have same closure • Supported by same set of transactions • An itemset I is closed iff • No supersets of I have the same support TID Items 1 B D 2 A B C 3 A C D 4 C ABCD 1 AB 1 ABC 1 AC 2 A 2 ABD 1 AD 2 B 2 Support ACD 2 BC 1 C 3 44 D BCD 1 BD 2 D 3 A 2 Frequent Itemset CD 2 D 2 Frequent Closed Itemset Equivalence Class Mining Frequent Closed Itemsets • Search Space Browsing • Traverse the lattice of frequent itemsets from one equivalence class to another • Closure computation • Compute the closure of frequent itemsets • Determine the closed itemsets Closure generator: • A single representative of an equivalence class • Can mine all the closed itemsets by computing the closure of the generator for each class Browsing the Search Space • Choose the key patterns (minimal elements) as generators • Traverse the lattice formed by key patterns with Apriori-like algorithm[TAOU00] • Unfortunately, same closed itemset can be led from more then one key patterns ABCD 1 AB 1 ABC 1 ABD 1 ACD 2 BCD 1 AC 2 AD 2 BC 1 BD 2 A 2 B 2 C 3 D 3 44 CD 2 Browsing the Search Space • Closure climbing • New generators are built as the supersets of the closed itemset discovered so far • Jump from an equivalence class to another • Cannot ensure the equivalence class is not visited yet ABCD 1 AB 1 ABC 1 ABD 1 ACD 2 BCD 1 AC 2 AD 2 BC 1 BD 2 A 2 B 2 C 3 D 3 44 CD 2 Problem of duplicate • Need duplicate checking to avoid generating the same closed itemset • To avoid useless expensive closure operation, use following lemma: Given two itemsets X and Y, if X Y and supp(X) supp(Y) (i.e., g(X) g(Y) ), then c(X) c(Y) • However, it is still expensive in time and space • All the mined closed itemsets need to be kept in main memory • Several algorithms are forced to adopt a strict lexicographic visiting order of the search space to ensure correct duplicate avoidance • CHARM[PEI00], CLOSET[PEI03], CLOSET+[ZAKI02] Computing Closures • Besides Galois operator, make use of the lemma: Given an itemsets X and an item i I, g(X) g(i) i c(X) • Perform inclusion check for all items in I • The chcek is benefited from using vertical representation of list of tidlist • Calculation can be either offline or online • Offline: compute closures for the entire set of generators Item A B C D T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 • Use key patterns, generators are shorter • Online: compute closure for a discovered generator • Use closure climbing, generators are longer • Fewer checks for longer generators, more efficient tidlist Handling duplicates • To identify the unique generator for each equivalence class • Define order-preserving property of generator • Check whether a given generator is order-preserving or not • Compute the closure of order-preserving generators only • Prune other generators Handling duplicates • Order-preserving property of generators: A generator of the form X Y i, where Y is a closed ite mset and i Y, is said to be order - preserving iff either c(X) X or i (c(X) \ X) • It means that if items need to be added to an order-preserving generator to compute the closure, they need to follow the item i • The introduction of order-preserving generator is used to avoid duplicate generation of closed itemset Example • {A}= Ø∪{A} is order-preserving generator • A c( A) \ A {C , D} • {C,D}={C}∪{D} is not order-preserving • D c({C , D}) \ {C , D} { A} ABCD 1 AB 1 ABC 1 ABD 1 ACD 2 BCD 1 AC 2 AD 2 BC 1 BD 2 A 2 B 2 C 3 D 3 44 Item A B C D T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 CD 2 Handling duplicates • We need to check whether a generator is orderpreserving or not • Define a set called pre-set(gen) of a generator gen Y i pre - set( gen) { j | j I, j gen, and j i} • We can now check whether a generator is orderpreserving by checking: j pre - set ( gen) such that g ( gen) g ( j ) • If yes, then gen is not order-preserving Handling duplicates • The goal is to compute the closure of orderpreserving generators only • For any closed itemset Y , there exists a sequence of order-preserving generators • Using closure climbing to climb a sequence of closed itemsets and reach Y • For each closed itemset Y ,the sequence of orderpreserving generators is unique Handling duplicates • Example : Y { A, B, C , D} gen0 { A} Y0 c() gen1 {A, C, D} {B} Y1 c( gen0 ) {A, C, D} Y c( gen1 ) { A, B, C, D} ABCD 1 Generator ={ A, C , D} {B} ACD Generator = {A} Note : Y0 Y1 Y AC 2 A 2 4 2 The DCI_CLOSED Algorithm • Two different types of data sets • Dense & Sparse • Dense data set • Transactions are long • Contain strongly correlated items • Number of closed itemsets may be nearly equal to number of frequent itemsets in sparse data sets • Mining closed itemsets becomes more expensive • Separated into two parts • DCI_CLOSEDs() & DCI_CLOSEDd() The DCI_CLOSED Algorithm • Discriminate between sparse and dense data sets: • Scan data set to find out frequent single items F1⊆ I • Build bitwise vertical data set VD • Items are increasingly sorted w.r.t. frequencies A 1010111 B 0101101 E 0101100 … • Decide whether a data set is sparse or dense • If percentage of 1s is large • If a large set of items is strongly correlated • Compute the percentage of the most frequent items that co-occur in the same transaction The DCI_CLOSED Algorithm • 3 input parameters: • CLOSED_SET=c(Ø), PRE_SET=Ø, POST_SET=F1\c(Ø) • Get an item i from POST_SET (minimum in order) • Add i to CLOSED_SET to build new_gen (closure climbing) • Check validity of generator new_gen with PRE_SET • Compute closure of new_gen using lemma 2 for CLOSED_SET • New closed set generated from new_gen The DCI_CLOSED Algorithm • Use PRE_SET to check validity of new_gen • Guarantee duplicate generators will be correctly pruned out • POST_SET is used to guarantee generators are produced according to Theorem 1 • POST_SET contains items j follow i in lexicographic order & not included in CLOSED_SET yet POST_SETnew { j POST _ SET | i j and j X } Running example of DCI_CLOSEDd() • CLOSED_SET = c(Ø)=Ø, PRE_SET=Ø, POST_SET={A,B,C,D} • Compute closure of generator gen= Ø∪{A}={A} • Check with PRE_SET order-preserving • Check if g(A)⊂g(j), ∀j∈POST_SET • If yes, include j into CLOSED_SET ACD Generator = {A} AC 2 A 2 A B C D 2 Generator = 4 T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 Running example of DCI_CLOSEDd() • CLOSED_SET={A,C,D}, PRE_SET=Ø, POST_SET={B} • New generator gen= {A,C,D}∪{B}={A,B,C,D} • Check with PRE_SET order-preserving • gen is closed since POST_SET is empty • Note: {A,C,D} {A,B,C,D}, need not to be in order Generator ={ A, C , D} {B} ACD AC 2 A 2 A B C D ABCD 1 4 2 T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 Running example of DCI_CLOSEDd() • gen=Ø∪{B}, PRE_SET={A}, POST_SET={C,D} • gen is order-preserving by checking with g(A) • Check g(B) with g(C) and g(D) get c(B)={B,D} • {B,D} is closed by checking with POST_SET A B C D ABCD 1 ACD AC 2 A 2 2 BD B T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 2 2 4 Generator = {B} Running example of DCI_CLOSEDd() • CLOSED_SET={B,D}, PRE_SET={A}, POST_SET={C} • gen now is {B,D}∪{C} = {B,C,D} • Check g({B,C,D}) with g(A), g({B,C,D})⊂g(A) • gen is not order-preserving and can be pruned with all its possible extensions Generator = {B, D} {C} A B C D ABCD 1 ACD AC 2 A 2 2 BCD BD B 2 4 1 2 T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 Running example of DCI_CLOSEDd() • gen=Ø∪{C}, PRE_SET={A,B}, POST_SET={D} • gen is order-preserving by checking with g(A), g(B) • gen cannot not be extended by checking with POST_SET, so it is closed A B C D ABCD 1 ACD AC 2 A 2 2 BCD BD B 2 C 4 1 2 3 Generator = {C} T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 Running example of DCI_CLOSEDd() • CLOSED_SET={C}, PRE_SET={A,B}, POST_SET={D} • gen now is {C}∪{D} = {C,D} • Check g({C,D}) with g(A), g({C,D})⊂g(A) • gen is not order-preserving and can be pruned with considering its possible extensions A B C D ABCD 1 ACD AC 2 A 2 2 BCD BD B 2 C 4 1 2 CD T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 2 3 Generator ={C} {D} Running example of DCI_CLOSEDd() • gen=Ø∪{D}, PRE_SET={A,B,C}, POST_SET= Ø • gen is order-preserving by checking with g(A), g(B), g(C) • gen cannot not be extended by checking with POST_SET, so it is closed A B C D ABCD 1 ACD AC 2 A 2 2 BCD BD B 2 C 4 3 D 1 2 CD T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 2 3 Generator = {B} Optimizations • Vertical data set (frequent single items) is represented by a bitmap matrix VD MxN • VD(i,j) =1 when item i of transaction j is frequent • Row i of the matrix represents g(i), the tidlist • Optimize the bitwise AND operations for • tidlist intersections • Inclusion checks • 3 optimization techniques Optimizations • Data Set Projection (projection) • For closed itemsets Z discovered by closed set X • g(Z) is supported by subsets of g(X) • Delete all columns from VD corresponding transactions not occurring in g(X) • This process is limited to generators of 1st level of recursion since it is expensive Optimizations • Data Sets with Highly Correlated Items (section eq) • Columns of VD are reordered to profit of data correlation • Maximize the submatrix VE of VD having all rows and columns are identical • VE is likely to be large and includes most frequent items • Many frequent itemsets can be mined within VE T1 T2 T3 T4 T2 T4 T1 T3 A 0 1 0 1 A 1 1 0 0 B 1 1 1 1 B 1 1 1 1 C 1 1 0 1 C 1 1 1 0 D 0 1 0 1 D 1 1 0 0 Optimizations • Reusing Results of Previous Bitwise Intersections (included) • To check whether an itemset X is closed, compare X with its PRE_SET • For X is closed, g(X)⊆g(j) for all j • Large part of g(X) may be included in g(j) • Let gh(X)⊆gh(j), so gh(X∪Y)⊆gh(j) • We can limit the check of various g(j) to the complementary part of gh(j) g(X∪Y) g(X) check g(j) h Optimizations • Actual number of bitwise AND operations vs. support threshold • Optimizations “section eq” & “included” are most effective Performance Analysis • Competitors: FP-CLOSE[GRAH03], CLOSET+[PEI03] • Environment: Windows XP, Pentium IV 2.8GHz, 512MB • Spare & Dense data sets Dataset Items Avg. Trans. Size Transactions T40I10D100K 1000 40 100000 Retail 16471 13 88162 Chess 76 37 3196 Pumsb 7117 74 49046 Performance Analysis • Data set: T40I10D100K, Retail • DCI_CLOSED is faster in one order of magnitude Performance Analysis • Data set: , CHESS, PUMSB Performance Analysis • Time efficiency of duplicate checking • Speedup up to six when support thresholds are small chess chess References • [GRAH03] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in Mining Frequent Itemsets,” Proc. ICDM Workshop Frequent Itemset Mining Implementations, Dec. 2003. • [PEI00] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Int’l Workshop Data Mining and Knowledge Discovery, May 2000. • [PEI03] J. Pei, J. Han, and J. Wang, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug. 2003. • [TAOU00] R. Taouil, N. Pasquier, Y. Bastide, L. Lajhal, and G. Stumme, “Mining Frequent Patterns with Counting Inference,” SIGKDD Explorations, vol. 2, no. 2, Dec. 2000. • [ZAKI02] M.J. Zaki and C.-J. Hsiao, “Charm: An Efficient Algorithm for Closed Itemsets Mining,” Proc. Second SIAM Int’l Conf. Data Mining, Apr. 2002. • [ZAKI04] M.J. Zaki, “Mining Non-Redundant Association Rules,” Data Mining and Knowledge Discovery, vol. 9, no.3, pp. 223-248, 2004.