Download Document

Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando DB group seminar Presenter: Leonidas Raffaele Perego Abstract • Frequent Itemsets Mining • Closed Itemsets • Mining Frequent Closed Itemsets • Handling duplicates • Brief introduction of the algorithm • Experimental results Frequent Itemsets Mining • A set of items I, set of transactions D • Discover all the itemsets from I with support > min_supp • Support of a k-itemset I supp(I) : number of transactions in D includes I • I is a set of items from I • Transaction t in D is a set of items from I • Well known algorithm: Apriori • Discover frequent itemsets Weaknesses & Solutions • Number of frequent itemsets grows up quickly as min_supp decreases • Complexity of mining task increases rapidly • Huge size of output • Complex for analysis • Closed itemsets are one of the solutions • Unique maximal elements of the equivalence classes defined over the lattice of all the frequent itemsets Weaknesses & Solutions • Equivalence class • Distinct group of frequent itemsets • Supported by same set of transactions • Represent same knowledge • Vertical bitwise representation of data set • Association Rules extracted are more meaningful [ZAKI04] • Redundancies are removed • Suitable for dense data set • Frequent closed itemsets are much fewer than frequent itemsets Closed Itemsets • I is subsets of items appearing in D • T is subset of transactions in D • Define two functions: f (T )  {i  Ι | t  T , i  t} g ( I )  {t  D | i  I , i  t} TID Items 1 B D 2 A B C 3 A C D 4 C D • Itemset I is closed iff c( I )  f ( g ( I ))  f  g ( I )  I • Function c  f  g is called Galois operator / closure operator Equivalence classes • Two itemsets belong to same equivalence class iff • They have same closure • Supported by same set of transactions • An itemset I is closed iff • No supersets of I have the same support TID Items 1 B D 2 A B C 3 A C D 4 C ABCD 1 AB 1 ABC 1 AC 2 A 2 ABD 1 AD 2 B 2  Support ACD 2 BC 1 C 3 44 D BCD 1 BD 2 D 3 A 2 Frequent Itemset CD 2 D 2 Frequent Closed Itemset Equivalence Class Mining Frequent Closed Itemsets • Search Space Browsing • Traverse the lattice of frequent itemsets from one equivalence class to another • Closure computation • Compute the closure of frequent itemsets • Determine the closed itemsets Closure generator: • A single representative of an equivalence class • Can mine all the closed itemsets by computing the closure of the generator for each class Browsing the Search Space • Choose the key patterns (minimal elements) as generators • Traverse the lattice formed by key patterns with Apriori-like algorithm[TAOU00] • Unfortunately, same closed itemset can be led from more then one key patterns ABCD 1 AB 1 ABC 1 ABD 1 ACD 2 BCD 1 AC 2 AD 2 BC 1 BD 2 A 2 B 2 C 3 D 3  44 CD 2 Browsing the Search Space • Closure climbing • New generators are built as the supersets of the closed itemset discovered so far • Jump from an equivalence class to another • Cannot ensure the equivalence class is not visited yet ABCD 1 AB 1 ABC 1 ABD 1 ACD 2 BCD 1 AC 2 AD 2 BC 1 BD 2 A 2 B 2 C 3 D 3  44 CD 2 Problem of duplicate • Need duplicate checking to avoid generating the same closed itemset • To avoid useless expensive closure operation, use following lemma: Given two itemsets X and Y, if X  Y and supp(X)  supp(Y) (i.e., g(X)  g(Y) ), then c(X)  c(Y) • However, it is still expensive in time and space • All the mined closed itemsets need to be kept in main memory • Several algorithms are forced to adopt a strict lexicographic visiting order of the search space to ensure correct duplicate avoidance • CHARM[PEI00], CLOSET[PEI03], CLOSET+[ZAKI02] Computing Closures • Besides Galois operator, make use of the lemma: Given an itemsets X and an item i  I, g(X)  g(i)  i  c(X) • Perform inclusion check for all items in I • The chcek is benefited from using vertical representation of list of tidlist • Calculation can be either offline or online • Offline: compute closures for the entire set of generators Item A B C D T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 • Use key patterns, generators are shorter • Online: compute closure for a discovered generator • Use closure climbing, generators are longer • Fewer checks for longer generators, more efficient tidlist Handling duplicates • To identify the unique generator for each equivalence class • Define order-preserving property of generator • Check whether a given generator is order-preserving or not • Compute the closure of order-preserving generators only • Prune other generators Handling duplicates • Order-preserving property of generators: A generator of the form X  Y  i, where Y is a closed ite mset and i  Y, is said to be order - preserving iff either c(X)  X or i  (c(X) \ X) • It means that if items need to be added to an order-preserving generator to compute the closure, they need to follow the item i • The introduction of order-preserving generator is used to avoid duplicate generation of closed itemset Example • {A}= Ø∪{A} is order-preserving generator • A  c( A) \ A  {C , D} • {C,D}={C}∪{D} is not order-preserving • D  c({C , D}) \ {C , D}  { A} ABCD 1 AB 1 ABC 1 ABD 1 ACD 2 BCD 1 AC 2 AD 2 BC 1 BD 2 A 2 B 2 C 3 D 3  44 Item A B C D T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 CD 2 Handling duplicates • We need to check whether a generator is orderpreserving or not • Define a set called pre-set(gen) of a generator gen  Y  i pre - set( gen)  { j | j  I, j  gen, and j  i} • We can now check whether a generator is orderpreserving by checking: j  pre - set ( gen) such that g ( gen)  g ( j ) • If yes, then gen is not order-preserving Handling duplicates • The goal is to compute the closure of orderpreserving generators only • For any closed itemset Y , there exists a sequence of order-preserving generators • Using closure climbing to climb a sequence of closed itemsets and reach Y • For each closed itemset Y ,the sequence of orderpreserving generators is unique Handling duplicates • Example : Y  { A, B, C , D} gen0    { A} Y0  c()   gen1  {A, C, D} {B} Y1  c( gen0 )  {A, C, D} Y  c( gen1 )  { A, B, C, D} ABCD 1 Generator ={ A, C , D}  {B} ACD Generator =   {A} Note : Y0  Y1  Y AC 2 A 2  4 2 The DCI_CLOSED Algorithm • Two different types of data sets • Dense & Sparse • Dense data set • Transactions are long • Contain strongly correlated items • Number of closed itemsets may be nearly equal to number of frequent itemsets in sparse data sets • Mining closed itemsets becomes more expensive • Separated into two parts • DCI_CLOSEDs() & DCI_CLOSEDd() The DCI_CLOSED Algorithm • Discriminate between sparse and dense data sets: • Scan data set to find out frequent single items F1⊆ I • Build bitwise vertical data set VD • Items are increasingly sorted w.r.t. frequencies A 1010111 B 0101101 E 0101100 … • Decide whether a data set is sparse or dense • If percentage of 1s is large • If a large set of items is strongly correlated • Compute the percentage of the most frequent items that co-occur in the same transaction The DCI_CLOSED Algorithm • 3 input parameters: • CLOSED_SET=c(Ø), PRE_SET=Ø, POST_SET=F1\c(Ø) • Get an item i from POST_SET (minimum in order) • Add i to CLOSED_SET to build new_gen (closure climbing) • Check validity of generator new_gen with PRE_SET • Compute closure of new_gen using lemma 2 for CLOSED_SET • New closed set generated from new_gen The DCI_CLOSED Algorithm • Use PRE_SET to check validity of new_gen • Guarantee duplicate generators will be correctly pruned out • POST_SET is used to guarantee generators are produced according to Theorem 1 • POST_SET contains items j follow i in lexicographic order & not included in CLOSED_SET yet POST_SETnew  { j  POST _ SET | i  j and j  X } Running example of DCI_CLOSEDd() • CLOSED_SET = c(Ø)=Ø, PRE_SET=Ø, POST_SET={A,B,C,D} • Compute closure of generator gen= Ø∪{A}={A} • Check with PRE_SET  order-preserving • Check if g(A)⊂g(j), ∀j∈POST_SET • If yes, include j into CLOSED_SET ACD Generator =   {A} AC 2 A 2 A B C D 2 Generator =   4 T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 Running example of DCI_CLOSEDd() • CLOSED_SET={A,C,D}, PRE_SET=Ø, POST_SET={B} • New generator gen= {A,C,D}∪{B}={A,B,C,D} • Check with PRE_SET  order-preserving • gen is closed since POST_SET is empty • Note: {A,C,D}  {A,B,C,D}, need not to be in order Generator ={ A, C , D}  {B} ACD AC 2 A 2 A B C D ABCD 1  4 2 T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 Running example of DCI_CLOSEDd() • gen=Ø∪{B}, PRE_SET={A}, POST_SET={C,D} • gen is order-preserving by checking with g(A) • Check g(B) with g(C) and g(D) get c(B)={B,D} • {B,D} is closed by checking with POST_SET A B C D ABCD 1 ACD AC 2 A 2 2 BD B T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 2 2  4 Generator =   {B} Running example of DCI_CLOSEDd() • CLOSED_SET={B,D}, PRE_SET={A}, POST_SET={C} • gen now is {B,D}∪{C} = {B,C,D} • Check g({B,C,D}) with g(A), g({B,C,D})⊂g(A) • gen is not order-preserving and can be pruned with all its possible extensions Generator = {B, D}  {C} A B C D ABCD 1 ACD AC 2 A 2 2 BCD BD B 2  4 1 2 T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 Running example of DCI_CLOSEDd() • gen=Ø∪{C}, PRE_SET={A,B}, POST_SET={D} • gen is order-preserving by checking with g(A), g(B) • gen cannot not be extended by checking with POST_SET, so it is closed A B C D ABCD 1 ACD AC 2 A 2 2 BCD BD B 2  C 4 1 2 3 Generator =   {C} T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 Running example of DCI_CLOSEDd() • CLOSED_SET={C}, PRE_SET={A,B}, POST_SET={D} • gen now is {C}∪{D} = {C,D} • Check g({C,D}) with g(A), g({C,D})⊂g(A) • gen is not order-preserving and can be pruned with considering its possible extensions A B C D ABCD 1 ACD AC 2 A 2 2 BCD BD B 2  C 4 1 2 CD T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 2 3 Generator ={C}  {D} Running example of DCI_CLOSEDd() • gen=Ø∪{D}, PRE_SET={A,B,C}, POST_SET= Ø • gen is order-preserving by checking with g(A), g(B), g(C) • gen cannot not be extended by checking with POST_SET, so it is closed A B C D ABCD 1 ACD AC 2 A 2 2 BCD BD B 2  C 4 3 D 1 2 CD T1 0 1 0 1 T2 1 1 1 1 T3 1 0 1 1 T4 0 0 1 0 2 3 Generator =   {B} Optimizations • Vertical data set (frequent single items) is represented by a bitmap matrix VD MxN • VD(i,j) =1 when item i of transaction j is frequent • Row i of the matrix represents g(i), the tidlist • Optimize the bitwise AND operations for • tidlist intersections • Inclusion checks • 3 optimization techniques Optimizations • Data Set Projection (projection) • For closed itemsets Z discovered by closed set X • g(Z) is supported by subsets of g(X) • Delete all columns from VD corresponding transactions not occurring in g(X) • This process is limited to generators of 1st level of recursion since it is expensive Optimizations • Data Sets with Highly Correlated Items (section eq) • Columns of VD are reordered to profit of data correlation • Maximize the submatrix VE of VD having all rows and columns are identical • VE is likely to be large and includes most frequent items • Many frequent itemsets can be mined within VE T1 T2 T3 T4 T2 T4 T1 T3 A 0 1 0 1 A 1 1 0 0 B 1 1 1 1 B 1 1 1 1 C 1 1 0 1 C 1 1 1 0 D 0 1 0 1 D 1 1 0 0 Optimizations • Reusing Results of Previous Bitwise Intersections (included) • To check whether an itemset X is closed, compare X with its PRE_SET • For X is closed, g(X)⊆g(j) for all j • Large part of g(X) may be included in g(j) • Let gh(X)⊆gh(j), so gh(X∪Y)⊆gh(j) • We can limit the check of various g(j) to the complementary part of gh(j) g(X∪Y) g(X) check g(j) h Optimizations • Actual number of bitwise AND operations vs. support threshold • Optimizations “section eq” & “included” are most effective Performance Analysis • Competitors: FP-CLOSE[GRAH03], CLOSET+[PEI03] • Environment: Windows XP, Pentium IV 2.8GHz, 512MB • Spare & Dense data sets Dataset Items Avg. Trans. Size Transactions T40I10D100K 1000 40 100000 Retail 16471 13 88162 Chess 76 37 3196 Pumsb 7117 74 49046 Performance Analysis • Data set: T40I10D100K, Retail • DCI_CLOSED is faster in one order of magnitude Performance Analysis • Data set: , CHESS, PUMSB Performance Analysis • Time efficiency of duplicate checking • Speedup up to six when support thresholds are small chess chess References • [GRAH03] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in Mining Frequent Itemsets,” Proc. ICDM Workshop Frequent Itemset Mining Implementations, Dec. 2003. • [PEI00] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Int’l Workshop Data Mining and Knowledge Discovery, May 2000. • [PEI03] J. Pei, J. Han, and J. Wang, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug. 2003. • [TAOU00] R. Taouil, N. Pasquier, Y. Bastide, L. Lajhal, and G. Stumme, “Mining Frequent Patterns with Counting Inference,” SIGKDD Explorations, vol. 2, no. 2, Dec. 2000. • [ZAKI02] M.J. Zaki and C.-J. Hsiao, “Charm: An Efficient Algorithm for Closed Itemsets Mining,” Proc. Second SIAM Int’l Conf. Data Mining, Apr. 2002. • [ZAKI04] M.J. Zaki, “Mining Non-Redundant Association Rules,” Data Mining and Knowledge Discovery, vol. 9, no.3, pp. 223-248, 2004.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document