Download Reducing the number of sub-trees for frequent itemsets mining

Recent Advances in Electrical and Computer Engineering Reducing the number of sub-trees for frequent itemsets mining Supatra Sahaphong and Gumpon Sritanratana candidate itemsets have two major problems which are shown as follows. The database must be scanned multiple times to generate candidate sets which increase the I/O load and is time-consuming. Moreover, the generation of huge candidate sets and calculation of their support will consume a lot of CPU time. The drawbacks which presented as above were overcome by using the next generation of algorithm, called the FPgrowth algorithm [3]. The advantages of mining of frequent itemsets by using the FP-growth algorithm are shown as follows. The database is scanned only two times, so time consuming is decreased. The generating of candidate sets is not required, so the I/O load is reduced. The FP-growth algorithm performs depth-first search approach in the search space. It encodes the data set using a compact data structure called FP-tree and extracts frequent pattern directly from this prefix tree [4]. The following researches have improved this idea. In reference [5], the H-mine algorithm was introduced by using array-based and trie-based data structure. The Patricia Mine algorithm was proposed in [6] that compressed Patricia trie to store the data sets. The FPgrowt* algorithm reduced the FP-tree traversal time by using array technique [7]. In reference [8], the SFI-Mine algorithm which constructs pattern-base by using a new method which is different from pattern-base in FP-growth and mines frequent itemsets with a new combination method without recursive construction of conditional FP-tree. However, most of the FP-tree algorithm base has the following drawbacks. First, mining of frequent itemset from the FP-tree, it generates huge of conditional FPtree and takes a lot of time and space. Second, when the changing of minimum support, this algorithm may restart and scan database twice. Many researchers have proposed ways to scan database once. The Eclat algorithm was proposed by using the join step from the Apriori property to generate frequent pattern [9]. In Reference [10], the new data structure, called LIB-graph is proposed to contain data when database is scanned and discovery of frequent patterns by using recursive conditional FP-tree. The Sorted-List structure which created from the Vertical Index List was proposed to contained data from scanning database once and mining of frequent itemsets by using depth-first search [11]. Moreover, in case that the decision maker wants to change the minimum support threshold, an algorithm is performed without rescanning of database [12]. This paper proposed a new algorithm to mine all frequent itemsets. The feature of the proposed algorithm presented as follows. The database is scanned only one time to mine frequent itemsets and a new algorithm mines frequent itemsets without generation of candidate sets. The decision maker can Abstract—This paper aimed to develop a new algorithm to mine all frequent itemsets from a transaction database. A new mining algorithm called vertical index list (VIL) tree which performs database only once and without generating any candidate itemsets. The decision maker can change of the minimum support threshold without rescanning of the database. The VIL-tree algorithm uses sorted VIL, so a mount of the frequent itemsets are generated at first. The next trees are resized down which reduced trees construction and its traversal, and the number of recursive of mining steps. When the node construction and sub-trees are reduced, resulting in a reduction in run time and memory consumption. The experiments in which run time and memory consumption of the proposed algorithm are tested in comparison with frequent pattern (FP) growth algorithm. The experiments of both algorithms are evaluated by applying to the bench mark synthetics datasets. The experimental results demonstrate that VIL-tree provides better performance than FP-growth in terms of run time and space consumption. Keywords—Association rule mining, data mining, frequent itemsets mining, knowledge discovering. I. INTRODUCTION F requent itemsets mining is an essential step in association rule mining. The association rule mining is to decompose into two major subtasks. First, the generation of all the frequent itemsets which satisfy the minimal support threshold. Second, the extraction of all high confidence rules from frequent itemsets found in previous step. Our work focuses on the first subtask. The first classic algorithm is Apriori which is proposed in [1]. The Apriori principle is “If an itemsets is frequent, then all of its subsets must also be frequent” [2]. The Apriori algorithm uses a level-wise and breadth-first search approach for generating association rule. It uses the supportbased pruning to control the exponential growth of candidate itemsets. The algorithms based on generated and tested This work was supported in part by Ramkhamhaeng University, Ramkhamhaeng Raod, Bangkapi District, Bangkok 10240, Thailand. Supatra Sahaphong is with the Department of Computer Science, Faculty of Science, Ramkhamhaeng University, Ramkhamhaeng Raod, Bangkapi District, Bangkok 10240, Thailand (corresponding e-mail: [email protected], [email protected]). Gumpon Sritaratana is with the Department of Mathematics, Faculty of Science, Buriram Rajabhat University, Muang, 31000, Thailand (e-mail: [email protected]). ISBN: 978-1-61804-228-6 213 Recent Advances in Electrical and Computer Engineering change of the minimum support threshold all time without rescanning of the database. The proposed algorithm reduced the number of sub-trees and loops in mining steps. Therefore, the proposed method can reduce both run time and space consumption, the experiments in which the run time and memory consumption are tested for the VIL-tree and FPgrowth algorithm. The results of this method are still obtaining complete and correct frequent itemset. This paper is organized as follows. The prior knowledge is presented in section II, follows by the approach which is presented in section III, the results and discussions is shown in section V and the finally, the conclusion is addressed in section VI. II. PRIOR KNOWLEDGE This section introduces basic concepts for mining of frequent itemsets. The following definition is proposed by Han et al in [4, 12]. Let I = {x1 , x 2 ,..., x m } DB = {T1 , T2 ,..., Tn } be be a a set transaction of items database, Fig. 1 The instruction of VIL and III. THE APPROACH where This section introduces a new algorithm called VIL-tree. The feature of the proposed algorithm presented as follows. The transaction database is scanned only one time to mine frequent itemsets. The VIL-tree algorithm mines frequent itemsets without generation of candidate sets, reduces the number of sub-trees and loops in mining steps. Therefore, the proposed method can reduce both of run time and space consumption. The transaction database is scanned once to construct a VIL. Then an itemset-tree structure is a general tree structure constructed from the VIL, called a vertical index list-tree of itemsets, denoted by VIL-tree (itemsets). It is a finite set of one or more nodes. It consists of the root of tree, a set of item subtrees as the children of the root, and a set of header tables. Each node in tree comprises five fields. There are two fields of value which are item-name and support and there are three fields of pointer which are same-item, parent and child. Each member of the header table consists of two fields, item-name and head of node link. Each node of tree is of the form (frequent itemset : support). The algorithm in Fig. 2 shows how to construct VIL-tree and the algorithm in Fig. 3 shows how to mine all frequent itemsets. T1 , T2 ,..., Tn are transactions that contain items in I. The support, or supp (occurrence frequency), of a pattern A, where A is a set of items, is the number of transactions containing A in DB. A pattern A is frequent if A’s support is no less than a predefined minimum support threshold, minsup. Given a DB and a minimum support threshold minsup, the problem of finding a complete set of frequent itemsets is called the frequent-itemsets mining problem. A data structure called a vertical index list (VIL) which introduced in [2, 9, 11] is summarized as follows. Let Ti = {x1 , x2 ,..., xm } be a transaction in DB, where i = 1,2,..., m and x j is an item for j = 1,2,..., n. A vertical index list (or VIL) is the structure constructed from a scan of each Ti in DB only once. Each row in VIL contains an item in I, support of item in I, and transactions in DB which contain such an item. The set of transaction will be written in order according to the ascending of its identification number. The set of items will be written in order according to the descending of its support. The algorithm in Fig. 1 shows how to construct the VIL. The construction VIL is presented in Fig.1. ISBN: 978-1-61804-228-6 214 Recent Advances in Electrical and Computer Engineering IV. RESULTS AND DISCUSSION This section presents the experiments in which the run time and memory consumption are tested for the VIL-tree and FPgrowth algorithm with two synthetic datasets and varying minimum support thresholds. The experiments were performed on a Microsoft Windows 7 Home Premium, processor is (Intel (R) Core (TM) i5-2467M, and 4 GB of RAM. All algorithms were coded using C language. The two synthetic datasets generated by the IBM Almaden Quest research group [13-14] were used for presented the experimental results. The datasets serve as the FIMI repository, which is a result of the workshops on frequent itemset mining implementations [1516]. The two original databases of synthetic datasets are T10I4D100K and T40I10D100K. In Fig. 4 and Fig. 5, when the minimum support is high, the number of frequent itemsets is low. The minimum support is low, many frequent itemsets are obtained. VIL-tree is always faster than FP-growth method because frequent item in VIL is performed in order of high support. The node construction and sub-trees are reduced, resulting in a reduction in run time and memory consumption. The Fig. 6 and Fig. 7 show that the memory consumption of VIL-tree is less than FP-growth in every minimal support threshold. This is because the node construction and the size of trees are reduced, so memory consumption is decreased. Fig. 2 The construction of VIL-tree Fig. 4 Run time of mining on T10I4D100K Fig. 5 Run time of mining on T40I10D100K Fig. 3 VIL-tree algorithm ISBN: 978-1-61804-228-6 215 Recent Advances in Electrical and Computer Engineering ACKNOWLEDGMENT We deep appreciation and gratitude to Associate Professor Dr. Veera Boonjing of the Department of Mathematics and Computer Science, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Thailand for his guidance and suggestions. We would like to express our appreciation and gratitude to Associate Professor Dr. Tawesak Tanwandee of Mahidol University, Thailand for his help in proof reading of this paper. REFERENCES Fig. 6 Memory consumption of mining on T10I4D100K [1] R. Agrawal and R. Srikant: Fast Algorithm for Mining Association Rules, Proceedings of the 20th International Conference on Very Large Data Bases, Chile, September 1994, 487-499. [2] P-N. Tan, M. Steinbach, & V. Kumar, Introduction to Data Mining (Pearson Education Inc., 2006). [3] J. Han, J. Pei, & Y. Yin: Mining Frequent Pattern without Candidate Generation, Proceedings of the 2000 ACM SIGMOD international conference on Management of Data, Texas, May 2000, 1-12. [4] J. Han, J. Pei, Y. Yin, & R. Mao: Mining Frequent Pattern without Candidate Generation: a Frequent Pattern Tree, Springer, vol. 8, 2004, no 1, 53-87. [5] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, & D. Yang: Hmine: Hyper-Structure Mining of Frequent Patterns in Large Databases, Proceedings of the 2001 IEEE International Conference on Data Mining, USA, November 2001, 441-448. [6] A. Pietracaprina, & D.Zandolin: Mining Frequent Itemsets Using Patricia Tries, Proceedings of the 3rd IEEE International Conference on Data Mining, Florida, USA, November 2003. [7] G. Grahne, & J. Zhu: Efficiently Using Prefix-Trees in Mining Frequent Itemsets,” Proceedings of the 3rd IEEE International Conference on Data Mining, Florida, USA, November 2003. [8] S. Sahaphong, & V. Boonjing: The Combination Approach to Frequent Itemsets Mining, Proceedings of the 2008 International Conference on Convergence and hybrid Information Technology, Korea, November 2008, 565-570. [9] M.J. Zaki, Scalable Algorithms for Association Mining, IEEE Transaction on Knowledge and Data Engineering, vol. 12, no. 3, 2000, 372-390. [10] D. J. Chai, L. Jin, B. Hwang, & K. H. Ryu: Frequent Pattern Mining Using Bipartite Graph, Proceedings of the 18th International Conference on Database and Expert Systems Applications, Germany, August 2007, 182-186. [11] S. Sahaphong, Frequent Itemsets Mining Using Vertical Index List, Proceedings of the 2nd IEEE International Conference on Computer Science and Information Technology, China, August 209, 480-484. [12] J. Han, & M. Kamber, Data Mining: Concepts and Techniques, Elsevier, Maryland Heights MO, 2006. Fig. 7 Memory consumption of mining on T40I10D100K V. CONCLUSION This paper aimed to develop a new algorithm to mine all frequent itemsets from a transaction database. A new mining algorithm called vertical index list (VIL) tree which performs database only once and without generating any candidate itemsets. The decision maker can change the minimum support threshold without rescanning of the database at anytime. The research provided the experiments in which run time and memory consumption are tested in comparison with frequent pattern (FP) growth algorithm. The experiments of both algorithms are evaluated by applying to the bench mark synthetic datasets. The experimental results demonstrate that VIL-tree provides better performance than FP-growth in terms of run time and space consumption. We summarized the feature of this research as follows. The VIL-tree algorithm scans database only once, moreover, the VIL-tree algorithm uses sorted VI so that the amount of frequent itemsets are generated at first. The next VIL-tree is resized down and subtrees are reduced which reduced the number of loops of mining steps. Therefor, run time and space consumption are reduced. The experiments of both algorithms are evaluated by applying to the bench mark synthetics datasets. The experimental results demonstrate that VIL-tree provides better performance than FP-growth in terms of run time and space consumption. ISBN: 978-1-61804-228-6 216 Recent Advances in Electrical and Computer Engineering [13] Frequent Itemset Mining Dataset Repository, “T10I4D100K”. Available: http://fimi.cs.helsinki.fi/data/ [14] Frequent Itemset Mining Dataset Repository, “T40I10D100K”.Available:http://fimi.cs.helsinki.fi/data/ [15] Workshop on Frequent Itemset Mining Implementations. Available: http://fimi.ua.ac.be/fimi03/ [16] Workshop on Frequent Itemset Mining Implementations. Available: http://fimi.ua.ac.be/fimi04/ Supatra Sahaphong received her B.S. (Computer Science) from Ramkhamhaeng University (RU), Thailand, in 1990, a M.S. (Information Technology) from King Mongkut’s Institute of Technology Ladkrabang (KMITL), Thailand, in 1998, and a Ph.D. (Computer Science) from KMITL in 2011. From 1991 to 1999, she worked at Institute of Computer, RU and has joined Department of Computer Science, Faculty of Science, RU in 1999, where she is currently an Assistant Professor. She has been invited to be a program committee member and organizing committee of many international conferences as well as reviewer of many journals. She had received the best paper award from the Second International Workshop on Frontiers of Information Technology, Applications and Tools. Her research interests include data mining, knowledge discovering, and image segmentation. Gumpon Sritanratana is an Assistant Professor at the Department of Mathematics, Faculty of Science, Buriram Rajabhat University, Thailand. He received Ph.D. in Mathematics from Chiang Mai University, Thailand. He worked at Department of Mathematics, Mahidol Univerisity, Thailand. He has been invited to be a program committee member and organizing committee of many international conferences and reviewer of many journals. His research activities involve partial differential operators such as the laplacian, ultra-hyperbolic and domain operators etc. His interest is in certain partial differential equations involving temper distribution solution. Another area of interest involves distribution theory and matrix calculus, especially some matrix convolutions of functions and distributions. ISBN: 978-1-61804-228-6 217

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Reducing the number of sub-trees for frequent itemsets mining