Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fast Mining Frequent Patterns with Secondary Memory Kawuu W. Lin Sheng-Hao Chung Sheng-Shiung Huang Dept. Computer Science and Information Engineering National Kaohsiung University of Applied Sciences Kaohsiung, Taiwan Dept. Industrial Engineering and Management National Chiao Tung University Hsinchu, Taiwan Dept. Computer Science and Information Engineering National Kaohsiung University of Applied Sciences Kaohsiung, Taiwan [email protected] [email protected] [email protected] Chun-Cheng Lin Dept. Industrial Engineering and Management National Chiao Tung University Hsinchu, Taiwan [email protected] ABSTRACT Data mining technology has been widely studied and applied in recent years. Frequent pattern mining is one important technical field of such research. The frequent pattern mining technique is popular not only in academia but also in the business community. With advances in technology, databases have become so large that data mining is impossible because of memory restrictions. In this study, we propose a novel algorithm called Hybrid Mine (HMine) to help improve this situation. H-Mine saves a part of the information that is not stored in the memory, and through the use of mixed hard disk and memory mining we are able to complete data mining with limited memory. The results of empirical evaluation under various simulation conditions show that H-Mine delivers excellent performance in terms of execution efficiency and scalability. CCS Concepts • Information systems ➝ Information systems applications➝ Data mining ➝ Association rules Keywords Data Mining; Frequent Pattern Mining; Main Memory; Disk Storage. 1. INTRODUCTION With technological developments and the convenience of computers, human behavior for analysis can be stored as different types of data. Although these data obey no laws at first glance, using data mining techniques we can extract hidden information Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ASE BD&SI 2015, October 07 - 09, 2015, Kaohsiung, Taiwan Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3735-9/15/10…$15.00 DOI: http://dx.doi.org/10.1145/2818869.2818933 from them. For example, in recent years the popular social networking sites Facebook, Google+, and YouTube understand the user’s preference by frequent pattern mining technology such as fan pages, applications, club membership, etc. this provides related information to direct the user or attract the user’s attention to specific websites. By mining transaction data, businesses can deduce customer buying habits and then use this information effectively to increase profits. Data mining has been successfully applied to various fields, fostering the ability to detect information in vast datasets; this information can be subdivided into five parts, the association rule, classification, clustering, sequential pattern, and time sequence. Here we focus on the association rules. In 1994, Agrawal Rakesh et al. first proposed the Apriori algorithm [1] for mining association rules. Although Apriori is effective for finding frequent patterns, its main shortcomings are execution time and the memory required for storing 2-candidate datasets. In 1999 Han Jiawei et al. [2] proposed a novel data structure and mining algorithm, called the Frequent-Pattern tree (FP-tree), along with a FP-Growth algorithm to improve the Apriori shortcomings. This method can compress data and reduce database scanning so that frequent pattern mining can run more efficiently. With the rapid development of information technology, databases are consequently increasingly large; the FPGrowth method has been unable to satisfy the expected efficiency of the user. In recent years, many researchers have proposed improvements; these can be divided into the following three categories: First is the use of multiple computing resources to improve the efficiency of mining such as QFP-growth [3], TPFP [4], PFP-Tree [5], and FD-Mine [6]. Second is by reducing the memory capacity such as FP-growth and CFP-Tree [7]. Third is the use of the hard disk to store mining information such as Database Projection [8], Aggressive Projection [9], DRFP-Tree [10], and DSP-Tree [11]. Despite the recent algorithms showing good results, most of them release an unestablished FP-tree when out of memory, then carry out subsequent processing, this wastes a lot of time and information. Here our proposed method, Hybrid Mine (H-Mine), continues using the unfinished FP-tree when out of memory, then builds the rest of the tree node on the hard disk to complete mining. This algorithm uses mixed hard disk and memory mining to help solve large data problems even with limited memory. 2. RELATED WORKS 2.1 FP-growth Algorithm Han Jiawei et al. [2] proposed a tree-based data structure, the Frequent Pattern Tree (FP-tree), and a corresponding mining algorithm, the Frequent Pattern-growth (FP-growth), to mine frequent patterns. This algorithm improves the Apriori shortcomings, i.e., excessive numbers of candidates and a long execution time. The FP-growth algorithm begins by scanning the database; it then creates a header table and calculates the frequency of occurrence of each item. If the counts of individual items are less than the support, then the item is discarded. A header table records all frequent items, count of items, and the first appearance of the FPtree by a reference or pointer to each frequent item called Link. aimed at improving distributed computing data transmission, as between the various clouds this transmission relies on the Internet. To reduce the amount of data transmitted FD-Mine uses a matrix to retain the necessary FP-tree node information (Label, Count, and Parent). Here we propose the Hybrid Mine algorithm (H-Mine). It is based on the FD-Mine algorithm and retains the necessary FP-tree node information method. We have made a number of improvements. The finished FP-tree can now retain the necessary node information saved onto disk which then releases more memory, making it available for large database mining application. 3. PROPOSED METHOD In this section, we introduce the proposed Hybrid Mine algorithm (H-Mine) and give details of its data structure. The FP-growth algorithm finds frequent patterns after the FP-tree has been created. It gradually selects items from the header table, in accordance with the item Link, to create the conditions of a sub FP-tree. Recursion results in the constant rebuilding of the sub FP-tree, until it becomes a single path or an empty tree, then it stops. This then continues recursively to the next Item, until each item has been mined; the frequency patterns can then be obtained. 3.1 Memory warning mechanism 2.2 Database Projection Algorithm 3.2 Reserved node mapping disk mechanism Han Jiawei et al. [8] improved the FP-growth algorithm [2] shortcomings that resulted in big data mining errors. Because the FP-tree is not fully built in the memory and it cannot perform FPgrowth mining when using a big database, problems are caused by insufficient memory; therefore, they propose a Database Projection algorithm to improve this. It is based on the framework of FP-growth; when confronted with insufficient memory it reduces database actions and attempts the FP-growth again. If it still can't mine, then it continues to reduce the database actions until the memory can accommodate mining and then completes the investigation. To quickly search the FP-tree node information on disk, we used a directory to record the node information in a memory mapping disk address table (the mapping table). For each node the initial disk mapping address was -1. When a new node was added to the hard disk, the present disk address was recorded in the memory and the mapping table updated. The Database Projection algorithm approach: if building the FPtree does not result in sufficient memory, then perform FP-growth to obtain the frequent patterns, otherwise perform the Database Projection algorithm. The Database Projection algorithm creates a new sub-database for each frequent item and again executes recursively until investigation is complete. 2.3 CARM Algorithm Kawuu W. Lin and Der-Jiunn Deng proposed the CARM algorithm [12], which involves novel parallel computing in the cloud environment. CARM comprises two algorithms: the highworkability distributed FP-mine (HD-Mine) and the fastdistributed FP-mine (FD-Mine). The HD-Mine algorithm splits the mined information into small pieces, stores it in computing nodes then these nodes merge information to find the frequent patterns. But this method can be very time consuming, so they proposed FD-Mine to speed up the execution time. FD-Mine is Index Count (max:n) Disk address (Childnode.data) … Index Java provides a memory management interface and warning alerts, so we directly implemented these. An alert is sent when the memory usage reaches a value pre-set by us. We were able store the FP-tree node information on the hard disk after this alert and still keep the part of the memory space for use in subsequent mining. 3.3 Disk information structure quick search and tree-building In order to continue building the FP-tree when the memory warning occurs, we added two files to store node information: nodeToSeek.data and ChildNode.data. The data structure of nodeToSeek.data, which records each node’s index, is shown in Figure 1. ChildNode counts the disk addresses of ChildNode.data and displays three columns as a group; when creating a new node index this is then added as one new group. The first column uses one space for the storage node index. The second column uses one space to store the current number of the ChildNode. The third column uses n spaces to store the disk address of the ChildNode.data. If the count of second column is more than n, it opens a new nodeToSeek2.data file to store, and so on. The data structure of ChildNode.data is shown in Figure 2. These records contain each node's index, label and the next sub-node ChildNode, and display three columns as a group. When creating a new node index this is added as one new group, each column using one space to store the information. Count (max:n) Disk address (Childnode.data) … Figure 1. The data structure of nodeToSeek.data Index … Index Label ChildNode Index Label ChildNode … Index Figure 2. The data structure of ChildNode.data Node label count parent Node label count parent Node label … Figure 3. The data structure of TreeNodeInDisk.data 3.4 Storage FP-tree node in the disk information structure After the memory warning occurred, we stored the node label, count and parent data to a TreeNodeInDisk.data file such as CARM [12] that records tree nodes. As shown in Figure 3, TreeNodeInDisk.data is recorded and contains each node, label, count, and parent data. It displays four columns as a group and when creating a new node adds it as one new group. 3.5 LINK Header Table in the disk information structure We added two attribute to the disk to support the Link Header Table when memory warning occurred. The first attribute to record file Next.data stores the Link disk address, which is connected to the node index of TreeNodeInDisk.data. The other attribute, NextsInDiskCOUNT, stores an amount of the Next.data in order to execute fast searches. The Link disk address in the mining FP-growth algorithm has an initial value of zero. When creating a new Link in the Next.data this then changes the amount of Header Table in NextsInDiskCOUNT attribute. 3.6 H-Mine Algorithm The complete algorithm of H-Mine is shown in Figure 4. The input is a transaction database D, minimum support S and percentage of reserved memory space X%. The output is in the form of frequent patterns FP. First, we obtained the maximum available memory capacity to enable the calculation of the upper limit of the memory warning value (line 2 and line 3). We then scanned the Database to produce C1, filtered support to obtain the Header Table, established the node mapping table, and initialized the Tree (line 4 to line 7). The database was scanned again to build the Tree, each added node was assessed to see whether or not it exceeded the upper warning value limit; if found to be above this limit then the information is stored on disk, elsewhere than the working memory (line 8 to line 19). After completing the Tree, mining the FP-growth algorithm reveals Frequent Patterns (line 21). Note: if the memory warning notice occurs during mining FP-growth, similarly use the memory and disk hybrid approach to store the sub-tree to finish the investigation. Algorithm: Hybrid-Mine (H-Mine) Input: Database D, minimum support S, the using percentage of reserved memory space X%. Output: Frequent Patterns FP. 1. Procdure H-Mine(D, S, X%){ 2. AvailableMemory = getMaxOfAvailableMemoryCapacity(); Warning = AvailableMemory * X%; C1= ScanDB(D); HT = getHT (C1, S); MappingNodeInDisk(HT); // Establish the mapping table of node. 7. Tree = Ø; 8. For( d = each transaction of D ){ 9. freqitem = filter(d, HT); 10. For( Node = each item of freqitem ){ 11. If( getCurrentMemory() < Warning ){ 12. Tree = Tree ∪ BuildTreeInMemory(Node); 13. }Else{ 14. DiskPostion = BuildTreeInMemory(Node, Tree); // Get the disk address of newly added node. 15. UpdateMappingNodeInDisk(Node, DiskPostion); // Update the mapping table. 16. Tree = Tree ∪ BuildTreeInDisk(Node); 17. } 18. } 19. } 20. FP = FP-growth(Tree, HT); 21. Return FP; 22. } 3. 4. 5. 6. Figure 4. H-mine Algorithm. 4. EXPERIMENTAL RESULTS 4.1 Experimental Setup To estimate the performance of the proposed method H-Mine, we used IBM’s Quest synthetic data generator [13] to generate the workload and also experimented with real data. All experiments were run on a PC with an Operating System as follows: Windows 7 Enterprise service pack 1, Intel(R) Core(TM) i7-4790 CPU @3.60GHz and 1TB disk storage. The algorithm was implemented in Java. For verification of the overall procedure time, we chose the FP-growth [2] and Database Projection (DP) [8] for comparison because of their method of solutions for large datasets. 4.2 Varying Support base data The experimental data was generated using IBM’s Quest Synthetic Data Generator [13], and the data is based on the general transaction records databases. The average transaction length (T), average frequent itemset length (I), number of items (N), and number of transactions (D) were 20, 10, 10K, and 1000K, respectively. This was the basic dataset for our experiments. We limited the set memory to 1 GB, reserved memory space to 95% and observed the efficiency of FP-growth, DP and H-Mine by varying the support. To evaluate the execution times of FP-growth, DP and H-Mine, we lowered the support continually to find out the exact threshold where the ‘out-of-memory’ error occurs and the FP-Tree can no longer fit into the main memory. We found out that when support was set to 0.5%, this simulated the above scenario, so we set the experimental support range from 0.3% to 0.7% for comparison. The experimental results showed that HMine performed better than the FP-growth and DP algorithms in terms of execution time, as summarized in Table 1 and Figure 5. Table 2. Experimental data T40I10D100K.data T40I10D100K.data Time(s) FP-growth DP H-Mine 0.8 Fail 752.315 286.435 0.9 Fail 734.075 273.337 1 Fail 714.771 265.264 5 42.935 42.935 42.935 7 4.143 4.143 4.143 Support(%) Table 1. Experimental data showing varying support T20I10N10KD1000K Time(s) FP-growth DP H-Mine 0.3 Fail 612.39 433.76 0.4 Fail 417.535 177.262 Support(%) 0.5 39.161 39.161 39.161 0.6 8.761 8.761 8.761 0.7 5.22 5.22 5.22 T40I10D100K.data 800 FP-growth DP H-mine 600 Time(s) 400 200 0 T20I10N10KD1000K 700 600 0.8 FP-growth DP H-mine 500 0.9 1 5 7 Support(%) Time(s) 400 300 Figure 6. T40I10D100K.data effect on execution time. 200 100 4.4 Webdoc.data by varying support 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Support (%) Figure 5. Varying support, effect on execution time experiment. 4.3 T40I10D100K.data by varying support The T40I10D100K.data [14] dataset was generated by the IBM Almaden Quest research group. The average transaction length (T), average frequent itemset length (I), and number of transactions (D) were 40, 10, and 100K, respectively. We limited the set memory to 500 MB, the reserved memory space to 95% and observed the efficiency of FP-growth, DP, and H-Mine by varying the support. The experimental results are shown in Table 2 and Figure 6. The Webdoc.data [15] dataset was generated from Claudio Lucchese et al. The file size, average transaction length (T), number of items (N), and number of transactions (D) were 1.48 GB, 70 K, 5000 K, and 1600 K, respectively. We limited the set memory to 1 GB, the reserved memory space to 95% and observed the efficiency of FP-growth, DP, and H-Mine by varying the support. The experimental results are summarized in Table 3 and Figure 7. Table 3. Experimental data Webdoc.data Webdoc.data Time(s) FP-growth DP H-Mine 21 Fail 8614.025 6686.069 22 Fail 6875.271 4610.803 23 Fail 3826.958 2854.743 24 152.555 152.555 1789.202 25 60.428 60.428 60.428 Support(%) [4] webdoc.data 10000 [5] FP-growth DP H-mine 8000 Time(s) 6000 [6] 4000 2000 [7] 0 20 21 22 23 24 25 26 Support(%) [8] Figure 7. Webdoc.data effect on execution time. 5. CONCLUSIONS Identifying valuable frequent pattern hidden in large databases is a fundamental task in association rules mining. Past research has been committed to the use of multiple computing resources, data structure compression, or hard disk processing; however, we observe that these methods will not be sufficient in the future as the amount of data available continues to grow rapidly. Therefore, we propose an efficient hybrid method, using mixed hard disk and memory mining, to solve big data problem even with limited memory. It can be seen from this experiment that when dataset size increases rapidly, the execution time of the H-Mine algorithm increases but the curve remains steady. For future work, we intend to further improve the efficiency of this algorithm by combining it with cloud computing technology through various nodes. 6. ACKNOWLEDGMENTS Part of this work was supported by the National Science Council of Taiwan, R.O.C., under grant MOST 103-2221-E-151 -033 -. 7. REFERENCES [1] [2] [3] Agrawal, Rakesh, and Ramakrishnan Srikant, 1994, “Fast algorithms for mining association rules”. Proceedings of the 20th International Conference on Very Large Data Bases vol. 1215 pp. 487-499. Jiawei Han, Jian Pei, R. Mao, Yiwen Yin, 2000, “Mining Frequent Patterns without Candidate Generation.”, In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pp.1-12. Qing-song xie. Yong Qiu, Yongjie Lan, 2004, “An improved algorithm of mining from FP- tree.” Proceedings of the Third International Conference on Machine Learning and Cybernetics, pp. 26-29, August. [9] [10] [11] [12] [13] [14] [15] Jiayi Zhou, Kun-Ming Yu, 2008, “Tidset-based Parallel FPtree Algorithm for the Frequent Pattern Mining Problem on PC Clusters.”, Advances in Grid and Pervasive Computing, Lecture Notes in Computer Science, vol. 5036, pp 18-28. Asif Javed, Ashfaq Khokhar, 2004, “Frequent pattern mining on message passing multiprocessor systems”, Distributed and Parallel Database 16 (3) pp. 321–334. ACM New York, NY, USA ©2000 Kawuu Weichieng Lin, Yu-Chin Lo, 2013, “Efficient algorithms for frequent pattern mining in many-task computing environments”, Elsevier Knowledge-Based Systems 49, pp. 10–21. Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner, 2011, “Memory-Efficient Frequent-Itemset Mining”, EDBT/ICDT ‘11 Proceedings of the 14th International Conference on Extending Database Technology pp. 461472, ACM New York, NY, USA. Jiawei Han, Jian Pei, R. Mao, Yiwen Yin, 2004, “Mining frequent patterns without candidate generation: a frequentpattern tree approach”, Journal of Data Mining and Knowledge Discovery, vol.8, no.1, pp.53-87. G. Grahne, J. Zhu, 2004, “Mining Frequent Itemsets from Secondary Memory”, International Conference on Data Mining, pp. 91-98, November. Muhaimenul Adnan , Reda Alhajj, 2007, DRFP-tree: diskresident frequent pattern tree, Springer Science Business Media, LLC. Alfredo Cuzzocrea, Carson K. Leung, Juan J Cameron, 2013, “Stream Mining of Frequent Sets with Limited Memory”, AC ‘13 Proceedings of the 28th Annual ACM Symposium on Applied Computing, pp. 173-175, ACM New York, NY, USA. Der-Jiunn Deng, Kawuu W. Lin, 2010, “A novel parallel algorithm for frequent pattern mining with privacy preserved in cloud computing environments”, International Journal of Ad Hoc and Ubiquitous Computing pp. 205-215 Agrawal, Rakesh, Ramakrishnan Srikant, Quest Synthetic Data Generator, IBM Almaden Research Center, SanJose, California. Fabrizio Silvestri, Paolo Palmerini, Raffaele Perego, Salvatore Orlando, 2001, “DCI: A hybrid Algorithm for Frequent Set Counting”, High Performance Computing Lab. at ISTI-CNR., University of Venice. C. Lucchese, F. Silvestri, G. Tolomei, R. Perego, S. Orlando, 2011, Identifying task-based sessions in search engine query logs , Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, ACM, pp. 277-286.