* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Apriori Algorithm
Survey
Document related concepts
Transcript
A New PC Cluster Based Distributed Association Rule Mining Ferenc Kovács, Sándor Juhász Department of Automation and Applied Informatics, Budapest University of Technology and Economics, Faculty of Electrical Engineering and Informatics, 1111 Budapest, Goldmann György tér 3. IV. em., Hungary E-Mail: {kovacs.ferenc, juhasz.sandor }@aut.bme.hu Abstract – One of the most important problems in data mining is association rule mining. It requires very large computation and I/O traffic capacity. For that reason there are several parallel mining algorithms which can take advantage of the performance of the cluster systems. These algorithms are optimised and developed on supercomputer platforms, but nowadays the capacity of PC preserves the possibility to build cluster systems cheaper. Usage of PC cluster systems raises some issues about the optimisation of the distributed mining algorithms, especially the cost of the node to node communication and data distribution. The current association rule mining algorithms do not consider the cost of data distribution and they do not make any data processing during the data distribution phase. The main part of the distributed association rule mining algorithms is based on Apriori algorithm therefore these algorithms suffer the drawbacks of the Apriori algorithm. In this paper a new distributed association rule mining algorithm is introduced, which is based on modified version of the Apriori algorithm, which gives a solution of the Apriori algorithm bottlenecks. Other advantage of this new distributed algorithm is that it can make significant and fundamental data processing during the data distribution phase. Keywords: data mining, association rule mining, distributed data mining I. INTRODUCTION The association rule mining (ARM) is very important task within the area of data mining [1]. Many algorithms were developed to find association rules, but the Apriori is the best-known [2]. The one of the main disadvantage of the Apriori algorithm is its I/O cost. A paper [3] introduces an algorithm for I/O cost cutting, but it has higher memory requirements. Because of the complexity of the ARM task several parallel algorithms have been developed. The main part of the distributed algorithms is based on the Apriori algorithm. The count distribution (CD) -like algorithms [4] generate the smallest network traffic, because they send only their own counters to the other nodes. The data distribution (DD)- based algorithms [4] generate higher network traffic, because the nodes send not only their local counters, but their own database as well. There are some distributed algorithms, which are not based on Apriori, for instance [5] contributes such an algorithm. The paper [6] shows a detailed analysis of the Apriori algorithm and it concluded that the finding the 2 frequent itemsets is the significant part of the execution time. The paper [7] introduces two modifications to avoid the bottlenecks in the Apriori algorithm. The distributed algorithms were developed and evaluated in supercomputer environment, but the PC cluster systems have several differences compared to traditional cluster systems. Therefore the contributed distributed association rule mining algorithms do not consider the data distribution cost and they begin the itemset counting after the data distribution phase. But the initial data distribution appears in case of the PC cluster systems as the database usually is stored separately from the mining cluster. Just few distributed association rule mining algorithm were investigated in PC cluster environment. The paper [8] is concerned with the behaviour of HPA algorithm [9] in PC cluster environment. The node synchronization and possibility of different types of nodes can cause serious performance decrease in PC clusters. The PC cluster-based modification of CD and DD algorithms were discussed in [10]. The algorithm synchronization problem was examined in [11]. An asynchronous distributed algorithm was introduced in [12] This paper is organized as follows: first of all the widely spread sequential ARM algorithms are described. Afterwards the basic distributed ARM algorithms are summarized. After the bottlenecks of the Apriori based algorithms are investigated. Afterwards the novel distributed association rule mining algorithm is introduced. Finally we outline some test result of the new algorithm. II. PROBLEM STATEMENT First we elaborate on some basic concepts of association rules using the formalism presented in [1]. Let I={i1,i2,…im} be set of literals, called items. Let D={t1,t2,…tn} be set of transactions, where each transaction t is a set of items such that t I. The itemset X has support s in the transaction set D if s% of transactions contains X, here we denote s= support(X). An association rule is an implication in the form of XY, where X ,Y I , and X Y= . Each rule has two measures of value, support and confidence. The support of the rule XY is support(X Y). The confidence c of the rule XY in the transaction set D means c% of transactions in D that contain X also contain Y, which can be written in c S X Y / S X form. The problem of mining association rules is to find all the rules that satisfy a user specified minimum support and minimum confidence. If support(X) is larger than a user defined minimum support (denoted here min_sup) then the itemset X is called large itemset. The association rule mining can be decomposed into two subproblems: Finding all of the large itemsets Generating rules from these large itemsets The second subproblem is much easier than the first one that is the reason why the ARM algorithms are different from each other only in the method handling the first subproblem. III. BASIC ALGORITHMS In this section some basic association rule mining algorithms are described. First the Apriori algorithm is introduced because it is the base algorithm of the investigated distributed algorithms. Then three fundamental and referential algorithms are described: count distribution [4], data distribution [4] and hash-based partition algorithm [14]. Table 1 contains the notations that are used in detailed descriptions of the sequential algorithms. k itemset L Li Ci |A| An itemset having k items Set of the frequent itemset Set of frequent i itemset Set of candidate i itemset (potentially frequent itemset) Number of elements in set A Table 1. Notations in the sequential algorithms Apriori Algorithm The Apriori algorithm [2] use the following theorem to reduce the search space: if an itemset is large then all of its subsets are large as well. This means it is possible to generate the potentially large i+1 itemset using large i itemset. Each subsets of candidate i+1 itemset must be large itemset. Hereby it is possible to find all large itemset using database scan repeatedly. During the ith database scan it counts the occurrence of the i itemset and by the end of the pass i, it generates the candidates, which contain i+1 item. Figure 1 shows the pseudo code of the Apriori algorithm. The main disadvantage of this algorithm is the multiple database scan. There are many solutions to reduce the number of database scans [2][3][5][13]. L1 1 frequent itemsets C2 GenerateCandidate L1 i2 while Li 1 { foreach t D { IncrementCounter Ci , t } c.counter Li c Ci min_supp D i i 1 Ci GenerateCandidate Li } L Li i Figure 1. Pseudo Code of the Apriori Algorithm Count Distribution Algorithm The count distribution algorithm [4] is a fundamental distributed association rule algorithm. The basic idea of this algorithm is that each of the nodes keeps large itemsets and counters of candidates locally, which are related to the whole database. These counters are maintained in accordance with the local dataset and incoming counter values. The nodes locally execute the Apriori algorithm and after reading through the local dataset they broadcast own counters to the other nodes. Hence each of the nodes can generate new candidates on the basis of the global counter values. Figure 2 shows the pseudo code of the CD algorithm and figure 3 gives an overview of the CD algorithm. C j items 1 i 1 do { foreach t D j { IncrementCounter Ci j , t } Broadcast Ci j for k 1 to N , k j { Receive Cik ComputeCounters Ci j , Cik } c.counter Li c Ci j min_supp D i i 1 Ci j GenerateCandidate Li 1 } while Li 1 L Li i Figure 2 Pseudo Code of the Count Distribution algorithm Node 1 Node N Exchange counter values Lk Lk Candidate Generation Candidate Generation Ck Ck Database Partition Apriori Item Counting Database Partition C j items 1 i 1 do { foreach t D j { IncrementCounter Ci j , t } Broadcast D j for k 1 to N , k j { Receive D k Apriori Item Counting Figure 3 The overview of the count distribution algorithm This algorithm keeps the possibility of low data transmission, but each of the nodes is synchronized after each of the database scans. The other disadvantage of previously mentioned algorithm is that if there are too many candidates then it will not take the opportunity that there could be enough memory in whole cluster to keep all the candidates in the memory. foreach t D k { IncrementCounter Ci j , t } } Broadcast Ci j Data Distribution Algorithm Data distribution algorithm [4] gives a solution for a situation, when one of the nodes does not have enough memory for all of the candidates. In this case each of the nodes is responsible for only a part of the candidates. Each of the nodes counts the occurrence of its own candidates in the whole dataset. But all of the nodes have to broadcast their own local database to the other nodes. Figure 4 gives an overview of the DD algorithm and Figure 5 shows the pseudo code of the algorithm. The disadvantage of the algorithm is that it generates very large network traffic, because it does not use any kind of optimisation to reduce the network traffic. It can be noticed that the huge number of the candidates are decreasing during the later iteration step. Hence the local database broadcast is not necessary when the number of counter is decreased. HPA algorithm The HPA algorithm[14] is an improved version of the data distribution algorithm; it uses a hash function to determine the owner node of the candidate. With the help of this hash function it is possible to decide where to send each of the read itemsets. In this way the network traffic can be reduced. Node 1 Node N Exchange counter values Lk Database Partition Lk Database Partition Candidate Generation Candidate Generation Ck Ck Remote Data Remote Data Apriori Item Counting Apriori Item Counting Exchange Database Figure 4 Overview of the Data Distribution algorithm Ci Ci j for k 1 to N , k j { Receive Cik } Ci Ci Cik c.counter Li c Ci min_supp D i i 1 Ci j GenerateCandidate Li 1 } while Li 1 L Li i Figure 5 Pseudo Code of the Data Distribution algorithm IV. BOTTLENECK OF THE APRIORI BASED ALGORITHMS Several Apriori-based algorithms were developed but the modifications are focused on the reduction of the number of database scans. However the Apriori-based algorithms have some other drawback. The role of the 2 frequent itemsets is fundamental in the Apriori-based algorithms [6]. In accordance with the evaluation of the Apriori algorithm it is useful to take care of the 2 candidates in special way [7]. The modified version of the Apriori algorithm does not generate 2 candidates but it takes into consideration all possible 2 itemsets as candidates. The itemset counters can be stored in a triangular matrix. This triangular matrix can be indexed by the itemset identifiers. Hence this matrix contains counter values of each 2 itemsets and it can keep the counter values of the 1 itemsets as well. The diagonal of the matrix can be the placeholder of the 1 itemset counters. The advantages of this solution as follows: 1. It is not necessary to generate the 2 candidates 2. It is possible to count the 1 and 2 itemset in one step 3. Each the counters can be reached in O(1) time 1 4 2 2 12 3 1 4 6 4 1 3 1 6 5 0 3 0 1 4 2 3 4 5 1 Item index Figure 6 The data structure of the counters The main benefit of this structure is the O(1) time counter searching. This can extremely reduce the cost of the first database scans. This can be very important because they take substantial part of the level wise algorithms. After the first database scan the level wise algorithms can continue their work as they did it previously. But of course it is necessary to convert the frequent itemset from the diagonal matrix into their data structure (usually hash tree or “trie” structure). The conversion can be speed up if the counters of the 1 itemsets are checked first. Because if an itemset is not frequent then its row and column cannot contain frequent counter. This is the consequence of the Apriori condition: if an itemset is frequent then each of its sub itemsets is frequent as well. V. THE NOVEL ALGORITHM: COUNTING WHILE DISTRIBUTING The Counting While Distributing (CWD) algorithm is based on count distribution algorithm but it uses the triangular matrix to find the 2 frequent itemset [7]. The other modification is that the traditional CD algorithm uses all to all broadcast mechanism to share own local itemset counter when the nodes finished a database scan but in PC cluster environment the all to all broadcast can be very expensive therefore in this algorithm there is a coordinator node. This node collects the local counter value of the candidates and generates new candidates [10][12]. The traditional distributed association rule mining algorithms do not consider the initial database distribution cost hence they start the itemset counting after the data distribution. Due to the increased searching capability of triangular matrix representation of the 2 itemset it is possible to count the 2 itemsets during the data distribution phase. Therefore after the initial data distribution the coordinator can immediately collect the counters of the 2 itemsets. This solution allows overlapping the initial data distribution and a significant part of the itemsets counting. The asynchronous communication model can improve the overall efficiency of the cluster. The asynchronous communication model facilitates to overlap the communication and the data processing. During the initial data distribution the coordinator node creates small data fragments from the database and these are sent to the worker nodes. This fragmentation keeps the possibility to increase the efficiency of the data processing. Due to the asynchronous communication the worker nodes can process a data fragment while a new one is arriving through the network communication channel. In the current implementation the size of the fragment is 128 kilobytes. The sequence diagram of the communication is shown on Figure 7. The coordinator node can send to following messages to the worker nodes: Start: The coordinator informs the worker nodes about the beginning of the data distribution. Data Fragment: The coordinator node sends a data fragment to the worker node. Delete Candidate: The candidate did not become frequent. Hence the worker nodes have to delete the candidate from their own memory. Insert Candidate: The new candidates are generated and the coordinator node broadcasts the new candidates to the worker nodes. Continue Counting: The candidate generation process is finished. The worker nodes can continue the itemset counting. Finish Counting: There is no new candidate hence the itemset counting process is finished. The worker nodes can send the following messages to the coordinator node: Triangular Matrix: The worker nodes send back their own triangular matrix to the coordinator. Counter Update: The worker nodes send back their own candidate counter values to the coordinator. Coordinator Worker Node Start Data Fragment Finished Distribution Triangular Matrix Insert Counter Continue Working Counter Update Insert Counter Delete Counter Continue Working Counter Update Counter Update Finished Figure 7 Sequence diagram of the communication VI. SIMULATION RESULT Minimum support=0.9% Working Node=10 500 400 Response Timr (sec) 300 250 200 Novel CD 150 100 50 0 1 2 3 4 6 7 8 9 10 Figure 11 Response time of the CWD and CD algorithm in case of 0.9 % minimum support Triangular Matrix Collection Data Distribution Working nodes =10 16 14 12 10 8 6 4 2 0% 0% 1. 5 0% 1. 4 0% 1. 3 0% 1. 2 0% 1. 1 0% CD 200 1. 0 0% 0. 9 0% 0. 8 250 0. 7 CWD 0. 6 0% 300 0% 0 350 Minimum support 150 100 Figure 12 Finding 2 frequent itemsets in case of 10 working nodes and using CWD algorithm 50 0 0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 1.10% 1.20% 1.30% 1.40% 1.50% Minimum support Frequent 2 Frequent 1 Distribution Working nodes =10 Working Node=5 600 500 400 CWD CD 300 300 Response Time (sec) Figure 8 Response time of the CWD and CD algorithm in case of 10 working nodes 250 200 150 100 50 0 200 0. 50 % 0. 60 % 0. 70 % 0. 80 % 0. 90 % 1. 00 % 1. 10 % 1. 20 % 1. 30 % 1. 40 % 1. 50 % Response Time (sec) 5 Working nodes 0. 5 Response Time (sec) 450 350 Response Time (sec) The introduced CWD algorithm and the count distribution algorithm are implemented in standard C++ using the pyramid [15] asynchronous message passing library on Windows XP platform. The simulation took place in the PC laboratory of the department using 11 uniform PCs having an Intel Pentium IV processor of 2.26 GHz, 256 MB RAM, and an Intel 82801DB PRO/100 VE network adapter of 100 Mbits. The nodes were connected through a switching hub of type 3Com SuperStack 4226T. The datasets used by the algorithms are generated by IBM synthetic data generator [2]. We used T20I7D500K dataset for the experimentation. The dataset means that there are 500K transactions in it, the average size of a transaction is 20, and the average size of the maximal frequent itemset is 7. The simulation result is shown from Figure 8 to Figure 14. 100 Minimum support 0 0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 1.10% 1.20% 1.30% 1.40% 1.50% Minimum support Figure 9 Response time of the CWD and CD algorithm in case of 5 working nodes Figure 13 Finding 2 frequent itemsets in case of 10 working nodes and using CD algorithm Finding 2 Frequent itemsets minimum support=0.5% 1000 800 CWD CD 600 400 250 200 Novel CD 150 100 50 1 2 3 4 5 6 7 8 9 10 0% 1. 5 0% 0% 1. 4 1. 3 0% 1. 2 0% 1. 1 0% 1. 0 0% 0. 9 0% 0. 8 0% 0. 7 0% 0. 5 0 0% 0 200 0. 6 Response Time (sec) 1200 Response Time(sec) 300 1400 Minimum Support Working nodes Figure 10 Response time of the CWD and CD algorithm in case of 0.5 % minimum support Figure 14 Comparison of the two algorithm: finding 2 frequent itemsets in case of 10 working nodes Figure 8 and Figure 9 show that the response time of the CWD algorithm is significantly better than the CD algorithm. Figure 10 and Figure 11 show the scale-up capability of both algorithms. They show that the novel CWD algorithm always has better response time than he CD algorithm. Both algorithms have saturation in the scaleup capabilities but asymptotical limit of the CWD algorithm is significantly smaller than in case of CD algorithm. Figure 12, Figure 13 and Figure 14 show that the speed-up of the CWD algorithm comes from the speedup the Apriori algorithm. These measurement results also reinforce that the main bottleneck of the Apriori algorithm is finding 2 frequent itemsets. Therefore the Apriori-based association rule mining algorithms have worse performance than the introduced CWD algorithm. REFERENCES [1] [2] [3] [4] [5] [6] [7] VII. CONCLUSION [8] In this paper a novel distributed association rule mining algorithm has been introduced. This algorithm is based on a modified version of the Apriori algorithm. It handles the 2 itemsets is special way therefore it can significantly reduce the response time of the Apriori algorithm. The basic distributed association rule mining algorithms are based on the Apriori algorithm hence they imply all the problems of the Apriori algorithm. The introduced speed up solution can be used in any kind of Apriori-based algorithm. The CWD algorithm has substantially better response time than the referential CD algorithm. The one of the main advantage of this algorithm is that it counts the occurrences of the 2 itemsets while the data distribution is being taken. Therefore the cost of the data distribution disappears in the introduced novel algorithm, which can be high in case of very large data source. [9] [10] [11] [12] [13] [14] [15] ACKNOWLEDGMENTS This work has been supported by the fund of the Hungarian Academy of Sciences for control research and the Hungarian National Research Fund (grant number: T042741). R. Agrawal, T. Imielinski, and A. Swami, Mining association rules between sets of items in large databases, Proc. ACMSIGMOD Conference, 1993 R. Agrawal and R. Srikant, Fast algorithms for mining association rules, Proc. 20th Very Large Databases Conference, 1994 J. Han J. Pei and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. ACM SIGMOD International. Conference on Management of Data, 2000 R. Agrwal and J.C. Schafer, Parallel mining of association rules, IEEE Trans. Knowledge and Data Engineering, vol. 8, no 6, 1996 O. R. Zaïne, M. El-Hajj and P. Lu, Fast Parallel Association Rule Mining Without Candidacy Generation, Proc. IEEE International Conference on Data Mining , 2001 Renáta Iváncsy, Ferenc Kovács and István Vajk, An Analysis of Association Rule Mining Algorithms, Proc. of International Conference of ICSC EIS, 2004 Ferenc Kovács, Renáta Iváncsy and István Vajk, Evaluation of the Serial Association Rule Mining Algorithms, Proc. of International Conference of IASTED DBA, 2004 M. Tamura and M. Kitsuregawa, Dynamic load balancing for parallel association rule mining on heterogeneous PC cluster system, Proc. 25th Very Large Databases Conference, 1999 T. Shintani and M. Kitsuregawa, Hash based parallel algorithms for mining association rules, Proc. Parallel and Distributed Information Systems Conference, 1996 T. Shimomura and S. Shibusawa, Performance Evaluation of Distributed Algorithms for Mining Association Rules on Workstation Cluster, Proc. IEEE International Workshops on Parallel Processing (ICPP’00- Workshops), 2000 J. Zhang, H. Shi and L. Zheng, A Method and Algorithm of Distributed Mining Association Rules in Synchronism, Proc. IEEE International Conference on Machine Learning and Cybernetics, 2002 Ferenc Kovács, Renáta Iváncsy and István Vajk, Dynamic Itemset Counting in PC Cluster Based Association Rule Mining, Proc. of International Conference of ICSC EIS, 2004 S.Brin, R. Motawani, J.D. Ullman and S. Tsur, Dynamic Item set counting and implication rules for market basket data, Proc. ACM-SIGMOD Conference, 1997 T. Shintani and M. Kitsuregawa: Hash based parallel algorithms for mining association rules, Proc. Parallel and Distributed Information Systems Conference, 1996 S. Juhász et al., “The Pyramid Project”, Budapest University of Technology and Economics, Department of Automation and Applied Informatics, 2002. http://avalon.aut.bme.hu/~sanyo/piramis