* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Format guide for AIRCC
Survey
Document related concepts
Transcript
An efficient hash based algorithm for mining closed frequent itemsets Ms. Dhara Patel1 and Prof. Ketan Sarvakar2 1 ME (CSE Student), UVPCE, Kherva, Gujarat, India 2 Asst. Professor, UVPCE, Kherva, Gujarat, India Abstract : Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. Efficient algorithms to discover frequent patterns are crucial in data mining research. Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore it has attracted significant research attention. In this paper for generating frequent itemsets we developed improved procedure and result analysis with wine dataset. Our improved procedure is compare with ILLT algorithm and time required for generating itemsets is less. I. Introduction Association rules mining is one of the most important and well researched techniques. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of the items in the transaction database or other data repositories. It is widely used in various areas such as cross marketing, new product development, personalized service and commercial credit evaluation in e-business, etc. The process of discovering all the association rules consists of two steps: 1) discovery of all frequent itemsets that have minimum support, and 2) creating of all rules from the discovered frequent itemsets that meet the confidence threshold. Most researches have focused on efficient methods for finding frequent itemsets because it is computationally the most expensive step and solve the candidate itemsets generation by avoiding candidate generation, and reducing the time to scan database. However, they do not concentrate mining frequent itemsets based on more similar transactions in the database as found in the real world, e.g. wholesale transactions and medical prescription transactions. Since introduction in 1993 by Argawal. The frequent itemset and association rule mining problems have received a great deal of attention. To solve mining problems efficiently more number of paper published for presenting new algorithms and improvements on existing algorithms. To exploiting customer behaviour and make correct decision leading to analyze huge amount of data. For example, an association rule “beer, chips (60%)” states that four out of five customers that bought beer also bought chips. These rules can be useful for decisions making for promotions, store layout, product pricing and others [1]. Page | 1 The Apriori algorithm achieves good reduction on the size of candidate sets. However, when there exist a large number of frequent patterns and/or long patterns, candidate generation- and-test methods may still suffer from generating huge numbers of candidates and taking many scans of large databases for frequency checking. Large amount of data have been collected routinely in the course of day-to-day management, in business, administration, banking, e-commerce, the delivery of social and health services, environmental protection, security and in politics. With the tremendous growth of data, users are expecting more relevant and sophisticated information which may be lying hidden in the data. Existing analysis and evaluating techniques do not match with this tremendous growth. Data mining is often described as a discipline to find hidden information in databases. It involves different techniques and algorithms to discover useful knowledge lying hidden in the data. Association rule mining has been one of the most popular data mining subjects which can be simply defined as finding interesting rules from collection of data. The first step in association rule mining is finding frequent itemsets. It is a very resource consuming task and for that reason it has been one of the most popular research fields in data mining. At the same time very large databases do exist in real life. In a medium sized business or in a company big as Walmart, it’s very easy to collect a few gigabytes of data. Terabytes of raw data are ubiquitously being recorded in commerce, science and government. The question of how to handle these databases is still one of the most difficult problems in data mining. In this paper, section 2 discuss the related work of the frequent itemset mining; section 3 discuss the methodology for frequent itemset mining; section 4 discuss result analysis; Finally section 5 concludes the paper. II. Related Work Generally, the method for finding frequent itemsets can be divided into two approaches: candidate generation-and test and pattern growth. A basic algorithm for candidate generation-and-test is the Apriori which makes use of items for the candidate and combines the pre-given threshold value to count frequent itemsets in the database. This algorithm shows that it requires multiple database scans, as many as the longest frequent itemsets. Formally, as defined in [2], the problem of mining association rules is stated as follows: Let I = {i1,i2,…,im}. Let D be a set of transactions, where each transaction T is a set of items such that T⊆ I. Associated with each transaction is a unique identifier, called its transaction id TID. A transaction T contains X, a set of some items in I, if X ⊆ T. An association rule is an implification of the form X ⇒ Y where X ⊂ I, Y ⊂ I and X ∩ Y = φ. The meaning of such a rule is that transactions in database, which contain the items in X, tend to also contain the items in Y. The rule X ⊆ Y holds in the transaction set D with confidence c if among those transactions that contains X c% of them also contain Y. The rule X ⇒ Y has support s in the transaction set D if s% of transactions in D contain X ∪ Y. The problem of mining association rules that have support and confidence greater than the user-specified minimum support and Page | 2 minimum confidence respectively. Conventionally, the problem of discovering all association rules is composed of the following two steps: 1) Find the large itemsets that have transaction support above a minimum support and 2) From the discovered large itemsets generate the desired association rules. The overall performance of mining association rules can be achieved by using the first step. After the identification of large itemsets, the corresponding association rules can be derived in a straightforward manner. All the algorithms produce frequent itemsets on the basis of minimum support. The advantage of Apriori algorithm is Simple and easy algorithm to finds the frequent elements from the database and disadvantage are more search space is needed and I/O cost will increase and number of database scan is increased thus candidate generation will increase results in increase in computational cost [1, 2]. Apriori algorithm is quite successful for market based analysis in which transactions are large but frequent items generated is small in number. The advantage of Eclat algorithm is When the database is very large, it produce good results and disadvantage is it generates the larger amount of candidates then Apriori. Vertical layout based algorithms claims to be faster than Apriori but require larger memory space then horizontal layout based because they needs to load candidate, database and TID list in main memory [3]. The advantage of FP-Growth algorithm are only 2 passes over data-set. No candidate generation faster than Apriori and disadvantage are not fit in mamory and expansive to build [4]. The advantage of H-mine algorithm are no candidate generation and does not need to store any frequent pattern in memory and disadvantage are required more memory and no random access [5]. For FP-Tree and H-mine, performs better than all discussed above algorithms because of no generation of candidate sets but the pointes needed to store in memory require large memory space. The advantage of Frequent Item Graph (FIG) algorithm are quick mining process and scanning the entire database only once and disadvantage is requires a full scan of frequent 2-itemsets that would be used when building the graphical structure [6]. For FIG algorithm is a quick mining process that does not use candidates but requires a full scan of frequent 2-itemsets that would be used when building the graphical structure in second phase. The advantage of Frequent Itemsets Algorithm for Similar Transactions (FIAST) are save space reduce time and disadvantage is uses AND operation to find itemsets and eventually [7]. For FIAST algorithm, tries to reduce I/O, space and time but performance decreases for sparse datasets. The advantage of Indexed Limited Level Tree (ILLT) algorithm are easy to find frequent itemsts for different support levels and scanning the database only once and disadvantage is candidate generation [8]. For ILLT algorithm performs better but requires large memory space for store tree structure. In data mining, frequent itemset is acknowledged because there is more application like correlation, association rules based on frequent patterns, sequential patterns tasks. In frequent pattern itemsets finding association rules are important as other task for data mining. The major difficulty in frequent pattern Page | 3 mining is result of large number of patterns. As the minimum threshold becomes lower, an exponentially large number of itemsets are generated. So pruning is unimportant in mining and it becomes important topics in mining frequent patterns. Therefore, the goal is optimize the process of finding frequent patterns which is scalable, efficient and get important patterns [9]. III. Methodology In a large transactional database multiple items are there so the database surely contains various transactions which contain same set of items. Thus by taking advantage of these transactions trying to find out the frequent itemsets and prune off the candidate itemsets whose node count is lower than min support using improved procedure, result in efficiently execution time. Sampling method is a popular method in computational statistics; two important terminologies related to it are population and sample. The population is defined in keeping with the objectives of the study; a sample is a subset of population. Usually, when the population is large, if the sample is scientifically chosen, it can be used to represent the population, because the sample reflects the characteristics of the population from which it is drawn. Usually, in data mining, the population is large, so the sampling method is appropriate. As in given example, suppose that the sample S data in Table 2 is a carefully chosen sample of some population P in Table 1. Table 1: Population Data TID List of item_IDs 1 i1, i2, i5 2 i2, i4 3 i2, i3 4 i1, i2, i4 5 i1, i3 6 i2, i3 7 i1, i3 8 i1, i2, i3, i5 9 i1, i2, i3 10 i1, i2, i5 11 i2, i4 12 i2, i3 13 i1, i2, i4 14 i1, i3 15 i2, i3 16 i1, i3 17 i1, i2, i3, i5 Page | 4 18 i1, i2, i3 Using sampling method can save much time, if the sample is carefully chosen, the sample can represent the population, and then the table that comes from the sample can represent that comes from the population, the 2-itemsets with high frequency in sample’s table are liable to be the one with high frequency in population’s table. Table 2: Sample Data TID List of item_IDs 1 i1, i2, i5 2 i2, i4 3 i2, i3 4 i1, i2, i4 5 i1, i3 6 i2, i3 7 i1, i3 8 i1, i2, i3, i5 9 i1, i2, i3 Procedure 1.) Carefully draw a sample S from the population P, usually by random sampling. 2.) To deal with the sample S to get a table, denoted as table HS. 3.) Rank the table HS with respect to the frequency of column content in order to make the column address with high frequency lie in the former and that with low frequency the latter, then we get a new table HSR. 4.) Based on HSR, to deal with the rest sample of the population P, i.e. P − S, when finished, get a table denoted as HP. 5.) Obtain frequent itemsets according to predetermined minimum support count. Pseudo Code Input: Database D, min_sup Output: Frequent Itemsets Procedure MiningFrequentItemsets For some transaction t ∈ D, Insert t in to sample S; End for; For sample S, Page | 5 get a Hash table (HS); End for; Rank the Hash table Hs (descending), get a new Hash table (HSR); Based on HSR (P-S), get a Hash table (HP); For each item (HP), If item ≥ min_sup, return L = U (item); End if; End For; Example Take the data in Table 1 as an example, we will show how procedure works. Draw a sample S from the population P, shown in Table 2. To deal with the sample S to get a table, denoted as table HS, shown in Table 3. Table 3 : Table HS Address (1, 2) (1, 5) (2, 5) (2, 4) (2, 3) (1, 4) (1, 3) (3, 5) Count 4 2 2 2 4 1 4 1 Content {i1, i2} {i1, i5} {i2, i5} {i2, i4} {i2, i3} {i1, i4} {i1, i3} {i3, i5} {i1, i2} {i1, i5} {i2, i5} {i2, i4} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} Rank the table HS with respect to the frequency of column content in order to make the column address with the high frequent content lie in the former and that with low frequent content the latter, get a new table HSR, shown in Table 4. Table 4 : Table HSR Address (1, 2) (2, 3) (1, 3) (1, 5) (2, 5) (2, 4) (1, 4) (3, 5) Count 4 4 4 2 2 2 1 1 Content {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i4} {i3, i5} {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} Based on HSR,To deal with the rest sample of the population P, i.e. P − S, when finished, get a table denoted as HP, shown in Table 5. Page | 6 Table 5 : Table HP Address (1, 2) (2, 3) (1, 3) (1, 5) (2, 5) (2, 4) (1, 4) (3, 5) Count 8 8 8 4 4 4 2 2 Content {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i4} {i3, i5} {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i4} {i3, i5} {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} Obtain frequent itemsets according to predetermined minimum support count. If we set support count as 6, we find that 2-itemset {i1, i2}, {i2, i3} and {i1, i3} are frequent. IV. Result Analysis In our experiments we choose different datasets with different number of records to prove the efficiency of the algorithm. Table 6 shows the dataset from the UCI repository of machine learning databases [10]. Table 6: The characteristics of Dataset Dataset Number of Records agaricus-lepiota.data.txt 8124 balance-scale.data.txt 625 bridges.data.txt 108 flag.data.txt 194 imports-85.data.txt 205 letter-recognition.data.txt 20000 machine.data.txt 209 tic-tac-toe.data.txt 958 wine.data.txt 178 As a result of the experimental study, revealed the performance of efficient algorithm with the apriori algorithm. The run time is the time to mine the frequent itemsets. The experimental result of time is shown in Table 7 to Table 15 reveals that the algorithm outperforms the apriori algorithm. The experimental result is also shown in Figure 1 to Figure 9. Page | 7 As it is clear from the comparison efficient algorithm performs well for the low support value for all datasets shown in Table 6. But at the higher support its performance small reduces compare to apriori algorithm. Difference between execution time of efficient algorithm and apriori algorithm are decreases in later stages. Table 7: Execution Time for Apriori Algorithm and Efficient Algorithm using Agaricus-lepiota dataset Support (in %) Total Execution time in second Apriori Algorithm Efficient Algorithm 40 173.89 134.73 50 80.78 62.87 60 53.86 43.81 Figure 1: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Agaricuslepiota dataset Table 8: Execution Time for Apriori Algorithm and Efficient Algorithm using Balance-scale dataset Support (in %) Total Execution time in second Apriori Efficient Algorithm Page | 8 Algorithm 40 13.37 11.51 50 6.21 5.34 60 4.14 3.71 Figure 2: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Balance-scale dataset Table 9: Execution Time for Apriori Algorithm and Efficient Algorithm using Bridges dataset Support (in %) Total Execution time in second Apriori Algorithm Efficient Algorithm 40 2.31 1.89 50 1.07 0.86 60 0.72 0.58 Page | 9 Figure 3: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Bridges dataset Table 10: Execution Time for Apriori Algorithm and Efficient Algorithm using Flag dataset Support (in %) Total Execution time in second Apriori Algorithm Efficient Algorithm 40 4.15 3.58 50 1.93 1.62 60 1.29 1.10 Page | 10 Figure 4: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Flag dataset Table 11: Execution Time for Apriori Algorithm and Efficient Algorithm using Imports-85 dataset Support (in %) Total Execution time in second Apriori Algorithm Efficient Algorithm 40 5.71 4.73 50 2.45 1.92 60 1.50 1.17 Page | 11 Figure 5: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Imports-85 dataset Table 12: Execution Time for Apriori Algorithm and Efficient Algorithm using Letter-recognition dataset Support (in %) Total Execution time in second Apriori Algorithm Efficient Algorithm 40 406.69 331.69 50 169.05 130.34 60 99.44 75.51 Page | 12 Figure 6: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Letterrecognition dataset Table 13: Execution Time for Apriori Algorithm and Efficient Algorithm using Machine dataset Support (in %) Total Execution time in second Apriori Algorithm Efficient Algorithm 40 4.47 3.66 50 2.09 1.79 60 1.39 1.24 Page | 13 Figure 7: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Machine dataset Table 14: Execution Time for Apriori Algorithm and Efficient Algorithm using Tic-tac-toe dataset Support (in %) Total Execution time in second Apriori Algorithm Efficient Algorithm 40 23.59 17.65 50 10.48 7.80 60 6.67 5.17 Page | 14 Figure 8: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Tic-tac-toe dataset Table 15: Execution Time for Apriori Algorithm and Efficient Algorithm using Wine dataset Support (in %) Total Execution time in second Apriori Algorithm Efficient Algorithm 40 3.89 3.28 50 1.81 1.45 60 1.20 0.96 Page | 15 Figure 9: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Wine dataset V. Conclusion We choose different datasets with different number of records to prove the efficiency of the algorithm. We choose nine datasets such as, agaricus-lepiota, balance-scale, bridges, flag, imports-85, letterrecognition, machine, tic-tac-toe and wine from the UCI repository of machine learning databases. Frequent pattern mining problem has been studied extensively with alternative by considered the following factor for creating our efficient algorithm, which are the time consumption, these factor is affected by the approach for finding the frequent itemsets. Work has been done to develop an efficient algorithm which is an improvement over apriori algorithm. For different datasets the running time consumption of our efficient algorithm outperformed apriori algorithm. Whereas the running time of efficient algorithm performed well over the apriori algorithm on the collected dataset at the lower support level and also running time of efficient algorithm performed well at higher support level. Thus it saves much time and considered as an efficient algorithm as proved from the results. References [1] Aggrawal.R; Imielinski.t; Swami.A. 1993. Mining Association Rules between Sets of Items in Large Databases. ACM SIGMOD Conference. Washington DC, USA. Page | 16 [2] Agrawal.R., Srikant.R. September 1994. Fast algorithms for mining association rules. In Proc. Int’l Conf. Very Large Data Bases (VLDB), pp.487–499. [3] C.Borgelt. 2003.Efficient Implementations of Apriori and Eclat. In Proc. 1st IEEE ICDM Workshop on Frequent Item Set Mining Implementations, CEUR Workshop Proceedings 90, Aachen, Germany. [4] Han.J; Pei.J; Yin. Y; , “Mining frequent patterns without candidate generation,” In Proc. ACMSIGMOD Int’l Conf. Management of Data (SIGMOD), 2000. [5] Pei.J; Han.J; Lu.H; Nishio.S.; Tang. S.; Yang. D.; , “H-mine: Hyper-structure mining of frequent patterns in large databases,” In Proc. Int’l Conf. Data Mining (ICDM), November 2001. [6] Kumar, A.V.S.; Wahidabanu, R.S.D.; , "A Frequent Item Graph Approach for Discovering Frequent Itemsets," Advanced Computer Theory and Engineering, 2008. ICACTE '08. International Conference on , vol., no., pp.952-956, 20-22 Dec. 2008. [7] Duemong, F.; Preechaveerakul, L.; Vanichayobon, S.; , "FIAST: A Novel Algorithm for Mining Frequent Itemsets," Future Computer and Communication, 2009. ICFCC 2009. International Conference on , vol., no., pp.140-144, 3-5 April 2009. [8] Venkateswari, S.; Suresh, R.M.; , "An efficient for discovery of frequent itemsets," Signal and Image Processing (ICSIP), 2010 International Conference on, vol., no., pp.531-533, 15-17 Dec. 2010. [9] Pramod. S; O. P. Vyas.; , " Survey on Frequent Item set Mining Algorithms," In Proc. International Journal of Computer Applications, 2010 International Conference on, vol., no., pp.86-91, 2010. [10] Blake C.L. and Merz C.J., UCI Repository of Machine Learning Databases, Dept. of Information and Computer Science, University of California at Irvine, CA, USA, 1998. Page | 17