Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Improved Procedure for Frequent Pattern Mining in Transactional Database Ms. Dhara Patel1 and Prof. Ketan Sarvakar2 1 ME (CSE Student), UVPCE, Kherva, Gujarat, India 2 Asst. Professor, UVPCE, Kherva, Gujarat, India Abstract : Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. Efficient algorithms to discover frequent patterns are crucial in data mining research. Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore it has attracted significant research attention. In this paper for generating frequent itemsets we developed improved procedure and result analysis with wine dataset. Our improved procedure is compare with ILLT algorithm and time required for generating itemsets is less. I. Introduction Association rules mining is one of the most important and well researched techniques. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of the items in the transaction database or other data repositories. It is widely used in various areas such as cross marketing, new product development, personalized service and commercial credit evaluation in e-business, etc. The process of discovering all the association rules consists of two steps: 1) discovery of all frequent itemsets that have minimum support, and 2) creating of all rules from the discovered frequent itemsets that meet the confidence threshold. Most researches have focused on efficient methods for finding frequent itemsets because it is computationally the most expensive step and solve the candidate itemsets generation by avoiding candidate generation, and reducing the time to scan database. However, they do not concentrate mining frequent itemsets based on more similar transactions in the database as found in the real world, e.g. wholesale transactions and medical prescription transactions. Since introduction in 1993 by Argawal. The frequent itemset and association rule mining problems have received a great deal of attention. To solve mining problems efficiently more number of paper published for presenting new algorithms and improvements on existing algorithms. To exploiting customer behaviour and make correct decision leading to analyze huge amount of data. For example, an association Page | 1 rule “beer, chips (60%)” states that four out of five customers that bought beer also bought chips. These rules can be useful for decisions making for promotions, store layout, product pricing and others [1]. The Apriori algorithm achieves good reduction on the size of candidate sets. However, when there exist a large number of frequent patterns and/or long patterns, candidate generation- and-test methods may still suffer from generating huge numbers of candidates and taking many scans of large databases for frequency checking. Large amount of data have been collected routinely in the course of day-to-day management, in business, administration, banking, e-commerce, the delivery of social and health services, environmental protection, security and in politics. With the tremendous growth of data, users are expecting more relevant and sophisticated information which may be lying hidden in the data. Existing analysis and evaluating techniques do not match with this tremendous growth. Data mining is often described as a discipline to find hidden information in databases. It involves different techniques and algorithms to discover useful knowledge lying hidden in the data. Association rule mining has been one of the most popular data mining subjects which can be simply defined as finding interesting rules from collection of data. The first step in association rule mining is finding frequent itemsets. It is a very resource consuming task and for that reason it has been one of the most popular research fields in data mining. At the same time very large databases do exist in real life. In a medium sized business or in a company big as Walmart, it’s very easy to collect a few gigabytes of data. Terabytes of raw data are ubiquitously being recorded in commerce, science and government. The question of how to handle these databases is still one of the most difficult problems in data mining. In this paper, section 2 discuss the related work of the frequent itemset mining; section 3 discuss the methodology for frequent itemset mining; section 4 discuss result analysis; Finally section 5 concludes the paper. II. Related Work Generally, the method for finding frequent itemsets can be divided into two approaches: candidate generation-and test and pattern growth. A basic algorithm for candidate generation-and-test is the Apriori which makes use of items for the candidate and combines the pre-given threshold value to count frequent itemsets in the database. This algorithm shows that it requires multiple database scans, as many as the longest frequent itemsets. Formally, as defined in [2], the problem of mining association rules is stated as follows: Let I = {i1,i2,…,im}. Let D be a set of transactions, where each transaction T is a set of items such that T⊆ I. Associated with each transaction is a unique identifier, called its transaction id TID. A transaction T contains X, a set of some items in I, if X ⊆ T. An association rule is an implification of the form X ⇒ Y where X ⊂ I, Y ⊂ I and X ∩ Y = φ. The meaning of such a rule is that transactions in database, which contain the items in X, tend to also contain the items in Y. The rule X ⊆ Y holds in the transaction set D Page | 2 with confidence c if among those transactions that contains X c% of them also contain Y. The rule X ⇒ Y has support s in the transaction set D if s% of transactions in D contain X ∪ Y. The problem of mining association rules that have support and confidence greater than the user-specified minimum support and minimum confidence respectively. Conventionally, the problem of discovering all association rules is composed of the following two steps: 1) Find the large itemsets that have transaction support above a minimum support and 2) From the discovered large itemsets generate the desired association rules. The overall performance of mining association rules can be achieved by using the first step. After the identification of large itemsets, the corresponding association rules can be derived in a straightforward manner. All the algorithms produce frequent itemsets on the basis of minimum support. The advantage of Apriori algorithm is Simple and easy algorithm to finds the frequent elements from the database and disadvantage are more search space is needed and I/O cost will increase and number of database scan is increased thus candidate generation will increase results in increase in computational cost [1, 2]. Apriori algorithm is quite successful for market based analysis in which transactions are large but frequent items generated is small in number. The advantage of Eclat algorithm is When the database is very large, it produce good results and disadvantage is it generates the larger amount of candidates then Apriori. Vertical layout based algorithms claims to be faster than Apriori but require larger memory space then horizontal layout based because they needs to load candidate, database and TID list in main memory [3]. The advantage of FP-Growth algorithm are only 2 passes over data-set. No candidate generation faster than Apriori and disadvantage are not fit in mamory and expansive to build [4]. The advantage of H-mine algorithm are no candidate generation and does not need to store any frequent pattern in memory and disadvantage are required more memory and no random access [5]. For FP-Tree and H-mine, performs better than all discussed above algorithms because of no generation of candidate sets but the pointes needed to store in memory require large memory space. The advantage of Frequent Item Graph (FIG) algorithm are quick mining process and scanning the entire database only once and disadvantage is requires a full scan of frequent 2-itemsets that would be used when building the graphical structure [6]. For FIG algorithm is a quick mining process that does not use candidates but requires a full scan of frequent 2-itemsets that would be used when building the graphical structure in second phase. The advantage of Frequent Itemsets Algorithm for Similar Transactions (FIAST) are save space reduce time and disadvantage is uses AND operation to find itemsets and eventually [7]. For FIAST algorithm, tries to reduce I/O, space and time but performance decreases for sparse datasets. The advantage of Indexed Limited Level Tree (ILLT) algorithm are easy to find frequent itemsts for different support levels and scanning the database only once and disadvantage is candidate generation [8]. For ILLT algorithm performs better but requires large memory space for store tree structure. Page | 3 In data mining, frequent itemset is acknowledged because there is more application like correlation, association rules based on frequent patterns, sequential patterns tasks. In frequent pattern itemsets finding association rules are important as other task for data mining. The major difficulty in frequent pattern mining is result of large number of patterns. As the minimum threshold becomes lower, an exponentially large number of itemsets are generated. So pruning is unimportant in mining and it becomes important topics in mining frequent patterns. Therefore, the goal is optimize the process of finding frequent patterns which is scalable, efficient and get important patterns [9]. III. Methodology In a large transactional database multiple items are there so the database surely contains various transactions which contain same set of items. Thus by taking advantage of these transactions trying to find out the frequent itemsets and prune off the candidate itemsets whose node count is lower than min support using improved procedure, result in efficiently execution time. Sampling method is a popular method in computational statistics; two important terminologies related to it are population and sample. The population is defined in keeping with the objectives of the study; a sample is a subset of population. Usually, when the population is large, if the sample is scientifically chosen, it can be used to represent the population, because the sample reflects the characteristics of the population from which it is drawn. Usually, in data mining, the population is large, so the sampling method is appropriate. As in given example, suppose that the sample S data in Table 2 is a carefully chosen sample of some population P in Table 1. Table 1: Population Data TID List of item_IDs 1 i1, i2, i5 2 i2, i4 3 i2, i3 4 i1, i2, i4 5 i1, i3 6 i2, i3 7 i1, i3 8 i1, i2, i3, i5 9 i1, i2, i3 10 i1, i2, i5 11 i2, i4 12 i2, i3 13 i1, i2, i4 14 i1, i3 Page | 4 15 i2, i3 16 i1, i3 17 i1, i2, i3, i5 18 i1, i2, i3 Using sampling method can save much time, if the sample is carefully chosen, the sample can represent the population, and then the table that comes from the sample can represent that comes from the population, the 2-itemsets with high frequency in sample’s table are liable to be the one with high frequency in population’s table. Table 2: Sample Data TID List of item_IDs 1 i1, i2, i5 2 i2, i4 3 i2, i3 4 i1, i2, i4 5 i1, i3 6 i2, i3 7 i1, i3 8 i1, i2, i3, i5 9 i1, i2, i3 Procedure 1.) Carefully draw a sample S from the population P, usually by random sampling. 2.) To deal with the sample S to get a table, denoted as table HS. 3.) Rank the table HS with respect to the frequency of column content in order to make the column address with high frequency lie in the former and that with low frequency the latter, then we get a new table HSR. 4.) Based on HSR, to deal with the rest sample of the population P, i.e. P − S, when finished, get a table denoted as HP. 5.) Obtain frequent itemsets according to predetermined minimum support count. Pseudo Code Input: Database D, min_sup Output: Frequent Itemsets Procedure MiningFrequentItemsets For some transaction t ∈ D, Page | 5 Insert t in to sample S; End for; For sample S, get a Hash table (HS); End for; Rank the Hash table Hs (descending), get a new Hash table (HSR); Based on HSR (P-S), get a Hash table (HP); For each item (HP), If item ≥ min_sup, return L = U (item); End if; End For; Example Take the data in Table 1 as an example, we will show how procedure works. Draw a sample S from the population P, shown in Table 2. To deal with the sample S to get a table, denoted as table HS, shown in Table 3. Table 3 : Table HS Address (1, 2) (1, 5) (2, 5) (2, 4) (2, 3) (1, 4) (1, 3) (3, 5) Count 4 2 2 2 4 1 4 1 Content {i1, i2} {i1, i5} {i2, i5} {i2, i4} {i2, i3} {i1, i4} {i1, i3} {i3, i5} {i1, i2} {i1, i5} {i2, i5} {i2, i4} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} Rank the table HS with respect to the frequency of column content in order to make the column address with the high frequent content lie in the former and that with low frequent content the latter, get a new table HSR, shown in Table 4. Table 4 : Table HSR Address (1, 2) (2, 3) (1, 3) (1, 5) (2, 5) (2, 4) (1, 4) (3, 5) Count 4 4 4 2 2 2 1 1 Content {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i4} {i3, i5} {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} Page | 6 Based on HSR,To deal with the rest sample of the population P, i.e. P − S, when finished, get a table denoted as HP, shown in Table 5. Table 5 : Table HP Address (1, 2) (2, 3) (1, 3) (1, 5) (2, 5) (2, 4) (1, 4) (3, 5) Count 8 8 8 4 4 4 2 2 Content {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i4} {i3, i5} {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i4} {i3, i5} {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i2} {i2, i3} {i1, i3} {i1, i5} {i2, i5} {i2, i4} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} {i1, i2} {i2, i3} {i1, i3} Obtain frequent itemsets according to predetermined minimum support count. If we set support count as 6, we find that 2-itemset {i1, i2}, {i2, i3} and {i1, i3} are frequent. IV. Result Analysis In our experiments we choose wine dataset with different properties, to prove the efficiency of the algorithm. In the wine dataset, 178 number of records and 14 number of columns. Table 6 shows the dataset from the UCI repository of machine learning databases [10]. Table 6: The characteristics of Dataset Itemset Number of Records Number of Columns Wine.data.txt 178 14 As a result of the experimental study, revealed the performance of improved procedure with the ILLT algorithm. The run time is the time to mine the frequent itemsets. The experimental result of time is shown in Table 7 reveals that the improved procedure outperforms the ILLT algorithm. The experimental result is also shown in Figure 1. As it is clear from the comparison improved procedure performs well for the low support value for the Wine dataset which contains 178 transactions and 14 numbers of columns. But at the higher support its performance small reduces compare to ILLT algorithm. Difference between execution time of improved procedure and ILLT are decreases in later stages. Page | 7 Table 7: Execution Time for ILLT and Improved Procedure using Wine dataset Support (in Total Execution time in second %) ILLT Improved Procedure 40 3.81 3.28 50 1.77 1.45 60 1.18 0.96 Figure 1: Total Execution Time for ILLT and Improved Procedure using Wine dataset V. Conclusion Frequent pattern mining problem has been studied extensively with alternative By considered the following factor for creating our improved procedure, which are the time consumption, these factor is affected by the approach for finding the frequent itemsets. Work has been done to develop an improved procedure which is an improvement over ILLT algorithm. For wine dataset the running time consumption of our improved procedure outperformed ILLT. Whereas the running time of improved procedure performed well over the ILLT on the collected dataset at the lower support level and also running time of improved procedure performed well at higher support level. Thus it saves much time and considered as an efficient method as proved from the results. References [1] Aggrawal.R; Imielinski.t; Swami.A. 1993. Mining Association Rules between Sets of Items in Large Databases. ACM SIGMOD Conference. Washington DC, USA. Page | 8 [2] Agrawal.R., Srikant.R. September 1994. Fast algorithms for mining association rules. In Proc. Int’l Conf. Very Large Data Bases (VLDB), pp.487–499. [3] C.Borgelt. 2003.Efficient Implementations of Apriori and Eclat. In Proc. 1st IEEE ICDM Workshop on Frequent Item Set Mining Implementations, CEUR Workshop Proceedings 90, Aachen, Germany. [4] Han.J; Pei.J; Yin. Y; , “Mining frequent patterns without candidate generation,” In Proc. ACMSIGMOD Int’l Conf. Management of Data (SIGMOD), 2000. [5] Pei.J; Han.J; Lu.H; Nishio.S.; Tang. S.; Yang. D.; , “H-mine: Hyper-structure mining of frequent patterns in large databases,” In Proc. Int’l Conf. Data Mining (ICDM), November 2001. [6] Kumar, A.V.S.; Wahidabanu, R.S.D.; , "A Frequent Item Graph Approach for Discovering Frequent Itemsets," Advanced Computer Theory and Engineering, 2008. ICACTE '08. International Conference on , vol., no., pp.952-956, 20-22 Dec. 2008. [7] Duemong, F.; Preechaveerakul, L.; Vanichayobon, S.; , "FIAST: A Novel Algorithm for Mining Frequent Itemsets," Future Computer and Communication, 2009. ICFCC 2009. International Conference on , vol., no., pp.140-144, 3-5 April 2009. [8] Venkateswari, S.; Suresh, R.M.; , "An efficient for discovery of frequent itemsets," Signal and Image Processing (ICSIP), 2010 International Conference on, vol., no., pp.531-533, 15-17 Dec. 2010. [9] Pramod. S; O. P. Vyas.; , " Survey on Frequent Item set Mining Algorithms," In Proc. International Journal of Computer Applications, 2010 International Conference on, vol., no., pp.86-91, 2010. [10] Blake C.L. and Merz C.J., UCI Repository of Machine Learning Databases, Dept. of Information and Computer Science, University of California at Irvine, CA, USA, 1998. Page | 9