Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2016 3rd International Conference on Engineering Technology and Application (ICETA 2016) ISBN: 978-1-60595-383-0 Research on Improved Algorithm Based on AprioriTid Transaction and Candidate Item Compression Yonggui Zou & Bowen Tan School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China ABSTRACT: Association rule mining is one of the most popular technologies. This paper researches on AprioriTid of association rule mining technique, and concludes the two main problems in AprioriTid. One is giant candidate Tid list, the other one is the mount of nonsense projects storage. As to the main problems in AprioriTid Algorithm, this paper proposes an improved AprioriTid Algorithm based on transaction set compression and candidate item set compression. This algorithm optimizes the self-connection of frequent item set to reduce the candidate item set through deleting affairs and reducing data set; after testing and comparing the algorithm in many aspects by UCI standard test set, the improved algorithm grows by 10%-20% in time efficiency. Keywords: data mining; association rule; AprioriTid Algorithm; improved AprioriTid Algorithm 1 INTRODUCTION Data mining[1] is the process of extracting unknown but potential and useful information from enormous, incomplete, noisy, fuzzy, and random practical application data. The information from the data mining is effective and practical. The core modules of data mining technology include statistics, artificial intelligence, and machine learning. Data mining is the most popular area in database research, and association rule is an important research of data mining. Association rule learning is a method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1] Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onions, potatoes} {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis, association rules are employed today in many application areas including web usage mining, intrusion detection, continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions. Association rule is one of the important topics of KDD (Knowledge Discovery in Database) research that American R. Agrawa started firstly. The discovery of association rules in two steps: firstly, iterate and identify all the frequent item set with the requirement support is not lower than minimum support set by users; secondly, build credibility no lower than users-setting minimum confidence from the frequent item set to identify all the frequent item sets. This is the core of Association Rule discovery algorithm. 2 ASSOCIATION ALGORITHM ANALSIS: BASIC CONCEPTS AND ALGORITHMS Many business enterprises accumulate large quantities of data from their day-to-day operations. For example, huge amounts of customer purchase data are collected daily at the checkout counters of grocery stores. Table 1 illustrates an example of such data, commonly known as market basket transactions. 107 Each row in this table corresponds to a transaction, which contains a unique identifier labeled TID and a set of items bought by a given customer. Retailers are interested in analyzing the data to learn about the purchasing behavior of their customers. Such valuable information can be used to support a variety of business-related applications such as marketing promotions, inventory management, and customer relationship management. Association analysis is useful for discovering interesting relationships hidden in large datasets. The uncovered relationships can be represented in the form of association rules or sets of frequent items. For example, the following rule can be extracted from the data set shown in Table 1. Table 1. An example of market basket transactions. TID 1 2 3 4 5 Items {Bread, Milk} {Butter } {Beer, Diaper} {Bread, Milk, Butter} {Bread } and contains a subset of the items in I. A rule is defined as an implication of the form: X YWhereX, Y IandX Y Every rule is composed of two different set of items, also known as item sets, X and Y, where X is called antecedent or left-hand-side (LHS) and Y consequent or right-hand-side (RHS). To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I={Bread, Milk, Diapers, Beer, Cola} and in the Table 2, the A binary 0/1 representation of market basket data, is shown a small database containing the items, where, in each entry, the value 1 means the presence of the item in the corresponding transaction, and the value 0 represents the absence of an item in a that transaction. Table 2. A binary 0/1representation of market basket data. The rule suggests that a strong relationship exists between the sale of butter, bread and milk meaning that if butter and bread are bought, while customers also buy milk. Retailers can use this type of rules to help them identify new opportunities for cross selling their products to the customers. Besides market basket data, association analysis is also applicable to other application domains such as bioinformatics, medical diagnosis, web mining, and scientific data analysis. In the analysis of Earth science data, for example, the association patterns may reveal interesting connections among the ocean, land, and atmospheric processes. Such information may help Earth scientists develop a better understanding of how the different elements of the Earth system interact with each other. Even though the techniques presented here are generally applicable to a wider variety of data sets, for illustrative purposes, our discussion will focus mainly on market basket data. There are two key issues that need to be addressed when applying association analysis to market basket data. First, discovering patterns from a large transaction data set can be computationally expensive. Second, some of the discovered patterns are potentially spurious because they may happen simply by chance. 2.1 Basic concepts Following the original definition proposed by Agrawal et al.,[2] the problem of association rule mining is defined as: Let I = {i1, i2,…, in} be a set of n binary attributes called items. Let D = {t1, t2,…, tm} be a set of transactions called the database. Each transaction in D has a unique transaction ID TID 1 2 3 4 5 Milk 1 0 0 1 0 Bread 1 0 0 1 1 Butter 0 1 0 1 0 Beer 0 0 1 0 0 Diaper 0 0 1 0 0 An example rule for the supermarket could be {butter, bread} {milk} meaning that if butter and bread are bought, customers also buy milk. In order to select interesting rules from the set of all possible rules, constraints on various measures of significance and interest are used. The best-known constraints are minimum thresholds on support and confidence. Let X be an item-set, X Y be an association rule and T be a set of transactions of a given database. Support: The support value of X with respect to T is defined as the proportion of transactions in the database which contains the item-set X. In formula: sup(X). In the example database, the item-set {milk, bread, butter} has a support of 1/5=0.2 since it occurs in 20% of all transactions (1 out of 5 transactions). The argument of sup(X) is a set of preconditions, and thus becomes more restrictive as it grows (instead of more inclusive). Confidence: The confidence value of a rule, X Y , with respect to a set of transactions T, is the proportion of the transactions that contains X which also contains Y. Confidence is defined as: conf (X Y) sup(X Y) / sup(X) For example, the rule {butter, bread} {milk} has a confidence of 0.2/0.2=1.0 in the database, which means that for 100% of the transactions containing butter and bread the rule is correct (100% of the times a customer buys butter and bread, and milk is bought as well). 108 2.2 The algorithms Association rule research can be divided in to two sub-tasks: [2] (1) Find out all the frequent item set in data set D according to the minimum support. (2) Generate association rule according to frequent item set and minimum confidence. The first task is to find all the frequent item set in D. It’s the main problem and standard judgment of association rule research. Most studies are focused on this. Apriori algorithm[3,4] is one kind of hierarchical computation which belongs to association rule research, meanwhile, it’s a common algorithm with good combination property. Apriori algorithm needs multi-steps processing to search frequent item set: the first step is to find out 1 frequent item set, and then circular process until there is no generation of frequent item set. When it comes the K cycle, use apriori-gen function to generate candidate set of K-dimension (Ck), and then search the support rate of Ck in database, next, compare with the minimum support rate and find out K-dimension frequent item set Lk. The join step and prune step are needed in apriori-gen function. The prune step can diminish the number of entry in candidate item set Ck. Here’s the main idea: (1) L 1 {l arg e 1 itemsets}; (2) for (k=2; L k 1 ; k++) do begin (3) C k apriori gen(L k 1 ); / / New candidates (4) for all transactions t D do begin C t subste(C k , t); (5) / / candidates contains in transations t (6) for all candidates c C t do (7) c.count++; (8) (9) end L k {c C k | c.count min sup} (10) end Every factor in Ck with the form of <TID, {Xk}>, among them {Xk} is the biggest set of itemset in all the potential transactions by TID only-identified. The first step of AprioriTid Algorithm is to scan data D and calculate the support of C1 in candidate set 1, and then generate 1 frequent item set L1. C1 is the same as data set D. The second step is to process circularly until no generation of frequent item set. The circle process is record all the transaction like t D , t.TID,{c C k | c t} C k . In the K-th step, use the frequent item set k-1(Lk-1) from K-1-th step to generate candidate item set K(Ck).Then Ck traverse Ck-1 to calculate the support in Ck, simultaneously, generate frequent item set Lk. With the increase of K, the value of Ck is far less than transaction data set D. Besides, it reduces the I/O operation time and the size of scanning date set and improves the efficiency of the algorithm. 3 OPTIMIZED APRIORITID ALGORITHM 3.1 The feature of optimized AprioriTid Algorithm as follows Property 1: All the nonvoid subsets in any frequent item set are frequent item set, and the superset of non-frequent item set is non- frequent item set[6]. Property 2: If frequent item set k can generate item set k+1, then the number of frequent item set k must greater than k. It can be proved from Property 1 that in Lk+1 the different k+1 item of k subset must belong to set frequent item set k. Property 3: The support of any piece of transaction in frequent item set Lk support at least k pieces of k-1 item set in Lk-1; Property 4: The non-frequent item in AprioriTid transaction data Ck can be dismissed when calculating Lk+1. According to proof by contradiction, if non-frequent item cannot be dismissed, then it supports Lk+1, which stands against Nature 1. 3.2 The optimization idea of AprioriTid (11) Aswer= L K AprioriTid Algorithm [5] is the classic association rule algorithm improved from Apriori Algorithm. Apriori Algorithm only uses data D at the first time when calculating the support of unidimensional item set with the iteration calculate support from data D to candidate item set. While AprioriTid takes Ck to replace the work, Ck stores K-dimension candidate item set in transaction item set instead of store transaction item set directly to decide whether K-dimension candidate item set can be K-dimension frequent item set. (1) The improved algorithm based on transaction item set compression [7] In the k-th step, during the traversing of candidate item set Ck in Ck 1 , c C k . If the number of potential large item set is less than or equal to 1 in any transaction record, then delete the transaction record directly. Authentication: in k-th step, c Ck , during the process of Ck traversing Ck 1 and generating Ck, should judgment of Ck 1 ’s each pieces of transaction record whether included in or not in c-[k-1] and c-[k], 109 therefore, each pieces of transaction record contains at least two or more items. (2) The improved algorithm based on candidate item set compression [8] Before association rule digging, algorithm needs to preprocess the original data and build data dictionary to make attributive classification and simple number equivalent. In this paper, the algorithm ranks data set by dictionary ascending order, in order to reduce the candidate item set, when generate the K candidate itemsets just keep the support is greater than the minimum support degree of itemsets, while K ≥ 2 generate candidate itemsets k Tid item set in the table with the set instead of the item set (just like this t.TID,{{c Lk 1}| c t} ) and this method could reduce the data storage. After getting the frequent item set Lk, we should count for each item. If the item’s number is less than the minsup, we should remove it and use a new frequent itemsets L'k to replace the original Lk frequent itemsets, which reduces the next time the number of candidate itemsets generation and also the storage space and time. 3.3 Improved AprioriTid Algorithm pseudocode 4 EXPERIMENT AND THE RESULT To verify the efficient of the improved algorithm, this experiment compares the frequent item set generating-time of improved algorithm (named NewAprioriTid) and AprioriTid Algorithm under different support. To better verify NewAprioTid algorithm efficiency so we use the UCI standard test data set – Mushroom data set. UCI standard test data is the popular data set in data mining, so it can be got freely from the Internet and has much authority. Experimental environment is Windows 7 operating system, 2.50 GHz CPU, 4.0 GB Memory, and the JAVA language as the development language. 4.1 Comparison between the original algorithm and the improved algorithm and analysis (1) Fix data centralized transaction number and compare the original algorithm and the improved algorithm under different supporting threshold value. Choose 4,000 data randomly in transactional databases, and then get the different running time in Table 3. The improved algorithm is as follow: Table 3. The run time of 4000 transactions. Support The run time of AprioriTid (ms) The run time of NewAprioriTid (ms) Improvement efficiency (1) C1 = D (2) L1 = getFreq1Itemset(C1 ) (3) L'1 = L1 (4) for(k 2;(L' k 1 ) & &(C k 1 ); k ){ (5) Ck Apriori _ gen(Lk 1 ) (6) Ck (7) (8) if(transactin t k) contiune (9) else if(c Ck ){ (13) (14) c.sup++ } 0.7 261 313 290 270 242 214 8% 12% 11% 14% 18% 300 280 260 240 } AprioriTid NewAprioriTid 220 200 { Ck Ck ( t.tid, C t )} (17) L k {c Ck |c.sup min sup} 0.5 0.55 0.6 Support 0.65 0.7 Figure 1. The run time of 4000 transactions. (18) L' k getFreqKItemSet(L k ) (19)} 0.65 283 320 (15) if (C t ) & &(C t item set'number k) (16) 0.6 304 340 Run time/ms (12) 0.55 330 To compare the running efficiency clearly this paper takes Figure1 as description. if(c t){ Ct Ct c (10) (11) 0.5 340 It can be seen from the table above that the features of two algorithms are relatively stable. It shows that running time decreases with the increasing of the support. From the saving-time efficiency line, we can see the improved algorithm takes less time than before. Besides, when support is relatively low, the time 110 improved algorithm takes reducing 10%-20%. With the increase of threshold value, the time-saving efficiency of improved one becomes more apparent. In the beginning of the program operation, the lower supporting threshold value makes condition-satisfying frequent item set and association rule become more, so running time is relatively long. With the increasing of support, rangeability of running time becomes less. This shows the superiority of improved algorithm. Due to the high support at the beginning of program, it restrains the generation of large number of candidate item set and frequent item set which satisfy lower support and saves a lot of time to make the rangeability bigger. (2) When comparing the efficiency under different transaction, take minimum support 0.5, and then randomly take 5000,6000,7000,8000 data to compare and analysis. The results are as follow in Table 4. Table 4. The efficiency of different transaction. The number of transaction set 5000 6000 7000 8000 The run time of AprioriTid (ms) 263 291 326 354 The run time of NewAprioriTid (ms) 227 258 283 314 360 340 Run time/ms 320 300 280 260 220 AprioriTid NewAprioriTid 5000 5 CONCLUSION Association rule digging problem can be divided into two sub-tasks: find frequent item set and generate association rule. Find frequent item set is the core of association rule digging. This paper proposed an improved AprioriTid algorithm based on the transaction set compression and candidate item set compression. On the one hand, it compresses the size of data set efficiently and reduces the times of scanning data set; on the other hand, it reduces the candidate item set and improves the algorithm. The experimental result has shown that the improved AprioriTid Algorithm is better than the original one. REFERENCES For better comparison, the data are described as Figure 2. 240 set with the fixed support 0.5 condition. It shows that the general tendencies of the two algorithms are the same. Meanwhile, the improved algorithm has advantages in time increasing rate 6000 7000 8000 The number of transaction set Figure 2. The efficiency of different transaction. It can be seen that the running time of algorithm and improved one increase with the bigger transaction [1] Han, Jiawei, Micheline Kamber, Jian Pei. 2011. Data Mining: Concepts and Techniques. Elsevier, [2] Agrawal R., Mannila H., Srikant R., et al. 1996. Fast discovery of association rules. Advances in Knowledge Discovery and Data Mining, 12(1): 307-328. [3] Park J.S., Chen M.S., Yu P.S. 1995. An Effective Hash-based Algorithm for Mining Association Rules. ACM, 24(2): 175-186. [4] Pasquier N., Bastide Y., Taouil R., et al. 1999. Discovering Frequent Closed Itemsets for Association Rules. Database Theory--ICDT’ 99. pp: 398-416. [5] Li Z.C., He P.L., Lei M. 2005. A high efficient AprioriTid algorithm for mining association rule. Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on. IEEE, 3: 1812-1815. [6] Lan C., Liu Y., Tang Z. 2010. Improvement of aprioritid algorithm for mining frequent items. Computer Applications and Software, 27: 234-6. [7] Peng Y., Xiong Y. 2006. Study on Optimization of AprioriTid algorithm for mining association rules. Computer Engineering, 5: 019. [8] Athiyaman B., Sahu R., Srivastava A. 2013. Hybrid data ming algorithm: An application to weather data. Journal of Indian Research, 1(4): 71-83. 111