Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Trends and Technology- July to Aug Issue 2011 Distributed Count Association Rule Mining Algorithm Tirumala prasad B #1, Dr. MHM Krishna Prasad *2 # Dept. of Information Technology , UCEV- JNTUK Vizianagaram, A.P., India * Associate Professor & Head, Dept. of IT UCEV-JNTUK, Vizianagaram, A.P., India Abstract- With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools to find the desired information resources, and to track and analyze their usage patterns. Association rule mining is an active data mining research area. However, most ARM algorithms cater to a centralized environment. In contrast to previous ARM algorithms, Optimized Distributed Association Rule Mining (ODARM) is a distributed algorithm for geographically spread data sets that aimed to reduces operational/ communication costs. Recently, as the need to mine patterns across distributed databases has grown, Distributed Association Rule Mining (DARM) algorithms have been developed. These algorithms assume that the databases are either horizontally or vertically distributed. In the special case of databases populated from information extracted from textual data, existing D-ARM algorithms cannot discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two. Hence, this paper proposes a Distributed Count Association Rule Mining Algorithm(DCARM), which is experimented on real time datasets obtained from UCI machine learning repository. Keywords - ARM algorithm, ODARM, DARM, PARM. I. INTRODUCTION Data mining [1] is the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyse massive databases to ISSN: 2231-2803 deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?" Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc.. Data mining is a complex topic and has links with multiple core fields such as computer science and adds value to rich seminal computational techniques from statistics, information retrieval, machine learning and pattern recognition. Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: Massive data collection Powerful multiprocessor computers Data mining algorithms Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods. Web Mining [2] is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World Wide Web. There are roughly three knowledge discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource discovery based on concepts indexing or agent based technology may also fall in this category. Web structure mining is the process http://www.internationaljournalssrg.org Page 280 International Journal of Computer Trends and Technology- July to Aug Issue 2011 of inferring knowledge from the World Wide Web organization and links between references and referents in the Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in web access logs. Web content mining is an automatic process that goes beyond keyword extraction. Since the content of a text document presents no machine-readable semantic, some approaches have suggested restructuring the document content in a representation that could be exploited by machines. The usual approach to exploit known structure in documents is to use wrappers to map documents to some data model. A) Classification: The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to specific variable(s) you are trying to predict. For example, a typical classification problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values "Good" and "Bad." B) Clustering: The process of dividing a dataset into mutually exclusive groups, such that the members of each group are as "close" as possible to other, and different groups are as "far" as possible from the other, where distance is measured with respect to all available/considered variables. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities: Automated prediction of trends and behaviours. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. ISSN: 2231-2803 DARM discovers [3], [4] rules from various geographically distributed data sets. However, the network connection between those data sets isn't as fast as in a parallel environment, so distributed mining usually aims to minimize communication costs. II. ASSOCIATION RULE MINING Association rule mining [5] finds interesting associations and/or correlation relationships among large set of data items. Association rules shows attribute value conditions that occur frequently together in a given dataset. A typical and widelyused example of association rule mining is Market Basket Analysis. For example, data are collected using bar-code scanners in supermarkets. Such ‘market basket’ databases consist of a large number of transaction records. Each record lists all items bought by a customer on a single purchase transaction. Association rules provide information of this type in the form of "if-then" statements. These rules are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature. In addition to the antecedent (the "if" part) and the consequent (the "then" part), an association rule has two numbers that express the degree of uncertainty about the rule. Support Confidence Support: In association analysis the antecedent and consequent are sets of items (called itemsets) that are disjoint (do not have any items in common). The first number is called the support for the rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. (The support is sometimes expressed as a percentage of the total number of records in the database.) Confidence: The other number is known as the confidence of the rule. Confidence is the ratio of the number of transactions that include all items in the consequent as well as the antecedent (namely, the support) to the number of transactions that include all items in the antecedent. Let us see an example based on these two association rule numbers: If a supermarket database has 100,000 point-of-sale transactions, out of which 2,000 include both items A and B and 800 of these include item C, the association rule "If A and B are purchased then C is purchased on the same trip" has a support of 800 transactions (alternatively 0.8% = 800/100,000) and a confidence of 40% (=800/2,000). One way to think of support is that it is the probability that a randomly selected transaction from the database will contain all items in the antecedent and the consequent, whereas the confidence is the conditional probability that a randomly selected transaction will include all the items in the consequent given that the transaction includes all the items in the antecedent. http://www.internationaljournalssrg.org Page 281 International Journal of Computer Trends and Technology- July to Aug Issue 2011 for all 2-subsets s of t An association rule tells us about the association between two or more items. For example: In 80% of the cases when people buy bread, they also buy milk. This tells us of the association between bread and milk. We represent it as bread => milk | 80% This should be read as - "Bread means or implies milk, 80% of the time." Here 80% is the "confidence factor" of the rule. Association rules can be between more than 2 items. For example bread , milk => jam | 60% bread => milk, jam | 40% Given any rule, we can easily find its confidence. For example, for the rule bread, milk => jam We count the number say n1, of records that contain bread and milk. Of these, how many contain jam as well? Let this be n2. Then required confidence is n2/n1. This means that the user has to guess which rule is interesting and ask for its confidence. But our goal was to "automatically" find all interesting rules. This is going to be difficult because the database is bound to be very large. We might have to go through the entire database many times to find all interesting rules. if (s є C2) s .sup ++; t ΄ = delete_nonfrequent_items(t); Table.add( t ΄); } Send_to_receiver(C2); /*Global Frequent support counts from receiver*/ F2 =receive_from_receiver(Fg); C3={Candidate itemset}; T=Table.getTransactions( ); k=3; While(Ck ≠ { }) { for all transaction t є T for all k-subsets s of t if(s є Ck) s .sup++; k++; send_to_receiver(Ck); /*generating candidate Itemset of k+1 pass*/ Ck + 1={Candidate itemset}; } III. DISTRIBUTED COUNT ASSOCIATION MINING ALGORITHM We assume each site [6] has the same tasks as sequential association mining, except it broadcasts support counts of candidate itemsets after every pass. DCARM first computes support counts of 1-itemsets from each site in the same manner as it does for the sequential Apriori [7]. It then broadcasts those itemsets to other sites and discovers the global frequent 1-itemsets. Subsequently, each site generates candidate 2- itemsets and computes their support counts. At the same time, DCARM also eliminates all globally infrequent 1-itemsets from every transaction and inserts the new transaction (that is, a transaction without infrequent 1itemset) into memory. While inserting the new transaction, it checks whether that transaction is already in the memory. If it is, DCARM increases that transaction's counter by one. Otherwise, it inserts the transaction with a count equal to one into the main memory. After generating support counts of candidate 2-itemsets at each site, DCARM generates the globally frequent 2-itemsets. It then iterates through the main memory (transactions without infrequent 1-itemsets) and generates the support counts of candidate itemsets of respective length. Next, it generates the globally frequent itemsets of that respective length by broadcasting the support counts of candidate itemsets after every pass. Algorithm: Distributed Count Association Rule Mining Algorithm NF = {Non-frequent global 1-itemset} for all transaction t є D { ISSN: 2231-2803 Because DCARM eliminates all globally infrequent itemsets from every transaction and inserts them into the main memory, it reduces the transaction size (the number of items) and finds more identical transactions. This is because the data set initially contains both frequent and infrequent items. However, total transactions could exceed the main memory limit. To deal with this problem, we propose a technique that fragments the data set into different horizontal partitions. Then, from each partition, DCARM removes infrequent items and inserts each transaction into the main memory. While inserting the transactions, it checks whether they are already in memory. If yes, it increases that transaction's counter by one. Otherwise, it inserts that transaction into the main memory with a count equal to one. Finally, it writes all mainmemory entries for this partition into a temp file each local site generates support counts and broadcasts them to all other sites to let each site calculate globally frequent itemsets for that pass [8]. So, the total number of messages broadcast from each site equals (n - 1 * |C|). We can calculate the total message size using = ( − 1) ∗ where, n is the total number of sites and C is number of candidate itemsets. Contrast with CD, DCARM sends support counts of candidate itemsets to a single site, which calculates the globally frequent itemsets for that pass. We refer to the sites that send local support counts as the sender and the site that generates the globally frequent itemsets is the receiver. http://www.internationaljournalssrg.org Page 282 International Journal of Computer Trends and Technology- July to Aug Issue 2011 For example, with three sites, two broadcast their local support counts of candidate itemsets to the third site. The third site is responsible for generating that iteration's globally frequent itemsets. The total number of messages broadcast from a sender site to a receiver site equals (1 * |C|). Once the receiver site generates globally frequent itemsets, it broadcasts them to all sender sites. The total number of messages broadcast from the receiver is (n - 1 * |FG|). We can calculate the total message broadcasting size (the aggregate of sender and receiver sites messages) using = ( − 1) ∗ + ( − 1) where n is the number of sites, C is the candidate itemsets, and Fg is the globally frequent itemsets. Minimum Support: To make the problem tractable, we introduce the concept of minimum support. The user has to specify this parameter - let us call it minsupport. Then any rule i1, i2, ... , in => j1, j2, ... , jn needs to be considered, only if the set of all items in this rule which is { i1 , i2, ... , in, j1, j2, ... , jn } has support greater than minsupport. bread, milk => jam If the number of people buying bread, milk and jam together is very small, then this rule is hardly worth consideration (even if it has high confidence). Our problem now becomes - Find all rules that have a given minimum confidence and involves itemsets whose support is more than minsupport. Clearly, once we know the supports of all these itemsets, we can easily determine the rules and their confidences. Hence we need to concentrate on the problem of finding all itemsets which have minimum support. We call such itemsets as frequent itemsets. IV. EXPERIMENTAL EVALUATION In this section, we present our experimental observations obtained while evaluating the behaviour of optimized association rule mining on different real time and synthetic datasets. Datasets Experiments are performed on Connect-4 [9] and Cover Type [9] real-time datasets from UCI Machine learning, also using synthetic datasets. Implementation details we implemented Optimised distribution association rule mining algorithm on different machines, communicated using java socket programming using jdk1.6. Experiments are performed on different machines of Intel® CoreTM 2DuoCPU 3.0 GHz with 4GB of RAM running on windows. ISSN: 2231-2803 Observations In this paper we have extensively studied DCARM's performance to confirm its effectiveness, we implemented this structure in java technology and we established a socket-based, client-server distributed environment to evaluate DCARM's message reduction techniques. Each site has a receiving and a sending unit and assigns a specific port to send and receive candidate support counts. Because the candidate itemsets that each site generates will be based on the global frequent itemset for the previous pass, the candidate itemsets are identical among various sites. We organize our performance evaluation experiment by executing DCARM in multiple sites to compare how much time the algorithm takes to generate n length of frequent itemsets by using different no. of instances. In each site we generate candidate frequent itemsets by removing non-frequent item sets to improve the memory utilization of the system. By using this technique in the three proposed sites, we generate candidate frequent itemsets with different size of instances and finally among the three sites we assume a coordinator can collect the generated candidate frequent itemsets and implement DCARM algorithm to get global frequent itemsets and it calculate the minimum support count and generate association rules. DCARM exchanges less messages among different sites to generate globally frequent itemsets. DCARM reduces the communication cost by 60 to 80 percent. The process of communication among the sites is described below. DCARM sends each support count to the polling sites. After receiving a request, each polling site sends a polling request to all remote sites other than the originator site. Upon receiving the polling request from all other sites, the polling site computes whether that candidate itemset is globally frequent and broadcasts only globally frequent itemsets to all other sites. Hence, it exchanges more messages because each polling site sends and receives support counts from remote sites. It also needs to send global support counts to all participating sites when a candidate itemset is heavy and subsequently increases the communication cost. Furthermore, each polling site receives polling requests only from one site. The below table shows the evaluated time for the three different sites with the instances of size 1k to 5k where k=1000. We conduct an experiment in a multiple sites with different sizes of datasets, and compared the time of execution times, we can say that DCARM is efficient algorithm for generating the support counts by the comparison of the execution times of the multiple sites. Finally by the result of the coordinator support count we prove that the algorithm DCARM is efficient. http://www.internationaljournalssrg.org Page 283 International Journal of Computer Trends and Technology- July to Aug Issue 2011 mining research area. This paper presents a distributed association rule mining algorithm by utilizing the global count on frequent itemsets. From the experimental results, we observed that, Distributed Count Association Rule Mining provides an efficient method for generating association rules from different datasets, distributed among various sites. REFERENCES [1] [2] Figure 1: Comparison of multiple sites The above Figure 1 shows the comparison evaluations for the multiple sites and the coordinator. [3] [4] Table 1: Execution times for multiple sites and coordinator No.of Instances 1000 2000 3000 4000 5000 Site1 local time 125 281 406 547 672 Site2 local time 156 261 390 531 641 Site3 local time Algorithm computational time 109 250 407 562 672 31 32 62 78 109 V. CONCLUSION With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools to find the desired information resources, and to track and analyze their usage patterns. Association rule mining is an active data ISSN: 2231-2803 [5] [6] [7] [8] [9] Mining F. Patterns, “Mining Frequent Patterns”, In Data Mining and Knowledge Discovery, Vol. 8, pp. 53-87, 2004. R. Cooley and B. Mobasher and J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web”. In IEEE International Conference on, pg-558, 1997. R. Agrawal and J.C. Shafer , "Parallel Mining of Association Rules”, IEEE Tran. Knowledge and 16 Data Eng , vol. 8, pp. 962-969, 1996. D.W. Cheung , et al., "A Fast Distributed Algorithm for Mining Association Rules", In Proc. Parallel and Distributed Information Systems, IEEE CS Press, pp. 31-42, 1996. R. Agrawal and R. Srikant , "Fast Algorithms for Mining Association Rules in Large Database,"Proc. 20th Int'l Conf. Very Large Databases (VLDB 94), Morgan Kaufmann, 1994,pp. 407-419. M. Zaman, Ashrafi, T. David, K. Smith, “ODAM: An Optimized Distributed Association Rule Mining Algorithm”, In IEEE Distributed Systems Online, Los Alamitos 2004. A. Inokuchi, T. Washio, H. Motoda, “An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data” In Lecture Notes in Computer Science, pg.13-23, 2000. M.J. Zaki , "Parallel and Distributed Association Mining: A Survey", IEEE Concurrency , pp. 14-25, 1999. UCI Machine Learning Repository, accessed on 12th Feb 2011. http://www.internationaljournalssrg.org Page 284