Download Association Rule Mining: An Overview

IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391 Association Rule Mining: An Overview Neesha Sharma1, Dr. Chander Kant Verma2 M.Tech Scholar, DCSA, Kurukshetra University, Kurukshetra 2 Assistant Professor, DCSA, Kurukshetra University, Kurukshetra [email protected],[email protected] 1 Abstract: Association Rule Mining(ARM) has been the area of interest for many researchers for a long time and continues to be the same. It is one of the important tasks of data mining. It aims at discovering relationships among various items in the database. The objective of this paper is to present a review on the basic concepts of ARM technique along with the recent related work that has been done in this field. The paper also discusses the issues and challenges related to the field of association rule mining. A small comparison based on the performance of various algorithms of association rule mining has also been made in the paper. Keywords: Association rule mining, Weka, XLMiner. 1. Introduction The process of knowledge discovery in databases (KDD) includes selection of data, its preprocessing, transformation, data mining and interpretation. Data mining (the analysis step of the KDD), an interdisciplinary subfield of computer science is the computational process in which patterns are discovered in large data sets. The main goal of the data mining process is to extract information from a data set and transform it into a structure that can be understood for certain future use. The steps that are involved in data mining are anomaly detection in data, association rule mining, clustering, classification, regression and summarization [4]. Association rule mining is one of the important tasks of data mining. It is indeed one of the most well studied data mining tasks. Association rule mining was first introduced in[13]. It discovers relationships among attributes in databases, by producing certain if-then statements concerning attribute values [3]. A more formal definition of association rule mining is as follows: Let I={I1,I2,…,In)be a set of n binary attributes called items. Let D={T1,T2,…,Tn} be a set of transactions called the database. Every transaction in the database D has a unique transaction ID and contains a subset of the items in I. A rule in the database is defined as an implication of the form I1=>I2 where I1, I2 I and I1 I2= .The sets of items (for short item sets) I1 and I2 are called antecedent (left-handside) and consequent (right-hand-side) of the rule respectively[6]. Association rule mining aims at finding association rules that should satisfy the predefined minimum support and confidence specified by the user from a given database. The problem is generally decomposed into two subproblems[17]. Figure 1: Generating Association Rules As shown in figure 1 one subproblem is to find out those itemsets in the database whose occurrences exceed a predefined threshold in the database; those itemsets are called frequent or large itemsets. The second subproblem is to generate association rules from those large itemsets with the constraints of minimal confidence. Suppose one of the large itemsets is Lk, Lk = {I1, I2, … , Ik}, association rules with this itemsets can be generated in the following IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391 way: the first rule will be {I1, I2, … , Ik-1} {Ik}, by checking the confidence specified by the user, this rule can be determined as interesting or not. Then other rules are then generated by deleting the last items from the antecedent and inserting it into the consequent. Afterwards, the confidences of the new rules are checked to determine their interestingness. This process is iterated while the antecedent is non-empty. Since the second subproblem is very much straight forward, most of the researches mainly focus on the first subproblem. The first subproblem of finding all frequent itemsets can be further divided into two subproblems[17]: candidate large itemsets generation process and frequent itemsets generation process. The itemsets whose support is greater than the support threshold are called large or frequent itemsets, and the itemsets that are expected to be large or frequent are called candidate itemsets. There are two important basic measures to find the interestingness of association rules, support(s) and confidence(c). They respectively reflect the usefulness and certainty of discovered rules[10]. Support(s) of an association rule is defined as the percentage/fraction of transactions that contain I1 I2 to the total number of transactions in the database. Suppose the support of a particular item is 0.1%, it means only 0.1 percent of the transactions contain purchasing of this item[8]. Confidence(c) of an association rule is defined as the percentage/fraction of the number of transactions that contain I1 I2 to the total number of transactions that contain I1. Confidence is a measure of strength of the association rules, suppose the confidence of some association rule X Y is 70%, it means that 70% of the transactions that contain X also contain Y[8]. To illustrate the concepts, a small example from the supermarket domain has been used[6]. The set of items is I={bread, jam, butter, milk} and a small database(Table 1) containing the items (1 represents that item is present and 0 represents that item is not present in a transaction). An example rule for the supermarket could be {bread, jam} => {butter} meaning that if bread and jam are bought together, customers also buy butter. T Table 1: Sample Database to Find Association Rules Bread Jam Butter Milk T1 1 1 0 0 T2 1 1 1 0 T3 1 0 1 1 T4 0 1 1 0 T5 1 1 0 0 In the example database, the item set{bread, jam, butter} has a support of 1/5=0.2. Since it occurs in 20% of all transactions (1 out of 5 transactions). The rule {bread, jam}=>{ butter} has a confidence of 0.2/0.4 = 0.5in the database, it clearly means that for 50% of the transactions containing bread and jam the rule is correct (50% of the times a customer buys bread and jam, butter is bought as well). Another example of data mining is as follows[12]. In an online book store there are always some tips that are shown after a person purchases some books, for instance, once a person bought the book Data Mining Concepts and Techniques, a list of related books such as: Database System 40%, Data Warehouse 25%, will be presented to you as recommendation for further purchasing. In this example, the association rules are: when the book Data Mining Concepts and Techniques is bought, 40% of the time the book Database System is bought with it, and 25% of the time the book Data Warehouse is bought with it. Those rules discovered from the transaction database of the book store can be used to rearrange the way of how to place those books which are related, thus further making those rules stronger. Those rules can also be used to help the owner of the store to make his market strategies such as: by promotion of the book Data Mining Concepts and Techniques, it can increase the sales of the other two books mentioned. 2. Basic Association Rule Mining Algorithm Some of the well known algorithms for association rule mining are Apriori, FP-growth, AIS, Apriori-TID, Apriori Hybrid, Partitioning algorithms, Tertius Algorithm and many more. Various authors have compared the existing algorithms and various others have proposed new algorithms to remove the drawbacks of older ones. In general, a set of items(such as antecedent(LHS) or the consequent(RHS) of a rule) is called an itemset. The number of items contained in an itemset is called the length of an itemset. Itemsets of some length k are called kitemsets. Generally, an association rules mining algorithm contains the following steps[17]: a) The set of candidate k-itemsets is generated by 1-extensions of the large (k-1)-itemsets generated in the previous iteration. b) Supports for the candidate k-itemsets are generated by a pass over the database. IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391 c) Itemsets that do not have the minimum support are discarded and the remaining itemsets are called large kitemsets. This process is repeated until there are no more large itemsets in the database. The most commonly used approach for finding association rules is based on the Apriori algorithm. The efficiency of the level wise generation of frequent itemsets is improved by using the Apriori property which says that all nonempty subsets of a frequent itemset must also be frequent[10]. 3. Related Work 3.1 Recent Advancements An efficient algorithm has been proposed in 2008[5] to mine combined association rules on imbalanced datasets. Unlike conventional association rules, combined association rules have been organized as a number of rule sets. In each rule set, single combined association rules consist of various types of attributes. A novel frequent pattern generation algorithm has been proposed to discover the complex inter-rule and intra-rule relationships. Data imbalance problem has also been tackled. An experiment has been performed in 2009[7] using association rule mining technique that can predict healthy and sick status of heart using heart disease data. Three association rule mining algorithms apriori, predictive apriori and tertius have been compared in the experiment. The results of the experiment clearly show that the apriori algorithm is best suited for this type of task. An efficient version of apriori algorithm has been proposed in 2010[11] for mining multi-level association rules in large databases to find maximum frequent itemset at lower level of abstraction. The new algorithm namely, SC-BF Multilevel is fast and efficient that requires only a single scan of database for mining complete frequent itemsets.Also, the results of Apriori and K-Means algorithms against their implementation in Weka and XLMiner have been compared in 2010[1]. For the comparison purpose the transaction data of Sales Day(a super store) has been used. The results suggested some strategy change for organization that could lead to better sale. In 2012[18] work has been done upon the data regarding the students enrollments for a specific set of data i.e., the courses which the students would like to learn. Then the three data mining techniques namely, clustering(K-Means algorithm), classification(ADTree algorithm) and association rule(Apriori algorithm) were applied on the data to find the best combination of courses. A comparison has been made which shows that this combined approach is better than using only association rule mining for such kind of task. In 2013[16] three association rule algorithms have been considered by the author: apriori, predictive apriori and tertius. The author has tried to find the best association rules using Weka. After getting the results, the author discussed the limitations of the apriori algorithm and the ways to improve it. Association rules generated by the three algorithms have been compared and it has been found that apriori is faster than the other two algorithms. In the same year association rule mining technique was applied on a dataset containing crimes against women[2]. Weka tool was used for this purpose. One dataset was taken from the UCI repository and another was collected from the session court of Sirsa. Apriori algorithm was applied on the dataset to extract the hidden information about the crime and the culprit. A comparison has been made between apriori and predictive apriori in which which apriori was found to be better and faster. Also, a ‘Kidney Dataset’ containing 7 attributes and 157 instances has been used to integrate clustering and association rule mining in 2013[14]. This determined some essential and interesting rules formed in case of each cluster using Weka. A supermarket data has been taken in 2013[8]. A comparison has been made among apriori algorithm, FP-Growth and Tertius algorithm by applying them on the supermarket data using Weka. The results show that FP-Growth performs better than the other two algorithms 3.2 Performance Review The advantages and disadvantages of certain association rule mining algorithms are discussed below in tabular form (Table 2): 3.3 Issues and Challenges Though a lot of research work has been done in the field of association rule mining. Various authors have proposed various different algorithms in this field. Still there exist many issues and challenges in this field which need to be solved in order to get complete advantage of this technique. The main drawbacks of the association rule mining algorithms are[9]: a) Obtaining non interesting rules b) Huge number of discovered rules c) Low algorithm performance IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391 Association Rule Mining Algorithm 1. AIS 2. 3. Table 2: Performance Review of a Few Algorithms Advantages Disadvantages a) An estimation is used in the algorithm to prune those candidate itemsets that have no hope to be large. b) Memory management is used in AIS when memory is not enough. Apriori a) Easy implementation. b) It uses Apriori property for pruning therefore, itemsets left for further support checking remain less. FP-Growth a) It scales better than Apriori algorithm. b) It requires only two scans of the database without any candidate generation. c) It is not influenced by support factor. a) Requires multiple scanning of the database . b) Only rules of the kind X Y=>Z can be generated. Rules of the kind X=>Y Z cannot be generated. c) Too many candidate itemsets that finally turn out to be small are generated. a) It explains only the presence or absence of an item in the database. b) It scans the complete database multiple times. c) It has a complex candidate generation process that uses most of the time, space and memory. d) It works well only for small databases with large support factor. a) The resulting FP-Tree is not unique for the same logical database. b) It cannot be used in interactive mining system. c) It cannot be used for incremental mining. Extracting all the association rules from a database requires counting all the possible combinations of attributes present in that database. Support and confidence factors can be used to obtain the interesting rules which have values for these factors greater than a threshold value specified by the user. In most of the methods the confidence is determined only after the relevant support for the rules is computed. However, when the number of attributes is large computational time increases exponentially. For a database of m records and n attributes, assuming binary encoding of attributes in a record, the enumeration of subset of attributes requires m*2n computational steps. For small value of n traditional algorithms are simple and efficient but for large values of n the computational analysis is infeasible[9]. While the general association rule model described in section 2 suffices for many applications, it is not adequate when the frequency of each item in the basket is variable and cannot be ignored[10]. The key element that makes association rule mining practical is the minsup i.e., the minimum support specified by the user. It is used to prune the uninteresting rules. But using only a single minsup means that all the items in the database are of the same nature. This may not be the case all the time. For example, in retailing business customers frequently buy those items which have less price while the items which have a higher price may not be bought too frequently. In such a situation, if the minsup is set too high, the generated rules will contain only those rules containing only those items which have less price and contribute less to the profit of the organization. On the other hand, if the minsup is set too low, too many meaningless frequent patterns will be generated that will overload the decision makers. This type of dilemma is called rare item problem[19]. Association rule mining has been very successful in various fields be it commercial, social or human activities. But this technique poses a threat to privacy. One can easily disclose other’s information by using this technique. So before releasing databse, sensitive information must be hidden from unauthorized access. It has been revealed that one of the current technical challenges is the development of techniques that incorporate security and privacy issues. The association rule hiding problem aims at sanitizing the database in such a way that through association rule mining one will not be able to disclose the sensitive rules and only the non-sensitive rules will be mined[15]. IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391 4. Conclusion Association rule mining has gained a lot of attention in the field of research. This technique aims at finding dependences among the attributes in a database. Therefore, it has found applicability in a lot of fields. One of the major area of applications of association rule mining is market basket analysis. Many algorithms have been proposed that mine association rules from a database based on the minsup and minconf specified by the user. This paper presents a review on the association rule mining. It firstly defines what association rule mining is all about. It then presents a generalized association rule mining algorithm. The paper surveys the research work done by various authors in this field. It has been found while reviewing the literature that those algorithms which do not involve candidate generation process are faster than those which involve candidate generation process. Some of the issues and challenges related to this field have also been presented in this paper. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. References A.M. Khattak, A.M. Khan, Sungyoung Lee and Young-Koo Lee, “Analyzing Association Rule Mining and Clustering on Sales Day Data with XLMiner and Weka” In International Journal of Database Theory and Application, Volume 3, No. 1, March 2010. Divya Bansal and Lekha Bhambhu, “Execution of Apriori Algorithm Of Data Mining Directed Towards Tumultuous Crimes Concerning Women” In International Journal Of Advanced Research in Computer Science and Software Engineering, ISSN: 2277 128X, Volume 3, Issue 9, September 2013. Enrique Garcia, Cristobal Romero, Sebastian Ventura and Toon Calders, “Drawbacks and Solutions of Applying Association Rule Mining in Learning Management Systems”, In Proceedings of the International Workshop on Applying Data Mining in e-learning, 2007. http://en.wikipedia.org/wiki/Data_mining Huaifeng Zhang, Yanchang Zhao, Longbing Cao and Chengqi Zhang, “Combined Association Rule Mining”, PAKDD 2008, LNAI 5012, pp. 1069-1074, 2008 © Springer- Verlag Berlin Heidelberg 2008 Ila Chandrakar and A. Mari Kirthima, “A Survey On Association Rule Mining Algorithms”, In International Journal Of Mathematics and Computer Research, ISSN: 2320-7167, Vol 1, Issue 10, Page No. 270-272, November 2013. Jesmin Nahar, Kevin S.Tickle, Shawkat Ali and Yi-Ping Phoebe Chen, “Diagnosis Heart Disease Using an Association Rule Discovery Approach” In Proceedings of the IASTED International Conference Computational Intelligence(CI 2009), August 17-19, 2009 Honolulu, Hawaii, USA. Jyoti Arora, Shelza and Sanjeev Rao, “An Efficient ARM Technique for Information Retrieval In Data Mining” In International Journal of Engineering Research and Technology(IJERT),ISSN: 2278-0181, Volume 2, Issue 10, October 2013. Maria N.Moreno, Saddys Segrera and Vivian F. Lopez, “Association Rules: Problems, solutions and new applications” In Actas del III Taller Nacional de Mineria de Datos y Aprendizaje, pp. 317-323, TAMIDA 2005, ISBN: 84-9732-449-8 © 2005 Los autores, Thomson. Nitin Gupta, Nitin Mangal, Kamal Tiwari and Pabitra Mitra, “Mining Quantitative Association Rules in Protein Sequences” Data Mining, LNAI 3755, pp. 273-281, 2006 © Springer- Verlag Berlin Heidelberg 2006. Pratima Gautam and K.R. Pardasani, “Algorithm for Efficient Multilevel Association Rule Mining” In (IJCSE) International Journal on Computer Science and Engineering, ISSN: 0975-3397, Volume 02, No. 05, 1700-1704, 2010. Qiankun Zhao and Sourav S. Bhowmick, “Association Rule Mining: A Survey”, Technical Report, CAIS, Nanyang Technological University, Singapore, No. 2003116, 2003. Rakesh Agrawal, R., T. Imielinski, and A.Swami, “Mining association rules between sets of items in large databases” In Proceedings of the 1993 ACM SIGMOD International Conference on Management of DataSIGMOD ’93, pp. 207-216, 1993. Ritu Ganda, “Knowledge Discovery from Database using an Integration of Clustering and Association Rule Mining” In International Journal Of Advanced Research in Computer Science and Software Engineering, ISSN: 2277 128X, Volume 3, Issue 9, September 2013. Sanjay Keer, Anju Singh, “Hiding Sensitive Association Rule Using Clusters of Sensitive Association Rule”, International Journal of Computer Science and Network(IJCSN), ISSN: 2277-5420, Volume 1, Issue 3, June 2012. Shweta and Kanwal Garg, “Mining Efficient Association Rules Through Apriori Algorithm Using Attributes And Comparative Analysis Of Various Association Rule Algorithms” In International Journal Of Advanced Research in Computer Science and Software Engineering, ISSN: 2277 128X, Volume 3, Issue 6, June 2013. Sotiris Kotsiantis and Dimitris Kanellopoulos, “Association Rule Mining: A Recent Overview” In GESTS International Transactions On Computer Science And Engineering, Vol. 32(1), pp. 71-82, 2006. IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391 18. Sunita B. Aher and Lobo L.M.R.J., “Combination of Clustering, Classification and Association Rule Based Approach for Course Recommender System in E-learning” In International Journal of Computer Applications(09758887), Volume 39- No. 7, February 2012. 19. Ya-Han Hu, Yen-Liang Chen, “Mining Association Rules with Multiple Minimum Supports: A New Mining Algorithm and a Support Timing Mechanism” © 2004 Elsevier B.V.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Association Rule Mining: An Overview