Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN 2348–2370 Vol.07,Issue.12, August-2015, Pages:2199-2204 www.ijatir.org New Approach for Classification Based Association Rule Mining K. CH. PRAVALLIKA1, CH. N. S. PRIYANKA2, B. V. BALAJI3 1 PG Scholar, Dept of MCA, Sri Vasavi Engineering College, Tadepalligudem, West Godavari, AP, India, E-mail: [email protected]. 2 PG Scholar, Dept of MCA, Sri Vasavi Engineering College, Tadepalligudem, West Godavari, AP, India, E-mail: [email protected]. 3 Assistant Professor, Dept of MCA, Sri Vasavi Engineering College, Tadepalligudem, West Godavari, AP, India, E-mail: [email protected]. Abstract: Recent studies in data mining have planned associative classification, which, according to several reports achieves higher classification accuracy than C4.5 In this paper, we propose to integrate two mining techniques. The integration is done by focusing on mining a special detachment of association rules, called class association rules (CARs). This paper presents a new approach for build a classifier, based on an extended association rule mining technique in the perspective of classification. The distinctive of this approach is threefold: first, applying the information gain measure to the generation of indicate item sets; second, integrating the process of frequent item sets generation with the process of rule generation; third, integrating strategies for avoiding rule redundancy and conflicts into the mining process. The corresponding mining algorithm proposed, namely. CPAR(Classification Based On Predictive Association Rule) and CARM (Classification Association Rule Mining) using decision tree information gain , produces a classifier with satisfactory classification accuracy, compared with other classifiers CARM could filter out many candidate item sets in the generation process. Keywords: Data Mining, Association Rule Mining, Frequent Item Set, Electronic Commerce, Information Gain, Decision Tree, FOIL, PRM , CPAR,CARM. I. INTRODUCTION Classification and association rule mining are two major areas of research and applications nowadays in knowledge discovery. An association rule (AR) is of the form, X Y where X and Y are sets of data items. The goal of association rule mining is to generate certain associative relationships between data items with the degrees of confidence and support greater than user specified thresholds. The Apriori algorithm is a well known algorithm in this field. A typical association rule application is market baskets analysis describing, for example, the customers’ buying behavior such as “Fruit => Meat” meaning that customers who bought fruit also tended to buy meat, which reflects association between occurrences of data items. Classification is used to find a logical description, namely a classifier, which results from training datasets with predetermined targets, and could group unlabeled datasets. Existing research efforts have proposed a number of approaches and systems. A worth-noting type of approaches is classification based on association rules, aimed at building a classifier by discovering a small set of rules to form a so-called associative classifier. Classification Rule Mining (CRM) is a well known Data Mining technique for the extraction of hidden Classification Rules (CRs) from a given database that is coupled with a set of pre-defined class labels, the objective being to build a classifier to classify “unseen” data records. One recent approach to CRM is to employ Association Rule Mining (ARM) methods to identify the desired CRs, i.e. Classification Association Rule Mining (CARM). CARM aims to mine a set of Classification Association Rules (CARs) from a class-transaction database, where a CAR describes an implicative co-occurring relationship between a set of binary-valued data attributes (items) and a pre-defined class, expressed in the form of an “antecedent consequentclass” rule. CARM seems to offer greater accuracy, in many cases, than other classification methods such as decision trees, rule induction and probabilistic approaches . In the past decade, a number of CARM approaches have been developed that include: TFPC (Total From Partial Classification) , CBA (Classification Based Associations), CMAR (Classification based on Multiple Association Rules), CPAR (Classification based on Predictive Association Rules) etc. Although these CARM approaches employ different ARM techniques to extract CARs from a given class-transaction database, a similar set of CARs is always generated, based on a pair of specific values for both support and confidence thresholds. Regardless of which particular CARM method is utilized, a classifier is usually presented as an ordered list of CARs, based on a selected rule ordering strategy. Hence, it can be indicated that the essential to produce a more accurate CARM classifier is to develop a better (more rational) rule ordering approach. II. RELATED WORK The data analysis algorithms (or data mining algorithms, as they are more popularly known nowadays) can be divided into Copyright @ 2015 IJATIR. All rights reserved. K. CH. PRAVALLIKA, CH. N. S. PRIYANKA, B. V. BALAJI three major categories based on the nature of their called CBA-RG, is based on Apriori algorithm for finding the information extraction: Clustering (also called association rules and a classifier builder, which is called segmentation or unsupervised learning), Predictive CBA-CB. In Apriori Algorithm, item set ( a set of items) were modeling (also called classification or supervised learning), used while in CBARG, rule item, which consists of a condset and Frequent pattern extraction. Clustering is the major (a set of items) and a class. Class Association Rules that are class of data mining algorithms. The goal of the search used to create a classifier in is more accurate than C4.5 process used by these algorithms is to identify all sets of algorithm. But the Classification Based on Associations similar examples in the data, in some optimal fashion. One (CBA) algorithm needs the ranking rule before it can create a of the oldest algorithms for clustering is k-means. The two classifier. Ranking depends on the support and confidence of disadvantages of this algorithm are initialization problem each rule. The goal of classification is to build a model from and that the cluster must be linearly separable. To deal with classified objects in order to classify previously unseen the initialization problem, the global k-means has been objects as accurately as possible. There are many proposed, which is an incremental-deterministic algorithm classification approaches for extracting knowledge from data that employs k-means as a local search procedure. Kernel such as divide-and- conquer , separate-and-conquer , covering k-means algorithm avoids the limitation of linearly and statistical approaches. separable clusters and it mapped the data points from input The divide-and conquer approach starts by selecting an space to a higher dimensional feature through a nonlinear attribute as a root node, and then it makes a branch for each transformation Ø and the k-means is applied in the feature possible level of that attribute. This will split the training space. instances into subsets, one for each possible value of the Global kernel k-means is an algorithm which mapped attribute. The same process will be repeated until all instances data points from input space to a higher dimensional that fall in one branch have the same classification or the feature space through the use of a kernel function and remaining instances cannot be split any further. The separateoptimizes the clustering error in the feature space by and-conquer approach, on the other hand, starts by building up locating near-optimal solution. Because of its deterministic the rules in greedy fashion (one by one). After a rule is found, nature, this makes it independent of the initialization all instances covered by the rule will be deleted. The same problem, and the ability to identify nonlinearly separable process is repeated until the best rule found has a large error cluster in input space. So global kernel k-means algorithm rate. Statistical approaches such as Naive Bayes use combines the advantages of both global k-means and probabilistic measures, i.e. likelihood, to classify test objects. kernel k-means. Another approach for clustering data is Finally, covering approach selects each of the available hierarchical clustering that is based on the Hungarian classes in turn, and looks for a way of covering most of method and the computational complexity of the proposed training objects to that class in order to come up with algorithm is O (n2). The important classification maximum accuracy rules. Numerous algorithms have been algorithms are decision tree, Naive-Bayes classifier and derived from these approaches, such as decision trees, PART, statistics. They use heuristic search and greedy search RIPPER ]and Prism. While single label classification, which techniques to find the subsets of rules to find the assigns each rule in the classifier to the most obvious label, classifiers. C4.5 and CART are the most well-known has been widely studied, little work has been done on multidecision tree algorithms. The final class of data mining label classification. Most of the previous research work to algorithms is frequent pattern extraction. For a large date on multi-label classification is related to text databases, describes an Apriori algorithm that generate all categorization. In this paper, only traditional classification significant association rules between items in the database. algorithms that generate rules with a single class will be The algorithm makes the multiple passes over the database. considered. The frontier set for a pass consists of those item sets that III. SYSTEM DESIGN AND IMPLEMENTATION are extended during the pass. In each pass, the support for The overall system design of Classification Based candidate item sets, which are derived from the tuples in Association Rule Mining is described in below Fig.1. The the databases and the item sets contain in frontier set, are System is divided into 4 Modules. measured. Data Source/ Data Base Module Initially the frontier set consists of only one element, Classification Module which is an empty set. At the end of a pass, the support for Association Rule Generation Module a candidate item set is compared with the min support. At Performance Analysis Module the same time it is determined if an item set should be added to the frontier set for the next pass. The algorithm Data Source/ Data Base Module: This Module maintains terminates when the frontier set is empty. After finding all data in the form of data sets. Here we have a Data set of the item sets that satisfy min support threshold, association several attribute values in the form of transaction records and rules are generated from that item sets. Bing Liu and et al we have data set that contains schema of data set. This schema had proposed an Classification Based on Associations is useful for classifying the data. (CBA) algorithm that discovers Class Association Rules (CARs). It consists of two parts, a rule generator, which is International Journal of Advanced Technology and Innovative Research Volume.07, IssueNo.12, August-2015, Pages: 2199-2204 New Approach for Classification Based Association Rule Mining Classification Module: This Module reads the data from new rule is to be built, the queue is first checked. If it is not data set and performs classification operation and empty, a rule is extracted from it and is taken as the current generated classes. rule. This forms the depth first-search in rule generation. Association Rule Generation Module: This Module uses classes and performs association rule mining and generates frequent item sets, generates association rules. Performance Analysis Module: This Module computes time complexity, space complexity, accuracy and no of association rules for each execution based on no of classes for different algorithms such as CARM using Information Gain, CARM Using Random Gain, FOIL and PRM. Then it compares their values and analyzes the efficient algorithms. Example: An example of how CPAR generates rules. After the first literal (A1 = 2) is selected, two literals (A2 = 1) and (A3 = 1) are found to have similar gain, which is higher than other literals. Literal (A2 = 1) is first selected and a rule is generated along this direction. After that, the rule (A1 = 2; A3 = 1) is taken as the current rule as shown in Fig.2. Again two literals with similar gain (A4 = 2) and (A2 = 1) are selected and a rule is generated along each of the two directions. In this way, three rules are generated: (A1 = 2; A2 = 1; A4 = 1). (A1 = 2; A3 = 1; A4 = 2; A2 = 3). (A1 = 2; A3 = 1; A2 = 1). Fig.2. some rules generated by CPAR. Fig.1. System Architecture. IV. PROPOSED WORK A. (CPAR and CARM Using Decision Tree Info Gain) CPAR CPAR (Classification based on Predictive Association Rules), which combines the advantages of both associative classification and traditional rule-based classification. Instead of generating a large number of candidate rules as in associative classification, CPAR adopts a greedy algorithm to generate rules directly from training data. Moreover, CPAR generates and tests more rules than traditional rule-based classifiers to avoid missing important rules. To avoid over fitting, CPAR uses expected accuracy to evaluate each rule and uses the best k rules in prediction. CPAR stands in the middle between exhaustive and greedy algorithms and combines the advantages of both. CPAR builds rules by adding literals one by one, which is similar to PRM. However, instead of ignoring all literals except the best one, CPAR keeps all close-to-the-best literals during the rule building process. By doing so, CPAR can select more than one literal at the same time and build several rules simultaneously. The following is a detailed description of the rule generation algorithm of CPAR. Suppose at a certain step in the process of building a rule, after finding the best literal p, another literal q that has similar gain as p (e.g., differ by at most 1%) is found. Besides continuing building the rule by appending p to r, q is also appended to the current rule r to create a new rule r0, which is pushed into the queue. Each time when a B. CARM Classification Association Rule Mining (CARM) is a recent Classification Rule Mining approach that builds an Association Rule Mining based classifier using Classification Association Rules (CARs). Regardless of which particular CARM algorithm is used, a similar set of CARs is always generated from data, and a Classifier is usually presented as an ordered CAR list, based on a selected rule ordering strategy. Fig.3. Decision Tree. Decision Tree: A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node. The following decision tree is for the concept buy computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class. as shown in Fig.3 International Journal of Advanced Technology and Innovative Research Volume.07, IssueNo.12, August-2015, Pages: 2199-2204 K. CH. PRAVALLIKA, CH. N. S. PRIYANKA, B. V. BALAJI General definition of Info Gain: In general terms, the The Following Fig.5 shows the comparison of space expected information gain is the change in information complexity between different algorithms FOIL, PRM, entropy from a prior state to a state that takes some CPAR and CARM using Line Chart. information as given: (1) Information Gain: We now return to the problem of trying to determine the best attribute to choose for a particular node in a tree. The following measure calculates a numerical value for a given attribute, A, with respect to a set of examples, S. Note that the values of attribute A will range over a set of possibilities which we call Values (A), and that, for a particular value from that set, v, we write S v for the set of examples which have value v for attribute A. The information gain of attribute A, relative to a collection of examples, S, is calculated as: (2) The information gain of an attribute can be seen as the expected reduction in entropy caused by knowing the value of attribute A. Fig.5. Space Complexity comparison of algorithms. The Following Fig.6 shows the comparison of accuracy between different algorithms FOIL, PRM, CPAR and CARM using Line Chart. V. EXPERIMENTAL RESULTS We have conducted an extensive performance study to evaluate accuracy and efficiency of CPAR, CARM using decision tree info gain and compare it with that of FOIL, PRM. We validted our approach by means of a large set of experiments addressing the following issues: Performance of the Classification and association rules, in terms of execution time, memory usage. Performance of the Classification and association rules, in terms of classes and accuracy. Performance of the Classification and association rules, in terms of classes And No of rules generated. Scalability of the approach. All the experiments are performed on a Pentium IV with 2GB main memory, running Microsoft Windows/XP. The Following Fig.4 shows the comparison of time complexity between different algorithms FOIL, PRM, CPAR and CARM using Line Chart. Fig.4. Time Complexity comparison of algorithms. Fig.6. Accuracy comparison of algoritms. The Following Fig.7 shows the comparison of no of Rules generated between different algorithms FOIL, PRM, CPAR and CARM using Line Chart. Fig.7. No.of Rules comparison of algorithms. International Journal of Advanced Technology and Innovative Research Volume.07, IssueNo.12, August-2015, Pages: 2199-2204 New Approach for Classification Based Association Rule Mining The Following Fig.8 shows the comparison of time The Following Fig.11 shows the comparison of No of complexity between different algorithms FOIL, PRM, Rules between different algorithms FOIL, PRM, CPAR and CPAR and CARM, using Bar Chart. CARM using Bar Chart. Fig.8. Time Complexity comparison of algorithms. The Following Fig.9 shows the comparison of space complexity between different algorithms FOIL, PRM, CPAR and CARM using Bar Chart. Fig.9. Space Complexity comparison of algorithms. The Following Fig.10 shows the comparison of Accuracy complexity between different algorithms FOIL, PRM, CPAR and CARM using Bar Chart. Fig.11. No Of Rules comparison of algorithms. VI. CONCLUSIONS AND FUTURE WORK In this paper, we examined two major challenges in associative classification: (1) efficiency at handling huge number of mined association rules, and (2) effectiveness at predicting new class labels with high classification accuracy. We proposed two novel associative classification methods, CARM using info gain and CPAR (Classification Based on Predictive Association Rule). Our Experiments shows both CARM using info gain and CPAR shows better efficiency than FOIL and PRM. VII. REFERENCES [1]Devasri Rai, A.S.Thoke and Keshri Verma, “Enhancement of Associative Rule based FOIL and PRM Algorithms”, Proc.. 2012. [2]Quinlan, J.R.(1993). C4.5: Programs for Machine Learning San Mateo, Morgan Kaufmann , San Francisco F.Clark and T.Niblett. The cn2 induction algorithm. Machine Learning, 2:261-283,1989. [3]Memory based reasoning(Data Mining Techniques ), by M.J.A. Berry and G.S Linoff,2004. [4]R.Andrews, J.Diederich, and A. Ticke,”A survey and critique of Techniques for extracting rules from trained artificial neural networks, “Knowledge Based Systems , volume ,pp. 373-389,1995. [5]P.Langely , W.Iba , and K.Thomson . “An analysis of Bayesian Classifiers,” in National Conference on Artificial Intelligence (1992).pp.223-228. [6]Liu B., Hsu W. And Ma Y. Integrating Classification and Association rule Mining . In Proceedings of the 4th international Conference on Knowledge Discovery and Data Mining (KDD-2001, pages 80-86, new York ,USA , August 1998. The AAAI press. [7]J.R. Quinlan and R.M. Cameron-Jones. FOIL: A midterm report. In Proc. 1993 European Conf. Machine Learning . pp. 3{20,Vienna Austria, 1993. [8]X.Yin and J.Han, “CPAR: Classification Based on Predictive Association Rules”, Proceedings of Siam International conference on Data Mining ,2003, pp.331-335. Fig.10. Accuracy comparison of algorithms. International Journal of Advanced Technology and Innovative Research Volume.07, IssueNo.12, August-2015, Pages: 2199-2204 K. CH. PRAVALLIKA, CH. N. S. PRIYANKA, B. V. BALAJI [9]Prafulla Gupta & Durga Toshniwal, “Performance Author’s Profile: Comparision of Rule Based Classification Algorithms”, K.CH.Pravallika is pursuing Master of International Journal of Computer Science & Informatics, Computer Applications in Sri Vasavi Volume-1,Issue-II,2011,pp37-42. Engineering College, Tadepalligudem, [10]W. Li ,J.Han and J.Pie CMAR: Accurate and efficient Affiliated to JNTUK. Classification based on multiple class-association rules. In ICDM01, 2011,pp.369{376.san jose, CA ,Nov.2001. [11]Thabtah , F., Cowling, P. And Peng , CH.N.S.Priyanka is studying Master of Y.H.(2004).MMAC: A New Multi-Class , Multi-Label Computer Applications in Sri Vasavi Associative Classification Approach. Fourth IEEE Engineering College, Tadepalligudem, International Conference on Data Mining (ICDM04). Affiliated to JNTUK. [12]“Classification Based on Predictive Association rule, “ Available online :http://www.csc.liv.ac.uk/~ frans/KDD/Software/FOIL_PRM_CPAR/cpar.html. [13]Uci:Blake , c.i.,&Merz,C,J(1998) UCI repository of Mr.B.V.Balaji has completed his M.Tech machine learning data bases from www.ics.uci.edu/~ (Computer Science Engineering) from mlearn/MLrepository.html. JNTUK University, Kakinada. Currently ,he [14]Coenen, F.(2003), The LUCS-KDD Discretized is working as Assistant Professor of MCA /normalised ARM and CARM data library department in Sri Vasavi Engineering http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KD College , Tadepalligudem. D_DN/, Department of Computer Science , The University of Liverpool, UK. [15]Zuoliang Chen, Guoqing Chen, “Building An Associative Classifier Based On Fuzzy Association Rules”, International Journal of Computational Intelligence Systems, Vol.1, No. 3 (August, 2008), 262 – 273. [16] Xin Lu, Barbara Di Eugenio and Stellan Ohlsson, “ Learning Tutorial Rules Using Classification Based On Associations”, Xin Lu, Computer Science (M/C 152), University of Illinois at Chicago, 851 S Morgan St., Chicago IL, 60607, USA. Email: [email protected]. [17]“Alaa Al Deen” Mustafa Nofal and Sulieman BaniAhmad, “ Classification based on association-rule mining techniques: a general surveyand empirical comparative evaluation”, Ubiquitous Computing and Communication Journal, Volume 5 Number 3, www.ubicc.org. [18] A. Zemirline, L. Lecornu, B. Solaiman, and A. Echcherif, “An Efficient Association Rule Mining Algorithm for Classification”, L. Rutkowski et al. (Eds.): ICAISC 2008, LNAI 5097, pp. 717–728, 2008. c Springer-Verlag Berlin Heidelberg 2008. [19]Bing Liu Wynne Hsu Yiming Ma, “Integrating Classification and Association Rule Mining “,Appeared in KDD-98, New York, Aug 27-31, 1998. International Journal of Advanced Technology and Innovative Research Volume.07, IssueNo.12, August-2015, Pages: 2199-2204