Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovering Association Rules and Classification for Biological Data using Data Mining Methods 1,2 J. Tsiligaridis1, M. Pagela2 Math & Computer Science Department, Heritage University , Toppenish, USA Abstract - This project presents a set of algorithms and their efficiency for discovering association rules and classification using Genetics Algorithm(GA), Decision Trees (DT) and Neural Networks (NN). A GA generates a large set of possible solutions to a given problem. Apriori is the basic algorithm for association rules. A GA is developed for finding the frequent conditions. The proposed GA based on encoding and generation construction method (GA_EN) can mine association rules with improved performance using appropriate generation of the rules. For GA classification (GA_CL) algorithm, rules are classified using predefined constraints. A Decision Tree algorithm (DTA) is created from data using probabilities, and the goal is to create on-demand an accurate decision tree (DT). Based on the rules produced from GA_CL, a Neural Network classifier (NNC_GA) is created. For learning a backpropagation neural network algorithm is used to adjust the weights. Simulation results are provided. Keywords: Genetic Algorithm, Decision Trees, Neural Network, Data Mining 1 Introduction GA is based on biological principles of natural selection. The key idea of Apriori is to find the frequent conditions constructed by possible values of attributes in any data set. This idea can also be used to find the frequent conditions constructed by possible values of attributes in any data set. The GA finds all the possible associations between conditions constructed by attribute values under a given constraint (e.g. support and confidence). The association rules mining, integrated with classification, creates the association classification [1],[2],[3]. The objective of Classification Association rules (CAR) is to generate a set of class association rules that satisfy the min support (msup), and minimum confidence (mconf) constraints and to build a classifier from the class association rule set. The GA_EN, based on an encoding method and construction of generations has advantages over the Apriori because it includes GA mining techniques that improve performance. The GA_CL discovers rules with minimum class support (mcsupp) and less than the maximum classification error (maxclerror). There are two types of a DT [3]; the complete and the incomplete. In the incomplete one there are subtrees where the repetition and replication are included. A DT represents a procedure for classifying objects based on their attributes. The rule set can be created by running the tree. Decision tree is used to find predictive rules combining numeric and categorical attributes. The splitting process is recursively repeated until the end of data. The DTA creates a DT using the criterion of maximum probability. There are two phases. In order to avoid repetition or replication of subtrees a new criterion, the elimination of a branch (CEB), is applied. The criterion of elimination a branch (CEB), eliminates redundant branches. The pruning of decision rules case is also examined with consequences on the accuracy. A NN is a collection of units that are connected in some pattern to allow communication between the units. The back propagation algorithm is in wide use because it learns by adapting its weight using the generalized delta rule which attempts to minimize the squared error between what is the desired network output and the actual network output. The NNC_GA is a classifier using NN methodology and having as input the rules created from GA_CL. The article is organized as follows. Section 2 ,3 introduces definitions for association rules and CAR. Section 4 deals with GA_EN and GA_CL. Section 5 includes the description of DTA. Section 6 contains the NN and NNC_GA. Simulation results appear in Section 7. 2 Association rules Let D = {T1, T2, . . . ,Tn} be a set of n transactions and let I be a set of items, I = {i1, i2 . . . im}. Each transaction [5] is a set of items, i.e. Ti ⊆ I. An association rule is an implication of the form X ⇒ Y, where X, Y ⊂ I, and X ∩ Y = ∅ ; X is called the antecedent and Y is called the consequent of the rule. In general, a set of items, such as X or Y , is called an itemset. For an itemset X ⊆ I, support(X) is defined as the fraction of transactions Ti ∈ D such that X ⊆ Ti. That is, P(X) = support(X). The support of a rule X ⇒ Y is defined as support(X ⇒ Y) = P(X ∪ Y ). An association rule X ⇒ Y has a measure of reliability called confidence(X ⇒ Y ) defined as P(Y |X) = P(X∪Y )/P (X) = support(X∪Y )/support(X). For the CAR it is supposed that data samples are given with n attributes (A1,A2,..,An) and for each sample there is a class label C (C={c1,c2,..ccm}). A pattern P (P={a1,a2,..,ak})is a set of attribute values for different attributes (1≤k≤n). For rule R: P-> c the number of data samples matching pattern P with class label c is called the support of rule R. The ratio of the number of samples matching pattern P and having class label c versus the total number of samples matching pattern P is called confidence of R. 3 CAR The CAR[3] contains methods for associative classification. CBA is one of the methods. It uses an iterative approach to frequent itemset mining which is similar to Apriori. CBA construct the classifier where the rules are ordered according to the precedence based on the rule confidence and support. More details appear in [3]. (Greedy strategy). The multi-way split is used, where as many partitions as distinct values. Nodes with homogeneous class distribution are preferred. In the incomplete DT there are subtrees where the repetition and replication are included. Repetition is where an attribute is repeatedly tested along a given branch of three , e.i. age, and replication where duplicate subtrees exist within a tree , such as the subtree headed by the node “credit_rating”. The DTA uses the criterion of maximum probability and can be created in the following phases: Phase 1: Discover the root (i) (from all the attributes) ∑ ∑ p( Ai) * p(Ci / Ai) P(EAi) = Ci 4 GA_EN ,GA_CL A GA is an iterative procedure, which works with a population of individuals represented by a finite string of characters or binary values. The traditional method usually searches a very large space for the solution using a large number of iterations, where a GA minimizes searches by adopting a fitness function. Each iteration consists of a created evolved population of genomes and a new generation. There are three operations: selection, crossover, mutation. For GA_EN, the number of conditions determines the construction of the chromosome and the population size. The next generations can be created from other either one or two previous generations using the “or” operation for crossover of chromosomes. This provides less memory operations and fewer offspring chromosomes. The msupp and mconf is the constraint set by user. The GA_CL working with the next generations under conditions and the crossover of the chromosomes and considering the predefined classification error can discover the classification rules. The conditions: msup and maxclerror reduces the number of undesired rules in the mining process. For the GA_CL only the rules from chromosomes are extracted which the classification error does not surpass the maxclerror. The next generation includes initially chromosomes with support greater than the minsup (gr_mins_chrom). The ‘or’ operation among gr_mins_chrom can create the chromosomes of the next generation. Chromosomes with class error less than the maxclerror can provide the classification rule. There are no classification rules for any attribute if there are no chromosomes with acceptance classification error. The GA mines rules of our interest by defining the msupp and mconf . With low constraints the number of discovered association rules becomes extremely large. 5 DTA In decision trees [4],[5] the input data set has one attribute called class C that takes a value from K discrete values 1, . . . , K, and a set of numeric and categorical attributesA1, .. . , Ap. The goal is to predict C given A1, . . . , Ap. Decision tree algorithms automatically split numeric attributes Ai into two ranges and they split categorical attributes Aj into two subsets at each node. Split the records based on an attribute test that optimizes certain criterion , Ai where Ai : the attributes of the tuples and Ci the classes (attribute test). MP = max (P(EAi)) //max attribute test criterion Phase 2: split the data into smaller subsets, so that the partition to be as pure as possible using the same formula. The measure of nodes impurity is the MP. Continue until the end of the attributes. There is a stopping criterion for expanding a node when all records belong to the same class. DTA : Input : training data Output: decision tree 1. define root node (phase 1) 2. discover the branches from root while (! end of the attributes) {3. splitting the attribute (phase 2) } Example: Weather Sunny Sunny ….. Windy ….. parents yes no no money rich rich rich decision (Example) cinema Tennis cinema yes no Parents: class, P(E) = (5/10)*(5/10) + (5/10)*(1/10) = 0.3 (phase 1) Weather: class, P(E) = (3/10)*(1/10) + (4/10)*(3/10) + (3/10)*(2/10) = 0.21 The CEB, is used to eliminate redundant branches. It is a prepruning approach [3]. For an attribute (attr1) with value v1 , if there are tuples from attr2 that have all the values in relation with v1 (of attr1) then the attr2 is named as: do n’t care attribute Example: R1. PCEB = P(A1 =a1,…, A|A| = a|A| | C=ci) = | A| ∏ p( A i = ai | C = c j ) i =1 A branch is eliminated when the PCEB ≠ 0 The criterion of Elimination of Branch (CEB):: if the P CEB= 0, between two attributes (A1, A2) then A2 is don’t care attribute. The CEB criterion is valid when P CEB ≠ 0. CEB is to develop the DT so that to avoid the repetition or replication. Theorem: The CEB criterion can determine the existence of a small DT with the best accuracy (100%, or complete) avoiding repetitions and replications. Proof: Because if CEB criterion is valid discourage the repetition Example: Age Has_job Own-house Credit-rating class (Example) Young false False fair No Young false False Good No Young True False Good Yes Young True True fair Yes Middle True True Good Yes Old false true Excellent Yes …….. Of the DT (with own_house as a root) it is not necessary to have more extension with the attribute of age for all the probable partitions (“young”,” middle”, “old”). PCEB = P(A1 =a1,…, A|A| = a|A| | C=ci) = P(age =young , own_house=”y” | C=”yes”) =P(age =young | C=”yes”) * P(own_house=”y” | C=”yes”) = 2/5 * 6/9 ≠ 0 DTs provide less rules than the CAR. DT can find rules with very low support (like medical rules). CAR requires discrete attributes. DT learning uses discrete and continues attributes. The ID3 is a method for discovering DT and uses the information gain as its attribute selection measure [3]. Neural Network For supervising learning it is necessary to have: data that have a known classification, sufficient data to represent all aspects of the problem beign solved, sufficient data to allow for testing. The back propagation algorithm learns by adapting its weight using the generalized delta rule which attempts to minimize the squared error between what is the desired network output and the actual network output. During learning it continually cycles through the data until the error is at a low enough value for the task to be considered solved. There are two activation functions one from input layer to hidden layer and the other from hidden layer to output layer. Both of them are the logistic functions. Example: for tennis participation with tuples: “outlook , temperature, humidity, windy, class”. The distributed coding has been used. The logistic function is applied to hidden layer and output layer. A set of five input processing units and five hidden layer units has been used. The weight matrix is 5x5 dimension with bias input z0 =1. Parameters are: m: #of input vectors with length of 5 (including bias input). Array x[i] is input values , d[i] the desired output for each input x[i]. Two activation (logistic) functions fh (input layer to hidden layer) and fo (hidden layer to output layer) are assumed. Test of convergence is achieved by checking the output error function to see if its magnitude is below some given threshold. Some disadvantage of NN: (a) Does not explain the solution is derived, need more examples, need appropriate examples that matches the real world situation. (b) it can be a 7 Simulation Two are the scenarios for the simulation 1. Apriori vs GA_EN: The GA_EN using the particular way to perform the next generations has better performance than the Apriori for the mining of the rules of the “heart” data set. running time 6 lengthy and computationally expensive process to train a NN using a large number of high dimensional training examples. The advantage of DT over NN is that it will require much less training time than NN. If we compare decision trees and neural networks we can see that their advantages and drawbacks are almost complementary. For instance humans easily understand knowledge representation of decision trees, which is not the case for neural networks. Decision trees have trouble dealing with noise in training data, which is again not the case for neural networks, decision trees learn very fast and neural networks learn relatively slow, etc. The classification rules created by GA_CL will be used as input to the construction of the NNC_GA. The NNC_GA architecture has three layers. First, the input layer that has the input nodes where each one is represented by one characteristic of the rules. Second , the hidden layer in which each node will be connected with the characteristic (the attribute value in the rule is called characteristic) of each rule. The number of rules is equal to the number of hidden nodes. Third, the output layer , that has the classes nodes (i.e. c1,..cn). In the NNC_GA for learning the input vector can be created in binary format by considering ‘1’ for activated input node and ‘0’ for the non activated one. The gradient descent would proceed in infinitesimal steps along the direction established by the gradient. For the learning rate, n., is selected large enough to cause the network to converge quickly without oscillations. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 GA_EN Apriori Asociation rules Fig.1 GA_EN vs Apriori 2. CBA vs NNC_GA: Using data set “hepatitis” there is better accuracy for NNC_GA because NNC_GA follows the NN method for learning the process and adjusting the weight. 9 1 [1] B. Bringmann, S. Nijssen and A. Zimmermann , “Patternbased classification: a unifying perspective”, In Proceedings of ECML-PKDD workshop on from local patterns to global models, pp. 36-50, 2009 accuracy 0.95 0.9 0.85 [2] W. Li, J. Han and J. Pei, “CMAR: Accurate and efficient classification based on multiple class-association rules”, In Data Mining ’01, Proceedings IEEE International Conference on, pp. 369-376, Nov. 2001. 0.8 0.75 0.7 CBA NNC_GA Fig. 2 CBA vs NNC_GA 4. ID3 vs DTA: using the data set “iris”. DTA results are slightly better than the ones of ID3. 0.87 running time [3] J. Han, ,M. Kamber, J. Pei, “ Data Mining Concepts and Techniques”, Morgan Kaufman, 3 ed, 2012 [4] U. Fayyad and G. Piateski-Shapiro, “ From Data Mining to Knowledge Discovery. MIT Press, 1995. [5] C. Ordonez, “Comparing Association Rules and Decision Trees for Disease Prediction”, HIKM 2006, Nov 11, Virginia. [6] M. Karntardzic, ”Data Mining: Concepts, Models, Methods, and Algorithms”, IEEE Press, 2003 0.875 0.865 0.86 0.855 0.85 0.845 0.84 DTA ID3 Fig. 3 ID3 vs DTA 8 References Conclusions In this project a new framework of algorithms is developed based on the ability of Genetic Algorithms for discovering Association rules and Classification on biological data. Certain advantages apply depending on the algorithms’ particular way of operation. Future work will focus on NN.