Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Use of Evolutionary Algorithms in Data Mining Ayush Joshi Jordan Wallwork Khulood AlYahya Sultanah AlOtaibi 1 MScISE BScAICS MScISE MScACS Abstract With the huge amount of data being generated in the world every day, at a rate far higher than by which it can be analyzed by human comprehension alone, data mining becomes an extremely important task for extracting as much useful information from this data as possible. The standard data mining techniques are satisfactory to a certain extent but they are constrained by certain limitations, and it is for these cases that evolutionary approaches are both more capable and more efficient. In this paper we present the use of nature inspired evolutionary techniques to data mining augmented with human interaction to handle situations for which concept definitions are abstract and hard to define, hence not quantifiable in an absolute sense. Finally, we propose some ideas for these techniques for future implementations. Keywords: data mining, knowledge discovery, evolutionary algorithms, interactive evolutionary algorithms, genetic algorithms, genetic programming, co-evolutionary algorithms, rule discovery, classification, clustering, data mining tasks, data mining algorithms 2 Table of Contents 1 Introduction ............................................................................................................................. 4 2 Overview of Data mining and Knowledge Discovery ............................................... 5 2.1 Data Mining Pre-processing ....................................................................................... 5 2.2 Data Mining Tasks .......................................................................................................... 6 2.2.1 2.3 3 Conventional Techniques of Data Mining ............................................................. 8 Evolutionary Algorithms and Data Mining ................................................................... 8 3.1 Genetic Algorithms ........................................................................................................ 9 3.2 Genetic Programming ................................................................................................... 9 3.3 Co-evolutionary Algorithms .................................................................................... 10 3.4 Representation and Encoding ................................................................................ 11 3.4.1 Rules Representation ........................................................................................ 11 3.4.2 Fuzzy Logic Based Rules Representation .................................................. 12 3.5 Genetic Operators........................................................................................................ 12 3.5.1 Crossover ............................................................................................................... 12 3.5.2 Mutation ................................................................................................................. 13 3.5.3 Fuzzy logic Operators ....................................................................................... 13 3.6 4 Models and Patterns ............................................................................................. 6 Fitness Evaluation ....................................................................................................... 15 3.6.1 Objective Fitness Evaluation .......................................................................... 15 3.6.2 Subjective Fitness Evaluation (Interactive Evolutionary Algorithms) 17 3.7 Selection and Replacement...................................................................................... 18 3.8 Integrating Conventional Techniques with Evolutionary Algorithms.... 19 Applications of Data Mining Using IEA ....................................................................... 19 4.1 Extracting Knowledge from a Text Database ................................................... 19 4.2 Extracting Marketing Rules from User Data ..................................................... 22 4.3 Fraud Detection Using Data Mining and IEA Techniques ............................ 24 4.4 Some current work being done.............................................................................. 25 5 Conclusion and Future Work .......................................................................................... 25 6 References .............................................................................................................................. 27 3 1 Introduction In recent years, the massive growth in the amount of stored data has increased the demand for effective data mining methods to discover the hidden knowledge and patterns in these data sets. Data mining means to “mine” or extract relevant information from any available data of concern to the user. Data mining is not a new technique but has been around for centuries and has been used for problems like regression analysis, or knowledge discovery from records of various types. As computers invaded almost all conceivable fields of human knowledge and occupation, their advantages were advocated all over, but what was observed soon enough was that with the increasing amounts of data that could be generated, stored and analysed there was a need to define some way to sift through it and grab the important stuff out. During the earlier days a human or a group of humans would sit down to analyse the data by going through it manually and using statistical techniques, but the curve of data generation was far steeper than what could realistically be processed by hand. This led to the emergence of the field of data mining, which was essentially to define and formalize standard techniques to extract data from large data warehouses. As data mining evolved it was observed that the data at hand was almost always never perfect or suitable to be fed to data mining engines and needed several steps of pre-processing before it could be put through “mining”. Generally these inconsistencies would be in data format, level of noise or incorrect data, unnecessary data, redundant data etc. These steps would clean, integrate, discretize and select the most relevant attributes before performing any mining. A whole new area called Intelligent data analysis has emerged which utilises efficient techniques for mining data from large sets keeping in mind that the knowledge obtained is useful at the same time also remembering that time for mining is constrained and the user requires data as soon as possible. Some of the methods used to mine data include support vector machines, decision trees, nearest neighbour analysis, Bayesian classification, and latent semantic analysis. With the problems associated with conventional data mining techniques, clever new ways to overcome these were needed, and the application of AI techniques to the field resulted in a very powerful hybrid of techniques. Evolutionary optimization techniques provided with a useful and novel solution to these issues, and once data mining was enhanced with using EC many of the previously mentioned problems were no longer big issues. Some of applications of evolutionary algorithms in data mining, which involves human interaction, are presented in this paper. When dealing with concepts that are abstract and hard to define or cases where there are a large or variable number of parameters, we still do not have reliable methods for finding solutions. For certain cases where we are unable to quantify what we want to measure, for instance ‘beauty’ in images or ‘pleasantness’ in music, we almost always require a human to drive the solutions through his choices. In these situations we use a combination of Evolutionary computation along with data mining but with a human sitting and interacting with the engine to steer the computation towards solutions or answers he is looking for. 4 This paper begins by describing some concepts in data mining and general evolutionary algorithms by giving relevant concepts and descriptions. In the later sections we discuss some of the areas where these are implemented and lastly we give a few ideas of where these techniques may be implemented in the future. 2 Overview of Data mining and Knowledge Discovery Knowledge discovery and data mining as defined by Fayyad et al. (1996) is “the process of identifying valid, novel, useful, and understandable patterns in data”. Data mining has emerged particularly in situations where analysing the data manually or by using simple queries is either impossible or very complicated (Cant´u-Paz & Kamath, 2001). Data mining is a multi-disciplinary field that incorporates knowledge from many disciplines, mainly from machine learning, artificial intelligence, statistics, signal and image processing, mathematical optimization, and pattern recognition (ibid.). Knowledge discovery and data mining consist of three main steps to convert a collection of raw data to valuable knowledge. These three steps are data pre-processing, knowledge extraction, and data post-processing (Freitas, 2003). The discovered knowledge should be accurate, comprehensible, relevant and interesting for the end user in order to consider the data mining process as successful (Cant´u-Paz & Kamath, 2001). This section gives an overview of data mining pre-processing, data mining tasks, and the conventional techniques for data mining. 2.1 Data Mining Pre-processing The purpose of using data mining pre-processes is to eliminate the outliers, inconsistency and incompleteness of data in order to obtain accurate results (Freitas, 2003). These preprocesses are listed below: Data cleaning: involves preparing data to the following process by removing irrelevant data and as much noise as possible from the data. It is done to guarantee the accuracy and the validity of the data. Data integration: removes redundant and inconsistent data from data that is collected from different sources. Discretization: converts continuous values of attributes to discrete values e.g. for the attribute Age we can set minimum value equal to 21 and maximum value equal to 60. Attribute selection: selects the relevant data to the analysis process from all the data sets. Data mining: after doing all the previous steps, data mining algorithms or techniques can be applied to the data in order to extract the desirable knowledge. 5 2.2 Data Mining Tasks It is very important to define the data mining task that the algorithm should address before designing it for application to a particular problem. There are several tasks of data mining and each of them has specific purposes in terms of the knowledge to be discovered (Freitas, 2002). 2.2.1 Models and Patterns In data mining the term model “is a high level description of the data set” (Hand, 20001). A model can be either descriptive or predictive. As the names imply, the descriptive model is an unsupervised model that aims to describe the data, while predictive model is a supervised model that aims to predict values from the data. Patterns are used to define the important and interesting features of the data. Unusual combination of purchased items in supermarket is an example of a pattern. Models are used to describe the whole data set, while patterns are used to highlight particular aspects of data. 2.2.1.1 Predictive Models According to Kambert (2001), data analysis generally can be either in a classification or a prediction form. Regression analysis is an example of prediction tasks, namely numeric prediction. The difference between classification and regression is that the target value (response variable) is a quantitative value in regression modeling, while it is a qualitative or categorical value in classification modeling. Classification Task Some terms need to be introduced in order to describe classification tasks. The data sets that the classification techniques or algorithms are applied to are composed of a number of instances/objects. Each instance has a number of attributes, which have discrete values. The records in databases tables, for example represent the instances and the fields represent the attributes. In other words, each row represents an object and the columns describe this object in terms of its attributes. Classification is tasked with being able to extract the hidden knowledge from some attributes values in form of patterns in order to predict the value of particular field or attribute. This target value is known as the class (Dzeroski & Lavrac, 2001) The inputs for the classification algorithm are data instances and the outputs are the patterns that are used to predict the class that this instance belongs to. Here is an example of classification rule (Freitas, 2002): IF (a _given_set_of_conditions_ is _satisfied_by _an_instant) THEN (predict_a_certian_class_for_that_instance) 6 Antecedent part Consequent part The data in the classification task is divided into two “mutually exclusive” data sets, the training dataset and the testing dataset (Freitas, 2003). The training dataset is used to build the classification model and the test dataset is used to evaluate the predictive performance of the model. Overfitting occurs when the model is over trained on the training dataset and it is simply “memorizing” it, which would result in poor predictive performance on the testing dataset. By contrast, underfitting occurs when the model is undertrained and did not learn well from the training data. In underfiting situations, the model consists of number of rules that cover too many training instances (ibid.). Regression Task In general, regression modeling is very similar to classification modeling, except that the target value in regression modeling is a continuous or ordered value. In statistics, regression analysis models the relation of response value (output or target value of the predictor) to specified predictor values (input variable for the predictor). Regression analysis can be linear or multiple regression. Both are used to approximate the relation of a single response variable to a single continuous predictor value in the former and multiple continuous predictor values in the latter (Larose, 2006) 2.2.1.2 Descriptive model Clustering Task Clustering simply means grouping, placing data instances into different groups or clusters such that instances from the same clusters are similar together and easily distinguished from the instances that belong to the other clusters (Zaki et al., 2010). Association Analysis Task Association analysis refers to the process of extracting association rules from a data set that describe some interesting relations hidden in this data set. For further illustration, imagine the market basket transactions example, where we have two items A and B and the following rule is extracted from the data: {A} -> {B}. This rule suggests that there is a strong relation between item A and item B in terms of the frequency of their occurrence together (Tan et al., 2006). This means if there is an item A in the basket then there is a high probability that item B will be in the basket as well. 7 2.3 Conventional Techniques of Data Mining Several tools and techniques are available for data mining and knowledge discovery. These techniques have been developed from two main fields: statistics and machine learning. Multivariate analysis, logistic regression, liner discrimination, ID3, k-nearest neighbor, Bayesian classifiers, principal component analysis, and support vector machines are examples of these techniques. These techniques are designed to discover accurate and comprehensible rules, but most of them are not designed to discover interesting rules (Freitas, 2003). Statistics and machine learning techniques are considered to be the most used techniques for data mining but these techniques have some drawbacks. The models or rules discovered using these techniques are not always optimal. This is due to their sensitivity to the noise in the data set, which may cause them to overfit the data (Vafaie & Jong, 1994). They also tend to generate models with a larger number of features than really necessary, which increases the computational cost of the model (ibid.).Another drawback is they typically assume a priori knowledge about the data set, which is not available in most cases. Statistical methods have another problem that is that they assume linearity of the models and distribution of the data (Terano & Ishino, 1996). 3 Evolutionary Algorithms and Data Mining Evolutionary algorithms have several features that make them attractive for the data mining process (Freitas, 2003; Vafaie & Jong, 1994). They are a domain independent technique, which makes them ideal for applications where domain knowledge is difficult to provide. They have the ability to explore large search spaces finding consistently good solutions. In addition, they are relatively insensitive to noise, and can manage attribute interaction better than the conventional data mining techniques. Therefore, several works have been done, in recent years, to develop new techniques for data mining using evolutionary algorithms. These attempts used evolutionary algorithms for different tasks of data mining such as feature extraction, feature selection, classification, and clustering (Cant´u-Paz & Kamath, 2001). The main role of evolutionary algorithms in most of these approaches is optimization. They are used to improve the robustness and accuracy of some of the traditional data mining techniques. Different types of evolutionary algorithms have been developed over the years such as genetic algorithms, genetic programming, evolution strategies, evolutionary programming, evolution strategies, differential evolution, cultural evolution algorithms and co-evolutionary algorithms (Engelbrecht, 2007). Some of these types that are used in data mining are genetic algorithms, genetic programming and co-evolutionary algorithms. Genetic algorithms are used for data preprocessing and for post processing the discovered knowledge, while genetic programming is used for rule discovery and data preprocessing (Freitas, 2003). This section will give a general overview of genetic algorithms, genetic programming, and co-evolutionary algorithms, followed by an overview of different representation schemes, genetic operators, and fitness evaluation for the purpose of data mining. Finally a brief discussion of integrating conventional data mining techniques with evolutionary algorithms is given. 8 3.1 Genetic Algorithms Genetic algorithms are those that “have been originally proposed as a general model of adaptive processes, but by far the largest application of the techniques is in the domain of optimization” (Back et al., 1997). They consist of a population of individual solutions that are acted upon by a series of genetic operators in order to generate new, and hopefully better, solutions to a particular problem, and are inspired by natural evolution. The term ‘genetic algorithm’ was coined in the early 70s by John Holland, who had been working with systems that generate populations of potential solutions using natural methods since the early 60s. In his paper “Outline for a Logical Theory of Adaptive Systems”, Holland (1962) describes a system where a “generation tree” of populations is generated. By applying a number of solutions (the population) to a number of problems (the environment), the solutions that are able to successfully solve the problems are given reward/activation scores, which enable solutions to be compared with one another, and the best of these are used in the generation of the next branch of the generation tree. Virtually all modern evolutionary systems have the same general stages: • A random population of solutions is generated, to be used as the initial population • The solutions within the population are evaluated to determine their ‘fitness’ • Solution pairs are first selected, based on their fitness, and then are combined to create offspring, which are added to the next generation of the population • Other genetic operators, such as mutation, are also applied to offspring 3.2 Genetic Programming Genetic programming is a specific application of genetic algorithms, used to evolve computer programs. The paradigm was named and developed by John Koza in the early 90s, who initially used genetic programming to evolve LISP programs. (Koza, 1992) Whilst the general premise of genetic programming is the same as the basic genetic algorithm – selecting the fittest members of a given population and then crossing and mutating them. The representation of the solutions, however, is radically different, which results in the need for alternative crossover methods. Instead of representing solutions as chromosomes, which are fixed length set where each gene has a specific meaning and a limited value, genetic program represents programs as strings, which can grow infinitely longer. This means that crossover cannot simply occur randomly anywhere in the program, as this would probably simply break it, so more careful crossover algorithms need to be developed. Figure 1: Examples of evolved LISP programs. Fitness calculated by number of outputs closer than 20% to correct output. (Mitchell, 1998) Koza realized that not only could genetic algorithms be used to evolve programs, but also other complex structures, such as equations and rule sets. This is particularly useful when 9 looking at genetically evolving data mining techniques, since we can use the principles of genetic programming in the evolution of rule constructs. In data mining, genetic programming is considered as a more open-ended search technique that can produce many different combinations of attributes. Hence it is very useful for classification and prediction tasks(Freitas, 2003). 3.3 Co-evolutionary Algorithms In co-evolutionary algorithms, two populations are evolved together, with the fitness function involving the relationship with other individuals. In this algorithm, the individuals of the two populations evolve through either competing against each other or through cooperation with each other (Engelbrecht, 2007). The competitive approach is used to “obtain exclusivity on a limited resource” (Tan et al, 2005), while a cooperative approach is used to “gain access to some hard to attain resourse”(ibid.). In the competitive approach, the fitness of an individual in one population is based on the direct competition with the fitness of individuals in the other population. The cooperative approach, on the other hand, the fitness of an individual in one population is based on the how much does it cooperate with the individuals in the other population. Co-evolutionary approaches, particularly the cooperative approach, can address some of the problems of evolutionary algorithms with a single population, such as poor performance and convergence to local optima when dealing with problems that have complex solution (Tan et al, 2005). Several attempts have been made to apply co-evolutionary algorithms to the field of data mining. One of them is the distributed evolutionary classifier for knowledge discovery in data mining proposed by Tan et al (2005). In their approach, they use a cooperative evolutionary algorithm to evolve two populations. The individuals of the first population represent a single rule. Each individual of the second population represents a set of rules. They validated their approach using six datasets. Their classifier preformed better than C4.5 classifier (a well-known algorithm for generating decision trees). The proposed coevolutionary approach reduces the computation time through sharing the workload among multiple computers. It has also achieved a smaller number of rules for the rule set compared with other classification techniques, which increases the comprehensibility of the classification model. Moreover, it is more robust to noise in the data and has robust predication accuracy. Another approach that applies co-evolutionary algorithms to data mining is the coevolutionary system for discovering fuzzy classification rules developed by Mendes et al (2001). They used two evolutionary algorithms in their system: a genetic programming algorithm and an evolutionary algorithm to co-evolve two populations. The genetic programming algorithm evolves a population of fuzzy rule sets and the evolutionary algorithm evolves a population of membership function definitions. The advantage of using the co-evolutionary process is the discovery of fuzzy rule sets and the membership function definitions that are more adjusted to each other. 10 3.4 Representation and Encoding The traditional method of encoding the genetic rules, which perhaps resembles most closely the way evolution occurs in nature, is to use a direct representation scheme to encode the population data as a series of bitstrings – a binary string representative of the genes, which build each chromosome in a population. An 8-bit binary string would be representative of a population whose genetic data consisted of 8 Boolean values, where each bit had some specific meaning. For example, a system looking to design a new car could use its first bit to represent whether or not the car has two or four doors, the second to represent whether it has 3 or 4 wheels, the third to represent whether the car has a spoiler, etc. In this system, a population member with a value 011xxxxx would represent a car with two doors, four wheels and a spoiler. The key issue with this kind of representation, however, is that it defines a very specific search space, with a set number of genetic ‘parameters’ and a very restricted set of values that each of these parameters can be. A simple way to make the genetic algorithm considerably more powerful is to alter the representation so that rather than being encoded as a set of Boolean values, it is stores a number, such as an integer or a floating point value. This means that the search space defined by the system described before could be hugely expanded, allowing for a far greater number of genetic possibilities in its populous. For example, the second bit was representative of whether the car had three or four wheels; by using an integer rather than a Boolean we broaden the system so that it can represent a car with any number of wheels. Expanding the representation in this way comes with an additional memory overhead – a gene encoded with a 1-byte integer is 8 times the size of a binary gene – however, the number of possible values leaps to 255, so the memory increase is a small cost to pay for significant improvement to the representation. Of course, not all genes need to be encoded in the same way; a chromosome can be constructed by any combination of data types that best fit the space being represented. For example, it would not make sense to represent whether a car has a spoiler or not with an integer, as there are only two possibilities, so a Boolean would be sufficient. 3.4.1 Rules Representation Classification is the most common application of evolutionary algorithms in data mining. There are many techniques to perform classification task. Rule-based technique is preferred over other classification techniques because rules are more comprehensible (Freitas, 2003). There are two approaches to represent individuals when using evolutionary algorithms for rule discovery: Michigan and Pittsburgh approach (ibidi.). In the Michigan approach each individual represent a single rule, where in the Pittsburgh approach each individual represent a set of rules. The Pittsburgh approach is more suitable for classification tasks because the quality of the rule set will be evaluate as a whole, rather than the quality of a single rule. On the other hand, the Michigan approach is more suitable for other kinds of data mining tasks such as find a small set of high-quality prediction rules because each rule is evaluated independently of the other rules (ibid.). 11 3.4.2 Fuzzy Logic Based Rules Representation Fuzzy logic based rules are not only more readable by humans, but are also easier to evolve than classic rule types. Fuzzy rules use unary functions to classify variables – for example, IS_LOW(Age) could be equivalent to Age < 27. The advantage of using this fuzzy representation is that it allows us to classify variables without needing to know explicitly the range of values that could be realistically expected, as they are simply classified into three distinct groups, low, medium and high, defined by the normalized values of Figure 2: Fuzzy membership all the data contained in the database. An example of a classification (Walter, 2000) fuzzy classification function is depicted in Figure 2, where the x-axis denotes the normalized data values, and the y-axis denotes the degree of membership in the fuzzy groups. For our rule set, therefore, we will use just two binary operators, AND and OR, four unary operators, NOT, LOW, MEDIUM and HIGH, and an integer value to represent each data variable. We will assign a numerical value to each of these, so for instance values 0-5 could represent the operators, and the variables could be 6+. We will use 1 byte binary representations of these, which gives us up to 249 possible variables; if we require more variables, we can simply chose to use a larger representation (i.e. 2 bytes gives us 65529 variables). If we consider a sample rule, LOW(Age) AND NOT(HIGH(Height) OR HIGH(Weight)), we can see how the binarized form of the rule, 0000 0011 0110 0010 0001 0101 1000 0101 0111, can be parsed and understood (in this example, we have used a 4 bit representation). 3.5 Genetic Operators Figure 3 New population members can be evolved by applying a number of genetic operators in order to combine existing chromosomes. There are two operators which mimic natural methods of recombining genetic material: crossover, which merges two sets of chromosomes in much the same way as sexual reproduction does between animals, and mutation, where genes alter randomly, which happens in real life after organisms reproduce. 3.5.1 Crossover A common method for crossover is called one-point crossover (Rawlins, 1991). Bitstring representations will be discussed for simplicity here, but whatever the data type used, the methods do not vary. In one-point crossover, the two parents chromosomes are split in the same place, and half of one set of genes is combined with the other half of the other. The crossover point tends to be randomized each time a pair of chromosomes reproduce. As an 12 example, consider the two parent chromosomes 01100101 and 10011100. Since there are eight genes in the chromosome, the crossover point can be anywhere between bits 1-2 and 7-8. Say the crossover point is between bits 3-4, the two halves of each parent will be 011|00101 and 100|11100. Depending on which way the parents are combined, the offspring will be either 011|11100 or 100|00101. Generalizing / Specializing Crossover The purpose of this type of crossover is to generalize a rule when it is overfitting or specialize the rule when it is underfiting the data (Freitas, 2003). If binary encoding is used, the generalization crossover is done using logical OR and the specialization crossover is done using logical AND. Example of generalizing and specializing crossover, where the symbol “|” illustrates the crossover points (ibid.). Parents 0 1| 0 1 | 11 0 0| 1 0 | 10 3.5.2 Offspring (generalizing crossover OR) 0 1 | 1 1 | 11 0 0 | 1 1 | 10 Offspring (specializing crossover AND) 0 1| 0 0 | 11 0 0| 0 0 | 10 Mutation Mutation is a fairly simple operator, where bits are flipped to alter a chromosome’s genetic makeup. The mutation rate affects how often these mutations occur – a system with a high mutation rate will result in lots of mutated offspring. Mutation is necessary as it provides renewable variety: it allows the system to explore solutions that may not be available by recombination alone. Generalizing / Specializing Mutation Different mutation operators can be performed to generalize or specialize a rule. A simple mutation for generalizing a rule can be done through deleting one of the conditions in its antecedent part. In the opposite, adding a condition to the rule’s antecedent will be a specialization mutation (Freitas, 2003). Another generalizing/specializing mutation operator is done by subtracting or adding randomly generated value to “attribute-value conditions” (ibid.). Example if the condition is (years_ of_ experience > 20), then subtracting a randomly generated value from this condition (e.g. years_ of_ experience > 10) will be a generalizing mutation and adding a randomly generated value (e.g. years_ of_ experience > 25) will be a specializing mutation. 3.5.3 Fuzzy logic Operators With fuzzy logic representation, we need to come up with new ways to cross and mutate the individuals in our population, ensuring that the rules are still valid within the structure of the grammar. 13 Mutation Mutation is a simple enough process: we can interchange the binary functions AND and OR, leaving a syntactically correct rule; we can add or remove NOT before any of the operators, whether unary or binary; and we can substitute any of the fuzzy classification functions LOW, MEDIUM or HIGH with one another. Crossover Crossover, however, is a more difficult problem. One point crossover is not a suitable method here, since it can result in syntactically incorrect rules. Figure 3 shows the result of crossing the rule shown in Figure 2 with another rule, NOT( (MEDIUM(Age) OR LOW(Age)) OR (LOW(Height)) ) at a random point. Rule 1: 0000 0011 0110 0010 0001 0101 | 1000 0101 0111 Rule 2: 0010 0001 0001 0100 0110 | 0011 0110 0011 1000 Rule 3: 0000 0011 0110 0010 0001 0101 | 0011 0110 0011 1000 Figure 4 As you can see, crossing at this point has cut a binary operator in half, resulting in a rule that cannot be parsed. For this reason, it is important to look more carefully at the crossover points. Crossover may not occur after a fuzzy classification function, however, it may occur at any point where an AND, OR, or NOT branch. In addition to this method of merging rules, other systems have used a method of occasionally simply combining rules using ‘AND’ or ‘OR’. This can be a useful technique if used infrequently, so we can use this combination method 10% of the time, and the rest use the merging method (Walter, 2000). 14 3.6 Fitness Evaluation Each member of the population in an evolutionary system will have a fitness level, which is defined by how effective the solution is deemed to be at solving a particular problem. The general aim of any genetic algorithm is to adapt the parameters of its population in order to evolve solutions with maximal fitness. Fitness functions can be extremely complicated, as they require some method for quantitatively and qualitatively evaluating solutions where often the knowledge of what makes a good solution is not known. In data mining, the fitness function is used to evaluate the fitness of the prediction rules. As mentioned earlier in this paper, prediction accuracy, comprehensibility and interestingness represent the quality criteria of the discovered rules and can be used to measure their fitness. Two main types of fitness functions are used in data mining to evaluating the fitness of an individual: objective and subjective fitness evaluation. A major issue with using evolutionary algorithms in data mining that needs to be considered when designing the fitness function is the interesting of the discovered rules. Because evolutionary algorithms are powerful techniques that can perform global search and generate huge numbers of rules. However, these rules can be trivial and not interesting. 3.6.1 Objective Fitness Evaluation This section discusses the objective or the quantitative approaches to evaluate the fitness of discovered rules (Freitas, 2003). Different approaches have been proposed to design effective objective fitness functions. These approaches will be organized according to the quality criteria of discovered rule mentioned earlier. The examples used to illustrate these approaches are represented using Michigan scheme, which means each individual represents a single rule. Prediction Accuracy Criteria One of the approaches to measure the prediction accuracy of a rule is to use confidence factor CF (Freitas, 2003). Suppose that the rule to be evaluated is as follow: IF A THEN B. Then the confidence factor can be calculated as: CF=|A&B|/|A| Where |A| represents the number of the instances in the data that satisfy all the conditions in the antecedent part A of the rule and |A&B| represents the number of the instances in the data that satisfy all the conditions in A and are classified to be of class B (Freitas, 2003). Here is an example of how to calculate CF. IF |A|= 100 and |A&B|=60, then CF will be 60% which can give an insight of how accurate the rule is. Therefore, the higher the rule accuracy in the training set, the more likely that it will be selected. This is a very simple approach to define the prediction accuracy, but one obvious drawback of such approach is that it most likely to overfit the data which would results in poor prediction performance on the testing data set. 15 Another approach mentioned in (Freitas, 2003) is to use a confusion matrix. This matrix is a 2 x 2 matrix used to describe the “predictive performance” of the rule. Recall the previous rule example: IF A THEN B. The confusion matrix of this rule is: Predicted class B Not B Actual Class B TP FN Not B FP TN Where: “TP = True Positives = Number of examples satisfying A and B FP = False Positives = Number of examples satisfying A but not B FN = False Negatives = Number of examples not satisfying A but satisfying B TN = True Negatives = Number of examples not satisfying A or B” (Freitas, 2003). Using this matrix, CF can be computed as following: CF=TP/ (TP+FP). One important advantage of the previous approach is that it introduces a measurement of the rule completeness “Comp”: Comp=TP/ (TP+FN). Now the rule fitness can be calculated as follow: Fitness=CF*Comp Comprehensibility Criteria Fitness function can be extended in order to cover the comprehensibility criteria as follow (Freitas, 2003): Fitness= w1* (CF*Comp.)+w2*Simp. Where W1 and W2 are user defined weights. Simp refers to the simplicity measurement of a rule. One obvious way to measure the simplicity is to compute the number of the conditions in the rule. The smaller the number of conditions the simpler the rule is. For data mining approached that uses genetic programming, the simplicity of a rule can be measured by counting the number of nodes. Possible method to measure the rule simplicity mentioned in (Freitas, 2003), is to define a maximum number of nodes in a tree (individual) and then calculate the simplicity as follow: 16 Simp= (MaxNodes -0.5 NumNodes – 0.5)/(Maxnodes -1) Interestingness Criteria Noda et al (1999) have proposed a fitness function that composed of two parts: the first part is to measure the degree of interestingness and the second part to measure the predictive accuracy. The degree of interestingness part also consists of two parts. Users are supposed to set the weights of the degree of interestingness and the predictive accuracy parts. Another measurement of interestingness of rule has been introduced by Piatetsky-Shapiro cited in (Gebhardt, 1991) which is called PS measure: PS = |A&B| - |A||B|/N. According to Piatetsky-Shapir summarized in (Gebhardt, 1991), there are three principles for rule interestingness (RI) measures: • RI = 0 if |A & B| = |A| |B| / N, when the antecedent and the consequent of the rule are statistically independent. • RI monotonically increases with |A&B| when other parameters are fixed, namely |A| and |B|. In this case the CF and Comp factors are increased also which means more interesting rule. • RI monotonically decreases with |A| or |B| when other parameters are fixed, namely |A&B|. In this case the CF and Comp factors are decreased also which means less interesting rule. 3.6.2 Subjective Fitness Evaluation (Interactive Evolutionary Algorithms) Writing an appropriate objective fitness function can be a very hard task. This is particularly true for situations where the domain knowledge or a prior knowledge is not available, which makes the decision of what is considered as an interesting knowledge difficult. In such cases, a subjective fitness function evaluation could be very useful. Subjective fitness evaluation is done by human experts. In data mining, domain experts evaluate the fitness of the discovered rules according to their interesting feature. Rules can be interesting if they are unexpected and actionable for the user (Liu et al. , 1997). In many different domains, however, knowledge about the domain data can vary from one user to another. User’s prior knowledge of the domain can be either general impression (GI) when the user has feelings about the domain or it can be reasonably precise knowledge (RPK) when the user has definite idea. Generally, discovered rules are evaluated and ranked against these two types of concept (ibid.). A major problem with data mining is that the discovered models do not necessarily contain important or interesting rules. They sometime include trivial rules or even worse they can have counterintuitive rules (Pazzani, 2002). Some of the previous attempt to address this problem is to interact with the domain experts to evaluate the models and to find what is 17 interesting and important. The found model, then, will be adjusted according to the feedback from the domain experts (for example by adding or removing variables) until an acceptable model is found (ibid). Subjective fitness evaluation and Interactive evolutionary algorithms accelerates this process and probably generates more interesting rules by involving the domain expert in the search process to bias the search toward models that are more novel and comprehensible. Figure 5 (Pazzani, 2002) Using subjective function evaluation in data mining offers many opportunities for future research. For example, Pazzani’s idea (2002) illustrated in Figure 5 could be accomplished through the use of subjective fitness evaluation. In his idea, he describes how the different fields of artificial intelligence, statistics, data base and cognitive psychology should be combined together to improve the performance of the multi-disciplinary field of data mining. Interactive evolutionary algorithms can allow the use of cognitive psychology in developing tools and techniques for data mining and knowledge discovery through involving the human cognitive process into the search for interesting patterns and the discovery of new knowledge from data sets. 3.7 Selection and Replacement Selection is the process of choosing which individuals in the population to use for reproduction, and replacement is the process of selecting which individuals in the population will go through to the next generation. Whilst selection and replacement can use the same methods almost interchangeably, they do not both need to be implemented the same way in a particular genetic system: a system may use the ‘roulette wheel’ method for selection, and the ‘absolute’ method for replacement. There are a number of different methods for making these selections: Absolute The n fit individuals in the population are chosen for breeding, or the n least fit individuals are replaced. Whilst this seems like a good strategy, it can result in losing individuals that, while being less fit, hold genetic material that could be useful in 18 evolving even better strategies. It is important to keep a mix or genetic material within the population to stop the solutions converging prematurely at local optima. Random The opposite of absolute selection is random selection. Here, no regard is given at all to the fitness – all individuals are selected with uniform probability. Whilst this method does preserve variety, it also means that it can take a very long time to find a good solution, and if it is used as a replacement strategy, good solutions have as much chance of being overlooked as bad ones, meaning that good solutions may never develop. Roulette Wheel By looking at the previous two methods, we can see that whilst it is important to focus on the individuals with higher fitness levels, we also need to ensure that we do not throw away potentially useful solutions. The roulette wheel method addresses this by picking randomly, but in proportion to the fitness of the individuals, so that very fit individuals have a higher chance of being selected for breeding, and less fit individuals have a higher chance of being replaced. 3.8 Integrating Conventional Techniques with Evolutionary Algorithms Several hybrid approaches have been proposed that integrate evolutionary algorithms with one of the conventional techniques to tackle some of the problems with the conventional techniques such as minimizing the number of selected features and selecting more interesting features. One of the successful attempts to integrate evolutionary algorithms with data mining is the approach developed by Terano and Ishino (1996). Their approach integrates evolutionary algorithm with one of the machine learning data mining techniques, namely inductive learning technique that generates decision trees. They used the inductive learning algorithm to find rules from the data, then they used interactive evolutionary algorithm to refine these rules. Their work will be discussed in greater depth in the following section. 4 Applications of Data Mining Using IEA In this section we examine some areas where data mining with interactive evolutionary algorithms IEA techniques has been successfully applied. The first approach detailed is very general in terms that it can be used to classify any text based data and hence is not limited to any specific discipline. The approach requires textual data in the form of reports, which can be just normal text files corresponding to the database for which the knowledge needs to be extracted. 4.1 Extracting Knowledge from a Text Database This technique proposed by Sakurai (2001) details a means to extract knowledge from any database with the help of domain dependent dictionaries. The particular application in the paper deals with text mining from daily business reports generated by some institution and 19 classification of the reports based on some knowledge dictionaries. In their experiment, two kinds of knowledge dictionaries were used, one is called the key concept dictionary, and the other is the concept relation dictionary. The daily business reports generated from any source are decomposed into words using lexical analysis and the words are checked for entry in the key concept dictionary. All reports are then classified with particular concepts; according to the words in the report, which represent the concept in the key concept dictionary. Also each report is then checked if its key concepts are assigned in the concept relation dictionary. Reports are then classified according to the set of concept relations, and reports having the same text class are put into the same group. This facilitates the end users as they can read only those reports, which are put into groups with topics matching their interests; also it gives them and indication of the trends of topics in reports. The key concept dictionary contains concepts having common features, concepts and related keywords, and expressions and phrases concerned with the target problem. An example of the key concept dictionary can be seen in the figure below concept relation dictionary contains a relation, which describes a condition and a result. This is a mapping from key concepts to classes. Since creating a dictionary is time consuming and prone to errors the paper describes an automatic way of creating a concept relation dictionary. Figure 6 (Sakurai et al., 2001) The relation in concept relation dictionary is like a rule and can be acquired by inductive learning if training examples are available, to do so words are extracted from the document by lexical analysis and these words are checked if they match a expression in key concept dictionary. Thus we have the following assumptions, concept classes are attributes, concepts are values and test classes given by the reader are the result classes we want, this forms a training example. Also for all those attributes, which do not have values, 0 is assigned. An overview of this is clearly depicted in the figure below 20 Figure 7 (Sakurai et al., 2001) For the inductive learning to work we need a fuzzy algorithm, as reports, which are written by humans, are not strict in accordance with descriptions. Thus the method described for the learning is the IDF algorithm, which is a fuzzy algorithm. This algorithm makes rules from the generated training examples and the rules, which are generated, have the genotype of a tree. 21 The whole process can be seen in figure 8 below which shows the inputs, and the processes, which go into getting the final outputs from the input dictionaries and data. Figure 8 (Sakurai et al., 2001) The algorithm was tested on daily reports for a business concerning retail sales into 3 classes concerned with describing a sales opportunity as best, missed or other. The key concept dictionary was composed of 13 concept classes and each concept class has its subset of concepts. Those reports which contained contradicting descriptions were regarded as unnecessary and training example from them were not generated. And the results showed that by using 10 fold cross validation they were successfully able to generate the concept relation dictionary and obtain better results than IDF on the reports generated for retailing. 4.2 Extracting Marketing Rules from User Data Since marketing decisions require optimum rules from customer data, which can be really noisy, Simulated breeding and inductive learning methods have been tested to create such rules, which have been able to generate simple and easy to understand results in the form which can be used directly by the marketing agent. This work has been developed by Terano and Ishino (1996). The conventional method to solve the problem of generating efficient decision making rules was to use statistical methods but these prove to be weak since they assume that the mining data is based on linear models. Multivariate analysis, which is popularly used, fails to satisfy the need for 22 both quantitative as well as qualitative analysis of data. AI techniques on the other hand focus on the problem of feature selection, which is based on machine learning and aims to find the optimal number of features to describe the target concept. This does not work for the current problem hence we cannot apply well-known standard techniques to choose the appropriate features. Hence the smart way proposed by the authors is to use both simulated breeding and inductive learning techniques. The Inductive learning is used to generate the decision rules from data to give emphasis on relationship between product and feature, while simulated breeding to get the effective features. This work was the first of its kind that specifically address the problem of clarifying the relationship between the product image and features using user questionnaire data. Simulated breeding is a GA based technique to evolve offspring. The offspring, which are judged by human expert to have some, desired features are allowed to breed. The judgment is done interactively. It is used in cases where fitness function is hard to define. Inductive learning is used to generate the rules in the form of a decision tree as output for the analysis of features and attribute value pairs. This specific implementation used C4.5. Since marketing decisions must be made by analysts who need to make promotion strategies for their product according to an abstract image of their product. The things they need to keep in mind are that the data gathered from users is inherently noisy and the data is based on complicated models hence simple rules are needed to explain the characteristics of the products. Also the features of the product to realize the image are left on intuition of the experts and there is no clear way to do this. So, the information needs to be organized in a clear manner to understand the relationship between the feature and image of the product. The algorithm proposed consists of the following steps: 1. Inductive learning to classify data 2. Genetic operators to enhance flexibility of feature selection 3. Decision tree selection based on human judgment 4. Developing decision trees with small number of features which fully explain data Automation of offspring selection is not done to promote human creativeness, which incorporates appropriate explanations, and also as our problem needs subjective judgment which makes it really very hard to define a formal fitness function. This analysis was carried out on oral care products and 2300 users filled the questionnaire used, the knowledge obtained was tested by a domain expert at the manufacturing company. The domain expert must know basic principles if IL , stats , and must understand outputs obtained. Using the outputs of decision trees she interactively evaluates the quality of the obtained knowledge. 23 4.3 Fraud Detection Using Data Mining and IEA Techniques An interesting application of Genetic Programming and rule based data mining can be seen in the work done by Bentley (2000) where a system is designed to analyze data provided by a bank and discover the cases of fraud; in this particular case, for insurance applications. The logic that goes into this paper stems from a very pressing issue that is the increase in fraud in all forms of financial institutions. For a large bank this is typically hard to handle as the number of fraudulent cases is masked by the other large number of true applicants and hence they just slip under the hands. An effective method is needed to find such cases from huge amounts of data where data mining comes in. The evolutionary computation techniques are used to generate certain rules, which might be the underlying explanations for fraud cases. In this experiment first the data was clustered into 3 segments, which then correspond to the domains of the rule generation membership functions. These functions give the “degree of membership” of the input data into fuzzy logic sets of “LOW”, “MEDIUM” and “HIGH”. A GP is used for the purpose of evolving rules and the representation of a rule is done in the form of a tree, where each tree will correspond to a particular rule. After a set of rules has been generated, they are evaluated by an expert system and assigned a score before they are applied on training data. This is the data for which the bank has certified the number of fraud cases and this is used to generate rules, which can be accurate to describe fraud cases. The fitness function checks the scores and describes different fitness values with a key objective to ensure that there are as less as possible of misclassified items, differentiate between “suspicious” and “unknown” classes ensuring that “suspicious” are given more relevance and finally ensuring that the rule generated is concise and yet understandable. One single run of GP generated one rule which might not classify all suspicious items hence it is run several times until all suspicious items are classified and therefore we get more than one rule. Any of them, which misclassified a number of claims, is removed from the final set. Now comes the role of the human interaction in the process, since the variables in our evolutionary system are large each can have an effect on our outcome and there is no single selection of settings, which will classify every data set correctly. Cluster size choices, membership function choices, rule interpreters, fitness functions, GA settings, etc all of them can be tweaked. Therefore to help the human decision maker four versions of this system with different setting are run in parallel all the results generated are presented. The human has the task to select the best results from the four by performing a series of task, to find the most accurate, intelligent and, accurate and intelligent rule sets. Then this rule set is finally evaluated on the global data and if need be the settings of our parameters needed to be adjusted by the human to generate better rules. With this setting and data obtained from a bank the research team was able to get results up to an accuracy of 60% which is impressive as the training data which contained reported incidents of frauds was less and also was spread over a number of years while the data to be tested was for the past couple of months and the percentage of suspicious items was unknown. 24 4.4 Some current work being done This is a brief glimpse of research going on at the TAKAGI lab, Kyushu University that involves interactive evolutionary computation and data mining from different sources. Constructing image or music feature space and impression space. Neural networks are used to learn mapping from features to impressions and search a point in feature space from impression space. By doing this the aim is to retrieve images or music based on human impression. Investigation of diagnostic data by measuring psychological dynamic range, such as happy, sad, using IEC of mental illness patient (with a therapist as the trained individual) . 5 Conclusion and Future Work In this paper we have discussed the use of different evolutionary algorithms in data mining and knowledge discovery field. The main motive behind using evolutionary algorithms in data mining is their attractive features that enable them to resolve some of the drawbacks in conventional data mining techniques and enable them to discover novel solutions, such as their robustness when dealing with noisy data, and their ability to interpret data without any a priori knowledge. The difficulty of discovering novel and interesting knowledge is one of the main issues in data mining, and interactive evolutionary algorithms have been used to address this problem. Interactive evolutionary algorithms provide a promising research area for data mining and knowledge discovery, and there are a wide range of applications that use evolutionary algorithms in data mining; a number of which have been presented in this paper. Possible Future Applications In this section we propose some applications where Data mining along with IEA methods can be fruitful to implement and it is hoped that this will become evident in the near future. Generating effective strategies for Share/Stock market data analysis For the analysis of stock market data at any given point the number of variables to be taken into account can be enormous and often each of these variables in turn can have a lot of different choices. An expert needs to make a decision on what subset of these variables he must consider before trying to make any assumptions. An example of a variable is selection of trading rules which are constrains or choices which define when to buy sell or hold a stock. Several rules are available like Filter rules, moving averages, support and resistance, abnormal return, etc. Another variable, which is more evident in share market, is the effect of other stocks on the stock being considered. The number of other stocks to consider, etc. hence it is almost never true that at a given time a strategy will fit all situations. We propose that in the future we will see implementations where humans will specify the fitness functions by selection the best rules for a given situation by specifying the number and types of variables to be chosen at any given time and then test the effectiveness by deploying them. 25 Intrusion Detection in Networks or Websites Another application, which we propose might be evident, sometime soon is intrusion detection in networks or websites. Taking some idea from the approach proposed by Li (2004), assume that the intrusion detection systems of the future will use interactive techniques to enhance their already implemented systems, which use genetic algorithms, which are used to define rules based on the network traffic activity. The system described in the paper is such that GA is capable of generating rules which classify traffic activity as suspicious, normal or other. Almost all suspicious activity is added to the IDS to prevent further attacks of that kind and for any such system in use no human intervention would be required, but there is a shortcoming in this technique mentioned which does not catch any novel attacks. For a network administrator it is these novel and unique attacks, which can cause the most harm as these lie in the unknown range and are unlooked upon. We assume that in our interactive system the suspicious category will be exclusively handles by the GA and for the purpose of unknown cases a network administrator will analyze the traffic and make subsequent changes to the GA and hence the IDS so that newer attacks are being added into the system. The human in charge will look at the novel attack and modify the GA to include rules which would be able to capture such attacks in the future by modifying the fitness function accordingly. 26 6 References Zaki, M. J., Yu, J. X., Ravindran, B., & Pudi, V. (Eds.). (2010). Advances in Knowledge Discovery and Data Mining, Part I, Proceedings of the 14th Pacific-Asia Conference (PAKDD 2010). Springer. Walter, D. a. (2000). ClaDia: A Fuzzy Classifier System for Disease Diagnosis. Proceedings of the Congress on Evolutionary Computation . Vafaie, H., & Jong, K. D. (1994). Improving a rule induction system using genetic algorithms. In R. S. Michalski, R. S. Michalski, & G. Tecuci, Machine learning: a Multistrategy Approach (pp. 453-470). San Francisco, CA: Morgan Kaufmann. Back, T., Hammel, U., & Schwefel, H.-P. (1997). Evolutionary Computation: Comments on the History and Current State. IEEE Transactions on Evolutionary Computation, 1 (1) , 3-17. Bentley, P. J. (2000). “Evolutionary, my dear watson” investigating committee-based Evolution of fuzzy rules for the detection of suspicious insurance claims. Genetic and Evolutionary Computation Conf. (GECCO-2000). Morgan Kaufmann. Cant´u-Paz, E., & Kamath, C. (2001). On the use of evolutionary algorithms in data mining. In H. A. Abbass, R. A. Sarker, & C. Sincla (Eds.), Data mining: a heuristic approach (pp. 48-71). Idea Group Inc . Engelbrecht, A. P. (2007). Computational Intelligence: An Introduction (2nd ed.). Sussex: Wiley. Dzeroski, S., & Lavrac, N. (2001). Relational Data Mining. Secaucus, NJ: Springer. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in Knowledge Discovery and Data Mining. Melno Park,Calif: The MIT Press. Freitas, A. A. (2003). A Survey of Evolutionary Algorithms for Data Mining and Knowledge. In A. Ghosh, & S. Tsutsui, Advances in Evolutionary Computing: Theory and Applications (pp. 819-846). New York, NY: Springer-Verlag. Freitas, A. A. (2002). Data Mining and Knowledge Discovery with Evolutionary Algorithms. Berlin: Springer-Verlag. Gebhardt, F. (1991). Choosing among competing generalizations. Knowledge. Knowledge Acquisition , 3 (4), 361-380. Hand, D. M. (20001). Principles of Data Mining. MIT Press. Holland, J. (1962). Outline for a Logical theory of Adaptive Systems, 9 (3). Journal of the ACM , 297-314. 27 Kambert, J. H. (2001). Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann. Koza, J. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. Massachusets: The MIT Press. Larose, D. T. (2006). Data Mining: Methods and Models. New York: Wiley Interscience, Inc. Liu, B., Hsu, W., & Chen, S. (1997). Using general impressions to analyze discovered classification rules. Knowledge Discovery & Data Mining, (pp. 31-3). Li, W. (2004). Using Genetic Algorithm for network intrusion detection. United States Department of Energy Cyber Security Group 2004 Training Conference, (pp. 24-27). Kansas City, Kansas. Noda, E., Freitas, A. A., & Lopes, H. S. (1999). Discovering interesting prediction rules with a genetic algorithm. Conference on Evolutionary Computation 1999 (CEC-99), (pp. 1322-1329). Washington D.C. Mendes, R. R., Voznika, F. d., Freitas, A. A., & Nievola, J. C. (2001). Discovering fuzzy classification rules with genetic programming and co-evolution. In D. Raedt, & S. A. Luc (Eds.), Principles of Data Mining and Knowledge Discovery (Vol. 2168, pp. 314325). Heidelberg, Berlin: Springer-Verlag. Mitchell, M. (1998). An Introduction to Genetic Algorithms. Massachusetts: MIT Press. Pazzani, M. J. (2002). Knowledge discovery from data? Intelligent Systems and their Applications, IEEE , 15 (2), 10-12. Sakurai, S., Ichimura, Y., Suyama, A., & Orihara, R. (2001). Acquisition of a knowledge dictionary for a text mining system using an inductive learning method. IJCAI 2001 Workshop on Text Learning: Beyond Supervision, (pp. 45–52). Rawlins, G. J. (1991). Foundations of Genetic Algorithms. California: Morgan Kaufmann Publishers, Inc. Tan, K. C., Yu, Q., & Lee, T. H. (2005). A distributed evolutionary clasifier for knowledge and discovery in data mining. IEEE Trans. on Systems, Man, and Cybernetics: Part C - Applica- tions and Reviews , 35 (2), 131-142. (2006). Association analysis basic concept and algorithms. In P.-N. Tan, M. Steinbach, & V. Kumar, Introduction to data mining. Pearson Addison Wesley. 28 Terano, T., & Ishino, Y. (1996). Knowledge acquisition from questionnaire data using simulated breeding and inductive learning methods. Expert Systems with Applications , 11 (4), 507-518. 29