Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Distributed-Population Genetic Algorithm for Discovering Interesting Prediction Rules Edgar Noda1 1 School Alex A. Freitas2 Akebo Yamakami1 of Electrical and Computer Engineering (FEEC) State University of Campinas (Unicamp), Brazil 2 Computing Laboratory University of Kent at Canterbury, UK Introduction Data Mining – Extraction of knowledge from data. – Data mining task: • Classification. – One goal Attribute, prediction. • Dependence Modeling. – Classification generalization, more than one possible goal attribute. Prediction rules form. – IF conditions on the values of predicting attributes are true THEN predict a value for some goal attribute Discovered Knowledge Desirable properties: – In principle, 3 properties. – 1. Predicative accuracy. • Most emphasized in the literature. • Discovered knowledge should have high predictive accuracy – 2. Comprehensibility. • High-level rules. • The output of rule discovery algorithms tends to be more comprehensible than the output of other kinds of algorithms Discovered Knowledge Desirable properties: – 3. Interestingness. • Discovered knowledge should be interesting to the user. • Among the three above-mentioned desirable properties, interestingness seems to be the most difficult one to be quantified and to be achieved. • By "interesting" we mean that discovered knowledge should be novel or surprising to the user. • The notion of interestingness goes beyond the notions of predictive accuracy and comprehensibility. Motivation for using a Genetic Algorithm (GA) in rule discovery Genetic Algorithm. – A GA is essentially a search algorithm inspired by the principle of natural selection. – In general, GAs tend to cope better with attribute interaction problems than greedy rule induction algorithms. – GAs perform a global search. – GAs use stochastic search operators, which contributes to make them more robust and less sensitive to noise. – The execution of a GA can be regarded as a parallel search engine acting upon a population of candidate rules. Motivation for using a Genetic Algorithm (GA) in rule discovery Distributed Genetic Algorithm (DGA). – Basic idea lies in the partition of the population into several small semi-isolated subpopulations. – Each subpopulation being associated to an independent GA, possibly exploring different promising regions. – Occasionally, these subpopulations interact with other subpopulations through the exchange of few individuals, simulating a seasonal migratory process. – The new injected genetic material hopefully ensures that good genetic material is shared from time to time. – This approach also contributes to minimize the early convergence problem and restricts the occurrence of “illegal matting”. GA-Nuggets Overview. – Designed to the dependence modeling task. – Individual encoding: • Genotype: fixed-length individual. • Phenotype: rules with variable number of attributes. – Fitness Function. • Two Parts: – Degree of interestingness. » Objective (Information-theoretical) measure. » Antecedent and consequent interestingness. – Predictive accuracy. GA-Nuggets The fitness function: w1 . Fitness = – – – – AntInt ConsInt w2 .P. redAcc 2 w1 w2 AntInt – Antecedent degree of interestingness. ConsInt – Consequent degree of interestingness. PredAcc – Predicative accuracy. W1 and W2 are user-defined weights. GA-Nuggets Selection method: – Tournament selection (factor:2). Genetic operators: – Uniform crossover. – Mutation. – Condition Insertion / Removal operators. • Influence in the size of the discovered predictive rule. – Consequent formation. – All operators guarantee the maintenance of valid genetic material. DGA-Nuggets Fitness, selection and genetic operators. – The same as in the single population version. Subpopulations – A specific fitness function in each subpopulation (search for different goals attributes). – Number of subpopulations = number of possible goals attributes. Migration policy. – Migration take places every m generations. – Each subpolutaion send a best individual based in the “foreign ” fitness. Computational Results Datasets. – Obtained from the UCI repository of machine learning databases (http://www.ics.uci.edu/AI/MachineLearning.html). The data sets used are Zoo, Car Evaluation, Auto Imports and Nursery • Zoo - 101 instances and 18 attributes. • Car evaluation - 1728 instances and 6 attributes. • Auto-imports 85M - 205 instances and 26 categorical attributes. • Nursery school - 12960 instances and 9 attributes. Computational Results Summary of results. – Predicative accuracy. • DGA-Nuggets obtained somewhat better results than singlepopulation GA-Nuggets. • In one case the GA-Nuggets found rules with significantly higher predictive accuracy. DGA-Nuggets significantly outperformed single-population GA in six cases – Degree of interestingness. • DGA-Nuggets obtained results considerably better than singlepopulation GA-Nuggets. • DGA-nuggets outperformed the latter in 22 out of 44 cases – considering all the discovered rules in all the four data sets – whereas the reverse was true in just five out of 44 cases. In the other cases the difference between the two algorithms was not statistically significant. Discussion Performance of the Distributed GA: – Predictive accuracy: a somewhat better performance. – Degree of interestingness: results considerably better. Sub division of the problem: – The use of the sub populations as a explicit way to improve a sub division of the task result in a considerable decrease of the number of generations necessary to convergence (the number of individuals evaluated in both algorithms was the same). Migratory policy: – The cooperative approach helps to decrease the number of generations necessary for convergence,but hinders the maintenance of diversity. Future Works Developing a new version of the distributed-population GA where each subpopulation is associated with a goal attribute value, rather than with a goal attribute as in the current distributed version. Comparing the performance of this future version with the performance of the current distributed version, in order to empirically determine the cost-effectiveness of these approaches. Extending the computational experiments reported in this paper to other data sets and other migration policies.