* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Clustering Genetic Algorithm
Point mutation wikipedia , lookup
Genetic studies on Bulgarians wikipedia , lookup
Frameshift mutation wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Medical genetics wikipedia , lookup
Genetic code wikipedia , lookup
History of genetic engineering wikipedia , lookup
Koinophilia wikipedia , lookup
Behavioural genetics wikipedia , lookup
Gene expression programming wikipedia , lookup
Heritability of IQ wikipedia , lookup
Genetic drift wikipedia , lookup
Genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Human genetic variation wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Population genetics wikipedia , lookup
Introduction Clustering Genetic Algorithm Experimental results Conclusion Clustering Genetic Algorithm Petra Kudová Department of Theoretical Computer Science Institute of Computer Science Academy of Sciences of the Czech Republic ETID 2007 ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Outline Introduction Clustering Genetic Algorithm Experimental results Conclusion ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Motivation Goals study applicability of GAs to clustering design genetic operators suitable for clustering application to tasks with unknown number of clusters compare to standard techniques Clustering partitioning of a data set into subsets - clusters, so that the data in each subset share some common trait often based on some similarity or distance measure the notion of similarity is always problem-dependent. wide range of algorithms (k-means, SOMs, etc.) ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Clustering Definition of cluster Basic idea: cluster groups together similar objects More formally: clusters are connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by an low density of points Applications Marketing - find groups of customers with similar behaviour Biology - classify of plants/animals given their features WWW - document classification, clustering weblog data to discover groups of similar access patterns ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Genetic algorithms Genetic algorithms stochastic optimization technique applicable on a wide range of problems work with population of solutions - individuals new populations produced by genetic operators Genetic operators selection - the better the solution is the higher probability to be selected for reproduction crossover - creates new individuals by combining old ones mutation - random changes ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Clustering Genetic Algorithm (CGA) Representation of the individual 1. approach (Hruschka, Campelo, Castro) for each data point store cluster ID long individuals (high space requirements) 2. approach (Maulik, Bandyopadhyay) store centres of the clusters need to assign data points to clusters before each fitness evaluation ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Fitness Normalization partition the data set into clusters using the given individual move the centres to the actual gravity centres Fitness evaluation fit(I) = −EVQ clustering error: EVQ = K X ||x~i − ~cf (xi ) ||2 , f (~xi ) = arg min ||~xi − ~ck ||2 k i=1 silhouette function: s(~x ) = fit(i) = PN ~ i=1 s(xi ) b(~x ) − a(~x ) max{b(~x ), a(~x )} ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Crossover One-point Crossover exchange the whole blocks (i.e. centres) Combining Crossover match the centres and combine them ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Mutation One-point mutation, Biased one-point mutation One-point Mutation: ~cnew = ~xi , where i ← random(1, N) Bias one-point Mutation: ~ where ∆ ~ is a random small vector ~cnew = ~cold + ∆, K-means mutation several steps of k-means clustering Cluster addition, Cluster removal Cluster Addition – adds one centre Cluster Removal – removes randomly selected centre ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Experiments Goals demonstrate the performance of CGA compare variants of genetic operators Data Sets 0.9 data 0.8 0.7 0.6 25 centres 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 vowels (UCI machine learning repository) 11 kinds of vowels, dimension 9 990 examples mushrooms (UCI machine learning repository) 23 kinds of mushrooms, dimension 125 8124 examples ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Operators Comparison Mutation 1-point Biased 1-point K-means 1-point + Biased 1-pt 1-point + K-means All Crossover 1-point Combining Both 25clusters 0.20 0.25 0.26 0.21 0.21 0.22 25clusters 0.201 0.222 0.202 Vowels 927.7 927.3 940.7 927.3 927.6 927.3 Vowels 927.7 927.4 927.4 ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Convergence Rate – Mutation Mutation Comparison -900 1-point biased 1-point k-means 1-point + k-means -950 -1000 fitness -1050 -1100 -1150 -1200 -1250 0 5 10 15 iteration 20 ETID’2007 25 Introduction Clustering Genetic Algorithm Experimental results Conclusion Convergence Rate – Crossover Crossover Comparison -900 1-point combining both -950 -1000 fitness -1050 -1100 -1150 -1200 -1250 0 2 4 6 8 iteration 10 12 14 ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Comparison to other clustering algorithms method k-means CLARA CGA HCA Mushroom data set accuracy 95.8% 96.8% 97.3% 99.2% 25 centers CGA k-means 0.9 0.9 data centers data centers 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Estimating the number of clusters Initial population: 2 to 15 centres 26 # centers 24 22 20 18 16 14 12 0 10 20 30 40 50 60 70 80 ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Estimating the number of clusters Initial population: 10 to 30 centres 25.5 # centers 25 24.5 24 23.5 23 22.5 22 21.5 21 2 4 6 8 10 12 14 ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Conclusion Summary Clustering Genetic Algorithm proposed several genetic operators proposed and compared CGA compared to available clustering algorithms estimating the number of clusters tested Future work application of CGA to large data sets reducing time requirements, lazy evaluations, etc. applications ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Thank you. Any questions? ETID’2007