Download Clustering Genetic Algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Point mutation wikipedia , lookup

Genetic studies on Bulgarians wikipedia , lookup

Frameshift mutation wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Medical genetics wikipedia , lookup

Genetic code wikipedia , lookup

History of genetic engineering wikipedia , lookup

Koinophilia wikipedia , lookup

Behavioural genetics wikipedia , lookup

Gene expression programming wikipedia , lookup

Heritability of IQ wikipedia , lookup

Genetic drift wikipedia , lookup

Genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Human genetic variation wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Population genetics wikipedia , lookup

Genetic testing wikipedia , lookup

Genetic engineering in science fiction wikipedia , lookup

Transcript
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Clustering Genetic Algorithm
Petra Kudová
Department of Theoretical Computer Science
Institute of Computer Science
Academy of Sciences of the Czech Republic
ETID 2007
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Outline
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Motivation
Goals
study applicability of GAs to clustering
design genetic operators suitable for clustering
application to tasks with unknown number of clusters
compare to standard techniques
Clustering
partitioning of a data set into subsets - clusters, so that the
data in each subset share some common trait
often based on some similarity or distance measure
the notion of similarity is always problem-dependent.
wide range of algorithms (k-means, SOMs, etc.)
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Clustering
Definition of cluster
Basic idea: cluster groups together similar objects
More formally: clusters are connected regions of a
multi-dimensional space containing a relatively high
density of points, separated from other such regions by an
low density of points
Applications
Marketing - find groups of customers with similar behaviour
Biology - classify of plants/animals given their features
WWW - document classification, clustering weblog data to
discover groups of similar access patterns
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Genetic algorithms
Genetic algorithms
stochastic optimization technique
applicable on a wide range of problems
work with population of solutions - individuals
new populations produced by genetic operators
Genetic operators
selection - the better the solution is the higher probability to
be selected for reproduction
crossover - creates new individuals by combining old ones
mutation - random changes
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Clustering Genetic Algorithm (CGA)
Representation of the individual
1. approach (Hruschka, Campelo, Castro)
for each data point store cluster ID
long individuals (high space requirements)
2. approach (Maulik, Bandyopadhyay)
store centres of the clusters
need to assign data points to clusters before each fitness
evaluation
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Fitness
Normalization
partition the data set into clusters using the given individual
move the centres to the actual gravity centres
Fitness evaluation
fit(I) = −EVQ
clustering error:
EVQ =
K
X
||x~i − ~cf (xi ) ||2 ,
f (~xi ) = arg min ||~xi − ~ck ||2
k
i=1
silhouette function:
s(~x ) =
fit(i) =
PN
~
i=1 s(xi )
b(~x ) − a(~x )
max{b(~x ), a(~x )}
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Crossover
One-point Crossover
exchange the whole blocks (i.e. centres)
Combining Crossover
match the centres and combine them
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Mutation
One-point mutation, Biased one-point mutation
One-point Mutation:
~cnew = ~xi , where i ← random(1, N)
Bias one-point Mutation:
~ where ∆
~ is a random small vector
~cnew = ~cold + ∆,
K-means mutation
several steps of k-means clustering
Cluster addition, Cluster removal
Cluster Addition – adds one centre
Cluster Removal – removes randomly selected centre
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Experiments
Goals
demonstrate the performance of CGA
compare variants of genetic operators
Data Sets
0.9
data
0.8
0.7
0.6
25 centres
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
vowels (UCI machine learning repository)
11 kinds of vowels, dimension 9
990 examples
mushrooms (UCI machine learning repository)
23 kinds of mushrooms, dimension 125
8124 examples
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Operators Comparison
Mutation
1-point
Biased 1-point
K-means
1-point + Biased 1-pt
1-point + K-means
All
Crossover
1-point
Combining
Both
25clusters
0.20
0.25
0.26
0.21
0.21
0.22
25clusters
0.201
0.222
0.202
Vowels
927.7
927.3
940.7
927.3
927.6
927.3
Vowels
927.7
927.4
927.4
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Convergence Rate – Mutation
Mutation Comparison
-900
1-point
biased 1-point
k-means
1-point + k-means
-950
-1000
fitness
-1050
-1100
-1150
-1200
-1250
0
5
10
15
iteration
20
ETID’2007
25
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Convergence Rate – Crossover
Crossover Comparison
-900
1-point
combining
both
-950
-1000
fitness
-1050
-1100
-1150
-1200
-1250
0
2
4
6
8
iteration
10
12
14
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Comparison to other clustering algorithms
method
k-means
CLARA
CGA
HCA
Mushroom data set
accuracy
95.8%
96.8%
97.3%
99.2%
25 centers
CGA
k-means
0.9
0.9
data
centers
data
centers
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Estimating the number of clusters
Initial population: 2 to 15 centres
26
# centers
24
22
20
18
16
14
12
0
10
20
30
40
50
60
70
80
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Estimating the number of clusters
Initial population: 10 to 30 centres
25.5
# centers
25
24.5
24
23.5
23
22.5
22
21.5
21
2
4
6
8
10
12
14
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Conclusion
Summary
Clustering Genetic Algorithm proposed
several genetic operators proposed and compared
CGA compared to available clustering algorithms
estimating the number of clusters tested
Future work
application of CGA to large data sets
reducing time requirements, lazy evaluations, etc.
applications
ETID’2007
Introduction
Clustering Genetic Algorithm
Experimental results
Conclusion
Thank you.
Any questions?
ETID’2007