Download What Is Clustering?

A Genetic Algorithm Approach to K-Means Clustering Craig Stanek CS401 November 17, 2004 What Is Clustering? “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that: a) Each cluster has instances that are very similar (or “near”) to each other, and b) The instances in each cluster are very different (or “far away”) from the instances in the other clusters” --Alex A. Freitas, “Data Mining and Knowledge Discovery with Evolutionary Algorithms” Why Cluster? Segmentation and Differentiation Age vs. Income 120,000 100,000 Income 80,000 60,000 40,000 20,000 0 0 10 20 30 40 Age 50 60 70 Why Cluster? Outlier Detection Age vs. Income 2,500,000 Income 2,000,000 1,500,000 1,000,000 500,000 0 0 10 20 30 40 Age 50 60 70 Why Cluster? Classification Petal Width vs. Petal Length 8 7 Petal Length 6 5 4 3 2 1 0 0 0.5 1 1.5 Petal Width 2 2.5 3 K-Means Clustering 1) Specify K clusters 2) Randomly initialize K “centroids” 3) Classify each data instance to closest cluster according to distance from centroid 4) Recalculate cluster centroids 5) Repeat steps (3) and (4) until no data instances move to a different cluster Drawbacks of K-Means Algorithm • Local rather than global optimum • Sensitive to initial choice of centroids • K must be chosen apriori • Minimizes intra-cluster distance but does not consider inter-cluster distance Problem Statement • Can a Genetic Algorithm approach do better than standard K-means Algorithm? • Is there an alternative fitness measure that can take into account both intra-cluster similarity and inter-cluster differentiation? • Can a GA be used to find the optimum number of clusters for a given data set? Representation of Individuals • Randomly generated number of clusters • Medoid-based integer string (each gene is a distinct data instance) Example: 58 244 23 162 113 Genetic Algorithm Approach Why Medoids? Age vs. Income 2,500,000 Income 2,000,000 1,500,000 1,000,000 500,000 0 0 10 20 30 40 Age 50 60 70 Genetic Algorithm Approach Why Medoids? Age vs. Income 2,500,000 Income 2,000,000 1,500,000 Instances Medoid Centroid 1,000,000 500,000 0 0 10 20 30 40 Age 50 60 70 Genetic Algorithm Approach Why Medoids? Age vs. Income 300,000 250,000 Income 200,000 Instances 150,000 Medoid Centroid 100,000 50,000 0 0 10 20 30 40 Age 50 60 70 Recombination Parent #1: 36 108 82 Parent #2: 5 80 147 82 108 Child #1: 36 108 147 82 6 Child #2: 5 80 82 6 Fitness Function Let rij represent the jth data instance of the ith cluster and Mi be the medoid of the ith cluster   Dist r , M i     Dist r , M k  ni K Let X = i 1 j 1 K Let Y = ij ni K i 1 j 1 k 1 Fitness = Y / X ik ( K  1) ij Experimental Setup Iris Plant Data (UCI Repository) • 150 data instances • 4 dimensions • Known classifications 3 classes 50 instances of each Experimental Setup Iris Data Set Sepal Width vs. Sepal Length 9 8 Sepal Length 7 6 Iris-setosa 5 Iris-versicolor 4 Iris-virginica 3 2 1 0 0 1 2 3 Sepal Width 4 5 Experimental Setup Iris Data Set Petal Width vs. Petal Length 8 7 Petal Length 6 5 Iris-setosa 4 Iris-versicolor Iris-virginica 3 2 1 0 0 0.5 1 1.5 Petal Width 2 2.5 3 Standard K-Means vs. Medoid-Based EA K-Means EA 30 30 Avg. Correct 120.1 134.9 Avg. % Correct 80.1% 89.9% Min. Correct 77 133 Max. Correct 134 135 Avg. Fitness 78.94 84.00 Total Trials Standard K-Means Clustering Iris Data Set Petal Width vs. Petal Length 8 8 7 7 6 6 5 Iris-setosa 4 Iris-versicolor Iris-virginica 3 Petal Length Petal Length Petal Width vs. Petal Length 5 Cluster 1 4 Cluster 3 3 2 2 1 1 0 Cluster 2 0 0 0.5 1 1.5 Petal Width 2 2.5 3 0 0.5 1 1.5 Petal Width 2 2.5 3 Medoid-Based EA Iris Data Set Petal Width vs. Petal Length 8 8 7 7 6 6 5 Iris-setosa 4 Iris-versicolor Iris-virginica 3 Petal Length Petal Length Petal Width vs. Petal Length 5 Cluster 1 4 Cluster 3 3 2 2 1 1 0 Cluster 2 0 0 0.5 1 1.5 Petal Width 2 2.5 3 0 0.5 1 1.5 Petal Width 2 2.5 3 Standard Fitness EA vs. Proposed Fitness EA Standard Proposed 30 30 Avg. Correct 134.9 134.0 Avg. % Correct 89.9% 89.3% Min. Correct 133 134 Max. Correct 135 134 Avg. Generations 82.7 24.9 Total Trials Fixed vs. Variable Number of Clusters EA Fixed # Variable # 30 30 Avg. Correct 134.0 134.0 Avg. % Correct 89.3% 89.3% Min. Correct 134 134 Max. Correct 134 134 3 7 Total Trials Avg. # of Clusters Variable Number of Clusters EA Iris Data Set Petal Width vs. Petal Length Petal Width vs. Petal Length 8 8 7 7 6 6 5 Iris-setosa 4 Iris-versicolor Iris-virginica 3 Petal Length Petal Length Cluster 1 Cluster 2 5 Cluster 3 4 Cluster 4 Cluster 5 3 Cluster 6 Cluster 7 2 2 1 1 0 0 0 0.5 1 1.5 Petal Width 2 2.5 3 0 0.5 1 1.5 Petal Width 2 2.5 3 Conclusions • GA better at obtaining globally optimal solution • Proposed fitness function shows promise • Difficulty letting GA determine “correct” number of clusters on its own Future Work • Other data sets • Alternative fitness function • Scalability • GA comparison to simulated annealing

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download What Is Clustering?