Download Kernel K-Means++ to

DATA CLUSTERING WITH KERNAL K-MEANS++ Matt Strautmann, Dept. of Electrical and Computer Engineering PROJECT OBJECTIVES PROJECT GOAL Iris Plant Dataset Experimentally demonstrate the application of Kernel K-Means to non-linearly clusterable data sets ACADEMIC IMPORTANCE  Expand the application of the Kernel K-Means clustering algorithm to non-traditional uses  K-Means clustering aims to divide the dataset into clusters (“groups”) in which each data point belongs to the cluster with the nearest mean vector. o WHAT IS KERNAL K-MEANS? QuickTime™ and a decompressor are needed to see this picture. lans.ece.utexas.edu  Two step process: data point assignment and update Kernel data-mapping was seen to solve the overlapping data sets by: • Partitioning the points with linear separators in the new space • Soft K-Means++ could not successfully cluster the Lung Cancer Dataset; results were for one cluster out of three successfully clustered • Soft K-Means++ clustered the two dimension, two cluster Gaussian dataset with only one error out of the one thousand data points CONCLUDING REMARKS WHAT IS THE PLUS PLUS INITIALIZATION SCHEME?  The first mean vector is a randomly selected data point 3.) Cluster Centroid Becomes New Cluster Mean QuickTime™ and a d eco mpres sor are nee ded to s ee this picture. http://en.wikipedia.org/wiki/K-means_clustering http://en.wikipedia.org/wiki/K-means_clustering QuickTime™ and a d eco mpres sor are nee ded to s ee this picture. QuickTime™ and a d eco mpres sor are nee ded to s ee this picture. 4.) Step 2 and 3 Repeated until Convergence APPROACH Clustering Accuracy Average (over ten runs) Variance of Accuracy Calculation (over ten runs) 28.00% 8.218% 2.867% Lung Cancer Dataset 43.75% - - 2D2K Gaussian Dataset 99.00% - - 8D5K Gaussian Dataset 58.50% 2.082% 0.043% Clustering Accuracy Average (over ten runs) Standard Deviation of Accuracy Calculation (over ten runs) Iris Plant Dataset 57.00% 5.009% 2.238% Lung Cancer Dataset 62.00% 6.878% 0.473% 2D2K Gaussian Dataset 96.50% 1.677% 0.028% 8D5K Gaussian Dataset 76.31% 10.366% 1.075% The initialization was seen to be the most important factor in the algorithm converging • The “PLUS PLUS” cluster mean initialization was seen to improve the results • Kernel assignment works better than the maximum responsibility calculation of Soft K-Means • Kernel K-Means++ can handle small or large dimension datasets well; the increase of dimensionally seemed to be advantageous for the Lung Cancer Dataset (56 dimensions) over the lower clustering accuracy of the Iris Plant Dataset (4 dimensions) • Kernel K-Means++ produced superior results to Soft KMeans++ when clustering the Lung Cancer Dataset and demonstrated recognition of all three clusters Variance of Accuracy Calculation (over ten runs) FUTURE WORK RESULTS COMPARISON • Kernel K-Means++ clustering accuracy superior in all cases except the two dimensional, two cluster dataset. • Further improvement of the mean vector initialization is believed possible over the “PLUS PLUS” initialization • The clustering accuracy of the datasets increased by the following amounts: • Other options for the mean-squared error calculation for data point evaluation are possible • The time analysis of the algorithm must be calculate • Iris Plant: 104% Evaluate standard K-Means (Soft++) against 4 datasets to form benchmark • Lung Cancer: 38% Hybridize Soft K-Means++ with Kernel K-Means to form Kernel K-Means++ • 8D5K: 30% Test Kernel K-Means++ on small size, small dimension Gaussian, large dimension Gaussian, and large size datasets Standard Deviation of Accuracy Calculation (over ten runs) Iris Plant Dataset Kernel K-Means++ 2.) Voronoi Diagram Generated by the Means (data points associated with nearest cluster mean) http://en.wikipedia.org/wiki/K-means_clustering 1.) Initial Mean Orientations http://en.wikipedia.org/wiki/K-means_clustering Soft K-Means++ Quic kTime™ and a dec ompres sor are needed to see t his pic ture. • SOFT K-MEANS++ VS. KERNEL K-MEANS++  Each subsequent mean vector is created by evaluating randomly selected data points against a vector weighting probability • • QuickTime™ and a decompressor are needed to see this picture.  Sum-of-squares algorithm • Kernel K-Means++ was found to cluster the test datasets in a superior manner over Soft K-Means++ • Mapping the data before clustering to a higherdimensional feature space using a nonlinear function 2 Dimension, 2 Cluster Dataset (Gaussian 2D2K) 2 Dimension, 2 Cluster Dataset (Gaussian 2D2K) o WHAT IS K-MEANS CLUSTERING? • • QuickTime™ and a decompressor are needed to see this picture. BACKGROUND o DISCUSSION lans.ece.utexas.edu  o PROJECT DATASETS eleves.ens.fr o Dr. Donald C. Wunsch II, Dept. of Electrical and Computer Engineering • 2D2k: -2.5% 120 100 KERNEL++ SOFT++ 80 60 Acknowledgements 40 The author would like to acknowledge the expertise of Dr. Rui Xu in advising this project. 20 0 IRIS PLANT DATASET LUNG CANCER DATASET 2D2K GAUSSIAN DATASET 8D5K GAUSSIAN DATASET

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Kernel K-Means++ to