Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
DATA CLUSTERING WITH KERNAL K-MEANS++ Matt Strautmann, Dept. of Electrical and Computer Engineering PROJECT OBJECTIVES PROJECT GOAL Iris Plant Dataset Experimentally demonstrate the application of Kernel K-Means to non-linearly clusterable data sets ACADEMIC IMPORTANCE Expand the application of the Kernel K-Means clustering algorithm to non-traditional uses K-Means clustering aims to divide the dataset into clusters (“groups”) in which each data point belongs to the cluster with the nearest mean vector. o WHAT IS KERNAL K-MEANS? QuickTime™ and a decompressor are needed to see this picture. lans.ece.utexas.edu Two step process: data point assignment and update Kernel data-mapping was seen to solve the overlapping data sets by: • Partitioning the points with linear separators in the new space • Soft K-Means++ could not successfully cluster the Lung Cancer Dataset; results were for one cluster out of three successfully clustered • Soft K-Means++ clustered the two dimension, two cluster Gaussian dataset with only one error out of the one thousand data points CONCLUDING REMARKS WHAT IS THE PLUS PLUS INITIALIZATION SCHEME? The first mean vector is a randomly selected data point 3.) Cluster Centroid Becomes New Cluster Mean QuickTime™ and a d eco mpres sor are nee ded to s ee this picture. http://en.wikipedia.org/wiki/K-means_clustering http://en.wikipedia.org/wiki/K-means_clustering QuickTime™ and a d eco mpres sor are nee ded to s ee this picture. QuickTime™ and a d eco mpres sor are nee ded to s ee this picture. 4.) Step 2 and 3 Repeated until Convergence APPROACH Clustering Accuracy Average (over ten runs) Variance of Accuracy Calculation (over ten runs) 28.00% 8.218% 2.867% Lung Cancer Dataset 43.75% - - 2D2K Gaussian Dataset 99.00% - - 8D5K Gaussian Dataset 58.50% 2.082% 0.043% Clustering Accuracy Average (over ten runs) Standard Deviation of Accuracy Calculation (over ten runs) Iris Plant Dataset 57.00% 5.009% 2.238% Lung Cancer Dataset 62.00% 6.878% 0.473% 2D2K Gaussian Dataset 96.50% 1.677% 0.028% 8D5K Gaussian Dataset 76.31% 10.366% 1.075% The initialization was seen to be the most important factor in the algorithm converging • The “PLUS PLUS” cluster mean initialization was seen to improve the results • Kernel assignment works better than the maximum responsibility calculation of Soft K-Means • Kernel K-Means++ can handle small or large dimension datasets well; the increase of dimensionally seemed to be advantageous for the Lung Cancer Dataset (56 dimensions) over the lower clustering accuracy of the Iris Plant Dataset (4 dimensions) • Kernel K-Means++ produced superior results to Soft KMeans++ when clustering the Lung Cancer Dataset and demonstrated recognition of all three clusters Variance of Accuracy Calculation (over ten runs) FUTURE WORK RESULTS COMPARISON • Kernel K-Means++ clustering accuracy superior in all cases except the two dimensional, two cluster dataset. • Further improvement of the mean vector initialization is believed possible over the “PLUS PLUS” initialization • The clustering accuracy of the datasets increased by the following amounts: • Other options for the mean-squared error calculation for data point evaluation are possible • The time analysis of the algorithm must be calculate • Iris Plant: 104% Evaluate standard K-Means (Soft++) against 4 datasets to form benchmark • Lung Cancer: 38% Hybridize Soft K-Means++ with Kernel K-Means to form Kernel K-Means++ • 8D5K: 30% Test Kernel K-Means++ on small size, small dimension Gaussian, large dimension Gaussian, and large size datasets Standard Deviation of Accuracy Calculation (over ten runs) Iris Plant Dataset Kernel K-Means++ 2.) Voronoi Diagram Generated by the Means (data points associated with nearest cluster mean) http://en.wikipedia.org/wiki/K-means_clustering 1.) Initial Mean Orientations http://en.wikipedia.org/wiki/K-means_clustering Soft K-Means++ Quic kTime™ and a dec ompres sor are needed to see t his pic ture. • SOFT K-MEANS++ VS. KERNEL K-MEANS++ Each subsequent mean vector is created by evaluating randomly selected data points against a vector weighting probability • • QuickTime™ and a decompressor are needed to see this picture. Sum-of-squares algorithm • Kernel K-Means++ was found to cluster the test datasets in a superior manner over Soft K-Means++ • Mapping the data before clustering to a higherdimensional feature space using a nonlinear function 2 Dimension, 2 Cluster Dataset (Gaussian 2D2K) 2 Dimension, 2 Cluster Dataset (Gaussian 2D2K) o WHAT IS K-MEANS CLUSTERING? • • QuickTime™ and a decompressor are needed to see this picture. BACKGROUND o DISCUSSION lans.ece.utexas.edu o PROJECT DATASETS eleves.ens.fr o Dr. Donald C. Wunsch II, Dept. of Electrical and Computer Engineering • 2D2k: -2.5% 120 100 KERNEL++ SOFT++ 80 60 Acknowledgements 40 The author would like to acknowledge the expertise of Dr. Rui Xu in advising this project. 20 0 IRIS PLANT DATASET LUNG CANCER DATASET 2D2K GAUSSIAN DATASET 8D5K GAUSSIAN DATASET