Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering of Dynamic Data Introduction Data clustering is a field of active research in machine learning and data mining. Most of the work has focused on static data sets. There has been little work on clustering of dynamic data. We define a dynamic data set as a set of elements whose parameters change over time. A flock of flying birds is an example of a dynamic data set. We are interested in exploring algorithms are capable of finding relationships amongst the elements in a dynamic data set. In this paper we evaluate the use of data clustering techniques developed for static data sets on dynamic data. Hypothesis Traditional clustering algorithms used in data mining will not perform well on dynamic data sets. A clustering algorithm must consider the elements’ history in order to efficiently and effectively find clusters in dynamic data. Experiment Characterize a set of traditional clustering algorithms against dynamic data sets. o Data Sets: Swarm style data Traffic style data Ant Colony data ???? Augment traditional clustering to make the distance measure a function of the elements’ history o Use a moving average of each attribute o Use past cluster labels in the distance measure. o Start partitioning algorithms with the output from the last time interval Characterize augmented algorithms against dynamic data sets. o Metrics Timing Accuracy Consistency Cluster label is consistent over time (label does not thrash > stays with the “core” group) Clustering Algorithms In general, clustering algorithms can be divided into two categories: hierarchical and partitioning. Hierarchical algorithms build clusters gradually. An agglomerative hierarchical algorithm starts with each element in it own cluster. The clusters are iteratively combined to form a dendrogram. Divisive algorithms starts with all elements in one cluster and create the dendrogram from the top down. Partitioning algorithms create clusters directory by optimizing a function (locally or globally) with out creating a structure like the dendrogram. Partitioning algorithms typically run faster than hierarchical cluster, but need to have the number of clusters to find defined. Partitioning algorithms can be further dived into relocation methods and density-based methods. Relocation methods work by minimizing a cost function by iteratively relocating elements to clusters. Density based methods attempt to cluster densely connected elements. We plan to implement a representative set of clustering algorithms and evaluate their performance on dynamic data sets. From the hierarchical clustering category we plan to implement a single-link and a complete-link algorithm. These two algorithms differ in the way they measure distance between clusters. The single-link algorithm measures the distance between two clusters A and B as the minimum distance between any member of A and any member of B. The complete-link algorithm measures the distance between clusters A and cluster B as the maximum distance between any member of cluster A and any member of cluster B. The complete-link algorithm can find compact cluster while the single-link algorithm finds elongated clusters. From the partitioning-relocation category we plan to implement k-means, k-mediods, and Expectation Maximization (EM). The k-means and k-mediods algorithms work by randomly assigning elements to k clusters and then iteratively reassigning elements to the closest cluster and re-computing the cluster’s parameters. They differ in how the clusters are represents. K-means represents a cluster as the centroid of its members. The Kmediods algorithm selects a member of the cluster to represent the cluster. The EM algorithm attempts to determine the distribution of the elements in clusters. EM iteratively reassigns elements to the clusters that maximize their probability of membership and then re-estimates the distribution of each cluster. DBSCAN (Density Based Spatial Clustering of Applications with Noise) and DENCLUE ((DENsity-based CLUstEring) will be implemented to represent density based partitioning algorithms. DBSCAN creates clusters from highly connected elements while DENCLUE clusters elements in highly populated areas. Both algorithm handle outliers well and will not include them in any cluster. Bibliography P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002. A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999