Download Clustering of Dynamic Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Clustering of Dynamic Data
Introduction
Data clustering is a field of active research in machine learning and data mining. Most of
the work has focused on static data sets. There has been little work on clustering of
dynamic data. We define a dynamic data set as a set of elements whose parameters
change over time. A flock of flying birds is an example of a dynamic data set. We are
interested in exploring algorithms are capable of finding relationships amongst the
elements in a dynamic data set. In this paper we evaluate the use of data clustering
techniques developed for static data sets on dynamic data.
Hypothesis
Traditional clustering algorithms used in data mining will not perform well on dynamic
data sets. A clustering algorithm must consider the elements’ history in order to
efficiently and effectively find clusters in dynamic data.
Experiment
 Characterize a set of traditional clustering algorithms against dynamic data sets.
o Data Sets:
 Swarm style data
 Traffic style data
 Ant Colony data
 ????
 Augment traditional clustering to make the distance measure a function of the
elements’ history
o Use a moving average of each attribute
o Use past cluster labels in the distance measure.
o Start partitioning algorithms with the output from the last time interval
 Characterize augmented algorithms against dynamic data sets.
o Metrics
 Timing
 Accuracy
 Consistency
 Cluster label is consistent over time (label does not thrash > stays with the “core” group)
Clustering Algorithms
In general, clustering algorithms can be divided into two categories: hierarchical and
partitioning. Hierarchical algorithms build clusters gradually. An agglomerative
hierarchical algorithm starts with each element in it own cluster. The clusters are
iteratively combined to form a dendrogram. Divisive algorithms starts with all elements
in one cluster and create the dendrogram from the top down. Partitioning algorithms
create clusters directory by optimizing a function (locally or globally) with out creating a
structure like the dendrogram. Partitioning algorithms typically run faster than
hierarchical cluster, but need to have the number of clusters to find defined. Partitioning
algorithms can be further dived into relocation methods and density-based methods.
Relocation methods work by minimizing a cost function by iteratively relocating
elements to clusters. Density based methods attempt to cluster densely connected
elements.
We plan to implement a representative set of clustering algorithms and evaluate their
performance on dynamic data sets. From the hierarchical clustering category we plan to
implement a single-link and a complete-link algorithm. These two algorithms differ in
the way they measure distance between clusters. The single-link algorithm measures the
distance between two clusters A and B as the minimum distance between any member of
A and any member of B. The complete-link algorithm measures the distance between
clusters A and cluster B as the maximum distance between any member of cluster A and
any member of cluster B. The complete-link algorithm can find compact cluster while
the single-link algorithm finds elongated clusters.
From the partitioning-relocation category we plan to implement k-means, k-mediods, and
Expectation Maximization (EM). The k-means and k-mediods algorithms work by
randomly assigning elements to k clusters and then iteratively reassigning elements to the
closest cluster and re-computing the cluster’s parameters. They differ in how the clusters
are represents. K-means represents a cluster as the centroid of its members. The Kmediods algorithm selects a member of the cluster to represent the cluster. The EM
algorithm attempts to determine the distribution of the elements in clusters. EM
iteratively reassigns elements to the clusters that maximize their probability of
membership and then re-estimates the distribution of each cluster.
DBSCAN (Density Based Spatial Clustering of Applications with Noise) and DENCLUE
((DENsity-based CLUstEring) will be implemented to represent density based
partitioning algorithms. DBSCAN creates clusters from highly connected elements while
DENCLUE clusters elements in highly populated areas. Both algorithm handle outliers
well and will not include them in any cluster.
Bibliography
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue
Software, San Jose, CA, 2002.
A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a review, ACM Computing
Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999