Download Report - UF CISE - University of Florida

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Transcript
COMPARISON OF CLUSTERING ALGORITHMS:
PARTITIONAL AND HIERARCHICAL
Principal Investigator
Dr.Sanjay Ranka
Professor
Department of Computer Science, University of Florida
Teaching Assistant
Manas Somaiya
Authors
Joyesh Mishra, Gnana Sundar Rajendiran, Vasanth Prabhu Sundararaj
Department of Computer Science,
University of Florida
Gainesville
www.cise.ufl.edu
Final Report

December 2007
TABLE OF CONTENTS
I.
ABSTRACT .........................................................................................................................1
II.
DETAILED REPORT..........................................................................................................1
1.
2.
3.
4.
K-Means Partitional clustering ................................................................................1
1.1 Characteristics of K - means ..............................................................................1
1.2 Algorithm ...........................................................................................................1
1.3 Observations ......................................................................................................2
Agglomerative Hierarchical Clustering ...................................................................4
2.1 Definition ...........................................................................................................4
2.2 Algorithms implemented in this Project ............................................................4
2.3 Datasets and Experiments ..................................................................................6
DBSCAN (Using KD Trees)..................................................................................12
3.1 DBSCAN Algorithm ........................................................................................12
3.2 DBSCAN Performance Enhancements Using KD Trees .................................12
3.3 Observations regarding DBSCAN Issues ........................................................13
CURE – Hierarchical Clustering (Using KD Trees) ..............................................13
4.1 CURE Hierarchical Clustering Algorithm .......................................................14
4.2 CURE Overview ..............................................................................................15
4.3 CURE - Data Structures Used ..........................................................................15
4.4 Benefits of CURE against Other Algorithms ...................................................16
4.5 Observations towards Sensitivity to Parameters ..............................................17
III.
CONCLUSION ..................................................................................................................18
IV.
REFERENCES ..................................................................................................................18
LIST OF FIGURES
Figure 1 K – means Initial K clusters...............................................................................................2
Figure 2 K – means Clusters getting rearranged by computing new centroids ................................3
Figure 3 K – means Converged clusters ..........................................................................................3
Figure 4 Union By Rank ..................................................................................................................6
Figure 5 SPAETH dataset ................................................................................................................7
Figure 6 Agglomerative Clusters After 28000 iterations .................................................................7
Figure 7 Agglomerative Clusters After 64000 iterations .................................................................8
Figure 8 Agglomerative Clusters After 65388 iterations .................................................................8
Figure 9 Agglomerative Non – globular clusters .............................................................................9
Figure 10 CURE Non – globular clusters ........................................................................................9
Figure 11 Complete Link ...............................................................................................................10
Figure 12 Complete Link clusters After 2000 iterations ................................................................10
Figure 13 Complete Link clusters After 2012 iterations ................................................................11
Figure 14 CURE clusters ...............................................................................................................11
Figure 15 DBSCAN performance measurements ..........................................................................13
Figure 16 Partitioning results .........................................................................................................17
I. ABSTRACT
Clustering is one of the important streams in data mining useful for discovering groups
and identifying interesting distributions in the underlying data.
This project aims in analyzing and comparing the partitional and hierarchical clustering
algorithms namely DBSCAN and k-means (partitional) with Agglomerative and CURE
(hierarchical). The comparison is done based on the extent to which each of these algorithms
identify the clusters, their pros and cons and the timing that each algorithm takes to identify the
clusters present in the dataset. Among each clustering algorithm, computation time was measures
as the size of data set increased. This was used to test the scalability of the algorithm and if it
could be disintegrated and executed concurrently on several machines.
k-means is a partitional clustering technique that helps to identify k clusters from a given
set of n data points in d-dimensional space. It starts with k random centers and a single cluster,
and refines it at each step arriving to k clusters. Currently, the time complexity for implementing
k - means is O (I * k * d * n), where I is the number of iterations. If we could use the KD-Tree
data structure in the implementation, it can further reduce the complexity to O (I * k * d * log
(n)).
DBSCAN discovers clusters of arbitrary shape relying on a density based notion of
clusters. Given eps as the input parameter, unlike k-means clustering, it tries to find out all
possible clusters by classifying each point as core, border or noise. DBSCAN can be expensive as
computation of nearest neighbors requires computing all pair wise proximities. Additional
implementation includes KD-Trees to store the data which would allow efficient retrieval of data
and bring down the time complexity from O(m^2) to O(m log m).
Agglomerative Hierarchical Clustering is one of the non-parametric approaches to
Clustering which is based on the measures of the dissimilarities among the current cluster set in
each iteration. In general we will start with the points as individual clusters and at each step
merge the closest pair of clusters by defining a notion of cluster proximity. We will implement
three algorithms, namely, Single-Linkage Clustering and Complete-Linkage Clustering. We will
be analyzing the advantages and drawbacks of Agglomerative Hierarchical Clustering by
comparing it with the other Algorithms CURE, DBSCAN and K-Means.
CURE clustering algorithm helps in attaining scalability for clustering in large databases
without sacrificing quality of the generated clusters. The algorithm uses KD-Trees and Min
Heaps for efficient data analysis and repetitive clustering. The random sampling, partitioning of
clusters and two pass merging helps in scaling the algorithm for large datasets. Our
implementation would provide a comparative study of CURE against other partitioning and
hierarchical algorithm.
1
II. DETAILED REPORT
1. K-Means Partitional clustering
Clustering based on k-means is closely related to a number of other clustering and
location problems. These include the Euclidean k-medians in which the objective is to minimize
the sum of distances to the nearest center and the geometric k-center problem in which the
objective is to minimize the maximum distance from every point to its closest center. There are
no efficient solutions known to any of these problems and some formulations are NP-hard. The
large constant factors suggest that it is not a good candidate for practical implementation.
One of the most popular heuristics for solving the k-means problem is based on a simple iterative
scheme for finding a locally minimal solution. This algorithm is often called the k-means
algorithm.
1.1 Characteristics of K - means
a. It is a prototype based Clustering. It can only be applied to clusters that have the notion of
a centre.
b. The algorithm has a space complexity of O (I * K * m * n), where I is the number of
iterations, K is the number of clusters, m is the number of dimensions and n is the number
of points.
c. Using KD Trees the overall Time Complexity reduces to O (n * logn). KD Tree is a data
structure that will help grouping the points that will be most likely be a cluster at each
point of decision between isolating the clusters.
1.2 Algorithm
a. Select K initial centroids
b. Repeat
 For each point, find its closes centroid and assign that point to the centroid. This
results in the formation of K clusters
 Recompute centroid for each cluster
Until the centroids do not change
In the first step, points are assigned to the initial centroids, which are all in the larger
group of points. After points are assigned to a centroid, the centroid is then updated. In the
second step, points are assigned to the updated centroids, and the centroids are updated again.
When the k-means algorithm terminate, the centroids would have identified the natural groupings
of points. For some combinations of proximity functions and types of centroids, k-means always
converge to a solution i.e., k-means reaches a state in which no points are shifting from one
cluster to another and hence the centroids do not change.
1.3 Observations
The datasets used for running k-means algorithm is a 2d array of x y points obtained from
SPAETH (http://people.scs.fsu.edu/~burkardt/datasets/spaeth/spaeth.html). The list of figures
given below shows how the k-means algorithm converges for the set of data points.
Figure 1 K – means Initial K clusters
Figure 2 K – means Clusters getting rearranged by computing new centroids
Figure 3 K – means Converged clusters
With LabVIEW 8.2.1 compiler, and with 3360 points, the k-means algorithm took 355
ms to arrive to convergence. The hardware used is Intel@ Core™2 IV 1.73 Ghz with 1GB RAM.
The pros of k-means algorithm are:
a. It is very simple to implement
b. This algorithm is very fast for low dimensional data
c. It can find pure sub clusters if large number of clusters is specified
The cons of k-means of algorithm are:
a. K-Means cannot handle non-globular data of different sizes and densities
b. K-Means will not identify outliers
c. K-Means is restricted to data which has the notion of a centre (centroid)
2. Agglomerative Hierarchical Clustering
2.1 Definition
Hierarchical clustering builds a cluster hierarchy or, in other words, a tree of clusters, also
known as a dendrogram. Every cluster node contains child clusters; sibling clusters partition the
points covered by their common parent. Such an approach allows exploring data on different
levels of granularity. Hierarchical clustering methods are categorized into agglomerative
(bottom-up) and divisive (top-down). An agglomerative clustering starts with one-point
(singleton) clusters and recursively merges two or more most appropriate clusters. A divisive
clustering starts with one cluster of all data points and recursively splits the most appropriate
cluster. The process continues until a stopping criterion (frequently, the requested number k of
clusters) is achieved. In this project, we will dealing with Agglomerative Hierarchical Clustering.
Advantages of hierarchical clustering:
 Embedded flexibility regarding the level of granularity

Ease of handling of any forms of similarity or distance

applicability to any attribute types
Disadvantages of hierarchical clustering:
 Vagueness of termination criteria

The fact that most hierarchical algorithms do not revisit once constructed (intermediate)
clusters with the purpose of their improvement
2.2 Algorithms implemented in this Project
In this project, we have implemented two linkage metric algorithms, Single-Link (MIN)
and Complete-Link (MAX) algorithms. Time Complexity is O(n2logn).
Single Link Algorithm
In this algorithm, the proximity of two clusters is defined as the minimum of the distance
(maximum of the similarity) between any two points in the two different clusters. Using graph
terminology, if you start with all points as singleton clusters and add links between points one at
a time, shortest links first, and then these single links combine the points into clusters. In the
project, a new method is used to implement the single link algorithm. A minimum spanning tree
is implemented using the Kruskal’s algorithm. Union-by-Rank and Path compression methods
are used for optimization.
Minimum Spanning Tree - Given a connected, undirected graph, a spanning tree of that graph
is a sub-graph which is a tree and connects all the vertices together. A single graph can have
many different spanning trees. We can also assign a weight to each edge, which is a number
representing how unfavorable it is, and use this to assign a weight to a spanning tree by
computing the sum of the weights of the edges in that spanning tree. A minimum spanning tree
or minimum weight spanning tree is then a spanning tree with weight less than or equal to the
weight of every other spanning tree. More generally, any undirected graph (not necessarily
connected) has a minimum spanning forest, which is a union of minimum spanning trees for its
connected components.
Kruskal’s algorithm - Kruskal's algorithm is an algorithm in graph theory that finds a minimum
spanning tree for a connected weighted graph. This means it finds a subset of the edges that
forms a tree that includes every vertex, where the total weight of all the edges in the tree is
minimized. If the graph is not connected, then it finds a minimum spanning forest (a minimum
spanning tree for each connected component). Kruskal's algorithm is an example of a greedy
algorithm.
It works as follows:
 create a forest F (a set of trees), where each vertex in the graph is a separate tree
 create a set S containing all the edges in the graph
 while S is nonempty
o remove an edge with minimum weight from S
o if that edge connects two different trees, then add it to the forest, combining two
trees into a single tree
o otherwise discard that edge
At the termination of the algorithm, the forest has only one component and forms a minimum
spanning tree of the graph.
Union By Rank – In this we have a parent of shallower tree point to other tree. We will be
maintaining the rank(x) as an upper bound on the depth of the tree rooted at x. Consider the
following example,
Figure 4 Union By Rank
If suppose, rank(x) = 3, rank(y) = 2, then Union (x, y) results in with the rank of the resultant tree
= greater rank.
If the two trees are of the same rank then the rank of the resultant tree increases by one.
Path Compression
 1st walk: Find the name of the set. Take a walk until we reach the root.
 2nd walk: Retrace the path and join all the elements along the path to the root using
another pointer.
This enables future finds to take shorter paths.
In the implementation of Single Link Algorithm, each point is initially considered as a singleton
cluster. When the Euclidean distance between two clusters (trees) is minimum when compared
with the other clusters, the two clusters are merged into a single cluster (tree) and the root node is
updated.
Complete Link Algorithm
In this algorithm, the proximity of two clusters is defined as the maximum of the distance
(minimum of the similarity) between any two points in the two different clusters. Using graph
terminology, if you start with all points as singleton clusters and add links between points one at
a time, shortest links first, then a group of points is not a cluster until all the points in it are
completely linked, i.e. form a clique.
Single Link is susceptible to noise/outliers. Complete Link may not work well with non-globular
clusters.
2.3 Datasets and Experiments
Single Link Algorithm Testing
Dataset: SPAETH2 dataset (2D- voice modulation data) from the Florida State University’s
website (Around 900 data points)
Figure 5 SPAETH dataset
Output Cluster – Plot
Globular Clusters
After 28000 iterations (3 clusters remain)
Figure 6 Agglomerative Clusters After 28000 iterations
After 64000 iterations (2 Clusters remain)
Figure 7 Agglomerative Clusters After 64000 iterations
Final Cluster (After 65388 iterations)
Figure 8 Agglomerative Clusters After 65388 iterations
Non-Globular Clusters (Run on CheckBoard data)
Single Link
Figure 9 Agglomerative Non – globular clusters
CURE
Figure 10 CURE Non – globular clusters
Complete Link
It was executed on a part of the Census data obtained from UCI Repository
Figure 11 Complete Link
Output Cluster – Plot (Compared with CURE algorithm)
After 2000 iterations (13 clusters remain)
Figure 12 Complete Link clusters After 2000 iterations
Final Cluster (after 2012 iterations)
Figure 13 Complete Link clusters After 2012 iterations
CURE
Figure 14 CURE clusters
3. DBSCAN (Using KD Trees)
The main reason why natural clusters are recognizable is that within each cluster we have
a typical density of points which is considerably higher than outside of the cluster. Furthermore,
the density within the areas of noise is lower than the density in any of the clusters. With this
understanding, we can describe core, border and noise points in a given data set next.
Core points: A point is a core point if the number of points within a given neighborhood around
the point as determined by the distance function and as user specified distance parameter Eps,
exceeds a certain threshold, MinPts, which is also a user-specified parameter.
Border points: A border point is not a core point, but falls within the neighborhood of a core
point.
Noise points: A noise point is any point that is neither a core point nor a border point.
3.1 DBSCAN Algorithm
1.
2.
3.
4.
5.
Label all points as core, border or noise points
Eliminate noise points
Put an edge between all core points that are within Eps of each other
Make each group of connected core points into a separate cluster
Assign each border point to one of the clusters of its associated core points
3.2 DBSCAN Performance Enhancements Using KD Trees
We used KD Trees to improve the efficiency of DBSCAN clustering. The worst case time
complexity of DBSCAN algorithm is O(m^2). However, it can be shown that in low dimensional
data, this time complexity can be reduced to O(m*logm) using KD Trees.
The Initialization of KD Trees is a one time cost which the algorithm incurs while reading the
data points from File. Once the KD Tree has been initialized, it can be used across the algorithm
to classify core points, border points and noise points based on the the number of nearest
neighbors found as well as find the nearest core point for a border point. KD Tree helps to
decrease the search time for nearest neighbor of a point from O(n) to O(log n) where n is the size
of the data set.
We saw performance improvements by using KD Trees. The algorithm was run on a Intel
Pentium IV 1.8 Ghz (Duo Core) System with 1 GB RAM. The program was compiled using Java
1.6 Compiler.
No. of Points
Clustering Time (sec)
1572
3.5
3568
10.9
7502
39.5
10256
78.4
Figure 15 DBSCAN performance measurements
3.3 Observations regarding DBSCAN Issues
The following are our observations:
1. DBSCAN algorithm performs efficiently for low dimensional data.
2. The algorithm is robust towards outliers and noise points
3. Using KD Tree improves the efficiency over traditional DBSCAN algorithm
4. DBSCAN is highly sensitive to user parameters MinPts and Eps. Slight change in the
values may produce different clustering results and prior knowledge about these values
cannot be inferred that easily.
5. The dataset cannot be sampled as sampling would affect the density measures.
6. The Algorithm is not partitionable for multi-processor systems.
7. DBSCAN fails to identify clusters if density varies and if the dataset is too sparse.
8.
4. CURE – Hierarchical Clustering (Using KD Trees)
Partitional Clustering Algorithms attempt to determine k – partitions that optimize a certain
criterion function. The square error criterion, defined below, is the most commonly used (mi is
the mean of the cluster Ci).
The square error is a good measure of the within cluster variation across all the partitions. This
objective tries to make the k clusters as compact and separated as possible. However when there
are large differences in the sizes or geometries of clusters, the square error method could split
large clusters to minimize the square error.
Next we considered DBSCAN which has been explained above. Apart from problems with
variable clustering, DBSCAN could not be made concurrent. In this age, when computation
power is getting cheaper day by day and Dedicated Grids have been setup for data intensive
computations, an algorithm is required which can be parallelized and takes advantage of all the
resources available. DBSCAN as we would see had scaling problems as data sets increased.
In comparison to Agglomerative Clustering, CURE certainly would perform well. Agglomerative
Clustering provide options of choosing Single Link vs Complete Link to cluster points. While the
former identifies only globular clusters efficiently the latter is computationally intensive. Hence
for data sets ranging more than 5000 points, agglomerative clustering was highly inefficient
though quality of clustering could be achieved by using one of the above options.
Our experiments on CURE clustering algorithm, suggest that CURE depends on few parameters
and if once they are tuned for a given data set pertaining to a domain, the algorithm can scale
well by adding more resources and partitioning the data.
4.1 CURE Hierarchical Clustering Algorithm
The CURE clustering algorithm is a hierarchical algorithm which merges two clusters at every
step and the clustering process is carried over in two passes. The overall hierarchical algorithm
is as follows:
To enhance performance, scalability as well as quality of clustering, CURE takes into account
few more pre-clustering and post-clustering steps.
4.2 CURE Overview
While drawing the random sample, due importance was given to the fact that all clusters were
represented and none of them were missed out by estimating a minimum probability.
4.3 CURE - Data Structures Used
We used two data structures namely the KD Tree and Min Heap. Following are the brief
description of both of them.
4.3.1 KD Tree
A KD-Tree (short for k-dimensional tree) is a space-partitioning data structure for organizing
points in a k-dimensional space. KD-Trees are a useful data structure for several applications,
such as searches involving a multidimensional search key (e.g. range searches and nearest
neighbour searches). KD-Trees are a special case of BSP trees.
A KD-Tree uses only splitting planes that are perpendicular to one of the coordinate system axes.
This differs from BSP trees, in which arbitrary splitting planes can be used. In addition, in the
typical definition every node of a KD-Tree, from the root to the leaves, stores a point. This differs
from BSP trees, in which leaves are typically the only nodes that contain points (or other
geometric primitives). As a consequence, each splitting plane must go through one of the points
in the KD-Tree. KD-Tries are a variant that store data only in leaf nodes. It is worth noting that in
an alternative definition of KD-Tree the points are stored in its leaf nodes only, although each
splitting plane still goes through one of the points.
In Cure, the KD Tree is initialized during the initial phase of clustering to hold all the points.
Later on in the algorithm, we use this tree for nearest neighbor search and finding closest clusters
based on representative points of a cluster. When a new cluster is formed, new representative
points are added to the KD Trees. The representative points of older clusters are deleted from the
tree.
KD Tree improves the search of points in k dimensional space from O(n) to O(log n) as it uses
binary partitioning across coordinate axes.
4.3.2 Min Heap
A Min Heap is a simple heap data structure created using a binary tree. It can be seen as a binary
tree with two additional constraints:
1. The shape property: all levels of the tree, except possibly the last one (deepest) are fully
filled, and, if the last level of the tree is not complete, the nodes of that level are filled
from left to right.
2. The heap property: each node is lesser than or equal to each of its children.
The Min Heap stores the minimum element at the root of the heap. In Cure, we always merge
two clusters at every step. Thus the cluster to be merged would necessary be having the closest
distance from another nearby cluster as the heap is created using inter-cluster distance
comparisons. Hence we can get this cluster in O(1) time always.
We used java.util.PriorityQueue which supports all the Min Heap operations.
4.4 Benefits of CURE against Other Algorithms
K-Means (& Centroid based Algorithms): Unsuitable for non-spherical and size differing
clusters.
CLARANS: Needs multiple data scan (R* Trees were proposed later on). CURE uses KD Trees
inherently to store the dataset and use it across passes.
BIRCH: Suffers from identifying only convex or spherical clusters of uniform size
DBSCAN: No parallelism, High Sensitivity, Sampling of data may affect density measures.
4.5 Observations towards Sensitivity to Parameters
We observed that the random sample size was an important criterion while preclustering the data set. Hence we used the Chernoff bounds as given in [1] to calculate the
minimum size of sample to be selected. Random Sampling often missed out some of the smaller
clusters. The next important parameter was the Shrink Factor of Representative Points(a). If we
increased a to make it 1, the algorithm would degenerate to MST based algorithms. If the
parameter a is reduced to 0.1, CURE starts behaving as a centroid based algorithm. Thus for a
range of 0.3 to 0.7, CURE identified the right clusters.
The number of Representative Points present in a cluster is an important
parameter. If the cluster is too sparse, it may need more representative points than a compact
smaller cluster. We observed that if the number of representative points is increased to 8 or 10,
sparse clusters with variable size and density were identified properly. But with increase in
representative points, the computation time for clustering increased as for every new cluster
formed, new representative points have to be calculated and shrunk.
One of the most important observations of our experiments was with respect to
partitioning of data sets as CURE supports concurrent execution of the first pass of algorithm. As
the number of partitions was increased from 2 to 6 or 10, the clustering time dropped
significantly. Though the number of clusters to be merged increased in the second step, but the
advantage of concurrent execution was far more. But what we noticed is that if we increased the
number of partitions to higher numbers such as 50, the clustering would not give proper results
as some of the partitions would not have any data to cluster. Hence, though the time consumed
would be lesser, the quality of cluster gets affected and CURE could not identify all the clusters
correctly. Some of them got merged to form bigger clusters. Hence, a partitioning of 10 – 20
would result in efficient speed up of algorithm while maintaining the quality of clusters.
Partitioning Results
No. of Points
1572
3568
7502
10256
Partition P = 2
6.4
7.8
29.4
75.7
Partition P = 2
6.5
7.6
21.6
43.6
Partition P = 5
6.1
7.3
12.2
21.2
Time (sec)
Figure 16 Partitioning results
If a chart is plotted for the same, we can see that as the partitioning is increased, the time taken to
cluster increases very slowly even though the data set size has increased by four times.
III.CONCLUSION
From the clusters obtained through various algorithms and the time taken by each
algorithm on the datasets, we can say that, K – means is not the best of clustering methods with
its high space complexity. For high dimensional data, K – means takes a lot of time and memory.
Also it cannot always converge.
Our experiments suggest that DBSCAN faired well for low-dimensional data. Also, if the
density of clusters did not vary too much, DBSCAN fairly identified all the clusters. But if the
size of the data increases and if shapes and density of clusters vary too much, DBSCAN resulted
in combining or splitting those clusters.
Cure could identify all the clusters properly. But CURE depends on some of the user
parameters which have to be data specific. The range of such parameters do not vary too much
many of them being from 0 – 1. Cure could identify several clusters with high purity which Kmeans and DBSCAN failed to identify.
With respect to agglomerative clustering, clusters with high purity could be obtained but
the computation time for clustering was high. Application of Kruskal and Union-By-Rank
Algorithm helped to improve the efficiency but still the computation time increased significantly
as the size of the data set increased.
IV. REFERENCES
1. An Efficient k-Means Clustering Algorithm: Analysis and Implementation - Tapas Kanungo,
Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu.
2. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, KDD '96
3. CURE : An Efficient Clustering Algorithm for Large Databases – S. Guha, R. Rastogi and K.
Shim, 1998.
4. Introduction to Clustering Techniques – by Leo Wanner
5. A comprehensive overview of Basic Clustering Algorithms – Glenn Fung
6. Introduction to Data Mining – Tan/Steinbach/Kumar
7. Thomas T. Cormen , Charles E. Leiserson , Ronald L. Rivest, Introduction to algorithms,
MIT Press, Cambridge, MA, 1990
8. Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering
method for very large databases, Proceedings of the 1996 ACM SIGMOD international
conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
9. An Efficient K-Means Clustering Algorithm. K. Alsabti, S. Ranka, V. Singh. 1998
10. Density based Indexing for Approximate Nearest Neighbor Queries. K. Bennett, U. Fayyad,
D. Geiger. Microsoft Research. 1998
11. The Analysis of a Simple K-Means Algorithm. T. Kanungo, D.M. Mount, N.S. Netanyahu, C.
Piatko, R. Silverman and A.Y. Wu. 2000
12. Accelerating exact K-Means algorithms with Geometric Reasoning. D Pelleg and A. Moore.
1999.