Download efficient algorithms for mining arbitrary shaped clusters

Document related concepts

Principal component analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
EFFICIENT ALGORITHMS FOR MINING ARBITRARY
SHAPED CLUSTERS
By
Vineet Chaoji
A Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Subject: COMPUTER SCIENCE
Approved by the
Examining Committee:
Dr. Mohammed J. Zaki, Thesis Adviser
Dr. Boleslaw Szymanski, Member
Dr. Mark Goldberg, Member
Dr. Malik Magdon-Ismail, Member
Dr. Taneli Mielikäinen, External Member
Rensselaer Polytechnic Institute
Troy, New York
July 2009
(For Graduation August 2009)
c Copyright 2009
by
Vineet Chaoji
All Rights Reserved
ii
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
Clustering – Application Domains . . . . . . . . . . . . . . . . . . . .
2
1.2
Shape-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.1
Motivating Applications . . . . . . . . . . . . . . . . . . . . .
4
1.2.2
Problem Formulation and Contribution . . . . . . . . . . . . .
6
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.3
2. Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1
2.2
2.3
Clustering Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1
Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2
Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . 10
Dominant Clustering Paradigms . . . . . . . . . . . . . . . . . . . . . 12
2.2.1
Differentiating Properties
. . . . . . . . . . . . . . . . . . . . 12
2.2.2
Categorization . . . . . . . . . . . . . . .
2.2.2.1 Partitional Clustering . . . . .
2.2.2.2 Hierarchical Clustering . . . . .
2.2.2.3 Probabilistic/fuzzy Clustering .
2.2.2.4 Graph-theoretic Clustering . .
2.2.2.5 Grid-based Clustering . . . . .
2.2.2.6 Evolution and Neural-net based
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Clustering
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
14
17
19
19
20
20
Review of Shape-based Clustering Methods . . . . . . . . . . . . . . . 22
2.3.1
Density-based Clustering . . . . . . . . . . . . . . . . . . . . . 22
2.3.2
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3
Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.4
SPARCL – Brief Overview . . . . . . . . . . . . . . . . . . . . 29
2.3.5
Backbone based Clustering – An Overview . . . . . . . . . . . 29
iii
3. SPARCL: Efficient Shape-based Clustering . . . . . . . . . . . . . . . . . . 31
3.1
The SPARCL Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2
Phase 1 – Kmeans Algorithm . . . . . . . . . . . . . . . . . . . . . . 33
3.3
3.2.1
Kmeans Initialization Methods . . . . . . . . . . . . . . . . . 34
3.2.2
Initialization using Local Outlier Factor . . . . . . . . . . . . 37
3.2.3
Complexity Analysis of LOF Based Initialization . . . . . . . 38
Phase 2 – Merging Neighboring Clusters . . . . . . . . . . . . . . . . 40
3.3.1
Cluster Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4
Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5
Estimating the Value of K . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7
3.6.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6.1.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . 49
3.6.1.2 Real Datasets . . . . . . . . . . . . . . . . . . . . . . 49
3.6.2
Comparison of Kmeans Initialization Methods . . . . . . . . . 50
3.6.3
Results
3.6.3.1
3.6.3.2
3.6.3.3
3.6.3.4
3.6.3.5
3.6.4
Results on Real Datasets . . . . . . . . . . . . . . . . . . . . . 62
3.6.5
Comparison with Locally Linear Embedding . . . . . . . . . . 63
on Synthetic Datasets . . . . . . .
Scalability Experiments . . . . .
Clustering Quality . . . . . . . .
Varying Number of Clusters . . .
Varying Number of Dimensions .
Varying Number of Seed-Clusters
. . .
. . .
. . .
. . .
. . .
(K)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
54
56
58
59
61
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4. Shape-based Clustering through Backbone Identification . . . . . . . . . . 67
4.1
Related Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.1
4.2
4.3
Skeletonization . . . . . . . . . . . . . . . . . . . . . . . . . . 69
The Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.2
Phase 1 – Backbone Identification . . . . . . . . . . . . . . . . 72
4.2.2.1 Minimum Description Length principle . . . . . . . . 78
4.2.3
Phase 2 – Cluster Identification . . . . . . . . . . . . . . . . . 81
4.2.4
Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 83
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 83
iv
4.4
4.3.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.2
Scalability Results . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3
Clustering Quality Results . . . . . . . . . . . . . . . . . . . . 85
4.3.4
Parameter Sensitivity Results . . . . . . . . . . . . . . . . . . 86
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1
Comparison with SPARCL . . . . . . . . . . . . . . . . . . . . 88
5. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1
Efficient Subspace Clustering
. . . . . . . . . . . . . . . . . . . . . . 91
5.2
Shape Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
v
LIST OF TABLES
2.1
Summary of spatial (shape-based) Clustering Algorithms . . . . . . . . 27
3.1
Comparison on synthetic datasets. The distortion scores are shown for
each method. The value in bold indicate the best result for each row. . 51
3.2
Runtime Performance on Synthetic Datasets. All times are reported in
seconds. ‘-’ for DBSCAN and Spectral method denotes the fact that it
ran out of memory for all these cases. . . . . . . . . . . . . . . . . . . . 51
4.1
Scalability results on dataset with 13 true clusters. The size of the
dataset is varied keeping the noise at 5% of the dataset size. . . . . . . 86
vi
LIST OF FIGURES
1.1
Applications of Shape-based Clustering in Image Analysis, Geographical Information Systems and Sensor Data. . . . . . . . . . . . . . . . . .
5
2.1
Contingency Table for Jaccard Co-efficient . . . . . . . . . . . . . . . . 11
2.2
Taxonomy of Clustering Algorithms . . . . . . . . . . . . . . . . . . . . 15
2.3
The k-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4
DBScan – Density reachability, core points and noise points. minPts = 2 22
2.5
CHAMELEON Clustering Steps. Figure from [64] . . . . . . . . . . . . 24
2.6
CHAMELEON – Relative Interconnectivity. Figure from [64]. . . . . . . 25
2.7
CHAMELEON – Relative Closeness. Figure from [64]. . . . . . . . . . . 26
3.1
The SPARCL Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2
Effect of Choosing mean or actual data point . . . . . . . . . . . . . . . 34
3.3
Bad Choice of Cluster Centers . . . . . . . . . . . . . . . . . . . . . . . 35
3.4
Local Outlier Based Center Selection . . . . . . . . . . . . . . . . . . . 39
3.5
Projection of points onto the vector connecting the centers . . . . . . . 40
3.6
Estimating the value of K . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7
Generating seed representatives with cdistmin . . . . . . . . . . . . . . . 47
3.8
Sensitivity Comparison: LOF vs. Random . . . . . . . . . . . . . . . . 51
3.9
SPARCL clustering on standard synthetic datasets from the literature. . 52
3.10
Results on Swiss-roll
3.11
Scalability Results on Dataset DS5
3.12
Clustering Quality on Dataset DS5 . . . . . . . . . . . . . . . . . . . . 55
3.13
Clustering Results on 3D Dataset . . . . . . . . . . . . . . . . . . . . . 56
3.14
Clustering quality for varying dataset size . . . . . . . . . . . . . . . . . 58
3.15
Varying Number of Natural Clusters . . . . . . . . . . . . . . . . . . . . 59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
vii
. . . . . . . . . . . . . . . . . . . . 54
3.16
Varying Number of Dimensions . . . . . . . . . . . . . . . . . . . . . . . 60
3.17
10 dimensional dataset (size=500K, k=10) projected onto a 3D subspace 61
3.18
Clustering quality for varying number of seed-clusters . . . . . . . . . . 61
3.19
Protein Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.20
Cluster separation with Locally Linear Embedding . . . . . . . . . . . . 63
3.21
Cancer Dataset: (a)-(c) are the actual benign tissue images. (d)-(f)
gives the clustering of the corresponding tissues by SPARCL. . . . . . . 66
4.1
Initial dataset (4.1(a)); after iterations 3 and 6; and the backbone after
8 iterations (right) of the algorithm . . . . . . . . . . . . . . . . . . . . 68
4.2
Example skeleton of a binary image (in black). The white outline is the
skeleton. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3
Sample dataset showing one iteration of glob and movement . . . . . . 72
4.4
k-NN matrices for sample dataset . . . . . . . . . . . . . . . . . . . . . 72
4.5
Bubble plot for Figure 4.1(d). The size of a bubble is proportionate to
the weight wi of a point. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6
The Backbone Identification Based Clustering Algorithm . . . . . . . . 74
4.7
Example illustrating the globbing-movement twin process. . . . . . . . . 76
4.8
Reconstructed (and original) k-NN matrices for sample dataset . . . . . 77
4.9
The number of points moved and globbed per iteration for a dataset
with 1000K points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10
Balancing the two contradicting influences in the clustering formulation. 81
4.11
Scalability Results for Backbone Based Clustering . . . . . . . . . . . . 85
4.12
Backbone/skeleton of 2D synthetic datasets in our study. Left column:
original dataset, right column: skeletons. . . . . . . . . . . . . . . . . . 87
4.13
Backbone/skeleton of 3D synthetic datasets in our study. Left column:
original dataset, right column: skeletons. . . . . . . . . . . . . . . . . . 88
4.14
Purity score with varying dataset size . . . . . . . . . . . . . . . . . . . 89
4.15
Execution time and purity for varying number of nearest neighbors . . . 89
5.1
Subspace clustering – Challenges for SPARCL . . . . . . . . . . . . . . 91
5.2
Local Outlier Factor based representatives are rotation invariant. . . . . 93
viii
ACKNOWLEDGMENT
While I think back about the interesting graduate school years at RPI, this journey
would not have been anywhere close, had it not been for the following people.
First and foremost, I sincerely thank my adviser, Professor Zaki for his support,
both in research and otherwise. He has this amazing style of advising – giving
freedom but at the same time questioning; helping us to think but not hand-holding;
being gently pushy but never overbearing; acknowledging lack of progress but still
being optimistic. Above all, he has been very approachable and always willing to
discuss ideas. I have enjoyed many involved discussions in his office. I am extremely
fortunate to have had him as my adviser, otherwise my frequent India trips to meet
my wife would not have been possible :)
I would also like to thank my committee members, Professor Goldberg and Professor
Magdon-Ismail, for agreeing to be a part of my committee. Their courses during
graduate school were the most informative. Professor Magdon-Ismail gave very good
suggestions and ideas during my candidacy and beyond.
I have had the opportunity of working with the remaining members of my committee
on other projects. I thank Professor Szymanski for allowing me to be associated
with the RDM project after my active participation ended. Meetings with Professor
Szymanski have always been very exciting – full of good research ideas sprinkled
with anecdotes from history, literature and science. It was a pleasure interacting
with him.
Apart from agreeing to be a part of my committee, I have many things to thank
Taneli for. He was a wonderful mentor when I interned at Nokia Research Center
over Summer 2008. I had a fruitful and a memorable summer. He also provided
multiple opportunities for me to continue my association with Nokia Research. I
am very grateful to him for those opportunities and look forward to his guidance
and mentorship further ahead in my career.
Huge thanks goes to Terry Hayden and Chris Coonrad. They are the pillars of
the department. Because of their efforts, graduate life in the department seems so
ix
comfortable. I will miss those csgrads mails from Chris and Terry :)
A large part of graduate life is spend in the lab with fellow graduate students. I
have my fondest memories at RPI with my labmates and colleagues. I have enjoyed
working with Hasan and Saeed on many projects. In particular, Hasan’s enthusiasm
while working on a project has been contagious. We discussed many ideas and Hasan
always had thought provoking insight. Apart from research related aspects, I have
seen (and try to imbibe) the merits of tremendous hard-work and perseverance from
Hasan. Saeed has the knack of grasping new concepts and building on those. It was
fun working with him on designing the experiments for the ORIGAMI and the
SPARCL papers. I owe a lot of my awareness of world politics to Saeed :) I would
also like to acknowledge a few other colleagues/friends – Asif, Ali, Apirak, Hilmi,
Krishna and Medha.
Last but most important, I owe a lot to my family members. Without their collective
support obtaining this degree would have been much more difficult. I am indebted
to my wife, Anjali, who stood by me throughout these years. She was pushy at
times but always very understanding and supportive. Credit goes to my immediate
family members – my parents and sister (Aai, Baba and Priti), and Anjali’s parents
and sister (Mom, Dad and Neha) for their constant support.
This thesis is dedicated to Ajoba (my grandfather), who passed away on the day of
my candidacy. Apart from being the first one to have a PhD in my family, he has
been a great source of inspiration for me. He was always very curious about my
research and its progress.
x
ABSTRACT
Clustering is one of the fundamental data mining tasks. Many different clustering
paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. Traditional
algorithms approach clustering as an optimization problem, wherein the objective is
to minimize certain quality metrics such as the squared error. The resulting clusters
are convex polytopes in d-dimensional metric space. For clusters that have arbitrary shapes, such a strategy does not work well. Clusters with arbitrary shapes
are observed in many areas of science. For instance, spatial data gathered from
Geographic Information Systems, data from weather satellites, data from studies on
epidemiology and sensor data rarely possess regular shaped clusters. Image segmentation is an area of technology that deals extensively with arbitrary shaped regions
and boundaries. In addition to the complex shapes some of the above applications
generate large volumes of data. The set of clustering algorithms that identify irregular shaped clusters are referred to as shape-based clustering algorithms. These
algorithms are the focus of this thesis.
Existing methods for identifying arbitrary shaped clusters include density-based,
hierarchical and spectral algorithms. These methods suffer either in terms of the
memory or time complexity, which can be quadratic or even cubic. This shortcoming has restricted these algorithms to datasets of moderate sizes. In this thesis we
propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary
shapes and sizes. SPARCL has a linear space and time complexity. SPARCL consists of two stages – the first stage runs a carefully initialized version of the Kmeans
algorithm to generate many small seed clusters. The second stage iteratively merges
the generated clusters to obtain the final shape-based clusters. The merging stage
is guided by a similarity metric between the seed clusters. Experiments conducted
on a variety of datasets highlight the effectiveness, efficiency, and scalability of our
approach. On large datasets SPARCL is an order of magnitude faster than the
best existing approaches. SPARCL can identify irregular shaped clusters that are
xi
full-dimensional, i.e., the clusters span all the input dimensions.
We also propose an alternate algorithm for shape-based clustering. In prior clustering algorithms the objects remain static whereas the cluster representatives are
modified iteratively. We propose an algorithm based on the movement of objects
under a systematic process. On convergence, the core structure (or the backbone) of
each cluster is identified. From the core, we can identify the shape-based clusters
more easily. The algorithm operates in an iterative manner. During each iteration,
a point can either be subsumed (the term “globbing” is used in this text) by another
representative point and/or it moves towards a dense neighborhood. The stopping
condition for this iterative process is formulated as a MDL model selection criterion.
Experiments on large datasets indicate that the new approach can be an order of
magnitude faster, while maintaining clustering quality comparable with SPARCL.
In the future, we plan to extend our work to identify subspace clusters. A subspace
cluster spans a subset of the dimensions in the input space. The task of subspace
clustering thus involves not only identifying the cluster members, but also the relevant dimensions for each cluster. Indexing spatial objects using the seed selection
approach proposed in SPARCL is another line of work we intend to explore.
xii
CHAPTER 1
Introduction
Clustering has been a traditional and a prominent area of research within the data
mining, machine learning and statistical learning communities. Along with classification and regression, cluster analysis covers most techniques proposed in these
communities. The growing interest in the field of cluster analysis is fueled by a large,
constantly growing, set of applications that have benefited greatly by the progress
in this area.
Broadly, clustering can be defined as follows. Given a set D of n objects in
d-dimensional space, cluster analysis assigns the objects into k groups such that
each object in a group is more similar to other objects in its group as compared
to objects in other groups 1 . While some clustering algorithms merely identify
members of different groups, others also provide the characteristic representative(s)
for each group. Some other algorithms are able to identify isolated objects that
do not belong to any specific group. These objects are atypical and are commonly
known as outliers or noise. Another class of clustering algorithms assigns each
object a probability of belonging to each of the k groups. A more thorough review
of clustering paradigms appears in Section 2.2.
Within the machine learning community, cluster analysis is popularly known
as unsupervised learning. Unsupervised learning derives its name from the complementary field of supervised learning, popularly known as classification. On one
hand, the classification task is aided (or supervised) by the presence of labels for
the objects and the goal is to assign labels to new unseen objects. On the contrary,
cluster analysis is devoid of any such supervision. Hence the name unsupervised
learning. On a related note, techniques that combine both supervised and unsupervised learning have also been studied within the machine learning community. They
are aptly known as semi-supervised learning algorithms.
In this thesis, we focus on clustering algorithms that can capture clusters of
1
This is a very general definition. Variations and specialization of this definition can be seen
for different flavors of clustering problems.
1
2
arbitrary shapes and sizes. These algorithms are commonly known in the literature
as spatial clustering algorithms, although we mostly refer to them as shape-based
clustering algorithms. Throughout this document, objects will be interchangeably
referred to as data points or instances. Similarly, the groups will be referred to
as clusters and terms ‘cluster analysis’ and clustering will be used interchangeably.
The d dimensions will be called as the features or attributes of the objects.
1.1
Clustering – Application Domains
This notion of capturing similarity between objects lends itself to a variety of
applications. As a result, cluster analysis plays an important role in almost every
area of science and engineering, including bioinformatics [55], market research [96],
privacy and security [62], image analysis [105], web search [125], health care [89]
and many others. Some of the key application domains of clustering are described
in this section.
Market Research: Cluster analysis is widely used in market research [118, 102].
Researchers have used cluster analysis to group or segment populations/customers.
Such a segmenting can help gain useful insight into market penetration, customer
base size and product positioning [117]. Moreover, the result of cluster analysis can
also reveal the correlations between the various segments. For market surveys and
test panels, cluster analysis can help determine the size and composition of test
markets [30].
Finance: As market research is crucial for new product developers, a good understanding of stocks is important for brokers/traders [41]. Grouping related stock
options helps traders plan their hedging strategy. Similarly, knowledge of related
stocks can help infer similar behavior under different market conditions [87]. Studies have been conducted using cluster analysis to understand regional differences in
market behavior [17].
Customer profiling: Increased availability of customer behavior data (online purchases, website visits, reviews, comments, wish lists, etc.) has made it possible to
build models to capture customer preferences [24]. Customer profiling has enabled
organizations to not only provide targeted products (and recommendations) but also
3
superior customer service. Analyzing customer data has also helped financial institutes and online stores identify fraudulent behavior [51]. Within clinical research,
grouping patient symptoms and diagnoses has helped health care practitioners identify diseases effectively.
Planning and governance: Identifying population dynamics enables authorities
to better distribute facilities (schools, hospitals, etc.) across a town or city. Similarly, spread of diseases can also be contained based on identification of normal and
“abnormal” clusters [86].
Sciences: Within the scientific computing community, clustering has been used by
astronomers to categorize constellations and stars [60]. Interesting problems in life
sciences have also applied cluster analysis. Grouping protein sequences [78], analysis of gene expression data [27] and building the phylogenetic tree [85] are a few
examples.
Internet-based Applications: Internet based applications such as search (both
image and text) [16], books and movie recommendations, and music portals have
effectively utilized cluster analysis to provide improved and customized results.
Miscellaneous: Other application within computer science include detecting communities in social network graphs [113], image segmentation [105] and grouping
related documents [109].
The set of applications described above clearly indicate that clustering is one
of the major data mining methods. Despite the vast amount of research in this area,
the emergence of new applications creates the need for more effective and efficient
clustering algorithms.
1.2
Shape-based Clustering
In this thesis, our focus is on the arbitrary-shape clustering task. We use the
term shape-based clustering for all algorithmic techniques that capture clusters with
arbitrary shapes, varying densities and sizes. Shape-based clustering remains of
active interest, and several previous approaches have been proposed; spectral [105],
density-based (DBSCAN [34]), and nearest-neighbor graph based (Chameleon [64])
approaches are the most successful among the many shape-based clustering methods.
4
However, they either suffer from poor scalability or are very sensitive to the choice of
the parameter values. On the one hand, simple and efficient algorithms like Kmeans
are unable to mine arbitrary-shaped clusters, and on the other hand, clustering
methods that can cluster such datasets are not very efficient. Considering the volume
of data generated by current sources (e.g., geo-spatial satellites) there is a need for
efficient algorithms in the shape-based clustering domain that can scale to much
larger datasets.
1.2.1
Motivating Applications
The need for spatial clustering can be illustrated from the following histor-
ical incident. Figure 1.1(a)2 shows the map of London during the 1855 Cholera
outbreak. The map marks the location of deaths caused due to cholera and the
position of water pumps. The story goes that Dr. John Snow (one of the founders
of medical epidemiology) used the map to correlate the deaths with the sources of
water. Grouping/clustering the occurrences of deaths, along with the location of
the pumps helped him identify the pump with contaminated water. Although this
is a small example and at that time techniques from the field of cartology were more
popular as compared to cluster analysis; even then this goes to emphasis the role of
shape-based clustering in modern day Geographic Information Systems.
Astronomy-related studies, research on epidemiology, location based applications, seismological observations are a few sources of spatial data. Improvements in
sensor devices, observation satellites and GPS devices have enabled gathering finer
and varied types of data, resulting in a large volume of gathered data. Identifying
shape-based clusters could aid in resource allocation, urban planning and marketing,
health care and criminology. Some recent applications of spatial data and clustering
are outlined below.
Location-based Search: With the popularity of Google Local, Ovi Maps from
Nokia, Placemaker from Yahoo and similar location specific services, efficient algorithms for indexing and querying spatial data have come to the forefront. Efficient
response to spatial queries such as “Find all restaurants in the vicinity of Empire
2
Image taken from http://en.wikipedia.org/wiki/GIS
5
(a) GIS Data – London Cholera
(b) Image of Katrina Hurricane
(c) Satellite Image of California Forest Fire
(d) Chromosome Separation
Figure 1.1: Applications of Shape-based Clustering in Image Analysis,
Geographical Information Systems and Sensor Data.
State building” is contingent on building indices that group the points-of-interest
(POI) data.
Earth sciences and geo-spatial data: Spatial clustering on climatology data
has been used to characterize regions with varying weather patterns [95]. In another effort [110], large datasets from astronomical studies are analyzed to clusters
of galaxies. Shape-based clustering has also been applied to seismic data to understand earthquake patterns [5]. [47] serves as a good survey of related techniques.
Epidemiology and disease clusters: Another application of shape-based clustering comes from the health care domain, wherein early detection of disease outburst
can be achieved by observing the spatial distribution of the disease instances [72, 69].
Furthermore, understanding disease clusters can help analyze spread of an out-
6
break [71].
Global Information Systems (GIS) data: Ecoregions – areas of land or water
characterized by presence of certain types of flora or fauna, soil type, animal communities, etc. – can be captured [49] using shape-based clustering. The detection of
such regions impacts conservation activities for wildlife and forests. Segmentation
of sensor fields is crucial for understanding placement of sensors and master nodes.
Spatial clustering methods have been applied for segmenting sensor networks [128]
and for grouping sensor networks considering the energy efficiency criteria [42]. Clustering algorithms have been used on geo-spatial satellite images and remote sensing
digital images to identify regions with irregular shapes [101], such as bridges, rivers
and roads.
Figure 1.1(c)3 shows the remote sensing image of the California forest fire.
Similarly, Figure 1.1(b)4 shows a satellite image of the cloud formation during Hurricane Katrina. Image segmentation methods based on shaped-based clustering can
help identify regions of interest in these images. Similar applications appear within
biological image analysis, such as chromosome separation. Figure 1.1(d) 5 shows an
image used for chromosome separation.
1.2.2
Problem Formulation and Contribution
Traditional clustering algorithms have focused on the following objectives: 1)
improving the clustering quality, 2) improving efficiency, and 3) designing algorithms
that can scale to larger datasets. The last consideration is gaining prominence as
the sizes of the datasets are growing at a steady pace. Shape-based clustering becomes challenging for large datasets since standard distance measures and parametric models are unable to capture arbitrary shaped clusters. Although density-based
algorithms have been proposed to overcome this limitation, they suffer from a high
computational complexity and sensitivity to parameters. Similarly, spectral clustering algorithms suffer from scalability issues.
Existing approaches to clustering tackle scalability issues through the following
3
Image taken from http://www.geology.com
Image taken from http://www.usgs.gov/ngpo/
5
Image taken from http://www.riken.go.jp/asi/images/kaleidoscope/chromosome.jpg
4
7
methods:
1. Distributed clustering algorithms: This approach to clustering large scale
datasets, scales up the resources (CPU and memory) proportionately. Distributed and parallel versions of existing clustering algorithms can deal with
large datasets. For instance, [61] discusses a distributed density based clustering algorithm and [92] describes a parallel version of BIRCH. Recently,
cluster computing paradigms such as Map-Reduce have also been employed
for scalability purposes [21].
2. Sampling based methods: rely on applying clustering algorithms on a
randomly selected sample of the data [119]. The underlying assumption is that
the results on the sample would apply to the entire dataset. CLARANS [88]
CURE [45] and DBRS [116] are examples of this approach. Factors such as
the size of the sample would affect the quality of the clustering. Also, these
methods scale well but on the down side they rely on uniform size and density
of the clusters.
3. Data summarization methods: Somewhat related to the sampling based
methods, this class of algorithms aims at identifying representatives within
the large dataset. Standard clustering algorithms can be applied over this
summary dataset. Final clustering is obtained by mapping the representatives
to the original set of points. This approach is taken by CSM [75].
Another approach is to “intelligently” reduce the dataset size. With a significantly smaller dataset, even computationally expensive algorithms can be applied.
Reducing the dataset size in a principled manner to achieve scalability is the driving theme for the algorithms proposed in this thesis. In this thesis, we propose two
simple, yet highly scalable algorithm for mining clusters of arbitrary shapes, sizes
and densities.
We call our first algorithm SPARCL (which is an anagram of the bold letters in ShAPe-based CLusteRing). In order to achieve this we exploit the linear
(in the number of objects) runtime of Kmeans based algorithms while avoiding its
8
drawbacks. Kmeans based algorithms assign all points to the nearest cluster center; thus the center represents a set of objects that collectively approximates the
shape of a d dimensional hypersphere. When the number of centers are few, each
such hypersphere covers a larger region, thus leading to incorrect partitioning of a
dataset with arbitrary shapes. Increasing the number of centers reduces the region
covered by each center. SPARCL exploits this observation by first using a smart
strategy for sampling objects from the entire dataset. These objects are used as initial seeds of the Kmeans algorithm. On termination, Kmeans yields a set of centers.
In the second step, a similarity metric for each pair of centers is computed. The
similarity graph representing pairwise similarities between the centers is partitioned
to generate the desired final number of clusters.
The second algorithm proposed in this thesis is inspired from the concept of
skeletonization from the image processing literature. A skeletonized dataset is much
smaller as compared to the original dataset and has much less noise. The reduction
in the amount of noise makes the data cleaner resulting in efficient identification of
the clusters. The reduction in the dataset size, on the other hand, contributes to
the scalability of the clustering algorithm. In order to achieve the same effect as
skeletonization we define two operations on the data – globbing and displacement.
These two operations are performed on the dataset in an iterative fashion. The
stopping criteria for the iterative process is based on a Minimum Description Length
(MDL) principle formulation. On termination, a skeletonized dataset is obtained.
In a second step, clusters from the reduced dataset are obtained, either by applying
a hierarchical or a spectral clustering algorithm.
To summarize we made the following key contributions in this work:
1. We propose a new, highly scalable algorithm, SPARCL, for arbitrary shaped
clusters, that combines partitional and hierarchical clustering in the two phases
of its operation. The overall complexity of the algorithm is linear in the number
of objects in the dataset.
2. SPARCL takes only two parameters – number of initial centers and the number
of final clusters expected from the dataset. Note that the number of final
clusters to find is typically a hyper-parameter of most clustering algorithms.
9
3. Within the second phase of SPARCL we define a new function that captures
similarity between a pair of cluster centers. This function encapsulates the
distance between the clusters as well as the density of the pair of clusters.
4. The second backbone detection based algorithm, applies two simple operations – globbing and displacement – on the dataset to identify the skeleton
(backbone) of the clusters. On repeated application of these operations, the
cluster backbone emerges. Hierarchical clustering on the backbone produces
the final set of clusters.
5. We perform a variety of experiments on both real and synthetic shape clustering datasets to show the strengths and weaknesses of our approaches. We
show that our methods are an order of magnitude faster than the best current
approaches.
1.3
Thesis Outline
Chapter 2 provides a comprehensive introduction to clustering algorithms,
with specific emphasis on shape-based clustering algorithms. Certain key shapedbased clustering algorithms from the literature are discussed. Chapter 3 focuses on
the SPARCL algorithm. Chapter 3 also provides a thorough experimental comparison with related algorithms. Chapter 4 introduces the second algorithm for identifying arbitrary shaped clusters. Since SPARCL performs better than the state-ofthe-art algorithms, as a result the backbone based algorithm in Chapter 4 focuses
on comparison with SPARCL. Finally, Chapter 5 discusses the future directions.
Future efforts involve extending SPARCL to identify clusters in subspaces. Some
other directions include using the seed selection procedure outlined in Chapter 3 to
index shapes. Using concepts from graph sparsification literature to obtain a sparse
data is another interesting line of work to improve scalability. The concept of tree
spanners is one such idea.
CHAPTER 2
Background and Related Work
This chapter covers fundamentals of clustering followed by an overview of various
clustering algorithms. A few shape-based clustering algorithms (such as DBSCAN
and CHAMELEON) are discussed in further detail since we compare SPARCL with
them in Chapter 3.
2.1
2.1.1
Clustering Preliminaries
Data Types
The objects in the dataset are assumed to be in a d-dimensional feature space.
The type of data associated with each feature determines the overall type of the
object. Depending on the application, each feature of an object can have a different
data type associated with it. The most common data types include numeric, binary,
categorical (also known as nominal ), ordinal or a combination of them. Numeric
features have real values. Binary features capture the presence or absence of the
feature for an object. Categorical data is a generalization of binary data to more
than two choices. Ordinal data is characterized by the presence of order information
between them. For instance, the medal tally of countries taking part at Olympics,
wherein gold, silver and bronze have an order associated with them. Most clustering
algorithms are catered towards numeric data. while some others can handle categorical data [38, 46]. A few can cluster mixed data, i.e., some categorical features
along with numerical features [20].
2.1.2
Distance Measures
The main operation in clustering is to group similar objects together and to
keep dissimilar objects far apart. The similarity is defined in terms of some distance
metric. The distance measure is chosen based on the data types associated with the
features of an object. A variety of distance measures have been proposed in the
literature, which include:
10
11
Minkowiski distance: of order p (p-norm distance), is given by
dist(x, y) =
d
X
| xi − yi |
i=1
p 1/p
(2.1)
where x, y ∈ Rd and xi represents the value at the ith dimension. Minkowiski
distance is applicable for d-dimensional Euclidean spaces. Manhattan distance and
Euclidean distance are special cases of Minkowiski distance with p = 1 and p = 2,
respectively.
Object y
1
0
1
a
b
0
c
d
Object x
Figure 2.1: Contingency Table for Jaccard Co-efficient
Jaccard Coefficient: The distance between two binary valued objects can be
calculated with the Jaccard Coefficient. Given the contingency table as shown in
Figure 2.1, the Jaccard coefficient is given by
sim(x, y) =
a
a+b+c
(2.2)
where a and d indicate the number of dimensions in which the two objects have
the same binary value of 1 and 0, respectively. Similarly, c and b count the number
of dimension in which the two objects have different binary values. The Jaccard
coefficient is an asymmetric measure. A symmetric distance measure for binary
data computes the ratio of number of dissimilar features to the total number of
features, given by
sim(x, y) =
b+c
a+b+c+d
(2.3)
12
Cosine Measure: In order to compute the cosine distance, the set of features for
each object is treated as a vector. The distance between two objects is the cosine
of the angle between the corresponding vectors. This measure is frequently used for
text documents, due to its scale and length invariant properties.
The choice of the distance/similarity measure is also dependent on the application and the properties of the data. For instance, Discrete Wavelet Transform [63],
Discrete Fourier Transform [79] and Dynamic Time Warping [66] are favorably
used for time-series data; edit distance and its variations for sequence data such
as protein/gene sequence; Pearson Correlation [98] for collaborative filtering; and
Spearman correlation [107] for ordinal data.
2.2
Dominant Clustering Paradigms
In the following section, we outline some properties based on which the clus-
tering algorithms can be distinguished. Some of these properties pertain to the
output of the clustering algorithms, some others to the type of data accepted by the
algorithm and the rest to the parameters associated with the algorithms.
2.2.1
Differentiating Properties
Since the field of cluster analysis has been in existence for a long time, many
clustering algorithms have been proposed. Although many of these algorithms might
seem similar at the outset, there is a set of properties that can help differentiate
between them. Observing the algorithms with respect to these properties, brings
out the differences between them. The properties have been organized into related
groups.
Performance: Properties related to the efficiency and scalability of the algorithm.
• Time and Space complexity: The time and space complexities are important from the point of scalability to larger datasets. This also includes the
fact whether the algorithm needs the entire pair-wise similarity (or distance)
13
matrix to be computed. For large datasets, computing the entire pair-wise
similarity matrix is prohibitive, both in terms of space and time.
• High-dimensionality: Can the algorithm scale to higher dimensions?
Membership: Properties related to the membership/representative information
resulting from the algorithm.
• Representatives: Does the algorithm produce a set of representatives for
the identified clusters?
• Hard versus soft: Does the algorithm assign each object to a single fixed
cluster (hard clustering) or does it result in a probability distribution over
cluster membership (soft clustering)?
• Outlier Detection: Does the algorithm distinguish between outliers and
cluster members, or are outliers assigned to one of the identified clusters?
Robustness: Properties capturing sensitivity to external effects.
• Data order dependency: Does the output or performance of the algorithm
depend on the order in which the points are processed?
• Parameters and their effect: Algorithms with smaller number of parameters are favored. Moreover, it is important to understand the effect of changes
in the parameter values on the final clustering. Sensitivity of the clustering
results to changes in parameter values reflects the lack of robustness of the
algorithm. A robust algorithm is definitely preferred.
Cluster type: Properties that capture the type of clusters identified by the algorithm.
• Shape-based clusters: Is the algorithm able to identify clusters with arbitrary shapes and diverse densities?
• Subspace clusters: Can the algorithm identify clusters that lie in a subset
of the d dimensions, called a subspace. Each cluster can belong in a different
subspace.
14
Input Parameters: Properties related to inputs provided to the algorithm.
• Data type: Defines the data types (from Section 2.1.1) that can be handled
by the algorithm.
• Distance measure: The distance measures (from Section 2.1.2) that can be
used by the algorithm.
• Prior knowledge: Does the algorithm depend on assumptions regarding the
data. This naturally restricts the applicability of the algorithm to a wide range
of datasets.
2.2.2
Categorization
Due to the large number of potential application domains, many flavors of
clustering algorithms have been proposed [59, 84]. Categorizing them helps in understanding their differences. Although Section 2.2.1 outlined some of the differentiating properties, the mode of operation is the most common basis of categorization.
Figure 2.2 provides a taxonomy of clustering algorithms based on their mode of
operation. Broadly, they can be categorized as variance-based, hierarchical, partitional, spectral, probabilistic/fuzzy and density-based. However, the common task
among all algorithms is that they compute the similarities (distances) among the
data points to solve the clustering problem. The definition of similarity or distance
varies based on the application domain. For instance, if the data instance is modeled
as a point in d-dimensional linear subspace, Euclidean distance generally works well.
However, in applications like image segmentation or spatial data mining, Euclidean
distance based measure does not generate the desired clustering solution. Clusters in
these applications generally form a dense set of points that can represent (physical)
objects of arbitrary shapes. The Euclidean distance measure fails to isolate those
objects since it favors compact and spherical shaped clusters. Below we review each
of the major clustering paradigms (as illustrated in Figure 2.2).
2.2.2.1
Partitional Clustering
The partitioning based methods aim to divide the set of objects D into k
disjoint sets. An optimal clustering is obtained when the division of objects results
15
Agglomerative
Hierarchical
Divisive
k−Means
Partitional
k−Medoids
Probabilistic
Cluster Analysis
Algorithms
Expectation
Maximization
Graph−theoretic
Connectivity−based
Density−based
Density−function
based
Grid−based
Spectral
Evolution + Neural−net
based
SOM
Figure 2.2: Taxonomy of Clustering Algorithms
in k sets such the following two conditions are satisfied: (1) points in a set are
“close” to other points in the same set, (2) points belonging to two different sets are
as “far apart” as possible. To obtain the optimal clustering one has to enumerate all
possible partitions. Since the number of partitions are exponential in the number
of objects, this approach is naturally unfeasible.
This leads to (non-optimal) algorithms that incrementally obtain a better partitioning based on certain heuristics. Such algorithms are known as iterative relocation methods, based on their mode of operation. The algorithms start with
16
a random partitioning
6
of the objects. In each subsequent iteration the points can
be moved to a different cluster, as long as the quality of the clustering is improved.
The procedure concludes when moving the points does not lead to any further improvement in the quality of the clustering. Partitional algorithms broadly fall under
two sub-categories:
k-means: In this strategy, each cluster is represented by a mean point. The mean
point is the arithmetic mean of all the points belonging to a cluster, along each
dimension. During each iteration, objects are assigned to the mean point closest
to them. This could change the objects assigned to a cluster, which in turn could
change the mean point. This process continues until one of the following conditions
is satisfied: (1) no object moves to a different cluster, or (2) the change in the mean
points for each cluster is below a pre-determined threshold. The clustering quality
metric for k-Means is the Sum of Square Error (SSE). The SSE also serves as
the optimization criteria for the k-Means procedure. Given a clustering C with
clusters C1 , C2 , ..., Ck , the Sum of Square Error is given by
SSE(C) =
k X
X
|| xi − cj || 2
(2.4)
j=1 xi ∈Cj
where the objects are represented by xi , and cj is the mean point for cluster Cj .
It can be shown that for the above SSE, the k-means algorithm converges monotonically to a local minima. We outline the k-means algorithm in Figure 2.3. This
version of k-means is popularly known as the Lloyd’s algorithm [77]. The random
initialization is attributed to Forgy [36]. The described algorithm has a time complexity that is linear in the number of points and the number of clusters. The
complexity can be denoted by O(nke), where e is the number of times lines 4–9 are
executed.
k-medoids: The k-medoids is a variation of the k-means strategy. Here, each
cluster is represented by the medoid of the cluster. A medoid is the object that is
closest to the center of the cluster. Medoids are less affected by outlier points, as
6
Some other initializations have also been proposed in the literature which are discussed later.
17
k-means(D, k):
1. Cinit = pick random init center(D, k)
2. M = assign obj to center(D, Cinit )
3. repeat
4. Cnew = compute centers(D, M)
5. Mnew = assign obj to center(D, Cnew )
6. change = compute change(M, Mnew )
7. M = Mnew
8. until change == true
Figure 2.3: The k-means Algorithm
a result the medoids based approach is more robust. At the same time, computing
medoids is computationally more expensive as compared to computing means.
2.2.2.2
Hierarchical Clustering
Hierarchical clustering, as the name suggests, creates a hierarchy of clusters.
The hierarchical arrangement of the clusters results in a tree-like structure called a
dendrogram. Broadly, two disparate approaches are proposed in the literature for
obtaining hierarchical clusters. Agglomerative hierarchical clustering starts
out with each point being in a separate cluster. During each subsequent step, the
“closest” clusters are merged to form a new cluster at a higher level in the hierarchy.
This process continues till the desired number of clusters are obtained. Divisive
hierarchical clustering takes a top-down approach. It starts with a single cluster
consisting of all the objects. At each step, a cluster is broken into two sub-clusters,
until a stopping condition is satisfied. Examples of stopping condition includes: (1)
reaching the desired number of clusters, or (2) the minimum distance between a
pair of clusters being greater than a predetermined threshold.
For agglomerative
clustering the following distance measures are commonly used:
1. Single-link [106]: The distance between two clusters Ci and Cj is given by the
minimum distance between two points, one of which is in Ci and the other in
Cj .
SL(Ci , Cj ) = min{dist(x, y) | x ∈ Ci , y ∈ Cj }
(2.5)
18
2. Complete-link, also known as the farthest neighbor, is given by the expression
CL(Ci , Cj ) = max{dist(x, y) | x ∈ Ci , y ∈ Cj }
(2.6)
3. Average-link [111], also known as the minimum variance method, determines
the distance between two clusters by the expression
AL(Ci , Cj ) =
P
x∈Ci ,y∈Cj
dist(x, y)
| Ci | × | Cj |
(2.7)
Hierarchical clustering, unlike partitional clustering, does not change the cluster
membership of an object. Smaller clusters can be merged to form bigger clusters,
but otherwise objects cannot drastically change membership. This is an inherent
drawback of the hierarchical mode of the clustering.
Some clustering algorithms combine hierarchical clustering with other clustering algorithms. BIRCH [126], builds a tree-like summary structure (called Clustering
Feature tree) corresponding to the hierarchical arrangement. Conceptually, the CF
tree is similar to a B+-tree. Each node of the CF tree contains the summary statistics for the cluster corresponding to the tree node. In the second step, BIRCH
employs any clustering algorithms to cluster the leaf nodes of the CF tree. Another
algorithm, CURE [45], combines the centroid-based approach with hierarchical clustering. Instead of assigning a single centroid to a cluster, a large number of centroids
are associated with a cluster. The distance between two clusters is the single-link
distance between the centroids of the clusters. Additionally, the centroids associated
with a cluster are pulled in towards the center of the cluster by a fixed fraction of the
distance. This enables CURE to capture clusters that are non-spherical in shape.
CHAMELEON is another popular clustering algorithm. CHAMELEON employs a
combination of graph-partitioning and hierarchical clustering to obtain the final set
of clusters. It can capture clusters with arbitrary shapes and sizes. CHAMELEON
is discussed in detail in Section 2.3.2.
19
2.2.2.3
Probabilistic/fuzzy Clustering
Under fuzzy/probabilistic clustering each object x is assigned a probability of
belonging to a cluster Ci . The concept of probabilistic membership is commonly
known as soft clustering. The most popular fuzzy clustering algorithm is the Fuzzy
C-Means [10] which is a variation of the regular k-means. In Fuzzy C-Means, the
following weighted squared error is minimized
J=
n X
k
X
i=1
Pn
uij xi
uij || xi − cj || cj = Pi=1
n
i=1 uij
j=1
2
(2.8)
where uij is the fraction denoting the likelihood of object xi belonging to cluster Cj
with center cj . It can be shown that with this objective function, a local optima
can be reached following the k-means style algorithm.
Expectation Maximization (EM) [28] algorithm is a popular algorithm for probabilistic clustering. For a mixture model wherein each cluster is generated from
a distribution, the EM algorithm determines the parameter values for the distributions. Expectation Maximization is a Maximum Likelihood Estimation (MLE)
method for the mixture parameters. Intuitively, maximum likelihood estimate selects the parameters values that maximize the likelihood of the observed data. If
the parameters are indicated by the variable α = {α1 , α2 , . . . , αk }, the EM selects
the value of α that maximizes the probability P r(D | α). Like k-means, EM is an
iterative algorithm. Each iteration is composed of two steps:
• E-step: In this step the algorithm computes a lower bound for the expected
value of the likelihood function, under the current estimates of the parameters.
• M-step: In the M-step, the algorithm computes the new estimate for the
parameters which maximizes the expected value of the likelihood function
computed in the E-step.
2.2.2.4
Graph-theoretic Clustering
Vast amount of work has been done within the graph-theory and network
analysis community on clustering algorithms. Usually, the approach taken involves
20
concepts related to influence propagation, graph cut algorithms or community detection methods. In [37], the authors use the concept of affinity propagation between
data points (objects). In this iterative algorithm, messages reflecting the affinity
of a node towards another are passed between nodes. The result of this process
is a set of “exemplars” that correspond to the cluster representatives. Each point
is associated with the closest “exemplar”. Other algorithms that are grounded in
concepts from network flow include [33, 56]. With the popularity of social networks
there has been a renewed interest in graph-based clustering methods. In an earlier
work [48], the authors propose the concept of separating operators, when applied
iteratively to the nodes, brings out the clusters within a graph. The authors define
a separating operator based on the circular escape probability between nodes in a
graph.
2.2.2.5
Grid-based Clustering
The grid based clustering methods operate by partitioning the d-dimensional
space along each dimension. This results in a grid-like structure over the input space.
STING (STatistical INformation Grid) [115] is an example of grid-based clustering.
For each cell resulting from the grid, STING captures statistical information (e.g.
standard deviation, mean, etc.) from the objects within that cell. This forms the
first level of cells. Like hierarchical clustering, cells belonging to the first levels are
combined to form larger cells at the next level. The statistical attributes of cells
at higher resolution can be computed from the cells in the immediate lower level.
As such, STING provides a multi-resolution clustering. Another grid-based algorithmWaveCluster [104], utilizes concepts from signal processing. After generating
the grids, WaveCluster applies discrete Wavelet transform to the objects within a
cell. The wavelet transform identifies boundaries of the clusters as high frequency
regions. Neighboring cells are combined using connected components.
2.2.2.6
Evolution and Neural-net based Clustering
Many clustering algorithms have been proposed from the neural networks and
genetic algorithms communities. A Self Organizing Map (SOM), also known as a
Kohonen map, is a type of artificial neural network that uses a vector quantization
21
technique to map objects in high dimensional data space to a lower dimensional
space. This mapping leads to grouping of objects in the lower dimensional space.
SOMs were initially designed as a data visualization technique. From the evolutionary computing side, genetic algorithms have also been used for clustering [82], to
navigate the feature space in search of appropriate cluster centers.
Due to the large body of work related to clustering, a complete coverage of
the algorithms is beyond the scope of any single document. For instance, clustering
algorithms that draw inspiration from natural and physical phenomenon – colonies
of ants [11], flocks of birds [25], force of gravity [58] and magnetic fields [12] – have
not been discussed.
The algorithms discussed above are specifically for identifying groups of related
objects. Although not stated explicitly, these algorithms assume that the objects in
a cluster span across all the dimensions. Such algorithms are known as full space
clustering algorithms. Certain clustering algorithms, called subspace clustering
algorithms, identify clusters that lie in a space spanned by a subset of the dimensions or some linear/non-linear combination of the dimensions. Given a dataset of
objects, these algorithms are able to capture the subspaces spanned by the objects
in a cluster [3, 94].
Additional data in the form of constraints can be provided to a clustering
algorithm. Common forms of constraints include instance level must-link constraints
and instance level cannot-link constraints. The must-link constraint between a pair
of objects enforces the points to belong to the same cluster. On the other hand,
a cannot-link constraint disallows a pair of objects from being grouped together in
the same cluster. Algorithms that incorporate constraints are commonly known as
constraint based clustering algorithms [26].
Certain recent classes of algorithms treat the objects in d-dimensional space
as a 2-dimensional n × d matrix. Linear algebra based factorization methods, such
as Non-negative Matrix Factorization [73], have been shown to be related to conventional methods such as k-means and spectral clustering. Another branch within
clustering, termed co-clustering [29], aims at grouping features in addition to the
objects. For every object cluster, a group of features is associated, such that a high
22
correlation exists between the object-feature pairs.
2.3
Review of Shape-based Clustering Methods
A comprehensive survey of arbitrary shape clustering with a focus towards
spatial clustering is provided in [84]. Here we review some of the pioneering methods.
2.3.1
Density-based Clustering
D
D
Noise point
C
C
B
B
A
A
Core point
Noise point
eps
(a)
(b)
Figure 2.4: DBScan – Density reachability, core points and noise points.
minPts = 2
DBSCAN [34] was one of the earliest algorithms that addressed arbitrary shape
clustering. It defines two parameters – eps which is the radius of the neighborhood
of a point, and MinPts which is the minimum threshold for the number of points
within eps radius of a point. A point is labeled as a core point if the number of points
within its eps neighborhood is at least MinPts. Based on the notion of density-based
reachability, a cluster can be defined as the maximal set of reachable core points, i.e.,
such that each core point is within the eps neighborhood of at least one other core
point in the cluster. Other (border) points that are with the neighborhood of core
points are also added to the same cluster (ties are broken arbitrarily or in the order
of visitation). Points that are not core and not reachable from a core are labeled
as noise. Figure 2.4 shows the three clusters obtained with minPts set to 2. Points
A through D are core points and D is density reachable from A. Two noise points
23
are shown in Figure 2.4(b). The main advantages of DBSCAN are that it does not
require the number of desired clusters as an input, and it explicitly identifies outliers.
On the flip side, DBSCAN can be quite sensitive to the values of eps and MinPts,
and choosing correct values for these parameters is not that easy. DBSCAN is also
an expensive method, since in general it needs to compute the eps neighborhood
for each point, which takes O(n2 ) time, especially with increasing dimensions; this
time can be brought down to O(n log n) in lower dimensional spaces, via the use of
spatial index structures like R∗ -trees.
DENCLUE [52, 54] is a density based clustering algorithm based on kernel
density estimation. DENCLUE models the impact of a data point within its neighborhood as an influence function. The influence function is defined in terms of
the distance between the two points. The density function at a point in the data
space is expressed in terms of the influence functions acting on that point. Clusters are determined by identifying density attractors which are local maxima of the
density function. The density attractors are identified by performing a gradient
ascent type algorithm over the space of influence functions. Both center-defined
and arbitrary-shaped clusters can be identified by finding the set of points that are
density attracted by a density attractor. DENCLUE shares some of the same limitations of DBSCAN, namely, sensitivity to parameter values, and its complexity is
O(n log m + m2 ), where n is the number of points, and m is the number of populated cells. In the worst case m = O(n), and thus its complexity is also O(n2 ). The
recent DENCLUE2.0 [53] method practically speeds up the time by adjusting the
step size in the hill climbing approach. An extension [31] of DENCLUE, proposes a
grid approximation to deal with large datasets.
2.3.2
Hierarchical Clustering
The arbitrary shape clustering problem has also been modeled as a hierarchi-
cal clustering task. For example, Kaufman and Rousseeuw [65] proposed one of the
earliest agglomerative method that can handle arbitrary shape clusters, which they
termed as elongated clusters. They compute the similarity between two clusters A
and B as the smallest distance between a pair of objects from A and B respectively.
24
This method is computationally very expensive due to the expensive similarity computations, with a complexity of O(n2 log n). Moreover, presence of outlier points
between the boundary region of two distinct clusters can cause wrong merging decisions. In a recent work [68], the authors propose a hierarchical clustering algorithm
based on an approximate nearest neighbor search – Locality-Sensitive Hashing [4].
This approach considerably improves the time complexity of the algorithm.
CURE [45] is another hierarchical agglomerative clustering algorithm that handles shape-based clusters. It follows the nearest neighbor distance to measure the
similarity between two clusters as in [65], but reduces the computational cost significantly. The reduction is achieved by taking a set of representative points from each
cluster and engaging only these points in similarity computations. To ensure that
the representative points are not outlier points, the representatives are pulled in, by a
predetermined factor, towards the mean of the cluster. CURE is still expensive with
its quadratic complexity, and more importantly, the quality of clustering depends
enormously on the sampling quality. In [64], the authors show several examples
where CURE failed to obtain the desired shape-based clusters. CHAMELEON [64]
Figure 2.5: CHAMELEON Clustering Steps. Figure from [64]
also formulates the shape-based clusters as a hierarchical clustering problem over
a graph partitioning algorithm. A m nearest neighbor graph is generated for the
input dataset, for a given number of neighbors m. This graph is partitioned into
a predefined number of sub-graphs (also referred as sub-clusters). The partitioned
sub-graphs are then merged to obtain the desired number of final k clusters. This
process is illustrated in Figure 2.5. CHAMELEON introduces two measures – rel-
25
ative inter-connectivity (RI) and relative closeness (RC) – that determine if a pair
of clusters can be merged. Relative inter-connectivity is defined as ratio of the total
edge cut between the two sub-clusters and the mean internal connectivity of the
sub-clusters. It is given by the expression
RI =
EC(Ci , Cj )
+ EC(Cj ))
1
(EC(Ci )
2
(2.9)
where EC(Ci , Cj ) is the sum of the edges in the m-nearest neighbor graph that
connect cluster Ci and Cj , EC(Ci ) is the minimum sum of the cut edges if cluster Ci
is bisected. The internal connectivity is defined as the weight of the cut that divides
(a)
Figure 2.6: CHAMELEON
from [64].
(b)
–
Relative
Interconnectivity.
Figure
a sub-cluster into equal parts. The relative inter-connectivity measure ensures that
sub-clusters having a small bridge connecting them are not merged together. The RI
measure can be explained using Figure 2.6. Although both Figures 2.6(a) and 2.6(b)
have almost the same edge cut, the mean internal connectivity is very different.
The two circular clusters in Figure 2.6(b) have a much higher internal connectivity
resulting is a smaller value for RI.
Relative closeness is the ratio of the absolute closeness to the internal closeness of
the two sub-clusters, where absolute closeness is the mean edge cut between the two
clusters, and the internal closeness of a cluster is the average edge cut that splits it
into two equal parts. It is given by the expression
RC =
S̄EC (Ci , Cj )
mj
mi
S̄ (Ci ) + mi +m
S̄EC (Cj )
mi +mj EC
j
(2.10)
where mi and mj are the sizes of clusters Ci and Cj respectively. S̄EC (Ci , Cj ) is the
26
average weight of the edges between clusters Ci and Cj and S̄EC (Ci ) is the average
weight of the edges if cluster Ci was bisected. Relative closeness ensures that the
two merged sub-clusters have the same density. Moreover, this measure ensures
that the distance between the two sub-clusters is comparable with their internal
densities. Sub-clusters having high relative closeness and relative inter-connectivity
are merged. CHAMELEON is robust to the presence of outliers, partly due to the
m-nearest neighbor graph which eliminates these noise points. This very advantage,
turns into an overhead when the dataset size becomes considerably large, since computing the nearest neighbor graph can take O(n2 ) time as the dimensions increase.
Figure 2.7 helps understand the RC measure. The S̄EC (Ci , Cj ) measure for the
clusters in Figure 2.7(a) is small as compared to the denominator in Equation 2.10.
This results in a small value for RC for these clusters. On the other hand, even
though the S̄EC (Ci , Cj ) value for the clusters in Figure 2.7(b) might be small, the
denominator is also small due to the within cluster sparsity. The net effect is a high
value for RC, indicating a possible merger of the two clusters.
(a)
(b)
Figure 2.7: CHAMELEON – Relative Closeness. Figure from [64].
2.3.3
Spectral Clustering
Proposed in the pattern recognition community, the spectral clustering meth-
ods are capable of handling arbitrary shaped clusters. The data points are represented as a weighted undirected graph, where the weights denote the similarities
between the nodes (data points). Let W be the symmetric weight matrix. The degree of the nodes in the graph is captured in the diagonal matrix D as d1 , d2 , . . . , dn .
The Normalized Laplacian matrix L is given by L = I − D−1 W .
worst case: O(n2 )
O(n)
O(n)
worst case: O(nlogn)
O(cd )
O(n)
SPARCL (2008)
O(n)
Somewhat
Somewhat
Yes
No
Somewhat
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
No
T : radius of
α: shrinkage factor
points within eps
minPts: minimum no. of
eps: radius
L: Entries in a leaf
leaf clusters,
s: random sample size
K: # of pseudo-centers
algorithm
Depends on exact
parameters for RI and RC
p, M IN SIZE, k,
not required
c, no. of final clusters
ξ: noise threshold
σ: density parameter
c: # of representatives
Table 2.1: Summary of spatial (shape-based) Clustering Algorithms
worst case:O(n3 )
O(n2 )
s: # of small
clusters generated
O(pn)
p: # of nearest neighbors
O(ns + nlogn + s2 logs)
each dimension
c: # cells in
O(n)
worst case: O(nlogn)
Somewhat
No
No
Yes
Parameters
p: # of clusters
Yes
Yes
Somewhat
Dependent
Input Order
schemes reduce complexity
Yes
Not very well
Somewhat
Clusters
Irregular Shaped
p: # of partitions
O(n2 logn)
Handles high
dimensionality
Comparison Metrics
Sampling+partitioning
O(n)
O(n)
factor
spatial index: O(nlogn)
O(n)
incremental
O(nB), B: branching
Space
Requirement
Time
Efficiency
Spectral (Shi-Malik, 2000)
CHAMELEON (1999)
WaveCluster (1998)
DENCLUE (1998)
CURE (1998)
DBSCAN (1996)
BIRCH (1996)
Algorithms
Clustering
Part of clusters
Part of clusters
to clusters
Assigns noise points
points
Identifies noise
at 2 stages
Identifies noise points
points.
Identifies noise
points
Identifies noise
Handling Noise
27
28
The Laplacian matrix possesses some nice linear algebra properties, such as,
being positive semi-definite. In [105] the authors formulate the arbitrary shape
clustering problem as a normalized min-cut problem. The normalized cut for a
graph with partitions A1 , . . . , Ak is given by the expression
N cut(A1 , · · · , Ak ) =
k
X
cut(Ai , Āi )
i=1
where cut(A, B) =
P
i∈A,j∈B
wij and vol(A) =
vol(Ai )
P
i∈A,j∈A
(2.11)
wij . The minimization of
the normalized cut criterion tends to result in clusters that are “balanced”. While
the simple cut() criterion has a polynomial time solution, the same cannot be said
for the normalized cut problem [114]. A relaxed version of the problem is solved
using spectral graph theory. The solution to the relaxed version is an approximation
that is obtained by computing the eigenvectors of the graph Laplacian matrix.
The basic idea is to partition the similarity graph based on the eigenvector
corresponding to the second smallest eigenvalue (the smallest eigenvalue is always
0 with eigenvector 1) of the Laplacian matrix. If the desired number of clusters
are not obtained the subgraphs are further partitioned using the lower eigenvectors
as approximations for the second eigenvector of the subgraphs. The intuitive reason of its success is its alternate similarity measure which is shape-insensitive. [83]
shows that the similarity between two data points in the normalized-cut framework
is equivalent to their connectedness with respect to the random walks in the graph,
where the transition probability between nodes is inversely proportional to the distance between the pair of points. Although based on strong theoretical foundation,
this method, unfortunately, is not scalable, due to its high computational time and
space complexity. It requires O(n3 ) time to solve the Eigensystem of the symmetric Laplacian matrix, and storing the matrix also requires at least Ω(n2 ) memory.
There are some variations of this general approach [124], but all suffer from the
poor scalability problem. von Luxburg [80] provides a good ground up tutorial on
spectral clustering. Table 2.1 provides a summarized comparison of shape-based
clustering algorithms that have been proposed in the literature. They are arranged
in chronological order.
29
2.3.4
SPARCL – Brief Overview
Our proposed method SPARCL [18] is based on the well known family of
Kmeans based algorithms, which are widely popular for their simplicity and efficiency [122]. Kmeans based algorithms operate in an iterative fashion. From an
initial set of k selected objects, the algorithm iteratively refines the set of representatives with the objective of minimizing the mean squared distance (also known as
distortion) from each object to its nearest representative. Kmeans based methods
are characterized by O(n d k e) time complexity, where e represents the number of
iterations the algorithm runs before convergence. They are related to Voronoi tessellation, which leads to convex polytopes in metric spaces [90]. As a consequence,
Kmeans based algorithms are unable to partition spaces with non-spherical clusters
or in general arbitrary shapes. However, in this thesis we show that one can use
Kmeans type algorithms to obtain a set of seed representatives, which in turn can
be used to obtain the final arbitrary shaped clusters. In this way, SPARCL retains
the linear time complexity in terms of the data points, and is surprisingly effective
as well, as we discuss next. Note that SPARCL focuses on full-space shape-based
clusters.
2.3.5
Backbone based Clustering – An Overview
The backbone based clustering algorithm is inspired by the concept of skele-
tonization from the image processing literature. We assume a hypothetical generative process for obtaining clusters with arbitrary shapes given the backbone. Our
idea is based on outlining this generative process in reverse. The resulting skeletonized dataset is much smaller as compared to the original dataset and has much
less noise. The reduction in the amount of noise makes the data cleaner leading to
efficient identification of the clusters. The reduction in the dataset size, on the other
hand, contributes to the scalability of the clustering algorithm. In order to achieve
the same effect as skeletonization we define two operations on the data – globbing
and displacement. These two operations are performed on the dataset in an iterative fashion. The stopping criteria for the iterative process is based on a Minimum
Description Length (MDL) principle formulation. On termination, a skeletonized
30
dataset is obtained. The entire process takes a single parameter – the number of
nearest neighbors. In the second step of the algorithm, clusters from the reduced
dataset are obtained, either by applying a hierarchical or a spectral clustering algorithm. We are currently exploring methods by which the clusters can be detected
without specifying the desired number of clusters as a parameter to the hierarchical
or spectral clustering algorithm.
CHAPTER 3
SPARCL: Efficient Shape-based Clustering
In this chapter we focus on a scalable algorithm for obtaining clusters with arbitrary
shapes. In order to capture arbitrary shapes, we want to divide such shapes into convex pieces. This approach is motivated by the concept of convex decomposition [100]
from computational geometry.
Convex Decomposition: Due to the simplicity of dealing with convex shapes,
the problem of decomposing non-convex shapes into a set of convex shapes has been
of great interest in the area of computational geometry. A convex decomposition
is a partition, if the polyhedron is decomposed into disjoint pieces, and it is a
cover, if the pieces are overlapping. While algorithms for convex decomposition
are well understood in 2-dimensional space, the same cannot be said about higher
dimensions [19]. In this work, we approximate the convex decomposition of an
arbitrary shape cluster by the convex polytopes generated by the Kmeans centers
that are within that cluster. Depending on the complexity of the shape, higher
number of centers may be required to obtain a good approximation of that shape.
Essentially, we can reformulate the original problem of identifying arbitrary shaped
clusters in terms of a sampling problem. Ideally, we want to minimize the number
of centers, with the constraint that the space covered by each center is a convex
polytope. One can immediately identify this optimization problem as a modified
version of the facility location problem. In fact, this optimization problem is exactly
the Minimum Consistent Subset Cover Problem (MCSC) [39]. Given a finite set
S and a constraint, the MCSC problem considers finding a minimal collection T
S
of subsets such that C∈T C = S, where C ⊂ S, and each C satisfies the given
constraint. In our case, S is the set of points and c the convex polytope constraint.
The MCSC problem is NP-hard, and thus finding the optimal centers is hard. We
thus rely on the iterative Kmeans type method to approximate the centers.
31
32
3.1
The SPARCL Algorithm
The pseudo-code for the SPARCL algorithm is given in Figure 3.1. The algo-
rithm takes two input parameters. The first one k is the final number of clusters
desired. We refer to these as the natural clusters in the dataset, and like most other
methods, we assume that the user has a good guess for k. In addition SPARCL
requires another parameter K, which gives the number of seed centers to consider
to approximate a good convex decomposition; we also refer to these seed centers as
pseudo-centers. Note that k < K ≪ n = |D|. Depending on the variant of Kmeans
used to obtain the seeds centers, SPARCL uses a third parameter mp, denoting the
number of nearest neighbors to consider during a smart initialization of Kmeans
that avoids outliers as centers. The random initialization based Kmeans does not
require the mp parameter.
SPARCL operates in two stages. In the first stage we run the Kmeans algorithm on the entire dataset to obtain K convex clusters. The initial set of centers
for the Kmeans algorithm may be chosen randomly, or in such a manner that they
are not outlier points. Following the Kmeans run, the second stage of the algorithm computes a similarity metric between every seed cluster pair. The resulting
similarity matrix can act as input either for a hierarchical or a spectral clustering
algorithm. It is easy to observe that this two-stage refinement employs a cheaper
(first stage) algorithm to obtain a course grained clustering. The first phase has
complexity O(ndKe), where d is the data dimensionality and e is the number of iterations Kmeans takes to converge, which is linear in n. This approach considerably
reduces the problem space as we only have to compute O(K 2 ) similarity values in
the second phase. For the second phase we can use a more expensive algorithm to
obtain the final set of k natural clusters.
SPARCL(D, K, k, mp):
1. Cinit = seed center initialization(D, K, mp)
2. Cseed = Kmeans(Cinit , K)
3. forall distinct pairs (Ci , Cj ) ∈ Cseed × Cseed
4.
S(i, j) = compute similarity(Ci , Cj )
5. cluster centers(Cseed , S, k)
Figure 3.1: The SPARCL Algorithm
33
3.2
Phase 1 – Kmeans Algorithm
The first stage SPARCL is shown in steps 1 – 2 of Figure 3.1. This stage
involves running the Kmeans algorithm with a set of initial centers Cinit (line 1),
until convergence, at which point we obtain the final seed clusters Cseed . There is one
subtlety in this step; instead of using the mean point in each iteration of Kmeans,
we actually use an actual data point in the cluster that is closest to the center mean.
We do this for two reasons. First, if the cluster centers are not actual points in the
dataset, chances are higher that points from two different natural clusters would
belong to a seed cluster, considering that the clusters are arbitrarily shaped. When
this happens, the hierarchical clustering in the second phase would merge parts
of two different natural clusters. Second, our approach is more robust to outliers,
since the mean point can get easily influenced by outliers. Figure 3.2(a) outlines an
example. There are two natural clusters in the form of the two rings. When we run a
regular Kmeans, using the mean point as the center representative, we obtain some
seed centers that lie in empty space, between the two ring-shaped clusters (e.g., 4, 5,
and 7). By choosing an actual data point, we avoid the “dangling” means problem,
and are more robust to outliers, as shown in Figure 3.2(b).
This phase starts by selecting the initial set of centers for the Kmeans algorithm. In order for the second stage to capture the natural clusters in the datasets,
it is important that the final set of seed centers, Cseed , generated by the Kmeans
algorithm satisfy the following properties:
1. Points in Cseeds are not outlier points,
2. Representatives in Cseed are spread evenly over the natural clusters.
In general, random initialization is fast, and works well. However, selecting the
centers randomly can violate either of the above properties, which can lead to illformed clusters for the second phase. Figure 3.3 shows an example of such a case.
In Figure 3.3(a) seed center 1 is almost an outlier point. As a result the members
belonging to seed center 1 come from two different natural clusters. This results in
the small (middle) cluster merging with the larger cluster to its right.
In order to avoid such cases and to achieve both the properties mentioned
34
4
5
7
2
1
8
3
6
(a) Using mean point
4
1
8
7
6
2
3
5
(b) Using actual data point
Figure 3.2: Effect of Choosing mean or actual data point
above we utilize our recently proposed outlier and density insensitive based selection
of initial centers [50]. Let us take a quick look at other initialization methods before
discussing our Local Outlier Factor based initialization technique.
3.2.1
Kmeans Initialization Methods
Although there are numerous initialization methods, we briefly discuss some
of the key ones. One of the first schemes of center initialization was proposed by
Ball and Hall [8]. They suggested use of a user defined threshold, d, to ensure that
the seed points are well apart from each others. The first point is chosen as a seed,
35
1.5
1
3
2
0.5
6
4
0
1
5
−0.5
−1
−2
−1.5
−1
−0.5
0
0.5
1
1.5
1
1.5
(a) Randomly selected centers
6 Centers
1.5
1
0.5
0
−0.5
−1
−2
−1.5
−1
−0.5
0
0.5
(b) Natural cluster split by bad center assignment
Figure 3.3: Bad Choice of Cluster Centers
and for any subsequent point considered, it is selected as a seed if it is at least d
distance apart from already chosen seeds, until k seeds are found. With a right
choice of the value of d, this approach can restrict the splitting of natural clusters,
but guessing a right value of d is very difficult and the quality of seeds depends on
the order in which the data points are considered.
Astrahan [7] suggested using two distance parameters, d1 and d2 . The method
first computes the density of each point in the dataset, which is given as the number
of neighboring points within the distance d1 , and it then sorts the data points
according to decreasing value of density. The highest density point is chosen as the
36
first seed. Subsequent seed point are chosen in order of decreasing density subject to
the condition that each new seed point be at least at a distance of d2 from all other
previously chosen seed points. This step is continued until no more seed points can
be chosen. Finally, if more than k seeds are generated from the above approach,
hierarchical clustering is used to group the seed points into the final k seeds. The
main problem with this approach is that it is very sensitive to the values of d1 and d2 .
Furthermore, users have insufficient knowledge regarding the good choices of these
parameters, and the method is computationally very expensive. A range search
query needs to be made for every data point followed by a hierarchical clustering of
a set of points. Small values of d1 and d2 may produces enormously large number of
seeds, and hierarchical clustering of those seeds can be very expensive (O(n2 log n)
in the worst case). This method also performs poorly when there exist different
clusters in the dataset with variable density and size.
Katsavounidis et. al. [57] suggested a parameterless approach, which we call
the KKZ method based on the initials of all the authors. KKZ chooses the first
centers near the “edge” of the data, by choosing the vector with the highest norm
as the first center. Then, it chooses the next center to be the point that is farthest
from the nearest seed in the set chosen so far. This method is very inexpensive
(O(kn)) and is easy to implement. It does not depend on the order of points and
is deterministic by nature; as single run suffices to obtain the seeds. However, KKZ
is sensitive to outliers, since the presence of noise at the edge of the dataset may
cause a small set of outlier/noise points to make up a cluster.
Bradley and Fayyad [14] proposed an initialization method that is suitable for
large datasets. We call their approach Subsample, since they take a small subsample
(less than 5%) of the dataset and use k-means clustering on the subsample and
record the cluster centers. This process is repeated and cluster centers from all the
different iterations are accumulated in a dataset. Finally, a last round of k-means
is performed on this dataset and the cluster centers of this round are returned as
the initial seeds for the entire dataset. This method generally performs better than
k-means and converges to the local optimal faster. However, it still depends on
the random choice of the subsamples and hence, can obtain a poor clustering in an
37
unlucky session.
More recently, Arthur and Vassilvitskii [6] proposed the k-means++ approach,
which is similar to the KKZ method. However, when choosing the seeds, they do
not choose the farthest point from the already chosen seeds, but choose a point with
a probability proportional to its distance from the already chosen seeds.
3.2.2
Initialization using Local Outlier Factor
We chose the local outlier factor (LOF) criterion for selecting the initial set of
cluster centers. LOF was proposed in [15] as a measure for determining the degree
to which a point is an outlier. For a point x ∈ D, define the local neighborhood of
x, given the minimum points threshold mp as follows:
N (x, mp) = {y ∈ D | dist(x, y) ≤ dist(x, xmp )}
where xmp is the mp-th nearest neighbor of x. Thus N (x, mp) contains at least mp
points. The density of x is then computed as follows:
density(x, mp) =
P
y∈N (x,mp) distance(x, y)
| N (x, mp) |
!−1
Essentially, the lower the distance between x and neighboring points, the higher the
density of x. The average relative density (ard) of x, is then computed as the ratio
of the density of x and the average density of its nearest neighbors, given as follows:
ard(x, mp) = P
density(x, mp)
density(y,mp)
|N (x,mp)|
y∈N (x,mp)
Finally the LOF score of x is just the inverse of the average relative density
of x:
LOF (x, mp) = ard(x, mp)−1
If a point is in a low density neighborhood compared to all its neighbors, then its
ard score is low and hence its LOF value is high. Thus LOF value represents the
extent to which a point is an outlier. A point that belongs to a cluster has an LOF
38
value approximately equal to 1, since its density and the density of its neighbors is
approximately the same.
LOF has three excellent properties: (1) It is very robust when the dataset has
clusters with different sizes and densities. (2) Even though the LOF value may vary
somewhat with mp, it is generally robust in making the decision whether a point is
an outlier or not. That is, for a large range of values of mp, the outlier points will
have LOF value well above 1, whereas points belonging to a cluster will assume an
LOF value close to 1. (3) It leads to practically faster convergence of the Kmeans
algorithm, i.e., fewer iterations.
As we reported in [50], to select the initial seeds, we use the following approach.
A non-outlier point with the largest norm is selected as the first center. The largest
norm ensures that the selected center is farthest from the origin. For selecting
subsequent centers assume that i initial centers have been chosen. To choose the
i + 1-th center, we first compute the distance of each point to each of the i chosen
centers, and sort them in decreasing order of distance. Next, in that sorted order,
we pick the first point that is not an outlier as the next seed. We repeat until K
initial seeds have been chosen, and then run the Kmeans algorithm to converge with
those initial starting centers, to obtain the final set of seed centers Cseed .
3.2.3
Complexity Analysis of LOF Based Initialization
The overall complexity of this approach can be analyzed in terms of the steps
involved. Let us assume that t ≪ n is the number of outliers in the data. While
choosing the i + 1-th center, the minimum distance of each of the n − i non-center
points from the i centers is computed. Aggregated over the K centers, the total
computational cost of this step amounts to O(nK 2 d), where d is the dimensionality
of the data. Once the minimum distances are computed, the i + 1-th center is chosen by examining points in descending order of minimum distance and selecting the
point that has an LOF value close to 1. The linear-time partition-based selection
algorithm [22] for computing the p-th largest number can be used to find the points
in descending order. In the worst case, the selection algorithm has to be invoked t
times (with p = 1 . . . t) for the i + 1-th center selection. If t is a small constant, the
39
selection based approach can be much more efficient as compared to the O(n log n)
sorting based selection algorithm. The aggregated computational cost for the selection phase of K centers is given by O(tnK) or O(tKn log n) in the worst case.
The LOF value for each examined point is computed during the selection stage.
1
The cost of computing the LOF value of a point is given by O(n1− d ∗ mp), since
nearest neighbor queries are performed on the mp neighbors of the point and each
1
nearest neighbor query takes O(n1− d ) time. In the worst case, for each center t
LOF computations need to be performed. As the result, the LOF computation
1
aggregated over K centers comes out to O(n1− d ∗ mp ∗ t ∗ K). Finally, adding
up the costs of the above steps, the complexity of the entire process is given by
1
O(nK 2 d + tKn log n + n1− d ∗ mp ∗ t ∗ K). As seen from the previous expression, the
overall time complexity is linear in the number of points in the data. On the other
hand, a random initialization for the seed centers takes O(K) time. As an example
350
5
300
19 47
29
37 25
42 18 23 49
46
11 31
10
7
12
39
30
27
13
28 21 38
26
44
24 32 1
2 50 41 48 20
34
17 8 45
9
22
36 4
16
43
250
200
150
100
50
0
6
14 33 3 35 15
40
0
100
200
300
400
500
600
700
(a) Selected centers on D1 dataset
Figure 3.4: Local Outlier Based Center Selection
of LOF-based seed selection, Figure 3.4 shows the initial set of centers for one of
the shape-based datasets. Section 3.6.2 provides an empirical comparison of LOF
based initialization with other initialization methods.
40
3.3
Phase 2 – Merging Neighboring Clusters
As the output of the first phase of the algorithm, we have a relatively small
number K of seed cluster centers (compared to the size of the dataset) along with
the point assignments for each cluster. During the second phase of the algorithm,
a similarity measure for each pair of seed clusters is computed (see lines 3-4 in
Figure 3.1). The similarity between clusters is then used to drive any clustering
algorithm that can use the similarity function to merge the K seed clusters in the
final set of k natural clusters. We applied both hierarchical as well as spectral
methods on the similarity matrix. Since the size of the similarity matrix is O(K 2 ),
as opposed to O(n2 ), even spectral methods can be conveniently applied.
p
p
p
j
i
k
Vk
B Xu
B X2 B
X1
11
00
00
11
Ii
Center X
fx
B Y0 B
Y1 B Y2 B Y3
B X0
Ij
B Yv
1
0
0
1
Center Y
fy
Hk
Figure 3.5: Projection of points onto the vector connecting the centers
3.3.1
Cluster Similarity
Let us consider that the d-dimensional points belonging to a cluster X are
denoted by PX and similarly points belonging to cluster Y are denoted by PY . The
corresponding centers are denoted by cX and cY , respectively. A similarity score
is assigned to each cluster pair. Conceptually, each cluster can be considered to
41
represent a Gaussian and the similarity captures the overlap between the Gaussians. Intuitively, two clusters should have a high similarity score if they satisfy the
following conditions:
1. The clusters are close to each other in the Euclidean space.
2. The densities of the two clusters are comparable, which implies that one cluster
is an extension of the other.
3. The face (hyperplane) at which the clusters meet is wide.
The compute similarity function in Figure 3.1 computes the similarity for a given
pair of centers. For computing the similarity, points belonging to the two clusters
are projected on the vector connecting the two centers as shown in Figure 3.5. Even
though the figure just shows points above the vector being projected, this is merely
for the convenience of exposition and illustration. fx represents the distance from
the center X to the farthest projected image Ii of a point pi belonging to X. Hi is
the horizontal (along the vector joining the two centers) distance of the projection
of point pi from the center, and Vi is the perpendicular (vertical) distance of the
point from its projection. The means (mHX and mHY ) and standard deviations
(sHX and sHY ) of the horizontal distances for points belonging to the clusters are
computed. Similarly, means and standard deviations for perpendicular distances are
computed. A histogram with bin size of
si
2
(i ∈ {HX , HY }) is constructed for the
projected points. The bins are numbered starting from the farthest projected point
fi (i ∈ X, Y ), i.e., bin BX0 is the first bin for the histogram constructed on points
in cluster X. The number of bins for cluster X is given by |BX |. Then, we compute
the average of horizontal distances for points in each bin; dij denotes the average
distance for bin j in cluster i. max bini = arg maxj dij represents the bin with the
largest number of projected points in cluster i. The number of points in bin Xi is
given by N [Xi ]. The ratio
N [Xi ]
N [Xmax binX ]
is denoted by sz ratioXi .
Now, the size based similarity between two bins in clusters X and Y is given
by the equation:
size sim(BXi , BYj ) = sz ratio(BXi ) ∗ sz ratio(BYj )
(3.1)
42
The distance-based similarity between two bins in clusters X and Y is given by the
following equation, where dist(BXi , BYj ) is the horizontal distance between the bins
Xi and Yj :
dist sim(BXi , BYj ) =
2 ∗ dist(BXi , BYj )
sHX + sHY
(3.2)
The overall similarity between the clusters X and Y is then given as
S(X, Y ) =
t
X
size sim(BXi , BYi ) ∗ exp−dist sim(BXi ,BYi )
(3.3)
i=0
where t = min(|BX |, |BY |). Also, while projecting the points onto the vector, we
discarded points that had a vertical distance greater than twice the vertical standard
deviation, considering them as noise points.
Let us look closely at the above similarity metric to understand how it satisfies
the above mentioned three conditions for good cluster similarity. Since the bins start
from the farthest projected points, for bordering clusters the distance between X0
and Y0 will be very less. This gives a small value to dist sim(BX0 , BY0 ). As a result,
the exponential function gets a high value due to the exponent taking a low value.
This causes the first term of the summation in Equation 3.3 to be high, especially if
the size sim score is also high. A high value for the first term indicates that the two
clusters are close by and that there are a large number of points along the surface of
intersection of the two clusters. If the size sim(BX0 , BY0 ) is small, which can happen
when the two clusters meet at a tangent point, the first term in the summation will
be small. This is exactly as expected intuitively and captures conditions 1 and 3
mentioned above. Both the size sim and dist sim measures are averse to outliers
and would give a low score for bins containing outlier points. For outlier bins, the
sz ratio will have a low score, resulting in a lower score for size sim. Similarly,
clusters having outlier points would tend to have a high standard deviation, which
would result in a low score for dist sim.
We considered the possibility of extending the histogram to multiple dimensions, along the lines of grid-based algorithms [40], but the additional computational
cost does not justify the improvement in the quality of the results.
43
Finally, once the similarity between pairs of seed has been computed, we can
use spectral or hierarchical agglomerative clustering to obtain the final set of k natural clusters. For our experiments, we used the agglomerative clustering algorithm
provided with CLUTO.
Our similarity metric S(X, Y ) can be shown to be a kernel. The following
lemmas regarding kernels allow us to prove that the similarity function is a kernel.
Lemma 3.3.1 [103] Let κ1 and κ2 be kernels over X × X, X ⊆ Rn , and let f (.) be
a real-valued function on X. Then the following functions are kernels:
i. κ(x, z) = κ1 (x, z) + κ2 (x, z),
ii. κ(x, z) = f (x)f (z),
iii. κ(x, z) = κ1 (x, z)κ2 (x, z),
Lemma 3.3.2 [103] Let κ1 (x, z) be a kernel over X × X, where x, z ∈ X. Then
the function κ(x, z) = exp(κ1 (x, z)) is also a kernel.
Theorem 3.3.3 Function S(X, Y ) in Equation 3.3 is a kernel function.
Proof: Since dist and sz ratio are real valued functions, dist sim and size sim are
kernels by Lemma 3.3.1(ii). This makes exp(−dist sim(., .)) a kernel by Lemma 3.3.2.
Product of size sim and exp(−dist sim(., .)) is a kernel by Lemma 3.3.1(iii). And finally, S(X, Y ) is a kernel since the sum of kernels is also a kernel by Lemma 3.3.1(i).
qededtrue
The matrix obtained by computing S(X, Y ) for all pairs of clusters turns out to
be a kernel matrix. This nice property provides the flexibility to utilize any kernel
based method, such as spectral clustering [88] or kernel k-means [29], for the second
phase of SPARCL.
3.4
Complexity Analysis
The first stage of SPARCL starts with computing initial K centers randomly
or based on the local outlier factor. If we use random initialization phase 1 takes
44
O(Knde) time, where e is the number of Kmeans iterations. The time for computing
1
the LOF-based seeds is O(nK 2 d + tKn log n + n1− d ∗ mp ∗ t ∗ K) [50], where t is the
number of outliers in the dataset, followed by the O(Knde) time for Kmeans. The
second phase of the algorithm projects points belonging to every cluster pair on the
vector connecting the centers of the two clusters. The projected points are placed
in appropriate bins of the histogram. The projection and the histogram creation
requires time linear in the number of points in the seed cluster. For the sake of
simplifying the analysis, let us assume that each seed cluster has the same number
of points,
n
.
K
Projection and histogram construction requires O( Kn ) time. In practice
only points from a cluster that lie between the two centers are processed, reducing
the computation by half on an average. Since there are O(K 2 ) pairs of centers,
the total complexity for generating the similarity map is K 2 × O( Kn ) = O(Kn).
The final stage applies a hierarchical or spectral algorithm to find the final set of
k clusters. Spectral approach will take O(K 3 ) time in the worst case, whereas
agglomerative clustering will take time O(K 2 log K). Overall, the time for SPARCL
is O(Knd + K 2 log K) (ignoring the small number of iterations it takes Kmeans
to converge) if using the random initialization, or O(K 2 nd + K 2 log K), assuming
mp = O(K), and using the LOF-based initialization. In our experiment evaluation,
we obtained comparable results using random initialization for datasets with uniform
density of clusters. With random initialization, the algorithm runs in time linear in
the number of points as well as the dimensionality of the dataset.
3.5
Estimating the Value of K
As discussed in Section 3.3, neighboring clusters are merged to obtain the
final set of k natural clusters. A “good” clustering is guaranteed if the following
conditions are satisfied:
1. Merging Condition – Only pseudo-centers belonging to a single natural
cluster are merged together,
2. Pseudo-center Condition – No pseudo-center exists such that it is a representative for points belonging to more than one natural cluster.
45
The Merging Condition is influenced by the effectiveness of the similarity measure
and the merging process itself. The similarity score in turn depends on the satisfiability of the Pseudo-center Condition. This transitive dependence between the
above two conditions indicates that satisfying the second condition is crucial for
obtaining a good clustering result.
The value of K can adversely influence the Pseudo-center condition. Underestimating K can result in points from two or more natural clusters being assigned to
the same pseudo-center, since each center has to now account for a larger number
of points. This is emphasized by our results (Section 3.6.3.5) in Figure 3.18, which
shows a lower clustering quality score for smaller values of K. Hence, having a good
estimate of K can considerably improve the clustering quality. At the same time,
the clustering outcome is not sensitive to small changes in the value of K, which
implies that a rough estimate suffices.
In many application domains the expert has an insight into approximate distances between natural clusters. For instance, cell biologists might have an estimate
of the distance between nearby chromosomes; a radiologist might have an intuition
regarding average distance between bones in an X-ray image; or distance between
regions of interest on a weather forecast map might be known a priori. Let us assume that an expert can estimate the minimum distance between any true clusters,
denoted by cdistmin . Given cdistmin , we can estimate the value of K such that
the Pseudo-center Condition is satisfied. Figure 3.6(a) shows the true clusters for
an illustrative dataset, with the noise points removed. The figure also shows the
cdistmin for this dataset.
Assume point A is selected as one of the pseudo-centers. In order to assign
point B to a center other than A, there has to be another center C closer than
cdistmin to point B. Any point closer than cdistmin to B has to belong to Cluster
1, which implies that center C would belong to Cluster 1. In other words, if the
nearest center for each point is at a distance less than cdistmin , condition two is satisfied. If the dataset is scattered over a 2-dimensional region with area (volume for
V
higher dimensions) V , then the value of K is given by ⌈ 2πcdist
⌉. The 2πcdistmin
min
expression is an approximation for the area of a circle (convex polyhedron) around
46
B
cdistmin
A
CLUSTER 2
CLUSTER 1
(a) Estimating K from cdistmin
300
300
250
200
150
9
12
8
250
16
20
10 24
200
1
26 6
7
14
18
150
15
2
100
22
23
4
3
19
50
0
400
21
13
17
11
420
440
460
480
500
(b) Seed Centers with
100
5
25
50
520
540
560
cdistmin = 20
580
0
400
420
440
460
480
500
520
540
560
580
(c) Members of each seed cluster with cdistmin =
20
Figure 3.6: Estimating the value of K
a center point. The above expression can be similarly generalized for higher dimensional datasets to ⌈ V ol.
V ol. occupied by cluster shape
⌉.
of hypersphere with radius cdistmin
The area or volume occupied
by the clusters does not have to be computed explicitly. Based on the above idea,
the LOF based algorithm (Section 3.2) for obtaining initial seeds can be modified
to automatically select the required number of seeds. Figure 3.7 shows the modified
47
LOF initialization algorithm. The algorithm seed center initialization takes as
input the dataset of points D, cdistmin and a parameter for LOF computation. The
algorithm returns the set of seed pseudo-centers. In the LOF based initialization, it
seed center initialization(D, cdistmin , mp):
1. Take any reference point, r (origin suffices)
2. Insert r in C
3. do
4.
sort the points in D in decreasing order of
5.
minimum distance from points in C
6.
for each x in sorted order
7.
if (LOF(x, mp) ≈ 1)
8.
insert x in C
9.
min dist = distx
10.
break
11.
endif
12.
endfor
13. while min dist ≥ cdistmin
14. remove r from C
15. return C
Figure 3.7: Generating seed representatives with cdistmin
can be shown that each subsequently selected seed representative results in a monotonic decrease in the minimum distance (min dist). As a result, the condition on
Line 13 is violated after a finite number of iterations of the while loop. At the time
the condition in the while loop is violated, the maximum distance of a point to its
nearest seed representative is less than cdistmin . As a result of which, no pseudocluster has points from more than one natural cluster. The LOF function on Line
7 computes the Local Outlier Factor for a point. Recall that the LOF value is an
indicator of the degree to which a point is an outlier. A value close to 1 signifies a
point is not an outlier.
We would like to caution the reader that this is a worst case analysis which
guarantees that the Pseudo-center Condition is satisfied. On the downside, for
pathological datasets much larger number of seed representatives could be selected
(in the worst case O(n) seeds) as compared to what might suffice for a good clustering. This can be seen in the results for dataset DS2 wherein a good clustering is
obtained with K = 60 (as shown in Figure 3.9(b)). Applying the algorithm in Fig-
48
ure 3.7 generates K = 287 seed representatives with cdistmin = 20. For the dataset
in Figure 3.6(a), the seed centers and the points assigned to them are shown in Figures 3.6(b) and 3.6(c), respectively. As one can see in Figure 3.6(b), the LOF based
initialization generates centers that are uniformly distributed. For regular shaped
clusters, the Pseudo-center Condition can be preserved even with a non-uniform
distribution of the centers as long as points such as A and B are not assigned to
the same seed-cluster. This approach could result in a reduced number of seed centers and a faster overall computation time. In this work, we do not address the
non-uniform selection of seed centers.
3.6
Experiments and Results
Experiments were performed to compare the performance of our algorithm
with Chameleon [64], DBSCAN [34] and spectral clustering [105]. The Chameleon
code was obtained as a part of the CLUTO [127]7 package. The DBSCAN implementation in Weka was used for the sake of comparison. Similarly, for spectral
clustering the SpectraLIB Matlab implementation
8
based on the Shi-Malik algo-
rithm [105] was used initially. Since this implementation could not scale to larger
datasets, we implemented the algorithm in C++ using the GNU Scientific Library9
and SVDLIBC10 . Even this implementation would not scale to very large datasets
since the entire affinity matrix would not fit in memory. The results in Table 3.2
for spectral clustering are based on this implementation.
Even though the implementations are in different languages, some of which
might be inherently slower than others, the speedup due to our algorithm far surpasses any implementation biases. All the experiments were performed on Mac G5
machine with a 2.66 GHz processor, running the Mac 10.4 OS X. Our code is written
in C++ using the Computational Geometry Algorithms Library (CGAL). We show
results for both LOF based as well as random initialization of seed clusters.
7
http://glaros.dtc.umn.edu/gkhome/cluto/
http://www.stat.washington.edu/spectral/
9
http://www.gnu.org/software/gsl/
10
http://tedlab.mit.edu/~dr/svdlibc/
8
49
3.6.1
Datasets
3.6.1.1
Synthetic Datasets
We used a variety of synthetic and real datasets to test the different methods.
DS1, DS2, DS3, and DS4, shown in Figures 3.9(a), 3.9(b), 3.9(c), and 3.9(d), are
those that have been used in previous studies including Chameleon and CURE.
These are all 2D datasets with points ranging from 8000 to 100000. The Swissroll dataset in Figure 3.10 is the classic non-linear manifold using in non-linear
dimensionality reduction [74]. We simply split the manifold into four clusters to see
how our methods handle this case.
For the scalability tests, and for generating 3D datasets, we wrote our own
shape-based cluster generator. To generate a shape in 2D, we randomly choose
points in the drawing canvas and accept the points which lie in our desired shape.
All the shapes are generated with point (0,0) as the origin. To get complex shapes,
we combine rotated and translated basic shapes (circle, rectangle, ellipse, circular
strip, etc.). Our 3D shape generation is built on the 2D shapes. We randomly
choose points in the 3 coordinates, if the x and y coordinates satisfy the shape, we
randomly choose the z-axis from a given range. This approach generates true 3D
shapes, and not only layers of 2D shapes. Similar to the case for 2D, we combine
rotated and translated basic 3D shapes to get more sophisticated shapes. Once we
generate all the shapes, we randomly add noise (1% to 2%) to the drawing frame.
An example of a synthetic 3D dataset is shown in Figure 3.13(b). This 3D dataset
has 100000 points, and 10 clusters.
3.6.1.2
Real Datasets
We used two real shape-based datasets: cancer images and protein structures.
The first is a set of 2D images from benign breast cancer
11
. The actual images are
divided into 2D grid cells (80 × 80) and the intensity level is used to assign each
grid to either a cell or background. The final dataset contains only the actual cells,
along with their x and y co-ordinates.
Proteins are 3D objects where the coordinates of the atoms represent points.
11
These were obtained from Prof. Bulet Yener at RPI
50
Since proteins are deposited in the protein data bank (PDB) in different reference
frames, the coordinates of the protein need to be centered such that the minimum
point (atom) is above the (0,0,0) origin. We translate the proteins to get separated
clusters. Once the translation is done, we add the noise points. Our protein dataset
has 15000 3D points obtained from the following proteins: 1A1T (865 atoms), 1B24
(619 atoms), 1DWK (11843 atoms), 1B25 (1342 atoms), and 331 noise points.
3.6.2
Comparison of Kmeans Initialization Methods
We compared different initialization methods on a set of synthetic datasets.
These datasets contain regular shaped clusters of varying sizes and densities. The
exact details of the datasets are omitted as they do not influence the comparative
results. Since the clusters generated by Kmeans are hyperspheres, distortion score
is used as the evaluation metric for comparing the initialization methods. Distortion
Score is defined as the sum of the distance between each point and its closest center.
Smaller value of distortion implies a better clustering.
Multiple runs are performed for algorithms that depend either on the order of
objects in the dataset or on randomization. For those algorithms, the minimum and
average distortion values are shown. Results for different dimensions (d) and number
of natural clusters (k) are shown in Table 3.1. The value in bold indicates the best
result for each row. The optimal distortion measure is also shown since the datasets
are synthetically generated. The results show that the LOF based initialization
performs better on most of the datasets.
Results in Figure 3.8 highlight the
robustness of LOF-based initialization as compared to random initialization. In
Figure 3.8(a), the distortion score is computed for varying percentage of random
noise in the dataset. Again, a lower distortion score indicates robustness to random
noise. Similarly, Figure 3.8(b) shows that the LOF based initialization is robust to
small changes in the parameter (mp) value.
3.6.3
Results on Synthetic Datasets
Results of SPARCL on the synthetic datasets are shown in Table 3.2, and
in Figure 3.9 and 3.10. We refer the reader to [64] for the clustering results of
Chameleon and DBSCAN on datasets DS1-4. In essence, Chameleon is able to
51
d
k Optimal
8 10
25
50
16 10
25
50
24 10
25
50
7738
9365
8694
16865
17241
17580
26149
22233
21453
LOF
7755
9382
8754
16882
17261
17622
26150
22261
21467
Random
min
avg
7904 8421
9774 10185
9244 9565
17406 18496
18298 19219
18866 19507
26706 28733
23241 24582
22838 23818
Subsample
min
avg
7887 8092
9639 10044
9136 9407
17356 18314
17647 18812
18469 19084
26150 28340
23034 23942
22477 23387
k-means++
min
avg
8008 8508
9641 9951
9289 9598
16870 17951
17732 18550
18974 19661
26755 28860
22803 23787
23003 23762
KKZ
9204
10743
17042
19346
20567
21632
29413
27052
26599
Table 3.1: Comparison on synthetic datasets. The distortion scores are
shown for each method. The value in bold indicate the best
result for each row.
LOF
Random minimum
Random average
Random maximum
6300
Distortion Score
Distortion Score
8000
6400
LOF
random minimum
random average
random maximum
7600
7200
6200
6100
6000
5900
5800
5700
2
4
6
Noise percentage
(a) Random Noise
8
10
5
10
15
20
25
mp value
30
35
40
(b) d = 8, k = 15
Figure 3.8: Sensitivity Comparison: LOF vs. Random
Name
|D|(d)
DS1
8000 (2)
DS2
10000 (2)
DS3
8000 (2)
DS4
100000 (2)
Swiss-roll 19200 (3)
k
6
9
8
5
4
SPARCL
Chameleon DBSCAN Spectral
(LOF/Random)
5.74/1.277
4.02
14.16
199
8.6/1.386
5.72
24.2
380
6.88/1.388
4.24
14.52
239
35.24/20.15
280.47
23.92/17.89
19.38
-
Table 3.2: Runtime Performance on Synthetic Datasets. All times are
reported in seconds. ‘-’ for DBSCAN and Spectral method
denotes the fact that it ran out of memory for all these cases.
52
(a) Results on DS1
(b) Results on DS2
(c) Results on DS3
(d) Results on DS4
Figure 3.9: SPARCL clustering on standard synthetic datasets from the
literature.
perfectly cluster these datasets, whereas both DBSCAN and CURE make mistakes,
or are very dependent on the right parameter values to find the clusters. As we
can see SPARCL had no difficulty in identifying the shape-based clusters in these
datasets. However, SPARCL does make minor mistakes at the boundaries in the
Swiss-roll dataset (Figure 3.10). The reason for this is that SPARCL is designed
mainly for full-space clusters, whereas this is a 2D manifold embedded in a 3D
space. In other words, it is a nonlinear subspace cluster. What is remarkable is that
53
Figure 3.10: Results on Swiss-roll
SPARCL can actually find a fairly good clustering even in this case.
Table 3.2 shows the characteristics of the synthetic datasets along with their
running times. The default parameters for running Chameleon in CLUTO were retained (number of neighbors was set at 40). Parameters that were set for Chameleon
include the use of graph clustering method (clmethod=graph) with similarity set to
inverse of Euclidean distance (sim=dist) and the use of agglomeration (agglofrom=30),
as suggested by the authors. Results for both the LOF and random initialization
are presented for SPARCL. Also, we used K = 50, 60, 70, 50 for each of the datasets
DS1-4, respectively. For swiss-roll we use K = 530.
We can see that DBSCAN is 2-3 times slower than both SPARCL and Chameleon
on smaller datasets. However, even for these small datasets, the spectral approach
ran out of memory. The times for SPARCL (with LOF) and Chameleon are comparable for the smaller datasets, though the random initialization gives the same
results and can be 3-4 times faster. For the larger DS4 dataset SPARCL shows an
order of magnitude faster performance, showing the real strength of our approach.
For DBSCAN we do not show the results for DS4 and Swiss-roll since it returned
only one cluster, even when we played with different parameter settings.
Time (sec)
54
40000
35000
30000
25000
20000
15000
10000
5000
0
SPARCL(random)
SPARCL(lof)
Chameleon
DBScan
0
100 200 300 400 500 600 700 800 900 1000
# of points x 1000 (d=2)
Figure 3.11: Scalability Results on Dataset DS5
3.6.3.1
Scalability Experiments
Using our synthetic dataset generator, we generated DS5, in order to perform
experiments on varying number of points, varying densities and varying noise levels.
For studying the scalability of our approach, different versions of DS5 were generated
with different number of points, but keeping the number of clusters constant at 13.
The noise level was kept at 1% of the dataset size. Figure 3.11 compares the runtime
performance of Chameleon, DBScan and our approach for dataset sizes ranging from
100000 points to 1 million points. We chose not to go beyond 1 million as the time
taken by Chameleon and DBSCAN was quite large. In fact, we had to terminate
DBSCAN beyond 100K points. Figure 3.11 shows that our approach, with random
initialization, is around 22 times faster than Chameleon while it is around 12 time
faster when LOF based initialization is considered. Note that the time for LOF also
increases with increase in the size of the dataset. For Chameleon, the parameters
agglofrom, sim, clmethod were set to 30, dist and graph, respectively. For DBSCAN
the eps was set at 0.05 and MinPts was set at 150 for the smallest dataset. MinPts
was increased linearly with the size of the dataset. In our case, for all datasets,
55
K = 100 seed centers were selected for the first phase and mp was set to 15.
(a) SPARCL (K = 100)
(b) DBSCAN (minPts=150, eps=0.05)(c)
Chameleon
(agglofrom=30,
sim=dist, clmethods=graph)
Figure 3.12: Clustering Quality on Dataset DS5
Figure 3.12 shows the clusters obtained as a result of executing our algorithm,
DBSCAN and Chameleon on the dataset DS5 of size 50K points. We can see that
DBSCAN makes the most mistakes, whereas both SPARCL and Chameleon do well.
Scalability experiments were performed on 3D datasets as well. Result for one
of those datasets is shown in Figure 3.13(b). The 3D dataset consists of shapes in
full 3D space (and not 2D shapes embedded in 3D space). The dataset contained
random noise too (2% of the dataset size). As seen in Figure 3.13(a), SPARCL
(with random initialization) can be more than four times as fast as Chameleon.
Time (sec)
56
20000
SPARCL(lof)
18000
16000 SPARCL(random)
Chameleon
14000
12000
10000
8000
6000
4000
2000
0
0
100
200
300
400
500
600
# of points x 1000 (d=3)
Figure 3.13: Clustering Results on 3D Dataset
3.6.3.2
Clustering Quality
Since two points in the same cluster can be very far apart, traditional metrics such as cluster diameter, k-Means/k-Medoid objective function (sum of squared
errors) and compactness (avg. intra-cluster distance over the avg. inter-cluster distance) are generally not appropriate for shape-based clustering. We apply supervised
57
metrics, wherein the true clustering is known apriori, to evaluate clustering quality.
Popular supervised metrics include purity, Normalized Mutual Information, rank
index, and so on. In this work, we use purity as the metric of choice due to its intuitive interpretation. Given the true set of clusters (referred to as classes henceforth
to avoid confusion), CT = {c1 , c2 , . . . , cL } and the clusters obtained from SPARCL
CS = {s1 , s2 , . . . , sM }, purity is given by the expression:
purity(CS , CT ) =
1 X
max ksk ∩ cj k
N k j
(3.4)
where N is the number of points in the dataset. Purity lies in the range [0,1], with
a perfect clustering corresponding to purity value of 1.
As a side note, purity tends to favor larger number of clusters, reaching a score
of 1 even when each point is in its own cluster. To overcome this bias Normalized
Mutual Information (NMI) [81] is used as a metric for clustering quality. Normalized
Mutual Information is given by the expression:
N M I(CS , CT ) =
I(CS ; CT )
[H(CS ) + H(CT )]/2
(3.5)
where I(CS ; CT ) is the mutual information and H(.) is the entropy. The mutual
information is given
I(CS , CT ) =
X X
p(si , cj ) log
si ∈CS cj ∈CT
and the entropy is given by H(X) = −
P
i∈X
p(si , cj )
p(si ) · p(cj )
(3.6)
p(i) log p(i). Since entropy increases
with the increase in number of clusters, overall NMI score decreases with increase
in the number of clusters. NMI overcomes this shortcoming of the purity measure
while maintaining the [0,1] range of the score. In our case, the number of clusters
obtained from SPARCL are the same as the true number of clusters. As a result,
purity can be safely used as the quality score without being concerned about the
above mentioned drawback.
Since DS1-DS4 and the real datasets do not provide the class information,
experiments were conducted on varying sizes of the DS5 dataset. The class infor-
58
mation was recorded during the dataset generation. Fig. 3.14 shows the purity score
for clusters generated by SPARCL and CHAMELEON (parameters agglofrom=100,
sim=dist, clmethod=graph). Since these algorithms cluster noise points differently,
for fair comparison they are ignored while computing the purity, although the noise
points are retained during the algorithm execution. Note that for datasets larger
than 600K, CHAMELEON did not finish in reasonable time. When CHAMELEON
was run with the default parameters, which runs much faster, the purity score lowered to 0.6, whereas SPARCL’s purity score is more than 0.9.
Purity score
1
0.9
0.8
0.7
0.6
Sparcl(lof)
Chameleon
0.5
0
100 200 300 400 500 600 700 800 900 1000
Dataset Size (x1000)
Figure 3.14: Clustering quality for varying dataset size
3.6.3.3
Varying Number of Clusters
Experiments were conducted to see the impact of varying the number of natural clusters k. To achieve this, the DS5 dataset was replicated by tiling the dataset
in a grid form. Since the DS5 dataset contains 13 natural clusters, a 1 × 3 tiling contains 39 natural clusters (see Fig. 3.15(a)). The number of points are held constant
at 180K. The number of natural clusters are varied from 13 to 117. The number of
seed-clusters are set at 5 times number of natural clusters, i.e, K = 5k. We see that
SPARCL finds most of the clusters correctly, but it does make one mistake, i.e., the
center ring has been split into two. Here we find that since there are many more
clusters, the time to compute the LOF goes up. In order to obtain each additional
center the LOF method examines a constant number of points, resulting in a linear
59
2500
SPARCL(lof)
SPARCL(random)
Chameleon
Time (sec)
2000
1500
1000
500
0
0
(a) 1x3 grid tiling. Results of SPARCL
20
40
60
80
100
# of natural clusters (d=2)
120
(b) Varying number of natural clusters.
Figure 3.15: Varying Number of Natural Clusters
relation between the number of clusters and the runtime. Thus we prefer to use the
random initialization approach when the number of clusters are large. With that
SPARCL is still 4 times faster than Chameleon (see Fig. 3.15(b)).
Even though Chameleon produces results competent with that of SPARCL, it requires tuning the parameters to obtain these results. Especially when the nearest
neighbor graph contains disconnected components CHAMELEON tends to break
natural clusters in an effort to return the desired number of clusters. Hence CHAMELEON expects the user to have a certain degree of intuition regarding the dataset
in order to set parameters that would yield the expected results.
3.6.3.4
Varying Number of Dimensions
Synthetic data generator SynDECA (http://cde.iiit.ac.in/~soujanya/
syndeca/) was used to generate higher dimensional datasets. The number of points
and clusters were set to 500K and 10, respectively. 5% of the points were uniformly
distributed as noise points. SynDECA can generate regular (circle, ellipse, square
and rectangle) as well as random/irregular shapes. Although SynDECA can generate subspace clusters, for our experiments full dimensional clusters were generated.
Fig. 3.16(b) shows the runtime for both LOF based and random initialization of
seed clusters. With increasing number of dimensions, LOF computation takes substantial time. This effect can be attributed to a combination of two effects. First,
1.1
3000
1
2500
0.9
2000
Time (sec)
Purity score
60
0.8
0.7
0.5
4
6
8
1500
1000
500
SPARCL(lof)
SPARCL(random)
Chameleon
0.6
SPARCL(lof)
SPARCL(random)
Chameleon
0
10
12
# of dimensions
(a) Purity
14
16
4
6
8
10
12
14
16
# of dimensions
(b) Runtime
Figure 3.16: Varying Number of Dimensions
since a kd-tree is used for nearest-neighbor queries the performance degrades with
increasing dimensionality. Second, since we keep the number of points constant in
this experiment, the sparsity of the input space increases for higher dimensions. On
the other hand, random initialization is computationally inexpensive. Fig. 3.16(a)
shows the purity for higher dimensions. Both SPARCL and CHAMELEON perform
well on this measure. The quality of the clustering can also be visually inspected.
The points in the high dimensional space can be projected onto a lower dimensional
space. For our experiments, we use Principal Component Analysis [32] (PCA) as
the dimensionality reduction technique. PCA has the distinction of being a linear
transformation that is optimal for preserving the subspace with the largest variance
in the data. Figure 3.17 shows the above dataset with 10 dimensions projected onto
a 3-dimensional subspace. The noise points have been purposely suppressed in order
to view the projected clusters clearly. As seen in Figure 3.17, the compact regions
representing the clusters contain points from the same natural clusters. Projecting
points on a lower dimensional sub-space, results in small overlap of some of the
clusters.
61
Figure 3.17: 10 dimensional dataset (size=500K, k=10) projected onto a
3D subspace
1
0.9
0.8
Purity score
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
60
120 180 240 300 360 420 480 540
K
Figure 3.18: Clustering quality for varying number of seed-clusters
3.6.3.5
Varying Number of Seed-Clusters (K)
SPARCL has a single parameter, K. Fig. 3.18 shows the effect of changing K
on the quality of the clustering. The dataset used is same as in Fig. 3.15(a), with
300K points. As seen in the figure, the purity stabilizes around K=180 and remains
almost constant till K=450. As K is increased further, a significant number of seedcenters lie between two clusters. As a result SPARCL tends to merge parts of one
62
cluster with the other, leading to a gradual decline in the purity. Overall, the figure
shows that SPARCL is fairly insensitive to the K value.
100
90
80
70
60
50
40
30
20
10
0
200
100
0
200
180
160
140
120
100
80
60
40
20
0
Figure 3.19: Protein Dataset
3.6.4
Results on Real Datasets
We applied SPARCL on the protein dataset. As shown in Figure 3.19, SPARCL
is able to perfectly identify the four proteins. The largest doughnut-shaped is the
1DWK protein while the other smaller ones are amoeba like irregular shaped. The
K value for the protein dataset is 30. On this dataset, Chameleon returns similar results. The results on the benign cancer datasets are shown in Figure 3.21.
Here too SPARCL successfully identifies the regions of interest. The K value for
this dataset is 100. The distinct clusters in the cancer dataset represent the nuclei,
whereas the surrounding region is the tissue. Clusters that are globular in shape
correspond to healthy tissues whereas irregular shapes of the nuclei correspond to
cancerous tissues. We do not show the time for the real datasets since the datasets
are fairly small and both Chameleon and SPARCL perform similarly.
63
−1
−1
−1
50
−1
0
15
−1
10
−1
5
−1
0
−1
−5
15
−1
10
5
−10
−1
0
−5
−15
−10
−1
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−15
(a) Original Swiss-roll dataset
90
(b) Swiss roll embedded in 2D with
l=8
0.2
80
0
70
60
−0.2
50
40
−0.4
30
−0.6
20
10
−0.8
0
200
−1
100
0
160
140
120
100
80
60
40
20
(c) Original Protein Dataset
0
−1.2
−0.5
0
0.5
1
1.5
2
2.5
3
(d) Protein dataset embedded in 2D
with l=8
Figure 3.20: Cluster separation with Locally Linear Embedding
3.6.5
Comparison with Locally Linear Embedding
In this section we compare our approach with a popular embedding technique.
Locally Linear Embedding (LLE) [99] embeds the d dimensional points in the dataset
into a smaller set of dimensions d′ , such that the local geometry of each point
is retained. Formally, given a dataset D and a parameter l, LLE first expresses
each point as a linear combination of its l neighbors such that the error ǫ(W ) =
P
P
2
Xi ∈D | Xi −
Xj ∈Rl (Xi ) Wij Xj | is minimized. The matrix W assigns weights
to the l nearest neighbors (given by Rl (Xi )) for reconstructing a point Xi . Once
the W matrix is obtained by solving a constrained least squares problem, the new
′
embedding Yi (Yi ∈ Rd ) for a point Xi is obtained by minimizing the cost function
64
Φ(Y ) =
P
Yi
| Yi −
P
Yj ∈Rl (Yi )
Wij Yj |2 .
We use LLE to embed two 3D datasets (swiss-roll and proteins datasets) into 2
dimensions. The objective is to check if the clusters in the embedded 2-dimensional
space are well separated. We use the LLE implementation provided by the author
of [99]
12
. Figure 3.20 shows the original and the embedded datasets. Notice that
LLE is able to separate out three clusters out of four from the protein dataset. Two
clusters are almost merged with each other (top right corner of Figure 3.20(d)). As
for the swiss-roll dataset, LLE linearizes the points in 2D, but a clear separation
between the clusters is not obtained. One can see that neighboring clusters in the
manifold overlap. The overlapping is an artifact of the small gap between the true
clusters.
3.7
Conclusions
In this work, we made use of a very simple idea, namely, to capture arbitrary
shapes by means of convex decomposition, via the use of the highly efficient Kmeans
approach. By selecting a large enough number of seed clusters K, we are able
to capture most of the dense areas in the dataset. Next, we compute pairwise
similarities between the seed clusters to obtain a K ×K symmetric similarity matrix.
The similarity is designed to capture the extent to which the points from the two
clusters come close to each other, namely close to the d-dimensional hyperplane that
separates the two clusters. The similarity is computed rapidly by projecting all the
points in the two clusters on the line joining the two centers (which is reminiscent
of linear discriminant analysis). We bin the distances horizontally and compute a
one-dimensional histogram to approximate the closeness of the points, which in turn
gives the similarities between the clusters. We then apply a merging based approach
to obtain the final set of user-specified (natural) clusters.
Our experimental evaluation shows that this simple approach, SPARCL, is
remarkably effective in finding arbitrary shaped-based clusters in a variety of 2D
and 3D datasets. It has the same accuracy as Chameleon, a state of the art shapebased method, and at the same time it is over an order of magnitude faster, since
12
http://www.cs.toronto.edu/~roweis/lle/code.html
65
its running time is essentially linear in the number of points as well as dimensions.
SPARCL can also find clusters in the classic Swiss-roll dataset, effectively discovering
the 2D manifold via Kmeans approximations. It does make some small errors on
that dataset. In general SPARCL works well for full-space clusters, and is not
yet tuned for subspace shape-based clusters. In fact, find arbitrary shaped-based
subspace clusters is one avenue of future work for our method.
66
(a)
(b)
(c)
(d)
(e)
(f)
Figure 3.21: Cancer Dataset: (a)-(c) are the actual benign tissue images.
(d)-(f ) gives the clustering of the corresponding tissues by
SPARCL.
CHAPTER 4
Shape-based Clustering through Backbone Identification
This chapter introduces another scalable approach to clustering spatial data. This
approach is motivated by the concepts of skeletonization within the image processing
community.
In this chapter, we present a scalable clustering algorithm that aims to identify
the underlying shape of the clusters or what we refer to as the intrinsic shape of
the clusters. We also refer to the intrinsic shape as the backbone of the clusters.
The intrinsic shape of a cluster is conceptually similar to image skeletonization
in the image processing literature. Figure 4.1(a) shows and example dataset and
Figure 4.1(d) shows its intrinsic shape. The intrinsic shape of the dataset has two
benefits:
1. Removal or reduction of noise points from the dataset, and
2. Reduction in the size of the dataset
Both these effects help in reducing the computational cost and improving the quality
of the clustering. The basic idea is to recursively collapse a set of points into a single
representative point. Over a few iterations, the dataset is repeatedly summarized
until the intrinsic shape (also referred to as the backbone) of the data is identified,
as illustrated in Figure 4.1. The contributions of this chapter can be summarized
as follows:
1. We propose a new algorithm to identify the shape (skeleton) of the clusters.
This step enables easy identification of the final set of clusters.
2. Many clustering algorithms need the true number of clusters as an input to
the algorithm. Note that DBSCAN is an exception. In this work, we outline
methods that allow us to identify the true clusters in lieu of the actual number
of clusters as an input parameter.
67
68
350
300
300
250
250
200
200
150
150
100
100
50
0
0
100
200
300
400
500
600
50
700
0
100
(a) Initial Dataset
300
250
250
200
200
150
150
100
100
0
100
200
300
400
300
400
500
600
(b) After 3 Iterations
300
50
200
500
(c) After 6 Iterations
600
50
0
100
200
300
400
500
600
(d) After 8 Iterations
Figure 4.1: Initial dataset (4.1(a)); after iterations 3 and 6; and the backbone after 8 iterations (right) of the algorithm
3. The clustering algorithm can identify clusters of varying shapes, sizes and
densities.
4.1
Related Techniques
Considerable work has been done in the field of arbitrary shape clustering.
Apart from the standard clustering methods (hierarchical, density-based and spectral) that have been described in Section 2.2.2, certain alternate clustering algorithms, such as those inspired from various physical and natural phenomenon, have
69
also been proposed. More specifically, clustering algorithms motivated by concepts
from swarm intelligence [1] and biologically inspired models have been proposed recently [91]. These algorithms are characterized by individual data points that are
termed agents, that interact with and alter their local environment under defined
principles. In [35], the authors propose a flocking based spatial clustering algorithm.
Each agent within the flocking model moves under separation, cohesion and alignment behaviors modeled after the flocking phenomenon of birds. The movement of
agents under these behaviors repeated over pre-defined iterations results in clusters
of agents. Separation ensures that agents maintain certain distance from neighboring agents. Movement of agent under the cohesive behavior allows agents to form
clusters.
Another line of work inspired from physical sciences, models cluster centers
as centers of gravitational forces [121, 93, 43, 70] exerted by data points
13
on
each other. In [43], the authors determine clusters by moving points under the
Gravitational Law and Newton’s motion law. While the authors claim that the
number of clusters are determined automatically, the proposed algorithm requires a
number of other parameters (α: minimum number of points permissible in a cluster,
ǫ: clustering merging distance bound). Moreover, the clusters captured are convex
in shape. In [93], the authors compare hierarchical clustering with gravitational
clustering.
Clustering algorithms motivated by principles from magnetism have been proposed [12, 9] as well. In [12], the authors use the spin-spin correlation function to
determine clusters. Each data point is assumed to possess a Potts spin variable.
Within the Potts spin model, at super–paramagnetic temperatures spin variables
belonging to the same cluster get aligned with each other. The spin-spin correlation
function is estimated using Monte Carlo methods, which in turn governs the size
and number of clusters.
13
A data point is considered as a unit-mass particle that is exposed to gravitational forces
70
4.1.1
Skeletonization
Skeletonization (also known as thinning) from image processing literature con-
ceptually resembles the approach proposed in this chapter. Let us take a brief look
at skeletonization from the image processing perspective, although we will not make
use of image processing algorithms in this work.
Skeletonization is the process of peeling off from an image as many pixels
as possible without affecting the general shape of the image. In other words, after
pixels have been peeled off, the image should still be recognized. The skeleton hence
obtained must have the following properties. It should be 1) as thin as possible, 2)
connected, and 3) centered. Figure 4.2, shows an image and its skeleton. Skeletons
Figure 4.2: Example skeleton of a binary image (in black). The white
outline is the skeleton.
have many mathematical definitions. Some of the definitions are as follows:
1. Fire propagation model: Blum [13] described a skeleton in terms of a fire
propagation model. Assume that a region of prairie grass has the shape of
the object whose skeleton has to be determined. Assume that a fire is lighted
along each edge of the region. The points on the prairie region where two or
more “fronts” of fire meet form the skeleton of the region. Blum defined the
Medial Axis Transform (MAT) to determine the skeleton.
2. Centers of bi-tangent circles: This approach is conceptually similar to the
Medial Axis Transformation. This method considers a point p (within the
region R) to be a part of the skeleton if p is equidistant from two points on
the boundary of R.
71
3. Centers of maximal disks: If disks are placed on the region R to be skeletonized, such that the size of the disk cannot be increased without intersecting
the boundary of R, then the centers of these maximal disks define the skeleton.
Numerous algorithm for skeletonization have been proposed in the image processing
community. Morphology based techniques [44] along with iterative thinning algorithms [23] are commonly used for identifying skeletons. These algorithms rely
heavily on the binary pixel based representation of the data and on the notion of
foreground and background pixels. Morphological operators are applied on a pixel
and its neighborhood to achieve various effects (e.g. thinning, thickening, hole filling,
etc.). The strict adherence to the pixel based representation makes these methods
infeasible for our data.
4.2
The Clustering Algorithm
Our approach to clustering is motivated by the notion that a cluster possesses
an intrinsic shape or a core shape. Intuitively, for a 2-dimensional Gaussian cluster, points around the mean of the cluster could be considered as points forming
the core shape of the cluster. For an arbitrary shaped cluster, such as shown in
Figure 4.1(a), the intrinsic shape of the cluster is captured by the backbone of the
cluster (Figure 4.1(d)). The clustering algorithm has two phases. In the first phase
we identify the intrinsic shape of the clusters. In the following phase, the individual
clusters are separately identified.
4.2.1
Preliminaries
Consider a dataset D of N points in d-dimensional Euclidean space. The
distance between points i and j is represented by dij . The k-nearest neighbors (kNN) of a data point i are given by the set Rk (i). The nearest neighbors for all points
are captured in the matrix A. Each entry A(i, j) is given by

 1 if j ∈ R (i)
k
A(i, j) =
 0 if j ∈
/ Rk (i)
72
The term k-NN matrix is used for A henceforth. The entry A(i, j) can be viewed
as the probability of point j being in the k-NN set of i.
4
Figure 4.3(a) (left)
4
5
3’={4,5}
7
3
3
2
4
3
4’={6,7}
6
1’ = {1,2}
2
2
2’={3}
1
1
1
1
2
3
4
5
1
2
3
4
5
(a) Sample Dataset
Figure 4.3: Sample dataset showing one iteration of glob and movement
shows a sample dataset and Figure 4.4(a) shows the corresponding k-NN matrix.
Figure 4.3(a) (right) shows the sample dataset after one iteration, while Figure 4.8(a)
shows the corresponding updated k-NN matrix.
For the sake of clarity, the notation kn (n subscript for neighborhood) is used
to denote the k parameter in a nearest neighbor context. Whenever required, the
notation kc would imply the true number of clusters in a dataset.





A0 = 




0
1
1
0
0
0
0
1
0
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
0
0
1
1
0










(a) Initial k-NN matrix for sample data

0
 1
A1 = 
 0
0
1
0
1
1
1
1
0
1

0
0 

1 
0
(b) Updated k-NN matrix
after first iteration
Figure 4.4: k-NN matrices for sample dataset
73
4.2.2
Phase 1 – Backbone Identification
We hypothesize that, given the points in the backbone, the initial dataset can
be obtained through the following hypothetical generative process. Let us assume
that the backbone of a cluster has Nt points. Assume that each backbone point
pi has two parameters associated with it. The weight parameter for pi , denoted
by wi , indicates the number of points that can be generated from pi . The spread
parameter, indicates the region around pi within which wi points can be generated.
For a d-dimensional input space, a covariance matrix, σi could represent the spread
parameter. For the sake of simplicity, we assume that the covariance matrix is a
diagonal matrix with the same variance vi along each dimension. Now, assume
that a Gaussian process generates m (m < wi ) points at random, with mean at pi
and the covariance matrix σi dictating the distribution of these points. The weight
wi is distributed (either uniformly or as a function of the distance of the point
from pi ) across the generated m points. The covariance matrix for the m points is
obtained by updating σi such that the spread volume for the m points is decreased
in proportion to weights assigned to them. This generative process is repeated at
each new point until the weight assigned to a new point has reduced to one. A
weight wi = 1 indicates that a point cannot generate any points. We propose this
simple generative model for obtaining the original dataset from the backbone.
Contrary to the generative model, we want to identify the points belonging
to the backbone given the original dataset. In essence, our approach tries to run
the generative model in reverse. In other words, we follow a “reverse generative”
process. Notice that the backbone has much less noise as compared to the original
set of points, making it easier to identify the individual clusters. The task now is to
capture this backbone or the core shape. To identify this core backbone shape, we
propose an algorithm based on two simple concepts:
1. Globbing: Globbing involves assigning a representative to a group of points.
The globbed points are removed from the dataset and their representative
(point) is added. Each point, in the dataset, has a weight w assigned to it.
Initially, the weight of each point is set to 1. The weight of a representative
is proportional to the number of points globbed by the representative. All
74
points that lie within a d-dimensional ball of radius b, around a representative
r are globbed by r. As discussed later, the value of b is estimated from the
dataset. To illustrate the effect of globbing, Figure 4.5 shows the bubble plot
corresponding to Figure 4.1(d). Each point is replaced by a bubble wherein
the size of bubble is proportionate to the number of points globbed by it.
Figure 4.5: Bubble plot for Figure 4.1(d). The size of a bubble is
proportionate to the weight wi of a point.
2. Object Movement: In our model, each point experiences a force of attraction from its neighboring points. Under the influence of these forces a point
can change its position. The magnitude and direction of movement is proportionate to the forces exerted on the point.
find core clusters(D, k):
1. Initialize wi = 1, ∀i
2. r = estimate knn radius(D, k)
3. repeat
4. glob objects(D, r, k)
5. Dnew = move objects(D, r, k)
6. D = Dnew
7. until stopping condition satisfied
8. C = identify clusters(D)
Figure 4.6: The Backbone Identification Based Clustering Algorithm
75
The backbone identification algorithm involves two steps that are repeated iteratively. In the first step, objects are globbed starting at the most dense regions of
the dataset. In the following second step, objects move under the influence of mutual forces. Figure 4.3(a) shows the initial dataset consisting of 7 points and the
effect of one iteration (globbing followed by movement) on the dataset. Similarly,
Figures 4.4(a) and 4.8(a) show the initial k-NN matrix A0 and the updated k-NN
matrix A1 after one iteration, respectively. On convergence of the iterative process,
the intrinsic shape of the cluster is expected to emerge. Figure 4.1(d) shows the
backbone of the dataset in Figure 4.1(a), on convergence. Note that the two steps
outlined in the algorithm are essentially simulating the generative model in reverse.
The algorithm is outlined in Figure 4.6. estimate knn radius computes an estimate for the average distance to the k th nearest neighbor for objects in the dataset.
The radius is estimated by first obtaining the distance to the k th nearest neighbor
over a random sample from the original dataset. The largest 5% of these distances
are eliminated and the average radius is computed from the remaining 95%. This
average radius is used as the globbing radius r.
During glob objects all points within a radius r of a point a are marked as being
“globbed” by a. The use of the globbing radius r in the globbing step ensures that
only points in the close proximity of a can be represented by a. Such selective globbing also ensures that outlier or noise points do not glob points belonging to dense
cluster regions. Globbing modifies the dataset by removing the globbed points and
by updating the weight wa of the globbing point to include the weights of all the
P
globbed points (i.e. wa = ∀p s.t. dist(p,a)<r wp ).
In the move objects step, a point b in d-dimensional space is displaced under the
influence of its nearest neighbors’ force of attraction. Out of the k nearest neighbors,
only those that have not been globbed by b participate in displacing b. The force
exerted by an object c on object b is proportional to wc and inversely proportional
to dist(b, c), where dist() is some distance function. The updated position of b in
dimension i is given by Equation 4.1, where bi is the ith dimension of b.
bi =
b i · wb +
wb +
P
c∈Rk (b) ci
P
c∈Rk (b)
1
dist(b,c)
1
dist(b,c)
· wc ·
wc ·
(4.1)
76
r
f1
11
00
00
11
f3
a
f
2
Figure 4.7: Example illustrating the globbing-movement twin process.
Figure 4.7 elaborates the globbing and movement steps. For a point a (shown in
red) the 8 nearest neighbors are marked in blue. Out of the 8 nearest neighbors,
5 lie within radius r, as a result they are globbed by a. The remaining 3 nearest
neighbors are responsible for moving a. The forces exerted by these points on a are
shown by the vectors f1 , f2 and f3 . The resultant direction in which a moves is the
vector sum of f1 , f2 and f3 .
One can extrapolate that the above two steps repeated without a suitable
stopping condition would result in a dataset with a single point which globs all the
points in the dataset. Let Dn be the dataset after iteration n. D = D0 and let Df inal
be the dataset obtained after Line 6 of Figure 4.6. Clustering quality is poor if Df inal
has points that represent globbed points from more than one natural cluster. At the
same time, if Df inal is very similar to D0 , then we have not achieved any reduction in
the dataset size. Hence, a “good” stopping condition needs to balance the reduction
in the dataset size and the degree to which Dn captures D0 .
To formalize this notion, let An be the k-NN matrix after iteration n. The
initial k-NN matrix for the dataset is A (or A0 ). Let the size of An be Nn × Nn ,
where Nn is the number of points in the dataset at the end of iteration n. Consider
fn : Rd → Rd be an onto function for iteration n. Function fn maps a point a in the
original dataset D0 to a point in Dn that has globbed a. Given that fn (a) = fn (b),
i.e., both a and b are globbed by the same point in Dn , the probability that b is in
the k-neighborhood of a is approximated by the expression:
s(f (a))−2
Pr[b ∈ Rk (a)] =
Ck−2n
s(f (a))−1
Ck−1n
, if fn (a) = fn (b)
(4.2)
77
where s(x) is the number of points globbed within x. The above equation can be
explained as follows. The numerator corresponds to the number of sets of points
of size k − 2 (considering a and b to be included in the set) that can be selected
from s(fn (a)) − 2 points. The denominator corresponds to the number of sets (of
points) that include point a. Essentially, the probability is given by the ratio of
the two selections. In the alternate scenario, when fn (a) 6= fn (b), the probability of
b ∈ Rk (a) is given by the expression
s(f (a))+s(f (b))−2
n
Ck−2n
1
Pr[b ∈ Rk (a)] =
· s(fn (a))+s(fn (b))−1 , if fn (a) 6= fn (b)
d(fn (a), fn (b)) Ck−1
(4.3)
The probability in Equation 4.3 depends on two factors: 1) the number of points
globbed by the representatives of a and b in Dn , and 2) the distance between the
representatives fn (a) and fn (b). The larger the distance between fn (a) and fn (b),
the smaller the probability of b belonging to Rk (a). Similarly, the probability in
Equation 4.3 is less than that in Equation 4.2. This resonates with the intuition that
nearby points should have higher probability. Note that although the k-NN relation
is not symmetric, the above probabilities are symmetric, i.e., Pr[b ∈ Rk (a)] =
Pr[a ∈ Rk (b)]. Note that for the Equations 4.2 and 4.3 to represent probabilities,
P
the right hand side should be normalized by dividing by the term Za = b Pr[b ∈
Rk (a)]. Let Mi be the N0 × N0 matrix with the entry Mi [x, y] representing Pr[y ∈
Rk (x)]. Figure 4.8 shows the M1 matrix (without the normalizing factor Z) obtained
using A1 (from example in Figure 4.4) and Equations 4.2 and 4.3. For the sake of
comparison, the original k-NN matrix A0 is also shown alongside.
Given the above description, the stopping condition for Algorithm 4.6 can be
formulated in terms of the Minimum Description Length (MDL) principle.
4.2.2.1
Minimum Description Length principle
Minimum Description Length (MDL) principle [97] is a model selection criteria. The MDL principle originates from the more general theory of Occam’s razor.
Occam’s razor states that given two approaches that are equivalent in all other
respects, select the one that is simpler. By considering that the simplest represen-
78



0
1
0.5 0.083 0.083 0.066 0.066
0 1 1
 1

 1 0 1
0
0.5
0.083
0.083
0.066
0.066



 0.5 0.5

 1 1 0
0
0.16
0.16
0.083
0.083



 A0 =  0 0 0
0.083
0.083
0.16
0
1
0.33
0.33
M1 = 



0.083 0.083 0.16 1
 0 0 0
0 0.33 0.33 



0.066 0.066 0.083 0.33 0.33 0
 0 0 0
1 
0.066 0.066 0.083 0.33 0.33 1
0
0 0 0
(a) Reconstructed k-NN matrix after first iteration
0
0
0
0
1
1
0
0
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
0
0
1
1
0










(b) Initial k-NN matrix for sample data
Figure 4.8: Reconstructed (and original) k-NN matrices for sample
dataset
tation corresponds to the most compressed representation, the MDL principle takes
an information theoretic approach towards selecting a model. Let us assume that P1
is interested in transmitting the data D to P2 . Given a set of methods (hypotheses)
H used to encode the data, P1 needs to pick the hypothesis hx that has the largest
compression. For P2 to decode the data, it needs the hypothesis hx as well as the
data that is encoded using hx . Let L(hx ) be the number of bits required to represent
the hypothesis and L(D | hx ) be the encoded data given hypothesis hx . The MDL
principle suggests selecting the model that minimizes L(hx ) + L(D | hx ). One can
notice that the MDL principle balances the generality and the specificity (the biasvariance trade-off) in model selection for the data. A simple model requires fewer
number of bits corresponding to the L(hx ) term, but it results in a larger number
of bits to represent the data L(D | hx ). On the contrary, a complex model would
exhibit just the opposite effect. The term L(D | hx ) also corresponds to the error
introduced in the transmission as a result of selecting the model hx .
In the context of our clustering algorithm, the set of hypotheses/models is
represented by Di , ∀i > 0. The simplest model D1 requires the largest number
of bits, but requires fewest number of bits to encode D. Stated another way, the
simplest model has the smallest error when it comes to reconstruction of the original
data. This is often called as the reconstruction error. For subsequent hypotheses,
as L(Di ) decreases, the additional data required to represent D0 (given by L(D0 |
Di ), i > 0) increases. L(D0 | Di ) can be interpreted as the error introduced in
reconstructing D0 from Di .
79
Representing the model and the data: As seen before, Ai represents the k-NN
matrix for Di . Let Mi represent the k-NN matrix “reconstructed” from Ai using
Equations 4.2 and 4.3. The probability that the reconstructed k-NN matrix Mi
faithfully captures A0 is given by Pr[A0 | Mi ]. Since each element in A0 can be
considered independent, Pr[A0 | Mi ] can be expressed as
Pr[A0 | Mi ] =
N0 Y
N0
Y
Pr[A0 (m, n) | Mi (m, n)]
(4.4)
m=0 n=0
Since A0 is a binary matrix, as a result the expression Pr[A0 (m, n) | Mi (m, n)] is
given by
Pr[A0 (m, n) | Mi (m, n)] =



Mi (m,n)
if A0 (m, n) = 1
1−Mi (m,n)
if A0 (m, n) = 0
(4.5)
The number of bits required to represent the total reconstruction error is captured
by
L(D0 | Di ) = − log Pr[A0 | Mi ]
N0 X
N0
X
= −
log Pr[A0 (m, n) | Mi (m, n)]
(4.6)
m=1 n=1
The number of bits to represent the model depends on the relative size of Di . It is
given by the following expression
L(Di ) = − log(
| Di |
)
| D0 |
(4.7)
Hence the trade-off at the end of any iteration i is between the average reconstruction
error given by
1
L(D0
N02
| Di ) and the size of the model L(Di ). The term L(D0 | Di )
normalized by divided with N 2 , gives the average number of bits per entry in the
matrix. Computing the reconstruction matrix is O(N 2 ) for each iteration, where N
is the number of points in the original dataset. As a result, this is not a feasible
approach to ascertain the stopping condition. We present a stopping condition that
is simpler to compute but captures the same trade-off between the reconstruction
error and the dataset size.
80
A Practical Stopping Condition: Since the above stopping condition is computationally expensive, we provide a practical stopping condition that can intuitively
shown to be related to the MDL based formulation stated above.
In a dataset, if points are only globbed (without moving), it results in the
sparsification (reduction) of the data. In addition, moving points enables further
globbing in subsequent iterations. In other words, if gi is the number of points
globbed in an iteration and mi is the number of points that are moved in an iteration,
gi ∝ mi−1 . This is shown in Figure 4.9.
Number of points
120000
No. points moved
No. points globbed
100000
80000
60000
40000
20000
0
2
3
4
5
6
7
Iterations
8
9
10
Figure 4.9: The number of points moved and globbed per iteration for a
dataset with 1000K points.
Intuitively, as more points are globbed across the iterations, the reconstruction error
increases. Let Ei be the reconstruction error at the end of iteration i and let the
error difference between two consecutive iterations be ∆Ei (∆Ei = Ei − Ei−1 ).
The difference between the errors is proportional to the number of points globbed,
i.e., ∆Ei ∝ gi . Combining this with the previous observation (gi ∝ mi−1 ) yields
∆Ei ∝ mi−1 .
As fewer points move in subsequent iterations (mi < mi−1 ), it reflects the decline
in the size of the dataset, i.e., Ni < Ni+1 . The ratio
mi
(<
mi−1
1) captures the relative
rate of this decline.
To state upfront, if the expression
mj−1
mj
<
mj−2
mj−1
(4.8)
81
does not hold then the iterative process is halted, else continued. The above discussion helps understand this stopping condition. Let us look at Figure 4.10 showing
the two contradicting influences – dataset size and reconstruction error – in an MDL
based formulation of our clustering problem. Although the ratio
mj
mj−1
is less than
1
Reconstruction Error
Relative Dataset size
0
Iterations
Figure 4.10: Balancing the two contradicting influences in the clustering
formulation.
one, the condition in Equation 4.8 is encouraging an increase in this ratio. It implies
that the stopping condition encourages a rapid decrease in the size of the dataset,
by the relation mi ∝ Ni . The downward sloping arrow along the ‘Reconstruction
Error’ curve in Figure 4.10 represents this effect. Moreover, as mi is less than mi+1 ,
i+1
∆Ei+1 is less than ∆Ei . The relative error difference ( ∆E
) is increasing as long
∆Ei
as the condition in Equation 4.8 is satisfied, as a result of mi being less than mi−1 .
Hence, the condition in Equation 4.8, does not favor a decline in the relative error
difference. In the context of Figure 4.10, this is depicted by the downward sloping
arrow along the ‘Reconstruction Error’ curve. At the iteration at which the stopping condition in Equation 4.8 is violated, both the above effects (increasing relative
reconstruction error and the rate at which the dataset size is decreasing) are not
satisfied. We chose to stop at this iteration. Intuitively, this is indicated by the
intersection point of the two curves in Figure 4.10.
At the end of the iterative process a much smaller dataset Dt , as compared to the
original dataset, is obtained. This dataset preserves the structural shape of the
original dataset.
82
4.2.3
Phase 2 – Cluster Identification
Once the intrinsic shape of the clusters is identified, the task remains to isolate
the individual clusters. The first phase (Section 4.2.2) helps drastically reduce the
noise while reducing the size of the dataset considerably.
Let us consider two cases within cluster identification. First case deals with
the possibility that the desired number of clusters is pre-specified. In the second
case, the algorithm needs to determine the number of clusters automatically.
Number of clusters c specified: In this scenario, obtaining the clusters is fairly
straight forward. Since the original dataset is significantly reduced in size after the
first phase, any computationally inexpensive clustering algorithm can be applied
to Dn . We show results of applying hierarchical clustering (CHAMELEON) in
Section 4.3.
Number of clusters unspecified: When the number of clusters are not specified,
the identify clusters(D) method in Figure 4.6 proceeds in two stages. In the
first stage, running a connected components algorithm on Dt delivers the set of
preliminary clusters C. In the second stage, the clusters in C are merged to obtain
the final set of clusters.The merging process is based on a similarity metric. For
each pair of clusters in C two similarity measures are defined. Let B(Ci , Cj ) be the
points in cluster Ci that have a point from Cj in their k-NN set, i.e. B(Ci , Cj ) =
pi ∈ Ci | ∃pj ∈ Rk (pi ) ∧ pj ∈ Cj . We call B(Ci , Cj ) the border points in cluster Ci
to cluster Cj . Note that B(Ci , Cj ) need not be the same as B(Cj , Ci ). Let E(Ci , Cj )
be the total number of occurrences of points in Cj in the k-neighborhood of points
P
in Ci , i.e., E(Ci , Cj ) = pi ∈Ci | pj | pj ∈ Rk (pi ) ∧ pj ∈ Cj |. Let B(Ci ) be the set
S
of all border points in cluster Ci , i.e., B(Ci ) = ∀Cj 6=Ci B(Ci , Cj ) The first similarity
metric S1 is given
S1 (Ci , Cj ) =
| E(Ci , Cj ) |
>α
| B(Ci , Cj ) |
(4.9)
The higher the value of the ratio in Equation 4.9, the greater the similarity between
the clusters. A high value for S1 (Ci , Cj ) indicates that the points in Cj are close
to the border points in Ci . This similarity metric captures the degree of closeness,
measured in terms of local neighborhood of border points, between a cluster pair.
83
The second similarity measure we define, S2 , is given by
S2 (Ci , Cj ) =
| B(Ci , Cj ) |
>β
| B(Ci ) |
(4.10)
The similarity S2 (., .) ensure that two clusters can be merged only if the interaction
“face” (fraction of border points) between the two clusters is above the β threshold.
Cluster pairs are iteratively merged, starting with the pair with highest similarity.
For two clusters Ci and Cj to be merged both conditions in Equations 4.9 and 4.10
must be satisfied. Since the true number of clusters are not specified, we need
to provide lower-bound thresholds (α and β) for the similarity criteria to continue
merging of clusters.
4.2.4
Complexity Analysis
Let us assume that the above algorithm converges after t iterations. The
number of points at the end of each iteration is given by N0 , N1 , ..., Nt . For each point
p (in each iteration i) that globs its nearest neighbors, a k-NN search is performed on
1− d1
the dataset. Since we use a kd-Tree to store the points, a k-NN search takes O(Ni
)
time complexity. Let G1 , G2 , ..., Gt be the number of points that have globbed other
points, in each of the iteration. The total complexity of the k-NN searches is given
P
1− 1
by O( ti=1 Gi · Ni d ). Moving the points involves computing the new location
based on the k-NN. If M1 , M2 , ..., Mt represents the number of points that move in
P
each iteration, the total cost of moving across all iterations is given O(k · ti=1 Mi ).
When CHAMELEON is applied to the set of points after the iterative process, the
computational cost is O(Nt log Nt ). Hence the total computational cost is the sum
of the above terms. Let us assume that a constant fraction of points are globbed
and moved in each iteration, i.e., Gi = Mi = O(1). Also let us assume that in the
worst case the number of points in each iteration are O(N0 ), i.e., Ni = O(N0 ). In
the worst case, the runtime complexity of the algorithm is O(tN0 +kN0 +N0 log N0 ),
where t is the number of iterations of the algorithms and k is the number of nearest
neighbors selected.
84
4.3
Experimental Evaluation
In this section we will briefly look at the performance results for the above al-
gorithm. We only cover the scenario where the true number of clusters are specified.
4.3.1
Datasets
The same datasets that were used for SPARCL in Chapter 3 are used here.
Broadly, a set of synthetic datasets containing 13 clusters of arbitrary shapes are
used for the scalability experiments. Some of the commonly used datasets in the
literature have also been explored. The experiments are conducted on a Mac G5
machine with a 2.66 GHz processor, running the Mac 10.4 OS X. Our code is written
in C++ using the Approximate Nearest Neighbor Library (ANN)
4.3.2
14
.
Scalability Results
To study the scalability of the proposed algorithm, we generate synthetic
datasets of varying number of points. The number of noise points are set constant at 5% of the total dataset size. The dimensionality of the dataset is d = 2
and the number of clusters are fixed at 13. For each dataset k is set at 70. The first
column in Table 4.1 specifies the size of the datasets, the largest being a dataset
with 1 million points. The table breaks down the total execution time into the time
taken during the iterative process (Column 2) and the CHAMELEON execution
time on the final dataset after the iterative steps (Column 4). The number of iterations performed and the size of the final dataset (as a percentage of the initial
dataset) after t iterations are shown in Columns 3 and 5, respectively. Execution
time results for 3D datasets (protein and swiss roll) have also been shown in this
table. As observed from the table, the time taken by the iterative process increases
with the increasing size of the dataset. Also, different datasets exhibit varying degrees of dataset reduction. The time taken by CHAMELEON is proportional to
the dataset reduction achieved. This is evident from the observation that the time
taken by CHAMELEON on the 1000K dataset is ten times less than that for the
800K dataset. The reduction in the dataset is purely a factor of the density of the
14
http://www.cs.umd.edu/~mount/ANN/
85
points and also the relative position of the points. As such, no concrete reasoning
for better reduction with the 1000K dataset can be tendered.
1600
SPARCL(random)
Backbone-based clustering
1400
Time (sec)
1200
1000
800
600
400
200
0
0
100
200
300
400
500
600
# of points x 1000 (d=2)
700
800
900
1000
Figure 4.11: Scalability Results for Backbone Based Clustering
Figure 4.11 compares the execution time taken by random seeded SPARCL with
the method proposed in this chapter. The time reported is the total execution
time, i.e., time for iterative steps and time taken by CHAMELEON. To remind
the reader, randomly seeded SPARCL is faster as compared to SPARCL that is
seeded using the LOF technique. Moreover, both forms of SPARCL are an order of magnitude faster as compared to contemporary clustering algorithms. As a
result, those comparison have been omitted here. To summarize, the backbone
based clustering approach is around an order of magnitude faster as compared
to SPARCLṠince SPARCL itself is an order of magnitude faster than contemporary
clustering algorithms (CHAMELEON and DBSCAN), the backbone based approach
is two orders of magnitude faster than CHAMELEON and DBSCAN. Figure 4.12
shows the “skeletons” or reduced datasets for some of the common datasets in the
literature (Fig. 4.12(b) and 4.12(d)). The skeletons for 3D datasets (protein and
swiss roll) are also shown in Figure 4.13. For the sake of comparison, the number
of points in each dataset are also show. The 3D datasets exhibit a predominant
86
Dataset
(no. of
points)
10K
50K
100K
200K
400K
600K
800K
1000K
protein
(14669)
swiss roll
(19386)
Time for t Number of
Time for
Dataset size
iterations iterations CHAMELEON on
after t
(sec)
(t)
dataset post t
iterations (% of
iterations (sec)
initial size)
0.503
4
0.428
4.41%
3.00
4
1.08
4.07%
5.597
4
1.616
5.2%
12.159
4
7.66
5.98%
26.467
4
25.130
6.94%
40.923
4
58.732
6.88%
57.503
4
109.935
7.49%
113.861
10
10.501
1.78%
1.119
5
1.068
13.8%
1.38
6
1.16
12.74%
Table 4.1: Scalability results on dataset with 13 true clusters. The size
of the dataset is varied keeping the noise at 5% of the dataset
size.
sparsification effect as compared to a skeletonization effect.
4.3.3
Clustering Quality Results
As in Chapter 3, an external criterion, namely purity score is used to measure
the clustering quality. Recall that the purity score lies in the range [0, 1]. Like
before, the noise points are eliminated while computing the purity, since different
algorithms deal with noise points differently. Figure 4.14 shows the purity score for
the synthetic datasets used earlier. The purity is fairly stable apart from the score
for dataset of size 600K. As compared to SPARCL the purity score of the proposed
method is less by a small fraction. This is because the globbing and movement
process at times tends to glob border points with noise points.
4.3.4
Parameter Sensitivity Results
We performed experiments to test the sensitivity of the algorithm to the input
parameter k (number of nearest neighbors). For a given dataset, we alter k and
record the clustering quality. We selected the dataset with 800K points for this
87
350
300
300
250
250
200
200
150
150
100
100
50
0
0
100
200
300
400
500
600
700
50
0
100
(a) 8000 points
200
300
400
500
600
(b) 1077 points
500
450
450
400
400
350
350
300
300
250
250
200
200
150
150
100
100
50
0
0
100
200
300
400
500
600
700
50
0
100
(c) 1000 points
200
300
400
500
600
700
(d) 838 points
Figure 4.12: Backbone/skeleton of 2D synthetic datasets in our study.
Left column: original dataset, right column: skeletons.
experiment. Figures 4.15(a) and 4.15(b) show the run time and purity,respectively,
as the value of k is varied. Figure 4.15 shows the execution time and purity as
the number of nearest neighbors are increased. Note that the purity score remains
almost the same. From Figure 4.15(a) one can see that the execution time increases
linearly as the k parameter is gradually increased.
4.4
Conclusion
In this chapter, we proposed another method for clustering large spatial point
datasets. Like SPARCL this method too results in a reduced dataset, which we call
as the backbone of the original dataset. Finding clusters in the backbone amounts
to identifying clusters in the original dataset. The algorithm performs two steps
88
60
50
40
20
0
0
15
15
10
10
15
5
5
15
10
0
0
10
5
−5
5
−5
0
0
−5
−10
−10
−5
−10
−15
−15
−15
(a) 19386 points
−10
(b) 2471 points
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
0
150
200
100
100
50
0
160
140
120
100
80
60
40
20
0
(c) 14669 points
0
150
100
50
0
(d) 2023 points
Figure 4.13: Backbone/skeleton of 3D synthetic datasets in our study.
Left column: original dataset, right column: skeletons.
(globbing and movement) iteratively, resulting in a substantially reduced dataset
that still captures the structural shape of the clusters. From the experimental
evaluation we see that the algorithm is more scalable as compared to SPARCL.
4.4.1
Comparison with SPARCL
Since both SPARCL and the proposed approach target the space of scalable
clustering algorithms, they have considerable similarities and some subtle differences. Following are some notes comparing the two.
1. In its first stage, SPARCL aims to identify representatives from the entire
dataset that would capture the dense regions (clusters) in the dataset. On
the other hand, the current method tries to retain the structure in the data,
89
Purity score
1
0.8
0.6
0.4
0.2
Backbone-based clustering
SPARCL(lof)
0
0 100 200 300 400 500 600 700 800 900 1000
Dataset Size (x 1000)
2000
1800
1600
1400
1200
1000
800
600
400
200
1.1
1
Purity score
Time (sec)
Figure 4.14: Purity score with varying dataset size
0.9
0.8
0.7
0.6
Time
0
50
100 150 200 250 300 350 400
k nearest neighbors
(a) Execution time
Purity score
0.5
0
50
100 150 200 250 300 350 400
k nearest neighbors
(b) Purity score
Figure 4.15: Execution time and purity for varying number of nearest
neighbors
while globbing and moving the points. In some sense, the backbone is the
representative for the entire dataset.
2. SPARCL takes a projective approach, wherein the points belonging to the two
clusters are projected onto the line connecting their centers. As the dimen-
90
sionality of the data is increased, this approach is likely to result in misleading
similarity scores. This is because even points that are far apart can get project
onto the same bin, resulting in a misleading score. The backbone based approach on the other hand, can suffer from the curse of dimensionality as it
relies extensively on nearest neighbor queries.
3. Although SPARCL has superior run time complexity, in practice the backbone
based clustering algorithm turns out to be more efficient.
4. In some sense SPARCL is a parametric approach since it assumes that regular
isotropic clusters can be overlayed on the true clusters. The current algorithm
is non-parametric that way, since it does not make any assumptions and infers
any information directly from the data.
CHAPTER 5
Conclusion and Future Work
Chapters 3 and 4 cover our existing contributions on shape-based clustering. This
chapter covers some future directions in shape-based clustering.
5.1
Efficient Subspace Clustering
4
3
4
2
3
2
1
0
Z axis
Z axis
1
−1
0
−1
−2
−2
−3
−3
−0.5
−4
−0.42 −0.44 −0.46 −0.48 −0.5 −0.52 −0.54 −0.56 −0.58 −0.6
−1 0
Y axis
−4
1
−1
X axis
−0.4
−0.5
0
0.5
1
Y axis
X axis
(a) View perpendicular to the YZ plane
(b) View perpendicular to the XZ plane
Figure 5.1: Subspace clustering – Challenges for SPARCL
As a future direction, we are interested in exploring the possibility of extending SPARCL and the backbone method to subspace based clustering. Subspace
clustering is useful for applications where the patterns/clusters lie within a smaller
set of dimensions.
As noted earlier our current solutions are designed for full-space clustering, i.e., the
clusters are assumed to span all the dimensions. Preliminary experiments show, for
example, that SPARCL cannot detect subspace clusters effectively. This is because
a subspace cluster can have very sparsely distributed points in dimensions other
than the subspace dimensions. As a result, the similarity score between the seed
clusters computed by SPARCL will end up being small, indicating separate clusters.
This is illustrated in Figure 5.1. Consider a cluster in XZ subspace, as shown in
91
92
Figure 5.1(b). The same cluster appears as a sparse set of points when viewed along
the direction perpendicular to the Y Z plane. If a projection based approach is taken
(as in SPARCL), the seed clusters are likely to end up having very small scores for
inter-cluster similarity, due to the sparsity of the data. The subspace criterion of
the seed clusters is not taken into account in the similarity computation. As a result, the similarity between two clusters in different subspaces can be close to the
similarity between two clusters in the same subspace. Different approaches can be
tried for identifying the subspace clusters. Preliminary ideas regarding some of the
approaches are outlined as follows:
1. A new unified similarity metric needs to be defined that takes into account
the dimensionality of the seed clusters along with the distance similarity. Currently, in SPARCL, the similarity S(X, Y ) is a function of the distance and
density of the two pseudo-clusters. This similarity should also include the
dimension compatibility between the two pseudo-clusters X and Y .
2. A projective approach such as taken by [2] can be combined with SPARCL.
Using the concepts in [2], a large number of convex polytopes in different
subspaces can be determined. The convex polytopes can now be merged to
obtain the final set of clusters. In this method too, the SPARCL similarity
metric has to be a function of the dimensions involved.
5.2
Shape Indexing
Considerable work on indexing and matching shapes has been done [123, 120].
Some of the work from the data mining community in this area has focused on
converting the shapes to a time series and then indexing the time series [67]. The
drawback of such time series based methods is that they only consider the boundary
of the shape, independent of other factors such as the density of points within
the shape. Our idea is based on the Local Outlier Factor proposed in Chapter 3.
Although, never explicitly stated, the representatives selected by the LOF approach
are rotation invariant as shown in Figure 5.2. Which also means that the distances
to k-nearest neighbors from each point are preserved. This fact can be used to index
93
250
-300
11
6
12
-320
7
200
7
11
2
4
9
-340
9
12
5
150
-360
6
1
8
-380
8
2
4
100
-400
10
1
3
5
10
3
-420
50
420
440
460
480
500
520
540
560
580
300
(a)
350
400
450
500
(b)
Figure 5.2: Local Outlier Factor based representatives are rotation invariant.
shapes. From the set of selected seeds, the distance from each seed to its k-nearest
neighbors constitutes a feature vector. Let A be such a feature vector for a shape
a and let B be the feature vector for a shape b. If the shapes a and b are similar,
then A = αB, where α is the scaling factor.
Other directions: Other areas of future exploration include scaling proposed algorithms to higher dimensions using methods such as LSH and algorithms for efficient
nearest neighbor search in high dimension [112, 76].
New spatial clustering algorithm based on concepts from graph theory is another
potential direction. Graph sparsification methods can be used to sparsify the kNN
graph for the data points. Efficient sparsification methods have been proposed
in [108]. From the sparse kNN graph identifying the clusters should be an easier
task.
BIBLIOGRAPHY
[1] Ajith Abraham, Swagatam Das, and Sandip Roy. Swarm intelligence algorithms for data clustering. In Soft Computing for Knowledge Discovery and
Data Mining, pages 279–313. 2008.
[2] Pankaj K. Agarwal and Nabil H. Mustafa. k-means projective clustering. In
PODS ’04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems, pages 155–165, New York, NY,
USA, 2004. ACM.
[3] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar
Raghavan. Automatic subspace clustering of high dimensional data for data
mining applications. SIGMOD Record, 27(2):94–105, 1998.
[4] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–
122, 2008.
[5] Anooshiravan Ansari, Assadollah Noorzad, and Hamid Zafarani. Clustering
analysis of the seismic catalog of Iran. Comput. Geosci., 35(3):475–486, 2009.
[6] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding.
In Proc. of Symposium of Discrete Analysis, 2005.
[7] M. M. Astrahan. Speech analysis by clustering, or the hyperphoneme method.
Technical report, Stanford A.I. Project Memo, Stanford University, 1970.
[8] G. H. Ball and D. J. Hall. Promenade– an online pattern recognition system.
Technical report, Stanford Research Institute, Stanford University, 1967.
[9] Mats Bengtsson and Johan Schubert. Dempster-shafer clustering using potts
spin mean field theory. Soft Computing, 5(3):215–228, 2001.
[10] James C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell, MA, USA, 1981.
[11] Wu bin and Shi Zhongzhi. A clustering algorithm based on swarm intelligence.
volume 3, pages 58–66 vol.3, 2001.
[12] Marcelo Blatt, Shai Wiseman, and Eytan Domany. Clustering data through
an analogy to the potts model. In Advances in Neural Information Processing
Systems 8, pages 416–422. MIT Press, 1996.
[13] Harry Blum. A transformation for extracting new descriptors of shape. Models
for the Perception of Speech and Visual Form, pages 362–380, 1967.
94
95
[14] P. S. Bradley and U. M. Fayyad. Refining initial points for k-means clustering.
In Fifteenth Intl. Conf. on Machine Learning, pages 91–99, 1998.
[15] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying densitybased local outliers. In ACM SIGMOD 2000 Int. Conf. On Management of
Data, 2000.
[16] Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma, and Ji-Rong Wen. Hierarchical clustering of www image search results using visual, textual and link
information. In MULTIMEDIA ’04: Proceedings of the 12th annual ACM international conference on Multimedia, pages 952–959, New York, NY, USA,
2004. ACM.
[17] Man-chung Chan, Yuen-Mei Li, and Chi-Cheong Wong. Web-based cluster
analysis system for china and hong kong’s stock market. In IDEAL ’00:
Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning, Data Mining, Financial Engineering, and
Intelligent Agents, pages 545–550, London, UK, 2000. Springer-Verlag.
[18] V. Chaoji, M. Al Hasan, S. Salem, and M.J. Zaki. Sparcl: Efficient and
effective shape-based clustering. In Data Mining, 2008. ICDM ’08. Eighth
IEEE International Conference on, pages 93–102, Dec. 2008.
[19] B. Chazelle and L. Palios. Algebraic Geometry and its Applications. SpringerVerlag, 1994.
[20] Tom Chiu, DongPing Fang, John Chen, Yao Wang, and Christopher Jeris.
A robust and scalable clustering algorithm for mixed type attributes in large
database environment. In KDD ’01: Proceedings of the seventh ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 263–
268, New York, NY, USA, 2001. ACM.
[21] Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski,
Andrew Y. Ng, and Kunle Olukotun. Map-reduce for machine learning on
multicore. In Bernhard Schölkopf, John C. Platt, and Thomas Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 281–288.
MIT Press, 2006.
[22] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to
Algorithms. McGraw-Hill, 2nd edition, 2005.
[23] M. Couprie. Note on fifteen 2d parallel thinning algorithms. Technical report,
Universit de Marne-laValle, IGM2006-01, 2006.
[24] Glendon Cross and Wayne Thompson. Understanding your customer: Segmentation techniques for gaining customer insight and predicting risk in the
telecom industry. SAS Global Forum, 2008.
96
[25] Xiaohui Cui, Jinzhu Gao, and Thomas E. Potok. A flocking based algorithm
for document clustering analysis. J. Syst. Archit., 52(8):505–515, 2006.
[26] Ian Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues
and the k-means algorithm. In SIAM Data Mining Conference, 2005.
[27] Marcilio de Souto, Ivan Costa, Daniel de Araujo, Teresa Ludermir, and
Alexander Schliep. Clustering cancer gene expression data: a comparative
study. BMC Bioinformatics, 9(1):497, 2008.
[28] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the em algorithm. Journal of the Royal Statistical Society.
Series B (Methodological), 39(1):1–38, 1977.
[29] I. S. Dhillon, Y. Guan, and B. Julis. Kernel k-means, spectral clustering
and normalized cuts. In KDD ’04: Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and data mining, 2004.
[30] Sara Dolniar and Friedrich Leisch. Behavioral market segmentation of binary
guest survey data with bagged clustering. In ICANN ’01: Proceedings of
the International Conference on Artificial Neural Networks, pages 111–118,
London, UK, 2001. Springer-Verlag.
[31] Carlotta Domeniconi and Dimitrios Gunopulos. An efficient density-based approach for data mining tasks. Knowledge and Information Systems, 6(6):750–
770, 2004.
[32] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification
(2nd Edition). Wiley-Interscience, 2000.
[33] A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithm for
large-scale detection of protein families. Nucleic Acids Res, 30(7):1575–1584,
2002.
[34] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise.
In ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pages
226–231, 1996.
[35] Gianluigi Folino and Giandomenico Spezzano. An adaptive flocking algorithm
for spatial clustering. In PPSN VII: Proceedings of the 7th International Conference on Parallel Problem Solving from Nature, pages 924–933, London, UK,
2002. Springer-Verlag.
[36] E. W. FORGY. Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21:768–769, 1965.
97
[37] Brendan J. Frey and Delbert Dueck. Clustering by passing messages between
data points. Science, 315:972–976, 2007.
[38] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. CACTUS clustering categorical data using summaries. In KDD ’99: Proceedings of the
fifth ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 73–83, 1999.
[39] Byron J. Gao, Martin Ester, Jin-Yi Cai, Oliver Schulte, and Hui Xiong. The
minimum consistent subset cover problem and its applications in data mining.
In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 310–319, New York, NY, USA,
2007. ACM.
[40] J. A. Garcı́a, J. Fdez-Valdivia, F. J. Cortijo, and R. Molina. A dynamic
approach for clustering data. Signal Process., 44(2):181–196, 1995.
[41] Martin Gavrilov, Dragomir Anguelov, Piotr Indyk, and Rajeev Motwani. Mining the stock market (extended abstract): which measure is best? In KDD ’00:
Proceedings of the sixth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 487–496, New York, NY, USA, 2000. ACM.
[42] Soheil Ghiasi, Ankur Srivastava, Xiaojian Yang, and Majid Sarrafzadeh. Optimal energy aware clustering in sensor networks. Sensors, 2(7):258–269, 2002.
[43] Jonatan Gomez, Dipankar Dasgupta, and Olfa Nasraoui. A new gravitational
clustering algorithm. In In Proc. of the SIAM Int. Conf. on Data Mining
(SDM), 2003.
[44] Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing (3rd
Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2006.
[45] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE: an efficient clustering algorithm for large databases. In ACM SIGMOD International Conference
on Management of Data, pages 73–84, 1998.
[46] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE ’99: Proceedings of the 15th
International Conference on Data Engineering, page 512, Washington, DC,
USA, 1999. IEEE Computer Society.
[47] J. Han, M. Kamber, and A. K. H. Tung. Spatial Clustering Methods in Data
Mining: A Survey. Taylor and Francis, 2001.
[48] David Harel and Yehuda Koren. Clustering spatial data using random walks.
In KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 281–286, New York, NY,
USA, 2001. ACM.
98
[49] William W. Hargrove and Forrest M. Hoffman. Using multivariate clustering
to characterize ecoregion borders. Computing in Science and Engg., 1(4):18–
25, 1999.
[50] Mohammad Al Hasan, Vineet Chaoji, Saeed Salem, and Mohammed J. Zaki.
Robust partitional clustering by outlier and density insensitive seeding. Pattern Recogn. Lett., 30(11):994–1002, 2009.
[51] Constantinos S. Hilas and Paris As. Mastorocostas. An application of supervised and unsupervised learning approaches to telecommunications fraud
detection. Know.-Based Syst., 21(7):721–726, 2008.
[52] A. Hinneburg and D.A Keim. An efficient approach to clustering in multimedia
databases with noise. In 4th Int’l Conf. on Knowledge Discovery and Data
Mining, 1999.
[53] Alexander Hinneburg and Hans-Henning Gabriel. Denclue 2.0: Fast clustering
based on kernel density estimation. In International Symposium on Intelligent
Data Analysis, 2007.
[54] Alexander Hinneburg and Daniel A. Keim. A general approach to clustering
in large databases with noise. Knowledge and Information Systems, 5(4):387–
415, 2003.
[55] Xiaohua Hu and Yi Pan. Knowledge Discovery in Bioinformatics: Techniques,
Methods, and Applications (Wiley Series in Bioinformatics). John Wiley &
Sons, Inc., New York, NY, USA, 2007.
[56] Woochang Hwang, Young-Rae Cho, Aidong Zhang, and Murali Ramanathan.
A novel functional module detection algorithm for protein-protein interaction
network. Algorithms for Molecular Biology, 1, 2006.
[57] C. C.J. Kuo I. Katsavounidis and Z. Zhen. A new initialization technique for
generalized lloyd iteration. In IEEE Signal Processing Letter, volume 1, pages
144–146, 1994.
[58] M. Indulska and M. E. Orlowska. Gravity based spatial clustering. In GIS
’02: Proceedings of the 10th ACM international symposium on Advances in
geographic information systems, pages 125–130, New York, NY, USA, 2002.
ACM.
[59] Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. PrenticeHall, Inc., Upper Saddle River, NJ, USA, 1988.
[60] Woncheol Jang and Martin Hendry. Cluster analysis of massive datasets in
astronomy. Statistics and Computing, 17(3):253–262, 2007.
99
[61] Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. Towards effective and
efficient distributed clustering. In In Workshop on Clustering Large Data Sets
(ICDM), pages 49–58, 2003.
[62] Klaus Julisch. Clustering intrusion detection alarms to support root cause
analysis. ACM Transactions on Information and System Security, 6(4):443–
471, 2003.
[63] Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Puttagunta. Distance
measures for effective clustering of arima time-series. In ICDM ’01: Proceedings of the 2001 IEEE International Conference on Data Mining, pages
273–280, Washington, DC, USA, 2001. IEEE Computer Society.
[64] George Karypis, Eui-Hong (Sam) Han, and Vipin Kumar. Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer, 32(8):68–75,
1999.
[65] L. Kaufman and P. J. Rousseeuw. Finding groups in data. an introduction to
cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York: Wiley, 1990, 1990.
[66] Eamonn Keogh. Exact indexing of dynamic time warping. In VLDB ’02:
Proceedings of the 28th international conference on Very Large Data Bases,
pages 406–417. VLDB Endowment, 2002.
[67] Eamonn Keogh, Li Wei, Xiaopeng Xi, Sang-Hee Lee, and Michail Vlachos.
Lb keogh supports exact indexing of shapes under rotation invariance with
arbitrary representations and distance measures. In VLDB ’06: Proceedings
of the 32nd international conference on Very large data bases, pages 882–893.
VLDB Endowment, 2006.
[68] Hisashi Koga, Tetsuo Ishibashi, and Toshinori Watanabe. Fast agglomerative
hierarchical clustering algorithm using locality-sensitive hashing. Knowledge
and Information Systems, 12(1):25–53, 2007.
[69] Martin Kulldorff and N. Nagarwalla. Spatial Disease Clusters: Detection and
Inference. Statistics in Medicine, 14:799–810, 1995.
[70] Sukhamay Kundu. Gravitational clustering: a new approach based on the
spatial distribution of the points. Pattern Recognition, 32(7):1149–1160, 1999.
[71] P. .C. Lai, C. M. Wong, A. J. Hedley, S. V. Lo, P. Y. Leung, J. Kong, and
G. M. Leung. Understanding the spatial clustering of severe acute respiratory
syndrome (sars) in hong kong. Environ Health Perspectives, 112(15):1550–
1556, 2004.
100
[72] Andrew B. Lawson, Silvia Simeon, Martin Kulldorff, Annibale Biggeri, and
Corrado Magnani. Line and point cluster models for spatial health data.
Comput. Stat. Data Anal., 51(12):6027–6043, 2007.
[73] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative
matrix factorization. Nature, 401(6755):788–791, October 1999.
[74] John Lee and Michel Verleysen.
Springer, 2007.
Nonlinear Dimensionality Reduction.
[75] Cheng-Ru Lin and Ming-Syan Chen. A robust and efficient clustering algorithm based on cohesion self-merging. In KDD ’02: Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 582–587, New York, NY, USA, 2002. ACM.
[76] King Ip Lin, H. V. Jagadish, and Christos Faloutsos. The tv-tree: an index
structure for high-dimensional data. The VLDB Journal, 3(4):517–542, 1994.
[77] S. Lloyd. Least squares quantization in pcm. Technical Note, Bell Laboratories.
Information Theory, IEEE Transactions on, 28(2):129–137, 1957,1982.
[78] Yaniv Loewenstein, Elon Portugaly, Menachem Fromer, and Michal Linial. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling
the entire protein space. Bioinformatics, 24(13):41–49, July 2008.
[79] G.E. Lowitz. What the fourier transform can really bring to clustering. Pattern
Recognition, 17(6):657–665, 1984.
[80] Ulrike Luxburg. A tutorial on spectral clustering. Statistics and Computing,
17(4):395–416, 2007.
[81] C. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University, 2008.
[82] Ujjwal Maulik and Sanghamitra B. Genetic algorithm-based clustering technique. Pattern Recognition, 33:1455–1465, 2000.
[83] M. Meila and J. Shi. A random walks view of spectral segmentation. In AI
and Statistics (AISTATS), 2001.
[84] Harvey J. Miller and Jiawei Han. Geographic Data Mining and Knowledge
Discovery. Taylor & Francis, Inc., Bristol, PA, USA, 2001.
[85] Masatoshi Nei and Sudhir Kumar. Molecular Evolution and Phylogenetics.
Oxford University Press, USA, 2000.
[86] Daniel B. Neill and Andrew W. Moore. A fast multi-resolution method for
detection of significant spatial disease clusters. In Advances in Neural Information Processing Systems 16, 2003.
101
[87] Jr Newton Da Costa, Jefferson Cunha, Sergio Da Silva M. Wedel, and
J. Steenkamp. Stock selection based on cluster analysis. Economics Bulletin,
13(1):1–9, 2005.
[88] A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an
algorithm. In Advances in Neural Information Processing Systems 14, 2001.
[89] Mary K. Obenshain. Application of data mining techniques to healthcare
data. Infection Control and Hospital Epidemiology, 25:690–695, 2004.
[90] A. Okabe, B. Boots, and K. Sugihara. Spatial Tessellations: Concepts and
Applications of Voronoi Diagrams. John Wiley & Sons, 1992.
[91] Stephan Olariu and Albert Y. Zomaya. Handbook Of Bioinspired Algorithms
And Applications (Chapman & Hall/CRC Computer & Information Science).
Chapman & Hall/CRC, 2005.
[92] Clark F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21(8):1313–1325, 1995.
[93] Yen-Jen Oyang, Chien-Yu Chen, and Tsui-Wei Yang. A study on the hierarchical data clustering algorithm based on gravity theory. In PKDD ’01:
Proceedings of the 5th European Conference on Principles of Data Mining and
Knowledge Discovery, pages 350–361, 2001.
[94] Lance Parsons, Ehtesham Haque, and Huan Liu. Subspace clustering for high
dimensional data: a review. SIGKDD Explor. Newsl., 6(1):90–105, 2004.
[95] L. F. Pineda-Martinez and N. Carbajal. Climatology of Mexico: a Description
Based on Clustering Analysis. American Geophysical Union Spring Meeting
Abstracts, pages A7+, May 2007.
[96] Girish Punj and David W. Stewart. Cluster analysis in marketing research:
Review and suggestions for application. Journal of Marketing Research,
20(2):134–148, May, 1983.
[97] J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471,
1978.
[98] Joseph Lee Rodgers and W. Alan Nicewander. Thirteen ways to look at the
correlation coefficient. The American Statistician, 42(1):59–66, 1988.
[99] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by
locally linear embedding. Science, 290(5500):2323–2326, December 2000.
[100] J.-R. Sack and J. Urrutia. Handbook of computational geometry. NorthHolland Publishing Co., Amsterdam, The Netherlands, 2000.
102
[101] Sriparna Saha and Sanghamitra Bandyopadhyay. Application of a new
symmetry-based cluster validity index for satellite image segmentation. IEEE
Geoscience and Remote Sensing Letters, 5(2):166–170, 2008.
[102] Michael J. Shaw, Chandrasekar Subramaniam, Gek Woo Tan, and Michael E.
Welge. Knowledge management and data mining for marketing. Decision
Support Systems, 31(1):127–137, 2001.
[103] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis.
Cambridge University Press, 2004.
[104] Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang.
WaveCluster: A multi-resolution clustering approach for very large spatial
databases. In 24th Int. Conf. Very Large Data Bases, VLDB, pages 428–439,
24–27 1998.
[105] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–
905, 2000.
[106] R. Sibson. Slink: An optimally efficient algorithm for the single-link cluster
method. The Computer Journal, 16(1):30–34, 1973.
[107] C. Spearman. The proof and measurement of association between two things.
The American journal of psychology, 100(3-4):441–471, 1987.
[108] Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective
resistances. In STOC ’08: Proceedings of the 40th annual ACM symposium
on Theory of computing, pages 563–568, New York, NY, USA, 2008. ACM.
[109] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques, 2000.
[110] Alexander Szalay, Tamas Budavari, Andrew Connolly, Jim Gray, Takahiko
Matsubara, Adrian Pope, and Istvan Szapudi. Spatial clustering of galaxies in
large datasets. volume 4847, pages 1–12. Proceedings- SPIE The International
Society for Optical Engineering, 2002.
[111] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data
Mining. Addison Wesley, 2005.
[112] Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. Quality and efficiency
in high dimensional nearest neighbor search. In SIGMOD ’09: Proceedings
of the 35th SIGMOD international conference on Management of data, pages
563–576, New York, NY, USA, 2009. ACM.
[113] Stijn Van Dongen. Graph clustering via a discrete uncoupling process. SIAM
J. Matrix Anal. Appl., 30(1):121–141, 2008.
103
[114] Dorothea Wagner and Frank Wagner. Between min cut and graph bisection.
In MFCS ’93: Proceedings of the 18th International Symposium on Mathematical Foundations of Computer Science, pages 744–750, London, UK, 1993.
Springer-Verlag.
[115] Wei Wang, Jiong Yang, and Richard R. Muntz. Sting: A statistical information grid approach to spatial data mining. In VLDB ’97: Proceedings of the
23rd International Conference on Very Large Data Bases, pages 186–195, San
Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
[116] Xin Wang and Howard J. Hamilton. DBRS: A density-based spatial clustering
method with random sampling. In Proceedings of the Seventh Pacific-Asia
Conference on Knowledge Discovery and Data Mining, pages 563–575, 2003.
[117] M. Wedel and J. Steenkamp. A clusterwise regression method for simultaneous
fuzzy market structuring and benefit segmentation. Journal of Marketing
Research, pages 385–396, 1991.
[118] Michel Wedel and Wagner A. Kamakura. Market Segmentation: Conceptual
and Methodological Foundations. Kluwer Academic Publisher, 2000.
[119] Ron Wehrens, Lutgarde M.C. Buydens, Chris Fraley, and Adrian E. Raftery.
Model-based clustering for image segmentation and large datasets via sampling. Journal of Classification, 21(2):231–253, September 2004.
[120] Li Wei, Eamonn J. Keogh, and Xiaopeng Xi. Saxually explicit images: Finding
unusual shapes. In ICDM, pages 711–720. IEEE Computer Society, 2006.
[121] W. E. Wright. Gravitational clustering. Pattern Recognition, 9(3):151–166,
1977.
[122] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10
algorithms in data mining. Knowledge and Information Systems, 14(1):1–37,
2007.
[123] Dragomir Yankov, Eamonn J. Keogh, Li Wei, Xiaopeng Xi, and Wendy L.
Hodges. Fast best-match shape searching in rotation invariant metric spaces.
SDM.
[124] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In 18th Annual
Conference on Neural Information Processing Systems, 2004.
[125] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, and Jinwen Ma. Learning to cluster web search results. In SIGIR ’04: Proceedings of the 27th annual
international ACM SIGIR conference on Research and development in information retrieval, pages 210–217, New York, NY, USA, 2004. ACM.
104
[126] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery,
1(2):141–182, 1997.
[127] Ying Zhao and George Karypis. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141–168, 2005.
[128] Xianjin Zhu, Rik Sarkar, and Jie Gao. Shape segmentation and applications
in sensor networks. In Proceedings of the 26th IEEE International Conference
on Computer Communications, pages 1838–1846, 2007.