Download Synthetic Datasets for Clustering Algorithms

Document related concepts

Principal component analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
i
SYNTHETIC DATASETS FOR CLUSTERING
ALGORITHMS
By
Jhansi Rani Vennam
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE BY RESEARCH
(COMPUTER SCIENCES)
Under the Gudanice of
Prof. Kamalakar Karlapalem
at the
International Institute of Information Technology - Hyderabad
June 2006
ii
Abstract
The process of grouping similar objects in the given dataset is known as clustering. A large variety of clustering
algorithms have been proposed to find clusters in the given dataset. Not many real-life datasets are available for
testing the proposed algorithms. Moreover the existing datasets do not have actual clustering result. This leads to
the idea of generating “benchmarking” datasets with high dimensionality and noise, which can evaluate clustering
algorithms on various aspects like scalability, accuracy and robustness to noise.
We first propose few algorithms and methodologies that generate high-dimensional cluster datasets in R d space
along with the original clustering results. We developed a toolkit called SynDECA[1] that generates synthetic datasets
based on the algorithms proposed. Given inputs like the number of clusters; dimensionality; maximum value of a
dimension and size of the dataset by the user, SynDECA generates the clustering dataset. The proposed methods
ensure that there are exactly the requested number of clusters in the dataset.
Traditional clustering algorithms try to find clusters in all dimensions of the dataset. When the dimensionality of
the dataset increases, some dimensions could be irrelevant for few data points. There could be clusters which are
spread in subset of dimensions of the dataset, these clusters may not be visible when seen in all the dimensions of
the dataset. A number of subspace clustering algorithms are proposed to find such clusters. We propose methods to
generate datasets which are useful for the subspace clustering algorithms. When the subspace cluster datasets are
used as input DBSCAN could not identify the subspace clusters, whereas K-Means got confused by the presence noise
as expected.
iii
iv
Contents
1 Introduction
1.1 Cluster Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
2
3
2 Clustering Methods
2.1 Categories of Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
6
8
3 Related Work
9
4 Synthetic Datasets
4.1 SynDECA - Framework . . . . . . . . . . . . .
4.1.1 Notation . . . . . . . . . . . . . . . .
4.2 Algorithms . . . . . . . . . . . . . . . . . . .
4.2.1 Algorithm for Cluster Placement . . . .
4.2.2 Algorithm for Cluster Points Generation
4.2.3 Algorithm for Noise Points Generation
4.2.4 Complexity of Algorithms . . . . . . .
4.3 Evaluation of datasets generated . . . . . . . .
4.3.1 Discussion . . . . . . . . . . . . . . .
4.4 Summary . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
11
11
11
12
13
15
16
16
16
22
23
5 Subspace Cluster Data
5.1 Subspace Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Subspace Clustering Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
26
27
30
6
31
31
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Experimental Results
6.1 Comparison with the Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Validation of SynDECA Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Conclusions and Future Work
45
v
vi
CONTENTS
List of Figures
2.1
2.2
2.3
Working of partition clustering method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Working of hierarchical clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Working of DBSCAN algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
4.2
4.3
4.4
4.5
Generation of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Algorithm for Cluster Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Division of the given space into cells which can accommodate clusters with minimum radius
Algorithm for Cluster Points Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Algorithm for Noise Points Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
12
13
14
17
18
5.1
5.2
5.3
Example how different dimensions are important to different clusters . . . . . . . . . . . . . . . . . .
Clusters according to projected subspace clustering . . . . . . . . . . . . . . . . . . . . . . . . . . .
Subspace clustering dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
26
27
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
6.12
6.13
6.14
6.15
6.16
6.17
6.18
Few examples of the generated data in 2-Dimensions. . . . . . . . . . . . . . . . . .
Few examples of the generated data in 3-Dimensions. . . . . . . . . . . . . . . . . .
Top row datasets are from “clusutils” — Bottom row datasets are from “SynDECA”.
Plot of generated datasets (a) dataset1 (b) dataset2 . . . . . . . . . . . . . . . . . . .
Plot of clustering result of K-Means algorithm on (a) dataset1 (b) dataset2 . . . . . .
Plot of dataset3 (a) Original (b) Result of K-Means algorithm . . . . . . . . . . . . .
Plot of dataset4 (a) Original (b) Result of K-Means algorithm . . . . . . . . . . . . .
Plot of clustering result of DBSCAN algorithm on (a) dataset1 (b) dataset2 . . . . .
Plot of dataset5 (a) Original (b) Result of DBSCAN algorithm . . . . . . . . . . . .
Plot of subspace-dataset1 (a) three dimensional plot (b) two dimensional projection .
Clustering result of subspace-dataset1 (a) K-Means (b) DBSCAN . . . . . . . . . .
Plot of subspace-dataset2 (a) three dimensional plot (b) two dimensional projection .
Clustering result of subspace-dataset2 (a) K-Means (b) DBSCAN . . . . . . . . . .
Plot of subspace-dataset3 (a) Original (b) Result of DBSCAN . . . . . . . . . . . .
Plot of clustering result of ReCkless algorithm on dataset1 (a) k=25 (b) k=60 . . . .
Plot of clustering result of ReCkless algorithm on dataset1 (a) k=90 (b) k=162 . . . .
Plot of clustering result of ReCkless algorithm on dataset5 (a) k=32 (b) k=33 . . . .
Plot of clustering result of ReCkless algorithm on dataset5 (a) k=50 (b) k=78 . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
36
36
37
37
38
38
39
39
40
40
41
41
42
42
43
43
44
7.1
Clusters in arbitrary shaped bounding boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7
8
viii
LIST OF FIGURES
Chapter 1
Introduction
For complex problems such as data clustering, intrusion detection and spatio-temporal applications it is difficult to
evaluate solutions and algorithms as not enough real life datasets are available. Therefore there is a need to come
up with synthetic datasets for comprehensive evaluation of the algorithms. Manually generating test data is a time
consuming process and prone to human error. So when real data sets are not available, people do create some synthetic
data. While generating data for a particular algorithm or problem, the following are the properties to be satisfied.
1. Validity of dataset.
2. Circumstances for which the dataset is valid.
3. Expected behavior of the algorithm with the dataset.
4. Required format.
In the area of spatio-temporal data management, datasets are required in order to evaluate spatio-temporal databases,
spatio-temporal data modeling, query languages, spatio-temporal data mining and spatio-temporal indexing. A recent
work on data generation for spatio-temporal data can be found in [2]. Given few parameters such as duration, shift
and resizing of an object, tools like GSTD[3], G-TERD[4], and Oporto[5] will generate spatio-temporal data. Out of
these Oporto mimics a very specific scenario: fishing at sea. But these techniques being spatio-temporal, data is up to
four dimensional.
In the fields such as intrusion detection, synthetic data helps analyzing at which level the intrusion can be
predicted/found without affecting the security. Even if there is any such attack there will be a financial loss but not
the life. Where as in the case of analyzing the effects of earthquakes and changes in nuclear reactor, synthetic datasets
are going to help the mankind in estimating the effects and taking the necessary actions in advance.
Benchmarks are similar to synthetic datasets, but benchmarks help in evaluating the performance of a system rather
than just checking the validity of the algorithm. In the field of databases, benchmarking attempts to measure how fast a
database is and how much cost is involved so that it can help in determining the best suited database for an application.
Transaction Processing Performance Council (TPC), a group of hardware vendors and database companies, formed
in the late 1980s to give a third part objectivity to the database benchmarking. The TPC benchmarks have evolved
and multiplied over time, but they’ve always provided two measurements: the transaction rate of a database, and the
1
CHAPTER 1. INTRODUCTION
2
cost per transaction including hardware, software, and maintenance. Almost all the major companies in this field
benchmark their DBMS’s.
1.1 Cluster Data
Clustering is the process of grouping a set of objects into classes of similar objects. A cluster is a collection of data
objects that are similar to objects within the same cluster and dissimilar to those in other clusters [6]. Similarity
between two objects is calculated using a distance measure. The family of L k -norm distances, Mahalanobis distance
functions are a few distance measures to mention.
The dimensionality of the data plays an important role in the process of clustering. The result of any clustering
algorithm can be visualized and verified only if the dimension of the data is less than or equal to three. If there are
datasets with the actual clustering result, they help in correcting the behavior of the clustering algorithm. There are
very less number of real life datasets with large dimensions in this field. Moreover those datasets do not provide the
actual clustering result. Our aim is to generate datasets in R d (d is number of dimensions) with the result given the
following inputs by the user.
1. No. of clusters
2. No. of Points
3. Dimensionality
4. Max value for each Dimension.
Though there are few existing methods to generate cluster data, they do not provide solid proof about the number
of clusters. They divide the attribute range into a non-overlapping sub-ranges and assigns each range to a cluster.
Few methods generate a random point, and then distribute the points according to the randomly generated range and
variance assigned to that cluster. In such case there is a possibility that two clusters merge into a single cluster.
The generated data must contain exactly the requested number of clusters, otherwise the basic aim for generating
synthetic cluster data will not be met. We ensure that the data generated will exactly have requested number of
clusters. In general there are few objects in the dataset which are not attached to any cluster, those points are referred
to as outliers/noise. The data generated should include some amount of noise. The attributes of each object could be
floating points or categorical. Categorical values do not have order defined among those, so it is difficult to work with
such attributes.
Of late researchers found that there is a possibility that some clusters are not spread in all the dimensions of dataset.
That is, a cluster can be identified in some subset of given dimensions. We also proposed a method to generate
subspace cluster datasets too.
1.2 Contributions
The contributions of this thesis are as follows.
1.3. STRUCTURE OF THESIS
3
We proposed efficient algorithms for generating cluster datasets in R d space. The method of generation ensures that
there are exactly the requested number of clusters in the dataset. None of the existing techniques generate arbitrary
shaped clusters. We proposed a method to generate such clusters. We also provide the clustering result along with
the dataset, which can be used to verify the result of clustering algorithms. Along with the traditional cluster data, we
generate datasets containing subspace clusters.
1.3 Structure of Thesis
The reminder of this thesis is divided into six chapters. In chapter two we introduce the method of clustering and
various techniques available. Chapter three deals with the existing cluster data generators. Chapter four is the heart
of this thesis, which explains how we generated the cluster data and the required proofs. Fifth chapter speaks about
subspace clustering methods and the subspace cluster data. We provide some experimental results in chapter six.
conclusions and future work are given in chapter seven, which is followed by bibliography.
4
CHAPTER 1. INTRODUCTION
Chapter 2
Clustering Methods
Given a set of objects, the process of dividing this set into subsets of objects is known as Clustering. Objects in a
cluster are more similar to each other when compared to those of different cluster. Similarity between two objects is
calculated using a distance measure.1 The family of Lk -norm distances, Mahalanobis distance functions are a few
distance measures to mention. The major goal of clustering is to identify underlying patterns based on the similarities
between the objects. Clustering is also knows as un-supervised learning. Since clustering forms groups, it can be
used as a pre-processing step for methods like classification.
Of late, with the explosion of data, the data to process for clustering has increased tremendously. Thus, enforcing
the following issues into the clustering algorithms:
1. Scalability of the algorithm: Firstly, the order of the input records in a dataset should not affect the output of the
clustering algorithm. The algorithm should also be able to handle very large datasets.
2. High-dimensional data: The algorithms should be able to handle large dimensional data.
3. Heterogeneous attributes: Ability to handle various types of attributes (numerical, categorical) is important, due
to increasing nature of heterogeneous data, of late.
4. Complex shapes of clusters: The accuracy of the clustering result should be acceptable despite the presence of
complex shaped clusters.
5. Noise: The algorithm should be able to handle noise too. Presence of noise should not deter the accuracy or
efficiency of the clustering algorithm.
The most recent work in clustering algorithms [7], [8], [9], [10] address almost all the issues mentioned above.
However, any clustering algorithm needs to be evaluated for its ability to handle the above mentioned issues. For
the sake of evaluation, we need datasets which are large, noisy and high-dimensional with the presence of complex
clusters. In addition to this, the “actual” clustering result should also be available to benchmark the results given by
various clustering algorithms. Such datasets along with the clustering results are very less in number. Though there
are a few readily available real-life datasets, the actual clustering results are not known. This necessitates a tool to
be devised, which is capable of generating high-dimensional, noisy datasets along with the original clustering results.
1 The
terms similarity and distance are used interchangeably. The more the distance between the two objects the lesser similar they are.
5
CHAPTER 2. CLUSTERING METHODS
6
Our toolkit SynDECA (Synthetic Datasets to Evaluate Clustering Algorithms) generates large noisy high-dimensional
datasets. SynDECA also provides the information about each point (whether it belongs to cluster or noise) and a brief
statistical description of the clusters present in the dataset.
2.1 Categories of Clustering Algorithms
Grouping of the objects can be done using any of partitioning, hierarchical, density-based, grid-based and modelbased techniques. Each of which is explained as follows.
1. Partitioning Method: Given n objects and number of partitions k(k < n) the partitioning method divides the
objects into k partitions (i.e. k clusters). These algorithms work iteratively to improve the quality of partitioning.
In each iteration a representative object is found and all the rest of the objects are associated to one of these
representatives. The most popular partitioning algorithms are K-Means and K-Medoids. K-Means algorithm
works as follows. In each iteration mean of each partition is calculated and then distance of each object with
respect to all the means. Each object is associated with that mean with which the distance is minimum. The
method is explained as in figure 2.1
(a)
(b)
(c)
Figure 2.1: Working of partition clustering method
The major drawback of partitioning algorithms is that they can not tackle noise/outliers in the dataset. They are
not be able to recognize such points. Moreover the user is supposed to give the number of clusters as input.
Figuring out the number of clusters may not be easy all the time, due to the dimensionality of the data and the
heterogeneous attributes. Also, these algorithms are not suitable for discovering non-convex shapes.
2. Hierarchical Method: In Hierarchical clustering methods, objects are grouped into a tree of clusters, known as
dendogram. There are two methods to perform the grouping, agglomerative and divisive. Agglomerative is a
bottom-up approach. It takes each object as a cluster and at each iteration it merges these clusters to form larger
and larger clusters until all the objects form a single cluster or until certain threshold conditions are satisfied.
Divisive method is a reverse process of agglomeration. It starts with all the objects in one single cluster and
divides the clusters into smaller and smaller clusters until each object is a cluster or until threshold conditions are
met. AGNES, DIANA, BIRCH, CURE, CHAMELEON and ROCK are few hierarchical clustering algorithms.
Working of these algorithms can be found in figure 2.2
2.1. CATEGORIES OF CLUSTERING ALGORITHMS
Agglomerative Clustering Algorithms
Step 5
Step 4
BCDEF
Step 1
BC
A
B
D
E
Step 4
EF
C
Step 3
DEF
Step 2
Step 2
Step 3
Step 1
ABCDEF
Divisive Clustering Algorithms
7
F
Step 5
Figure 2.2: Working of hierarchical clustering methods
The problem with these clustering algorithms is a decision taken can not be undone and there is no way to swap
the objects from one group to other, so as to form a better clustering.
3. Density-Based Method: Most of the distance based algorithms try to find only spherical shaped clusters. In
order to find the arbitrary shaped clusters, density based methods are proposed. The idea is to grow the cluster
as long as the density of around a point in the cluster is above certain threshold. These methods not only find
arbitrary shaped clusters but also outliers/noise points. DBSCAN and OPTICS are examples of density based
clustering algorithms. DBSCAN algorithm takes a radius epsilon and minpoints as input. It takes each object
and verifies whether it has at least minpoints within an epsilon radius. If so, it considered as a part of some
cluster. If minpoints are three then points A,B,C,D,E,F and G in figure 2.3 helps the cluster to grow, where as
points H and I are considered as outliers.
DBSCAN needs epsilon and minpoints as input, calculating which in high-dimensional space is a tough task.
OPTICS algorithm is proposed to overcome this difficulty. OPTICS computes an augmented cluster ordering for
automatic and interactive density based clusters.
4. Grid based Method: Grid based algorithms divide the given data space into finite number of cells that form
a grid structure. This grid structure is used to perform all the clustering operations. The advantage here is
clustering processing can be done in parallel. Quality of the clusters depends on the granularity of the grid.
STING, CLIQUE and Wave Cluster are few grid based methods.
5. Model based Method: In this method, a model is hypothesized for each cluster that will best fit the given data.
The algorithm might construct a density function that reflects the spatial distribution of the data. Model based
methods follow two approaches: statistical approach and neural network based approach.
Sometimes it is tough to classify a clustering algorithm into one of the above methods, as the algorithms integrate
ideas from more than one of the methods.
CHAPTER 2. CLUSTERING METHODS
8
%$%
$ #"#
A
"
B
D
C
E G
H
!!
I
F
Figure 2.3: Working of DBSCAN algorithm
2.2 Outliers
Few objects in the data behave very much dissimilar to the general behavior of the entire data, these objects are called
as outliers. Outlier mining can be seen as two subproblems: (1) Defining what is an inconsistent data (2) Finding
methods to determine such data. Defining what an outlier is not a trivial job. Data visualization methods will help in
detecting outliers to an extent. But in the presence of categorical and/or a very high dimensional data visualization
techniques won’t be effective. Detecting such points will be useful, as applications such as fraud detection needs to
identify such objects.
Chapter 3
Related Work
There have been some early attempts to generate synthetic data to test the clustering algorithms. A few schemes
for generating artificial data have appeared in the literature ([11, 12, 13]). One of the processes used multivariate
normal mixtures with fairly complex covariance matrices. This leads to the generation of the overlapping clusters.
The algorithm proposed by Glenn [11] generates data in either four, six, or eight dimensional space containing up to
five clusters. Three different methods are followed in assigning points to each cluster.
The overall approach followed by the Glenn’s algorithm is as follows. Initially the extent of each cluster in the
first dimension is fixed in such a way that there will not be any overlap between any clusters in this dimension.
Then the extent of each cluster in the remaining dimensions is calculated. Points are generated within the bounding
box1 of each cluster. Outliers associated with each cluster are generated, these outliers are not within the bounding
box of the cluster. Finally error measures (such as adding error perturbation to each dimension of each point in
the dataset) are added. Since the clusters are non-overlapping in the first dimension, They remain non-overlapped
even if other dimensions are also considered. In this case any clustering algorithm will be able to identify clusters
properly in any of n − i (i=1 . . . n − 1 where n is the dimensionality of the data) dimensions. For each cluster
some percentage of the cluster points are added as noise to that cluster. There could be every chance that a noise
point assigned to a cluster may fall within the bounding box of some other cluster. The major drawback is that
the algorithm is limiting the number of dimensions as well as the number of clusters. “Clusutils” [14], has a
component called “clusgen”which is a tool based on Glenn’s [11] algorithm. Though this tool does not limit the
number of dimensions and the number of clusters, it can generate only rectangular/square shaped (a fixed shape) clusters but not random shaped clusters. Moreover clusgen doesn’t guarantee the presence of requested number of clusters.
While evaluating the clustering algorithms authors [15, 16, 17, 18, 19, 20, 21, 12] generated some synthetic
datasets. These datasets are tailor made for their particular needs i.e as DBSCAN algorithm needs arbitrary shaped
clusters, they generated such datasets. Most of the papers used the method explained in BIRCH[12]. The synthetic
generator explained in BIRCH paper takes pattern, number of clusters, maximum and minimum number of points in
a cluster, maximum and minimum radius of a cluster and few more inputs and generates the data. Cluster centers
are fixed depending on the pattern of the cluster. After fixing few more characteristics of the cluster, data points are
2
generated using a 2-d normal distribution whose mean is the cluster center and variance is r2 . They also generate
some amount of noise points. Since the maximum distance between the cluster center and the cluster point is
1 The
term bounding box is used extensively to mean the minimum bounding box
9
10
CHAPTER 3. RELATED WORK
unbounded, there are many chances that a point that belongs to cluster c i is much closer to the center of cluster cj ,
which makes the generated data bit fuzzy. As there is an overlap there is no guarantee that the data will have exactly
the requested number of clusters.
Algorithms like [22] used data generation method explained by M. Zait et. al in [13]. The data generator explained
by M. Zait et. al takes number of objects, number of dimensions, type and range of values for each dimension as
input. The process is divided into three steps. In the first step a file containing details about each and every dimension
is generated. The second step is to define the ranges of each cluster for each dimension. Each dimension is divided
into non-overlapping ranges, which is in turn assigned to one of the cluster. If there are categorical dimensions then
the values are divided into set of subsets and each subset is randomly assigned to a cluster. Equal number of points
are generated according to the ranges of dimensions assigned for each cluster. It is very evident that the generated
data has the required number of clusters, but clusters can be identified by just looking at one dimension which is like
the output of the algorithm proposed by Glenn. Generating such data is not a good option. Moreover none of the
clusters generated has an arbitrary shape.
The method proposed by us takes number of clusters, number of points, number of dimensions and maximum
range of each dimension as input and generates data. we guarantee that there will be exactly the requested number of
clusters. We are not restricted only to rectangular shaped clusters, in fact we generate random shaped clusters. We
include some amount of noise too. Clusters generated can be overlapped when seen in subset of the dimensions, So
chances are less to figure out the clusters by just looking at some dimension.
Chapter 4
Synthetic Datasets
4.1 SynDECA - Framework
SynDECA currently generates datasets in R d , where d is the number of dimensions. In the following sections, we
describe the various components of SynDECA and their functionalities.
4.1.1 Notation
Let the dataset to be generated be represented as X. Let the dimensionality of the dataset be d, the dimensional
space be D ⊆ Rd and the number of points in the dataset be n. Let the range of each axis be [0, m], where m is, the
maximum allowable value in each dimension. Let c be the number of clusters to be present in the dataset X and X c
be the set of points that belong to clusters and Xn be the set of points that are noise. Let η be the noise percentage,
n|
i.e. |X
|X| × 100. Let µ represent the set of cluster centers, µ i represents the cluster center of cluster i and µij is the
value of j-th dimension of cluster center i.
With the above notation, we define the problem of synthetic numerical dataset generation as:
Given the number of points to be present in the dataset n, the dimensionality of the dataset d, the maximum range
of each dimension m, the number of non-overlapping clusters c to be present in the dataset, our aim is to generate
a dataset that is spread across all the d dimensions with c non-overlapping clusters and η percentage of noise points
within the dataset.
We list out the tasks, step-by-step, that together address the above problem definition. We divide the whole problem
into four smaller tasks:
1. Cluster placement: For all the c clusters, the radius of the cluster and the cluster centers are to be determined
such that there is no overlapping between the clusters in all d dimensions. The algorithm is in table 4.2.
2. Cluster and Noise Cardinality estimation: For each cluster, the number of points to be placed is in proportion
with its radius. The number of points to be placed as noise, also needs to be determined.
3. Filling clusters with points: In this technique, with the cluster centers and their radii in place, we sprinkle
randomly generated points to various clusters, till the required number of cluster points |X c | is achieved (as
described in figure 4.4).
11
CHAPTER 4. SYNTHETIC DATASETS
12
4. Sprinkling noise: After filling up the clusters with random points, noise points need to be added such that they
do not lie within the region of any cluster, in which cluster points are placed. Noise points are filled till the |X n |
is achieved (as described in Figure 4.5).
Figure 4.1: Generation of data
4.2 Algorithms
With the brief mention of the various tasks in the previous section, we now explain the algorithms in detail. The
inputs from the user are: the number of points in the dataset n, the dimensionality of the dataset d, the number of
non-overlapping clusters c and the maximum allowable range m.
Apart from these user-given parameters, we use another set of parameters which are set dynamically during the
algorithm execution. They are:
1. Parameter : This parameter is used to ensure a minimum gap between any two clusters, i.e. within (1 + ) ∗ r
(r - radius of the cluster) region around the center of any cluster no other cluster can be placed. Range is (0 , 1]
and the value is set randomly.
2. Parameter rmax : rmax denotes the maximum allowable values for the cluster radius, which is calculated from
the user-inputs and .
3. Parameter rmin : This parameter determines the minimum allowable radius for a cluster. This depends on how
less a cluster can be when compared with the largest cluster. It is determined by k (ratio of r max to rmin ) and
rmax , In our experiments we have taken k = 3.
4.2. ALGORITHMS
13
4. Parameter η: η denotes the percentage of noise points. The range of the noise points is (0 ,
the points are allocated to each cluster in the proportion to its radius.
1
1+c
∗ n]. The rest of
4.2.1 Algorithm for Cluster Placement
The algorithm proceeds with assigning the first cluster’s center and radius randomly. For the remaining clusters, the
cluster center and radius are set depending on the placement of the previous clusters. For every cluster, the center µ i
and radius ri are generated randomly. The center generated is checked such that it is at a distance of radius r i from
the edges along each dimension i.e m − µij is at least radius of that cluster(j = 1 . . . d). Once the center is adjusted,
it is checked against with each of the already placed clusters for the minimum gap that needs to be maintained. If
there is any cluster with which the current cluster is not having the minimum required gap, the radius of the current
cluster is reduced. If the reduced radius is less than that of the minimum allowable radius then the process is started
again for generating a new center point as well as radius. The algorithm is given in Figure 4.2.
Input: µi and existing set of cluster centers µ; radii of already placed clusters
Output: µi and Radius of the current cluster Ci
Algorithm: place cluster(i,µi ,ri ): Cluster Placement
1. ri ← Generate radius(rmin , rmax )
/* Generate radius randomly generates a value between rmin and rmax */
2. µi ← Generate center(ri )
/* Generate center randomly generates a point in d dimensional space such that distance between the center
3.
and any edge of bounding box of given space is at least Radius of that cluster */
4. do
5. is-cluster-placed ← true
6. for j in 1 to (i-1) do
7.
if distance(µi , µj ) < (1+)* (ri + rj ) then
8.
is-cluster-placed ← false
9.
Reduce (ri )
10.
if ( ri < rmin ) then
11.
place cluster(i,µi ,ri )
12.
end if
13.
break
14.
end if
16. end for
17. while (! is-cluster-placed)
18. end
Figure 4.2: Algorithm for Cluster Placement
The algorithm for cluster placement will definitely be able to find continuous free space in the given space s, so that
it can place a cluster with rmin as radius. The following proof shows the same. Please look into section 4.3 to verify
CHAPTER 4. SYNTHETIC DATASETS
14
the equations.
rmax
d1 1
s
∗
=
c
2 ∗ (1 + )
s = md
1
m = rmax ∗ c d ∗ 2(1 + )
as
rmin =
rmax
k
1
m = k ∗ rmin ∗ c d ∗ 2(1 + )
(4.1)
1/d
k* c
.....
3
2
1
2
..............
3
1/d
k* c
2(1+e)* r min
m
Figure 4.3: Division of the given space into cells which can accommodate clusters with minimum radius
From equation 4.1, we can divide each dimension into k ∗ c d units as shown in figure (4.3). While placing a cluster
with rmax as radius, it might at most span k + 1 units along each dimension. Similarly the number of units occupied
in each dimension for a cluster having radius rmin is at most two. The dotted squares in figure 4.3 are the bounding
boxes of clusters. The algorithm generates a radius in the range of [r min , rmax ], so not all the clusters are going to
have maximum radius. Each cluster will be leaving (k + 1)d − (cellsi )d number of empty cells, where cellsi are
the cells occupied by ith cluster. As d increases the number of un-occupied cells increases and there is a chance that
some cells are occupied by more than one cluster. Which shows that there are ample number of cells left free, each of
which is capable of hosting a cluster with rmin as radius.
1
In some iteration, the algorithm will generate a center in the availble free space. If at all the generated radius is not
allowing it to place the cluster, algorithm will be shrinking the size of the radius and will try to accomodate in that.
4.2. ALGORITHMS
15
We ran cluster placement algorithm for 50 times for each input. The results are tabulated in Table 4.1. The results
show that the average radius of the clusters is not even half of rmax , which shows that there is ample amount of free
space left. Algorithm is recursively called on an average of atmost 2 times in most of the cases. In some cases the
value is little bit high due to the randomness in finding the center of the cluster, same is the case with the number of
times the cluster radius is reduced to fit in that position.
Dimensions
2
2
2
3
3
3
5
5
5
10
10
10
20
20
20
50
50
50
100
100
100
Clusters
10
100
1000
10
100
1000
10
100
1000
10
100
1000
10
100
1000
10
100
1000
10
100
1000
rmax
13.1762
4.16667
1.31762
19.34
8.97681
4.16667
26.2899
16.5878
10.4662
33.097
26.2899
20.8828
37.1355
33.097
29.4977
39.7914
38.0005
36.2901
40.7182
39.7914
38.8856
rmin
4.38766
1.3875
0.438766
6.4402
2.98928
1.3875
8.75453
5.52374
3.48524
11.0213
8.75453
6.95397
12.3661
11.0213
9.82275
13.2505
12.6542
12.0846
13.5592
13.2505
12.9489
Avg Radius
6.9583308
2.1744378
0.68921596
9.9235446
4.5308042
2.1070136
12.399806
7.8629638
4.9784418
14.526136
11.594366
9.2591416
15.864384
13.58244
12.394634
16.450032
15.341142
14.555292
17.001032
15.94729
15.538592
Avg Trails
1.618
1.6226
1.6068
1.588
1.5952
1.56458
1.628
1.4942
1.48072
7.85
1.35
1.3389
4.442
4.7032
1.25274
57.964
3.896
2.30472
11.7
3.5456
1.8887
Avg Radius Reduction
1.496
1.5188
1.51924
1.506
1.5472
1.50642
1.78
1.5088
1.46624
13.762
1.3378
1.32902
7.386
7.9074
1.27538
109.782
6.3966
3.37338
21.232
5.7674
2.58478
Table 4.1: Avg Radius, Avg no.of trails to place the cluster and Avg no.of times the radius got shrinked
4.2.2 Algorithm for Cluster Points Generation
1
∗ n]
After obtaining the cluster positions and radii from the Cluster Placement algorithm, a random number (∈ (0, 1+c
) of points are allotted for noise. The remaining points are allotted to each of the cluster in proportion to the cluster’s
radius. A shape (circle, ellipse, rectangle, square or irregular) for each cluster is assigned randomly. Points for a
cluster are generated according to its shape. For a circular shaped cluster, points are generated in such a way that all
the points are within a distance of radius ri from its center. For rectangular and elliptical shaped clusters, the extent of
radius (along a dimension) is reduced in at most d − 1 dimensions, points generated will be within the structure. For
irregular shaped clusters there are two different methods that we employ:
1. The first method is analogous to the reverse mechanism of DBSCAN algorithm [15]. A point p is randomly
chosen and a small percentage of the total allocated points (min pts) are sprinkled around p within a distance
of eps (eps r). A new point p0 , which is at a distance q 0 (eps < q 0 < eps + eps0 and eps0 r), from p
CHAPTER 4. SYNTHETIC DATASETS
16
is chosen and min pts0 are sprinkled around p0 within a distance of eps0 . This process is continued till all the
required number of points are generated. The condition that the distance q 0 between points p and p0 should be
within (eps, eps + eps0 ) ensures the connectivity of the generated points.
2. The second technique is based on the hyper-dimensional grids. In this method, the bounding box of the cluster is
divided into small d dimensional boxes, from which one box is randomly chosen, to fill with points. Following
that, an adjacent box of the filled box is chosen to fill with points, this process is continued till the required
number of points are generated for that cluster. By using grids, it is ensured that each point that is generated for
a cluster is within the bounding box of the cluster.
Since the rand function in C generates random numbers in uniform distribution, density of regular shaped clusters
is uniform. Where as irregular shaped clusters do not follow any distribution. The algorithm is given in Figure 4.4.
Given a center µi and radius ri , the function generate-point generates a point within the bounding box of the cluster.
Function is-within-required-boundary checks the extent of a point around the center of the cluster depending upon
the shape of the cluster. Function sprinkle-points generates min pts points within eps region of a selected point and
function get-a-point generates a point p0 which is at a distance q 0 from p.
4.2.3 Algorithm for Noise Points Generation
With the cluster points generated, we now generate η percentage of n (number of points in the dataset) noise points.
While generating noise points, care is taken such that a point generated for noise will not be with in the space of
any cluster where cluster points are generated. The generated point p is a potential noise point if it is not within the
bounding box of any cluster. Otherwise if one of the following conditions are satisfied then p is treated as valid noise
point. If the shape of the cluster is
• circle: p is not within distance ri from the center of the cluster µi .
• ellipse: p is not within the space of the ellipse.
• rectangle: for any dimension i, |pj − µij | is > Radiiij (Radiiij is radius of cluster i along dimension j).
• irregular: p is not within the eps distance of any point which is used as pivot to sprinkle the points.
The algorithm is given in Figure 4.5.
4.2.4 Complexity of Algorithms
SynDECA generates the required datasets within a reasonable time (It does not take more than 10 seconds of time
to generate 10,00,000 point, 2-dimensional data containing 100 clusters). For high-dimensional large datasets, it is
not easy to evaluate the time analysis because at times, getting the position of a point such that it satisfies various
conditions is a time consuming task.
4.3 Evaluation of datasets generated
Given the user inputs: number of points in the dataset n, the dimensionality d, the number of clusters c and the
maximum allowable value in each dimension m, the toolkit generates the required number n points in R d dimension
space with c clusters along with some noise points (calculated based on the other parameter values).
4.3. EVALUATION OF DATASETS GENERATED
17
Input: No.of points to be generated, Center µi ; Radius ri and Shape of cluster
1.
Radiii , extent in each dimension for ellipse and rectangle
Output: Required number of points with in the bounding box of the cluster.
Algorithm: Cluster Points Generation
2. Let i ← 1
3.switch (Shape)
4.
case CIRCLE:
5.
case ELLIPSE:
6.
case RECTANGLE:
7.
case SQUARE:
8.
while (i < No.of points to be generated)
9.
point ← generate-point (µi , ri )
10.
if (is-within-required-boundary(µi,point,ri ,Shape))
11.
generated-points[i] ← point
12.
increment i
13.
end if
14.
end while
15.
case IRREGULAR:
16.
p ← generate-point(µi, ri )
17.
sprinkle-points(p,eps,min pts)
18.
no-of-generated-points ← min pts
19.
while ( no-of-generated-points < No.of points to be generated
20.
p’ ← get-a-point(p,eps,eps’)
21.
sprinkle-points(p’,eps’,min pts’)
22.
increment no-of-generated-points by min pts’
23.
p ← p’
eps ← eps’
24.
end while
25. end switch
26. end
Figure 4.4: Algorithm for Cluster Points Generation
100
A random number between 0 to 1+c
is selected as noise percent η. The maximum limit on noise percent ensures
the average number of points that are given for a cluster is greater than or equal to noise points, which is explained
below.
100 − η
∗n ≥ η∗n
c
100 ≥ (1 + c) ∗ η
100
η ≤
1+c
The amount of space provided for generating the dataset is, s = m d .
Since any size of bounding box (for c clusters) can not fit in the space provided, there should be a restriction on
the size of the bounding box of the cluster. So we need to define a maximum and minimum allowable radius (r max ,
CHAPTER 4. SYNTHETIC DATASETS
18
Input: No.of noise points η % n, Center µi ; Shape; Radius/Radii of each cluster and the selected points.
Output: η % n of noise points.
Algorithm: Noise Points Generation
1. Let i ← 1
2. while ( i < No.of noise Points)
3.
point ← Generate-point-in-space()
4.
if (not within-any-cluster-bounding-box(point))
5.
generated-noise[i] ← point
6.
increment i
7.
end if
8. els if(is-within-free-space-of-cluster(point, µj , rj , Shapej ))
9.
generated-noise[i] ← point
10.
increment i
11.
end else
12. end while
13. end
Figure 4.5: Algorithm for Noise Points Generation
rmin ) for each cluster.
Computation of rmax :
Maximum space allocated for each cluster is sc . The size of the bounding box of largest cluster (which includes the
space) in terms of rmax and is [2 ∗ rmax ∗ (1 + )]d .
s
c
rmax
= [2 ∗ rmax ∗ (1 + )]d
d1 1
s
∗
=
c
2 ∗ (1 + )
Minimum allowable radius ,rmin
=
=
1
rmax
k
d1 1
s
∗
c
2 ∗ k ∗ (1 + )
where k (> 1), is the ratio of rmax to rmin . k can take any value, depending on how small a cluster can be when
compared to the largest cluster. In our experiments we have taken k as 3.
The size of bounding box of cluster with rmax as radius =
=
s 1
1
2 ∗ [ ]d ∗ [
]
c
2 ∗ (1 + )
d
1
s
∗
c 1+
d
4.3. EVALUATION OF DATASETS GENERATED
19
Similarly, the size of bounding box of cluster with rmin as radius =
d
s
1
∗
c k ∗ (1 + )
if every cluster takes rmax as its radius, then
d
s
1
Minimum free space available = s −
∗c
c 1+
[1 + ]d − 1
= s∗
(1 + )d
Similarly if every cluster takes rmin as its radius, then
d
s
1
Maximum free space available = s −
∗c
c k ∗ (1 + )
= s∗
[k ∗ (1 + )]d − 1
[k ∗ (1 + )]d
The step 6 of the algorithm for Cluster Placement (given in Table 4.2) checks whether the current cluster j is within
the (1 + ) ∗ ri (ri is the radius of cluster i) space around µi (center of cluster i, which is already placed). If there is
any such overlap then the radius of the cluster j is reduced. Moreover ∗ r i space around the bounding box of cluster
i is left free. Hence we can have the following lemma.
Lemma 1 The algorithm for Cluster Placement (table 4.2) generates clusters that are non-overlapping.
Theorem 1 The dataset generated by the above step-by-step process mentioned in section 3, will exactly contain c
clusters.
Proof: From the above lemma it is assured that the bounding boxes of the clusters will not overlap. Each cluster has
its own private bounding box, within which there will not be any other cluster’s points or the noise points. Now if we
can prove that the density of the points in within the bounding boxes of each cluster is more than that of the noise,
then we can assure that there exist exactly c clusters (i.e. there will be exactly c sets of points which are denser than
the surrounding space) in the generated dataset.
Case 1:
When every cluster has rmax as radius and noise percentage as η =
Number of noise points =
=
100
1+c
η
∗n
100
1
∗n
1+c
CHAPTER 4. SYNTHETIC DATASETS
20
Total number of cluster points =
=
=
all the clusters have same radius
Number of points for each cluster =
=
100 − η
∗n
100
1
∗n
1−
1+c
c
∗n
1+c
c
∗n
(1 + c) ∗ c
1
∗n
1+c
The number of points allotted for every cluster is the same as that of the noise points.
The bounding box size of a cluster = [2 ∗ rmax ]d
d
s 1
1
d
=
2∗[ ] ∗
c
2 ∗ (1 + )
d
s
1
=
∗
c 1+
Total free space = Total space provided
−
d
X
(Bounding box size of cluster i)
i=1
d
1
s
Total free space = s − c ∗ ∗
c 1+
(1 + )d − 1
= s
(1 + )d
In order to have the density of any cluster to be more than that of the density of noise, (since each cluster has the
same number of points) the available free space should be greater than that of the size of the bounding box of the
cluster.
d
1
(1 + )d − 1
s
∗
s
>
(1 + )d
c 1+
s
> 0
(1 + )d
1
(1 + )d − 1 >
c
1+c
d
(1 + ) >
c
1
1+c d
>
−1
c
4.3. EVALUATION OF DATASETS GENERATED
21
If the above condition is ensured then the amount of free space is greater than that of the size of the bounding box
1
d
of each cluster. So we make to take values within the range ([ 1+c
c ] − 1, 1].
Even when every cluster takes rmin as radius, the number of points in every cluster is same as number of
noise points. In this case the size of the bounding box of cluster is reduced by k1d times and the amount of free space
is increased. This means that the same number of noise points(as above) are spread over an increased free space and
the same number of cluster points(as above) occupy less space. Therefore in this case also, the density of any cluster
is more than that of the noise density.
Case 2:
100
, c−x
When x clusters got rmin as radius, the remaining c − x clusters got rmax as radius and η = 1+c
clusters will get more number of points(due to these x clusters having minimum radius) than in case 1 and the amount
of free space is also increased. Therefore, the density of c − x clusters (with r max as radius) is greater than the noise
density.
Points allotted for cluster withrmax as radius =
k
c
∗
∗n
k ∗ (c − x) + x 1 + c
Points allotted for cluster withrmin as radius =
c
1
∗
∗n
k ∗ (c − x) + x 1 + c
When x = 1 the cluster with rmin as radius gets the least share of points, since the denominator is more when
compared to all the other cases.
free space is increased by
Total free space =
(2 ∗ rmax )d − (2 ∗ rmin )d
2
= (2 ∗ rmax )d − ( ∗ rmax )d
k
s
kd − 1
∗
=
(k ∗ (1 + ))d
c
s
(1 + )d − 1
kd − 1
∗
s+
∗
(1 + )d
(k ∗ (1 + ))d
c
The size of the bounding box of the cluster withrmin as radius
The density of the cluster with rmin as radius
=
=
d
s
1
∗
c k ∗ (1 + )
1
k∗(c−1)+1
∗
c
1+c ∗
d
s
1
c ∗ k∗(1+)
n
CHAPTER 4. SYNTHETIC DATASETS
22
n
1+c
The density of noise is determined by =
∗ s+
kd −1
(k∗(1+))d
(1+)d −1
(1+)d
kd −1
(k∗(1+))d
∗
s
c
Since density of cluster should be greater than noise density,
1
k∗(c−1)+1
∗
c
1+c ∗
d
n
>
1
s
c ∗ k∗(1+)
c ∗ n ∗ c ∗ k d ∗ (1 + )d
s(1 + c)(k ∗ [c − 1] + 1)
Since
n∗c∗kd (1+)d
s(1+c)
>
n
1+c
(1+)d −1
(1+)d
∗ s+
∗
s
c
n ∗ c ∗ k d ∗ (1 + )d
s(1 + c)[c ∗ k d ∗ ((1 + )d − 1) + (k d − 1)]
>0
c >
c∗
kd
k ∗ (c − 1) + 1
∗ ((1 + )d − 1) + (k d − 1)
c2 ∗ k d ((1 + )d − 1) > kc − k + 1 − c ∗ (k d − 1)
c2 ∗ k d ((1 + )d − 1) > 1 − k − c(k d − k − 1)
1 − k − c(k d − k − 1)
(1 + )d > 1 +
c2 ∗ k d
>
1 − k − c(k d − k − 1)
c2 ∗ k d
d1
1
1 − k − c(k d − k − 1) d
Since
c2 ∗ k d
−1
(4.2)
< 1
The right hand side of the equation 4.2 is always negative and as is always positive, the above condition is satisfied.
Therefore, for each cluster the density is more than that of noise. Hence, the theorem is proved.
4.3.1 Discussion
Let mi be the maximum allowable value in each dimension. When max di=1 (mi ) mindi=1 (mi ), the given space
is skewed towards rectangular shape. The bounding box of cluster can no longer be of square shaped, since r max
calculated will be more than that of mindi=1 (mi ). So in this case we should consider the rectangular shaped bounding
boxes to generate points.
Let li be the length of the side of bounding box of large cluster along dimension i. Let mlii = ρ i = 1 . . . d.
When all the bounding boxes of clusters are large,
space occupied by each cluster =
=
s
c
d
Y
i=1
li ∗ (1 + )d
4.4. SUMMARY
23
Where ( ∗ di /2) is the amount of space left free around the bounding box of cluster along dimension i.
s
c
=
d
Y
li ∗ (1 + )d
d
Y
mi
d
Y
li ∗ (1 + )d
i=1
s =
i=1
Qd
i=1
Qd
mi
c
i=1
Qd
=
i=1
mi
i=1 li
ρd
= c ∗ (1 + )d
= c ∗ (1 + )d
Value of either ρ or need to be fixed, in order to get the size of the large bounding box (to get the density of the
cluster with largest bounding box). If ρ is fixed then there could be every chance that for some values of c and d,
epsilon could be negligible (the gap between clusters will also be negligible). In this scenario we cannot ensure nonoverlapping clusters. (< 1) is fixed to get the value of ρ with the help of c and d. We can prove that there are exactly
c clusters in the generated data in the similar way mentioned in the above theorem.
4.4 Summary
In this chapter we proposed methods to generate clustering datasets in R d (d is the dimensionality of dataset). Our
algorithms are capable of generating regular as well as arbitrary shaped clusters. Algorithms generate the requested
number of clusters in the dataset along with the actual clustering result. This property makes them to be helpful in
evaluating the clustering result of clustering algorithms.
24
CHAPTER 4. SYNTHETIC DATASETS
Chapter 5
Subspace Cluster Data
Subspace cluster can be defined as a pair of subset of given data and a subspace. The data objects in a subspace cluster
are similar in the associated subspace. When we consider all the dimensions we may not be able to figure out the
similarity between these data points. Traditional clustering algorithms take all the dimensions into account in order to
figure out the clusters. In case of a very high-dimensional data, all the dimensions may not be relevant. If a dimension
is irrelevant for all the objects in the database, it can be removed using dimensionality reduction techniques. Feature
transformation techniques such as principle component analysis, might be lucrative, But after applying these methods
the distance between any two points remain approximately same as in the original dataset. Moreover interpreting the
meaning of the new dimensions is difficult. Dimensionality reduction techniques may not always work as different
dimensions play an important role in finding different clusters, so removing a dimension could be a great loss for
some cluster. For example points A,B,C and D are forming a cluster in dimensions Y and Z where as points E,F,G,H
and I are forming a cluster in dimensions X and Z (Figure 5.1). If the dimensionality reduction techniques remove
any of X and Y dimensions, one of the cluster will not be identified. Subspace clustering algorithms try to give a
solution for all these problems.
I
E
Dimension Z
E
H
F
Dimension Y
G
H
D
C
B
G
C
A
I
D
B
A
F
Dimension Z
Dimension X
(b)
(a)
Figure 5.1: Example how different dimensions are important to different clusters
25
CHAPTER 5. SUBSPACE CLUSTER DATA
26
5.1 Subspace Clustering Methods
Lance Parsons et.al [23] gave a very good survey about the subspace clustering algorithms. The first algorithm
proposed to figure out subspace clusters is CLIQUE [24]. Algorithm CLIQUE is a grid based technique and works in
a bottom-up fashion. The given data space is divided into axis parallel hyper cubes of equal size, each cube is labeled
as dense if it is having certain number of points. The algorithm first checks for one dimensional dense hyper cubes
and proceeds to higher dimensions. At each stage the candidate subspaces are generated in an apriori method, these
candidates are further used to find clusters in higher dimensions. Adjacent dense cubes at each stage are combined
into one single cluster. Adjusting the parameters density and the granularity of the cuboid is difficult. If the size of
the cuboid is not set properly then chances are there that some noise points are added into the cluster or some cluster
points are ignored. The objects might belong to many subspace clusters at each level. CLIQUE does not partition the
data into non-overlapping subsets of points, which is required for most of the applications like clustering documents
based on the language.
Dimension Y
Charu Aggarwal et.al proposed PROCLUS(PROjected CLUStering) [22], which works as K-medoids algorithm.
Initially a large set of medoids are found and then out of which K are chosen, the rest of the objects are attached to one
of the medoid. In the next run again the medoids are calculated, to refine the quality of the clusters. Each medoid is
associated with some subspace. A dimension is relevant to a cluster if variance of the cluster points is less. PROCLUS
takes two inputs, number of clusters and average number of dimensions per subspace cluster, which are again
difficult to figure out. ORCLUS [25] is proposed by the same authors, which handles the non-axis aligned subspaces.
In case of these projected clustering algorithms, few points can be reported in multiple clusters, as shown in Figure(5.2)
'&'& ))(
+*+* ( --,,
0/.0./0.0. 656 2211
4 58
:9:9 343 > 787 <;<
=/=/@?@ =>= ; DCD
? BAB C
A
FEFE N
JIJI NMVMVU
U
HGHG P R\RQ\Q[ T ZYZY
LKLK X OPO [S//S STS
WXW
Dimension X
Figure 5.2: Clusters according to projected subspace clustering
LAC (Locally Adaptive Clustering) [26] calculates clusters using an attribute weighted method. It partitions data
into requested number of clusters. Each cluster is associated with a weight vector. Weights are larger if the dimension
is more relevant. Dimensions are more relevant if the variance along that dimension is less. Though LAC resembles
PROCLUS, the algorithms are quite different.
5.2. SUBSPACE CLUSTERING DATASETS
27
H.P. Kriegel et.al proposed a density based subspace clustering algorithm[27], which also works in bottom-up
fashion like CLIQUE. The algorithm computes 1-dimensional subspace clusters, applying DBSCAN algorithm on one
dimension. In the next iterations it generates candidate subspaces in an apriori method, and again applies DBSCAN
algorithm for finding the clusters in those subspaces. The same authors along with Christian proposed another density
based approach with local subspace preferences. Variance is calculated within the -neighborhood of a point along
a dimension, to determine whether it is relevant dimension. According to the calculated variance each dimension
is given a weight. This algorithm works more like a DBSCAN, but at each stage weights of the dimensions are
considered in the calculation of -neighborhood.
5.2 Subspace Clustering Datasets
Synthetic datasets are generated while testing the subspace clustering algorithms. Most of them used the methods
given in BIRCH [12] or by M. Zait[13]. For the dimensions in which the cluster is present, they followed the given
method, but for the remaining dimensions they used the entire range of those dimensions. The problem with such
datasets is that there is a possibility that some points of a cluster mix with some other cluster points. It might happen
that the density in the common area for both the clusters is more and that particular portion can be identified as a
cluster in high-dimensional space. In Figure (5.3), black dots are belong to one cluster for which X is chosen as one
of the subspace dimensions and Y for gray dots. The other dimension is not selected so the data points are spread
over the entire range of that dimension.
Figure 5.3: Subspace clustering dataset
While we generate datasets for sub-space clustering algorithms, we restrict the whole range of the data points to
be with-in the bounding box of that cluster. Restricting in a bounding box will help the cluster points not to mix up
with any other cluster points or with noise. The aim is to generate subspace cluster data points, which should form
a cluster when seen in only a subset of dimensions, but not in all the dimensions i.e subspace cluster points look
like almost as noise when seen in all the dimensions. It is not possible to generate the subspace cluster points whose
density is almost like noise in all dimensions, but is more when compared to that of noise when seen in a particular
CHAPTER 5. SUBSPACE CLUSTER DATA
28
subspace associated with that cluster. We will see how it is not possible in the following. Since the major problem is
with density, the number of points allotted to a cluster plays a major role. If we allot the points to a cluster based on
the size of the bounding box and since we are restricting the cluster points to be with in the bounding box, the density
of all the clusters is almost same when seen in all the dimensions. So, we allotted points to a cluster based on the
area/volume occupied by the cluster in its associated dimensions. The notation is as given in the Chapter 4.
Let n be the total number of points, d be the total number of dimensions, c be the number of clusters, η be the
percentage of noise, s be the total space given to the user, m be the maximum value for each dimension, r be the
average radius of the cluster and ds be the average number of dimensions chosen for x subspace clusters.
Volume of a normal cluster = (2r)d
Volume occupied by a subspace cluster = (2r)ds
n
Max points allotted for noise are =
1+c
Points allotted to a normal cluster =
Points allotted to a subspace cluster =
c
(2r)d
∗
∗n
(c − x) ∗ (2r)d + x ∗ (2r)ds 1 + c
(2r)ds
c
∗
∗n
(c − x) ∗ (2r)d + x ∗ (2r)ds 1 + c
Free space when seem in all the given ’n’ dimensions = s − (2r) d ∗ c
d1 1
s
∗
rmax =
c
2 ∗ (1 + )
1
rmin =
rmax
k 1
1
Average Radius r =
∗ rmax
∗ 1+
2
k
d1
k+1
s
=
∗
(4k) ∗ (1 + ) c
d
(2k) ∗ (1 + )
∗c
s = (2r)d ∗
k+1
Density of Noise =
=
Density of Subspace Cluster =
1
n ∗ 1+c
s − (2r)d ∗ c
1
n ∗ 1+c
d
d
−
2
rd ∗ c∗ 4k∗(1+)
k+1
(2r)ds
(c−x)∗(2r)d +x∗(2r)ds
(2r)ds
∗
c
1+c
∗n
5.2. SUBSPACE CLUSTERING DATASETS
29
Density of noise when seen in all dimensions should be greater than or equal to density of the subspace cluster.
(2r)ds
c
1
(c−x)∗(2r)d +x∗(2r)ds ∗ 1+c ∗ n
n ∗ 1+c
≥
s − (2r)d ∗ c
(2r)ds
d
(c − x) ∗ (2r) + x ∗ (2r)
ds
2
≥ c ∗ (2r)
Free space inds dimensions
ds
2k ∗ (1 + )
k+1
d
−1
(5.1)
= mds − c ∗ (2r)ds
1
m = s d
1
2r ∗ 2k ∗ (1 + )
=
∗ cd
1+k
d
ds
2k ∗ (1 + ) s
ds
d
Free space inds dimensions = (2r)
∗c −c
1+k
Density of noise in ds dimensions
=
1
1+c
ds
(2r)ds 2k∗(1+)
1+k
n∗
Volume of subspace cluster inds dimensions
= (2r)ds
Density of subspace cluster in ds dimensions
=
∗c
ds
d
−c
c
1
(2r)ds
∗
∗n∗
(c − x) ∗ (2r)d + x ∗ (2r)ds 1 + c
(2r)ds
In order to identify the subspace cluster in its associated ds dimensions, density of noise when seen in ds dimensions
should be less than density of the subspace cluster.
(2r)ds
c
1
∗
∗n∗
>
(c − x) ∗ (2r)d + x ∗ (2r)ds 1 + c
(2r)ds
(2r)ds
2k ∗ (1 + )
1+k
ds
∗c
ds
d
1
1+c
ds
(2r)ds 2k∗(1+)
1+k
n∗
∗c
ds
d
−c
− c ∗ c > (c − x) ∗ (2r)d + x ∗ (2r)ds
(5.2)
From Equations 5.1 and 5.2
(2r)ds
2k ∗ (1 + )
1+k
ds
c
∗c
ds
d
ds
d
d
2k ∗ (1 + )
−1
− c ∗ c > c2 ∗ (2r)ds
k+1
2k ∗ (1 + )
>
1+k
d−ds
∗c
(5.3)
CHAPTER 5. SUBSPACE CLUSTER DATA
30
Since
ds
d
< 1 and
2k∗(1+)
1+k
d−ds
>1
Equation 5.3 is impossible. Thus generating subspace cluster points whose density is almost like noise in all
dimensions, but is more when compared to that of noise when seen in a particular subspace associated with that cluster
is not possible. When we take large d (ds d) and compare the density of the normal cluster in all d dimensions
with that of a subspace cluster, then the density of normal cluster will be much larger than that of subspace cluster.
Thus the subspace cluster points can be treated as noise.
After fixing the cardinality of the clusters, subspace cluster points are generated in the same way as described for
the traditional clusters but for the non-selected dimension the entire range of the cluster bounding box is used.
5.3 Summary
In this chapter we discussed what a subspace cluster is and few of the existing techniques for subspace clustering. We
propose a method to generate subspace cluster datasets. The subspace cluster in the dataset almost looks like noise
when seen in all the dimensions, where as it is identifiable in subspaces attached to it.
Chapter 6
Experimental Results
SynDECA[28] is developed in C++ language on Linux platform. For the high dimensional data space, not all the
generated points (those will fall within the bounding box of the cluster) will be with in the feasible region of the
circular, elliptical or random shaped clusters. As the number of dimensions increase, the ratio of feasible region to
size of bounding box of the cluster will decrease. In this scenario time taken to generate cluster points will increase.
In order to overcome this, care is taken to make sure that each generated point will be within the feasible region of the
cluster.
In case of the grid method for random shaped clusters, as the number of the dimensions increase, number of cells
in the grid will explode. Moreover there is no control over the size of the bounding box of the cluster. Since the
maximum allowable size of each dimension can take a large value, it is not possible to maintain the information about
each and every cell. Since the number of points allotted to a cluster can be very large and a small percentage of points
are filled in each cell, maintaining information about filled cells can be cumbersome.
Few examples of two dimensional datasets are given in Figure 6.1. Few three dimensional datasets without noise
are shown in Figure 6.2. The time taken to generate datasets for various inputs is given in Table 6.1. Since the
generation of points to the subspace cluster is not so different from the normal cluster points, time taken is almost
same. SynDECA generates large number of data points in a very high dimensional data space including complex
shaped clusters with some amount of noise. The generated datasets help the clustering algorithms in dealing with
many of the issues listed in Chapter 2.
6.1 Comparison with the Existing Work
Since the current version of the “clusutils” tool is not working properly, we could not give much of the comparison
of its output with SynDECA datasets. The figure 6.3 shows results by “clusutils” in the top row and results by
“SynDECA” in the bottom row. The major inputs for clusutils are as follows.
• Number of clusters.
• Number of points.
• Number of dimensions.
31
CHAPTER 6. EXPERIMENTAL RESULTS
32
No.of
Points
1000
10000
100000
1000000
1000000
1000000
10000000
100000
100000
1000000
100000
100000
100000
100000
No.of
Dimensions
2
2
2
2
2
2
2
10
10
10
50
50
100
100
No.of
Clusters
10
10
50
50
100
100
100
100
100
100
50
100
50
50
Max Value in
each dimension
100
100
100
100
100
1000
1000
100
1000
1000
100
100
100
1000
Time taken
(in sec)
0.30
0.10
0.99
8.22
8.25
8.93
88.21
8.55
7.42
47.06
397.92
500.54
1196.1
1568.31
Table 6.1: Time taken for generating datasets for various inputs
• Density level.
• Random noise (this value will be added to each and every dimension of every point).
• Percentage of outliers (this is the percentage of number of extra points to be added as outliers/noise).
The inputs given to both the tools are as follows.
Number of points
= 500
Number of clusters
= 5
Number of dimensions
= 2
Input for the density level to clusutils parameter is 1, meaning all the clusters will have same density. The first two
columns are the datasets without noise points, SynDECA is modified sightly for generating noise less datasets. In the
last column the dataset of the clusutils is given Random noise as 50.0 and noise points are generated in the case of
SynDECA. Even when the inputs are same clusutils have taken different ranges in both x and y dimensions. In the
case of SynDECA the first image is in 10 x 10 space and the second image is in 100 x 100 space (Remember here
the user can specify the maximum value taken by a dimension). The last image of clusutils shows that there are no
clusters even though it is specified that there should be five clusters in the dataset. Whereas SynDECA ensures that
there are the required number of clusters in each of the generated dataset.
Method explained in [12] generates only two dimensional datasets. Since the regions in which clusters are spread
can overlap there is no guarantee that there are requested number of clusters in the generated dataset. M.Zait et.al
method [13] can handle any number of dimensions, but there is no noise in the dataset and each cluster has a dedicated
range in each of the dimensions. Where as SynDECA can generate clusters which are not having any dedicated range
and guarantees the requested number of clusters.
6.2. VALIDATION OF SYNDECA DATASETS
33
6.2 Validation of SynDECA Datasets
We generated some datasets (details in Table 6.2), which are used to study the behavior of some of the existing
clustering algorithms. We used implementation of K-Means available at [29], DBSCAN implementation available at
[30] and ReCkless algorithm [31] to test our datasets. Since we could not get any of the subspace algorithms code, we
are unable to show the results with respect to those algorithms. We generated four different datasets in two and three
dimensions.
Name of
Dataset
dataset1
dataset2
dataset3
dataset4
dataset5
subspace-dataset1
subspace-dataset2
subspace-dataset3
No.of
Points
10000
10000
10000
10000
10000
10000
10000
10000
No.of
Dimensions
2
3
2
2
2
3
3
3
No.of
Clusters
10
6
10
30
30
5(2 subspace)
5(1 subspace)
6(3 subspace)
Table 6.2: Details of datasets used in the experiments
Each cluster is plotted in a different color and noise in a different color for better visibility of the datasets. Original
datasets can be checked in Figures (6.4, 6.6, 6.7, 6.9, 6.10, 6.12, 6.14).
All those clustering algorithms which do not consider the presence of noise, will not be able to find the clusters
properly when noisy datasets are given as input. Results of K-Means algorithm in Figures (6.5, 6.11, 6.13) shows the
same. When there is no noise and the clusters are well separated, K-Means is able to find the clusters (Figure 6.6).
But when took a very small value, K-Means is confused (Figure 6.7) and gave wrong result.
Density based algorithms such as DBSCAN will be able to find out the clusters properly (Figure 6.8) when all
the clusters are traditional clusters. When takes very small value, then the distance between two clusters might
be negligible, in such case DBSCAN clubs two clusters into single one (Figure 6.9). In the presence of subspace
clusters, though sometimes it is able to identify the correct number of clusters (Figures 6.11, 6.13) sometimes it is
unable to figure out the clusters (Figure 6.14). The implementation of DBSCAN is having a provision not to specify
epsilon value, the program will take care of that. Different minpoints are given to the program and the results are
verified. For subspace-dataset3 when minpoints are given in [1, 7] the no.of clusters identified are in the range [7, 26].
Where as for minpoints [8, 18] number of clusters identified is only five, for the other input it identified very less
no.of clusters.
ReCkless [9] is an agglomerative algorithm based on reverse nearest neighbors. With a small k value it will find
large number of clusters, gradually merges the clusters as the k value increases. In Figure(6.15) we can see the
change. As k value is increased, the clusters in random shape are merging and forming a single cluster (Figure 6.16),
But this time more number of noise points are treated as cluster points. When dataset5 (whose epsilon value is very
small) is used as input, for smaller k-values (≤ 32) it is able to identify the regular clusters (Figure 6.17). But there is
a sudden merge when k=33 as the clusters are near by. Even at the higher values the two clusters are identified as one
cluster(Figure 6.18) as RecKless is a hierarchical clustering algorithm.
34
CHAPTER 6. EXPERIMENTAL RESULTS
Grid based clustering algorithms such as STING build a hierarchical grid structure and try to find the dense regions.
Datasets that are generated with some subspace clusters will make grid based methods to fail, but when all normal
clusters are generated they give correct result.
Projection based subspace clustering methods are always gullied by the presence of the noise, as they ignore the
presence of the noise points. Density based subspace clustering algorithms can also get confused by the noise as the
density of the noise is almost same as the density of the subspace cluster when seen in all dimensions.
Various datasets submitted here prove that SynDECA is capable of generating datasets with which algorithms will
be able to form the clusters properly, also it generates datasets which confuse the clustering algorithms by choosing
different values to parameters. Table 6.3 gives a overview of the results from various algorithms along with some
remarks.
6.2. VALIDATION OF SYNDECA DATASETS
Figure 6.1: Few examples of the generated data in 2-Dimensions.
35
36
CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.2: Few examples of the generated data in 3-Dimensions.
Figure 6.3: Top row datasets are from “clusutils” — Bottom row datasets are from “SynDECA”.
6.2. VALIDATION OF SYNDECA DATASETS
Figure 6.4: Plot of generated datasets (a) dataset1 (b) dataset2
Figure 6.5: Plot of clustering result of K-Means algorithm on (a) dataset1 (b) dataset2
37
38
CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.6: Plot of dataset3 (a) Original (b) Result of K-Means algorithm
Figure 6.7: Plot of dataset4 (a) Original (b) Result of K-Means algorithm
6.2. VALIDATION OF SYNDECA DATASETS
Figure 6.8: Plot of clustering result of DBSCAN algorithm on (a) dataset1 (b) dataset2
Figure 6.9: Plot of dataset5 (a) Original (b) Result of DBSCAN algorithm
39
40
CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.10: Plot of subspace-dataset1 (a) three dimensional plot (b) two dimensional projection
Figure 6.11: Clustering result of subspace-dataset1 (a) K-Means (b) DBSCAN
6.2. VALIDATION OF SYNDECA DATASETS
Figure 6.12: Plot of subspace-dataset2 (a) three dimensional plot (b) two dimensional projection
Figure 6.13: Clustering result of subspace-dataset2 (a) K-Means (b) DBSCAN
41
42
CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.14: Plot of subspace-dataset3 (a) Original (b) Result of DBSCAN
Figure 6.15: Plot of clustering result of ReCkless algorithm on dataset1 (a) k=25 (b) k=60
6.2. VALIDATION OF SYNDECA DATASETS
Figure 6.16: Plot of clustering result of ReCkless algorithm on dataset1 (a) k=90 (b) k=162
Figure 6.17: Plot of clustering result of ReCkless algorithm on dataset5 (a) k=32 (b) k=33
43
44
CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.18: Plot of clustering result of ReCkless algorithm on dataset5 (a) k=50 (b) k=78
Chapter 7
Conclusions and Future Work
Datasets are required for testing the correctness of any clustering algorithm. Since not many real life datasets are
available, there is a need for a tool which generates clustering datasets. The existing methods to generate cluster
datasets are not good enough. We proposed algorithms to generate clustering datasets in R d (where d is the no.of
dimensions). We generate clusters having regular and arbitrary shapes. Methods proposed by us make sure that
there are exactly the requested number of clusters in the dataset. Few researchers found that there is a possibility of
clusters spreading in subset of dimensions of the dataset. We also generate datasets which help in evaluating subspace
clustering algorithms. Experimental results show that the existing traditional clustering algorithms may not identify
the subspace clusters.
The work presented in this thesis can be extended as follows.
1. The proposed algorithms take cuboid as minimum bounding box. Methods are to be developed to take an arbitrary
bounding box. Datasets generated in such bounding boxes will look like clusters shown in Figure 7.1.
2. We concentrated only on numerical datasets. Some clustering algorithms are proposed [32, 33] to deal with pure
categorical or a mixture of categorical and real number dimensions. Techniques need to be devised to generate
datasets for evaluation of such clustering methods
Figure 7.1: Clusters in arbitrary shaped bounding boxes
45
46
CHAPTER 7. CONCLUSIONS AND FUTURE WORK
Bibliography
[1] J. R. Vennam, “Syndeca.” http://cde.iiit.net/syndeca/.
[2] M. A. Nascimento, D. Pfoser, and Y. Theodoridis, “Synthetic and real spatiotemporal datasets,” Bulletin of the
IEEE Computer Society Technical Committee on Data Engineering, 2003.
[3] “Gstd: Generate spatio temporal data.” http://www.cti.gr/RD3/GSTD/index2.html.
[4] “G-terd: Generator for time-evolving regional data.” http://delab.csd.auth.gr/stdbs/g-terd.html.
[5] J. M. Jean-Marc Saglio, “Oporto: A realistic scenario generator for moving objects,” GeoInformatica, Volume 5,
Issue 1, p. 71-93, 2001.
[6] J. Han and M. Kamber, Data Mining Concepts and Techniques. Morgan Kauffmann Publishers, 1988.
[7] L. Ertoz, M. Steinbach, and V. Kumar, “Finding clusters of different sizes, shapes, and densities in noisy, high
dimensional data,” SIAM International Conference on Data Mining (SIAM), 2003.
[8] L. Ertoz, M. Steinbach, and V. Kumar, “A new shared nearest neighbor clustering algorithm and its applications,”
Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on
Data Mining, 2002.
[9] S. Vadapalli, S. R. Valluri, K. Karlapalem, and P. Gupta, “Cluster analysis and outlier detection using reverse
nearest neighbors,” Tech. Rep. IIIT-H/TR/2004/009, International Institute of Information Technology, Hyderabad.
[10] A. Foss and O. R. Zaane, “A parameterless method for efficiently discovering clusters of arbitrary shape in large
datasets,” International Conference on Data Mining (ICDM), pp. pp 179–186, 2002.
[11] G. W. Milligan, “An algorithm for generating artificial test clusters,” Psychometria.
[12] R. R. T. Zhang and M. Livny, “Birch : an efficient data clustering method for very large databases,” Proc. of
ACM SIGMOD International Conference on Management of Data, 1996.
[13] H. M. M. Zait, “A comparative study of clustering methods,” FGCS Journal, Special Issue on Data Mining, 1997.
[14] D. X. Pape, “Clusutils.” http://clusutils.sourceforge.net/, http://clusutils.sourceforge.net/manual/cg-man.html,
September 2000.
[15] J. M.Ester, H.P.Kriegel and X.Xu, “A density-based algorithm for discovering clusters in large spatial databases,”
International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231, 1996.
[16] R. R. Sudipto Guha and K. Shim, “Cure: Anefficient clustering algorithm for large databases,” Proc. of ACM
SIGMOD International Conference on Management of Data, 1998.
[17] R. T. Ng and J. Han, “Efficient and effective clustering methods for spatial data mining,” Proc. of International
Conference on Very Large Databases (VLDB), 1994.
47
BIBLIOGRAPHY
48
[18] H.-P. K. M. Ankerst, M. Breunig and J. Sander, “Optics: Ordering points to identify the clustering structure,”
Proc. of ACM SIGMOD International Conference on Management of Data, 1999.
[19] E. Knorr and R. Ng, “Algorithms for mining distance-based outliers in large datasets,” Proc. of International
Conference on Very Large Databases (VLDB), 1998.
[20] S. C. G. Sheikholeslami and A. Zhang, “Wavecluster: A multi-resolution clustering approach for very large
spatial databases,” Proc. of International Conference on Very Large Databases (VLDB), 1998.
[21] R. M. W. Wang, Yang, “Sting: A statistical information grid approach to spatial data mining,” Proc. of International Conference on Very Large Databases (VLDB), 1997.
[22] J. L. W. P. S. Y. e. a. Charu C. Aggarwal, Cecilia Procopiuc, “Fast algorithms for projected clustering,” Proc. of
ACM SIGMOD International Conference on Management of Data, 1999.
[23] H. L. Lance Parsons, Ehtesham Haque, “Evaluating subspace clustering algorithms,” SIAM International Conference on Data Mining (SIAM), 2004.
[24] D. G. P. R. Rakesh Agrawal, Johannes Gehrke, “Automatic subspace clustering of high dimensional data for data
mining applications,” Proc. of ACM SIGMOD International Conference on Management of Data, 1998.
[25] P. S. Y. Charu C. Aggarwal, “Finding generalized projected clusters for high dimensional spaces,” Proc. of ACM
SIGMOD International Conference on Management of Data, 2000.
[26] D. G. S. M. Carlotta Domeniconi, Dimitris Papadopoulos, “Subspace clustering of high dimensional data,” SIAM
International Conference on Data Mining (SIAM), 2004.
[27] p. K. Karin Kailing, Hans-Peter Kriegel, “Density-connected subspace clustering for high-dimensional data,”
SIAM International Conference on Data Mining (SIAM), 2004.
[28] S. V. Jhansi Rani Vennam, “Syndeca: A toolkit to generate synthetic datasets for evaluation of clustering algorithms,” 11th International Conference on Management of Data (COMAD), 2005.
[29] “Efficient algorithms for k-means clustering.” http://www.cs.umd.edu/ mount/Projects/KMeans/.
[30] “Matlab implementation of dbscan.” http://www.chemometria.us.edu.pl/download.html.
[31] S. Vadapalli, “k-reverse
http://cde.iiit.ac.in/RNNs/.
nearest
neighbor
(k-rnn)
cluster
analysis
and
outlier
detection.”
[32] R. R. Venkatesh Ganti, Johannes Gehrke, “Cactus: Clustering categorical data using summaries,” International
Conference on Knowledge Discovery and Data Mining (KDD), 1999.
[33] M. J. Z. Markus Peters, “Click: Clustering categorical data using k-partite maximal cliques,” International Conference on Data Engineering (ICDE), 2005.