Download AN EFFICIENT HILBERT CURVE

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
AN EFFICIENT HILBERT CURVE-BASED CLUSTERING STRATEGY FOR
LARGE SPATIAL DATABASES
A Thesis
Submitted to the Faculty
of
National Sun Yat-sen University
by
Yun-Tai Lu
In Partial Fulllment of the
Requirements for the Degree
of
Master of Science
June 2003
ABSTRACT
Recently, millions of databases have been used and we need a new technique that can
automatically transform the processed data into useful information and knowledge.
Data mining is the technique of analyzing data to discover previously unknown information and spatial data mining is the branch of data mining that deals with spatial
data. In spatial data mining, clustering is one of useful techniques for discovering interesting data in the underlying data objects. The problem of clustering is that give n
data points in a d-dimensional metric space, partition the data points into k clusters
such that the data points within a cluster are more similar to each other than data
points in dierent clusters. Cluster analysis has been widely applied to many areas
such as medicine, social studies, bioinformatics, map regions and GIS, etc. In recent
years, many researchers have focused on nding eÆcient methods to the clustering
problem. In general, we can classify these clustering algorithms into four approaches:
partition, hierarchical, density-based, and grid-based approaches. The k -means algorithm which is based on the partitioning approach is probably the most widely applied
clustering method. But a major drawback of k-means algorithm is that it is diÆcult
to determine the parameter k to represent \natural" cluster, and it is only suitable for
concave spherical clusters. The k-means algorithm has high computational complexity and is unable to handle large databases. Therefore, in this thesis, we present an
eÆcient clustering algorithm for large spatial databases. It combines the hierarchical
approach with the grid-based approach structure. We apply the grid-based approach,
because it is eÆcient for large spatial databases. Moreover, we apply the hierarchical
approach to nd the genuine clusters by repeatedly combining together these blocks.
Basically, we make use of the Hilbert curve to provide a way to linearly order the
points of a grid. Note that the Hilbert curve is a kind of space-lling curves, where a
space-lling curve is a continuous path which passes through every point in a space
once to form a one-one correspondence between the coordinates of the points and
the one-dimensional sequence numbers of the points on the curve. The goal of using
space-lling curve is to preserve the distance that points which are close in 2-D space
and represent similar data should be stored close together in the linear order. This
kind of mapping also can minimize the disk access eort and provide high speed for
clustering. This new algorithm requires only one input parameter and supports the
user in determining an appropriate value for it. In our simulation, we have shown
that our proposed clustering algorithm can have shorter execution time than other
algorithms for the large databases. Since the number of data points is increased, the
execution time of our algorithm is increased slowly. Moreover, our algorithm can deal
with clusters with arbitrary shapes in which the k-means algorithm can not discover.
TABLE OF CONTENTS
Page
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
1.2
1.3
1.4
1.5
.
.
.
.
.
1
3
8
12
16
2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.1
2.2
2.3
2.4
2.5
2.6
Spatial Data Mining .
Clustering . . . . . . .
Space Filling Curves .
Motivation . . . . . . .
Organization of Thesis
K-Means . .
CLARANS
HAC . . . .
CURE . . .
DBSCAN .
STING . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. The Clustering Algorithm Based on the Hilbert Curve . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
19
20
22
24
27
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.1 The Basic Idea . . . . . .
3.2 The Algorithm . . . . . .
3.2.1 The First Round .
3.2.2 The Second Round
3.2.3 The Third Round .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
31
31
34
38
Page
3.2.4 The Fourth Round . . . . . . . . . . . . . . . . . . . . . . . .
43
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.1 Performance Measures . . . . . .
4.2 The Simulation Model . . . . . .
4.3 Simulation Results . . . . . . . .
4.3.1 Time Scalability . . . . . .
4.3.2 Sensitivity to Parameters .
4.3.3 Special Data Sets . . . . .
.
.
.
.
.
.
55
57
59
59
66
69
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
79
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
LIST OF FIGURES
Figure
Page
1.1 The object of clustering . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2 A classication of clustering algorithms . . . . . . . . . . . . . . . . .
4
1.3 The concept of the partitioning approach . . . . . . . . . . . . . . . .
5
1.4 Distinction between agglomerative and divisive methods . . . . . . .
7
1.5 The concept of the density-based approach . . . . . . . . . . . . . . .
8
1.6 Peano curves of order: (a) 1; (b) 2; (c) 3. . . . . . . . . . . . . . . . .
10
1.7 Reected binary gray-code curves of order: (a) 1; (b) 2; (c) 3. . . . .
11
1.8 Hilbert curves of order: (a) 1; (b) 2; (c)3. . . . . . . . . . . . . . . . .
11
1.9 Clusters are formed in the space lling curve of order 3: (a) the Peano
curve; (b) the RBG curve; (c) the Hilbert curve. . . . . . . . . . . . .
12
1.10 Sample databases: (a) clusters of widely dierent sizes (Case 1); (b)
clusters with convex shapes (Case 2); (c) clusters with elongated shapes
(Case 3); (d) clusters with double circles (Case 4). . . . . . . . . . . .
14
1.11 Cluster discovered by dierent algorithms: (1) the k-means method;
(2) the new algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.1 An overview of the k-means method: (a) (c) (e) calculate the new
centroid; (b) (d) (f) assign the points to the new centroid. . . . . . .
18
2.2 The dierent linkage measure: (a) single linkage; (b) complete linkage;
(c) average linkage. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.3 The overview of CURE . . . . . . . . . . . . . . . . . . . . . . . . . .
24
iv
Figure
Page
2.4 Directly density-reachable . . . . . . . . . . . . . . . . . . . . . . . .
25
2.5 Example for: (a)density-reachability; (b)density-connectivity. . . . . .
26
2.6 The concept of the grid-based method . . . . . . . . . . . . . . . . .
28
3.1 Partition the spatial data into the rectangle blocks . . . . . . . . . .
31
3.2 Using the Hilbert curve to connect these blocks . . . . . . . . . . . .
32
3.3 The initial state of the rst round . . . . . . . . . . . . . . . . . . . .
32
3.4 Procedure F irstRound(BI; m) . . . . . . . . . . . . . . . . . . . . .
33
3.5 The result of the rst round . . . . . . . . . . . . . . . . . . . . . . .
34
3.6 The initial state of the second round . . . . . . . . . . . . . . . . . .
35
3.7 The dierent cases in the second round: (a) C04 ; (b) C14 ; (C) C24 ; (d)
C34 ; (e) C44 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.8 Procedure SecondRound(BI; m) . . . . . . . . . . . . . . . . . . . . .
36
3.9 The result of the second round . . . . . . . . . . . . . . . . . . . . . .
37
3.10 The initial state of the third round . . . . . . . . . . . . . . . . . . .
39
3.11 Basic unit U3 in the third round . . . . . . . . . . . . . . . . . . . . .
39
3.12 The basic idea of nding the neighboring block . . . . . . . . . . . . .
40
3.13 The cases in the third round . . . . . . . . . . . . . . . . . . . . . . .
40
3.14 The dierent units in the third round . . . . . . . . . . . . . . . . . .
42
3.15 The result of the third round . . . . . . . . . . . . . . . . . . . . . . .
44
3.16 The initial state of the fourth round . . . . . . . . . . . . . . . . . . .
45
3.17 The cases in the fourth round . . . . . . . . . . . . . . . . . . . . . .
45
3.18 The result of the fourth round . . . . . . . . . . . . . . . . . . . . . .
47
3.19 The process for merging clusters . . . . . . . . . . . . . . . . . . . . .
48
v
Figure
Page
4.1 The actual data sets used in our rst experiments: (a) DS1; (b) DS2;
(c) DS3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.2 A comparison of the execution time (DS1) . . . . . . . . . . . . . . .
62
4.3 A comparison of the execution time (DS2) . . . . . . . . . . . . . . .
63
4.4 A comparison of the execution time (DS3) . . . . . . . . . . . . . . .
64
4.5 A comparison of the execution time (the degenerated case) . . . . . .
66
4.6 A comparison of the quality the of clustering of the k-means algorithm
under a dierent parameter k: (a) k = 2; (b) k = 5; (c) k = 7. . . . .
68
4.7 A comparison of the execution time under a dierent order h of our
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.8 The special data sets: (a) SDS1; (b) SDS2; (c) SDS3; (d) SDS4; (e)
SDS5; (f) SDS6; (g) SDS7; (i) SDS8. . . . . . . . . . . . . . . . . . .
70
4.9 The result of SDS1: (a) the k-means algorithm; (b) our algorithm. . .
72
4.10 The result of SDS2: (a) the k-means algorithm; (b) our algorithm. . .
73
4.11 The result of DS3: (a) the k-means algorithm; (b) our algorithm. . .
74
4.12 The result of SDS4: (a) the k-means algorithm; (b) our algorithm. . .
74
4.13 The result of SDS5: (a) the k-means algorithm; (b) our algorithm. . .
75
4.14 The result of SDS6: (a) the k-means algorithm; (b) our algorithm. . .
75
4.15 The result of SDS7: (a) the k-means algorithm; (b) our algorithm. . .
76
4.16 The result of SDS8: (a) the k-means algorithm; (b) our algorithm. . .
76
vi
LIST OF TABLES
Table
Page
3.1 The basic unit in each round . . . . . . . . . . . . . . . . . . . . . . .
30
3.2 Case 1 in the third round . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.3 Case 2 in the third round . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4 Case 1 in the fourth round . . . . . . . . . . . . . . . . . . . . . . . .
46
3.5 Case 2 in the fourth round . . . . . . . . . . . . . . . . . . . . . . . .
46
3.6 Denitions of parameters . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.7 The rules for Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.8 The rules for Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.9 The number of stopping points . . . . . . . . . . . . . . . . . . . . .
51
3.10 The relationships in the 3rd round . . . . . . . . . . . . . . . . . . .
51
3.11 The relationships in the 3rd round . . . . . . . . . . . . . . . . . . .
52
4.1 Parameters for data generation and their values (or ranges) . . . . . .
58
4.2 Data sets used in the simulation . . . . . . . . . . . . . . . . . . . . .
60
4.3 A comparison of the execution time (DS1) . . . . . . . . . . . . . . .
62
4.4 A comparison of the execution time (DS2) . . . . . . . . . . . . . . .
64
4.5 A comparison of the execution time (DS3) . . . . . . . . . . . . . . .
65
4.6 A comparison of the execution time (the degenerated case) . . . . . .
67
vii
Table
Page
4.7 A comparison of the execution time under a dierent order h of our
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.8 Execution time (in millisecond) for dierent special data sets . . . . .
71
4.9 A comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
viii
CHAPTER I
Introduction
, or the eÆcient discovery of interesting patterns from large collections
of data, has been recognized as an important area of database research [17]. The goal
is to reveal regularities and relationships that are non-trivial. This is accomplished
through an analysis of the patterns that form in the data [16]. Data mining techniques
can be classied into the following categories: classication, clustering, association
rules, sequential patterns, time-series patterns, link analysis and text mining [28].
Spatial data mining is the branch of data mining that deals with spatial (location)
data.
Data mining
1.1
Spatial Data Mining
(SDBS) are database systems designed to handle spatial
data and the non-spatial information used to identify the data. Spatial data describes
information related to the space occupied by objects. SDBS are used for everything
from geo-spatial data to bio-medical knowledge and the number of such databases and
their uses are increasing rapidly. The amount of spatial data being collected is also
increasing exponentially. The complexity of the data contained in these databases
means that it is not possible for humans to completely analyze the data being collected. Data mining techniques have been used with relational database to discover
unknown information, searching for unexpected result and correlations [16]. Therefore, automated knowledge discovery becomes more and more important in spatial
databases [6].
Spatial Database Systems
1
Spatial data mining in particular is the discovery of interesting relationships and
characteristics that may exist implicitly in spatial databases [19]. Spatial data mining
diers form regular data mining in parallel with the dierences between non-spatial
data and spatial data. The attributes of a spatial object stored in a database may be
aected by the attributes of the spatial neighbors of that object. In addition, spatial
location, and implicit information about the location of an object, may be exactly
the information that can be extracted through spatial data mining [16]. Knowledge
discovered from spatial data can be of various forms, like characteristic and discriminant rules, extraction and description of prominent structures or clusters, spatial
associations, and others.
Usually, the spatial relationships are implicit in nature. Because of the huge
amounts of spatial data that may be obtained from satellite images, medical equipments, Geographic Information Systems (GIS), image database exploration etc., it is
expensive and unrealistic for the users to examine spatial data in detail. Spatial data
mining aims to automate the process of understanding spatial data by representing
the data in a concise manner and reorganizing spatial databases to accommodate
data semantics. It can be used in many applications such as seismology (grouping
earthquakes clustered along seismic faults), mineeld detection (grouping mines in a
mineeld), and astronomy (grouping stars in galaxies) [24]. So, a crucial challenge
in spatial data mining is the eÆciency of spatial data mining algorithms due to the
often huge amount of spatial data and the complexity of spatial data types and spatial accessing method [27]. And another challenge is that there is no ordering by
spatial proximity among spatial objects, the computation of spatial operators is more
dierent than that of the non-spatial counterparts. Due to its undirected nature,
clustering is often the best technique to adopt rst when a large, complex data set
with many variables and many internal structures are encountered [28]. Clustering
is a technique that is quite useful in spatial data mining applications. It divides the
initial set of objects into a number of non-overlapping subsets, in order to identify
classes or groups of objects whose locations (in some k-dimensional space) are close
to each other.
2
1.2
Clustering
In spatial data mining, clustering is a useful technique for discovering interesting data distributions and patterns in the underlying data. Cluster analysis helps
construct meaningful partitioning of a large set of objects based on a \divide and
conquer" methodology which decomposes a large scale system into smaller components to simplify design and implementation. As a data mining task, data clustering
identies clusters, or densely populated regions, according to some distance measurement, in a large, multidimensional data set. Given a large set of multidimensional
data points, the data space is usually not uniformly occupied by the data points.
Data clustering identies the sparse and the crowded places, and hence discovers the
overall distribution patterns of the data set [5].
Clustering analysis does not category labels that tag objects with prior identiers.
The absence of category labels distinguishes cluster analysis from discriminant analysis (and pattern recognition and decision analysis). The objective of cluster analysis
is simply to nd a convenient and valid organization of the data, not to establish rules
for separation future data into categories. Clustering algorithms are geared toward
nding structure in the data [12].
The problem of clustering can be dened formally as follows: given n data points
in a d-dimensional metric space, partition the data points into k clusters such that
the data points within a cluster are more similar to each other than data points in
dierent clusters [10]. An example of clustering is depicted in Figure 1.1. The input
patterns are shown in Figure 1.1 -(a), and the desired clusters are shown in Figure
1.1 -(b) [13].
In the past years, cluster analysis has been widely applied to many areas such as
medicine (classication of diseases), chemistry (grouping of compounds), social studies (classication of statistical ndings), bioinformatics [3, 25] etc. [20]. In general,
we can classify the clustering algorithms into four approaches: partition, hierarchical,
density-based, and grid-based approaches. Figure 1.2 shows the relationship between
the dierent approaches of spatial clustering algorithms.
3
Y
X
X
X
X
X
X
X
X
X
5 5
5
5
4
3
4
3
4
3
X
X
4
X
X
X
5
3
X
X
4
4
4
X
X
5
4
4
X X
X
X
X
X X
X X
6
2 2
2 2
X
X
X
4
4
X
X
X
X X X
X
4
4
X
7
6
X
1
111
1
X
(a)
7
6
7
6
7
(b)
X
X
Figure 1.1 The object of clustering
Clustering
Methods
Partitioning
Buttom-Up
K-Medoids
PAM
DensityBased
Hierarchical
CLARA
CURE
Grid-Based
Top-Down
DBSCAN
BIRCH
CLIQUE
K-Means
CLARANS
STING
WaveCluster
OPTICS
CHAMELEON
ROCK
CAST
STING+
Figure 1.2 A classication of clustering algorithms
4
Figure 1.3 The concept of the partitioning approach
The partitioning approach constructs a partition of a database D of n objects
into a set of k clusters, where k is an input parameter for these algorithms. The
partitioning approach typically starts with an initial partition of a database D and
then uses an iterative control strategy to optimize an objective function. Each cluster
is represented by the gravity center for the cluster (k-means method) or by one of
the objects of the cluster located near its center (k-medoids method) [6]. That is, it
classies the data into k groups, which together satisfy the requirements of a partition:
(1) Each group must contain at least one object. (2) Each object must belong to
exactly one group. These conditions imply that there are at most as many groups as
there are object (k n). The second condition says that two dierent clusters cannot
have any objects in common and that k groups together add up to the full data set.
Figure 1.3 shows an example of a partition of 20 points into three clusters [15].
Consequently, the partitioning approach uses a two-steps procedure. First, it determines k representatives minimizing the objective function. Second, it assigns each
object to the cluster with its representative "closest" to the considered objects. It
is important to note that k is given by the user. Of course, not all values of k lead
to \natural" clustering, so it is advisable to run the algorithm several times with
dierent values of k and to select that k for which certain characteristics or graphics
look best, or to retain the clustering that appears to give rise to the most meaningful interpretation [6, 15]. Several algorithms belong to the partitioning approach,
including k-means [15], PAM [15], CLARA [15], and CLARANS [19].
5
The hierarchical approach creates a hierarchical decomposition of a database D.
The hierarchical decomposition is represented by a dendrogram, a tree that iteratively
splits D into smaller subsets until each subset consists of only one object. In such a
hierarchy, each node of the tree represents a cluster of D. In contrast to partitioning
approaches, the hierarchical approach does not need k as an input. However, a
termination condition has to be dened indication when the merge or division process
should be terminated [6]. The hierarchical algorithms can be further divided into the
agglomerative and the divisive methods.
The agglomerative methods: If the clustering hierarchy is formed from bottom
up, at the start, each data object is a cluster by itself. Then, small clusters are
merged into bigger clusters at each level of the hierarchy until at the top of the
hierarchy all the data objects are in one cluster [29].
The divisive methods: In the divided methods, initially the set of all objects
is viewed as a cluster and at each level, and some clusters are binary divided
into smaller clusters [25]. The process continues until a stopping criterion (frequently, the requested number k of clusters) is achieved.
The distinction between the agglomerative and the divisive methods is shown in
Figure 1.4. Figure 1.4 shows what happens with a data set with n = 5 objects. The
agglomerative methods (indicated by the upper arrow, pointing to the right) starts
when all objects are apart (that is , at step 0 we have n clusters). Then, in each
step, two clusters are merged, until only one is left. On the other hand, the divisive
methods start when all objects are together (that is, at step 0, there is one cluster)
and in each following step, a cluster is split up, until there are n of them [15]. Many
algorithms belong to the hierarchical approach, including HAC [26], BIRCH [30],
ROCK [9] CURE [10] and, CHAMELEON [14].
The density-based approach applies a local cluster criterion. Clusters are regarded as regions in the data space in which the objects are regarded as regions in
the data space in which the objects are dense, and which are separated by regions of
low object density (noise). These regions may have an arbitrary shape and the points
6
Step 0
Step 1
Step 2
Step 3
Step 4
agglomerative
a
ab
b
abcde
c
cde
d
de
e
divisive
Step 4
Step 3
Step 2
Step 1
Step 0
Figure 1.4 Distinction between agglomerative and divisive methods
inside a region may be arbitrarily distributed. The key idea of the density-based approach is that for each object of a cluster, the neighborhood of a given radius (") has
to contain at least a minimum number of objects (MinPts), i.e. the cardinality of the
neighborhood has to exceed a threshold. As shown Figure 1.5, the point q contains
ve neighbors, so point q and its neighbors can become a cluster [2, 6].
An open set in the Euclidean space can be divided into a set of its connected
components. The implementation of this idea for partitioning of a nite set of points
requires concepts of density, connectivity and boundary. They are closely related to a
point's nearest neighbors. A cluster, dened as a connected dense component, grows
in any direction that density leads. Therefore, the density-based approach is capable
of discovering clusters of arbitrary shapes. Also this provides a natural protection
against outliers [4]. The advantages of density-based approach are that it can discover
clusters with arbitrary shapes and it does not need to preset the number of clusters
[29]. Many algorithms belong to the density-base approach, including CLIQUE [1],
OPTICS [2], CAST [3] DBSCAN [6], and WaveCluster [24].
The grid-based approach rst quantizes the clustering space into a nite number
of cells, and then performs clustering on the gridded cells. Overall, the grid-based
7
p
MinPts = 5
q
Eps = 2 cm
Figure 1.5 The concept of the density-based approach
approach shifts our attention from data to space partitioning. Data partitioning
is induced by membership of points in segments resulted from space partitioning,
while space partitioning is based on grid-characteristics accumulated from input data.
One advantage of this indirect handling (data ) grid-data ) space-partitioning )
data-partitioning) is that accumulation of grid-data makes the grid-based clustering approach independent of data ordering. In contrast, relocation methods and all
incremental methods are very sensitive with respect to data ordering [4].
The main advantage of grid-based approach is that its speed only depends on the
resolution of griding, but not on the size of the data set. The grid-based approach is
more suitable for high density data sets with a huge number of data objects in limited
space [29]. Many algorithms belongs to the grid-based approach, including CLIQUE
[1], STING [27], and WaveCluster [24].
1.3
Space Filling Curves
The amount of spatial data being collected is increasing rapidly. The size of spatial
databases is more and more large. So, we need an eÆcient clustering algorithm to help
us to analyze the large spatial databases. The gird-based approach is very eÆcient
for large databases. Because it quantizes the space into a nite number of blocks and
transforms the focus form the number of objects to the number of blocks.
8
A space-lling curve is a continuous path which passes through every point in a
space once to form a one-one correspondence between the coordinates of the points
and the one-dimensional sequence numbers of the points on the curve. The spacelling curve provides a way to order linearly the points of a grid.
In any case, the term spatial usually refers to objects and operators in a space of
dimension two or higher. However, there is no total ordering among spatial objects
that preserves spatial proximity. In other words, there is no mapping from twoor higher-dimensional space into one-dimensional space such that any two objects
that are spatially close in higher dimensional space are close to each other in onedimensional sorted sequence. One way is to look for total orders that preserve spatial
proximity at least to some extent.
The goal of the space-lling curve is to preserve the distance, i.e., points which
are close in space and represent similar data should be stored close together in the
linear order [8]. Some examples of space-lling curves are the Peano curve, the RBG
curve and the Hilbert curve [7, 18, 22, 23]. These space-lling curves can help us to
preserve the spatial locality of the blocks and provide high speed for clustering.
In general, space-lling curves start with a basic path on a k-dimensional square
grid of side 2. The path visits every point in the grid exactly once without crossing
itself. It has two free ends which may be joined with other paths. The basic curve
is said to be of order 1. To derive a curve of order i, each vertex of the basic curve
is replaced by the curve of order i 1, which may be appropriately rotated and/or
reected to t the new curve [8].
The Peano curve is proposed rst by Orenstein [22, 23]. In this mapping, the
one-dimensional sequence number (1D-number) of a point is obtained by simply interleaving the bits of a binary representation of the X and Y coordinates of the point
in the two-dimensional space (2D-space). The basic Peano curve for a 2 2 grid,
denoted as P1 , is shown in Figure 1.6-(a). To derive higher orders of the Peano curve,
we replace each vertex of the basic curve with the previous order curve. Figures 1.6(b) and (c) show the Peano curves of order 2 and 3, respectively. We can think of
dividing the given region into quadrants and drawing a curve such as Figure 1.6-(a).
9
1
3
0
2
(a)
(b)
(c)
Figure 1.6 Peano curves of order: (a) 1; (b) 2; (c) 3.
Then each quadrant is divided in turn into 4 sub-quadrants, and the same basic curve
repeated in each, in place of each node in the previous step. One more recursive step,
again dividing each node into 4 sub-quadrants joined by the basic curve, gives rise to
Figures 1.6-(c) [11].
In a (binary) Gray code, numbers are coded into binary representations such that
successive numbers dier in exactly one bit position. Faloutsos [7, 8] observed that
dierence in only one bit position had a relationship with locality. He proposed that
numbers produces by interleaving the coordinates of a point in 2D-space as in the
Peano curve technique to obtain the 1D-number. The basic reected binary gray-code
curve (the RBG curve) of a 2 2 grid, denoted as R1 is shown in Figure 1.7-(a). The
procedure to derive higher orders of this curve is to reect the previous order curve
over x-axis and then over the y -axis. Figures 1.7-(b) and (c) show the reected binary
gray-code curve of order 2 and 3, respectively. As in the case of the Peano curve, the
RBG curve begins with the curve of Figure 1.7-(a). It divides each quadrant into 4
sub-quadrants and replicates. While replicating, it rotates the two upper quadrants
through 180Æ as shown in Figure 1.7-(b). It divides into 4 sub-quadrants once again,
with replication and upper quadrant rotation to get Figure 1.7-(c) [11].
The Hilbert curve is a mapping in which the four nearest neighbors in 2D-space
are usually mapped to points not too far away in the linear traversal. It begins with
Figure 1.8-(a). As in the case of the previous curves, it replicates in four quadrants.
When replicating, the lower left quadrant is rotated clockwise 90Æ , the lower right
10
1
2
0
3
(a)
(b)
(c)
Figure 1.7 Reected binary gray-code curves of order: (a) 1; (b) 2; (c) 3.
1
2
0
3
(a)
(b)
(c)
Figure 1.8 Hilbert curves of order: (a) 1; (b) 2; (c)3.
quadrant is rotated anti-clockwise 90Æ , and the sense (or direction of traversal) of
both lower quadrants is reversed. The two upper quadrants have no rotation and
no change of sense. Thus, we obtain Figure 1.8-(b). Remembering that all rotation
and sense computations are relative to previously obtained rotation and sense in a
particular quadrant, a repetition of this step gives rise to Figure 1.8-(c). So, the basic
Hilbert curve of a 2 2 grid, denoted as H1 is shown in Figure 1.8-(a). The procedure
to derive higher orders of this curve is to rotate and reect the curve at vertex 0 and
at vertex 3. The curve can keep growing recursively by following the same rotation
and reection pattern at each vertex of the basic curve. Figures 1.8-(b) and (c) show
the Hilbert curves of order 2 and 3, respectively [11].
For a given distance-preserving mapping that maps the points in 2D-space on
one-dimensional index, a cluster is dened to be a group of points with consecutive
11
the Peano curve
(a)
the RBG curve
the Hilbert curve
(b)
(c)
query
window
Figure 1.9 Clusters are formed in the space lling curve of order 3: (a) the Peano
curve; (b) the RBG curve; (c) the Hilbert curve.
ordering values. If a range query retrieves few clusters, the distance-preserving mapping method requires few disk accesses on the actual le [8]. Figure 1.9 illustrates
a range query where the Hilbert curve (c) is better than the Peano curve (a). The
shaded area is the result of the range query. The Peano curve (a) and the RBG curve
(b) have four clusters in the shaded area while the Hilbert curve (c) only has two.
1.4
Motivation
Clustering is one of the most important technique about spatial data mining and
four dierent approaches have been proposed to achieve the goal. In this section, we
rst make a comparison of those dierent clustering approaches for spatial data in
the presence of physical constraints [29], and then propose a new eÆcient clustering
algorithm for large spatial databases.
The advantage of the partitioning approach is that it is very easy to understand
and implement. But all the algorithms based on the partitioning approach have a
similar clustering quality and the major diÆculties with these algorithms include:
(1) The number k of clusters to be found needs to be known prior the clustering
requiring at least some domain knowledge which is often not available. (2) It is
diÆcult to identify clusters with large variations in sizes (large genuine clusters tend
12
to be split). (3) This approach is only suitable for concave spherical clusters, and is
unable to nd arbitrary shaped clusters [29].
The advantage of the hierarchical clustering approach is that it does not need k
as an input parameter, which is an obvious advantage over the partitioning approach.
And it usually uses the dendrogram to represent the whole process. It is easy to understand which clusters to be merged at each step. The disadvantage of the hierarchical
clustering approach is setting a termination condition which requires some domain
knowledge for parameter setting. The problem of parameter setting makes them less
useful for real world applications. Also typically, the hierarchical clustering approach
has high computational complexity [29]. And in some hierarchical algorithms, it is
order-sensitive and can not handle outliers.
The advantage of the density-based approach is that it can discover clusters
with arbitrary shapes and it does not need to preset the number of clusters [29]. The
major disadvantage of the density-based approach is the choice of the value of Eps
and the resulting clustering quality is highly dependent on the Eps parameter [27].
The advantage of the grid-based approach is that it can handle large databases.
Because the run time complexity for STING is O(K ) where K is the number of blocks.
And the K N , where N is the number of object, so this approach provides fast
speed. But the disadvantage of the grid-based approach is its low quality. The lower
the K which we use, the quality will be lower. Otherwise, the higher the K , the
slower the algorithm will run [16].
In additional to considering the above advantages and disadvantages of dierent
approaches for large data clustering, many applications for large spatial databases
rise the following requirements for clustering approaches:
1. Minimization of input parameters: Minimal requirements of domain knowledge
to determine the input parameters is needed, because appropriate values are
often not known in advance when dealing with large databases [6].
2. Discovery of clusters with arbitrary shapes: Due to the diverse nature of the
spatial objects, the clusters may be of arbitrary shapes. For example, there
may be a large variation in cluster sizes (as shown in Figure 1.10-(a)), nested
13
(a)
(b)
(c)
(d)
Figure 1.10 Sample databases: (a) clusters of widely dierent sizes (Case 1); (b)
clusters with convex shapes (Case 2); (c) clusters with elongated shapes (Case 3); (d)
clusters with double circles (Case 4).
within one another (as shown in Figure 1.10-(b)), elongated (as shown in Figure
1.10-(c)) or double curve (as shown in Figure 1.10-(d)). A good clustering approach should be able to identify clusters irrespective of their shapes or relative
positions [24].
3. Good eÆciency for large databases: Due to the huge amount of spatial data, an
important challenge for clustering approaches is to provide good time eÆciency
[24].
4. Robust with regard to noise: Another important issue is the handling of noise.
Noise refers to spatial objects which are not contained in any cluster and should
be discarded during the mining process [24].
5. Insensitive to the data input order: The results of a good clustering approach
should not get aected by the dierent ordering of input data and should produce the same clusters. In other words, it should be order insensitive with
respect to input data [24].
14
(a) - (1)
(a) - (2)
(b) - (1)
(b) - (2)
(d) - (1)
(d) - (2)
(c) - (1)
(c) - (2)
Figure 1.11 Cluster discovered by dierent algorithms: (1) the k-means method; (2)
the new algorithm.
But, there is no single algorithm that can fully satisfy all the above requirements.
Therefore, in this thesis, we present the new clustering algorithm for large spatial
databases. It combines the hierarchical approach with the grid-based approach structure. We use the grid-based approach of clustering algorithm, because it is eÆcient
for large spatial databases. And we use the hierarchical approach to nd the genuine
clusters by repeatedly combining together these blocks. We also make use of the
Hilbert curve to provide a way to linearly order the points of a grid. The goal is to
preserve the distance that points which are close in 2-D space and represent similar
data should be stored close together in the linear order. This kind of mapping also
can minimize the disk access eort [11] and provide high speed for clustering. This
new algorithm requires only one input parameter and supports the user in determining an appropriate value for it. It does not get aected by the ordering of input data.
Because we use the Hilbert curve to store our data, the ordering of input data is xed.
Our algorithm also can discover clusters of some arbitrary shapes as shown in Figure
1.11. Finally, it is eÆcient even for large spatial databases.
From our simulation and performance measures, we show that our proposed algorithm has shorter mean execution time than the k-means algorithm. When the
15
number of data points is increased, the k-means algorithm has large computation
time for clustering. But, the execution time of our algorithm is increased very slowly.
As mentioned before, in some of hierarchical algorithms, it needs to set termination
condition to stop algorithm. But, in our algorithm, it does not set this condition.
Moreover, our algorithm can nd clusters with arbitrary shapes which the k-means
algorithm can not discover as shown in Figure 1.11 (a)-(d). Because in the k-means
algorithm, the quality of clustering result is highly aected by the parameter k and
it can not deal with arbitrary shapes. In our algorithm, the execution time and
quality of clustering results are also aected by the parameter h, but the aection
is lightly. Therefore, our proposed algorithm has its own advantages and applicable
domains, and it can reduce the execution time for clustering and keep the high quality
of clustering results.
1.5
Organization of Thesis
The rest of the thesis is organized as follows. In Chapter 2, we give a survey of
several well-known clustering algorithms. In Chapter 3, we present a new clustering
algorithm based on the Hilbert curve for large spatial databases. In Chapter 4, we
give a comparison of the performance of the k-means algorithm and our new clustering
algorithm. Finally, we given a summary and point out some future research directions.
16
CHAPTER II
A Survey
There have been several of clustering methods proposed. Basically, these clustering methods can be classied into the following four approaches: the partitioning,
the hierarchical, the density-based and the grid-based approaches. In this Chapter,
we will describe some well-known clustering methods, the k-Means [15], CLARANS
[19], HAC [26], CURE [10], DBSCAN [6], and STINGS [27], which are based on
the partitioning approach, the partitioning approach, the hierarchical approach, the
hierarchical approach, the density-based approach and the grid-based approach, respectively.
2.1
K-Means
The k-means method is a partitioning approach. It is probably the most widely
applied the partition clustering approach [15]. The k-means method can be illustrated
as follows: suppose that n objects described by the attribute vectors fx1 ; x2 ; :::; xn g be
partitioned into k clusters, where k < n. Let mi be the mean of the vectors in cluster
i. Initially, select k number of points to be the centers of your clusters arbitrarily,
then nd the distance from each point to each center, and assign the point to the
closest center. Next, for each set of points assigned to a center, nd the middle of the
cluster, take that value as the new center and repeat the process until the centers do
not seem to move.
Figure 2.1 shows the process of the standard k-means clustering method. First,
Figure 2.1-(a) shows that objects are represented as points in space, where we randomly select 2 points (represented by X) to be the centroids of two clusters. Then,
17
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
(a)
4
5
6
7
8
9
10
6
7
8
9
10
6
7
8
9
10
(b)
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
0
10
1
2
3
4
5
(d)
(c)
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
(f)
(e)
Figure 2.1 An overview of the k-means method: (a) (c) (e) calculate the new centroid;
(b) (d) (f) assign the points to the new centroid.
18
Figure 2.1-(b) represents that the next process is to partition the points (represented
by ) into 2 groups (clusters), indicated by the two large circles. Then, Figure 2.1-(c)
shows that the new centroid of each group of points is calculated, and then the points
are reassigned to the new centroid which they are closest, as shown Figure 2.1-(d).
Figure 2.1-(e) displays that when all the new centroids are the same as the previous
centroids, the centroids are stable and algorithm is stopped. The nal clusters are
shown in Figure 2.1-(f), the points are assigned to the cluster to which they are most
similar.
2.2
CLARANS
The CLARANS (Clustering Large Applications based on RANdomized Search)
method is a partitioning approach. The CLARANS method [19] is proposed a partition clustering method for large databases which is based on randomized search. The
CLARANS method is an improved the k-medoid method. The idea of the k-medoids
method is like the k-means method. But the k-medoids method selects k medoids
(which are called representative objects) in the data set, not the centroids of the clusters. The corresponding clusters are then found by assigning each remaining object
to the nearest representative object. To be exact, the average distance (or average
dissimilarity) of the representative object to all the other objects of the same cluster
is being minimized [15].
The CLARANS method is based on the k-medoids method, in which each cluster
is also represented by its medoid, the most centrally located points in the cluster, and
the objective is to nd the k best medoids that optimize the criterion function. This
method reduces this problem to that of graph search by representing each set of k
medoids as a node in the graph [10]. The CLARANS method is that given n objects,
the process described above of nding k medoids can be viewed abstractly as searching
through a certain graph. In this graph, denoted by Gn;k , a node is represented by a set
of k objects fOm1 ; :::; Omk g, intuitively indicating that Om1 ; :::; Omk are the selected
medoids. The set of nodes in the graph is the set f fOm1 ; :::; Omk gjOm1 ; :::; Omk are
19
objects in the data set g. Two nodes are neighbors (i.e. connected by an arc) if their
sets dier by only one object. More formally, two nodes S1 = fOm1 ; :::; Omk g and
S2 = fOw1; :::; Owk g are neighbors if and only if the cardinality of the intersection of
T
S1 and S2 is k 1, i.e. jS1 S2 j = k 1. It is easy to see that each node has k(n k)
neighbors. Since a node represents a collection of k medoids, each node corresponds
to a clustering. Thus, each node can be assigned a cost that is dened to be the total
dissimilarity between every object and the medoids of its cluster [19].
This method has two inputs: maxneighbor and numlocal. Maxneighbor is the
maximum number of neighbors of a node that are to be examined. Numlocal is the
maximum number of local minimum that can be collected. The CLARANS method
begins by selecting a random node. It then checks a sample of the neighbors of the
node, and if a better neighbor is found based on the \cost dierential of the two node,"
it moves to the neighbor and continues processing until the maxneighbor criterion is
met. Otherwise, it declares the current node a local minimum and starts a new pass to
search for other local minimum. After a specied number of local minimum (numlocal)
are collected, the method returns the best of these local values as the medoid of the
cluster [16].
2.3
HAC
The HAC method is a hierarchical approach. It is an hierarchical agglomerative
clustering, and it is a bottom-up clustering method. This method starts with each
objects in a separate cluster. At each step of the method, the most \similar" clusters
are joined together. \Similarity" depends on the clustering criterion used. Joining
operations continue until all the objects are merged to a single cluster. The most
commonly used methods join a single pair of clusters at each step, the results in a
binary tree. At each step, the pair of clusters merged are the ones between which the
distance is the minimum. The widely used measures for distance between the points
in the two clusters are described as follows, Ci represents the cluster and ni is the
number of points in Ci [10].
20
: The distance between two clusters is given by the minimum cost
edge between points in the two clusters.
Single link
d(Ci ; Cj ) = minfd(s; t)js 2 Ci ; t 2 Cj g
: The distance between two clusters is given by the maximum cost
edge between points in the two points.
Complete link
d(Ci ; Cj ) = maxfd(s; t)js 2 Ci ; t 2 Cj g
: The distance between two clusters is the average of all of the edge
costs between points in the two clusters [21].
Average link
d(Ci; Cj ) =
P
2
s Ci
P
2 d(i; j )
t Cj
ni nj
Now, we show an example to describe the HAC method with using simple linkage.
Suppose that the similarity matrix D is given by:
1
2
D=
3
4
5
0
9
3
6
11
9
0
7
5
10
3
7
0
9
2
6
5
9
0
8
11
10
2
8
0
Treating each object as a cluster, the rst join corresponds to merging cluster 3 &
5, because they have the minimum distance. To implement the next level of clustering,
we need to calculate the distances between the cluster (35) and the remaining object
1, 2 and 4. We have
21
d((35); 1) = minfd(3; 1); d(5; 1)g = minf3; 11g = 3
d((35); 2) = minfd(3; 2); d(5; 2)g = minf7; 10g = 7
d((35); 4) = minfd(3; 4); d(5; 4)g = minf9; 8g = 8
The new similarity matrix is then changed as follows:
(35)
D= 1
2
4
0
3
7
8
3
0
9
6
7
9
0
5
8
6
5
0
The smallest distance between pairs of clusters is now d((35); 1) = 3; hence, the
next cluster obtained is (135). Calculating d((135); 2) = 7 and d((135); 4) = 6, we
get the new similarity matrix
D=
(135) 0 7 6
2
7 0 5
4
6 5 0
The smallest distance between paris of clusters is now d(2; 4), and therefore the
resulting dendrogram looks like Figure 2.2 -(a). If we use the dierent linkage measure
to calculate the similarity matrix, we will get dierent results like Figure 2.2 -(b) and
Figure 2.2 -(c).
2.4
CURE
CURE (Clustering Using Representatives) [10] is a bottom-up hierarchical clustering algorithm, but instead of using a single centroid to represent a cluster, a constant
number of representative points are chosen to represent a cluster. In fact, CURE
22
14
12
12
5
10
10
4
3
Length
14
6
Length
Length
7
8
6
8
6
2
4
4
1
2
2
0
3
5
1
object
2
4
0
3
5
1
2
4
0
3
5
1
object
object
(b)
(c)
(a)
2
4
Figure 2.2 The dierent linkage measure: (a) single linkage; (b) complete linkage;
(c) average linkage.
begins by choosing a constant number, c of well scattered points from a cluster. The
scattered points capture the shape and extent of the cluster. The next step of the
CURE algorithm shrinks the scattered points toward the centroid of the cluster using some pre-determined fraction . The chosen scattered points after shrinking are
used as representatives of the cluster. Varying the fraction between 0 and 1 helps
CURE to identify dierent types of clusters. The similarity between two clusters is
measured by the similarity of the closest pair of the representative points belonging
to dierent clusters. These are the clusters that are chosen to be merged as part of
the hierarchical algorithm. Merging continues until the desired number of clusters, k,
an input parameter [14, 16].
CURE's approach to the clustering problem for large data sets in two ways. First,
CURE begins by drawing a random sample form the database. It uses pre-clustering
of all the data points in order to handle larger data sets. Random sampling has
two positive eects. First, the sample can be designed to t in main memory, which
eliminates signicant I/O costs, and second, random sampling helps to lter outliers.
The random samples must be selected such that the probability of missing clusters
is low. The authors analytically derive sample sizes for which the risk is low, and
show empirically that random samples will still preserve accurate information about
the geometry of the clusters [16]. Second, in order to further speed up clustering,
CURE rst partitions the random sample and partially clusters the data points in
23
Figure 2.3 The overview of CURE
each partition. After eliminating outliers, the pre-clustered data in each partition is
then clustered in a nal pass to generate the nal clusters.
Once clustering of the random sample is completed, instead of a single centroid,
multiple representative points form each cluster are used to label the remainder of
the data set. The use of multiple points enables the algorithm to identify arbitrarily
shaped cluster. The steps involved in clustering using CURE are described in Figure
2.3. The worst-case time complexity of CURE is O(n2 logn) where n is the number
of sampled points and not N , the entire database. The computational complexity of
CURE is quadratic with respect to the sample size, and is not related to the size of
the data set.
2.5
DBSCAN
The DBSCAN method is a density-based approach. It relies on a density-based
notion of clusters. The DBSCAN method requires the users to specify two parameters that are used to dene minimum density for clustering { the radius Eps (")
of the neighborhood of a point and the minimum number of points MinPts in the
neighborhood. Clusters are then found by starting form an arbitrary point and if its
neighborhood satises the minimum density, including the points in its neighborhood
into the cluster. The process is then repeated for the newly added points [10]. The
formal denitions of this notion of a clustering are shortly introduced as follows.
Denition 1: (directly density-reachable) A point p is directly density-reachable
from a point q wrt. Eps and MinPts in a set of objects D if
24
p
q
p is directly densityreachable from q
q is not directly densityreachable from p
Figure 2.4 Directly density-reachable
1) p 2 N" (q ) (N" (q ) is the subset of D contained in the "-neighborhood of q ), and
2) Card(N"(q )) MinP ts (Card(N ) denotes the cardinality of the set N ).
The condition Card(N"(q )) MinP ts is called the \core object condition". If
this condition holds for an object p, then we call p a \core point". Like Figure 2.4, q is
a core point. Only from core objects, other objects can be directly density-reachable.
Denition 2: (density-reachable) A point p is density-reachable from a point q
wrt. " and MinP ts, if there is a chain of points p1 ; :::; pn , p1 = q , pn = p such that
pi+1 is directly density-reachable from pi .
Density-reachability is the transitive hull of direct density-reachability. This relation is not symmetric in general. Only core objects can be mutually density-reachable.
Denition 3: (density-connected) A point p is density-connected to a point q wrt.
" and MinP ts, if there is a point o such that both, p and q are density-reachable
from o wrt. " and Minpts.
Density-connectivity is a symmetric relation. Figure 2.5 illustrates the denitions
on a sample database of 2-dimensional points from a vector space. Note that the
above denitions only require a distance measure and will also apply to data from
a metric space. A density-based cluster is now dened as a set of density-connected
objects which is maximal wrt. density-reachability and the noise is the set of objects
not contained in any cluster.
25
p
p
p is densityreachable from q
q is not densityreachable from p
q
o
q
p and q are densityconnected to
each other by o
(b)
(a)
Figure 2.5 Example for: (a)density-reachability; (b)density-connectivity.
Denition 4: (cluster and noise) Let D be a set of objects. A cluster C wrt. "
and MinP ts in D is a non-empty subset of D satisfying the following conditions:
1) Maximality: 8p; q 2 D: if p
MinP ts, then also q 2 C .
2 C and q is density-reachable form q wrt. " and
2) Connectivity: 8p; q 2 C : p is density-connected to q wrt. " and MinP ts in D.
Every object not contained an any cluster is noise.
Note that a cluster contains not only core objects but also objects that do not
satisfy the core object condition. These objects - called \border objects" of the cluster
- are, however, directly density-reachable from at least one core object of the cluster
(in contrast to noise objects).
The DBSCAN method, which discovers the clusters and the noise in a database
according to the above denitions, is based on the fact that a cluster is equivalent to
the set of all objects in D which are density-reachable from an arbitrary core object
in the cluster. The retrieval of density-reachable objects is performed by iteratively
collecting directly density-reachable objects. The DBSCAN method checks the "neighborhood of each point in the database. If the "-neighborhood N" (p) of a point
p has more than MinP ts points, a new cluster C containing the objects in N" (p)
is created. Then, the "-neighborhood of all points q in C which have not yet been
processed is checked. If N"(q ) contains more than MinP ts points, the neighbors
26
of q which are not already contained in C are added to the cluster and their "neighborhood is checked in the next step. This procedure is repeated until no new
point can be added to the current cluster C [2].
2.6
STING
The STING method is a grid-based approach. The STatistical INformation Gridbased method (STING) [27] divide the spatial area into rectangular cells using a
hierarchical structure, like Figure 2.6. A hierarchical structure is used to manipulate
the grid. Each cell at level i is partitioned into a xed number k of cells at the
next level. This is similar to spatial index structures. They store the statistical
parameters (such as mean, variance, minimum, maximum, and type of distribution)
of each numerical feature of the objects within cells [16, 24].
Clustering operations are performed using a top-down method, starting with the
root. The relevant cells are determined using the statistical information, and only
the paths from those cells down the tree are followed. Once the leaf cells are reached,
the clusters are formed using a breadth-rst search, by merging cells based on their
proximity and whether the average density of the area is greater than some specied
threshold. The time complexity for STING is O(K ) where K is the number of cells
at the bottom layer. [27] assumed that K N . However, the smaller the K , the
more approximate are the clusters. The lower the granularity, the higher the K ,
the slower the algorithm will run. The STING method in its approximation mode
(high granularity) is very fast. Tests showed that its execution rate was almost
independent of the number of data points for both generation and query operations.
However, because of the approximation characteristics, the quality of the clusters is
not as good as other algorithms. Although it can handle large amounts of data, and
is not sensitive to noise, it cannot handle higher dimensional data without a serious
degradation of performance [16].
27
1st layer
1st level (top level) could
have only one cell.
(i-1)th layer
A cell of ( i-1) the level
corresponds to 4 cells of
i th level.
i th layer
Figure 2.6 The concept of the grid-based method
28
CHAPTER III
The Clustering Algorithm Based on the Hilbert
Curve
Clustering, as applied to large data sets, is the process of creating a group of
objects organized on some similarity among the members. In spatial data sets, clustering permits a generalization of the spatial component that allows for successful
data mining. The traditional clustering algorithms can not handle spatial databases.
Spatial databases, in particular, have unique requirements that we need a new special needs for clustering algorithms [16]. And another issue is spatial databases are
increasing exponentially, so we also need an eÆcient clustering algorithm to handle
large databases.
3.1
The Basic Idea
In many spatial databases, in order to preserve the spatial locality in the linear
space, a space-lling curve provides a continuous path to visit every point in high
dimensional grid exactly once and never crosses itself. The three space-lling curves
are the Peano curve, the reected binary gray-code (RBG) curve and the Hilbert
curve. [11] showed that under most circumstances, the Hilbert mapping performed
as well as or better than other mapping methods in minimizing the number of disk
blocks accessed. So, it is widely believed that the Hilbert space-lling curve achieves
the best clustering [18]. Based on this property, we use this structure to store spatial
data by using the Hilbert curve. It can help us to speed up clustering and minimize the
disk access eort. Another advantage of using the Hilbert curve is that it can preserve
object "locality". On the other hand, the traditional algorithms use the graph metrics.
29
Table 3.1 The basic unit in each round
Round
1st Round
2nd Round
3rd Round
4th Round
Unit (blocks)
U1 = 20 20 = 1
U2 = 21 21 = 4
U3 = 22 22 = 16
U4 = 23 23 = 64
The graph metrics need to determine the intercluster distance according to the cost
function of the edges between the points in the two clusters. Therefore, using the
Hilbert curve can preserve this "locality" and we do not need to calculate these
distance.
Figure 3.1 shows that we divide the spatial data into rectangle blocks. (e.g., using
latitude and longitude). We transform the focus from n objects to m blocks, where
n m. The number of blocks are 2h 2h = m, where h is the order of the Hilbert
curves. The size of the blocks is dependent on the density of objects. The path of a
space-lling curve imposes a linear ordering, which may be calculated by starting at
one end of the curve and following the path to the other end. Orenstein [22, 23] used
the term h-ordering to refer to the ordering of the Hilbert curve. Take Figure 3.2
as an example, numbers are ordered in the Hilbert curve. The number in the block
represents the h-ordering and the gray block represents the block which has data.
In our algorithm, according to the order of the Hilbert curves, we will determine
how many rounds which we will execute. If the order of the Hilbert curves is h, the
execution rounds are k = h +1. Another issue which we are concerned is the capacity
of the basic unit. In every round, we use dierent basic unit Uk to run our algorithm.
When we execute the k0 th round, the basic unit Uk contains 2(k 1) 2(k 1) blocks.
Take Figure 3.2 as an example. The order of the Hilbert curves is 3, so the total
execution rounds are 4. Moreover, in every round, the basic unit Uk which we used
is shown in Table 3.1.
30
the block that the data is located
the block that the data is not located
Figure 3.1 Partition the spatial data into the rectangle blocks
3.2
The Algorithm
Now, we use an example to illustrate our algorithm.
3.2.1
The First Round
The initial state of the rst round is shown in Figure 3.3. In this round, the basic
unit U1 contains one block (20 20 ). According to the property of the Hilbert curve,
it is easy to show that when two numbers are close together in the 2-D space, they
are always close together in one-dimension space [11]. Based on this property, we can
be sure that if two block numbers are continuous, they must be in the same cluster.
The FirstRound(BI ,m) procedure is shown in Figure 3.4, where BI [i] represents
the status of block i and m represents the number of blocks. If BI [i] = 1, it indicates
31
21
22
25
26
37
38
41
42
20
23
24
27
36
39
40
43
19
18
29
28
35
34
45
44
16
17
30
31
32
33
46
47
15
12
11
10
53
52
51
48
14
13
8
9
54
55
50
49
1
2
7
6
57
56
61
62
0
3
4
5
58
59
60
63
Figure 3.2 Using the Hilbert curve to connect these blocks
21
22
25
26
37
38
41
42
20
23
24
27
36
39
40
43
19
18
29
28
35
34
45
44
16
17
30
31
32
33
46
47
15
12
11
10
53
52
51
48
14
13
8
9
54
55
50
49
1
2
7
6
57
56
61
62
0
3
4
5
58
59
60
63
Figure 3.3 The initial state of the rst round
32
Procedure F irstRound(BI,m);
/* The rst round to merge data. */
/* BI [i] = 1 indicates that block i has an object. */
/* BC [i] = j indicates that block i belongs to cluster j . */
/* F lag is used to control the meaningful increment of the cluster number. */
begin
j := 1;
F lag := 1;
for i := 0 to m 1 do
begin
if (BI [i] 6= 0) then /* Case 1 */
/* Set the cluster number to the current block */
begin
BC [i] := j ;
F lag := 0;
end
else /* Case 2 */
begin
BC [i] := 0;
if (F lag = 0) then
/* increase the cluster number */
j := j + 1;
F lag := 1;
end;
end;
end;
Figure 3.4 Procedure F irstRound(BI; m)
that block i has objects. On the other hand, if BI [i] = 0, it indicates that block i
does not have objects. Then, we will check each block i from 0 to m 1. If any
block has objects, Case 1 of procedure FirstRound is executed. Otherwise, Case 2 of
procedure FirstRound is executed. In Case 1, we check this block i. If block i and
the previous block l (l < i) which has objects are continuous (i.e., l = i 1), we set
them with the same cluster number. That is, BC [i] = BC [i 1] = j , where j is the
cluster number. Otherwise, we set BC [i] = j + 1, i.e., the next cluster number. In
Case 2, we set BC [i] = 0 which indicates that it does not belong to any cluster.
The result of the rst round is shown in Figure 3.5, where the number in ()
represents the cluster number. The dierent cluster numbers represent that the blocks
are not in the same cluster. Take Figure 3.5 as an example. Block 0 has no object in
this block, so Case 2 of procedure FirstRound(BI,m) is executed. We set BC [0] = 0.
Blocks 2 and 4 have objects, but their Hilbert curve numbers are not continuous.
Therefore, Case 1 of procedure FirstRound(BI,m) is executed for both of them, and
they are assigned with dierent cluster numbers. That is, we have BC [2] = 1 and
33
21
22
(8)
25
(9)
26
37
38
41
42
20
(7)
23
(8)
24
27
36
39
40
43
19
18
(6)
29
28
(10)
35
34
45
44
(12)
16
17
30
31
(11)
32
(11)
33
46
47
(13)
15
12
11
10
(4)
53
(14)
52
51
48
(13)
14
13
(5)
8
(3)
9
54
55
50
(13)
49
(13)
1
2
(1)
7
(3)
6
57
56
(15)
61
(16)
62
0
3
4
(2)
5
(2)
58
59
(16)
60
(16)
63
Figure 3.5 The result of the rst round
BC [4] = 2. On the other hand, Blocks 4 and 5 have objects and their Hilbert
curve numbers are continuous. Therefore, Case 1 of procedure FirstRound(BI,m)
is executed and the same cluster number is assigned to them. That is, we have
BC [4] = BC [5] = 2. Similarly, the other blocks follow this procedure to set their
cluster numbers.
3.2.2
The Second Round
The basic unit U2 in the second round contains 4 (21 21 ) blocks and the initial
state of the second round is shown in Figure 3.6. In the second round, we have 16
dierent cases totally. We classify them into ve groups as shown in Figure 3.7 (a)(e), respectively. In Figure 3.7, we nd that Case (a) and Case (b) can be ignored,
since they are nothing to be merged. Next, Cases (c), (d) and (e) have already been
processed in the rst round except Cases (c)-(6), (d)-(3) and (d)-(4) which are covered
by a dotted line. We observe that in those special cases, their rst block and last
34
21
22
25
26
37
38
41
42
20
23
24
27
36
39
40
43
19
18
29
28
35
34
45
44
16
17
30
31
32
33
46
47
15
12
11
10
53
52
51
48
14
13
8
9
54
55
50
49
1
2
7
6
57
56
61
62
0
3
4
5
58
59
60
63
Figure 3.6 The initial state of the second round
block are mergable although their block numbers are not continuous. Therefore, we
get a conclusion that we only have to check whether the rst block and the last block
are mergable for the basic unit U2 in the second round. If these two blocks do not
have the same cluster number, we will merge them.
Procedure SecondRound(BI ,m) is shown in Figure 3.8. We will check each basic
unit in U2 , successively. In any U2i , if blocks 4i and (4i + 3) have objects and their
cluster numbers are not the same, we will merge them. If the case is mergable, we
will let BC [4i +3] = BC [4i]. By the way, we assign each of the blocks with its cluster
number equal to the cluster number of block (4i + 3) in the rst round the same
cluster number of block 4i. That is why a for loop is used for the mergable case.
Otherwise, we do nothing.
The result of the second round is shown in Figure 3.9. Take Figure 3.9 as an
example. For unit U21 (0,1,2,3), it is Case (b)-(3) in Figure 3.7. Therefore, nothing
35
1
2
0
3
(a)
(1)
1
2
1
2
1
2
1
2
0
3
0
3
0
3
0
3
(b)
(1)
(2)
(3)
(4)
1
2
1
2
1
2
1
2
1
2
1
2
0
3
0
3
0
3
0
3
0
3
0
3
(c)
(2)
(1)
(3)
(4)
(5)
1
2
1
2
1
2
1
2
0
3
0
3
0
3
0
3
(6)
(d)
(1)
1
(2)
(3)
(4)
2
(e)
0
3
(1)
Figure 3.7 The dierent cases in the second round: (a) C04 ; (b) C14 ; (C) C24 ; (d) C34 ;
(e) C44 .
Procedure SecondRound(BI,m);
/* The second round to merge data. */
/* m is the number of blocks. */
begin
k := m=4;
for i := 0 to (k 1) do
begin
if (BI [4i + 3] 6= 0) and (BI [4i] 6= 0) and (BC [4i] 6= BC [4i + 3]) then
begin
for j := 1 to 4 do
if (BI [4i + j ]) 6= 0 then
BC [4i + j ] := BC [4i];
end;
end;
end;
Figure 3.8 Procedure SecondRound(BI; m)
36
;
;;
;;
;
;;;;
;;;;
;;
;
;;
;;
;;
21
22
(7)
25
(9)
26
37
38
41
42
20
(7)
23
(7)
24
27
36
39
40
43
19
18
(6)
29
28
(10)
35
34
45
44
(12)
16
17
30
31
(10)
32
(10)
33
46
47
(12)
15
12
11
10
(4)
53
(14)
52
51
48
(12)
14
13
(5)
8
(2)
9
54
55
50
(12)
49
(12)
1
2
(1)
7
(2)
6
57
56
(15)
61
(15)
62
0
3
4
(2)
5
(2)
58
59
(15)
60
(15)
63
Stop block
Merged blocks
Figure 3.9 The result of the second round
is done in procedure SecondRound(BI,m). For unit U22 (4,5,6,7), it is Case (d)-(4) in
Figure 3.7. Therefore, they must be merged and we reassign BC [7] = BC [4]; that is,
BC [7] = 2. And in procedure SecondRound, we also reassign BC [8] = BC [4]; that
is, BC [8] = 2. Because in the rst round, Blocks 7 and 8 are in the same cluster.
Therefore, if the cluster number of Block 7 is reassigned, the cluster number of Block
8 should also be reassigned. Similarly, we reassign BC [22] = BC [23] = BC [20] = 7
(Case (d)-(3)), BC [31] = BC [32] = BC [28] = 11 (Case (c)-(6)), BC [47] = BC [48] =
BC [49] = BC [50] = BC [44] = 13 (Case (c)-(6)) and BC [59] = BC [60] = BC [61] =
BC [56] = 15 (Case (c)-(6)). Finally, we merge ve cases in this round, and the
number of clusters is reduced form 16 to 11.
37
3.2.3
The Third Round
In the third round, the basic unit U3 contains 16 (22 22 ) blocks and the initial
state of the third round is shown in Figure 3.10. Basically, we will divide the basic
unit U31 (blocks 0 to 15) into four parts (P1 ; P2 ; P3 and P4 ) in the third round as
shown in Figure 3.11. We can nd that we only have to consider the crossing parts
as shown in Figure 3.11. Because the blocks inside each of P0 ; P1 ; P2 and P3 parts,
they are already checked in the previous two rounds. So, in this round, we only
consider the relationship among these four parts and our algorithm will check these
relationships from the rear part to the front part as shown in Figure 3.12. These
relationships are described as follows:
N1;0 : P1 ! P0
N2;1 : P2 ! P1
N3;2 : P3 ! P2
N3;0 : P3 ! P0
We can separate these relationships Na;b , where a > b, into two cases. Case 1
includes N2;1 and N3;0 , and Case 2 includes N1;0 and N3;2 . In Case 1 of the basic
unit U31 as shown in Figure 3.13, we observe that Block 8 and 7 are contained in
relationship N2;1 , so are Blocks 9 and 6. Similarly, Blocks 13 and 2 are contained in
relationship N3;0 , so are Blocks 14 and 1. We nd that these pairs of blocks have a
special property: their sum is 15. Other basic units U32 , U33 and U34 also have the
similar property as shown in Table 3.2. In Case 2 of the basic unit U31 , we observe
that Block 4 and 3 are contained in relationship N1;0 , so are Blocks 7 and 2. Similarly,
Block 12 and 11 are contained in relationship N3;2 , so are Blocks 13 and 8. These
pairs of blocks which have the same dierence from the outer blocks to the inner
blocks are shown in Figure 3.13. Other basic units U32 , U33 and U34 also have the
similar property as shown in Table 3.3.
38
21
22
25
26
37
38
41
42
20
23
24
27
36
39
40
43
19
18
29
28
35
34
45
44
16
17
30
31
32
33
46
47
15
12
11
10
53
52
51
48
14
13
8
9
54
55
50
49
1
2
7
6
57
56
61
62
0
3
4
5
58
59
60
63
Figure 3.10 The initial state of the third round
Case 2
P3
P2
15
12
11
10
14
13
8
9
Case 1
1
2
7
6
0
3
4
5
P0
P1
Figure 3.11 Basic unit U3 in the third round
39
P1
N2 , 1
P2
N3 , 2
N1, 0
P0
N3, 0
P3
Figure 3.12 The basic idea of nding the neighboring block
Case 2:
difference = 1 or 5
form the outer block
to the inner block
15
12
11
10
14
13
8
9
P3
P2
1
2
7
6
0
3
4
5
P0
Case 1:
sum =15
2
2
= 2 k x 2k -1
= 2 x 2 -1
P1
Figure 3.13 The cases in the third round
40
Table 3.2 Case 1 in the third round
Unit
Sum Blocks (the rear block ! the front block)
Unit U31
Unit U32
Unit U33
Unit U34
15
47
79
111
(8 ! 7); (9 ! 6); (13 ! 2); (14 ! 1)
(25 ! 22); (24 ! 23); (29 ! 18); (30 ! 17)
(41 ! 38); (40 ! 39); (45 ! 34); (46 ! 33)
(57 ! 54); (56 ! 55); (61 ! 50); (62 ! 49)
Table 3.3 Case 2 in the third round
Unit
Dierence Blocks (the rear block ! the front block)
Unit U31 1 or
5
Unit U32 1 or
5
Unit U33 1 or
5
Unit U34 1 or
5
(4 ! 3); (12 ! 11)
(7 ! 2); (13 ! 8)
(20 ! 19); (28 ! 27)
(23 ! 28); (29 ! 24)
(36 ! 35); (44 ! 43)
(39 ! 34); (45 ! 40)
(52 ! 51); (60 ! 59)
(55 ! 50); (61 ! 56)
41
Case 2
Case 1
P3
P2
P1
P2
15
12
11
10
21
22
25
26
14
13
8
9
20
23
24
27
1
2
7
6
19
18
29
28
3
4
5
16
17
30
Case 1
0
P0
Case 2
P0
P1
(b) Unit U32
(a) Unit U31
Case 1
P1
31
P3
Case 2
P2
P1
37
38
41
42
36
39
40
43
P0
53
52
51
48
54
55
50
49
Case 2
Case 1
35
34
45
44
57
56
61
62
32
33
46
47
58
59
60
23
P0
P2
P3
(c) Unit U33
P3
(d) Unit U34
Figure 3.14 The dierent units in the third round
We observe another interesting property. The direction (horizontal / vertical) of
Case 1 and Case 2 may exchange for each basic unit Uki (1 i 4) as shown in
Figure 3.14. For Unit U31 and Unit U34 , they have the same direction of Cases 1 and
2. But for Unit U32 and Unit U33 , the position of Case 1 and 2 are dierent from
these cases of Unit U31 and Unit U34 ; their Case 1 and Case 2 are exchanged. The
reason is that according to the property of the Hilbert curve, the direction of both
lower quadrants will be rotated 90Æ in Units U31 and U34 .
The result of the third round is shown in Figure 3.15. Take Figure 3.15 as an
example. For unit U31 , the dierence of Blocks 7 and 2 is 5. So, it belongs to Case
2 in U31 . Therefore, they must be merged and we reassign BC [7] = BC [2]; that is,
BC [7] = 1. We also reassign BC [4] = BC [5] = BC [8] = 1. Because in the second
round, they are in the same cluster. Therefore, if Block 7 is reassigned, Blocks 4, 5
42
and 8 should be reassigned. For Blocks 13 and 8, they also belong to Case 2 in U31 .
Therefore, they must be merged, and we reassign BC [13] = BC [8]; that is, we have
BC [13] = 1.
For unit U32 , the dierence of Blocks 23 and 18 is 5. It is the Case 2 in U32 .
Therefore, they must be merged, and we reassign BC [23] = BC [18]; that is, we have
BC [23] = 6. We also reassign BC [20] = BC [22] = 6. Because in the second round,
they are in the same cluster. Therefore, if Block 23 is reassigned, Blocks 20 and
22 should be reassigned. The sum of Blocks 25 and 22 is 47. It is Case 1 in U32 .
Therefore, they must be merged, and we reassign BC [25] = BC [22]; that is, we have
BC [25] = 6.
For unit U34 , the sum of Blocks 61 and 50 is 111. It is Case 1 in U34 as shown in
Table 3.2. Therefore, they must be merged and we reassign BC [61] = BC [55]; that
is, BC [61] = 12. We also reassign BC [56] = BC [59] = BC [60] = 12. Because in
the second round, they are in the same cluster. Therefore, if Block 61 is reassigned,
Blocks 56, 59 and 60 should be reassigned. Therefore, we merge these ve cases in
the third round. So, the number of clusters is reduced form 11 to 6.
3.2.4
The Fourth Round
In the fourth round, the basic unit U4 contains 64 (23 23 ) blocks and the initial
state of the fourth round is shown in Figure 3.16. Basically, we will also divide the
basic unit U4 (blocks 0 to 63) into four parts (P1 ; P2 ; P3 and P4 ) in the fourth round
as shown in Figure 3.17. We can nd that we only have to consider the crossing parts
as shown in Figure 3.17. Because the blocks inside each of P0 ; P1 ; P2 and P3 parts,
they are already checked in the previous three rounds. So, in this round, we only
consider the relationships (N1;0 ; N2;1 ; N3;2 , and N3;0 ) among these four parts.
Similar to the third round, we also separate these relationships Na;b , where a > b,
into two cases. Case 1 includes N2;1 and N3;0 , and Case 2 includes N1;0 and N3;2 . In
Case 1 of the basic unit U41 as shown in Figure 3.17, we observe that Blocks 37 and
26 are contained in relationship N2;1 , so are Blocks 36 and 27, 35 and 28, 32 and 31.
Similarly, Blocks 58 and 5 are contained in relationship N3;0 , so are Blocks 57 and
43
;;
;
;;
;;
; ;
;;
; ;
;;
;
21
22
(6)
25
(6)
26
37
38
41
42
20
(6)
23
(6)
24
27
36
39
40
43
19
18
(6)
29
28
(10)
35
34
45
44
(12)
16
17
30
31
(10)
32
(10)
33
46
47
(12)
15
12
11
10
(4)
53
(14)
52
51
48
(12)
14
13
(1)
8
(1)
9
54
55
50
(12)
49
(12)
1
2
(1)
7
(1)
6
57
56
(12)
61
(12)
62
0
3
4
(1)
5
(1)
58
59
(12)
60
(12)
63
Stop block
Merged blocks
Figure 3.15 The result of the third round
6, 54 and 9, 53 and 10. We nd that these pairs of blocks have a special property:
their sum is 63 as shown in Table 3.4. In Case 2 of the basic unit U41 , we observe
that Blocks 16 and 15 are contained in relationship N1;0 , so are Blocks 17 and 12,
30 and 11, 31 and 10. Similarly, Blocks 48 and 47 are contained in relationship N3;2 ,
so are Blocks 51 and 46, 52 and 33, 53 and 32. These pairs of blocks have the same
dierence from the outer block to the inner block as shown in Table 3.5.
The result of the fourth round is shown in Figure 3.18. Take Figure 3.18 as an
example. For unit U4 , the dierence of Blocks 31 and 10 is 21. So, it belongs to Case
2 in U4 . Therefore, they must be merged and we reassign BC [31] = BC [10]; that is,
we have BC [31] = 4. We also reassign BC [28] = BC [32] = 1. Because in the third
round, they are in the same cluster. Therefore, if Block 31 is reassigned, Block 28
44
21
22
25
26
37
38
41
42
20
23
24
27
36
39
40
43
19
18
29
28
35
34
45
44
16
17
30
31
32
33
46
47
15
12
11
10
53
52
51
48
14
13
8
9
54
55
50
49
1
2
7
6
57
56
61
62
0
3
4
5
58
59
60
63
Figure 3.16 The initial state of the fourth round
Case 1:
sum = 63
3
3
= 2 k x 2k -1
= 2 x 2 -1
21
22
25
26
37
38
41
42
20
23
24
27
36
39
40
43
19
18
29
28
35
34
45
44
16
17
30
31
32
33
46
47
15
12
11
10
53
52
51
48
14
13
8
9
54
55
50
49
1
2
7
6
57
56
61
62
0
3
4
5
58
59
60
63
P1
P0
P2
Case 2:
difference = 1, 5, 19, 21
from the outer block
to the inner block
P3
Figure 3.17 The cases in the fourth round
45
Table 3.4 Case 1 in the fourth round
Unit
Sum Blocks (the rear block ! the front block)
Unit 1 63
(37 ! 26); (36 ! 27); (35 ! 28); (32 ! 31);
(53 ! 10); (54 ! 9); (57 ! 6); (58 ! 5)
Table 3.5 Case 2 in the fourth round
Unit
Dierence Blocks (the rear block ! front block)
Unit 1 1 or
5 or
19 or
21
(16 ! 15); (48 ! 47)
(17 ! 12); (51 ! 46)
(30 ! 11); (52 ! 33)
(31 ! 10); (53 ! 32)
and 32 should be reassigned. And for Blocks 53 and 32, they also belong to Case 2 in
U4 . Therefore, they must be merged and we reassign BC [53] = BC [32]; that is, we
have BC [53] = 4. Finally, we merge two cases in the fourth round. So, the number
of clusters is reduced form 6 to 4.
After executing these four rounds, we can obtain 4 clusters nally. Figure 3.19
displays the process for merging these clusters. From the above example, we observe
some interesting facts and nd some special rules in the third and the fourth rounds.
Table 3.6 shows the parameters used in our rules. First, we divide the basic unit Uk
in the k0 th round into four parts and consider their relationships of these four parts.
There are two cases to be considered in each basic unit. Case 1 (the relationships of
N2;1 and N3;0 ) concerns the sum of two blocks, and Case 2 (the relationships of N1;0
and N3;2 ) concerns the dierence of two blocks. Then, when we consider the basic
units U5 and U6 in some other examples, we also nd that Cases 1 and Case 2 of these
basic units have the similar property. In Case 1, their sum of two blocks is the same
in every basic unit Uki in the k0 th round (i represents how many basic unit which we
have in the k0 th round, 1 i m=22(k 1) ) as shown in Table 3.7, where m is the
46
21
22
(6)
20
(6)
23
(6)
19
18
(6)
16
17
15
12
14
13
(1)
1
2
(1)
0
3
;;
;
;;
;
;;
;
25
(6)
26
37
38
41
42
24
27
36
39
40
43
29
28
(4)
35
34
45
44
(12)
30
31
(4)
32
(4)
33
46
47
(12)
11
10
(4)
53
(4)
52
51
48
(12)
8
(1)
9
54
55
50
(12)
49
(12)
7
(1)
6
57
56
(12)
61
(12)
62
4
(1)
5
(1)
58
59
(12)
60
(12)
63
Stop block
Merged blocks
Figure 3.18 The result of the fourth round
47
round 1
1
round 2
1
2
3
2
4
5
6
4
5
6
7
8
7
round 3
1
4
6
round 4
1
4
6
9
9
10
11
12
13
14
10
12
14
10
12
14
15
16
15
12
Figure 3.19 The process for merging clusters
number of the blocks. For example, in the 4th round, the sum of basic unit U41 is 63
(63 + (1 1) 128). In Case 2, their dierence is the same in every basic unit Uk
form the outer block to the inner block in the k0 th round as shown in Table 3.8. For
example, in the 4th round, the dierence of basic unit U41 is f1, 5, 19, 21g form the
outer block to the inner block.
In general, the rule for Case 1 is that in a basic unit Uki (1 i m=22(k 1) ),
the two blocks with the sum equal to (22(k 1) + (i 1) 2 22(k 1) ) are mergable.
While for the rule in Case 2, the set of the dierences (denote as DSk ) from the outer
block to the inner block in the basic unit Uki is the same, and the two blocks with
the dierence equal to some element in DSk are mergable. For example, in the third
round, the dierence form the outer block to the inner block is f1; 5g. We restore
these information in DS3 by this sequence, so DS3 is f1; 5g. DSk can be formulated
as follows:
DS2 = f1g
DS3 = f1; 5g
DS4 = f1; 5; 19; 21g
DSk = fDSk
1
g [ faja = 5 4k
3
w; and a = 5 4k 3 + w; w 2 DSk
48
2
g
(1)
Table 3.6 Denitions of parameters
Parameter Description
n
m
h
k
SBka
USBk (j )
NBka
UNBk (j )
DSk
Sk
Na;b
The number of objects
The number of blocks
The order of the Hilbert curve
The number of the execution rounds
The block which we want to stop in the k0 th round (1 a 4)
The blocks SBka recorded in the j 0 th element of array USBk ,
which we want to stop in the k0 th round
The block which is nearby the block SBka in the k0 th round (1 a 4)
The neighboring blocks NBka recorded in the j 0 th element of array UNBk ,
which is corresponding to the neighbor of block USBk (j ) in the k0 th round
The dierence set which is used in Case 2 of the k0 th round
The array which is used in USBk (j ) in the k0 th round
The relationship in every round (0 b < a 3)
Table 3.7 The rules for Case 1
Basic Unit Sum
U3i
U4i
U5i
U6i
Uki
(1 i m=22(k
1)
)
15 + (i 1) 32
63 + (i 1) 128
255 + (i 1) 512
1023 + (i 1) 2048
(22(k 1) 1) + (i 1) 2 22(k
49
1)
Table 3.8 The rules for Case 2
Basic Unit Dierence (from the outer block to the inner block)
U3
U4
U5
U6
1,5
1,5,19,21
1,5,19,21,75,79,81,85
1,5,19,21,75,79,81,85,299,301,315,319,321,325,339,341
DSk represents the dierence set which contains these dierences used in the k0 th
round, and we use Equation (1) to generate DSk . Note that, in our implementation,
we use an array to store such a set. Next, to speed up the processing time, let's decide
the blocks which we want to stop. In this way, we do not have to check every block
in each round. Table 3.9 shows the the number of stopping blocks in each round. In
each round, we quarter these stopping blocks into four parts (SBk1 ; SBk2; SBk3 , and
SBk4 ), and each part corresponds to dierent relationships (N1;0 ; N2;1 ; N3;2 , and N3;0 ).
Each part uses the relationship to nd its corresponding neighbor. For example, SBk1
uses the relationship N1;0 to nd his corresponding neighbor NBk1 in the k0 th round
as shown in Table 3.10. These stopping blocks (SBk1 ; SBk2; SBk3 , and SBk4 ) are
recorded in the j 0 th element of array USBk in the k0 th round, and their corresponding
neighbors (N1;0 ; N2;1 ; N3;2 , and N3;0 ) are recorded in the j 0 th element of array UNBk
as shown in Table 3.11. According to Table 3.10 and Table 3.11, we know the union
of SBk1 ; SBk2; SBk3 , and SBk4 is USBk , NBk1 ; NBk2 ; NBk3 , and NBk4 is UNBk .
Take an example as shown in Table 3.10. In this example, in the third round, we
will stop at 8 blocks in the basic unit U31 , i.e., Blocks 4, 7, 8, 9, 12, 13, 13, and 14.
These stopping blocks are recorded in USB3 (j ), 0 j 7. We separate them into
four parts as shown in Table 3.10, and each part uses dierent relationship Na;b that
can nd the corresponding neighbors NBka , (0 a 3) as shown in Table 3.10,
respectively. How to use these relationships to nd the corresponding neighbors, we
will describe later. For example, when we stop at Block 7, it belongs to SB31 . We
then use the relationship N1;0 to nd his corresponding neighbor NB31 . In this case,
50
Table 3.9 The number of stopping points
Round
The number of stopping points
1st Round
2nd Round
3rd Round
4th Round
k0 th Round
m
(m=21 21 ) 1
(m=22 22 ) 8
(m=23 23 ) 16
(m=2k 1 2k 1) 2k
(k 3)
Table 3.10 The relationships in the 3rd round
Relationship The stopping block (the block which should be checked)
Na;b
N1;0
N2;1
N3;2
N3;0
SBka (NBka )
SB31 (NB31 ) :
SB32 (NB32 ) :
SB33 (NB33 ) :
SB34 (NB34 ) :
1a4
4(3); 7(2)
8(7); 9(6)
12(11); 13(8)
13(2); 14(1)
we nd Block 2, and we will check whether it is mergable with Block 7. Therefore,
to speed up our process, the rst step is to nd the block x which we should stop,
and next step is to nd the block y which we should check whether it is mergable
with block x. In order to compute USBk (j ), we also use an array USk which records
some information that help us to generate USBk (j ). The formula to generate USk
and USBk (j ) are described as follows.
51
Table 3.11 The relationships in the 3rd round
Relationship The stopping blocks The blocks which should be checked
Na;b
N1;0
N2;1
N3;2
N3;0
UNBk (j )
0j7
UNB3 (0) = 3
UNB3 (1) = 2
UNB3 (2) = 7
UNB3 (3) = 6
UNB3 (4) = 11
UNB3 (5) = 8
UNB3 (6) = 2
UNB3 (7) = 1
USBk (j )
USB3 (0) = 4
USB3 (1) = 7
USB3 (2) = 8
USB3 (3) = 9
USB3 (4) = 12
USB3 (5) = 13
USB3 (6) = 13
USB3 (7) = 14
US3 = f0; 3; 0; 1; 0; 1; 1; 2g
S41 = f0; 1; 2; 3g
[Sk [Sk
Sk = S k [ S k
Sk = S k [ S k
Sk = S k [ S k
USk = Sk [ Sk [ Sk [ Sk
Sk1 = S(k
2)1
(
1)1
2
(
1)1
(
1)2
3
(
1)1
(
1)2
4
(
1)4
(
1)4
1
2
52
3
(
4
2)1
USB2 (0) = 1
USB2 (1) = 2
USB2 (2) = 3
USB2 (3) = 3
USBk (2j ) = USBk 1 (j ) 4 + USk (2j )
(2)
USBk (2j + 1) = USBk 1 (j ) 4 + USk (2j + 1)
(3)
USBk (j ) represents the block which we should stop and we record such information in the j 0 th element of array USBk in the k0 th round. We can use the information
of USBk 1 in the (k 1)0 th, USk and Equations (2)-(3) to generate every stopping block in the k0 th round, and use array USBk to record such stopping blocks.
For example, we want to nd the stopping blocks in the 3rd round, and array US3 is
f0; 3; 0; 1; 0; 1; 1; 2g. So, we can use USB2 and Equations (2)-(3) to generate USB3 (0).
USB3 (0) = USB2 (0) 4 + USk (0) = 1 4 + 0 = 4. So, the rst block which we want
to stop in the third round is Block 4.
After we get USBk (j ) and DSk , we can use them to generate their corresponding
neighbors UNBk (j ). We record this information in array UNBk . The formula to
generate UNBk is shown as follows. Because of we quarter the stopping blocks
USBk into four parts, and each part corresponds to dierent relationships Na;b . So,
dierent relationships use dierent equations. Equation (4) represents the relationship
N1;0 . Equation (5) represents the relationship N2;1 . Equation (6) represents the
relationship N3;2 . Equation (7) represents the relationship N3;0 . Then, they can
generate NBk1 ; NBk2 ; NBk3 , and NBk4 , respectively. Finally, we combine them to
record these information in array UNBk (Equations (8)) which we want to check in
the k0 th round.
53
NBk1 = SBk1
NBk2 = (22(k
DSk
1)
NBk3 = SBk3
NBk4 = (22(k
(4)
1) + (i
1) 2 22(k
1)
SBk2
DSk
1)
(5)
(6)
1) + (i
1) 2 22(k
1)
UNBk = NBk1 [ NBk2 [ NBk3 [ NBk4
SBk4
(7)
(8)
Take the third round as an example. We will show how to nd the rst block
which we want to stop and its neighbor which we may merge it or not in the third
round.
Step 1: First, we check the stopping blocks USB3 , and nd USB3 (0) = 4. So, we
know that the rst block which we want to stop is Block 4. According to Table
3.11, we know that Block 4 belongs to SB31 and relationship N1;0 .
Step 2: Because Block 4 belongs to relationship N1;0 , it is Case 2 of the third round.
Therefore, we need to consider the dierence and use Equation (5) to compute
his neighbor.
Step 3: In Case 2, we consider the dierence. Then, we check the rst element in
dierence set DS3 . Since DS3 = f1; 5g, the corresponding rst value for Block
4 is 1.
Step 4: By using Equation (5), NB31 = SB31
DS3 = 4
1 = 3, we nd the
neighbor of Block 4 is Block 3.
Step 5: Finally, we will check whether these two blocks have the same cluster number. If they do not have the same cluster number, then we can merge them.
Step 6: Go back to Step 1, and check the next stopping block. Repeat Step 1 to
Step 5 until all stopping points in array USB3 in the 3rd round are checked .
54
CHAPTER IV
Performance
In this Chapter, we study the performance of the proposed Hilbert curve-based
clustering algorithm by simulation, and make a comparison with the other clustering
algorithms.
4.1
Performance Measures
Clustering validation refers to procedures that evaluate the results of cluster analysis in a quantitative and objective fashion. The clustering validation usually uses
the execution time and cluster quality to evaluate the clusters. And use the cluster
quality method to evaluate \good or bad" cluster called validation techniques. One
way of validating a clustering structure is to compare it to an a priori structure, which
is assigned without regard to the measurements. For example, the cluster numbers
assigned to objects by a clustering algorithm can be compared to category labels,
assigned independent of the clustering. The validation technique that has come to be
known as Hubert's [12] has been shown to be eective in assessing t between data
and a priori structures.
The abstract problem to which Hubert's is applicable can be stated as follows.
Let X = [X (i; j )] and Y = [Y (i; j )] be two n n proximity matrices on the same n
objects. The matrices must contain data having no built-in or implied relationships.
For example, X (i; j ) could denote the observed proximity between objects i and j
and Y (i; j ) could be dened as
55
8
<0
Y (i; j ) =
:1
if objects i and j are in the same cluster
if not
The Hubert statistic is, simply, the point serial correlation between the two
matrices. It can be expressed in raw form as follows when the two matrices are
symmetric:
=
XX
n 1
n
X (i; j )Y (i; j )
i=1 j =i+1
In normalized form, is the sample correlation coeÆcient between the entries of
the two matrices. If mx and my denote the sample means and sx and sy denote the
sample standard deviations of the entries of matrices X and Y , the normalized
statistic is
= f(1=M )
XX
n 1
n
[X (i; j ) mx ][Y (i; j ) my ]g=sx sy
i=1 j =i+1
where M = n(n 1)=2 is the number of entries in the double sum and the moments
are given by
XX
X (i; j )
XX
mx = (1=M )
s2x = (1=M )
X 2 (i; j ) m2x s2y
PP Y (i; j )
PP Y (i; j ) m
= (1=M )
my = (1=M )
2
(4.1)
2
y
(4.2)
All sums are over the set f(i; j ) : 1 i (n 1); (i + 1) j ng. The
statistic measures the degree of linear correspondence between the entries of X and
Y . Unusually large absolute values of suggest that the two matrices agree with
each other. The normalized is always between 1 and 1, while the range of the
raw depends on the ranges of values in the matrices and on the number of entries.
A higher value of represented the quality is better [12].
56
4.2
The Simulation Model
In this section, we present the experimental evaluation of our algorithm, and
compare its performance with other clustering algorithms. Our experiments were run
on a Pentium IV 1.4G Hz, 654 MB RAM, running Windows 2000 Server, and coded
in Java. We have used a collection of synthetic data sets to run our simulation and
these data sets are generated by a generator that the BIRCH algorithm developed
[30]. These synthetic data sets are needed to simulate the spatial data sets in a spatial
database. We perform evaluation with these synthetic data sets which contain data
points in the two-dimensional space, so the generating clusters are easy to visualize,
and make the comparison of dierent schemes visible.
The generation of data sets is controlled by a set of parameters that are summarized in Table 4.1. In our experiments, we use a square [0; 1024)2 as data space,
and generate date sets in this data space. Each data set consists of w clusters of two
dimensions data points that are not uniformly distributed in the data space. Each
cluster is characterized by the number of data points in it (n), its radius (r), and its
center (c). n is in the range of [nl ; nh]; nl and nh are the minimum and the maximum
numbers of data points in each cluster, respectively. r is in the range of [r1 ; rh ]; rl and
rh are the minimum and the maximum length of the radius of cluster, respectively.
The total number of data points in the data set is N .
First, we random generate w centers of clusters, these centers of clusters are placed
on the data space. The location of the center of each cluster is determined by the
its center (c). In order to prevent from overlapping with these clusters, the distance
between the centers of neighboring clusters on the same row/column is controlled by
the parameter kg . We determine that the distance between the centers of neighboring
clusters is set to kg (rl + rh )=2. Next, we generate n data points around its center
in each cluster. The distance with these data points and its center is less than the
radius r. We use this method to generate our data sets, and use these data sets as
an input to run the k-means algorithm.
57
Table 4.1 Parameters for data generation and their values (or ranges)
Parameter Description
m
h
w
N
nl
nh
rl
rh
kg
Value (or Range)
The number of blocks
The order of the Hilbert curve
The number of clusters which we generate initially
The total number of data points
The minimum number of data points in each cluster
The maximum number of data points in each cluster
The minimum radius of each cluster
The maximum radius of each cluster
Distance multiplier
The percentages of the small clusters / all clusters
16384
7
7::20
10000::30000
1000
3000
20
100
3
0%..100%
In our algorithm, we have assumed these synthetic data points are stored in the
disk based on the order (h) of the Hilbert curve. So, we divide the data space into
rectangle blocks by using the Hilbert curve. Then, we partition these synthetic data
points into m blocks and we assign synthetic each data point to the corresponding
disk block on the Hilbert curve. In the Hilbert curve in the two-dimensional space,
the order h of the curve decides the sequential numbers from 0 to 22h (= 2h 2h ) that
can linearly order the spatial data points. If one sequential number represents as one
disk block, the maximum order of the Hilbert curve aects the execution time. The
higher the order h is, the larger the execution time will be. Moreover, the dierent
value of the parameter h may aect the quality of clustering. So, rst we must decide
the appropriate parameter h of the Hilbert curve to divide our data sets. (Note that
the points which were assigned to the same disk block are linked.) In our simulation,
we use the order of the Hilbert curve is 7 to order these data points; i.e., there are
16384 (= 27 27 ) disk blocks which are used to store these data points. We use this
58
input structure to run our algorithm, and compare the execution time and clustering
quality to other clustering algorithms.
4.3
Simulation Results
In our simulation result, the performance measures which we concern about are
the CPU execution time of the computation, the sensitivity to parameters, and the
quality of clustering results. First, we make a comparison of the execution time
for three synthetic data sets. Next, we perform a sensitivity analysis with dierent
parameters. Finally, we use some special data sets to analyze the execution time and
the quality of clustering results with these dierent clustering algorithms. For the
comparison of the rst two performance measures, given the data space 1024 1024.
We took 1000 averages of each data set containing 10000 to 30000 data points. The
last performance measure, given the data space 512 512, we took 1000 averages of
each of data set containing 6000 to 10000 data points.
4.3.1
Time Scalability
The rst performance measure of our experiments was to evaluate the execution
time of these algorithms to cluster various synthetic data sets. The time is presented
in millisecond in the performance. Three dierent synthetic data sets were generated
by the generator. Furthermore, when a single parameter is varied, the default settings
are used for the remaining parameters. Table 4.2 presents the default settings and
dierent generator settings for these synthetic data sets.
Figure 4.1 visualizes the actual clusters of these data sets by plotting these clusters.
DS1 is a data set in which each cluster has the same radius and the same number
of data points in it. We use data set DS1 to observe the relationship between the
execution time and the change of the number of clusters (w) form 7 to 20. DS2 is a
data set in which each cluster has the same radius and the same number of clusters.
We use data set DS2 to observe the relationship between the execution time and the
change of the number of data points (n) form 1000 to 3000 in each cluster. DS3
59
Table 4.2 Data sets used in the simulation
Data set
Default Settings
DS 1
DS 2
DS 3
Generator Setting
w = 10, nl = nh = 1500, rl = rh = 60, kg = 3
w = 7::20, nl = nh = 1500, rl = rh = 60, kg = 3
w = 10, nl = 1000, nh = 3000, rl = rh = 60, kg = 3
w = 10, nl = nh = 1500, rl = 20, rh = 100, kg = 3
is a data set in which each cluster has the same number of clusters and the same
number of data points in it. We use data set DS3 to observe the relationship between
the execution time and the change of the radius of each cluster. In data set DS3, we
separate these clusters into two parts by using radius (r). The radius r of cluster with
20 r 60 belongs to the small cluster, and the radius r of cluster with 60 < r 100
belongs to the big cluster. We use a parameter to control the percentages of the
small clusters. As is increased, the number of small clusters is increased. We will
analyze these results of these data sets as follows.
Increasing the Number of Clusters: For the data set DS1 shown in Figure
4.1-(a), Figure 4.2 shows the comparison of the execution time between the k-means
algorithm and our algorithm, and the detailed information is shown in Table 4.3.
From this result, we observe that our algorithm requires shorter time than the kmeans algorithm. Moreover, as the number of clusters (w) is increased, the execution
time of the k-means algorithm is increased quickly, while the execution time of our
algorithm is increased slowly. Note that when the number of clusters is increased in
data set DS1, the total number of data points (N ) is also increased.
Increasing the Number of Data Points in Each Cluster: For the data set
DS2 shown in Figure 4.1-(b), Figure 4.3 shows the comparison of the execution time
between the k-means algorithm and our algorithm, and the detailed information is
shown in Table 4.4. From this result, we observe that our algorithm requires shorter
time than the k-means algorithm. Moreover, as the number of data points (n) in
each cluster is increased, the execution time of the k-means algorithm is increased
60
(a)
(b)
(c)
Figure 4.1 The actual data sets used in our rst experiments: (a) DS1; (b) DS2; (c)
DS3.
61
Execution time (millisecond)
15000
k-means
ours
10000
5000
0
5
7
9
11
13
15
17
19
21
Number of clusters (w)
Figure 4.2 A comparison of the execution time (DS1)
Table 4.3 A comparison of the execution time (DS1)
Number of clusters (w) the k-means algorithm our algorithm
7
10
13
17
20
947.37
2071.87
3793.27
8651.17
15062.47
62
312.70
328.10
342.30
362.50
379.60
Execution time (millisecond)
8000
k-means
ours
6000
4000
2000
0
1000
1500
2000
2500
3000
Number of dat points in each cluster (n)
Figure 4.3 A comparison of the execution time (DS2)
quickly, while the execution time in our algorithm is almost stable. Note that when
the number of data points in each cluster is increased in data set DS2, the total
number of data points (N ) is also increased.
The Dierent Radiuses of Clusters: For the data set DS3 shown in Figure
4.1-(c), Figure 4.4 shows the comparison of the execution time between the k-means
algorithm and our algorithm, and the detailed information is shown in Table 4.5.
From this result, we observe that our algorithm requires shorter time than the kmeans algorithm. Moreover, as is increased, the execution time of the k-means
algorithm is decreased, while the execution time in our algorithm is decreased slowly.
Because the number of the small clusters is increased, the number of calculating
centroid of the k-means algorithm will be decreased, and the execution time of the
k-means algorithm is also decreased.
From the above simulation results, we observe that the execution time of the kmeans algorithm is aected largely by the total number of data points. When the
number of data points is increased form 10000 to 30000, the execution time of the
63
Table 4.4 A comparison of the execution time (DS2)
Number of data points in each cluster (n) the k-means algorithm our algorithm
1000
1500
2000
2500
3000
1268.77
2803.03
4249.70
5769.13
8226.57
k-means
ours
3000
Execution time (millisecond)
379.80
337.60
337.40
334.60
343.80
2000
1000
0
-0.1
0.1
0.3
0.5
0.7
0.9
1.1
The parameter (alpha)
Figure 4.4 A comparison of the execution time (DS3)
64
Table 4.5 A comparison of the execution time (DS3)
the percentage of small clusters () the k-means algorithm our algorithm
0%
20%
40%
60%
80%
100%
3139.20
2974.90
2564.15
2388.15
2118.00
1680.90
358.33
346.00
327.93
313.73
310.46
307.33
k-means algorithm is increased quickly. The execution time of our algorithm is largely
aected by the number of blocks which have data points, which is one of the main
properties of the grid-based approach and is also the main reason for being able to
support large spatial databases as mentioned in Chapter 1.3. The more the number
of blocks which have data points is, the more the number of blocks which should be
checked is, and the longer the execution time will be. Therefore, from the results of
our algorithm, in the case for the input data set DS1, since the radius of each cluster
is the same, the increase of the number of clusters implies that the number of blocks
which should be checked is also increased. Consequently, the execution time in DS1
will be increased as w (the number of clusters) is increased. While for data set DS2,
in which the radius of each cluster is also the same. Because only the density of
each cluster is increased, which may not aect the number of blocks that should be
checked, it does not largely aect the execution time of our algorithm. Therefore, the
execution time of DS2 is almost stable as n (the number of data points in each cluster)
is increased. In data set DS3, the result is similar to the data set DS1. When the
radius of clusters is increased, it implies that the number of blocks which should be
checked is also increased. Consequently, the execution time of DS3 will be increased
as the number of the big clusters is increased.
The Degenerated Case: For the previous simulation results, while the k-means
algorithm deals with the data points, our algorithm deals with the blocks. Now, let's
65
Execution time (millisecond)
2200
k-means
ours
1700
1200
700
4000
5000
6000
7000
8000
9000
10000
Total number of data points (N)
Figure 4.5 A comparison of the execution time (the degenerated case)
consider a degenerated input case in which each block contains only one data point.
Int this way, both algorithms deal with data points only. For this case, we use data
set DS3 and degenerate the data space as 128 128, i.e., the order of the Hilbert curve
= 7. In our algorithm, we also divide the data space into 128 128 blocks. That
means that we let one block only store one data point. In this way, we can consider
that our algorithm deals with the data points, the same as case of the the k-means
algorithm, instead of blocks. The simulation result is shown in Figure 4.5, and the
detailed information is shown in Table 4.6. We observe that in this degenerated case,
the execution time of our algorithm is still shorter than the k-means algorithm.
4.3.2
Sensitivity to Parameters
Next, we studied the sensitivity of the k-means algorithm and our algorithm to
the change of some parameters. We generate the data set with parameters w = 5,
n = 1500, and 40 r 80 for our experiment. Due to the parameters of these two
66
Table 4.6 A comparison of the execution time (the degenerated case)
The total number of data points (N ) the k-means algorithm our algorithm
4500
5400
6300
7200
8100
9000
9900
688.2
1000.0
1134.4
1484.6
1544.0
1862.4
2156.4
631.2
755.0
862.6
931.2
1003.0
1071.3
1123.2
algorithms which we concern about are dierent, so we discuss the parameters of two
algorithms, respectively.
The k-means algorithm: In the k-means algorithm, the parameter k will aect
the quality of the clustering result. Figure 4.6 shows the clusters analyzed by the
k-means algorithm, when k is varied form 2, 5 to 7. Note that in this simulation, we
always let the number of clusters which we generate initially (w) be 5. As the gure
illustrates, when k is not equal to the real cluster numbers (w), it will merge two real
clusters into one cluster (k = 2 < 5), or split some one real cluster into two parts
(k = 7 > 5). Figure 4.6 can tell us when parameter k of the k-means algorithm is not
assign to the real number of data cluster (w), the quality of the k-means algorithm
will be decreased.
Our algorithm: Figure 4.7 illustrates the execution time for our algorithm as the
order of Hilbert curve is varied form 5, 6 to 7, and the detailed information is shown in
Table 4.7. From this result, we observe that when h is increased, the execution time
of our algorithm is increased. Because when the order h is increased, the total number
of block which we should be checked is also increased. Therefore, the execution time
will be increased about 11%, when the order of the Hilbert curve is increased by 1.
67
(1)
(1)
(1)
(2)
(5)
(7)
(2)
(4)
(3)
(2)
(3)
(b)
(a)
(4)
(5)
(6)
(c)
Figure 4.6 A comparison of the quality the of clustering of the k-means algorithm
under a dierent parameter k: (a) k = 2; (b) k = 5; (c) k = 7.
h=7
h=6
h=5
Execution time (millisecond)
350
330
310
290
270
250
10000
15000
20000
25000
30000
Total number of data poubts (N)
Figure 4.7 A comparison of the execution time under a dierent order h of our
algorithm
68
Table 4.7 A comparison of the execution time under a dierent order h of our algorithm
The total number of data points (N ) h=7
10000
15000
20000
25000
30000
4.3.3
300.1
309.4
323.4
344.0
356.2
h=6
284.4
282.9
290.6
298.4
303.4
h=5
256.3
256.3
261.0
270.6
268.8
Special Data Sets
Finally, we experiment with some special data sets containing data points in two
dimensions [6, 10, 14, 24]. A particularly challenging feature of these data set is that
clusters are very close to each other and they have dierent densities and shapes. The
size of these data sets ranges form 6,000 to 10,000 data points, and their features are
indicated as follows. The rst data set, SDS1, as shown in Figure 4.8-(a), has ve
clusters that are with dierent size, shape, and density. The second data set SDS2
as shown in Figure 4.8-(b), contains two clusters which have two concentric rings.
A small ring is contained in the big ring, and dierent regions of the clusters have
dierent densities [14]. The third data set, SDS3, as shown in Figure 4.8-(c), has two
clusters of the same shape and density, but dierent orientations and positions. The
fourth data set, SDS4, as shown in Figure 4.8-(d), contains big and small ellipsoids
in two rows. The fth data set SDS5, as shown in Figure 4.8-(e), has a concave shape
with data points. A small curve is contained in another big curve. The sixth data
set, SDS6, as shown in Figure 4.8-(f), has two clusters of similar shape, but dierent
size. The seventh data set, SDS7, as shown in Figure 4.8-(g), has four clusters of
some special shapes, dierent density and orientation. The eighth data set, SDS8 as
shown in Figure 4.8-(i), has two rings. Each ring is composed of some small circles,
and these circles in dierent rings have dierent radius.
69
(b)
(a)
(c)
(d)
(e)
(f)
(g)
(j)
Figure 4.8 The special data sets: (a) SDS1; (b) SDS2; (c) SDS3; (d) SDS4; (e) SDS5;
(f) SDS6; (g) SDS7; (i) SDS8.
70
Table 4.8 Execution time (in millisecond) for dierent special data sets
SDS1 SDS2 SDS3 SDS4
Number of data
8000
7500
10000 7000
the k-means algorithm 3140
1407
1110
1187
313
313
312
our algorithm
297
SDS5 SDS6 SDS7 SDS8
Number of data
6000
6500
7000
7600
the k-means algorithm 2141
1125
1031
3172
312
328
328
our algorithm
259
A comparison of the execution time between the k-means algorithm and our algorithm is shown in Table 4.8. We observe that on the average, the k-means algorithm is
3 10 times slower than our algorithm. We also visualize the results of running both
algorithms on these data sets to compare the quality of clustering. These results of
special data sets are shown in the following ways. The points in the dierent clusters
are represented by using dierent colors and dierent cluster numbers. As a result,
data points that belong to the same clusters both have the same color and the same
cluster number.
Figure 4.9 shows the clusters analyzed by these two algorithms for special data set
SDS1. As expected, since the k-means algorithm uses a centroid clustering algorithm
for clustering the data points, it can not distinguish between the big and small clusters.
It splits the larger cluster while merging the two smaller clusters adjacent to it.
Moreover, it merges the two ellipsoids, because it cannot handle the elongated shapes.
But our algorithm successfully discovers these clusters in SDS1.
Figure 4.10 shows the clusters analyzed by these two algorithms for special data
set SDS2. The k-means algorithm splits one concentric rings into three parts, and
then these parts merge the outer and the inner rings into a single cluster, respectively.
71
(4)
(4)
(3)
(3)
(5)
(5)
(1)
(2)
(1)
(a)
(2)
(b)
Figure 4.9 The result of SDS1: (a) the k-means algorithm; (b) our algorithm.
The other concentric ring belongs to another cluster. But, our algorithm successfully
identies each ring as a separate cluster.
Figure 4.11 shows the clusters analyzed by these two algorithms for special data
set SDS3. The k-means algorithm will separate the extreme portion of the upper
cluster and merge this part with the lower cluster. But, our algorithm will separate
well in special data set SDS3.
Figure 4.12 shows the clusters analyzed by these two algorithms for special data
set SDS4. The k-means algorithm separates the extreme portions of the elongated
clusters, and merges them with some close clusters. But, our algorithm can nd right
clusters with the right parameter settings.
Figure 4.13 shows the clusters analyzed by these two algorithms for special data
set SDS5. The k-means algorithm splits these two clusters into right and left parts.
The right part of these two cluster merges to one cluster, and left part is similarly.
But, our algorithm identies each curve as a separate cluster.
Figure 4.14 shows the clusters analyzed by these two algorithms for special data
set SDS6. Like SDS1, the k-means algorithm can not distinguish between the big and
small clusters. It splits the larger cluster into upper and lower parts, and merge lower
72
(1)
(2)
(3)
(4)
(a)
(1) (2)
(3)
(4)
(b)
Figure 4.10 The result of SDS2: (a) the k-means algorithm; (b) our algorithm.
part with another small cluster. But, our algorithm distinguishes these two clusters
very well.
Figure 4.15 shows the clusters analyzed by these two algorithms for special data
set SDS7. The k-means algorithm separates some real clusters into two parts, and
merges two real clusters into one cluster. But, our algorithm can distinguishes these
clusters with arbitrary shapes in SDS7.
Figure 4.16 shows the clusters analyzed by these two algorithms for special data set
SDS8. The k-means algorithm can not distinguish big and small circles. It separates
some big circles into two parts, and merges small circles into one cluster. Because
the k-means algorithm can not deal with clusters which have dierent sizes. But, our
algorithm can distinguishes these clusters with dierent sizes very well.
73
(2)
(2)
(1)
(1)
(a)
(b)
Figure 4.11 The result of DS3: (a) the k-means algorithm; (b) our algorithm.
(1)
(3)
(2)
(5)
(4)
(2)
(1)
(4)
(3)
(5)
(7)
(9)
(6)
(8)
(6)
(10)
(7)
(a)
(9) (10)
(8)
(b)
Figure 4.12 The result of SDS4: (a) the k-means algorithm; (b) our algorithm.
74
(2)
(2)
(1)
(1)
(a)
(b)
Figure 4.13 The result of SDS5: (a) the k-means algorithm; (b) our algorithm.
(1)
(1)
(2)
(2)
(a)
(b)
Figure 4.14 The result of SDS6: (a) the k-means algorithm; (b) our algorithm.
75
(3)
(3)
(1)
(1)
(4)
(4)
(2)
(2)
(a)
(b)
Figure 4.15 The result of SDS7: (a) the k-means algorithm; (b) our algorithm.
(20)
(7)
(5)
(18)
(9)
(6)
(1)
(17)
(10)
(20)
(19)
(4)
(19)
(8)
(9)
(2)
(16)
(8)
(3)
(18)
(10)
(11)
(1)
(2)
(7)
(6)
(3)
(5)
(4)
(17)
(14)
(15)
(13)
(16)
(11)
(12)
(13)
(14)
(15)
(12)
(a)
(b)
Figure 4.16 The result of SDS8: (a) the k-means algorithm; (b) our algorithm.
76
Table 4.9 A comparison
The partitioning
The hierarchical
The density-based
The grid-based
approach
approach
approach
approach
Our algorithm
(1)
Scalable
Not well
Some well
Some well
Well
Well
(2)
Handle arbitrary
Not completely
Some algorithms can
Better than the
Better than the
Yes
hierarchical approach
hierarchical approach
shaped clusters
(3)
Independent of
Yes
Some not
Yes
Yes
Yes
Required
Required
Some not
Some not
Required
Not completely
Some partially
Yes
Yes
Not completely
No
Some can
No
Not well
No
data input order
(4)
No a-priori
knowledge of
inputs required
(5)
Insensitive
to noises
(6)
Handle higher
dimensionality
For the issues which a good spatial clustering algorithm concerns about, we can
conclude them into six requirements. A comparison of these requirements is shown
in Table 4.9. It can be seen that none of them can match all the requirements [16].
However, our proposed algorithm can match most of the requirements.
77
CHAPTER V
Conclusion
In this Chapter, we give a summary of the thesis and point out some future
directions.
5.1
Summary
Spatial data mining has became more important due to it can discover interesting
relationships and characteristics in large spatial databases [19]. However, how to nd
an eÆciency algorithm in spatial data mining is an important problem to be solved
because the amount of spatial databases is increasing exponentially. In spatial data
mining, clustering is a useful technique for discovering interesting data and patterns in
the implicit data. Clustering can help construct meaningful partitioning of a large set
of data points and cluster analysis can simply nd a convenient and valid organization
of the data set.
Another diÆculty in spatial data mining is that there is no total ordering among
these spatial data points to preserve spatial proximity. The space-lling curve can
provide a way to order these data points of a grid and preserve the distance which
in high-dimensional data space with these data points. So, in this thesis, we address problems with traditional clustering algorithms which either favor clusters with
spherical shapes and similar sizes, or it can not handle large databases. Then, we
propose a new clustering algorithm for these requirements. It can scales well for large
databases without sacricing clustering quality.
78
In Chapter 2, we have reviewed some related works of several well-known approaches to cluster data points in the spatial database, including the k-means [15],
CLARANS [19], HAC [26], CURE [10], DBSCAN [6], and STINGS [27] algorithms.
In Chapter 3, we have presented the proposed new method for clustering algorithm, which uses the Hilbert curve to store our spatial data points. The basic idea
is to use the Hilbert curve to preserve the distance that data points which are close
in 2-D space and represent similar data points should be stored close together in the
linear order. This method also can minimize the disk access eort and provide high
speed for clustering.
In Chapter 4, form our simulation, we have made a comparison of execution
time between dierent clustering algorithms, including the k-means algorithm and
our algorithm. We have shown that the execution time of our proposed algorithm
is shorter than other algorithms. Furthermore, we have shown that our algorithm
has higher quality of clustering than the k-means algorithm. It can deal with some
special data sets in our simulation. Moreover, our algorithm can handle large spatial
databases eÆciently. When the number of data points is increased, the execution
time of our algorithm is aected lightly.
5.2
Future Work
So far, we only consider that the data points are centralized in the data set.
How to process eÆciently for large amounts of noises in that data set is the future
research topic. Moreover, another topic is that the clustering algorithm which can
work eectively for high d, where d is the number of dimensions. Our algorithm with
high dimensional data space should be investigated. Furthermore, in our method,
we only consider the two main (vertical and horizontal) directions. How to nd the
diagonal direction blocks to be merged, which may improve the execution time and
the quality of clustering results is another future work.
79
BIBLIOGRAPHY
BIBLIOGRAPHY
[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, \Automatic Subspace
Clustering of High Dimensional Data for Data Mining Applications," ACM SIGMOD Conf., pp. 94{105, 1998.
[2] M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, \OPTICS: Ordering
Points To Identify the Clustering Structure," ACM SIGMOD Conf., pp. 49{60,
1999.
[3] A. Ben-Dor, R. Shamir, and Z. Yakhini, \Clustering Gene Expression Patterns,"
Proc. of the 3rd Annual Int. Conf. on Computational Molecular Biology, pp. 33{
42, 1999.
[4] P. Berkhin, \Survey of Clustering Data Mining Techniques," Accrue Software,
Inc., 2002.
[5] M. S. Chen, J. Han, and P. S. Yu, \Data Mining: An Overview from Database
Perspective," IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 6, pp. 866{
883, Dec. 1996.
[6] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, \A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise," Proc. of the 2nd
Int. Conf. on KDD, pp. 226{231, 1996.
[7] Christos Faloutsos, \Gray Codes for Partial Match and Range Queries," IEEE
Trans. on Software Eng., Vol. 14, No. 10, pp. 1381{1393, Aug. 1988.
[8] Christos Faloutsos and Shari Roseman, \Fractals for Secondary Key Retrieval,"
ACM SIGACT-SIGMOD-SIGART Symposium on PODS, pp. 247{252, 1989.
[9] S. Guha, R. Rastogi, and K. Shim, \ROCK: A Robust Clustering Algorithm for
Categorical," Int. Conf. on Data Eng., pp. 512{521, 1999.
[10] S. Guha, R. Rastogi, and K. Shim, \CURE: An EÆcient Clustering Algorithm
for Large Databases," Information Systems, Vol. 26, No. 1, pp. 35{58, March
2001.
80
[11] H. V. Jagadish, \Linear Clustering of Objects with Multiple Attributes," ACM
SIGMOD Conf., pp. 332{342, 1990.
[12] A. K. Jain and R. C. Dubes, \Algorithms for Clustering Data," Prentice Hall,
Englewood Clis, New Jersey, 1988.
[13] A. K. Jain, M. N. Murty, and P. J. Flynn, \Data Clustering: A Review," ACM
Computing Survey, Vol. 31, No. 3, pp. 264{323, Sept. 1999.
[14] G. Karypis, E. H. Han, and V. Kumar, \CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling," IEEE Trans. on Computer, Vol. 32,
No. 8, pp. 68{75, Aug. 1999.
[15] L. Kaufman and P. J. Rousseeuw, \Finding Groups in Data: An Introduction to
Cluster Analysis," John Wiley & Sons, Inc., New York, 1990.
[16] E. Kolatch, \Clustering Algorithms for Spatial Databases: A Survey," Dept. of
Computer Science, University of Maryland, College Park, 2001.
[17] B. Lent, A. Swami, and J. Widom, \Clustering Association Rules," Proc. of the
13th Int. Conf. on Data Eng., pp. 220{231, 1997.
[18] BongKi Moon, H. V. Jagadish, Christos Faloutsos, and Joel H. Saltz, \Analysis
of the Clustering Properties of the Hilbert Space-Filling Curve," IEEE Trans. on
Knowledge and Data Eng., Vol. 13, No. 1, pp. 124{141, Jan. 2001.
[19] R. T. Ng and J. Han, \EÆcient and Eective Clustering Methods for Spatial
Data Mining," Proc. of the 20th VLDB Conf., pp. 144{155, 1994.
[20] R. T. Ng and J. Han, \CLARANS: A Method for Clustering Objects for Spatial Data Mining," IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 5,
pp. 1003{1016, Sept./Oct. 2002.
[21] C. F. Olson, \Parallel Algorithms for Hierarchical Clustering," Parallel Computing, Vol. 21, No. 8, pp. 1313{1325, Aug. 1995.
[22] Jack A. Orenstein and T. H. Merrett, \A Class of Data Structures for Associative
Searching, " Proc. Symp. on PODS, pp. 181{190, 1984.
[23] Jack A. Orenstein, \Spatial Query Processing in an Object-Oriented Database
System," Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 326{336,
1986.
81
[24] G. Sheikholeslami, S. Chatterjee, and A. Zhang, \WaveCluster: A MultiResolution Clustering Approach for Very Large Spatial Databases," Proc. of
the 24th VLDB Conf., pp. 428{439, 1998.
[25] L. Y. Tseng and S. B. Yang, \A Genetic Approach to the Automatic Clustering
Problem," Pattern Recognition, Vol. 34, No. 2, pp. 415{424, Feb. 2001.
[26] E. M. Voorhees, \Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval," Information Proc. & Management, pp.
465{476, 1986.
[27] W. Wang, J. Yang, and R. Muntz, \STING: A Statistical Information Grid
Approach to Spatial Data," Proc. of the 23th VLDB Conf., pp. 186{195, 1997.
[28] C. P. Wei, Y. H. Lee, and C. M. Hsu, \Empirical Comparison of Fast Clustering
Algorithms for Large Data Sets," Proc. of the 33rd Hawaii Int. Conf. on System
Sciences, Maui, Hawaii, Jan. 2000.
[29] O. R. Zaiane, A. Foss, C. H. Lee, and W. wang, \On Data Clustering Analysis: Scalability, Constraints, and Validation," Pacic-Asia Conf. on Knowledge
Discovery and Data Mining, pp. 28{39, 2002.
[30] T. Zhang, R. Ramakrishnan, and M. Livny, \BIRCH: A EÆcient Data Clustering
Method for Very Large Databases," ACM SIGMOD Int. Conf. on Management
of Data, pp. 103{114, 1996.
82