Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AN EFFICIENT HILBERT CURVE-BASED CLUSTERING STRATEGY FOR LARGE SPATIAL DATABASES A Thesis Submitted to the Faculty of National Sun Yat-sen University by Yun-Tai Lu In Partial Fulllment of the Requirements for the Degree of Master of Science June 2003 ABSTRACT Recently, millions of databases have been used and we need a new technique that can automatically transform the processed data into useful information and knowledge. Data mining is the technique of analyzing data to discover previously unknown information and spatial data mining is the branch of data mining that deals with spatial data. In spatial data mining, clustering is one of useful techniques for discovering interesting data in the underlying data objects. The problem of clustering is that give n data points in a d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in dierent clusters. Cluster analysis has been widely applied to many areas such as medicine, social studies, bioinformatics, map regions and GIS, etc. In recent years, many researchers have focused on nding eÆcient methods to the clustering problem. In general, we can classify these clustering algorithms into four approaches: partition, hierarchical, density-based, and grid-based approaches. The k -means algorithm which is based on the partitioning approach is probably the most widely applied clustering method. But a major drawback of k-means algorithm is that it is diÆcult to determine the parameter k to represent \natural" cluster, and it is only suitable for concave spherical clusters. The k-means algorithm has high computational complexity and is unable to handle large databases. Therefore, in this thesis, we present an eÆcient clustering algorithm for large spatial databases. It combines the hierarchical approach with the grid-based approach structure. We apply the grid-based approach, because it is eÆcient for large spatial databases. Moreover, we apply the hierarchical approach to nd the genuine clusters by repeatedly combining together these blocks. Basically, we make use of the Hilbert curve to provide a way to linearly order the points of a grid. Note that the Hilbert curve is a kind of space-lling curves, where a space-lling curve is a continuous path which passes through every point in a space once to form a one-one correspondence between the coordinates of the points and the one-dimensional sequence numbers of the points on the curve. The goal of using space-lling curve is to preserve the distance that points which are close in 2-D space and represent similar data should be stored close together in the linear order. This kind of mapping also can minimize the disk access eort and provide high speed for clustering. This new algorithm requires only one input parameter and supports the user in determining an appropriate value for it. In our simulation, we have shown that our proposed clustering algorithm can have shorter execution time than other algorithms for the large databases. Since the number of data points is increased, the execution time of our algorithm is increased slowly. Moreover, our algorithm can deal with clusters with arbitrary shapes in which the k-means algorithm can not discover. TABLE OF CONTENTS Page ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4 1.5 . . . . . 1 3 8 12 16 2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 2.2 2.3 2.4 2.5 2.6 Spatial Data Mining . Clustering . . . . . . . Space Filling Curves . Motivation . . . . . . . Organization of Thesis K-Means . . CLARANS HAC . . . . CURE . . . DBSCAN . STING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . 3. The Clustering Algorithm Based on the Hilbert Curve . . . . . . . . . . . . . . . . . . . . . . 17 19 20 22 24 27 . . . . . . . . . . . . . . . . . . . . . . 3.1 The Basic Idea . . . . . . 3.2 The Algorithm . . . . . . 3.2.1 The First Round . 3.2.2 The Second Round 3.2.3 The Third Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 31 31 34 38 Page 3.2.4 The Fourth Round . . . . . . . . . . . . . . . . . . . . . . . . 43 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Performance Measures . . . . . . 4.2 The Simulation Model . . . . . . 4.3 Simulation Results . . . . . . . . 4.3.1 Time Scalability . . . . . . 4.3.2 Sensitivity to Parameters . 4.3.3 Special Data Sets . . . . . . . . . . . 55 57 59 59 66 69 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 79 iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF FIGURES Figure Page 1.1 The object of clustering . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 A classication of clustering algorithms . . . . . . . . . . . . . . . . . 4 1.3 The concept of the partitioning approach . . . . . . . . . . . . . . . . 5 1.4 Distinction between agglomerative and divisive methods . . . . . . . 7 1.5 The concept of the density-based approach . . . . . . . . . . . . . . . 8 1.6 Peano curves of order: (a) 1; (b) 2; (c) 3. . . . . . . . . . . . . . . . . 10 1.7 Reected binary gray-code curves of order: (a) 1; (b) 2; (c) 3. . . . . 11 1.8 Hilbert curves of order: (a) 1; (b) 2; (c)3. . . . . . . . . . . . . . . . . 11 1.9 Clusters are formed in the space lling curve of order 3: (a) the Peano curve; (b) the RBG curve; (c) the Hilbert curve. . . . . . . . . . . . . 12 1.10 Sample databases: (a) clusters of widely dierent sizes (Case 1); (b) clusters with convex shapes (Case 2); (c) clusters with elongated shapes (Case 3); (d) clusters with double circles (Case 4). . . . . . . . . . . . 14 1.11 Cluster discovered by dierent algorithms: (1) the k-means method; (2) the new algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 An overview of the k-means method: (a) (c) (e) calculate the new centroid; (b) (d) (f) assign the points to the new centroid. . . . . . . 18 2.2 The dierent linkage measure: (a) single linkage; (b) complete linkage; (c) average linkage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 The overview of CURE . . . . . . . . . . . . . . . . . . . . . . . . . . 24 iv Figure Page 2.4 Directly density-reachable . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Example for: (a)density-reachability; (b)density-connectivity. . . . . . 26 2.6 The concept of the grid-based method . . . . . . . . . . . . . . . . . 28 3.1 Partition the spatial data into the rectangle blocks . . . . . . . . . . 31 3.2 Using the Hilbert curve to connect these blocks . . . . . . . . . . . . 32 3.3 The initial state of the rst round . . . . . . . . . . . . . . . . . . . . 32 3.4 Procedure F irstRound(BI; m) . . . . . . . . . . . . . . . . . . . . . 33 3.5 The result of the rst round . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 The initial state of the second round . . . . . . . . . . . . . . . . . . 35 3.7 The dierent cases in the second round: (a) C04 ; (b) C14 ; (C) C24 ; (d) C34 ; (e) C44 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.8 Procedure SecondRound(BI; m) . . . . . . . . . . . . . . . . . . . . . 36 3.9 The result of the second round . . . . . . . . . . . . . . . . . . . . . . 37 3.10 The initial state of the third round . . . . . . . . . . . . . . . . . . . 39 3.11 Basic unit U3 in the third round . . . . . . . . . . . . . . . . . . . . . 39 3.12 The basic idea of nding the neighboring block . . . . . . . . . . . . . 40 3.13 The cases in the third round . . . . . . . . . . . . . . . . . . . . . . . 40 3.14 The dierent units in the third round . . . . . . . . . . . . . . . . . . 42 3.15 The result of the third round . . . . . . . . . . . . . . . . . . . . . . . 44 3.16 The initial state of the fourth round . . . . . . . . . . . . . . . . . . . 45 3.17 The cases in the fourth round . . . . . . . . . . . . . . . . . . . . . . 45 3.18 The result of the fourth round . . . . . . . . . . . . . . . . . . . . . . 47 3.19 The process for merging clusters . . . . . . . . . . . . . . . . . . . . . 48 v Figure Page 4.1 The actual data sets used in our rst experiments: (a) DS1; (b) DS2; (c) DS3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 A comparison of the execution time (DS1) . . . . . . . . . . . . . . . 62 4.3 A comparison of the execution time (DS2) . . . . . . . . . . . . . . . 63 4.4 A comparison of the execution time (DS3) . . . . . . . . . . . . . . . 64 4.5 A comparison of the execution time (the degenerated case) . . . . . . 66 4.6 A comparison of the quality the of clustering of the k-means algorithm under a dierent parameter k: (a) k = 2; (b) k = 5; (c) k = 7. . . . . 68 4.7 A comparison of the execution time under a dierent order h of our algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.8 The special data sets: (a) SDS1; (b) SDS2; (c) SDS3; (d) SDS4; (e) SDS5; (f) SDS6; (g) SDS7; (i) SDS8. . . . . . . . . . . . . . . . . . . 70 4.9 The result of SDS1: (a) the k-means algorithm; (b) our algorithm. . . 72 4.10 The result of SDS2: (a) the k-means algorithm; (b) our algorithm. . . 73 4.11 The result of DS3: (a) the k-means algorithm; (b) our algorithm. . . 74 4.12 The result of SDS4: (a) the k-means algorithm; (b) our algorithm. . . 74 4.13 The result of SDS5: (a) the k-means algorithm; (b) our algorithm. . . 75 4.14 The result of SDS6: (a) the k-means algorithm; (b) our algorithm. . . 75 4.15 The result of SDS7: (a) the k-means algorithm; (b) our algorithm. . . 76 4.16 The result of SDS8: (a) the k-means algorithm; (b) our algorithm. . . 76 vi LIST OF TABLES Table Page 3.1 The basic unit in each round . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Case 1 in the third round . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Case 2 in the third round . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Case 1 in the fourth round . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Case 2 in the fourth round . . . . . . . . . . . . . . . . . . . . . . . . 46 3.6 Denitions of parameters . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7 The rules for Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.8 The rules for Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.9 The number of stopping points . . . . . . . . . . . . . . . . . . . . . 51 3.10 The relationships in the 3rd round . . . . . . . . . . . . . . . . . . . 51 3.11 The relationships in the 3rd round . . . . . . . . . . . . . . . . . . . 52 4.1 Parameters for data generation and their values (or ranges) . . . . . . 58 4.2 Data sets used in the simulation . . . . . . . . . . . . . . . . . . . . . 60 4.3 A comparison of the execution time (DS1) . . . . . . . . . . . . . . . 62 4.4 A comparison of the execution time (DS2) . . . . . . . . . . . . . . . 64 4.5 A comparison of the execution time (DS3) . . . . . . . . . . . . . . . 65 4.6 A comparison of the execution time (the degenerated case) . . . . . . 67 vii Table Page 4.7 A comparison of the execution time under a dierent order h of our algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.8 Execution time (in millisecond) for dierent special data sets . . . . . 71 4.9 A comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 viii CHAPTER I Introduction , or the eÆcient discovery of interesting patterns from large collections of data, has been recognized as an important area of database research [17]. The goal is to reveal regularities and relationships that are non-trivial. This is accomplished through an analysis of the patterns that form in the data [16]. Data mining techniques can be classied into the following categories: classication, clustering, association rules, sequential patterns, time-series patterns, link analysis and text mining [28]. Spatial data mining is the branch of data mining that deals with spatial (location) data. Data mining 1.1 Spatial Data Mining (SDBS) are database systems designed to handle spatial data and the non-spatial information used to identify the data. Spatial data describes information related to the space occupied by objects. SDBS are used for everything from geo-spatial data to bio-medical knowledge and the number of such databases and their uses are increasing rapidly. The amount of spatial data being collected is also increasing exponentially. The complexity of the data contained in these databases means that it is not possible for humans to completely analyze the data being collected. Data mining techniques have been used with relational database to discover unknown information, searching for unexpected result and correlations [16]. Therefore, automated knowledge discovery becomes more and more important in spatial databases [6]. Spatial Database Systems 1 Spatial data mining in particular is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases [19]. Spatial data mining diers form regular data mining in parallel with the dierences between non-spatial data and spatial data. The attributes of a spatial object stored in a database may be aected by the attributes of the spatial neighbors of that object. In addition, spatial location, and implicit information about the location of an object, may be exactly the information that can be extracted through spatial data mining [16]. Knowledge discovered from spatial data can be of various forms, like characteristic and discriminant rules, extraction and description of prominent structures or clusters, spatial associations, and others. Usually, the spatial relationships are implicit in nature. Because of the huge amounts of spatial data that may be obtained from satellite images, medical equipments, Geographic Information Systems (GIS), image database exploration etc., it is expensive and unrealistic for the users to examine spatial data in detail. Spatial data mining aims to automate the process of understanding spatial data by representing the data in a concise manner and reorganizing spatial databases to accommodate data semantics. It can be used in many applications such as seismology (grouping earthquakes clustered along seismic faults), mineeld detection (grouping mines in a mineeld), and astronomy (grouping stars in galaxies) [24]. So, a crucial challenge in spatial data mining is the eÆciency of spatial data mining algorithms due to the often huge amount of spatial data and the complexity of spatial data types and spatial accessing method [27]. And another challenge is that there is no ordering by spatial proximity among spatial objects, the computation of spatial operators is more dierent than that of the non-spatial counterparts. Due to its undirected nature, clustering is often the best technique to adopt rst when a large, complex data set with many variables and many internal structures are encountered [28]. Clustering is a technique that is quite useful in spatial data mining applications. It divides the initial set of objects into a number of non-overlapping subsets, in order to identify classes or groups of objects whose locations (in some k-dimensional space) are close to each other. 2 1.2 Clustering In spatial data mining, clustering is a useful technique for discovering interesting data distributions and patterns in the underlying data. Cluster analysis helps construct meaningful partitioning of a large set of objects based on a \divide and conquer" methodology which decomposes a large scale system into smaller components to simplify design and implementation. As a data mining task, data clustering identies clusters, or densely populated regions, according to some distance measurement, in a large, multidimensional data set. Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the data points. Data clustering identies the sparse and the crowded places, and hence discovers the overall distribution patterns of the data set [5]. Clustering analysis does not category labels that tag objects with prior identiers. The absence of category labels distinguishes cluster analysis from discriminant analysis (and pattern recognition and decision analysis). The objective of cluster analysis is simply to nd a convenient and valid organization of the data, not to establish rules for separation future data into categories. Clustering algorithms are geared toward nding structure in the data [12]. The problem of clustering can be dened formally as follows: given n data points in a d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in dierent clusters [10]. An example of clustering is depicted in Figure 1.1. The input patterns are shown in Figure 1.1 -(a), and the desired clusters are shown in Figure 1.1 -(b) [13]. In the past years, cluster analysis has been widely applied to many areas such as medicine (classication of diseases), chemistry (grouping of compounds), social studies (classication of statistical ndings), bioinformatics [3, 25] etc. [20]. In general, we can classify the clustering algorithms into four approaches: partition, hierarchical, density-based, and grid-based approaches. Figure 1.2 shows the relationship between the dierent approaches of spatial clustering algorithms. 3 Y X X X X X X X X X 5 5 5 5 4 3 4 3 4 3 X X 4 X X X 5 3 X X 4 4 4 X X 5 4 4 X X X X X X X X X 6 2 2 2 2 X X X 4 4 X X X X X X X 4 4 X 7 6 X 1 111 1 X (a) 7 6 7 6 7 (b) X X Figure 1.1 The object of clustering Clustering Methods Partitioning Buttom-Up K-Medoids PAM DensityBased Hierarchical CLARA CURE Grid-Based Top-Down DBSCAN BIRCH CLIQUE K-Means CLARANS STING WaveCluster OPTICS CHAMELEON ROCK CAST STING+ Figure 1.2 A classication of clustering algorithms 4 Figure 1.3 The concept of the partitioning approach The partitioning approach constructs a partition of a database D of n objects into a set of k clusters, where k is an input parameter for these algorithms. The partitioning approach typically starts with an initial partition of a database D and then uses an iterative control strategy to optimize an objective function. Each cluster is represented by the gravity center for the cluster (k-means method) or by one of the objects of the cluster located near its center (k-medoids method) [6]. That is, it classies the data into k groups, which together satisfy the requirements of a partition: (1) Each group must contain at least one object. (2) Each object must belong to exactly one group. These conditions imply that there are at most as many groups as there are object (k n). The second condition says that two dierent clusters cannot have any objects in common and that k groups together add up to the full data set. Figure 1.3 shows an example of a partition of 20 points into three clusters [15]. Consequently, the partitioning approach uses a two-steps procedure. First, it determines k representatives minimizing the objective function. Second, it assigns each object to the cluster with its representative "closest" to the considered objects. It is important to note that k is given by the user. Of course, not all values of k lead to \natural" clustering, so it is advisable to run the algorithm several times with dierent values of k and to select that k for which certain characteristics or graphics look best, or to retain the clustering that appears to give rise to the most meaningful interpretation [6, 15]. Several algorithms belong to the partitioning approach, including k-means [15], PAM [15], CLARA [15], and CLARANS [19]. 5 The hierarchical approach creates a hierarchical decomposition of a database D. The hierarchical decomposition is represented by a dendrogram, a tree that iteratively splits D into smaller subsets until each subset consists of only one object. In such a hierarchy, each node of the tree represents a cluster of D. In contrast to partitioning approaches, the hierarchical approach does not need k as an input. However, a termination condition has to be dened indication when the merge or division process should be terminated [6]. The hierarchical algorithms can be further divided into the agglomerative and the divisive methods. The agglomerative methods: If the clustering hierarchy is formed from bottom up, at the start, each data object is a cluster by itself. Then, small clusters are merged into bigger clusters at each level of the hierarchy until at the top of the hierarchy all the data objects are in one cluster [29]. The divisive methods: In the divided methods, initially the set of all objects is viewed as a cluster and at each level, and some clusters are binary divided into smaller clusters [25]. The process continues until a stopping criterion (frequently, the requested number k of clusters) is achieved. The distinction between the agglomerative and the divisive methods is shown in Figure 1.4. Figure 1.4 shows what happens with a data set with n = 5 objects. The agglomerative methods (indicated by the upper arrow, pointing to the right) starts when all objects are apart (that is , at step 0 we have n clusters). Then, in each step, two clusters are merged, until only one is left. On the other hand, the divisive methods start when all objects are together (that is, at step 0, there is one cluster) and in each following step, a cluster is split up, until there are n of them [15]. Many algorithms belong to the hierarchical approach, including HAC [26], BIRCH [30], ROCK [9] CURE [10] and, CHAMELEON [14]. The density-based approach applies a local cluster criterion. Clusters are regarded as regions in the data space in which the objects are regarded as regions in the data space in which the objects are dense, and which are separated by regions of low object density (noise). These regions may have an arbitrary shape and the points 6 Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative a ab b abcde c cde d de e divisive Step 4 Step 3 Step 2 Step 1 Step 0 Figure 1.4 Distinction between agglomerative and divisive methods inside a region may be arbitrarily distributed. The key idea of the density-based approach is that for each object of a cluster, the neighborhood of a given radius (") has to contain at least a minimum number of objects (MinPts), i.e. the cardinality of the neighborhood has to exceed a threshold. As shown Figure 1.5, the point q contains ve neighbors, so point q and its neighbors can become a cluster [2, 6]. An open set in the Euclidean space can be divided into a set of its connected components. The implementation of this idea for partitioning of a nite set of points requires concepts of density, connectivity and boundary. They are closely related to a point's nearest neighbors. A cluster, dened as a connected dense component, grows in any direction that density leads. Therefore, the density-based approach is capable of discovering clusters of arbitrary shapes. Also this provides a natural protection against outliers [4]. The advantages of density-based approach are that it can discover clusters with arbitrary shapes and it does not need to preset the number of clusters [29]. Many algorithms belong to the density-base approach, including CLIQUE [1], OPTICS [2], CAST [3] DBSCAN [6], and WaveCluster [24]. The grid-based approach rst quantizes the clustering space into a nite number of cells, and then performs clustering on the gridded cells. Overall, the grid-based 7 p MinPts = 5 q Eps = 2 cm Figure 1.5 The concept of the density-based approach approach shifts our attention from data to space partitioning. Data partitioning is induced by membership of points in segments resulted from space partitioning, while space partitioning is based on grid-characteristics accumulated from input data. One advantage of this indirect handling (data ) grid-data ) space-partitioning ) data-partitioning) is that accumulation of grid-data makes the grid-based clustering approach independent of data ordering. In contrast, relocation methods and all incremental methods are very sensitive with respect to data ordering [4]. The main advantage of grid-based approach is that its speed only depends on the resolution of griding, but not on the size of the data set. The grid-based approach is more suitable for high density data sets with a huge number of data objects in limited space [29]. Many algorithms belongs to the grid-based approach, including CLIQUE [1], STING [27], and WaveCluster [24]. 1.3 Space Filling Curves The amount of spatial data being collected is increasing rapidly. The size of spatial databases is more and more large. So, we need an eÆcient clustering algorithm to help us to analyze the large spatial databases. The gird-based approach is very eÆcient for large databases. Because it quantizes the space into a nite number of blocks and transforms the focus form the number of objects to the number of blocks. 8 A space-lling curve is a continuous path which passes through every point in a space once to form a one-one correspondence between the coordinates of the points and the one-dimensional sequence numbers of the points on the curve. The spacelling curve provides a way to order linearly the points of a grid. In any case, the term spatial usually refers to objects and operators in a space of dimension two or higher. However, there is no total ordering among spatial objects that preserves spatial proximity. In other words, there is no mapping from twoor higher-dimensional space into one-dimensional space such that any two objects that are spatially close in higher dimensional space are close to each other in onedimensional sorted sequence. One way is to look for total orders that preserve spatial proximity at least to some extent. The goal of the space-lling curve is to preserve the distance, i.e., points which are close in space and represent similar data should be stored close together in the linear order [8]. Some examples of space-lling curves are the Peano curve, the RBG curve and the Hilbert curve [7, 18, 22, 23]. These space-lling curves can help us to preserve the spatial locality of the blocks and provide high speed for clustering. In general, space-lling curves start with a basic path on a k-dimensional square grid of side 2. The path visits every point in the grid exactly once without crossing itself. It has two free ends which may be joined with other paths. The basic curve is said to be of order 1. To derive a curve of order i, each vertex of the basic curve is replaced by the curve of order i 1, which may be appropriately rotated and/or reected to t the new curve [8]. The Peano curve is proposed rst by Orenstein [22, 23]. In this mapping, the one-dimensional sequence number (1D-number) of a point is obtained by simply interleaving the bits of a binary representation of the X and Y coordinates of the point in the two-dimensional space (2D-space). The basic Peano curve for a 2 2 grid, denoted as P1 , is shown in Figure 1.6-(a). To derive higher orders of the Peano curve, we replace each vertex of the basic curve with the previous order curve. Figures 1.6(b) and (c) show the Peano curves of order 2 and 3, respectively. We can think of dividing the given region into quadrants and drawing a curve such as Figure 1.6-(a). 9 1 3 0 2 (a) (b) (c) Figure 1.6 Peano curves of order: (a) 1; (b) 2; (c) 3. Then each quadrant is divided in turn into 4 sub-quadrants, and the same basic curve repeated in each, in place of each node in the previous step. One more recursive step, again dividing each node into 4 sub-quadrants joined by the basic curve, gives rise to Figures 1.6-(c) [11]. In a (binary) Gray code, numbers are coded into binary representations such that successive numbers dier in exactly one bit position. Faloutsos [7, 8] observed that dierence in only one bit position had a relationship with locality. He proposed that numbers produces by interleaving the coordinates of a point in 2D-space as in the Peano curve technique to obtain the 1D-number. The basic reected binary gray-code curve (the RBG curve) of a 2 2 grid, denoted as R1 is shown in Figure 1.7-(a). The procedure to derive higher orders of this curve is to reect the previous order curve over x-axis and then over the y -axis. Figures 1.7-(b) and (c) show the reected binary gray-code curve of order 2 and 3, respectively. As in the case of the Peano curve, the RBG curve begins with the curve of Figure 1.7-(a). It divides each quadrant into 4 sub-quadrants and replicates. While replicating, it rotates the two upper quadrants through 180Æ as shown in Figure 1.7-(b). It divides into 4 sub-quadrants once again, with replication and upper quadrant rotation to get Figure 1.7-(c) [11]. The Hilbert curve is a mapping in which the four nearest neighbors in 2D-space are usually mapped to points not too far away in the linear traversal. It begins with Figure 1.8-(a). As in the case of the previous curves, it replicates in four quadrants. When replicating, the lower left quadrant is rotated clockwise 90Æ , the lower right 10 1 2 0 3 (a) (b) (c) Figure 1.7 Reected binary gray-code curves of order: (a) 1; (b) 2; (c) 3. 1 2 0 3 (a) (b) (c) Figure 1.8 Hilbert curves of order: (a) 1; (b) 2; (c)3. quadrant is rotated anti-clockwise 90Æ , and the sense (or direction of traversal) of both lower quadrants is reversed. The two upper quadrants have no rotation and no change of sense. Thus, we obtain Figure 1.8-(b). Remembering that all rotation and sense computations are relative to previously obtained rotation and sense in a particular quadrant, a repetition of this step gives rise to Figure 1.8-(c). So, the basic Hilbert curve of a 2 2 grid, denoted as H1 is shown in Figure 1.8-(a). The procedure to derive higher orders of this curve is to rotate and reect the curve at vertex 0 and at vertex 3. The curve can keep growing recursively by following the same rotation and reection pattern at each vertex of the basic curve. Figures 1.8-(b) and (c) show the Hilbert curves of order 2 and 3, respectively [11]. For a given distance-preserving mapping that maps the points in 2D-space on one-dimensional index, a cluster is dened to be a group of points with consecutive 11 the Peano curve (a) the RBG curve the Hilbert curve (b) (c) query window Figure 1.9 Clusters are formed in the space lling curve of order 3: (a) the Peano curve; (b) the RBG curve; (c) the Hilbert curve. ordering values. If a range query retrieves few clusters, the distance-preserving mapping method requires few disk accesses on the actual le [8]. Figure 1.9 illustrates a range query where the Hilbert curve (c) is better than the Peano curve (a). The shaded area is the result of the range query. The Peano curve (a) and the RBG curve (b) have four clusters in the shaded area while the Hilbert curve (c) only has two. 1.4 Motivation Clustering is one of the most important technique about spatial data mining and four dierent approaches have been proposed to achieve the goal. In this section, we rst make a comparison of those dierent clustering approaches for spatial data in the presence of physical constraints [29], and then propose a new eÆcient clustering algorithm for large spatial databases. The advantage of the partitioning approach is that it is very easy to understand and implement. But all the algorithms based on the partitioning approach have a similar clustering quality and the major diÆculties with these algorithms include: (1) The number k of clusters to be found needs to be known prior the clustering requiring at least some domain knowledge which is often not available. (2) It is diÆcult to identify clusters with large variations in sizes (large genuine clusters tend 12 to be split). (3) This approach is only suitable for concave spherical clusters, and is unable to nd arbitrary shaped clusters [29]. The advantage of the hierarchical clustering approach is that it does not need k as an input parameter, which is an obvious advantage over the partitioning approach. And it usually uses the dendrogram to represent the whole process. It is easy to understand which clusters to be merged at each step. The disadvantage of the hierarchical clustering approach is setting a termination condition which requires some domain knowledge for parameter setting. The problem of parameter setting makes them less useful for real world applications. Also typically, the hierarchical clustering approach has high computational complexity [29]. And in some hierarchical algorithms, it is order-sensitive and can not handle outliers. The advantage of the density-based approach is that it can discover clusters with arbitrary shapes and it does not need to preset the number of clusters [29]. The major disadvantage of the density-based approach is the choice of the value of Eps and the resulting clustering quality is highly dependent on the Eps parameter [27]. The advantage of the grid-based approach is that it can handle large databases. Because the run time complexity for STING is O(K ) where K is the number of blocks. And the K N , where N is the number of object, so this approach provides fast speed. But the disadvantage of the grid-based approach is its low quality. The lower the K which we use, the quality will be lower. Otherwise, the higher the K , the slower the algorithm will run [16]. In additional to considering the above advantages and disadvantages of dierent approaches for large data clustering, many applications for large spatial databases rise the following requirements for clustering approaches: 1. Minimization of input parameters: Minimal requirements of domain knowledge to determine the input parameters is needed, because appropriate values are often not known in advance when dealing with large databases [6]. 2. Discovery of clusters with arbitrary shapes: Due to the diverse nature of the spatial objects, the clusters may be of arbitrary shapes. For example, there may be a large variation in cluster sizes (as shown in Figure 1.10-(a)), nested 13 (a) (b) (c) (d) Figure 1.10 Sample databases: (a) clusters of widely dierent sizes (Case 1); (b) clusters with convex shapes (Case 2); (c) clusters with elongated shapes (Case 3); (d) clusters with double circles (Case 4). within one another (as shown in Figure 1.10-(b)), elongated (as shown in Figure 1.10-(c)) or double curve (as shown in Figure 1.10-(d)). A good clustering approach should be able to identify clusters irrespective of their shapes or relative positions [24]. 3. Good eÆciency for large databases: Due to the huge amount of spatial data, an important challenge for clustering approaches is to provide good time eÆciency [24]. 4. Robust with regard to noise: Another important issue is the handling of noise. Noise refers to spatial objects which are not contained in any cluster and should be discarded during the mining process [24]. 5. Insensitive to the data input order: The results of a good clustering approach should not get aected by the dierent ordering of input data and should produce the same clusters. In other words, it should be order insensitive with respect to input data [24]. 14 (a) - (1) (a) - (2) (b) - (1) (b) - (2) (d) - (1) (d) - (2) (c) - (1) (c) - (2) Figure 1.11 Cluster discovered by dierent algorithms: (1) the k-means method; (2) the new algorithm. But, there is no single algorithm that can fully satisfy all the above requirements. Therefore, in this thesis, we present the new clustering algorithm for large spatial databases. It combines the hierarchical approach with the grid-based approach structure. We use the grid-based approach of clustering algorithm, because it is eÆcient for large spatial databases. And we use the hierarchical approach to nd the genuine clusters by repeatedly combining together these blocks. We also make use of the Hilbert curve to provide a way to linearly order the points of a grid. The goal is to preserve the distance that points which are close in 2-D space and represent similar data should be stored close together in the linear order. This kind of mapping also can minimize the disk access eort [11] and provide high speed for clustering. This new algorithm requires only one input parameter and supports the user in determining an appropriate value for it. It does not get aected by the ordering of input data. Because we use the Hilbert curve to store our data, the ordering of input data is xed. Our algorithm also can discover clusters of some arbitrary shapes as shown in Figure 1.11. Finally, it is eÆcient even for large spatial databases. From our simulation and performance measures, we show that our proposed algorithm has shorter mean execution time than the k-means algorithm. When the 15 number of data points is increased, the k-means algorithm has large computation time for clustering. But, the execution time of our algorithm is increased very slowly. As mentioned before, in some of hierarchical algorithms, it needs to set termination condition to stop algorithm. But, in our algorithm, it does not set this condition. Moreover, our algorithm can nd clusters with arbitrary shapes which the k-means algorithm can not discover as shown in Figure 1.11 (a)-(d). Because in the k-means algorithm, the quality of clustering result is highly aected by the parameter k and it can not deal with arbitrary shapes. In our algorithm, the execution time and quality of clustering results are also aected by the parameter h, but the aection is lightly. Therefore, our proposed algorithm has its own advantages and applicable domains, and it can reduce the execution time for clustering and keep the high quality of clustering results. 1.5 Organization of Thesis The rest of the thesis is organized as follows. In Chapter 2, we give a survey of several well-known clustering algorithms. In Chapter 3, we present a new clustering algorithm based on the Hilbert curve for large spatial databases. In Chapter 4, we give a comparison of the performance of the k-means algorithm and our new clustering algorithm. Finally, we given a summary and point out some future research directions. 16 CHAPTER II A Survey There have been several of clustering methods proposed. Basically, these clustering methods can be classied into the following four approaches: the partitioning, the hierarchical, the density-based and the grid-based approaches. In this Chapter, we will describe some well-known clustering methods, the k-Means [15], CLARANS [19], HAC [26], CURE [10], DBSCAN [6], and STINGS [27], which are based on the partitioning approach, the partitioning approach, the hierarchical approach, the hierarchical approach, the density-based approach and the grid-based approach, respectively. 2.1 K-Means The k-means method is a partitioning approach. It is probably the most widely applied the partition clustering approach [15]. The k-means method can be illustrated as follows: suppose that n objects described by the attribute vectors fx1 ; x2 ; :::; xn g be partitioned into k clusters, where k < n. Let mi be the mean of the vectors in cluster i. Initially, select k number of points to be the centers of your clusters arbitrarily, then nd the distance from each point to each center, and assign the point to the closest center. Next, for each set of points assigned to a center, nd the middle of the cluster, take that value as the new center and repeat the process until the centers do not seem to move. Figure 2.1 shows the process of the standard k-means clustering method. First, Figure 2.1-(a) shows that objects are represented as points in space, where we randomly select 2 points (represented by X) to be the centroids of two clusters. Then, 17 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 (a) 4 5 6 7 8 9 10 6 7 8 9 10 6 7 8 9 10 (b) 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 0 10 1 2 3 4 5 (d) (c) 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 (f) (e) Figure 2.1 An overview of the k-means method: (a) (c) (e) calculate the new centroid; (b) (d) (f) assign the points to the new centroid. 18 Figure 2.1-(b) represents that the next process is to partition the points (represented by ) into 2 groups (clusters), indicated by the two large circles. Then, Figure 2.1-(c) shows that the new centroid of each group of points is calculated, and then the points are reassigned to the new centroid which they are closest, as shown Figure 2.1-(d). Figure 2.1-(e) displays that when all the new centroids are the same as the previous centroids, the centroids are stable and algorithm is stopped. The nal clusters are shown in Figure 2.1-(f), the points are assigned to the cluster to which they are most similar. 2.2 CLARANS The CLARANS (Clustering Large Applications based on RANdomized Search) method is a partitioning approach. The CLARANS method [19] is proposed a partition clustering method for large databases which is based on randomized search. The CLARANS method is an improved the k-medoid method. The idea of the k-medoids method is like the k-means method. But the k-medoids method selects k medoids (which are called representative objects) in the data set, not the centroids of the clusters. The corresponding clusters are then found by assigning each remaining object to the nearest representative object. To be exact, the average distance (or average dissimilarity) of the representative object to all the other objects of the same cluster is being minimized [15]. The CLARANS method is based on the k-medoids method, in which each cluster is also represented by its medoid, the most centrally located points in the cluster, and the objective is to nd the k best medoids that optimize the criterion function. This method reduces this problem to that of graph search by representing each set of k medoids as a node in the graph [10]. The CLARANS method is that given n objects, the process described above of nding k medoids can be viewed abstractly as searching through a certain graph. In this graph, denoted by Gn;k , a node is represented by a set of k objects fOm1 ; :::; Omk g, intuitively indicating that Om1 ; :::; Omk are the selected medoids. The set of nodes in the graph is the set f fOm1 ; :::; Omk gjOm1 ; :::; Omk are 19 objects in the data set g. Two nodes are neighbors (i.e. connected by an arc) if their sets dier by only one object. More formally, two nodes S1 = fOm1 ; :::; Omk g and S2 = fOw1; :::; Owk g are neighbors if and only if the cardinality of the intersection of T S1 and S2 is k 1, i.e. jS1 S2 j = k 1. It is easy to see that each node has k(n k) neighbors. Since a node represents a collection of k medoids, each node corresponds to a clustering. Thus, each node can be assigned a cost that is dened to be the total dissimilarity between every object and the medoids of its cluster [19]. This method has two inputs: maxneighbor and numlocal. Maxneighbor is the maximum number of neighbors of a node that are to be examined. Numlocal is the maximum number of local minimum that can be collected. The CLARANS method begins by selecting a random node. It then checks a sample of the neighbors of the node, and if a better neighbor is found based on the \cost dierential of the two node," it moves to the neighbor and continues processing until the maxneighbor criterion is met. Otherwise, it declares the current node a local minimum and starts a new pass to search for other local minimum. After a specied number of local minimum (numlocal) are collected, the method returns the best of these local values as the medoid of the cluster [16]. 2.3 HAC The HAC method is a hierarchical approach. It is an hierarchical agglomerative clustering, and it is a bottom-up clustering method. This method starts with each objects in a separate cluster. At each step of the method, the most \similar" clusters are joined together. \Similarity" depends on the clustering criterion used. Joining operations continue until all the objects are merged to a single cluster. The most commonly used methods join a single pair of clusters at each step, the results in a binary tree. At each step, the pair of clusters merged are the ones between which the distance is the minimum. The widely used measures for distance between the points in the two clusters are described as follows, Ci represents the cluster and ni is the number of points in Ci [10]. 20 : The distance between two clusters is given by the minimum cost edge between points in the two clusters. Single link d(Ci ; Cj ) = minfd(s; t)js 2 Ci ; t 2 Cj g : The distance between two clusters is given by the maximum cost edge between points in the two points. Complete link d(Ci ; Cj ) = maxfd(s; t)js 2 Ci ; t 2 Cj g : The distance between two clusters is the average of all of the edge costs between points in the two clusters [21]. Average link d(Ci; Cj ) = P 2 s Ci P 2 d(i; j ) t Cj ni nj Now, we show an example to describe the HAC method with using simple linkage. Suppose that the similarity matrix D is given by: 1 2 D= 3 4 5 0 9 3 6 11 9 0 7 5 10 3 7 0 9 2 6 5 9 0 8 11 10 2 8 0 Treating each object as a cluster, the rst join corresponds to merging cluster 3 & 5, because they have the minimum distance. To implement the next level of clustering, we need to calculate the distances between the cluster (35) and the remaining object 1, 2 and 4. We have 21 d((35); 1) = minfd(3; 1); d(5; 1)g = minf3; 11g = 3 d((35); 2) = minfd(3; 2); d(5; 2)g = minf7; 10g = 7 d((35); 4) = minfd(3; 4); d(5; 4)g = minf9; 8g = 8 The new similarity matrix is then changed as follows: (35) D= 1 2 4 0 3 7 8 3 0 9 6 7 9 0 5 8 6 5 0 The smallest distance between pairs of clusters is now d((35); 1) = 3; hence, the next cluster obtained is (135). Calculating d((135); 2) = 7 and d((135); 4) = 6, we get the new similarity matrix D= (135) 0 7 6 2 7 0 5 4 6 5 0 The smallest distance between paris of clusters is now d(2; 4), and therefore the resulting dendrogram looks like Figure 2.2 -(a). If we use the dierent linkage measure to calculate the similarity matrix, we will get dierent results like Figure 2.2 -(b) and Figure 2.2 -(c). 2.4 CURE CURE (Clustering Using Representatives) [10] is a bottom-up hierarchical clustering algorithm, but instead of using a single centroid to represent a cluster, a constant number of representative points are chosen to represent a cluster. In fact, CURE 22 14 12 12 5 10 10 4 3 Length 14 6 Length Length 7 8 6 8 6 2 4 4 1 2 2 0 3 5 1 object 2 4 0 3 5 1 2 4 0 3 5 1 object object (b) (c) (a) 2 4 Figure 2.2 The dierent linkage measure: (a) single linkage; (b) complete linkage; (c) average linkage. begins by choosing a constant number, c of well scattered points from a cluster. The scattered points capture the shape and extent of the cluster. The next step of the CURE algorithm shrinks the scattered points toward the centroid of the cluster using some pre-determined fraction . The chosen scattered points after shrinking are used as representatives of the cluster. Varying the fraction between 0 and 1 helps CURE to identify dierent types of clusters. The similarity between two clusters is measured by the similarity of the closest pair of the representative points belonging to dierent clusters. These are the clusters that are chosen to be merged as part of the hierarchical algorithm. Merging continues until the desired number of clusters, k, an input parameter [14, 16]. CURE's approach to the clustering problem for large data sets in two ways. First, CURE begins by drawing a random sample form the database. It uses pre-clustering of all the data points in order to handle larger data sets. Random sampling has two positive eects. First, the sample can be designed to t in main memory, which eliminates signicant I/O costs, and second, random sampling helps to lter outliers. The random samples must be selected such that the probability of missing clusters is low. The authors analytically derive sample sizes for which the risk is low, and show empirically that random samples will still preserve accurate information about the geometry of the clusters [16]. Second, in order to further speed up clustering, CURE rst partitions the random sample and partially clusters the data points in 23 Figure 2.3 The overview of CURE each partition. After eliminating outliers, the pre-clustered data in each partition is then clustered in a nal pass to generate the nal clusters. Once clustering of the random sample is completed, instead of a single centroid, multiple representative points form each cluster are used to label the remainder of the data set. The use of multiple points enables the algorithm to identify arbitrarily shaped cluster. The steps involved in clustering using CURE are described in Figure 2.3. The worst-case time complexity of CURE is O(n2 logn) where n is the number of sampled points and not N , the entire database. The computational complexity of CURE is quadratic with respect to the sample size, and is not related to the size of the data set. 2.5 DBSCAN The DBSCAN method is a density-based approach. It relies on a density-based notion of clusters. The DBSCAN method requires the users to specify two parameters that are used to dene minimum density for clustering { the radius Eps (") of the neighborhood of a point and the minimum number of points MinPts in the neighborhood. Clusters are then found by starting form an arbitrary point and if its neighborhood satises the minimum density, including the points in its neighborhood into the cluster. The process is then repeated for the newly added points [10]. The formal denitions of this notion of a clustering are shortly introduced as follows. Denition 1: (directly density-reachable) A point p is directly density-reachable from a point q wrt. Eps and MinPts in a set of objects D if 24 p q p is directly densityreachable from q q is not directly densityreachable from p Figure 2.4 Directly density-reachable 1) p 2 N" (q ) (N" (q ) is the subset of D contained in the "-neighborhood of q ), and 2) Card(N"(q )) MinP ts (Card(N ) denotes the cardinality of the set N ). The condition Card(N"(q )) MinP ts is called the \core object condition". If this condition holds for an object p, then we call p a \core point". Like Figure 2.4, q is a core point. Only from core objects, other objects can be directly density-reachable. Denition 2: (density-reachable) A point p is density-reachable from a point q wrt. " and MinP ts, if there is a chain of points p1 ; :::; pn , p1 = q , pn = p such that pi+1 is directly density-reachable from pi . Density-reachability is the transitive hull of direct density-reachability. This relation is not symmetric in general. Only core objects can be mutually density-reachable. Denition 3: (density-connected) A point p is density-connected to a point q wrt. " and MinP ts, if there is a point o such that both, p and q are density-reachable from o wrt. " and Minpts. Density-connectivity is a symmetric relation. Figure 2.5 illustrates the denitions on a sample database of 2-dimensional points from a vector space. Note that the above denitions only require a distance measure and will also apply to data from a metric space. A density-based cluster is now dened as a set of density-connected objects which is maximal wrt. density-reachability and the noise is the set of objects not contained in any cluster. 25 p p p is densityreachable from q q is not densityreachable from p q o q p and q are densityconnected to each other by o (b) (a) Figure 2.5 Example for: (a)density-reachability; (b)density-connectivity. Denition 4: (cluster and noise) Let D be a set of objects. A cluster C wrt. " and MinP ts in D is a non-empty subset of D satisfying the following conditions: 1) Maximality: 8p; q 2 D: if p MinP ts, then also q 2 C . 2 C and q is density-reachable form q wrt. " and 2) Connectivity: 8p; q 2 C : p is density-connected to q wrt. " and MinP ts in D. Every object not contained an any cluster is noise. Note that a cluster contains not only core objects but also objects that do not satisfy the core object condition. These objects - called \border objects" of the cluster - are, however, directly density-reachable from at least one core object of the cluster (in contrast to noise objects). The DBSCAN method, which discovers the clusters and the noise in a database according to the above denitions, is based on the fact that a cluster is equivalent to the set of all objects in D which are density-reachable from an arbitrary core object in the cluster. The retrieval of density-reachable objects is performed by iteratively collecting directly density-reachable objects. The DBSCAN method checks the "neighborhood of each point in the database. If the "-neighborhood N" (p) of a point p has more than MinP ts points, a new cluster C containing the objects in N" (p) is created. Then, the "-neighborhood of all points q in C which have not yet been processed is checked. If N"(q ) contains more than MinP ts points, the neighbors 26 of q which are not already contained in C are added to the cluster and their "neighborhood is checked in the next step. This procedure is repeated until no new point can be added to the current cluster C [2]. 2.6 STING The STING method is a grid-based approach. The STatistical INformation Gridbased method (STING) [27] divide the spatial area into rectangular cells using a hierarchical structure, like Figure 2.6. A hierarchical structure is used to manipulate the grid. Each cell at level i is partitioned into a xed number k of cells at the next level. This is similar to spatial index structures. They store the statistical parameters (such as mean, variance, minimum, maximum, and type of distribution) of each numerical feature of the objects within cells [16, 24]. Clustering operations are performed using a top-down method, starting with the root. The relevant cells are determined using the statistical information, and only the paths from those cells down the tree are followed. Once the leaf cells are reached, the clusters are formed using a breadth-rst search, by merging cells based on their proximity and whether the average density of the area is greater than some specied threshold. The time complexity for STING is O(K ) where K is the number of cells at the bottom layer. [27] assumed that K N . However, the smaller the K , the more approximate are the clusters. The lower the granularity, the higher the K , the slower the algorithm will run. The STING method in its approximation mode (high granularity) is very fast. Tests showed that its execution rate was almost independent of the number of data points for both generation and query operations. However, because of the approximation characteristics, the quality of the clusters is not as good as other algorithms. Although it can handle large amounts of data, and is not sensitive to noise, it cannot handle higher dimensional data without a serious degradation of performance [16]. 27 1st layer 1st level (top level) could have only one cell. (i-1)th layer A cell of ( i-1) the level corresponds to 4 cells of i th level. i th layer Figure 2.6 The concept of the grid-based method 28 CHAPTER III The Clustering Algorithm Based on the Hilbert Curve Clustering, as applied to large data sets, is the process of creating a group of objects organized on some similarity among the members. In spatial data sets, clustering permits a generalization of the spatial component that allows for successful data mining. The traditional clustering algorithms can not handle spatial databases. Spatial databases, in particular, have unique requirements that we need a new special needs for clustering algorithms [16]. And another issue is spatial databases are increasing exponentially, so we also need an eÆcient clustering algorithm to handle large databases. 3.1 The Basic Idea In many spatial databases, in order to preserve the spatial locality in the linear space, a space-lling curve provides a continuous path to visit every point in high dimensional grid exactly once and never crosses itself. The three space-lling curves are the Peano curve, the reected binary gray-code (RBG) curve and the Hilbert curve. [11] showed that under most circumstances, the Hilbert mapping performed as well as or better than other mapping methods in minimizing the number of disk blocks accessed. So, it is widely believed that the Hilbert space-lling curve achieves the best clustering [18]. Based on this property, we use this structure to store spatial data by using the Hilbert curve. It can help us to speed up clustering and minimize the disk access eort. Another advantage of using the Hilbert curve is that it can preserve object "locality". On the other hand, the traditional algorithms use the graph metrics. 29 Table 3.1 The basic unit in each round Round 1st Round 2nd Round 3rd Round 4th Round Unit (blocks) U1 = 20 20 = 1 U2 = 21 21 = 4 U3 = 22 22 = 16 U4 = 23 23 = 64 The graph metrics need to determine the intercluster distance according to the cost function of the edges between the points in the two clusters. Therefore, using the Hilbert curve can preserve this "locality" and we do not need to calculate these distance. Figure 3.1 shows that we divide the spatial data into rectangle blocks. (e.g., using latitude and longitude). We transform the focus from n objects to m blocks, where n m. The number of blocks are 2h 2h = m, where h is the order of the Hilbert curves. The size of the blocks is dependent on the density of objects. The path of a space-lling curve imposes a linear ordering, which may be calculated by starting at one end of the curve and following the path to the other end. Orenstein [22, 23] used the term h-ordering to refer to the ordering of the Hilbert curve. Take Figure 3.2 as an example, numbers are ordered in the Hilbert curve. The number in the block represents the h-ordering and the gray block represents the block which has data. In our algorithm, according to the order of the Hilbert curves, we will determine how many rounds which we will execute. If the order of the Hilbert curves is h, the execution rounds are k = h +1. Another issue which we are concerned is the capacity of the basic unit. In every round, we use dierent basic unit Uk to run our algorithm. When we execute the k0 th round, the basic unit Uk contains 2(k 1) 2(k 1) blocks. Take Figure 3.2 as an example. The order of the Hilbert curves is 3, so the total execution rounds are 4. Moreover, in every round, the basic unit Uk which we used is shown in Table 3.1. 30 the block that the data is located the block that the data is not located Figure 3.1 Partition the spatial data into the rectangle blocks 3.2 The Algorithm Now, we use an example to illustrate our algorithm. 3.2.1 The First Round The initial state of the rst round is shown in Figure 3.3. In this round, the basic unit U1 contains one block (20 20 ). According to the property of the Hilbert curve, it is easy to show that when two numbers are close together in the 2-D space, they are always close together in one-dimension space [11]. Based on this property, we can be sure that if two block numbers are continuous, they must be in the same cluster. The FirstRound(BI ,m) procedure is shown in Figure 3.4, where BI [i] represents the status of block i and m represents the number of blocks. If BI [i] = 1, it indicates 31 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 18 29 28 35 34 45 44 16 17 30 31 32 33 46 47 15 12 11 10 53 52 51 48 14 13 8 9 54 55 50 49 1 2 7 6 57 56 61 62 0 3 4 5 58 59 60 63 Figure 3.2 Using the Hilbert curve to connect these blocks 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 18 29 28 35 34 45 44 16 17 30 31 32 33 46 47 15 12 11 10 53 52 51 48 14 13 8 9 54 55 50 49 1 2 7 6 57 56 61 62 0 3 4 5 58 59 60 63 Figure 3.3 The initial state of the rst round 32 Procedure F irstRound(BI,m); /* The rst round to merge data. */ /* BI [i] = 1 indicates that block i has an object. */ /* BC [i] = j indicates that block i belongs to cluster j . */ /* F lag is used to control the meaningful increment of the cluster number. */ begin j := 1; F lag := 1; for i := 0 to m 1 do begin if (BI [i] 6= 0) then /* Case 1 */ /* Set the cluster number to the current block */ begin BC [i] := j ; F lag := 0; end else /* Case 2 */ begin BC [i] := 0; if (F lag = 0) then /* increase the cluster number */ j := j + 1; F lag := 1; end; end; end; Figure 3.4 Procedure F irstRound(BI; m) that block i has objects. On the other hand, if BI [i] = 0, it indicates that block i does not have objects. Then, we will check each block i from 0 to m 1. If any block has objects, Case 1 of procedure FirstRound is executed. Otherwise, Case 2 of procedure FirstRound is executed. In Case 1, we check this block i. If block i and the previous block l (l < i) which has objects are continuous (i.e., l = i 1), we set them with the same cluster number. That is, BC [i] = BC [i 1] = j , where j is the cluster number. Otherwise, we set BC [i] = j + 1, i.e., the next cluster number. In Case 2, we set BC [i] = 0 which indicates that it does not belong to any cluster. The result of the rst round is shown in Figure 3.5, where the number in () represents the cluster number. The dierent cluster numbers represent that the blocks are not in the same cluster. Take Figure 3.5 as an example. Block 0 has no object in this block, so Case 2 of procedure FirstRound(BI,m) is executed. We set BC [0] = 0. Blocks 2 and 4 have objects, but their Hilbert curve numbers are not continuous. Therefore, Case 1 of procedure FirstRound(BI,m) is executed for both of them, and they are assigned with dierent cluster numbers. That is, we have BC [2] = 1 and 33 21 22 (8) 25 (9) 26 37 38 41 42 20 (7) 23 (8) 24 27 36 39 40 43 19 18 (6) 29 28 (10) 35 34 45 44 (12) 16 17 30 31 (11) 32 (11) 33 46 47 (13) 15 12 11 10 (4) 53 (14) 52 51 48 (13) 14 13 (5) 8 (3) 9 54 55 50 (13) 49 (13) 1 2 (1) 7 (3) 6 57 56 (15) 61 (16) 62 0 3 4 (2) 5 (2) 58 59 (16) 60 (16) 63 Figure 3.5 The result of the rst round BC [4] = 2. On the other hand, Blocks 4 and 5 have objects and their Hilbert curve numbers are continuous. Therefore, Case 1 of procedure FirstRound(BI,m) is executed and the same cluster number is assigned to them. That is, we have BC [4] = BC [5] = 2. Similarly, the other blocks follow this procedure to set their cluster numbers. 3.2.2 The Second Round The basic unit U2 in the second round contains 4 (21 21 ) blocks and the initial state of the second round is shown in Figure 3.6. In the second round, we have 16 dierent cases totally. We classify them into ve groups as shown in Figure 3.7 (a)(e), respectively. In Figure 3.7, we nd that Case (a) and Case (b) can be ignored, since they are nothing to be merged. Next, Cases (c), (d) and (e) have already been processed in the rst round except Cases (c)-(6), (d)-(3) and (d)-(4) which are covered by a dotted line. We observe that in those special cases, their rst block and last 34 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 18 29 28 35 34 45 44 16 17 30 31 32 33 46 47 15 12 11 10 53 52 51 48 14 13 8 9 54 55 50 49 1 2 7 6 57 56 61 62 0 3 4 5 58 59 60 63 Figure 3.6 The initial state of the second round block are mergable although their block numbers are not continuous. Therefore, we get a conclusion that we only have to check whether the rst block and the last block are mergable for the basic unit U2 in the second round. If these two blocks do not have the same cluster number, we will merge them. Procedure SecondRound(BI ,m) is shown in Figure 3.8. We will check each basic unit in U2 , successively. In any U2i , if blocks 4i and (4i + 3) have objects and their cluster numbers are not the same, we will merge them. If the case is mergable, we will let BC [4i +3] = BC [4i]. By the way, we assign each of the blocks with its cluster number equal to the cluster number of block (4i + 3) in the rst round the same cluster number of block 4i. That is why a for loop is used for the mergable case. Otherwise, we do nothing. The result of the second round is shown in Figure 3.9. Take Figure 3.9 as an example. For unit U21 (0,1,2,3), it is Case (b)-(3) in Figure 3.7. Therefore, nothing 35 1 2 0 3 (a) (1) 1 2 1 2 1 2 1 2 0 3 0 3 0 3 0 3 (b) (1) (2) (3) (4) 1 2 1 2 1 2 1 2 1 2 1 2 0 3 0 3 0 3 0 3 0 3 0 3 (c) (2) (1) (3) (4) (5) 1 2 1 2 1 2 1 2 0 3 0 3 0 3 0 3 (6) (d) (1) 1 (2) (3) (4) 2 (e) 0 3 (1) Figure 3.7 The dierent cases in the second round: (a) C04 ; (b) C14 ; (C) C24 ; (d) C34 ; (e) C44 . Procedure SecondRound(BI,m); /* The second round to merge data. */ /* m is the number of blocks. */ begin k := m=4; for i := 0 to (k 1) do begin if (BI [4i + 3] 6= 0) and (BI [4i] 6= 0) and (BC [4i] 6= BC [4i + 3]) then begin for j := 1 to 4 do if (BI [4i + j ]) 6= 0 then BC [4i + j ] := BC [4i]; end; end; end; Figure 3.8 Procedure SecondRound(BI; m) 36 ; ;; ;; ; ;;;; ;;;; ;; ; ;; ;; ;; 21 22 (7) 25 (9) 26 37 38 41 42 20 (7) 23 (7) 24 27 36 39 40 43 19 18 (6) 29 28 (10) 35 34 45 44 (12) 16 17 30 31 (10) 32 (10) 33 46 47 (12) 15 12 11 10 (4) 53 (14) 52 51 48 (12) 14 13 (5) 8 (2) 9 54 55 50 (12) 49 (12) 1 2 (1) 7 (2) 6 57 56 (15) 61 (15) 62 0 3 4 (2) 5 (2) 58 59 (15) 60 (15) 63 Stop block Merged blocks Figure 3.9 The result of the second round is done in procedure SecondRound(BI,m). For unit U22 (4,5,6,7), it is Case (d)-(4) in Figure 3.7. Therefore, they must be merged and we reassign BC [7] = BC [4]; that is, BC [7] = 2. And in procedure SecondRound, we also reassign BC [8] = BC [4]; that is, BC [8] = 2. Because in the rst round, Blocks 7 and 8 are in the same cluster. Therefore, if the cluster number of Block 7 is reassigned, the cluster number of Block 8 should also be reassigned. Similarly, we reassign BC [22] = BC [23] = BC [20] = 7 (Case (d)-(3)), BC [31] = BC [32] = BC [28] = 11 (Case (c)-(6)), BC [47] = BC [48] = BC [49] = BC [50] = BC [44] = 13 (Case (c)-(6)) and BC [59] = BC [60] = BC [61] = BC [56] = 15 (Case (c)-(6)). Finally, we merge ve cases in this round, and the number of clusters is reduced form 16 to 11. 37 3.2.3 The Third Round In the third round, the basic unit U3 contains 16 (22 22 ) blocks and the initial state of the third round is shown in Figure 3.10. Basically, we will divide the basic unit U31 (blocks 0 to 15) into four parts (P1 ; P2 ; P3 and P4 ) in the third round as shown in Figure 3.11. We can nd that we only have to consider the crossing parts as shown in Figure 3.11. Because the blocks inside each of P0 ; P1 ; P2 and P3 parts, they are already checked in the previous two rounds. So, in this round, we only consider the relationship among these four parts and our algorithm will check these relationships from the rear part to the front part as shown in Figure 3.12. These relationships are described as follows: N1;0 : P1 ! P0 N2;1 : P2 ! P1 N3;2 : P3 ! P2 N3;0 : P3 ! P0 We can separate these relationships Na;b , where a > b, into two cases. Case 1 includes N2;1 and N3;0 , and Case 2 includes N1;0 and N3;2 . In Case 1 of the basic unit U31 as shown in Figure 3.13, we observe that Block 8 and 7 are contained in relationship N2;1 , so are Blocks 9 and 6. Similarly, Blocks 13 and 2 are contained in relationship N3;0 , so are Blocks 14 and 1. We nd that these pairs of blocks have a special property: their sum is 15. Other basic units U32 , U33 and U34 also have the similar property as shown in Table 3.2. In Case 2 of the basic unit U31 , we observe that Block 4 and 3 are contained in relationship N1;0 , so are Blocks 7 and 2. Similarly, Block 12 and 11 are contained in relationship N3;2 , so are Blocks 13 and 8. These pairs of blocks which have the same dierence from the outer blocks to the inner blocks are shown in Figure 3.13. Other basic units U32 , U33 and U34 also have the similar property as shown in Table 3.3. 38 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 18 29 28 35 34 45 44 16 17 30 31 32 33 46 47 15 12 11 10 53 52 51 48 14 13 8 9 54 55 50 49 1 2 7 6 57 56 61 62 0 3 4 5 58 59 60 63 Figure 3.10 The initial state of the third round Case 2 P3 P2 15 12 11 10 14 13 8 9 Case 1 1 2 7 6 0 3 4 5 P0 P1 Figure 3.11 Basic unit U3 in the third round 39 P1 N2 , 1 P2 N3 , 2 N1, 0 P0 N3, 0 P3 Figure 3.12 The basic idea of nding the neighboring block Case 2: difference = 1 or 5 form the outer block to the inner block 15 12 11 10 14 13 8 9 P3 P2 1 2 7 6 0 3 4 5 P0 Case 1: sum =15 2 2 = 2 k x 2k -1 = 2 x 2 -1 P1 Figure 3.13 The cases in the third round 40 Table 3.2 Case 1 in the third round Unit Sum Blocks (the rear block ! the front block) Unit U31 Unit U32 Unit U33 Unit U34 15 47 79 111 (8 ! 7); (9 ! 6); (13 ! 2); (14 ! 1) (25 ! 22); (24 ! 23); (29 ! 18); (30 ! 17) (41 ! 38); (40 ! 39); (45 ! 34); (46 ! 33) (57 ! 54); (56 ! 55); (61 ! 50); (62 ! 49) Table 3.3 Case 2 in the third round Unit Dierence Blocks (the rear block ! the front block) Unit U31 1 or 5 Unit U32 1 or 5 Unit U33 1 or 5 Unit U34 1 or 5 (4 ! 3); (12 ! 11) (7 ! 2); (13 ! 8) (20 ! 19); (28 ! 27) (23 ! 28); (29 ! 24) (36 ! 35); (44 ! 43) (39 ! 34); (45 ! 40) (52 ! 51); (60 ! 59) (55 ! 50); (61 ! 56) 41 Case 2 Case 1 P3 P2 P1 P2 15 12 11 10 21 22 25 26 14 13 8 9 20 23 24 27 1 2 7 6 19 18 29 28 3 4 5 16 17 30 Case 1 0 P0 Case 2 P0 P1 (b) Unit U32 (a) Unit U31 Case 1 P1 31 P3 Case 2 P2 P1 37 38 41 42 36 39 40 43 P0 53 52 51 48 54 55 50 49 Case 2 Case 1 35 34 45 44 57 56 61 62 32 33 46 47 58 59 60 23 P0 P2 P3 (c) Unit U33 P3 (d) Unit U34 Figure 3.14 The dierent units in the third round We observe another interesting property. The direction (horizontal / vertical) of Case 1 and Case 2 may exchange for each basic unit Uki (1 i 4) as shown in Figure 3.14. For Unit U31 and Unit U34 , they have the same direction of Cases 1 and 2. But for Unit U32 and Unit U33 , the position of Case 1 and 2 are dierent from these cases of Unit U31 and Unit U34 ; their Case 1 and Case 2 are exchanged. The reason is that according to the property of the Hilbert curve, the direction of both lower quadrants will be rotated 90Æ in Units U31 and U34 . The result of the third round is shown in Figure 3.15. Take Figure 3.15 as an example. For unit U31 , the dierence of Blocks 7 and 2 is 5. So, it belongs to Case 2 in U31 . Therefore, they must be merged and we reassign BC [7] = BC [2]; that is, BC [7] = 1. We also reassign BC [4] = BC [5] = BC [8] = 1. Because in the second round, they are in the same cluster. Therefore, if Block 7 is reassigned, Blocks 4, 5 42 and 8 should be reassigned. For Blocks 13 and 8, they also belong to Case 2 in U31 . Therefore, they must be merged, and we reassign BC [13] = BC [8]; that is, we have BC [13] = 1. For unit U32 , the dierence of Blocks 23 and 18 is 5. It is the Case 2 in U32 . Therefore, they must be merged, and we reassign BC [23] = BC [18]; that is, we have BC [23] = 6. We also reassign BC [20] = BC [22] = 6. Because in the second round, they are in the same cluster. Therefore, if Block 23 is reassigned, Blocks 20 and 22 should be reassigned. The sum of Blocks 25 and 22 is 47. It is Case 1 in U32 . Therefore, they must be merged, and we reassign BC [25] = BC [22]; that is, we have BC [25] = 6. For unit U34 , the sum of Blocks 61 and 50 is 111. It is Case 1 in U34 as shown in Table 3.2. Therefore, they must be merged and we reassign BC [61] = BC [55]; that is, BC [61] = 12. We also reassign BC [56] = BC [59] = BC [60] = 12. Because in the second round, they are in the same cluster. Therefore, if Block 61 is reassigned, Blocks 56, 59 and 60 should be reassigned. Therefore, we merge these ve cases in the third round. So, the number of clusters is reduced form 11 to 6. 3.2.4 The Fourth Round In the fourth round, the basic unit U4 contains 64 (23 23 ) blocks and the initial state of the fourth round is shown in Figure 3.16. Basically, we will also divide the basic unit U4 (blocks 0 to 63) into four parts (P1 ; P2 ; P3 and P4 ) in the fourth round as shown in Figure 3.17. We can nd that we only have to consider the crossing parts as shown in Figure 3.17. Because the blocks inside each of P0 ; P1 ; P2 and P3 parts, they are already checked in the previous three rounds. So, in this round, we only consider the relationships (N1;0 ; N2;1 ; N3;2 , and N3;0 ) among these four parts. Similar to the third round, we also separate these relationships Na;b , where a > b, into two cases. Case 1 includes N2;1 and N3;0 , and Case 2 includes N1;0 and N3;2 . In Case 1 of the basic unit U41 as shown in Figure 3.17, we observe that Blocks 37 and 26 are contained in relationship N2;1 , so are Blocks 36 and 27, 35 and 28, 32 and 31. Similarly, Blocks 58 and 5 are contained in relationship N3;0 , so are Blocks 57 and 43 ;; ; ;; ;; ; ; ;; ; ; ;; ; 21 22 (6) 25 (6) 26 37 38 41 42 20 (6) 23 (6) 24 27 36 39 40 43 19 18 (6) 29 28 (10) 35 34 45 44 (12) 16 17 30 31 (10) 32 (10) 33 46 47 (12) 15 12 11 10 (4) 53 (14) 52 51 48 (12) 14 13 (1) 8 (1) 9 54 55 50 (12) 49 (12) 1 2 (1) 7 (1) 6 57 56 (12) 61 (12) 62 0 3 4 (1) 5 (1) 58 59 (12) 60 (12) 63 Stop block Merged blocks Figure 3.15 The result of the third round 6, 54 and 9, 53 and 10. We nd that these pairs of blocks have a special property: their sum is 63 as shown in Table 3.4. In Case 2 of the basic unit U41 , we observe that Blocks 16 and 15 are contained in relationship N1;0 , so are Blocks 17 and 12, 30 and 11, 31 and 10. Similarly, Blocks 48 and 47 are contained in relationship N3;2 , so are Blocks 51 and 46, 52 and 33, 53 and 32. These pairs of blocks have the same dierence from the outer block to the inner block as shown in Table 3.5. The result of the fourth round is shown in Figure 3.18. Take Figure 3.18 as an example. For unit U4 , the dierence of Blocks 31 and 10 is 21. So, it belongs to Case 2 in U4 . Therefore, they must be merged and we reassign BC [31] = BC [10]; that is, we have BC [31] = 4. We also reassign BC [28] = BC [32] = 1. Because in the third round, they are in the same cluster. Therefore, if Block 31 is reassigned, Block 28 44 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 18 29 28 35 34 45 44 16 17 30 31 32 33 46 47 15 12 11 10 53 52 51 48 14 13 8 9 54 55 50 49 1 2 7 6 57 56 61 62 0 3 4 5 58 59 60 63 Figure 3.16 The initial state of the fourth round Case 1: sum = 63 3 3 = 2 k x 2k -1 = 2 x 2 -1 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 18 29 28 35 34 45 44 16 17 30 31 32 33 46 47 15 12 11 10 53 52 51 48 14 13 8 9 54 55 50 49 1 2 7 6 57 56 61 62 0 3 4 5 58 59 60 63 P1 P0 P2 Case 2: difference = 1, 5, 19, 21 from the outer block to the inner block P3 Figure 3.17 The cases in the fourth round 45 Table 3.4 Case 1 in the fourth round Unit Sum Blocks (the rear block ! the front block) Unit 1 63 (37 ! 26); (36 ! 27); (35 ! 28); (32 ! 31); (53 ! 10); (54 ! 9); (57 ! 6); (58 ! 5) Table 3.5 Case 2 in the fourth round Unit Dierence Blocks (the rear block ! front block) Unit 1 1 or 5 or 19 or 21 (16 ! 15); (48 ! 47) (17 ! 12); (51 ! 46) (30 ! 11); (52 ! 33) (31 ! 10); (53 ! 32) and 32 should be reassigned. And for Blocks 53 and 32, they also belong to Case 2 in U4 . Therefore, they must be merged and we reassign BC [53] = BC [32]; that is, we have BC [53] = 4. Finally, we merge two cases in the fourth round. So, the number of clusters is reduced form 6 to 4. After executing these four rounds, we can obtain 4 clusters nally. Figure 3.19 displays the process for merging these clusters. From the above example, we observe some interesting facts and nd some special rules in the third and the fourth rounds. Table 3.6 shows the parameters used in our rules. First, we divide the basic unit Uk in the k0 th round into four parts and consider their relationships of these four parts. There are two cases to be considered in each basic unit. Case 1 (the relationships of N2;1 and N3;0 ) concerns the sum of two blocks, and Case 2 (the relationships of N1;0 and N3;2 ) concerns the dierence of two blocks. Then, when we consider the basic units U5 and U6 in some other examples, we also nd that Cases 1 and Case 2 of these basic units have the similar property. In Case 1, their sum of two blocks is the same in every basic unit Uki in the k0 th round (i represents how many basic unit which we have in the k0 th round, 1 i m=22(k 1) ) as shown in Table 3.7, where m is the 46 21 22 (6) 20 (6) 23 (6) 19 18 (6) 16 17 15 12 14 13 (1) 1 2 (1) 0 3 ;; ; ;; ; ;; ; 25 (6) 26 37 38 41 42 24 27 36 39 40 43 29 28 (4) 35 34 45 44 (12) 30 31 (4) 32 (4) 33 46 47 (12) 11 10 (4) 53 (4) 52 51 48 (12) 8 (1) 9 54 55 50 (12) 49 (12) 7 (1) 6 57 56 (12) 61 (12) 62 4 (1) 5 (1) 58 59 (12) 60 (12) 63 Stop block Merged blocks Figure 3.18 The result of the fourth round 47 round 1 1 round 2 1 2 3 2 4 5 6 4 5 6 7 8 7 round 3 1 4 6 round 4 1 4 6 9 9 10 11 12 13 14 10 12 14 10 12 14 15 16 15 12 Figure 3.19 The process for merging clusters number of the blocks. For example, in the 4th round, the sum of basic unit U41 is 63 (63 + (1 1) 128). In Case 2, their dierence is the same in every basic unit Uk form the outer block to the inner block in the k0 th round as shown in Table 3.8. For example, in the 4th round, the dierence of basic unit U41 is f1, 5, 19, 21g form the outer block to the inner block. In general, the rule for Case 1 is that in a basic unit Uki (1 i m=22(k 1) ), the two blocks with the sum equal to (22(k 1) + (i 1) 2 22(k 1) ) are mergable. While for the rule in Case 2, the set of the dierences (denote as DSk ) from the outer block to the inner block in the basic unit Uki is the same, and the two blocks with the dierence equal to some element in DSk are mergable. For example, in the third round, the dierence form the outer block to the inner block is f1; 5g. We restore these information in DS3 by this sequence, so DS3 is f1; 5g. DSk can be formulated as follows: DS2 = f1g DS3 = f1; 5g DS4 = f1; 5; 19; 21g DSk = fDSk 1 g [ faja = 5 4k 3 w; and a = 5 4k 3 + w; w 2 DSk 48 2 g (1) Table 3.6 Denitions of parameters Parameter Description n m h k SBka USBk (j ) NBka UNBk (j ) DSk Sk Na;b The number of objects The number of blocks The order of the Hilbert curve The number of the execution rounds The block which we want to stop in the k0 th round (1 a 4) The blocks SBka recorded in the j 0 th element of array USBk , which we want to stop in the k0 th round The block which is nearby the block SBka in the k0 th round (1 a 4) The neighboring blocks NBka recorded in the j 0 th element of array UNBk , which is corresponding to the neighbor of block USBk (j ) in the k0 th round The dierence set which is used in Case 2 of the k0 th round The array which is used in USBk (j ) in the k0 th round The relationship in every round (0 b < a 3) Table 3.7 The rules for Case 1 Basic Unit Sum U3i U4i U5i U6i Uki (1 i m=22(k 1) ) 15 + (i 1) 32 63 + (i 1) 128 255 + (i 1) 512 1023 + (i 1) 2048 (22(k 1) 1) + (i 1) 2 22(k 49 1) Table 3.8 The rules for Case 2 Basic Unit Dierence (from the outer block to the inner block) U3 U4 U5 U6 1,5 1,5,19,21 1,5,19,21,75,79,81,85 1,5,19,21,75,79,81,85,299,301,315,319,321,325,339,341 DSk represents the dierence set which contains these dierences used in the k0 th round, and we use Equation (1) to generate DSk . Note that, in our implementation, we use an array to store such a set. Next, to speed up the processing time, let's decide the blocks which we want to stop. In this way, we do not have to check every block in each round. Table 3.9 shows the the number of stopping blocks in each round. In each round, we quarter these stopping blocks into four parts (SBk1 ; SBk2; SBk3 , and SBk4 ), and each part corresponds to dierent relationships (N1;0 ; N2;1 ; N3;2 , and N3;0 ). Each part uses the relationship to nd its corresponding neighbor. For example, SBk1 uses the relationship N1;0 to nd his corresponding neighbor NBk1 in the k0 th round as shown in Table 3.10. These stopping blocks (SBk1 ; SBk2; SBk3 , and SBk4 ) are recorded in the j 0 th element of array USBk in the k0 th round, and their corresponding neighbors (N1;0 ; N2;1 ; N3;2 , and N3;0 ) are recorded in the j 0 th element of array UNBk as shown in Table 3.11. According to Table 3.10 and Table 3.11, we know the union of SBk1 ; SBk2; SBk3 , and SBk4 is USBk , NBk1 ; NBk2 ; NBk3 , and NBk4 is UNBk . Take an example as shown in Table 3.10. In this example, in the third round, we will stop at 8 blocks in the basic unit U31 , i.e., Blocks 4, 7, 8, 9, 12, 13, 13, and 14. These stopping blocks are recorded in USB3 (j ), 0 j 7. We separate them into four parts as shown in Table 3.10, and each part uses dierent relationship Na;b that can nd the corresponding neighbors NBka , (0 a 3) as shown in Table 3.10, respectively. How to use these relationships to nd the corresponding neighbors, we will describe later. For example, when we stop at Block 7, it belongs to SB31 . We then use the relationship N1;0 to nd his corresponding neighbor NB31 . In this case, 50 Table 3.9 The number of stopping points Round The number of stopping points 1st Round 2nd Round 3rd Round 4th Round k0 th Round m (m=21 21 ) 1 (m=22 22 ) 8 (m=23 23 ) 16 (m=2k 1 2k 1) 2k (k 3) Table 3.10 The relationships in the 3rd round Relationship The stopping block (the block which should be checked) Na;b N1;0 N2;1 N3;2 N3;0 SBka (NBka ) SB31 (NB31 ) : SB32 (NB32 ) : SB33 (NB33 ) : SB34 (NB34 ) : 1a4 4(3); 7(2) 8(7); 9(6) 12(11); 13(8) 13(2); 14(1) we nd Block 2, and we will check whether it is mergable with Block 7. Therefore, to speed up our process, the rst step is to nd the block x which we should stop, and next step is to nd the block y which we should check whether it is mergable with block x. In order to compute USBk (j ), we also use an array USk which records some information that help us to generate USBk (j ). The formula to generate USk and USBk (j ) are described as follows. 51 Table 3.11 The relationships in the 3rd round Relationship The stopping blocks The blocks which should be checked Na;b N1;0 N2;1 N3;2 N3;0 UNBk (j ) 0j7 UNB3 (0) = 3 UNB3 (1) = 2 UNB3 (2) = 7 UNB3 (3) = 6 UNB3 (4) = 11 UNB3 (5) = 8 UNB3 (6) = 2 UNB3 (7) = 1 USBk (j ) USB3 (0) = 4 USB3 (1) = 7 USB3 (2) = 8 USB3 (3) = 9 USB3 (4) = 12 USB3 (5) = 13 USB3 (6) = 13 USB3 (7) = 14 US3 = f0; 3; 0; 1; 0; 1; 1; 2g S41 = f0; 1; 2; 3g [Sk [Sk Sk = S k [ S k Sk = S k [ S k Sk = S k [ S k USk = Sk [ Sk [ Sk [ Sk Sk1 = S(k 2)1 ( 1)1 2 ( 1)1 ( 1)2 3 ( 1)1 ( 1)2 4 ( 1)4 ( 1)4 1 2 52 3 ( 4 2)1 USB2 (0) = 1 USB2 (1) = 2 USB2 (2) = 3 USB2 (3) = 3 USBk (2j ) = USBk 1 (j ) 4 + USk (2j ) (2) USBk (2j + 1) = USBk 1 (j ) 4 + USk (2j + 1) (3) USBk (j ) represents the block which we should stop and we record such information in the j 0 th element of array USBk in the k0 th round. We can use the information of USBk 1 in the (k 1)0 th, USk and Equations (2)-(3) to generate every stopping block in the k0 th round, and use array USBk to record such stopping blocks. For example, we want to nd the stopping blocks in the 3rd round, and array US3 is f0; 3; 0; 1; 0; 1; 1; 2g. So, we can use USB2 and Equations (2)-(3) to generate USB3 (0). USB3 (0) = USB2 (0) 4 + USk (0) = 1 4 + 0 = 4. So, the rst block which we want to stop in the third round is Block 4. After we get USBk (j ) and DSk , we can use them to generate their corresponding neighbors UNBk (j ). We record this information in array UNBk . The formula to generate UNBk is shown as follows. Because of we quarter the stopping blocks USBk into four parts, and each part corresponds to dierent relationships Na;b . So, dierent relationships use dierent equations. Equation (4) represents the relationship N1;0 . Equation (5) represents the relationship N2;1 . Equation (6) represents the relationship N3;2 . Equation (7) represents the relationship N3;0 . Then, they can generate NBk1 ; NBk2 ; NBk3 , and NBk4 , respectively. Finally, we combine them to record these information in array UNBk (Equations (8)) which we want to check in the k0 th round. 53 NBk1 = SBk1 NBk2 = (22(k DSk 1) NBk3 = SBk3 NBk4 = (22(k (4) 1) + (i 1) 2 22(k 1) SBk2 DSk 1) (5) (6) 1) + (i 1) 2 22(k 1) UNBk = NBk1 [ NBk2 [ NBk3 [ NBk4 SBk4 (7) (8) Take the third round as an example. We will show how to nd the rst block which we want to stop and its neighbor which we may merge it or not in the third round. Step 1: First, we check the stopping blocks USB3 , and nd USB3 (0) = 4. So, we know that the rst block which we want to stop is Block 4. According to Table 3.11, we know that Block 4 belongs to SB31 and relationship N1;0 . Step 2: Because Block 4 belongs to relationship N1;0 , it is Case 2 of the third round. Therefore, we need to consider the dierence and use Equation (5) to compute his neighbor. Step 3: In Case 2, we consider the dierence. Then, we check the rst element in dierence set DS3 . Since DS3 = f1; 5g, the corresponding rst value for Block 4 is 1. Step 4: By using Equation (5), NB31 = SB31 DS3 = 4 1 = 3, we nd the neighbor of Block 4 is Block 3. Step 5: Finally, we will check whether these two blocks have the same cluster number. If they do not have the same cluster number, then we can merge them. Step 6: Go back to Step 1, and check the next stopping block. Repeat Step 1 to Step 5 until all stopping points in array USB3 in the 3rd round are checked . 54 CHAPTER IV Performance In this Chapter, we study the performance of the proposed Hilbert curve-based clustering algorithm by simulation, and make a comparison with the other clustering algorithms. 4.1 Performance Measures Clustering validation refers to procedures that evaluate the results of cluster analysis in a quantitative and objective fashion. The clustering validation usually uses the execution time and cluster quality to evaluate the clusters. And use the cluster quality method to evaluate \good or bad" cluster called validation techniques. One way of validating a clustering structure is to compare it to an a priori structure, which is assigned without regard to the measurements. For example, the cluster numbers assigned to objects by a clustering algorithm can be compared to category labels, assigned independent of the clustering. The validation technique that has come to be known as Hubert's [12] has been shown to be eective in assessing t between data and a priori structures. The abstract problem to which Hubert's is applicable can be stated as follows. Let X = [X (i; j )] and Y = [Y (i; j )] be two n n proximity matrices on the same n objects. The matrices must contain data having no built-in or implied relationships. For example, X (i; j ) could denote the observed proximity between objects i and j and Y (i; j ) could be dened as 55 8 <0 Y (i; j ) = :1 if objects i and j are in the same cluster if not The Hubert statistic is, simply, the point serial correlation between the two matrices. It can be expressed in raw form as follows when the two matrices are symmetric: = XX n 1 n X (i; j )Y (i; j ) i=1 j =i+1 In normalized form, is the sample correlation coeÆcient between the entries of the two matrices. If mx and my denote the sample means and sx and sy denote the sample standard deviations of the entries of matrices X and Y , the normalized statistic is = f(1=M ) XX n 1 n [X (i; j ) mx ][Y (i; j ) my ]g=sx sy i=1 j =i+1 where M = n(n 1)=2 is the number of entries in the double sum and the moments are given by XX X (i; j ) XX mx = (1=M ) s2x = (1=M ) X 2 (i; j ) m2x s2y PP Y (i; j ) PP Y (i; j ) m = (1=M ) my = (1=M ) 2 (4.1) 2 y (4.2) All sums are over the set f(i; j ) : 1 i (n 1); (i + 1) j ng. The statistic measures the degree of linear correspondence between the entries of X and Y . Unusually large absolute values of suggest that the two matrices agree with each other. The normalized is always between 1 and 1, while the range of the raw depends on the ranges of values in the matrices and on the number of entries. A higher value of represented the quality is better [12]. 56 4.2 The Simulation Model In this section, we present the experimental evaluation of our algorithm, and compare its performance with other clustering algorithms. Our experiments were run on a Pentium IV 1.4G Hz, 654 MB RAM, running Windows 2000 Server, and coded in Java. We have used a collection of synthetic data sets to run our simulation and these data sets are generated by a generator that the BIRCH algorithm developed [30]. These synthetic data sets are needed to simulate the spatial data sets in a spatial database. We perform evaluation with these synthetic data sets which contain data points in the two-dimensional space, so the generating clusters are easy to visualize, and make the comparison of dierent schemes visible. The generation of data sets is controlled by a set of parameters that are summarized in Table 4.1. In our experiments, we use a square [0; 1024)2 as data space, and generate date sets in this data space. Each data set consists of w clusters of two dimensions data points that are not uniformly distributed in the data space. Each cluster is characterized by the number of data points in it (n), its radius (r), and its center (c). n is in the range of [nl ; nh]; nl and nh are the minimum and the maximum numbers of data points in each cluster, respectively. r is in the range of [r1 ; rh ]; rl and rh are the minimum and the maximum length of the radius of cluster, respectively. The total number of data points in the data set is N . First, we random generate w centers of clusters, these centers of clusters are placed on the data space. The location of the center of each cluster is determined by the its center (c). In order to prevent from overlapping with these clusters, the distance between the centers of neighboring clusters on the same row/column is controlled by the parameter kg . We determine that the distance between the centers of neighboring clusters is set to kg (rl + rh )=2. Next, we generate n data points around its center in each cluster. The distance with these data points and its center is less than the radius r. We use this method to generate our data sets, and use these data sets as an input to run the k-means algorithm. 57 Table 4.1 Parameters for data generation and their values (or ranges) Parameter Description m h w N nl nh rl rh kg Value (or Range) The number of blocks The order of the Hilbert curve The number of clusters which we generate initially The total number of data points The minimum number of data points in each cluster The maximum number of data points in each cluster The minimum radius of each cluster The maximum radius of each cluster Distance multiplier The percentages of the small clusters / all clusters 16384 7 7::20 10000::30000 1000 3000 20 100 3 0%..100% In our algorithm, we have assumed these synthetic data points are stored in the disk based on the order (h) of the Hilbert curve. So, we divide the data space into rectangle blocks by using the Hilbert curve. Then, we partition these synthetic data points into m blocks and we assign synthetic each data point to the corresponding disk block on the Hilbert curve. In the Hilbert curve in the two-dimensional space, the order h of the curve decides the sequential numbers from 0 to 22h (= 2h 2h ) that can linearly order the spatial data points. If one sequential number represents as one disk block, the maximum order of the Hilbert curve aects the execution time. The higher the order h is, the larger the execution time will be. Moreover, the dierent value of the parameter h may aect the quality of clustering. So, rst we must decide the appropriate parameter h of the Hilbert curve to divide our data sets. (Note that the points which were assigned to the same disk block are linked.) In our simulation, we use the order of the Hilbert curve is 7 to order these data points; i.e., there are 16384 (= 27 27 ) disk blocks which are used to store these data points. We use this 58 input structure to run our algorithm, and compare the execution time and clustering quality to other clustering algorithms. 4.3 Simulation Results In our simulation result, the performance measures which we concern about are the CPU execution time of the computation, the sensitivity to parameters, and the quality of clustering results. First, we make a comparison of the execution time for three synthetic data sets. Next, we perform a sensitivity analysis with dierent parameters. Finally, we use some special data sets to analyze the execution time and the quality of clustering results with these dierent clustering algorithms. For the comparison of the rst two performance measures, given the data space 1024 1024. We took 1000 averages of each data set containing 10000 to 30000 data points. The last performance measure, given the data space 512 512, we took 1000 averages of each of data set containing 6000 to 10000 data points. 4.3.1 Time Scalability The rst performance measure of our experiments was to evaluate the execution time of these algorithms to cluster various synthetic data sets. The time is presented in millisecond in the performance. Three dierent synthetic data sets were generated by the generator. Furthermore, when a single parameter is varied, the default settings are used for the remaining parameters. Table 4.2 presents the default settings and dierent generator settings for these synthetic data sets. Figure 4.1 visualizes the actual clusters of these data sets by plotting these clusters. DS1 is a data set in which each cluster has the same radius and the same number of data points in it. We use data set DS1 to observe the relationship between the execution time and the change of the number of clusters (w) form 7 to 20. DS2 is a data set in which each cluster has the same radius and the same number of clusters. We use data set DS2 to observe the relationship between the execution time and the change of the number of data points (n) form 1000 to 3000 in each cluster. DS3 59 Table 4.2 Data sets used in the simulation Data set Default Settings DS 1 DS 2 DS 3 Generator Setting w = 10, nl = nh = 1500, rl = rh = 60, kg = 3 w = 7::20, nl = nh = 1500, rl = rh = 60, kg = 3 w = 10, nl = 1000, nh = 3000, rl = rh = 60, kg = 3 w = 10, nl = nh = 1500, rl = 20, rh = 100, kg = 3 is a data set in which each cluster has the same number of clusters and the same number of data points in it. We use data set DS3 to observe the relationship between the execution time and the change of the radius of each cluster. In data set DS3, we separate these clusters into two parts by using radius (r). The radius r of cluster with 20 r 60 belongs to the small cluster, and the radius r of cluster with 60 < r 100 belongs to the big cluster. We use a parameter to control the percentages of the small clusters. As is increased, the number of small clusters is increased. We will analyze these results of these data sets as follows. Increasing the Number of Clusters: For the data set DS1 shown in Figure 4.1-(a), Figure 4.2 shows the comparison of the execution time between the k-means algorithm and our algorithm, and the detailed information is shown in Table 4.3. From this result, we observe that our algorithm requires shorter time than the kmeans algorithm. Moreover, as the number of clusters (w) is increased, the execution time of the k-means algorithm is increased quickly, while the execution time of our algorithm is increased slowly. Note that when the number of clusters is increased in data set DS1, the total number of data points (N ) is also increased. Increasing the Number of Data Points in Each Cluster: For the data set DS2 shown in Figure 4.1-(b), Figure 4.3 shows the comparison of the execution time between the k-means algorithm and our algorithm, and the detailed information is shown in Table 4.4. From this result, we observe that our algorithm requires shorter time than the k-means algorithm. Moreover, as the number of data points (n) in each cluster is increased, the execution time of the k-means algorithm is increased 60 (a) (b) (c) Figure 4.1 The actual data sets used in our rst experiments: (a) DS1; (b) DS2; (c) DS3. 61 Execution time (millisecond) 15000 k-means ours 10000 5000 0 5 7 9 11 13 15 17 19 21 Number of clusters (w) Figure 4.2 A comparison of the execution time (DS1) Table 4.3 A comparison of the execution time (DS1) Number of clusters (w) the k-means algorithm our algorithm 7 10 13 17 20 947.37 2071.87 3793.27 8651.17 15062.47 62 312.70 328.10 342.30 362.50 379.60 Execution time (millisecond) 8000 k-means ours 6000 4000 2000 0 1000 1500 2000 2500 3000 Number of dat points in each cluster (n) Figure 4.3 A comparison of the execution time (DS2) quickly, while the execution time in our algorithm is almost stable. Note that when the number of data points in each cluster is increased in data set DS2, the total number of data points (N ) is also increased. The Dierent Radiuses of Clusters: For the data set DS3 shown in Figure 4.1-(c), Figure 4.4 shows the comparison of the execution time between the k-means algorithm and our algorithm, and the detailed information is shown in Table 4.5. From this result, we observe that our algorithm requires shorter time than the kmeans algorithm. Moreover, as is increased, the execution time of the k-means algorithm is decreased, while the execution time in our algorithm is decreased slowly. Because the number of the small clusters is increased, the number of calculating centroid of the k-means algorithm will be decreased, and the execution time of the k-means algorithm is also decreased. From the above simulation results, we observe that the execution time of the kmeans algorithm is aected largely by the total number of data points. When the number of data points is increased form 10000 to 30000, the execution time of the 63 Table 4.4 A comparison of the execution time (DS2) Number of data points in each cluster (n) the k-means algorithm our algorithm 1000 1500 2000 2500 3000 1268.77 2803.03 4249.70 5769.13 8226.57 k-means ours 3000 Execution time (millisecond) 379.80 337.60 337.40 334.60 343.80 2000 1000 0 -0.1 0.1 0.3 0.5 0.7 0.9 1.1 The parameter (alpha) Figure 4.4 A comparison of the execution time (DS3) 64 Table 4.5 A comparison of the execution time (DS3) the percentage of small clusters () the k-means algorithm our algorithm 0% 20% 40% 60% 80% 100% 3139.20 2974.90 2564.15 2388.15 2118.00 1680.90 358.33 346.00 327.93 313.73 310.46 307.33 k-means algorithm is increased quickly. The execution time of our algorithm is largely aected by the number of blocks which have data points, which is one of the main properties of the grid-based approach and is also the main reason for being able to support large spatial databases as mentioned in Chapter 1.3. The more the number of blocks which have data points is, the more the number of blocks which should be checked is, and the longer the execution time will be. Therefore, from the results of our algorithm, in the case for the input data set DS1, since the radius of each cluster is the same, the increase of the number of clusters implies that the number of blocks which should be checked is also increased. Consequently, the execution time in DS1 will be increased as w (the number of clusters) is increased. While for data set DS2, in which the radius of each cluster is also the same. Because only the density of each cluster is increased, which may not aect the number of blocks that should be checked, it does not largely aect the execution time of our algorithm. Therefore, the execution time of DS2 is almost stable as n (the number of data points in each cluster) is increased. In data set DS3, the result is similar to the data set DS1. When the radius of clusters is increased, it implies that the number of blocks which should be checked is also increased. Consequently, the execution time of DS3 will be increased as the number of the big clusters is increased. The Degenerated Case: For the previous simulation results, while the k-means algorithm deals with the data points, our algorithm deals with the blocks. Now, let's 65 Execution time (millisecond) 2200 k-means ours 1700 1200 700 4000 5000 6000 7000 8000 9000 10000 Total number of data points (N) Figure 4.5 A comparison of the execution time (the degenerated case) consider a degenerated input case in which each block contains only one data point. Int this way, both algorithms deal with data points only. For this case, we use data set DS3 and degenerate the data space as 128 128, i.e., the order of the Hilbert curve = 7. In our algorithm, we also divide the data space into 128 128 blocks. That means that we let one block only store one data point. In this way, we can consider that our algorithm deals with the data points, the same as case of the the k-means algorithm, instead of blocks. The simulation result is shown in Figure 4.5, and the detailed information is shown in Table 4.6. We observe that in this degenerated case, the execution time of our algorithm is still shorter than the k-means algorithm. 4.3.2 Sensitivity to Parameters Next, we studied the sensitivity of the k-means algorithm and our algorithm to the change of some parameters. We generate the data set with parameters w = 5, n = 1500, and 40 r 80 for our experiment. Due to the parameters of these two 66 Table 4.6 A comparison of the execution time (the degenerated case) The total number of data points (N ) the k-means algorithm our algorithm 4500 5400 6300 7200 8100 9000 9900 688.2 1000.0 1134.4 1484.6 1544.0 1862.4 2156.4 631.2 755.0 862.6 931.2 1003.0 1071.3 1123.2 algorithms which we concern about are dierent, so we discuss the parameters of two algorithms, respectively. The k-means algorithm: In the k-means algorithm, the parameter k will aect the quality of the clustering result. Figure 4.6 shows the clusters analyzed by the k-means algorithm, when k is varied form 2, 5 to 7. Note that in this simulation, we always let the number of clusters which we generate initially (w) be 5. As the gure illustrates, when k is not equal to the real cluster numbers (w), it will merge two real clusters into one cluster (k = 2 < 5), or split some one real cluster into two parts (k = 7 > 5). Figure 4.6 can tell us when parameter k of the k-means algorithm is not assign to the real number of data cluster (w), the quality of the k-means algorithm will be decreased. Our algorithm: Figure 4.7 illustrates the execution time for our algorithm as the order of Hilbert curve is varied form 5, 6 to 7, and the detailed information is shown in Table 4.7. From this result, we observe that when h is increased, the execution time of our algorithm is increased. Because when the order h is increased, the total number of block which we should be checked is also increased. Therefore, the execution time will be increased about 11%, when the order of the Hilbert curve is increased by 1. 67 (1) (1) (1) (2) (5) (7) (2) (4) (3) (2) (3) (b) (a) (4) (5) (6) (c) Figure 4.6 A comparison of the quality the of clustering of the k-means algorithm under a dierent parameter k: (a) k = 2; (b) k = 5; (c) k = 7. h=7 h=6 h=5 Execution time (millisecond) 350 330 310 290 270 250 10000 15000 20000 25000 30000 Total number of data poubts (N) Figure 4.7 A comparison of the execution time under a dierent order h of our algorithm 68 Table 4.7 A comparison of the execution time under a dierent order h of our algorithm The total number of data points (N ) h=7 10000 15000 20000 25000 30000 4.3.3 300.1 309.4 323.4 344.0 356.2 h=6 284.4 282.9 290.6 298.4 303.4 h=5 256.3 256.3 261.0 270.6 268.8 Special Data Sets Finally, we experiment with some special data sets containing data points in two dimensions [6, 10, 14, 24]. A particularly challenging feature of these data set is that clusters are very close to each other and they have dierent densities and shapes. The size of these data sets ranges form 6,000 to 10,000 data points, and their features are indicated as follows. The rst data set, SDS1, as shown in Figure 4.8-(a), has ve clusters that are with dierent size, shape, and density. The second data set SDS2 as shown in Figure 4.8-(b), contains two clusters which have two concentric rings. A small ring is contained in the big ring, and dierent regions of the clusters have dierent densities [14]. The third data set, SDS3, as shown in Figure 4.8-(c), has two clusters of the same shape and density, but dierent orientations and positions. The fourth data set, SDS4, as shown in Figure 4.8-(d), contains big and small ellipsoids in two rows. The fth data set SDS5, as shown in Figure 4.8-(e), has a concave shape with data points. A small curve is contained in another big curve. The sixth data set, SDS6, as shown in Figure 4.8-(f), has two clusters of similar shape, but dierent size. The seventh data set, SDS7, as shown in Figure 4.8-(g), has four clusters of some special shapes, dierent density and orientation. The eighth data set, SDS8 as shown in Figure 4.8-(i), has two rings. Each ring is composed of some small circles, and these circles in dierent rings have dierent radius. 69 (b) (a) (c) (d) (e) (f) (g) (j) Figure 4.8 The special data sets: (a) SDS1; (b) SDS2; (c) SDS3; (d) SDS4; (e) SDS5; (f) SDS6; (g) SDS7; (i) SDS8. 70 Table 4.8 Execution time (in millisecond) for dierent special data sets SDS1 SDS2 SDS3 SDS4 Number of data 8000 7500 10000 7000 the k-means algorithm 3140 1407 1110 1187 313 313 312 our algorithm 297 SDS5 SDS6 SDS7 SDS8 Number of data 6000 6500 7000 7600 the k-means algorithm 2141 1125 1031 3172 312 328 328 our algorithm 259 A comparison of the execution time between the k-means algorithm and our algorithm is shown in Table 4.8. We observe that on the average, the k-means algorithm is 3 10 times slower than our algorithm. We also visualize the results of running both algorithms on these data sets to compare the quality of clustering. These results of special data sets are shown in the following ways. The points in the dierent clusters are represented by using dierent colors and dierent cluster numbers. As a result, data points that belong to the same clusters both have the same color and the same cluster number. Figure 4.9 shows the clusters analyzed by these two algorithms for special data set SDS1. As expected, since the k-means algorithm uses a centroid clustering algorithm for clustering the data points, it can not distinguish between the big and small clusters. It splits the larger cluster while merging the two smaller clusters adjacent to it. Moreover, it merges the two ellipsoids, because it cannot handle the elongated shapes. But our algorithm successfully discovers these clusters in SDS1. Figure 4.10 shows the clusters analyzed by these two algorithms for special data set SDS2. The k-means algorithm splits one concentric rings into three parts, and then these parts merge the outer and the inner rings into a single cluster, respectively. 71 (4) (4) (3) (3) (5) (5) (1) (2) (1) (a) (2) (b) Figure 4.9 The result of SDS1: (a) the k-means algorithm; (b) our algorithm. The other concentric ring belongs to another cluster. But, our algorithm successfully identies each ring as a separate cluster. Figure 4.11 shows the clusters analyzed by these two algorithms for special data set SDS3. The k-means algorithm will separate the extreme portion of the upper cluster and merge this part with the lower cluster. But, our algorithm will separate well in special data set SDS3. Figure 4.12 shows the clusters analyzed by these two algorithms for special data set SDS4. The k-means algorithm separates the extreme portions of the elongated clusters, and merges them with some close clusters. But, our algorithm can nd right clusters with the right parameter settings. Figure 4.13 shows the clusters analyzed by these two algorithms for special data set SDS5. The k-means algorithm splits these two clusters into right and left parts. The right part of these two cluster merges to one cluster, and left part is similarly. But, our algorithm identies each curve as a separate cluster. Figure 4.14 shows the clusters analyzed by these two algorithms for special data set SDS6. Like SDS1, the k-means algorithm can not distinguish between the big and small clusters. It splits the larger cluster into upper and lower parts, and merge lower 72 (1) (2) (3) (4) (a) (1) (2) (3) (4) (b) Figure 4.10 The result of SDS2: (a) the k-means algorithm; (b) our algorithm. part with another small cluster. But, our algorithm distinguishes these two clusters very well. Figure 4.15 shows the clusters analyzed by these two algorithms for special data set SDS7. The k-means algorithm separates some real clusters into two parts, and merges two real clusters into one cluster. But, our algorithm can distinguishes these clusters with arbitrary shapes in SDS7. Figure 4.16 shows the clusters analyzed by these two algorithms for special data set SDS8. The k-means algorithm can not distinguish big and small circles. It separates some big circles into two parts, and merges small circles into one cluster. Because the k-means algorithm can not deal with clusters which have dierent sizes. But, our algorithm can distinguishes these clusters with dierent sizes very well. 73 (2) (2) (1) (1) (a) (b) Figure 4.11 The result of DS3: (a) the k-means algorithm; (b) our algorithm. (1) (3) (2) (5) (4) (2) (1) (4) (3) (5) (7) (9) (6) (8) (6) (10) (7) (a) (9) (10) (8) (b) Figure 4.12 The result of SDS4: (a) the k-means algorithm; (b) our algorithm. 74 (2) (2) (1) (1) (a) (b) Figure 4.13 The result of SDS5: (a) the k-means algorithm; (b) our algorithm. (1) (1) (2) (2) (a) (b) Figure 4.14 The result of SDS6: (a) the k-means algorithm; (b) our algorithm. 75 (3) (3) (1) (1) (4) (4) (2) (2) (a) (b) Figure 4.15 The result of SDS7: (a) the k-means algorithm; (b) our algorithm. (20) (7) (5) (18) (9) (6) (1) (17) (10) (20) (19) (4) (19) (8) (9) (2) (16) (8) (3) (18) (10) (11) (1) (2) (7) (6) (3) (5) (4) (17) (14) (15) (13) (16) (11) (12) (13) (14) (15) (12) (a) (b) Figure 4.16 The result of SDS8: (a) the k-means algorithm; (b) our algorithm. 76 Table 4.9 A comparison The partitioning The hierarchical The density-based The grid-based approach approach approach approach Our algorithm (1) Scalable Not well Some well Some well Well Well (2) Handle arbitrary Not completely Some algorithms can Better than the Better than the Yes hierarchical approach hierarchical approach shaped clusters (3) Independent of Yes Some not Yes Yes Yes Required Required Some not Some not Required Not completely Some partially Yes Yes Not completely No Some can No Not well No data input order (4) No a-priori knowledge of inputs required (5) Insensitive to noises (6) Handle higher dimensionality For the issues which a good spatial clustering algorithm concerns about, we can conclude them into six requirements. A comparison of these requirements is shown in Table 4.9. It can be seen that none of them can match all the requirements [16]. However, our proposed algorithm can match most of the requirements. 77 CHAPTER V Conclusion In this Chapter, we give a summary of the thesis and point out some future directions. 5.1 Summary Spatial data mining has became more important due to it can discover interesting relationships and characteristics in large spatial databases [19]. However, how to nd an eÆciency algorithm in spatial data mining is an important problem to be solved because the amount of spatial databases is increasing exponentially. In spatial data mining, clustering is a useful technique for discovering interesting data and patterns in the implicit data. Clustering can help construct meaningful partitioning of a large set of data points and cluster analysis can simply nd a convenient and valid organization of the data set. Another diÆculty in spatial data mining is that there is no total ordering among these spatial data points to preserve spatial proximity. The space-lling curve can provide a way to order these data points of a grid and preserve the distance which in high-dimensional data space with these data points. So, in this thesis, we address problems with traditional clustering algorithms which either favor clusters with spherical shapes and similar sizes, or it can not handle large databases. Then, we propose a new clustering algorithm for these requirements. It can scales well for large databases without sacricing clustering quality. 78 In Chapter 2, we have reviewed some related works of several well-known approaches to cluster data points in the spatial database, including the k-means [15], CLARANS [19], HAC [26], CURE [10], DBSCAN [6], and STINGS [27] algorithms. In Chapter 3, we have presented the proposed new method for clustering algorithm, which uses the Hilbert curve to store our spatial data points. The basic idea is to use the Hilbert curve to preserve the distance that data points which are close in 2-D space and represent similar data points should be stored close together in the linear order. This method also can minimize the disk access eort and provide high speed for clustering. In Chapter 4, form our simulation, we have made a comparison of execution time between dierent clustering algorithms, including the k-means algorithm and our algorithm. We have shown that the execution time of our proposed algorithm is shorter than other algorithms. Furthermore, we have shown that our algorithm has higher quality of clustering than the k-means algorithm. It can deal with some special data sets in our simulation. Moreover, our algorithm can handle large spatial databases eÆciently. When the number of data points is increased, the execution time of our algorithm is aected lightly. 5.2 Future Work So far, we only consider that the data points are centralized in the data set. How to process eÆciently for large amounts of noises in that data set is the future research topic. Moreover, another topic is that the clustering algorithm which can work eectively for high d, where d is the number of dimensions. Our algorithm with high dimensional data space should be investigated. Furthermore, in our method, we only consider the two main (vertical and horizontal) directions. How to nd the diagonal direction blocks to be merged, which may improve the execution time and the quality of clustering results is another future work. 79 BIBLIOGRAPHY BIBLIOGRAPHY [1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, \Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," ACM SIGMOD Conf., pp. 94{105, 1998. [2] M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, \OPTICS: Ordering Points To Identify the Clustering Structure," ACM SIGMOD Conf., pp. 49{60, 1999. [3] A. Ben-Dor, R. Shamir, and Z. Yakhini, \Clustering Gene Expression Patterns," Proc. of the 3rd Annual Int. Conf. on Computational Molecular Biology, pp. 33{ 42, 1999. [4] P. Berkhin, \Survey of Clustering Data Mining Techniques," Accrue Software, Inc., 2002. [5] M. S. Chen, J. Han, and P. S. Yu, \Data Mining: An Overview from Database Perspective," IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 6, pp. 866{ 883, Dec. 1996. [6] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, \A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proc. of the 2nd Int. Conf. on KDD, pp. 226{231, 1996. [7] Christos Faloutsos, \Gray Codes for Partial Match and Range Queries," IEEE Trans. on Software Eng., Vol. 14, No. 10, pp. 1381{1393, Aug. 1988. [8] Christos Faloutsos and Shari Roseman, \Fractals for Secondary Key Retrieval," ACM SIGACT-SIGMOD-SIGART Symposium on PODS, pp. 247{252, 1989. [9] S. Guha, R. Rastogi, and K. Shim, \ROCK: A Robust Clustering Algorithm for Categorical," Int. Conf. on Data Eng., pp. 512{521, 1999. [10] S. Guha, R. Rastogi, and K. Shim, \CURE: An EÆcient Clustering Algorithm for Large Databases," Information Systems, Vol. 26, No. 1, pp. 35{58, March 2001. 80 [11] H. V. Jagadish, \Linear Clustering of Objects with Multiple Attributes," ACM SIGMOD Conf., pp. 332{342, 1990. [12] A. K. Jain and R. C. Dubes, \Algorithms for Clustering Data," Prentice Hall, Englewood Clis, New Jersey, 1988. [13] A. K. Jain, M. N. Murty, and P. J. Flynn, \Data Clustering: A Review," ACM Computing Survey, Vol. 31, No. 3, pp. 264{323, Sept. 1999. [14] G. Karypis, E. H. Han, and V. Kumar, \CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling," IEEE Trans. on Computer, Vol. 32, No. 8, pp. 68{75, Aug. 1999. [15] L. Kaufman and P. J. Rousseeuw, \Finding Groups in Data: An Introduction to Cluster Analysis," John Wiley & Sons, Inc., New York, 1990. [16] E. Kolatch, \Clustering Algorithms for Spatial Databases: A Survey," Dept. of Computer Science, University of Maryland, College Park, 2001. [17] B. Lent, A. Swami, and J. Widom, \Clustering Association Rules," Proc. of the 13th Int. Conf. on Data Eng., pp. 220{231, 1997. [18] BongKi Moon, H. V. Jagadish, Christos Faloutsos, and Joel H. Saltz, \Analysis of the Clustering Properties of the Hilbert Space-Filling Curve," IEEE Trans. on Knowledge and Data Eng., Vol. 13, No. 1, pp. 124{141, Jan. 2001. [19] R. T. Ng and J. Han, \EÆcient and Eective Clustering Methods for Spatial Data Mining," Proc. of the 20th VLDB Conf., pp. 144{155, 1994. [20] R. T. Ng and J. Han, \CLARANS: A Method for Clustering Objects for Spatial Data Mining," IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 5, pp. 1003{1016, Sept./Oct. 2002. [21] C. F. Olson, \Parallel Algorithms for Hierarchical Clustering," Parallel Computing, Vol. 21, No. 8, pp. 1313{1325, Aug. 1995. [22] Jack A. Orenstein and T. H. Merrett, \A Class of Data Structures for Associative Searching, " Proc. Symp. on PODS, pp. 181{190, 1984. [23] Jack A. Orenstein, \Spatial Query Processing in an Object-Oriented Database System," Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 326{336, 1986. 81 [24] G. Sheikholeslami, S. Chatterjee, and A. Zhang, \WaveCluster: A MultiResolution Clustering Approach for Very Large Spatial Databases," Proc. of the 24th VLDB Conf., pp. 428{439, 1998. [25] L. Y. Tseng and S. B. Yang, \A Genetic Approach to the Automatic Clustering Problem," Pattern Recognition, Vol. 34, No. 2, pp. 415{424, Feb. 2001. [26] E. M. Voorhees, \Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval," Information Proc. & Management, pp. 465{476, 1986. [27] W. Wang, J. Yang, and R. Muntz, \STING: A Statistical Information Grid Approach to Spatial Data," Proc. of the 23th VLDB Conf., pp. 186{195, 1997. [28] C. P. Wei, Y. H. Lee, and C. M. Hsu, \Empirical Comparison of Fast Clustering Algorithms for Large Data Sets," Proc. of the 33rd Hawaii Int. Conf. on System Sciences, Maui, Hawaii, Jan. 2000. [29] O. R. Zaiane, A. Foss, C. H. Lee, and W. wang, \On Data Clustering Analysis: Scalability, Constraints, and Validation," Pacic-Asia Conf. on Knowledge Discovery and Data Mining, pp. 28{39, 2002. [30] T. Zhang, R. Ramakrishnan, and M. Livny, \BIRCH: A EÆcient Data Clustering Method for Very Large Databases," ACM SIGMOD Int. Conf. on Management of Data, pp. 103{114, 1996. 82