Download Using Clustering Methods in Geospatial

G E O M A T I C A USING CLUSTERING METHODS IN GEOSPATIAL INFORMATION SYSTEMS Xin Wang and Jing Wang Department of Geomatics Engineering, Schulich School of Engineering University of Calgary, Calgary, Alberta Spatial clustering is the process of grouping similar objects based on their distance, connectivity, or relative density in space. It has been employed in the field of spatial analysis for years. In order to select the proper spatial clustering methods for geospatial information systems, we need to consider the characteristics of different clustering methods, relative to the objectives that we are trying to achieve. In this paper, we give a detailed discussion of different types of clustering methods from a data mining perspective. Analysis of the advantages and limitations of some classical clustering methods are given. Subsequently we discuss applying spatial clustering methods as part of geospatial information systems, with respect to distance functions, data models, non-spatial attributes and performance. Le regroupement spatial est le processus visant à regrouper des objets similaires en fonction de leur distance, de leur connectivité ou de leur densité relative dans l’espace. Il est utilisé depuis des années dans le domaine de l’analyse spatiale. Afin de choisir les méthodes adéquates de regroupement spatial pour les systèmes d’information géospatiale, nous devons tenir compte des caractéristiques des diverses méthodes de regroupement relativement aux objectifs que nous tentons d’atteindre. Dans le présent article, nous faisons un exposé détaillé des différents types de méthodes de regroupement dans une perspective d’exploration de données. Nous présentons également une analyse des avantages et des limites de certaines méthodes classiques de regroupement. Subséquemment, nous examinons la question de l’application des méthodes de regroupement spatial dans le cadre des systèmes d’information géospatiale, en ce qui a trait aux fonctions de distance, aux modèles de données, aux attributs non spatiaux et à la performance. 1. Introduction A Geospatial Information System (GIS) is a computer-based information system for both managing geographical data and for using these data to solve spatial problems [Lo and Yeung 2007]. Rapid growth is occurring in the number and the size of GIS applications, including geo-marketing, traffic control, and environmental studies [Han et al. 2001]. Spatial clustering is the process of grouping similar objects based on their distance, connectivity, or relative density in space [Han et al. 2001]. This has been employed for spatial analysis over a number of years. Currently it is commonly used in such diverse fields as disease surveillance, spatial epidemiology, population genetics, landscape ecology, crime analysis, as well as in many other fields [Jacquez 2008]. Therefore, spatial clustering is potentially a very useful tool for spatial analysis in GIS. Various clustering methods have been proposed in both the area of spatial data mining and the area of geospatial analysis [Agrawl et al. 1998; Ester et al. 1996; Estivill-Castro and Lee 2000a; Estivill-Castro and Lee 2000b; Gaffney et al. 2006; Gaffney and Smyth 1999; Kaymak and Setnes 2002; Klawonn and Hoppner 2003; Lee et al. 2007; Martino et al. 2008; Xin Wang [email protected] Jing Wang [email protected] Nanni and Pedreschi 2006; Mu and Wang 2008; Ng and Han 1994; Sander et al. 1998; Stefanakis 2007; Tung et al. 2001a; Tung et al. 2001b; Wang and Hamilton 2003; Wang et al. 2004; Wang et al. 1997; Zaïane and Lee 2002; Zhang et al. 1996; Zhou et al. 2005]. In spatial data mining, clustering methods can be classified into different categories. In terms of the techniques adopted to define clusters, clustering algorithms have been categorized into four broad categories: hierarchical, partitional, density-based, and grid-based [Han et al. 2001]. Hierarchical clustering methods group objects into a tree-like structure that progressively reduces the search space. Partitional clustering methods partition the points into clusters, such that the points in a cluster are more similar to each other than to points in different clusters. Density-based clustering methods can find arbitrarily shaped clusters that are ‘grown’ from seeds, and established once the density in the clusters’ neighborhoods exceed certain density thresholds. Grid-based clustering methods divide the information spaces into a finite number of grid cells and then cluster objects based on this structure. GEOMATICA Vol. 64, No. 3, 2010 pp. 347 to 361 G E O M A T In terms of domain constraints, clustering methods also include a large set of constraint-based spatial clustering. These are often used by GIS applications. A Constraint describes the incorporation into the spatial clustering of background or prior knowledge. For example, suppose a criminologist wanted to analyze the connections between the road networks and crime rates. We may assume that rivers and lakes act as obstacles for criminals, while major streets and highways act as facilitators. Therefore the simple Euclidean distances between the locations do not provide an appropriate basis for clustering. For example, if rivers and lakes exist in the area, they should not be ignored because they can block the reachability from side to side. In addition, since traveling on the major streets in the city is faster than traveling on other streets, the length of these streets should be shortened for this analysis. Ignoring the role of both obstacles (rivers and lakes) and facilitators (major streets for driving) when performing clustering may lead to distorted or useless results. As discussed above, a wide variety of clustering methods have been developed and used over the past several years. There are typically two challenges that users encounter when they need to use geospatial clustering methods as part of their application: the first is how to choose the proper clustering methods, and the second is that, if current methods are not suitable for the application, how we can extend the current methods. How to select and extend the proper spatial clustering method for a specific geospatial information system is not a simple topic. This is due to the differing requirements of applications and due to the different types of data being used. In Section 2 of this paper we give a detailed overview of different types of clustering methods. An analysis of advantages and limitations of specific clustering methods is given. The clustering methods discussed in this section are drawn from data mining research since these fundamental clustering methods can be easily extended and adopted for use in GIS. Additionally, these methods usually show good performance on large spatial datasets, which is critical to some GIS applications that require a fast response. Since spatial constraints play an important role in spatial clustering, we discuss a specific type of constraintsbased clustering method (i.e., spatial clustering for obstacle and facilitator constraints) in Section 3. In Section 4, we discuss how to select and extend the spatial clustering methods in geospatial information systems with respect to four key issues. Finally, in Section 5, we conclude with summary of the paper as well as important directions and priorities for future research. 348 I C A 2. Classification of Clustering Methods In this section, we provide a detailed review of some classical clustering methods in terms of clustering techniques, discussing both their pros and cons. Many new clustering methods have been proposed in recent years, most of which are based on classical methods. Therefore, in this paper, we selected the classic method for each category with the aim of providing readers with a general overview of clustering methods. Based on the techniques adopted to define clusters, clustering algorithms have been categorized into four broad categories: hierarchical, partitional, density-based, and grid-based [Han et al. 2001]. These methods have been widely used in geospatial information systems, such as Point Pattern Analysis, Hot spot detection, regionalization, etc. In this section, we discuss some classic methods for each category. The example methods chosen for each category are efficient and scalable on large spatial datasets. 2.1 Hierarchical Clustering Methods Hierarchical clustering methods can be either agglomerative or divisive. An agglomerative method starts with each point as a separate cluster, and successively performs merging until a stopping criterion is met. A divisive method begins with all points in a single cluster and performs splitting until a stopping criterion is met. The result of a hierarchical clustering method is a tree of clusters called a dendogram. A Hierarchical clustering method is useful in geospatial information applications for clustering at different spatial scales. However, the method has difficulties in setting up merge and split decisions, because each decision requires the examination and evaluation of a large number of objects or clusters [Han et al. 2001]. An example of a hierarchical clustering method is BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [Zhang et al. 1996]. BIRCH is an integrated hierarchical clustering method. It incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering within specified memory and time constraints. The concepts of ‘clustering feature’ and ‘CF tree’ are crucial to the BIRCH approach. They are used to summarize representations. A clustering feature (CF) is a triple summary of the information that we maintain about a cluster. Specifically, for N d−dimensional data points in a cluster: X i where G E i = 1, 2, …, N, the CF entry of the cluster is {N, L S , SS }, where N is the number of data points in the cluster, L S is the linear sum of the N data points i.e., LS = Σ i = 1 X i, and 66 is the square sum of N the N data points N 2 i.e., SS = Σ i = 1X i . A &) tree is a height-balanced tree that stores the CFs for a hierarchical clustering. It requires two parameters, a branching factor B and a threshold T. The branching factor B defines the maximum number of entries in non-leaf nodes. Figure 1 depicts a CF-tree structure. In this tree structure, each non-leaf node has exactly B entries and each leaf node has at most L entries. All entries in leaf nodes must satisfy the threshold T. That is, the radius of each entry in a leaf node must be less than T. These two parameters influence the size of the CF-tree and thus the effectiveness of clustering. O M A T I C A specified and the process of rebuilding the tree begins without the necessity of rereading all of the objects or points. Phase 2 is optional. Since there potentially is a gap between the size of Phase 1 results and the input range of Phase 3, Phase 2 serves as a cushion that bridges this gap. Phase 2 scans the leaf entries in the initial CF tree while removing more outliers and grouping crowded subclusters into larger ones. Phase 3 uses a global or semi-global algorithm to cluster all leaf entries. After Phase 3, a set of clusters that captures the major distribution pattern in the data is obtained. Phase 4 uses the centroids of the clusters produced by Phase 3 as seeds, and redistributes the data points to their closest seeds in order to obtain a set of new clusters. The complexity of the algorithm is O(n), where n is the number of points in the dataset. BIRCH can typically achieve good clustering with a single scan of the data, and improve the quality further with a few additional iterations. BIRCH is also the first clustering algorithm proposed in the database research area to handle “noise” effectively. However, since each node in a CF-tree can hold only a limited number of entries, due to the size of the tree, a CF-tree node does not always correspond to what a user may consider a natural cluster. Moreover, if the clusters are not spherical in shape, BIRCH does not perform well because it uses the notion of radius to control the boundary of a cluster [Han et al. 2001]. 2.2 Partitional Clustering Methods Figure 1: BIRCH’s CF tree structure. The BIRCH clustering method includes four phases [Zhang et al. 1996]. The main task of Phase 1 is to scan all data and build an initial in-memory CF tree using the available amount of memory and recycling space on the disk. This CF tree is intended to reflect the clustering information of the data set as finely as possible under the existing memory limit. The CF tree is built dynamically as objects are inserted, hence the method is incremental. An object is inserted to the closest leaf entry (subcluster). After insertion of a new object, information about it is passed toward the root of the tree. The size of the tree can be changed by modifying the threshold. If the need for memory is greater than the available memory, a smaller threshold can be Partitional clustering methods determine a partition for dividing a group of points into different clusters, such that the points in a cluster are more similar to each other than to points in different clusters. These methods start with some arbitrary initial clusters and iteratively reallocate points into clusters until a stopping criterion is met. They tend to find clusters with hyperspherical shapes. Examples of partitional clustering algorithms include k-means, PAM [Kaufman and Rousseeuw 1990], CLARA [Kaufman and Rousseeuw 1990], CLARANS [Ng and Han 1994], and EM [Kaufman and Rousseeuw 1990]. Partitional clustering methods, like k-means, are usually not scalable. In this section, we choose CLARANS as the example, which makes the clustering process more scalable. CLARANS is a spatial clustering method based on randomized search [Ng and Han 1994]. It was the first clustering method proposed for spatial data mining and it led to a significant improvement in efficiency for clustering large spatial datasets. It finds a medoid for each of k clusters. Informally, a medoid is the center of a cluster. The great insight 349 G E O M A T behind CLARANS is that the clustering process can be described as searching a graph where every node is a potential solution (i.e., a partition of the points into k clusters). In this graph, each node is represented by a set of k medoids. Two nodes are neighbors if their partitions differ by only one point. CLARANS performs the following steps. First, it selects an arbitrary possible clustering node current. Next, the algorithm randomly picks a neighbor of current and compares the quality of clusters at current and the neighbor node. A swap between current and a neighbor is made if such a swap would result in an improvement of the clustering quality. The number of neighbors of a single node to be randomly tried is restricted by a parameter called maxneighbor. If a swap happens, CLARANS moves to the neighbor’s node and the process is started again; otherwise the current clustering produces a local optimum. If the local optimum is found, CLARANS starts with a new randomly selected node in search of a new local optimum. The number of local optima to be searched is bounded by a parameter called numlocal. Based upon CLARANS, two spatial data mining algorithms were developed: a spatially dominant approach, called SD_CLARANS, and a non-spatially dominant approach, called NSD_CLARANS. In SD_CLARANS, the spatial component(s) of the relevant data items are collected and clustered using CLARANS. Then, the algorithm performs attribute-oriented induction on the non-spatial description of points in each cluster. NSD_CLARANS first applies attribute-oriented generalization to the non-spatial attributes to produce a number of generalized tuples. Then, for each such generalized tuple, all spatial components are collected and clustered using CLARANS. CLARANS suffers from some weaknesses [Ng and Han 1994]. First, it assumes that the points to be clustered are all stored in main memory. Second, the run time of the algorithm is prohibitive on large datasets. In [Ng and Han 1994], its authors claim that CLARANS is “linearly proportional to the size of dataset”. However, in the worst case, O(nk) steps may be needed to find a local optimum, where n is the size of the dataset, and k is the desired number of clusters. The time complexity of CLARANS is Ω(kn2) in the best case, and O(nk) in the worst case. 2.3 Density-Based Clustering Methods Density-based clustering methods try to find clusters based on the density of points in regions. Dense regions that are reachable from each other are merged to form clusters. Density-based clustering methods excel at finding clusters of arbi350 I C A trary shapes. Examples of density-based clustering methods include OPTICS [Ankerst et al. 1999], DBSCAN [Ester et al. 1996] and DBRS [Wang and Hamilton 2003]. DBSCAN was the first densitybased spatial clustering method proposed [Ester et al. 1996], and can be easily extended for different applications. To define a new cluster or to extend an existing cluster, a neighborhood around a point of a given radius (Eps) must contain at least a minimum number of points (MinPts), the minimum density for the neighborhood. DBSCAN uses an efficient spatial access data structure, called an R*tree, to retrieve the neighborhood of a point from the dataset. The average case time complexity of DBSCAN is nlog n. DBSCAN can follow arbitrarily shaped clusters [Ester et al. 1996]. Given a dataset D, a distance function dist, and parameters Eps and MinPts, the following definitions (adapted from [Ester et al. 1996]) are used to specify DBSCAN. Definition 1 The Eps-neighborhood (or neighborhood) of a point p, denoted by NEps(p), is defined by NEps(p) = { q∈ D | dist(p,q) ≤ Eps}. Definition 2 A point p is directly density-reachable from a point q if (1) p∈ NEps(q) and (2) |NEps(q)| ≥ MinPts. Definition 3 A point p is density-reachable from a point q if there is a chain of points p,…,pn, p=q, pn=p such that pi is directly densityreachable from pi for 1 ≤ i ≤ n-1. Definition 4 A point p is density-connected to a point q if there is a point o such that both p and q are density-reachable from o. Definition 5 A density-based cluster C is a nonempty subset of D satisfying the following conditions: (1) ∀p, q: if p∈C and q is densityreachable from p, then q∈C; (2) ∀p, q∈C: p is density-connected to q. DBSCAN starts from an arbitrary point q. It begins by performing a region query, which finds the neighborhood of point q. If the neighborhood is sparsely populated (i.e., it contains fewer than MinPts points), then point q is labeled as noise. Otherwise, a cluster is created and all points in q’s neighborhood are placed in this cluster. Then the neighborhood of each of q’s neighbors is examined to see if it can be added to the cluster. If so, the process is repeated for every point in this neighborhood, and so on. If a cluster cannot be expanded further, DBSCAN chooses another arbitrary unlabelled point and repeats the process. This procedure is iterated until all points in the dataset have either been placed in clusters or labeled as noise. For a dataset containing n points, n region queries are required. G E Although DBSCAN gives extremely good results and is efficient in many datasets, it may not be suitable for the following cases. First, if a dataset has clusters of widely varying densities, DBSCAN is not able to handle it efficiently. On such a dataset, the density of the least dense cluster must be applied to the whole dataset, regardless of the density of the other clusters in the dataset. Since all neighbors are checked, much time may be spent in dense clusters examining the neighborhoods of all points. Simply using random sampling to reduce the input size will not be effective with DBSCAN because the density of points within clusters can vary so substantially in a random sample, that density-based clustering becomes ineffective [Estivill-Castro and Lee 2000a]. Secondly, if non-spatial attributes play a role in determining the desired clustering result, DBSCAN is not appropriate, because it does not consider nonspatial attributes in the dataset. Thirdly, DBSCAN is not suitable for finding approximate clusters in very large datasets. DBSCAN starts to create and expand a cluster from a randomly chosen point. It works on this cluster thoroughly and accurately until all points in the cluster have been found. Then another point outside the cluster is randomly selected and the procedure is repeated. This method is not suited to being stopped early for the purpose of settling on an approximate identification of clusters. NBC (Neighborhood-Based Clustering algorithm) is a density-based algorithm [Zhou et al. 2005]. It can automatically discover arbitrary shaped clusters of differing local densities with only one input parameter k. Two basic neighborhoods are defined by the algorithm. The k-Neighborhood (kNB) of a point is a set of points including all other points in its k-nearest neighbors. The Reverse kNeighborhood (R-kNB) of a point consists of all other points whose kNB includes the point. Based on the two neighborhoods, the Neighborhood-Based Density Factor (NDF) of a point is defined as the ratio of the size of R-kNB over the size of kNB. The basic idea of NBC is that the value of NDF of each point in a cluster should be no less than 1. In the other words, the number of neighbors in its reverse k-nearest-neighborhood should be greater than the points in its k-nearest-neighborhood. The algorithm starts by calculating the NDF for each point in the dataset, and the clustering follows the same steps as DBSCAN (i.e., replacing with different density thresholds). In DBSCAN, the size of the neighborhood should be greater than 0LQ3WV, while in NBC the NDF should be greater than one. Although NBC gives very good results and is efficient in many datasets, it may not suitable for the following cases. First, when the distance function for O M A T I C A applications needs to be Euclidean distance, NBC is not efficient and suitable. As mentioned, the neighborhood queries are the most time-consuming step. In NBC, for each point, NDF is calculated. The complexity of finding k-nearest neighborhood is O(logn), in which n is the size of the dataset. The paper uses a cell-based approach to support rapid kNB query processing. However, finding the neighborhood in a rectangle cell implicitly uses the Manhattan distance function. Secondly, when the ranges of point coordinates in the datasets are large and the clusters are very close, NBC may not produce good results. The size of the cell is determined by the parameter k and the range of the point coordinates in the dataset. However, when the range of the point coordinates is too large, the cell generated will be too big to find the correct clusters. 2.4 Grid-Based Clustering Methods Grid-based clustering methods quantize the clustering space into a finite number of cells and then perform the required operations on the quantized space. Cells containing more than a certain number of points are considered to be dense. Contiguous dense cells are connected to form clusters. Examples of grid-based clustering methods include CLIQUE [Agrawl et al. 1998] and STING [Wang et al. 1997]. STING (Statistical Information Grid) is a statistical information grid-based approach for spatial databases. The overall spatial universe for the data is divided into rectangular cells. Several levels of such rectangular cells are used, corresponding to different resolutions. The cells at different resolutions form a hierarchical structure. Statistical information of each cell is calculated and stored and the information is used to answer queries. To perform clustering on such a data structure, users must first supply the density level as an input parameter. Using this parameter, the following topdown method is used to find regions with sufficient density. First, a layer within the hierarchical structure is selected, where the query answering process is to start. The layer typically contains a small number of cells. For each cell in the current layer, we compute the confidence interval (or estimated range of probability), that the cell will be relevant (i.e., related) to the result of the clustering. Cells that do not meet the confidence condition are labeled as not relevant and removed from further consideration. The relevant cells are then refined to a finer resolution by repeating the procedure at the next level of the structure. The process is repeated until the bottom layer is reached. At that time, if the query specification is met, the regions of relevant 351 G E O M A T cells are retrieved, and further processed until they meet the requirements of the query. STING is a very efficient algorithm. It goes through the database once to compute the statistical parameters of the cells. Hence the time complexity for generating clusters is O(n), where n is the total number of data objects. If the hierarchical structure fits into main memory, the cost of constructing it can be ignored. Otherwise, the cost of construction is klogk. After generating the hierarchical structure, the query processing time is O(K), where K is the number of cells at bottom layers of the structure, which is usually much smaller than n. The quality of the clusters generated by STING is heavily dependent on the granularity of the bottom level of the hierarchical structure [Han et al. 2001]. If the grid structure at the bottom level is very fine, the cost of generating the grid structure will be high; however, if it is too coarse, the quality of the cluster will be reduced. Moreover, STING does not consider the spatial relationship between the children and their neighboring cells, for construction of the parent cell. As a result, the boundaries of the clusters are either horizontal or vertical, and no diagonal boundary is detected. This may lower the quality and accuracy of the clusters despite the fast processing time of the algorithm. 3. Constraint-Based Clustering Methods Besides the general clustering methods, constraint-based spatial clustering sometimes are more desirable for real geospatial information systems. Since they can lead to effective and fruitful results by capturing application semantics [Han et al. 1999; Tung et al. 2001a]. Depending on the nature of the constraints and applications, the constrained clustering problem includes four categories: constraints on individual objects, obstacle objects as constraints, clustering parameters as constraints, and constraints imposed on each individual cluster [Han et al. 1999; Tung et al. 2001a]. A constraint on individual objects confines the set of objects to be clustered. For example, the objects to be clustered might be restricted to luxury mansions valued at more than one million dollars. Using obstacle object as constraints means that physical obstacles such as mountains and rivers could affect the ‘reachability’ among data objects. Constraints can also restrict the possible values of the parameters to a clustering algorithm (e.g., the number of result clusters might be restricted to five). Constraints can also be imposed on an 352 I C A individual cluster. For example, the number of points in each cluster might be restricted to be at most 50. [Tung et al. 2001a] proposes an algorithm for the last type of constraints. In this section, we present one type of constrained-based clustering method (i.e., physical obstacles and facilitators as constraints). Typically, a clustering task consists of separating a set of objects into different groups according to a measure of goodness, which can vary depending on the application. A common measure of goodness is Euclidean distance (i.e., straight-line distance). However, in many applications of clustering to spatial datasets, the presence of obstacles and facilitators makes Euclidean distance an ineffective measure. An obstacle is a physical object that obstructs the reachability among the points, such as fences, rivers, and highways (when walking), and a facilitator is a physical object that enhances the reachability among points, such as bridges, tunnels, and highways (when driving). Both obstacles and facilitators are assumed to be modeled as simple polygons with no data points inside them. Constraints of obstacles and facilitators exist in many geospatial datasets. Handling constraints due to obstacles and facilitators can increase the effectiveness of data mining for geospatial information systems by capturing application semantics. In the following, we will discuss four different methods to handle physical obstacle and facilitator constraints. 3.1 COD_CLARANS COD_CLARANS [Tung et al. 2001b] was the first obstacle constraint partitioning clustering method. It is a modified version of the CLARANS partitioning algorithm [Ng and Han 1994] adapted for clustering in the presence of obstacles. CLARANS finds a medoid for each of k clusters. A medoid is the center of a cluster that minimizes the sum of the distances to all objects in the cluster. The great insight behind CLARANS is that the clustering process entails searching a graph where every node is a potential solution (i.e., a partition of the points into k clusters). CLARANS randomly searches the graph until a local minimum is found. The property being minimized is the total Euclidean distance between every object and the medoid of its cluster. The main idea of COD_CLARANS is to replace the Euclidean distance between two points in CLARANS with the unobstructed distance (called the “obstructed distance” by Tung et al.), which is the length of the shortest Euclidean path between two points, that does not intersect any obstacles. The calculation of the unobstructed distance includes three preprocessing steps. The first step is to build a G E visibility graph. A visibility graph is an undirected graph, where the vertices correspond to vertices of the obstacles and the edges are generated if and only if the connection between the corresponding vertices of the obstacles does not intersect any obstacles. The second preprocessing step is micro-clustering. A micro-cluster is a compressed representation of a group of one or more points that are so close together that they are likely to belong to the same cluster. Instead of representing each point in the micro-cluster individually, COD_CLARANS represents them using their center, and a count of the points in the micro-cluster. If a point is not close to any other points, it is represented as a separate micro-cluster. The third preprocessing step creates three spatial join indexes, the VV index, the MV index, and the MM index. Creating the VV index computes the all-pairs shortest paths in the visibility graph; that is, for every pair of obstacle vertices, the VV index gives the shortest path that does not intersect any obstacle. Creating the MV index computes an index entry for every pair of a micro-cluster and an obstacle vertex. The MM index is created by computing the unobstructed distance between every pair of micro-clusters. The size of the MM index may be too huge to be materialized. These three indices are used whenever it is necessary to calculate the distance between any two points. Tung et al. ignored the computational cost of the preprocessing steps in their performance evaluation of COD_CLARANS. Figure 2: Micro-Clustering is not applicable for clusters of varying densities. After preprocessing, COD_CLARANS works efficiently on a large number of obstacles. However, the algorithm may not be suitable for the following cases. First, if the dataset has varying densities, COD_CLARANS’s micro-clustering approach may not be suitable for the sparse clusters. For the clusters shown in Figure 2, the left cluster is much denser than the other two clusters. If we use a small radius (as shown by the larger cir- O M A T I C A cles in Figure 2) to form three micro-clusters, the micro-clustering process does not have much effect on the majority of the points in the two clusters on the right side. However, if we were to pick a larger radius, the micro-clustering process might mistakenly merge the two clusters on the right side into one cluster, with the obstacle inside the micro-cluster or with a micro-cluster intersecting the obstacle. Second, as given, COD_CLARANS was not designed to handle intersecting obstacles. As a result the model used in preprocessing for determining visibility and building the spatial join index would need to be changed significantly. Third, the algorithm does not take into consideration facilitator constraints that connect data objects. A simple modification of the distance function in COD_CLARANS is inadequate for handling facilitators because the model used in preprocessing for determining visibility and building the spatial join index would need to be changed significantly. 3.2 AUTOCLUST+ AUTOCLUST+ [Estivill-Castro and Lee 2000b] is an enhanced version of AUTOCLUST [Estivill-Castro and Lee 2000a], which handles obstacles. AUTOCLUST+ is an effective graphbased clustering algorithm. With AUTOCLUST+, the user does not need to supply parameter values. To understand the AUTOCLUST+ algorithm, the terms Voronoi diagram and Delaunay diagram should first be understood. A Voronoi diagram is a partitioning of a plane with n points into n convex polygons such that each polygon contains exactly one point and every location in a given polygon is closer to its point than to any other point. A Delaunay diagram is the dual of a Voronoi diagram [Agrawl et al. 1998]. In a Delaunay diagram, the same points as in the original data are used and an edge, called a Delaunay edge, is present between two points if and only if their corresponding Voronoi regions share a boundary. AUTOCLUST+ uses four steps. In the first step, a Delaunay diagram is constructed. In the second step, for each point, the standard deviation of the lengths of the Delaunay edges directly connected (incident) to the point is calculated. A global variation indicator, the average of these standard deviations, is then calculated before considering any obstacles. In the third step, any Delaunay edge that intersects any obstacle is deleted and local strength indicators for each point are calculated. The local strength indicator for a point is the mean length of all Delaunay edges incident to the point. In the fourth step, AUTOCLUST is applied to the planar graph, resulting from the previous steps with 353 G E O M A T the calculated global variation indicator and local strength indicators. Let us consider the third step in more detail. If an original Delaunay edge traverses an obstacle, the length of the unobstructed distance (called the “detour distance” by Estivill-Castro and Lee [2000b]) between the two end points is approximated by a detour path (i.e., a minimal length path in the Delaunay diagram that does not intersect any obstacles between the two points). Figure 3 shows an example illustrating the unobstructed distance and the corresponding detour path, as well as a case where the detour path cannot be substituted for the unobstructed distance. The dotted lines represent Delaunay edges obstructed by the obstacle. The thick, solid lines in Figure 3(a) and (b) illustrate the unobstructed distance and the detour path between point 2 and point 4, respectively. For the case shown in Figure 3(c), the unobstructed distance between point 2 and point 4 is the same as that observed in Figure 3(a). However, no corresponding detour path can be found in the Delaunay diagram. In cases where a detour path between two points cannot be found, AUTOCLUST+ does not include any estimate whatsoever of the unobstructed distance between the two points, when calculating their local strength indicators. Ignoring the distance between two points in this manner could decrease the quality of the clustering results. As well, AUTOCLUST+ does not consider facilitator constraints that connect points. Since the points connected by facilitators usually do not share a boundary in Voronoi regions, no simple modification of the distance function in AUTOCLUST+ would allow it to handle facilitators. 3.3 DBCLuC DBCLuC [Zaïane and Lee 2002], which is based on DBSCAN, is a density-based clustering algorithm that can deal with obstacle constraints. Instead of finding the shortest path between the two objects by traversing the edges of the obstacles as Figure 3: A case where AUTOCLUST+ may fail. (a) Detour distance (b) Detour path (c) No detour path. 354 I C A in COD_CLARANS, DBCLuC determines visibility by using obstruction lines. An obstruction line, which is constructed during preprocessing, is an internal edge that maintains visible spaces for the obstacle polygons. In preprocessing, a convexity test is first applied to all obstructing polygons. The convexity test includes two steps: first, for each vertex of the polygon, an assessment edge is constructed. An assessment edge of a vertex is a line segment whose two end vertices respectively lie in two adjacent edges of the vertex, and which does not intersect the polygon. Second, each vertex is labeled as either a convex vertex or a concave vertex. A vertex is a convex if its assessment edge is interior to a polygon. Otherwise, it is concave. If there is a concave vertex in a polygon, the polygon is concave, therwise, it is convex. For each polygon, the convex vertices are bi-partitioned (into two groups that are as equal in size as possible) according to an enumeration order, such as clockwise from a specified vertex. For convex polygons, an obstruction line is drawn between each convex vertex in one partition and the corresponding convex vertex in the other partition. To construct a set of obstruction lines for a concave polygon, an admission test is performed for each possible obstruction line candidate. If an obstruction line candidate is not admissible (i.e., totally outside of the obstacle polygon, or intersecting it), a set of obstruction lines are generated by finding the shortest path between the two vertices, which then replaces the candidate. Then the admission test is repeated on each line in the set. The maximum number of obstruction lines that can be generated for an obstacle is equal to the number of edges in the obstacle. DBCLuC can also deal with facilitators, otherwise known by Zaїane and Lee as “bridges” or “crossings”. In dealing with facilitators, entry edges and entrances are identified. A non-empty subset of the edges of every facilitator is assumed as entry edges, where the facilitator can be entered [Lee 2003]. The lengths of facilitators are ignored. For each entry edge, a series of locations, called entrances, is defined from one vertex of the edge to the other vertex; such that each consecutive pair of entrances is separated by a distance less than or equal to the radius of the neighborhood area. DBCLuC starts clustering from entrances and maximally expands a set of clusters such that all data points that are reachable by facilitators are grouped together. Then it continues processing the remaining data points using the ‘obstacles’ method. Once preprocessing to find obstruction lines has been performed, DBCLuC is an effective density-based clustering approach for large datasets containing obstacles with many edges. However, G E preprocessing is expensive for concave polygons with many vertices, because its complexity is O(v2), where v is the number of convex vertices in all obstacles. Also, since any two points are defined as being reachable if the unobstructed path between them is calculated using obstruction lines, the algorithm does not work correctly for the two examples shown in Figure 4. In both diagrams, the circle represents the neighborhood area of the central point. For the flattened diamond shaped obstacle shown in Figure 4(a), the obstruction line shown satisfies the definition of an obstruction line that is provided in [Zaïane and Lee 2002]. The algorithm incorrectly considers point p to be in the central region, because the distance via the obstruction line is less than the radius. In Figure 4(b), point p and the central point are blocked by an obstacle with the obstruction lines shown. The algorithm considers point p to be unreachable from the central point, which is incorrect, because the shortest distance between these points is actually less than the radius. 3.4 DBRS+ DBRS+ [Wang and Hamilton 2003] is a constrained density-based spatial clustering method. It first chooses one point at random from the dataset and retrieves its neighborhood without considering obstacles and facilitators. The neighborhood consists of the selected point, which becomes the central point, and all points within a specified distance Eps from it, according to a distance function, such as Euclidean distance or Manhattan distance. The neighborhood area is the area within a distance of Eps from the central point. If any obstacles appear in the neighborhood area, DBRS+ removes all neighbors separated from the central point by obstacles. Thus, it keeps only those points in the same region as the central point. A region is a maximal contiguous portion of a neighborhood area that does not contain any obstacles. The central region of a neighborhood area is the region that contains the central point. If no obstacles appear in the neighborhood area, the only region is the original neighborhood area. If any facilitators appear in the region, DBRS+ first determines the entrances available for each facilitator. Then for each possible exit of any such facilitator, it finds the corresponding extra points that can be reached. The points in the central region, together with any reachable extra points, form the new neighborhood. If the new neighborhood is not dense enough (i.e., if the number of points is less than a user-specified threshold MinPts, the central point of the neighborhood is classified as noise). Otherwise, if the new neighborhood intersects one or more existing clusters, the existing clusters are O M A T I C A merged together with the new neighborhoods to form a single cluster, but if the new neighborhood does not intersect any existing clusters, it becomes a new cluster. These steps are iterated until all points have been clustered or classified as noise. DBRS+ has four major strengths. First, it can handle both obstacles, such as fences, rivers, and highways (when walking), and facilitators, such as bridges, tunnels, major streets, and highways (when driving), whereas most previous methods can only deal with obstacles. Second, DBRS+ can deal with any combination of intersecting obstacles and facilitators. No previous method considers intersecting obstacles, which are common in real data. For example, highways or rivers often cross each other, and bridges and tunnels often cross rivers. Although previous methods can merge obstacles during preprocessing, the resulting polygons cannot be guaranteed to be simple polygons, and these methods do not work on complex polygons. Third, DBRS+ is simple and efficient. It does not require any preprocessing, because the constraints are dealt with during the clustering process. Almost all previous methods include complicated preprocessing. Fourth, due to capabilities inherited from DBRS, DBRS+ can work on datasets containing clusters with widely varying shapes, datasets having significant non-spatial attributes, and datasets comprising more than 100 000 points. 4. Discussion on Integrating Clustering Methods in GIS Geospatial clustering focused on generalization or classification by aggregating similar geographic objects into clusters. When integrating clustering methods in GIS, the method that is the most appropriate depends heavily on the application goal, the trade-off between cluster quality and clustering performance, and the characteristics of data [Han et al. 2001]. Due to the specialization of geospatial information systems, the following issues should Figure 4: Two Cases Where DBCLuC May Fail. 355 G E O M A T be carefully considered when applying the clustering methods in GIS. The following discussion focuses on four main issues concerning integrating clustering methods in GIS: Distance Functions, Similarity on Non-Spatial Attributes, Data Types and Performance. 4.1 Distance Functions The first law of geography indicates that everything is related to everything else, but near things are more related than distant things [Tobler 1970]. Distance is a numerical description of how similar two objects are in space. According to [Tobler 1970], we usually use geometric distance as the scale of measurement in the ideal model. The geometric distances are defined by exact mathematical formulas to reflect the physical length between two objects in defined coordinate systems, such as Euclidean Distance and Manhattan Distance. Table 1 gives a list of geometric distance functions. Each of these functions, because of their geometry, implies a different view of the data. When we use clustering methods in large geographical area, the Earth’s spherical shape cannot be ignored. Geographical distance is the distance measured along the surface of the earth. These methods calculate distances between points that are defined by geographical coordinates in terms of latitude and longitude, such as Great Circle distance and Vincenty’s formulae. As for a large geographical area, the factor of map projections needs to be taken into account. Map projection is the process of mathematical transformation of locations in the three-dimensional space of the Earth’s surface onto the two-dimensional space of a map sheet [Lo Yeung 2007]. Since some of the properties of the spherical Earth are lost after projection, the areas and shapes of I C A the features as they appear on paper are also altered. Consequently after projection the distance and direction between individual features can often not be maintained. Depending on the properties that are preserved, map projections can be classified as equal-area, conformal, equidistant and true direction map projections. Among them, the equidistant map projection is the most important one because it results in little distortion of distance. In the case where location information is represented on spherical and ellipsoidal surfaces, geographical distance can also be used as the distance function for clustering methods. The distance functions for geographical applications should be defined properly before using spatial clustering methods. For example, to find the shortest traveling path in a city, the distance function could be defined based on the road network, speed limits, road direction, traffic volume. The number of traffic lights and stop signs can also be considered. From this perspective, the constraintsbased clustering methods for physical objects discussed in Section 3, are extended from general clustering methods to include newly defined physical constraint distance functions. For example, for physical obstacles, the unobstructed distance between two points x and y in a neighborhood area N N, denoted by dist uno x, y , can be defined as d x, y x and y areinthesameQHL N dist uno x, y = Jhborhood region withinN ∞ Rtherwise The neighborhood area is the area within a distance of Eps from the central point. A neighborhood region used in the definition is a maximal contiguous portion of a neighborhood area that does not contain any obstacle. Then we can apply different density-based spatial clustering methods discussed in Section 2. Table 1: Selected geometric distance functions between points x and y. Distance Function Definition n Euclidean distance d x, y = Manhattan distance d x, y = Tchebyschev distance Σ i=1 x i– y i n Σ i=1 xi – yi d (x, y) = max i = 1,2…,n | xi – yi | n Σ i=1 Minkowski distance d x, y = Canberra distance d x, y = Cosine similarity d x, y =FRV x i , y i n 2 Σ i=1 xi – yi p 1⁄p , p>0 xi – yi x i + y i , x i and y i are positive = xi yi x iy i G E Net-DBSCAN [Stefanakis 2007] is an example of a clustering method used in GIS to apply the above idea. It extends DBSCAN to cluster the nodes of a dynamic linear network. In a dynamic linear network, each network edge has a cost value for traversing it and a set of accessible temporal intervals. In this method, the distance function between two nodes is defined as the minimum accumulated cost of traversing the network edges between the two nodes during the accessible temporal intervals of all edges. The distance function can be represented as follows. Min Σ cost x, y dist x, y = , when the edges on the path between x,y are accessible ∞, otherwise Based on this distance function, the algorithm first computes the accessible nodes for each node on the network. The neighborhood of a node is a set of nodes in the network with an accumulated cost less than or equal to Eps. Then an initial cluster is retrieved with the given Eps and MinPts. Clusters are expanded for each node in the neighborhood. The cluster expansion process is the same as DBSCAN. The only difference is that the distance definition is expanded to the linear network with temporal intervals. In summary, the distance function should be tailored to meet different clustering purposes. Once the distance function is determined for GIS applications, various clustering methods can be extended. 4.2 Similarity on Non-Spatial Attributes Spatial clustering has previously been based on only the topological features of the data. However, one or more non-spatial attributes may have a significant influence on the results of the clustering process. For example, in image processing, the general procedure for region-based segmentation compares a pixel with its immediate surrounding neighbors. We may not want to include a pixel in a region if its non-spatial attribute does not match our requirements. That is, region growing is heavily based on not only the location of pixels but also the non-spatial attributes or “a priori” knowledge [Cramariuc et al. 1997]. As another example, suppose we want to cluster soil samples that were taken from different sites according to their types and locations. If a selected site has soil of a certain type, and some neighboring sites have the same type of soil, while others have different types, we may wish to decide on the viability of a cluster around the selected site according to both the num- O M A T I C A ber of neighbors with the same type of soil and the percentage of neighbors with the same type. One option is to handle non-spatial attributes and spatial attributes in two separate steps, as described in CLARNS. The other option is to handle the non-spatial attributes and spatial attributes together in the clustering process. The similarity function for non-spatial attributes and distance functions for spatial attributes, are handled at the same time in order to define the overall similarity between objects. In GIS applications, we can classify non-spatial attributes (alphanumeric attributes) into two categories: numerical and non-numerical. Different methods to handle them have been proposed in the data mining research field. For numerical non-spatial attributes, we usually transform the numerical values into some standardized values, and then calculate the similarity by using one of the distance functions mentioned in 4.1. The Cosine similarity function in Table 1 is one of the most popular methods. For the non-numerical attributes, new functions are defined to transform non-numerical values to numerical. For example, GDBSCAN [Sander et al. 1998] takes into account the non-spatial attributes of an object as a “weight” attribute, defined by the weighted cardinality of the singleton containing the object. The weight could be the size of the area of the clustering object, or a calculated value from several non-spatial attributes. In another instance, DBRS [Wang and Hamilton 2003] introduces the concept of purity to determine the non-numerical attributes of objects in the neighborhood. It is defined as the number of equal values within their neighborhood. For non-spatial attributes this can avoid creating clusters of points with different values, even though they may be close to each other [Wang and Hamilton 2003]. The following shows the result of applying DBRS to Calgary (Canada) theft crime data. The dataset has 1044 records from January 31 to February 15, 2009 in Calgary. This includes two types of theft crime: “Theft from Vehicle” and “Other Theft”. Each record has the spatial attributes of longitude and latitude and the nonspatial category attribute of ‘type of the theft’. In this example, Calgary police officers want to find the dense clusters of theft crime based location and theft type. Two different kinds of crimes are considered separately while performing the clustering. In other words, only the crime records with the same nonspatial attribute are clustered together. Figure 5 shows the results when the search distance Eps is 500 meters; MinPts is 10; and the purity of neighborhood is more than 60%. 15 clusters have been found in total. Among them, 11 clusters are of the 3 G E O M A T “Other Theft” type and 4 clusters are of the “Theft from Vehicle” type. In the close up of the downtown area in Figure 5, it is evident that a large cluster from two kinds of crime types is separated into two clusters, although the clustering areas overlap each other. 4.3 Data Types Vector data represents spatial data as points, lines and polygons. Locations of spatial objects or events are usually represented as point objects. Most of the clustering methods are designed for point objects. Recently, the need to cluster line and polygon data, and even moving objects, is increasing. Simplifying lines and polygons to points usually does not work for these applications as the direction and length of lines and the size of polygons can be greatly distorted or overlooked. A more practical method is to extend the current point-centric clustering methods for use with lines and polygons. For example, a distance based neighborhood is a natural notion of a neighborhood for point objects, but if clustering spatially extended objects such as a set of polygons of largely differing sizes it may be more appropriate to use neighborhood (spatial) predicates like intersects or meets for finding clusters of polygons. The definition of a cluster in [Ester et al. 1996] (i.e., NEps(o) = {o’ ∈ D| |o – o’| ≤ Eps}, where o and o’ are points, can be extended to the neighborhood for polygon objects). For an instance, it can be defined as NEps(o) = {o’ ∈ D| o {meet, intersect} o’ ≤ Eps}, where o and o’ are two polygons. By using this mechanism, various clustering methods [Gaffney et al. 2006; Gaffney and Smyth 1999; Lee et al. 2007; Nanni and Pedreschi 2006; Guo 2008; Mu and Wang 2008] have been improved for data types beside data points. For example, for the lines or trajectories of a moving object Lee et. al [2007] proposed TRACLUS, which was altered from the traditional clustering method DBSCAN [Ester et al. 1996], by defining a distance function Figure 5: Using DBRS on the Calgary theft crime dataset. 8 I C A between line segments. For polygon data, Guo [2008] presented the REDCAP family of six hierarchical regionalization methods based on taking the length of different edges as the distance function between polygon clusters. For a raster dataset, the center of each grid can be transferred to a set of points and most clustering methods can be applied after the transformation. Grid-based clustering methods can be easily applied to raster data because we can simply consider the pixels of the raster image as equivalent to the grids of the clustering methods. Various methods have also been proposed in the fields remote sensing and image processing, but these are beyond the scope of this paper. 4.4 Performance The performance of applying clustering methods in a GIS is affected by two factors. One is the efficiency of the clustering algorithms; other is the data processing performance of the system. As spatial databases are usually huge, time complexity is an important issue when applying algorithms to find clusters. Most efficient clustering methods have time complexity as O(nlogn) where n is the size of the dataset. To improve the efficiency of the clustering algorithm, current clustering methods use different spatial data structures. Table 2 lists the complexity for some common clustering algorithms and the spatial data structure used. For example, DBRS applies an SR-tree [Katayama and Satoh 1997] to store the whole dataset into smaller SR-Tree nodes, which can reduce the neighborhood query operation from O(n) to O(logn), where n is the size of the dataset. Clustering techniques have also been used in web based GIS applications. As these applications are part of the internet environment and users expect the system respond quickly, response time is the most important criterion when choosing the clustering technique. For example, many Webbased GISs use clustering for cartographic generalization, (i.e., to generalize map symbols). From Table 2 we can see that grid based methods provide low time complexity. For a specific implementation, the current map view is usually divided into several grids based on the size of users’ screens. Then the points in each grid are clustered based on the pixel distance between points. To further improve the response time in a GIS, other techniques and methods have been used. One technique is the pre-processing method. Datasets are pre-clustered at different zoom levels and saved in the form of static images. ClustrMap [2009] is an example of a web-based GIS to use the preprocess- G E O M A T I C A ing method. The clusters are represented by different sizes according to the number of points in each cluster. Figure 6 shows the archived clustering map based on the visitors of the website www.clustrmap.com from May 1-June 1, 2009. Figure 6(a) shows the clustering results at a global level and Figure 6(b) shows it at the continental level. Both clustering results are saved on the sever side as static images. When users zoom to a certain level, the server returns the corresponding images. (a) A static clustering map of visitors at the global level. 5. Conclusions Spatial clustering, which has been employed for spatial analysis over years, is the process of grouping similar objects based on their distance, connectivity, or relative density in space. To select a proper spatial clustering method for geospatial information systems, we need to consider characteristics of different clustering methods. This paper has overviewed different types of clustering methods from a data mining perspective, with an analysis of advantages and limitations for some classical clustering methods. In addition, the paper discusses four important issues when using clustering methods within a GIS. We also provide some solutions and examples for selecting and extending spatial clustering methods within geospatial information systems with respect to distance functions, non-spatial attributes, data types and system performance. Domain knowledge can play an important role during method selection and extension. Since con- (b) A static clustering map of visitors at the continental level. Figure 6: Static clustering map of visitors to www.clustrmap.com Table 2: Time complexity of some clustering algorithms. Clustering Algorithm Time Complexity Clustering Method Type Spatial Data Structure BIRCH O(n) Hierarchical CF-tree K-means O(nkd) Partitional n/a CLARANS ȍ(kn2)3artitionalQ/a EM O(knd2+kd3) Partitional n/a DBSCAN O(nlogn) Density-based R*-tree DBRS O(nlogn) Density-based SR-Tree NBC O(logn) Density-based High-dimensional cells CLIQUE O(nd4) Grid-based Hash-Tree STING O(n) Grid-based Multiple-level granularity structure Where n is the size of the dataset. k is the number of clusters. d is the dimension of objects. G E O M A T straint-based clustering methods only consider sharply limited knowledge concerning the domain and the user’s goals, they are typically difficult to reuse. In particular, they usually have very restricted means of incorporating domain-related information from non-geospatial attributes. An ontology is a formal explicit specification of a shared conceptualization [*UXEHU1]. It provides domain knowledge relevant to the conceptualization and axioms for reasoning with it. For geospatial clus tering, an appropriate ontology must include a rich set of geospatialand clustering concepts. Therefore, using domainontology to help people FKRRVHWKHPRVWDGDSWLYHDOJRULWKPVLVRQHNH\ future research topic. In addition, various methods have been developed for modeling distance relations within the geospatial domain. For instance, Gahegan [1995] proposes a fuzzy logic model for proximity reasoning, in which each proximity expression, such as near or far, has a corresponding fuzzy membership function. Adapting this research on distance relations and other spatial relations into clustering methods will provide a useful step toward development of new clustering methods for GIS applications. References Agrawl, R., J. Gehrke, D. Gunopulos, and P. Raghavan. 1998. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, SIGMOD Record, 27(2), pp.94-105. Ankerst, M., M. Breunig, H.P. Kriegel, and J. Sander. 1999. OPTICS: Ordering points to identify the clustering structure. Proc. 1999 ACM-SIGMOD Intl. Conf. Management of Data (SIGMOD’99), Philadelphia, PA, pp. 49–60. Cramariuc, B., M. Gabbouj, and J. Astola. 1997. Clustering Based Region Growing Algorithm for Color Image Segmentation, Proc. Digital Signal Processing, Stockholm, Sweden. pp.857–860. ClustrMap. 2009. http://www.clustrmaps.com/ (Access on May 30, 2009). Ester, M., H. Kriegel, J. Sander, and X. Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Portland, OR, pp. 226-231. Estivill-Castro, V., and I.J. Lee. 2000a. AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive Point-Data Sets. Proceedings of the Fifth International Conference On Geocomputation. University of Greenwich, Medway Campus, U.K., pp. 23-25. Estivill-Castro, V., and I.J. Lee. 2000b. AUTOCLUST+: Automatic Clustering of Point-Data Sets in the Presence of Obstacles. Proceedings of the I C A International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining. Lyon, France, pp. 133-146. Gaffney, S., A. Robertson, P. Smyth, S. Camargo, and M. Ghil. 2006. Probabilistic clustering of extratropical cyclones using regression mixture models. In Technical Report, Bren School of Information and Computer Sciences, University of California, Irvine. Gaffney, S., and P. Smyth. 1999. Trajectory clustering with mixtures of regression models. In Proc. 1999 Intl. Conf. Knowledge Discovery and Data Mining (KDD’99), San Diego, CA, August 1999, pp. 63–72. Gahegan, M. 1995. Proximity operators for qualitative spatial reasoning. In COSIT ’95 Proceedings: Spatial Information Theory: A Theoretical Basis for GIS. A. U. Frank and W. Kuhn (eds). Berlin: Springer-Verlag. Gruber, T.R. 1993. A translation approach to portable ontologies. Knowledge Acquisition, 5 (2), pp. 199220. Guo, D. 2008. Regionalization with Dynamically Constrained Agglomerative Clustering and Partitioning (RECDCAP), International Journal of Geographical Information Science, 22(7), pp. 801823. Han, J., L.V.S. Lakshmanan, and R.T. Ng. 1999. Constraint-Based Multidimensional Data Mining. Computer, 32(8), pp. 46-50. Han, J., M. Kamber, and A.K.H. Tung. 2001. Spatial clustering methods in data mining: A survey. Geographic Data Mining and Knowledge Discovery, Miller H. and Han J. (eds), Taylor and Francis, 2001. Jacquez, G.M. 2008. Spatial Cluster Analysis. Chapter 22 In The Handbook of Geographic Information Science, S. Fotheringham and J. Wilson (Eds.). Blackwell Publishing, pp. 395-416. Kaymak, U., and M. Setnes. 2002. Fuzzy Clustering with Volume Prototype and Adaptive Cluster Merging. IEEE Trans. on Fuzzy Systems. 10(6), pp.705-712. Katayama, N., and S. Satoh. 1997. The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. Proceedings of ACM SIGMOD International Conference on Management of Data, Tucson, AZ, pp. 369-380. Kaufman, L., and P.J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis, Wiley. Klawonn, F., and F. Hoppner. 2003. What is Fuzzy about Fuzzy Clustering? Understanding and Improving the Concept of the Fuzzifier. Berthold M.R., Lenz H.J., Bradley E., Kruse R. and Borgelt C. (eds) Advances in Intelligent Data Analysis, Berlin: Springer, pp. 254-264. Lee, C.H. 2003. Personal communication. Lee, J.G., J. Han, and K.Y. Whang. 2007. Trajectory clustering: A partition-and-group framework. In Proc. 2007 ACM-SIGMOD Intl. Conf. Management of Data (SIGMOD’07), Beijing, China, June 2007, pp. 593–604. Lo, C.P., and A.K.W. Yeung. 2007. Concepts and Techniques in Geographic Information Systems, Pearson Prentice Hall. G E Martino, F. Di., V. Loia, and S. Sessa. 2008. Extended fuzzy C-means clustering algorithm for hotspot events in spatial analysis. International Journal of Hybrid Intelligent Systems, 5(1), pp. 31-44. Nanni, M., and D. Pedreschi. 2006. Time-focused clustering of trajectories of moving objects, Journal of Intelligent Information Systems, 27(3), pp. 267–289. Mu, L., and F. Wang. 2008. A Scale-Space Clustering Method: Mitigating the Effect of Scale in the Analysis of Zone-Based Data, Annals of the Association of American Geographers, 98(1), pp. 85-101. Ng, R., and J. Han. 1994. Efficient and Effective Clustering Method for Spatial Data Mining. Proceedings of the Twentieth International Conference on Very Large Data Bases. Santiago, Chile, pp. 144-155. Sander, J., M. Ester, H. Kriegel, and X. Xu. 1998. Densitybased Clustering in Spatial Databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 2(2), pp. 169-194. Stefanakis, E. 2007. NET-DBSCAN: clustering the nodes of a dynamic linear network. International Journal of Geographical Information Science, 21(4), pp. 427-442. Tobler, W. 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(2), pp. 234-240. Tung, A.K.H., J. Hou, L.V.S. Lakshmanan, and R.T. Ng. 2001a. Constraint-Based Clustering in Large Databases. Proceedings of the Eighth International Conference on Database Theory, London, U.K, pp. 405-419. Tung, A.K.H., J. Hou, and J. Han. 2001b. Spatial Clustering in the Presence of Obstacles. Proceedings of the Seventeenth International Conference On Data Engineering. Heidelberg, Germany, pp. 359-367. Wang, X., and H.J. Hamilton. 2003. DBRS: A DensityBased Spatial Clustering Method with Random Sampling. Proceedings of the Seventh Pacific-Asia Conference on Knowledge Discovery and Data Mining. Seoul, Korea, pp. 563- 575. Wang, X., C. Rostoker, and H. J. Hamilton. 2004. Densitybased spatial clustering in the presence of obstacles and facilitators. Proceedings of the Eighth European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy, September 2004, pp. 446-458. Wang, W., J. Yang, and R. Muntz. 1997. STING: A Statistical Information Grid Approach to Spatial Data O M A T I C A Mining, Proceedings of the twenty-third International Conference on Very Large Data Bases, Athens, Greece, pp. 186-195. Zaïane, O.R., and C.H. Lee. 2002. Clustering Spatial Data When Facing Physical Constraints. Proceedings of the IEEE International Conference on Data Mining. Maebashi City, Japan, pp. 737-740. Zhang, T., R. Ramakrishnan, and M. Livny. 1996. BIRCH: An Efficient Data Clustering Method for Very Large Databases, SIGMOD Record, 25(2), pp. 103-114. Zhou, S.G., Y. Zhao, J.H. Guan, and J. Huang. 2005. A Neighborhood-Based Clustering Algorithm. Advances in Knowledge Discovery and Data Mining, pp. 361-371. MS Rec'd 08/09/03 Revised MS rec'd 09/11/12 Authors Dr. Xin Wang is an assistant professor of Geomatics Engineering at the University of Calgary. Dr. Wang received doctorate degree in computer science from the University of Regina in 2006, M. Eng. in software engineering and B.Sc. degree from Northwest University, China. She was with SaskTel Information Technology Management Department from 2006 to 2007, a lecturer at East China University of Science and Technology, a software engineer with ASTI Shanghai, and a researcher at Fudan University and Shanghai Software Centre from 1996 to1999. Her research interests are spatial data mining, ontology and knowledge engineering in GIS, web GIS and privacy protection in GIS. Mr. Jing Wang is a MSc student in the Department of Geomatics Engineering, University of Calgary specializing in GIS. He graduated with a B.Sc. in Geographical Information System from Henan University, China in 2007. His research interests are spatial analysis, spatial data mining, and Web GIS. o

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Using Clustering Methods in Geospatial