Download Using Clustering Methods in Geospatial

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
G
E
O
M
A
T
I
C
A
USING CLUSTERING METHODS IN
GEOSPATIAL INFORMATION SYSTEMS
Xin Wang and Jing Wang
Department of Geomatics Engineering, Schulich School of Engineering
University of Calgary, Calgary, Alberta
Spatial clustering is the process of grouping similar objects based on their distance, connectivity, or relative density in space. It has been employed in the field of spatial analysis for years. In order to select the proper spatial clustering methods for geospatial information systems, we need to consider the characteristics of
different clustering methods, relative to the objectives that we are trying to achieve. In this paper, we give a
detailed discussion of different types of clustering methods from a data mining perspective. Analysis of the
advantages and limitations of some classical clustering methods are given. Subsequently we discuss applying
spatial clustering methods as part of geospatial information systems, with respect to distance functions, data
models, non-spatial attributes and performance.
Le regroupement spatial est le processus visant à regrouper des objets similaires en fonction de leur distance, de leur connectivité ou de leur densité relative dans l’espace. Il est utilisé depuis des années dans le
domaine de l’analyse spatiale. Afin de choisir les méthodes adéquates de regroupement spatial pour les systèmes d’information géospatiale, nous devons tenir compte des caractéristiques des diverses méthodes de
regroupement relativement aux objectifs que nous tentons d’atteindre. Dans le présent article, nous faisons
un exposé détaillé des différents types de méthodes de regroupement dans une perspective d’exploration de
données. Nous présentons également une analyse des avantages et des limites de certaines méthodes classiques de regroupement. Subséquemment, nous examinons la question de l’application des méthodes de
regroupement spatial dans le cadre des systèmes d’information géospatiale, en ce qui a trait aux fonctions
de distance, aux modèles de données, aux attributs non spatiaux et à la performance.
1. Introduction
A Geospatial Information System (GIS) is a
computer-based information system for both managing geographical data and for using these data to
solve spatial problems [Lo and Yeung 2007]. Rapid
growth is occurring in the number and the size of
GIS applications, including geo-marketing, traffic
control, and environmental studies [Han et al. 2001].
Spatial clustering is the process of grouping similar
objects based on their distance, connectivity, or relative density in space [Han et al. 2001]. This has
been employed for spatial analysis over a number of
years. Currently it is commonly used in such diverse
fields as disease surveillance, spatial epidemiology,
population genetics, landscape ecology, crime
analysis, as well as in many other fields [Jacquez
2008]. Therefore, spatial clustering is potentially a
very useful tool for spatial analysis in GIS.
Various clustering methods have been proposed
in both the area of spatial data mining and the area of
geospatial analysis [Agrawl et al. 1998; Ester et al.
1996; Estivill-Castro and Lee 2000a; Estivill-Castro
and Lee 2000b; Gaffney et al. 2006; Gaffney and
Smyth 1999; Kaymak and Setnes 2002; Klawonn and
Hoppner 2003; Lee et al. 2007; Martino et al. 2008;
Xin Wang
[email protected]
Jing Wang
[email protected]
Nanni and Pedreschi 2006; Mu and Wang 2008; Ng
and Han 1994; Sander et al. 1998; Stefanakis 2007;
Tung et al. 2001a; Tung et al. 2001b; Wang and
Hamilton 2003; Wang et al. 2004; Wang et al. 1997;
Zaïane and Lee 2002; Zhang et al. 1996; Zhou et al.
2005]. In spatial data mining, clustering methods can
be classified into different categories. In terms of the
techniques adopted to define clusters, clustering
algorithms have been categorized into four broad
categories: hierarchical, partitional, density-based,
and grid-based [Han et al. 2001]. Hierarchical
clustering methods group objects into a tree-like
structure that progressively reduces the search space.
Partitional clustering methods partition the points
into clusters, such that the points in a cluster are
more similar to each other than to points in different
clusters. Density-based clustering methods can find
arbitrarily shaped clusters that are ‘grown’ from
seeds, and established once the density in the clusters’ neighborhoods exceed certain density thresholds. Grid-based clustering methods divide the
information spaces into a finite number of grid cells
and then cluster objects based on this structure.
GEOMATICA Vol. 64, No. 3, 2010 pp. 347 to 361
G
E
O
M
A
T
In terms of domain constraints, clustering
methods also include a large set of constraint-based
spatial clustering. These are often used by GIS applications. A Constraint describes the incorporation
into the spatial clustering of background or prior
knowledge. For example, suppose a criminologist
wanted to analyze the connections between the road
networks and crime rates. We may assume that rivers
and lakes act as obstacles for criminals, while major
streets and highways act as facilitators. Therefore the
simple Euclidean distances between the locations do
not provide an appropriate basis for clustering. For
example, if rivers and lakes exist in the area, they
should not be ignored because they can block the
reachability from side to side. In addition, since traveling on the major streets in the city is faster than
traveling on other streets, the length of these streets
should be shortened for this analysis. Ignoring the
role of both obstacles (rivers and lakes) and facilitators (major streets for driving) when performing
clustering may lead to distorted or useless results.
As discussed above, a wide variety of clustering
methods have been developed and used over the past
several years. There are typically two challenges that
users encounter when they need to use geospatial
clustering methods as part of their application: the
first is how to choose the proper clustering methods,
and the second is that, if current methods are not
suitable for the application, how we can extend the
current methods.
How to select and extend the proper spatial
clustering method for a specific geospatial information system is not a simple topic. This is due to
the differing requirements of applications and due to
the different types of data being used. In Section 2 of
this paper we give a detailed overview of different
types of clustering methods. An analysis of advantages and limitations of specific clustering methods
is given. The clustering methods discussed in this
section are drawn from data mining research since
these fundamental clustering methods can be easily
extended and adopted for use in GIS. Additionally,
these methods usually show good performance on
large spatial datasets, which is critical to some GIS
applications that require a fast response. Since spatial constraints play an important role in spatial clustering, we discuss a specific type of constraintsbased clustering method (i.e., spatial clustering for
obstacle and facilitator constraints) in Section 3. In
Section 4, we discuss how to select and extend the
spatial clustering methods in geospatial information
systems with respect to four key issues. Finally, in
Section 5, we conclude with summary of the paper
as well as important directions and priorities for
future research.
348
I
C
A
2. Classification of
Clustering Methods
In this section, we provide a detailed review of
some classical clustering methods in terms of clustering techniques, discussing both their pros and
cons. Many new clustering methods have been proposed in recent years, most of which are based on
classical methods. Therefore, in this paper, we
selected the classic method for each category with
the aim of providing readers with a general
overview of clustering methods.
Based on the techniques adopted to define
clusters, clustering algorithms have been categorized into four broad categories: hierarchical, partitional, density-based, and grid-based [Han et al.
2001]. These methods have been widely used in
geospatial information systems, such as Point
Pattern Analysis, Hot spot detection, regionalization, etc. In this section, we discuss some classic
methods for each category. The example methods
chosen for each category are efficient and scalable
on large spatial datasets.
2.1 Hierarchical Clustering Methods
Hierarchical clustering methods can be
either agglomerative or divisive. An agglomerative
method starts with each point as a separate cluster,
and successively performs merging until a stopping
criterion is met. A divisive method begins with all
points in a single cluster and performs splitting
until a stopping criterion is met. The result of a
hierarchical clustering method is a tree of clusters
called a dendogram. A Hierarchical clustering
method is useful in geospatial information applications for clustering at different spatial scales.
However, the method has difficulties in setting up
merge and split decisions, because each decision
requires the examination and evaluation of a large
number of objects or clusters [Han et al. 2001].
An example of a hierarchical clustering method
is BIRCH (Balanced Iterative Reducing and
Clustering using Hierarchies) [Zhang et al. 1996].
BIRCH is an integrated hierarchical clustering
method. It incrementally and dynamically clusters
incoming multi-dimensional metric data points to try
to produce the best quality clustering within specified memory and time constraints.
The concepts of ‘clustering feature’ and ‘CF
tree’ are crucial to the BIRCH approach. They are
used to summarize representations. A clustering
feature (CF) is a triple summary of the information
that we maintain about a cluster. Specifically, for N
d−dimensional data points in a cluster: X i where
G
E
i = 1, 2, …, N, the CF entry of the cluster is {N,
L S , SS }, where N is the number of data points in
the cluster, L S is the linear sum of the N data points
i.e., LS = Σ i = 1 X i, and 66 is the square sum of
N
the N data points
N
2
i.e., SS = Σ i = 1X i . A &)
tree is a height-balanced tree that stores the CFs for
a hierarchical clustering. It requires two parameters, a branching factor B and a threshold T. The
branching factor B defines the maximum number of
entries in non-leaf nodes. Figure 1 depicts a CF-tree
structure. In this tree structure, each non-leaf node
has exactly B entries and each leaf node has at most
L entries. All entries in leaf nodes must satisfy the
threshold T. That is, the radius of each entry in a
leaf node must be less than T. These two parameters influence the size of the CF-tree and thus the
effectiveness of clustering.
O
M
A
T
I
C
A
specified and the process of rebuilding the tree
begins without the necessity of rereading all of the
objects or points. Phase 2 is optional. Since there
potentially is a gap between the size of Phase 1
results and the input range of Phase 3, Phase 2
serves as a cushion that bridges this gap. Phase 2
scans the leaf entries in the initial CF tree while
removing more outliers and grouping crowded subclusters into larger ones. Phase 3 uses a global or
semi-global algorithm to cluster all leaf entries.
After Phase 3, a set of clusters that captures the
major distribution pattern in the data is obtained.
Phase 4 uses the centroids of the clusters produced
by Phase 3 as seeds, and redistributes the data
points to their closest seeds in order to obtain a set
of new clusters. The complexity of the algorithm is
O(n), where n is the number of points in the dataset.
BIRCH can typically achieve good clustering
with a single scan of the data, and improve the
quality further with a few additional iterations.
BIRCH is also the first clustering algorithm proposed in the database research area to handle
“noise” effectively. However, since each node in a
CF-tree can hold only a limited number of entries,
due to the size of the tree, a CF-tree node does not
always correspond to what a user may consider a
natural cluster. Moreover, if the clusters are not
spherical in shape, BIRCH does not perform well
because it uses the notion of radius to control the
boundary of a cluster [Han et al. 2001].
2.2 Partitional Clustering Methods
Figure 1: BIRCH’s CF tree structure.
The BIRCH clustering method includes four
phases [Zhang et al. 1996]. The main task of Phase
1 is to scan all data and build an initial in-memory
CF tree using the available amount of memory and
recycling space on the disk. This CF tree is intended to reflect the clustering information of the data
set as finely as possible under the existing memory
limit. The CF tree is built dynamically as objects
are inserted, hence the method is incremental. An
object is inserted to the closest leaf entry (subcluster). After insertion of a new object, information
about it is passed toward the root of the tree. The
size of the tree can be changed by modifying the
threshold. If the need for memory is greater than
the available memory, a smaller threshold can be
Partitional clustering methods determine a
partition for dividing a group of points into different clusters, such that the points in a cluster are
more similar to each other than to points in different clusters. These methods start with some arbitrary initial clusters and iteratively reallocate points
into clusters until a stopping criterion is met. They
tend to find clusters with hyperspherical shapes.
Examples of partitional clustering algorithms
include k-means, PAM [Kaufman and Rousseeuw
1990], CLARA [Kaufman and Rousseeuw 1990],
CLARANS [Ng and Han 1994], and EM [Kaufman
and Rousseeuw 1990]. Partitional clustering methods, like k-means, are usually not scalable. In this
section, we choose CLARANS as the example,
which makes the clustering process more scalable.
CLARANS is a spatial clustering method
based on randomized search [Ng and Han 1994]. It
was the first clustering method proposed for spatial
data mining and it led to a significant improvement
in efficiency for clustering large spatial datasets. It
finds a medoid for each of k clusters. Informally, a
medoid is the center of a cluster. The great insight
349
G
E
O
M
A
T
behind CLARANS is that the clustering process can
be described as searching a graph where every node
is a potential solution (i.e., a partition of the points
into k clusters). In this graph, each node is represented by a set of k medoids. Two nodes are neighbors if their partitions differ by only one point.
CLARANS performs the following steps. First, it
selects an arbitrary possible clustering node current. Next, the algorithm randomly picks a neighbor of current and compares the quality of clusters
at current and the neighbor node. A swap between
current and a neighbor is made if such a swap
would result in an improvement of the clustering
quality. The number of neighbors of a single node to
be randomly tried is restricted by a parameter called
maxneighbor. If a swap happens, CLARANS moves
to the neighbor’s node and the process is started
again; otherwise the current clustering produces a
local optimum. If the local optimum is found,
CLARANS starts with a new randomly selected
node in search of a new local optimum. The number
of local optima to be searched is bounded by a
parameter called numlocal. Based upon
CLARANS, two spatial data mining algorithms
were developed: a spatially dominant approach,
called SD_CLARANS, and a non-spatially dominant approach, called NSD_CLARANS. In
SD_CLARANS, the spatial component(s) of the
relevant data items are collected and clustered
using CLARANS. Then, the algorithm performs
attribute-oriented induction on the non-spatial
description of points in each cluster.
NSD_CLARANS first applies attribute-oriented
generalization to the non-spatial attributes to produce a number of generalized tuples. Then, for each
such generalized tuple, all spatial components are
collected and clustered using CLARANS.
CLARANS suffers from some weaknesses [Ng
and Han 1994]. First, it assumes that the points to be
clustered are all stored in main memory. Second, the
run time of the algorithm is prohibitive on large
datasets. In [Ng and Han 1994], its authors claim
that CLARANS is “linearly proportional to the size
of dataset”. However, in the worst case, O(nk) steps
may be needed to find a local optimum, where n is
the size of the dataset, and k is the desired number of
clusters. The time complexity of CLARANS is
Ω(kn2) in the best case, and O(nk) in the worst case.
2.3 Density-Based Clustering Methods
Density-based clustering methods try to
find clusters based on the density of points in
regions. Dense regions that are reachable from each
other are merged to form clusters. Density-based
clustering methods excel at finding clusters of arbi350
I
C
A
trary shapes. Examples of density-based clustering
methods include OPTICS [Ankerst et al. 1999],
DBSCAN [Ester et al. 1996] and DBRS [Wang and
Hamilton 2003]. DBSCAN was the first densitybased spatial clustering method proposed [Ester et
al. 1996], and can be easily extended for different
applications. To define a new cluster or to extend
an existing cluster, a neighborhood around a point
of a given radius (Eps) must contain at least a minimum number of points (MinPts), the minimum
density for the neighborhood. DBSCAN uses an
efficient spatial access data structure, called an R*tree, to retrieve the neighborhood of a point from
the dataset. The average case time complexity of
DBSCAN is nlog n. DBSCAN can follow arbitrarily shaped clusters [Ester et al. 1996].
Given a dataset D, a distance function dist, and
parameters Eps and MinPts, the following definitions (adapted from [Ester et al. 1996]) are used to
specify DBSCAN.
Definition 1 The Eps-neighborhood (or neighborhood) of a point p, denoted by NEps(p), is
defined by NEps(p) = { q∈ D | dist(p,q) ≤ Eps}.
Definition 2 A point p is directly density-reachable
from a point q if (1) p∈ NEps(q) and (2)
|NEps(q)| ≥ MinPts.
Definition 3 A point p is density-reachable from
a point q if there is a chain of points p,…,pn,
p=q, pn=p such that pi is directly densityreachable from pi for 1 ≤ i ≤ n-1.
Definition 4 A point p is density-connected to a
point q if there is a point o such that both p and
q are density-reachable from o.
Definition 5 A density-based cluster C is a nonempty subset of D satisfying the following
conditions: (1) ∀p, q: if p∈C and q is densityreachable from p, then q∈C; (2) ∀p, q∈C: p is
density-connected to q.
DBSCAN starts from an arbitrary point q. It
begins by performing a region query, which finds the
neighborhood of point q. If the neighborhood is
sparsely populated (i.e., it contains fewer than
MinPts points), then point q is labeled as noise.
Otherwise, a cluster is created and all points in q’s
neighborhood are placed in this cluster. Then the
neighborhood of each of q’s neighbors is examined
to see if it can be added to the cluster. If so, the
process is repeated for every point in this neighborhood, and so on. If a cluster cannot be expanded further, DBSCAN chooses another arbitrary unlabelled
point and repeats the process. This procedure is iterated until all points in the dataset have either been
placed in clusters or labeled as noise. For a dataset
containing n points, n region queries are required.
G
E
Although DBSCAN gives extremely good
results and is efficient in many datasets, it may not
be suitable for the following cases. First, if a dataset
has clusters of widely varying densities, DBSCAN is
not able to handle it efficiently. On such a dataset,
the density of the least dense cluster must be applied
to the whole dataset, regardless of the density of the
other clusters in the dataset. Since all neighbors are
checked, much time may be spent in dense clusters
examining the neighborhoods of all points. Simply
using random sampling to reduce the input size will
not be effective with DBSCAN because the density
of points within clusters can vary so substantially in
a random sample, that density-based clustering
becomes ineffective [Estivill-Castro and Lee 2000a].
Secondly, if non-spatial attributes play a role in
determining the desired clustering result, DBSCAN
is not appropriate, because it does not consider nonspatial attributes in the dataset.
Thirdly, DBSCAN is not suitable for finding
approximate clusters in very large datasets.
DBSCAN starts to create and expand a cluster from
a randomly chosen point. It works on this cluster
thoroughly and accurately until all points in the
cluster have been found. Then another point outside
the cluster is randomly selected and the procedure
is repeated. This method is not suited to being
stopped early for the purpose of settling on an
approximate identification of clusters.
NBC (Neighborhood-Based Clustering algorithm) is a density-based algorithm [Zhou et al.
2005]. It can automatically discover arbitrary shaped
clusters of differing local densities with only one
input parameter k. Two basic neighborhoods are
defined by the algorithm. The k-Neighborhood
(kNB) of a point is a set of points including all other
points in its k-nearest neighbors. The Reverse kNeighborhood (R-kNB) of a point consists of all
other points whose kNB includes the point. Based on
the two neighborhoods, the Neighborhood-Based
Density Factor (NDF) of a point is defined as the
ratio of the size of R-kNB over the size of kNB. The
basic idea of NBC is that the value of NDF of each
point in a cluster should be no less than 1. In the
other words, the number of neighbors in its reverse
k-nearest-neighborhood should be greater than the
points in its k-nearest-neighborhood. The algorithm
starts by calculating the NDF for each point in the
dataset, and the clustering follows the same steps as
DBSCAN (i.e., replacing with different density
thresholds). In DBSCAN, the size of the neighborhood should be greater than 0LQ3WV, while in NBC
the NDF should be greater than one.
Although NBC gives very good results and is
efficient in many datasets, it may not suitable for the
following cases. First, when the distance function for
O
M
A
T
I
C
A
applications needs to be Euclidean distance, NBC is
not efficient and suitable. As mentioned, the neighborhood queries are the most time-consuming step.
In NBC, for each point, NDF is calculated. The complexity of finding k-nearest neighborhood is
O(logn), in which n is the size of the dataset. The
paper uses a cell-based approach to support rapid
kNB query processing. However, finding the neighborhood in a rectangle cell implicitly uses the
Manhattan distance function. Secondly, when the
ranges of point coordinates in the datasets are large
and the clusters are very close, NBC may not produce good results. The size of the cell is determined
by the parameter k and the range of the point coordinates in the dataset. However, when the range of the
point coordinates is too large, the cell generated will
be too big to find the correct clusters.
2.4 Grid-Based Clustering Methods
Grid-based clustering methods quantize the
clustering space into a finite number of cells and
then perform the required operations on the quantized space. Cells containing more than a certain
number of points are considered to be dense.
Contiguous dense cells are connected to form clusters. Examples of grid-based clustering methods
include CLIQUE [Agrawl et al. 1998] and STING
[Wang et al. 1997].
STING (Statistical Information Grid) is a statistical information grid-based approach for spatial
databases. The overall spatial universe for the data is
divided into rectangular cells. Several levels of such
rectangular cells are used, corresponding to different
resolutions. The cells at different resolutions form a
hierarchical structure. Statistical information of each
cell is calculated and stored and the information is
used to answer queries.
To perform clustering on such a data structure,
users must first supply the density level as an input
parameter. Using this parameter, the following topdown method is used to find regions with sufficient
density. First, a layer within the hierarchical structure is selected, where the query answering process
is to start. The layer typically contains a small number of cells. For each cell in the current layer, we
compute the confidence interval (or estimated
range of probability), that the cell will be relevant
(i.e., related) to the result of the clustering. Cells
that do not meet the confidence condition are
labeled as not relevant and removed from further
consideration. The relevant cells are then refined to
a finer resolution by repeating the procedure at the
next level of the structure. The process is repeated
until the bottom layer is reached. At that time, if the
query specification is met, the regions of relevant
351
G
E
O
M
A
T
cells are retrieved, and further processed until they
meet the requirements of the query.
STING is a very efficient algorithm. It goes
through the database once to compute the statistical
parameters of the cells. Hence the time complexity
for generating clusters is O(n), where n is the total
number of data objects. If the hierarchical structure
fits into main memory, the cost of constructing it
can be ignored. Otherwise, the cost of construction
is klogk. After generating the hierarchical structure,
the query processing time is O(K), where K is the
number of cells at bottom layers of the structure,
which is usually much smaller than n.
The quality of the clusters generated by
STING is heavily dependent on the granularity of
the bottom level of the hierarchical structure [Han
et al. 2001]. If the grid structure at the bottom level
is very fine, the cost of generating the grid structure
will be high; however, if it is too coarse, the quality of the cluster will be reduced. Moreover, STING
does not consider the spatial relationship between
the children and their neighboring cells, for construction of the parent cell. As a result, the boundaries of the clusters are either horizontal or vertical,
and no diagonal boundary is detected. This may
lower the quality and accuracy of the clusters
despite the fast processing time of the algorithm.
3. Constraint-Based
Clustering Methods
Besides the general clustering methods, constraint-based spatial clustering sometimes are more
desirable for real geospatial information systems.
Since they can lead to effective and fruitful results
by capturing application semantics [Han et al.
1999; Tung et al. 2001a].
Depending on the nature of the constraints and
applications, the constrained clustering problem
includes four categories: constraints on individual
objects, obstacle objects as constraints, clustering
parameters as constraints, and constraints imposed
on each individual cluster [Han et al. 1999; Tung et
al. 2001a]. A constraint on individual objects confines the set of objects to be clustered. For example,
the objects to be clustered might be restricted to
luxury mansions valued at more than one million
dollars. Using obstacle object as constraints means
that physical obstacles such as mountains and
rivers could affect the ‘reachability’ among data
objects. Constraints can also restrict the possible
values of the parameters to a clustering algorithm
(e.g., the number of result clusters might be restricted to five). Constraints can also be imposed on an
352
I
C
A
individual cluster. For example, the number of
points in each cluster might be restricted to be at
most 50. [Tung et al. 2001a] proposes an algorithm
for the last type of constraints.
In this section, we present one type of constrained-based clustering method (i.e., physical
obstacles and facilitators as constraints). Typically,
a clustering task consists of separating a set of
objects into different groups according to a measure of goodness, which can vary depending on the
application. A common measure of goodness is
Euclidean distance (i.e., straight-line distance).
However, in many applications of clustering to spatial datasets, the presence of obstacles and facilitators makes Euclidean distance an ineffective measure. An obstacle is a physical object that obstructs
the reachability among the points, such as fences,
rivers, and highways (when walking), and a facilitator is a physical object that enhances the reachability among points, such as bridges, tunnels, and
highways (when driving). Both obstacles and facilitators are assumed to be modeled as simple polygons with no data points inside them. Constraints
of obstacles and facilitators exist in many geospatial datasets. Handling constraints due to obstacles
and facilitators can increase the effectiveness of
data mining for geospatial information systems by
capturing application semantics. In the following,
we will discuss four different methods to handle
physical obstacle and facilitator constraints.
3.1 COD_CLARANS
COD_CLARANS [Tung et al. 2001b] was the
first obstacle constraint partitioning clustering
method. It is a modified version of the CLARANS
partitioning algorithm [Ng and Han 1994] adapted
for clustering in the presence of obstacles.
CLARANS finds a medoid for each of k clusters. A
medoid is the center of a cluster that minimizes the
sum of the distances to all objects in the cluster. The
great insight behind CLARANS is that the clustering
process entails searching a graph where every node
is a potential solution (i.e., a partition of the points
into k clusters). CLARANS randomly searches the
graph until a local minimum is found. The property
being minimized is the total Euclidean distance
between every object and the medoid of its cluster.
The main idea of COD_CLARANS is to
replace the Euclidean distance between two points in
CLARANS with the unobstructed distance (called
the “obstructed distance” by Tung et al.), which is
the length of the shortest Euclidean path between
two points, that does not intersect any obstacles. The
calculation of the unobstructed distance includes
three preprocessing steps. The first step is to build a
G
E
visibility graph. A visibility graph is an undirected
graph, where the vertices correspond to vertices of
the obstacles and the edges are generated if and only
if the connection between the corresponding vertices
of the obstacles does not intersect any obstacles.
The second preprocessing step is micro-clustering. A micro-cluster is a compressed representation
of a group of one or more points that are so close
together that they are likely to belong to the same
cluster. Instead of representing each point in the
micro-cluster individually, COD_CLARANS represents them using their center, and a count of the
points in the micro-cluster. If a point is not close to
any other points, it is represented as a separate
micro-cluster.
The third preprocessing step creates three spatial join indexes, the VV index, the MV index, and
the MM index. Creating the VV index computes the
all-pairs shortest paths in the visibility graph; that is,
for every pair of obstacle vertices, the VV index
gives the shortest path that does not intersect any
obstacle. Creating the MV index computes an index
entry for every pair of a micro-cluster and an obstacle vertex. The MM index is created by computing
the unobstructed distance between every pair of
micro-clusters. The size of the MM index may be too
huge to be materialized. These three indices are used
whenever it is necessary to calculate the distance
between any two points. Tung et al. ignored the
computational cost of the preprocessing steps in
their performance evaluation of COD_CLARANS.
Figure 2: Micro-Clustering is not applicable for clusters of varying densities.
After preprocessing, COD_CLARANS works
efficiently on a large number of obstacles.
However, the algorithm may not be suitable for the
following cases. First, if the dataset has varying
densities, COD_CLARANS’s micro-clustering
approach may not be suitable for the sparse clusters. For the clusters shown in Figure 2, the left
cluster is much denser than the other two clusters.
If we use a small radius (as shown by the larger cir-
O
M
A
T
I
C
A
cles in Figure 2) to form three micro-clusters, the
micro-clustering process does not have much effect
on the majority of the points in the two clusters on
the right side. However, if we were to pick a larger
radius, the micro-clustering process might mistakenly merge the two clusters on the right side into
one cluster, with the obstacle inside the micro-cluster or with a micro-cluster intersecting the obstacle.
Second, as given, COD_CLARANS was not
designed to handle intersecting obstacles. As a
result the model used in preprocessing for determining visibility and building the spatial join index
would need to be changed significantly. Third, the
algorithm does not take into consideration facilitator constraints that connect data objects. A simple
modification of the distance function in
COD_CLARANS is inadequate for handling facilitators because the model used in preprocessing for
determining visibility and building the spatial join
index would need to be changed significantly.
3.2 AUTOCLUST+
AUTOCLUST+ [Estivill-Castro and Lee
2000b] is an enhanced version of AUTOCLUST
[Estivill-Castro and Lee 2000a], which handles
obstacles. AUTOCLUST+ is an effective graphbased clustering algorithm. With AUTOCLUST+,
the user does not need to supply parameter values.
To understand the AUTOCLUST+ algorithm,
the terms Voronoi diagram and Delaunay diagram
should first be understood. A Voronoi diagram is a
partitioning of a plane with n points into n convex
polygons such that each polygon contains exactly
one point and every location in a given polygon is
closer to its point than to any other point. A
Delaunay diagram is the dual of a Voronoi diagram
[Agrawl et al. 1998]. In a Delaunay diagram, the
same points as in the original data are used and an
edge, called a Delaunay edge, is present between
two points if and only if their corresponding
Voronoi regions share a boundary.
AUTOCLUST+ uses four steps. In the first
step, a Delaunay diagram is constructed. In the second step, for each point, the standard deviation of
the lengths of the Delaunay edges directly connected (incident) to the point is calculated. A global
variation indicator, the average of these standard
deviations, is then calculated before considering
any obstacles. In the third step, any Delaunay edge
that intersects any obstacle is deleted and local
strength indicators for each point are calculated.
The local strength indicator for a point is the mean
length of all Delaunay edges incident to the point.
In the fourth step, AUTOCLUST is applied to the
planar graph, resulting from the previous steps with
353
G
E
O
M
A
T
the calculated global variation indicator and local
strength indicators.
Let us consider the third step in more detail. If
an original Delaunay edge traverses an obstacle,
the length of the unobstructed distance (called the
“detour distance” by Estivill-Castro and Lee
[2000b]) between the two end points is approximated by a detour path (i.e., a minimal length path
in the Delaunay diagram that does not intersect any
obstacles between the two points). Figure 3 shows
an example illustrating the unobstructed distance
and the corresponding detour path, as well as a case
where the detour path cannot be substituted for the
unobstructed distance. The dotted lines represent
Delaunay edges obstructed by the obstacle. The
thick, solid lines in Figure 3(a) and (b) illustrate the
unobstructed distance and the detour path between
point 2 and point 4, respectively. For the case
shown in Figure 3(c), the unobstructed distance
between point 2 and point 4 is the same as that
observed in Figure 3(a). However, no corresponding detour path can be found in the Delaunay diagram. In cases where a detour path between two
points cannot be found, AUTOCLUST+ does not
include any estimate whatsoever of the unobstructed distance between the two points, when calculating their local strength indicators. Ignoring the distance between two points in this manner could
decrease the quality of the clustering results.
As well, AUTOCLUST+ does not consider
facilitator constraints that connect points. Since the
points connected by facilitators usually do not
share a boundary in Voronoi regions, no simple
modification of the distance function in AUTOCLUST+ would allow it to handle facilitators.
3.3 DBCLuC
DBCLuC [Zaïane and Lee 2002], which is
based on DBSCAN, is a density-based clustering
algorithm that can deal with obstacle constraints.
Instead of finding the shortest path between the two
objects by traversing the edges of the obstacles as
Figure 3: A case where AUTOCLUST+ may fail. (a) Detour distance (b)
Detour path (c) No detour path.
354
I
C
A
in COD_CLARANS, DBCLuC determines visibility by using obstruction lines. An obstruction line,
which is constructed during preprocessing, is an
internal edge that maintains visible spaces for the
obstacle polygons. In preprocessing, a convexity
test is first applied to all obstructing polygons. The
convexity test includes two steps: first, for each
vertex of the polygon, an assessment edge is constructed. An assessment edge of a vertex is a line
segment whose two end vertices respectively lie in
two adjacent edges of the vertex, and which does
not intersect the polygon. Second, each vertex is
labeled as either a convex vertex or a concave vertex. A vertex is a convex if its assessment edge is
interior to a polygon. Otherwise, it is concave. If
there is a concave vertex in a polygon, the polygon
is concave, therwise, it is convex. For each polygon,
the convex vertices are bi-partitioned (into two
groups that are as equal in size as possible) according to an enumeration order, such as clockwise
from a specified vertex. For convex polygons, an
obstruction line is drawn between each convex vertex in one partition and the corresponding convex
vertex in the other partition. To construct a set of
obstruction lines for a concave polygon, an admission test is performed for each possible obstruction
line candidate. If an obstruction line candidate is
not admissible (i.e., totally outside of the obstacle
polygon, or intersecting it), a set of obstruction
lines are generated by finding the shortest path
between the two vertices, which then replaces the
candidate. Then the admission test is repeated on
each line in the set. The maximum number of
obstruction lines that can be generated for an obstacle is equal to the number of edges in the obstacle.
DBCLuC can also deal with facilitators, otherwise known by Zaїane and Lee as “bridges” or
“crossings”. In dealing with facilitators, entry
edges and entrances are identified. A non-empty
subset of the edges of every facilitator is assumed
as entry edges, where the facilitator can be entered
[Lee 2003]. The lengths of facilitators are ignored.
For each entry edge, a series of locations, called
entrances, is defined from one vertex of the edge to
the other vertex; such that each consecutive pair of
entrances is separated by a distance less than or
equal to the radius of the neighborhood area.
DBCLuC starts clustering from entrances and maximally expands a set of clusters such that all data
points that are reachable by facilitators are grouped
together. Then it continues processing the remaining
data points using the ‘obstacles’ method.
Once preprocessing to find obstruction lines
has been performed, DBCLuC is an effective density-based clustering approach for large datasets
containing obstacles with many edges. However,
G
E
preprocessing is expensive for concave polygons
with many vertices, because its complexity is
O(v2), where v is the number of convex vertices in
all obstacles. Also, since any two points are defined
as being reachable if the unobstructed path between
them is calculated using obstruction lines, the algorithm does not work correctly for the two examples
shown in Figure 4. In both diagrams, the circle represents the neighborhood area of the central point.
For the flattened diamond shaped obstacle shown
in Figure 4(a), the obstruction line shown satisfies
the definition of an obstruction line that is provided in [Zaïane and Lee 2002]. The algorithm incorrectly considers point p to be in the central region,
because the distance via the obstruction line is less
than the radius. In Figure 4(b), point p and the central point are blocked by an obstacle with the
obstruction lines shown. The algorithm considers
point p to be unreachable from the central point,
which is incorrect, because the shortest distance
between these points is actually less than the radius.
3.4 DBRS+
DBRS+ [Wang and Hamilton 2003] is a constrained density-based spatial clustering method. It
first chooses one point at random from the dataset
and retrieves its neighborhood without considering
obstacles and facilitators. The neighborhood consists
of the selected point, which becomes the central
point, and all points within a specified distance Eps
from it, according to a distance function, such as
Euclidean distance or Manhattan distance. The
neighborhood area is the area within a distance of
Eps from the central point. If any obstacles appear in
the neighborhood area, DBRS+ removes all neighbors separated from the central point by obstacles.
Thus, it keeps only those points in the same region
as the central point. A region is a maximal contiguous portion of a neighborhood area that does not
contain any obstacles. The central region of a neighborhood area is the region that contains the central
point. If no obstacles appear in the neighborhood
area, the only region is the original neighborhood
area. If any facilitators appear in the region, DBRS+
first determines the entrances available for each
facilitator. Then for each possible exit of any such
facilitator, it finds the corresponding extra points that
can be reached. The points in the central region,
together with any reachable extra points, form the
new neighborhood. If the new neighborhood is not
dense enough (i.e., if the number of points is less
than a user-specified threshold MinPts, the central
point of the neighborhood is classified as noise).
Otherwise, if the new neighborhood intersects one or
more existing clusters, the existing clusters are
O
M
A
T
I
C
A
merged together with the new neighborhoods to
form a single cluster, but if the new neighborhood
does not intersect any existing clusters, it becomes a
new cluster. These steps are iterated until all points
have been clustered or classified as noise.
DBRS+ has four major strengths. First, it can
handle both obstacles, such as fences, rivers, and
highways (when walking), and facilitators, such as
bridges, tunnels, major streets, and highways
(when driving), whereas most previous methods
can only deal with obstacles. Second, DBRS+ can
deal with any combination of intersecting obstacles
and facilitators. No previous method considers
intersecting obstacles, which are common in real
data. For example, highways or rivers often cross
each other, and bridges and tunnels often cross
rivers. Although previous methods can merge
obstacles during preprocessing, the resulting polygons cannot be guaranteed to be simple polygons,
and these methods do not work on complex polygons. Third, DBRS+ is simple and efficient. It does
not require any preprocessing, because the constraints are dealt with during the clustering process.
Almost all previous methods include complicated
preprocessing. Fourth, due to capabilities inherited
from DBRS, DBRS+ can work on datasets containing clusters with widely varying shapes, datasets
having significant non-spatial attributes, and
datasets comprising more than 100 000 points.
4. Discussion on Integrating
Clustering Methods in GIS
Geospatial clustering focused on generalization
or classification by aggregating similar geographic
objects into clusters. When integrating clustering
methods in GIS, the method that is the most appropriate depends heavily on the application goal, the
trade-off between cluster quality and clustering
performance, and the characteristics of data [Han et
al. 2001]. Due to the specialization of geospatial
information systems, the following issues should
Figure 4: Two Cases Where DBCLuC May Fail.
355
G
E
O
M
A
T
be carefully considered when applying the clustering methods in GIS. The following discussion
focuses on four main issues concerning integrating
clustering methods in GIS: Distance Functions,
Similarity on Non-Spatial Attributes, Data Types
and Performance.
4.1 Distance Functions
The first law of geography indicates that everything is related to everything else, but near things
are more related than distant things [Tobler 1970].
Distance is a numerical description of how similar
two objects are in space. According to [Tobler
1970], we usually use geometric distance as the
scale of measurement in the ideal model. The geometric distances are defined by exact mathematical
formulas to reflect the physical length between two
objects in defined coordinate systems, such as
Euclidean Distance and Manhattan Distance. Table
1 gives a list of geometric distance functions. Each
of these functions, because of their geometry,
implies a different view of the data.
When we use clustering methods in large geographical area, the Earth’s spherical shape cannot be
ignored. Geographical distance is the distance
measured along the surface of the earth. These
methods calculate distances between points that are
defined by geographical coordinates in terms of latitude and longitude, such as Great Circle distance and
Vincenty’s formulae. As for a large geographical
area, the factor of map projections needs to be taken
into account. Map projection is the process of
mathematical transformation of locations in the
three-dimensional space of the Earth’s surface onto
the two-dimensional space of a map sheet [Lo Yeung
2007]. Since some of the properties of the spherical
Earth are lost after projection, the areas and shapes of
I
C
A
the features as they appear on paper are also altered.
Consequently after projection the distance and
direction between individual features can often not
be maintained. Depending on the properties that are
preserved, map projections can be classified as
equal-area, conformal, equidistant and true direction
map projections. Among them, the equidistant map
projection is the most important one because it results
in little distortion of distance. In the case where location information is represented on spherical and ellipsoidal surfaces, geographical distance can also be
used as the distance function for clustering methods.
The distance functions for geographical applications should be defined properly before using
spatial clustering methods. For example, to find the
shortest traveling path in a city, the distance function could be defined based on the road network,
speed limits, road direction, traffic volume. The
number of traffic lights and stop signs can also be
considered. From this perspective, the constraintsbased clustering methods for physical objects discussed in Section 3, are extended from general
clustering methods to include newly defined physical constraint distance functions. For example, for
physical obstacles, the unobstructed distance
between two points x and y in a neighborhood area
N
N, denoted by dist uno x, y , can be defined as
d x, y x and y areinthesameQHL
N
dist uno x, y =
Jhborhood region withinN
∞
Rtherwise
The neighborhood area is the area within a
distance of Eps from the central point. A neighborhood region used in the definition is a maximal
contiguous portion of a neighborhood area that
does not contain any obstacle. Then we can apply
different density-based spatial clustering methods
discussed in Section 2.
Table 1: Selected geometric distance functions between points x and y.
Distance Function
Definition
n
Euclidean distance
d x, y =
Manhattan distance
d x, y =
Tchebyschev distance
Σ
i=1
x i– y i
n
Σ
i=1
xi – yi
d (x, y) = max i = 1,2…,n | xi – yi |
n
Σ
i=1
Minkowski distance
d x, y =
Canberra distance
d x, y =
Cosine similarity
d x, y =FRV x i , y i
n
2
Σ
i=1
xi – yi
p 1⁄p
, p>0
xi – yi
x i + y i , x i and y i are positive
=
xi yi
x iy i
G
E
Net-DBSCAN [Stefanakis 2007] is an example
of a clustering method used in GIS to apply the
above idea. It extends DBSCAN to cluster the
nodes of a dynamic linear network. In a dynamic
linear network, each network edge has a cost value
for traversing it and a set of accessible temporal
intervals. In this method, the distance function
between two nodes is defined as the minimum
accumulated cost of traversing the network edges
between the two nodes during the accessible temporal intervals of all edges. The distance function
can be represented as follows.
Min Σ cost x, y
dist x, y =
, when the edges on the path
between x,y are accessible
∞, otherwise
Based on this distance function, the algorithm
first computes the accessible nodes for each node
on the network. The neighborhood of a node is a set
of nodes in the network with an accumulated cost
less than or equal to Eps. Then an initial cluster is
retrieved with the given Eps and MinPts. Clusters
are expanded for each node in the neighborhood.
The cluster expansion process is the same as
DBSCAN. The only difference is that the distance
definition is expanded to the linear network with
temporal intervals.
In summary, the distance function should be
tailored to meet different clustering purposes. Once
the distance function is determined for GIS applications, various clustering methods can be extended.
4.2 Similarity on
Non-Spatial Attributes
Spatial clustering has previously been based
on only the topological features of the data.
However, one or more non-spatial attributes may
have a significant influence on the results of the
clustering process. For example, in image processing, the general procedure for region-based segmentation compares a pixel with its immediate surrounding neighbors. We may not want to include a
pixel in a region if its non-spatial attribute does not
match our requirements. That is, region growing is
heavily based on not only the location of pixels but
also the non-spatial attributes or “a priori” knowledge [Cramariuc et al. 1997]. As another example,
suppose we want to cluster soil samples that were
taken from different sites according to their types
and locations. If a selected site has soil of a certain
type, and some neighboring sites have the same
type of soil, while others have different types, we
may wish to decide on the viability of a cluster
around the selected site according to both the num-
O
M
A
T
I
C
A
ber of neighbors with the same type of soil and the
percentage of neighbors with the same type.
One option is to handle non-spatial attributes
and spatial attributes in two separate steps, as
described in CLARNS. The other option is to handle the non-spatial attributes and spatial attributes
together in the clustering process. The similarity
function for non-spatial attributes and distance
functions for spatial attributes, are handled at the
same time in order to define the overall similarity
between objects.
In GIS applications, we can classify non-spatial attributes (alphanumeric attributes) into two
categories: numerical and non-numerical. Different
methods to handle them have been proposed in the
data mining research field. For numerical non-spatial attributes, we usually transform the numerical
values into some standardized values, and then calculate the similarity by using one of the distance
functions mentioned in 4.1. The Cosine similarity
function in Table 1 is one of the most popular methods. For the non-numerical attributes, new functions are defined to transform non-numerical values
to numerical. For example, GDBSCAN [Sander et
al. 1998] takes into account the non-spatial attributes of an object as a “weight” attribute, defined by
the weighted cardinality of the singleton containing
the object. The weight could be the size of the area
of the clustering object, or a calculated value from
several non-spatial attributes. In another instance,
DBRS [Wang and Hamilton 2003] introduces the
concept of purity to determine the non-numerical
attributes of objects in the neighborhood. It is
defined as the number of equal values within their
neighborhood. For non-spatial attributes this can
avoid creating clusters of points with different values, even though they may be close to each other
[Wang and Hamilton 2003]. The following shows
the result of applying DBRS to Calgary (Canada)
theft crime data. The dataset has 1044 records from
January 31 to February 15, 2009 in Calgary. This
includes two types of theft crime: “Theft from
Vehicle” and “Other Theft”. Each record has the spatial attributes of longitude and latitude and the nonspatial category attribute of ‘type of the theft’. In this
example, Calgary police officers want to find the
dense clusters of theft crime based location and theft
type. Two different kinds of crimes are considered
separately while performing the clustering. In other
words, only the crime records with the same nonspatial attribute are clustered together. Figure 5
shows the results when the search distance Eps is
500 meters; MinPts is 10; and the purity of neighborhood is more than 60%. 15 clusters have been
found in total. Among them, 11 clusters are of the
3
G
E
O
M
A
T
“Other Theft” type and 4 clusters are of the “Theft
from Vehicle” type. In the close up of the downtown
area in Figure 5, it is evident that a large cluster from
two kinds of crime types is separated into two clusters, although the clustering areas overlap each other.
4.3 Data Types
Vector data represents spatial data as points,
lines and polygons. Locations of spatial objects or
events are usually represented as point objects. Most
of the clustering methods are designed for point
objects. Recently, the need to cluster line and polygon data, and even moving objects, is increasing.
Simplifying lines and polygons to points usually
does not work for these applications as the direction
and length of lines and the size of polygons can be
greatly distorted or overlooked.
A more practical method is to extend the current
point-centric clustering methods for use with lines
and polygons. For example, a distance based neighborhood is a natural notion of a neighborhood for
point objects, but if clustering spatially extended
objects such as a set of polygons of largely differing
sizes it may be more appropriate to use neighborhood (spatial) predicates like intersects or meets for
finding clusters of polygons. The definition of a cluster in [Ester et al. 1996] (i.e., NEps(o) = {o’ ∈ D| |o –
o’| ≤ Eps}, where o and o’ are points, can be extended to the neighborhood for polygon objects). For an
instance, it can be defined as NEps(o) = {o’ ∈ D| o
{meet, intersect} o’ ≤ Eps}, where o and o’ are two
polygons. By using this mechanism, various clustering methods [Gaffney et al. 2006; Gaffney and Smyth
1999; Lee et al. 2007; Nanni and Pedreschi 2006;
Guo 2008; Mu and Wang 2008] have been improved
for data types beside data points. For example, for
the lines or trajectories of a moving object Lee et. al
[2007] proposed TRACLUS, which was altered
from the traditional clustering method DBSCAN
[Ester et al. 1996], by defining a distance function
Figure 5: Using DBRS on the Calgary theft crime dataset.
8
I
C
A
between line segments. For polygon data, Guo
[2008] presented the REDCAP family of six hierarchical regionalization methods based on taking the
length of different edges as the distance function
between polygon clusters.
For a raster dataset, the center of each grid can
be transferred to a set of points and most clustering
methods can be applied after the transformation.
Grid-based clustering methods can be easily
applied to raster data because we can simply consider the pixels of the raster image as equivalent to
the grids of the clustering methods. Various methods have also been proposed in the fields remote
sensing and image processing, but these are beyond
the scope of this paper.
4.4 Performance
The performance of applying clustering methods in a GIS is affected by two factors. One is the
efficiency of the clustering algorithms; other is the
data processing performance of the system.
As spatial databases are usually huge, time
complexity is an important issue when applying
algorithms to find clusters. Most efficient clustering methods have time complexity as O(nlogn)
where n is the size of the dataset. To improve the
efficiency of the clustering algorithm, current clustering methods use different spatial data structures.
Table 2 lists the complexity for some common clustering algorithms and the spatial data structure
used. For example, DBRS applies an SR-tree
[Katayama and Satoh 1997] to store the whole
dataset into smaller SR-Tree nodes, which can
reduce the neighborhood query operation from
O(n) to O(logn), where n is the size of the dataset.
Clustering techniques have also been used in
web based GIS applications. As these applications
are part of the internet environment and users
expect the system respond quickly, response time is
the most important criterion when choosing the
clustering technique. For example, many Webbased GISs use clustering for cartographic generalization, (i.e., to generalize map symbols). From
Table 2 we can see that grid based methods provide
low time complexity. For a specific implementation, the current map view is usually divided into
several grids based on the size of users’ screens.
Then the points in each grid are clustered based on
the pixel distance between points.
To further improve the response time in a GIS,
other techniques and methods have been used. One
technique is the pre-processing method. Datasets
are pre-clustered at different zoom levels and saved
in the form of static images. ClustrMap [2009] is an
example of a web-based GIS to use the preprocess-
G
E
O
M
A
T
I
C
A
ing method. The clusters are represented by different sizes according to the number of points in each
cluster. Figure 6 shows the archived clustering map
based on the visitors of the website www.clustrmap.com from May 1-June 1, 2009. Figure 6(a)
shows the clustering results at a global level and
Figure 6(b) shows it at the continental level. Both
clustering results are saved on the sever side as static images. When users zoom to a certain level, the
server returns the corresponding images.
(a) A static clustering map of visitors at the global level.
5. Conclusions
Spatial clustering, which has been employed for
spatial analysis over years, is the process of grouping
similar objects based on their distance, connectivity,
or relative density in space. To select a proper spatial
clustering method for geospatial information systems, we need to consider characteristics of different
clustering methods. This paper has overviewed different types of clustering methods from a data mining perspective, with an analysis of advantages and
limitations for some classical clustering methods. In
addition, the paper discusses four important issues
when using clustering methods within a GIS. We
also provide some solutions and examples for selecting and extending spatial clustering methods within
geospatial information systems with respect to distance functions, non-spatial attributes, data types and
system performance.
Domain knowledge can play an important role
during method selection and extension. Since con-
(b) A static clustering map of visitors at the continental level.
Figure 6: Static clustering map of visitors to www.clustrmap.com
Table 2: Time complexity of some clustering algorithms.
Clustering Algorithm Time Complexity
Clustering Method Type Spatial Data Structure
BIRCH
O(n)
Hierarchical
CF-tree
K-means
O(nkd)
Partitional
n/a
CLARANS
ȍ(kn2)3artitionalQ/a
EM
O(knd2+kd3)
Partitional
n/a
DBSCAN
O(nlogn)
Density-based
R*-tree
DBRS
O(nlogn)
Density-based
SR-Tree
NBC
O(logn)
Density-based
High-dimensional cells
CLIQUE
O(nd4)
Grid-based
Hash-Tree
STING
O(n)
Grid-based
Multiple-level granularity structure
Where n is the size of the dataset. k is the number of clusters. d is the dimension of objects.
G
E
O
M
A
T
straint-based clustering methods only consider
sharply limited knowledge concerning the domain
and the user’s goals, they are typically difficult to
reuse. In particular, they usually have very restricted means of incorporating domain-related information from non-geospatial attributes. An ontology is
a formal explicit specification of a shared conceptualization [*UXEHU1]. It provides domain
knowledge relevant to the conceptualization and
axioms for reasoning with it. For geospatial clus
tering, an appropriate ontology must include a rich
set of geospatialand clustering concepts.
Therefore, using domainontology to help people
FKRRVHWKHPRVWDGDSWLYHDOJRULWKPVLVRQHNH\
future research topic.
In addition, various methods have been developed for modeling distance relations within the
geospatial domain. For instance, Gahegan [1995]
proposes a fuzzy logic model for proximity reasoning, in which each proximity expression, such as
near or far, has a corresponding fuzzy membership
function. Adapting this research on distance relations
and other spatial relations into clustering methods
will provide a useful step toward development of
new clustering methods for GIS applications.
References
Agrawl, R., J. Gehrke, D. Gunopulos, and P. Raghavan.
1998. Automatic Subspace Clustering of High
Dimensional Data for Data Mining Applications,
SIGMOD Record, 27(2), pp.94-105.
Ankerst, M., M. Breunig, H.P. Kriegel, and J. Sander.
1999. OPTICS: Ordering points to identify the clustering structure. Proc. 1999 ACM-SIGMOD Intl.
Conf. Management of Data (SIGMOD’99),
Philadelphia, PA, pp. 49–60.
Cramariuc, B., M. Gabbouj, and J. Astola. 1997.
Clustering Based Region Growing Algorithm for
Color Image Segmentation, Proc. Digital Signal
Processing, Stockholm, Sweden. pp.857–860.
ClustrMap. 2009. http://www.clustrmaps.com/ (Access
on May 30, 2009).
Ester, M., H. Kriegel, J. Sander, and X. Xu. 1996. A
Density-Based Algorithm for Discovering Clusters
in Large Spatial Databases with Noise. In:
Proceedings of the Second International Conference
on Knowledge Discovery and Data Mining.
Portland, OR, pp. 226-231.
Estivill-Castro, V., and I.J. Lee. 2000a. AUTOCLUST:
Automatic Clustering via Boundary Extraction for
Mining Massive Point-Data Sets. Proceedings of the
Fifth International Conference On Geocomputation.
University of Greenwich, Medway Campus, U.K.,
pp. 23-25.
Estivill-Castro, V., and I.J. Lee. 2000b. AUTOCLUST+:
Automatic Clustering of Point-Data Sets in the
Presence of Obstacles. Proceedings of the
I
C
A
International Workshop on Temporal, Spatial and
Spatio-Temporal Data Mining. Lyon, France, pp.
133-146.
Gaffney, S., A. Robertson, P. Smyth, S. Camargo, and M.
Ghil. 2006. Probabilistic clustering of extratropical
cyclones using regression mixture models. In
Technical Report, Bren School of Information and
Computer Sciences, University of California, Irvine.
Gaffney, S., and P. Smyth. 1999. Trajectory clustering
with mixtures of regression models. In Proc. 1999
Intl. Conf. Knowledge Discovery and Data Mining
(KDD’99), San Diego, CA, August 1999, pp. 63–72.
Gahegan, M. 1995. Proximity operators for qualitative
spatial reasoning. In COSIT ’95 Proceedings:
Spatial Information Theory: A Theoretical Basis for
GIS. A. U. Frank and W. Kuhn (eds). Berlin:
Springer-Verlag.
Gruber, T.R. 1993. A translation approach to portable
ontologies. Knowledge Acquisition, 5 (2), pp. 199220.
Guo, D. 2008. Regionalization with Dynamically
Constrained Agglomerative Clustering and
Partitioning (RECDCAP), International Journal of
Geographical Information Science, 22(7), pp. 801823.
Han, J., L.V.S. Lakshmanan, and R.T. Ng. 1999.
Constraint-Based Multidimensional Data Mining.
Computer, 32(8), pp. 46-50.
Han, J., M. Kamber, and A.K.H. Tung. 2001. Spatial
clustering methods in data mining: A survey.
Geographic Data Mining and Knowledge
Discovery, Miller H. and Han J. (eds), Taylor and
Francis, 2001.
Jacquez, G.M. 2008. Spatial Cluster Analysis. Chapter
22 In The Handbook of Geographic Information
Science, S. Fotheringham and J. Wilson (Eds.).
Blackwell Publishing, pp. 395-416.
Kaymak, U., and M. Setnes. 2002. Fuzzy Clustering with
Volume Prototype and Adaptive Cluster Merging.
IEEE Trans. on Fuzzy Systems. 10(6), pp.705-712.
Katayama, N., and S. Satoh. 1997. The SR-tree: An Index
Structure for High-Dimensional Nearest Neighbor
Queries. Proceedings of ACM SIGMOD
International Conference on Management of Data,
Tucson, AZ, pp. 369-380.
Kaufman, L., and P.J. Rousseeuw. 1990. Finding Groups
in Data: An Introduction to Cluster Analysis, Wiley.
Klawonn, F., and F. Hoppner. 2003. What is Fuzzy about
Fuzzy Clustering? Understanding and Improving
the Concept of the Fuzzifier. Berthold M.R., Lenz
H.J., Bradley E., Kruse R. and Borgelt C. (eds)
Advances in Intelligent Data Analysis, Berlin:
Springer, pp. 254-264.
Lee, C.H. 2003. Personal communication.
Lee, J.G., J. Han, and K.Y. Whang. 2007. Trajectory clustering: A partition-and-group framework. In Proc.
2007 ACM-SIGMOD Intl. Conf. Management of
Data (SIGMOD’07), Beijing, China, June 2007, pp.
593–604.
Lo, C.P., and A.K.W. Yeung. 2007. Concepts and
Techniques in Geographic Information Systems,
Pearson Prentice Hall.
G
E
Martino, F. Di., V. Loia, and S. Sessa. 2008. Extended
fuzzy C-means clustering algorithm for hotspot
events in spatial analysis. International Journal of
Hybrid Intelligent Systems, 5(1), pp. 31-44.
Nanni, M., and D. Pedreschi. 2006. Time-focused clustering of trajectories of moving objects, Journal of
Intelligent Information Systems, 27(3), pp. 267–289.
Mu, L., and F. Wang. 2008. A Scale-Space Clustering
Method: Mitigating the Effect of Scale in the
Analysis of Zone-Based Data, Annals of the
Association of American Geographers, 98(1), pp.
85-101.
Ng, R., and J. Han. 1994. Efficient and Effective
Clustering Method for Spatial Data Mining.
Proceedings of the Twentieth International
Conference on Very Large Data Bases. Santiago,
Chile, pp. 144-155.
Sander, J., M. Ester, H. Kriegel, and X. Xu. 1998. Densitybased Clustering in Spatial Databases: the algorithm
GDBSCAN and its applications. Data Mining and
Knowledge Discovery, 2(2), pp. 169-194.
Stefanakis, E. 2007. NET-DBSCAN: clustering the
nodes of a dynamic linear network. International
Journal of Geographical Information Science,
21(4), pp. 427-442.
Tobler, W. 1970. A computer movie simulating urban
growth in the Detroit region. Economic Geography,
46(2), pp. 234-240.
Tung, A.K.H., J. Hou, L.V.S. Lakshmanan, and R.T. Ng.
2001a. Constraint-Based Clustering in Large
Databases. Proceedings of the Eighth International
Conference on Database Theory, London, U.K, pp.
405-419.
Tung, A.K.H., J. Hou, and J. Han. 2001b. Spatial
Clustering in the Presence of Obstacles. Proceedings
of the Seventeenth International Conference On Data
Engineering. Heidelberg, Germany, pp. 359-367.
Wang, X., and H.J. Hamilton. 2003. DBRS: A DensityBased Spatial Clustering Method with Random
Sampling. Proceedings of the Seventh Pacific-Asia
Conference on Knowledge Discovery and Data
Mining. Seoul, Korea, pp. 563- 575.
Wang, X., C. Rostoker, and H. J. Hamilton. 2004. Densitybased spatial clustering in the presence of obstacles
and facilitators. Proceedings of the Eighth European
Conference on Principles and Practice of Knowledge
Discovery in Databases, Pisa, Italy, September 2004,
pp. 446-458.
Wang, W., J. Yang, and R. Muntz. 1997. STING: A
Statistical Information Grid Approach to Spatial Data
O
M
A
T
I
C
A
Mining, Proceedings of the twenty-third International
Conference on Very Large Data Bases, Athens,
Greece, pp. 186-195.
Zaïane, O.R., and C.H. Lee. 2002. Clustering Spatial Data
When Facing Physical Constraints. Proceedings of
the IEEE International Conference on Data Mining.
Maebashi City, Japan, pp. 737-740.
Zhang, T., R. Ramakrishnan, and M. Livny. 1996. BIRCH:
An Efficient Data Clustering Method for Very Large
Databases, SIGMOD Record, 25(2), pp. 103-114.
Zhou, S.G., Y. Zhao, J.H. Guan, and J. Huang. 2005. A
Neighborhood-Based
Clustering
Algorithm.
Advances in Knowledge Discovery and Data Mining,
pp. 361-371.
MS Rec'd 08/09/03
Revised MS rec'd 09/11/12
Authors
Dr. Xin Wang is an assistant professor of
Geomatics Engineering at the University of Calgary.
Dr. Wang received doctorate degree in computer science from the University of Regina in 2006, M. Eng.
in software engineering and B.Sc. degree from
Northwest University, China. She was with SaskTel
Information Technology Management Department
from 2006 to 2007, a lecturer at East China
University of Science and Technology, a software
engineer with ASTI Shanghai, and a researcher at
Fudan University and Shanghai Software Centre
from 1996 to1999. Her research interests are spatial
data mining, ontology and knowledge engineering in
GIS, web GIS and privacy protection in GIS.
Mr. Jing Wang is a MSc student in the
Department of Geomatics Engineering, University
of Calgary specializing in GIS. He graduated with
a B.Sc. in Geographical Information System from
Henan University, China in 2007. His research
interests are spatial analysis, spatial data mining,
and Web GIS. o