Download Interactive Clustering and Exploration of Large

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
GeoInformatica 7:3, 229±253, 2003
# 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.
ICEAGE: Interactive Clustering and Exploration of
Large and High-Dimensional Geodata
DIANSHENG GUO, DONNA J. PEUQUET AND MARK GAHEGAN
Department of Geography and GeoVISTA Center, Pennsylvania State University, 302 Walker Building,
University Park, PA 16802
E-mail: [email protected], [email protected], [email protected]
Abstract
The unprecedented large size and high dimensionality of existing geographic datasets make the complex patterns
that potentially lurk in the data hard to ®nd. Clustering is one of the most important techniques for geographic
knowledge discovery. However, existing clustering methods have two severe drawbacks for this purpose. First,
spatial clustering methods focus on the speci®c characteristics of distributions in 2- or 3-D space, while generalpurpose high-dimensional clustering methods have limited power in recognizing spatial patterns that involve
neighbors. Second, clustering methods in general are not geared toward allowing the human-computer interaction
needed to effectively tease-out complex patterns.
In the current paper, an approach is proposed to open up the ``black box'' of the clustering process for easy
understanding, steering, focusing and interpretation, and thus to support an effective exploration of large and high
dimensional geographic data. The proposed approach involves building a hierarchical spatial cluster structure
within the high-dimensional feature space, and using this combined space for discovering multi-dimensional
(combined spatial and non-spatial) patterns with ef®cient computational clustering methods and highly
interactive visualization techniques. More speci®cally, this includes the integration of: (1) a hierarchical spatial
clustering method to generate a 1-D spatial cluster ordering that preserves the hierarchical cluster structure, and
(2) a density- and grid-based technique to effectively support the interactive identi®cation of interesting
subspaces and subsequent searching for clusters in each subspace. The implementation of the proposed approach
is in a fully open and interactive manner supported by various visualization techniques.
Keywords: geographic knowledge discovery, spatial clustering and ordering, hierarchical subspace clustering,
visualization and interaction
1.
Introduction
Increasingly large volumes of geographic data are being collected, but the spatial data
analysis capabilities currently available have not kept up with the need for deriving
meaningful information from these data [26], [28]. The unprecedented large size and high
dimensionality of existing geographic datasets make the complex patterns that potentially
lurk in the data hard to ®nd. It is critical to develop new techniques to ef®ciently and
230
GUO, PEUQUET AND GAHEGAN
effectively assist in deriving information from these large and heterogeneous spatial
datasets. Towards this goal, spatial data mining and knowledge discovery approaches have
been gaining momentum [26].
Clustering is one of the most important tasks in data mining and knowledge discovery
[14]. The aim of clustering is to ®nd subsets within the data that behave enough alike to
warrant further analysis, organizing a set of objects into groups (or clusters) such that
objects in the same group are similar to each other and different from those in other groups
[17], [18]. These groups or clusters should have meaning in the context of a particular
problem [23].
General-purpose high-dimensional clustering methods discussed in the data mining and
knowledge discovery literature mainly deal with non-spatial feature spaces and have very
limited power in recognizing spatial patterns that involve neighbors. Spatial dimensions
(e.g., expressed as latitude and longitude, or x coordinate and y coordinates) cannot simply
be treated as two additional dimensions in a high-dimensional data space. Spatial
dimensions, which are not independent from each other, carry real-world signi®cance.
Their unique and complex inter-relationships thus cause dif®culties for clustering methods
that do not recognize these speci®c inter-relationships [16].
Clustering techniques speci®c to spatial data have long been used as an important
process in geographic analysis. Various spatial clustering approaches have been
developed, including statistical approaches [38], Delaunay triangulation [13], [25],
variable density [12], grid-based division [37], random walks [20], and even brute-force
exhaustive search [29]. Existing spatial clustering methods, however, can only deal with a
low-dimensional data space (usually 2-D or 3-D space, plus a geo-referenced attribute).
Spatial clustering methods often adopt real-world dissimilarity measures, e.g., road
distance or travel time, and consider complex situations, e.g., geographic obstacles [34].
Such unique clustering considerations are hard to integrate within high-dimensional
clustering methods.
Geospatial datasets currently encountered often have a high dimensionality (i.e., 2-D or
3-D space plus many attributes). Such data sets are often compiled from multiple data
sources, which are of different themes and might be collected for different purposes. By
putting them together for analysis, we are hoping to ®nd some unknown (and unexpected)
multivariate relationships or patterns. Inevitably, the quality and relevance of these
attributes can vary dramatically. Irrelevant or noisy attributes often exist in the data set.
Therefore, clustering functions that use all dimensions of the data can be ineffective and
even counter-productive (in terms of hiding clusters). The intent of subspace clustering (or
projective clustering) is to identify subspaces (subsets of dimensions from the original
high-dimensional data space) that contain meaningful clusters and then to search clusters
in each subspace [4]. Although recently there are several subspace clustering methods that
have been proposed [1], [4], [10], [31], none of them is yet practical for analyzing real
geospatial datasets.
To achieve both ef®ciency and effectiveness for exploring large and high-dimensional
geospatial datasets, it is critical to develop a highly interactive analysis environment that
integrates the best of both human and machine capabilities [6]. Computational methods
can search large volumes of data for a speci®c type of pattern very quickly with
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
231
mechanical accuracy and consistency, but they have very limited ability in adapting to
various data sets and interpreting complex patterns. In contrast, humans can visually pick
out complex patterns very quickly, attach meaning to patterns ( judge and interpret
patterns), and generate hypotheses for further analysis [30]. A knowledge discovery
system for handling current geospatial datasets should thus have automated computational
methods integrated with interactive visualization techniques in order to leverage the
human expert's knowledge and inference capabilities in a more human-machine
collaborative environment.
The goal of the research reported upon in the current paper is to develop a novel
approach to explore complex and unexpected patterns in geospatial datasets via: (1) the
integration of spatial clustering and general-purpose, high-dimensional clustering
methods; and (2) the integration of automatic computational methods and highly
interactive visualization techniques. We name our approach ICEAGE (Interactive
Clustering and Exploration of Geodata), which involves three elements:
*
*
*
An ef®cient hierarchical spatial clustering method that can identify arbitrary-shaped
hierarchical 2-D clusters at different scales. This method generates a spatial cluster
ordering that fully preserves all hierarchical clusters. In other words, any set of points,
which constitute a cluster at some hierarchical level, should be contiguous in the 1-D
ordering. By transforming hierarchical spatial clusters into a linear ordering, the
integration of spatial and non-spatial information is made simpler since the spatial
cluster structure is reduced to a single axis (an additional ``common'' attribute) in the
high-dimensional feature space.
A density- and grid-based hierarchical subspace clustering method. A subspace is
formed by a subset of dimensions from the original data space. The subspace
clustering method can ®rst identify (with human interactions) interesting subspaces
and then search clusters in each subspace. It is ef®cient because it ®rst generalizes data
into a small set of hyper-cells and then performs clustering with those cells. The
spatial cluster ordering (above) is then integrated with this subspace clustering method
to identify multivariate spatial patterns.
A fully open and interactive environment including various visualization techniques.
The user can interactively control parameters of the clustering methods and see the
immediate result corresponding to the parameter change. Several visualization
techniques are developed to facilitate the human interaction and interpretation.
The remainder of the paper is organized as follows. Section 2 gives a review of related
research. Section 3 presents the interactive hierarchical spatial clustering. Section 4
introduces the hierarchical high-dimensional subspace clustering method. In Section 5, an
integrated system for interactively searching high-dimensional (multivariate) spatial
patterns is presented, with a working demo showcasing census data analysis. Section 6
provides conclusions and includes a brief list of future work. Related material and color
®gures of this paper are available in the digital library of the GeoVISTA research center
(www.geovista.psu.edu).
232
2.
2.1.
GUO, PEUQUET AND GAHEGAN
Related work
A general classi®cation of clustering methods
Clustering methods can be divided into two types: partitioning and hierarchical
approaches (®gure 1). The partitioning approach aims to divide the data set into several
clusters, which may not overlap with each other but together cover the whole data space. A
data item is assigned to the ``closest'' cluster based on a proximity or dissimilarity
measure. Hierarchical clustering approaches decompose the data set with a sequence of
nested partitions, from ®ne to coarse resolution. Hierarchical clustering can be presented
with dendrograms, which consist of layers of nodes, each representing a cluster [23].
Within each type, according to their de®nitions of a cluster, clustering methods may also
be classi®ed into three groups: distance-based, model-based (or distribution-based), and
density-based methods (®gure 1). Distance-based clustering methods rely on a distance or
dissimilarity measure and an optimization criterion to group those most similar objects
into clusters. K-means and CLARANS [27] are distance-based partitioning methods, while
the single-link and graph-based methods [11], [17], [18], [23] can perform distance-based
hierarchical clustering.
Model-based or distribution-based clustering methods assume the data of each cluster
conform to a speci®c statistical distribution (e.g., Gaussian distribution) and the whole
dataset is a mixture of several distribution models. Maximum likelihood estimation (MLE)
and expectation-maximization (EM) are two examples of distribution-based partitioning
clustering methods [9], [11]. Model-based hierarchical clustering has been studied in [15],
[35].
Density-based approaches regard a cluster as a dense region (relative to sparse regions)
of data objects [4], [12], [23]. Density-based clustering can adopt two different strategies:
grid-based or neighborhood-based. A grid-based approach divides the data space into a
®nite set of multi-dimensional grid cells, calculates the density of each grid cell, and then
groups those neighboring dense cells into a cluster. Such methods include Grid-Clustering
[32], CLIQUE [4], OptiGrid [21], ENCLUS [10]. The key idea of neighborhood-based
Figure 1.
An overview of different clustering methods. Those shown in bold font are used in this paper.
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
233
approaches is that, given a radius e (as in DBSCAN [12] and OPTICS [5]) or a side length
w (as in DOC [31]), the neighborhood (either a hyper-sphere of radius e or a hyper-cube of
side length w) of an object has to contain at least a minimum number of objects (MinPts) to
form a cluster around this object. Among those density-based methods, Grid-Clustering
and OPTICS can perform hierarchical clustering.
2.2.
Hierarchical spatial clustering
Spatial data sets often contain hierarchical structures, and different patterns may exist at
different levels (scales) within the hierarchy. Two groups of methods have been used to
detect hierarchical clustering structures in spatial data. The ®rst group consists of those
traditional hierarchical clustering methods, e.g., the single-link and graph-based
methods [11], [17], [18], [23]. For clustering 2-D spatial points, Delaunay triangulation
has been extensively used [13], [25] to reduce the construction complexity of the
dissimilarity matrix and to ef®ciently locate neighbors of each point. AMOEBA [13] is
a Delaunay-based hierarchical spatial clustering method, which automatically derives a
criterion function F…p† as the threshold to cut off long edges and then recursively
processes each sub-graph to construct a hierarchy of clusters. AMEOBA tries to avoid
the single-link effect (see Section 3.2) by detecting noise points and excluding them in
any clusters at each hierarchical level. However, the criterion function F…p† is not easy
to justify and customize for different application data sets and tasks. AMOEBA can
only work with 2-D points, which has very limited power for exploring highdimensional spatial data sets.
The second alternative for hierarchical spatial clustering is to use a density-based
partitioning algorithm with different parameter settings. As an extension of DBSCAN
[12], OPTICS [5] is a neighborhood-based hierarchical clustering method (see ®gure 1).
Given a ``generating distance'' …e† and MinPts, OPTICS ®rst identi®es core objects and
non-core objects (edge objects or noise). Core objects can be connected with any other
core objects, while non-core objects can only be reached via core objects (no connection
allowed between non-core objects). OPTICS develops a clustering ordering to support an
interactive exploration of the hierarchical cluster structure. Although OPTICS is a densitybased method, after the identi®cation of core-objects and the removal of the connection
between non-core objects, it works exactly like a single-link method. It avoids the singlelink effect at a speci®c level in the hierarchy (depending on the generating distance e and
MinPts). OPTICS relies on multidimensional index structures to speed up k-nearestneighbor queries and to maintain an O…n log n† complexity.
2.3.
Subspace clustering methods
A subspace is formed by a subset of dimensions from the original high-dimensional data
space. Let S be a d-D data space having a set of dimensions (attributes)
S ˆ fa1 ; a2 ; . . . ; ad g. A subspace of S is de®ned as S0 ˆ fas1 ; . . . ; ask j 0 < k < ˆ d;
234
GUO, PEUQUET AND GAHEGAN
Ask [ Ag. Subspace clustering (or projective clustering) is to identify subspaces of a high
dimensional data space that allow better clustering of the data objects than the original
space [4]. It is often not meaningful to look for clusters using all input dimensions because
some dimensions can be noisy and irrelevant, which may blur or even hide strong clusters
residing in lower-dimensional subspaces [4]. Traditional dimensionality reduction
methods, e.g., principal component analysis (PCA) [11], transform the original data
space into a lower-dimensional space by forming new dimensions that are linear
combinations of the original dimensions (attributes). Such dimensionality reduction
techniques have severe drawbacks for clustering high-dimensional data [4], [31]. Firstly,
they cannot preserve clusters existing in different subspaces of the original data space.
Secondly, new dimensions can be very dif®cult to interpret, making result clusters hard to
understand. Thirdly, global techniques such as PCA can fail to take account of local
structures in data.
Existing subspace clustering methods include CLIQUE [4], ENCLUS [10], ORCLUS
[1] and DOC [31]. CLIQUE partitions a subspace into multi-dimensional grid cells. These
cells are constructed by partitioning each dimension into x equal-length intervals. The
selectivity of a grid cell is the percentage of total data points contained in the cell. A cell is
dense if its selectivity is greater than a density threshold t. A cluster is a maximal set of
connected dense cells. Two k-D cells are connected if they share …k 1† intervals.
ENCLUS is similar to CLIQUE but uses an entropy-based strategy for pruning subspaces.
ORCLUS introduces the problem of generalized projected clusters. A generalized
projected cluster is a set E of vectors together with a set D of data points such that the
points in D are closely clustered in the subspace de®ned by the vectors in E, which may
have much lower dimensionality than the original data space. DOC is a Monte Carlo
algorithm that computes, with high probability, a good approximation of an optimal
projective cluster. The algorithm can be iterated and for each iteration it generates one new
cluster. The iteration stops when some criterion is met. DOC is a density- and
neighborhood-based method, while CLIQUE and ENCLUS are density- and grid-based
methods (see ®gure 1).
Nevertheless, the identi®cation of subspaces that contain clusters remains a challenging
research problem for two reasons. First, existing subspace clustering techniques always try
to ®nd clusters and their associated subspaces simultaneously. Thus the identi®cation of
interesting subspaces depends heavily on a speci®c clustering algorithm. Even worse, it
may also depend on several subjective input parameters of the clustering algorithm. For
example, CLIQUE needs the interval number …x† and the density threshold …t†, ORCLUS
needs the number of clusters …k† and the dimensionality of subspaces …l†, and DOC needs
the side length …w†, the density threshold …a† and the balance factor …b†. All these
parameters are critical to the algorithms but problematic to con®gure beforehand; in fact
they correspond to strong hypotheses regarding how clusters will manifest, or what types
of clusters are of interest. Second, existing subspace clustering methods cannot perform
hierarchical clustering in each subspace and cannot adapt well to different application data
sets and patterns of different scales. The user needs to run the algorithm many times with
different settings of one or more parameters to gain an overall understanding of the data
set.
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
3.
235
Hierarchical spatial clustering and ordering
Our method to hierarchical spatial clustering (for 2-D spatial points) is ef®cient, achieving
O…n log n† complexity without using any index structure, and fully supports interactive
exploration of hierarchical clusters. It has advantages from both AMEOBA and OPTICS.
Our method can generate an optimal spatial cluster ordering to preserve hierarchical clusters
and encode spatial proximity information as much as possible. It is based on delaunay
triangulation (DT) and minimum spanning tree (MST) and overcomes the single-link
effect via singling out boundary points for special treatment. To simplify the description of
the method, we ®rst introduce the method without considering boundary points. Then a
method is introduced for singling out boundary points and treating them differently.
3.1.
Description of the method
The input consists of a set of 2-D points V ˆ fv1 ; v2 ; . . . ; vn g, where vi ˆ hx; yi is a
location in geographic space. Our clustering method (without tackling the single-link
effect) takes three steps: (1) construct a DT, and then construct an MST from the DT; (2)
derive an optimal cluster ordering of points in V; (3) visualize the cluster ordering and
interactively explore the hierarchical structure.
3.1.1. Construct DT and MST. A DT is constructed from the input points, using the
Guibas-Stol® algorithm [19], which adopts a divide-and-conquer strategy and is of
O…n log n† complexity. The triangulation result (®gures 2 and 4) is stored in memory with
Figure 2. Constructing the MST. The thinner lines show triangulation, the thicker lines are the MST. Edges are
selected in the order of AB, BC, BE, CD, JK, HJ, HI, HG, JL, EF, DL. Numbers indicate the length of each edge.
236
GUO, PEUQUET AND GAHEGAN
Figure 3. Derivation of an optimal ordering of points. Horizontal lines under points show the hierarchy of
clusters.
ef®cient access for: each point, each edge, end points of an edge and incident edges on a
point. Each edge has a length, which is the dissimilarity between its two end points.
Kruskal's algorithm [7], which is also of O…n log n† complexity, is used to construct an
MST from the DT. Basically an MST is a subset of those edges in a DT. At the beginning of
the construction phase, the MST contains no edges and each point itself is a connected
graph (altogether n graphs). The algorithm ®rst sorts all edges in the DT in an increasing
order. Following this order ( from the shortest one), each edge is selected in turn. If an edge
connects two points in two different connected graphs, the algorithm adds the edge to the
Figure 4.
The DT, MST (thicker edges) and hierarchical clusters.
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
237
MST. If an edge connects two points in the same graph (i.e., forms a cycle in the graph),
then the edge will be discarded. When all the points are in a single graph, the spanning tree
is complete (®gure 2).
3.1.2. Derive a cluster ordering. From the MST, an optimal ordering of all points can
be derived to completely preserve the hierarchical cluster structure and additional spatial
proximity information. A cluster (connected graph) can be viewed as a chain of points
[36]. At the lowest level, each cluster (or chain) contains a single point. Each chain has two
end points (at the very beginning they are the same point). When two clusters are merged
into one with an edge, the closest two ends (each from a distinct chain) will be connected
in the new chain. The ordering of those points in ®gure 2 is shown in ®gure 3. All
hierarchical clusters ( points underscored by a line) are preserved (i.e., contiguous) in the
1-D ordering. Moreover, the ordering preserves additional spatial proximity information
as well as hierarchical clusters. For example, when D is merged to cluster fE; C; B; Ag with
edge DC, it can be placed next to A or next to E in the orderingÐeither choice will equally
preserve the cluster fD; E; C; B; Ag. D is placed next to E rather than A in the ordering
because DE < DA. Thus the proximity among D, E, and C is also preserved although they
do not form a hierarchical cluster at any level.
3.1.3. Visualize the ordering and interactively explore the hierarchy. Now we
consider a larger data set shown in ®gure 4. Its cluster ordering is visualized in ®gure 5
(upper half ). This visualization idea has already been sketched in ®gure 3. The horizontal
axis in the ordering graph (®gure 5: upper half ) represents the ordering of points (labeled
``instances'' because this visualization tool can also be used for non-spatial data set and
ordering). Here there are altogether 74 points. The vertical axis represents the length of
each edge. Each vertical line segment is an edge in the MST. Between two neighboring
points in the ordering, there is an edgeÐthus there are altogether 73 edges. With this
visualization technique, a cluster appears as a valley in the graph. Distinct clusters are
separated by long edges (high ridges). The second horizontal line (other than the bottom
axis) is the threshold value for cutting off long edges. By interactively dragging this
threshold line (bar), one can interactively explore clusters at different hierarchical level.
Given a length threshold and a minimal number of points (MinClusSize) that a major
cluster should have, the algorithm can automatically extract major clusters and minor
clusters (having less than MinClusSize points). Major clusters are colored differently in
both the ordering visualization and the 2-D map (see ®gures 4 and 5).
This visualization of ordering is different from the ordering of OPTICS [5] in two
aspects. First, here each vertical line segment is an edge (not a point as in OPTICS). Thus it
is much easier than OPTICS (which needs to analyze the steepness of the start and end
points of a cluster) to automatically extract clusters from this ordering. Secondly, as
introduced above, this ordering not only preserves all hierarchical clusters, but also tries to
preserve other spatial proximity information as much as possible, i.e., connecting the
closest pair of end points (each from a chain) when merging two clusters.
A trend plot (bottom of ®gure 5) is developed to visualize the relationship between a
distance threshold and the total number of clusters in the data set, given a minimal number
238
GUO, PEUQUET AND GAHEGAN
Figure 5. The cluster ordering and the trend plot of the data in ®gure 4 (MinClusSize ˆ 3). Above the threshold
bar in the ordering, the total number of clusters (#major/#minor) is shown. Given the threshold (9.8), there are
three major clusters and six minor clusters.
of points (MinClusSize) that a major cluster should have. We call those clusters that have
less than MinClusSize points minor clusters. In a trend plot, the horizontal axis represents
values of possible threshold length. The vertical axis indicates the number of clusters
given a threshold edge length. The threshold can be interactively set via dragging a vertical
bar. This vertical bar is linked with the horizontal bar in the cluster ordering (top of ®gure
5): when you drag one, the other will move accordingly. With the cluster ordering and the
trend plot, one can clearly understand the overall hierarchical cluster structure of the data
and interactively explore with ease.
3.2.
Tackling the single-link effect
Without further improvement, an MST-based clustering method can suffer from the
single-link effect (also called chaining effect), which is caused by some linearly connected
points that run through a sparse area. In ®gure 6 (left), points a, b, c, d, e, f, and g
potentially can cause single-link effects at different hierarchical levels. For example, C21
will ®rst merge with C13 rather than C22 through the connection between points a, b, and
c. As reviewed in Section 2, AMOEBA and OPTICS both tried to avoid the single-link
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
239
Figure 6. Left: A simple MST, with no consideration of boundary points. Right: our modi®ed MST
considering boundary points (light gray points). The single-link effect is avoided at all levels.
effect but the former cannot support a ¯exible hierarchical clustering while the latter can
only avoid the single-link effect at a speci®c level.
We propose a measureÐdeviation-to-minimum-length (DML)Ðto detect boundary
points, located either on the boundary of a cluster (at various hierarchical levels) or on a
line in a sparse area. By treating these boundary points differently, the single-link effect at
several levels can be avoided. For a point p, its DML value is calculated with the following
equation.
s
PNe
2
Lmin †
e ˆ 1 …Le
DML…p† ˆ
:
Ne
Ne is the number of edges incident to point p in the DT, Le is the length of an edge
incident to p, and Lmin is the length of the shortest edge incident to p. A high DML value
indicates that the point locates on a boundaryÐsome neighbors are very close while other
neighbors are far away. We now name those non-boundary points as core points (after
OPTICS). A visual interface is developed for the user to interactively con®gure the DML
threshold value and visualize those boundary points and core points on a map, which can
help the user to set a reasonable DML threshold value. The map in ®gure 7 can show either
the result clusters or core/boundary points, but cannot show both at the same time. Which
to show depends on which component, the ordering (®gure 7: middle) or the DML graph
(®gure 7: bottom part), gets the focus.
In the improved MST, core points can only be connected through core points, i.e., on the
path from one core point to another core point in the MST, no boundary point is allowed.
Boundary points can connect to other boundary points or core points (®gure 6: right). This
treatment is different from both AMOEBA and OPTICS in that boundary points are not
240
Figure 7.
GUO, PEUQUET AND GAHEGAN
Cluster ordering of 3,128 USA cities in the contiguous 48 states.
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
241
necessarily ``noisy'' points and they are allowed to be included in clusters. The
construction of our modi®ed MST maintains an O…n log n† time complexity.
The cluster ordering of 3128 USA cities (excluding cities in Alaska and Hawaii) is
shown in ®gure 7. In the map only cities belonging to a major cluster are visible. Singlelink effect is successfully avoided. Some major clusters are numbered to show the
mapping between clusters in the 2-D space and the valleys in the ordering. With the
ordering, it is easy to see the hierarchical cluster structure within each major cluster. In
other words, the ordering is a complete representation of the hierarchical cluster structure
discovered by the clustering algorithm.
3.3.
Cluster ordering as input to high-dimensional clustering
The spatial cluster ordering derived above, which preserves all hierarchical spatial clusters
and some additional spatial proximity information, can be treated as a single attribute
occupying only one dimension in any general purpose high-dimensional clustering method
for identifying multivariate clusters. Such integration will be introduced in Section 5.
Section 4 will introduce an interactive grid-based and density-based method for effective
and ef®cient hierarchical subspace clustering.
4.
High-dimensional subspace evaluation and clustering
We develop a density- and grid-based approach (see ®gure 1) for hierarchical subspace
clustering, which is similar to CLIQUE but improved in several aspects. First, our
approach uses a nested-mean discretization method instead of the equal-interval method
that is used in CLIQUE, making it more ¯exible locally. Secondly, an entropy-based
evaluation method is adopted to rank subspaces before searching clusters in each of them.
Third, by treating each multi-dimensional grid cell as a ``point'' and calculating a
synthetic distance measure between two ``points'', the hierarchical spatial clustering
method introduced above can be easily extended here to perform hierarchical subspace
clustering. Fourth, with various visualization techniques, our approach supports a fully
interactive exploration and interpretation of clusters. Our approach can ef®ciently process
very large data sets.
4.1.
Discretization of each dimension
Each dimension needs to be discretized into a set of intervals. Intervals from different
dimensions together divide a data space into a set of hyper-cells, each of which contain
some data points. There are many existing discretization (classi®cation) methods for
single-dimensional data [33]. CLIQUE adopted the equal-interval (EI) method. We choose
the nested-mean (NM) method (®gure 8) over the EI method (®gure 9) to improve
effectiveness.
242
GUO, PEUQUET AND GAHEGAN
Figure 8. Equal-interval discretization of a 2-D data set. This synthetic data set has 3,500 points, which
contains three clusters (of different sizes) and a portion of noisy points.
The EI approach divides a dimension into a number of intervals, each of which has the
same length. This approach does not consider the distribution of the dataÐit only uses the
minimum and maximum values. Although it can effectively locate strong clusters, it often
results in an extremely uneven assignment of data items to cells and fails to examine
detailed patterns within a dense area. As in ®gure 8, with the EI approach the two smaller
but much denser clusters fall in a single cell. Therefore these two clusters can never be
distinguished in further analysis with those cells. Extreme outlier values can also severely
affect the effectiveness of the EI approach.
The NM approach can adapt well with the data distribution and is robust with outlier
values and noisy points. It recursively calculates the mean value of the data and cut the
data set into two halves with the mean value (®gure 9). Then each half will be cuts into
halves with its own mean value. This recursive process will stop when the required number
of intervals is obtained. The NM discretization can examine detailed structures within a
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
243
Figure 9. Nested-means discretization of the same data set used in ®gure 7. The length of each interval is no
longer the same. It is shorter in dense area and longer in sparse area. However, it always make sure the cells of a
dense area are denser (in terms of how many points a cell contains) than those of a sparse area.
dense region and, at the same time, can capture coarse patterns in a comparatively sparse
region. Although NM tends to divide a cluster into several cells, those cells that constitute
the cluster are always denser than neighboring cells. As in ®gure 9, the two smaller but
denser cells now fall in eight cells, each of which is still much denser than cells in a sparse
area. Thus these two clusters are distinguishable in further analysis. The synthetic
distances (see next section) among those cells of the same cluster are very small and the
clustering procedure can easily restore the cluster by connecting them.
The number of intervals …r† needed for each dimension is determined by the subspace
size (d, the number of dimensions involved in the subspace) and the data set size …n†. A
general rule adopted here is that rd should roughly equal to n, i.e., r should be around the
value of n1/d. For the nested-mean discretization, r should also equal 2k (k is a positive
integer). We use these two rules to determine r. For example, if d ˆ 4 and n ˆ 3,800, since
244
GUO, PEUQUET AND GAHEGAN
23 ˆ 8 and 84 ˆ 4,096 (close to 3,800), then r should be 8. With this strategy, our approach
is scalable to very large data sets.
4.2.
Entropy-based subspace evaluation
A subspace clustering method needs to have an effective approach to evaluate each
subspace and rank them according to their ``interestingness''. Thus the user can start from
the top of this ranking list to quickly locate important subspaces and hence signi®cant
patterns. We adopt an entropy-based evaluation criterion developed by Cheng and others
for pruning subspaces [10]. The entropy of a grid-based subspace is calculated with the
following equation.
H…X† ˆ
X
d…x† log d…x†:
x[X
H…X† is the entropy value of subspace X, which is a collection of grid cells. The density
of a cell, d…x†, is the fraction of total data items contained in the cell. Figure 10 shows an
example of how to apply the entropy-based evaluation. Two subspaces, each of which
consists of two dimensions, are discretized into grid cells. Each subspace has 16 cells in
this example. The base of the log function is therefore 16 to ensure the entropy values fall
in range ‰0; 1Š. The calculation of the entropy values for the left subspace…S1† and the right
subspace …S2† are shown below. H…S1† is smaller than H…S2† because subspace S1 is more
``clustered''.
H…S1 † ˆ
…0:02 log16 0:02 ‡ 0:01 log16 0:01 ‡ 0:15 log16 0:15 ‡ † ˆ 0:678;
H…S2 † ˆ
…0:08 log16 0:08 ‡ 0:1 log16 0:1 ‡ 0:05 log16 0:05 ‡ † ˆ 0:957:
Although entropy is an effective measure of the interestingness of a subspace, the
calculation of entropy values for all possible subspaces is impossible given a highdimensional data set (e.g., 20 or more dimensions). Within this paper, the user needs to
specify the dimensionality of subspaces. Then the algorithm searches all subspaces of this
Figure 10. Two discretized subspaces. The number in each cell is the density (a percentage value) of the cell.
245
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
dimensionality and ranks them according to their entropy values. A much more ef®cient
(and more complicated) approach for identifying interesting subspaces was also
developed, which is beyond the scope of this paper.
4.3.
A synthetic distance measure between two cells
A similarity or dissimilarity (distance) measure is always needed to perform hierarchical
clustering. The choice of a distance measure can dramatically in¯uence the clustering
result. For clustering analysis, especially high-dimensional clustering, many distance
measures have been used and evaluated [2], [3], [8], [24]. A Hamming distance between
two hyper-cells can be de®ned as the number of intervals shared by two cells, and is
potentially suitable for hierarchical clustering. However, Hamming distance does not
consider the distribution of data points within each cell. The Hamming distance between
two diagonal cells is always 0, although the majority of points in two cells can be very
close to each other. For example, in ®gure 9, the smallest (but densest) cluster is divided
into 4 cells and the data distribution in each cell is skewed towards each other.
To effectively identify hierarchical clusters given a set of dense cells, we calculate a
synthetic distance measure, which considers both the nominal position of intervals and the
distribution of data points within each cell. First, a synthetic value (SynVal) is calculated
for each interval within each cell, based on three values: (1) the nominal position (i) of the
interval for the dimension, (2) the interval bounding valuesЉMini; MaxiŠ, and (3) the
dimension mean value …Meani† of all data points contained in the cell.
SynVal ˆ ‰…Meani
…Maxi ‡ Mini †=2†=…Maxi
Mini †Š ‡ i:
Note that the SynVal of the same interval in different cells can be different due to the
different data points they contain and hence different mean dimension values. For easy
explanation, let's consider a 1-D space, where each cell is de®ned by a single interval
(®gure 11). The dimension is of range ‰0; 100Š and divided with the NM discretization into
4 intervals. Thus there are 4 cells, each of which is de®ned by a single interval. The
synthetic value of each interval in each cell is shown in ®gure 11.
These synthetic interval values, which integrate both the global nominal ordering and
the local numerical variance, can preserve the data distribution. Each hyper-cell is de®ned
as a high-dimensional ``point'' with a vector of synthetic values of its constitutional
intervals. The distance between two cells is the Euclidean distance between the two
vectors of synthetic values. One advantage of this synthetic distance measure is that, the
distance between two diagonal cells (cells that do not share intervals) may be very short if
data points in each cell are skewed towards each other. With this distance measurement,
the three clusters in ®gure 9 can be easily identi®ed by the clustering algorithm. For
example, in ®gure 9, although cell …X1;Y2† shares an interval with cell …X2;Y2†, the data
points in them are attracted to different clusters and therefore their synthetic values are
very different from each other. Thus the two small clusters can be easily separated. Our
246
GUO, PEUQUET AND GAHEGAN
Figure 11. The calculation of the synthetic value of an interval within a hyper-cell. Here we take a 1-D space
as an example. Each cell is de®ned by a single interval.
algorithm design is also ¯exible to support a collection of distance measures for the user to
choose and compare.
Once a distance measure is chosen, the hierarchical spatial clustering method
introduced previously can easily be extended here to perform hierarchical clustering
with a set of hyper-cells of a subspace.
4.4.
Interactive hierarchical subspace clustering and visualization
To facilitate an interactive exploration and interpretation of the hierarchical subspace
clustering process, a subspace chooser, a density plot, an HD cluster ordering and an HD
cluster viewer (®gure 12) are developed to cooperatively support a human-led exploration
of hierarchical clusters in different subspaces.
A subspace chooser (right bottom in ®gure 12) is a visualization component that lists
subspaces ordered by their entropy values. The user chooses a subspace size for the system
to enumerate and evaluate all subspaces of that size. The constituent dimensions of each
subspace and its entropy value are shown in the subspace chooser. Once a subspace is
selected from the list, an array of non-empty cells of this subspace is passed to the density
plot to visualize.
A density plot (right middle in ®gure 12) is a visualization component that helps the user
to understand the overall distribution of cell densities and interactively con®gure the
density threshold via dragging the threshold bar (the horizontal line in the middle of the
density plot). Taking as input an array of non-empty cells of a subspace (selected in the
subspace chooser), the density plot ®rst orders cells according to their densities, and then
plots them on a 2-D plane. The number right above the threshold bar is the total coverage
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
247
Figure 12. The ordered subspace list, HD density plot, HD cluster ordering, and the HD cluster viewer. This is
the overall interface for high-dimensional hierarchical subspace clustering. The data set used here is a remote
sensing data set with 1,705 data records.
248
GUO, PEUQUET AND GAHEGAN
(total percentage of all data) of current dense cells according to current density threshold.
For example, in ®gure 12, there are 264 non-empty cells. The current density threshold is
0.6%Ðabout 10 data points in a cell. With this threshold, 30 dense cells (out of 264) are
extracted and altogether they contain 57.3% of all data points. The plot can be zoomed in
or out for better views. The density plot can facilitate a reasonable con®guration of the
density threshold. Once the user sets a new threshold (via dragging the threshold bar), a
new set of dense cells is extracted and passed to the HD cluster ordering component for
interactive hierarchical clustering.
The HD cluster ordering (top right in ®gure 12) is similar to and extended from the
spatial cluster ordering and visualization in section 3. The construction of an HD cluster
ordering from those dense cells takes four steps: (1) construct a pair-wise distance
(dissimilarity) matrix (since the number of dense cells is much smaller than the data set
size …n†, this step will not cause a time complexity problem), (2) construct a hyper-MST
from the distance matrix, (3) derive a HD cluster ordering, and (4) visualize the ordering
for interactive control and exploration. The ordering can clearly show the hierarchical
structure within the data and conveniently support dynamic browsing of clusters at
different hierarchical levels. While the user interactively controls the distance threshold,
the immediate clustering result is visualized in the HD cluster viewer with each cluster of a
different color.
The HD cluster viewer (left in ®gure 12) is based around the PCP ( parallel coordinate
plot) that allows investigation of high dimensional spaces [22]. In this case the PCP is used
to visualize hyper-cells rather than actual data points. Each string (consisting of a series of
line segments) represents a hyper-cell. The color of the string represents the current cluster
label of the cell. The width of the string roughly represents the density of the cell. When
the user interacts with the subspace chooser, HD density plot, or the HD cluster ordering,
the HD cluster viewer is automatically updated, thus the clustering result associated with
different input parameters can be immediately seen during interactions. The user can also
select strings in the HD cluster viewer to highlight those cells in the ordering. Several
types of selection are supported, including single selection, intersect selection (an
intersection of multiple single selections), and union selection (a union of multiple single
selections). The user can thus visually and interactively explore multivariate patterns
based on the clustering result.
5.
Integration for high-dimensional spatial clustering
Integration of the hierarchical spatial clustering methods introduced in Section 3 and the
hierarchical subspace clustering method introduced in Section 4 is actually very simple:
add the spatial clustering ordering (generated by the hierarchical spatial clustering
component) as an additional attribute to the original data set and then input the combined
data set to the hierarchical subspace clustering component. Let this ``new attribute'' be
``SpaOrdering''. The SpaOrdering value for a spatial point is its nominal position in the
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
249
cluster ordering. If a subspace involving SpaOrdering has a low entropy value and rank
high in the subspace list, then this subspace might have signi®cant multivariate spatial
clusters and is an interesting candidate for interactive clustering and exploration.
Now we integrate both the hierarchical spatial clustering and the hierarchical
subspace clustering methods together to analyze a census data set of 3,128 USA cities.
The data set has many attributes: LOCATION_X, LOCATION_Y, ELEVATION, WHITE_P,
BLACK_P, AMERI_ES_P, ASIAN_PI_P, OTHER_P, HISPANIC_P, DIVORCED_P, and
MOBILEHOME_P, etc. LOCATION_X and LOCATION_Y together represent the location
of a city. WHITE_P is the percentage of white race population against the total
population of the city. Similarly, other race population attributes (BLACK_P, ASIAN_P,
etc.) are all percentage values of the total population. With the spatial cluster ordering
(SpaOrdering) as a common attribute, the subspace clustering method can ef®ciently
and effectively identify multivariate spatial clusters with human interactions. Figure 13
shows a system snapshot, where a subspace, involving SpaOrdering, WHITE_P,
BLACK_P, is selected and several strong clusters emerge quickly with human interaction
and steering. Different subspaces need different parameters to locate the strongest
clusters they contain. It is our contention that these parameters are most effectively
con®gured via visualization and human interactions.
The subspace shown in ®gure 13 has 238 non-empty cells, which are passed to the HD
density plot, where these cells are ordered and visualized. Visually the user can easily see
where to set the density threshold according to the shape of the density plot curve.
Currently selected dense cells have a total coverage of 52.52%. These dense cells
(altogether 42) are then passed to the HD cluster ordering component, which constructs a
hyper-MST from the cells and derives a clustering ordering. With this ordering we can
clearly see the hierarchical cluster structure. By manipulating the distance threshold we
can see ®ve major clusters emerging (see ®gure 13). Cluster 1 represents those cities that
have very high (comparing to other cities) black population, very low white population,
and are spatially close (decided by their similar SpaOrdering valuesÐsoutheast and east
region). Cluster 2 is similar to cluster 1 (high black population and low white population)
except that they concentrate in a different region (mid-south region). Cluster 3 represents
those cities that have medium black population, medium white population, and are
spatially closeÐnorth to (and partly overlap with) the region of cluster 1 and 2. Cluster 4
represents those cities that have low black population, very high white population, and are
also spatially close (the vast north and northeast region). Cluster 5 represents those cities
that have low black population, medium-high white population, and are also spatially
close (west and paci®c coast region, where Hispanic, Asian, and American Indian
population are comparatively high).
The above ®nding is only a very small part of the whole discovery process. The user can
interactively select subspaces, control parameters, interpret emerged patterns, and then
decide the next action for further exploration. The integrated approach and implemented
system have also been applied to several other real geographic data sets and shown its
adaptability and effectiveness for searching multivariate spatial patterns. Since this paper
focuses on introducing the methodology, analysis of other data sets will not be introduced
here.
250
GUO, PEUQUET AND GAHEGAN
Figure 13. An application demo: Analyzing census data for U.S.A. cities.
6.
Conclusion and future work
This paper reported a novel approach, ICEAGE (Interactive Clustering and Exploration of
Geodata), for exploring complex and unexpected patterns in geospatial datasets via: (1) the
integration of spatial clustering and general-purpose, high-dimensional clustering
methods; and (2) the integration of automatic computational methods and highly
interactive visualization techniques. The contribution of the research is in three parts: (1)
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
251
an ef®cient hierarchical spatial clustering method that can identify arbitrary-shaped
hierarchical 2-D clusters at different scales, (2) a density- and grid-based hierarchical
subspace clustering method, and (3) a fully open and interactive environment including
various visualization techniques.
With such an open and interactive approach and the use of various visualization
techniques, the ``black box'' of the clustering process is opened up for easy understanding,
steering, focusing and interpretation. Multivariate spatial patterns can be discovered
ef®ciently and effectively. We are currently exploring the use of these methods to search
for interesting patterns in datasets combining cancer epidemiology, health infrastructure
and census variables, from which we hope to identify possible relationships between
disease incidence, healthcare accessibility and socio-demographic variables.
Currently our approach can process only numerical data (nominal data are treated as
numerical data). It is not dif®cult to extend the system to address nominal data properly
because the numerical data is actually ®rst discretized into nominal intervals in the method
and nominal data types are easier to discretize. Further investigation is needed to justify
and evaluate the effectiveness of spatial cluster ordering as a spatial pattern representation
approach. A more ef®cient strategy for identifying interesting subspaces will be
incorporated with the reported system in the future.
Acknowledgment
This paper is partly based upon work funded by NSF Digital Government grant (No.
9983445) and grant CA95949 from the National Cancer Institute.
References
1. C. Aggarwal and P. Yu. ``Finding generalized projected clusters in high dimensional spaces,'' ACM
SIGMOD International Conference on Management of Data, 2000.
2. C.C. Aggarwal. ``Re-designing distance functions and distance-based applications for high dimensional
data,'' SIGMOD Rec., Vol. 30:13±18, 2001.
3. C.C. Aggarwal, A. Hinneburg, and D.A. Keim, ``On the surprising behavior of distance metrics in high
dimensional space,'' in Database TheoryÐICDT 2001, Vol. 1973, Springer-Verlag: Berlin, 2001.
4. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. ``Automatic subspace clustering of high
dimensional data for data mining applications,'' ACM SIGMOD International Conference on Management
of Data, Seattle, WA, USA, 1998.
5. M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander. ``OPTICS: Ordering Points To Identify the
Clustering Structure,'' ACM SIGMOD International Conference on Management of Data, Philadelphia, PA,
1999.
6. M. Ankerst, M. Ester, and H.-P. Kriegel. ``Towards an effective cooperation of the user and the computer for
classi®cation,'' Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery
and data mining, Boston, Massachusetts, United States, 2000.
7. S. Baase and A.V. Gelder. Computer Algorithms. Addison-Wesley, 2000.
8. A. Bookstein, V.A. Kulyukin, and T. Raita. ``Generalized Hamming Distance,'' Information Retrieval, Vol.
5:353±375, 2002.
252
GUO, PEUQUET AND GAHEGAN
9. P. Bradley, U. Fayyad, and C. Reina. ``Scaling clustering algorithms to large databases,'' ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, New York City, 1998.
10. C. Cheng, A. Fu, and Y. Zhang. ``Entropy-based subspace clustering for mining numerical data,'' ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA,
1999.
11. R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classi®cation. John Wiley & Sons, New York, 2000.
12. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. ``A density-based algorithm for discovering clusters in large
spatial databases with noise,'' The 2nd International Conference on Knowledge Discovery and Data
Mining, Portland, Oregon, 1996.
13. V. Estivill-Castro and I. Lee. ``Amoeba: Hierarchical clustering based on spatial proximity using Delaunaty
diagram,'' 9th International Symposium on Spatial Data Handling, Beijing, China, 2000.
14. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. ``From data mining to knowledge discovery-An review,'' in
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusay (Eds.), Advances in Knowledge Discovery,
AAAI Press/The MIT Press: Cambridge, MA, 1996.
15. C. Fraley. ``Algorithms for model-based gaussian hierarchical clustering,'' SIAM Journal on Scienti®c
Computing, Vol. 20:270±281, 1998.
16. M. Gahegan. ``On the application of inductive machine learning tools to geographical analysis,''
Geographical Analysis, Vol. 32:113±139, 2000.
17. A.D. Gordon. ``A review of hierarchical classi®cation,'' Journal of the Royal Statistical Society. Series A
(General), Vol. 150:119±137, 1987.
18. A.D. Gordon, ``Hierarchical classi®cation,'' in P. Arabie, L.J. Hubert, and G.D. Soete (Eds.), Clustering and
Classi®cation, World Scienti®c Publ.: River Edge, NJ, 1996.
19. L. Guibas and J. Stol®. ``Primitives for the manipulation of general subdivisions and the computation of
Voronoi diagrams,'' ACT TOG, Vol. 4: 1985.
20. D. Harel and Y. Koren. ``Clustering spatial data using random walks,'' Proceedings of the seventh
conference on Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery
and data mining, San Francisco, California, 2001.
21. A. Hinneburg and D.A. Keim. ``Optimal grid-clustering: towards breaking the curse of dimensionality in
high-dimensional clustering,'' Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999.
22. A. Inselberg. ``The plane with parallel coordinates,'' The Visual Computer, Vol. 1:69±97, 1985.
23. A.K. Jain and R.C. Dubes, Algorithms for clustering data. Prentice Hall: Englewood Cliffs, NJ, 1988.
24. A.K. Jain, M.N. Murty, and P.J. Flynn. ``Data clustering: A review,'' ACM Computing Surveys (CSUR), Vol.
31:264±323, 1999.
25. I.-S. Kang, T.-W. Kim, and K.-J. Li. ``A spatial data mining method by Delaunay triangulation,'' The 5th
international workshop on Advances in geographic information systems, LasVegas, Nevada, 1997.
26. H.J. Miller and J. Han. ``Geographic data mining and knowledge discovery: an overview,'' in H.J. Miller
and J. Han (Eds.), Geographic Data Mining and Knowledge Discovery, Taylor & Francis: London and New
York, 2001.
27. R. Ng and J. Han. ``Ef®cient and effective clustering methods for spatial data mining,'' Proc. 20th
International Conference on Very Large Databases, Santiago, Chile, 1994.
28. S. Openshaw. ``Developing appropriate spatial analysis methods for GIS,'' in D.J. Maguire (Ed.),
Geographical Information Systems, Vol. 1: Principles, Longman/Wiley, 1991.
29. S. Openshaw, M. Charlton, C. Wymer, and A. Craft. ``A Mark 1 geographical analysis machine for the
automated analysis of point data sets,'' International Journal of Geographical Information Science, Vol.
1:335±358, 1987.
30. D.J. Peuquet. Representations of Space and Time. New York: Guilford Press, 2002.
31. C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali. ``A Monte Carlo algorithm for fast projective
clustering,'' ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA,
2002.
32. E. Schikuta. ``Grid clustering: An ef®cient hierarchical clustering method for very large data sets,'' 13th
Conf. on Pattern Recognition, Vol. 2, 1996.
33. T.A. Slocum. Thematic Cartography and Visualization. Upper Saddle River, N.J.: Prentice Hall, 1999.
ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION
253
34. A.K.H. Tung, J. Hou, and J. Han. ``Spatial clustering in the presence of obstacles,'' The 17th International
Conference on Data Engineering (ICDE'01), 2001.
35. S. Vaithyanathan and B. Dom. ``Model-based hierarchical clustering,'' The Sixteenth Conference on
Uncertainty in Arti®cial Intelligence, Stanford, CA, 2000.
36. D. Vandev and Y.G. Tsvetanova. ``Perfect chains and single linkage clustering algorithm,'' Statistical Data
Analysis, Proceedings SDA-95, 1995.
37. W. Wang, J. Yang, and R. Muntz. ``STING: A statistical information grid approach to spatial data mining,''
23rd Int. Conf on Very Large Data Bases, Athens, Greece, 1997.
38. C. Zhang and Y. Murayama. ``Testing local spatial autocorrelation using k-order neighbors,'' International
Journal of Geographical Information Science, Vol. 14:681±692, 2000.
Diansheng Guo is a Ph.D. student (ABD) at the Department of Geography and the GeoVISTA Center,
Pennsylvania State University, USA. He received his M.S. degree in GIS and cartography from the State Key Lab
of Resources and Environmental Information Systems, Chinese Academy of Sciences, 1999. He obtained his B.S.
degree from the Department of Urban and Environmental Sciences, Peking University, 1996. His research
interests are data mining, exploratory data analysis, spatial databases, geovisualization, and their applications in
environmental and social data analysis.
Donna Peuquet is currently Professor in the Department of Geography, The Pennsylvania State University. She
holds degrees from the University of Cincinnati and the State University of New York at Buffalo. Her principal
research interests are spatio-temporal data models, spatial cognition, and data mining.
Mark Gahegan is a professor of geography and associate director of the GeoVISTA research center at
Pennsylvania State University, USA. His research interests are broad, covering most aspects of GIS, speci®cally:
geovisualization, semantic models of geographic information, geo-computation, digital remote sensing, arti®cial
intelligence tools, spatial analysis and spatial data structures (Voronoi diagrams).