Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
GeoInformatica 7:3, 229±253, 2003 # 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. ICEAGE: Interactive Clustering and Exploration of Large and High-Dimensional Geodata DIANSHENG GUO, DONNA J. PEUQUET AND MARK GAHEGAN Department of Geography and GeoVISTA Center, Pennsylvania State University, 302 Walker Building, University Park, PA 16802 E-mail: [email protected], [email protected], [email protected] Abstract The unprecedented large size and high dimensionality of existing geographic datasets make the complex patterns that potentially lurk in the data hard to ®nd. Clustering is one of the most important techniques for geographic knowledge discovery. However, existing clustering methods have two severe drawbacks for this purpose. First, spatial clustering methods focus on the speci®c characteristics of distributions in 2- or 3-D space, while generalpurpose high-dimensional clustering methods have limited power in recognizing spatial patterns that involve neighbors. Second, clustering methods in general are not geared toward allowing the human-computer interaction needed to effectively tease-out complex patterns. In the current paper, an approach is proposed to open up the ``black box'' of the clustering process for easy understanding, steering, focusing and interpretation, and thus to support an effective exploration of large and high dimensional geographic data. The proposed approach involves building a hierarchical spatial cluster structure within the high-dimensional feature space, and using this combined space for discovering multi-dimensional (combined spatial and non-spatial) patterns with ef®cient computational clustering methods and highly interactive visualization techniques. More speci®cally, this includes the integration of: (1) a hierarchical spatial clustering method to generate a 1-D spatial cluster ordering that preserves the hierarchical cluster structure, and (2) a density- and grid-based technique to effectively support the interactive identi®cation of interesting subspaces and subsequent searching for clusters in each subspace. The implementation of the proposed approach is in a fully open and interactive manner supported by various visualization techniques. Keywords: geographic knowledge discovery, spatial clustering and ordering, hierarchical subspace clustering, visualization and interaction 1. Introduction Increasingly large volumes of geographic data are being collected, but the spatial data analysis capabilities currently available have not kept up with the need for deriving meaningful information from these data [26], [28]. The unprecedented large size and high dimensionality of existing geographic datasets make the complex patterns that potentially lurk in the data hard to ®nd. It is critical to develop new techniques to ef®ciently and 230 GUO, PEUQUET AND GAHEGAN effectively assist in deriving information from these large and heterogeneous spatial datasets. Towards this goal, spatial data mining and knowledge discovery approaches have been gaining momentum [26]. Clustering is one of the most important tasks in data mining and knowledge discovery [14]. The aim of clustering is to ®nd subsets within the data that behave enough alike to warrant further analysis, organizing a set of objects into groups (or clusters) such that objects in the same group are similar to each other and different from those in other groups [17], [18]. These groups or clusters should have meaning in the context of a particular problem [23]. General-purpose high-dimensional clustering methods discussed in the data mining and knowledge discovery literature mainly deal with non-spatial feature spaces and have very limited power in recognizing spatial patterns that involve neighbors. Spatial dimensions (e.g., expressed as latitude and longitude, or x coordinate and y coordinates) cannot simply be treated as two additional dimensions in a high-dimensional data space. Spatial dimensions, which are not independent from each other, carry real-world signi®cance. Their unique and complex inter-relationships thus cause dif®culties for clustering methods that do not recognize these speci®c inter-relationships [16]. Clustering techniques speci®c to spatial data have long been used as an important process in geographic analysis. Various spatial clustering approaches have been developed, including statistical approaches [38], Delaunay triangulation [13], [25], variable density [12], grid-based division [37], random walks [20], and even brute-force exhaustive search [29]. Existing spatial clustering methods, however, can only deal with a low-dimensional data space (usually 2-D or 3-D space, plus a geo-referenced attribute). Spatial clustering methods often adopt real-world dissimilarity measures, e.g., road distance or travel time, and consider complex situations, e.g., geographic obstacles [34]. Such unique clustering considerations are hard to integrate within high-dimensional clustering methods. Geospatial datasets currently encountered often have a high dimensionality (i.e., 2-D or 3-D space plus many attributes). Such data sets are often compiled from multiple data sources, which are of different themes and might be collected for different purposes. By putting them together for analysis, we are hoping to ®nd some unknown (and unexpected) multivariate relationships or patterns. Inevitably, the quality and relevance of these attributes can vary dramatically. Irrelevant or noisy attributes often exist in the data set. Therefore, clustering functions that use all dimensions of the data can be ineffective and even counter-productive (in terms of hiding clusters). The intent of subspace clustering (or projective clustering) is to identify subspaces (subsets of dimensions from the original high-dimensional data space) that contain meaningful clusters and then to search clusters in each subspace [4]. Although recently there are several subspace clustering methods that have been proposed [1], [4], [10], [31], none of them is yet practical for analyzing real geospatial datasets. To achieve both ef®ciency and effectiveness for exploring large and high-dimensional geospatial datasets, it is critical to develop a highly interactive analysis environment that integrates the best of both human and machine capabilities [6]. Computational methods can search large volumes of data for a speci®c type of pattern very quickly with ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 231 mechanical accuracy and consistency, but they have very limited ability in adapting to various data sets and interpreting complex patterns. In contrast, humans can visually pick out complex patterns very quickly, attach meaning to patterns ( judge and interpret patterns), and generate hypotheses for further analysis [30]. A knowledge discovery system for handling current geospatial datasets should thus have automated computational methods integrated with interactive visualization techniques in order to leverage the human expert's knowledge and inference capabilities in a more human-machine collaborative environment. The goal of the research reported upon in the current paper is to develop a novel approach to explore complex and unexpected patterns in geospatial datasets via: (1) the integration of spatial clustering and general-purpose, high-dimensional clustering methods; and (2) the integration of automatic computational methods and highly interactive visualization techniques. We name our approach ICEAGE (Interactive Clustering and Exploration of Geodata), which involves three elements: * * * An ef®cient hierarchical spatial clustering method that can identify arbitrary-shaped hierarchical 2-D clusters at different scales. This method generates a spatial cluster ordering that fully preserves all hierarchical clusters. In other words, any set of points, which constitute a cluster at some hierarchical level, should be contiguous in the 1-D ordering. By transforming hierarchical spatial clusters into a linear ordering, the integration of spatial and non-spatial information is made simpler since the spatial cluster structure is reduced to a single axis (an additional ``common'' attribute) in the high-dimensional feature space. A density- and grid-based hierarchical subspace clustering method. A subspace is formed by a subset of dimensions from the original data space. The subspace clustering method can ®rst identify (with human interactions) interesting subspaces and then search clusters in each subspace. It is ef®cient because it ®rst generalizes data into a small set of hyper-cells and then performs clustering with those cells. The spatial cluster ordering (above) is then integrated with this subspace clustering method to identify multivariate spatial patterns. A fully open and interactive environment including various visualization techniques. The user can interactively control parameters of the clustering methods and see the immediate result corresponding to the parameter change. Several visualization techniques are developed to facilitate the human interaction and interpretation. The remainder of the paper is organized as follows. Section 2 gives a review of related research. Section 3 presents the interactive hierarchical spatial clustering. Section 4 introduces the hierarchical high-dimensional subspace clustering method. In Section 5, an integrated system for interactively searching high-dimensional (multivariate) spatial patterns is presented, with a working demo showcasing census data analysis. Section 6 provides conclusions and includes a brief list of future work. Related material and color ®gures of this paper are available in the digital library of the GeoVISTA research center (www.geovista.psu.edu). 232 2. 2.1. GUO, PEUQUET AND GAHEGAN Related work A general classi®cation of clustering methods Clustering methods can be divided into two types: partitioning and hierarchical approaches (®gure 1). The partitioning approach aims to divide the data set into several clusters, which may not overlap with each other but together cover the whole data space. A data item is assigned to the ``closest'' cluster based on a proximity or dissimilarity measure. Hierarchical clustering approaches decompose the data set with a sequence of nested partitions, from ®ne to coarse resolution. Hierarchical clustering can be presented with dendrograms, which consist of layers of nodes, each representing a cluster [23]. Within each type, according to their de®nitions of a cluster, clustering methods may also be classi®ed into three groups: distance-based, model-based (or distribution-based), and density-based methods (®gure 1). Distance-based clustering methods rely on a distance or dissimilarity measure and an optimization criterion to group those most similar objects into clusters. K-means and CLARANS [27] are distance-based partitioning methods, while the single-link and graph-based methods [11], [17], [18], [23] can perform distance-based hierarchical clustering. Model-based or distribution-based clustering methods assume the data of each cluster conform to a speci®c statistical distribution (e.g., Gaussian distribution) and the whole dataset is a mixture of several distribution models. Maximum likelihood estimation (MLE) and expectation-maximization (EM) are two examples of distribution-based partitioning clustering methods [9], [11]. Model-based hierarchical clustering has been studied in [15], [35]. Density-based approaches regard a cluster as a dense region (relative to sparse regions) of data objects [4], [12], [23]. Density-based clustering can adopt two different strategies: grid-based or neighborhood-based. A grid-based approach divides the data space into a ®nite set of multi-dimensional grid cells, calculates the density of each grid cell, and then groups those neighboring dense cells into a cluster. Such methods include Grid-Clustering [32], CLIQUE [4], OptiGrid [21], ENCLUS [10]. The key idea of neighborhood-based Figure 1. An overview of different clustering methods. Those shown in bold font are used in this paper. ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 233 approaches is that, given a radius e (as in DBSCAN [12] and OPTICS [5]) or a side length w (as in DOC [31]), the neighborhood (either a hyper-sphere of radius e or a hyper-cube of side length w) of an object has to contain at least a minimum number of objects (MinPts) to form a cluster around this object. Among those density-based methods, Grid-Clustering and OPTICS can perform hierarchical clustering. 2.2. Hierarchical spatial clustering Spatial data sets often contain hierarchical structures, and different patterns may exist at different levels (scales) within the hierarchy. Two groups of methods have been used to detect hierarchical clustering structures in spatial data. The ®rst group consists of those traditional hierarchical clustering methods, e.g., the single-link and graph-based methods [11], [17], [18], [23]. For clustering 2-D spatial points, Delaunay triangulation has been extensively used [13], [25] to reduce the construction complexity of the dissimilarity matrix and to ef®ciently locate neighbors of each point. AMOEBA [13] is a Delaunay-based hierarchical spatial clustering method, which automatically derives a criterion function F p as the threshold to cut off long edges and then recursively processes each sub-graph to construct a hierarchy of clusters. AMEOBA tries to avoid the single-link effect (see Section 3.2) by detecting noise points and excluding them in any clusters at each hierarchical level. However, the criterion function F p is not easy to justify and customize for different application data sets and tasks. AMOEBA can only work with 2-D points, which has very limited power for exploring highdimensional spatial data sets. The second alternative for hierarchical spatial clustering is to use a density-based partitioning algorithm with different parameter settings. As an extension of DBSCAN [12], OPTICS [5] is a neighborhood-based hierarchical clustering method (see ®gure 1). Given a ``generating distance'' e and MinPts, OPTICS ®rst identi®es core objects and non-core objects (edge objects or noise). Core objects can be connected with any other core objects, while non-core objects can only be reached via core objects (no connection allowed between non-core objects). OPTICS develops a clustering ordering to support an interactive exploration of the hierarchical cluster structure. Although OPTICS is a densitybased method, after the identi®cation of core-objects and the removal of the connection between non-core objects, it works exactly like a single-link method. It avoids the singlelink effect at a speci®c level in the hierarchy (depending on the generating distance e and MinPts). OPTICS relies on multidimensional index structures to speed up k-nearestneighbor queries and to maintain an O n log n complexity. 2.3. Subspace clustering methods A subspace is formed by a subset of dimensions from the original high-dimensional data space. Let S be a d-D data space having a set of dimensions (attributes) S fa1 ; a2 ; . . . ; ad g. A subspace of S is de®ned as S0 fas1 ; . . . ; ask j 0 < k < d; 234 GUO, PEUQUET AND GAHEGAN Ask [ Ag. Subspace clustering (or projective clustering) is to identify subspaces of a high dimensional data space that allow better clustering of the data objects than the original space [4]. It is often not meaningful to look for clusters using all input dimensions because some dimensions can be noisy and irrelevant, which may blur or even hide strong clusters residing in lower-dimensional subspaces [4]. Traditional dimensionality reduction methods, e.g., principal component analysis (PCA) [11], transform the original data space into a lower-dimensional space by forming new dimensions that are linear combinations of the original dimensions (attributes). Such dimensionality reduction techniques have severe drawbacks for clustering high-dimensional data [4], [31]. Firstly, they cannot preserve clusters existing in different subspaces of the original data space. Secondly, new dimensions can be very dif®cult to interpret, making result clusters hard to understand. Thirdly, global techniques such as PCA can fail to take account of local structures in data. Existing subspace clustering methods include CLIQUE [4], ENCLUS [10], ORCLUS [1] and DOC [31]. CLIQUE partitions a subspace into multi-dimensional grid cells. These cells are constructed by partitioning each dimension into x equal-length intervals. The selectivity of a grid cell is the percentage of total data points contained in the cell. A cell is dense if its selectivity is greater than a density threshold t. A cluster is a maximal set of connected dense cells. Two k-D cells are connected if they share k 1 intervals. ENCLUS is similar to CLIQUE but uses an entropy-based strategy for pruning subspaces. ORCLUS introduces the problem of generalized projected clusters. A generalized projected cluster is a set E of vectors together with a set D of data points such that the points in D are closely clustered in the subspace de®ned by the vectors in E, which may have much lower dimensionality than the original data space. DOC is a Monte Carlo algorithm that computes, with high probability, a good approximation of an optimal projective cluster. The algorithm can be iterated and for each iteration it generates one new cluster. The iteration stops when some criterion is met. DOC is a density- and neighborhood-based method, while CLIQUE and ENCLUS are density- and grid-based methods (see ®gure 1). Nevertheless, the identi®cation of subspaces that contain clusters remains a challenging research problem for two reasons. First, existing subspace clustering techniques always try to ®nd clusters and their associated subspaces simultaneously. Thus the identi®cation of interesting subspaces depends heavily on a speci®c clustering algorithm. Even worse, it may also depend on several subjective input parameters of the clustering algorithm. For example, CLIQUE needs the interval number x and the density threshold t, ORCLUS needs the number of clusters k and the dimensionality of subspaces l, and DOC needs the side length w, the density threshold a and the balance factor b. All these parameters are critical to the algorithms but problematic to con®gure beforehand; in fact they correspond to strong hypotheses regarding how clusters will manifest, or what types of clusters are of interest. Second, existing subspace clustering methods cannot perform hierarchical clustering in each subspace and cannot adapt well to different application data sets and patterns of different scales. The user needs to run the algorithm many times with different settings of one or more parameters to gain an overall understanding of the data set. ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 3. 235 Hierarchical spatial clustering and ordering Our method to hierarchical spatial clustering (for 2-D spatial points) is ef®cient, achieving O n log n complexity without using any index structure, and fully supports interactive exploration of hierarchical clusters. It has advantages from both AMEOBA and OPTICS. Our method can generate an optimal spatial cluster ordering to preserve hierarchical clusters and encode spatial proximity information as much as possible. It is based on delaunay triangulation (DT) and minimum spanning tree (MST) and overcomes the single-link effect via singling out boundary points for special treatment. To simplify the description of the method, we ®rst introduce the method without considering boundary points. Then a method is introduced for singling out boundary points and treating them differently. 3.1. Description of the method The input consists of a set of 2-D points V fv1 ; v2 ; . . . ; vn g, where vi hx; yi is a location in geographic space. Our clustering method (without tackling the single-link effect) takes three steps: (1) construct a DT, and then construct an MST from the DT; (2) derive an optimal cluster ordering of points in V; (3) visualize the cluster ordering and interactively explore the hierarchical structure. 3.1.1. Construct DT and MST. A DT is constructed from the input points, using the Guibas-Stol® algorithm [19], which adopts a divide-and-conquer strategy and is of O n log n complexity. The triangulation result (®gures 2 and 4) is stored in memory with Figure 2. Constructing the MST. The thinner lines show triangulation, the thicker lines are the MST. Edges are selected in the order of AB, BC, BE, CD, JK, HJ, HI, HG, JL, EF, DL. Numbers indicate the length of each edge. 236 GUO, PEUQUET AND GAHEGAN Figure 3. Derivation of an optimal ordering of points. Horizontal lines under points show the hierarchy of clusters. ef®cient access for: each point, each edge, end points of an edge and incident edges on a point. Each edge has a length, which is the dissimilarity between its two end points. Kruskal's algorithm [7], which is also of O n log n complexity, is used to construct an MST from the DT. Basically an MST is a subset of those edges in a DT. At the beginning of the construction phase, the MST contains no edges and each point itself is a connected graph (altogether n graphs). The algorithm ®rst sorts all edges in the DT in an increasing order. Following this order ( from the shortest one), each edge is selected in turn. If an edge connects two points in two different connected graphs, the algorithm adds the edge to the Figure 4. The DT, MST (thicker edges) and hierarchical clusters. ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 237 MST. If an edge connects two points in the same graph (i.e., forms a cycle in the graph), then the edge will be discarded. When all the points are in a single graph, the spanning tree is complete (®gure 2). 3.1.2. Derive a cluster ordering. From the MST, an optimal ordering of all points can be derived to completely preserve the hierarchical cluster structure and additional spatial proximity information. A cluster (connected graph) can be viewed as a chain of points [36]. At the lowest level, each cluster (or chain) contains a single point. Each chain has two end points (at the very beginning they are the same point). When two clusters are merged into one with an edge, the closest two ends (each from a distinct chain) will be connected in the new chain. The ordering of those points in ®gure 2 is shown in ®gure 3. All hierarchical clusters ( points underscored by a line) are preserved (i.e., contiguous) in the 1-D ordering. Moreover, the ordering preserves additional spatial proximity information as well as hierarchical clusters. For example, when D is merged to cluster fE; C; B; Ag with edge DC, it can be placed next to A or next to E in the orderingÐeither choice will equally preserve the cluster fD; E; C; B; Ag. D is placed next to E rather than A in the ordering because DE < DA. Thus the proximity among D, E, and C is also preserved although they do not form a hierarchical cluster at any level. 3.1.3. Visualize the ordering and interactively explore the hierarchy. Now we consider a larger data set shown in ®gure 4. Its cluster ordering is visualized in ®gure 5 (upper half ). This visualization idea has already been sketched in ®gure 3. The horizontal axis in the ordering graph (®gure 5: upper half ) represents the ordering of points (labeled ``instances'' because this visualization tool can also be used for non-spatial data set and ordering). Here there are altogether 74 points. The vertical axis represents the length of each edge. Each vertical line segment is an edge in the MST. Between two neighboring points in the ordering, there is an edgeÐthus there are altogether 73 edges. With this visualization technique, a cluster appears as a valley in the graph. Distinct clusters are separated by long edges (high ridges). The second horizontal line (other than the bottom axis) is the threshold value for cutting off long edges. By interactively dragging this threshold line (bar), one can interactively explore clusters at different hierarchical level. Given a length threshold and a minimal number of points (MinClusSize) that a major cluster should have, the algorithm can automatically extract major clusters and minor clusters (having less than MinClusSize points). Major clusters are colored differently in both the ordering visualization and the 2-D map (see ®gures 4 and 5). This visualization of ordering is different from the ordering of OPTICS [5] in two aspects. First, here each vertical line segment is an edge (not a point as in OPTICS). Thus it is much easier than OPTICS (which needs to analyze the steepness of the start and end points of a cluster) to automatically extract clusters from this ordering. Secondly, as introduced above, this ordering not only preserves all hierarchical clusters, but also tries to preserve other spatial proximity information as much as possible, i.e., connecting the closest pair of end points (each from a chain) when merging two clusters. A trend plot (bottom of ®gure 5) is developed to visualize the relationship between a distance threshold and the total number of clusters in the data set, given a minimal number 238 GUO, PEUQUET AND GAHEGAN Figure 5. The cluster ordering and the trend plot of the data in ®gure 4 (MinClusSize 3). Above the threshold bar in the ordering, the total number of clusters (#major/#minor) is shown. Given the threshold (9.8), there are three major clusters and six minor clusters. of points (MinClusSize) that a major cluster should have. We call those clusters that have less than MinClusSize points minor clusters. In a trend plot, the horizontal axis represents values of possible threshold length. The vertical axis indicates the number of clusters given a threshold edge length. The threshold can be interactively set via dragging a vertical bar. This vertical bar is linked with the horizontal bar in the cluster ordering (top of ®gure 5): when you drag one, the other will move accordingly. With the cluster ordering and the trend plot, one can clearly understand the overall hierarchical cluster structure of the data and interactively explore with ease. 3.2. Tackling the single-link effect Without further improvement, an MST-based clustering method can suffer from the single-link effect (also called chaining effect), which is caused by some linearly connected points that run through a sparse area. In ®gure 6 (left), points a, b, c, d, e, f, and g potentially can cause single-link effects at different hierarchical levels. For example, C21 will ®rst merge with C13 rather than C22 through the connection between points a, b, and c. As reviewed in Section 2, AMOEBA and OPTICS both tried to avoid the single-link ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 239 Figure 6. Left: A simple MST, with no consideration of boundary points. Right: our modi®ed MST considering boundary points (light gray points). The single-link effect is avoided at all levels. effect but the former cannot support a ¯exible hierarchical clustering while the latter can only avoid the single-link effect at a speci®c level. We propose a measureÐdeviation-to-minimum-length (DML)Ðto detect boundary points, located either on the boundary of a cluster (at various hierarchical levels) or on a line in a sparse area. By treating these boundary points differently, the single-link effect at several levels can be avoided. For a point p, its DML value is calculated with the following equation. s PNe 2 Lmin e 1 Le DML p : Ne Ne is the number of edges incident to point p in the DT, Le is the length of an edge incident to p, and Lmin is the length of the shortest edge incident to p. A high DML value indicates that the point locates on a boundaryÐsome neighbors are very close while other neighbors are far away. We now name those non-boundary points as core points (after OPTICS). A visual interface is developed for the user to interactively con®gure the DML threshold value and visualize those boundary points and core points on a map, which can help the user to set a reasonable DML threshold value. The map in ®gure 7 can show either the result clusters or core/boundary points, but cannot show both at the same time. Which to show depends on which component, the ordering (®gure 7: middle) or the DML graph (®gure 7: bottom part), gets the focus. In the improved MST, core points can only be connected through core points, i.e., on the path from one core point to another core point in the MST, no boundary point is allowed. Boundary points can connect to other boundary points or core points (®gure 6: right). This treatment is different from both AMOEBA and OPTICS in that boundary points are not 240 Figure 7. GUO, PEUQUET AND GAHEGAN Cluster ordering of 3,128 USA cities in the contiguous 48 states. ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 241 necessarily ``noisy'' points and they are allowed to be included in clusters. The construction of our modi®ed MST maintains an O n log n time complexity. The cluster ordering of 3128 USA cities (excluding cities in Alaska and Hawaii) is shown in ®gure 7. In the map only cities belonging to a major cluster are visible. Singlelink effect is successfully avoided. Some major clusters are numbered to show the mapping between clusters in the 2-D space and the valleys in the ordering. With the ordering, it is easy to see the hierarchical cluster structure within each major cluster. In other words, the ordering is a complete representation of the hierarchical cluster structure discovered by the clustering algorithm. 3.3. Cluster ordering as input to high-dimensional clustering The spatial cluster ordering derived above, which preserves all hierarchical spatial clusters and some additional spatial proximity information, can be treated as a single attribute occupying only one dimension in any general purpose high-dimensional clustering method for identifying multivariate clusters. Such integration will be introduced in Section 5. Section 4 will introduce an interactive grid-based and density-based method for effective and ef®cient hierarchical subspace clustering. 4. High-dimensional subspace evaluation and clustering We develop a density- and grid-based approach (see ®gure 1) for hierarchical subspace clustering, which is similar to CLIQUE but improved in several aspects. First, our approach uses a nested-mean discretization method instead of the equal-interval method that is used in CLIQUE, making it more ¯exible locally. Secondly, an entropy-based evaluation method is adopted to rank subspaces before searching clusters in each of them. Third, by treating each multi-dimensional grid cell as a ``point'' and calculating a synthetic distance measure between two ``points'', the hierarchical spatial clustering method introduced above can be easily extended here to perform hierarchical subspace clustering. Fourth, with various visualization techniques, our approach supports a fully interactive exploration and interpretation of clusters. Our approach can ef®ciently process very large data sets. 4.1. Discretization of each dimension Each dimension needs to be discretized into a set of intervals. Intervals from different dimensions together divide a data space into a set of hyper-cells, each of which contain some data points. There are many existing discretization (classi®cation) methods for single-dimensional data [33]. CLIQUE adopted the equal-interval (EI) method. We choose the nested-mean (NM) method (®gure 8) over the EI method (®gure 9) to improve effectiveness. 242 GUO, PEUQUET AND GAHEGAN Figure 8. Equal-interval discretization of a 2-D data set. This synthetic data set has 3,500 points, which contains three clusters (of different sizes) and a portion of noisy points. The EI approach divides a dimension into a number of intervals, each of which has the same length. This approach does not consider the distribution of the dataÐit only uses the minimum and maximum values. Although it can effectively locate strong clusters, it often results in an extremely uneven assignment of data items to cells and fails to examine detailed patterns within a dense area. As in ®gure 8, with the EI approach the two smaller but much denser clusters fall in a single cell. Therefore these two clusters can never be distinguished in further analysis with those cells. Extreme outlier values can also severely affect the effectiveness of the EI approach. The NM approach can adapt well with the data distribution and is robust with outlier values and noisy points. It recursively calculates the mean value of the data and cut the data set into two halves with the mean value (®gure 9). Then each half will be cuts into halves with its own mean value. This recursive process will stop when the required number of intervals is obtained. The NM discretization can examine detailed structures within a ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 243 Figure 9. Nested-means discretization of the same data set used in ®gure 7. The length of each interval is no longer the same. It is shorter in dense area and longer in sparse area. However, it always make sure the cells of a dense area are denser (in terms of how many points a cell contains) than those of a sparse area. dense region and, at the same time, can capture coarse patterns in a comparatively sparse region. Although NM tends to divide a cluster into several cells, those cells that constitute the cluster are always denser than neighboring cells. As in ®gure 9, the two smaller but denser cells now fall in eight cells, each of which is still much denser than cells in a sparse area. Thus these two clusters are distinguishable in further analysis. The synthetic distances (see next section) among those cells of the same cluster are very small and the clustering procedure can easily restore the cluster by connecting them. The number of intervals r needed for each dimension is determined by the subspace size (d, the number of dimensions involved in the subspace) and the data set size n. A general rule adopted here is that rd should roughly equal to n, i.e., r should be around the value of n1/d. For the nested-mean discretization, r should also equal 2k (k is a positive integer). We use these two rules to determine r. For example, if d 4 and n 3,800, since 244 GUO, PEUQUET AND GAHEGAN 23 8 and 84 4,096 (close to 3,800), then r should be 8. With this strategy, our approach is scalable to very large data sets. 4.2. Entropy-based subspace evaluation A subspace clustering method needs to have an effective approach to evaluate each subspace and rank them according to their ``interestingness''. Thus the user can start from the top of this ranking list to quickly locate important subspaces and hence signi®cant patterns. We adopt an entropy-based evaluation criterion developed by Cheng and others for pruning subspaces [10]. The entropy of a grid-based subspace is calculated with the following equation. H X X d x log d x: x[X H X is the entropy value of subspace X, which is a collection of grid cells. The density of a cell, d x, is the fraction of total data items contained in the cell. Figure 10 shows an example of how to apply the entropy-based evaluation. Two subspaces, each of which consists of two dimensions, are discretized into grid cells. Each subspace has 16 cells in this example. The base of the log function is therefore 16 to ensure the entropy values fall in range 0; 1. The calculation of the entropy values for the left subspace S1 and the right subspace S2 are shown below. H S1 is smaller than H S2 because subspace S1 is more ``clustered''. H S1 0:02 log16 0:02 0:01 log16 0:01 0:15 log16 0:15 0:678; H S2 0:08 log16 0:08 0:1 log16 0:1 0:05 log16 0:05 0:957: Although entropy is an effective measure of the interestingness of a subspace, the calculation of entropy values for all possible subspaces is impossible given a highdimensional data set (e.g., 20 or more dimensions). Within this paper, the user needs to specify the dimensionality of subspaces. Then the algorithm searches all subspaces of this Figure 10. Two discretized subspaces. The number in each cell is the density (a percentage value) of the cell. 245 ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION dimensionality and ranks them according to their entropy values. A much more ef®cient (and more complicated) approach for identifying interesting subspaces was also developed, which is beyond the scope of this paper. 4.3. A synthetic distance measure between two cells A similarity or dissimilarity (distance) measure is always needed to perform hierarchical clustering. The choice of a distance measure can dramatically in¯uence the clustering result. For clustering analysis, especially high-dimensional clustering, many distance measures have been used and evaluated [2], [3], [8], [24]. A Hamming distance between two hyper-cells can be de®ned as the number of intervals shared by two cells, and is potentially suitable for hierarchical clustering. However, Hamming distance does not consider the distribution of data points within each cell. The Hamming distance between two diagonal cells is always 0, although the majority of points in two cells can be very close to each other. For example, in ®gure 9, the smallest (but densest) cluster is divided into 4 cells and the data distribution in each cell is skewed towards each other. To effectively identify hierarchical clusters given a set of dense cells, we calculate a synthetic distance measure, which considers both the nominal position of intervals and the distribution of data points within each cell. First, a synthetic value (SynVal) is calculated for each interval within each cell, based on three values: (1) the nominal position (i) of the interval for the dimension, (2) the interval bounding valuesÐMini; Maxi, and (3) the dimension mean value Meani of all data points contained in the cell. SynVal Meani Maxi Mini =2= Maxi Mini i: Note that the SynVal of the same interval in different cells can be different due to the different data points they contain and hence different mean dimension values. For easy explanation, let's consider a 1-D space, where each cell is de®ned by a single interval (®gure 11). The dimension is of range 0; 100 and divided with the NM discretization into 4 intervals. Thus there are 4 cells, each of which is de®ned by a single interval. The synthetic value of each interval in each cell is shown in ®gure 11. These synthetic interval values, which integrate both the global nominal ordering and the local numerical variance, can preserve the data distribution. Each hyper-cell is de®ned as a high-dimensional ``point'' with a vector of synthetic values of its constitutional intervals. The distance between two cells is the Euclidean distance between the two vectors of synthetic values. One advantage of this synthetic distance measure is that, the distance between two diagonal cells (cells that do not share intervals) may be very short if data points in each cell are skewed towards each other. With this distance measurement, the three clusters in ®gure 9 can be easily identi®ed by the clustering algorithm. For example, in ®gure 9, although cell X1;Y2 shares an interval with cell X2;Y2, the data points in them are attracted to different clusters and therefore their synthetic values are very different from each other. Thus the two small clusters can be easily separated. Our 246 GUO, PEUQUET AND GAHEGAN Figure 11. The calculation of the synthetic value of an interval within a hyper-cell. Here we take a 1-D space as an example. Each cell is de®ned by a single interval. algorithm design is also ¯exible to support a collection of distance measures for the user to choose and compare. Once a distance measure is chosen, the hierarchical spatial clustering method introduced previously can easily be extended here to perform hierarchical clustering with a set of hyper-cells of a subspace. 4.4. Interactive hierarchical subspace clustering and visualization To facilitate an interactive exploration and interpretation of the hierarchical subspace clustering process, a subspace chooser, a density plot, an HD cluster ordering and an HD cluster viewer (®gure 12) are developed to cooperatively support a human-led exploration of hierarchical clusters in different subspaces. A subspace chooser (right bottom in ®gure 12) is a visualization component that lists subspaces ordered by their entropy values. The user chooses a subspace size for the system to enumerate and evaluate all subspaces of that size. The constituent dimensions of each subspace and its entropy value are shown in the subspace chooser. Once a subspace is selected from the list, an array of non-empty cells of this subspace is passed to the density plot to visualize. A density plot (right middle in ®gure 12) is a visualization component that helps the user to understand the overall distribution of cell densities and interactively con®gure the density threshold via dragging the threshold bar (the horizontal line in the middle of the density plot). Taking as input an array of non-empty cells of a subspace (selected in the subspace chooser), the density plot ®rst orders cells according to their densities, and then plots them on a 2-D plane. The number right above the threshold bar is the total coverage ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 247 Figure 12. The ordered subspace list, HD density plot, HD cluster ordering, and the HD cluster viewer. This is the overall interface for high-dimensional hierarchical subspace clustering. The data set used here is a remote sensing data set with 1,705 data records. 248 GUO, PEUQUET AND GAHEGAN (total percentage of all data) of current dense cells according to current density threshold. For example, in ®gure 12, there are 264 non-empty cells. The current density threshold is 0.6%Ðabout 10 data points in a cell. With this threshold, 30 dense cells (out of 264) are extracted and altogether they contain 57.3% of all data points. The plot can be zoomed in or out for better views. The density plot can facilitate a reasonable con®guration of the density threshold. Once the user sets a new threshold (via dragging the threshold bar), a new set of dense cells is extracted and passed to the HD cluster ordering component for interactive hierarchical clustering. The HD cluster ordering (top right in ®gure 12) is similar to and extended from the spatial cluster ordering and visualization in section 3. The construction of an HD cluster ordering from those dense cells takes four steps: (1) construct a pair-wise distance (dissimilarity) matrix (since the number of dense cells is much smaller than the data set size n, this step will not cause a time complexity problem), (2) construct a hyper-MST from the distance matrix, (3) derive a HD cluster ordering, and (4) visualize the ordering for interactive control and exploration. The ordering can clearly show the hierarchical structure within the data and conveniently support dynamic browsing of clusters at different hierarchical levels. While the user interactively controls the distance threshold, the immediate clustering result is visualized in the HD cluster viewer with each cluster of a different color. The HD cluster viewer (left in ®gure 12) is based around the PCP ( parallel coordinate plot) that allows investigation of high dimensional spaces [22]. In this case the PCP is used to visualize hyper-cells rather than actual data points. Each string (consisting of a series of line segments) represents a hyper-cell. The color of the string represents the current cluster label of the cell. The width of the string roughly represents the density of the cell. When the user interacts with the subspace chooser, HD density plot, or the HD cluster ordering, the HD cluster viewer is automatically updated, thus the clustering result associated with different input parameters can be immediately seen during interactions. The user can also select strings in the HD cluster viewer to highlight those cells in the ordering. Several types of selection are supported, including single selection, intersect selection (an intersection of multiple single selections), and union selection (a union of multiple single selections). The user can thus visually and interactively explore multivariate patterns based on the clustering result. 5. Integration for high-dimensional spatial clustering Integration of the hierarchical spatial clustering methods introduced in Section 3 and the hierarchical subspace clustering method introduced in Section 4 is actually very simple: add the spatial clustering ordering (generated by the hierarchical spatial clustering component) as an additional attribute to the original data set and then input the combined data set to the hierarchical subspace clustering component. Let this ``new attribute'' be ``SpaOrdering''. The SpaOrdering value for a spatial point is its nominal position in the ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 249 cluster ordering. If a subspace involving SpaOrdering has a low entropy value and rank high in the subspace list, then this subspace might have signi®cant multivariate spatial clusters and is an interesting candidate for interactive clustering and exploration. Now we integrate both the hierarchical spatial clustering and the hierarchical subspace clustering methods together to analyze a census data set of 3,128 USA cities. The data set has many attributes: LOCATION_X, LOCATION_Y, ELEVATION, WHITE_P, BLACK_P, AMERI_ES_P, ASIAN_PI_P, OTHER_P, HISPANIC_P, DIVORCED_P, and MOBILEHOME_P, etc. LOCATION_X and LOCATION_Y together represent the location of a city. WHITE_P is the percentage of white race population against the total population of the city. Similarly, other race population attributes (BLACK_P, ASIAN_P, etc.) are all percentage values of the total population. With the spatial cluster ordering (SpaOrdering) as a common attribute, the subspace clustering method can ef®ciently and effectively identify multivariate spatial clusters with human interactions. Figure 13 shows a system snapshot, where a subspace, involving SpaOrdering, WHITE_P, BLACK_P, is selected and several strong clusters emerge quickly with human interaction and steering. Different subspaces need different parameters to locate the strongest clusters they contain. It is our contention that these parameters are most effectively con®gured via visualization and human interactions. The subspace shown in ®gure 13 has 238 non-empty cells, which are passed to the HD density plot, where these cells are ordered and visualized. Visually the user can easily see where to set the density threshold according to the shape of the density plot curve. Currently selected dense cells have a total coverage of 52.52%. These dense cells (altogether 42) are then passed to the HD cluster ordering component, which constructs a hyper-MST from the cells and derives a clustering ordering. With this ordering we can clearly see the hierarchical cluster structure. By manipulating the distance threshold we can see ®ve major clusters emerging (see ®gure 13). Cluster 1 represents those cities that have very high (comparing to other cities) black population, very low white population, and are spatially close (decided by their similar SpaOrdering valuesÐsoutheast and east region). Cluster 2 is similar to cluster 1 (high black population and low white population) except that they concentrate in a different region (mid-south region). Cluster 3 represents those cities that have medium black population, medium white population, and are spatially closeÐnorth to (and partly overlap with) the region of cluster 1 and 2. Cluster 4 represents those cities that have low black population, very high white population, and are also spatially close (the vast north and northeast region). Cluster 5 represents those cities that have low black population, medium-high white population, and are also spatially close (west and paci®c coast region, where Hispanic, Asian, and American Indian population are comparatively high). The above ®nding is only a very small part of the whole discovery process. The user can interactively select subspaces, control parameters, interpret emerged patterns, and then decide the next action for further exploration. The integrated approach and implemented system have also been applied to several other real geographic data sets and shown its adaptability and effectiveness for searching multivariate spatial patterns. Since this paper focuses on introducing the methodology, analysis of other data sets will not be introduced here. 250 GUO, PEUQUET AND GAHEGAN Figure 13. An application demo: Analyzing census data for U.S.A. cities. 6. Conclusion and future work This paper reported a novel approach, ICEAGE (Interactive Clustering and Exploration of Geodata), for exploring complex and unexpected patterns in geospatial datasets via: (1) the integration of spatial clustering and general-purpose, high-dimensional clustering methods; and (2) the integration of automatic computational methods and highly interactive visualization techniques. The contribution of the research is in three parts: (1) ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 251 an ef®cient hierarchical spatial clustering method that can identify arbitrary-shaped hierarchical 2-D clusters at different scales, (2) a density- and grid-based hierarchical subspace clustering method, and (3) a fully open and interactive environment including various visualization techniques. With such an open and interactive approach and the use of various visualization techniques, the ``black box'' of the clustering process is opened up for easy understanding, steering, focusing and interpretation. Multivariate spatial patterns can be discovered ef®ciently and effectively. We are currently exploring the use of these methods to search for interesting patterns in datasets combining cancer epidemiology, health infrastructure and census variables, from which we hope to identify possible relationships between disease incidence, healthcare accessibility and socio-demographic variables. Currently our approach can process only numerical data (nominal data are treated as numerical data). It is not dif®cult to extend the system to address nominal data properly because the numerical data is actually ®rst discretized into nominal intervals in the method and nominal data types are easier to discretize. Further investigation is needed to justify and evaluate the effectiveness of spatial cluster ordering as a spatial pattern representation approach. A more ef®cient strategy for identifying interesting subspaces will be incorporated with the reported system in the future. Acknowledgment This paper is partly based upon work funded by NSF Digital Government grant (No. 9983445) and grant CA95949 from the National Cancer Institute. References 1. C. Aggarwal and P. Yu. ``Finding generalized projected clusters in high dimensional spaces,'' ACM SIGMOD International Conference on Management of Data, 2000. 2. C.C. Aggarwal. ``Re-designing distance functions and distance-based applications for high dimensional data,'' SIGMOD Rec., Vol. 30:13±18, 2001. 3. C.C. Aggarwal, A. Hinneburg, and D.A. Keim, ``On the surprising behavior of distance metrics in high dimensional space,'' in Database TheoryÐICDT 2001, Vol. 1973, Springer-Verlag: Berlin, 2001. 4. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. ``Automatic subspace clustering of high dimensional data for data mining applications,'' ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA, 1998. 5. M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander. ``OPTICS: Ordering Points To Identify the Clustering Structure,'' ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, 1999. 6. M. Ankerst, M. Ester, and H.-P. Kriegel. ``Towards an effective cooperation of the user and the computer for classi®cation,'' Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, United States, 2000. 7. S. Baase and A.V. Gelder. Computer Algorithms. Addison-Wesley, 2000. 8. A. Bookstein, V.A. Kulyukin, and T. Raita. ``Generalized Hamming Distance,'' Information Retrieval, Vol. 5:353±375, 2002. 252 GUO, PEUQUET AND GAHEGAN 9. P. Bradley, U. Fayyad, and C. Reina. ``Scaling clustering algorithms to large databases,'' ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York City, 1998. 10. C. Cheng, A. Fu, and Y. Zhang. ``Entropy-based subspace clustering for mining numerical data,'' ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 1999. 11. R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classi®cation. John Wiley & Sons, New York, 2000. 12. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. ``A density-based algorithm for discovering clusters in large spatial databases with noise,'' The 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, 1996. 13. V. Estivill-Castro and I. Lee. ``Amoeba: Hierarchical clustering based on spatial proximity using Delaunaty diagram,'' 9th International Symposium on Spatial Data Handling, Beijing, China, 2000. 14. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. ``From data mining to knowledge discovery-An review,'' in U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusay (Eds.), Advances in Knowledge Discovery, AAAI Press/The MIT Press: Cambridge, MA, 1996. 15. C. Fraley. ``Algorithms for model-based gaussian hierarchical clustering,'' SIAM Journal on Scienti®c Computing, Vol. 20:270±281, 1998. 16. M. Gahegan. ``On the application of inductive machine learning tools to geographical analysis,'' Geographical Analysis, Vol. 32:113±139, 2000. 17. A.D. Gordon. ``A review of hierarchical classi®cation,'' Journal of the Royal Statistical Society. Series A (General), Vol. 150:119±137, 1987. 18. A.D. Gordon, ``Hierarchical classi®cation,'' in P. Arabie, L.J. Hubert, and G.D. Soete (Eds.), Clustering and Classi®cation, World Scienti®c Publ.: River Edge, NJ, 1996. 19. L. Guibas and J. Stol®. ``Primitives for the manipulation of general subdivisions and the computation of Voronoi diagrams,'' ACT TOG, Vol. 4: 1985. 20. D. Harel and Y. Koren. ``Clustering spatial data using random walks,'' Proceedings of the seventh conference on Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California, 2001. 21. A. Hinneburg and D.A. Keim. ``Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering,'' Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999. 22. A. Inselberg. ``The plane with parallel coordinates,'' The Visual Computer, Vol. 1:69±97, 1985. 23. A.K. Jain and R.C. Dubes, Algorithms for clustering data. Prentice Hall: Englewood Cliffs, NJ, 1988. 24. A.K. Jain, M.N. Murty, and P.J. Flynn. ``Data clustering: A review,'' ACM Computing Surveys (CSUR), Vol. 31:264±323, 1999. 25. I.-S. Kang, T.-W. Kim, and K.-J. Li. ``A spatial data mining method by Delaunay triangulation,'' The 5th international workshop on Advances in geographic information systems, LasVegas, Nevada, 1997. 26. H.J. Miller and J. Han. ``Geographic data mining and knowledge discovery: an overview,'' in H.J. Miller and J. Han (Eds.), Geographic Data Mining and Knowledge Discovery, Taylor & Francis: London and New York, 2001. 27. R. Ng and J. Han. ``Ef®cient and effective clustering methods for spatial data mining,'' Proc. 20th International Conference on Very Large Databases, Santiago, Chile, 1994. 28. S. Openshaw. ``Developing appropriate spatial analysis methods for GIS,'' in D.J. Maguire (Ed.), Geographical Information Systems, Vol. 1: Principles, Longman/Wiley, 1991. 29. S. Openshaw, M. Charlton, C. Wymer, and A. Craft. ``A Mark 1 geographical analysis machine for the automated analysis of point data sets,'' International Journal of Geographical Information Science, Vol. 1:335±358, 1987. 30. D.J. Peuquet. Representations of Space and Time. New York: Guilford Press, 2002. 31. C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali. ``A Monte Carlo algorithm for fast projective clustering,'' ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, 2002. 32. E. Schikuta. ``Grid clustering: An ef®cient hierarchical clustering method for very large data sets,'' 13th Conf. on Pattern Recognition, Vol. 2, 1996. 33. T.A. Slocum. Thematic Cartography and Visualization. Upper Saddle River, N.J.: Prentice Hall, 1999. ICEAGE: INTERACTIVE CLUSTERING AND EXPLORATION 253 34. A.K.H. Tung, J. Hou, and J. Han. ``Spatial clustering in the presence of obstacles,'' The 17th International Conference on Data Engineering (ICDE'01), 2001. 35. S. Vaithyanathan and B. Dom. ``Model-based hierarchical clustering,'' The Sixteenth Conference on Uncertainty in Arti®cial Intelligence, Stanford, CA, 2000. 36. D. Vandev and Y.G. Tsvetanova. ``Perfect chains and single linkage clustering algorithm,'' Statistical Data Analysis, Proceedings SDA-95, 1995. 37. W. Wang, J. Yang, and R. Muntz. ``STING: A statistical information grid approach to spatial data mining,'' 23rd Int. Conf on Very Large Data Bases, Athens, Greece, 1997. 38. C. Zhang and Y. Murayama. ``Testing local spatial autocorrelation using k-order neighbors,'' International Journal of Geographical Information Science, Vol. 14:681±692, 2000. Diansheng Guo is a Ph.D. student (ABD) at the Department of Geography and the GeoVISTA Center, Pennsylvania State University, USA. He received his M.S. degree in GIS and cartography from the State Key Lab of Resources and Environmental Information Systems, Chinese Academy of Sciences, 1999. He obtained his B.S. degree from the Department of Urban and Environmental Sciences, Peking University, 1996. His research interests are data mining, exploratory data analysis, spatial databases, geovisualization, and their applications in environmental and social data analysis. Donna Peuquet is currently Professor in the Department of Geography, The Pennsylvania State University. She holds degrees from the University of Cincinnati and the State University of New York at Buffalo. Her principal research interests are spatio-temporal data models, spatial cognition, and data mining. Mark Gahegan is a professor of geography and associate director of the GeoVISTA research center at Pennsylvania State University, USA. His research interests are broad, covering most aspects of GIS, speci®cally: geovisualization, semantic models of geographic information, geo-computation, digital remote sensing, arti®cial intelligence tools, spatial analysis and spatial data structures (Voronoi diagrams).