Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interactive Subspace Clustering for Mining High-Dimensional Spatial Patterns Diansheng Guo, Donna Peuquet, and Mark Gahegan GeoVISTA Center & Department of Geography Pennsylvania State University 302 Walker Building, University Park, PA 16802, USA [email protected], [email protected], [email protected] Statement of Problem The unprecedented large size and high dimensionality of existing geographic datasets make complex patterns that potentially lurk in the data hard to find. Spatial data analysis capabilities currently available have not kept up with the need for deriving the full potential of these data. “Traditional spatial analytical techniques cannot easily discover new and unexpected patterns, trends and relationships that can be hidden deep within very large and diverse geographic datasets”(Miller and Han 2000). We are facing a datarich but knowledge-poor era. To bridge this gap, spatial data mining and knowledge discovery has been gaining momentum. Clustering is one of the most important tasks in data mining and knowledge discovery literature (Fayyad, Piatetsky-Shapiro et al. 1996). Spatial clustering has also long been used as an important process in geographic analysis (Openshaw, Charlton et al. 1987; Ester, Kriegel et al. 1996; Kang, Kim et al. 1997; Wang, Yang et al. 1997; Estivill-Castro and Lee 2000; Zhang and Murayama 2000; Harel and Koren 2001). Nevertheless, existing clustering methods have three major drawbacks for searching high-dimensional (multivariate) spatial patterns. First, on one hand, existing spatial clustering methods mainly deal with low-dimensional spaces, or spatial dimensions only (e.g. location <x, y>); on the other hand, general-purpose high-dimensional clustering methods developed in the data mining and knowledge discovery literature mainly deal with non-spatial feature spaces and have very limited power in recognizing spatial patterns that involve neighbors. Second, most of existing high-dimensional clustering methods use all input dimensions to identify clusters. Some noisy or irrelevant dimensions may blur or even hide strong clusters residing in subspaces. Third, existing clustering methods tend to be ‘closed’ and are not geared toward allowing the interaction needed to effectively support a human-led exploratory analysis. Objective and Methodology The objective of the research is to develop an effective and yet efficient approach to discover high-dimensional spatial patterns (clusters) from large geospatial datasets. We develop a novel computational approach to integrate spatial clustering information within the non-spatial attribute or feature space, and then to use this combined space for discovering high-dimensional spatial clusters with a hierarchical subspace clustering method and highly interactive visualization techniques. The research includes three major parts of work. First, a graph-based hierarchical spatial clustering method is developed, which is efficient and, more importantly, can generate an ordering of spatial objects. The ordering can fully preserve all hierarchical spatial clusters of any level. Second, a density- and grid-based high-dimensional subspace clustering is developed, which is effective in dealing with both the large size and high dimensionality. Third, various interactive visualization techniques are developed to leverage the human expert’s knowledge and inference capabilities, to support a better human-machine collaboration. Thus, both the computational power and human expertise are integrated together for searching and interpreting patterns. Spatial Clustering and Ordering Spatial dimensions cannot simply be treated as 2 additional non-spatial dimensions in high-dimensional clustering methods because of two important reasons. First, the combination of spatial dimensions, which are not independent from each other, bears unique and real-world meanings. Their complex inter-relationship can cause difficulties for clustering methods (Gahegan 2000). Second, spatial clustering methods often adopt real-world dissimilarity (distance) measures, e.g. road distance or travel time, and consider complex situations, e.g. geographic obstacles (Tung, Hou et al. 2001). Such unique clustering considerations are hard to integrate within high-dimensional clustering methods. A graph-based hierarchical spatial clustering method, which achieves O(nlogn) time complexity and avoids the single-link effect, is developed for identifying arbitrary-shaped clusters of spatial points, defined with <x, y>. This method can then generate a spatial cluster ordering/encoding of those points to fully preserve and represent all hierarchical clusters. This 1-D ordering/encoding have two important properties: (1) any set of points that constitute a cluster at some hierarchical level, will be contiguous in the 1-D ordering; (2) Points of the same cluster at some hierarchical level will have similar values in the 1D encoding (the degree of similarity depends on the hierarchical level). By transforming hierarchical spatial clusters into a linear ordering/encoding, the integration of spatial and non-spatial information is made simpler since the spatial cluster structure is reduced to a single axis (a “common” attribute) in the feature space. For 2D point data, our method outperforms OPTICS (Ankerst, Breunig et al. 1999) in several aspects: (1) it achieves O(nlogn) complexity without using any index; (2) it avoids the single-link effect at various hierarchical levels; (3) the ordering can be used in other clustering methods to effectively search high-dimensional spatial patterns. High-Dimensional Subspace Clustering A subspace is formed by a subset of dimensions of the original data space. Subspace clustering is very important for effective identification of patterns in a high-dimensional data space. It is often not meaningful to look for clusters using all input dimensions because some noisy or irrelevant dimensions may blur or even hide strong clusters residing in subspaces. Traditional dimensionality reduction (or multi-dimensional scaling) methods(Duda, Hart et al. 2000), e.g. principal component analysis (PCA) or self-organizing map, have three severe drawbacks (Agrawal, Gehrke et al. 1998): 1) new dimensions can be very difficult to interpret, making result clusters hard to understand; 2) they cannot preserve clusters existing in different subspaces of the original data space; (3) they are based on the assumption that globally-calculated measures of structure or correlation are meaningful. A density- and grid-based hierarchical subspace clustering method is developed to identify high-dimensional clusters within the combined data space, which includes the spatial cluster ordering/encoding as a “common” dimension and other non-spatial dimensions. The method can automatically construct, evaluate, and select/prune subspaces of the data space. For a selected subspace, the clustering method first generalizes the large volume of data into a small set of hyper-cells, each of which contains many similar data instances. The amount of cells is much smaller than the original data size. With those cells, it evaluates the subspace with an entropy-based approach, which helps to single out those “interesting” subspaces that have strong clusters. For an “interesting” subspace, its cells are further filtered with a density threshold. Those dense cells are treated as “points” with a synthetic distance measurement. Then the above hierarchical spatial cluster method is easily applied to these dense cells to identify hierarchical high-dimensional clusters. If the spatial cluster ordering/encoding is involved in the result clusters, then strong high-dimensional spatial patterns are found. This method outperforms CLIQUE (Agrawal, Gehrke et al. 1998) in several aspects: (1) a better and flexible discretization method is used, (2) it fully supports human interaction, (3) it can perform hierarchical clustering; (4) an entropy-based evaluation method is adopted. Visualization and Human Interaction To achieve both efficiency and effectiveness for exploring very large spatial data sets, it is desirable to develop a highly interactive analysis environment, which integrates the best of both human and machine capabilities with support of various visualization techniques. The above two methods (hierarchical spatial clustering and subspace clustering) are implemented in a fully open and interactive manner with support of various visualization techniques. Both methods are efficient enough to support real-time user interactions. The user can interactively control parameters of the clustering methods and examine the immediate result corresponding to the parameter change. This opens up the “black box” of the clustering process for easy understanding, steering, focusing, interpretation and iterative exploration. Application Demo The developed system can be applied to various data sets of different application fields. Here a working demo with a large and high-dimensional census data set of US cities is presented. The data set has spatial dimensions (e.g., location <x, y> of cities) and many (over 20 or more) non-spatial dimensions (e.g., the area, total population, black, white, Asian, retail trade, manufacturing, household income, etc., of cities). Acknowledgments This paper is partly based upon work funded by NSF Digital Government grant (No. 9983445). Reference: Agrawal, R., J. Gehrke, et al. (1998). Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD international conference on Management of data, Seattle, WA USA. Ankerst, M., M. M. Breunig, et al. (1999). OPTICS: Ordering Points To Identify the Clustering Structure. Proc. ACM SIGMOD’99 Int. Conf. on Management of Data, Philadelphia PA. Duda, R. O., P. E. Hart, et al. (2000). Pattern classification and scene analysis. New York, Wiley. Ester, M., H.-P. Kriegel, et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, AAAI Press. Estivill-Castro, V. and I. Lee (2000). Amoeba: Hierarchical Clustering Based On Spatial Proximity Using Delaunaty Diagram. 9th International Symposium on Spatial Data Handling, Beijing, China. Fayyad, U., G. Piatetsky-Shapiro, et al. (1996). From data mining to knowledge discovery-An review. Advances in knowledge discovery. U. Fayyad, G. PiatetskyShapiro, P. Smyth and R. Uthurusay. Cambridge, MA, AAAI Press/The MIT Press: 1-33. Gahegan, M. (2000). “On the application of inductive machine learning tools to geographical analysis.” Geographical Analysis 32(2): 113-139. Harel, D. and Y. Koren (2001). Clustering spatial data using random walks. Proceedings of the seventh conference on Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California. Kang, I.-S., T.-W. Kim, et al. (1997). A spatial data mining method by Delaunay triangulation. the 5th international workshop on Advances in geographic information systems, LasVegas, Nevada. Miller, H. J. and J. Han (2000). “Discovering Geographic Knowledge in Data Rich Environments: A Report on a Specialist Meeting.” SIGKDD Explorations 1(2): 105107. Openshaw, S., M. Charlton, et al. (1987). “A Mark 1 Geographical Analysis Machine for the automated analysis of point data sets.” International Journal of Geographical Information Science 1(4): 335-358. Tung, A. K. H., J. Hou, et al. (2001). Spatial clustering in the presence of obstacles. The 17th International Conference on Data Engineering (ICDE'01). Wang, W., J. Yang, et al. (1997). STING : A Statistical Information Grid Approach to Spatial Data Mining. 23rd Int. Conf on Very Large Data Bases, Athens, Greece, Morgan Kaufmann. Zhang, C. and Y. Murayama (2000). “Testing local spatial autocorrelation using k-order neighbors.” International Journal of Geographical Information Science 14(7): 681692.