Download Interactive Subspace Clustering for Mining High

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Interactive Subspace Clustering for Mining High-Dimensional Spatial Patterns
Diansheng Guo, Donna Peuquet, and Mark Gahegan
GeoVISTA Center & Department of Geography
Pennsylvania State University
302 Walker Building, University Park, PA 16802, USA
[email protected], [email protected], [email protected]
Statement of Problem
The unprecedented large size and high dimensionality of existing geographic datasets
make complex patterns that potentially lurk in the data hard to find. Spatial data analysis
capabilities currently available have not kept up with the need for deriving the full
potential of these data. “Traditional spatial analytical techniques cannot easily discover
new and unexpected patterns, trends and relationships that can be hidden deep within
very large and diverse geographic datasets”(Miller and Han 2000). We are facing a datarich but knowledge-poor era. To bridge this gap, spatial data mining and knowledge
discovery has been gaining momentum. Clustering is one of the most important tasks in
data mining and knowledge discovery literature (Fayyad, Piatetsky-Shapiro et al. 1996).
Spatial clustering has also long been used as an important process in geographic analysis
(Openshaw, Charlton et al. 1987; Ester, Kriegel et al. 1996; Kang, Kim et al. 1997;
Wang, Yang et al. 1997; Estivill-Castro and Lee 2000; Zhang and Murayama 2000; Harel
and Koren 2001).
Nevertheless, existing clustering methods have three major drawbacks for searching
high-dimensional (multivariate) spatial patterns. First, on one hand, existing spatial
clustering methods mainly deal with low-dimensional spaces, or spatial dimensions only
(e.g. location <x, y>); on the other hand, general-purpose high-dimensional clustering
methods developed in the data mining and knowledge discovery literature mainly deal
with non-spatial feature spaces and have very limited power in recognizing spatial
patterns that involve neighbors. Second, most of existing high-dimensional clustering
methods use all input dimensions to identify clusters. Some noisy or irrelevant
dimensions may blur or even hide strong clusters residing in subspaces. Third, existing
clustering methods tend to be ‘closed’ and are not geared toward allowing the interaction
needed to effectively support a human-led exploratory analysis.
Objective and Methodology
The objective of the research is to develop an effective and yet efficient approach to
discover high-dimensional spatial patterns (clusters) from large geospatial datasets. We
develop a novel computational approach to integrate spatial clustering information within
the non-spatial attribute or feature space, and then to use this combined space for
discovering high-dimensional spatial clusters with a hierarchical subspace clustering
method and highly interactive visualization techniques.
The research includes three major parts of work. First, a graph-based hierarchical
spatial clustering method is developed, which is efficient and, more importantly, can
generate an ordering of spatial objects. The ordering can fully preserve all hierarchical
spatial clusters of any level. Second, a density- and grid-based high-dimensional
subspace clustering is developed, which is effective in dealing with both the large size
and high dimensionality. Third, various interactive visualization techniques are developed
to leverage the human expert’s knowledge and inference capabilities, to support a better
human-machine collaboration. Thus, both the computational power and human expertise
are integrated together for searching and interpreting patterns.
Spatial Clustering and Ordering
Spatial dimensions cannot simply be treated as 2 additional non-spatial dimensions in
high-dimensional clustering methods because of two important reasons. First, the
combination of spatial dimensions, which are not independent from each other, bears
unique and real-world meanings. Their complex inter-relationship can cause difficulties
for clustering methods (Gahegan 2000). Second, spatial clustering methods often adopt
real-world dissimilarity (distance) measures, e.g. road distance or travel time, and
consider complex situations, e.g. geographic obstacles (Tung, Hou et al. 2001). Such
unique clustering considerations are hard to integrate within high-dimensional clustering
methods.
A graph-based hierarchical spatial clustering method, which achieves O(nlogn) time
complexity and avoids the single-link effect, is developed for identifying arbitrary-shaped
clusters of spatial points, defined with <x, y>. This method can then generate a spatial
cluster ordering/encoding of those points to fully preserve and represent all hierarchical
clusters. This 1-D ordering/encoding have two important properties: (1) any set of points
that constitute a cluster at some hierarchical level, will be contiguous in the 1-D ordering;
(2) Points of the same cluster at some hierarchical level will have similar values in the 1D encoding (the degree of similarity depends on the hierarchical level). By transforming
hierarchical spatial clusters into a linear ordering/encoding, the integration of spatial and
non-spatial information is made simpler since the spatial cluster structure is reduced to a
single axis (a “common” attribute) in the feature space.
For 2D point data, our method outperforms OPTICS (Ankerst, Breunig et al. 1999) in
several aspects: (1) it achieves O(nlogn) complexity without using any index; (2) it
avoids the single-link effect at various hierarchical levels; (3) the ordering can be used in
other clustering methods to effectively search high-dimensional spatial patterns.
High-Dimensional Subspace Clustering
A subspace is formed by a subset of dimensions of the original data space. Subspace
clustering is very important for effective identification of patterns in a high-dimensional
data space. It is often not meaningful to look for clusters using all input dimensions
because some noisy or irrelevant dimensions may blur or even hide strong clusters
residing in subspaces. Traditional dimensionality reduction (or multi-dimensional
scaling) methods(Duda, Hart et al. 2000), e.g. principal component analysis (PCA) or
self-organizing map, have three severe drawbacks (Agrawal, Gehrke et al. 1998): 1) new
dimensions can be very difficult to interpret, making result clusters hard to understand; 2)
they cannot preserve clusters existing in different subspaces of the original data space; (3)
they are based on the assumption that globally-calculated measures of structure or
correlation are meaningful.
A density- and grid-based hierarchical subspace clustering method is developed to
identify high-dimensional clusters within the combined data space, which includes the
spatial cluster ordering/encoding as a “common” dimension and other non-spatial
dimensions. The method can automatically construct, evaluate, and select/prune
subspaces of the data space. For a selected subspace, the clustering method first
generalizes the large volume of data into a small set of hyper-cells, each of which
contains many similar data instances. The amount of cells is much smaller than the
original data size. With those cells, it evaluates the subspace with an entropy-based
approach, which helps to single out those “interesting” subspaces that have strong
clusters. For an “interesting” subspace, its cells are further filtered with a density
threshold. Those dense cells are treated as “points” with a synthetic distance
measurement. Then the above hierarchical spatial cluster method is easily applied to
these dense cells to identify hierarchical high-dimensional clusters. If the spatial cluster
ordering/encoding is involved in the result clusters, then strong high-dimensional spatial
patterns are found.
This method outperforms CLIQUE (Agrawal, Gehrke et al. 1998) in several aspects:
(1) a better and flexible discretization method is used, (2) it fully supports human
interaction, (3) it can perform hierarchical clustering; (4) an entropy-based evaluation
method is adopted.
Visualization and Human Interaction
To achieve both efficiency and effectiveness for exploring very large spatial data sets, it
is desirable to develop a highly interactive analysis environment, which integrates the
best of both human and machine capabilities with support of various visualization
techniques.
The above two methods (hierarchical spatial clustering and subspace clustering) are
implemented in a fully open and interactive manner with support of various visualization
techniques. Both methods are efficient enough to support real-time user interactions. The
user can interactively control parameters of the clustering methods and examine the
immediate result corresponding to the parameter change. This opens up the “black box”
of the clustering process for easy understanding, steering, focusing, interpretation and
iterative exploration.
Application Demo
The developed system can be applied to various data sets of different application fields.
Here a working demo with a large and high-dimensional census data set of US cities is
presented. The data set has spatial dimensions (e.g., location <x, y> of cities) and many
(over 20 or more) non-spatial dimensions (e.g., the area, total population, black, white,
Asian, retail trade, manufacturing, household income, etc., of cities).
Acknowledgments
This paper is partly based upon work funded by NSF Digital Government grant (No.
9983445).
Reference:
Agrawal, R., J. Gehrke, et al. (1998). Automatic subspace clustering of high dimensional
data for data mining applications. ACM SIGMOD international conference on
Management of data, Seattle, WA USA.
Ankerst, M., M. M. Breunig, et al. (1999). OPTICS: Ordering Points To Identify the
Clustering Structure. Proc. ACM SIGMOD’99 Int. Conf. on Management of Data,
Philadelphia PA.
Duda, R. O., P. E. Hart, et al. (2000). Pattern classification and scene analysis. New
York, Wiley.
Ester, M., H.-P. Kriegel, et al. (1996). A density-based algorithm for discovering clusters
in large spatial databases with noise. the 2nd International Conference on Knowledge
Discovery and Data Mining, Portland, Oregon, AAAI Press.
Estivill-Castro, V. and I. Lee (2000). Amoeba: Hierarchical Clustering Based On Spatial
Proximity Using Delaunaty Diagram. 9th International Symposium on Spatial Data
Handling, Beijing, China.
Fayyad, U., G. Piatetsky-Shapiro, et al. (1996). From data mining to knowledge
discovery-An review. Advances in knowledge discovery. U. Fayyad, G. PiatetskyShapiro, P. Smyth and R. Uthurusay. Cambridge, MA, AAAI Press/The MIT Press:
1-33.
Gahegan, M. (2000). “On the application of inductive machine learning tools to
geographical analysis.” Geographical Analysis 32(2): 113-139.
Harel, D. and Y. Koren (2001). Clustering spatial data using random walks. Proceedings
of the seventh conference on Proceedings of the seventh ACM SIGKDD international
conference on knowledge discovery and data mining, San Francisco, California.
Kang, I.-S., T.-W. Kim, et al. (1997). A spatial data mining method by Delaunay
triangulation. the 5th international workshop on Advances in geographic information
systems, LasVegas, Nevada.
Miller, H. J. and J. Han (2000). “Discovering Geographic Knowledge in Data Rich
Environments: A Report on a Specialist Meeting.” SIGKDD Explorations 1(2): 105107.
Openshaw, S., M. Charlton, et al. (1987). “A Mark 1 Geographical Analysis Machine for
the automated analysis of point data sets.” International Journal of Geographical
Information Science 1(4): 335-358.
Tung, A. K. H., J. Hou, et al. (2001). Spatial clustering in the presence of obstacles. The
17th International Conference on Data Engineering (ICDE'01).
Wang, W., J. Yang, et al. (1997). STING : A Statistical Information Grid Approach to
Spatial Data Mining. 23rd Int. Conf on Very Large Data Bases, Athens, Greece,
Morgan Kaufmann.
Zhang, C. and Y. Murayama (2000). “Testing local spatial autocorrelation using k-order
neighbors.” International Journal of Geographical Information Science 14(7): 681692.