Download Removing Dimensionality Bias in Density

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Dutch-Belgian Data Base Day (DBDBD 2007), Eindhoven, NL.
Removing Dimensionality Bias in Density-based Subspace Clustering
Ira Assent*, Ralph Krieger*, Emmanuel Müller*, Thomas Seidl+
Dept. Computer Science 9 (data management and data exploration), RWTH Aachen University, Germany
*
PhD students, + professor; {krieger, assent, mueller, seidl}@cs.rwth-aachen.de
Abstract.
In many scientific settings, engineering processes, and business applications ranging from experimental sensor
data and process control data to telecommunication traffic observation and financial transaction monitoring,
huge amounts of high-dimensional measurement data are produced and stored. Whereas sensor equipment as
well as big storage devices get cheaper every day, data analysis tools and techniques lag behind. Clustering
methods are common solutions to unsupervised learning problems where neither any expert knowledge nor some
helpful annotation for the data is available. In general, clustering groups the data objects in a way that similar
objects get together in clusters whereas objects from different clusters are of high dissimilarity.
A typical observation, however, is that clustering reveals almost no structure even it is known there must be
groups of similar objects. In many cases, the reason is that the cluster structure is induced by some subsets of the
space’s dimensions only, and the many additional dimensions contribute nothing else than noise that hinder the
discovery of the clusters in the data. As a solution to this problem, clustering algorithms are applied to the
relevant subspaces only. Immediately, the new question is how to determine the relevant subspaces among the
dimensions of the full space. Being faced with the power set of the set of dimensions a brute force trial of all
subsets is infeasible due to their exponential number with respect to the original dimensionality.
Subspace clustering solves this problem by identifying both in parallel, relevant subspaces as well as the cluster
structures which become apparent in these subspaces. In the literature, several density-based approaches to
subspace clustering are found [PHL 04] [SZ 04]. They share the problem that the notions of density for the
clusters significantly depend on the dimensionality of the respective subspace. As the algorithms require a
density threshold to be set by the users, they ran into the difficulty how to set the threshold value reasonably. The
higher the value the less clusters are detected in high-dimensional subspaces but, on the other hand, low values
detect high-dimensional clusters but produce tremendous amounts of trivial low-dimensional subspace clusters.
Thus, these methods fail to separate cluster patterns from noise in different subspaces.
Our new density definition for subspace clusters compensates this dimensionality bias by introducing the
expected density into the notion of density, resulting in the concept DUSC (dimensionality-unbiased subspace
clustering) [AKMS 07]. The new definition normalizes the density measure by the expected value of density
with respect to the current dimensionality and the assumption of a uniform distribution in that dimensionality.
Along with the formal notion of dimensionality-unbiased density measures, we propose a generic unbiased
density estimator and demonstrate its effectiveness based on the Epanechnikov kernel as a particular example.
On top of the definitions that improve the quality of subspace clustering, we also work on the efficiency of
subspace cluster detection. For this purpose, we introduce some powerful pruning properties and tackle the
efficiency problem by a multi-step approach including lossless filter steps on cluster approximations. Our
experiments real world data as well as on synthetical data sets demonstrate that our approach outperforms
existing subspace clustering algorithms with respect to both, quality and efficiency.
References.
[AKMS 07] Assent I., Krieger R., Müller E., Seidl T.: DUSC: Dimensionality Unbiased Subspace Clustering.
Proc. IEEE International Conference on Data Mining (ICDM), Omaha, Nebraska, USA, 2007.
[PHL 04] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD
Explorations Newsletter, 6(1):90–105, June 2004.
[SZ 04] K. Sequeira, M. Zaki. SCHISM: A new approach for interesting subspace mining. Proc. IEEE
International Conference on Data Mining (ICDM), Hong Kong, 2004.