Download Removing Dimensionality Bias in Density

Dutch-Belgian Data Base Day (DBDBD 2007), Eindhoven, NL. Removing Dimensionality Bias in Density-based Subspace Clustering Ira Assent*, Ralph Krieger*, Emmanuel Müller*, Thomas Seidl+ Dept. Computer Science 9 (data management and data exploration), RWTH Aachen University, Germany * PhD students, + professor; {krieger, assent, mueller, seidl}@cs.rwth-aachen.de Abstract. In many scientific settings, engineering processes, and business applications ranging from experimental sensor data and process control data to telecommunication traffic observation and financial transaction monitoring, huge amounts of high-dimensional measurement data are produced and stored. Whereas sensor equipment as well as big storage devices get cheaper every day, data analysis tools and techniques lag behind. Clustering methods are common solutions to unsupervised learning problems where neither any expert knowledge nor some helpful annotation for the data is available. In general, clustering groups the data objects in a way that similar objects get together in clusters whereas objects from different clusters are of high dissimilarity. A typical observation, however, is that clustering reveals almost no structure even it is known there must be groups of similar objects. In many cases, the reason is that the cluster structure is induced by some subsets of the space’s dimensions only, and the many additional dimensions contribute nothing else than noise that hinder the discovery of the clusters in the data. As a solution to this problem, clustering algorithms are applied to the relevant subspaces only. Immediately, the new question is how to determine the relevant subspaces among the dimensions of the full space. Being faced with the power set of the set of dimensions a brute force trial of all subsets is infeasible due to their exponential number with respect to the original dimensionality. Subspace clustering solves this problem by identifying both in parallel, relevant subspaces as well as the cluster structures which become apparent in these subspaces. In the literature, several density-based approaches to subspace clustering are found [PHL 04] [SZ 04]. They share the problem that the notions of density for the clusters significantly depend on the dimensionality of the respective subspace. As the algorithms require a density threshold to be set by the users, they ran into the difficulty how to set the threshold value reasonably. The higher the value the less clusters are detected in high-dimensional subspaces but, on the other hand, low values detect high-dimensional clusters but produce tremendous amounts of trivial low-dimensional subspace clusters. Thus, these methods fail to separate cluster patterns from noise in different subspaces. Our new density definition for subspace clusters compensates this dimensionality bias by introducing the expected density into the notion of density, resulting in the concept DUSC (dimensionality-unbiased subspace clustering) [AKMS 07]. The new definition normalizes the density measure by the expected value of density with respect to the current dimensionality and the assumption of a uniform distribution in that dimensionality. Along with the formal notion of dimensionality-unbiased density measures, we propose a generic unbiased density estimator and demonstrate its effectiveness based on the Epanechnikov kernel as a particular example. On top of the definitions that improve the quality of subspace clustering, we also work on the efficiency of subspace cluster detection. For this purpose, we introduce some powerful pruning properties and tackle the efficiency problem by a multi-step approach including lossless filter steps on cluster approximations. Our experiments real world data as well as on synthetical data sets demonstrate that our approach outperforms existing subspace clustering algorithms with respect to both, quality and efficiency. References. [AKMS 07] Assent I., Krieger R., Müller E., Seidl T.: DUSC: Dimensionality Unbiased Subspace Clustering. Proc. IEEE International Conference on Data Mining (ICDM), Omaha, Nebraska, USA, 2007. [PHL 04] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explorations Newsletter, 6(1):90–105, June 2004. [SZ 04] K. Sequeira, M. Zaki. SCHISM: A new approach for interesting subspace mining. Proc. IEEE International Conference on Data Mining (ICDM), Hong Kong, 2004.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Removing Dimensionality Bias in Density