Download a subspace clustering of high dimensional data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
A SUBSPACE CLUSTERING OF HIGH DIMENSIONAL DATA
REDUNDANCY
R.TAMILSELVAN
LECTURER IN IT DEPARTMENT,
VIVEKANANDHA INSTITUTE OF ENGINEERING AND TECHNOLOGY,
NAMAKKAL, TAMILNADU STATE, INDIA.
E-mail: [email protected]
V.HARIHARAPRABU
ASSISTANT PROFESSOR IN IT DEPARTMENT,
VIVEKANANDHA INSTITUTE OF ENGINEERING AND TECHNOLOGY,
NAMAKKAL, TAMILNADU STATE, INDIA.
E-mail: [email protected]
PROF.R.BHASKARAN
ASSISTANT PROFESSOR IN CSE DEPARTMENT
MUTHAYAMMAL ENGINEERING COLLEGE,
NAMAKKAL, TAMILNADU STATE, INDIA.
E-mail: [email protected]
DR.S.CHITRA
PRINCIPAL
M.KUMARASAMY COLLEGE OF ENGINEERING,
KARUR, TAMILNADU STATE, INDIA.
E-mail: [email protected]
DR.C.PALANISAMY
PROFESSOR
BANNARI AMMAN INSTITUTE OF TECHNOLOGY,
ERODE, TAMILNADU STATE, INDIA.
E-mail: [email protected]
ABSTRACT
This paper proposes a new innovative algorithm is called Non Redundant Subspace Cluster mining, to efficiently
discover a succinct collection of subspace clusters while also maintaining the required degree of data coverage.
We first study an important but unsolved dilemma in the literature of subspace clustering, which is referred to
as “information overlapping-data coverage” challenge. NORSC not only avoids generating the redundant
clusters with most of the contained data covered by high dimensional clusters to resolve the information
overlapping problem but also limits the information loss to cope with the data coverage problem. The highdimensional data is inherently more complex in clustering, classification, and similarity search. It produces
identical results irrespective of the order in which input records are presented and does not presume any specific
mathematical form for data distribution. The number of dimensions in each such cluster-specific subspace may
also vary. Hence, it may be impossible to a single small subset of dimensions for all the clusters. The similarity
search and indexing problem is well known to be a difficult one for high dimensional application. Due to the
nature of monotonicity property in appropriate like procedures, it is inherent that if a region is identified as
dense, all its projected regions are also identified as dense, causing overlapping/redundant clustering
information to be inevitably reported to users when generating clusters from such highly correlated regions. As
shown by our experimental results, NORSC is very effective in identifying a concise and small set of subspace
clusters, while incurring time complexity in orders of magnitude better than that of previous work.
Keywords -- Data Mining, Subspace Clustering, redundancy filtering, redundant clustering high dimensional
data.
I. INTRODUCTION
The clustering techniques have been recognized as important and valuable capabilities in the data mining field. The
increase of research attention for subspace clustering comes from the recent report of the curse of dimensionality which
shows that the difference of the distances from a point to its nearest point and farthest point diminishes as the
dimension cardinality increases. The applicability of the subspace clustering has been demonstrated in various
applications, including gene expression data analysis, E-commerce, DNA microarray analysis, and so forth [3], [4]. In
this paper we explore an unsolved dilemma in previous works, which is called the information overlapping data
coverage. In such manners, the data space is first partitioned into a number of equal-sized units/grids [5]. The dense
units whose densities exceed a predefined density threshold are identified, and finally, the groups of connected dense
units are discovered as clusters.
This paper we categorize such a problem as the information overlapping problem. To solve the information
overlapping problem, a naïve extension of previous works is to remove a cluster if all its dense units are projections of
dense units of higher dimensional cluster. However, it may seriously eliminate important information due to another
problem, called data coverage problem. This problem refers to the phenomenon that some data points in a lower
dimensional cluster may not be members of any higher dimensional cluster, even though all the dense units of the lower
dimensional cluster are projections of the dense units of higher dimensional clusters, i.e., information overlapping [9].
The information overlapping and data coverage problems are not special cases in subspace clustering. As will be shown
by the experimental results in section 5, we study these problems by using three real data sets in varying applications
adopted from UCI machine learning repository [7], which are yeast database, the adult database, and the overtype
database. We have clearly found that 1) a massive and user-unacceptable number of subspace clusters will be reported
if no cluster is removed by calculating their information overlapping; 2) Data coverage problem will be strikingly
sacrificed if a cluster is removed while all its dense units are projections of units in higher dimensional clusters. The
above results show that there is a dilemma between information overlapping and data coverage, also calling for a novel
solution to provide good balance between data coverage and information overlapping. In our algorithm NORSC, the
efficiency in discovering the no redundant clusters is made by first identifying the maximal dense regions. The
experimental results on extensive real data sets reveal that NORSC can obtain a concise set of subspace clusters, while
incurring time complexity in orders of magnitude better than that of previous works [21], [25].
The rest of this paper is organized as follows. In section II, some Project related work for subspace clustering is
presented. In section III, we give new innovative NORSC algorithm. In section IV, we present the Experimental results.
This paper concludes with section V.
II. PROJECTED RELATED WORK
The information overlapping and data coverage problem are not special cases in subspace clustering. We study
these problems by using these real data sets in varying applications adopted from UCI machine repository [3]. As
shown in the Experimental results in this paper, CLIQUE suffers from the data coverage problem, where directly
removing the Clusters whose constituent units are projections of dense Units in higher dimensional clusters causes
large in formation loss. The subspace entropy is devised by considering three criteria of good clustering in a subspace,
i.e., high data coverage, high density, and correlated dimensions. The subspaces with good clustering will have lower
entropy and are selected to discover the clusters [1], [2]. ENCLUS also suffers from the information over lapping and
data coverage problems because ENCLUS adopts the same clustering model in CLIQUE to discover the subspace
clusters.
By introducing two parameters £ and m, the core objects are defined as the data points containing at least m data
points in their£-neighborhood, where the distance between two data points in a subspace is calculated as the distance
between their projections in this subspace. As such, if a data point is a core object in a space, it will be also a core
object while being projected into lower subspaces, thus resulting in a massive amount of clusters with high degree of
information overlapping. In addition, SUBCLU has huge workload in performing range queries to find the data points
in £-neighborhood of each data point in arbitrary subspace, thus reducing the practicability in subspace clustering. The
method in this paper falls in the second category we can use in soft subspace clustering [4].
A. Hard Subspace Clustering
The subspace clustering methods in this category can be further divided into bottom-up and top-down subspace
search methods [7]. The bottom-up methods for subspace clustering consist of the following main steps. Dividing each
dimension into intervals and identifying the dense intervals in each dimension. From the interactions of the dense
intervals, identifying the dense cells in all two dimensions. From the intersections of 2D dense cells and the dense
intervals of other dimensions, identifying the dense cells in all three dimensions and repeating this process until all
dense cells in all k dimensions are identified, and merging the adjacent dense cells in the same subsets of dimensions to
identify clusters. The efficacy of this method depends on how the clustering problem is addressed in the first place in
the original feature space. A potentially serious problem with such a technique is the lack of data to locally perform
PCA on each cluster to derive the principal components; therefore, it is inflexible in determining the dimensionality of
data representation.
B. Soft Subspace Clustering
Instead of identifying exact subspaces for clusters, this approach assigns a weight to each dimension in the
clustering process to measure the contribution of the dimension in forming a particular cluster. In a clustering, every
dimension contributes to every cluster, but contributions are different. The subspaces of the clusters can be identified
by the weight values after clustering. Variable weighting for clustering is an important research topic in statistics and
data mining [8], [9], [10]. However, the purpose is to select important variables for clustering. Extensions to some
variable weighting methods, for example, the k-means type variable weighting methods, can perform the task of
subspace clustering. A number of algorithms in this direction have been reported recently [11], [12]. We can observe
that the weight value for a dimension in a cluster is inversely proportional to the dispersion of the values from the
center in the dimension of the cluster. Since the dispersions are different in different dimensions of different clusters,
the weight values for different clusters are different. The high weight indicates a small dispersion in a dimension of the
cluster. Therefore, that dimension is more important in forming the cluster. This subspace clustering algorithm has a
problem in handling sparse data. If the dispersion of a dimension in a cluster happens to be zero, then the weight for
that dimension is not computable.
III. NORSC ALGORITHM
In this paper, we propose an innovative algorithm, called NORSC to cope with the Nonredundant Cluster
Discovering Problem. In NORSC, the identification of the nonredundant clusters relies on the computation of the
amount of data points contained in each dense unit and the number of non extensible data points of each dense unit. We
take this Information to calculate for each cluster the number of its enclosed data points and the total number of data
points in its extension clusters, which are then utilized to identify whether it is redundant. In this manner, the major
challenge in developing an efficient non redundant cluster discovering algorithm hinges on computing the non tensile
data points for the dense units. This paper proposes a new innovative algorithm is called NORSC. In NORSC, A naive
approach is to continuously scan the higher dimensional units for each dense unit to extract its dense super units, which
are then adopted in computing the intersections of high dimensional data are extended its non extensible data points.
However, this naive approach will be faced with the enormous workload in
Fig. 1 NORSC Algorithm
the exhaustive search in a tremendous amount of higher dimensional units. This challenge is tackled in NORSC by
leveraging the maximal dense units to efficiently discover the non extensible data points for the dense units.The
maximal denseunits are the dense units whose super units are not dense [16]. Thus, NORSC is devised with a two-step
approach.The firststep discovers the maximal dense units,and the second step utilizes the maximal dense units to
extract the non extensible data points for the dense unit sand discover the nonredundant cluster afterward. Fig.3 gives
the flowchart of algorithm NORSC. We will describe the details of the two steps of NORSC in Sections. The
discussion on NORSC is presented in Section 3.1
A. Mining Maximum Dense Unit
The first step of NORSC first extracts the maximal dense units for accelerating the discovery of the nonredundant
clusters. In mining maximal dense units, each k-dimensional unit is representated by a k-element set. The discovery of
maximal dense units is performed by traversaing the lattice in depth-first sequence.
Fig. 2. Procedure for maximal dense unit.
In the following, we first formulate the depth-first traversal procedure, which will not traverse the dependant nodes of
the utmost extended dense units. set, where Pu whose concatenation with µ are dense units, are inserted into Cu .If
Cu is empty, unit µ is an utmost extended dense unit and is taken to discover the maximal dense units. contationing a
set of ID units for valid until extensions to generate list of child nodes. Then unit µ extracts a subsetof C µ from
Pu Cu  Pu  is called concatenation. The ID units following u  in Cu,but removing the ones with the dimensions
identified u ' s  dimension.To start lattice traversal, this traversing procedure is invoked by the root with the ssible set
containin all 1D denseunits, sorted by the dimension ID.
B. Mining Nonredundant Cluster
After efficiently extracting the maximal dense units in the last section, we leverage these maximal dense units to
find the non extensible data points for the dense units, which is the requisite information for discovering the
nonredundant clusters. For discovering the maximal dense units from the utmost extended dense units obtained in
lattice traversal, each utmost extended dense unit must check whether it has dense super units in other branches. Thus,
an utmost extended dense unit is a maximal dense unit if and only if there are no other utmost extended dense units,
which are its super units. Procedure of discovering non extensible dense unit. All The dense units in different subspace
scan identify their non extensible data points after the data points have discovered their nonextensible dense units with
fig.2. However, we propose to discover the nonextensible data points for dense units of different cardinalities in
different iterations. The reason is that it will require a prominent size of memory to store all the nonextensible dense
units of all data points due to the numerous dense units in data of high dimensions. For iteration k to discover the
nonextensible data points for k-dimensional dense units, we make each data point extract only then on extensible dense
units
C. Discussion about NORSC Algorithm
NORSC outperforms the traditional grid-based subspace clustering mechanisms [6], [12] using post processing to
filter redundant clusters. The main reason is that we leverage the maximal dense units to find the non extensible data
points for the dense units, and we only need to find the unit counts of a small amount of the dense units. Efficiency in
computing non extensible data. A naive approach to find then on extensible data points for a dense unit is to
exhaustedly scan its dense super units to compute. The data points covered by these units. The results shown below
illustrate the time complexity of computing the non extensible data points for all k-dimensional dense Units in the
above naive approach and in NORSC, respectively. As shown in the experimental result, since the number of maximal
dense units is much smaller than the total number of k-dimensional dense units, NORSC incurs much smaller
computational time in computing then on extensible data points, as compared to the above naive approach. The
following data sets can be used.
TABLE 1
Example for data sets
Number of
Number
S. No. Data Set
Dimensions of data
1
Yeast
6
1484
2
Adult
6
32561
3
Covertype
10
16125
IV. EXPERIMENTAL RESULTS
An extension of traditional grid-based subspace clustering mechanisms [9], [10] diminishes the information overlapping between the highly correlated units by removing the clusters with all its dense units as the projections of dense
units in higher dimensional clusters. However, some data points in the removed clusters may not be covered by higher
dimensional clusters, causing information loss in these uncovered data. It is the data coverage problem indicated in
Section1.In this section, we study the data coverage problem by using the three real datasets shown in Table1, where
we evaluate the coverage loss ratio in each removed cluster in the above extension approach.
Fig. 4(a). Cluster Ratio on redundancy thershold
We first evalute the accuracy of NORSC on the three data listed in table 1 with τ as 5. In Fig.4. gives experimentals
results, where we change the density threshold µ.The Fig 4 shows the “cluster ratio” which is defined as the ratio of the
number of subspace clusters discovered by NORSC to the number of clusters discovered by CLIQUE, for varied.
Fig. 4(b). Cluster Ratio on redundancy thershold
Fig. 4(c). Cluster Ratio on redundancy thershold
We generate the synthetic data sets by implementing the data generator in [10], which is used in most traditional
subspace clustering works. Four synthetic data sets are generated and their characteristics are shown in Table 3a. In
Table 4(b), we show the experimental results of our algorithm NORSC and the results of the traditional grid-based
clustering algorithm CLIQUE [13], [14], [15]. These results are the best ones derived by testing a broad range of
parameter setting. In these experiments, τ is set to 6. As can be seen from Table 1, in these data sets, NORSC and
CLIQUE all get the recall equal to 1, that is, NORSC and CLIQUE can accurately discover the true clusters in the data
sets [22], [23], [24]. However, CLIQUE will generate a massive set of subspace clusters. It is because the true clusters
would make their projected regions in lower subspaces be also identified as dense regions, thus resulting in a massive
set of clusters discovered by CLIQUE. As shown in Table 3b, this problem is even worse in CLIQUE when the number
of clusters is increased or the clusters are embedded in higher subspaces. In contrast, from the experimental results, we
note that NORSC cannot only accurately extract the true clusters but also effectively discover a succinct clustering
result. As shown in Table 3b, NORSC can discover the same true clusters in data set 1 and data set 2. For data set 3 and
data set 4, NORSC discovers succinct clustering results as compared to CLIQUE [16], [17].
A. Redundancy Threshold
In this section, we propose a procedure to recommend a redundancy threshold to help the users discover a succinct
clustering result. When selecting the value, we may face the trade-off between the information loss in the identified
redundant clusters and the information redundancy in the identified non redundant clusters. For the identified redundant
clusters, we may lose the information for those of their enclosed data points that are not contained in higher
dimensional clusters.To limit the information loss, we may consider setting a much higher value (such as
α=0.95).However, in such case, a higher α value is Useless store solve the information overlapping problem such that
the clusters discovered may still have large information redundancy [18], [19], [20].
Fig. 5 The form shows marital status in number of people using NORSC algorithm for cluster1&2 results.
We compare the time performance of NORSC and the extended CLIQUE algorithm [6], where the two algorithms
identify the same non redundant clusters. We extend CLIQUE by first executing CLIQUE to generate all subspace
clusters and then identifying and filtering redundant clusters by means of 4(a) and 4(b). To compare the performance,
the three data sets shown in Table1 a are utilized. Density threshold [20], [21]. Fig.10 compares the scalability of
NORSC and the extended CLIQUE with different density thresholds, where is 0.95. As shown in the three plots,
NORSC out performs the extended CLIQUE in the execution time. For a larger, NORSC leads to smaller execution
time because fewer dense units need to be identified, and also because fewer nodes in the lattice need to be traversed to
find the maximal dense units.
V. CONCLUSIONS
In this paper, we propose a new algorithm is reducing redundancy in high dimensional data. We have studied
dilemma in the literature of subspace clustering called “information overlapping-data coverage” challenge. Naive
extensions of previous works cannot lead to a good line to balance these issues. We have proposed the NORSC
algorithm to automatically discover a succulent collection of subspace clusters while also maintaining the required
degree data coverage. NORSC does not generate the clusters with most of the contained data covered by higher
dimensional clusters to avoid the information overlapping problem. In addition, NORSC limits the information loss in
the coverage problem. Our algorithms leverage the maximal dense units to generate nonredundant clusters. As
demonstrated by our experimental results, NORSC can discover a concise and small collection of subspace clusters,
and the time efficiency of NORSC outperforms the extension of previous works.
ACKNOWLEDGEMENT
The authors would like to thank all staff in Department of Information Technology in Bannari Amman Institute of
Technology, Sathyamangalam and Dr.Amitabh Wahi for providing the useful online subspace clustering in high
dimensional data and suggestions on experiments.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Projected Clustering of High Dimensional Data
Streams,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
C.C. Aggarwal, A. Hinneburg, and D. Keim, “On the Surprising Behavior of Distance Metrics in High
Dimensional Space,” Proc. Eighth Int'l Conf. Database Theory (ICDT), 2001.
C.C. Aggarwal and C. Procopiuc, “Fast Algorithms for Projected Clustering,” Proc. ACM SIGMOD, 1999.
C.C. Aggarwal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proc. ACM
SIGMOD, 2000.
C.C. Aggarwal and P.S. Yu, “The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in
High Dimensional Space,” Proc. ACM SIGKDD, 2000.
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional
Data for Data Mining Applications,” Proc. ACM SIGMOD, 1998.
R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int'l Conf. Very Large
Data Bases (VLDB), 1994
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is Nearest Neighbors Meaningful?” Proc. Seventh
Int'l Conf. Database Theory (ICDT), 1999.
M.-S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview from Database Perspective,” IEEE Trans.
Knowledge and Data Eng., 1996.
C.H. Cheng, A.W. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Mining Numerical Data,” Proc.
ACM SIGKDD, 1999.
Y.-H. Chu, J.-W. Huang, K.-T. Chuang and M.-S. Chen, “On Subspace Clustering with Density Consciousness,”
Proc. ACM Int'l Conf. Information and Knowledge Management (CIKM), 2006.
Yi-Hong Chu, JYing-Ju Chen,De-Nian Yang “Reducing Redundancy in Subspace Clustering,” IEEE Transactions
on knowledgw and Data Engineering, 2010.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise,” Proc. ACM SIGKDD, 1996.
H. Fang, C. Zhai, L. Liu, and J. Yang, “Subspace Clustering for Microarray Data Analysis: Multiple Criteria and
Significance,” Proc. Computational Systems Bioinformatics Conf. (CSB), 2004.
S. Goil, H. Nagesh, and A. Choudhary, “MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data
Sets,” technical report, Northwestern Univ., 1999.
J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
A. Hinneburg, C.C. Aggarwal, and D. Keim, “What is the Nearest Neighbor in High Dimensional Spaces?” Proc.
26th Int'l Conf. Very Large Data Bases (VLDB), 2000.
K. Kailing, H.-P. Kriegel, and P. Kroger, “Density-Connected Subspace Clustering for High-Dimensional Data,”
Proc. Fourth IEEE Int'l Conf. Data Mining (ICDM),2004.
Y.B. Kim, J.H. Oh, and J. Gao, “Emerging Pattern Based Subspace Clustering of Microarray Gene Expression
Data Using Mixture Models,” Proc. 23rd Int'l Conf. Machine Learning (ICML), 2006.
J. Liu, K. Strohmaier, and W. Wang, “Revealing True Subspace Clusters in High Dimensions,” Proc. Fourth IEEE
Int'l Conf. Data Mining (ICDM), 2004.
L. Lu and R. Vidal, “Combined Central and Subspace Clustering for Computer Vision Applications,” Proc. 23rd
Int'l Conf. Machine Learning (ICML), 2006.
[22] H.S. Nagesh, S. Goil, and A. Choudhary, “Adaptive Grids for Clustering Massive Data Sets,” Proc. First IEEE
Int'l Conf. Data Mining (ICDM), 2001.
[23] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, “UCI Repository of Machine Learning
Databases,”http://www.ics.uci/edu/mlearnmlreposit-ory.html, 1998..
[24] K.Y. Yip, D.W. Cheung, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,” IEEE Trans.
Knowledge and Data Eng., 2004.
[25] M.L. Yiu and N. Mamoulis, “Iterative Projected Clustering by Subspace Mining,” IEEE Trans. Knowledge and
Data Eng., 2005.