Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A SUBSPACE CLUSTERING OF HIGH DIMENSIONAL DATA REDUNDANCY R.TAMILSELVAN LECTURER IN IT DEPARTMENT, VIVEKANANDHA INSTITUTE OF ENGINEERING AND TECHNOLOGY, NAMAKKAL, TAMILNADU STATE, INDIA. E-mail: [email protected] V.HARIHARAPRABU ASSISTANT PROFESSOR IN IT DEPARTMENT, VIVEKANANDHA INSTITUTE OF ENGINEERING AND TECHNOLOGY, NAMAKKAL, TAMILNADU STATE, INDIA. E-mail: [email protected] PROF.R.BHASKARAN ASSISTANT PROFESSOR IN CSE DEPARTMENT MUTHAYAMMAL ENGINEERING COLLEGE, NAMAKKAL, TAMILNADU STATE, INDIA. E-mail: [email protected] DR.S.CHITRA PRINCIPAL M.KUMARASAMY COLLEGE OF ENGINEERING, KARUR, TAMILNADU STATE, INDIA. E-mail: [email protected] DR.C.PALANISAMY PROFESSOR BANNARI AMMAN INSTITUTE OF TECHNOLOGY, ERODE, TAMILNADU STATE, INDIA. E-mail: [email protected] ABSTRACT This paper proposes a new innovative algorithm is called Non Redundant Subspace Cluster mining, to efficiently discover a succinct collection of subspace clusters while also maintaining the required degree of data coverage. We first study an important but unsolved dilemma in the literature of subspace clustering, which is referred to as “information overlapping-data coverage” challenge. NORSC not only avoids generating the redundant clusters with most of the contained data covered by high dimensional clusters to resolve the information overlapping problem but also limits the information loss to cope with the data coverage problem. The highdimensional data is inherently more complex in clustering, classification, and similarity search. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. The number of dimensions in each such cluster-specific subspace may also vary. Hence, it may be impossible to a single small subset of dimensions for all the clusters. The similarity search and indexing problem is well known to be a difficult one for high dimensional application. Due to the nature of monotonicity property in appropriate like procedures, it is inherent that if a region is identified as dense, all its projected regions are also identified as dense, causing overlapping/redundant clustering information to be inevitably reported to users when generating clusters from such highly correlated regions. As shown by our experimental results, NORSC is very effective in identifying a concise and small set of subspace clusters, while incurring time complexity in orders of magnitude better than that of previous work. Keywords -- Data Mining, Subspace Clustering, redundancy filtering, redundant clustering high dimensional data. I. INTRODUCTION The clustering techniques have been recognized as important and valuable capabilities in the data mining field. The increase of research attention for subspace clustering comes from the recent report of the curse of dimensionality which shows that the difference of the distances from a point to its nearest point and farthest point diminishes as the dimension cardinality increases. The applicability of the subspace clustering has been demonstrated in various applications, including gene expression data analysis, E-commerce, DNA microarray analysis, and so forth [3], [4]. In this paper we explore an unsolved dilemma in previous works, which is called the information overlapping data coverage. In such manners, the data space is first partitioned into a number of equal-sized units/grids [5]. The dense units whose densities exceed a predefined density threshold are identified, and finally, the groups of connected dense units are discovered as clusters. This paper we categorize such a problem as the information overlapping problem. To solve the information overlapping problem, a naïve extension of previous works is to remove a cluster if all its dense units are projections of dense units of higher dimensional cluster. However, it may seriously eliminate important information due to another problem, called data coverage problem. This problem refers to the phenomenon that some data points in a lower dimensional cluster may not be members of any higher dimensional cluster, even though all the dense units of the lower dimensional cluster are projections of the dense units of higher dimensional clusters, i.e., information overlapping [9]. The information overlapping and data coverage problems are not special cases in subspace clustering. As will be shown by the experimental results in section 5, we study these problems by using three real data sets in varying applications adopted from UCI machine learning repository [7], which are yeast database, the adult database, and the overtype database. We have clearly found that 1) a massive and user-unacceptable number of subspace clusters will be reported if no cluster is removed by calculating their information overlapping; 2) Data coverage problem will be strikingly sacrificed if a cluster is removed while all its dense units are projections of units in higher dimensional clusters. The above results show that there is a dilemma between information overlapping and data coverage, also calling for a novel solution to provide good balance between data coverage and information overlapping. In our algorithm NORSC, the efficiency in discovering the no redundant clusters is made by first identifying the maximal dense regions. The experimental results on extensive real data sets reveal that NORSC can obtain a concise set of subspace clusters, while incurring time complexity in orders of magnitude better than that of previous works [21], [25]. The rest of this paper is organized as follows. In section II, some Project related work for subspace clustering is presented. In section III, we give new innovative NORSC algorithm. In section IV, we present the Experimental results. This paper concludes with section V. II. PROJECTED RELATED WORK The information overlapping and data coverage problem are not special cases in subspace clustering. We study these problems by using these real data sets in varying applications adopted from UCI machine repository [3]. As shown in the Experimental results in this paper, CLIQUE suffers from the data coverage problem, where directly removing the Clusters whose constituent units are projections of dense Units in higher dimensional clusters causes large in formation loss. The subspace entropy is devised by considering three criteria of good clustering in a subspace, i.e., high data coverage, high density, and correlated dimensions. The subspaces with good clustering will have lower entropy and are selected to discover the clusters [1], [2]. ENCLUS also suffers from the information over lapping and data coverage problems because ENCLUS adopts the same clustering model in CLIQUE to discover the subspace clusters. By introducing two parameters £ and m, the core objects are defined as the data points containing at least m data points in their£-neighborhood, where the distance between two data points in a subspace is calculated as the distance between their projections in this subspace. As such, if a data point is a core object in a space, it will be also a core object while being projected into lower subspaces, thus resulting in a massive amount of clusters with high degree of information overlapping. In addition, SUBCLU has huge workload in performing range queries to find the data points in £-neighborhood of each data point in arbitrary subspace, thus reducing the practicability in subspace clustering. The method in this paper falls in the second category we can use in soft subspace clustering [4]. A. Hard Subspace Clustering The subspace clustering methods in this category can be further divided into bottom-up and top-down subspace search methods [7]. The bottom-up methods for subspace clustering consist of the following main steps. Dividing each dimension into intervals and identifying the dense intervals in each dimension. From the interactions of the dense intervals, identifying the dense cells in all two dimensions. From the intersections of 2D dense cells and the dense intervals of other dimensions, identifying the dense cells in all three dimensions and repeating this process until all dense cells in all k dimensions are identified, and merging the adjacent dense cells in the same subsets of dimensions to identify clusters. The efficacy of this method depends on how the clustering problem is addressed in the first place in the original feature space. A potentially serious problem with such a technique is the lack of data to locally perform PCA on each cluster to derive the principal components; therefore, it is inflexible in determining the dimensionality of data representation. B. Soft Subspace Clustering Instead of identifying exact subspaces for clusters, this approach assigns a weight to each dimension in the clustering process to measure the contribution of the dimension in forming a particular cluster. In a clustering, every dimension contributes to every cluster, but contributions are different. The subspaces of the clusters can be identified by the weight values after clustering. Variable weighting for clustering is an important research topic in statistics and data mining [8], [9], [10]. However, the purpose is to select important variables for clustering. Extensions to some variable weighting methods, for example, the k-means type variable weighting methods, can perform the task of subspace clustering. A number of algorithms in this direction have been reported recently [11], [12]. We can observe that the weight value for a dimension in a cluster is inversely proportional to the dispersion of the values from the center in the dimension of the cluster. Since the dispersions are different in different dimensions of different clusters, the weight values for different clusters are different. The high weight indicates a small dispersion in a dimension of the cluster. Therefore, that dimension is more important in forming the cluster. This subspace clustering algorithm has a problem in handling sparse data. If the dispersion of a dimension in a cluster happens to be zero, then the weight for that dimension is not computable. III. NORSC ALGORITHM In this paper, we propose an innovative algorithm, called NORSC to cope with the Nonredundant Cluster Discovering Problem. In NORSC, the identification of the nonredundant clusters relies on the computation of the amount of data points contained in each dense unit and the number of non extensible data points of each dense unit. We take this Information to calculate for each cluster the number of its enclosed data points and the total number of data points in its extension clusters, which are then utilized to identify whether it is redundant. In this manner, the major challenge in developing an efficient non redundant cluster discovering algorithm hinges on computing the non tensile data points for the dense units. This paper proposes a new innovative algorithm is called NORSC. In NORSC, A naive approach is to continuously scan the higher dimensional units for each dense unit to extract its dense super units, which are then adopted in computing the intersections of high dimensional data are extended its non extensible data points. However, this naive approach will be faced with the enormous workload in Fig. 1 NORSC Algorithm the exhaustive search in a tremendous amount of higher dimensional units. This challenge is tackled in NORSC by leveraging the maximal dense units to efficiently discover the non extensible data points for the dense units.The maximal denseunits are the dense units whose super units are not dense [16]. Thus, NORSC is devised with a two-step approach.The firststep discovers the maximal dense units,and the second step utilizes the maximal dense units to extract the non extensible data points for the dense unit sand discover the nonredundant cluster afterward. Fig.3 gives the flowchart of algorithm NORSC. We will describe the details of the two steps of NORSC in Sections. The discussion on NORSC is presented in Section 3.1 A. Mining Maximum Dense Unit The first step of NORSC first extracts the maximal dense units for accelerating the discovery of the nonredundant clusters. In mining maximal dense units, each k-dimensional unit is representated by a k-element set. The discovery of maximal dense units is performed by traversaing the lattice in depth-first sequence. Fig. 2. Procedure for maximal dense unit. In the following, we first formulate the depth-first traversal procedure, which will not traverse the dependant nodes of the utmost extended dense units. set, where Pu whose concatenation with µ are dense units, are inserted into Cu .If Cu is empty, unit µ is an utmost extended dense unit and is taken to discover the maximal dense units. contationing a set of ID units for valid until extensions to generate list of child nodes. Then unit µ extracts a subsetof C µ from Pu Cu Pu is called concatenation. The ID units following u in Cu,but removing the ones with the dimensions identified u ' s dimension.To start lattice traversal, this traversing procedure is invoked by the root with the ssible set containin all 1D denseunits, sorted by the dimension ID. B. Mining Nonredundant Cluster After efficiently extracting the maximal dense units in the last section, we leverage these maximal dense units to find the non extensible data points for the dense units, which is the requisite information for discovering the nonredundant clusters. For discovering the maximal dense units from the utmost extended dense units obtained in lattice traversal, each utmost extended dense unit must check whether it has dense super units in other branches. Thus, an utmost extended dense unit is a maximal dense unit if and only if there are no other utmost extended dense units, which are its super units. Procedure of discovering non extensible dense unit. All The dense units in different subspace scan identify their non extensible data points after the data points have discovered their nonextensible dense units with fig.2. However, we propose to discover the nonextensible data points for dense units of different cardinalities in different iterations. The reason is that it will require a prominent size of memory to store all the nonextensible dense units of all data points due to the numerous dense units in data of high dimensions. For iteration k to discover the nonextensible data points for k-dimensional dense units, we make each data point extract only then on extensible dense units C. Discussion about NORSC Algorithm NORSC outperforms the traditional grid-based subspace clustering mechanisms [6], [12] using post processing to filter redundant clusters. The main reason is that we leverage the maximal dense units to find the non extensible data points for the dense units, and we only need to find the unit counts of a small amount of the dense units. Efficiency in computing non extensible data. A naive approach to find then on extensible data points for a dense unit is to exhaustedly scan its dense super units to compute. The data points covered by these units. The results shown below illustrate the time complexity of computing the non extensible data points for all k-dimensional dense Units in the above naive approach and in NORSC, respectively. As shown in the experimental result, since the number of maximal dense units is much smaller than the total number of k-dimensional dense units, NORSC incurs much smaller computational time in computing then on extensible data points, as compared to the above naive approach. The following data sets can be used. TABLE 1 Example for data sets Number of Number S. No. Data Set Dimensions of data 1 Yeast 6 1484 2 Adult 6 32561 3 Covertype 10 16125 IV. EXPERIMENTAL RESULTS An extension of traditional grid-based subspace clustering mechanisms [9], [10] diminishes the information overlapping between the highly correlated units by removing the clusters with all its dense units as the projections of dense units in higher dimensional clusters. However, some data points in the removed clusters may not be covered by higher dimensional clusters, causing information loss in these uncovered data. It is the data coverage problem indicated in Section1.In this section, we study the data coverage problem by using the three real datasets shown in Table1, where we evaluate the coverage loss ratio in each removed cluster in the above extension approach. Fig. 4(a). Cluster Ratio on redundancy thershold We first evalute the accuracy of NORSC on the three data listed in table 1 with τ as 5. In Fig.4. gives experimentals results, where we change the density threshold µ.The Fig 4 shows the “cluster ratio” which is defined as the ratio of the number of subspace clusters discovered by NORSC to the number of clusters discovered by CLIQUE, for varied. Fig. 4(b). Cluster Ratio on redundancy thershold Fig. 4(c). Cluster Ratio on redundancy thershold We generate the synthetic data sets by implementing the data generator in [10], which is used in most traditional subspace clustering works. Four synthetic data sets are generated and their characteristics are shown in Table 3a. In Table 4(b), we show the experimental results of our algorithm NORSC and the results of the traditional grid-based clustering algorithm CLIQUE [13], [14], [15]. These results are the best ones derived by testing a broad range of parameter setting. In these experiments, τ is set to 6. As can be seen from Table 1, in these data sets, NORSC and CLIQUE all get the recall equal to 1, that is, NORSC and CLIQUE can accurately discover the true clusters in the data sets [22], [23], [24]. However, CLIQUE will generate a massive set of subspace clusters. It is because the true clusters would make their projected regions in lower subspaces be also identified as dense regions, thus resulting in a massive set of clusters discovered by CLIQUE. As shown in Table 3b, this problem is even worse in CLIQUE when the number of clusters is increased or the clusters are embedded in higher subspaces. In contrast, from the experimental results, we note that NORSC cannot only accurately extract the true clusters but also effectively discover a succinct clustering result. As shown in Table 3b, NORSC can discover the same true clusters in data set 1 and data set 2. For data set 3 and data set 4, NORSC discovers succinct clustering results as compared to CLIQUE [16], [17]. A. Redundancy Threshold In this section, we propose a procedure to recommend a redundancy threshold to help the users discover a succinct clustering result. When selecting the value, we may face the trade-off between the information loss in the identified redundant clusters and the information redundancy in the identified non redundant clusters. For the identified redundant clusters, we may lose the information for those of their enclosed data points that are not contained in higher dimensional clusters.To limit the information loss, we may consider setting a much higher value (such as α=0.95).However, in such case, a higher α value is Useless store solve the information overlapping problem such that the clusters discovered may still have large information redundancy [18], [19], [20]. Fig. 5 The form shows marital status in number of people using NORSC algorithm for cluster1&2 results. We compare the time performance of NORSC and the extended CLIQUE algorithm [6], where the two algorithms identify the same non redundant clusters. We extend CLIQUE by first executing CLIQUE to generate all subspace clusters and then identifying and filtering redundant clusters by means of 4(a) and 4(b). To compare the performance, the three data sets shown in Table1 a are utilized. Density threshold [20], [21]. Fig.10 compares the scalability of NORSC and the extended CLIQUE with different density thresholds, where is 0.95. As shown in the three plots, NORSC out performs the extended CLIQUE in the execution time. For a larger, NORSC leads to smaller execution time because fewer dense units need to be identified, and also because fewer nodes in the lattice need to be traversed to find the maximal dense units. V. CONCLUSIONS In this paper, we propose a new algorithm is reducing redundancy in high dimensional data. We have studied dilemma in the literature of subspace clustering called “information overlapping-data coverage” challenge. Naive extensions of previous works cannot lead to a good line to balance these issues. We have proposed the NORSC algorithm to automatically discover a succulent collection of subspace clusters while also maintaining the required degree data coverage. NORSC does not generate the clusters with most of the contained data covered by higher dimensional clusters to avoid the information overlapping problem. In addition, NORSC limits the information loss in the coverage problem. Our algorithms leverage the maximal dense units to generate nonredundant clusters. As demonstrated by our experimental results, NORSC can discover a concise and small collection of subspace clusters, and the time efficiency of NORSC outperforms the extension of previous works. ACKNOWLEDGEMENT The authors would like to thank all staff in Department of Information Technology in Bannari Amman Institute of Technology, Sathyamangalam and Dr.Amitabh Wahi for providing the useful online subspace clustering in high dimensional data and suggestions on experiments. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Projected Clustering of High Dimensional Data Streams,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004. C.C. Aggarwal, A. Hinneburg, and D. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Space,” Proc. Eighth Int'l Conf. Database Theory (ICDT), 2001. C.C. Aggarwal and C. Procopiuc, “Fast Algorithms for Projected Clustering,” Proc. ACM SIGMOD, 1999. C.C. Aggarwal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proc. ACM SIGMOD, 2000. C.C. Aggarwal and P.S. Yu, “The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space,” Proc. ACM SIGKDD, 2000. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proc. ACM SIGMOD, 1998. R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int'l Conf. Very Large Data Bases (VLDB), 1994 K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is Nearest Neighbors Meaningful?” Proc. Seventh Int'l Conf. Database Theory (ICDT), 1999. M.-S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview from Database Perspective,” IEEE Trans. Knowledge and Data Eng., 1996. C.H. Cheng, A.W. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Mining Numerical Data,” Proc. ACM SIGKDD, 1999. Y.-H. Chu, J.-W. Huang, K.-T. Chuang and M.-S. Chen, “On Subspace Clustering with Density Consciousness,” Proc. ACM Int'l Conf. Information and Knowledge Management (CIKM), 2006. Yi-Hong Chu, JYing-Ju Chen,De-Nian Yang “Reducing Redundancy in Subspace Clustering,” IEEE Transactions on knowledgw and Data Engineering, 2010. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. ACM SIGKDD, 1996. H. Fang, C. Zhai, L. Liu, and J. Yang, “Subspace Clustering for Microarray Data Analysis: Multiple Criteria and Significance,” Proc. Computational Systems Bioinformatics Conf. (CSB), 2004. S. Goil, H. Nagesh, and A. Choudhary, “MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets,” technical report, Northwestern Univ., 1999. J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. A. Hinneburg, C.C. Aggarwal, and D. Keim, “What is the Nearest Neighbor in High Dimensional Spaces?” Proc. 26th Int'l Conf. Very Large Data Bases (VLDB), 2000. K. Kailing, H.-P. Kriegel, and P. Kroger, “Density-Connected Subspace Clustering for High-Dimensional Data,” Proc. Fourth IEEE Int'l Conf. Data Mining (ICDM),2004. Y.B. Kim, J.H. Oh, and J. Gao, “Emerging Pattern Based Subspace Clustering of Microarray Gene Expression Data Using Mixture Models,” Proc. 23rd Int'l Conf. Machine Learning (ICML), 2006. J. Liu, K. Strohmaier, and W. Wang, “Revealing True Subspace Clusters in High Dimensions,” Proc. Fourth IEEE Int'l Conf. Data Mining (ICDM), 2004. L. Lu and R. Vidal, “Combined Central and Subspace Clustering for Computer Vision Applications,” Proc. 23rd Int'l Conf. Machine Learning (ICML), 2006. [22] H.S. Nagesh, S. Goil, and A. Choudhary, “Adaptive Grids for Clustering Massive Data Sets,” Proc. First IEEE Int'l Conf. Data Mining (ICDM), 2001. [23] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, “UCI Repository of Machine Learning Databases,”http://www.ics.uci/edu/mlearnmlreposit-ory.html, 1998.. [24] K.Y. Yip, D.W. Cheung, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,” IEEE Trans. Knowledge and Data Eng., 2004. [25] M.L. Yiu and N. Mamoulis, “Iterative Projected Clustering by Subspace Mining,” IEEE Trans. Knowledge and Data Eng., 2005.