* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download H-D and Subspace Clustering of Paradoxical High Dimensional
Survey
Document related concepts
Transcript
Indian Journal of Science and Technology, Vol 9(38), DOI: 10.17485/ijst/2016/v9i38/101792, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 H-D and Subspace Clustering of Paradoxical High Dimensional Clinical Datasets with Dimension Reduction Techniques – A Model S. Rajeswari1*, M. S. Josephine2 and V. Jeyabalaraja3 Bharathiyar University, Coimbatore - 641046, India; [email protected] Dr. M.G.R. Educational and Research Institute, Chennai - 600095, India; [email protected] 3 Velammal Engineering College, Chennai - 600066, India; [email protected] 1 2 Abstract Objectives: Heterogeneous High dimensional data clustering is the analysis of data with multiple dimensions. Large dimensions are not easy to handle. The complexity increases exponentially with the dimensionality. Dimensionality reduction is the conversion of high dimensional data into a considerable representation of reduced dimensionality that corresponds to the essential dimensionality of the data. To solve the problem we put forward a general framework for clustering high dimensional datasets. Methods: Clustering is the method of finding groups of objects, such that the objects in the group will be similar to each another and different from the objects in other groups. In our framework, a heterogeneous high dimensional clustering is partitioned into several one or two dimensional clustering phases. Findings: In this paper, a model is designed in which Hierarchical-Divisive clustering; subspace clustering is used to make non-overlapping clusters and combined with dimension reduction techniques to reduce the dimensions of paradoxical high dimensional clinical datasets. Applications: solution for processing the heterogeneous high dimensional dataset such as PCA, LDA, and PSO etc. Keywords: High Dimensional Data, Hierarchical-Divisive (H-D) Clustering, Subspace Clustering 1. Introduction Data mining refers to the mining or discovery of new information in terms of patterns or rules from the large collection of data. Data mining is a process that takes data as input and outputs knowledge. Clustering is a process by which the data are divided into groups called as clusters such that objects in one cluster are closely related and objects in different clusters are very much contradictory to each other1,2. Figure 1 shows the Data Clusters. In other words, clusters should have low inter-cluster similarity and high intra cluster similarity. Applying standard clustering algorithms on the high dimensional datasets frequently presented a great challenge for traditional data mining techniques in terms of efficiency and in practical purposes also. From the distinct distances, the complexity will be increased between the data points and sparsity *Author for correspondence of data, which causes “dimensionality disaster problem” making clustering difficult3. So, the proposed model should maintain the quality of data and the speed of processing which will be more effective that the existing algorithm. Due to its high complexity in computations of clusters in high dimensional data and with poor cluster accuracy. So research in the area of clustering introduces a lot of new concepts such as subspace clustering, ensemble clustering and H-K clustering process4,5. By applying these concepts to the heterogeneous high dimensional dataset it will lead to a dimensional adversity problem which is to be concentrated. Subspace clustering, an extended traditional clustering model, finds the clusters in various datasets6. Subspace clustering deal with the detection of group of clusters that are very scattered within different subspace of the same dataset. The problem becomes how to find such subspace clusters effectively and efficiently. H-D and Subspace Clustering of Paradoxical High Dimensional Clinical Datasets with Dimension Reduction Techniques – A Model Ensemble clustering ‘the knowledge reuse framework, proposed by in7. The traditional algorithms for clustering gives less efficient results when dealing with high dimensional data as it has the advantages such as the “curse of dimensionality”. The problems which are quoted such as irrelevant noisy features and sparsity of data should be completely shortened. The highest priority will be given to these above problems to provide an advanced clustering algorithm that will solve and cluster the data efficiently. We proposed a model with the combinations of advanced clustering algorithms that will improve the quality of cluster and speed of processing the large amount of data. The proposed model combines the three techniques Hierarchical (Divisive) clustering, subspace clustering (Proclus) combination with Dimension reduction techniques which may be PCA, SVD, LDA, PSO etc., which will improve the cluster efficiency and reduce the curse of dimensionality. sent these data as Paradoxical high dimensional Clinical Datasets. One of the most significant challenges of the data mining in medical side is to obtain the quality and relevant clinical trial data. Medical data are complex and heterogeneous in nature, because it is collected from various sources such as from the medical reports of laboratory, from the discussion with the patient or from the review of physicians. The medical information is characteristics of redundancy, multi-attribution, incompletion and closely related with time. 1.2 Hierarchical Clustering Analysis Hierarchical clustering and partition clustering are the basic types of clustering algorithms. Hierarchical clustering, which builds a hierarchy of clusters from the single link and complete link clustering features. It is further Classified into agglomerative (bottom-up approach) and divisive (top-down approach). Agglomerative Clustering Hierarchical process that begins with each object or observation in a separate cluster. In each subsequent step, the most similar clusters are combined to form a new cumulative cluster. The iterative process is repeated until ‘all’ objects are finally combined into a single cluster, from n clusters to 1. As similarity measures decreases during successive steps, clusters can’t be split, starts with a single data point. Add two or more clusters recursively (AGNES). Figure 1. Data Clusters. 1.1. Paradoxical High Dimensional Clinical Datasets Heterogeneous high dimensional dataset is a set of interrelated component which are autonomous in nature. The attributes present in one component may completely different from attributes in other component datasets which makes some complications to integrate their semantics into the overall heterogeneous database. There are different kinds of data systems such as relational or object oriented databases, hierarchical databases, and network databases, spread sheets, multimedia databases or file systems which are combined to form the heterogeneous databases that referred as legacy database8. Here we repre- 2 Vol 9 (38) | October 2016 | www.indjst.org Divisive Clustering Starting with all attributes in a single cluster, then it is divided into step by step process. From the single cluster it is seggregated into one or two more additional clusters, which is having the most dissimilar objects. From the one cluster is divided into two clusters, and then one of these clusters is split for a total of three clusters. The iteration will be continued until all the observations from the singlecluster ranging from 1 cluster to n clusters. DIANA is the hierarchical divisive clustering algorithm which starts with big cluster and divides into smaller clusters respectively. For any set of comparing the clusters of the heterogeneous high dimensional dataset, the hierarchical cluster analysis will provide the tremendous framework with accurate solutions. The HCA method helps us to evaluate how many clusters to be taken or to be considered. Indian Journal of Science and Technology S. Rajeswari, M. S. Josephine and V. Jeyabalaraja Advantage of Hierarchical Clustering Analysis (HCA) are Simplicity: With the help of the dendogram structure, the Hierarchical cluster analysis provides a simple, wideranging depiction of clustering solutions. Measure of Similarity: HCA can be applied to almost any type of research question. Speed: HCA had the advantages of generating an entire set of clustering solutions in a convenient manner8. 1.4 Subspace Clustering Subspace clustering is an extended method of attribute subset selection that has shown its strength at high dimensional clustering. Based on the observation that different subspaces may contain different, meaningful clusters. Subspace clustering explores the groups of clusters within different subspaces of the similar data set. The problem becomes how to find such subspace clusters effectively and efficiently. Dimension growth subspace clustering (CLIQUE), dimension-reduction projected clustering (PROCLUS) and frequent pattern based clustering (pCluster). Clique splits the n-dimensional data space into non-overlapping rectangular units, identifying the dense units among these. This is done for each dimension. Clique (Clustering in QUEst) find out the subspaces of high dimensionality having high density clusters from the different subspaces in automated manner. PROCLUS (Projected Clustering) is a dimension–reduction subspace clustering method. From the preliminary stages of single-dimensional spaces, the PROCLUS will find the initial evaluation of the clusters in the single-dimensional attribute space. From the above stages, the dimensions which are presented in clusters are assigned by specific weightage values9. These weightage values are passed to the next iteration for regenerating the clusters. Exploring the intense regions with all subspaces from the required dimensionality and exclude the generation of huge quantity of overlapped clusters in projected dimensions of lower dimensionality. When compared to CLIQUE, PROCLUS finds non-overlapped partitions of points. The discovered clusters may help better understand the high-dimensional data and facilitate other subsequence analyses. Frequent pattern-based cluster analysis can discover the significant associations and correlations among data objects in the clusters. Rather than growing the clusters dimension by dimension, this will grow sets of frequent item sets, which eventually lead to cluster description. An advantage of Vol 9 (38) | October 2016 | www.indjst.org frequent term-based clustering is that, the automatically generated description of cluster from the frequent item sets. Traditional clustering methods produce only clusters and several processing steps had to be included for generating the cluster descriptions9. Recently set of works has been done in the area of high dimensional data, that has been explained briefly in10,11. Dimensionality Reduction Feature extraction and feature transformation the most popular techniques of dimension reduction. Some of the experimental evaluation leads to that both methods, the accuracy and effective of data will be affected by the lost information and feature selection algorithms may found the difficulty when clusters are found in different subspaces. This type of data motivated the evolution of the subspace clustering algorithm. 2. Proposed Model The complete flow diagram of the proposed model shown in Figure 2 Model of Dimension Reduction. Based on the flowchart of the proposed model, the following content will unfold these stages in details: Figure 2. Model of Dimension Reduction. Phase 1: Dataset Pre-Processing Import the dataset for pre-processing, as the clinical dataset is having many missing values and outliers. Preprocessing is needed to avoid these types of noises and make the raw data to processed data. Indian Journal of Science and Technology 3 H-D and Subspace Clustering of Paradoxical High Dimensional Clinical Datasets with Dimension Reduction Techniques – A Model Phase 2: H-D Clustering Process By divisive (Top-down) approach the dataset will be divided into n clusters from the top. As we given number of clusters and threshold value the clusters will be formed. The clusters are represented by the dendogram structure. By clustering the heterogeneous high-dimensional clinical datasets, overlapping may occur; the clusters will be formed from the subset of another. So, there is a lack of conversion in high dimensional to low dimensional shows Figure 3 H-D clusters of data. attributes. These numbers of attributes will be clustered by the H-D clustering algorithm. After the H-D clustering some of the overlapping clusters are formed. By using the subspace clustering algorithm these overlapping clusters will reduce to form the prominent clusters and combined with dimension reduction techniques the resultant will be the required reduced data sets. By applying these numbers of above the phases, the proposed model will get the reduced number of clusters and finally we got the accurate and efficient reduced number of clinical datasets which will be very useful to diagnose the problem of a patient. Figure 3. H-D clusters of data. Phase 3: The Subspace Clustering Process By the end of divisive clustering, the overlapped clusters will refine by the subspace clustering process. These overlapping will have the required number of datasets in them. By assigning the number of clusters and subspace determination the process will show the number of clusters present in them. Finally, the reduction of number of clusters will be evaluated by combining the groups which are closely and similar to each other shows in Figure 4 Cluster Refining process. Phase 4: Dimension Reduction Techniques From the subspace process, the reduced clusters will be formed. But these reduced clusters are also having several numbers of attributes or dimensions. In combined with subspace, principal component Analysis, Linear Discriminant Analysis, Singular value decomposition, Factor analysis etc., can be used to reduce the multi-attributes datasets. According to our domain knowledge, the paradoxical clinical datasets, which are said to be heterogeneous high dimensional in nature. When considering the blood report of a particular patient and scan report of the particular patient, it shows the different number of 4 Vol 9 (38) | October 2016 | www.indjst.org Figure 4. Cluster Refining process. 3. Conclusion and Future Enhancement Heterogeneous High dimensional dataset processing faces some complications such as “the curse of dimensionality” and the sparsity of data in the high dimensional space. The proposed model provides a solution for processing the heterogeneous high dimensional dataset which is composition of Hierarchical clustering (divisive), subspace clustering (Proclus) and Dimension reduction algorithm such as PCA, LDA, and PSO etc. The hierarchical clusters Indian Journal of Science and Technology S. Rajeswari, M. S. Josephine and V. Jeyabalaraja of the corresponding dataset will pass to subspace clustering generating the subsets of non-overlapping clusters which results the low dimensional clusters and combined with dimension reduction techniques reaches the final stage converting high dimensional or multi-attribute datasets to lower dimensional clinical datasets. This paper provides a model for dimension reduction in paradoxical high dimensional clinical datasets. The future scope will be generating the algorithm for the above combined concepts and implementing these algorithms in benchmark clinical datasets and provides efficient results and visualizing the results. 4. References 1. Aastha Joshi, Rajneet Kaur. A Review: Comparative Study of Various Clustering Techniques in Data Mining. International Journal of Advanced Research in Computer Science and Software Engineering. 2013 Mar; 3(3):55-7. 2. Smyth P. Clustering using Monte Carlo cross-validation. Learning, Probability, & Z Graphical Models. 1996; p. 12633. 3. Painthankar Rashmi, Tidke Bharat. A H-K clustering algorithm for high dimensional data using ensemble learning. International Journal of Information Technology Convergence and Services. 2014 Dec; 4(5/6):1-9. Vol 9 (38) | October 2016 | www.indjst.org 4. Muller Emmanuel. Evaluating Clustering in subspace projections of high dimensional Data. Proceedings of the VLDB Endowment. 2009 Aug; 2(1):1270-81. 5. A novel approach for high dimensional data clustering. Date Accessed: 9/01/2010: Available from: http://ieeexplore.ieee.org/document/5432636/. 6. Parsons Lance, Haque Ehtesham, Liu Huan. Subspace clustering for high dimensional Data: A Review. ACM SIGKDD Explorations Newsletter. 2004 Jun; 6(1):90-105. 7. Strehl A, Ghosh J. Cluster ensembles – A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research. 2003 Jan; 3:583-617. 8. He Ying , Wang Jian, Liang-Xi Qin, Mei Lin. A H-K Clustering-algorithm for high dimensional data using ensemble learning. IET International Conference on Smart and Sustainable City 2013 (ICSSC 2013). 2013 Aug; p. 300– 305. 9. Jiawei Han, Kamber Michaline. Morgan Kaufmann Publishers: Data Mining Concepts and Techniques, 3rd(Edn). 2011 Jul. 10. Sim K, Gopala Krishnan V, Zimek A, Kong G. A survey on enhanced subspace clustering. Data mining and Knowledge Discovery. 2013 Mar; 26(2):332-97. 11. Moise G, Zimek A, Knoger P, Kriegal HP, Sander J. Subspace and Projected Clustering: Experiment Evaluation and Analysis. Knowledge and Information Systems. 2009 Dec; 21:299-326. Indian Journal of Science and Technology 5