Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Imperial Journal of Interdisciplinary Research (IJIR) Vol-2, Issue-12, 2016 ISSN: 2454-1362, http://www.onlinejournal.in A Mutual Subspace Clustering Algorithm for High Dimensional Datasets K. Venkata Narayana1 & Dr. A. Mary Sowjanya 2 1 M. Tech. 2 Assistant Professor, Dept. of Computer Science and Systems Engineering, A.U. College of Engineering, Andhra University, Visakhapatnam, Andhra Pradesh, (India) Abstract: Generation of consistent clusters is always an interesting research issue in the field of knowledge and data engineering. In real applications, different similarity measures and different clustering techniques may be adopted in different clustering spaces. In such a case, it is very difficult or even impossible to define an appropriate similarity measure and clustering criteria in the union space. The mutual subspace clustering from multiple clustering spaces is critically different from subspace clustering in one (union) clustering space. Mutual subspace clustering finds the common clusters agreed by subspace clustering in both clustering spaces, which cannot be handled by the traditional subspace clustering analysis. The partitioning model divides points in a data set into k exclusive clusters and a signature subspaces are found for each cluster, where k is the number of clusters desired by a user. This model improves the k means with the elimination of random centroid selection, using average pairwise distance and other parameters to generate consistent clusters. The experimental results have been recorded on cancer data set to state the efficiency of mutual subspace clustering. Key words: Subspace clustering, high dimensional, average pair distance, k-means, mutual subspace, signature subspace 1. Introduction With the evolution of technology, the amount of data and also dimensionality of data is increasing tremendously. From the available huge amount data, data mining techniques have to be applied in order to get desired results. But traditional algorithms cannot safe to high dimensional data because those algorithms lead to inconsistency in results. In recent years, study on clustering algorithm is no longer just a step for improving from a single clustering algorithm. In 2005, [1] Tung-Shou Chen et al. proposed H-K (Hierarchical K-means) clustering algorithm, combining hierarchical clustering method and partition clustering method for data clustering. Compared with single algorithm, H-K clustering algorithm can solve the problem of randomness and a priority of initial centers selection in k-means Imperial Journal of Interdisciplinary Research (IJIR) clustering process, and obtain better clustering results. But it still needs high computing complexity. As H-K clustering is more and more widely used in practical application, and also highlights some problems. Especially when it used for clustering high dimensional data, it failed to avoid dimensional disaster problem, even leading to invalid clustering result sometimes. In order to solve this problem, this paper adopts ensemble learning for improving H-K clustering, in order to obtain better clustering results. Ensemble learning is an approach that is by means of training a variety of learning classifiers to solve the same problem. In recent years, ensemble learning has been introduced into clustering analysis problem, named as an ensemble clustering. Ensemble clustering is an approach that uses a fusion method to obtain an ensemble clustering result which is relative to all input clustering results of a given clustering result set. To develop effective therapies for cancers, both clinical data and genomic data have been accumulated for cancer patients. Exploring clinical data or genomic data independently may not disclose the inherent patterns and correlations present in both datasets. Therefore, it is important to combine clinical and genomic data and mining knowledge from both data sources. Clustering is a powerful tool for revealing underlying patterns without requiring basic knowledge about the data. To discover phenol types of cancer, subspace clustering has been widely used to analyze such data. For a cluster mutual in a clinical subspace and a genomic subspace, methods can use the genomic attributes to verify and justify the clinical attributes. The mutual clusters are more understandable and more robust. In addition, mutual subspace clustering is also helpful in combining multiple sources. 2. Methodology 2.1 Finding signature subspace: To find the signature subspace for a particular cluster is a critical issue. To find a subspace U⊆S which shows the similarity between points in C, a set of points C is given to a cluster in space S. Again consider that the attributes are normalized. Preferably, if a suitable similarity measure simu(x,y) Page 987 Imperial Journal of Interdisciplinary Research (IJIR) Vol-2, Issue-12, 2016 ISSN: 2454-1362, http://www.onlinejournal.in is used to compute the similarity between points x and y in subspace U, then it measures ∑x,y∈CsimU(x,y) for every non-empty subspace U⊆S, subspace maximizing the sum of similarities as the signature subspace. Moreover, such a method is often impractical for two reasons. First, describing an ideal similarity measure is very difficult. Intensely many similarities or distance measures bias towards low-dimensional subspaces. Similarities in different subspaces are frequently unable to compare directly. Second, if S has m dimensions, then it needs to check 2m -1 sub spaces. When the dimensionality is very high, enumerating all subspaces and computing the sums of similarities in them are frequently very costly. If U is the signature subspace of C which shows the similarity among points in C, then the points must be similar in every attribute in U. Moreover, the points must be largely dissimilar in attributes not in U. The average pairwise distance (APD) between points in C can be used to compute the compactness of the cluster. That is, dist (x,y)]/(|C|.|C|-1)/2 APD(C, D) = [∑ x, yϵC = 2 [∑ D dist (x,y)]/(|C|.|C|-1) x, yϵC D The average pairwise distance can be used as the computation of how well the points in C are clustered on an attribute. The attributes should have a small APD in the signature subspace of C, while the attributes which are not in the signature subspace should have a large APD. In the average pairwise distance all attributes should be sorted in ascending order. The first attribute is the best to show the similarity. Now, the problem is how to select other attributes that show the similarity of the cluster together with the first one. The attributes are in two sets: the ones present in the signature subspace and the ones which are not in the subspace. The attributes in the signature subspace should have a similar APD. The Chebyshev’s inequality [2] can be applied to confidently select the attributes in the signature subspace. Let D1…..Dn be the attributes in the APD ascending order. Suppose D1…..Di (1≤i≤n) form the signature subspace. The expectation of APD (average pair distance) is E (APD) = (1/i) ∑j=1 I APD(C, Dj). Let σ be the standard deviation of APD(C,D1)…….APD(C, Di). Then, for any attribute Dj(1≤j≤i) it requires, |APD(C, Dj)-E (APD)| ≤ t.σ, where t is a small integer. (1-(1/t2)) is the confidence level of the selection. Algorithmically, signature subspace U with D1, the first attribute in the sorted list. Then, the attributes will be added one by one in the average pairwise distance ascending order. For each attribute added, it should check whether the confidence level Imperial Journal of Interdisciplinary Research (IJIR) is maintained or not. If even one time the confidence level is violated, the attributes just added should be removed, and terminates the selection procedure. Algorithm: We have to give O: {o1, o2,…, on}, C is cluster and C⊆O and threshold t as Input Do the following… 1. Calculate the average pair distance (APD) as D for each attribute in C from centroids of C. 2. Arrange the all attributes of D in ascending order. Let D1, D2,.., Dn be the sorted list. 3. Calculate the E (Di) and standard deviation σ for each APD. 4. Check the following conditions for each attribute Let U= {} be the subspace For i=1 to n DO If (|E (Di) – APD (Di)| < t. σ) THEN DO U=U ∪ {Di} End if End for Return U 2.2 Algorithm for generation of clusters: In this paper we use k-means algorithm for full space clustering and PROCLUS [3] for subspace clustering adopting an iterative greedy search. To interleave the iterative k-means clustering procedures in the clustering spaces is the central idea of our topdown mutual subspace clustering. The process starts with arbitrary k points c1…..ck in the clustering space S1 as the temporary centers of clusters C1……Ck respectively. The k centers do not necessarily belong to O. The points in O will be assigned to the clusters according to their distances to the centers in space S1: a point o∈O is assigned to the cluster of the center closest to ‘o’. This is the first step of the kmeans clustering procedure. To find mutual subspace clusters, the information is used in the clustering space S2 to refine the clusters. Now, it needs to find the signature subspaces in S2, and also the cluster assignment will be improved. For each cluster Ci, it finds a subspace Vi⊆S2 as the signature subspace of Ci in S2 and calculates the center of Ci in Vi. In order to improve the cluster assignment, for each point o∈O, distvi(o, ci) for 1≤i≤k, will be checked and o will be assigned to the cluster of closest center in the signature subspace. This forms the refined clustering. By using the information in S1, the clustering in S2 is fed into the refinement. That is, the signature subspaces and the centers of the clusters in S1 will be computed, and the cluster assignment will Page 988 Imperial Journal of Interdisciplinary Research (IJIR) Vol-2, Issue-12, 2016 ISSN: 2454-1362, http://www.onlinejournal.in be adjusted. As the iteration progress, the information in the clustering spaces is used to form the mutual subspace clusters. On the cluster assignment if the signature subspaces in the clustering spaces agree with each other, then that cluster can become stable. That is, in the clustering spaces the centers attract the approximately same set of points to the cluster. But for some clusters like temperature clusters the signature subspaces do not agree on each other because the centers of those spaces changes continuously due to continuous change in the members of centers, happens due to continuous change in temperature. When we have to terminate the iteration? If mutual subspace clusters exist, then the above iterative refinement can greedily approach the mutual subspace clusters. The above process is because of the refinement in S1 and S2 and iteratively reduces variance of clusters in S1 and S2. The exact termination may require a large number of iterations. In practice, the mis-assignment rate has been defined as the portion of points in O that are assigned to a different cluster for a single iteration. The clustering gets stable if the signature subspaces of the clusters become stable and the mis-assignment rate is low. If the signature subspaces in the clustering spaces do not change then the iterative refinement stops, and each round both the signature subspaces in both clustering spaces S1 and S2 are refined. Sometimes some points may not belong to any mutual clusters. If the two clustering spaces do not agree with each other on those points then the iterative refinement may fall into an infinite loop. To detect the infinite loop, for every two clustering spaces in two consecutive rounds of refinement the cluster assignments will be compared. For each misassigned point which is assigned to different clusters in different clustering spaces, if it is repeatedly assigned to the same cluster in the same clustering space, and the cluster centers are stable, then the point does not belong to a mutual cluster and should be removed. Such a point is called a conflict point. The centers and the cluster assignment become stable after removing those conflict points. Then the mutual subspace clusters can be derived. Algorithm: Take the set of points O in clustering spaces S1,S2 and k is the number of clusters specified by user as the input Select the k centers randomly c1…..ck in S1; and assign each attribute to its closest center. Do the following steps For each cluster Ci Do Find the signature subspace in S2 and center End Do Assign each point in O to a cluster of its closest center in signature of subspace S2; Imperial Journal of Interdisciplinary Research (IJIR) For each cluster Ci Do Find the signature subspace in S1 and center End Do Assign each point in O to a cluster of its closest center in signature of subspace S1; If clustering is stable then remove conflict points UNTIL clustering is stable 3. Experiments and Results Generally in the proposed method the following modules are present. Data selection: The user must select the valid data set which contains numeric data. Divide into subspace: We have to divide the given input into 2 modules Basic clustering: Based on the user specified number of clusters, we have to select that much number of centers randomly and do the clustering. Page 989 Imperial Journal of Interdisciplinary Research (IJIR) Vol-2, Issue-12, 2016 ISSN: 2454-1362, http://www.onlinejournal.in datasets and Signature Subspace, it is a Combination of attributes from each data source which identifies the clusters in most prominent way and APD (Average Pairwise Distance) is distance between two points in cluster can be used to measure the compactness of the cluster. This approach gives optimal and consistent clusters than traditional approaches. The current work can be improved if it can resolve the zero cluster size i.e. if a centroid does not get minimum distance with any data item in the dataset then respect cluster would be zero because newly computed centroid may not available in dataset. 5. References Compute the signature subspaces: After generation of clusters, compute the signature subspace using average pair distance (APD). Compute new clusters: From the above signature subspaces find the new centroids and form the clusters based on the generated centroids. Repeat above two steps until stable clusters are formed or it reach the user specified iterations. [1] T.-S. Chen, T.-H. Tsai, Y.-T. Chen, C.-C. Lin and R.-C.Chen. (2005): 'a combined k-means and hierarchical clustering method', Proceedings of 2005 International Symposium on Intelligent Signal Processing and Communication Systems, pp.405408. [2]. Abramowitz, M. and Stegun, I. A. (Eds.). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover, p. 11, 1972. [3] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, J.S. Park, Fast algorithms for projected clustering, in: 1999 ACM-SIGMOD International Conference on Management of Data (SIGMOD’99), Philadelphia, PA, June 1999, pp. 61–72. [4] M. Ester, R. Ge, B.J. Gao, Z. Hu, B. Ben-Moshe, Joint cluster analysis of attribute data and relationship data: the connected k-center problem, in: SDM, 2006. [5] [Agarwal et al.1998] Agarwal, R., Gehrke, J., Gunopulos, D., Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications”, Proceedings of ACMSIGMOD conference, pp. 94 – 105, 1998 [6] Behera Gayathri, A. Mary Sowjanya “Dimensionality Reduction Using CLIQUE and Genetic Algorithm”, proceedings of IJCST Vol. 6, Issue 3, July - Sept 2015 4. Conclusions and Future work This project work deals with efficient Subspace Clustering, which finds set of objects that are homogeneous in subspaces of high- dimensional Imperial Journal of Interdisciplinary Research (IJIR) Page 990