* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Finding and Visualizing Subspace Clusters of High Dimensional
Survey
Document related concepts
Transcript
International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai Finding and Visualizing Subspace Clusters of High Dimensional Dataset Using Advanced Star Coordinates Rajashri Kulkarni, A.J. Patankar, Sunita Jahirabadkar what further experiments to conduct [2]. Visualization of such subspace clusters in the environment, where large data is available for analysis, can help in various application areas such as Web Text mining, DNA analysis, and financial analysis etc. For example, DNA microarray is the new field in Biomedical. Microarrays consist of thousand of genes under different conditions. If there are 50 cancer profiles with 1000 features, user can not analyze all the cancer subtypes on a genetic level. Also a particular cancer is divided into more than one set of characteristics. Identifying the specimens, one set of genes is required while subtype based on cell division would require different set of genes. In order to analyze such complex cellular mechanism subspace clustering visualization is very helpful. It extends the power of traditional clustering to understand the meaningful subspaces and subspace clusters [3]. In this paper, we propose the new ISC-ASC approach in which the visualization of ISC (Intelligent subspace clustering) [8] is done through ASC (Advanced star coordinates)[10] which help users to detect and analyze the clusters at different dimensionality level. In ISC, the algorithm Rank gives the list of dimensions with descending order of interestingness. Starting with two dimensions, ISC detects the clusters in these dimensions. For this DBSCAN [4], robust density based clustering algorithm is used. These clusters are visualized using Advanced Star Coordinates. ASC is based on the star coordinates [1] which is traditional data visualization technique. In star coordinates approach, radius is used to represent dimension axis, whereas, ASC uses diameter to represent the dimension axis. It projects the high dimensional data items on the dimension axis. The projection point is the advanced star coordinates for the data point, setup using the Cartesian coordinates. To improve the efficiency of the algorithm, the projection point can be represented with polar coordinates. Every high dimensional data object found in the subspace clusters using ISC is displayed using ASC on the screen. ISC-ASC approach helps to identify clusters of different size, shape and density and visualizing it on 2dimension helps the user in in-depth analysis. It is beneficial to take the decision about what further experiments to be carried out to improve the quality of clusters. Abstract— Analysis of high dimensional data is a research area since many years. Analysts can detect similarity of data points within a cluster. Subspace clustering detects useful dimensions in clustering high dimensional dataset. Visualization allows a better insight of subspace clusters. However, displaying such high dimensional database clusters on the 2-dimensional display is a challenging task. We proposed an ISC-ASC approach which first identifies subspace clusters in a high dimensional dataset and then display these clusters on a 2-dimensional display device. Algorithm ISC detects the subspace clusters using a density notion of clustering. Algorithm ASC visualizes these subspace clusters. In ASC instead of considering all the dimensions, the dimensions which are taking part in subspace clustering are considered to find the projection points. ISC-ASC is beneficial for users to identify subspace clusters. Visualizing these subspace clusters using ASC have efficient knowledge discovery which helps to take decision about the quality of subspace clusters. Keywords—Subspace clustering, high dimensional data subspace clustering, visualization I. INTRODUCTION A S a data mining function, clustering is the process of grouping the physical or abstract objects into the classes of abstract objects. Analyzing these clusters helps in understanding the distribution of data, identify the characteristics of the clusters and focus on a particular set of clusters for further analysis. Visualization helps the user by representing information visually. Subspace clustering identifies the subsets of attributes relevant for clustering. Visualization of such high dimensional clusters is an important subfield of scientific visualization. It allows the user to explore data in different ways at different levels of abstraction to find the right levels of details [5]. By visualizing these subspace clusters, user can find how well defined the clusters are, which dimension is relevant and Rajashri Kulkarni ,M.E.Student,Computer Engineering Department, D.Y.Patil College Of Engineering, Akurdi, Pune University, Pune(India) (email:[email protected]). A.J. Patankar is working in Computer Engineering Department, D.Y.Patil college of Engineering, Akurdi,Pune University,Pune(India) as Asst. Professor (e-mail:[email protected]) Sunita Jahirabadkar is working in Computer Engineering Department of Cummins College of Engineering, Pune University,Pune (India) as Asst. Professor (e-mail: [email protected]). 81 International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai II. RELATED WORK perimeter. Each dimension axis is associated with an attraction factor as in an imaginary spring system, where there are special end points. The equally spaced points are called dimensional anchors. One end of spring is attached to the dimension anchors. Other end of a spring attached to the data point. Each line is associated with one attribute value. RadViz give very good time complexity. However, in RadViz, similar records in the n-dimensional space are projected close together on the 2D space, favoring identification of clusters. Also fact that very different records may be projected closed together. Another popular visualization approach is Star Coordinates [1]. It projects the high dimensional information object in two dimensions on the screen. In this approach, each dimension is represented as a vector radiating from the centre of a unit circle in a two-dimensional plane. However, in this approach, the angle between axes is equal and all axes have the same length. Data points are scaled to the records projected close together and dimension arrangement is more complicated. Advance Start Coordinate is the extension of Star Coordinates [10].ASC uses diameter to represent the dimension axis. Also dimension configuration strategy of ASC helps to arrange the dimensions to avoid manual arrangement of dimensions on the 2-dimensional display device.ASC designs the dimension configuration strategy to optimize order and angle of dimensions axes. The visualization results of ASC are easily understandable and express dimension distribution information effectively, which is helpful for user to view high dimensional data and to discover implicit information in knowledge discovery process. In this paper, the major challenge described is to combine the ISC (Intelligent Subspace Clustering) algorithm and ASC (Advanced Star Coordinates) algorithm for High Dimensional Database. Hence the work has to be reviewed at two levels, Subspace Clustering approaches and visualization techniques. A. Subspace Clustering approaches Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a high dimensional database. While doing clustering on high dimensional dataset, the number of data points becomes sparser. A cluster is a dense region of data points. We can recognize the clusters because each cluster has a typical density of points. There are different subspaces clustering algorithms. The first well known algorithm is CLIQUE [3] algorithm combines density and grid based clustering and uses an APRIORI style search technique to find dense subspaces. ENCLUS [3] is based on entropy computation of a discrete random variable. MAFIA [3] uses adaptive, variable-sized grids in each dimension. All these approaches are grid based approaches which are based on the positioning of the grids. There are different Density based approaches to find the clusters which give better results. The first algorithm DOC [3] is a hybrid method which is a combination of grid based and iterative improvement method from top down approaches. It is based on the mathematical notion of an ‘Optimal projective cluster’. SUBCLU [3] can detect arbitrarily shaped and positioned clusters in subspaces. Another efficient algorithm to apply subspace clustering on high dimensional dataset is ISC (Intelligent subspace clustering) [8]. ISC implements dynamic and adaptive determination of meaningful clustering parameters using hierarchical filtering approach. ISC detects the subspace clusters at intermediate levels, by allowing modifying parameters adaptively. It is based on the density notion of clustering which helps to identify the clusters of different shapes and sizes. III. ISC-ASC APPROACH ALGORITHM The ISC-ASC algorithm consists of the following two major algorithms. A. ISC (Intelligent subspace clustering) The algorithm ISC (Intelligent Subspace Clustering)[8], which is based on the density notion of Hierarchical Subspace Clustering. The concept of hierarchy will be used in ISC at dimension level.ISC finds low dimensional subspace clusters and then try to extend these low dimensional subspace clusters to form higher dimensional meaningful clusters. Objects will be assigned to subspace clusters using the density notion of Clustering. As the number of possible subspaces is exponential in the number of dimensions, this is a challenging task both with respect to efficient runtime of the algorithm as well as to the typically enormous number of output clusters. To cope with this, ISC computes the relevance of dimensions and provide rankings to the dimensions according to the interestingness. ISC considers highly ranked dimension, build 1-d clusters and then continue in the descending order of ranking. As the dimensionality increases, the subspace clusters become sparser. Use of global density threshold to find a dense area doesn’t give meaningful results. To eliminate this problem, cluster quality is checked at each dimension level. B. Visualization Techniques For the visual data exploration , a number of visualization techniques are available. Geometric projection techniques are used to find informative projections. Different methods fall in this category helps to find correlations among dimensions, detects the outliers and work with high dimensional datasets [9]. The first well-known technique is the Parallel coordinates [6] where attributes are represented by parallel vertical axes linearly scaled within their data range. Circular parallel Coordinates visualization [6] is similar to parallel coordinates, in which n lines emanate radically from the centre of the circle and terminate at the perimeter. However visualizing the dataset with large data items using Circular parallel coordinates, polygon lines increase and it’s having serious impact on the arrangement of dimensions. In RadViz [6] which is based on Hooke’s law, where n lines emanate from the centre of the circle and terminate on its 82 International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai used, where differentials are calculated. So representing the point in polar coordinates increases the efficiency of the algorithm. Density threshold (µ) value will be changed adaptively. The algorithm is repeated with new values of density threshold at that dimensionality level, till we find the clusters of required quality. It has following important improvements over the state of the art Subspace Clustering approaches • Use of density approach for clustering data set, even at each dimension level, allows building / finding clusters of any size, shape and density. • Detects subspace clusters at different dimensionality levels. • Build a hierarchy of nested subspace clusters. • Users can interact for parameter settings at various dimension level to find meaningful clusters. To find clusters those are hidden in different subspaces, parameters like density threshold, has to be set depending upon number of dimensions considered for Clustering. Additionally, the intermediate level clustering results are helpful so that to change parameters for in-depth analysis and interaction. Fig.1 Representation of a data point using polar coordinates ASC arrange the dimensions to avoid manual arrangement of dimensions. The dimension configuration of ASC is based on the ideas that find the correlation coefficient among all the dimensions. Dimensions showing high correlations are positioned next to each other. The dimensions are represented as diameter of the circle. The direction of arrow indicates the positive direction of the dimension. The angle between two dimensions is measured as the angle between dimension positive directions. ASC finds the correlation matrix, and dimensions which are highly correlated, drop them. It finds the largest value of correlation matrix and does circular arrangement of dimensions. The angle between neighbor dimensions is calculated. According to the angle between the dimensions ASC makes the circular arrangement of dimensions. B. ASC(Advanced Star Coordinates) While displaying a multidimensional object in a 2dimensional display device, it can be simply shown as an point in multidimensional coordinates as in fig 1. Let, A(F) is a k-dimensional dataset where A(F) = {F1,F2....Fm} m = total number of records in the dataset. represents the multidimensional information object. ASC represents the multidimensional data item as a point in Cartesian coordinates. The projection lines from this point are perpendicular to every dimension axis and the coordinates where projection line intersects the dimensions axes are called as visual coordinates . In ASC all the dimensions are taken into Fi ( consideration while mapping the actual values to the visual coordinates. The projection lines from the visual coordinates converge to a point which is represented as which indicates the high dimensional data item represented as in two dimensional coordinates on the screen. Advanced Star Coordinates uses the diameter instead of the radius to define the dimension axis. For this ASC coordinates are setup in the Cartesian coordinate system. ASC find the visual coordinates on every dimension axis. The projection line from visual coordinates converges to a point which is represented as advanced start coordinate. The object function is constructed for each record in the dataset. In this algorithm, pattern search method is used to solve the object function. However, when the polar coordinates are setup in Cartesian coordinates, the high dimensional data point is as shown in fig.1. To solve the represented as object function in polar coordinates Quasi-Newton method is C. ISC-ASC algorithm The algorithm starts with two dimensions and iterate till k dimensions. ISC-ASC starts with rank algorithm which gives list of dimensions with descending order of interestingness. The dimension configuration strategy of ASC gives arrangement of dimensions. Advanced star coordinates are setup in polar coordinates. In ASC all the dimensions are taken into consideration, while finding the projection point. But in ISC-ASC approach we considered only those dimensions which have been taken part into subspace clustering. From Rank algorithm first two dimensions which are having higher interestingness are taken into consideration. Starting with two dimensions the projection point is calculated and go on calculating till k-dimensions. Also different combinations of dimension will be done. User can take the decision which dimensions give meaningful clusters. Starting with two dimensions, the two dimension axes are drawn. At each dimensionality level ISC selects the density threshold, εdistance is calculated, and DBSCAN will be applied considering these parameters. The clusters found using ISC are visualized on the screen using ASC coordinates. User can change the density threshold; this will be helpful for in-depth analysis. 83 International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai D.Algorithm ISC-ASC 1. Apply rank algorithm to select most interesting dimensions 2. Apply DBSCAN, with input parameter density threshold, to two dimensional dataset. 3. Set up advanced star coordinates in polar coordinates. Starting with two dimensions, draw the two dimension axis. 4. Find the unit vector of every dimension axis as – clustering results. We started with two dimensions. Clusters with two dimensions are displayed on the screen as shown in figure 3. The density threshold considered is three and clusters of blue, green and violet color is displayed on the screen. Outliers are shown in red color. At first only two dimension axis are drawn. There are different combinations of the dimensions. User can adaptively change the density threshold so that quality of the clusters can be observed. User can decide as the most useful dimension in clustering. We continued the algorithm further with three dimensions and later with four dimensions. Clustering results with same density threshold as three is shown in fig. 4. k=total number of dimensions. 5. Compute the equation of line which is passing through dimension starting point and vertical to the dimension axis. 6. Construct the object function as – min f(x ,y) = (1) = r= and y= 7. Find the point Fig.3 Visualization of clusters with two dimensions of Iris dataset ( , ) on every dimension axis according to And solve the object function (1) using Quasi Newton method 8. User can change the density threshold and see the quality of the clusters. 9. Repeat steps 2 to 6 so that it will iterate till k dimensions. The same algorithm can be repeated for various density thresholds to get the most clear and visible clusters on the screen. Fig.4 Visualization of clusters with three dimensions of Iris dataset IV. EXPERIMENTAL EVALUATION V. CONCLUSION We implemented ISC-ASC algorithm in Matlab. All the experiments were run on Microsoft Windows XP platform with 2.0.GHz CPU and min 2.0 GB RAM. We evaluated this algorithm using several synthetic datasets. The well known Iris flower dataset with four dimensions was used to test the Subspace clustering visualization has tremendous applications in science, engineering and business decision making. In this paper, we proposed need for the visualization of the subspace clusters. Later we propose ISC-ASC approach 84 International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai which visualizes the subspace clusters formed using ISC by ASC coordinates. Visualizing the clusters on the screen using ISC-ASC algorithm, leads to better cluster formation using high dimensional data. The experimental evaluation showed that ISC-ASC approach helps to identify clusters with different threshold. It will benefit large application domains such as web information system in which huge amount of data is available. It can be used in DNA microanalysis where analyst has to deal with the huge amount of genes. ISC-ASC algorithms help to analyze and understand complex cellular mechanisms in DNA microarrays. Currently it is experimented with four dimensions, in future work we will try to extend up to 10 dimensions. REFERENCES [1] E. Kandogan, “Star Coordinates: A high-dimensional visualization technique with uniform treatment of dimensions,” In Proc. of the IEEE Information Visualization Symposium, 2000, pp. 4-8. [2] Ian Davidson†, “Visualizing clustering results’’, In Proceedings of the Second SIAM International Conference on Data Mining, Arlington, VA, USA, April 11-13, 2002. SIAM 2002, ISBN 0-89871-517-2. [3] Lance Parsons, Ehtesham Haque, Huan Liu, “ Subspace clustering for high dimensional data: A review ’’ Department of Computer Science Engineering ,Arizona State University, Tempe, SIGKDD Explorations 2004,Volume 6,issue 1. [4] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with Noise”, In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, 1996. [5] P.C. Wong, and R.D. Bergeron, ‘30 Years of multidimensional multivariate Visualization’’, In Scientific Visualization--Overviews, Methodologies, and Techniques, Washington, IEEE Computer Society, 1997. [6] P.E. Hoffman, “Table Visualizations: A Formal Model and its Applications,” Doctoral Diss., Computer Science Department, University of Massachusetts Lowell, 1999. [7] Steinbach, M., Ertoz, L., & Kumar, V., “ Challenges of clustering high dimensional data’’, In Wille, L. T. (Ed.), New Vistas in Statistical Physics – Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer-Verlag. [8] Sunita Jahirabadkar, Parag Kulkarni; “ISC – Intelligent Subspace Clustering, A Density based Clustering approach for High Dimensional Dataset”, In World Congress on Science, Engineering & Technology (WCSET – 09); July 29 -31, 2009; Oslo, Norway; Pg No. 69-73. [9] Winnie Wing-Yi Chan, A survey on Multivariate Data visualization, June 2006. [10] Yang Sun, Jiuyang Tang, Daquan Tang, Weidong Xiao, “Advanced star coordinate” , In Proceedings of WAIM '08 Proceedings of the 2008 The Ninth International Conference on Web-Age Information Management, IEEE Computer Society Washington, DC, USA ©2008[26] . 85