Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Projected Clusters in High Dimensional Spaces Mohamed Bouguessa 1 Data Mining Data mining is the process of extracting potentially useful information from a dataset. Clustering is one of the central data mining problems 3 Clustering? Clustering techniques aims to partition a given set of data or objects into groups such that elements drawn from the same group are highly similar, while those assigned to different groups are dissimilar. 4 Similarity measure Clustering algorithm employ a distance metric in order to partition the database Euclidean distance 5 High dimensional data ! Problem with high dimensional data Similarity function require similar objects to have close values in all dimensions. The concept of similarity between objects in the fulldimensional space is often invalid. 6 Clustering in high dimensional space ! Illustration Person Age Virus Level Blood Type Disease A 1 35 0.95 AB Uninfected 2 64 0.9 AB Uninfected 3 27 1.0 AB Uninfected 4 18 9.8 O Infected 5 42 8.6 AB Infected 6 53 11.3 B Infected 7 37 0.75 O Recovered 8 28 0.8 A Recovered 9 65 0.89 B Recovered 7 Clustering in high dimensional space • Problem with high dimensional data Ø Presence of irrelevant dimension (attributes/features) Ø Clusters may exist in different subspaces (not in the full-dimensional space) 8 Clustering in high dimensional space ! Illustration ! Data set with 4 clusters. Ø cluster 1 et cluster 2 exist in dimension a & b. Ø cluster 3 et cluster 4 exist in dimension b & c. 9 Clustering in high dimensional space Sample data plotted in one dimension Sample data plotted in two dimension 10 Projected clusters 10 Projected clustering ! Definition A projected cluster is a subset S of data points, together with a subspace of dimension, D, such that the points in S are closely clustered in D 11 Projected clustering ! Example A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 x1 xa xb xc xN Cluster1 Cluster2 Cluster3 Cluster4 Ø the third cluster: (S3, D3) = ({xb, …, xc},{A2, A4,A10}). 12 Previous approaches ! Major limitations of previous approaches (each approach has one or more of the followings) ! Produce clusters all of the same dimensionality. ! Unable to determine the dimensionality of each cluster automatically. ! Unable to identify clusters with low dimensionality (clusters with low percentage of relevant dimensions, e.g. only 5% of input dimensions (features) are relevant). 19 Our approach: PCKA ! PCKA: Projected Clustering based on the K-means Algorithm DB: data set of d-dimensional points A={A1, A2, …, Ad} : set of attributes X={x1, x2, …, xn} : set of N data points, where xi=(xi1, …, xij, …, xid). xij : 1-d point nc: given number of clusters Cs: a projected cluster, containing Ns data points defined in a dsdimensional formed by the set As of its relevant dimensions. The remaining set A-As represent the irrelevant dimensions of Cs. 20 PCKA In order to discover clusters in different subspaces, PCKA proceeds in three phases: 1. Attribute relevance analysis. 2. Outlier handling. 3. Discovery of projected clusters. 21 Phase 1: Attribute relevance analysis ! Identify relevant dimension which exhibit some cluster structure ! By cluster structure we mean a region that has a high density of points that its surrounding regions Detect dense regions and their location in each dimension Dense regions 22 Attribute relevance analysis ! In order to detect densely populated regions in each attribute we compute a sparseness degree yij for each 1-d point xij by measuring the variance of its k nearest (1-d point) neighbors Intuitively, a large value of yij means that xij belongs to a sparse region, while a small one indicates that xij belongs to a dense region. 23 Attributes Relevance Analysis Dense regions In order to identify dense regions in each dimensions we are interested in all sets of xij having small sparseness degree. Model the sparseness degree in each dimension as a mixture distribution. 24 Attributes Relevance Analysis ! PDF estimation ! Identify the statistical properties of the sparseness degree x1 A1 A2 A3 A4 A5 A6 A7 A8 A9 xa xb xc xN Cluster1 Cluster2 Cluster3 Cluster4 Histograms of the sparseness degree of: (a) A1, (b) A2 , (c) A3 and (d)A4 25 A10 Attributes Relevance Analysis ! PDF estimation The histograms suggest the existence of components with different shape and/or heavy tail Which inspire us to use the gamma mixture model Formerly, we expect that the sparseness degree follows a mixture density of the form 26 Attributes Relevance Analysis ! How to estimate the parameters of the gamma components Gl? Maximum likelihood technique ! How to estimate the number of components? Bayesian Information Criteria (BIC) 27 Attributes Relevance Analysis ! PDF estimation 28 Attributes Relevance Analysis ! PDF estimation x1 A1 A2 A3 A4 A5 A6 A7 A8 A9 xa xb xc xN Cluster1 Cluster2 Cluster3 Cluster4 29 A10 Attributes Relevance Analysis ! Dense region detection Observation The locations of components that represent dense regions are close to zero in comparison to those that represent sparse regions. E x a m i n e t h e l o c a t i o n o f e a ch components Find a typical value that best describe the sparseness degree that belong to each component. 30 Attributes Relevance Analysis ! Dense region detection Let LOC={loc1, …, locj, …, locm_total} Let locj denotes the location of the component j m_total: is the total number of all components In order to determine the location of each component we calculate its median 31 Attributes Relevance Analysis ! Dense region detection ! Large value of locj means that the component j correspond to sparse region. ! Small value of locj means that the component j correspond to dense region. Our objective ! Divide LOC into two groups E and F, where E is the group of the highest values of locj and F is the group of low values of j MDL principle MDL: Minimum Description Length 32 Attributes Relevance Analysis ! Dense region detection 33 Attributes Relevance Analysis ! Illustration The set E 0.5 0.45 0.4 locj values 0.35 0.3 0.25 0.2 0.15 The set F 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Rank number 34 Attributes Relevance Analysis ! Dense region detection We obtain a binary matrix Z(N*d) x1 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 xa xb xc xN Cluster1 Cluster2 Cluster3 Cluster4 35 Attributes Relevance Analysis ! Dense region detection 36 Phase 2 : Handling outlier ! Outliers can be defined as set of data points that are considerably dissimilar, exceptional, or inconsistent with respect to the remaining data. ! Our outlier handling mechanism makes an efficient use of the properties of the binary matrix Z. ! In our case, outliers do not belong to any of the identified dense regions in the matrix Z, most of them are located in sparse regions. 37 Handling outlier ! We use the binary similarity coefficients (Jaccard coefficient) in order to measure the similarity between binary data points zi in the matrix Z. 38 Handling outlier Let JC(zi,zj) is the Jaccard coefficient between two binary points. A pair of points is considered similar if the estimated Jaccard coefficient between them exceeds a certain threshold ε. Our outlier handling mechanism is based on the following definition ! The above definition exploit the fact that points which belong to dense regions (clusters) have in general a large number of similar points in contrast to outliers 39 Handling outlier 40 Phase 3: Discovery of projected clusters ! The main goal of phase 3 is to identify clusters and their relevant dimensions ! The clustering process is based on the K-means algorithm ! K-means partitions a data into a number of clusters, each of which is represented by a center. • Pick nc points as cluster centers. • Alternate: • Assign data instance to closest mean • Assign each mean to the average of its assigned points • Stop when no points assignments change. 41 K-means principle 42 Discovery of projected clusters Problem: each cluster cluster has its own relevant dimension Consequence The use of classical distance function to compute the similarity between two data points is not an effective approach with high dimensional data. Why? because each dimension is equally weighted when computing the distance between two points. 43 Discovery of projected clusters Proposed solution To address this problem, we associate the binary weights tij in the matrix T (extracted from the matrix Z) to the Euclidian distance. This makes the distance measure more effective because the computation of distance is restricted to subsets (i.e. projections) where the object values are dense. Formally 44 Discovery of projected clusters ! How to identify relevant dimension for each clusters? ! The sum of the binary weights of the data points belonging to the same cluster over each dimension give us a meaningful measure of the relevance of each dimension to the clusters. ∑ ∑ 45 Discovery of projected clusters ! How to identify relevant dimension for each clusters? ! We propose a relevance index Wsj for each dimension in cluster Cs ! The value of the index is always between 0 and 1. The index gives a large value (close to 1) when the dimension is relevant to the cluster. On the other hand, an irrelevant dimension receives a very small index value (close to 0). 46 Discovery of projected clusters ! How to identify relevant dimension for each clusters? δ is a user-defined parameter that control the degree of relevancy of the dimension Aj to the cluster Cs 47 Discovery projected clusters 48 Empirical evaluation ! Accuracy: the aim is to test whether our algorithm, in comparison with other existing approaches, is able to correctly identify projected clusters in complex situations. ! Efficiency: the aim is to determine how the running time scales with 1) the size and 2) the dimensionality of the dataset. ! We evaluate the performance of PCKA synthetic datasets. on number of 49 Empirical evaluation ! Performance measure: We use the Clustering Error (CE) for projected clustering Such a metric performs comparisons in a more objective way since it takes into account the data point group and the associated subspace simultaneously. The value of CE is always between 0 and 1. The more similar the original partition and the generated partition by clustering algorithm the smaller the CE value. 50 Experiments on synthetic data sets ! Robustness to the average cluster dimensionality We generated sixteen different datasets with: Number of data points N = 3000 Number of dimensions d = 100 Number of clusters nc = 5 The average cluster dimensionality varies from 2% to 70% of d. No outliers were added to generated datasets. 51 Experiments on synthetic data sets ! Robustness to the average cluster dimensionality PCKA 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 2% 4% 6% SSPC(BEST) 8% 10% 12% HARP 14% 16% PROCLUS(BEST) 18% 20% 25% FASTDOC(BEST) 30% 40% 50% 60% 70% CE distance between the output of the five algorithms and the true clustering 52 Experiments on synthetic data sets ! Outlier immunity We generated three groups of data sets. In each group there are five datasets with: N = 1000, d = 100, nc = 3. In each datasets in the group, the percentage of outlier varied from 0% to 20% of N. The average cluster dimensionality of the datasets in the first group is fixed to 2% of d, for the second and the third group the average cluster dimensionality is fixed to 15% of d and 30% of d respectively. 53 Experiments on synthetic data sets ! Outlier immunity 1 PCKA SSPC(BEST) HARP PROCLUS(BEST) FASTDOC(BEST) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0% 5% 10% 15% 20% CE distance; datasets with average cluster dimensionality = 2% of d 54 Experiments on synthetic data sets ! Outlier immunity 1 PCKA SSPC(BEST) HARP PROCLUS(BEST) FASTDOC(BEST) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0% 5% 10% 15% 20% CE distance; datasets with average cluster dimensionality = 15% of d 55 Experiments on synthetic data sets ! Outlier immunity 1 PCKA SSPC(BEST) HARP PROCLUS(BEST) FASTDOC(BEST) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0% 5% 10% 15% 20% CE distance; datasets with average cluster dimensionality = 30% of d 56 Experiments on synthetic data sets ! Scalability with data set size PCKA PCKA 10000 Runing time (second) 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1000 5000 10000 50000 100000 Dataset size (log scale) Scalability of PCKA w.r.t. the data set size 57 Experiments on synthetic data sets ! Scalability with dimensionality of data PCKA PCKA 4000 Runing time (second) 3500 3000 2500 2000 1500 1000 500 0 100 250 500 750 1000 Dataset dimensionality Scalability of PCKA w.r.t. the data dimensionality 58 Experiments on real data ! Wisconsin Diagnostic Breast Cancer Data (WDBC): The set contains 569 samples, each with 30 features. The samples are grouped into two clusters: 357 samples for benign and 212 for malignant. ! Saccharomyces Cerevisiae Gene Expression Data (SCGE): This data contains expression level of 205 genes under 80 experiments. The data set is presented as a matrix. Each row corresponds to a gene and the each column represents experiments. The genes are grouped into four clusters. 59 Experiments on real data ! Multiple Features Data (MF): The set consists of features of handwritten numerals ("0"-"9") extracted from a collection of Dutch utility maps. 200 patterns per cluster (for a total of 2,000 patterns) have been digitized in binary images. For our experiments we have used five features sets (files): 1. mfeat-fou: 76 Fourier coefficients of the character shapes; 2. mfeat-fac: 216 profile correlations; 3. mfeat-kar: 64 Karhunen-Loève coefficients; 4. mfeat-zer: 47 Zernike moments; 5. mfeat-mor: 6 morphological features. In summary, we have a data set with 2000 patterns, 409 features, and 10 clusters. 60 Experiments on real data We use the class label as ground truth and we measured the accuracy of clustering by matching the points in input and output clusters. Accuracy of clustering 61