Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DB Seminar Series: HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering Presented by: Kevin Yip 20 September 2002 Short Summary Our own work (unpublished), supervised by Dr. Cheung and Dr. Ng Problem: to cluster datasets of very high dimensionality Assumption: clusters are formed in subspaces 1 Short Summary Previous approaches: either have special restrictions on the dataset or target clusters, or cannot determine the dimensionality of the clusters automatically Our approach: not restricted by these limitations 2 Presentation Outline Clustering Projected clustering Previous approaches to projected clustering Our approach: HARP – Concepts – Implementation: HARP.1 – Experiments Future work and conclusions 3 Clustering Goal: given a dataset D with N records and d attributes (dimensions), partition the records into k disjoint clusters such that – Intra-cluster similarity is maximized – Inter-cluster similarity is minimized 4 Clustering How to measure similarity? – Distance-based: Manhattan distance, Euclidean distance, etc. – Correlation-based: cosine correlation, Pearson correlation, etc. – Link-based (common neighbors) – Pattern-based 1 2 3 4 5 6 5 Clustering 2 common types of clustering algorithms: – Partitional: selects some representative points for each cluster, assigns all other points to their closest clusters, and then re-determines the new representative points – Hierarchical (agglomerative): repeatedly determines the two most similar clusters, and merges them 6 Clustering Partitional clustering: Dataset Representatives Replacement Assignment 7 Clustering Hierarchical clustering: Dataset Similarity calculation Merging Best merge determination 8 Projected Clustering Assumption (general case): each cluster is formed in a subspace Assumption (special case): each cluster has a set of relevant attributes Goal: determine the records and relevant attributes of each cluster (to “select” the “relevant attributes”. How to define “relevance”?) Source of figures: ORCLUS (SIGMOD 2000) 9 Projected Clustering A 3-D view: Source of figure: DOC (SIGMOD 2002) 10 Projected Clustering An example dataset: Person Age Virus Level Blood Type Disease A 1 35 0.6 AB Uninfected 2 64 0.9 AB Uninfected 3 27 1.1 AB Uninfected 4 18 9.8 O Infected 5 42 8.6 AB Infected 6 53 11.3 B Infected 7 37 0.7 O Recovered 8 28 0.4 A Recovered 9 65 0.9 B Recovered 11 Projected Clustering Projected clustering v.s. feature selection: – Feature selection selects a feature set for all records, but projected clustering selects attribute sets individually for each cluster – Feature selection is a preprocessing task, but projected clustering selects attributes during the clustering process 12 Projected Clustering Why projected clustering is important? – At high dimensionality, the data points are sparse, the distance between any two points is almost the same – There are many noise attributes that we are not interested in – High dimensionality implies high computational complexity 13 Previous Approaches (Refer to my previous DB seminar on 17 May 2002 titled “The Subspace Clustering Problem”) Grid-based dimension selection (CLIQUE, ENCLUS, MAFIA) Association rule hypergraph partitioning Context-specific Bayesian clustering Monte Carlo algorithm (DOC) Projective clustering (PROCLUS, ORCLUS) 14 Previous Approaches PROCLUS: 1. 2. 3. 4. 5. 6. Draw medoids Determine neighbors Select attributes Assign records Replace medoids Goto 2 ORCLUS: 1. 2. 3. 4. Draw medoids Assign records Select vectors Merge (reselect vectors and determine centroid) 5. Goto 2 15 Previous Approaches Summary of the limitations of previous approaches (each approach has one or more of the followings): – – – – – – – – – Produce non-disjoint clusters Has exponential time complexity w.r.t. cluster dimensionality Allow each attribute value be selected by only one cluster * Unable to determine the dimensionality of each cluster automatically Produce clusters all of the same dimensionality * Consider only local statistical values in attribute selection Unable to handle datasets with mixed attribute types * Assign records to clusters regardless of their distances 16 Require datasets to have a lot more records than attributes Our Approach: HARP Motivations: – From datasets: we want to study gene expression profile datasets (usually with thousands of genes and less than a hundred samples) – From previous algorithms: we want to develop a new algorithm that does not have any of the above limitations 17 Our Approach: HARP HARP: a Hierarchical algorithm with Automatic Relevant attribute selection for Projected clustering Special features: – Automatic attribute selection – Customizable procedures – Mutual disagreement prevention 18 Our Approach: HARP Special implementation based on attribute value density, HARP.1: – Use of global statistics in attribute selection – Generic similarity calculations that can handle both categorical and numeric attributes – Implementing all mutual disagreement mechanisms defined by HARP – Reduced time complexity by pre-clustering 19 Our Approach: HARP Basic idea: – In the partitional approaches: • At the beginning, each record is assigned to a cluster by calculating distances/similarities using all attributes • Very likely that some assignments are incorrect • No clue to find the dimensionality of the clusters – Our approach: • Allow only the “best merges” at any time 20 Out Approach: HARP Basic idea: – “Best”: a merge is permitted only if • Each selected attribute of the resulting cluster has a relevance of at least dt • The resulting cluster has more than mapc selected attributes • The two participating clusters have a mutual disagreement not larger than md – Mapc, dt, md: threshold variables 21 Our Approach: HARP Multi-step clustering: d 1 imd md dt mapc 1 g mmd Initial thresholds Cluster 1 Cluster2 Merge Score 2 6 27.6 3 8 24.3 12 13 24.1 1 5 18.5 … Perform all possible merges Merge score calculations d 1 imd md dt mapc 1 g mmd Threshold loosening 22 Our Approach: HARP Expected resulting clusters: – Have all relevant attributes selected (due to mapc) – Selected attributes have high relevance to the cluster (due to dt) – Not biased by the participating clusters (due to md and some other mechanisms) 23 Our Approach: HARP More details: attribute relevance – Depending on the definition of the similarity measure – E.g. the density-based measure defines the relevance of an attribute to a cluster by the “compactness” of its values in the cluster. Compactness can be reflected by the variance value 24 Our Approach: HARP More details: attribute relevance Attribute Cluster (mean/variance) C1 C2 A1 4.9/0.1 5.0/0.1 A2 A3 A4 5.1/0.1 2.8/0.5 3.1/0.5 4.9/0.1 7.3/0.5 7.2/0.5 Which attributes are relevant to the clusters? – A1, A2: local statistics – A3, A4: global statistics 25 Our Approach: HARP More details: mutual disagreement – The two clusters participating in a merge do not agree with each other 26 Our Approach: HARP More details: mutual disagreement – Case 1: 100 rec. {A1, A2} 105 rec. {A1, A2} 5 rec. {A3, A4} – One cluster dominates the selection of attributes 27 Our Approach: HARP More details: mutual disagreement – Case 2: 50 rec. {A1, A2} 50 rec. {A1, A2, …,A6} 100 rec. {A1, A2} – The clusters lose some information due to the merge 28 Our Approach: HARP More details: mutual disagreement – Mutual disagreement prevention: • Setup the md threshold to limit the maximum disagreement on the new set of attributes • Get the statistics of the loss of information in all possible merges, discard those with extraordinary high loss • Add a punishment factor to the similarity score 29 Our Approach: HARP.1 HARP.1: an implementation of HARP that defines the relevance of an attribute to a cluster by its density improvement from the global density Relevance score of an attribute to a cluster: – – – – Categorical: 1 – (1 – Mode-ratiolocal) / (1 – Mode-ratioglobal) Numeric: 1 – Varlocal / Varglobal *When Mode-ratioglobal = 1 or Varglobal = 0, the score = 0 If C1 and C2 merge into Cnew, we can use the records of C1 and C2 to evaluate their “agreement” on the selected attributes of Cnew in a similar way. 30 Our Approach: HARP.1 Mutual disagreement calculations: – Den(Ci, a): how good is attribute a in Ci – Den(Ci, Cnew, a): how good is the attribute a in Ci, evaluated by using the properties of a in Cnew – Both values are in the range [0, 1] Min Den (C1 , Cnew , a), Den (C2 , Cnew , a) MD (C1 , C2 ) 1 / AC aA Max Den ( C , C , a ), Den ( C , C , a ) 1 new 2 new new Cnew Den (C , C ILoss (Ci , Cnew ) 1 aACnew i new , a) Den (Ci , a) (truncated to 0 if - ve) aACi 31 Our Approach: HARP.1 Similarity score: Den(C Sim (C1 , C2 ) aACnew ACnew new , a) (1 ILoss (C1 , Cnew )) (1 ILoss (C2 , Cnew )) 32 Our Approach: HARP.1 Initial and baseline values for the md variable: user parameters, default 10 and 50 Multi-step clustering: d 1 imd md dt mapc 1 g mmd Initial thresholds Cluster2 Merge Score 2 6 27.6 3 8 24.3 12 13 24.1 1 5 18.5 … Merge score calculations Perform all possible merges With mutual disagreement prevention: 1. MD(C1,C2) <= md 2. Sum of and difference between ILoss(C1,Cdnew1) and imd ILoss(C2,Cnew) not more than a certain s.d. from mean 3. Punishment factor in similarity score md dt mapc Baseline value for each dt variable: the global statistical value Cluster 1 Each cluster keeps a local score list (binary tree) containing merges with all other clusters. The best scores are propagated to a global score list 1 g mmd Threshold loosening 33 Our Approach: HARP.1 Time complexity: O N 2 mls log N d max dom(a) a – Speeding up: use a fast projected clustering algorithm to pre-cluster the data Space complexity: O N 2 Nd 34 Our Approach: HARP.1 Accuracy experiments (datasets): Name Type Rec. Class Cat./Num. Attr. Avg. Rel. Attr. Outlier (%) Soybean Real-life 47 4 35 / 0 26 ? Voting Real-life 435 2 16 / 0 11 ? Mushroom Real-life 8124 2 22 / 0 15 ? SynCat1 Synthetic 500 5 20 / 0 12 5 SynMix1 Synthetic 500 5 10 / 10 12 5 SynNum1 Synthetic 500 5 0 / 20 12 5 SynCat2 Synthetic 500 5 20 / 0 7 5 SynMix2 Synthetic 500 5 10 / 10 7 5 SynNum2 Synthetic 500 5 0 / 20 7 5 35 Our Approach: HARP.1 Accuracy experiments (results1): Dataset Best score: error% / outlier% Average: error% / outlier% HARP.1 PROCLUS Traditional ROCK Soybean 0.0 / 0.0 0.0 / 0.0 17.3 / 0.0 2.1 / 0.0 9.2 / 0.0 Voting 6.4 / 13.6 2.1 / 55.6 13.8 / 7.9 13.1 / 11.3 13.1 / 1.9 6.2 / 14.5 Mushroom 1.4 / 0.0 3.2 / 0.0 9.0 / 0.0 6.0 / 0.0 5.2 / 0.0 0.4 / 0.0 No published result 36 Our Approach: HARP.1 Accuracy experiments (results2): Dataset HARP.1 PROCLUS Traditional ORCLUS SynCat1 0.0 / 5.0 3.6 / 1.4 6.7 / 3.7 2.6 /26.4 5.8 / 5.3 N/A SynMix1 0.4 / 6.8 2.2 / 17.0 6.8 / 10.1 11.6 / 11.2 7.9 / 4.6 N/A SynNum1 0.8 / 5.0 1.8 / 21.4 7.2 / 8.3 4.4 /32.0 5.9 / 9.2 0.4 / 23.8 2.31 / 8.15 SynCat2 4.0 / 8.4 11.0 / 31.0 25.0 / 14.4 17.8 /23.8 28.5 / 5.4 N/A SynMix2 11.4 / 4.4 16.6 / 62.2 25.4 / 32.8 17.6 /38.6 24.1 / 11.6 N/A SynNum2 18.8 / 4.4 11.6 / 50.8 18.7 / 20.7 11.6 /28.0 23.3 /10.9 50.8 / 0.0 57.2 / 0.0 37 Our Approach: HARP.1 Accuracy experiments (results3): – Dataset: 500 records, 200 attributes, on average 13 relevant, 5 classes – Pre-clustering: form 50 clusters PROCLUS (best score) HARP.1 w ith Preclustering Preclustering (best score) Percentage of error cases 50 40 30 20 10 0 0 20 40 60 Input l 80 100 38 Our Approach: HARP.1 Total execution time (s) in log scale Scalability experiments (scaling N): 100000 10000 1000 100 10 1 1000 10000 100000 Num ber of records in log scale 39 Our Approach: HARP.1 Scalability experiments (scaling d): 40 Our Approach: HARP.1 Total execution time (s) Scalability experiments (scaling average number of relevant attributes): 60 50 40 30 20 10 0 0 20 40 60 80 100 120 Average dim ensionality of cluster 41 Our Approach: HARP.1 Total execution time (s) in log scale Scalability experiments (scaling N with preclustering): 100000 10000 1000 100 1000 10000 100000 Num ber of records in log scale 42 Our Approach: HARP.1 Application: gene expression datasets – Lymphoma: Nature 403 (2000) – 96 samples, 4026 genes, 9 classes DLBCL Activated blood B FL Germinal centre B Resting/ activated T Resting blood B NI. lymph node/tonsil Transformed cell lines CLL 43 Our Approach: HARP.1 Application: gene expression datasets – Can also use genes as records and samples as attributes: • E.g. use the dendrogram to produce an ordering of all genes • Based on some domain knowledge, validate the ordering • If the ordering is valid, the position of other genes of unknown functions can be analyzed 44 Future Work Produce more implementations based on other similarity measures Study the definition of “relevance” in gene expression datasets Consider very large datasets that cannot fit into main memory Extend the approach to solve other problems, e.g. k-NN in high dimensional space 45 Conclusions A hierarchical projected clustering algorithm, HARP, is developed with – Dynamic selection of relevant attributes – Mutual disagreement prevention – Generic similarity calculation A density-based implementation called HARP.1 is developed with – Good accuracy – Reasonable time complexity – Real applications on gene expression datasets 46 References C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In ACM SIGMOD International Conference on Management of Data, 1999. C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. pages 70{81, 2000. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD International Conference on Management of Data, 1998. A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D.Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, J. C. Byrd, D. Botstein, P. O. Brown, and L. M. Staudt. Distinct types of di use large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769):503{511, 2000. Y. Barash and N. Friedman. Context-specific bayesian clustering for gene expression data. In Annual Conference on Research in Computational Molecular Biology, 2001. 47 References C. H. Cheng, A. W.-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Knowledge Discovery and Data Mining, pages 84{93, 1999. S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In 15th International Conference on Data Engineering, 1999. E.-H. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule hypergraphs. In 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997. G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning: Applications in VLSI domain. In ACM/IEEE Design Automation Conference, 1997. H. Nagesh, S. Goil, and A. Choudhary. Ma a: Efficient and scalable subspace clustering for very large data sets, 1999. C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A monte carlo algorithm for fast projective clustering. In ACM SIGMOD International Conference on Management of Data, 2002. H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large data sets. In ACM SIGMOD International Conference on Management of Data, 2002. 48