Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Semi-Supervised Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu The UNIVERSITY of Mining, KENTUCKY CS685 : Special Topics in Data UKY Outline of lecture • Overview of clustering and classification • What is semi-supervised learning? – Semi-supervised clustering – Semi-supervised classification • Semi-supervised clustering – What is semi-supervised clustering? – Why semi-supervised clustering? – Semi-supervised clustering algorithms CS685 : Special Topics in Data Mining, UKY Supervised classification versus unsupervised clustering • Unsupervised clustering – Group similar objects together to find clusters • Minimize intra-class distance • Maximize inter-class distance • Supervised classification – Class label for each training sample is given – Build a model from the training data – Predict class label on unseen future data points CS685 : Special Topics in Data Mining, UKY What is clustering? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized CS685 : Special Topics in Data Mining, UKY What is Classification? Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learning algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Deduction 10 Test Set CS685 : Special Topics in Data Mining, UKY Clustering algorithms • K-Means • Hierarchical clustering • Graph based clustering (Spectral clustering) CS685 : Special Topics in Data Mining, UKY Classification algorithms • • • • • • • Decision Trees Naïve Bayes classifier Support Vector Machines (SVM) K-Nearest-Neighbor classifiers Logistic Regression Neural Networks Linear Discriminant Analysis (LDA) CS685 : Special Topics in Data Mining, UKY Supervised Classification Example . . . . CS685 : Special Topics in Data Mining, UKY Supervised Classification Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Supervised Classification Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Unsupervised Clustering Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Unsupervised Clustering Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Semi-Supervised Learning • Combines labeled and unlabeled data during training to improve performance: – Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. – Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data. Unsupervised clustering Semi-supervised learning Supervised classification CS685 : Special Topics in Data Mining, UKY Semi-Supervised Classification Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Semi-Supervised Classification Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Semi-Supervised Classification • Algorithms: – – – – Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. Co-training [Blum:COLT98]. Transductive SVM’s [Vapnik:98,Joachims:ICML99]. Graph based algorithms • Assumptions: – Known, fixed set of categories given in the labeled data. – Goal is to improve classification of examples into these known categories. • More discussions next week CS685 : Special Topics in Data Mining, UKY Semi-Supervised Clustering Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Semi-Supervised Clustering Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Second Semi-Supervised Clustering Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Second Semi-Supervised Clustering Example . . . . .. .. . .. . . . . . . . . . CS685 : Special Topics in Data Mining, UKY Semi-supervised clustering: problem definition • Input: – A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) – A small amount of domain knowledge • Output: – A partitioning of the objects into k clusters (possibly with some discarded as outliers) • Objective: – Maximum intra-cluster similarity – Minimum inter-cluster similarity – High consistency between the partitioning and the domain knowledge CS685 : Special Topics in Data Mining, UKY Why semi-supervised clustering? • Why not clustering? – The clusters produced may not be the ones required. – Sometimes there are multiple possible groupings. • Why not classification? – Sometimes there are insufficient labeled data. • Potential applications – – – – Bioinformatics (gene and protein clustering) Document hierarchy construction News/email categorization Image categorization CS685 : Special Topics in Data Mining, UKY Semi-Supervised Clustering • Domain knowledge – Partial label information is given – Apply some constraints (must-links and cannot-links) – Search-based Semi-Supervised Clustering • Alter the clustering algorithm using the constraints • Alter the similarity measure based on the constraints – Combination of both Next class – Similarity-based Semi-Supervised Clustering This class • Approaches CS685 : Special Topics in Data Mining, UKY Search-Based Semi-Supervised Clustering • Alter the clustering algorithm that searches for a good partitioning by: – Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. – Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. – Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02]. CS685 : Special Topics in Data Mining, UKY Overview of K-Means Clustering • K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers Repeat until convergence: {l } K l 1 randomly. – Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from l (center of Xl) is minimum – Center Re-estimation Step: Re-estimate each cluster center l as the mean of the points in that cluster CS685 : Special Topics in Data Mining, UKY K-Means Objective Function • Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: K l 1 xi X l || xi l || 2 • Initialization of K cluster centers: – Totally random – Random perturbation from global mean – Heuristic to ensure well-separated centers CS685 : Special Topics in Data Mining, UKY K Means Example CS685 : Special Topics in Data Mining, UKY K Means Example Randomly Initialize Means x x CS685 : Special Topics in Data Mining, UKY K Means Example Assign Points to Clusters x x CS685 : Special Topics in Data Mining, UKY K Means Example Re-estimate Means x x CS685 : Special Topics in Data Mining, UKY K Means Example Re-assign Points to Clusters x x CS685 : Special Topics in Data Mining, UKY K Means Example Re-estimate Means x x CS685 : Special Topics in Data Mining, UKY K Means Example Re-assign Points to Clusters x x CS685 : Special Topics in Data Mining, UKY K Means Example Re-estimate Means and Converge x x CS685 : Special Topics in Data Mining, UKY Semi-Supervised K-Means • Partial label information is given – Seeded K-Means – Constrained K-Means • Constraints (Must-link, Cannot-link) – COP K-Means CS685 : Special Topics in Data Mining, UKY Semi-Supervised K-Means for partially labeled data • Seeded K-Means: – Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. – Seed points are only used for initialization, and not in subsequent steps. • Constrained K-Means: – Labeled data provided by user are used to initialize K-Means algorithm. – Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are reestimated. CS685 : Special Topics in Data Mining, UKY Seeded K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points may change. CS685 : Special Topics in Data Mining, UKY Seeded K-Means Example CS685 : Special Topics in Data Mining, UKY Seeded K-Means Example Initialize Means Using Labeled Data x x CS685 : Special Topics in Data Mining, UKY Seeded K-Means Example Assign Points to Clusters x x CS685 : Special Topics in Data Mining, UKY Seeded K-Means Example Re-estimate Means x x CS685 : Special Topics in Data Mining, UKY Seeded K-Means Example Assign points to clusters and Converge x x the label is changed CS685 : Special Topics in Data Mining, UKY Constrained K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points will not change. CS685 : Special Topics in Data Mining, UKY Constrained K-Means Example CS685 : Special Topics in Data Mining, UKY Constrained K-Means Example Initialize Means Using Labeled Data x x CS685 : Special Topics in Data Mining, UKY Constrained K-Means Example Assign Points to Clusters x x CS685 : Special Topics in Data Mining, UKY Constrained K-Means Example Re-estimate Means and Converge x x CS685 : Special Topics in Data Mining, UKY COP K-Means • COP K-Means [Wagstaff et al.: ICML01] is K-Means with mustlink (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. • Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). • Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort. CS685 : Special Topics in Data Mining, UKY COP K-Means Algorithm CS685 : Special Topics in Data Mining, UKY Determine its label Illustration Must-link x x Assign to the red class CS685 : Special Topics in Data Mining, UKY Illustration Determine its label x x Cannot-link Assign to the red class CS685 : Special Topics in Data Mining, UKY Illustration Determine its label Must-link x x Cannot-link The clustering algorithm fails CS685 : Special Topics in Data Mining, UKY Summary • Seeded and Constrained K-Means: partially labeled data • COP K-Means: constraints (Must-link and Cannot-link) • Constrained K-Means and COP K-Means require all the constraints to be satisfied. – May not be effective if the seeds contain noise. • Seeded K-Means use the seeds only in the first step to determine the initial centroids. – Less sensitive to the noise in the seeds. • Experiments show that semi-supervised k-Means outperform traditional K-Means. CS685 : Special Topics in Data Mining, UKY Reference • Semi-supervised Clustering by Seeding – http://www.cs.utexas.edu/users/ml/papers/semiicml-02.pdf • Constrained K-means clustering with background knowledge – http://www.litech.org/~wkiri/Papers/wagstaffkmeans-01.pdf CS685 : Special Topics in Data Mining, UKY Next class • Topics – Similarity-based semi-supervised clustering • Readings – Distance metric learning, with application to clustering with side-information • http://ai.stanford.edu/~ang/papers/nips02-metric.pdf – From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering • http://www.cs.berkeley.edu/~klein/papers/constrained _clustering-ICML_2002.pdf CS685 : Special Topics in Data Mining, UKY