Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006 The UNIVERSITY of Kansas Administrative Student presentations start from next week Send me the list of references that you use, if your choices are different from what are listed at the class webpage Speaker: Allocate enough time for background. 20 minutes for general audience 20 minutes for people working in “your area” <20 minutes for experts Keep in mind of five “w” questions: What is it? Why is it important? Who cares? Works or not? and So what? 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 Administrative Participants Need to ask at least one question 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 Overview Semi-supervised learning Semi-supervised classification Semi-supervised clustering Semi-supervised clustering Search based methods Cop K-mean Seeded K-mean Constrained K-mean Similarity based methods 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 Supervised Classification Example . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 Supervised Classification Example . . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 Supervised Classification Example . . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 Unsupervised Clustering Example . . . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 Unsupervised Clustering Example . . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 Semi-Supervised Learning Combines labeled and unlabeled data during training to improve performance: Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data. 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 Semi-Supervised Classification Example . . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 Semi-Supervised Classification Example . . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 Semi-Supervised Classification Algorithms: Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. Co-training [Blum:COLT98]. Transductive SVM’s [Vapnik:98,Joachims:ICML99]. Assumptions: Known, fixed set of categories given in the labeled data. Goal is to improve classification of examples into these known categories. 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 Semi-Supervised Clustering Example . . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 Semi-Supervised Clustering Example . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 Second Semi-Supervised Clustering Example . . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 Second Semi-Supervised Clustering Example . . . .. .. . .. . . . . . . . . . 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 Semi-Supervised Clustering Can group data using the categories in the initial labeled data. Can also extend and modify the existing set of categories as needed to reflect other regularities in the data. Can cluster a disjoint set of unlabeled data using the labeled data as a “guide” to the type of clusters desired. 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 Problem definition Input: A set of unlabeled objects Some domain knowledge Output: A partitioning of the objects into clusters Objective: Maximum intra-cluster similarity Minimum inter-cluster similarity High consistency between the partitioning and the domain knowledge 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 What is Domain Knowledge? Must-link and cannot-link Class labels Ontology 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 Why semi-supervised clustering? Why not clustering? Could not incorporate prior knowledge into clustering process Why not classification? Sometimes there are insufficient labeled data. Potential applications Bioinformatics (gene and protein clustering) Document hierarchy construction News/email categorization Image categorization 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 Semi-Supervised Clustering Approaches Search-based Semi-Supervised Clustering Alter the clustering algorithm using the constraints Similarity-based Semi-Supervised Clustering Alter the similarity measure based on the constraints Combination of both 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 Search-Based Semi-Supervised Clustering Alter the clustering algorithm that searches for a good partitioning by: Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02]. 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 Unsupervised KMeans Clustering KMeans iteratively partitions a dataset into K clusters. Algorithm: Initialize K cluster centers {l } randomly. Repeat until convergence: K l 1 Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from l (center of Xl) is minimum Center Re-estimation Step: Re-estimate each cluster center l as the mean of the points in that cluster 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 KMeans Objective Function Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: K l 1 xi X l || xi l || 2 Initialization of K cluster centers: Totally random Random perturbation from global mean Heuristic to ensure well-separated 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 centers etc. slide25 K Means Example 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 K Means Example Randomly Initialize Means x x 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 K Means Example Assign Points to Clusters x x 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 K Means Example Re-estimate Means x x 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 K Means Example Re-assign Points to Clusters x x 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 K Means Example Re-estimate Means x x 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 K Means Example Re-assign Points to Clusters x x 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 K Means Example Re-estimate Means and Converge x x 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 Semi-Supervised K-Means Constraints (Must-link, Cannot-link) COP K-Means Partial label information is given Seeded K-Means (Basu, ICML’02) Constrained K-Means 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 COP K-Means COP K-Means is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. Initialization: Cluster centers are chosen randomly but no must-link constraints that may be violated Algorithm: During cluster assignment step in COP-KMeans, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort. Based on Wagstaff et al.: ICML01 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 COPK-Means K-MeansAlgorithm Algorithm COP 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 Illustration Determine its label Must-link x x Assign to the red class 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 Illustration Determine its label x x Cannot-link Assign to the red class 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 Illustration Determine its label Must-link x x Cannot-link The clustering algorithm fails 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 Evaluation Rand index: measures the agreement between two partitions, P1 and P2, of the same data set D. Each partition is viewed as a collection of n(n-1)/2 pairwise decisions, where n is the size of D. a is the number of decisions where P1 and P2 put a pair of objects into the same cluster b is the number of decisions where two instances are placed in different clusters in both partitions. Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2): 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 Evaluation 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 Semi-Supervised K-Means Seeded K-Means: Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. Seed points are only used for initialization, and not in subsequent steps. Constrained K-Means: Labeled data provided by user are used to initialize K-Means algorithm. Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are reestimated. Based on Basu et al., ICML’02. 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 Seeded K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points may change. 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 Seeded K-Means Example 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 Seeded K-Means Example Initialize Means Using Labeled Data x 11/01/2006 Semi-supervised learning x Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide45 Seeded K-Means Example Assign Points to Clusters x 11/01/2006 Semi-supervised learning x Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 Seeded K-Means Example Re-estimate Means x 11/01/2006 Semi-supervised learning x Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 Seeded K-Means Example Assign points to clusters and Converge x 11/01/2006 Semi-supervised learning x Mining Biological Data KU EECS 800, Luke Huan, Fall’06 the label is changed slide48 Constrained K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points will not change. 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 Constrained K-Means Example 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 Constrained K-Means Example Initialize Means Using Labeled Data x 11/01/2006 Semi-supervised learning x Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 Constrained K-Means Example Assign Points to Clusters x 11/01/2006 Semi-supervised learning x Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide52 Constrained K-Means Example Re-estimate Means and Converge x 11/01/2006 Semi-supervised learning x Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53 Datasets Data sets: UCI Iris (3 classes; 150 instances) CMU 20 Newsgroups (20 classes; 20,000 instances) Yahoo! News (20 classes; 2,340 instances) Data subsets created for experiments: Small-20 newsgroup: random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms. Different-3 newsgroup: 3 very different newsgroups (alt.atheism, rec.sport.baseball, sci.space), created to study effect of data separability on algorithms. Same-3 newsgroup: 3 very similar newsgroups (comp.graphics, comp.os.ms-windows, comp.windows.x). 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide54 Evaluation Mutual information Objective function 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide55 Results: MI and Seeding Zero noise in seeds [Small-20 NewsGroup] Semi-Supervised KMeans substantially better than unsupervised KMeans 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide56 Results: Objective function and Seeding User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] Obj. function of data partition increases exponentially with seed fraction 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide57 Results: Objective Function and Seeding User-labeling inconsistent with KMeans assumptions [Yahoo! News] Objective function of constrained algorithms decreases with seeding 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide58 Similarity Based Methods Questions: given a set of points and the class labels, can we learn a distance matrix such that inter-cluster distance are minimized and intra-cluster distance are maximized? 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide59 Distance metric learning Define a new distance measure of the form: Linear transformation of the original data 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide60 Distance metric learning 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide61 Semi-Supervised Clustering Example Similarity Based 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide62 Semi-Supervised Clustering Example Distances Transformed by Learned Metric 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide63 Semi-Supervised Clustering Example Clustering Result with Trained Metric 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide64 Evaluation Source: 11/01/2006 Semi-supervised learning Mining Biological Data learning E. Xing, et al. Distance metric KU EECS 800, Luke Huan, Fall’06 slide65 Evaluation Source: E. Xing, et al. Distance metric learning 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide66 Additional Readings Combining Similarity and Search-Based Semi-Supervised Clustering “Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering”, Basu, et al. Ontology based semi-supervised clustering “A framework for ontology-driven subspace clustering”, Liu et al. 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide67 Summary Semi-supervised clustering Its applications in Bioinformatics? Not fully explored 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide68 Reference UT machine learning group http://www.cs.utexas.edu/~ml/publication/unsupervised.html Semi-supervised Clustering by Seeding http://www.cs.utexas.edu/users/ml/papers/semi-icml-02.pdf Constrained K-means clustering with background knowledge http://www.litech.org/~wkiri/Papers/wagstaff-kmeans-01.pdf Some slides are from Jieping Ye at Arizona State 11/01/2006 Semi-supervised learning Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide69