Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 27 Mai, SoSe 2015 Supervised vs Unsupervised Learning Typical Scenario We have an outcome quantitative (price of a stock, risk factor..) or categorical (heart attack yes or no) that we want to predict based on some features. We have a training set of data and build a prediction model, a learner, able to predict the outcome of new unseen objects - Supervised learning: the presence of the outcome variable is guiding the learning process - Unsupervised learning: we have only features, no outcome - Task is rather to describe the data Supervised Learning • Find the connection between two sets of observations: the input set and the output set • Given { , }, where ∈ (feature space) find a function f (from a set of hypothesis H), such that ∀ ∈ [1, . . ] = ( ) f linear function, polynomial function, a classifier (e.g. logistic regression), a NN, a SVM, a random forest… Unsupervised learning • find a structure in the data • Given X ={xn} measurements / obervations /features find a model M such that p(M|X) is maximized i.e. Find the process that is most likely to have generate the data Gaussian Mixture Models = , ) Gaussian Mixture models • Choose a number of clusters • Initialize the priors , the means and the covariances • Repeat until convergence • compute the probability p(k| ) of each datapoint to belong to cluster k E-step • update parameters of the model (cluster priors , mean and covariances by taking the weighted average number / location / variance of all points, where the weight of point is p(k| ) ) M-step EM algorithm Gaussian Mixture models Semi-supervised learning • Middle-ground between supervised and unsupervised learning • We are in the following situation: – Set of instances X ={xn} drawn from some unknown probability distribution p(X) – Wish to learn a target function : ⟶ (or a lerner) given a set H of possible hypothesis given: = { , , … , } labeled examples U= { , ,… , } unlabeled examples Usually ≪ Wish to find the hypothesis (function) with the lowest error = , ≡ [ℎ ≠ ] ∈ Why the name? Supervised learning • Classification, regression {( : , : )} Semi-supervised classification / regression : , : , , Semi-supervised clustering Unsupervised learning • Clustering, mixture models : , : : , , Why bother? There are cases where labeled examples are really few • People want better performance ‘for free’ • Unlabeled data are cheap, labeled data might be hard to get – Drug design, i.e. effect of a drug on protein activity – Genomics: real binding sites vs non real binding sites – SNPs data for disease classification Hard-to-get labels Image categorization of “eclipse”: one could in principle label 1000+ manually Hard-to-get labels Nonetheless, the number of images to classify might be huge.. We will show how to improve classification by using the unlabeled examples Semi-supervised mixture models • First approach: modify the posterior probability p(k| ) of each data point in the E-step (set p(k| )=1 if point is labeled as positive, 0 otherwise). Example on the blackboard (miRNA promoters) • Second approach: train classifier on labeled examples and estimate the parameters (M-step). Compute ‘expected’ class for unlabeled examples (E-step). Example on the blackboard. How can unlabeled data ever help? This is only one of the many ways to use unlabeled data Why the name? Supervised learning • Classification, regression {( : , : )} Semi-supervised classification / regression : , : , , Semi-supervised clustering Unsupervised learning • Clustering, mixture models : , : : , , Semi-supervised regression / classification Goal Using both labeled and unlabeled data to build better learners, than using each one alone – Boost the performance of a learning algorithm when only a small set of labeled data is available Example 1 – web page classification We want a system which automatically classifies web pages into faculty (academic) web pages vs other pages – Labeled examples: pages labeled by hand – Unlabeled examples: millions of pages on the web – Nice features: web pages can have multiple representations (can be described by different kind of information) Example1: Redundantly sufficient features Dr. Bernhard Renard my advisor Setting: redundant features, i.e. description of each example can be partitioned into distinct views Example: Redundantly sufficient features Dr. Bernhard Renard my advisor Example: Redundantly sufficient features Co-training, multi-view learning, Coregularization – Main idea In some settings data features are redundant – We can train different classifiers on disjoint features – Classifiers should agree on unlabeled examples – We can use unlabeled data to constraint joint training of both classifiers – Different algorithms differ in the way they constraint the classifiers The Co-training algorithm – Variant 1 Assumptions: – Either view of the features is sufficient for the learning task – Compatibility assumption (a strong one): classifiers in each view agree on labels of most unlabeled examples – Independence assumption: views are independent given the class labels (conditional independence) The Co-training algorithm – Variant 1 Given: • Labeled data L • Unlabeled data U Loop • Train g1 (hyperlink classifier) using L • Train g2 (page classifier) using L • Sample N1 points from U, let g1 label p1 positives and n1 negatives • Sample N2 points from U, let g2 label p2 positives and n2 negatives • Add self-labeled (N1+N2) examples to L A. Blum & T. Mitchell, 1998 ‘Combining Labeled and Labeled Data with Co-training’ The Co-training algorithm – Variant 2 Given: • Labeled data L • Unlabeled data U Loop • Train g1 (hyperlink classifier) using L • Train g2 (page classifier) using L • Sample N1 points from U, let g1 label p1 positives and n1 negatives • Sample N2 points from U, let g2 label p2 positives and n2 negatives • Add self-labeled n1 < N1 (examples where g1 is more confident) and n2 < N2 (examples where g2 is more confident) to L A. Blum & T. Mitchell, 1998 ‘Combining Labeled and Labeled Data with Co-training’ The Co-training algorithm – Results Web-pages classification task – – – – Experiment: 1051 web pages 203 (25%) left out (randomly) for testing 12 out of the remaining 848 are labeled (L) 5 experiments conducted using different test+training splits Plot of the error on the test set vs the number of iterations. Thanks to co-training the errors converge after a while A. Blum & T. Mitchell, 1998 The Co-training algorithm – Intuition and limitations • If the hyperlinks classifier finds a page it is highly confident about, then it can share it with the other classifier (the page classifier). This is how they benefit from each other. • Starting from a set of labeled examples, they co-train each other at each iteration (defined as ‘greedy’ EM) • Limitation: even if each classifier picks examples he is more confident about, it ignores the other view Multi-view learning (Co-regularization) - Motivation • It relaxes the assumption of compatibility – If classifiers don’t agree on unlabeled examples, then we have noisy data. – What is a way to reduce noise (variance) in the data? Multi-view learning (Co-regularization) Framework where classifiers are learnt in each view through forms of multi-view regularization – We are in the case where ≠ ( ) – Joint regularization to minimize the disagreement between them < , >← ( ) , ( − ( ; )) + ∈ + ( − ( ; )) + ∈ + ( ∈ ; − ; ) Penalization term Co-regularization – Intuition and Limitations • Build on the assumption that instead of making g1 and g2 agree on examples afterwards, must agree into the objective function we are optimizing. • Unlabeled incorporated into regularization • The algorithms are not greedy (Mmm..) • I can decide how much compatibility I want to force (i.e. how much I want to penalize disagreement). Any idea how? • Limitations: No good implementation so far Essential ingredients for Co-training / Co-regularization - Summary • • • • A big number of unlabeled examples Multi-view (particularly suitable for biology) Conditional independence A good way to solve your optimization problem when you have joint regularization Semi-supervised learning in Biology • Regulatory elements prediction (e.g. promoter) • Protein function prediction • Diseases hard to diagnose (few labels) • Long non-coding RNA function prediction • miRNA target prediction..