Download Applied Machine Learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Forecasting wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Applied Machine Learning
Annalisa Marsico
OWL RNA Bionformatics group
Max Planck Institute for Molecular Genetics
Free University of Berlin
27 Mai, SoSe 2015
Supervised vs Unsupervised Learning
Typical Scenario
We have an outcome quantitative (price of a stock, risk factor..) or
categorical (heart attack yes or no) that we want to predict based on some
features. We have a training set of data and build a prediction model, a
learner, able to predict the outcome of new unseen objects
- Supervised learning: the presence of the outcome variable is guiding the
learning process
- Unsupervised learning: we have only features, no outcome
- Task is rather to describe the data
Supervised Learning
• Find the connection between two sets of observations: the
input set and the output set
• Given { , }, where
∈ (feature space) find a function f
(from a set of hypothesis H), such that ∀ ∈ [1, . . ]
= ( )
f linear function, polynomial function, a classifier (e.g. logistic
regression), a NN, a SVM, a random forest…
Unsupervised learning
• find a structure in the data
• Given X ={xn} measurements / obervations /features
find a model M such that p(M|X) is maximized
i.e. Find the process that is most likely to have generate the data
Gaussian Mixture Models
=
,
)
Gaussian Mixture models
• Choose a number of clusters
• Initialize the priors , the
means
and the covariances
• Repeat until convergence
• compute the probability p(k| ) of each datapoint
to
belong to cluster k E-step
• update parameters of the model (cluster priors , mean
and covariances
by taking the weighted average
number / location / variance of all points, where the weight
of point
is p(k| ) ) M-step
EM algorithm
Gaussian Mixture models
Semi-supervised learning
• Middle-ground between supervised and unsupervised
learning
• We are in the following situation:
– Set of instances X ={xn} drawn from some unknown probability
distribution p(X)
– Wish to learn a target function : ⟶ (or a lerner) given a set H of
possible hypothesis given:
= { ,
, … , } labeled examples
U= {
,
,…
,
} unlabeled examples
Usually ≪
Wish to find the hypothesis (function) with the lowest error
=
,
≡ [ℎ
≠
]
∈
Why the name?
Supervised learning
• Classification, regression {(
:
,
:
)}
Semi-supervised classification / regression
: , : ,
,
Semi-supervised clustering
Unsupervised learning
• Clustering, mixture models
:
,
:
:
,
,
Why bother?
There are cases where labeled examples are really few
• People want better performance ‘for free’
• Unlabeled data are cheap, labeled data might be hard to
get
– Drug design, i.e. effect of a drug on protein activity
– Genomics: real binding sites vs non real binding sites
– SNPs data for disease classification
Hard-to-get labels
Image categorization of “eclipse”: one could in principle
label 1000+ manually
Hard-to-get labels
Nonetheless, the number of images to classify might be huge..
We will show how to improve classification
by using the unlabeled examples
Semi-supervised mixture models
• First approach: modify the posterior probability p(k| )
of each data point in the E-step (set p(k| )=1 if point is
labeled as positive, 0 otherwise). Example on the
blackboard (miRNA promoters)
• Second approach: train classifier on labeled examples
and estimate the parameters (M-step). Compute
‘expected’ class for unlabeled examples (E-step). Example
on the blackboard.
How can unlabeled data ever help?
This is only one of the many ways to use unlabeled data
Why the name?
Supervised learning
• Classification, regression {(
:
,
:
)}
Semi-supervised classification / regression
: , : ,
,
Semi-supervised clustering
Unsupervised learning
• Clustering, mixture models
:
,
:
:
,
,
Semi-supervised regression /
classification
Goal
Using both labeled and unlabeled data to build
better learners, than using each one alone
– Boost the performance of a learning algorithm
when only a small set of labeled data is available
Example 1 – web page classification
We want a system which automatically classifies web pages
into faculty (academic) web pages vs other pages
– Labeled examples: pages labeled by hand
– Unlabeled examples: millions of pages on the web
– Nice features: web pages can have multiple representations
(can be described by different kind of information)
Example1: Redundantly sufficient
features
Dr. Bernhard Renard
my advisor
Setting: redundant features, i.e. description of each example can be
partitioned into distinct views
Example: Redundantly sufficient
features
Dr. Bernhard Renard
my advisor
Example: Redundantly sufficient
features
Co-training, multi-view learning, Coregularization – Main idea
In some settings data features are redundant
– We can train different classifiers on disjoint features
– Classifiers should agree on unlabeled examples
– We can use unlabeled data to constraint joint training of both
classifiers
– Different algorithms differ in the way they constraint the
classifiers
The Co-training algorithm – Variant 1
Assumptions:
– Either view of the features is sufficient for the
learning task
– Compatibility assumption (a strong one):
classifiers in each view agree on labels of most
unlabeled examples
– Independence assumption: views are
independent given the class labels (conditional
independence)
The Co-training algorithm – Variant 1
Given:
• Labeled data L
• Unlabeled data U
Loop
• Train g1 (hyperlink classifier) using L
• Train g2 (page classifier) using L
• Sample N1 points from U, let g1 label p1 positives and
n1 negatives
• Sample N2 points from U, let g2 label p2 positives and
n2 negatives
• Add self-labeled (N1+N2) examples to L
A. Blum & T. Mitchell, 1998 ‘Combining Labeled and Labeled Data with Co-training’
The Co-training algorithm – Variant 2
Given:
• Labeled data L
• Unlabeled data U
Loop
• Train g1 (hyperlink classifier) using L
• Train g2 (page classifier) using L
• Sample N1 points from U, let g1 label p1 positives and
n1 negatives
• Sample N2 points from U, let g2 label p2 positives and
n2 negatives
• Add self-labeled n1 < N1 (examples where g1 is more confident)
and n2 < N2 (examples where g2 is more confident) to L
A. Blum & T. Mitchell, 1998 ‘Combining Labeled and Labeled Data with Co-training’
The Co-training algorithm – Results
Web-pages classification task
–
–
–
–
Experiment: 1051 web pages
203 (25%) left out (randomly) for testing
12 out of the remaining 848 are labeled (L)
5 experiments conducted using different test+training splits
Plot of the error on the test set
vs the number of iterations. Thanks
to co-training the errors converge after
a while
A. Blum & T. Mitchell, 1998
The Co-training algorithm – Intuition
and limitations
• If the hyperlinks classifier finds a page it is highly confident
about, then it can share it with the other classifier (the page
classifier). This is how they benefit from each other.
• Starting from a set of labeled examples, they co-train each
other at each iteration (defined as ‘greedy’ EM)
• Limitation: even if each classifier picks examples he is more
confident about, it ignores the other view
Multi-view learning (Co-regularization)
- Motivation
• It relaxes the assumption of compatibility
– If classifiers don’t agree on unlabeled examples, then we have
noisy data.
– What is a way to reduce noise (variance) in the data?
Multi-view learning (Co-regularization)
Framework where classifiers are learnt in each view through
forms of multi-view regularization
– We are in the case where
≠ ( )
– Joint regularization to minimize the disagreement between them
<
,
>← (
)
,
( −
( ;
)) +
∈
+
( −
( ;
)) +
∈
+
(
∈
;
−
;
)
Penalization
term
Co-regularization – Intuition and
Limitations
• Build on the assumption that instead of making g1 and g2
agree on examples afterwards, must agree into the objective
function we are optimizing.
• Unlabeled incorporated into regularization
• The algorithms are not greedy (Mmm..)
• I can decide how much compatibility I want to force (i.e. how
much I want to penalize disagreement). Any idea how?
• Limitations: No good implementation so far
Essential ingredients for Co-training /
Co-regularization - Summary
•
•
•
•
A big number of unlabeled examples
Multi-view (particularly suitable for biology)
Conditional independence
A good way to solve your optimization
problem when you have joint regularization
Semi-supervised learning in Biology
• Regulatory elements prediction (e.g.
promoter)
• Protein function prediction
• Diseases hard to diagnose (few labels)
• Long non-coding RNA function prediction
• miRNA target prediction..