Download Unsupervised

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Transcript
Why clustering and classification?
Advice from Charlotte Soneson, Qlucore
1. Supervised vs unsupervised
When working with machine learning methods it is important to
distinguish between supervised and unsupervised methods, since they
are used in very different circumstances. Unsupervised methods do
not use any information about the samples (annotations), and rather
try to find dominating structure and patterns in the data, patterns that
can then be interpreted. Clustering is an example of an unsupervised
method, where the goal is to find subgroups in the data (without
using any sample annotation information). Also PCA is unsupervised.
Supervised methods typically aim to build models that explain or
predict some pre-specified sample annotation. This annotation may
or may not correspond to the main pattern in the data. Classification,
or predictive modeling, is an example of supervised learning. Given
some data and a sample annotation, the aim is to build a model from
the data that is able to predict the value of the sample annotation in a
new sample for which we are only given the data.
It is important to recognize if the goal of a study requires a
supervised or unsupervised approach. For example, if the goal is to
build a model that can predict the disease status of a patient, one
should use a supervised approach. Using an unsupervised approach
like clustering or PCA will likely mix the signal that we are interested
in with other, unrelated, signals and give a worse predictor, unless
the disease status is the main signal in the data. On the other hand, if
the goal is to get an overview of a data set, to see which are the
strongest signals and if the samples naturally group into subgroups,
an unsupervised method like clustering should be used.
One thing to keep in mind when we use supervised methods is that
since we are explicitly looking for patterns that are associated with a
given annotation, we will most certainly find something that can
predict the annotation in the current data set. However, this is not
what we are interested in (since we already know the annotation
values in this data set). We are interested in seeing whether the
derived model can predict the value of the annotation in an
independent data set, where we have only the data, but no
information about the annotation. Thus, supervised models must
always be validated in independent data set (a good predictive
performance in the current data says nothing at all). A model that
cannot predict the correct annotation values in independent data is
not good. This is usually not necessary for unsupervised methods,
which are usually used to summarize, explore and describe a data set.
Clustering
In Qlucore we have two types of clustering methods: hierarchical
clustering (in the heatmaps) and k-means clustering. Both are used
for the same purpose: to find subgroups among the samples and to
see whether the samples naturally distribute themselves into distinct
clusters. The difference is that the hierarchical clustering builds a
“cluster tree” (or a dendrogram), which organizes the samples
hierarchically but does not directly divide them into clusters, while
the k-means splits the samples into a pre-defined number of groups.
Practical situations where one would like to use a clustering
approach are e.g.:
- to see whether there are subtypes of a particular disease, i.e., if
the samples group into different clusters. These clusters may
represent different disease types, which have different
prognosis and behavior.
- to explore the data set and look for artifacts. This can be done
by clustering the data and see whether the clusters agree with
the signals that one expects to be the dominating ones, or if
they rather correspond to batch effects or other technical
artifacts.
Classification
Classification models consist of two parts: the variables that are used
and a rule to combine the values of these variables in order to obtain
a predicted value of a given sample annotation. Both are important,
and are usually determined together.
Practical situations where one would like to use a classification
approach are e.g.:
- to build a model that can use gene expression data to predict
the prognosis of a cancer patient
- to build a model that can use some numeric data to assign a
sample to one of several disease subtypes
As noted above, it is important that a predictive model is evaluated
on independent data, and not on the same data where it was built.
Overfitting refers to the situation where a model is “too specifically
adapted” to a given data set, and does not generalize to other data
sets. Usually this is a sign that the model has been taking too much
advantage of the random noise in the training data set, to build a
model that fits well specifically to this data. The noise in an
independent data set will likely be different, and then the model does
not work any more.
Cross-validation is a technique that can be used to evaluate a model
based on a single data set. Basically, the idea is to subdivide the
entire data set into a training and test set (multiple times), build the
model on the training part and evaluate the performance on the test
part (which was not used to build the model).
The word classification is usually used to describe predictive
modeling where the sample annotation is categorical. To predict a
numeric/continuous annotation, one uses regression.
END
Author: Charlotte Soneson, Qlucore
http://www.qlucore.com/