Download Review of feature selection techniques in bioinformatics by Yvan

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Americo Pereira, Jan Otto
Review of feature selection techniques in bioinformatics by
Yvan Saeys, Iñaki Inza and Pedro Larrañaga.
ABSTRACT
In this paper we want to explain what feature selection is and what it is used for. We discuss methods for
getting those features and present them in the context of classification that is using only supervised learning.
Our goal is to present applications of these techniques in bioinformatics and match them with the
appropriate type of method to use. We will also talk about how to deal with small sample sizes and problems
related with it. Finally, we will present some areas in bioinformatics that can be improved using feature
selection methods.
1 - INTRODUCTION
The need of using feature selection (FS) techniques is growing in last years due to the size of data we need
to analyze in areas related to bioinformatics. FS can be classified as one of many dimensionality reduction
techniques. What distinguishes it from the others is the fact that it allows us to pick the subset of variables
from the original data without changing it.
2 - FEATURE SELECTION TECHNIQUES
The principles of FS usage are:
 avoiding overfitting
 improving model performance
 gaining a deeper insight into the underlying processes that generated the data
Unfortunately, the optimal parameters of the model generated from the full feature set are not always the
same as the optimal parameters of the model generated using the optimal features set selected by FS. So it is
important to use it wisely, there is a danger that we lose some information. Consequently, we need to find
the optimal model settings for the new set of features. Feature selection techniques can be divided into three
categories, depending on how they combine the feature selection search with the construction of the
classification model: filter methods, wrapper methods and embedded methods.
2.1 - FILTER TECHNIQUES
These techniques give importance to features by looking only at the basic properties of the data. In most
cases a feature relevance score is calculated, and low-scoring features are removed. Afterwards, this subset
of features is presented as input to the classification algorithm. Advantages of filter techniques are that they
easily scale to very high-dimensional datasets, they are computationally simple and fast, and they are
independent of the classification algorithm. Because of the problem with not considering the features
dependencies – multivariate techniques were introduced. But they are also not perfect, as they only use a
certain degree of dependencies.
2.2 - WRAPPER TECHNIQUES
Unlike filter techniques, wrapper methods embed the model hypothesis search within the feature subset
search. In this setup, the result of search procedure belongs to the space of possible feature subsets, and
various subsets of features are generated. By using wrapper methods we can also evaluate selected feature
subset by training and testing specific classification model. In order to pick the best feature subset, we need
to perform the search that will take exponential time, thus we use heuristics. To make this search we can use
deterministic or randomized search algorithms. A common disadvantage of these techniques is that they
have a higher risk of overfitting than filter techniques and are very computationally intensive, especially if
building the classifier has a high computational cost.
2.3 - EMBEDDED TECHNIQUES
The search space of the feature selection algorithm is a combination of feature space and hypothesis space.
The classifier itself provides the optimal feature selection. A great advantage is that it is not so
computationally intensive.
3 - APPLICATIONS IN BIOINFORMATICS
3.1 -SEQUENCE ANALYSIS
Sequence analysis is one of the most traditional areas of bioinformatics. The problems that the programmer
meets in this area can be divided in two types differing in the scope we are interested in. If we want to focus
on general characteristics, to reason basing on statistical features of the whole sequence, then we are
interested in performing content analysis. On the other hand, if we want to detect in a sequence the presence
of a particular motif or some specified aberration - we are in fact interested in analyzing only some small
part(s) of the sequence. In that case we want to perform signal analysis. As we can imagine, in both cases
(content/signal) in our sequence there's a lot of garbage – data which provides us with information that is
either irrelevant or redundant. And that is exactly where FS can be exploited!
3.1.1 FS dedicated to content analysis (Filter multivariate)
Because features are derived from a sequence which is ordered, keeping them ordered is often beneficial, as
it preserves dependencies between adjacent features. That's why Markov models were used in the first
approach mentioned in the Saeys review [1] and it's improvements are still maintained, but the first idea
stays the same. For scoring feature subsets there are used also genetic algorithms and SVMs.
3.1.2 FS dedicated to signal analysis (Wrapper)
Usually signal analysis is performed to recognize binding sites or other places of sequence with special
function. For feature selection it's best to interpret the code and relate motifs to the gene expression level.
Then motifs can be cropped in such a way that the preserved motifs would fit the best to their regression
models. Choosing which motifs are unselected in FS is dependent on the threshold number of
misclassification (TNoM) to the regression models. Importance of motifs can be sorted by the P-value
(derived directly from the TNoM score).
3.2 - MICROARRAY ANALYSIS
Main feature of new datasets created by microarrays is both: large dimensionality and small sample sizes.
What makes it more exciting is that analysis has to cope with noises and variability.
3.2.1 Univariate (only filter)
Reasons why this approach is used most widely:
 understandable output
 faster than multivariate
 it is somehow easier to validate the selection by biological lab methods
 the experts usually don't feel the need to consider genes interactions
Simplest heuristics techniques include setting a threshold on the differences in gene expression between the
states and then detection of the threshold point in each gene that minimizes TNoM. But they were also
developed in two directions:
3.2.1.1 Parametric methods:
Parametric methods assume a given distribution from which the samples have been generated. That's why
before using them; the programmer should justify his choice of the distribution. Unfortunately, samples are
so small that it is very hard to even validate such choice. The most standard choice is a Gaussian
distribution.
3.2.1.2 Model-free methods:
Just as parametric, but tries to figure the distribution out by estimating random permutations of the data,
which enhances the robustness against outliers. There is also another group of non-parametric methods
which, instead of trying to identify differentially expressed genes at the whole population level, are able to
capture genes which are significantly disregulated in only a subset of samples. These types of methods can
select genes containing specific patterns that are missed by previously mentioned metrics.
3.2.2 Multivariate
3.2.2.1 Filter methods:
The application of multivariate filter methods ranges from simple bivariate interactions towards more
advanced solutions exploring higher order interactions, such as correlation-based feature selection (CFS)
and several variants of the Markov blanket filter method.
3.2.2.3 Wrapper methods:
In the context of microarray analysis, most wrapper methods use population-based, randomized search
heuristics, although also a few examples use sequential search techniques. An interesting hybrid filterwrapper approach is crossing a univariately preordered gene ranking with an incrementally augmenting
wrapper method. Another characteristic of any wrapper procedure concerns the scoring function used to
evaluate each gene subset found. As the 0–1 accuracy measure allows for comparison with previous works,
the vast majority of papers use this measure.
3.2.2.3 Embedded methods:
The embedded capacity of several classifiers to discard input features and thus propose a subset of
discriminative genes, has been exploited by several authors. Examples include the use of random forests in
an embedded way to calculate the importance of each gene. Another line of embedded FS techniques uses
the weights of each feature in linear classifiers, such as SVMs and logistic regression. These weights are
used to reflect the relevance of each gene in a multivariate way and allow the removal of genes with very
small weights.
Partially due to the higher computational complexity of wrapper and to a lesser degree embedded
approaches, these techniques have not received as much interest as filter proposals. However, an advisable
practice is to pre-reduce the search space using a univariate filter method, and only then apply wrapper or
embedded methods, hence fitting the computation time to the available resources.
3.3 - MASS SPECTRA ANALYSIS
Mass spectrometry technology is emerging as a new and attractive framework for disease diagnosis and
protein-based biomarker profiling. A mass spectrum sample is characterized by thousands of different
mass/charge (m/ z) ratios on the x-axis, each with their corresponding signal intensity value on the y-axis. A
low-resolution profile can contain up to 15.500 data points in the spectrum between 500 and 20.000 m/z.
That's why data analysis step is severely constrained by both high-dimensional input spaces and their
inherent sparseness, just as it is the case with gene expression datasets.
3.3.1 Filter methods:
Starting from the raw data, and after an initial step to reduce noise and normalize the spectra from different
samples, we need to extract the variables that will constitute the initial pool of candidate discriminative
features. Some studies employ the simplest approach of considering every measured value as a feature
(15.000 – 100.000 variables!). On the other hand, a great deal of the current studies performs aggressive
feature extraction procedures that tend to limit the number of variables even to 500. FS has to be of low-cost
in this case. Similar to the domain of microarray analysis, univariate filter techniques seem to be the most
common techniques used, although the use of embedded techniques is certainly emerging as an alternative.
Multivariate filter techniques on the other hand, are still somewhat underrepresented.
3.3.2 Wrapper methods:
In the wrapper approaches different types of population-based randomized heuristics are used as search
engines in the major part of these papers: genetic algorithms, particle swarm optimization and ant colony
procedures. It is worth noting that the tendency of improvements in these methods is to reduce the initial
number of variables. Variations of the popular method originally proposed for gene expression domains,
using the weights of the variables in the SVM-formulation to discard features with small weights, have been
broadly and successfully applied in the mass spectrometry domain. Also a neural network classifier (using
the weights of the input masses to rank the features’ importance) and different types of decision tree-based
algorithms (including random forests) are an alternative for this strategy.
4 - DEALING WITH SMALL SAMPLE DOMAINS
When using small sample sizes to create models, some bad results start to appear, that is, the risks of
overfitting and imprecision of the models grow with the smaller amount of data used to train the model.
This poses a great challenge to many modeling problems in the area of bioinformatics. To try to overcome
these problems with feature selection two techniques were created, that is, the use of adequate evaluation
criteria, and the use of stable and robust feature selection models.
4.1 - ADEQUATE EVALUATION CRITERIA
In some cases it is selected a discriminative subset of features from the data and this subset is used to test
the final model. Since the model is also made using this subset we are using the same samples for both
testing and training. Because of this it is needed to have an external feature selection process for training the
model during each stage of testing the model, to get a better estimation of the accuracy. The bolstered error
estimation[2] is an example of a method to get a good estimation of the predictive accuracy that can deal
with small sample domains.
4.2 - ENSEMBLE FEATURE SELECTION APPROACHES
The idea of using ensembles is that instead of using just one feature selection method and accepting its
outcome, we can use different feature selection methods together to have better results. This is useful
because it's not certain that a specific optimal feature subset is the only optimal one. Although ensembles are
computationally more complex and require more resources, they give decent results even for small sample
domains, thus affording more computational resources will compensate with better results. The random
Forest method is a particular example of and ensemble that is based on a collection of decision trees. This
method can be used to get the relevance of each feature which can help on selecting interesting features.
5 FEATURE SELECTION IN UPCOMING DOMAINS
5.1 - SINGLE NUCLEOTIDE POLYMORPHISM ANALYSIS
Single nucleotide polymorphisms (SMPs) are mutations at a single nucleotide position that occurred during
evolution and were passed on through heredity, accounting for most of the genetic variation among different
individuals. They are used in many disease-gene associations and their number is estimated to be 7milion in
the human genome. Because of this, there is a need to select a portion of SMPs that is sufficiently
informative and also small enough to reduce the genotyping overhead which is an important step towards
disease-gene association.
5.2 - TEXT AND LITERATURE MINING
Text and literature mining is a method to get information from texts, which can further be used to generate
classification models. A particular representation of this texts and documents is using a Bag-of-words. In
this representation each word on the text represents a specific variable or feature and its frequency is
counted. Because of this on some texts the dimension of the data will be too big and the data will be sparse,
which is why feature selection must be used to choose the fundamental features. Although using feature
selection on text mining is common when making text classification, in the context of bioinformatics it is
still not much developed. In the case of biomedical documents clustering and classification it is expected
that most of the methods that were developed by the text mining community can be used.
6 - CONCLUSION
Feature selection is becoming very important, because the amount of data that has to be processed in
bioinformatics and other areas is constantly growing in numbers and dimensions. Nevertheless, not all the
people know about the existence of various types of feature selection methods, as they usually pick the
univariate filter without considering other options. Prudent usage of feature selection methods lead us to
deal better with the following issues of bioinformatics: large input dimensionality and small sample sizes.
We hope the reader feels convinced to make an effort to find the proper method of FS to use in each
problem he is solving that needs dimensionality reduction, before actually doing it.
REFERENCES
[1] A review of feature selection techniques in bioinformatics, Yvan Saeys, 2007
[2] High-dimensional bolstered error estimation, Chao Sima, 2011.