Download Learning Distance Functions For Gene Expression Data

Document related concepts

Knowledge representation and reasoning wikipedia , lookup

Genetic algorithm wikipedia , lookup

Concept learning wikipedia , lookup

Neural modeling fields wikipedia , lookup

Time series wikipedia , lookup

Machine learning wikipedia , lookup

Gene expression programming wikipedia , lookup

Pattern recognition wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Learning Distance Functions For
Gene Expression Data
Diploma Thesis in Bioinformatics
submitted
by
Torsten Schön
born 23rd February, 1986 in Wassertrüdingen
Written at
Department of Bioinformatics
Hochschule Weihenstephan-Triesdorf in Freising
in Cooperation with
Siemens AG, CT T DE TC 4, Erlangen
Advisors
Prof. Dr. Martin Stetter, Dr. Alexey Tsymbal (Siemens AG)
Started
01. September 2009
Finished
26. February 2010
ii
Eidesstattliche Erklärung:
Ich erkläre hiermit an Eides statt, dass die vorliegende Arbeit von mir selbst und ohne
fremde Hilfe verfasst und noch nicht anderweitig für Prüfungszwecke vorgelegt wurde.
Es wurden keine anderen als die angegebenen Quellen oder Hilfsmittel benutzt.
Wörtliche und sinngemäße Zitate sind als solche gekennzeichnet.
Erlangen, den ...................................
...................................
Unterschrift
iii
Abstract
This thesis addresses problems of classifying genetic data with distance function learners.
Two common learning algorithms, plain k-Nearest Neighbour (with the canonical Euclidean distance) and Random Forest are compared with two distance function learningbased techniques, learning from equivalence constraints and the intrinsic Random Forest
Similarity on different benchmark datasets. These datasets include gene expression data
for patients with Breast Cancer, Colon Cancer, Embrional Tumours, Brain Tumours,
Leukemia, Lung Cancer, Lupus and Lymphoma. Each dataset contains healthy subjects,
too. First, seven established and two novel distance functions are evaluated for learning
from equivalence constraints in the difference space. To consider gene interactions in the
classification algorithms, the original datasets are transformed into a new representation,
comprising gene-pairs and not single genes. All combinations of pairs between the genes
of the original datasets are constructed. The most discriminative gene pairs are selected
and the new representation is evaluated on the benchmark datasets. The novel gene-pair
representation is shown to increase the accuracy for genetic datasets. Based on the genepair representation, the GeneOntology semantic similarity of the gene pairs is calculated
with different methods and is used for feature weighting first. A comparison of eight
approaches is done where one new algorithm is introduced for calculating the semantic
similarity between two genes. Further, the semantic similarity is used to pre-select pairs
with a high similarity value. The GeneOntology based feature selection approach is compared to the common feature selection and is shown to increase the accuracy on average
over the datasets.
Zusammenfassung
Diese Diplomarbeit beschäftigt sich mit den Herausforderungen der Klassifizierung genetischer Daten, speziell mit Hilfe von distanzfunktionsbasierten Algorithmen. Zwei fundierte
v
Lernalgorithmen, der k-Nearest Neighbour und der Random Forest Algorithmus, werden
mit zwei Distanzbasierten Methoden, Equivalence Constraints und der intrinsic Random
Forest Similarity verglichen und auf verschiedenen Bezugsdaten getestet. Diese Bezugsdaten beinhalten Genexpressionsdaten von Patienten mit Brustkrebs, Darmkrebs, Embrionale Tumore, Leukämie, Lungenkrebs, Lupus und malignem Lymphom sowie Genexpressionsdaten von Gesunden Personen. Zu Beginn werden sieben etablierte und zwei neu
entwickelte Distanzfunktionen zur Klassifizierung mit Equivalence Constraints evaluiert.
Um das Zusammenspiel zwischen Genen in der Klassifikation zu berücksichtigen, werden die Datensätze in eine neuartige Darstellung von Genpaaren umgewandelt. Hierfür
werden alle Kombinationen von Paaren zwischen den einzelnen Genen der Datensätze
gebildet. Nach einer Auswahl der für die Klassenunterscheidung wichtigsten Paare, wird
die neue Darstellung auf den Referenzdatensätzen validiert und mit der ursprünglichen
Darstellung verglichen. Dabei zeigt sich eine Verbesserung der Klassifizierungsgenauigkeit
für genetische Datensätze. Daraufhin wird für jedes Paar die semantische Ähnlichkeit der
beiden Paar-Elemente mit Hilfe verschiedener Methoden und der GeneOntology berechnet. Diese Ähnlichkeit wird als Gewichtung der Paare in die Klassifikation einbezogen.
Ein Vergleich zwischen acht verschieden Methoden, diese Ähnlichkeit zu berechnen, wird
erstellt, wobei eine dieser Methoden neu vorgestellt wird. Anschließend wird die semantische Ähnlichkeit dazu verwendet, eine Vorauswahl von Paaren zu treffen, welche einen
sehr hohen Ähnlichkeitswert aufweisen. Diese Art der Paar-Auswahl wird mit der bisher
verwendeten Methode verglichen und zeigt eine Verbesserung des mittleren Klassifikationsresultates über die getesteten Datensätze.
vi
vii
Contents
1 Introduction
1
2 Related Work
5
2.1
Machine Learning with Gene Expression Data . . . . . . . . . . . . . . .
5
2.2
Learning Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3
Semantic Similarity Calculations Based on Gene Ontology . . . . . . . .
8
3 Material
3.1
3.2
3.3
11
Biological and Medical Aspects . . . . . . . . . . . . . . . . . . . . . . .
11
3.1.1
Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.1.2
Genetic Microarray Experiments
. . . . . . . . . . . . . . . . . .
13
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2.1
Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.2.2
Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . .
16
3.2.3
Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.2.4
k -Nearest Neighbour Classification . . . . . . . . . . . . . . . . .
17
3.2.5
AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2.6
Learning Distance Functions . . . . . . . . . . . . . . . . . . . . .
19
3.2.7
Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.2.8
Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Weka - A Machine Learning Framework in Java . . . . . . . . . . . . . .
23
3.3.1
23
The ARFF Format . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
3.3.2
The Structure of the Weka Framework . . . . . . . . . . . . . . .
25
3.4
Distance Learning Framework . . . . . . . . . . . . . . . . . . . . . . . .
25
3.5
Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.5.1
The Key Components of the Ontology . . . . . . . . . . . . . . .
28
3.5.2
The GO File Format: OBO . . . . . . . . . . . . . . . . . . . . .
30
3.5.3
The GO Annotation Database . . . . . . . . . . . . . . . . . . . .
31
3.6
Gene Ontology API for Java . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.7
NCBI EUtils
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.8
Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.9
RSCTC’2010 Discovery Challenge . . . . . . . . . . . . . . . . . . . . . .
35
3.9.1
Basic Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.9.2
Advanced Track . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4 Methods
38
4.1
Reorganization of the Distance Learning Framework . . . . . . . . . . . .
38
4.2
Distance Function Learning From Equivalence Constraints . . . . . . . .
44
4.2.1
The L1 Distance and Modifications of the L1 Distance . . . . . .
45
4.2.2
The Simplified Mahalanobis Distance . . . . . . . . . . . . . . . .
45
4.2.3
The Chi-Square Distance . . . . . . . . . . . . . . . . . . . . . . .
46
4.2.4
The Weighted Frequency Distance . . . . . . . . . . . . . . . . . .
46
4.2.5
The Canberra Distance . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2.6
The Variance Threshold Distance . . . . . . . . . . . . . . . . . .
47
4.2.7
Test Configurations . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Transformation of Feature Representation for Gene Expression Data . . .
48
4.3.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.3.2
Framework Updates . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.3.3
Test Configurations . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Integration of GO Semantic Similarity . . . . . . . . . . . . . . . . . . .
53
4.4.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.4.2
Framework Modifications . . . . . . . . . . . . . . . . . . . . . . .
53
4.4.3
GO-Based Feature Weighting . . . . . . . . . . . . . . . . . . . .
57
4.4.4
GO-Based Feature Selection . . . . . . . . . . . . . . . . . . . . .
58
4.3
4.4
ix
4.4.5
4.5
Test Configurations . . . . . . . . . . . . . . . . . . . . . . . . . .
59
RSCTC’2010 Discovery Challenge . . . . . . . . . . . . . . . . . . . . . .
60
4.5.1
Basic Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.5.2
Advanced Track . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5 Empirical Analysis
5.1
5.2
5.3
5.4
66
Distance Comparison for Learning From Equivalence Constraints in the
Difference Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
Transformation of Representation for Gene Expression Data . . . . . . .
67
5.2.1
Analysis of the Genetic Datasets . . . . . . . . . . . . . . . . . .
68
5.2.2
Analysis of the Non-Genetic Datasets . . . . . . . . . . . . . . . .
72
5.2.3
Robustness of the Gene-Pair Representation to Noise . . . . . . .
72
5.2.4
Benefits and Limitations . . . . . . . . . . . . . . . . . . . . . . .
73
Integration of Gene Ontology Semantic Similarity into Data Classification
74
5.3.1
GO-Based Feature Weighting . . . . . . . . . . . . . . . . . . . .
75
5.3.2
GO-based Feature Selection . . . . . . . . . . . . . . . . . . . . .
78
Preliminary Results for the RSCTC’2010 Discovery Challenge . . . . . .
81
6 Conclusion
82
6.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
6.2
Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . .
85
7 Acknowledgments
88
x
xi
CHAPTER
1
Introduction
Since the human genome has been sequenced in 2003 [13], biological research is becoming
more interested in the genetic cause of diseases. The biologists try to find responsible
genes for the disease under study by analyzing the expression values of genes. The human
genome includes thousands of genes, which makes it difficult to find the genes that are
associated with a certain disease. Genetic microarray experiments are usually performed
to analyze the expression values of thousands of genes within a single experiment. To
extract useful information out of the genetic experiments and to get a better understanding of the disease under study, usually a computational analysis is performed [39]. To
determine which genes are modified in a certain disease, the gene expression data of
healthy subjects and diseased patients are analyzed and compared. Machine learning
techniques [36] are often used to find specific patterns in the data of healthy subjects and
diseased patients that can be used to discriminate the patients based on their disease
state. These learning methods are used to train models with the data from patients with
known disease state. A trained model can be used to predict the disease state of an unseen patient, by searching similarities between the gene expression profiles of the patient
with the unknown disease state and the gene expression profile of the diseased and the
healthy subjects.
Usually, in machine learning, an example describing a single training or testing item
(a patient for example) is called instance. An instance can be described by an unlimited
1
amount of attributes and can have a label. For bioinformatics datasets as described above,
the attributes are normally genes and the class label is defined by the disease state.
One of the commonly used learning algorithms with genetic data is called k -Nearest
Neighbour classification. The Euclidean distance of an unseen patient data to the patients
records used for training the model is normally calculated and the k nearest neighbours
are determined [53]. Usually, the new instance data is then labeled with the most frequent
class of the k nearest neighbour instances.
The k -Nearest Neighbour algorithm is not the only approach that uses a distance
function for classification. During the last three decades, the importance of the distance
function in machine learning has been gradually acknowledged [4]. However, for many
years only canonical distance functions or hand-crafted distance functions were used. Recently, a growing body of work has addressed the problem of supervised or semi-supervised
learning of customized distance functions [24]. In particular, two different approaches,
learning from equivalence constraints and the intrinsic Random Forest Similarity have
been introduced and shown to perform good on image data [24, 4]. Many characteristics
of gene expression data are similar to those of image data. Both usually have a big number of features where many of them are redundant, irrelevant or noisy. For this reason,
we assume that learning distance functions can improve the classification of genetic data,
too.
Another possibility to improve the analysis of genetic data is to use external biological
knowledge [57, 29, 38, 12]. The GeneOntology [2] is a good source of biological knowledge
that can be incorporated into machine learning techniques. In addition, the GeneOntology content is consistently growing, becoming a more complete and reliable source of
knowledge. The number of entries for example, increased from 27,867 in January 1, 2009
to 30,716 in January 1, 2010.
In this thesis, we will try to incorporate the biological knowledge of the GeneOntology
to improve the classification accuracy of genetic benchmark datasets. Further, the benchmark datasets are used to compare different machine learning approaches with different
configurations. The main task of this thesis is to improve the classification accuracy for
the benchmark datasets, in particular to increase the number of correctly classified test
instances. This thesis will address the following general problems of classification based
2
on genetic data:
• Reducing the big number of genes to the most discriminative genes
• Removing noisy, irrelevant and redundant information
• Learning from a small number of training instances
• Classification of binary and multiclass problems
• Learning with unequal distribution of classes in a dataset
• Comparison of different learning algorithms
• Comparison of different distance functions for k -Nearest Neighbour classification
• Exploitation of biological information provided by the expression values for classification
• Incorporation of biological knowledge from external sources to improve classification
The following problems are important for classification with genetic data too, but they
are out of scope for this thesis.
• Imputation of missing data
• Normalization of gene expression values
The present study is motivated by the research activities in the EU Framework Programme 6 project Health-e-Child1 , aimed at improving personalized healthcare in selected
areas of paediatrics, especially focusing on integrating medical data across disciplines,
modalities, and vertical levels such as molecular, organ, individual and population. In
medicine, large complex heterogeneous data sets are commonplace. Today, a single patient record may include, for example, demographic data, familial history, laboratory
test results, images (including echocardiograms, MRI, CT, angiogram etc), other signals
(e.g. EKG), genomic and proteomic samples, and history of appointments, prescriptions
and interventions. And much if not all of this data may be relevant and may contain
1
www.health-e-child.org
3
important information for decision support. A successful integration of heterogeneous
data within a patient record thus becomes of paramount importance, and, in particular,
learning a distance function for such data for patient case retrieval or classification is
rather non-trivial and forms an important task.
This thesis is organized as follows: In Section 2, related work is presented to give an
overview of the state-of-the-art techniques in the area of machine learning with genetic
data. In Section 3, the algorithms, approaches, software and biological background used
in this thesis are described. The Methods Section (4) describes the implementations
and techniques used for executing the empirical tests for this thesis. The results and
the analysis of the performed experiments are presented in Section 5. Section 6 gives a
summary of the thesis and completes the thesis with proposals for future work.
4
CHAPTER
2
Related Work
2.1
Machine Learning with Gene Expression Data
Much work has been done in the area of machine learning over the last few decades and
as bioinformatics is gaining more attention, different techniques have been applied to
process genetic data. Larrañaga et. al. [31] presented a summary of machine learning
methods in bioinformatics describing how to apply modeling methods, supervised/unsupervised learning and optimization to gene expression data. The most crucial problems
in classifying genetic data are missing data imputation, discriminative feature selection,
classification and clustering. A single Microarray experiment may contain thousands
of genes (rows) under different conditions (columns) and is scanned automatically by a
robot. The scanning procedure can sometimes result in data leaks caused by scanning
problems, insufficient resolution, image corruption, scratches, dust or defective spots.
However, most data mining algorithms require a complete data matrix for a proper processing. In a related paper, Troyanskaya et. al. [53] implemented and evaluated different
methods of missing data imputation for gene expression data. Three different methods where compared: Singular Value Decomposition (SVD), row average and k -Nearest
Neighbor regression, where the last one was shown to perform best.
In addition to a usually small sample of instances (conditions), the large amount of
gene expression values complicates classification. For that reason, a feature selection step
is normally conducted before classification to determine discriminative genes and elimi5
nate redundancy in the dataset. For an exhaustive review of feature selection techniques
for gene expression data, see “A review of feature selection techniques in bioinformatics”
[44]. In this thesis different feature selection methods are used to pre-process. They will
be explained in detail in Section 3.2.7.
After the imputation of missing values and selection of the most meaningful features,
the data can either be used for unsupervised or supervised learning. The main task in
unsupervised learning is to cluster unlabeled data into groups such that an instance is
more similar to all participants of the same group than to any instance of all the other
groups. Ben-Dor et. al. [6] introduced a data mining algorithm called CAST (Cluster Affinity Search Technique) which has been further improved in 2002 by Bellaachia
et. al. [5] and was shown to perform well on clustering gene expression data. In contrast
to unsupervised learning, supervised learning uses labeled data to train a classification
system which is able to predict the labels of unseen (unlabeled) data. A few empirical
comparisons of classification algorithms introducing k -Nearest Neighbor (k -NN), Linear
Discriminant Analysis (LDA), classification trees (like C4.5 [41] or CART [40]), Support
Vector Machines (SVM), Artificial Neural Networks (ANN) and their combinations using Bagging and Boosting for several cancer gene expression studies have been published
[16, 65, 50, 32]. As the ambition of this thesis is to build a framework for testing different classification methods for predicting class labels of unknown data with respect to
labeled training data, supervised learning only is used. For that reason, this thesis will
not present any other information about clustering than a short review in the Material
section.
So far, some useful machine learning workbenches have been presented. Langlois
[30] for example developed an open source machine learning workbench for studying the
structure, function, evolution and mutation of proteins. Presumably, the most popular
machine learning framework was developed by the University of Waikato and is called
Weka [17]. As the software developed in the course of preparation of this thesis is based
on Weka, this framework is described in detail in Section 3.3.
For a more detailed review on data mining techniques for bioinformatics data an
interested reader is referred to “Pattern recognition techniques for the emerging field of
bioinformatics: A review ” [33] or “Machine learning in bioinformatics” [31].
6
2.2
Learning Distance Functions
Historically the research on distance functions in machine learning has started from supervised learning of distance functions for k -Nearest Neighbor classification [49]. Since
that time the canonical distance functions like the Euclidean distance or Mahalanobis
distances have been used. Even today the Euclidean distance function is used in many
applications although it is well known that its use is justified only when the feature data
distribution is Gaussian. The Mahalanobis metric learning has received high research
attention but is often inferior to many non-linear and non-metric distance techniques and
usually fails for learning distances from image data [24], too.
Discriminative distance functions have been used in various classification domains and
in image processing and pattern recognition. Hertz [24] performed extensive research in
this area and presented three novel distance learning algorithms: Relevant Component
Analysis (RCA), DistBoost and Kernel Boost. In RCA a Mahalanobis distance metric
is learned which is optimal under several conditions using generative learning with positive equivalence constraints. Hertz described the application of these novel algorithms
to various data domains including clustering and classification. It was shown that using the presented improved distance functions instead of off-the-shelf distance functions,
a significant improvement was reached in all of these application domains. Bar-Hillel
[4] discussed learning from equivalence constraints where distance function learning is
considered as learning a classifier defined over instance pairs. A new method, termed
coding similarity, has been introduced and shown to hold an empirical advantage over
the common Mahalanobis. Based on this algorithm a two-step method for subordinate
class recognition in images was developed.
It was shown that the most popular Euclidean and Manhattan distance metrics are
not suitable for many data distributions. Yu et. al. [66] proposed a novel algorithm
that finds the best distance metric dynamically for a given dataset. This boosted distance metric was shown to give robust results with fifteen UCI repository [7] benchmark
datasets and two image retrieval applications. Later, Yu et. al. [67] further improved
their metric and presented a general guideline to find a more optimal distance function
for similarity estimation with a specific dataset. Tsymbal et. al. [55, 54] first performed
7
a detailed comparison of learning discriminative distance functions for case retrieval and
decision support. Two types of discriminative distance learning methods, learning from
equivalence constraints and the intrinsic Random Forest similarity have been compared.
They showed that both techniques are competitive to plain learning where the Random
Forest similarity exhibits a more robust behavior and is more stable with missing data
and noise. A more thorough introduction to learning distance functions is given in Section
3.2.6
2.3
Semantic Similarity Calculations Based on Gene
Ontology
“The Gene Ontology (GO) [2] provides a controlled vocabulary of terms for describing
gene product characteristics and gene product annotation data from GO Consortium
members, as well as tools to access and process this data” 1 . Based on this controlled
vocabulary, two genes can be semantically associated and further a similarity can be calculated based on their annotations in the GO. Sevilla et. al. [48] applied the semantic
similarity which is well known in the field of lexical taxonomies, artificial intelligence and
psychology to the Gene Ontology by calculating the information content of each term in
the ontology based on different methods like those of Resnik [43], Jiang [25] or Lin [15].
Sevilla et. al. computed correlation coefficients to compare physical intergene similarity
with the GO semantic similarity. The results demonstrated a benefit for the similarity
measure of Resnik [43]. A correlation between the physical intergene similarity and the
GO semantic similarity has been shown to exist for all three GO branches.
Later Wang and Azuaje [57] integrated similarity information from the GO into the
clustering of gene expression data and provided a novel way to use the GO for knowledgedriven data mining systems. They showed that this method not only produces competitive results in clustering accuracy, but also has the ability to detect new biological dependencies for the given problem. In 2008 they discussed alternative techniques for measuring
GO semantic similarity and relationships between these types of information such as gene
co-expression [3]. Kustra and Zagdański [29] provided a framework to integrate different
1
From the official Gene Ontology website: http://www.geneontology.org/index.shtml (07. January
2010)
8
biological data sources through the combination of corresponding dissimilarity measures
and have shown that the combination of gene expression data and protein-protein interaction knowledge may improve cluster analysis and results in more biologically meaningful
clusters.
Wang et. al. [58] presented a more effective method to determine a semantic similarity
of two terms in the GO graph in 2007. The novel technique aggregates the semantic
similarity of their ancestor terms and weights the different relations that terms can have
with their ancestor. A formal definition of a term A in the Gene Ontology was given
as DAGA = (A, TA , EA ), where TA is the set of GO terms in DAGA including term A
and all its ancestors in the GO graph, and EA is the set of semantic relations (edges)
connecting the terms of TA . With respect to the definition of a single term A, a semantic
value can be derived by aggregating the relations to its ancestors where a term closer
to A contributes more to its semantics than a term further from it. For any term t in
DAGA a contribution to the semantics of the GO term A was called S-Value SA (t), and
is defined as:

 1,
if t = A
SA (t) =
 max{w ∗ S (t0 ) | t0 ∈ children of (t)}, if t 6= A
e
A
(2.1)
where we is a semantic weight factor for a relation between term t and its child term
t0 and 0 < we < 1. Therefore after obtaining the S-Value for each term in DAGA the
semantic value SV (A) of a GO term A is defined as:
SV (A) =
X
SA (t)
(2.2)
t∈TA
To derive the semantic similarity SGO (A, B) between two GO terms A and B with
DAGA = (A, TA , EA ) and DAGB = (B, TB , EB ), Equation 2.3 can be used.
X
(SA (t) + SB (t))
SGO (A, B) =
t∈TA ∩TB
SV (A) + SV (B)
(2.3)
This method determines the semantic similarity of two GO terms based on both, their
position in the GO graph and their relations to their ancestor terms. For getting a numeric
semantic similarity between two genes, where a gene can have several GO terms, Wang
9
et. al. [58] designed an algorithm to combine the similarities of the single terms. They
have shown that their results are more in line with hand crafted similarity measurement
by human experts that the algorithms commonly used. Next to the method by Wang
et. al. described above, a few more methods have been used to determine the semantic
similarity of two genes g1 and g2 like Maximum [8, 63], Average [57, 56], Tao [52] or
Schlicker [47, 46]. Xu et. al. [64] evaluated these five methods and showed that the
Maximum method, shown in Equation 2.4, outperforms the other ones by analyzing
correlation coefficients and ROC curves.
sim(g1 , g2 ) = max[sim(c1 , c2 )],
(2.4)
where c1 ∈ A(g1 ), c2 ∈ A(g2 ) and A(g1 ),A(g2 ) are the corresponding sets of GO terms
annotated by g1 and g2 . This means that all single semantic similarities between the terms
annotated by g1 and g2 are calculated and the maximum value is determined. Further,
they also detected that genes annotated with multiple GO terms may be more reliable to
separate true positives from noise compared to genes that are annotated with only one
GO term.
Another approach to include the knowledge available in GO into machine learning
is to use it for feature selection. Qi and Tang [38] introduced a novel method to select
genes not only by their individual discriminative power, but also by integrating the GO
annotations. The algorithm corrects invalid information by ranking the genes based on
their GO annotations and was shown to boost accuracy in all four tested public datasets.
Chen and Tang [12] further investigated this idea, recorded a novel approach to aggregate
semantic similarities and integrated it into traditional redundancy evaluation for feature
selection. This resulted in higher or comparable classification accuracies tested on public
benchmark sets by using less features compared to common feature selection methods.
10
CHAPTER
3
Material
The whole framework used for conducting experiments for this thesis was implemented
in Java 1.6 using the development environment Eclipse IDE for Java EE Developers on
a Microsoft Windows XP Professional PC with Service Pack 3 installed. The computer
works with an Intel(R) Xeon(R) E5440 Quadcore processor at 2.83GHz and uses 3.0 GB
RAM. This chapter describes the material and software used for experiments and the
implementation of the developed framework for the thesis.
3.1
3.1.1
Biological and Medical Aspects
Gene Expression
Gene expression analysis has become a standard procedure in biological and medical
research [9, 18] since gene expression is directly associated with biological behavior of
the organism including the gene. A variation in expression values of a single gene can
cause a serious disease or disfunctionality of the whole organism. Gene expression is the
process of transforming a gene into a functional gene product and is used by all known
living organisms including eukaryotes, prokaryotes and viruses. For this procedure, the
double stranded DNA sequence is translated into a protein or a functional RNA in a
biological cell, see Figure 3.1. The regulation of this process defines the function and
structure of cells by ensuring a controlled protein expression. As the cell structure and
proteins define a biological system, changes in gene expression directly influences the
11
AAATGTGCGGTA
TTTACACGCCAT
DNA
Transcription
RNA
AAAUGUGCGGUA
Protein biosynthesis
L
C
A
V
Protein
Figure 3.1: A gene is expressed into a gene product like RNA or proteins. Fot this,
the double stranded DNA of the gene is transformed into a single stranded RNA by
Transcription. For building a Protein, the RNA is translated into a Protein sequence by
Protein biosynthesis.
whole organism. Cystic fibrosis (also called mucoviscidosis) for example is caused by a
mutation in the single gene cystic fibrosis transmembrane conductance regulator (CFTR)
where only three nucleotides are missing and affect the entire body [23]. The product
of this mutated gene is an ion channel protein responsible for chloride exchange of cells.
In the mutated version, the channel is not able to pass through chloride which results
in symptoms like the salty skin, clogging the airway by mucosa and usually a short life
expectancy.
The expression of a gene in a cell can be measured by quantifying the amount of the
gene products and this may often be informative. A mutation, deletion or multiplication of a gene can be detected and a viral infect, susceptibility to congenital disease or
resistances against bacteria can be deduced.
For most diseases not only one gene is responsible, but a co-operation of several genes
where each can be altered in a different way. As these diversifications can be detected
by analyzing the modified expression of the genes, specific gene expression patterns can
be found at diseased patients. If the gene expression pattern for a disease is known, it is
possible to recognize the affection of a patient by gene expression analysis. And further,
given a set of gene expression data from diseased and healthy people, a computer can
be used to find differences between the healthy and diseased data sets. After recognizing
these pattern differences, new patients can be classified as diseased or healthy by cal-
12
culating similarities to the different patterns. Today the most established technique to
obtain gene expression data from a biological system for computational analyses is to use
a microarray [51].
3.1.2
Genetic Microarray Experiments
A cDNA microarray is a sample of thousands of spots on a glass or silicon surface where
different DNA oligonucleotides, called features, are attached to each spot. Each spot
contains picomoles (10−12 moles) of a specific DNA oligonucleotide, cDNA or mRNA
sequence called probes. The core task of a microarray is to qualify and quantify complementary DNA strands to the one attached to the array. First, two experimental samples
are obtained under different conditions. For example, for testing genes associated with
cancer, one sample may be taken from a healthy subject and the other one from a diseased patient. After extracting the mRNA from the probe cells, cDNA is built and the
healthy sample is labeled with fluorescent green and the diseased one with fluorescent
red markers. After labeling the two samples, they are merged to one sample for further
analysis. The mixed sample is incubated with the DNA chip. The labeled cDNA binds
to spots containing a complementary sequence by forming hydrogen bonds between the
complementary nucleotide base pairs. Genes of the sample, having a complementary sequence on a spot on the array, adhere to the spot, while all the other genes get washed
off. The cDNA enriched microarray is placed in a dark box and scanned by a green laser
where the glowing of the fluorescent green markers is detected by a camera and the image
is stored at a computer. The same procedure is done with a red laser respectively for
detecting the red markers. The result of the screening are two images, one with green
intensities of the spots and another with red intensities where the more sample cDNA is
bound to a spot, the higher the intensity of the colour is. These two images are merged
computationally and the result is an image representing the expression of the genes in the
sample. A green spot means that the gene binding to this spot is expressed only in the
healthy subject, a red spot only in the diseased patient, a yellow spot is expressed in both
and a black spot is expressed in none of the two samples. An example of a microarray
expression image is shown in Figure 3.2.
For obtaining expression intensities for a single probe, the probe is applied to the
13
a)
b)
Figure 3.2: a) An example of a microarray visualization with approximately 37,500
probes. b) An enlarged view of the blue rectangle of the microarray shown in a).
microarray and the procedure described above is done solely with one laser. The result
of a single probe experiment is therefore only one image, containing spots with different
colour saturations where the expression intensities can be derived from the brightness of
the spots. In the experiments conducted for this thesis, datasets of single probe intensities
have been used only.
The raw data of a microarray experiment has to be normalized to correct systematic failures caused by a wrong calibration, different chips and scanning parameters or
variable fluorescent marker characteristics. For this purpose, each microarray contains
several control spots. Another problem to be examined before data analysis is to remove
background noise, but is out of scope of this thesis.
3.2
Machine Learning
Machine learning is a scientific discipline of finding previously unknown information and
potentially interesting relations and patterns in a dataset or database and is highly related to the fields of statistics, data mining and artificial intelligence [36]. The human
ability to learn from known data and to draw an intelligent decision out of it, is a great
gift of nature and since computers have been introduced, people tried to transfer this
possibility to machines. In 1965, Herbert Simon prophesied: “Machines will be capable,
within twenty years, of doing any work that a man can do.”1 . This has been proven to be
1
Herbert Simon, American mathematical social scientist (1916-2001)
14
too optimistic. Even now, 45 years later, computers are still far away from being comparable to human intelligence. Today, machines are able to learn from data and make
some generalizations from it using different learning algorithms which can be split in supervised learning, unsupervised learning and combinations of both, called semi-supervised
learning. Especially supervised learning has been successfully applied and used in several
application domains varying from stock price analysis, speech recognition, text analysis, data mining, computer vision, bioinformatics, computational neuroscience and many
more. But as machine learning algorithms are using heuristics and probability calculations, an optimal solution cannot be guaranteed and the deduced results may be wrong
in some cases. Therefore, the main challenge in machine learning is to optimally classify
or cluster new cases in a preferably short time. The differences between classification (supervised learning) and clustering (unsupervised learning) as well as the two main learning
algorithms used for this thesis are described in the following sections.
3.2.1
Supervised Learning
Predicting a class label (classification) or a continuous value (regression) for an unseen instance based on deducing a general model from previously learned training instances is called supervised learning. The training data contains several different instances
{xi , yi }N
i=1 represented by a list of objects xi ∈ X and an associated label yi ∈ Y . A group
of instances having the same label is called a class. The algorithm learns from training
data by finding patterns being specific for cases sharing the same label and builds a
global generalized function f : X → Y out of it. The key challenge for supervised learning algorithms is to predict a class label or a continuous value for an unseen instance
x ∈ X after learning a model from a usually small set of training instances. The classification accuracy of an algorithm highly depends on the arrangement of the training set
which should contain as much different instances as possible. For best performance, the
instances should reflect all possible real-world characteristics. The number of features,
an instance is represented by, should be large enough to define the classes precisely but
should not be too large to avoid overfitting and save computational time. As there is
no classification algorithm that works best for each dataset (no free lunch theorem [62]),
different algorithms should be validated to reach the best accuracy. The most widely used
15
classifiers include Artificial Neural Networks (ANN), Support Vector Machines (SVN),
k -Nearest Neighbour and decision trees (like C4.5 [41]). The ones used for this thesis are
described in detail in Sections 3.2.3 - 3.2.6.
3.2.2
Unsupervised Learning
The key difference of unsupervised learning to supervised learning is the absence of class
labels. The algorithm is provided with a set of input objects {xi }N
i=1 for which no class
labels yi are provided. The key task in this area is not to classify unseen data but to
analyze the given dataset by clustering, feature extraction, density estimation, visualization, anomaly detection or information retrieval. As this thesis concentrates only on
supervised learning, unsupervised learning is not described in any more detail in this
report.
3.2.3
Random Forest
Breiman’s Random Forest [10] is an ensemble classifier composed of a set of decision
trees [40, 41]. A decision tree is represented as a set of nodes and branches. Each node
of the tree corresponds to a feature (attribute) and each child node gets labeled with a
particular range of values for that feature. Together, a node and its children define a
decision path in the tree that an example follows when being classified by the tree. A
decision tree is induced in a top-down fashion by recursively splitting the feature space
until each of the leaf nodes is nearly of one class. At each step, all features of a node
are evaluated for their ability to improve the ”purity“ of the class distribution in the
partition they produce. An example of a decision tree can be seen in Figure 3.3.
A Random Forest grows a set of decision trees. A new instance is put down all the
trees in the forest and each tree returns a classification result. The final Random Forest
result is the mode of the votes of the single trees. This means that the instance is classified
into the class which has been selected the most times in the single trees. As this may
only give improvement compared to a single decision tree if the trees are different, feature
subsets for each node of the trees are selected randomly. Let the number of cases in the
training set be N and the number of features be M . N cases are sampled randomly with
replacement from the original dataset and a tree is grown where at each node, a number
16
A
<= 1
>1
B
diseased
false
true
healthy diseased
Figure 3.3: Example of a decision tree where nodes are represented as brown circles and a
blue rectangle represents a leaf. For classifying a new instance, two decisions are needed
at most. Starting at A, the value of the feature can either be <= 1 or > 1. In the first
case, the new instance reaches a leaf and is classified as diseased. For the second case,
one more decision is needed, node B, where the feature value can either be true or false.
m << M of features is chosen. Further, the best split out of these m is determined and
is used to split the tree. The value for m is constant over all the trees in the forest and
the trees are grown to their largest expand without pruning as long as no minimum leaf
size is given.
The forest’s error rate depends mostly on two things, the correlation between two trees
in the forest and the strength of each individual tree. A high correlation increases the
error rate of the forest while strong individuals decrease the error rate [10]. Therefore,
m has to be chosen carefully to produce strong enough individuals but also keep the
correlation low.
The most essential benefits of Random Forest are: the fact that they are fast at
training, memory efficient and are able to achieve competitive accuracy on big datasets.
Yet another advantage is that Random Forests normally do not overfit no matter how
many trees are used [10].
3.2.4
k -Nearest Neighbour Classification
The k -Nearest Neighbour algorithm is an instance based (or lazy) learning algorithm
storing a series of training examples in memory and using a distance function to compare
17
new instances to the stored ones. The prediction of the class label of the new instance is
based only on the k closest example cases with respect to the distance function. Often,
the Euclidean distance (Equation 3.1) is used where p and q are n-dimensional data
vectors.
v
u n
uX
deuclidean (p, q) = t (pi − qi )2
(3.1)
i=1
As shown in Figure 3.4, the test instance is classified into the most frequent class value
of its k nearest neighbors. For k = 3 (solid circle), the 3 nearest neighbors are analyzed
k=6
k=3
?
Figure 3.4: A 2-dimensional example of k -NN classification. The test case (red circle)
should be classified either to the green squares or to the blue stars. The solid border
includes k = 3 nearest neighbors of the test case. In this case, the test instance is
classified as a green square as there are two squares and only one blue star. The dashed
border includes k = 6 nearest neighbors. In this case, there are more blue stars (4) than
green squares (2) means that the test case is classified as a blue star.
where 2 of them are labeled as green squares and one is labeled as a blue star. In this
case the test value is classified as the most frequently occurring class, green squares.
A different case is shown within the dashed circle where k = 6 nearest neighbors are
considered. As there are four blue stars and only two green squares, the test instance is
classified as a blue star. The same procedure can be used for regression, where the result
value is normally the average over the k nearest neighbors values.
A certain disadvantage of the algorithm is that the classes occurring more frequently
18
in the training set will have a tendency to come up more often also in the neighbourhood,
due to their larger number, by chance. One solution to this problem is to weight the k
nearest neighbours by their distance to the test case. Then, a neighbor closer to the test
case will have a bigger influence on the class prediction than a further neighbor.
Accuracy of k -Nearest Neighbour classification often highly depends on the value of k
where the optimal value for k is different for each dataset. Generally a high number of k
reduces noise but also makes the class borders fuzzy. For a good estimation of parameters,
several heuristics can be used such as cross validation which will be described in Section
3.2.8.
3.2.5
AdaBoost
AdaBoost [19] is another successful ensemble algorithm for boosting the performance of a
machine learning algorithm, called base learner, and is short for adaptive boosting. In the
experiments done for this thesis, C4.5 [41] was used as the base learner with AdaBoost.
The algorithm is called adaptive boosting because it runs several iterations t = 1 . . . T
with the base classifier and in each iteration, instances being classified incorrect in the
previous steps, are emphasized more. Therefore, a distribution of weights Dt is updated in
each round by increasing the weights of each incorrectly classified instance and decreasing
the weights of each correctly classified instance. The final classifier is a set of the single
classifiers added at each step. The emphasis on badly classified instances makes AdaBoost
overly sensitive to noisy data and to outliers.
3.2.6
Learning Distance Functions
Distance functions in machine learning have been addressed in many research studies in
the last three decades. Special emphasis was given to learning distance functions for classification tasks [24], motivated by several reasons. First, learning a distance function for
classification helps to combine the power of strong learners (Random Forest, C4.5) with
the transparency of instance based classifiers (k -NN). Also, it was shown that choosing an
optimal distance function makes classifier learning redundant [34]. Besides, where most
traditional methods would fail, learning a proper distance function is especially helpful
for high-dimensional data with many weakly relevant, irrelevant or correlated features.
19
As gene expression data contains exactly this kind of features, distance learning is suitable in particular for classifying genetic data. Next, learning distance functions breaks
the learning process into two separate steps, distance learning followed by classification.
Each step requires search in a less complex functional space compared to straight learning. The separation makes the model more flexible and modular and enables component
reuse of the two parts. In this thesis, two different approaches to distance learning are
used, learning from equivalence constrains and the intrinsic Random Forest similarity.
Learning From Equivalence Constraints
Today, the most commonly used representation of distance learning is the one based on
equivalence constrains [4, 24]. Usually, equivalence constraints are represented as triplets
(x1 , x2 , y), where x1 and x2 are data points in the original space and y is a label indicating
whether x1 and x2 correspond to the same class or not. This is also called learning in the
product space. The product space is out of scope for this thesis and will not be explained
in more detail. Another possible approach is to learn in the space of vector differences,
called difference space, and is often used with homogeneous high-dimensional data such
as pixel intensities in imaging. A more detailed description of the difference space will be
presented in Section 4.2. Both methods were shown to demonstrate promising empirical
results in different contexts. The availability of equivalence constrains in most learning
contexts and the fact that they are a natural input for optimal distance function learning
[4] are two crucial reasons that motivate their use. It was shown that the optimal distance
function for classification is of the form p(yi 6= yj | xi , xj ). For each class, the optimal
distance function under the i.i.d assumption can be expressed in terms of generative
models p(x | y) [34] as shown in Equation 3.2.
p(yi 6= yj | xi , xj ) =
X
p(y | xi )(1 − p(y | xj ))
(3.2)
y
The function was shown to approach the Bayesian optimal accuracy [34] and was analytically proven to be at least as good as any distance metric.
Yet another approach for representing equivalence constraints are relative comparisons. They are used usually in information retrieval contexts. This representation uses
triplets of the form “x is more similar to y than to z ” for learning.
20
The Intrinsic Random Forest Distance
With respect to learning from equivalence constraints, the intrinsic Random Forest distance acts as a blackbox and few applications for it have been reported so far. The basic
concept of this algorithm is to learn a Random Forest for the given classification problem
and use the proportion of the trees where two instances appear together in the same leaves
as a measure of similarity between them [10]. Equation 3.3 shows the calculation of the
similarity between two instances x1 and x2 of a given Random Forest f . The instances
are propagated down all K trees of the forest f and their leaf positions z in each of the
trees (z1 = (z11 , . . . , z1K ) for x1 , similar z2 for x2 ) are recorded. The indicator function
is represented as I.
s(x1 , x2 ) =
K
1 X
I(z1i = z2i )
K i=1
(3.3)
This similarity can be used for classification tasks and further for related problems like
clustering or nearest neighbor data imputation. In our experiments, we used this similarity as a replacement of the canonical Euclidean distance with k -Nearest Neighbour
classification.
3.2.7
Feature Selection
Before training a machine learning model, most data has to be pre-processed in order to
reduce the usually big amount of features. Especially when classifying gene expression
data, a careful selection of discriminative genes is important to speed up the learning
process, remove noisy, irrelevant and redundant genes and fight the curse of dimensionality
[11]. A good feature reduction can improve the classification accuracy significantly.
Generally, feature selection algorithms can be divided into two basic groups: feature
ranking and subset selection. While feature ranking scores the features by a metric [27]
and removes the features that do not reach a defined threshold value, subset selection
methods evaluate different feature subsets to find the optimal one. Theoretically, an
optimal feature selection for supervised learning requires an evaluation of all possible
feature subsets. In practice, it is often impossible to find an optimal solution due to the
very big number of available features. Hence, a set of features satisfying certain criteria
is searched in most methods. Evaluation methods can follow two different approaches,
21
so called filter or wrapper, where both use a search algorithm to search through the
possible feature subsets. Wrappers evaluate the search result by running a classifier on
the subset. This makes wrapper methods computationally expensive and fosters the risk
of overfitting the training data. Contrary to the wrapper approaches, a filter algorithm
evaluates the subset on a statistical filter instead of explicitly training a model. This
yields a usually faster but often less accurate algorithm. Both of these methods often use
a search algorithm which can either be optimal or heuristic. Typical search techniques for
optimal solutions are depth-first or breadth-first. For heuristics, often sequential forward
or backward selection or best-first searches are used. In this thesis, several different
feature selection methods are used, including Information Gain [44], Gain Ratio [44],
Correlation-based Feature Subset Selection (CFS) [22] and ReliefF [26].
3.2.8
Cross Validation
Cross validation is a technique often used in machine learning to evaluate the generalization accuracy of a learning algorithm with respect to a specific dataset. It is mostly
used for classification in order to predict how accurate a classifier will perform in practice
or to compare two different classifiers on a single dataset. When testing a classification
algorithm, usually a set of labeled data is given. To train and evaluate an algorithm, the
data is split randomly into a set of training instances and a set of instances the algorithm
is be tested on. This division can clearly bias the classification result. For better understanding, assume that all instances of a certain class are put in the test set. The classifier
would have no chance to train the model for classifying this class because there are no
training instances. For this reason, a cross validation, partitioning the data randomly
several times and applying the classification algorithm to it, is performed. The classification accuracies are reported over the single runs and an average value is presented at
the end.
There are three basic kinds of cross validation: random sub-sampling validation, k-fold
cross validation and leave-on-out cross validation. In random sub-sampling, the algorithm
randomly splits the dataset into training and test set so that the validation subsets can
overlap. Differently, the k-fold cross validation breaks the data into k subsets of the same
size. The cross validation algorithm runs k times where in each run, another k is used
22
exactly once as the test set and the other k − 1 subsets provide the training instances.
Leave-one-out cross validation is a special case of k-fold cross validation where k = 1
and is usually used for data with a small amount of instances. Because the process has
to be repeated the number-of-instances times, leave-one-out cross validation can become
computationally time expensive. In this thesis, k-fold cross validation and leave-one-out
cross validation are used for the evaluation of learning algorithms.
3.3
Weka - A Machine Learning Framework in Java
As machine learning attracts much attention in computer science, several frameworks
have been developed for Java. Probably, the most established one is the open source
library Weka [17] (version 3.6.1 was used in this thesis). Weka has been developed at the
University of Waikato in New Zealand and provides a collection of algorithms for data
classification, preprocessing, regression, clustering, association rules and data visualization. These techniques can either be accessed from the provided shell via a graphical
user interface, from the command line or directly from Java code via a transparent API
provided. To handle datasets, Weka defines a new data format called ARFF (AttributeRelation File Format) where each ARFF file describes a single dataset.
3.3.1
The ARFF Format
An ARFF file is an ASCII text file describing the list of instances sharing a set of attributes and is separated into two basic parts, the header and the data section. The
header contains the relation name of the dataset as well as the attributes and their types.
Comments are indicated by a % sign. An example header of an artificial animal dataset
is given in Listing 3.1
1
2
3
4
5
6
7
8
% --------------------------------------% This is a dataset of different animals
% --------------------------------------@Relation " some animals "
@Attribute name string
@Attribute number_of_legs numeric
@Attribute has_fur { yes , no }
23
9
10
@Attribute height numeric
@Attribute class { dangerous , friendly }
Listing 3.1: Example of an ARFF header section
The header is followed by the data section. An example is shown in Listing 3.2 for the
same dataset.
1
2
3
4
5
6
@Data
Dog ,4 , yes ,76 ,?
Snake ,0 , no ,8 , dangerous
Cat ,4 , yes ,? , friendly
Bird ,2 , no ,21 , friendly
Listing 3.2: Example of an ARFF data section
The @Relation declaration
The first line in the ARFF file contains the declaration of the relation name for the
dataset. If the name includes spaces, the string has to be quoted.
The @Attribute declaration
The attribute section contains a list of attributes where each line defines one attribute and
starts with @Attribute followed by a name and a type separated by a space character. If
the attribute name contains spaces, the name has to be quoted. The type of the attribute
can be of the following types:
• numeric (can also be specified as integer or real)
• nominal-specification(a list containing possible categorized values separated by a
comma within curly braces)
• date
• string
24
The @Data declaration
The data sections starts with the @Data command, followed by the section of instances
where each line defines exactly one instance. Each line contains values for all the attributes defined in the header section. The attribute values are separated by a comma
where missing values are represented by a single question mark.
3.3.2
The Structure of the Weka Framework
Weka provides a complete Java project designed to be modular and object-oriented to
easily add new classifiers, data, filters, clustering algorithms and more. These classes can
easily be accessed from any Java code by simply importing them. The workbench is separated into several top-level packages, each containing classes for a specific machine learning task and an abstract Java class for this task. For example, the package “classifiers”
contains sub packages of different classifiers where each extends the common base class
called “Classifier”, providing the interface for all classifier implementations. The sub
packages are further grouped by functionality or purpose. Filters for example are separated into unsupervised or supervised and further by operating on attributes or instance
basis. The system kernel methods are collected in the “core” package, providing classes
and data structures that represent instances and attributes, read and save datasets and
provide common utility methods as well as interfaces. The top-level package “gui” contains classes for the shell which follows the Java Bean convention. Weka supplies a Web
API
3.4
2
describing all classes in the format of the Java API.
Distance Learning Framework
Weka supplies a bundle of machine learning algorithms, but does not offer algorithms for
learning distance functions with equivalence constraints or the intrinsic random forest
similarity. Tsymbal et. al. [55] implemented a Java framework for the empirical analysis
of these two techniques. The source code of this project is based on Weka and is the basis
for the framework implemented for experiments presented in this thesis.
The basic concept behind the framework is to use an ARFF file to configure and eval2
The Weka API can be found at http://weka.sourceforge.net/doc
25
uate different classification techniques. The attributes of the ARFF configuration file are
used as system parameters and each instance defines a single experiment. A demonstrative example is shown in Listing 3.3. Line 3 for example defines the name of the dataset
to be used for the experiment. The general workflow is demonstrated in Figure 3.5.
1
2
3
4
5
6
7
8
9
10
@Relation " configuration "
@Attribute
@Attribute
@Attribute
@Attribute
dataset string
use_random_forest { yes , no }
number_of_trees numeric
...
@Data
breast_cancer . arff , yes ,50 ,...
leukemia . arff , yes ,100 ,...
Listing 3.3: Part of a configuration file for the “distance learning” framework
First, the configuration file is read line by line and a Weka “Instances” object is created
containing the information from the read configuration ARFF file. For each instance,
meaning for each line in the ARFF data section (line 9,10,... in Listing 3.3), an independent experiment is started. The defined classification and system parameters are set
and the dataset is loaded into memory. A single experiment includes 1 . . . n repetitions of
k -fold or leave-one-out cross validation. For each cross validation iteration, the dataset
is split randomly into a training and a test set. Next, a data imputation algorithm is run
to fill in missing data if there is some. For training a model for learning from equivalence
constrains, the training data has to be transformed into the product or the difference
space. A transformation to learn from comparative constrains is also possible. Then, the
specified learning algorithm is used to train a model on the training set. The trained
models are used to classify the instances of the test set and the predicted class values for
each instance are compared with their ground truth values. Depending on these results,
a classification accuracy is calculated for each cross validation fold and is averaged over
the folds. To get the final result for each model, the cross validation results are averaged further over all repetitions and saved into a result text file. Moreover, the overall
classification accuracies are printed out to the console.
The workbench includes basic algorithms to compare distance learners with canonical
26
machine learning algorithms and provides a basis for further experiments for distance
function learning. Probably the highest disadvantage of this framework is the procedural
style of code making it hard to understand and extend.
For each config line
Config
File
Read configuration values
and set system parameters
Cross validate
Split dataset into a training
and a testing set
Train
Test
+
set
set
Impute missing data
Transform representation
for product, difference
and competitive space
WEKA
Train different Models
Classify test instances
for each trained model
Average and report
cross validation results
Results
C:/
Results
Figure 3.5: The general workflow of the distance learning framework is shown. An
ARFF file containing several system configuration parameters as attributes and instances
representing single experiments, is read first. For each line, corresponding to each single
run, a new experiment is started with the classification configuration specified in the
ARFF file. For each experiment, several cross validation iterations can be performed. For
each cross validation iteration, the data is split randomly into a training and testing set.
Next, an imputation method is run to predict missing values. Before training the different
classification models specified in the configuration ARFF file, the sets are transformed
into the product, difference and the comparative space for learning distance functions.
Then, the algorithm classifies the test set with unseen class labels. At the end of all
iterations, the classification accuracies are averaged over all runs and reported to the
resulting text file and to the console.
27
3.5
Gene Ontology
3.5.1
The Key Components of the Ontology
The GeneOntology (GO) Project [2] offers a controlled vocabulary of gene products containing their annotations and characteristics. The main effort was to build a controlled
structure describing the interactions and relations between gene products. Therefore
three species independent ontologies (structures) have been developed describing each
gene product in terms of their associated biological process, cellular component and molecular function where each of the three parts is represented with a separate graph.
• Cellular Component
A cellular component is the component of a cell, describing subcellular structures
and macromolecular complexes. Generally, a gene product is the subcomponent of
or is located in a particular cellular component.
• Molecular Function
Molecular functions are the abilities a gene product can have including transport,
binding, holding or changing different components.
• Biological Process
A biological process is a recognized series of events or molecular functions with a
defined beginning and end.
A gene product can be associated with or located in one or more cellular components, be
active in one or more biological processes and can perform several molecular functions.
The GO is a set of words and phrases used for indexing and retrieving information and
also provides the relationships between two terms, making it a structured vocabulary.
The structure of the GO vocabulary is represented with an acyclic graph, where each
node represents a GO term and each arc describes a directed relationship between these
two terms. Each term can have more than one parent. In each of the three graphs, a term
closer to the root is more general than a term deeper in the graph and a child term is
more specialized than its parents. An example of a GO graph structure is given in Figure
3.6. The coloured arrows indicate relations between two GO terms where the letter in
the box shows the relation type. The different relation types will be described later on.
28
Figure 3.6: An example of a set of terms under the biological process node. A screenshot
from the ontology editing tool OBO-Edit (http://www.oboedit.org). The nodes represent
GO terms where the labeled arcs indicate relations between two GO terms.
Each term has at least one path to one of the three root notes. A term has the following
essential elements:
• Unique Identifier and term name
Every term has a unique seven-digit zero-padded number prefixed with GO:, e.g.
GO:0005125, where the number does not code for any structural information. Further, every term has a name. For GO:0005125, for example, the name is cytokine
activity.
• Namespace
Denotes to which of the three sub-ontologies the term belongs to.
• Definition
A description of the concept represented by this term extended with some additional information, like the knowledge source.
29
• Relationships to other terms
A list of related terms specified with the type of their relation which can either be
is a, part of or regulates.
The GO is structured as a graph with terms as nodes and arcs connecting these nodes.
Not only the nodes are defined, but also the arcs are categorized. They represent a
directed relation a term has with respect to another term. There are three different
relation types:
• The “is a” relation
The is a relation denotes the fact that a term A is a subtype of another term B.
Example: “Mitochondrion” − is a → “intracellular organelle”.
• The “part of ” relation
The part of relation indicates that a term A is a necessary part of another term B
where the presence of B implies the presence of A but not vice versa.
Example: “Cytoplasm” − part of → “Cell”.
• The “regulates” relation
The regulates relation can have two subtypes, negative regulates or positive regulates.
A term A is said to regulate another term B, in case when A occurs, it regulates
B but B is regulated not only by A but also by other terms.
Example: “Regulation of mitotic spindle organization” − regulates → “mitotic
spindle organization”.
3.5.2
The GO File Format: OBO
The GeneOntology can be accessed in two different ways, web-based or local. The GeneOntology Consortium and the open developing community provide a big variety of
tools to access, edit and work with the ontology. While the web-based tools access the
ontology via a web interface, the local tools need to have a local copy of the ontology.
Therefore a specific file format was introduced, OBO. The .obo file contains a header
section providing information about the obo-format version and some additional facts,
like the date of creation. Next, the terms are listed, where the beginning of a new term
30
is indicated by [Term]. An example of a term is given in Listing 3.4.
[ Term ]
id : GO :0015239
name : multidrug transporter activity
namespace : molecular_function
def : " Enables the directed movement of drugs across a
membrane into , out of , within or between cells ."
6 subset : gosubset_prok
7 is_a : GO :0015238 ! drug transporter activity
8 is_a : GO :0022857 ! transmembrane transporter activity
1
2
3
4
5
Listing 3.4: Term definition of multidrug transporter activity in OBO format
First, the unique GO identifiers is defined, followed by the term’s name (lines 2,3). The
“namespace” declaration in line 4 indicates to which of the three general categories the
term belongs, here it is molecular function. Next, additional information is provided. At
the end of each term, relations to other terms are listed (line 7,8). In the example of
Listing 3.4, the term is connected to two other different terms with is a relations. The
actual ontology.obo file can be downloaded from the official GO website3 and currently
includes 29,429 terms4 (18,118 biological processes, 2,642 cellular components and 8,669
molecular functions). These terms are associated with 353,869 gene products5 (252,697
biological processes, 250,657 cellular components and 267,874 molecular functions). Note
that a gene product can be associated to more than one sub-ontology.
3.5.3
The GO Annotation Database
The annotation database contains the assignments of gene products to GO terms and
is available for several species including homo sapiens, mouse and yeast. The datasets
used for experiments in this thesis all include only homo sapiens genes and therefore the
corresponding annotation database has been used only. The database can be downloaded
from the official GO website6 in tab-delimited plain text format containing not only
assignments, but also the evidence code. The evidence code is a measure of how trustful
the assignments are and where they have been extracted from.
3
http://www.geneontology.org/GO.downloads.ontology.shtml
As counted on January 28, 2010
5
As counted on January 28, 2010
6
http://www.geneontology.org/GO.current.annotations.shtml
4
31
3.6
Gene Ontology API for Java
In order to use the OBO file and the annotation database from Java code for gene
ontology vocabulary manipulation, an API was used called GO4J (version 1.1) [21].
The API contains four models where each model handles a different task. First, GO4J
provides a GO definition parser that supports common GO formats. Next, classes are
provided to construct a directed graph containing all specified GO identities. Further,
a graphical user interface is given to visualize GO pathways and graph models. But
the most important functionality for this thesis is the API for GO semantic similarity
calculation based on different algorithms (see Section 2.3). While GO4J includes methods
introduced by Resnik [43] and Lin [15], the algorithm of Wang [57] had to be implemented
to extend the library.
3.7
NCBI EUtils
The National Center for Biotechnology Information (NCBI) is one of the main venues
for information retrieval in genetics. They offer a wide variety of databases and tools
for public and free use [45]. One of these tools is called Entrez Programming Utilities,
or in short EUtils. It provides access to the NCBI databases outside of the regular webquery interface. The tool is located at one of the NCBI servers and can be accessed
at http://eutils.ncbi.nlm.nih.gov/. EUtils parameters can be specified by adding “GET”
parameters to the URL identifier. EUtils offers eight different tools for accessing the
database of which each corresponds to a specific task. For this thesis, the EUtils subprogram ESearch has been used for searching corresponding official gene names for known
NCBI accession numbers. Different options can be set for ESearch; some of them are listed
below.
• Database (db)
The database to be searched in.
• Term (term)
The term to be searched for. By default, the search term expects a search string.
To search for other types, like NCBI accession numbers, the type can be specified
32
by appending a type declaration in squared brackets.
• Retrieval Mode (retmode)
Specifies the format of the search result. By default, the result shows up as an
HTML page. The retrieval mode can be set to XML, too.
This is an example of an EUtils URL call used for experiments done for this thesis:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/
esearch.fcgi?db=gene&term=M26697[accn]&retmode=xml
The result of this call is an XML file including information about the NCBI entry for the
accession number M26697 in the “Genes” database. Note that [accn] indicates that the
search string is an accession number.
3.8
Benchmark Datasets
For evaluating different machine learning approaches, thirteen independent datasets have
been used, see Table 3.1. The datasets are associated with different clinical problems
and have been received from various sources. Colon, Embrional Tumours, Leukemia and
Lymphoma datasets have been obtained from the Bioinformatics Research Group, Spain7 ,
Heart, Liver, Thyroid and Arcene from the UCI repository [7], Lupus from The HumanComputer Interaction Lab at the University of Maryland, USA8 , Breast Cancer from the
University of Minnesota9 and Lung Cancer from the Devision of Thoracic Surgery at
Brigham and Women’s Hospital, USA10 . The Mesh dataset was generated from cardiac
aortic valve images at Siemens [42]. The last dataset, HeC Brain Tumours is obtained
from hospitals participating in the Health-e-Child project.
The datasets vary in the number of attributes and the number of instances, see column
two and three of Table 3.1. All of these datasets represent binary classification problems,
so that there are only two possible classes an instance can belong to. The data type
7
http://www.upo.es/eps/aguilar/datasets.html
http://hcil.cs.umd.edu/localphp/hcil/vast/archive/task.php?ts id=145
9
http://www-users.itlabs.umn.edu/classes/Spring-2009/csci5461/index.php?page=homework
10
http://www.chestsurg.org/publications/expression-data-181.txt
8
33
Name
Breast Cancer
Colon
Embr. Tumour
HeC Brain
Tumours
Leukemia
Lung Cancer
Lupus
Lymphoma
Arcene
Heart
Liver
Mesh
Thyroid
No. of
Attributes
24482
2001
7130
5580
No. of
Instances
97
62
60
67
Data Type
Gene Names
Genetic
Genetic
Genetic
Genetic
yes
yes
yes
yes
41
12534
172
4027
10001
72
181
90
45
200
yes
no
yes
no
no
14
7
2941
6
270
345
63
215
Genetic
Genetic
Genetic
Genetic
Mass
Spectrometry
Medical
Medical
Image Data
Unknown
no
no
no
no
Table 3.1: A list of the benchmark datasets used for experiments conducted for this thesis
indicates the source where the data was extracted from, while the eight datasets upon
the double line are microarray gene expression datasets and the last five are non-genetic.
For experiments with the GO semantic similarity, it is necessary to know the exact gene
names for the given features. This is only possible, if a certain gene identifier is given,
like a gene name or an accession number. The last column indicates if it is possible
to match the included features to genes, meaning that these datasets can be used for
experiments with the GO similarity. For the other datasets, the attributes may only
have an undefined number as feature name and therefore, it is not possible to derive any
semantic information about these features.
In order to run the experiments in a defined period of time and to remove redundant,
noisy and irrelevant features, the dataset attributes have been preselected by a filter feature selection method called ReliefF [26]. While all datasets with more than 200 features
have been reduced to 200 features, Lupus was reduced to 75 features and Leukemia was
used with the original number of features.
34
3.9
RSCTC’2010 Discovery Challenge
The RSCTC’2010 Discovery Challenge [1] is a part of the International Rough Sets and
Current Trends in Computing Conference to be held in Warsaw, Poland on June 28-30,
2010. The challenge task is to design a machine-learning algorithm that will classify
patients for the purpose of medical diagnosis and treatment by analyzing data from DNA
microarrays. Patients are categorized by gene transcription data containing between
20,000 and 65,000 features. The challenge includes two independent tracks, Basic Track
and Advanced Track, varying in the form of solutions to be submitted. In the Basic
Track, text files containing class labels for test samples of the datasets are submitted and
compared to ground truth on the TunedIT server11 . In the Advanced Track, Java source
code of a classification algorithm is submitted and compiled on the server. The classifier
is trained on a subset of data and evaluated on another subset.
3.9.1
Basic Track
In the Basic Track, six different datasets are given and split into labeled training data
and unlabeled test data. The patients of the test set have to be classified and a text file
containing only the class labels for the test instances is to be submitted for each dataset.
The six datasets vary in their number of classes, number of attributes and number of
instances. Another difference is the class distribution of the training instances, see the
last column of Table 3.2. Some of the classes occur in a small number of instances, making
the classification task more difficult. The predicted class labels are uploaded online in a
ID
1
2
3
4
5
6
No. of
Attributes
54676
22284
22278
54676
54614
59005
No. of
Instances
123
105
95
113
89
92
No. of
classes
2
3
5
5
4
5
Class Distribution
88-35
40-58-7
23-5-27-20-20
11-31-51-10-10
16-10-43-20
11-7-14-53-7
Table 3.2: The six datasets provided for the Basic Track of RSCTC’2010 Machine Learning Challenge
11
http://tunedit.org/
35
zip compressed file and evaluated on the server by comparing them with the true class
labels. The accuracy within each single class is calculated and averaged for each dataset,
which is also called “within-class” accuracy or mean true positive rate. The final result is
the average percentage over all the six datasets and is presented on a so called leaderboard
(a web page with a table including the preliminary results for the participants) without
decimal numbers. These preliminary results are evaluated on a part of the test sets to
guarantee that the final result is not biased. The final results are to be calculated on a
separate subset of the test sets at the end of the challenge on February 28, 2010. The
number of submissions for each participant is limited to 100. After the challenge is closed,
the last solution submitted is used for final evaluations.
3.9.2
Advanced Track
In the Advanced Track, participants submit Java source code of the classification algorithm instead of class labels. The submitted classifier is automatically trained on a subset
of the challenge data and evaluated on a test set on the TunedIT server. The algorithm
is evaluated with cross validation 5 (preliminary) or 20 (final) times on each dataset. The
five datasets used for this evaluation are not disclosed to participants, but it is know that
they follow the same characteristics as the datasets from the Basic Track. The classifier
is evaluated on the server using five-fold cross validation with a fixed random seed for
data splitting for reproducibility and in order to be able to compare different solutions.
This track is more challenging as the datasets are unknown and the classifier has
to fulfill time and memory restrictions. The total server runtime for all five datasets is
limited to 5 hours, so that each run of cross validation for each dataset can take not
more than twelve minutes, on average. Identically to the Basic Track, 100 submissions
of solutions are allowed and the last one is used for final classification on six different
datasets where the algorithm will be evaluated 20 times with cross validation and has to
terminate in less than 24 hours.
The memory consumption is not allowed to exceed 1,500 MB where up to 500 MB
is normally used by the evaluation procedure to load the dataset into memory, so that
about 1000 MB can be used by the classification algorithm. The classification accuracy
is calculated same as in the Basic Track, but the leaderboard here shows accuracy for
36
preliminary solution with two decimal places after the comma, in contrast to the Basic
Track, where only an integer classification accuracy is presented.
37
CHAPTER
4
Methods
To improve the classification accuracy for distance learning with genetic data, three general solutions have been studied, replacement of the elementary distance function for
learning from equivalent constraints, transformation of the data into another representation and incorporation of biological knowledge into classification. The introduced distance
learning framework (Section 3.4) was modified to fulfill the requirements needed to perform the experiments for this thesis. The different approaches have been tested in several
experiments under different conditions and compared to each other. An explanation of
the approaches and the methods used to perform the experiments, are introduced in the
following sections.
4.1
Reorganization of the Distance Learning Framework
To make the distance learning framework more modular, easier to extend and to maintain,
the framework was restructured from procedural to object oriented Java code. Moreover,
some of the hard coded system parameters have been outsourced into the ARFF configuration file to be able to configure each run as detailed as possible without changing
the code. The decision to use leave-one-out or 10-fold cross validation, to enable/disable
classification with k-NN, Random Forest, learning from equivalence constraints and the
intrinsic Random Forest similarity, the method to impute missing data and the distance
38
Class: Experiment
Config
File
Class: ConfigReader
Read configuration file
ARFF
Class: ConfigChecker
Validate configurations
Class: HTMLCreator
Prepare HTML output
for each config line
Class: TestExecuter
Run a single
experiment
Results
HTML
C:/
Results
Figure 4.1: The structure of the reorganized object oriented distance learning framework.
The TestExecuter class (green rectangle) is detailed in Figure 4.2.
function for learning from equivalence constraints have been transfered to the ARFF
configuration file. Further, the configuration file was extended by the attribute skip, to
disable a single experiment line. The configuration file includes many different experimental settings for different tasks. To study one task, for example only the experiments
where GO semantic similarity is incorporated, only the corresponding experiments are
normally of interest and the other experiments can be skipped.
The main class of the new structure is called Experiment. Within this class, ConfigReader is called to read the configuration ARFF file (see Figure 4.1). Yet another class,
ConfigChecker verifies each configuration line and reports obvious misconfigurations.
The ConfigChecker checks that the number of trees to be used for the Random Forest is a
positive integer value, that at least one classifier is activated and that the distance function
to be used for learning from equivalence constraints is defined. Further the class checks
if the configured dataset files exist at the supplied path. If the configurations are correct,
the HTMLCreator is initialized. This class saves the results of all single experiment runs
into an HTML result file. After, a loop over the single runs (experiments) configured in
39
the ARFF file is started, where in each iteration, a new TestExecuter object is declared
and the execution time for this experiment is measured. The TestExecuter class is
performing the tests specified in the ARFF file; its structure is described in Figure 4.2.
Class: TestExecuter
Config
File
Set system parameters read
from configuration files
ARFF
Dataset
Open dataset file and
load the dataset into memory
ARFF
Generate artificial missing data
Impute missing data
Do cross validation
C:/
Single Results
Single
Results
HTML
Figure 4.2: Detailed view of the TestExecuter class. The parts not exploited in the
experiments in this thesis are grayed out. The cross validation procedure (green rectangle)
is described in more detail in Figure 4.3.
Such a structure makes it possible to automatically run an unlimited amount of experiments one after another, while their results are reported to the same HTML file.
For each experiment, a new TestExecuter is launched with the corresponding dataset
declared in the configuration file.
First, the configuration line is read to set the experiment variables. Then, the dataset
ARFF file is loaded into memory. As this framework extends the WEKA framework [17],
a new Instances object is declared containing the dataset. Next, there is an opportunity
to impute missing data. In each cross validation run, one of the four methods can be used
to estimate the missing values; Mean/Mode imputation, k-NN imputation and multiple
40
k -NN imputation with Random Subspacing or Bagging. In the experiments conducted
for this thesis, the imputation of missing data was disabled as this problem was out
of scope for our study. All datasets considered here have no missing values. After
data preprocessing, cross validation for the datasets with different splits of training and
testing sets is performed. To eliminate the bias caused by randomness of some learning
techniques, several iterations of cross validation are performed. The procedure of each
cross validation iteration is shown in Figure 4.3.
for each CV run
Dataset
Trainset
Split dataset into train and test set
Testset
Trainset
Add noise to the train set
noisy
Trainset
transform
data into another space
train
KNN
model
train
meta
model
train from
equivalence
constraints
train
intrinsic RF
sim. model
classify
testset
classify
testset
classify
testset
classify
testset
Return accuracies for each single classifier
Figure 4.3: Detailed structure of a cross validation iteration where four types of learning
algorithms are used in one experiment.
In each run, the dataset is split randomly into a training and a testing set. The
size of the training set is specified in the configuration file. Before training the models,
artificial class noise can be injected into the training set. For learning from equivalence
41
constraints, the data has to be transformed into the product or the difference space. The
four learning techniques k-Nearest Neighbour , Meta, learning from equivalence constraints
and the intrinsic Random Forest similarity can be performed. Meta is a general name for
different ensemble classifiers such as AdaBoost or Random Forest, where only one can be
evaluated at each experiment. The configured Meta classifier is also used for experiments
with learning form equivalence constraints. The representation type for learning from
equivalence constraints, product or difference space, is set in the configuration file, too.
For each experiment, the usage of each of these learning techniques can be switched on
or off in the configuration file.
After training the models, the test set is separately evaluated with each classification
model and the accuracy is averaged over the cross validation iterations. At the end of
each experiment, the resulting accuracies are provided to the HTMLCreator. After all
the experiments have been completed, the HTML result file is generated, containing a
separate table for each experiment, including the experimental configurations and the
accuracies achieved by the activated classifiers. An example of the HTML output table
for one experiment is shown in Figure 4.4.
The experiments specified in the configuration file are numbered sequentially. The
experiment identification number is shown on the top left of the HTML output. In Figure
4.4, for example, the output of experiment 85 is shown, meaning that this experiment is
defined in line 85 in the configuration file and 84 other experiments have been performed
before or have been skipped. First, the parameters set in the ARFF configuration file for
the actual experiment are listed with their corresponding values. In the shown experiment
output, the HeC Brain Tumours Dataset was used, as can be seen in the first line on the
right. For a better view of experimental results, the configuration values can be hidden
by clicking the “hide config” link at the top right. At the bottom, the accuracies for
the activated learning algorithms are shown within a table. In this example, Random
Forest was used as Meta classifier and reached the accuracy of 91.18 %. The instancebased classifier, k -Nearest Neighbour with k = 7 (IB7 ), got 83.82 %, learning distance
functions from equivalence constraints in the difference space (LDF EC ) performed best
with 92.65 % and the instrisic Random Forest similarity (RF LDF ) has reached 89.71
%. The worse result of the intrinsic Random Forest similarity classifer, in this case,
42
Figure 4.4: Output of a single experiment for the Health-e-Child Brain Tumours dataset
in HTML format
43
was caused by the relatively small number of trees (25) used for this experiment (see
m ntrees LDF RF dist in Figure 4.4) and could be enhanced by using more trees, usually
up to 1000.
4.2
Distance Function Learning From Equivalence
Constraints
The first experiments conducted for this thesis have been intended to study the change of
the elementary distance function for learning from equivalence constraints in the difference
space to improve the classification accuracy. For learning from equivalence constraints
in the difference space, the dataset needs to be transformed. For two instances from the
original set A with attributes a1 , . . . , an and B with attributes b1 , . . . , bn , the transformation into the difference space is performed as shown in Equation 4.1, where d indicates a
distance function.
difference space(A, B) = {d(a1 , b1 ), . . . , d(an , bn )}
(4.1)
A certain learning algorithm, Random Forest or AdaBoost, is then used to learn a distance
function from the equivalence constraints. Afterwards, at the application phase, for each
test instance, the k nearest neighbours are calculated with respect to the learned distance
function and the test instance is classified as the most frequent class of the determined
neighbours.
To transform the dataset, different distance functions can be used to calculate the
distance between attribute values an and bn , influencing the classification accuracy. To
evaluate which distance function performs best with genetic data, several approaches have
been tested under identical conditions. In some of the following distance calculations,
the distance is squared. This has the effect of transforming the negative distances into
positive distances which is often causes a crucial loss of information and was shown to
always decrease the accuracy in our preliminary experiments. To avoid this effect, the
difference between x and y is calculated for each pair of attribute values first and the sign
of this result is then preserved for each final distance calculation. For each two values x
and y of a certain feature fi from two instances A and B where i = 1, . . . , n, the distance
44
d(x, y) can be calculated with one of the following distance approaches.
4.2.1
The L1 Distance and Modifications of the L1 Distance
• The L1 distance between two values x and y is calculated by subtracting y from
x. Taking as usually the absolute value for the L1 distance is avoided not to lose
crucial information.
dL1 (x, y) = x − y
(4.2)
• The Squared distance is a modification of the L1 distance where the result of the
L1 distance is squared.
dSquared (x, y) = (x − y)2 ∗ sgn(x − y)
(4.3)
• Another modification of the L1 distance is the Square-Root distance that is calculated by taking the square root of the L1 distance.
dRoot (x, y) =
p
|x − y| ∗ sgn(x − y)
(4.4)
• Different to the Squared distance (Equation 4.3), where the result of the L1 distance
is squared, in the Element Squared distance, the single values x and y get squared
before subtracting them.
dElementSquare (x, y) = (x2 − y 2 ) ∗ sgn(x − y)
4.2.2
(4.5)
The Simplified Mahalanobis Distance
The Mahalanobis distance [14] is based on correlations between two vectors and is not
scale-invariant, i.e. it is dependent on the scale of the attribute values. To calculate the
Mahalanobis distance, the standard deviation δ is prepared in advance over all instances
for the attribute for which the actual distance is calculated.
r
(x − y)2
dM ahalanobis (x, y) =
∗ sgn(x − y)
δ2
(4.6)
It must be noted, that the general Mahalanobis distance, including a sum of elements, is
simplified in our case to one dimension as the distance has to be calculated between the
45
two one-element vectors x and y.
4.2.3
The Chi-Square Distance
The Chi-Square distance [60, 35] is a normalized distance, where x is divided by the sum
of the a1 , . . . , an attributes of the instance A and y by the sum of the b1 , . . . , bn attributes
of the instance B.
2

1
dChi−Square (x, y) = P
m P
n
 x
y 
 ∗ sgn(x − y)
∗
−
n
n

P
P
ji
ai
bi
j=1 i=1
i=1
(4.7)
i=1
where m is the number of instances in the dataset. Less formal,
m P
n
P
Vji , where Vji is
j=1 i=1
the value of the attribute i of the instance j, is calculated by summarizing the values
n
P
of all attributes over all instances of the dataset, while
ai is the sum of the attribute
i=1
values only for instance A and x is the current element which the distance is calculated
n
P
for.
bi is calculated respectively for B containing y.
i=1
4.2.4
The Weighted Frequency Distance
The weighted frequency distance was developed for this thesis and is based on the idea
that the distance between x and y may depend on the distance of x and y to the corresponding attribute of the other instances in the dataset. With this distance, we tried to
take the distribution of the attribute values (where the distance is calculated for) over
all instances into account. For example, let x = a1 = 2 and y = b1 = 8, meaning the first
attribute of instances A and B. First, the number of instances where the value of their
first attribute is in the range between x and y is counted. The number of instances m
with an attribute value between x and y is divided by the number of all instances M to
get a percentage value p.
p=
m
M
(4.8)
The L1 distance between x and y is calculated and multiplied by p. As the Frequency Distance performed bad by itself in our preliminary tests, it has been extended by combining
it with the L1 distance. Both parts, the Frequency Distance and the L1 , are weighted by
46
a factor w, where 0 < w < 1. Note that if w = 1, the distance is equivalent to L1 .
dF requency (x, y) = (w ∗ (x − y) − (1 − w) ∗ (x − y) ∗ p) ∗ sgn(x − y)
(4.9)
For the experiments conducted for this thesis, a weight of w = 0.5 was used.
4.2.5
The Canberra Distance
The Canberra [35] distance function calculates the absolute value of x − y divided by the
absolute values of the sum x + y.
dCanberra (x, y) =
4.2.6
|x − y|
∗ sgn(x − y)
|x + y|
(4.10)
The Variance Threshold Distance
This new distance was developed for experiments for this thesis; the basic idea is to group
the attributes based on their expression values. We tried to model a biological view on
the expression values of the attributes. It was assumed that the exact expression value is
not the most important information, but the distance to the mean value of the attribute
(where the distance is calculated for) over all instances. We defined, that a gene can either
be strongly overexpressed, overexpressed, normal expressed, underexpressed or strongly
underexpressed. First, the mean mi and standard deviation δi are calculated for the
attribute over the training set and five ordered groups are prepared, where each attribute
value, x and y is classified independently into one of these groups by the following rules:



2,
if x > m + δ
(Strongly overexpressed)






1,
if m < x < m + δ
(Overexpressed)


(4.11)
xnew =
0,
if −m < x < m
(Normal)





−1,
if −m − δ < x < −m (Underexpressed)




 −2,
if x < −m − δ
(Strongly underexpressed)
The group value for y is calculated respectively. Next, the distance is defined as the
absolute L1 distance of group values of x and y.
dV arianceT hreshold (x, y) = |xnew − ynew |
47
(4.12)
For example, x is Strong overexpressed (2) and y is underexpressed (-1), the Variance
Threshold distance d(x, y) = 3. Note, if x and y are of the same expression group, the
distance is zero.
We called a special variant of this distance Zero Variance Distance, where zero is
used instead of the mean value. We used both, the Mean and the Zero Variance Distance
for experiments conducted in this thesis.
4.2.7
Test Configurations
The introduced distances have been tested on nine of the benchmark datasets described
in Table 3.1: Mesh, Lymphoma, Embr. Tumours, Colon, Leukemia, Arcene, Liver, Thyroid and Heart. The other datasets have been skipped for this experiment as they were
not available at the time the tests have been conducted. For Datasets with less than 90
instances, Leave-One-Out (LOO) cross validation was used and for the other datasets,
10-fold cross validation with 30 iterations was used. All tests have been done in the difference space with Random Forest as the learning algorithm and the average classification
accuracy over the datasets was calculated and is shown in Table 5.1.
4.3
4.3.1
Transformation of Feature Representation for Gene
Expression Data
Motivation
The genetic datasets described in Table 3.1 contain gene expression values, each attribute
is the expression of a single gene. In biology, the influence of a gene to a certain disease
often depends not only on its own expression, but also on the expression of some other
genes, interacting with it. In a certain disease, a high expression of a gene A might only
be influencing the etiopathology if a second gene B is overexpressed, too. In another
case, it might only be important if B is underexpressed. By training a classifier with
the microarray datasets in the usual gene expression representation, these dependencies
are often not considered. To consider these co-operations, the data can be transformed
into another representation. A normalization of the genes with respect to other genes is
needed.
48
Another motivation for changing the feature representation was to incorporate the GO
semantic similarity into classification. The GeneOntology provides similarity measures
between two genes A and B while most learning algorithms when the usual representation
is used, consider differences between patients with given gene expression values. The
GO provides information about gene-gene interactions and the classifiers normally use
patient-patient relations for classification. There is no obvious way how to incorporate
the semantic similarity between two genes to improve the classification of patients, when
the plain representation is used.
For these reasons, the original datasets can be transformed into a new representation
of gene-pairs instead of single genes. First, all possible pairs of single genes available in
the dataset are generated. As the information of the two pairs of a gene A and a gene
B, P air(A, B) and P air(B, A) is redundant, only one of these pairs is used. If n is the
number of genes (features) in the dataset, then
n2 −n
2
Gene A
pairs are generated.
P air(A, B)
T ransf ormation
Gene B −−−−−−−−−→ P air(A, C)
Gene C
(4.13)
P air(B, C)
We call this new representation Gene-Pair Representation. In (4.13), three genes are
transformed into three gene pairs. But the number of pairs increases with the number
of genes. Most datasets in our experiments have been reduced to 200 features, so that
19,900 gene pairs are constructed. Training a classification model may not finish in a
limited period of time with this large number of features. Moreover, most of these pairs
will not be useful for classification. Therefore, a feature selection has to be done to
select the most discriminative pairs in advance. With this new gene-pair representation,
the genes are normalized with respect to the other genes and it is possible to include
the GO semantic similarity as the transformed datasets are represented with gene-gene
relations. Therefore, an instance I in the new gene-pair representation is defined by a
certain amount of pairs of genes pi and a class label c.
I = (p1 , . . . pn , c)
49
(4.14)
4.3.2
Framework Updates
To transform the original datasets into the gene-pair representation, the distance learning
framework was extended by a preprocessor class for constructing and selecting the pairs.
In each cross validation iteration, the training set and the testing set are transformed into
the gene-pair representation before learning the models. For each iteration, the selected
pairs are provided to a class reporting statistics of the pairs (Figure 4.5). This class
counts the how often a pair is selected over all cross validation iterations. Further, the
ranking of the pairs in each iteration by the feature selection algorithm is provided to the
statistics reporter.
for each CV iteration
Add noise
original
Train set
original
Test set
Transform into
Gene-Pair
Representation
Report pair statistics
Train set
genepairs
Test set
genepairs
Pair
Stat.
HTML
train models
Figure 4.5: Modified Framework for transforming the dataset into the gene-pair representation. The new components are shown in blue.
The usage of the gene-pair representation can be enabled in the ARFF configuration
file and is shown in the HTML output on the bottom left of the configuration parameters.
To access the pair statistics for an experiment, one can click on the dataset name in the
header of the result table for each experiment. The detailed process of transforming the
data representation can be seen in Figure 4.6.
Before training the model of the actual cross validation iteration, the preprocessor
is started on the training set. Let n be the number of genes in the dataset and N
50
Class: Preprocessor
original
Trainset
original
Testset
2
2
n-n
n-n
Generate all 2
pairs of n features
Generate all 2
pairs of n features
Select
4 * N best pairs
with a Filter
Select
N best pairs
with a Wrapper
selected
pairs
selected
pairs
Reduce Testset to
filter selected pairs
Reduce Testset to
wrapper selected
pairs
GenePair
Testset
GenePair
Trainset
Figure 4.6: Workflow of the transformation from the single-gene to the gene-pair representation, where n is the number of features and N is the number of pairs to be selected.
be number of pairs to be used for classification. First, all
n2 −n
2
pairs are generated by
subtracting the gene expressions for one gene from another. The P air(A, B), for example,
is constructed by subtracting A − B. Note that compared to the common usage of a
distance, where the absolute value is often used, the sign is kept in the pair calculations
in order not to lose information. For example, let A = −15.65 and B = −3, 43, then
P air(A, B) = −12.22. Next, the transformed training set is reduced to a configured
number of pairs, N . Therefore, a filter feature selection, ReliefF is used first to pre-select
4N pairs out of all. The filter selection is used to pre-select a certain amount of pairs,
because the wrapper method is too computationally expensive. After the pre-selection,
a subset evaluation method using a greedy stepwise search, is applied to select a set of N
most discriminative pairs out of the 4N pre-selected pairs. In the experiments done for
this thesis, Correlation-based Feature Subset Selection (CFS) [22] has been used as the
wrapper. The number of pairs to be selected for classification N was set to 100 for all
the experiments. With a dataset containing 200 genes, the procedure would transform
51
and reduce the number of features as follows:
200
build pairs
−−−−−−→
19,900
Relief F
−−−−→
400
CF S
−−−→
100
(4.15)
The selected pairs are saved within the preprocessor class and are used to reduce the
test set, too. First, all pairs are generated, then the test set is reduced to the same
pairs as the training set. After transforming both, the training and the testing set, the
classification models are learned using the same techniques as with the datasets having
the plain representation. The classification accuracy is highly depending on the number
of pairs to be selected and the multiplication factor for the pairs selected preliminarily
by the filter. Our preliminary experiments have demonstrated that the models perform
best with 100 pair-features. We pre-selected 400 features because by using less features
in preliminary tests, the accuracy was worse. By using more than 400 features, the
execution time of the wrapper increased considerably while no considerable increase in
the accuracy was observed.
4.3.3
Test Configurations
For all the following experiments described in this thesis, the datasets Heart, Liver and
Thyroid have been neglected as they contain less features and are non-genetic. Arcene
and Mesh are non-genetic too, but they contain enough features to get comparable results
and are used as non-genetic reference datasets. The other datasets have been used to
compare the classification performance of the gene-pair representation compared to the
original one. As described before, the reduced datasets have been used, containing 200
single gene features. The number of pairs to select with CFS was set to 100 with 400
preselected by ReliefF. For datasets containing more than 90 instances, 10-fold cross
validation with 30 iterations has been used and leave-one-out cross validation for the
others. Each dataset was evaluated with four different classifiers, k-Nearest-Neighbours
(kNN) classification, Random Forest (RF), learning from equivalence constraints (EC)
and the intrinsic Random Forest Similarity (iRF). To measure the robustness of the genepair representation to noise, the same tests have been conducted also with noisy data.
First, 10% class noise has been injected artificially into the training set followed by 20%
in the second experiment. The classification accuracy of the two different representations
52
were compared over the four classifiers.
4.4
4.4.1
Integration of GO Semantic Similarity
Motivation
With the novel representation of features as difference in expression values of a pair of two
genes introduced in Section 4.3, it becomes possible to incorporate the external biological
knowledge into classification algorithms. There are several reasons that motivate us to
use the semantic similarity to guide the classification of gene expression data. First, the
selected pairs may itself contain valuable biological knowledge that can be used to guide
the classification. A set of two genes that are known to interact, might be more useful
for classification than two genes that are not associated with each other. The usually big
number of features in gene expression data makes it difficult for common feature selection
methods not to ignore some important features. The support by the semantic similarity
might guide the selection process and help to consider pairs that would otherwise be
neglected by the feature selection algorithm but might be representative for the studied
task. Next, the importance of a pair can be weighted by the semantic similarity between
the two genes. Gene-Pair-features with a high semantic similarity might be more useful
for classification than the other. In both approaches described above, the background
biological knowledge is used to support classification decisions and is assumed to increase
the classification accuracy. Compared to all other reports of using semantic similarity
for classification support, the approach used for this thesis is different as the gene-pair
representation was developed offering the opportunity to apply the semantic similarity
directly to the expression values.
4.4.2
Framework Modifications
In order to incorporate the GO semantic similarity, the distance learning framework had
to be extended. A serious challenge for using the semantic similarity was to retrieve
the similarity in a short time. The genetic datasets used for testing do not include GO
identifiers, but the NCBI accession numbers (except the HeC Brain Tumours dataset,
where the features are named by the official gene symbol). The GO does not include
53
NCBI accession numbers, therefore, it is not possible to retrieve the corresponding GO
identifiers for a gene directly by the given feature name from the GO OBO file or the
annotation database. A GO identifier for a certain gene can be searched by the gene’s
official symbol as included in the HeC Brain Tumours dataset. For the other datasets,
an intermediate step is needed to get the official gene symbol for the given unique NCBI
accession number. Therefore, a new class was implemented, calling the EUtils web service
via an URL to search the NCBI gene database for an accession number and for extracting
the official gene symbol out of the XML response (an example can be seen in Expression
4.16). The same class accesses the local GO annotation database to extract all GO
identifiers associated with this gene symbol. A string is constructed, containing a list of
GO identifiers delimited by exclamation marks. This procedure translates the original
dataset into a new dataset, containing the associated GO identifiers of the genes as
attribute names and is described in Figure 4.7.
For each attibute
original
dataset
ARFF
Call
NCBI EUtils
Search
Result
XML
Search
Result
XML
Extract official
gene symbol
Gene
names
dataset
Gene
names
dataset
Get GO ids for
gene names from
GO annotation DB
GO ID
dataset
ARFF
ARFF
ARFF
GO
annotation
database
Figure 4.7: Translation of the original dataset into a GO ID dataset
AF055033
NCBI EUtils
−−−−−−−→
IGFBP5
AnnotationDB
−−−−−−−−→
GO:0007242!GO...
(4.16)
The result is a copy of the dataset’s ARFF file where the attribute names are replaced
by the GO identifier string. This translation has to be done only once; after, the new
ARFF file can be used for classification tests. The process of retrieving the GO identifiers
54
for a gene is time consuming because for each gene, the NCBI web service has to be called,
the server response has to be parsed and an SQL database query has to be executed. The
idea to save the GO identifiers as a string as attribute names into the ARFF file, reduces
the execution time as the retrieving of the GO identifiers can be done once in advance.
This gives a big time advantage compared to retrieving the GO identifiers separately for
each experiment, as for some datasets, the retrieval procedure can take several hours.
Another time consuming part of retrieving the semantic similarity information is the
calculation of the similarity for two genes based on the associated GO terms. The GO
identifiers for each gene are saved in the dataset ARFF file, but the similarity of two of
these genes is not yet calculated. As the different methods introduced for calculating the
semantic similarity are relatively time consuming, the similarities for all possible gene
pairs of a dataset are calculated in advance and saved into a .govalues file.
To calculate the similarity, a new class, SimilarityCalulator, was implemented,
using the GO4J API. The class extends the API by new methods for calculating the semantic similarity between two GO identifiers as described by Wang [58]. The GO4J API
and the SimilarityCalulator both use the GO OBO file for calculating semantic similarities between two GO terms. Moreover, different approaches have been implemented
to derive the similarity between two genes: Max, Average, Azuaje and a self developed
method referred to as Schoen. The method Schoen is a combination of the Max and Average methods. First, all combinations of GO similarities of the two terms from the two
genes are calculated. The upper third of the values with the highest semantic similarity
are considered and the average over this group is used as the final similarity value, see the
Java code in Listing 4.1. Note that in line 8, the similarity between two GO identifiers
(terms) is calculated by using the GO4J API with the Lin similarity.
1
2
3
4
private double s c h o e n S i m i l a r i t y ( A r r a y L i s t <GO term> terms1 , A r r a y L i s t <GO term> terms2 ) {
i n t numOfValues = terms1 . s i z e ( ) ∗ terms2 . s i z e ( ) ;
double [ ] maxsim = new double [ numOfValues ] ;
i n t count = 0 ;
5
6
7
8
9
10
11
12
13
f o r ( i n t i = 0 ; i < terms1 . s i z e ( ) ; i ++){
f o r ( i n t j = 0 ; j < terms2 . s i z e ( ) ; j ++){
double s = model . e v a l S i m i l a r i t y ( terms1 . g e t ( i ) , terms2 . g e t ( j ) , S i m i l a r i t y . LIN ) ;
maxsim [ count ] = s ;
count++;
}
}
Arrays . s o r t ( maxsim ) ;
55
i n t b e s t t h i r d = ( i n t ) 2 ∗ ( maxsim . l e n g t h / 3 ) ;
14
15
double [ ] r e s u l t = Arrays . copyOfRange ( maxsim , b e s t t h i r d , maxsim . l e n g t h ) ;
double a v e r a g e = 0 ;
f o r ( i n t i = 0 ; i < r e s u l t . l e n g t h ; i ++){
a v e r a g e += r e s u l t [ i ] ;
}
double endsim = 0 ;
i f ( r e s u l t . l e n g t h != 0 ) {
endsim = a v e r a g e / r e s u l t . l e n g t h ;
}
return endsim ;
16
17
18
19
20
21
22
23
24
25
26
}
Listing 4.1: Java code of the Schoen similarity calculation between two genes. The
method is called with the GO terms for each gene as input parameters and uses the class
variable model that includes the GO graph.
One more method was implemented for the framework, the Graph Information Content
(GIC) [37]. This algorithm conducts two arrays, a union and an intersection of the sets of
GO terms of the two genes. The method calculates the sum of the Information Content
(IC) of concatenated GO terms from both genes, divided by the sum of the information
content of GO terms associated to both genes. For two genes A and B, the GIC similarity
is defined as:
P
IC(t)
simGIC(A, B) = Pt∈A∩B
t∈A∪B IC(t)
(4.17)
The similarity values of the shown approaches are used to generate different .govalues
files, one for each method. The new file format contains the GO id string for each gene
as a key, and a tab-delimited similarity value, where each line defines one gene. For each
method tested, a .govalues file was created and used. This is again a workaround to
reduce the running time of the framework.
The framework was extended to load the .govalues file by start up for each experiment
and to save the similarity values for each gene-pair of the dataset of the current experiment
into a Java HashMap. The similarity to be used can be defined in the framework and tests
can be run with similarity values calculated by different methods. The modifications of
the framework to handle the GO semantic similarity are described in Figure 4.8.
In the experiments conducted for this thesis, the GO semantic similarity can be applied
to the learning algorithms generally in two different ways, by feature weighting or feature
selection.
56
For each experiment
similarity
values
Prepare GO
semantic similarity
govalues
For each cross validation iteration
Build train and test set
Use GO similarity
for feature selection ?
Use GO similarity
for feature weighting ?
Pairs
KNN
Class: Preprocessor
Multiply
Pairs by
GO sim
Select
Pairs by
GO sim
KNN model
All models
Weight
Features
with
GO sim
train all
models
train KNN model
Figure 4.8: Modification of the framework to incorporate the GO semantic similarity in
three different ways.
4.4.3
GO-Based Feature Weighting
As feature weighting has been applied to the k -Nearest Neighbour classifier before [49],
the experiments with GO feature weighting have only been evaluated with the k -Nearest
Neighbour classification. The weighting can be done at two different positions in the
framework, in the preprocessor when the pairs are constructed or in the k -Nearest Neighbour algorithm.
In the first solution, the weighted Euclidean distance [28] is used to incorporate weights
57
of each gene-pair immediately while constructing the pairs. The pairs are weighted by
the corresponding semantic similarity value of the pairs.
P air(a, b) =
p
(a − b)2 ∗ sim(a, b)
(4.18)
The second possibility is to use the semantic similarity values as weights for the k -Nearest
Neighbour classifier. We used k -Nearest Neighbour as this is one of the best studied and
robust distance learning classifiers. We used only one classifier here, as the purpose of
these experiments is only to test the hypothesis if distance learning methods can be
improved by GO based feature weighting.
4.4.4
GO-Based Feature Selection
The second approach to use the semantic similarity to support classification is to select
pair-features based on their GO similarity values. A new feature selection algorithm has
been implemented. The algorithm can select pairs in different ways:
1.
The n pairs with the biggest semantic similarities.
2.
The n pairs with the lowest semantic similarities.
3.
All pairs with a similarity bigger than a given threshold .
4.
All pairs with a similarity less than a given threshold.
5.
The pairs with a defined value x and all pairs with a value y that does not differ
from x in more than a given threshold t: x − t < y < x + t.
For experiments done for this thesis, only the first method of the enumeration is used, to
select the n pairs with the biggest similarity. This feature selection has been combined
with the CFS feature selection to exclude noisy and redundant pairs. First, 400 pairs
with the biggest GO similarity have been selected, followed by CFS to select the 100
most discriminative gene-pairs out of 400.
To identify the best matching semantic similarity calculation technique for genetic
data, different combinations of similarity calculation methods have been tested. There
are three parameters that can influence the classification result. First, the method used
58
to calculate similarity between single GO terms. Next, the algorithm to combine these
similarities for two genes, and last, the way to apply the similarity to the learning algorithms.
Beside accuracy evaluation, the average semantic similarity of all calculated pairs
and of the selected pairs have been reported. It was assumed, that pairs selected by the
common feature selection have a higher average similarity than all pairs. This means that
pairs with a high similarity have more chances to be discriminative for classification and
therefore it is more probable that GO-based feature selection will improve classification
accuracy.
4.4.5
Test Configurations
For the GO semantic similarity, only six datasets could be used as it is necessary to
have genetic datasets where the attribute gene-names are known. Only Breast, Colon,
Embrional Tumours, HeC Brain Tumours , Leukemia and Lupus include attribute names
where the corresponding gene can be identified. As Leukemia includes only 38 attributes
that can be matched to genes, this dataset has thus been excluded from the experiments.
Except from the HeC Brain Tumours dataset, containing gene symbols, the other four
datasets contain NCBI accession numbers as attribute names. The described method
was used to find a GO identifier for the genes. As the GeneOntology is not complete and
some attribute names cannot be matched to genes, the datasets have been reduced in
their number of features, to be able to retrieve at least one GO term for each attribute
remaining in the dataset.
First, the five datasets have been tested with different combinations of similarity
calculation methods and the two possibilities to include feature weighting. The following
similarity calculation methods have been compared:
• Max-Resnik: Maximum value of Resnik similarities;
• Max-Lin: Maximum value of Lin similarities;
• Max-Wang: Maximum value of Wang similarities;
• Azuaje-Resnik: Azuaje combination of Resnik similarities;
59
• Azuaje-Lin: Azuaje combination of Lin similarities;
• Azuaje-Wang: Azuaje combination of Wang similarities;
• Schoen-Lin: Schoen combination of Lin similarities;
• GIC: Graph information content;
• Random: Random values in the range [0, 1] are used as similarity;
The Random method was used as a reference method, to see whether the change in
accuracy is caused by the methods or by chance. Therefore, a new .govalues file was
created while the similarity for each pair was defined as a random value in the range from
zero to one. The Schoen method was tested only in combination with the Lin similarity.
The intention was only to test the comparability in classification accuracy of the Schoen
similarity calculation to the Max and the Azuaje approach.
The described methods have been compared by performing tests on the datasets
Breast, Colon, Embrional Tumours and Lupus. The two described methods to incorporate GO-based feature weighting have been compared under the same classification
conditions and the average semantic similarities of the pairs and the selected pairs have
been reported.
Next, the GO-based feature selection technique was tested without feature weighting.
The 400 gene-pairs with the highest semantic similarity have been preselected followed
by a CFS based reduction to 100 pairs. This method has been evaluated on the five
datasets, including the HeC Brain Tumours dataset and compared with respect to the
classification accuracy.
4.5
RSCTC’2010 Discovery Challenge
In order to compare distance learning functions to other state-of-the art learning methods,
we registered to the RSCTC’2010 Discovery Challenge [1]. In the previous sections, two
new techniques, the gene-pair representation and the incorporation of external knowledge into classification, have been introduced. As the challenge datasets do not provide
information about the gene names, the incorporation of the GO semantic similarity was
60
impossible. Further, the experiments done for this thesis with the gene-pair representation, have only been validated on binary classification problems, but the challenge
datasets include more than two classes, except one dataset. Therefore, the gene-pair representation was not tested within the challenge datasets. Instead, we tried to optimize
the learning algorithms for the challenge datasets by testing different feature selection
methods, oversampling techniques (explained later) and configurations of the learning
methods. We tried to find the best settings with respect to the classification accuracy
for each dataset separately, as there is no learning technique that performs best on all
datasets (no free lunch theorem [62]).
4.5.1
Basic Track
For the Basic Track, six labeled training sets are provided. The task is to predict class
labels for the corresponding test sets. The predicted labels for each dataset are saved into
a separate text file. The six files are compressed into a zip file and uploaded to the challenge website. The uploaded class labels are compared to the ground truth immediately
after the submission and preliminary classification accuracy for the submitted solution is
presented on the so called leaderboard.
First, the training sets have been used to cross validate the four different learning
methods k-Nearest Neighbour classification, Random Forest, learning from equivalence
constraints and the intrinsic Random Forest Similarity on our local computer. A 10-foldcross validation was used with 30 iterations. Out of these four classifiers, the intrinsic
Random Forest Similarity performed best on all datasets and has been selected to be
used in all further experiments.
One problem with the challenge datasets is the unequal distribution of classes because
the classification accuracy can suffer from unbalanced data [59] for many learning algorithms. Some classes are represented by less than ten instances while other classes include
more than 50 instances. For equally distributed classes, the probability for an unlabeled
instance to be classified into the correct class by chance, is the same for all classes. For
unequally distributed classes, the probability of an instance of an over-represented class
to be classified correctly by chance, is higher than for the under-represented class. For example, class A is represented by 1000 instances and class B by 10 instances. An instance
61
of class A is more probable to be classified correctly by chance than an instance of class
B. This unequal class distribution can bias the classification accuracy for many learning
algorithms. To eliminate this problem, the training datasets have been oversampled by
duplicating the instances of under-represented classes until the class distributions have
been roughly equal (see Java Code 4.2).
1
public I n s t a n c e s o v e r s a m p l e ( I n s t a n c e s i n s t , i n t l i m i t ) {
2
// c l a s s o c c u r r e n c e v a l u e s
i n t [ ] c l a s s C o u n t s = new i n t [ i n s t . c l a s s A t t r i b u t e ( ) . numValues ( ) ] ;
3
4
5
f o r ( i n t i = 0 ; i u s e d < i n s t . nu mI n st an ce s ( ) ; i ++){
i n t c l a s s V a l u e = ( i n t ) i n s t . i n s t a n c e ( i ) . v a l u e ( i n s t . n u m A t t r i b u t e s ( ) −1) ;
c l a s s C o u n t s [ c l a s s V a l u e ]++;
}
6
7
8
9
10
// g e t most f r e q u e n t c l a s s
i n t m o s t F r e q u e n t C l a s s = U t i l s . maxIndex ( c l a s s C o u n t s ) ;
11
12
13
// o v e r s a m p l e o t h e r c l a s s e s
i n t numInst = i n s t . nu mI n st an ce s ( ) ;
f o r ( i n t i = 0 ; i < numInst ; i ++){
i n t c l a s s V a l u e = ( i n t ) i n s t . i n s t a n c e ( i ) . v a l u e ( i n s t . n u m A t t r i b u t e s ( ) −1) ;
int border = ( int ) c l a s s C o u n t s [ mostFrequentClass ] / c l a s s C o u n t s [ c l a s s V a l u e ] −1;
f o r ( i n t j = 0 ; j < b o r d e r && j < l i m i t ; j ++){
i n s t . add (new I n s t a n c e ( i n s t . i n s t a n c e ( i ) ) ) ;
}
}
14
15
16
17
18
19
20
21
22
23
return i n s t ;
24
25
}
Listing 4.2: Algorithm to oversample a dataset (inst) with unequally distributed classes.
At first, the most frequent class is identified by counting the number of instances of
each class. For each other class, the number of instances of the most frequent class is
divided by the number of instances of the current class, to get the number of times the
instances of the under-represented class have to be duplicated. Moreover, a limit can be
defined to stop oversampling if a certain amount of copies is already made to produce
lighter oversampling to avoid generating too many synthetic cases. For the example
shown above, where the class A contained 1000 and the class B contained 10 features,
class B has to be copied 100 times to be equally distributed to class A. The limit
defines a maximal multiplication factor of the underrepresented class. In this example,
if a limit is set to 20, class B can be multiplied only 20 times to contain not more than
200 features. Oversampling with different limits have been tested by submitting the
classification results to the challenge website where the best accuracy was reached with
62
no specified limit. The oversampled datasets have been used for further experiments.
With the oversampled datasets, different two-step filter-wrapper feature selection approaches have been tested. A GainRatio filter was used to pre-select a set of features
followed by CFS to select the most discriminative features out of the pre-selected ones.
Different numbers of features have been tested, where a filter selection of 400 features,
followed by a wrapper selection of 200 features, performed best.
The CFS algorithm evaluates different subsets, composed by a greedy stepwise search
algorithm. One problem with this technique is that a small number of features, for example five, can already reach an accuracy that is close to the Bayesian optimal accuracy
reached by considering all the features. Therefore, it becomes difficult to reliably detect
a classification improvement caused by adding other features. We have developed a new
feature selection algorithm that tries to solve this problem by using the CFS algorithm
in a certain amount of iterations. In each iteration, only a small group of features is
selected, therefore the algorithm was called Group Selection. At each iteration, n (group
size) features are selected by CFS and removed from the dataset for the next iteration.
To select a total number of N genes, N/n iterations are needed. The benefit of this
method is that at each iteration, the strong features are removed and the influence of the
other features on the decision is easier to detect.
1
2
3
Array s e l e c t e d ;
Dataset o r i g i n a l
D a t a s e t data = o r i g i n a l ;
4
5
f o r each i t e r a t i o n {
6
Array b e s t n = data . s e l e c t n B e s t F e a t u r e s ( u s e CFS ) ;
7
8
s e l e c t e d += b e s t n ;
9
10
data . remove ( b e s t n ) ;
11
12
13
}
14
15
o r i g i n a l . reduceTo ( s e l e c t e d ) ;
Listing 4.3: Pseudo code of the Group Selection algorithm.
Different numbers of group sizes and iterations have been tested. In our experiments, 40
iterations of selecting 5 features performed best. The filter pre-selection to 400 features
followed by the 5-features-40-iterations Group Selection has been used for all following
experiments.
63
Again, a cross validation on the training sets has been performed on our local PC to
determine the best algorithm parameters for the intrinsic Random Forest Similarity for
each dataset separately. We tested different values for the number of trees, the minimum
leaf size, the number of attributes to be considered to split a node in a tree and the number
of nearest neighbours. Further, the intrinsic Random Forest Similarity was tested with
the original Random Forest implementation and with extremely randomized trees [20].
Based on these results, the test configuration has been updated to use the best parameters
for each dataset.
4.5.2
Advanced Track
Solutions to the Advanced Track are submitted by uploading a Java jar file with a WEKA
classifier included. Therefore, a new classifier was implemented that can use the classes
used for the Basic Track. The classifier and all dependent classes have been extracted
into the jar file for the challenge.
In the submitted procedure for generating class models, first, we tried to oversample
the full dataset before doing feature selection, but the oversampled dataset exceeded
the memory limitations. Therefore, the provided dataset is first always reduced to 3000
features immediately after the training is started. Different feature selection algorithms
and different numbers of features have been tested, while the weighted ReliefF algorithm
with 3000 features for the initial feature selection performed best.
After reducing the dataset, there was enough memory for the normal oversampling
procedure. Again, different filter-wrapper feature selection methods have been tested. A
pre-selection of 400 features with ReliefF, followed by a selection of 200 features with
CFS has reached the best accuracy. Notice that the introduced Group Selection could
not be used for this track, as this algorithm violated the time restrictions, resulting in a
time-out error returned by the server.
Next, a cross validation step was implemented to find the best value for the number
of nearest neighbours, the number of considered attributes to split the node and the
Random Forest type, canonical Random Forest or extremely randomized trees, for the
intrinsic Random Forest Similarity. But the time restrictions forced a limitation of the
cross validation technique to validate only one of these parameters at a try. Further, the
64
number of trees for the intrinsic Random Forest Similarity used for cross validation was
set to 25 to reduce the execution time.
Finally, different numbers of trees for final classification with the intrinsic Random
Forest Similarity have been tested. With a high number of trees, the classifier exceeded
the time limitations and with a low number of trees, the accuracy decreased. Our tests
showed that a good compromise was to use 1000 trees.
65
CHAPTER
5
Empirical Analysis
We introduced different approaches to increase the classification accuracy for genetic data
in the previous sections. Different elementary distance functions for learning from equivalence constraints have been compared. Next, the new data representation with gene-pairs
was tested on different datasets and compared to the original representation. Two basic
ways to incorporate external biological knowledge, by feature weighting or feature selection, have been tested on selected datasets. Finally, distance learning algorithms have
been cross validated to a given set of datasets and evaluated at an international discovery challenge. The results of the experiments performed for this thesis are presented and
analyzed in this chapter. For all experiments conducted for this thesis, the presented
classification accuracy is the percentage of the correctly classified test instances by a
model trained previously on the training data. An instance is correctly classified, if the
class label predicted by the classification model is identically with the ground truth.
5.1
Distance Comparison for Learning From Equivalence Constraints in the Difference Space
As can be seen from Table 5.1, the L1 distance clearly outperforms the other distances.
L1 reached the best accuracy in seven of the nine used datasets and was one percent
only worse in the other two datasets. In these, Lymphoma and Liver, the Zero Variance
Threshold distance reached the best result. The second best result was achieved by the
66
Distance Function
L1
Square-Root
Squared
Mahalanobis
Canberra
Variance Threshold
Frequency (w = 0.5)
Zero Variance Threshold
Chi-Squared
Single Squared
Average Accuracy in %
81.81
73.17
68.18
64.36
63.02
62.21
61.07
59.29
57.91
56.49
Table 5.1: Average accuracies over the nine datasets for distance comparison for learning
from equivalence constraints.
Root distance and the third place the Squared distance. This result is not surprising as
these distances do not differ much form the L1 distance that was shown to perform best
in previous comparisons [61].
5.2
Transformation of Representation for Gene Expression Data
The gene-pair representation was compared with the plain single gene representation and
evaluated with the four different classifiers on 10 benchmark datasets. The following
configurations have been used for the experiments:
• 400 preselected pairs by ReliefF.
• 100 pairs selected by CFS and used for training the models.
• 10-fold cross validation for datasets with more than 90 instances.
• Leave-one-out cross validation otherwise.
The four classifiers have been used with the following settings:
• RF : Random Forest with 25 trees.
• 7 NN : K-Nearest-Neighbours with k = 7.
• EC : Learning equivalence constraints with L1 distance in difference space.
67
• iRF : The intrinsic Random Forest with 25 trees.
It was shown that 7 is the most robust parameter choice for k in our preliminary tests.
The tests have been conducted three times, first with no class noise injected, second with
10% and third with 20% random class noise. The results for the datasets are presented in
separate tables, where rows called Original correspond to the original representation and
rows named Pair include results for the gene-pair representation. Each column includes
results for one learning algorithm while the average over all four classifiers is presented
in the last column. The best accuracy reached for each noise level is given in bold. The
average value is green if the gene-pair representation outperforms the original one and
red vice versa.
Breast Cancer
Original
Pair
Original 10%
Pair 10%
Original 20%
Pair 20%
RF
74.11
73.11
73.44
74.22
74.67
73.33
7 NN
81.22
78.56
81.11
78.33
77.56
75.56
EC
75.56
77.00
75.33
75.00
73.33
75.44
iRF
72.44
72.44
73.33
74.11
73.89
73.11
Average
75.83
75.28
75.80
75.42
74.86
74.36
Table 5.2: Classification accuracy for Breast Cancer of pair vs. original representation.
Colon
Original
Pair
Original 10%
Pair 10%
Original 20%
Pair 20%
RF
80.65
87.10
85.48
80.65
79.03
82.26
7 NN
87.10
83.87
87.10
83.87
85.48
80.65
EC
87.10
90.32
87.10
83.87
83.87
80.65
iRF
83.87
88.71
83.87
82.26
83.87
82.26
Average
84.68
87.40
85.89
82.66
83.06
81.46
Table 5.3: Classification accuracy for Colon of pair vs. original representation.
5.2.1
Analysis of the Genetic Datasets
Tables 5.2 - 5.11 demonstrate the classification accuracies for the genetic and the nongenetic datasets with the original and gene-pair representation. The results are analyzed
68
Embrional Tumours
RF
Original
76.67
Pair
80.00
Original 10%
78.33
Pair 10%
78.33
Original 20%
63.33
Pair 20%
70.00
7 NN
75.0
78.33
71.67
75.00
73.33
73.33
EC
73.33
78.33
68.33
76.67
75.33
73.33
iRF
75.00
76.67
80.00
80.00
71.67
71.67
Average
75.00
78.33
74.58
77.50
70.92
72.08
Table 5.4: Classification accuracy for Embrional Tumours of pair vs. original representation.
HeC Brain Tumours
RF
Original
92.65
Pair
94.12
Original 10%
91.18
Pair 10%
91.18
Original 20%
86.76
Pair 20%
83.82
7 NN
86.76
94.12
83.82
91.18
85.29
83.82
EC
92.65
92.65
92.65
92.65
91.18
86.76
iRF
92.65
95.59
89.71
89.71
89.71
82.35
Average
91.18
94.12
89.34
91.18
88.24
84.19
Table 5.5: Classification accuracy for HeC Brain Tumours of pair vs. original representation.
Leukemia
Original
Pair
Original 10%
Pair 10%
Original 20%
Pair 20%
RF
95.83
97.22
95.83
97.22
87.50
97.22
7 NN
95.83
97.22
98.61
94.44
95.83
91.67
EC
95.83
97.22
95.83
97.22
97.22
95.83
iRF
94.44
97.22
95.83
97.22
93.06
95.83
Average
95.48
97.22
96.53
96.53
93.40
95.14
Table 5.6: Classification accuracy for Leukemia of pair vs. original representation.
Lung Cancer
Original
Pair
Original 10%
Pair 10%
Original 20%
Pair 20%
RF
98.48
98.12
98.42
97.15
96.79
92.73
7 NN
95.52
98.79
95.21
96.85
91.76
92.48
EC
98.91
98.85
98.06
97.09
96.42
95.09
iRF
98.48
97.94
98.85
97.70
97.94
94.67
Average
97.85
98.43
97.64
97.20
95.73
93.74
Table 5.7: Classification accuracy for Lung Cancer of pair vs. original representation.
69
Lupus
Original
Pair
Original 10%
Pair 10%
Original 20%
Pair 20%
RF
78.57
78.69
76.31
77.62
75.00
74.76
7 NN
77.26
76.67
75.71
75.00
71.55
72.86
EC
78.45
77.98
77.98
78.21
76.19
75.36
iRF
77.38
77.74
77.98
77.38
74.52
74.52
Average
77.92
77.77
77.00
77.05
74.32
74.38
Table 5.8: Classification accuracy for Lupus of pair vs. original representation.
Lymphoma
RF
95.56
95.56
95.56
95.56
95.56
95.56
Original
Pair
Original 10%
Pair 10%
Original 20%
Pair 20%
7 NN
100
100
100
97.78
95.56
97.78
EC
88.89
97.78
97.78
97.78
97.78
95.56
iRF
93.33
95.56
95.56
95.56
93.33
95.56
Average
94.45
97.23
97.23
96.67
95.56
96.12
Table 5.9: Classification accuracy for Lymphoma of pair vs. original representation.
Arcene (non-genetic)
Original
Pair
Original 10%
Pair 10%
Original 20%
Pair 20%
RF
86.06
84.17
82.28
79.94
76.17
74.94
7 NN
84.89
83.94
82.61
80.89
78.28
77.76
EC
87.00
85.72
84.11
82.06
79.56
76.72
iRF
85.44
83.89
83.39
79.89
77.72
76.06
Average
85.85
84.43
83.10
80.70
77.93
76.37
Table 5.10: Classification accuracy for Arcene of pair vs. original representation.
Mesh (non-genetic)
Original
Pair
Original 10%
Pair 10%
Original 20%
Pair 20%
RF
92.06
85.71
88.89
82.54
93.65
80.95
7 NN
87.30
85.71
87.30
84.13
84.13
82.54
EC
88.89
76.19
87.30
80.95
85.71
88.89
iRF
90.48
84.13
87.30
84.13
87.30
80.95
Average
89.68
82.96
87.70
82.94
87.70
83.33
Table 5.11: Classification accuracy for Mesh of pair vs. original representation.
70
for the no-noise experiments of genetic datasets first. For six out of eight genetic datasets,
the novel gene-pair representation outperformed the original one. Only for Breast and
Lupus, the original representation reached better results, but with less than 0.56% difference each. A summary of the averaged results can be found in Table 5.12. Note, that
for the following Tables 5.12-5.16, the last column (Difference) is the difference between
the average accuracy of the original representation and the gene-pair representation.
Dataset
Breast
Colon
Embr. Tumours
HeC Brain Tumours
Leukemia
Lung Cancer
Lupus
Lymphoma
Average
Average Original
75.83
84.68
75.00
91.18
95.48
97.85
77.92
94.45
86.55
Average Pair
75.28
87.40
78.33
94.12
97.22
98.43
77.77
97.23
88.22
Difference
-0.55
2.72
3.33
2.94
1.74
0.58
-0.15
2.78
1.67
Table 5.12: Average classification accuracy over the four different classifiers for the genetic
datasets.
Classifier
RF
7 NN
EC
iRF
Average Original
86.57
87.34
86.34
85.95
Average Pair
87.99
88.45
88.77
87.73
Difference
1.42
1.11
2.43
1.79
Table 5.13: Average classification accuracy over the genetic datasets for the tested classifiers.
The gene-pair representation could increase the average classification accuracy over
all classifiers and datasets by 1.67%. It is better for six out of eight datasets and all
four classifiers reached a better average result with the gene-pair representation, while
learning from equivalence constraints was shown to have the biggest increase.
The result of Tables 5.12 and 5.13 show a clear benefit in classification accuracy for
the novel representation, which is motivated by the fact that the genes often depend
on each other. To validate the assumption that this is the reason for the increase, the
two non-genetic datasets have also been tested. If the assumption is right, the boost in
71
accuracy of these two datasets should be absent or be not that obvious as for the genetic
data.
5.2.2
Analysis of the Non-Genetic Datasets
Dataset
Arcene
Mesh
Average
Average Original
85.85
89.68
87.77
Average Pair
84.43
82.96
83.70
Difference
-1.42
-6.72
-4.07
Table 5.14: Average classification accuracy over the four different classifiers for the nongenetic datasets.
Classifier
RF
7 NN
EC
iRF
Average Original
89.06
86.10
87.95
87.96
Average Pair
84.94
84.83
80.96
84.01
Difference
-4.12
-1.27
-6.99
-3.95
Table 5.15: Average classification accuracy over the non-genetic datasets for the tested
classifiers.
Tables 5.14 and 5.15 show how the gene-pair representation may fail for non-genetic
data. These results indicate that the benefit of the gene-pair representation presumably
relies on the interactions of genes.
5.2.3
Robustness of the Gene-Pair Representation to Noise
To measure the robustness of the gene-pair representation to class noise for the genetic
datasets, experiments with 10% and 20% artificial class noise have been conducted.
Class Noise
0%
10%
20%
Average Original
86.55
86.75
84.51
Average Pair
88.22
86.78
83.93
Difference
1.67
0.03
-0.58
Table 5.16: Average classification accuracies for noisy datasets.
From the results of Table 5.16, it can bee seen that the novel representation is not
as robust to noise as the original single feature representation. For 0% noise, the pairs
72
outperform the single genes by 1.67% while with 10% class noise, they perform almost
equal. With more noise, the gene-pair representation looses its benefit and performed
worse. With the gene-pair representation, it is easier to overfit noise.
5.2.4
Benefits and Limitations
The tests of comparing the transformed representation with the original one showed that
gene-pair features can definitely increase the classification accuracy for genetic datasets.
It was also shown, that the pairs often do not improve the accuracy of non-genetic
datasets. Therefore, the assumption that the benefit relies on the reflection of gene
interactions could not be disproved. It was also shown, that the original representation
is more robust to class noise, but is still clearly worse if no original noise is present. The
learning from equivalence constraints classifier has been shown to be sensitive to the novel
representation as this algorithm resulted in the biggest increase in accuracy for genetic
data whereas performed worse for non-genetic data.
Further, the gene-pair datasets have been reduced to 100 pairs for classification while
the original representation uses 200 features for training the models. The gene-pair
representation was shown to improve the classification accuracy by using only half of
the original number of attributes. Note that this does not mean that there is only the
information of 100 genes present here, as one pair is created of two different genes. The
statistics of the selected pairs show a tendency to select a big number of pairs where
the same feature is included. In some cases, this feature was highly ranked in single
representation as well, but some features being part of many pairs have not been ranked
high for the single representation. A deeper and more thorough analysis of these statistics
is a direction for future work.
The reduction to 100 attributes improves the model training time and the used memory at least by the factor of two. But the two-step feature selection, needed for detecting
the best 100 pairs is very time consuming and increases the time of the whole training
procedure dramatically.
For each dataset describing a certain disease, a feature selection is done to find the 100
most important genes for differentiating between healthy subjects and diseased patients.
These selected gene pairs might not only be used for classification in machine learning, but
73
can also give useful information about gene interactions reliable for the studied disease.
The most frequently selected and most discriminative gene pairs can be studied with
respect to their biological background and new knowledge can be extracted how and why
this pair influences the disease. This data might be of high interest for biology and health
professionals.
Another benefit is the fact, that attributes are represented by gene pairs. This enables
the possibility to retrieve external biological knowledge for the pairs and to integrate this
knowledge into the classification algorithm to improve the accuracy.
5.3
Integration of Gene Ontology Semantic Similarity into Data Classification
The datasets have been reduced to contain only the attributes that can be matched
exactly to a gene. Therefore the number of attributes for the five test sets used for
experiments in this section decreased as can be seen in Table 5.17.
Dataset
Original No. of
Attributes
Breast
Colon
Embr. Tumours
HeC Brain Tumours
Lupus
24482
2001
7130
5580
172
No. of
Attributes
with GO ids
2678
942
5896
4447
140
Table 5.17: Number of attributes for the original and the reduced datasets for experiments
with the GO semantic similarity.
The datasets had to be reduced to contain only attributes that can be associated
to GO identifiers. The classification power of an attribute was not considered in this
reduction. Some discriminative attributes may have been deleted and therefore, the
accuracies reported for these experiments are not comparable with the results of the pair
representation evaluation.
74
5.3.1
GO-Based Feature Weighting
Four datasets have been tested with different similarity calculation methods and two
different approaches of semantic similarity based feature weighting. All tests have been
run under the same conditions as described for tests with the gene-pair representation.
The result Tables 5.18-5.21 show the classification accuracies for the tested similarity
methods. Each row includes one used method where the last one is called “Random”
and no similarity values have been used but random values in the range between 0 and 1.
The first column shows the results for feature weighting within the k -Nearest Neighbour
classifier and the second one shows the corresponding results for the direct weighting of
the pairs, while constructing the pairs. The third column shows the average similarity
value of the 100 selected pairs while the last column shows the similarity value over all
calculated pairs. The field No FW shows the results of the reduced pair representation
without feature weighting.
Breast
No FW
Max-Resnik
Max-Lin
Max-Wang
Azuaje-Resnik
Azuaje-Lin
Azuaje-Wang
Schoen-Lin
GIC
Random
FW in
KNN
70.22
68.22
70.33
70.56
69.33
70.11
69.56
70.67
66.11
70.00
FW in
Pair
70.22
69.11
69.33
70.22
69.67
68.67
69.56
68.78
71.89
70.56
Av. Sim
Selected
Av. Sim
All
2.6258
0.7171
0.7915
0.8828
0.2457
0.3617
0.2379
369.2300
0.4933
2.5968
0.6666
0.7684
0.8596
0.2266
0.3445
0.2135
156.7375
0.4997
Table 5.18: Classification accuracy of the GO-based feature weighting for Breast.
75
Colon
No FW
Max-Resnik
Max-Lin
Max-Wang
Azuaje-Resnik
Azuaje-Lin
Azuaje-Wang
Schoen-Lin
GIC
Random
FW in
KNN
88.71
87.10
87.10
87.10
87.10
87.10
87.10
87.10
85.48
87.10
FW in
Pair
88.71
88.71
88.71
88.71
88.71
88.71
88.71
88.71
85.48
88.71
Av. Sim
Selected
Av. Sim
All
3.8565
0.7414
0.7723
1.0621
0.2329
0.3422
0.1664
207.7249
0.4753
3.7388
0.8292
0.8774
1.1924
0.2755
0.3894
0.1890
180.0117
0.5019
Table 5.19: Classification accuracy of the GO-based feature weighting for Colon.
Embrional Tumours
FW in
KNN
No FW
73.33
Max-Resnik
68.33
Max-Lin
78.33
Max-Wang
75.00
Azuaje-Resnik
76.67
Azuaje-Lin
76.67
Azuaje-Wang
76.67
Schoen-Lin
73.33
GIC
76.67
Random
76.67
FW in
Pair
73.33
70.00
70.00
71.67
70.00
70.00
71.67
70.00
76.67
73.33
Av. Sim
Selected
Av. Sim
All
3.4470
0.8385
0.8837
1.1160
0.2695
0.3763
0.1957
455.7339
0.5219
3.5950
0.8012
0.8576
1.0777
0.2541
0.3672
0.1734
204.6917
0.4976
Table 5.20: Classification accuracy of the GO-based feature weighting for Embrional
Tumours.
76
Lupus
No FW
Max-Resnik
Max-Lin
Max-Wang
Azuaje-Resnik
Azuaje-Lin
Azuaje-Wang
Schoen-Lin
GIC
Random
FW in
KNN
79.17
78.57
78.57
79.05
78.45
78.21
78.57
79.05
75.95
79.05
FW in
Pair
79.17
79.52
79.64
79.17
79.52
79.52
79.05
79.40
78.21
79.17
Av. Sim
Selected
Av. Sim
All
4.4469
0.8446
0.8803
1.2879
0.2814
0.3743
0.1872
172.5945
0.5189
4.5364
0.8565
0.8931
1.2942
0.2809
0.3832
0.1791
154.6439
0.5052
Table 5.21: Classification accuracy of the GO-based feature weighting for Lupus.
The average accuracies over the four datasets have been calculated and are shown in
Table 5.22.
No FW
Max-Resnik
Max-Lin
Max-Wang
Azuaje-Resnik
Azuaje-Lin
Azuaje-Wang
Schoen-Lin
GIC
Random
Average
FW in KNN
77.86
75.56
78.58
77.93
77.89
78.02
77.98
77.54
76.05
78.21
77.53
FW in Pair
77.86
76.84
76.92
77.44
76.98
76.73
77.25
76.72
78.06
77.94
77.21
Table 5.22: Average accuracies for semantic similarity based feature weighting with different methods.
The results demonstrate that for each dataset, at least one combination of methods
could beat or even reach the non-weighted gene-pair representation. But this seems to
be not the effect of semantic similarity. The random similarity was able to reach the
accuracy for experiments without weighting in two datasets and increased it for the other
two datasets. This result is surprising and was not expected. The results for experiments
with the gene-pair representation showed that the new representation is overfitting noise.
77
We assume that the reason for the increase in accuracy is not caused by the weighting
with semantic similarity, but the injection of noise into the pairs, making the system more
robust to overfitting. More experiments to test this hypothesis is a direction for future
work. Table 5.22 shows the average values for each combination. For weighting the pairs,
the Max-Lin combination performed best while for feature weighting in k -NN, the GIC
method reached the best accuracy. But again, the Random reference got a comparable
result. It can be concluded that, weighting pair features as introduced in this thesis can
improve the classification accuracy, but a reference test with random similarity values
could improve the accuracy, too.
5.3.2
GO-based Feature Selection
Next, experiments with the GO semantic similarity based feature selection have been
performed. The results of the feature weighting tests could not show a clear benefit for
a similarity method to increase the classification accuracy. Therefore it was not obvious
which similarity method should be used for GO-based feature selection. Because as for
all combinations including the Max algorithm, a big amount of similarities is 1, these
methods cannot be used. GO based feature selection is defined to pre-select pairs with a
high similarity, therefore the pairs selected by feature weighting have been analyzed with
the goal to find a similarity method where the average value of selected pairs is higher
than the average over all pairs. The percentage of the increase of the average selected
pair similarity (sselected ) with respect to the similarity of all pairs (sall ) was calculated
and is shown in Table 5.23.
The difference of the average semantic similarity between the selected pairs and all
pairs is negligible for most datasets. For the Schoen method, the similarity of the selected
pairs is 4% higher. But the GIC semantic similarity was shown to be 71% higher in the
selected pairs than in all pairs. Therefore, CGI similarity was selected further for testing
the GO guided feature selection. It was assumed that if the pairs selected by common
feature selection algorithms have a higher average CGI similarity then a pre-selection of
high CGI similarity pairs can improve the feature selection. To test this assumption, the
400 pairs with the highest CGI semantic similarity have been pre-selected for each dataset
followed by a CFS feature selection to reduce the 400 pairs to 100. The combination with
78
Similarities
Max-Resnik
Max-Lin
Max-Wang
Azuaje-Resnik
Azuaje-Lin
Azuaje-Wang
Schoen-Lin
GIC
Random
Ratio in %
99.54
100.06
98.16
98.71
99.80
98.26
104.21
171.30
100.25
Table 5.23: Comparing the ratio of the average similarity between selected pairs and all
pairs.
a common feature selection method (CFS) is needed to reduce redundant features. The
feature weighting was switched off and the tests have been performed under the same
classification conditions. The results can be seen in Tables 5.24-5.28. Random Forest,
k-Nearest Neighbour
and the intrinsic Random Forest similarity classifier have been
evaluated in this context.
Breast
GIC selection
no
yes
RF
69.00
74.22
k -NN
70.22
73.56
iRF
68.00
72.33
Average
69.07
73.37
Table 5.24: Classification accuracy of GO-based feature selection for Breast.
Colon
GIC selection
no
yes
RF
88.71
83.87
k -NN
88.71
88.71
iRF
87.10
85.48
Average
88.17
86.02
Table 5.25: Classification accuracy of GO-based feature selection for Colon.
Embrional Tumours
GIC selection
RF
no
75.00
yes
81.67
k -NN
73.33
81.67
iRF
73.33
78.33
Average
73.89
80.56
Table 5.26: Classification accuracy of GO-based feature selection for Embrional Tumours.
79
HeC Brain Tumours
GIC selection
RF
no
89.71
yes
92.65
k -NN
91.18
94.12
iRF
89.71
92.65
Average
90.20
93.14
Table 5.27: Classification accuracy of GO-based feature selection for HeC Brain Tumours.
Lupus
GIC selection
no
yes
RF
80.60
78.21
k -NN
79.17
79.05
iRF
79.76
78.10
Average
79.84
78.45
Table 5.28: Classification accuracy of GO-based feature selection for Lupus.
Based on these results, the average accuracy over all datasets and classification methods for the gene-pair representation with no semantic feature selection support of 80.23%
was calculated. The pair feature pre-selection with the GIC semantic similatiy improved
this result by 2.08% to an average value of 82.31%. Note that for datasets where the
average similarity of selected pairs was much higher than for all pairs, the GIC based preselection of features could improve the accurracy. The biggest increase in accuracy was
reached with the Embrional Tumours dataset, where the accuracy increased by 6.67% to
80.56%. This is the best accuracy reached for this dataset in all experiments conducted
for this thesis and is even higher than for experiments with the full set of attributes
(Table 5.4).
The fact that the GIC similarity for selected pairs is higher, is another evidence that
the boost of the pair representation is associated with the biological semantics of the corresponding genes of the pairs. It was shown that including external biological knowledge
into classification can boost the accuracy for most datasets. The bad performance for
the Lupus dataset is not surprising as the gene-pair representation failed for this dataset
before, too. One reason for this might be that for the Lupus dataset gene interactions are
not important or not reflected in the dataset. The other dataset failed, Colon, performed
good in the gene-pair representation tests, but the results for the semantic similarity based
feature weighting tests are strange as the classification accuracies of nearly all tested similarity methods are the same (Table 5.19). For feature weighting, in most methods, the
same accuracy has been reached for Colon. The number of attributes and instances as
80
well as the calculated similarities do not differ significantly from the other datasets.
5.4
Preliminary Results for the RSCTC’2010 Discovery Challenge
For all experiments, the intrinsic Random Forest Similarity was used as described in Section 4.5. By cross validating the training sets with different numbers of nearest neighbours
(k ), minimum leaf size (l), features to split a node (K), trees (n) and the two different
learning algorithms, plain Random Forest (RF) and Random Forest with extremely randomized trees (ERT) [20] for the intrinsic Random Forest Similarity, the best accuracy
was reached for the datasets by using the configurations of the intrinsic Random Forest
Similarity shown in Table 5.29.
Dataset
1
2
3
4
5
6
k
7
7
5
7
7
11
l
1
1
1
1
1
1
K
log(N ) + 1
log(N ) + 1
log(N ) + 1
log(N ) + 1
log(N ) + 1
log(N ) + 1
n
50,000
50,000
50,000
50,000
50,000
50,000
learner
RF
ERT
RF
ERT
RF
ERT
Table 5.29: Best configurations of the intrinsic Random Forest Similarity for Basic Track
datasets, where N is the number of attributes of the dataset.
The datasets have been reduced to 400 genes by GainRatio followed by a 5-features40-iterations Group feature selection. On February, 23 2010, four participants share the
first place with an average accuracy on preliminary evaluations of 75%. We have reached
74% and share the second place with another participant. In the preliminary evaluations,
a position within the top six of 73 participants was reached for the Basic Track.
In the first trials for the Advanced Track, the memory or the time restrictions have
been exceeded. The best accuracy reached in the preliminary evaluations was 78.73%. 22
participants submitted solutions, where the leader has currently reached a classification
accuracy of 79.78%. With our best accuracy of 78.73%, position ten was reached with a
gap of 1% with respect to the current leader.
81
CHAPTER
6
Conclusion
6.1
Summary
A flexible object oriented Java machine learning workbench has been developed for comparing different learning algorithms, in particular distance learning techniques. The
WEKA based framework performs a set of experiments declared in a configuration file
and presents the classification accuracies. A summary of the experimental results is saved
into an HTML file. The workbench was designed for tests with genetic data but can be
also used for any other area.
The workbench was used to compare different distances obtained with learning from
equivalence constraints, where the L1 distance was shown to perform best. Two novel
distance functions, the Variance Threshold and the Frequency distance, have been introduced and as candidates to replace the L1 distance in learning from equivalence constraints. The Variance Threshold distance was designed to group the expression values
into five categories, strongly underexpressed, underexpressed, normal, overexpressed and
strongly overexpressed. The second new distance function, the Frequency distance, was
designed to consider the distribution of the data and has been tested in combination with
the L1 distance. In the presented experiments, the weight was set to 0.5, so that the influence of both, the Frequency and the L1 distance was equal. Most of the other distance
functions used, are originally created for distance calculations between two vectors with
more than one element. But the distance functions have been applied to scalar values
82
and therefore, the strength of these functions could not be utilized. The strength of the
L1 distance is not surprising, as information and the original dimensionality of the data
is unchanged and the data is not distorted.
A new representation for genetic datasets has been developed, considering the interaction between genes. The new representation was shown to increase the accuracy on genetic
data on average by 1.67%. The assumption that this increase is caused by the reflection
of gene-gene interactions could be confirmed by testing the gene-pair representaion on
non-genetic datasets with worse results. Another fact confirming this assumption is that
the GIC semantic similarity for selected pairs was 70% higher compared to all pairs. The
gene-pair representation increased the accuracy for six out of eight genetic datasets. Only
for Breast and Lupus, the accuracy decreased, but for less than 0.55%. Out of the tested
classifiers, learning from equivalence constraints was shown to have the biggest change in
classification accuracy when using the new representation. The tests with the artificial
injected noise showed that the gene-pair representation is prone to overfit noise.
Algorithms for automatically retrieving GO terms associated to a provided NCBI
accession number or an official gene symbol have been developed. Further the workbench
is able to calculate the GO semantic similarity of two GO terms and two genes by several
different methods, like Lin, Resnik, Wang, Azuaje, Maximum and GIC. A novel algorithm
(called Schoen similarity) for calculating the GO similarity for two genes based on their
GO terms was introduced and was shown to perform comparable to state-of-the art
methods. The Schoen similarity calculates the average value of the best third similarities
out of a set of similarities between GO terms. In our experiments, the Schoen similarity
performed worse compared to the Max and Azuaje similarity when combining them with
the Lin algorithm. To do a better comparison between the Schoen, the Max and the
Azuaje similarity, they can be tested on a set of gene-pair examples and compared to
manual crafted similarity values calculated by a microbiologist as presented by Wang
et. al. [58].
Three different approaches for the incorporation of semantic similarity into data classification have been implemented and compared. The GO-based feature weighting was
shown to reach no better results than a reference with random similarity values. As
described above, the good result for weighting with random values may have also been
83
caused by the fact, that the gene-pair representation without GO-based guidance is overfitting noise. We could not find a similarity calculation method that clearly outperforms
the other methods on average over all the tested datasets. But we recognized that the
average semantic similarity of the most discriminative gene-pairs is significantly higher
than the average semantic similarity over all pairs for the GIC similarity in most datasets.
We tried to use the findings, knowing that gene pairs with higher semantic similarity tend
also to be more discriminative, to select the pairs with the highest semantic similarity
for classification. We have no explanation why the average similarity of selected pairs is
higher than the similarity of all pairs only for the GIC similarity. One reason might be
that the other similarity calculation methods split the similarity calculation into elementary similarities between two GO terms, while contributing the GIC method retrieves the
similarity directly by analyzing the GO graph for all the terms as a whole. Another reason can be that the GO still contains too much noise. To avoid the usage of not reliable
terms, the methods can be modified to use only GO terms with a high evidence level.
The evidence level is given for each term in the annotation database and is a measure of
how trustful the annotation is.
The GO based feature selection, was shown to perform well in combination with a
common data-driven feature selection method (CFS) and the use of GIC similarity and
could improve the classification accuracy by 2.06% on average over the tested datasets in
comparison to the use of the plain feature selection only. The GO-based feature selection
is choosing features based only on their semantic similarity values, resulting in a set of
features that will include redundancy and noise. It is clear that by itself, the semantic
similarity is not a good feature selector, but it has proven to be a good guide to perform preliminary feature selection to reduce the complexity of the usual feature selection
step. The common feature selection method can eliminate the redundant and irrelevant
features. Therefore, a combination of both normally leads to a better classification accuracy. To summarize, it was shown in this thesis that including background biological
knowledge into classification tasks, in the form of guiding the feature selection for the
gene-pair representation, can improve the accuracy for genetic datasets.
The intrinsic Random Forest Similarity was compared to other state-of-the-art learning techniques by participating in an international classification challenge with genetic
84
datasets. A new feature selection method, Group Selection has been developed and the
parameters of the classifier were optimized for the challenge datasets.
6.2
Limitations and Future Work
The implemented framework needs to have access to a running MySql server to use the
GO annotation database. Further, the GO OBO file, a WEKA and a GO4J package
needs to be available in the classpath. The code is written object oriented, but not each
described method or algorithm can be accessed in an object oriented manner. At this
state, the framework is not transferable easily to another computer. The workbench can
be reorganized to include all depending files and be packed into a jar file. Each method
described in this thesis could be implemented within an extra class being accessible
from every other Java code. One should be able to use every algorithm and technique
described in this thesis by including a single jar file without any further installations or
configurations.
The gene-pair representation was shown to perform good on binary classification problems, but was not tested on multiclass datasets. Minor tests on the challenge datasets
showed worse results for the multiclass datasets. More tests have to be performed to better understand why the gene-pair representation works good only on binary classification
problems and how this approach can be adapted to multiclass datasets. The classification
accuracy of the gene-pair representation is highly dependent on several parameters, for
example the number of pairs to be selected. Different filter and wrapper methods can
be tested with different numbers of features to be selected for each iteration. The used
wrapper method, CFS, is very time consuming. Further experiments can be conducted
to find a faster evaluation method that performs comparable to the CFS method. The
Group Selection introduced for the Basic Track in the RSCTC’2010 Machine Learning
Challenge might outperform the common CFS algorithm. The gene-pair representation
selects the best pairs for classification while the the original representation selects the
best genes. Each of the two representations, the single and the gene-pair representation
completely ignores the information provided by the other technique. A combination of
both methods, selecting the best pairs and adding them to the best single genes might
85
combine the benefits of both and result in a better accuracy. Further, a more complex
representation can be tested, triplets of genes for example. Another crucial point for the
gene-pair representation is the calculation of the pairs. The value for a pair is calculated
by building the L1 distance between the two genes of the pair. Other distances might be
tested to better reflect the differences between two single genes. One more way to improve the gene-pair representation could be to normalize the pairs. This can be especially
helpful with including the semantic similarity.
The GO semantic similarities tested, used all GO terms found for the given gene. A
better solution might be to filter the GO terms by their level of evidence. Tests can be
performed with GO terms derived from the three different branches of the GeneOntology
separately. The results depend on the version of the GO OBO file and the annotation
database. As they are updated in short time periods, the usage of newer versions can
change the results. For feature weighting experiments, the Euclidean weighting has been
used. Different weighting approaches can be compared. For feature selection, only one
of the presented methods was tested. The other methods can be tested, too. Analyzing
dependencies between the feature selection ranking and the semantic similarity of the
pairs can give a better understanding how to select features based on their GO similarity.
A histogram of dependencies between the GainRatio of the gene-pairs and the semantic
similarity of the pairs can be found in Appendix A for the tested datasets Breast Cancer,
Colon, Embrional Tumours, HeC Brain Tumours and Lupus. The pairs are divided into
100 bins based on their corresponding semantic similarity values, where each bin contains
the same number of pairs. These histograms show, that the first groups, the groups with
the highest semantic similarity, are always ranked low (for Breast Cancer for example,
the first three groups are ranked with less than 0.035, while the best ranked group (5)
reached 0.24). Notice that the low value is an average value over the GainRatio of the
pairs included in the bin. There can also be gene-pairs of high importance within the
small bins. Therefore, the GO-based feature selection can perhaps be further improved
by excluding the pairs with the highest similarity values which are not discriminative.
To better analyse these trends and to use the knowledge that can be derived from these
correlations for better feature selection is a promising direction for future work.
The GO feature selection was shown to perform good only if the average similarity of
86
the selected pair in preliminary tests has been much higher than the average of all pairs.
A more careful study can be made to better understand why the semantic similarity of
the selected pairs is higher only in some of the datasets and how these datasets differ
from the other datasets.
87
CHAPTER
7
Acknowledgments
I wish to thank my supervisor Dr. Alexey Tsymbal for helping me with any question,
having time whenever I needed help and for giving professional expert advises. Further,
I thank Prof. Dr. Martin Stetter for recommending me to Siemens, for guiding my
thesis and for supporting my work with expert knowledge. The thesis was funded by the
Health-e-Child project and by Siemens with special thanks to Dr. Martin Huber. Next, I
want to thank the University of Applied Science Weihenstephan for teaching me a broad
knowledge in different Bioinformatics related topics, especially Prof. Dr. Frank Lesske,
who always had time for extended discussions. Finally, I wish to thank my wonderful
girlfriend Maria for supporting me over the last years and giving me new power in stressful
times, and my parents for social and financial support.
88
List of Listings
3.1
Example of an ARFF header section . . . . . . . . . . . . . . . . . . . .
23
3.2
Example of an ARFF data section . . . . . . . . . . . . . . . . . . . . . .
24
3.3
Part of a configuration file for the “distance learning” framework . . . . .
26
3.4
Term definition of multidrug transporter activity in OBO format . . . . .
31
4.1
Java code of the Schoen similarity calculation between two genes. The
method is called with the GO terms for each gene as input parameters and
uses the class variable model that includes the GO graph. . . . . . . . . .
55
4.2
Algorithm to oversample a dataset (inst) with unequally distributed classes. 62
4.3
Pseudo code of the Group Selection algorithm. . . . . . . . . . . . . . . .
89
63
List of Figures
3.1
A gene is expressed into a gene product like RNA or proteins. Fot this,
the double stranded DNA of the gene is transformed into a single stranded
RNA by Transcription. For building a Protein, the RNA is translated into
a Protein sequence by Protein biosynthesis. . . . . . . . . . . . . . . . . .
3.2
12
a) An example of a microarray visualization with approximately 37,500
probes. b) An enlarged view of the blue rectangle of the microarray shown
in a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
14
Example of a decision tree where nodes are represented as brown circles
and a blue rectangle represents a leaf. For classifying a new instance, two
decisions are needed at most. Starting at A, the value of the feature can
either be <= 1 or > 1. In the first case, the new instance reaches a leaf and
is classified as diseased. For the second case, one more decision is needed,
node B, where the feature value can either be true or false. . . . . . . . .
3.4
17
A 2-dimensional example of k -NN classification. The test case (red circle)
should be classified either to the green squares or to the blue stars. The
solid border includes k = 3 nearest neighbors of the test case. In this case,
the test instance is classified as a green square as there are two squares and
only one blue star. The dashed border includes k = 6 nearest neighbors.
In this case, there are more blue stars (4) than green squares (2) means
that the test case is classified as a blue star. . . . . . . . . . . . . . . . .
90
18
3.5
The general workflow of the distance learning framework is shown. An
ARFF file containing several system configuration parameters as attributes
and instances representing single experiments, is read first. For each line,
corresponding to each single run, a new experiment is started with the
classification configuration specified in the ARFF file. For each experiment, several cross validation iterations can be performed. For each cross
validation iteration, the data is split randomly into a training and testing
set. Next, an imputation method is run to predict missing values. Before
training the different classification models specified in the configuration
ARFF file, the sets are transformed into the product, difference and the
comparative space for learning distance functions. Then, the algorithm
classifies the test set with unseen class labels. At the end of all iterations,
the classification accuracies are averaged over all runs and reported to the
resulting text file and to the console. . . . . . . . . . . . . . . . . . . . .
3.6
27
An example of a set of terms under the biological process node. A screenshot from the ontology editing tool OBO-Edit (http://www.oboedit.org).
The nodes represent GO terms where the labeled arcs indicate relations
between two GO terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
The structure of the reorganized object oriented distance learning framework. The TestExecuter class (green rectangle) is detailed in Figure 4.2.
4.2
29
39
Detailed view of the TestExecuter class. The parts not exploited in the
experiments in this thesis are grayed out. The cross validation procedure
(green rectangle) is described in more detail in Figure 4.3. . . . . . . . .
4.3
Detailed structure of a cross validation iteration where four types of learning algorithms are used in one experiment. . . . . . . . . . . . . . . . . .
4.4
41
Output of a single experiment for the Health-e-Child Brain Tumours dataset
in HTML format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5
40
43
Modified Framework for transforming the dataset into the gene-pair representation. The new components are shown in blue. . . . . . . . . . . .
91
50
4.6
Workflow of the transformation from the single-gene to the gene-pair representation, where n is the number of features and N is the number of
pairs to be selected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.7
Translation of the original dataset into a GO ID dataset . . . . . . . . .
54
4.8
Modification of the framework to incorporate the GO semantic similarity
in three different ways. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
57
List of Tables
3.1
A list of the benchmark datasets used for experiments conducted for this
thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
The six datasets provided for the Basic Track of RSCTC’2010 Machine
Learning Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
34
35
Average accuracies over the nine datasets for distance comparison for learning from equivalence constraints. . . . . . . . . . . . . . . . . . . . . . . .
67
5.2
Classification accuracy for Breast Cancer of pair vs. original representation. 68
5.3
Classification accuracy for Colon of pair vs. original representation. . . .
5.4
Classification accuracy for Embrional Tumours of pair vs. original representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5
68
69
Classification accuracy for HeC Brain Tumours of pair vs. original representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.6
Classification accuracy for Leukemia of pair vs. original representation. .
69
5.7
Classification accuracy for Lung Cancer of pair vs. original representation. 69
5.8
Classification accuracy for Lupus of pair vs. original representation. . . .
70
5.9
Classification accuracy for Lymphoma of pair vs. original representation.
70
5.10 Classification accuracy for Arcene of pair vs. original representation.
. .
70
5.11 Classification accuracy for Mesh of pair vs. original representation. . . .
70
5.12 Average classification accuracy over the four different classifiers for the
genetic datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
71
5.13 Average classification accuracy over the genetic datasets for the tested
classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.14 Average classification accuracy over the four different classifiers for the
non-genetic datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
5.15 Average classification accuracy over the non-genetic datasets for the tested
classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
5.16 Average classification accuracies for noisy datasets. . . . . . . . . . . . .
72
5.17 Number of attributes for the original and the reduced datasets for experiments with the GO semantic similarity. . . . . . . . . . . . . . . . . . . .
74
5.18 Classification accuracy of the GO-based feature weighting for Breast. . .
75
5.19 Classification accuracy of the GO-based feature weighting for Colon. . . .
76
5.20 Classification accuracy of the GO-based feature weighting for Embrional
Tumours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.21 Classification accuracy of the GO-based feature weighting for Lupus. . . .
77
5.22 Average accuracies for semantic similarity based feature weighting with
different methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.23 Comparing the ratio of the average similarity between selected pairs and
all pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.24 Classification accuracy of GO-based feature selection for Breast. . . . . .
79
5.25 Classification accuracy of GO-based feature selection for Colon. . . . . .
79
5.26 Classification accuracy of GO-based feature selection for Embrional Tumours. 79
5.27 Classification accuracy of GO-based feature selection for HeC Brain Tumours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
5.28 Classification accuracy of GO-based feature selection for Lupus. . . . . .
80
5.29 Best configurations of the intrinsic Random Forest Similarity for Basic
Track datasets, where N is the number of attributes of the dataset. . . .
94
81
Bibliography
[1] RSCTC’2010 Discovery Challenge: Mining DNA microarray data for medical diagnosis and treatment. http://tunedit.org/challenge/RSCTC-2010-A, to be published
in RSCTC’2010 proceedings, 2010.
[2] Michael Ashburner, Catherine A. Ball, and Judith A. Blake. Gene ontology: tool
for the unification of biology. Nature Genetics, 25(1):25–29, May 2000.
[3] Francisco Azuaje, Haiying Wang, and Olivier Bodenreider. Ontology-driven similarity approaches to supporting gene functional assessment. The Eighth Annual
Bio-Ontologies Meeting, 2008.
[4] Aharon Bar-Hillel. Learning from weak representations using distance functions and
generative models. PhD thesis, The Hebrew University of Jerusalem, 2006.
[5] Abdelghani Bellaachia, David Portnoy, and Et. E-CAST: A data mining algorithm
for gene expression data. BIOKDD02: Workshop on Data Mining in Bioinformatics,
1999.
[6] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. J
Comput Biol, 6(3-4):281–297, 1999.
[7] C. L. Blake and C. J. Merz.
UCI repository of machine learning databases.
http://archive.ics.uci.edu/ml/, 1998.
95
[8] Markus Brameier and Carsten Wiuf. Co-clustering and visualization of gene expression data and gene ontology terms for saccharomyces cerevisiae using self-organizing
maps. Journal of Biomedical Informatics, 40(2):160–173, 2007.
[9] Alvis Brazma, Jaak Vilo, and Edited G. Cesareni. Gene expression data analysis.
FEBS Lett, 480:17–24, 2000.
[10] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, October 2001.
[11] Edgar Chávez and Gonzalo Navarro. A probabilistic spell for the curse of dimensionality. Algorithm Engineering and Experimentation, pages 147–160, 2001.
[12] Zheng Chen and Jian Tang. Using gene ontology to enhance effectiveness of similarity
measures for microarray data. Bioinformatics and Biomedicine, IEEE International
Conference on, 0:66–71, 2008.
[13] International Human Genome Sequencing Consortium. Finishing the euchromatic
sequence of the human genome. Nature, 431(7011):931–945, October 2004.
[14] R. De Maesschalck. The mahalanobis distance. Chemometrics and Intelligent Laboratory Systems, 50(1):1–18, January 2000.
[15] Lin Dekang. An information-theoretic definition of similarity. Morgan Kaufmann,
pages 296–304, 1998.
[16] Sandrine Dudoit, Jane Fridlyand, and Terence P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of
the American Statistical Association, 97(457):77–87, 2002.
[17] Eibe, Frank and Holmes,Geoffrey and Pfahringer,Bernhard and Reutemann,Peter
and Witten, Ian H. The WEKA data mining software: An update. SIGKDD
Explorations, Volume 11, Issue 1, 2009.
[18] Valur Emilsson, Gudmar Thorleifsson, and Bin Zhang. Genetics of gene expression
and its effect on disease. Nature, 452(7186):423–428, March 2008.
96
[19] Yoav Freund and Robert E. Schapire. A short introduction to boosting. In In
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence,
volume 14, pages 1401–1406, 1999.
[20] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.
Machine Learning, 63(1):3–42, 2006.
[21] Zhan Guoqing, Lu Qingming, Lu Qiang, and Li Yixue. A set of API used to manipulate gene ontology vocabulary. http://www.bioinformatics.org/GO4J/, 2006.
[22] M. Hall. Correlation-based Feature Selection for Machine Learning. PhD thesis,
Department of Computer Science, University of Waikato, 1998.
[23] H. Heijerman. Infection and inflammation in cystic fibrosis: A short review. Journal
of Cystic Fibrosis, 4:3–5, August 2005.
[24] Tomer Hertz. Learning Distance Functions: Algorithms and Applications. PhD
thesis, The Hebrew University of Jerusalem, 2006.
[25] J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and
lexical taxonomy. In International Conference Research on Computational Linguistics (ROCLING X), September 1997.
[26] Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In ML’92:
Proceedings of the Ninth International Workshop on Machine learning, pages 249–
256, San Francisco, CA, USA, 1992. Morgan Kaufmann.
[27] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial
Intelligence, 97(1-2):273–324, 1997.
[28] Ron Kohavi, Pat Langley, and Yeogirl Yun. The utility of feature weighting in
nearest-neighbor algorithms. In Proceedings of the Ninth European Conference on
Machine Learning, pages 85–92. Springer-Verlag, 1997.
[29] Rafal Kustra and Adam Zagdanski. Incorporating Gene Ontology in clustering
gene expression data. In CBMS ’06: Proceedings of the 19th IEEE Symposium
97
on Computer-Based Medical Systems, pages 555–563, Washington, DC, USA, 2006.
IEEE Computer Society.
[30] Robert Ezra Langlois. Machine Learning In Bioinformatics: Algorithms, Implementations and Applications. PhD thesis, University of Illinois at Chicago, 2002.
[31] Pedro Larranaga, Borja Calvo, Roberto Santana, Concha Bielza, Josu Galdiano,
Inaki Inza, Jose A. Lozano, Ruben Armananzas, Guzman Santafe, Aritz Perez, and
Victor Robles. Machine learning in bioinformatics. Brief Bioinform, 7(1):86–112,
2006.
[32] Jae W. Lee, Jung B. Lee, Mira Park, and Seuck H. Song. An extensive comparison
of recent classification tools applied to microarray data. Computational Statistics &
Data Analysis, 48(4):869–885, April 2005.
[33] Alan W. Liew, Hong Yan, and Mengsu Yang. Pattern recognition techniques for the
emerging field of bioinformatics: A review. Pattern Recognition, 38(11):2055–2073,
November 2005.
[34] S. Mahamud and M. Hebert. The optimal distance measure for object detection.
Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer
Society Conference on, 1:I–248–I–255 vol.1, 2003.
[35] Ryszard S. Michalski, Robert E. Stepp, and Edwin Diday. A recent advance in data
analysis: Clustering objects into classes characterized by conjunctive concepts. Invited chapter in the book Progress in Pattern Recognition, L. Kanal and A. Rosenfeld
(Eds.), 1:33–55, 1981.
[36] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[37] Bastos H Falcão AO Couto FM Pesquita C, Faria D. Evaluating GO-based semantic
similarity measures. Proceedings of the 10th Annual Bio-Ontologies Meeting, 2007.
[38] Jianlong Qi and Jian Tang. Integrating gene ontology into discriminative powers of
genes for feature selection in microarray data. In SAC ’07: Proceedings of the ACM
symposium on Applied computing, pages 430–434, New York, NY, USA, 2007. ACM.
98
[39] J. Quackenbush. Computational analysis of microarray data. Nature Reviews Genetics, 2(6):418–427, June 2001.
[40] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, March
1986.
[41] Ross J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, January
1993.
[42] A. Tsymbal D. Vitanovski B. Georgescu S. Kevin Zhou N. Navab R. Ionasec and
D. Comaniciu. Shape-based diagnosis of the aortic valve. SPIE Medical Imaging,
2009.
[43] Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. Cognitive Modelling, November 1995.
[44] Yvan Saeys, Inaki Inza, and Pedro Larranaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507–2517, October 2007.
[45] Eric W. W. Sayers, Tanya Barrett, and Dennis A. A. Benson et. al. Database resources of the national center for biotechnology information. Nucleic Acids Research,
37(1):D5–15, January 2009.
[46] Andreas Schlicker and Mario Albrecht. Funsimmat: a comprehensive functional
similarity database. Nucl. Acids Res., pages 806+, October 2007.
[47] Andreas Schlicker, Francisco S. Domingues, Jorg Rahnenfuhrer, and Thomas
Lengauer. A new measure for functional similarity of gene products based on gene
ontology. BMC Bioinformatics, 7:302+, June 2006.
[48] Jose L. Sevilla, Victor Segura, Adam Podhorski, Elizabeth Guruceaga, Jose M. Mato,
Luis A. Martinez-Cruz, Fernando J. Corrales, and Angel Rubio. Correlation between
gene expression and GO semantic similarity. IEEE/ACM Trans. Comput. Biol.
Bioinformatics, 2(4):330–338, 2005.
[49] R.D. Short and K. Fukunaga. The optimal distance measure for nearest neighbour
classification. IEEE Transactions on Information Theory, 27(5):622–627, 1981.
99
[50] Alexander Statnikov, Constantin F. Aliferis, Ioannis Tsamardinos, Douglas Hardin,
and Shawn Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5):631–643,
March 2005.
[51] Roland B. Stoughton. Applications of DNA microarrays in biology. Annual review
of biochemistry, 74:53–82, 2005.
[52] Ying Tao, Lee Sam, Jianrong Li, Carol Friedman, and Yves A. Lussier. Information
theory applied to the sparse Gene Ontology annotation network to predict novel
gene function. Bioinformatics, 23(13):i529–538, July 2007.
[53] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie,
Robert Tibshirani, David Botstein, and Russ B. Altman. Missing value estimation
methods for DNA microarrays. Bioinformatics, 17(6):520–525, June 2001.
[54] Alexey Tsymbal, Martin Huber, and Shaohua Kevin Zhou. Discriminative distance
functions and the patient neighborhood graph for clinical decision support (to appear). In Hamid R. Arabnia (ed.), Advances in Computational Biology, Springer,
2010.
[55] Alexey Tsymbal, Shaohua Kevin Zhou, and Martin Huber. Neighborhood graph and
learning discriminative distance functions for clinical decision support. Conf Proc
IEEE Eng Med Biol Soc, 2009.
[56] H. Wang, F. Azuaje, O. Bodenreider, and J. Dopazo. Gene expression correlation
and gene ontology-based similarity: an assessment of quantitative relationships. In
CIBCB ’04. Proceedings of the 2004 IEEE Symposium on Computational Intelligence
in Bioinformatics and Computational Biology, pages 25–31, 2004.
[57] Haiying Wang and Francisco Azuaje. An ontology-driven clustering method for
supporting gene expression analysis. In CBMS ’05: Proceedings of the 18th IEEE
Symposium on Computer-Based Medical Systems, pages 389–394, Washington, DC,
USA, 2005. IEEE Computer Society.
100
[58] James Z. Wang, Zhidian Du, Rapeeporn Payattakool, Philip S. Yu, and Chin-Fu
Chen. A new method to measure the semantic similarity of GO terms. Bioinformatics, 23(10):1274–1281, 2007.
[59] G. Weiss and F. Provost. The effect of class distribution on classifier learning. 2001.
[60] Randall D. Wilson and Tony R. Martinez. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6:1–34, 1997.
[61] Randall D. Wilson and Tony R. Martinez. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6:1–34, 1997.
[62] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization.
Evolutionary Computation, IEEE Transactions on, 1(1):67–82, August 2002.
[63] Hongwei Wu, Zhengchang Su, Fenglou Mao, Victor Olman, and Ying Xu. Prediction
of functional modules based on comparative genome analysis and Gene Ontology
application. Nucl. Acids Res., 33(9):2822–2837, May 2005.
[64] Tao Xu, LinFang Du, and Yan Zhou. Evaluation of GO-based functional similarity
measures using s. cerevisiae protein interaction and expression profile data. BMC
Bioinformatics, 9(1):472, 2008.
[65] C. H. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R. M. Rifkin, M. Angelo,
M. Reich, E. Lander, J. Mesirov, and T. Golub. Molecular classification of multiple
tumor types. Bioinformatics, 17 Suppl 1, 2001.
[66] J. Yu, J. Amores, N. Sebe, and Q Tian. A new study on distance metrics as similarity
measurement. Publications of the Universiteit van Amsterdam (Netherlands), 2006.
[67] Jie Yu, J. Amores, N. Sebe, P. Radeva, and Qi Tian. Distance learning for similarity estimation. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
30(3):451–462, 2008.
101
Appendix A
The histograms show the GainRatio of the gene-pair features for the five tested datasets.
The y axis presents the GainRatio and the x axis is representing the semantic similarity
of the pairs, where the pairs are combined into 100 groups of equal size based on their
semantic similarity value. Every bin in the graph means the average GainRatio value of
the gene-pairs included in this bin. The average semantic similarity of the bins is thus
decreasing from left to right.
GainRatio
0.25
0.2
0.15
0.1
0.05
1 2 3 4 5 6 7 8 91011121314151617181920212223242526 27-100
Distribution of average GainRatio for gene-pair sets with different semantic similarity for
the Breast Cancer dataset.
i
GainRatio
0.35
0.3
0.25
0.2
0.15
0.1
0.05
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233343536373839404142 43-100
Distribution of average GainRatio for gene-pair sets with different semantic similarity for
the Colon dataset.
GainRatio
0.2
0.15
0.1
0.05
1
2
3
4
5
6
7
8
9
10-100
Distribution of average GainRatio for gene-pair sets with different semantic similarity for
the Embrional Tumours dataset.
ii
GainRatio
0.50
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
1
5
10
20
30
40
50
60 62
63-100
Distribution of average GainRatio for gene-pair sets with different semantic similarity for
the HeC Brain Tumours dataset.
GainRatio
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
1
5
10
20
30
40
50
60
70
76 77-100
Distribution of average GainRatio for gene-pair sets with different semantic similarity for
the Lupus dataset.
iii