Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge representation and reasoning wikipedia , lookup
Genetic algorithm wikipedia , lookup
Concept learning wikipedia , lookup
Neural modeling fields wikipedia , lookup
Time series wikipedia , lookup
Machine learning wikipedia , lookup
Gene expression programming wikipedia , lookup
Learning Distance Functions For Gene Expression Data Diploma Thesis in Bioinformatics submitted by Torsten Schön born 23rd February, 1986 in Wassertrüdingen Written at Department of Bioinformatics Hochschule Weihenstephan-Triesdorf in Freising in Cooperation with Siemens AG, CT T DE TC 4, Erlangen Advisors Prof. Dr. Martin Stetter, Dr. Alexey Tsymbal (Siemens AG) Started 01. September 2009 Finished 26. February 2010 ii Eidesstattliche Erklärung: Ich erkläre hiermit an Eides statt, dass die vorliegende Arbeit von mir selbst und ohne fremde Hilfe verfasst und noch nicht anderweitig für Prüfungszwecke vorgelegt wurde. Es wurden keine anderen als die angegebenen Quellen oder Hilfsmittel benutzt. Wörtliche und sinngemäße Zitate sind als solche gekennzeichnet. Erlangen, den ................................... ................................... Unterschrift iii Abstract This thesis addresses problems of classifying genetic data with distance function learners. Two common learning algorithms, plain k-Nearest Neighbour (with the canonical Euclidean distance) and Random Forest are compared with two distance function learningbased techniques, learning from equivalence constraints and the intrinsic Random Forest Similarity on different benchmark datasets. These datasets include gene expression data for patients with Breast Cancer, Colon Cancer, Embrional Tumours, Brain Tumours, Leukemia, Lung Cancer, Lupus and Lymphoma. Each dataset contains healthy subjects, too. First, seven established and two novel distance functions are evaluated for learning from equivalence constraints in the difference space. To consider gene interactions in the classification algorithms, the original datasets are transformed into a new representation, comprising gene-pairs and not single genes. All combinations of pairs between the genes of the original datasets are constructed. The most discriminative gene pairs are selected and the new representation is evaluated on the benchmark datasets. The novel gene-pair representation is shown to increase the accuracy for genetic datasets. Based on the genepair representation, the GeneOntology semantic similarity of the gene pairs is calculated with different methods and is used for feature weighting first. A comparison of eight approaches is done where one new algorithm is introduced for calculating the semantic similarity between two genes. Further, the semantic similarity is used to pre-select pairs with a high similarity value. The GeneOntology based feature selection approach is compared to the common feature selection and is shown to increase the accuracy on average over the datasets. Zusammenfassung Diese Diplomarbeit beschäftigt sich mit den Herausforderungen der Klassifizierung genetischer Daten, speziell mit Hilfe von distanzfunktionsbasierten Algorithmen. Zwei fundierte v Lernalgorithmen, der k-Nearest Neighbour und der Random Forest Algorithmus, werden mit zwei Distanzbasierten Methoden, Equivalence Constraints und der intrinsic Random Forest Similarity verglichen und auf verschiedenen Bezugsdaten getestet. Diese Bezugsdaten beinhalten Genexpressionsdaten von Patienten mit Brustkrebs, Darmkrebs, Embrionale Tumore, Leukämie, Lungenkrebs, Lupus und malignem Lymphom sowie Genexpressionsdaten von Gesunden Personen. Zu Beginn werden sieben etablierte und zwei neu entwickelte Distanzfunktionen zur Klassifizierung mit Equivalence Constraints evaluiert. Um das Zusammenspiel zwischen Genen in der Klassifikation zu berücksichtigen, werden die Datensätze in eine neuartige Darstellung von Genpaaren umgewandelt. Hierfür werden alle Kombinationen von Paaren zwischen den einzelnen Genen der Datensätze gebildet. Nach einer Auswahl der für die Klassenunterscheidung wichtigsten Paare, wird die neue Darstellung auf den Referenzdatensätzen validiert und mit der ursprünglichen Darstellung verglichen. Dabei zeigt sich eine Verbesserung der Klassifizierungsgenauigkeit für genetische Datensätze. Daraufhin wird für jedes Paar die semantische Ähnlichkeit der beiden Paar-Elemente mit Hilfe verschiedener Methoden und der GeneOntology berechnet. Diese Ähnlichkeit wird als Gewichtung der Paare in die Klassifikation einbezogen. Ein Vergleich zwischen acht verschieden Methoden, diese Ähnlichkeit zu berechnen, wird erstellt, wobei eine dieser Methoden neu vorgestellt wird. Anschließend wird die semantische Ähnlichkeit dazu verwendet, eine Vorauswahl von Paaren zu treffen, welche einen sehr hohen Ähnlichkeitswert aufweisen. Diese Art der Paar-Auswahl wird mit der bisher verwendeten Methode verglichen und zeigt eine Verbesserung des mittleren Klassifikationsresultates über die getesteten Datensätze. vi vii Contents 1 Introduction 1 2 Related Work 5 2.1 Machine Learning with Gene Expression Data . . . . . . . . . . . . . . . 5 2.2 Learning Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Semantic Similarity Calculations Based on Gene Ontology . . . . . . . . 8 3 Material 3.1 3.2 3.3 11 Biological and Medical Aspects . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.1 Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.2 Genetic Microarray Experiments . . . . . . . . . . . . . . . . . . 13 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.4 k -Nearest Neighbour Classification . . . . . . . . . . . . . . . . . 17 3.2.5 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.6 Learning Distance Functions . . . . . . . . . . . . . . . . . . . . . 19 3.2.7 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.8 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Weka - A Machine Learning Framework in Java . . . . . . . . . . . . . . 23 3.3.1 23 The ARFF Format . . . . . . . . . . . . . . . . . . . . . . . . . . viii 3.3.2 The Structure of the Weka Framework . . . . . . . . . . . . . . . 25 3.4 Distance Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5.1 The Key Components of the Ontology . . . . . . . . . . . . . . . 28 3.5.2 The GO File Format: OBO . . . . . . . . . . . . . . . . . . . . . 30 3.5.3 The GO Annotation Database . . . . . . . . . . . . . . . . . . . . 31 3.6 Gene Ontology API for Java . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 NCBI EUtils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.8 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.9 RSCTC’2010 Discovery Challenge . . . . . . . . . . . . . . . . . . . . . . 35 3.9.1 Basic Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.9.2 Advanced Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 Methods 38 4.1 Reorganization of the Distance Learning Framework . . . . . . . . . . . . 38 4.2 Distance Function Learning From Equivalence Constraints . . . . . . . . 44 4.2.1 The L1 Distance and Modifications of the L1 Distance . . . . . . 45 4.2.2 The Simplified Mahalanobis Distance . . . . . . . . . . . . . . . . 45 4.2.3 The Chi-Square Distance . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.4 The Weighted Frequency Distance . . . . . . . . . . . . . . . . . . 46 4.2.5 The Canberra Distance . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.6 The Variance Threshold Distance . . . . . . . . . . . . . . . . . . 47 4.2.7 Test Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Transformation of Feature Representation for Gene Expression Data . . . 48 4.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.2 Framework Updates . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.3 Test Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Integration of GO Semantic Similarity . . . . . . . . . . . . . . . . . . . 53 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.2 Framework Modifications . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.3 GO-Based Feature Weighting . . . . . . . . . . . . . . . . . . . . 57 4.4.4 GO-Based Feature Selection . . . . . . . . . . . . . . . . . . . . . 58 4.3 4.4 ix 4.4.5 4.5 Test Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 59 RSCTC’2010 Discovery Challenge . . . . . . . . . . . . . . . . . . . . . . 60 4.5.1 Basic Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5.2 Advanced Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5 Empirical Analysis 5.1 5.2 5.3 5.4 66 Distance Comparison for Learning From Equivalence Constraints in the Difference Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Transformation of Representation for Gene Expression Data . . . . . . . 67 5.2.1 Analysis of the Genetic Datasets . . . . . . . . . . . . . . . . . . 68 5.2.2 Analysis of the Non-Genetic Datasets . . . . . . . . . . . . . . . . 72 5.2.3 Robustness of the Gene-Pair Representation to Noise . . . . . . . 72 5.2.4 Benefits and Limitations . . . . . . . . . . . . . . . . . . . . . . . 73 Integration of Gene Ontology Semantic Similarity into Data Classification 74 5.3.1 GO-Based Feature Weighting . . . . . . . . . . . . . . . . . . . . 75 5.3.2 GO-based Feature Selection . . . . . . . . . . . . . . . . . . . . . 78 Preliminary Results for the RSCTC’2010 Discovery Challenge . . . . . . 81 6 Conclusion 82 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 85 7 Acknowledgments 88 x xi CHAPTER 1 Introduction Since the human genome has been sequenced in 2003 [13], biological research is becoming more interested in the genetic cause of diseases. The biologists try to find responsible genes for the disease under study by analyzing the expression values of genes. The human genome includes thousands of genes, which makes it difficult to find the genes that are associated with a certain disease. Genetic microarray experiments are usually performed to analyze the expression values of thousands of genes within a single experiment. To extract useful information out of the genetic experiments and to get a better understanding of the disease under study, usually a computational analysis is performed [39]. To determine which genes are modified in a certain disease, the gene expression data of healthy subjects and diseased patients are analyzed and compared. Machine learning techniques [36] are often used to find specific patterns in the data of healthy subjects and diseased patients that can be used to discriminate the patients based on their disease state. These learning methods are used to train models with the data from patients with known disease state. A trained model can be used to predict the disease state of an unseen patient, by searching similarities between the gene expression profiles of the patient with the unknown disease state and the gene expression profile of the diseased and the healthy subjects. Usually, in machine learning, an example describing a single training or testing item (a patient for example) is called instance. An instance can be described by an unlimited 1 amount of attributes and can have a label. For bioinformatics datasets as described above, the attributes are normally genes and the class label is defined by the disease state. One of the commonly used learning algorithms with genetic data is called k -Nearest Neighbour classification. The Euclidean distance of an unseen patient data to the patients records used for training the model is normally calculated and the k nearest neighbours are determined [53]. Usually, the new instance data is then labeled with the most frequent class of the k nearest neighbour instances. The k -Nearest Neighbour algorithm is not the only approach that uses a distance function for classification. During the last three decades, the importance of the distance function in machine learning has been gradually acknowledged [4]. However, for many years only canonical distance functions or hand-crafted distance functions were used. Recently, a growing body of work has addressed the problem of supervised or semi-supervised learning of customized distance functions [24]. In particular, two different approaches, learning from equivalence constraints and the intrinsic Random Forest Similarity have been introduced and shown to perform good on image data [24, 4]. Many characteristics of gene expression data are similar to those of image data. Both usually have a big number of features where many of them are redundant, irrelevant or noisy. For this reason, we assume that learning distance functions can improve the classification of genetic data, too. Another possibility to improve the analysis of genetic data is to use external biological knowledge [57, 29, 38, 12]. The GeneOntology [2] is a good source of biological knowledge that can be incorporated into machine learning techniques. In addition, the GeneOntology content is consistently growing, becoming a more complete and reliable source of knowledge. The number of entries for example, increased from 27,867 in January 1, 2009 to 30,716 in January 1, 2010. In this thesis, we will try to incorporate the biological knowledge of the GeneOntology to improve the classification accuracy of genetic benchmark datasets. Further, the benchmark datasets are used to compare different machine learning approaches with different configurations. The main task of this thesis is to improve the classification accuracy for the benchmark datasets, in particular to increase the number of correctly classified test instances. This thesis will address the following general problems of classification based 2 on genetic data: • Reducing the big number of genes to the most discriminative genes • Removing noisy, irrelevant and redundant information • Learning from a small number of training instances • Classification of binary and multiclass problems • Learning with unequal distribution of classes in a dataset • Comparison of different learning algorithms • Comparison of different distance functions for k -Nearest Neighbour classification • Exploitation of biological information provided by the expression values for classification • Incorporation of biological knowledge from external sources to improve classification The following problems are important for classification with genetic data too, but they are out of scope for this thesis. • Imputation of missing data • Normalization of gene expression values The present study is motivated by the research activities in the EU Framework Programme 6 project Health-e-Child1 , aimed at improving personalized healthcare in selected areas of paediatrics, especially focusing on integrating medical data across disciplines, modalities, and vertical levels such as molecular, organ, individual and population. In medicine, large complex heterogeneous data sets are commonplace. Today, a single patient record may include, for example, demographic data, familial history, laboratory test results, images (including echocardiograms, MRI, CT, angiogram etc), other signals (e.g. EKG), genomic and proteomic samples, and history of appointments, prescriptions and interventions. And much if not all of this data may be relevant and may contain 1 www.health-e-child.org 3 important information for decision support. A successful integration of heterogeneous data within a patient record thus becomes of paramount importance, and, in particular, learning a distance function for such data for patient case retrieval or classification is rather non-trivial and forms an important task. This thesis is organized as follows: In Section 2, related work is presented to give an overview of the state-of-the-art techniques in the area of machine learning with genetic data. In Section 3, the algorithms, approaches, software and biological background used in this thesis are described. The Methods Section (4) describes the implementations and techniques used for executing the empirical tests for this thesis. The results and the analysis of the performed experiments are presented in Section 5. Section 6 gives a summary of the thesis and completes the thesis with proposals for future work. 4 CHAPTER 2 Related Work 2.1 Machine Learning with Gene Expression Data Much work has been done in the area of machine learning over the last few decades and as bioinformatics is gaining more attention, different techniques have been applied to process genetic data. Larrañaga et. al. [31] presented a summary of machine learning methods in bioinformatics describing how to apply modeling methods, supervised/unsupervised learning and optimization to gene expression data. The most crucial problems in classifying genetic data are missing data imputation, discriminative feature selection, classification and clustering. A single Microarray experiment may contain thousands of genes (rows) under different conditions (columns) and is scanned automatically by a robot. The scanning procedure can sometimes result in data leaks caused by scanning problems, insufficient resolution, image corruption, scratches, dust or defective spots. However, most data mining algorithms require a complete data matrix for a proper processing. In a related paper, Troyanskaya et. al. [53] implemented and evaluated different methods of missing data imputation for gene expression data. Three different methods where compared: Singular Value Decomposition (SVD), row average and k -Nearest Neighbor regression, where the last one was shown to perform best. In addition to a usually small sample of instances (conditions), the large amount of gene expression values complicates classification. For that reason, a feature selection step is normally conducted before classification to determine discriminative genes and elimi5 nate redundancy in the dataset. For an exhaustive review of feature selection techniques for gene expression data, see “A review of feature selection techniques in bioinformatics” [44]. In this thesis different feature selection methods are used to pre-process. They will be explained in detail in Section 3.2.7. After the imputation of missing values and selection of the most meaningful features, the data can either be used for unsupervised or supervised learning. The main task in unsupervised learning is to cluster unlabeled data into groups such that an instance is more similar to all participants of the same group than to any instance of all the other groups. Ben-Dor et. al. [6] introduced a data mining algorithm called CAST (Cluster Affinity Search Technique) which has been further improved in 2002 by Bellaachia et. al. [5] and was shown to perform well on clustering gene expression data. In contrast to unsupervised learning, supervised learning uses labeled data to train a classification system which is able to predict the labels of unseen (unlabeled) data. A few empirical comparisons of classification algorithms introducing k -Nearest Neighbor (k -NN), Linear Discriminant Analysis (LDA), classification trees (like C4.5 [41] or CART [40]), Support Vector Machines (SVM), Artificial Neural Networks (ANN) and their combinations using Bagging and Boosting for several cancer gene expression studies have been published [16, 65, 50, 32]. As the ambition of this thesis is to build a framework for testing different classification methods for predicting class labels of unknown data with respect to labeled training data, supervised learning only is used. For that reason, this thesis will not present any other information about clustering than a short review in the Material section. So far, some useful machine learning workbenches have been presented. Langlois [30] for example developed an open source machine learning workbench for studying the structure, function, evolution and mutation of proteins. Presumably, the most popular machine learning framework was developed by the University of Waikato and is called Weka [17]. As the software developed in the course of preparation of this thesis is based on Weka, this framework is described in detail in Section 3.3. For a more detailed review on data mining techniques for bioinformatics data an interested reader is referred to “Pattern recognition techniques for the emerging field of bioinformatics: A review ” [33] or “Machine learning in bioinformatics” [31]. 6 2.2 Learning Distance Functions Historically the research on distance functions in machine learning has started from supervised learning of distance functions for k -Nearest Neighbor classification [49]. Since that time the canonical distance functions like the Euclidean distance or Mahalanobis distances have been used. Even today the Euclidean distance function is used in many applications although it is well known that its use is justified only when the feature data distribution is Gaussian. The Mahalanobis metric learning has received high research attention but is often inferior to many non-linear and non-metric distance techniques and usually fails for learning distances from image data [24], too. Discriminative distance functions have been used in various classification domains and in image processing and pattern recognition. Hertz [24] performed extensive research in this area and presented three novel distance learning algorithms: Relevant Component Analysis (RCA), DistBoost and Kernel Boost. In RCA a Mahalanobis distance metric is learned which is optimal under several conditions using generative learning with positive equivalence constraints. Hertz described the application of these novel algorithms to various data domains including clustering and classification. It was shown that using the presented improved distance functions instead of off-the-shelf distance functions, a significant improvement was reached in all of these application domains. Bar-Hillel [4] discussed learning from equivalence constraints where distance function learning is considered as learning a classifier defined over instance pairs. A new method, termed coding similarity, has been introduced and shown to hold an empirical advantage over the common Mahalanobis. Based on this algorithm a two-step method for subordinate class recognition in images was developed. It was shown that the most popular Euclidean and Manhattan distance metrics are not suitable for many data distributions. Yu et. al. [66] proposed a novel algorithm that finds the best distance metric dynamically for a given dataset. This boosted distance metric was shown to give robust results with fifteen UCI repository [7] benchmark datasets and two image retrieval applications. Later, Yu et. al. [67] further improved their metric and presented a general guideline to find a more optimal distance function for similarity estimation with a specific dataset. Tsymbal et. al. [55, 54] first performed 7 a detailed comparison of learning discriminative distance functions for case retrieval and decision support. Two types of discriminative distance learning methods, learning from equivalence constraints and the intrinsic Random Forest similarity have been compared. They showed that both techniques are competitive to plain learning where the Random Forest similarity exhibits a more robust behavior and is more stable with missing data and noise. A more thorough introduction to learning distance functions is given in Section 3.2.6 2.3 Semantic Similarity Calculations Based on Gene Ontology “The Gene Ontology (GO) [2] provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data from GO Consortium members, as well as tools to access and process this data” 1 . Based on this controlled vocabulary, two genes can be semantically associated and further a similarity can be calculated based on their annotations in the GO. Sevilla et. al. [48] applied the semantic similarity which is well known in the field of lexical taxonomies, artificial intelligence and psychology to the Gene Ontology by calculating the information content of each term in the ontology based on different methods like those of Resnik [43], Jiang [25] or Lin [15]. Sevilla et. al. computed correlation coefficients to compare physical intergene similarity with the GO semantic similarity. The results demonstrated a benefit for the similarity measure of Resnik [43]. A correlation between the physical intergene similarity and the GO semantic similarity has been shown to exist for all three GO branches. Later Wang and Azuaje [57] integrated similarity information from the GO into the clustering of gene expression data and provided a novel way to use the GO for knowledgedriven data mining systems. They showed that this method not only produces competitive results in clustering accuracy, but also has the ability to detect new biological dependencies for the given problem. In 2008 they discussed alternative techniques for measuring GO semantic similarity and relationships between these types of information such as gene co-expression [3]. Kustra and Zagdański [29] provided a framework to integrate different 1 From the official Gene Ontology website: http://www.geneontology.org/index.shtml (07. January 2010) 8 biological data sources through the combination of corresponding dissimilarity measures and have shown that the combination of gene expression data and protein-protein interaction knowledge may improve cluster analysis and results in more biologically meaningful clusters. Wang et. al. [58] presented a more effective method to determine a semantic similarity of two terms in the GO graph in 2007. The novel technique aggregates the semantic similarity of their ancestor terms and weights the different relations that terms can have with their ancestor. A formal definition of a term A in the Gene Ontology was given as DAGA = (A, TA , EA ), where TA is the set of GO terms in DAGA including term A and all its ancestors in the GO graph, and EA is the set of semantic relations (edges) connecting the terms of TA . With respect to the definition of a single term A, a semantic value can be derived by aggregating the relations to its ancestors where a term closer to A contributes more to its semantics than a term further from it. For any term t in DAGA a contribution to the semantics of the GO term A was called S-Value SA (t), and is defined as: 1, if t = A SA (t) = max{w ∗ S (t0 ) | t0 ∈ children of (t)}, if t 6= A e A (2.1) where we is a semantic weight factor for a relation between term t and its child term t0 and 0 < we < 1. Therefore after obtaining the S-Value for each term in DAGA the semantic value SV (A) of a GO term A is defined as: SV (A) = X SA (t) (2.2) t∈TA To derive the semantic similarity SGO (A, B) between two GO terms A and B with DAGA = (A, TA , EA ) and DAGB = (B, TB , EB ), Equation 2.3 can be used. X (SA (t) + SB (t)) SGO (A, B) = t∈TA ∩TB SV (A) + SV (B) (2.3) This method determines the semantic similarity of two GO terms based on both, their position in the GO graph and their relations to their ancestor terms. For getting a numeric semantic similarity between two genes, where a gene can have several GO terms, Wang 9 et. al. [58] designed an algorithm to combine the similarities of the single terms. They have shown that their results are more in line with hand crafted similarity measurement by human experts that the algorithms commonly used. Next to the method by Wang et. al. described above, a few more methods have been used to determine the semantic similarity of two genes g1 and g2 like Maximum [8, 63], Average [57, 56], Tao [52] or Schlicker [47, 46]. Xu et. al. [64] evaluated these five methods and showed that the Maximum method, shown in Equation 2.4, outperforms the other ones by analyzing correlation coefficients and ROC curves. sim(g1 , g2 ) = max[sim(c1 , c2 )], (2.4) where c1 ∈ A(g1 ), c2 ∈ A(g2 ) and A(g1 ),A(g2 ) are the corresponding sets of GO terms annotated by g1 and g2 . This means that all single semantic similarities between the terms annotated by g1 and g2 are calculated and the maximum value is determined. Further, they also detected that genes annotated with multiple GO terms may be more reliable to separate true positives from noise compared to genes that are annotated with only one GO term. Another approach to include the knowledge available in GO into machine learning is to use it for feature selection. Qi and Tang [38] introduced a novel method to select genes not only by their individual discriminative power, but also by integrating the GO annotations. The algorithm corrects invalid information by ranking the genes based on their GO annotations and was shown to boost accuracy in all four tested public datasets. Chen and Tang [12] further investigated this idea, recorded a novel approach to aggregate semantic similarities and integrated it into traditional redundancy evaluation for feature selection. This resulted in higher or comparable classification accuracies tested on public benchmark sets by using less features compared to common feature selection methods. 10 CHAPTER 3 Material The whole framework used for conducting experiments for this thesis was implemented in Java 1.6 using the development environment Eclipse IDE for Java EE Developers on a Microsoft Windows XP Professional PC with Service Pack 3 installed. The computer works with an Intel(R) Xeon(R) E5440 Quadcore processor at 2.83GHz and uses 3.0 GB RAM. This chapter describes the material and software used for experiments and the implementation of the developed framework for the thesis. 3.1 3.1.1 Biological and Medical Aspects Gene Expression Gene expression analysis has become a standard procedure in biological and medical research [9, 18] since gene expression is directly associated with biological behavior of the organism including the gene. A variation in expression values of a single gene can cause a serious disease or disfunctionality of the whole organism. Gene expression is the process of transforming a gene into a functional gene product and is used by all known living organisms including eukaryotes, prokaryotes and viruses. For this procedure, the double stranded DNA sequence is translated into a protein or a functional RNA in a biological cell, see Figure 3.1. The regulation of this process defines the function and structure of cells by ensuring a controlled protein expression. As the cell structure and proteins define a biological system, changes in gene expression directly influences the 11 AAATGTGCGGTA TTTACACGCCAT DNA Transcription RNA AAAUGUGCGGUA Protein biosynthesis L C A V Protein Figure 3.1: A gene is expressed into a gene product like RNA or proteins. Fot this, the double stranded DNA of the gene is transformed into a single stranded RNA by Transcription. For building a Protein, the RNA is translated into a Protein sequence by Protein biosynthesis. whole organism. Cystic fibrosis (also called mucoviscidosis) for example is caused by a mutation in the single gene cystic fibrosis transmembrane conductance regulator (CFTR) where only three nucleotides are missing and affect the entire body [23]. The product of this mutated gene is an ion channel protein responsible for chloride exchange of cells. In the mutated version, the channel is not able to pass through chloride which results in symptoms like the salty skin, clogging the airway by mucosa and usually a short life expectancy. The expression of a gene in a cell can be measured by quantifying the amount of the gene products and this may often be informative. A mutation, deletion or multiplication of a gene can be detected and a viral infect, susceptibility to congenital disease or resistances against bacteria can be deduced. For most diseases not only one gene is responsible, but a co-operation of several genes where each can be altered in a different way. As these diversifications can be detected by analyzing the modified expression of the genes, specific gene expression patterns can be found at diseased patients. If the gene expression pattern for a disease is known, it is possible to recognize the affection of a patient by gene expression analysis. And further, given a set of gene expression data from diseased and healthy people, a computer can be used to find differences between the healthy and diseased data sets. After recognizing these pattern differences, new patients can be classified as diseased or healthy by cal- 12 culating similarities to the different patterns. Today the most established technique to obtain gene expression data from a biological system for computational analyses is to use a microarray [51]. 3.1.2 Genetic Microarray Experiments A cDNA microarray is a sample of thousands of spots on a glass or silicon surface where different DNA oligonucleotides, called features, are attached to each spot. Each spot contains picomoles (10−12 moles) of a specific DNA oligonucleotide, cDNA or mRNA sequence called probes. The core task of a microarray is to qualify and quantify complementary DNA strands to the one attached to the array. First, two experimental samples are obtained under different conditions. For example, for testing genes associated with cancer, one sample may be taken from a healthy subject and the other one from a diseased patient. After extracting the mRNA from the probe cells, cDNA is built and the healthy sample is labeled with fluorescent green and the diseased one with fluorescent red markers. After labeling the two samples, they are merged to one sample for further analysis. The mixed sample is incubated with the DNA chip. The labeled cDNA binds to spots containing a complementary sequence by forming hydrogen bonds between the complementary nucleotide base pairs. Genes of the sample, having a complementary sequence on a spot on the array, adhere to the spot, while all the other genes get washed off. The cDNA enriched microarray is placed in a dark box and scanned by a green laser where the glowing of the fluorescent green markers is detected by a camera and the image is stored at a computer. The same procedure is done with a red laser respectively for detecting the red markers. The result of the screening are two images, one with green intensities of the spots and another with red intensities where the more sample cDNA is bound to a spot, the higher the intensity of the colour is. These two images are merged computationally and the result is an image representing the expression of the genes in the sample. A green spot means that the gene binding to this spot is expressed only in the healthy subject, a red spot only in the diseased patient, a yellow spot is expressed in both and a black spot is expressed in none of the two samples. An example of a microarray expression image is shown in Figure 3.2. For obtaining expression intensities for a single probe, the probe is applied to the 13 a) b) Figure 3.2: a) An example of a microarray visualization with approximately 37,500 probes. b) An enlarged view of the blue rectangle of the microarray shown in a). microarray and the procedure described above is done solely with one laser. The result of a single probe experiment is therefore only one image, containing spots with different colour saturations where the expression intensities can be derived from the brightness of the spots. In the experiments conducted for this thesis, datasets of single probe intensities have been used only. The raw data of a microarray experiment has to be normalized to correct systematic failures caused by a wrong calibration, different chips and scanning parameters or variable fluorescent marker characteristics. For this purpose, each microarray contains several control spots. Another problem to be examined before data analysis is to remove background noise, but is out of scope of this thesis. 3.2 Machine Learning Machine learning is a scientific discipline of finding previously unknown information and potentially interesting relations and patterns in a dataset or database and is highly related to the fields of statistics, data mining and artificial intelligence [36]. The human ability to learn from known data and to draw an intelligent decision out of it, is a great gift of nature and since computers have been introduced, people tried to transfer this possibility to machines. In 1965, Herbert Simon prophesied: “Machines will be capable, within twenty years, of doing any work that a man can do.”1 . This has been proven to be 1 Herbert Simon, American mathematical social scientist (1916-2001) 14 too optimistic. Even now, 45 years later, computers are still far away from being comparable to human intelligence. Today, machines are able to learn from data and make some generalizations from it using different learning algorithms which can be split in supervised learning, unsupervised learning and combinations of both, called semi-supervised learning. Especially supervised learning has been successfully applied and used in several application domains varying from stock price analysis, speech recognition, text analysis, data mining, computer vision, bioinformatics, computational neuroscience and many more. But as machine learning algorithms are using heuristics and probability calculations, an optimal solution cannot be guaranteed and the deduced results may be wrong in some cases. Therefore, the main challenge in machine learning is to optimally classify or cluster new cases in a preferably short time. The differences between classification (supervised learning) and clustering (unsupervised learning) as well as the two main learning algorithms used for this thesis are described in the following sections. 3.2.1 Supervised Learning Predicting a class label (classification) or a continuous value (regression) for an unseen instance based on deducing a general model from previously learned training instances is called supervised learning. The training data contains several different instances {xi , yi }N i=1 represented by a list of objects xi ∈ X and an associated label yi ∈ Y . A group of instances having the same label is called a class. The algorithm learns from training data by finding patterns being specific for cases sharing the same label and builds a global generalized function f : X → Y out of it. The key challenge for supervised learning algorithms is to predict a class label or a continuous value for an unseen instance x ∈ X after learning a model from a usually small set of training instances. The classification accuracy of an algorithm highly depends on the arrangement of the training set which should contain as much different instances as possible. For best performance, the instances should reflect all possible real-world characteristics. The number of features, an instance is represented by, should be large enough to define the classes precisely but should not be too large to avoid overfitting and save computational time. As there is no classification algorithm that works best for each dataset (no free lunch theorem [62]), different algorithms should be validated to reach the best accuracy. The most widely used 15 classifiers include Artificial Neural Networks (ANN), Support Vector Machines (SVN), k -Nearest Neighbour and decision trees (like C4.5 [41]). The ones used for this thesis are described in detail in Sections 3.2.3 - 3.2.6. 3.2.2 Unsupervised Learning The key difference of unsupervised learning to supervised learning is the absence of class labels. The algorithm is provided with a set of input objects {xi }N i=1 for which no class labels yi are provided. The key task in this area is not to classify unseen data but to analyze the given dataset by clustering, feature extraction, density estimation, visualization, anomaly detection or information retrieval. As this thesis concentrates only on supervised learning, unsupervised learning is not described in any more detail in this report. 3.2.3 Random Forest Breiman’s Random Forest [10] is an ensemble classifier composed of a set of decision trees [40, 41]. A decision tree is represented as a set of nodes and branches. Each node of the tree corresponds to a feature (attribute) and each child node gets labeled with a particular range of values for that feature. Together, a node and its children define a decision path in the tree that an example follows when being classified by the tree. A decision tree is induced in a top-down fashion by recursively splitting the feature space until each of the leaf nodes is nearly of one class. At each step, all features of a node are evaluated for their ability to improve the ”purity“ of the class distribution in the partition they produce. An example of a decision tree can be seen in Figure 3.3. A Random Forest grows a set of decision trees. A new instance is put down all the trees in the forest and each tree returns a classification result. The final Random Forest result is the mode of the votes of the single trees. This means that the instance is classified into the class which has been selected the most times in the single trees. As this may only give improvement compared to a single decision tree if the trees are different, feature subsets for each node of the trees are selected randomly. Let the number of cases in the training set be N and the number of features be M . N cases are sampled randomly with replacement from the original dataset and a tree is grown where at each node, a number 16 A <= 1 >1 B diseased false true healthy diseased Figure 3.3: Example of a decision tree where nodes are represented as brown circles and a blue rectangle represents a leaf. For classifying a new instance, two decisions are needed at most. Starting at A, the value of the feature can either be <= 1 or > 1. In the first case, the new instance reaches a leaf and is classified as diseased. For the second case, one more decision is needed, node B, where the feature value can either be true or false. m << M of features is chosen. Further, the best split out of these m is determined and is used to split the tree. The value for m is constant over all the trees in the forest and the trees are grown to their largest expand without pruning as long as no minimum leaf size is given. The forest’s error rate depends mostly on two things, the correlation between two trees in the forest and the strength of each individual tree. A high correlation increases the error rate of the forest while strong individuals decrease the error rate [10]. Therefore, m has to be chosen carefully to produce strong enough individuals but also keep the correlation low. The most essential benefits of Random Forest are: the fact that they are fast at training, memory efficient and are able to achieve competitive accuracy on big datasets. Yet another advantage is that Random Forests normally do not overfit no matter how many trees are used [10]. 3.2.4 k -Nearest Neighbour Classification The k -Nearest Neighbour algorithm is an instance based (or lazy) learning algorithm storing a series of training examples in memory and using a distance function to compare 17 new instances to the stored ones. The prediction of the class label of the new instance is based only on the k closest example cases with respect to the distance function. Often, the Euclidean distance (Equation 3.1) is used where p and q are n-dimensional data vectors. v u n uX deuclidean (p, q) = t (pi − qi )2 (3.1) i=1 As shown in Figure 3.4, the test instance is classified into the most frequent class value of its k nearest neighbors. For k = 3 (solid circle), the 3 nearest neighbors are analyzed k=6 k=3 ? Figure 3.4: A 2-dimensional example of k -NN classification. The test case (red circle) should be classified either to the green squares or to the blue stars. The solid border includes k = 3 nearest neighbors of the test case. In this case, the test instance is classified as a green square as there are two squares and only one blue star. The dashed border includes k = 6 nearest neighbors. In this case, there are more blue stars (4) than green squares (2) means that the test case is classified as a blue star. where 2 of them are labeled as green squares and one is labeled as a blue star. In this case the test value is classified as the most frequently occurring class, green squares. A different case is shown within the dashed circle where k = 6 nearest neighbors are considered. As there are four blue stars and only two green squares, the test instance is classified as a blue star. The same procedure can be used for regression, where the result value is normally the average over the k nearest neighbors values. A certain disadvantage of the algorithm is that the classes occurring more frequently 18 in the training set will have a tendency to come up more often also in the neighbourhood, due to their larger number, by chance. One solution to this problem is to weight the k nearest neighbours by their distance to the test case. Then, a neighbor closer to the test case will have a bigger influence on the class prediction than a further neighbor. Accuracy of k -Nearest Neighbour classification often highly depends on the value of k where the optimal value for k is different for each dataset. Generally a high number of k reduces noise but also makes the class borders fuzzy. For a good estimation of parameters, several heuristics can be used such as cross validation which will be described in Section 3.2.8. 3.2.5 AdaBoost AdaBoost [19] is another successful ensemble algorithm for boosting the performance of a machine learning algorithm, called base learner, and is short for adaptive boosting. In the experiments done for this thesis, C4.5 [41] was used as the base learner with AdaBoost. The algorithm is called adaptive boosting because it runs several iterations t = 1 . . . T with the base classifier and in each iteration, instances being classified incorrect in the previous steps, are emphasized more. Therefore, a distribution of weights Dt is updated in each round by increasing the weights of each incorrectly classified instance and decreasing the weights of each correctly classified instance. The final classifier is a set of the single classifiers added at each step. The emphasis on badly classified instances makes AdaBoost overly sensitive to noisy data and to outliers. 3.2.6 Learning Distance Functions Distance functions in machine learning have been addressed in many research studies in the last three decades. Special emphasis was given to learning distance functions for classification tasks [24], motivated by several reasons. First, learning a distance function for classification helps to combine the power of strong learners (Random Forest, C4.5) with the transparency of instance based classifiers (k -NN). Also, it was shown that choosing an optimal distance function makes classifier learning redundant [34]. Besides, where most traditional methods would fail, learning a proper distance function is especially helpful for high-dimensional data with many weakly relevant, irrelevant or correlated features. 19 As gene expression data contains exactly this kind of features, distance learning is suitable in particular for classifying genetic data. Next, learning distance functions breaks the learning process into two separate steps, distance learning followed by classification. Each step requires search in a less complex functional space compared to straight learning. The separation makes the model more flexible and modular and enables component reuse of the two parts. In this thesis, two different approaches to distance learning are used, learning from equivalence constrains and the intrinsic Random Forest similarity. Learning From Equivalence Constraints Today, the most commonly used representation of distance learning is the one based on equivalence constrains [4, 24]. Usually, equivalence constraints are represented as triplets (x1 , x2 , y), where x1 and x2 are data points in the original space and y is a label indicating whether x1 and x2 correspond to the same class or not. This is also called learning in the product space. The product space is out of scope for this thesis and will not be explained in more detail. Another possible approach is to learn in the space of vector differences, called difference space, and is often used with homogeneous high-dimensional data such as pixel intensities in imaging. A more detailed description of the difference space will be presented in Section 4.2. Both methods were shown to demonstrate promising empirical results in different contexts. The availability of equivalence constrains in most learning contexts and the fact that they are a natural input for optimal distance function learning [4] are two crucial reasons that motivate their use. It was shown that the optimal distance function for classification is of the form p(yi 6= yj | xi , xj ). For each class, the optimal distance function under the i.i.d assumption can be expressed in terms of generative models p(x | y) [34] as shown in Equation 3.2. p(yi 6= yj | xi , xj ) = X p(y | xi )(1 − p(y | xj )) (3.2) y The function was shown to approach the Bayesian optimal accuracy [34] and was analytically proven to be at least as good as any distance metric. Yet another approach for representing equivalence constraints are relative comparisons. They are used usually in information retrieval contexts. This representation uses triplets of the form “x is more similar to y than to z ” for learning. 20 The Intrinsic Random Forest Distance With respect to learning from equivalence constraints, the intrinsic Random Forest distance acts as a blackbox and few applications for it have been reported so far. The basic concept of this algorithm is to learn a Random Forest for the given classification problem and use the proportion of the trees where two instances appear together in the same leaves as a measure of similarity between them [10]. Equation 3.3 shows the calculation of the similarity between two instances x1 and x2 of a given Random Forest f . The instances are propagated down all K trees of the forest f and their leaf positions z in each of the trees (z1 = (z11 , . . . , z1K ) for x1 , similar z2 for x2 ) are recorded. The indicator function is represented as I. s(x1 , x2 ) = K 1 X I(z1i = z2i ) K i=1 (3.3) This similarity can be used for classification tasks and further for related problems like clustering or nearest neighbor data imputation. In our experiments, we used this similarity as a replacement of the canonical Euclidean distance with k -Nearest Neighbour classification. 3.2.7 Feature Selection Before training a machine learning model, most data has to be pre-processed in order to reduce the usually big amount of features. Especially when classifying gene expression data, a careful selection of discriminative genes is important to speed up the learning process, remove noisy, irrelevant and redundant genes and fight the curse of dimensionality [11]. A good feature reduction can improve the classification accuracy significantly. Generally, feature selection algorithms can be divided into two basic groups: feature ranking and subset selection. While feature ranking scores the features by a metric [27] and removes the features that do not reach a defined threshold value, subset selection methods evaluate different feature subsets to find the optimal one. Theoretically, an optimal feature selection for supervised learning requires an evaluation of all possible feature subsets. In practice, it is often impossible to find an optimal solution due to the very big number of available features. Hence, a set of features satisfying certain criteria is searched in most methods. Evaluation methods can follow two different approaches, 21 so called filter or wrapper, where both use a search algorithm to search through the possible feature subsets. Wrappers evaluate the search result by running a classifier on the subset. This makes wrapper methods computationally expensive and fosters the risk of overfitting the training data. Contrary to the wrapper approaches, a filter algorithm evaluates the subset on a statistical filter instead of explicitly training a model. This yields a usually faster but often less accurate algorithm. Both of these methods often use a search algorithm which can either be optimal or heuristic. Typical search techniques for optimal solutions are depth-first or breadth-first. For heuristics, often sequential forward or backward selection or best-first searches are used. In this thesis, several different feature selection methods are used, including Information Gain [44], Gain Ratio [44], Correlation-based Feature Subset Selection (CFS) [22] and ReliefF [26]. 3.2.8 Cross Validation Cross validation is a technique often used in machine learning to evaluate the generalization accuracy of a learning algorithm with respect to a specific dataset. It is mostly used for classification in order to predict how accurate a classifier will perform in practice or to compare two different classifiers on a single dataset. When testing a classification algorithm, usually a set of labeled data is given. To train and evaluate an algorithm, the data is split randomly into a set of training instances and a set of instances the algorithm is be tested on. This division can clearly bias the classification result. For better understanding, assume that all instances of a certain class are put in the test set. The classifier would have no chance to train the model for classifying this class because there are no training instances. For this reason, a cross validation, partitioning the data randomly several times and applying the classification algorithm to it, is performed. The classification accuracies are reported over the single runs and an average value is presented at the end. There are three basic kinds of cross validation: random sub-sampling validation, k-fold cross validation and leave-on-out cross validation. In random sub-sampling, the algorithm randomly splits the dataset into training and test set so that the validation subsets can overlap. Differently, the k-fold cross validation breaks the data into k subsets of the same size. The cross validation algorithm runs k times where in each run, another k is used 22 exactly once as the test set and the other k − 1 subsets provide the training instances. Leave-one-out cross validation is a special case of k-fold cross validation where k = 1 and is usually used for data with a small amount of instances. Because the process has to be repeated the number-of-instances times, leave-one-out cross validation can become computationally time expensive. In this thesis, k-fold cross validation and leave-one-out cross validation are used for the evaluation of learning algorithms. 3.3 Weka - A Machine Learning Framework in Java As machine learning attracts much attention in computer science, several frameworks have been developed for Java. Probably, the most established one is the open source library Weka [17] (version 3.6.1 was used in this thesis). Weka has been developed at the University of Waikato in New Zealand and provides a collection of algorithms for data classification, preprocessing, regression, clustering, association rules and data visualization. These techniques can either be accessed from the provided shell via a graphical user interface, from the command line or directly from Java code via a transparent API provided. To handle datasets, Weka defines a new data format called ARFF (AttributeRelation File Format) where each ARFF file describes a single dataset. 3.3.1 The ARFF Format An ARFF file is an ASCII text file describing the list of instances sharing a set of attributes and is separated into two basic parts, the header and the data section. The header contains the relation name of the dataset as well as the attributes and their types. Comments are indicated by a % sign. An example header of an artificial animal dataset is given in Listing 3.1 1 2 3 4 5 6 7 8 % --------------------------------------% This is a dataset of different animals % --------------------------------------@Relation " some animals " @Attribute name string @Attribute number_of_legs numeric @Attribute has_fur { yes , no } 23 9 10 @Attribute height numeric @Attribute class { dangerous , friendly } Listing 3.1: Example of an ARFF header section The header is followed by the data section. An example is shown in Listing 3.2 for the same dataset. 1 2 3 4 5 6 @Data Dog ,4 , yes ,76 ,? Snake ,0 , no ,8 , dangerous Cat ,4 , yes ,? , friendly Bird ,2 , no ,21 , friendly Listing 3.2: Example of an ARFF data section The @Relation declaration The first line in the ARFF file contains the declaration of the relation name for the dataset. If the name includes spaces, the string has to be quoted. The @Attribute declaration The attribute section contains a list of attributes where each line defines one attribute and starts with @Attribute followed by a name and a type separated by a space character. If the attribute name contains spaces, the name has to be quoted. The type of the attribute can be of the following types: • numeric (can also be specified as integer or real) • nominal-specification(a list containing possible categorized values separated by a comma within curly braces) • date • string 24 The @Data declaration The data sections starts with the @Data command, followed by the section of instances where each line defines exactly one instance. Each line contains values for all the attributes defined in the header section. The attribute values are separated by a comma where missing values are represented by a single question mark. 3.3.2 The Structure of the Weka Framework Weka provides a complete Java project designed to be modular and object-oriented to easily add new classifiers, data, filters, clustering algorithms and more. These classes can easily be accessed from any Java code by simply importing them. The workbench is separated into several top-level packages, each containing classes for a specific machine learning task and an abstract Java class for this task. For example, the package “classifiers” contains sub packages of different classifiers where each extends the common base class called “Classifier”, providing the interface for all classifier implementations. The sub packages are further grouped by functionality or purpose. Filters for example are separated into unsupervised or supervised and further by operating on attributes or instance basis. The system kernel methods are collected in the “core” package, providing classes and data structures that represent instances and attributes, read and save datasets and provide common utility methods as well as interfaces. The top-level package “gui” contains classes for the shell which follows the Java Bean convention. Weka supplies a Web API 3.4 2 describing all classes in the format of the Java API. Distance Learning Framework Weka supplies a bundle of machine learning algorithms, but does not offer algorithms for learning distance functions with equivalence constraints or the intrinsic random forest similarity. Tsymbal et. al. [55] implemented a Java framework for the empirical analysis of these two techniques. The source code of this project is based on Weka and is the basis for the framework implemented for experiments presented in this thesis. The basic concept behind the framework is to use an ARFF file to configure and eval2 The Weka API can be found at http://weka.sourceforge.net/doc 25 uate different classification techniques. The attributes of the ARFF configuration file are used as system parameters and each instance defines a single experiment. A demonstrative example is shown in Listing 3.3. Line 3 for example defines the name of the dataset to be used for the experiment. The general workflow is demonstrated in Figure 3.5. 1 2 3 4 5 6 7 8 9 10 @Relation " configuration " @Attribute @Attribute @Attribute @Attribute dataset string use_random_forest { yes , no } number_of_trees numeric ... @Data breast_cancer . arff , yes ,50 ,... leukemia . arff , yes ,100 ,... Listing 3.3: Part of a configuration file for the “distance learning” framework First, the configuration file is read line by line and a Weka “Instances” object is created containing the information from the read configuration ARFF file. For each instance, meaning for each line in the ARFF data section (line 9,10,... in Listing 3.3), an independent experiment is started. The defined classification and system parameters are set and the dataset is loaded into memory. A single experiment includes 1 . . . n repetitions of k -fold or leave-one-out cross validation. For each cross validation iteration, the dataset is split randomly into a training and a test set. Next, a data imputation algorithm is run to fill in missing data if there is some. For training a model for learning from equivalence constrains, the training data has to be transformed into the product or the difference space. A transformation to learn from comparative constrains is also possible. Then, the specified learning algorithm is used to train a model on the training set. The trained models are used to classify the instances of the test set and the predicted class values for each instance are compared with their ground truth values. Depending on these results, a classification accuracy is calculated for each cross validation fold and is averaged over the folds. To get the final result for each model, the cross validation results are averaged further over all repetitions and saved into a result text file. Moreover, the overall classification accuracies are printed out to the console. The workbench includes basic algorithms to compare distance learners with canonical 26 machine learning algorithms and provides a basis for further experiments for distance function learning. Probably the highest disadvantage of this framework is the procedural style of code making it hard to understand and extend. For each config line Config File Read configuration values and set system parameters Cross validate Split dataset into a training and a testing set Train Test + set set Impute missing data Transform representation for product, difference and competitive space WEKA Train different Models Classify test instances for each trained model Average and report cross validation results Results C:/ Results Figure 3.5: The general workflow of the distance learning framework is shown. An ARFF file containing several system configuration parameters as attributes and instances representing single experiments, is read first. For each line, corresponding to each single run, a new experiment is started with the classification configuration specified in the ARFF file. For each experiment, several cross validation iterations can be performed. For each cross validation iteration, the data is split randomly into a training and testing set. Next, an imputation method is run to predict missing values. Before training the different classification models specified in the configuration ARFF file, the sets are transformed into the product, difference and the comparative space for learning distance functions. Then, the algorithm classifies the test set with unseen class labels. At the end of all iterations, the classification accuracies are averaged over all runs and reported to the resulting text file and to the console. 27 3.5 Gene Ontology 3.5.1 The Key Components of the Ontology The GeneOntology (GO) Project [2] offers a controlled vocabulary of gene products containing their annotations and characteristics. The main effort was to build a controlled structure describing the interactions and relations between gene products. Therefore three species independent ontologies (structures) have been developed describing each gene product in terms of their associated biological process, cellular component and molecular function where each of the three parts is represented with a separate graph. • Cellular Component A cellular component is the component of a cell, describing subcellular structures and macromolecular complexes. Generally, a gene product is the subcomponent of or is located in a particular cellular component. • Molecular Function Molecular functions are the abilities a gene product can have including transport, binding, holding or changing different components. • Biological Process A biological process is a recognized series of events or molecular functions with a defined beginning and end. A gene product can be associated with or located in one or more cellular components, be active in one or more biological processes and can perform several molecular functions. The GO is a set of words and phrases used for indexing and retrieving information and also provides the relationships between two terms, making it a structured vocabulary. The structure of the GO vocabulary is represented with an acyclic graph, where each node represents a GO term and each arc describes a directed relationship between these two terms. Each term can have more than one parent. In each of the three graphs, a term closer to the root is more general than a term deeper in the graph and a child term is more specialized than its parents. An example of a GO graph structure is given in Figure 3.6. The coloured arrows indicate relations between two GO terms where the letter in the box shows the relation type. The different relation types will be described later on. 28 Figure 3.6: An example of a set of terms under the biological process node. A screenshot from the ontology editing tool OBO-Edit (http://www.oboedit.org). The nodes represent GO terms where the labeled arcs indicate relations between two GO terms. Each term has at least one path to one of the three root notes. A term has the following essential elements: • Unique Identifier and term name Every term has a unique seven-digit zero-padded number prefixed with GO:, e.g. GO:0005125, where the number does not code for any structural information. Further, every term has a name. For GO:0005125, for example, the name is cytokine activity. • Namespace Denotes to which of the three sub-ontologies the term belongs to. • Definition A description of the concept represented by this term extended with some additional information, like the knowledge source. 29 • Relationships to other terms A list of related terms specified with the type of their relation which can either be is a, part of or regulates. The GO is structured as a graph with terms as nodes and arcs connecting these nodes. Not only the nodes are defined, but also the arcs are categorized. They represent a directed relation a term has with respect to another term. There are three different relation types: • The “is a” relation The is a relation denotes the fact that a term A is a subtype of another term B. Example: “Mitochondrion” − is a → “intracellular organelle”. • The “part of ” relation The part of relation indicates that a term A is a necessary part of another term B where the presence of B implies the presence of A but not vice versa. Example: “Cytoplasm” − part of → “Cell”. • The “regulates” relation The regulates relation can have two subtypes, negative regulates or positive regulates. A term A is said to regulate another term B, in case when A occurs, it regulates B but B is regulated not only by A but also by other terms. Example: “Regulation of mitotic spindle organization” − regulates → “mitotic spindle organization”. 3.5.2 The GO File Format: OBO The GeneOntology can be accessed in two different ways, web-based or local. The GeneOntology Consortium and the open developing community provide a big variety of tools to access, edit and work with the ontology. While the web-based tools access the ontology via a web interface, the local tools need to have a local copy of the ontology. Therefore a specific file format was introduced, OBO. The .obo file contains a header section providing information about the obo-format version and some additional facts, like the date of creation. Next, the terms are listed, where the beginning of a new term 30 is indicated by [Term]. An example of a term is given in Listing 3.4. [ Term ] id : GO :0015239 name : multidrug transporter activity namespace : molecular_function def : " Enables the directed movement of drugs across a membrane into , out of , within or between cells ." 6 subset : gosubset_prok 7 is_a : GO :0015238 ! drug transporter activity 8 is_a : GO :0022857 ! transmembrane transporter activity 1 2 3 4 5 Listing 3.4: Term definition of multidrug transporter activity in OBO format First, the unique GO identifiers is defined, followed by the term’s name (lines 2,3). The “namespace” declaration in line 4 indicates to which of the three general categories the term belongs, here it is molecular function. Next, additional information is provided. At the end of each term, relations to other terms are listed (line 7,8). In the example of Listing 3.4, the term is connected to two other different terms with is a relations. The actual ontology.obo file can be downloaded from the official GO website3 and currently includes 29,429 terms4 (18,118 biological processes, 2,642 cellular components and 8,669 molecular functions). These terms are associated with 353,869 gene products5 (252,697 biological processes, 250,657 cellular components and 267,874 molecular functions). Note that a gene product can be associated to more than one sub-ontology. 3.5.3 The GO Annotation Database The annotation database contains the assignments of gene products to GO terms and is available for several species including homo sapiens, mouse and yeast. The datasets used for experiments in this thesis all include only homo sapiens genes and therefore the corresponding annotation database has been used only. The database can be downloaded from the official GO website6 in tab-delimited plain text format containing not only assignments, but also the evidence code. The evidence code is a measure of how trustful the assignments are and where they have been extracted from. 3 http://www.geneontology.org/GO.downloads.ontology.shtml As counted on January 28, 2010 5 As counted on January 28, 2010 6 http://www.geneontology.org/GO.current.annotations.shtml 4 31 3.6 Gene Ontology API for Java In order to use the OBO file and the annotation database from Java code for gene ontology vocabulary manipulation, an API was used called GO4J (version 1.1) [21]. The API contains four models where each model handles a different task. First, GO4J provides a GO definition parser that supports common GO formats. Next, classes are provided to construct a directed graph containing all specified GO identities. Further, a graphical user interface is given to visualize GO pathways and graph models. But the most important functionality for this thesis is the API for GO semantic similarity calculation based on different algorithms (see Section 2.3). While GO4J includes methods introduced by Resnik [43] and Lin [15], the algorithm of Wang [57] had to be implemented to extend the library. 3.7 NCBI EUtils The National Center for Biotechnology Information (NCBI) is one of the main venues for information retrieval in genetics. They offer a wide variety of databases and tools for public and free use [45]. One of these tools is called Entrez Programming Utilities, or in short EUtils. It provides access to the NCBI databases outside of the regular webquery interface. The tool is located at one of the NCBI servers and can be accessed at http://eutils.ncbi.nlm.nih.gov/. EUtils parameters can be specified by adding “GET” parameters to the URL identifier. EUtils offers eight different tools for accessing the database of which each corresponds to a specific task. For this thesis, the EUtils subprogram ESearch has been used for searching corresponding official gene names for known NCBI accession numbers. Different options can be set for ESearch; some of them are listed below. • Database (db) The database to be searched in. • Term (term) The term to be searched for. By default, the search term expects a search string. To search for other types, like NCBI accession numbers, the type can be specified 32 by appending a type declaration in squared brackets. • Retrieval Mode (retmode) Specifies the format of the search result. By default, the result shows up as an HTML page. The retrieval mode can be set to XML, too. This is an example of an EUtils URL call used for experiments done for this thesis: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi?db=gene&term=M26697[accn]&retmode=xml The result of this call is an XML file including information about the NCBI entry for the accession number M26697 in the “Genes” database. Note that [accn] indicates that the search string is an accession number. 3.8 Benchmark Datasets For evaluating different machine learning approaches, thirteen independent datasets have been used, see Table 3.1. The datasets are associated with different clinical problems and have been received from various sources. Colon, Embrional Tumours, Leukemia and Lymphoma datasets have been obtained from the Bioinformatics Research Group, Spain7 , Heart, Liver, Thyroid and Arcene from the UCI repository [7], Lupus from The HumanComputer Interaction Lab at the University of Maryland, USA8 , Breast Cancer from the University of Minnesota9 and Lung Cancer from the Devision of Thoracic Surgery at Brigham and Women’s Hospital, USA10 . The Mesh dataset was generated from cardiac aortic valve images at Siemens [42]. The last dataset, HeC Brain Tumours is obtained from hospitals participating in the Health-e-Child project. The datasets vary in the number of attributes and the number of instances, see column two and three of Table 3.1. All of these datasets represent binary classification problems, so that there are only two possible classes an instance can belong to. The data type 7 http://www.upo.es/eps/aguilar/datasets.html http://hcil.cs.umd.edu/localphp/hcil/vast/archive/task.php?ts id=145 9 http://www-users.itlabs.umn.edu/classes/Spring-2009/csci5461/index.php?page=homework 10 http://www.chestsurg.org/publications/expression-data-181.txt 8 33 Name Breast Cancer Colon Embr. Tumour HeC Brain Tumours Leukemia Lung Cancer Lupus Lymphoma Arcene Heart Liver Mesh Thyroid No. of Attributes 24482 2001 7130 5580 No. of Instances 97 62 60 67 Data Type Gene Names Genetic Genetic Genetic Genetic yes yes yes yes 41 12534 172 4027 10001 72 181 90 45 200 yes no yes no no 14 7 2941 6 270 345 63 215 Genetic Genetic Genetic Genetic Mass Spectrometry Medical Medical Image Data Unknown no no no no Table 3.1: A list of the benchmark datasets used for experiments conducted for this thesis indicates the source where the data was extracted from, while the eight datasets upon the double line are microarray gene expression datasets and the last five are non-genetic. For experiments with the GO semantic similarity, it is necessary to know the exact gene names for the given features. This is only possible, if a certain gene identifier is given, like a gene name or an accession number. The last column indicates if it is possible to match the included features to genes, meaning that these datasets can be used for experiments with the GO similarity. For the other datasets, the attributes may only have an undefined number as feature name and therefore, it is not possible to derive any semantic information about these features. In order to run the experiments in a defined period of time and to remove redundant, noisy and irrelevant features, the dataset attributes have been preselected by a filter feature selection method called ReliefF [26]. While all datasets with more than 200 features have been reduced to 200 features, Lupus was reduced to 75 features and Leukemia was used with the original number of features. 34 3.9 RSCTC’2010 Discovery Challenge The RSCTC’2010 Discovery Challenge [1] is a part of the International Rough Sets and Current Trends in Computing Conference to be held in Warsaw, Poland on June 28-30, 2010. The challenge task is to design a machine-learning algorithm that will classify patients for the purpose of medical diagnosis and treatment by analyzing data from DNA microarrays. Patients are categorized by gene transcription data containing between 20,000 and 65,000 features. The challenge includes two independent tracks, Basic Track and Advanced Track, varying in the form of solutions to be submitted. In the Basic Track, text files containing class labels for test samples of the datasets are submitted and compared to ground truth on the TunedIT server11 . In the Advanced Track, Java source code of a classification algorithm is submitted and compiled on the server. The classifier is trained on a subset of data and evaluated on another subset. 3.9.1 Basic Track In the Basic Track, six different datasets are given and split into labeled training data and unlabeled test data. The patients of the test set have to be classified and a text file containing only the class labels for the test instances is to be submitted for each dataset. The six datasets vary in their number of classes, number of attributes and number of instances. Another difference is the class distribution of the training instances, see the last column of Table 3.2. Some of the classes occur in a small number of instances, making the classification task more difficult. The predicted class labels are uploaded online in a ID 1 2 3 4 5 6 No. of Attributes 54676 22284 22278 54676 54614 59005 No. of Instances 123 105 95 113 89 92 No. of classes 2 3 5 5 4 5 Class Distribution 88-35 40-58-7 23-5-27-20-20 11-31-51-10-10 16-10-43-20 11-7-14-53-7 Table 3.2: The six datasets provided for the Basic Track of RSCTC’2010 Machine Learning Challenge 11 http://tunedit.org/ 35 zip compressed file and evaluated on the server by comparing them with the true class labels. The accuracy within each single class is calculated and averaged for each dataset, which is also called “within-class” accuracy or mean true positive rate. The final result is the average percentage over all the six datasets and is presented on a so called leaderboard (a web page with a table including the preliminary results for the participants) without decimal numbers. These preliminary results are evaluated on a part of the test sets to guarantee that the final result is not biased. The final results are to be calculated on a separate subset of the test sets at the end of the challenge on February 28, 2010. The number of submissions for each participant is limited to 100. After the challenge is closed, the last solution submitted is used for final evaluations. 3.9.2 Advanced Track In the Advanced Track, participants submit Java source code of the classification algorithm instead of class labels. The submitted classifier is automatically trained on a subset of the challenge data and evaluated on a test set on the TunedIT server. The algorithm is evaluated with cross validation 5 (preliminary) or 20 (final) times on each dataset. The five datasets used for this evaluation are not disclosed to participants, but it is know that they follow the same characteristics as the datasets from the Basic Track. The classifier is evaluated on the server using five-fold cross validation with a fixed random seed for data splitting for reproducibility and in order to be able to compare different solutions. This track is more challenging as the datasets are unknown and the classifier has to fulfill time and memory restrictions. The total server runtime for all five datasets is limited to 5 hours, so that each run of cross validation for each dataset can take not more than twelve minutes, on average. Identically to the Basic Track, 100 submissions of solutions are allowed and the last one is used for final classification on six different datasets where the algorithm will be evaluated 20 times with cross validation and has to terminate in less than 24 hours. The memory consumption is not allowed to exceed 1,500 MB where up to 500 MB is normally used by the evaluation procedure to load the dataset into memory, so that about 1000 MB can be used by the classification algorithm. The classification accuracy is calculated same as in the Basic Track, but the leaderboard here shows accuracy for 36 preliminary solution with two decimal places after the comma, in contrast to the Basic Track, where only an integer classification accuracy is presented. 37 CHAPTER 4 Methods To improve the classification accuracy for distance learning with genetic data, three general solutions have been studied, replacement of the elementary distance function for learning from equivalent constraints, transformation of the data into another representation and incorporation of biological knowledge into classification. The introduced distance learning framework (Section 3.4) was modified to fulfill the requirements needed to perform the experiments for this thesis. The different approaches have been tested in several experiments under different conditions and compared to each other. An explanation of the approaches and the methods used to perform the experiments, are introduced in the following sections. 4.1 Reorganization of the Distance Learning Framework To make the distance learning framework more modular, easier to extend and to maintain, the framework was restructured from procedural to object oriented Java code. Moreover, some of the hard coded system parameters have been outsourced into the ARFF configuration file to be able to configure each run as detailed as possible without changing the code. The decision to use leave-one-out or 10-fold cross validation, to enable/disable classification with k-NN, Random Forest, learning from equivalence constraints and the intrinsic Random Forest similarity, the method to impute missing data and the distance 38 Class: Experiment Config File Class: ConfigReader Read configuration file ARFF Class: ConfigChecker Validate configurations Class: HTMLCreator Prepare HTML output for each config line Class: TestExecuter Run a single experiment Results HTML C:/ Results Figure 4.1: The structure of the reorganized object oriented distance learning framework. The TestExecuter class (green rectangle) is detailed in Figure 4.2. function for learning from equivalence constraints have been transfered to the ARFF configuration file. Further, the configuration file was extended by the attribute skip, to disable a single experiment line. The configuration file includes many different experimental settings for different tasks. To study one task, for example only the experiments where GO semantic similarity is incorporated, only the corresponding experiments are normally of interest and the other experiments can be skipped. The main class of the new structure is called Experiment. Within this class, ConfigReader is called to read the configuration ARFF file (see Figure 4.1). Yet another class, ConfigChecker verifies each configuration line and reports obvious misconfigurations. The ConfigChecker checks that the number of trees to be used for the Random Forest is a positive integer value, that at least one classifier is activated and that the distance function to be used for learning from equivalence constraints is defined. Further the class checks if the configured dataset files exist at the supplied path. If the configurations are correct, the HTMLCreator is initialized. This class saves the results of all single experiment runs into an HTML result file. After, a loop over the single runs (experiments) configured in 39 the ARFF file is started, where in each iteration, a new TestExecuter object is declared and the execution time for this experiment is measured. The TestExecuter class is performing the tests specified in the ARFF file; its structure is described in Figure 4.2. Class: TestExecuter Config File Set system parameters read from configuration files ARFF Dataset Open dataset file and load the dataset into memory ARFF Generate artificial missing data Impute missing data Do cross validation C:/ Single Results Single Results HTML Figure 4.2: Detailed view of the TestExecuter class. The parts not exploited in the experiments in this thesis are grayed out. The cross validation procedure (green rectangle) is described in more detail in Figure 4.3. Such a structure makes it possible to automatically run an unlimited amount of experiments one after another, while their results are reported to the same HTML file. For each experiment, a new TestExecuter is launched with the corresponding dataset declared in the configuration file. First, the configuration line is read to set the experiment variables. Then, the dataset ARFF file is loaded into memory. As this framework extends the WEKA framework [17], a new Instances object is declared containing the dataset. Next, there is an opportunity to impute missing data. In each cross validation run, one of the four methods can be used to estimate the missing values; Mean/Mode imputation, k-NN imputation and multiple 40 k -NN imputation with Random Subspacing or Bagging. In the experiments conducted for this thesis, the imputation of missing data was disabled as this problem was out of scope for our study. All datasets considered here have no missing values. After data preprocessing, cross validation for the datasets with different splits of training and testing sets is performed. To eliminate the bias caused by randomness of some learning techniques, several iterations of cross validation are performed. The procedure of each cross validation iteration is shown in Figure 4.3. for each CV run Dataset Trainset Split dataset into train and test set Testset Trainset Add noise to the train set noisy Trainset transform data into another space train KNN model train meta model train from equivalence constraints train intrinsic RF sim. model classify testset classify testset classify testset classify testset Return accuracies for each single classifier Figure 4.3: Detailed structure of a cross validation iteration where four types of learning algorithms are used in one experiment. In each run, the dataset is split randomly into a training and a testing set. The size of the training set is specified in the configuration file. Before training the models, artificial class noise can be injected into the training set. For learning from equivalence 41 constraints, the data has to be transformed into the product or the difference space. The four learning techniques k-Nearest Neighbour , Meta, learning from equivalence constraints and the intrinsic Random Forest similarity can be performed. Meta is a general name for different ensemble classifiers such as AdaBoost or Random Forest, where only one can be evaluated at each experiment. The configured Meta classifier is also used for experiments with learning form equivalence constraints. The representation type for learning from equivalence constraints, product or difference space, is set in the configuration file, too. For each experiment, the usage of each of these learning techniques can be switched on or off in the configuration file. After training the models, the test set is separately evaluated with each classification model and the accuracy is averaged over the cross validation iterations. At the end of each experiment, the resulting accuracies are provided to the HTMLCreator. After all the experiments have been completed, the HTML result file is generated, containing a separate table for each experiment, including the experimental configurations and the accuracies achieved by the activated classifiers. An example of the HTML output table for one experiment is shown in Figure 4.4. The experiments specified in the configuration file are numbered sequentially. The experiment identification number is shown on the top left of the HTML output. In Figure 4.4, for example, the output of experiment 85 is shown, meaning that this experiment is defined in line 85 in the configuration file and 84 other experiments have been performed before or have been skipped. First, the parameters set in the ARFF configuration file for the actual experiment are listed with their corresponding values. In the shown experiment output, the HeC Brain Tumours Dataset was used, as can be seen in the first line on the right. For a better view of experimental results, the configuration values can be hidden by clicking the “hide config” link at the top right. At the bottom, the accuracies for the activated learning algorithms are shown within a table. In this example, Random Forest was used as Meta classifier and reached the accuracy of 91.18 %. The instancebased classifier, k -Nearest Neighbour with k = 7 (IB7 ), got 83.82 %, learning distance functions from equivalence constraints in the difference space (LDF EC ) performed best with 92.65 % and the instrisic Random Forest similarity (RF LDF ) has reached 89.71 %. The worse result of the intrinsic Random Forest similarity classifer, in this case, 42 Figure 4.4: Output of a single experiment for the Health-e-Child Brain Tumours dataset in HTML format 43 was caused by the relatively small number of trees (25) used for this experiment (see m ntrees LDF RF dist in Figure 4.4) and could be enhanced by using more trees, usually up to 1000. 4.2 Distance Function Learning From Equivalence Constraints The first experiments conducted for this thesis have been intended to study the change of the elementary distance function for learning from equivalence constraints in the difference space to improve the classification accuracy. For learning from equivalence constraints in the difference space, the dataset needs to be transformed. For two instances from the original set A with attributes a1 , . . . , an and B with attributes b1 , . . . , bn , the transformation into the difference space is performed as shown in Equation 4.1, where d indicates a distance function. difference space(A, B) = {d(a1 , b1 ), . . . , d(an , bn )} (4.1) A certain learning algorithm, Random Forest or AdaBoost, is then used to learn a distance function from the equivalence constraints. Afterwards, at the application phase, for each test instance, the k nearest neighbours are calculated with respect to the learned distance function and the test instance is classified as the most frequent class of the determined neighbours. To transform the dataset, different distance functions can be used to calculate the distance between attribute values an and bn , influencing the classification accuracy. To evaluate which distance function performs best with genetic data, several approaches have been tested under identical conditions. In some of the following distance calculations, the distance is squared. This has the effect of transforming the negative distances into positive distances which is often causes a crucial loss of information and was shown to always decrease the accuracy in our preliminary experiments. To avoid this effect, the difference between x and y is calculated for each pair of attribute values first and the sign of this result is then preserved for each final distance calculation. For each two values x and y of a certain feature fi from two instances A and B where i = 1, . . . , n, the distance 44 d(x, y) can be calculated with one of the following distance approaches. 4.2.1 The L1 Distance and Modifications of the L1 Distance • The L1 distance between two values x and y is calculated by subtracting y from x. Taking as usually the absolute value for the L1 distance is avoided not to lose crucial information. dL1 (x, y) = x − y (4.2) • The Squared distance is a modification of the L1 distance where the result of the L1 distance is squared. dSquared (x, y) = (x − y)2 ∗ sgn(x − y) (4.3) • Another modification of the L1 distance is the Square-Root distance that is calculated by taking the square root of the L1 distance. dRoot (x, y) = p |x − y| ∗ sgn(x − y) (4.4) • Different to the Squared distance (Equation 4.3), where the result of the L1 distance is squared, in the Element Squared distance, the single values x and y get squared before subtracting them. dElementSquare (x, y) = (x2 − y 2 ) ∗ sgn(x − y) 4.2.2 (4.5) The Simplified Mahalanobis Distance The Mahalanobis distance [14] is based on correlations between two vectors and is not scale-invariant, i.e. it is dependent on the scale of the attribute values. To calculate the Mahalanobis distance, the standard deviation δ is prepared in advance over all instances for the attribute for which the actual distance is calculated. r (x − y)2 dM ahalanobis (x, y) = ∗ sgn(x − y) δ2 (4.6) It must be noted, that the general Mahalanobis distance, including a sum of elements, is simplified in our case to one dimension as the distance has to be calculated between the 45 two one-element vectors x and y. 4.2.3 The Chi-Square Distance The Chi-Square distance [60, 35] is a normalized distance, where x is divided by the sum of the a1 , . . . , an attributes of the instance A and y by the sum of the b1 , . . . , bn attributes of the instance B. 2 1 dChi−Square (x, y) = P m P n x y ∗ sgn(x − y) ∗ − n n P P ji ai bi j=1 i=1 i=1 (4.7) i=1 where m is the number of instances in the dataset. Less formal, m P n P Vji , where Vji is j=1 i=1 the value of the attribute i of the instance j, is calculated by summarizing the values n P of all attributes over all instances of the dataset, while ai is the sum of the attribute i=1 values only for instance A and x is the current element which the distance is calculated n P for. bi is calculated respectively for B containing y. i=1 4.2.4 The Weighted Frequency Distance The weighted frequency distance was developed for this thesis and is based on the idea that the distance between x and y may depend on the distance of x and y to the corresponding attribute of the other instances in the dataset. With this distance, we tried to take the distribution of the attribute values (where the distance is calculated for) over all instances into account. For example, let x = a1 = 2 and y = b1 = 8, meaning the first attribute of instances A and B. First, the number of instances where the value of their first attribute is in the range between x and y is counted. The number of instances m with an attribute value between x and y is divided by the number of all instances M to get a percentage value p. p= m M (4.8) The L1 distance between x and y is calculated and multiplied by p. As the Frequency Distance performed bad by itself in our preliminary tests, it has been extended by combining it with the L1 distance. Both parts, the Frequency Distance and the L1 , are weighted by 46 a factor w, where 0 < w < 1. Note that if w = 1, the distance is equivalent to L1 . dF requency (x, y) = (w ∗ (x − y) − (1 − w) ∗ (x − y) ∗ p) ∗ sgn(x − y) (4.9) For the experiments conducted for this thesis, a weight of w = 0.5 was used. 4.2.5 The Canberra Distance The Canberra [35] distance function calculates the absolute value of x − y divided by the absolute values of the sum x + y. dCanberra (x, y) = 4.2.6 |x − y| ∗ sgn(x − y) |x + y| (4.10) The Variance Threshold Distance This new distance was developed for experiments for this thesis; the basic idea is to group the attributes based on their expression values. We tried to model a biological view on the expression values of the attributes. It was assumed that the exact expression value is not the most important information, but the distance to the mean value of the attribute (where the distance is calculated for) over all instances. We defined, that a gene can either be strongly overexpressed, overexpressed, normal expressed, underexpressed or strongly underexpressed. First, the mean mi and standard deviation δi are calculated for the attribute over the training set and five ordered groups are prepared, where each attribute value, x and y is classified independently into one of these groups by the following rules: 2, if x > m + δ (Strongly overexpressed) 1, if m < x < m + δ (Overexpressed) (4.11) xnew = 0, if −m < x < m (Normal) −1, if −m − δ < x < −m (Underexpressed) −2, if x < −m − δ (Strongly underexpressed) The group value for y is calculated respectively. Next, the distance is defined as the absolute L1 distance of group values of x and y. dV arianceT hreshold (x, y) = |xnew − ynew | 47 (4.12) For example, x is Strong overexpressed (2) and y is underexpressed (-1), the Variance Threshold distance d(x, y) = 3. Note, if x and y are of the same expression group, the distance is zero. We called a special variant of this distance Zero Variance Distance, where zero is used instead of the mean value. We used both, the Mean and the Zero Variance Distance for experiments conducted in this thesis. 4.2.7 Test Configurations The introduced distances have been tested on nine of the benchmark datasets described in Table 3.1: Mesh, Lymphoma, Embr. Tumours, Colon, Leukemia, Arcene, Liver, Thyroid and Heart. The other datasets have been skipped for this experiment as they were not available at the time the tests have been conducted. For Datasets with less than 90 instances, Leave-One-Out (LOO) cross validation was used and for the other datasets, 10-fold cross validation with 30 iterations was used. All tests have been done in the difference space with Random Forest as the learning algorithm and the average classification accuracy over the datasets was calculated and is shown in Table 5.1. 4.3 4.3.1 Transformation of Feature Representation for Gene Expression Data Motivation The genetic datasets described in Table 3.1 contain gene expression values, each attribute is the expression of a single gene. In biology, the influence of a gene to a certain disease often depends not only on its own expression, but also on the expression of some other genes, interacting with it. In a certain disease, a high expression of a gene A might only be influencing the etiopathology if a second gene B is overexpressed, too. In another case, it might only be important if B is underexpressed. By training a classifier with the microarray datasets in the usual gene expression representation, these dependencies are often not considered. To consider these co-operations, the data can be transformed into another representation. A normalization of the genes with respect to other genes is needed. 48 Another motivation for changing the feature representation was to incorporate the GO semantic similarity into classification. The GeneOntology provides similarity measures between two genes A and B while most learning algorithms when the usual representation is used, consider differences between patients with given gene expression values. The GO provides information about gene-gene interactions and the classifiers normally use patient-patient relations for classification. There is no obvious way how to incorporate the semantic similarity between two genes to improve the classification of patients, when the plain representation is used. For these reasons, the original datasets can be transformed into a new representation of gene-pairs instead of single genes. First, all possible pairs of single genes available in the dataset are generated. As the information of the two pairs of a gene A and a gene B, P air(A, B) and P air(B, A) is redundant, only one of these pairs is used. If n is the number of genes (features) in the dataset, then n2 −n 2 Gene A pairs are generated. P air(A, B) T ransf ormation Gene B −−−−−−−−−→ P air(A, C) Gene C (4.13) P air(B, C) We call this new representation Gene-Pair Representation. In (4.13), three genes are transformed into three gene pairs. But the number of pairs increases with the number of genes. Most datasets in our experiments have been reduced to 200 features, so that 19,900 gene pairs are constructed. Training a classification model may not finish in a limited period of time with this large number of features. Moreover, most of these pairs will not be useful for classification. Therefore, a feature selection has to be done to select the most discriminative pairs in advance. With this new gene-pair representation, the genes are normalized with respect to the other genes and it is possible to include the GO semantic similarity as the transformed datasets are represented with gene-gene relations. Therefore, an instance I in the new gene-pair representation is defined by a certain amount of pairs of genes pi and a class label c. I = (p1 , . . . pn , c) 49 (4.14) 4.3.2 Framework Updates To transform the original datasets into the gene-pair representation, the distance learning framework was extended by a preprocessor class for constructing and selecting the pairs. In each cross validation iteration, the training set and the testing set are transformed into the gene-pair representation before learning the models. For each iteration, the selected pairs are provided to a class reporting statistics of the pairs (Figure 4.5). This class counts the how often a pair is selected over all cross validation iterations. Further, the ranking of the pairs in each iteration by the feature selection algorithm is provided to the statistics reporter. for each CV iteration Add noise original Train set original Test set Transform into Gene-Pair Representation Report pair statistics Train set genepairs Test set genepairs Pair Stat. HTML train models Figure 4.5: Modified Framework for transforming the dataset into the gene-pair representation. The new components are shown in blue. The usage of the gene-pair representation can be enabled in the ARFF configuration file and is shown in the HTML output on the bottom left of the configuration parameters. To access the pair statistics for an experiment, one can click on the dataset name in the header of the result table for each experiment. The detailed process of transforming the data representation can be seen in Figure 4.6. Before training the model of the actual cross validation iteration, the preprocessor is started on the training set. Let n be the number of genes in the dataset and N 50 Class: Preprocessor original Trainset original Testset 2 2 n-n n-n Generate all 2 pairs of n features Generate all 2 pairs of n features Select 4 * N best pairs with a Filter Select N best pairs with a Wrapper selected pairs selected pairs Reduce Testset to filter selected pairs Reduce Testset to wrapper selected pairs GenePair Testset GenePair Trainset Figure 4.6: Workflow of the transformation from the single-gene to the gene-pair representation, where n is the number of features and N is the number of pairs to be selected. be number of pairs to be used for classification. First, all n2 −n 2 pairs are generated by subtracting the gene expressions for one gene from another. The P air(A, B), for example, is constructed by subtracting A − B. Note that compared to the common usage of a distance, where the absolute value is often used, the sign is kept in the pair calculations in order not to lose information. For example, let A = −15.65 and B = −3, 43, then P air(A, B) = −12.22. Next, the transformed training set is reduced to a configured number of pairs, N . Therefore, a filter feature selection, ReliefF is used first to pre-select 4N pairs out of all. The filter selection is used to pre-select a certain amount of pairs, because the wrapper method is too computationally expensive. After the pre-selection, a subset evaluation method using a greedy stepwise search, is applied to select a set of N most discriminative pairs out of the 4N pre-selected pairs. In the experiments done for this thesis, Correlation-based Feature Subset Selection (CFS) [22] has been used as the wrapper. The number of pairs to be selected for classification N was set to 100 for all the experiments. With a dataset containing 200 genes, the procedure would transform 51 and reduce the number of features as follows: 200 build pairs −−−−−−→ 19,900 Relief F −−−−→ 400 CF S −−−→ 100 (4.15) The selected pairs are saved within the preprocessor class and are used to reduce the test set, too. First, all pairs are generated, then the test set is reduced to the same pairs as the training set. After transforming both, the training and the testing set, the classification models are learned using the same techniques as with the datasets having the plain representation. The classification accuracy is highly depending on the number of pairs to be selected and the multiplication factor for the pairs selected preliminarily by the filter. Our preliminary experiments have demonstrated that the models perform best with 100 pair-features. We pre-selected 400 features because by using less features in preliminary tests, the accuracy was worse. By using more than 400 features, the execution time of the wrapper increased considerably while no considerable increase in the accuracy was observed. 4.3.3 Test Configurations For all the following experiments described in this thesis, the datasets Heart, Liver and Thyroid have been neglected as they contain less features and are non-genetic. Arcene and Mesh are non-genetic too, but they contain enough features to get comparable results and are used as non-genetic reference datasets. The other datasets have been used to compare the classification performance of the gene-pair representation compared to the original one. As described before, the reduced datasets have been used, containing 200 single gene features. The number of pairs to select with CFS was set to 100 with 400 preselected by ReliefF. For datasets containing more than 90 instances, 10-fold cross validation with 30 iterations has been used and leave-one-out cross validation for the others. Each dataset was evaluated with four different classifiers, k-Nearest-Neighbours (kNN) classification, Random Forest (RF), learning from equivalence constraints (EC) and the intrinsic Random Forest Similarity (iRF). To measure the robustness of the genepair representation to noise, the same tests have been conducted also with noisy data. First, 10% class noise has been injected artificially into the training set followed by 20% in the second experiment. The classification accuracy of the two different representations 52 were compared over the four classifiers. 4.4 4.4.1 Integration of GO Semantic Similarity Motivation With the novel representation of features as difference in expression values of a pair of two genes introduced in Section 4.3, it becomes possible to incorporate the external biological knowledge into classification algorithms. There are several reasons that motivate us to use the semantic similarity to guide the classification of gene expression data. First, the selected pairs may itself contain valuable biological knowledge that can be used to guide the classification. A set of two genes that are known to interact, might be more useful for classification than two genes that are not associated with each other. The usually big number of features in gene expression data makes it difficult for common feature selection methods not to ignore some important features. The support by the semantic similarity might guide the selection process and help to consider pairs that would otherwise be neglected by the feature selection algorithm but might be representative for the studied task. Next, the importance of a pair can be weighted by the semantic similarity between the two genes. Gene-Pair-features with a high semantic similarity might be more useful for classification than the other. In both approaches described above, the background biological knowledge is used to support classification decisions and is assumed to increase the classification accuracy. Compared to all other reports of using semantic similarity for classification support, the approach used for this thesis is different as the gene-pair representation was developed offering the opportunity to apply the semantic similarity directly to the expression values. 4.4.2 Framework Modifications In order to incorporate the GO semantic similarity, the distance learning framework had to be extended. A serious challenge for using the semantic similarity was to retrieve the similarity in a short time. The genetic datasets used for testing do not include GO identifiers, but the NCBI accession numbers (except the HeC Brain Tumours dataset, where the features are named by the official gene symbol). The GO does not include 53 NCBI accession numbers, therefore, it is not possible to retrieve the corresponding GO identifiers for a gene directly by the given feature name from the GO OBO file or the annotation database. A GO identifier for a certain gene can be searched by the gene’s official symbol as included in the HeC Brain Tumours dataset. For the other datasets, an intermediate step is needed to get the official gene symbol for the given unique NCBI accession number. Therefore, a new class was implemented, calling the EUtils web service via an URL to search the NCBI gene database for an accession number and for extracting the official gene symbol out of the XML response (an example can be seen in Expression 4.16). The same class accesses the local GO annotation database to extract all GO identifiers associated with this gene symbol. A string is constructed, containing a list of GO identifiers delimited by exclamation marks. This procedure translates the original dataset into a new dataset, containing the associated GO identifiers of the genes as attribute names and is described in Figure 4.7. For each attibute original dataset ARFF Call NCBI EUtils Search Result XML Search Result XML Extract official gene symbol Gene names dataset Gene names dataset Get GO ids for gene names from GO annotation DB GO ID dataset ARFF ARFF ARFF GO annotation database Figure 4.7: Translation of the original dataset into a GO ID dataset AF055033 NCBI EUtils −−−−−−−→ IGFBP5 AnnotationDB −−−−−−−−→ GO:0007242!GO... (4.16) The result is a copy of the dataset’s ARFF file where the attribute names are replaced by the GO identifier string. This translation has to be done only once; after, the new ARFF file can be used for classification tests. The process of retrieving the GO identifiers 54 for a gene is time consuming because for each gene, the NCBI web service has to be called, the server response has to be parsed and an SQL database query has to be executed. The idea to save the GO identifiers as a string as attribute names into the ARFF file, reduces the execution time as the retrieving of the GO identifiers can be done once in advance. This gives a big time advantage compared to retrieving the GO identifiers separately for each experiment, as for some datasets, the retrieval procedure can take several hours. Another time consuming part of retrieving the semantic similarity information is the calculation of the similarity for two genes based on the associated GO terms. The GO identifiers for each gene are saved in the dataset ARFF file, but the similarity of two of these genes is not yet calculated. As the different methods introduced for calculating the semantic similarity are relatively time consuming, the similarities for all possible gene pairs of a dataset are calculated in advance and saved into a .govalues file. To calculate the similarity, a new class, SimilarityCalulator, was implemented, using the GO4J API. The class extends the API by new methods for calculating the semantic similarity between two GO identifiers as described by Wang [58]. The GO4J API and the SimilarityCalulator both use the GO OBO file for calculating semantic similarities between two GO terms. Moreover, different approaches have been implemented to derive the similarity between two genes: Max, Average, Azuaje and a self developed method referred to as Schoen. The method Schoen is a combination of the Max and Average methods. First, all combinations of GO similarities of the two terms from the two genes are calculated. The upper third of the values with the highest semantic similarity are considered and the average over this group is used as the final similarity value, see the Java code in Listing 4.1. Note that in line 8, the similarity between two GO identifiers (terms) is calculated by using the GO4J API with the Lin similarity. 1 2 3 4 private double s c h o e n S i m i l a r i t y ( A r r a y L i s t <GO term> terms1 , A r r a y L i s t <GO term> terms2 ) { i n t numOfValues = terms1 . s i z e ( ) ∗ terms2 . s i z e ( ) ; double [ ] maxsim = new double [ numOfValues ] ; i n t count = 0 ; 5 6 7 8 9 10 11 12 13 f o r ( i n t i = 0 ; i < terms1 . s i z e ( ) ; i ++){ f o r ( i n t j = 0 ; j < terms2 . s i z e ( ) ; j ++){ double s = model . e v a l S i m i l a r i t y ( terms1 . g e t ( i ) , terms2 . g e t ( j ) , S i m i l a r i t y . LIN ) ; maxsim [ count ] = s ; count++; } } Arrays . s o r t ( maxsim ) ; 55 i n t b e s t t h i r d = ( i n t ) 2 ∗ ( maxsim . l e n g t h / 3 ) ; 14 15 double [ ] r e s u l t = Arrays . copyOfRange ( maxsim , b e s t t h i r d , maxsim . l e n g t h ) ; double a v e r a g e = 0 ; f o r ( i n t i = 0 ; i < r e s u l t . l e n g t h ; i ++){ a v e r a g e += r e s u l t [ i ] ; } double endsim = 0 ; i f ( r e s u l t . l e n g t h != 0 ) { endsim = a v e r a g e / r e s u l t . l e n g t h ; } return endsim ; 16 17 18 19 20 21 22 23 24 25 26 } Listing 4.1: Java code of the Schoen similarity calculation between two genes. The method is called with the GO terms for each gene as input parameters and uses the class variable model that includes the GO graph. One more method was implemented for the framework, the Graph Information Content (GIC) [37]. This algorithm conducts two arrays, a union and an intersection of the sets of GO terms of the two genes. The method calculates the sum of the Information Content (IC) of concatenated GO terms from both genes, divided by the sum of the information content of GO terms associated to both genes. For two genes A and B, the GIC similarity is defined as: P IC(t) simGIC(A, B) = Pt∈A∩B t∈A∪B IC(t) (4.17) The similarity values of the shown approaches are used to generate different .govalues files, one for each method. The new file format contains the GO id string for each gene as a key, and a tab-delimited similarity value, where each line defines one gene. For each method tested, a .govalues file was created and used. This is again a workaround to reduce the running time of the framework. The framework was extended to load the .govalues file by start up for each experiment and to save the similarity values for each gene-pair of the dataset of the current experiment into a Java HashMap. The similarity to be used can be defined in the framework and tests can be run with similarity values calculated by different methods. The modifications of the framework to handle the GO semantic similarity are described in Figure 4.8. In the experiments conducted for this thesis, the GO semantic similarity can be applied to the learning algorithms generally in two different ways, by feature weighting or feature selection. 56 For each experiment similarity values Prepare GO semantic similarity govalues For each cross validation iteration Build train and test set Use GO similarity for feature selection ? Use GO similarity for feature weighting ? Pairs KNN Class: Preprocessor Multiply Pairs by GO sim Select Pairs by GO sim KNN model All models Weight Features with GO sim train all models train KNN model Figure 4.8: Modification of the framework to incorporate the GO semantic similarity in three different ways. 4.4.3 GO-Based Feature Weighting As feature weighting has been applied to the k -Nearest Neighbour classifier before [49], the experiments with GO feature weighting have only been evaluated with the k -Nearest Neighbour classification. The weighting can be done at two different positions in the framework, in the preprocessor when the pairs are constructed or in the k -Nearest Neighbour algorithm. In the first solution, the weighted Euclidean distance [28] is used to incorporate weights 57 of each gene-pair immediately while constructing the pairs. The pairs are weighted by the corresponding semantic similarity value of the pairs. P air(a, b) = p (a − b)2 ∗ sim(a, b) (4.18) The second possibility is to use the semantic similarity values as weights for the k -Nearest Neighbour classifier. We used k -Nearest Neighbour as this is one of the best studied and robust distance learning classifiers. We used only one classifier here, as the purpose of these experiments is only to test the hypothesis if distance learning methods can be improved by GO based feature weighting. 4.4.4 GO-Based Feature Selection The second approach to use the semantic similarity to support classification is to select pair-features based on their GO similarity values. A new feature selection algorithm has been implemented. The algorithm can select pairs in different ways: 1. The n pairs with the biggest semantic similarities. 2. The n pairs with the lowest semantic similarities. 3. All pairs with a similarity bigger than a given threshold . 4. All pairs with a similarity less than a given threshold. 5. The pairs with a defined value x and all pairs with a value y that does not differ from x in more than a given threshold t: x − t < y < x + t. For experiments done for this thesis, only the first method of the enumeration is used, to select the n pairs with the biggest similarity. This feature selection has been combined with the CFS feature selection to exclude noisy and redundant pairs. First, 400 pairs with the biggest GO similarity have been selected, followed by CFS to select the 100 most discriminative gene-pairs out of 400. To identify the best matching semantic similarity calculation technique for genetic data, different combinations of similarity calculation methods have been tested. There are three parameters that can influence the classification result. First, the method used 58 to calculate similarity between single GO terms. Next, the algorithm to combine these similarities for two genes, and last, the way to apply the similarity to the learning algorithms. Beside accuracy evaluation, the average semantic similarity of all calculated pairs and of the selected pairs have been reported. It was assumed, that pairs selected by the common feature selection have a higher average similarity than all pairs. This means that pairs with a high similarity have more chances to be discriminative for classification and therefore it is more probable that GO-based feature selection will improve classification accuracy. 4.4.5 Test Configurations For the GO semantic similarity, only six datasets could be used as it is necessary to have genetic datasets where the attribute gene-names are known. Only Breast, Colon, Embrional Tumours, HeC Brain Tumours , Leukemia and Lupus include attribute names where the corresponding gene can be identified. As Leukemia includes only 38 attributes that can be matched to genes, this dataset has thus been excluded from the experiments. Except from the HeC Brain Tumours dataset, containing gene symbols, the other four datasets contain NCBI accession numbers as attribute names. The described method was used to find a GO identifier for the genes. As the GeneOntology is not complete and some attribute names cannot be matched to genes, the datasets have been reduced in their number of features, to be able to retrieve at least one GO term for each attribute remaining in the dataset. First, the five datasets have been tested with different combinations of similarity calculation methods and the two possibilities to include feature weighting. The following similarity calculation methods have been compared: • Max-Resnik: Maximum value of Resnik similarities; • Max-Lin: Maximum value of Lin similarities; • Max-Wang: Maximum value of Wang similarities; • Azuaje-Resnik: Azuaje combination of Resnik similarities; 59 • Azuaje-Lin: Azuaje combination of Lin similarities; • Azuaje-Wang: Azuaje combination of Wang similarities; • Schoen-Lin: Schoen combination of Lin similarities; • GIC: Graph information content; • Random: Random values in the range [0, 1] are used as similarity; The Random method was used as a reference method, to see whether the change in accuracy is caused by the methods or by chance. Therefore, a new .govalues file was created while the similarity for each pair was defined as a random value in the range from zero to one. The Schoen method was tested only in combination with the Lin similarity. The intention was only to test the comparability in classification accuracy of the Schoen similarity calculation to the Max and the Azuaje approach. The described methods have been compared by performing tests on the datasets Breast, Colon, Embrional Tumours and Lupus. The two described methods to incorporate GO-based feature weighting have been compared under the same classification conditions and the average semantic similarities of the pairs and the selected pairs have been reported. Next, the GO-based feature selection technique was tested without feature weighting. The 400 gene-pairs with the highest semantic similarity have been preselected followed by a CFS based reduction to 100 pairs. This method has been evaluated on the five datasets, including the HeC Brain Tumours dataset and compared with respect to the classification accuracy. 4.5 RSCTC’2010 Discovery Challenge In order to compare distance learning functions to other state-of-the art learning methods, we registered to the RSCTC’2010 Discovery Challenge [1]. In the previous sections, two new techniques, the gene-pair representation and the incorporation of external knowledge into classification, have been introduced. As the challenge datasets do not provide information about the gene names, the incorporation of the GO semantic similarity was 60 impossible. Further, the experiments done for this thesis with the gene-pair representation, have only been validated on binary classification problems, but the challenge datasets include more than two classes, except one dataset. Therefore, the gene-pair representation was not tested within the challenge datasets. Instead, we tried to optimize the learning algorithms for the challenge datasets by testing different feature selection methods, oversampling techniques (explained later) and configurations of the learning methods. We tried to find the best settings with respect to the classification accuracy for each dataset separately, as there is no learning technique that performs best on all datasets (no free lunch theorem [62]). 4.5.1 Basic Track For the Basic Track, six labeled training sets are provided. The task is to predict class labels for the corresponding test sets. The predicted labels for each dataset are saved into a separate text file. The six files are compressed into a zip file and uploaded to the challenge website. The uploaded class labels are compared to the ground truth immediately after the submission and preliminary classification accuracy for the submitted solution is presented on the so called leaderboard. First, the training sets have been used to cross validate the four different learning methods k-Nearest Neighbour classification, Random Forest, learning from equivalence constraints and the intrinsic Random Forest Similarity on our local computer. A 10-foldcross validation was used with 30 iterations. Out of these four classifiers, the intrinsic Random Forest Similarity performed best on all datasets and has been selected to be used in all further experiments. One problem with the challenge datasets is the unequal distribution of classes because the classification accuracy can suffer from unbalanced data [59] for many learning algorithms. Some classes are represented by less than ten instances while other classes include more than 50 instances. For equally distributed classes, the probability for an unlabeled instance to be classified into the correct class by chance, is the same for all classes. For unequally distributed classes, the probability of an instance of an over-represented class to be classified correctly by chance, is higher than for the under-represented class. For example, class A is represented by 1000 instances and class B by 10 instances. An instance 61 of class A is more probable to be classified correctly by chance than an instance of class B. This unequal class distribution can bias the classification accuracy for many learning algorithms. To eliminate this problem, the training datasets have been oversampled by duplicating the instances of under-represented classes until the class distributions have been roughly equal (see Java Code 4.2). 1 public I n s t a n c e s o v e r s a m p l e ( I n s t a n c e s i n s t , i n t l i m i t ) { 2 // c l a s s o c c u r r e n c e v a l u e s i n t [ ] c l a s s C o u n t s = new i n t [ i n s t . c l a s s A t t r i b u t e ( ) . numValues ( ) ] ; 3 4 5 f o r ( i n t i = 0 ; i u s e d < i n s t . nu mI n st an ce s ( ) ; i ++){ i n t c l a s s V a l u e = ( i n t ) i n s t . i n s t a n c e ( i ) . v a l u e ( i n s t . n u m A t t r i b u t e s ( ) −1) ; c l a s s C o u n t s [ c l a s s V a l u e ]++; } 6 7 8 9 10 // g e t most f r e q u e n t c l a s s i n t m o s t F r e q u e n t C l a s s = U t i l s . maxIndex ( c l a s s C o u n t s ) ; 11 12 13 // o v e r s a m p l e o t h e r c l a s s e s i n t numInst = i n s t . nu mI n st an ce s ( ) ; f o r ( i n t i = 0 ; i < numInst ; i ++){ i n t c l a s s V a l u e = ( i n t ) i n s t . i n s t a n c e ( i ) . v a l u e ( i n s t . n u m A t t r i b u t e s ( ) −1) ; int border = ( int ) c l a s s C o u n t s [ mostFrequentClass ] / c l a s s C o u n t s [ c l a s s V a l u e ] −1; f o r ( i n t j = 0 ; j < b o r d e r && j < l i m i t ; j ++){ i n s t . add (new I n s t a n c e ( i n s t . i n s t a n c e ( i ) ) ) ; } } 14 15 16 17 18 19 20 21 22 23 return i n s t ; 24 25 } Listing 4.2: Algorithm to oversample a dataset (inst) with unequally distributed classes. At first, the most frequent class is identified by counting the number of instances of each class. For each other class, the number of instances of the most frequent class is divided by the number of instances of the current class, to get the number of times the instances of the under-represented class have to be duplicated. Moreover, a limit can be defined to stop oversampling if a certain amount of copies is already made to produce lighter oversampling to avoid generating too many synthetic cases. For the example shown above, where the class A contained 1000 and the class B contained 10 features, class B has to be copied 100 times to be equally distributed to class A. The limit defines a maximal multiplication factor of the underrepresented class. In this example, if a limit is set to 20, class B can be multiplied only 20 times to contain not more than 200 features. Oversampling with different limits have been tested by submitting the classification results to the challenge website where the best accuracy was reached with 62 no specified limit. The oversampled datasets have been used for further experiments. With the oversampled datasets, different two-step filter-wrapper feature selection approaches have been tested. A GainRatio filter was used to pre-select a set of features followed by CFS to select the most discriminative features out of the pre-selected ones. Different numbers of features have been tested, where a filter selection of 400 features, followed by a wrapper selection of 200 features, performed best. The CFS algorithm evaluates different subsets, composed by a greedy stepwise search algorithm. One problem with this technique is that a small number of features, for example five, can already reach an accuracy that is close to the Bayesian optimal accuracy reached by considering all the features. Therefore, it becomes difficult to reliably detect a classification improvement caused by adding other features. We have developed a new feature selection algorithm that tries to solve this problem by using the CFS algorithm in a certain amount of iterations. In each iteration, only a small group of features is selected, therefore the algorithm was called Group Selection. At each iteration, n (group size) features are selected by CFS and removed from the dataset for the next iteration. To select a total number of N genes, N/n iterations are needed. The benefit of this method is that at each iteration, the strong features are removed and the influence of the other features on the decision is easier to detect. 1 2 3 Array s e l e c t e d ; Dataset o r i g i n a l D a t a s e t data = o r i g i n a l ; 4 5 f o r each i t e r a t i o n { 6 Array b e s t n = data . s e l e c t n B e s t F e a t u r e s ( u s e CFS ) ; 7 8 s e l e c t e d += b e s t n ; 9 10 data . remove ( b e s t n ) ; 11 12 13 } 14 15 o r i g i n a l . reduceTo ( s e l e c t e d ) ; Listing 4.3: Pseudo code of the Group Selection algorithm. Different numbers of group sizes and iterations have been tested. In our experiments, 40 iterations of selecting 5 features performed best. The filter pre-selection to 400 features followed by the 5-features-40-iterations Group Selection has been used for all following experiments. 63 Again, a cross validation on the training sets has been performed on our local PC to determine the best algorithm parameters for the intrinsic Random Forest Similarity for each dataset separately. We tested different values for the number of trees, the minimum leaf size, the number of attributes to be considered to split a node in a tree and the number of nearest neighbours. Further, the intrinsic Random Forest Similarity was tested with the original Random Forest implementation and with extremely randomized trees [20]. Based on these results, the test configuration has been updated to use the best parameters for each dataset. 4.5.2 Advanced Track Solutions to the Advanced Track are submitted by uploading a Java jar file with a WEKA classifier included. Therefore, a new classifier was implemented that can use the classes used for the Basic Track. The classifier and all dependent classes have been extracted into the jar file for the challenge. In the submitted procedure for generating class models, first, we tried to oversample the full dataset before doing feature selection, but the oversampled dataset exceeded the memory limitations. Therefore, the provided dataset is first always reduced to 3000 features immediately after the training is started. Different feature selection algorithms and different numbers of features have been tested, while the weighted ReliefF algorithm with 3000 features for the initial feature selection performed best. After reducing the dataset, there was enough memory for the normal oversampling procedure. Again, different filter-wrapper feature selection methods have been tested. A pre-selection of 400 features with ReliefF, followed by a selection of 200 features with CFS has reached the best accuracy. Notice that the introduced Group Selection could not be used for this track, as this algorithm violated the time restrictions, resulting in a time-out error returned by the server. Next, a cross validation step was implemented to find the best value for the number of nearest neighbours, the number of considered attributes to split the node and the Random Forest type, canonical Random Forest or extremely randomized trees, for the intrinsic Random Forest Similarity. But the time restrictions forced a limitation of the cross validation technique to validate only one of these parameters at a try. Further, the 64 number of trees for the intrinsic Random Forest Similarity used for cross validation was set to 25 to reduce the execution time. Finally, different numbers of trees for final classification with the intrinsic Random Forest Similarity have been tested. With a high number of trees, the classifier exceeded the time limitations and with a low number of trees, the accuracy decreased. Our tests showed that a good compromise was to use 1000 trees. 65 CHAPTER 5 Empirical Analysis We introduced different approaches to increase the classification accuracy for genetic data in the previous sections. Different elementary distance functions for learning from equivalence constraints have been compared. Next, the new data representation with gene-pairs was tested on different datasets and compared to the original representation. Two basic ways to incorporate external biological knowledge, by feature weighting or feature selection, have been tested on selected datasets. Finally, distance learning algorithms have been cross validated to a given set of datasets and evaluated at an international discovery challenge. The results of the experiments performed for this thesis are presented and analyzed in this chapter. For all experiments conducted for this thesis, the presented classification accuracy is the percentage of the correctly classified test instances by a model trained previously on the training data. An instance is correctly classified, if the class label predicted by the classification model is identically with the ground truth. 5.1 Distance Comparison for Learning From Equivalence Constraints in the Difference Space As can be seen from Table 5.1, the L1 distance clearly outperforms the other distances. L1 reached the best accuracy in seven of the nine used datasets and was one percent only worse in the other two datasets. In these, Lymphoma and Liver, the Zero Variance Threshold distance reached the best result. The second best result was achieved by the 66 Distance Function L1 Square-Root Squared Mahalanobis Canberra Variance Threshold Frequency (w = 0.5) Zero Variance Threshold Chi-Squared Single Squared Average Accuracy in % 81.81 73.17 68.18 64.36 63.02 62.21 61.07 59.29 57.91 56.49 Table 5.1: Average accuracies over the nine datasets for distance comparison for learning from equivalence constraints. Root distance and the third place the Squared distance. This result is not surprising as these distances do not differ much form the L1 distance that was shown to perform best in previous comparisons [61]. 5.2 Transformation of Representation for Gene Expression Data The gene-pair representation was compared with the plain single gene representation and evaluated with the four different classifiers on 10 benchmark datasets. The following configurations have been used for the experiments: • 400 preselected pairs by ReliefF. • 100 pairs selected by CFS and used for training the models. • 10-fold cross validation for datasets with more than 90 instances. • Leave-one-out cross validation otherwise. The four classifiers have been used with the following settings: • RF : Random Forest with 25 trees. • 7 NN : K-Nearest-Neighbours with k = 7. • EC : Learning equivalence constraints with L1 distance in difference space. 67 • iRF : The intrinsic Random Forest with 25 trees. It was shown that 7 is the most robust parameter choice for k in our preliminary tests. The tests have been conducted three times, first with no class noise injected, second with 10% and third with 20% random class noise. The results for the datasets are presented in separate tables, where rows called Original correspond to the original representation and rows named Pair include results for the gene-pair representation. Each column includes results for one learning algorithm while the average over all four classifiers is presented in the last column. The best accuracy reached for each noise level is given in bold. The average value is green if the gene-pair representation outperforms the original one and red vice versa. Breast Cancer Original Pair Original 10% Pair 10% Original 20% Pair 20% RF 74.11 73.11 73.44 74.22 74.67 73.33 7 NN 81.22 78.56 81.11 78.33 77.56 75.56 EC 75.56 77.00 75.33 75.00 73.33 75.44 iRF 72.44 72.44 73.33 74.11 73.89 73.11 Average 75.83 75.28 75.80 75.42 74.86 74.36 Table 5.2: Classification accuracy for Breast Cancer of pair vs. original representation. Colon Original Pair Original 10% Pair 10% Original 20% Pair 20% RF 80.65 87.10 85.48 80.65 79.03 82.26 7 NN 87.10 83.87 87.10 83.87 85.48 80.65 EC 87.10 90.32 87.10 83.87 83.87 80.65 iRF 83.87 88.71 83.87 82.26 83.87 82.26 Average 84.68 87.40 85.89 82.66 83.06 81.46 Table 5.3: Classification accuracy for Colon of pair vs. original representation. 5.2.1 Analysis of the Genetic Datasets Tables 5.2 - 5.11 demonstrate the classification accuracies for the genetic and the nongenetic datasets with the original and gene-pair representation. The results are analyzed 68 Embrional Tumours RF Original 76.67 Pair 80.00 Original 10% 78.33 Pair 10% 78.33 Original 20% 63.33 Pair 20% 70.00 7 NN 75.0 78.33 71.67 75.00 73.33 73.33 EC 73.33 78.33 68.33 76.67 75.33 73.33 iRF 75.00 76.67 80.00 80.00 71.67 71.67 Average 75.00 78.33 74.58 77.50 70.92 72.08 Table 5.4: Classification accuracy for Embrional Tumours of pair vs. original representation. HeC Brain Tumours RF Original 92.65 Pair 94.12 Original 10% 91.18 Pair 10% 91.18 Original 20% 86.76 Pair 20% 83.82 7 NN 86.76 94.12 83.82 91.18 85.29 83.82 EC 92.65 92.65 92.65 92.65 91.18 86.76 iRF 92.65 95.59 89.71 89.71 89.71 82.35 Average 91.18 94.12 89.34 91.18 88.24 84.19 Table 5.5: Classification accuracy for HeC Brain Tumours of pair vs. original representation. Leukemia Original Pair Original 10% Pair 10% Original 20% Pair 20% RF 95.83 97.22 95.83 97.22 87.50 97.22 7 NN 95.83 97.22 98.61 94.44 95.83 91.67 EC 95.83 97.22 95.83 97.22 97.22 95.83 iRF 94.44 97.22 95.83 97.22 93.06 95.83 Average 95.48 97.22 96.53 96.53 93.40 95.14 Table 5.6: Classification accuracy for Leukemia of pair vs. original representation. Lung Cancer Original Pair Original 10% Pair 10% Original 20% Pair 20% RF 98.48 98.12 98.42 97.15 96.79 92.73 7 NN 95.52 98.79 95.21 96.85 91.76 92.48 EC 98.91 98.85 98.06 97.09 96.42 95.09 iRF 98.48 97.94 98.85 97.70 97.94 94.67 Average 97.85 98.43 97.64 97.20 95.73 93.74 Table 5.7: Classification accuracy for Lung Cancer of pair vs. original representation. 69 Lupus Original Pair Original 10% Pair 10% Original 20% Pair 20% RF 78.57 78.69 76.31 77.62 75.00 74.76 7 NN 77.26 76.67 75.71 75.00 71.55 72.86 EC 78.45 77.98 77.98 78.21 76.19 75.36 iRF 77.38 77.74 77.98 77.38 74.52 74.52 Average 77.92 77.77 77.00 77.05 74.32 74.38 Table 5.8: Classification accuracy for Lupus of pair vs. original representation. Lymphoma RF 95.56 95.56 95.56 95.56 95.56 95.56 Original Pair Original 10% Pair 10% Original 20% Pair 20% 7 NN 100 100 100 97.78 95.56 97.78 EC 88.89 97.78 97.78 97.78 97.78 95.56 iRF 93.33 95.56 95.56 95.56 93.33 95.56 Average 94.45 97.23 97.23 96.67 95.56 96.12 Table 5.9: Classification accuracy for Lymphoma of pair vs. original representation. Arcene (non-genetic) Original Pair Original 10% Pair 10% Original 20% Pair 20% RF 86.06 84.17 82.28 79.94 76.17 74.94 7 NN 84.89 83.94 82.61 80.89 78.28 77.76 EC 87.00 85.72 84.11 82.06 79.56 76.72 iRF 85.44 83.89 83.39 79.89 77.72 76.06 Average 85.85 84.43 83.10 80.70 77.93 76.37 Table 5.10: Classification accuracy for Arcene of pair vs. original representation. Mesh (non-genetic) Original Pair Original 10% Pair 10% Original 20% Pair 20% RF 92.06 85.71 88.89 82.54 93.65 80.95 7 NN 87.30 85.71 87.30 84.13 84.13 82.54 EC 88.89 76.19 87.30 80.95 85.71 88.89 iRF 90.48 84.13 87.30 84.13 87.30 80.95 Average 89.68 82.96 87.70 82.94 87.70 83.33 Table 5.11: Classification accuracy for Mesh of pair vs. original representation. 70 for the no-noise experiments of genetic datasets first. For six out of eight genetic datasets, the novel gene-pair representation outperformed the original one. Only for Breast and Lupus, the original representation reached better results, but with less than 0.56% difference each. A summary of the averaged results can be found in Table 5.12. Note, that for the following Tables 5.12-5.16, the last column (Difference) is the difference between the average accuracy of the original representation and the gene-pair representation. Dataset Breast Colon Embr. Tumours HeC Brain Tumours Leukemia Lung Cancer Lupus Lymphoma Average Average Original 75.83 84.68 75.00 91.18 95.48 97.85 77.92 94.45 86.55 Average Pair 75.28 87.40 78.33 94.12 97.22 98.43 77.77 97.23 88.22 Difference -0.55 2.72 3.33 2.94 1.74 0.58 -0.15 2.78 1.67 Table 5.12: Average classification accuracy over the four different classifiers for the genetic datasets. Classifier RF 7 NN EC iRF Average Original 86.57 87.34 86.34 85.95 Average Pair 87.99 88.45 88.77 87.73 Difference 1.42 1.11 2.43 1.79 Table 5.13: Average classification accuracy over the genetic datasets for the tested classifiers. The gene-pair representation could increase the average classification accuracy over all classifiers and datasets by 1.67%. It is better for six out of eight datasets and all four classifiers reached a better average result with the gene-pair representation, while learning from equivalence constraints was shown to have the biggest increase. The result of Tables 5.12 and 5.13 show a clear benefit in classification accuracy for the novel representation, which is motivated by the fact that the genes often depend on each other. To validate the assumption that this is the reason for the increase, the two non-genetic datasets have also been tested. If the assumption is right, the boost in 71 accuracy of these two datasets should be absent or be not that obvious as for the genetic data. 5.2.2 Analysis of the Non-Genetic Datasets Dataset Arcene Mesh Average Average Original 85.85 89.68 87.77 Average Pair 84.43 82.96 83.70 Difference -1.42 -6.72 -4.07 Table 5.14: Average classification accuracy over the four different classifiers for the nongenetic datasets. Classifier RF 7 NN EC iRF Average Original 89.06 86.10 87.95 87.96 Average Pair 84.94 84.83 80.96 84.01 Difference -4.12 -1.27 -6.99 -3.95 Table 5.15: Average classification accuracy over the non-genetic datasets for the tested classifiers. Tables 5.14 and 5.15 show how the gene-pair representation may fail for non-genetic data. These results indicate that the benefit of the gene-pair representation presumably relies on the interactions of genes. 5.2.3 Robustness of the Gene-Pair Representation to Noise To measure the robustness of the gene-pair representation to class noise for the genetic datasets, experiments with 10% and 20% artificial class noise have been conducted. Class Noise 0% 10% 20% Average Original 86.55 86.75 84.51 Average Pair 88.22 86.78 83.93 Difference 1.67 0.03 -0.58 Table 5.16: Average classification accuracies for noisy datasets. From the results of Table 5.16, it can bee seen that the novel representation is not as robust to noise as the original single feature representation. For 0% noise, the pairs 72 outperform the single genes by 1.67% while with 10% class noise, they perform almost equal. With more noise, the gene-pair representation looses its benefit and performed worse. With the gene-pair representation, it is easier to overfit noise. 5.2.4 Benefits and Limitations The tests of comparing the transformed representation with the original one showed that gene-pair features can definitely increase the classification accuracy for genetic datasets. It was also shown, that the pairs often do not improve the accuracy of non-genetic datasets. Therefore, the assumption that the benefit relies on the reflection of gene interactions could not be disproved. It was also shown, that the original representation is more robust to class noise, but is still clearly worse if no original noise is present. The learning from equivalence constraints classifier has been shown to be sensitive to the novel representation as this algorithm resulted in the biggest increase in accuracy for genetic data whereas performed worse for non-genetic data. Further, the gene-pair datasets have been reduced to 100 pairs for classification while the original representation uses 200 features for training the models. The gene-pair representation was shown to improve the classification accuracy by using only half of the original number of attributes. Note that this does not mean that there is only the information of 100 genes present here, as one pair is created of two different genes. The statistics of the selected pairs show a tendency to select a big number of pairs where the same feature is included. In some cases, this feature was highly ranked in single representation as well, but some features being part of many pairs have not been ranked high for the single representation. A deeper and more thorough analysis of these statistics is a direction for future work. The reduction to 100 attributes improves the model training time and the used memory at least by the factor of two. But the two-step feature selection, needed for detecting the best 100 pairs is very time consuming and increases the time of the whole training procedure dramatically. For each dataset describing a certain disease, a feature selection is done to find the 100 most important genes for differentiating between healthy subjects and diseased patients. These selected gene pairs might not only be used for classification in machine learning, but 73 can also give useful information about gene interactions reliable for the studied disease. The most frequently selected and most discriminative gene pairs can be studied with respect to their biological background and new knowledge can be extracted how and why this pair influences the disease. This data might be of high interest for biology and health professionals. Another benefit is the fact, that attributes are represented by gene pairs. This enables the possibility to retrieve external biological knowledge for the pairs and to integrate this knowledge into the classification algorithm to improve the accuracy. 5.3 Integration of Gene Ontology Semantic Similarity into Data Classification The datasets have been reduced to contain only the attributes that can be matched exactly to a gene. Therefore the number of attributes for the five test sets used for experiments in this section decreased as can be seen in Table 5.17. Dataset Original No. of Attributes Breast Colon Embr. Tumours HeC Brain Tumours Lupus 24482 2001 7130 5580 172 No. of Attributes with GO ids 2678 942 5896 4447 140 Table 5.17: Number of attributes for the original and the reduced datasets for experiments with the GO semantic similarity. The datasets had to be reduced to contain only attributes that can be associated to GO identifiers. The classification power of an attribute was not considered in this reduction. Some discriminative attributes may have been deleted and therefore, the accuracies reported for these experiments are not comparable with the results of the pair representation evaluation. 74 5.3.1 GO-Based Feature Weighting Four datasets have been tested with different similarity calculation methods and two different approaches of semantic similarity based feature weighting. All tests have been run under the same conditions as described for tests with the gene-pair representation. The result Tables 5.18-5.21 show the classification accuracies for the tested similarity methods. Each row includes one used method where the last one is called “Random” and no similarity values have been used but random values in the range between 0 and 1. The first column shows the results for feature weighting within the k -Nearest Neighbour classifier and the second one shows the corresponding results for the direct weighting of the pairs, while constructing the pairs. The third column shows the average similarity value of the 100 selected pairs while the last column shows the similarity value over all calculated pairs. The field No FW shows the results of the reduced pair representation without feature weighting. Breast No FW Max-Resnik Max-Lin Max-Wang Azuaje-Resnik Azuaje-Lin Azuaje-Wang Schoen-Lin GIC Random FW in KNN 70.22 68.22 70.33 70.56 69.33 70.11 69.56 70.67 66.11 70.00 FW in Pair 70.22 69.11 69.33 70.22 69.67 68.67 69.56 68.78 71.89 70.56 Av. Sim Selected Av. Sim All 2.6258 0.7171 0.7915 0.8828 0.2457 0.3617 0.2379 369.2300 0.4933 2.5968 0.6666 0.7684 0.8596 0.2266 0.3445 0.2135 156.7375 0.4997 Table 5.18: Classification accuracy of the GO-based feature weighting for Breast. 75 Colon No FW Max-Resnik Max-Lin Max-Wang Azuaje-Resnik Azuaje-Lin Azuaje-Wang Schoen-Lin GIC Random FW in KNN 88.71 87.10 87.10 87.10 87.10 87.10 87.10 87.10 85.48 87.10 FW in Pair 88.71 88.71 88.71 88.71 88.71 88.71 88.71 88.71 85.48 88.71 Av. Sim Selected Av. Sim All 3.8565 0.7414 0.7723 1.0621 0.2329 0.3422 0.1664 207.7249 0.4753 3.7388 0.8292 0.8774 1.1924 0.2755 0.3894 0.1890 180.0117 0.5019 Table 5.19: Classification accuracy of the GO-based feature weighting for Colon. Embrional Tumours FW in KNN No FW 73.33 Max-Resnik 68.33 Max-Lin 78.33 Max-Wang 75.00 Azuaje-Resnik 76.67 Azuaje-Lin 76.67 Azuaje-Wang 76.67 Schoen-Lin 73.33 GIC 76.67 Random 76.67 FW in Pair 73.33 70.00 70.00 71.67 70.00 70.00 71.67 70.00 76.67 73.33 Av. Sim Selected Av. Sim All 3.4470 0.8385 0.8837 1.1160 0.2695 0.3763 0.1957 455.7339 0.5219 3.5950 0.8012 0.8576 1.0777 0.2541 0.3672 0.1734 204.6917 0.4976 Table 5.20: Classification accuracy of the GO-based feature weighting for Embrional Tumours. 76 Lupus No FW Max-Resnik Max-Lin Max-Wang Azuaje-Resnik Azuaje-Lin Azuaje-Wang Schoen-Lin GIC Random FW in KNN 79.17 78.57 78.57 79.05 78.45 78.21 78.57 79.05 75.95 79.05 FW in Pair 79.17 79.52 79.64 79.17 79.52 79.52 79.05 79.40 78.21 79.17 Av. Sim Selected Av. Sim All 4.4469 0.8446 0.8803 1.2879 0.2814 0.3743 0.1872 172.5945 0.5189 4.5364 0.8565 0.8931 1.2942 0.2809 0.3832 0.1791 154.6439 0.5052 Table 5.21: Classification accuracy of the GO-based feature weighting for Lupus. The average accuracies over the four datasets have been calculated and are shown in Table 5.22. No FW Max-Resnik Max-Lin Max-Wang Azuaje-Resnik Azuaje-Lin Azuaje-Wang Schoen-Lin GIC Random Average FW in KNN 77.86 75.56 78.58 77.93 77.89 78.02 77.98 77.54 76.05 78.21 77.53 FW in Pair 77.86 76.84 76.92 77.44 76.98 76.73 77.25 76.72 78.06 77.94 77.21 Table 5.22: Average accuracies for semantic similarity based feature weighting with different methods. The results demonstrate that for each dataset, at least one combination of methods could beat or even reach the non-weighted gene-pair representation. But this seems to be not the effect of semantic similarity. The random similarity was able to reach the accuracy for experiments without weighting in two datasets and increased it for the other two datasets. This result is surprising and was not expected. The results for experiments with the gene-pair representation showed that the new representation is overfitting noise. 77 We assume that the reason for the increase in accuracy is not caused by the weighting with semantic similarity, but the injection of noise into the pairs, making the system more robust to overfitting. More experiments to test this hypothesis is a direction for future work. Table 5.22 shows the average values for each combination. For weighting the pairs, the Max-Lin combination performed best while for feature weighting in k -NN, the GIC method reached the best accuracy. But again, the Random reference got a comparable result. It can be concluded that, weighting pair features as introduced in this thesis can improve the classification accuracy, but a reference test with random similarity values could improve the accuracy, too. 5.3.2 GO-based Feature Selection Next, experiments with the GO semantic similarity based feature selection have been performed. The results of the feature weighting tests could not show a clear benefit for a similarity method to increase the classification accuracy. Therefore it was not obvious which similarity method should be used for GO-based feature selection. Because as for all combinations including the Max algorithm, a big amount of similarities is 1, these methods cannot be used. GO based feature selection is defined to pre-select pairs with a high similarity, therefore the pairs selected by feature weighting have been analyzed with the goal to find a similarity method where the average value of selected pairs is higher than the average over all pairs. The percentage of the increase of the average selected pair similarity (sselected ) with respect to the similarity of all pairs (sall ) was calculated and is shown in Table 5.23. The difference of the average semantic similarity between the selected pairs and all pairs is negligible for most datasets. For the Schoen method, the similarity of the selected pairs is 4% higher. But the GIC semantic similarity was shown to be 71% higher in the selected pairs than in all pairs. Therefore, CGI similarity was selected further for testing the GO guided feature selection. It was assumed that if the pairs selected by common feature selection algorithms have a higher average CGI similarity then a pre-selection of high CGI similarity pairs can improve the feature selection. To test this assumption, the 400 pairs with the highest CGI semantic similarity have been pre-selected for each dataset followed by a CFS feature selection to reduce the 400 pairs to 100. The combination with 78 Similarities Max-Resnik Max-Lin Max-Wang Azuaje-Resnik Azuaje-Lin Azuaje-Wang Schoen-Lin GIC Random Ratio in % 99.54 100.06 98.16 98.71 99.80 98.26 104.21 171.30 100.25 Table 5.23: Comparing the ratio of the average similarity between selected pairs and all pairs. a common feature selection method (CFS) is needed to reduce redundant features. The feature weighting was switched off and the tests have been performed under the same classification conditions. The results can be seen in Tables 5.24-5.28. Random Forest, k-Nearest Neighbour and the intrinsic Random Forest similarity classifier have been evaluated in this context. Breast GIC selection no yes RF 69.00 74.22 k -NN 70.22 73.56 iRF 68.00 72.33 Average 69.07 73.37 Table 5.24: Classification accuracy of GO-based feature selection for Breast. Colon GIC selection no yes RF 88.71 83.87 k -NN 88.71 88.71 iRF 87.10 85.48 Average 88.17 86.02 Table 5.25: Classification accuracy of GO-based feature selection for Colon. Embrional Tumours GIC selection RF no 75.00 yes 81.67 k -NN 73.33 81.67 iRF 73.33 78.33 Average 73.89 80.56 Table 5.26: Classification accuracy of GO-based feature selection for Embrional Tumours. 79 HeC Brain Tumours GIC selection RF no 89.71 yes 92.65 k -NN 91.18 94.12 iRF 89.71 92.65 Average 90.20 93.14 Table 5.27: Classification accuracy of GO-based feature selection for HeC Brain Tumours. Lupus GIC selection no yes RF 80.60 78.21 k -NN 79.17 79.05 iRF 79.76 78.10 Average 79.84 78.45 Table 5.28: Classification accuracy of GO-based feature selection for Lupus. Based on these results, the average accuracy over all datasets and classification methods for the gene-pair representation with no semantic feature selection support of 80.23% was calculated. The pair feature pre-selection with the GIC semantic similatiy improved this result by 2.08% to an average value of 82.31%. Note that for datasets where the average similarity of selected pairs was much higher than for all pairs, the GIC based preselection of features could improve the accurracy. The biggest increase in accuracy was reached with the Embrional Tumours dataset, where the accuracy increased by 6.67% to 80.56%. This is the best accuracy reached for this dataset in all experiments conducted for this thesis and is even higher than for experiments with the full set of attributes (Table 5.4). The fact that the GIC similarity for selected pairs is higher, is another evidence that the boost of the pair representation is associated with the biological semantics of the corresponding genes of the pairs. It was shown that including external biological knowledge into classification can boost the accuracy for most datasets. The bad performance for the Lupus dataset is not surprising as the gene-pair representation failed for this dataset before, too. One reason for this might be that for the Lupus dataset gene interactions are not important or not reflected in the dataset. The other dataset failed, Colon, performed good in the gene-pair representation tests, but the results for the semantic similarity based feature weighting tests are strange as the classification accuracies of nearly all tested similarity methods are the same (Table 5.19). For feature weighting, in most methods, the same accuracy has been reached for Colon. The number of attributes and instances as 80 well as the calculated similarities do not differ significantly from the other datasets. 5.4 Preliminary Results for the RSCTC’2010 Discovery Challenge For all experiments, the intrinsic Random Forest Similarity was used as described in Section 4.5. By cross validating the training sets with different numbers of nearest neighbours (k ), minimum leaf size (l), features to split a node (K), trees (n) and the two different learning algorithms, plain Random Forest (RF) and Random Forest with extremely randomized trees (ERT) [20] for the intrinsic Random Forest Similarity, the best accuracy was reached for the datasets by using the configurations of the intrinsic Random Forest Similarity shown in Table 5.29. Dataset 1 2 3 4 5 6 k 7 7 5 7 7 11 l 1 1 1 1 1 1 K log(N ) + 1 log(N ) + 1 log(N ) + 1 log(N ) + 1 log(N ) + 1 log(N ) + 1 n 50,000 50,000 50,000 50,000 50,000 50,000 learner RF ERT RF ERT RF ERT Table 5.29: Best configurations of the intrinsic Random Forest Similarity for Basic Track datasets, where N is the number of attributes of the dataset. The datasets have been reduced to 400 genes by GainRatio followed by a 5-features40-iterations Group feature selection. On February, 23 2010, four participants share the first place with an average accuracy on preliminary evaluations of 75%. We have reached 74% and share the second place with another participant. In the preliminary evaluations, a position within the top six of 73 participants was reached for the Basic Track. In the first trials for the Advanced Track, the memory or the time restrictions have been exceeded. The best accuracy reached in the preliminary evaluations was 78.73%. 22 participants submitted solutions, where the leader has currently reached a classification accuracy of 79.78%. With our best accuracy of 78.73%, position ten was reached with a gap of 1% with respect to the current leader. 81 CHAPTER 6 Conclusion 6.1 Summary A flexible object oriented Java machine learning workbench has been developed for comparing different learning algorithms, in particular distance learning techniques. The WEKA based framework performs a set of experiments declared in a configuration file and presents the classification accuracies. A summary of the experimental results is saved into an HTML file. The workbench was designed for tests with genetic data but can be also used for any other area. The workbench was used to compare different distances obtained with learning from equivalence constraints, where the L1 distance was shown to perform best. Two novel distance functions, the Variance Threshold and the Frequency distance, have been introduced and as candidates to replace the L1 distance in learning from equivalence constraints. The Variance Threshold distance was designed to group the expression values into five categories, strongly underexpressed, underexpressed, normal, overexpressed and strongly overexpressed. The second new distance function, the Frequency distance, was designed to consider the distribution of the data and has been tested in combination with the L1 distance. In the presented experiments, the weight was set to 0.5, so that the influence of both, the Frequency and the L1 distance was equal. Most of the other distance functions used, are originally created for distance calculations between two vectors with more than one element. But the distance functions have been applied to scalar values 82 and therefore, the strength of these functions could not be utilized. The strength of the L1 distance is not surprising, as information and the original dimensionality of the data is unchanged and the data is not distorted. A new representation for genetic datasets has been developed, considering the interaction between genes. The new representation was shown to increase the accuracy on genetic data on average by 1.67%. The assumption that this increase is caused by the reflection of gene-gene interactions could be confirmed by testing the gene-pair representaion on non-genetic datasets with worse results. Another fact confirming this assumption is that the GIC semantic similarity for selected pairs was 70% higher compared to all pairs. The gene-pair representation increased the accuracy for six out of eight genetic datasets. Only for Breast and Lupus, the accuracy decreased, but for less than 0.55%. Out of the tested classifiers, learning from equivalence constraints was shown to have the biggest change in classification accuracy when using the new representation. The tests with the artificial injected noise showed that the gene-pair representation is prone to overfit noise. Algorithms for automatically retrieving GO terms associated to a provided NCBI accession number or an official gene symbol have been developed. Further the workbench is able to calculate the GO semantic similarity of two GO terms and two genes by several different methods, like Lin, Resnik, Wang, Azuaje, Maximum and GIC. A novel algorithm (called Schoen similarity) for calculating the GO similarity for two genes based on their GO terms was introduced and was shown to perform comparable to state-of-the art methods. The Schoen similarity calculates the average value of the best third similarities out of a set of similarities between GO terms. In our experiments, the Schoen similarity performed worse compared to the Max and Azuaje similarity when combining them with the Lin algorithm. To do a better comparison between the Schoen, the Max and the Azuaje similarity, they can be tested on a set of gene-pair examples and compared to manual crafted similarity values calculated by a microbiologist as presented by Wang et. al. [58]. Three different approaches for the incorporation of semantic similarity into data classification have been implemented and compared. The GO-based feature weighting was shown to reach no better results than a reference with random similarity values. As described above, the good result for weighting with random values may have also been 83 caused by the fact, that the gene-pair representation without GO-based guidance is overfitting noise. We could not find a similarity calculation method that clearly outperforms the other methods on average over all the tested datasets. But we recognized that the average semantic similarity of the most discriminative gene-pairs is significantly higher than the average semantic similarity over all pairs for the GIC similarity in most datasets. We tried to use the findings, knowing that gene pairs with higher semantic similarity tend also to be more discriminative, to select the pairs with the highest semantic similarity for classification. We have no explanation why the average similarity of selected pairs is higher than the similarity of all pairs only for the GIC similarity. One reason might be that the other similarity calculation methods split the similarity calculation into elementary similarities between two GO terms, while contributing the GIC method retrieves the similarity directly by analyzing the GO graph for all the terms as a whole. Another reason can be that the GO still contains too much noise. To avoid the usage of not reliable terms, the methods can be modified to use only GO terms with a high evidence level. The evidence level is given for each term in the annotation database and is a measure of how trustful the annotation is. The GO based feature selection, was shown to perform well in combination with a common data-driven feature selection method (CFS) and the use of GIC similarity and could improve the classification accuracy by 2.06% on average over the tested datasets in comparison to the use of the plain feature selection only. The GO-based feature selection is choosing features based only on their semantic similarity values, resulting in a set of features that will include redundancy and noise. It is clear that by itself, the semantic similarity is not a good feature selector, but it has proven to be a good guide to perform preliminary feature selection to reduce the complexity of the usual feature selection step. The common feature selection method can eliminate the redundant and irrelevant features. Therefore, a combination of both normally leads to a better classification accuracy. To summarize, it was shown in this thesis that including background biological knowledge into classification tasks, in the form of guiding the feature selection for the gene-pair representation, can improve the accuracy for genetic datasets. The intrinsic Random Forest Similarity was compared to other state-of-the-art learning techniques by participating in an international classification challenge with genetic 84 datasets. A new feature selection method, Group Selection has been developed and the parameters of the classifier were optimized for the challenge datasets. 6.2 Limitations and Future Work The implemented framework needs to have access to a running MySql server to use the GO annotation database. Further, the GO OBO file, a WEKA and a GO4J package needs to be available in the classpath. The code is written object oriented, but not each described method or algorithm can be accessed in an object oriented manner. At this state, the framework is not transferable easily to another computer. The workbench can be reorganized to include all depending files and be packed into a jar file. Each method described in this thesis could be implemented within an extra class being accessible from every other Java code. One should be able to use every algorithm and technique described in this thesis by including a single jar file without any further installations or configurations. The gene-pair representation was shown to perform good on binary classification problems, but was not tested on multiclass datasets. Minor tests on the challenge datasets showed worse results for the multiclass datasets. More tests have to be performed to better understand why the gene-pair representation works good only on binary classification problems and how this approach can be adapted to multiclass datasets. The classification accuracy of the gene-pair representation is highly dependent on several parameters, for example the number of pairs to be selected. Different filter and wrapper methods can be tested with different numbers of features to be selected for each iteration. The used wrapper method, CFS, is very time consuming. Further experiments can be conducted to find a faster evaluation method that performs comparable to the CFS method. The Group Selection introduced for the Basic Track in the RSCTC’2010 Machine Learning Challenge might outperform the common CFS algorithm. The gene-pair representation selects the best pairs for classification while the the original representation selects the best genes. Each of the two representations, the single and the gene-pair representation completely ignores the information provided by the other technique. A combination of both methods, selecting the best pairs and adding them to the best single genes might 85 combine the benefits of both and result in a better accuracy. Further, a more complex representation can be tested, triplets of genes for example. Another crucial point for the gene-pair representation is the calculation of the pairs. The value for a pair is calculated by building the L1 distance between the two genes of the pair. Other distances might be tested to better reflect the differences between two single genes. One more way to improve the gene-pair representation could be to normalize the pairs. This can be especially helpful with including the semantic similarity. The GO semantic similarities tested, used all GO terms found for the given gene. A better solution might be to filter the GO terms by their level of evidence. Tests can be performed with GO terms derived from the three different branches of the GeneOntology separately. The results depend on the version of the GO OBO file and the annotation database. As they are updated in short time periods, the usage of newer versions can change the results. For feature weighting experiments, the Euclidean weighting has been used. Different weighting approaches can be compared. For feature selection, only one of the presented methods was tested. The other methods can be tested, too. Analyzing dependencies between the feature selection ranking and the semantic similarity of the pairs can give a better understanding how to select features based on their GO similarity. A histogram of dependencies between the GainRatio of the gene-pairs and the semantic similarity of the pairs can be found in Appendix A for the tested datasets Breast Cancer, Colon, Embrional Tumours, HeC Brain Tumours and Lupus. The pairs are divided into 100 bins based on their corresponding semantic similarity values, where each bin contains the same number of pairs. These histograms show, that the first groups, the groups with the highest semantic similarity, are always ranked low (for Breast Cancer for example, the first three groups are ranked with less than 0.035, while the best ranked group (5) reached 0.24). Notice that the low value is an average value over the GainRatio of the pairs included in the bin. There can also be gene-pairs of high importance within the small bins. Therefore, the GO-based feature selection can perhaps be further improved by excluding the pairs with the highest similarity values which are not discriminative. To better analyse these trends and to use the knowledge that can be derived from these correlations for better feature selection is a promising direction for future work. The GO feature selection was shown to perform good only if the average similarity of 86 the selected pair in preliminary tests has been much higher than the average of all pairs. A more careful study can be made to better understand why the semantic similarity of the selected pairs is higher only in some of the datasets and how these datasets differ from the other datasets. 87 CHAPTER 7 Acknowledgments I wish to thank my supervisor Dr. Alexey Tsymbal for helping me with any question, having time whenever I needed help and for giving professional expert advises. Further, I thank Prof. Dr. Martin Stetter for recommending me to Siemens, for guiding my thesis and for supporting my work with expert knowledge. The thesis was funded by the Health-e-Child project and by Siemens with special thanks to Dr. Martin Huber. Next, I want to thank the University of Applied Science Weihenstephan for teaching me a broad knowledge in different Bioinformatics related topics, especially Prof. Dr. Frank Lesske, who always had time for extended discussions. Finally, I wish to thank my wonderful girlfriend Maria for supporting me over the last years and giving me new power in stressful times, and my parents for social and financial support. 88 List of Listings 3.1 Example of an ARFF header section . . . . . . . . . . . . . . . . . . . . 23 3.2 Example of an ARFF data section . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Part of a configuration file for the “distance learning” framework . . . . . 26 3.4 Term definition of multidrug transporter activity in OBO format . . . . . 31 4.1 Java code of the Schoen similarity calculation between two genes. The method is called with the GO terms for each gene as input parameters and uses the class variable model that includes the GO graph. . . . . . . . . . 55 4.2 Algorithm to oversample a dataset (inst) with unequally distributed classes. 62 4.3 Pseudo code of the Group Selection algorithm. . . . . . . . . . . . . . . . 89 63 List of Figures 3.1 A gene is expressed into a gene product like RNA or proteins. Fot this, the double stranded DNA of the gene is transformed into a single stranded RNA by Transcription. For building a Protein, the RNA is translated into a Protein sequence by Protein biosynthesis. . . . . . . . . . . . . . . . . . 3.2 12 a) An example of a microarray visualization with approximately 37,500 probes. b) An enlarged view of the blue rectangle of the microarray shown in a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 14 Example of a decision tree where nodes are represented as brown circles and a blue rectangle represents a leaf. For classifying a new instance, two decisions are needed at most. Starting at A, the value of the feature can either be <= 1 or > 1. In the first case, the new instance reaches a leaf and is classified as diseased. For the second case, one more decision is needed, node B, where the feature value can either be true or false. . . . . . . . . 3.4 17 A 2-dimensional example of k -NN classification. The test case (red circle) should be classified either to the green squares or to the blue stars. The solid border includes k = 3 nearest neighbors of the test case. In this case, the test instance is classified as a green square as there are two squares and only one blue star. The dashed border includes k = 6 nearest neighbors. In this case, there are more blue stars (4) than green squares (2) means that the test case is classified as a blue star. . . . . . . . . . . . . . . . . 90 18 3.5 The general workflow of the distance learning framework is shown. An ARFF file containing several system configuration parameters as attributes and instances representing single experiments, is read first. For each line, corresponding to each single run, a new experiment is started with the classification configuration specified in the ARFF file. For each experiment, several cross validation iterations can be performed. For each cross validation iteration, the data is split randomly into a training and testing set. Next, an imputation method is run to predict missing values. Before training the different classification models specified in the configuration ARFF file, the sets are transformed into the product, difference and the comparative space for learning distance functions. Then, the algorithm classifies the test set with unseen class labels. At the end of all iterations, the classification accuracies are averaged over all runs and reported to the resulting text file and to the console. . . . . . . . . . . . . . . . . . . . . 3.6 27 An example of a set of terms under the biological process node. A screenshot from the ontology editing tool OBO-Edit (http://www.oboedit.org). The nodes represent GO terms where the labeled arcs indicate relations between two GO terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The structure of the reorganized object oriented distance learning framework. The TestExecuter class (green rectangle) is detailed in Figure 4.2. 4.2 29 39 Detailed view of the TestExecuter class. The parts not exploited in the experiments in this thesis are grayed out. The cross validation procedure (green rectangle) is described in more detail in Figure 4.3. . . . . . . . . 4.3 Detailed structure of a cross validation iteration where four types of learning algorithms are used in one experiment. . . . . . . . . . . . . . . . . . 4.4 41 Output of a single experiment for the Health-e-Child Brain Tumours dataset in HTML format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 40 43 Modified Framework for transforming the dataset into the gene-pair representation. The new components are shown in blue. . . . . . . . . . . . 91 50 4.6 Workflow of the transformation from the single-gene to the gene-pair representation, where n is the number of features and N is the number of pairs to be selected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.7 Translation of the original dataset into a GO ID dataset . . . . . . . . . 54 4.8 Modification of the framework to incorporate the GO semantic similarity in three different ways. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 57 List of Tables 3.1 A list of the benchmark datasets used for experiments conducted for this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The six datasets provided for the Basic Track of RSCTC’2010 Machine Learning Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 34 35 Average accuracies over the nine datasets for distance comparison for learning from equivalence constraints. . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Classification accuracy for Breast Cancer of pair vs. original representation. 68 5.3 Classification accuracy for Colon of pair vs. original representation. . . . 5.4 Classification accuracy for Embrional Tumours of pair vs. original representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 68 69 Classification accuracy for HeC Brain Tumours of pair vs. original representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.6 Classification accuracy for Leukemia of pair vs. original representation. . 69 5.7 Classification accuracy for Lung Cancer of pair vs. original representation. 69 5.8 Classification accuracy for Lupus of pair vs. original representation. . . . 70 5.9 Classification accuracy for Lymphoma of pair vs. original representation. 70 5.10 Classification accuracy for Arcene of pair vs. original representation. . . 70 5.11 Classification accuracy for Mesh of pair vs. original representation. . . . 70 5.12 Average classification accuracy over the four different classifiers for the genetic datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 71 5.13 Average classification accuracy over the genetic datasets for the tested classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.14 Average classification accuracy over the four different classifiers for the non-genetic datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.15 Average classification accuracy over the non-genetic datasets for the tested classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.16 Average classification accuracies for noisy datasets. . . . . . . . . . . . . 72 5.17 Number of attributes for the original and the reduced datasets for experiments with the GO semantic similarity. . . . . . . . . . . . . . . . . . . . 74 5.18 Classification accuracy of the GO-based feature weighting for Breast. . . 75 5.19 Classification accuracy of the GO-based feature weighting for Colon. . . . 76 5.20 Classification accuracy of the GO-based feature weighting for Embrional Tumours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.21 Classification accuracy of the GO-based feature weighting for Lupus. . . . 77 5.22 Average accuracies for semantic similarity based feature weighting with different methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.23 Comparing the ratio of the average similarity between selected pairs and all pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.24 Classification accuracy of GO-based feature selection for Breast. . . . . . 79 5.25 Classification accuracy of GO-based feature selection for Colon. . . . . . 79 5.26 Classification accuracy of GO-based feature selection for Embrional Tumours. 79 5.27 Classification accuracy of GO-based feature selection for HeC Brain Tumours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.28 Classification accuracy of GO-based feature selection for Lupus. . . . . . 80 5.29 Best configurations of the intrinsic Random Forest Similarity for Basic Track datasets, where N is the number of attributes of the dataset. . . . 94 81 Bibliography [1] RSCTC’2010 Discovery Challenge: Mining DNA microarray data for medical diagnosis and treatment. http://tunedit.org/challenge/RSCTC-2010-A, to be published in RSCTC’2010 proceedings, 2010. [2] Michael Ashburner, Catherine A. Ball, and Judith A. Blake. Gene ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29, May 2000. [3] Francisco Azuaje, Haiying Wang, and Olivier Bodenreider. Ontology-driven similarity approaches to supporting gene functional assessment. The Eighth Annual Bio-Ontologies Meeting, 2008. [4] Aharon Bar-Hillel. Learning from weak representations using distance functions and generative models. PhD thesis, The Hebrew University of Jerusalem, 2006. [5] Abdelghani Bellaachia, David Portnoy, and Et. E-CAST: A data mining algorithm for gene expression data. BIOKDD02: Workshop on Data Mining in Bioinformatics, 1999. [6] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. J Comput Biol, 6(3-4):281–297, 1999. [7] C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://archive.ics.uci.edu/ml/, 1998. 95 [8] Markus Brameier and Carsten Wiuf. Co-clustering and visualization of gene expression data and gene ontology terms for saccharomyces cerevisiae using self-organizing maps. Journal of Biomedical Informatics, 40(2):160–173, 2007. [9] Alvis Brazma, Jaak Vilo, and Edited G. Cesareni. Gene expression data analysis. FEBS Lett, 480:17–24, 2000. [10] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, October 2001. [11] Edgar Chávez and Gonzalo Navarro. A probabilistic spell for the curse of dimensionality. Algorithm Engineering and Experimentation, pages 147–160, 2001. [12] Zheng Chen and Jian Tang. Using gene ontology to enhance effectiveness of similarity measures for microarray data. Bioinformatics and Biomedicine, IEEE International Conference on, 0:66–71, 2008. [13] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–945, October 2004. [14] R. De Maesschalck. The mahalanobis distance. Chemometrics and Intelligent Laboratory Systems, 50(1):1–18, January 2000. [15] Lin Dekang. An information-theoretic definition of similarity. Morgan Kaufmann, pages 296–304, 1998. [16] Sandrine Dudoit, Jane Fridlyand, and Terence P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77–87, 2002. [17] Eibe, Frank and Holmes,Geoffrey and Pfahringer,Bernhard and Reutemann,Peter and Witten, Ian H. The WEKA data mining software: An update. SIGKDD Explorations, Volume 11, Issue 1, 2009. [18] Valur Emilsson, Gudmar Thorleifsson, and Bin Zhang. Genetics of gene expression and its effect on disease. Nature, 452(7186):423–428, March 2008. 96 [19] Yoav Freund and Robert E. Schapire. A short introduction to boosting. In In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, volume 14, pages 1401–1406, 1999. [20] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, 2006. [21] Zhan Guoqing, Lu Qingming, Lu Qiang, and Li Yixue. A set of API used to manipulate gene ontology vocabulary. http://www.bioinformatics.org/GO4J/, 2006. [22] M. Hall. Correlation-based Feature Selection for Machine Learning. PhD thesis, Department of Computer Science, University of Waikato, 1998. [23] H. Heijerman. Infection and inflammation in cystic fibrosis: A short review. Journal of Cystic Fibrosis, 4:3–5, August 2005. [24] Tomer Hertz. Learning Distance Functions: Algorithms and Applications. PhD thesis, The Hebrew University of Jerusalem, 2006. [25] J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In International Conference Research on Computational Linguistics (ROCLING X), September 1997. [26] Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In ML’92: Proceedings of the Ninth International Workshop on Machine learning, pages 249– 256, San Francisco, CA, USA, 1992. Morgan Kaufmann. [27] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273–324, 1997. [28] Ron Kohavi, Pat Langley, and Yeogirl Yun. The utility of feature weighting in nearest-neighbor algorithms. In Proceedings of the Ninth European Conference on Machine Learning, pages 85–92. Springer-Verlag, 1997. [29] Rafal Kustra and Adam Zagdanski. Incorporating Gene Ontology in clustering gene expression data. In CBMS ’06: Proceedings of the 19th IEEE Symposium 97 on Computer-Based Medical Systems, pages 555–563, Washington, DC, USA, 2006. IEEE Computer Society. [30] Robert Ezra Langlois. Machine Learning In Bioinformatics: Algorithms, Implementations and Applications. PhD thesis, University of Illinois at Chicago, 2002. [31] Pedro Larranaga, Borja Calvo, Roberto Santana, Concha Bielza, Josu Galdiano, Inaki Inza, Jose A. Lozano, Ruben Armananzas, Guzman Santafe, Aritz Perez, and Victor Robles. Machine learning in bioinformatics. Brief Bioinform, 7(1):86–112, 2006. [32] Jae W. Lee, Jung B. Lee, Mira Park, and Seuck H. Song. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48(4):869–885, April 2005. [33] Alan W. Liew, Hong Yan, and Mengsu Yang. Pattern recognition techniques for the emerging field of bioinformatics: A review. Pattern Recognition, 38(11):2055–2073, November 2005. [34] S. Mahamud and M. Hebert. The optimal distance measure for object detection. Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, 1:I–248–I–255 vol.1, 2003. [35] Ryszard S. Michalski, Robert E. Stepp, and Edwin Diday. A recent advance in data analysis: Clustering objects into classes characterized by conjunctive concepts. Invited chapter in the book Progress in Pattern Recognition, L. Kanal and A. Rosenfeld (Eds.), 1:33–55, 1981. [36] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997. [37] Bastos H Falcão AO Couto FM Pesquita C, Faria D. Evaluating GO-based semantic similarity measures. Proceedings of the 10th Annual Bio-Ontologies Meeting, 2007. [38] Jianlong Qi and Jian Tang. Integrating gene ontology into discriminative powers of genes for feature selection in microarray data. In SAC ’07: Proceedings of the ACM symposium on Applied computing, pages 430–434, New York, NY, USA, 2007. ACM. 98 [39] J. Quackenbush. Computational analysis of microarray data. Nature Reviews Genetics, 2(6):418–427, June 2001. [40] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, March 1986. [41] Ross J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, January 1993. [42] A. Tsymbal D. Vitanovski B. Georgescu S. Kevin Zhou N. Navab R. Ionasec and D. Comaniciu. Shape-based diagnosis of the aortic valve. SPIE Medical Imaging, 2009. [43] Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. Cognitive Modelling, November 1995. [44] Yvan Saeys, Inaki Inza, and Pedro Larranaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507–2517, October 2007. [45] Eric W. W. Sayers, Tanya Barrett, and Dennis A. A. Benson et. al. Database resources of the national center for biotechnology information. Nucleic Acids Research, 37(1):D5–15, January 2009. [46] Andreas Schlicker and Mario Albrecht. Funsimmat: a comprehensive functional similarity database. Nucl. Acids Res., pages 806+, October 2007. [47] Andreas Schlicker, Francisco S. Domingues, Jorg Rahnenfuhrer, and Thomas Lengauer. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics, 7:302+, June 2006. [48] Jose L. Sevilla, Victor Segura, Adam Podhorski, Elizabeth Guruceaga, Jose M. Mato, Luis A. Martinez-Cruz, Fernando J. Corrales, and Angel Rubio. Correlation between gene expression and GO semantic similarity. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 2(4):330–338, 2005. [49] R.D. Short and K. Fukunaga. The optimal distance measure for nearest neighbour classification. IEEE Transactions on Information Theory, 27(5):622–627, 1981. 99 [50] Alexander Statnikov, Constantin F. Aliferis, Ioannis Tsamardinos, Douglas Hardin, and Shawn Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5):631–643, March 2005. [51] Roland B. Stoughton. Applications of DNA microarrays in biology. Annual review of biochemistry, 74:53–82, 2005. [52] Ying Tao, Lee Sam, Jianrong Li, Carol Friedman, and Yves A. Lussier. Information theory applied to the sparse Gene Ontology annotation network to predict novel gene function. Bioinformatics, 23(13):i529–538, July 2007. [53] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520–525, June 2001. [54] Alexey Tsymbal, Martin Huber, and Shaohua Kevin Zhou. Discriminative distance functions and the patient neighborhood graph for clinical decision support (to appear). In Hamid R. Arabnia (ed.), Advances in Computational Biology, Springer, 2010. [55] Alexey Tsymbal, Shaohua Kevin Zhou, and Martin Huber. Neighborhood graph and learning discriminative distance functions for clinical decision support. Conf Proc IEEE Eng Med Biol Soc, 2009. [56] H. Wang, F. Azuaje, O. Bodenreider, and J. Dopazo. Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships. In CIBCB ’04. Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 25–31, 2004. [57] Haiying Wang and Francisco Azuaje. An ontology-driven clustering method for supporting gene expression analysis. In CBMS ’05: Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems, pages 389–394, Washington, DC, USA, 2005. IEEE Computer Society. 100 [58] James Z. Wang, Zhidian Du, Rapeeporn Payattakool, Philip S. Yu, and Chin-Fu Chen. A new method to measure the semantic similarity of GO terms. Bioinformatics, 23(10):1274–1281, 2007. [59] G. Weiss and F. Provost. The effect of class distribution on classifier learning. 2001. [60] Randall D. Wilson and Tony R. Martinez. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6:1–34, 1997. [61] Randall D. Wilson and Tony R. Martinez. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6:1–34, 1997. [62] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. Evolutionary Computation, IEEE Transactions on, 1(1):67–82, August 2002. [63] Hongwei Wu, Zhengchang Su, Fenglou Mao, Victor Olman, and Ying Xu. Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucl. Acids Res., 33(9):2822–2837, May 2005. [64] Tao Xu, LinFang Du, and Yan Zhou. Evaluation of GO-based functional similarity measures using s. cerevisiae protein interaction and expression profile data. BMC Bioinformatics, 9(1):472, 2008. [65] C. H. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R. M. Rifkin, M. Angelo, M. Reich, E. Lander, J. Mesirov, and T. Golub. Molecular classification of multiple tumor types. Bioinformatics, 17 Suppl 1, 2001. [66] J. Yu, J. Amores, N. Sebe, and Q Tian. A new study on distance metrics as similarity measurement. Publications of the Universiteit van Amsterdam (Netherlands), 2006. [67] Jie Yu, J. Amores, N. Sebe, P. Radeva, and Qi Tian. Distance learning for similarity estimation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(3):451–462, 2008. 101 Appendix A The histograms show the GainRatio of the gene-pair features for the five tested datasets. The y axis presents the GainRatio and the x axis is representing the semantic similarity of the pairs, where the pairs are combined into 100 groups of equal size based on their semantic similarity value. Every bin in the graph means the average GainRatio value of the gene-pairs included in this bin. The average semantic similarity of the bins is thus decreasing from left to right. GainRatio 0.25 0.2 0.15 0.1 0.05 1 2 3 4 5 6 7 8 91011121314151617181920212223242526 27-100 Distribution of average GainRatio for gene-pair sets with different semantic similarity for the Breast Cancer dataset. i GainRatio 0.35 0.3 0.25 0.2 0.15 0.1 0.05 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233343536373839404142 43-100 Distribution of average GainRatio for gene-pair sets with different semantic similarity for the Colon dataset. GainRatio 0.2 0.15 0.1 0.05 1 2 3 4 5 6 7 8 9 10-100 Distribution of average GainRatio for gene-pair sets with different semantic similarity for the Embrional Tumours dataset. ii GainRatio 0.50 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 1 5 10 20 30 40 50 60 62 63-100 Distribution of average GainRatio for gene-pair sets with different semantic similarity for the HeC Brain Tumours dataset. GainRatio 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 1 5 10 20 30 40 50 60 70 76 77-100 Distribution of average GainRatio for gene-pair sets with different semantic similarity for the Lupus dataset. iii