Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ BRNO UNIVERSITY OF TECHNOLOGY FAKULTA INFORMAČNÍCH TECHNOLOGIÍ ÚSTAV POČÍTAČOVÉ GRAFIKY A MULTIMEDIÍ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER GRAPHICS AND MULTIMEDIA Methods for class prediction with high-dimensional gene expression data Metody pro predikci s vysokodimenzionálnı́mi daty genových expresı́ DISERTAČNÍ PRÁCE DOCTORAL THESIS AUTOR PRÁCE Ing. Jana Šilhavá AUTHOR VEDOUCÍ PRÁCE SUPERVISOR BRNO 2012 Doc. RNDr. Pavel Smrž, Ph.D. Abstract This thesis deals with class prediction with high-dimensional gene expression data. During the last decade, an increasing amount of genomic data has become available. Combining gene expression data with other data can be useful in clinical management, where it can improve the prediction of disease prognosis. The main part of this thesis is aimed at combining gene expression data with clinical data. We use logistic regression models that can be built through various regularized techniques. Generalized linear models enable us to combine models with different structure of data. It is shown that such a combination may yield more accurate predictions than those obtained based on the use of gene expression or clinical data alone. Suggested approaches are not computationally intensive. Evaluations are performed with simulated data sets in different settings and then with real benchmark data sets. The work also characterizes an additional predictive value of microarrays. The thesis includes a comparison of selected features of gene expression classifiers built up in five different breast cancer data sets. Finally, a feature selection that combines gene expression data with gene ontology information is proposed. Keywords predictive classification, generalized linear models, boosting, logistic regression, elastic net, model evaluation, high-dimensional data, combining of heterogenous data, feature selection, DNA microarray data, gene expression, clinical data, gene ontology Bibliographic citation Jana Šilhavá: Methods for Class Prediction with High-Dimensional Gene Expression Data, Ph.D. thesis, Department of Computer Graphics and Multimedia, FIT BUT, Brno, CZ, 2012. i ii Abstrakt Tato dizertačnı́ práce se zabývá predikcı́ vysokodimenzionálnı́ch dat genových expresı́. Množstvı́ dostupných genomických dat výrazně vzrostlo v průběhu poslednı́ho desetiletı́. Kombinovánı́ dat genových expresı́ s dalšı́mi daty nacházı́ uplatněnı́ v mnoha oblastech. Napřı́klad v klinickém řı́zenı́ rakoviny (clinical cancer management) může prispět k přesnějšı́mu určenı́ prognózy nemocı́. Hlavnı́ část této dizertačnı́ práce je zaměřena na kombinovánı́ dat genových expresı́ a klinických dat. Použı́váme logistické regresnı́ modely vytvořené prostřednictvı́m různých regularizačnı́ch technik. Generalizované linearnı́ modely umožnujı́ kombinovánı́ modelů s různou strukturou dat. V dizertačnı́ práci je ukázáno, že kombinovánı́ modelu dat genových expresı́ a klinických dat může vést ke zpřesněnı́ výsledků predikce oproti vytvořenı́ modelu pouze z dat genových expresı́ nebo klinických dat. Navrhované postupy přitom nejsou výpočetně náročné. Testovánı́ je provedeno nejprve se simulovanými datovými sadami v různých nastavenı́ch a následně s reálnými srovnavacı́mi daty. Také se zde zabýváme určenı́m přı́davné hodnoty microarray dat. Dizertačnı́ práce obsahuje porovnánı́ přı́znaků vybraných pomocı́ klasifikátoru genových expresı́ na pěti ruzných sadách dat týkajı́cı́ch se rakoviny prsu. Navrhujeme také postup výběru přı́znaků, který kombinuje data genových expresı́ a znalosti z genových ontologiı́. Klı́čová slova prediktivnı́ klasifikace, generalizované lineárnı́ modely, boosting, logistická regrese, elastická sı́ť, vyhodnocovánı́ modelu, vysokodimensionálnı́ data, kombinovánı́ heterogennı́ch dat, výběr přı́znaků, DNA microarray data, genové exprese, klinická data, genové ontologie Bibliografická Citace Jana Šilhavá: Methods for Class Prediction with High-Dimensional Gene Expression Data, Disertačnı́ práce, Ústav počı́tačové grafiky a multimédiı́, FIT VUT, Brno, CZ, 2012. iii iv Acknowledgments I would like to thank my supervisor Pavel Smrž for his leadership, valuable suggestions and support. I am also grateful to Jan Černocký and other people from the Department of Computer Graphics and Multimedia at Brno University of Technology for enabling me to write and finish my dissertation. I would also like to thank Kenneth Froehling for his proof-reading. Moreover, special thanks go to my family and all my closest friends for their patience and encouragement. v vi Prohlašenı́ Prohlašuji, že jsem tuto disertačnı́ práci vypracovala samostatně pod vedenı́m Doc. RNDr. Pavla Smrže, Ph.D. Uvedla jsem všechny literárnı́ prameny a publikace, ze kterých jsem čerpala. V Brně dne 20. srpen 2012 vii viii Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Original contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Gene expression measuring . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Types of DNA microarrays and their production . . . . . . . . . . . . . . . 6 2.3 Microarray data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Microarray data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Class prediction 9 3.1 Classifier performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 3.4 3.2.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.3 Preference for simple models . . . . . . . . . . . . . . . . . . . . . . 13 3.2.4 Regularization or shrinkage methods . . . . . . . . . . . . . . . . . . 13 3.2.5 Internal validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Classifier construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.2 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.3 Suport vector machines . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.4 k-nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.5 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.6 Classification and regression trees and ensemble methods . . . . . . 16 3.3.7 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Comparison between methods . . . . . . . . . . . . . . . . . . . . . . . . . . 17 ix x 4 Assessment of classifier performance 19 4.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5 Data integration 5.1 5.2 23 Categorization of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.1.1 Similar data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1.2 Heterogenous data types . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1.3 Early integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1.4 Intermediate integration . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1.5 Late integration 5.1.6 Serial integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Information resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2.1 Genomic variation data . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2.2 Gene expression data . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2.3 Proteomic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2.4 Interactomic data 5.2.5 Textual data and ontologies . . . . . . . . . . . . . . . . . . . . . . . 28 5.2.6 Clinical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2.7 Other data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6 Data sets 31 6.1 Real data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.2 Microarray data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.3 Simulated data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7 Combining gene expression and clinical data 41 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.2 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.2.1 Estimation of parameters . . . . . . . . . . . . . . . . . . . . . . . . 44 7.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.4.1 Functional gradient descent boosting algorithm . . . . . . . . . . . . 47 7.4.2 Base procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.4.3 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.5 Combination of logistic regresion and boosting . . . . . . . . . . . . . . . . 49 7.6 Parameters setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.7.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.7.2 Breast cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 xi 7.8 Pre-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.9 Determination of weights for models . . . . . . . . . . . . . . . . . . . . . . 56 7.10 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.12 Alternative regularized regression techniques 7.12.1 Elastic net . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.12.2 Combining gene expression, clinical and SNP data . . . . . . . . . . 62 7.13 Execution times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.14 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8 Additional predictive value of microarrays 67 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 8.2 Determination of predictive value . . . . . . . . . . . . . . . . . . . . . . . . 68 8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 9 Breast cancer prognostic genes 73 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 9.2 Gene selection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 9.3 Evaluation of selected genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 9.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 10 Gene ontology feature selection 85 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 10.2 Combining gene expression data with gene ontology . . . . . . . . . . . . . 86 10.2.1 Filtering with gene ontology . . . . . . . . . . . . . . . . . . . . . . . 87 10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 10.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 11 Conclusions 91 xii List of Abbreviations AIC Akaike Information Criterion ANOVA ANalysis Of VAriance AUC Area Under the ROC Curve ANN Artificial Neural Network BIC Bayesian Information Criterion BN Bayesian Networks BP Biological Process CC Cellular Component CDC Centers for Disease Control CFS Chronic Fatigue Syndrome CART Classification And Regression Trees CAT Correlation-Adjusted t-scores CGH Comparative Genomic Hybridization cDNA complementary DeoxyriboNucleic Acid cRNA complementary RiboNucleic Acid CWLLS Component-Wise Linear Least Squares CNV Copy Number Variations CV Cross-Validation CCD Cyclical Coordinate Descent DNA DeoxyriboNucleic Acid DLDA Diagonal Linear Discriminant Analysis EN Elastic Net ER Estrogen Receptor FLDA Fisher’s Linear Discriminant Analysis xiii xiv FDA Food and Drug Administration FGD Functional Gradient Descent FNDR False NonDiscovery Rate GO Gene Ontology GAM Generalized Additive Models GLM Generalized Linear Models HER Human Epidermal Growth factor Receptor ICA Independent Component Analysis IWLS Iteratively Weighted Least Squares IRLS Iteratively Reweighted Least Squares kNN k-Nearest Neighbors LOOCV Leave-One-Out Cross-Validation LDA Linear Discriminant Analysis LR Logistic Regression MLE Maximum Likelihood Estimate mRNA messenger RiboNucleic Acid MDL Minimum Description Length MCCV Monte-Carlo Cross-Validation MF Molecular Function NCBI National Center for Biotechnology Information NLM National Library of Medicine PLS Partial Least Squares PCA Principal Component Analysis OMIM Online Mendelian Inheritance in Man OS Overall Survival PR Progesterone Receptor PFS Progression-Free Survival RF Random Forests RMA Robust MultiArray QC Quality Control SNP Single Nucleotide Polymorphisms xv SSM Semantic Similarity Matrix SVM Support Vector Machines xvi Chapter 1 Introduction 1.1 Motivation Increasing amounts of genomic data have become available in the last decade. Gene expression data, genomic variation data or proteomic data provide examples of the results produced by high throughput technologies. This thesis deals with gene expression data coming from microarray analysis. Microarrays offer insights into the genetic basis of various diseases. Take cancer research as an example. Cancer is thought to be primarily caused by random genetic alternations. Consequently, genomic data has been successfully applied to various cancer-related problems (e.g. classifying tumors into subtypes and thus potentially improving the clinical management of cancer [197]). Microarray technology employs gene chips to measure the expression level of thousands of genes simultaneously. Each chip characterizes the gene expression levels at a different time point during the cell cycle. Medical studies compare the gene expression levels of individual patients, one chip per case history. The large expense of microarray experiments and the limited number of available patients in the experiments make the sample size of individual microarray studies relatively small, usually less than 100. This contrasts with the high number of variables (genes) in gene expression data, typically on the order of 10,000. Class prediction with high-dimensional microarray data is, therefore, extremely difficult and all approaches need to take into account possible classifier overfitting. Furthermore, probe design and experimental conditions influence signal intensities and sensitivities for many high-throughput technologies [158]. One has to pay special attention to the high noise when analysing microarray data. In order to overcome the above-mentioned problems and to increase the reliability of the findings, the analysis can combine gene expression data with information from other sources – proteomic and interactomic knowledge, clinical findings, medical ontologies, disease-specific textual data, etc. Each of these distinct data types, although individually 1 2 1 Introduction incomplete, can provide valuable, partly independent and complementary information. The goal of data integration is to obtain more precision, better accuracy, and greater statistical power than any individual data set would provide. It can also find a solution on how to cross-validate noisy data sets. There is a number of challenges in the process of data integration. The challenges may be of a conceptual, methodological, or practical nature and may relate to issues that arise due to experimental, computational, or statistical complexities [86]. It is essential to carefully consider strategies that best capture most information contained in each data type before combining them. The data from different sources might have a different quality depending on the experimental conditions. The data might also have different informativeness even if their quality is good and reliable. The methods for data integration in general and combining genomic data, in particular, form one of the most active topics in the current research. The knowledge fusion combining microarray data with information from additional data sources to improve the prediction of disease outcome also defines the goals of this thesis. They can be summarized as follows: • To summarize the current research on the knowledge integration approaches in the microarray data analysis. • To investigate the methods aiming at overcoming “the curse of high-dimensionality” of the data resulting from genomic studies. • To design, realize and validate classifiers combining gene expression data with information from additional data sources. • To deal with the additional predictive value of high-dimensional microarray data in the case when standard clinical predictors are already available. 1.2 Original contributions of the thesis The original contributions of the thesis can be sumarized as follows: • The novel approaches that construct classifiers combining clinical variables with microarray gene expression data. • The two-step method combining logistic regression and boosting models able to determine the additional predictive value of microarray data. • The combination of logistic regression and boosting with pre-validation. • The validation of the experiments based on combining gene expression, clinical and single nucleotide polymorphism data (SNP) using generalized linear models and regularization methods. 1.3 Structure of the thesis 3 • The study that discusses biomarker discovery methods and compares selected features of gene expression classifiers applied to five benchmark breast cancer data sets. • The feature selection approach that integrates gene ontology information with microarray gene expression data. 1.3 Structure of the thesis The rest of this thesis is divided into the following chapters: • Chapter 2 briefly describes gene expression measuring, the main types of DNA microarrays and their production. Then, microarray data analysis is introduced together with key topics of microarray data mining. • Chapter 3 is concerned with class prediction of microarray data. It analyses the performance of a predictive classifier in a high-dimensional setting. It deals with approaches that can optimize classifier perfomance and cope with high-dimensionality of data. It also gives an overview of the most popular class prediction methods in the context of microarray gene expression data and a comparison of their features. • Chapter 4 describes how classifier performance is estimated in this thesis. It describes the validation procedure and the performance measures. • Chapter 5 categorizes data integration methods from different points of views. Each category includes references to the published literature. It gives an overview of information resources that have been combined with microarray gene expression data as well. • Chapter 6 introduces data sets used for evaluation of experiments in this thesis . This chapter also includes examples of microarray data pre-processing. • Chapter 7 introduces a novel approach that constructs a classifier combining microarray gene expression data with clinical data. An extention of this approach is described which offers a pre-validation of models built with microarray and clinical data followed by weight calculations. Evaluations are performed on several redundant and non-redundant simulated data sets as well as on four real benchmark data sets. It includes a comparison with other methods and discussion. The next part of the chapter is dedicated to alternative regularized regression techniques. An elastic net is employed with high-dimensional data instead of boosting and is used in a joint classifier with logistic regression. After that, classifiers combining gene expression, clinical and SNP data are validated. A chronic fatigue syndrome (CFS) data set is used for experiments. The gene expression data is not only high-dimensional data 4 1 Introduction in this data set. That is why, regularization is not only applied to gene expression data in these experiments. • Chapter 8 describes a two-step method combining logistic regression and boosting models in order to determine an additional predictive value of microarray data. This method is evaluated and compared together with the other published method designed for the same purpose. • Chapter 9 compares selected features of gene expression classifiers built up in five benchmark breast cancer data sets. The features are selected based on three feature selection methods. This chapter includes an overview of breast cancer prognostic and predictive biomarkers in clinical use with a discussion about published breast cancer signatures and their discovery. • Chapter 10 proposes preliminary results of approaches of using gene ontology (GO) information with gene expression data. GO information is incorporated at the stage of early integration in order to improve class prediction. Feature selections based on GO are evaluated on a benchmark breast cancer data set. • Chapter 11 concludes this thesis and outlines the most promising direction for future research. Chapter 2 Background Understanding of data is essential for its analysis and mining. That is why this chapter introduces the fundamentals of DNA microarrays. It briefly describes principles of gene expression measuring and summarizes main types of DNA microarrays and their production. Microarray data pre-processing as a part of data analysis is also introduced together with key topics of microarray data mining. Figure 2.1 illustrates the process of production and analysis of microarray gene expression data. Most of the activities are matters for biologists. This thesis falls into the highlighted area. Biological question Experimental design Microarray experiment Data analysis and data mining Biological verification and interpretation Figure 2.1: Microarray gene expression data analysis process. 2.1 Gene expression measuring The central dogma of molecular biology [40] outlines that in synthesizing proteins, Deoxyribonucleic acid (DNA) is transcribed into messenger Ribonucleic acid (mRNA), which is translated into protein. The principle behind microarrays is hybridization between two DNA strands. Fluorescently labelled target sequences (present in the sample) bind to complementary probe sequences (attached to the solid surface of a DNA microarray chip) and generate a signal that depends on the strength of the hybridization determined by the number of paired bases. A specialized scanner is used to measure the amount of hybridized target at each probe, which is reported as gene expression levels. 5 6 2 Background 2.2 Types of DNA microarrays and their production Methods for quantitative measuring gene expression have been available to biologists since the 80’s [16]. They were limited to examining a small number of genes at a time. A later technique serial analysis of gene expression (SAGE) [187] enabled biologists to quantify the expression level of both known and unknown genes, but it was time-consuming. The current DNA microarray technology has revolutionized basic biological research and laboratory investigatons of patient material. It enables measuring the gene expression levels of thousands of genes simultaneously under the same experimental conditions. There are many different types of microarrays (called platforms) in use, but all can be characterized by a high density and number of biomolecules fixed onto a well-defined surface. DNA microarrays come in two main types of technical platforms. The first cDNA microarray (developed in the Brown and Botstein labs at Stanford, 1995) [166] is based on standard microscopic glass slides on which complementary DNAs (cDNAs) or long oligonucleotides (typically 70-80 mers) have been spotted. The second high-density oligonucleotide microarray [126] is based on photolithographic techniques to synthesize 25 mer oligonucleotides on a silicon wafer and constitutes the patented technology of Affymetrix Inc. The production methods of the two types of DNA microarrays are different: cDNA microarays can be produced relatively easily, but the high-density oligonucleotide microarrays need highly specialized production facilities [175]. A comparison of cDNA microarrays and high-density oligonucleotide microarrays is shown in Figure 2.2. For cDNA microarray experiments, two cell populations, for instance diseased and normal, are isolated, RNA is extracted, and cDNA is made, and used for transcription with Cy3 (green) or Cy5 (red) labeled nucleotides. The two labeled cRNA samples are mixed and hybridized on a glass slide array, which is scanned by a specialized laser, followed by data analysis. High-density oligonucleotide microarrays generate a gene expression profile of one sample and, therefore, one color. RNA is extracted and cDNA is prepared. The cDNA is used in transcription in order to generate complementary RNA (cRNA). After fragmentation, this cRNA is hybridized to microarrays, washed and stained, and subsequently scanned on a specialized laser scanner. 2.3 Microarray data pre-processing Raw numerical outputs from different microarray platforms need to be pre-processed. Pre-processing steps can be divided into five parts: data import, background adjustment, normalization, sumarization and quality assessment according to [74]. Data import methods are needed because data come in different formats and are often scattered across a number of files or database tables from which they need to be extracted and organized. Background adjustment is essential to reduce non-specific hybridization and noise. Nor- 7 2.3 Microarray data pre-processing cDNA microarray High-density oligonucleotide microarray two cell populations (diseased and normal) one cell population Target preparation Target preparation RNA extraction Conversion to cDNA, purification and labeling Cy3 Cy5 Fragmentation Hybridization Array preparation Array preparation oligonucleotide array (Affymetrix) (multiple short overlapping oligonucleotides per gene) glass slide array (one cDNA or long oligonucleotides per gene) Washing Laser scanning n1n2 ... p1 p2 ... ... Cy5 Cy3 gene expression ratio n1n2 ... p1 p2 Data analysis ’absolute’ gene expression value Figure 2.2: Comparison of cDNA microarray and high-density oligonucleotide microarray procedures (reproduced from [175]). malization enables one to compare measurements from different array hybridizations. For some platforms, summarization is needed because transcripts are represented by multiple probes. Quality assessment can detect divergent measurements beyond the acceptable level of random fluctations. Additional steps that are typically performed include missing value imputation and filtering (removal of genes that are not expressed). An overview of pre-processing methods, including summary algorithms and quality control metrics for microarrays, can be found in [149]. A practical example of benchmark microarray data set pre-processing is given in Chapter 6. Pre-processed microarray data is usually transformed into gene expression matrices where rows form the expression patterns of genes and columns represent the expression profiles of samples (experimental conditions) or vice-versa (see Figure 2.2). The cells in cDNA microarray matrix characterize log-ratios of gene expression from two cells populations (labeled with Cy3 or Cy5), while the cell in 8 2 Background a high-density oligonucleotide microarray matrix characterizes log-transformed intensities of gene expression levels of one population. 2.4 Microarray data mining Microarrays offer insights into the genetic basis of many diseases. However, micorarray information extraction presents a challenging task for data mining. Most of the literature assorts data mining with microarray gene expression data into three related cathegories: class discovery, class comparison and class prediction. Class discovery refers to the process of dividing patients, tumors or genes into classes (clusters) in the hope that they also have a similar function, behavior or properties. For example, a class discovery can identify new disease subtypes. The class comparison can be defined as a selection of genes whose expression is significantly diferent between conditions. These are called diferentially expressed genes. For example, class comparison problem can detect disease-associated (signalling) molecular markers (biomarkers) and their complex interactions. Scientists explore genegene interactions because they may play important roles in complex disease studies. The goal of class prediction is to develop a function (classification rule) for accurately predicting class membership. For example, it can be a function for the prediction of disease diagnosis or prognosis. Class prediction is the topic of this thesis. It is detailed in the next chapter. Chapter 3 Class prediction In class prediction (or predictive classification), the data with class labels is available and a classification algorithm learns from samples of this data with known class membership (the training set) and establishes rules to classify new samples. Results are evaluated on an unseen data (the test set). Figure 3.1 shows a typical scenario of class prediction. Here, the class labels can have two values: bad prognosis and good prognosis. In general, class prediction can deal with a two-class (binary) or multi-class classification problem. N -class problems can be modelled as N binary classification problems. This thesis is concerned with binary classification. Bad prognosis samples aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa Good prognosis samples aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa New samples aa aaaa a aa aa aaaa aa aaaa a aa aa aaaa a aa aa aaaa a aa aa aaaa aa aa a aa a aa a aa a aa New sample Classification rule Figure 3.1: Class prediction. 9 Good prognosis 10 3 Class prediction Notation for class prediction with microarray data Let X be the p × n matrix containing the expression values of p features/genes and n samples. The response variable with ground truth class labels is a n dimensional vector y ∈ {A, B} (A and B can denote poor and good prognosis). Similarly ŷ ∈ {A, B} is the response variable with predicted class labels. Examples of class prediction include a prediction of disease diagnosis and prognosis or therapeutic responsiveness. An important interest in the area of prognostic prediction is the prediction of the survival time of a patient [62]. Some authors dichotomize the survival time and transform the problem into a classification problem [146]. Class prediction objectives are: • Identification of classification rules that perform well with the new data; • Identification of characteristic genes related to disease progression for further investigation. 3.1 Classifier performance As mentioned above, classification with high-dimensional microarray gene expression data is a challenging task. Performance of a predictive classifier/model depends on sample size, data dimensionality and model complexity [193]. The accuracy of learned classifiers tends to decrease with high dimensions, a phenomenon called the ‘curse of dimensionality’ [51]. Trunk [182] illustrates the problem by the following example. Consider two equally probable, normally distributed classes with common variance in each dimension. For the feature/variable/gene indexed by n = 1, 2, 3, ..., class 1 has a mean mean − 1 1 n2 1 1 n2 and class 2 has a . Thus, each additional feature has some class discrimination power, even if it is decreasing as n increases. Trunk evaluated error rates for the Bayes decision rule, applied as a function of n, when the variance is assumed to be known but the class means are estimated based on a finite data set. He found that (1) the best test error was achived using a finite number of features; (2) using an infinite number of features, test error degrades to the accuracy of random guessing; and (3) the optimal dimensionality increases with an increasing sample size. These observations are consistent with the ‘bias/variance dilema’ [108]. Simple models may be biased but will have a low variance. More complex models have greater representation power (low bias) but will overfit to the particular training set (high variance). Thus, the large variance associated with many features (including those with modest discrimination power) defeats any possible classification benefit derived from these features (see Figure 3.2). With severe limits on available samples in microarray studies, complex models using high-feature dimensions will severely overfit, 11 3.1 Classifier performance greatly compromising classification performance. Computational learning theory provides bounds on generalisation accuracy in terms of a classifier’s capacity which is related to model complexity [186]. Relevance of these bounds to the microarray gene expression data is discussed in [14]. Unfortunately, training error is not a good estimate of test error, as it does not properly account for model complexity. Figure 3.2 shows the typical behaviour of the test error, as classifier complexity varies. Training error tends to decrease whenever the classifier complexity is increased, that is, it is harder to fit the data. However, with too much fitting, the model adapts itself too closely to the training data and will not Class prediction error generalize well (i.e., large test error). High bias Low variance Low bias High variance Underfitting Overfitting Bias Expected error (assesed on test samples) Expected error (assesed on training samples) Variance Classifier complexity Figure 3.2: A demonstration of bias/variance dilemma in class prediction and test/training error as a function of classifier complexity (adapted from [93], page 38). There are dimension reduction approaches that can optimize classifier perfomance and cope with the high-dimensionality of data. They form the topic of the following section. Dimension reduction can take place before classifier construction (feature selection or extraction), or it can be included in classifier construction (regularization/shrinkage methods or preference for simple models). Figure 3.3 illustrates the pipeline of microarray predictive classifier building with possible dimension reduction steps. The process of a class predictor building from high-dimensional data usually consists of a dimension reduction, Data Feature selection/extraction Classifier construction Classifier validation Error Figure 3.3: The process of a common class predictor building from high-dimensional data. 12 3 Class prediction classifier construction and validation. The dashed rectangle signalizes that dimension reduction is included in the classifier construction part. 3.2 Dimension reduction There are approaches that can optimize classifier perfomance and cope with the highdimensionality of data. Nevertheless, the use of these methods does not prevent overfitting. The literature includes the following approaches: • Feature selection; • Feature extraction; • Preference for simple models; • Regularization or shrinkage methods; • Internal validation. 3.2.1 Feature selection Feature selection, also known as variable selection or gene selection, is the way of selecting a subset of relevant features for building classifiers. A standard approach to feature selection involves identification of the genes that are differentially expressed among the classes when considered individually. This type of feature selection is a filter approach and can be computed, for example, by t-test, Wilcoxon test or analysis of variance (ANOVA). The genes that are significantly differentially expressed at a specified significance level are selected for inclusion in the class predictor. One can choose only the k top-ranking genes (where k < p) or the genes whose score exceeds a given threshold. The p-values are used as a convenient index based on null-hypothesis testing for selecting genes. In microarray data analysis, there is a problem with false positives which are genes that are found to be statistically different between conditions, but are not in reality. Therefore, multiple testing corrections adjust p-values to correct for occurrence of false positives. An overview of hypothesis testing methods can be found in [53]. This type of feature selection has an important drawback that correlations and interactions with other genes are ignored. It may lead to missing aspects relevant for prediction. The feature selection wrapper approach uses its performance as the evaluation criterion. It searches for features better suited to the mining algorithm, aiming to improve mining performance. The wrapper feature selection is generally computationally intensive and more difficult to set up than filter feature selection. An example of this approach is gene selection using SVM [84]. Search algorithms that explore the space of the possible subset of features define another feature selection approach. Model et al. [139] use the backward 3.2 Dimension reduction 13 selection procedure, whereas Bo et al. [23] select a pair of variables with a forward selection scheme. Alternative approaches include, for example, genetic algorithms [148]. 3.2.2 Feature extraction Feature extraction is dimensionality reduction which transforms the data in the highdimensional space to a space of fewer dimensions. The data transformation may be linear, as in Principal component analysis (PCA) [195]. Principal components are the orthogonal linear combinations of the genes showing the greatest variablility among the cases. The principal components are sometimes referred to as singular values [15]. Dimension reduction with PCA has two limitations. One is that the principal components are not necessarily good predictors. The second problem is that measuring principal components requires measuring expression of all genes which may not be desirable for clinical applications. Other methods are Independent component analysis (ICA) [124], Partial least squares (PLS) [142], Linear discriminant analysis (LDA), Diagonal LDA (DLDA) or nonlinear methods, such as Non-linear principal manifolds [60]. A general drawback of feature extraction compared to feature selection is worse interpretability of extracted features. 3.2.3 Preference for simple models Some methods, such as naive Bayes, Support vector machines (SVM) or DLDA can use as many features as desired. These methods use simple models that restrict complexity. Naive Bayes classifiers [138] that assume conditionally independent features from each other given the class label or even simpler models that can share parameters across classes [145]. SVM avoid overfitting by finding a linear discriminant function (or generalized linear discriminant) that maximizes the margin (the minimum distance of a sample point to the decision boundary) [186]. Dudoit at al. [52] recommend to prefer simple models before using complex classifers. 3.2.4 Regularization or shrinkage methods Regularization or shrinkage methods involve additional information in order to prevent overfitting. This information is usually of the form of a penalty for complexity, such as restrictions for smoothness or bounds on the vector space. For example, the least-squares method can be viewed as a very simple form of regularization. Ridge regression and the lasso [180] are regularized versions of least squares regression using L2 and L1 penalties, respectively. The regularization has been applied to generalized linear models [180]. In recent years, there has been an enormous amount of research activity devoted to related regularization methods: the grouped lasso [203, 137], where variables are included or excluded in groups; the elastic net [205] for correlated variables, which uses mixtures of L1 and L2 penalties; L1 regularization paths for generalized linear models [152]; regularization 14 3 Class prediction paths for SVM [91], etc. From the regularization point of view, boosting [162] follows a ‘regularized path’ of classifier solutions as the boosting iterations proceed. More complex solutions can overfit. Well-known classifier selection techniques include the Akaike information criterion (AIC), minimum description length (MDL) and the Bayesian information criterion (BIC). 3.2.5 Internal validation Alternative methods of controlling overfitting not involving dimension reduction or regularization include internal validation. Internal validation means that the validation is applied to the training set only, since parameter estimation is part of the training process. The test set is there to judge the performance of the selected model. For example, leaveone-out cross-validation (LOOCV) or boostrap sampling have been applied as internal validation methods (see overview of validation methods in [27]). 3.3 Classifier construction Classifier construction is influenced by the selection of class prediction method and the setting of parameters. Class prediction based on microarray gene expression data was introduced in Golub et al. [80]. They classified two types of acute leukemias with summing weighted votes for each gene on the test data and looking at the sign of the sum. Since that time, a lot of prediction methods have been presented in literature in the context of microarray gene expression data. The most popular methods are listed below. 3.3.1 Linear discriminant analysis Linear discriminant analysis (LDA) and the related Fisher’s linear discriminant analysis (FLDA) [135] find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier. Dudoit et al. [52] indicated that FLDA did not preform well unless the number of selected genes were small relative to the number of samples. The reason is that there are too many correlations to estimate and the method tends to be unstable and overfit the data. Diagonal linear discriminant analysis (DLDA) [135] is a special case of FLDA in which the correlation among genes is ignored. By ignoring correlations, one avoids having to estimate many parameters and obtains a method that performs better when the number of samples are small [167]. 3.3.2 Generalized linear models Generalized linear models (GLM) [134] are a large group of models for relating responses to linear combinations of predictor variables. GLM employs an iteratively reweighted least 3.3 Classifier construction 15 squares method for maximum likelihood estimation of the model parameters. According to [107], the GLM approach is attractive because: (1) it provides a general theoretical framework for many commonly encountered statistical models; and (2) it simplifies the implementation of these different models in software, since essentially the same algorithm can be used for estimation, inference and assessing model adequacy for all GLM. For example, logistic regression (LR) is an instance of GLM, which consists of a large variety of exponential models. LR cannot be used with high-dimensional data without a variable reduction step. An extention of LR with high-dimensional gene expressions is described in [58, 204, 205]. GLM can also be extended to generalized additive models (GAM) [94]. 3.3.3 Suport vector machines Suport vector machines (SVM) [186] construct the best separating hyperplane between classes locating this hyperplane so that it has a maximal margin (i.e. so that there is maximal distance between the hyperplane and the nearest point of any of the classes). They work in combination with the technique of kernels that automatically realizes a nonlinear mapping to a feature space. SVM have become a very popular option as a class prediction method in microarrays between practitioners [33, 140], but it is a black-box that gives results that are difficult to interpret. 3.3.4 k-nearest neighbors The k-nearest neighbors (kNN) method has also been used for gene expression data class prediction. It is a simple method which classifies unlabeled samples based on their similarity with samples in the training set. Euclidean distance can be used to measure the closeness between an unlabeled sample and samples in the training set. The class of unlabeled sample is predicted as the majority vote among the k-nearest neighbors. The number k of neighbors used can be taken as fixed or optimized by cross-validation [20]. According to [144], kNN has poor generalization performance compared to k discriminant adaptive nearest neighbor (kDANN) [92] which selects a distance measure adaptively, making the class conditional probabilities more homogeneous locally. 3.3.5 Artificial neural networks According to [167], artificial neural networks (ANN) do not perform well because of hidden nodes, nonlinear transfer functions and individual features as inputs. By contrast, Khan et al. [115] reported ANN that correctly classified all samples and identified the genes most relevant to the classification. Their neural network used a linear transfer function with no hidden layer and hence it was a linear classifier. The inputs to the ANN were the first 10 principal components of the genes, that is, the 10 orthogonal linear combinations of genes that accounted for most of the variability in gene expression among samples. 16 3 Class prediction 3.3.6 Classification and regression trees and ensemble methods Classification and regression trees (CART) and ensemble methods have been applied to microarray gene expression data as well [11]. CART [32] builds trees – formulates simple if/then rules for binary recursive partitioning (splitting) of all the objects into smaller subgroups. Each such step may give rise to new ‘branches’. According to Dudoit et al. [52], CART tend to be unstable. Ensemble methods use aggregated (multiple) classifiers to obtain better predictive performance. There are two resampling-based aggregated classifiers that have been applied with relative success to microarrays: bagging (booststrap aggregating) [29] and boosting [162]. An example of bagging is Breiman’s random forests (RF) [31] that combines bootstrap, decision trees and majority voting. Statnikov et al. [176] compared RF to SVM on 22 diagnostic and prognostic data sets. In this comparison, SVM outperformed RF. The authors found out that the linear decision functions of SVM may be less sensitive to the choice of input parameters than RF. Boosting is based on emphasizing the training instances that previous models misclassified. The best known implementation of boosting is AdaBoost [198]. While boosting works well on a variety of different types of data, it is not suited to microarray gene expression data according to [52]. Dettling and Bühlmann [48] demonstrated with LogitBoost the modification of boosting to become an accurate classifier in the context of gene expression data. They found that it gave lower error rates than the commonly used AdaBoost algorithm and concluded that this was because LogitBoost was less sensitive to outliers. 3.3.7 Bayesian networks Bayesian networks (BN) [95] have received a growing interest in recent years. BN is a probabilistic graph-based model that consists of two parts [78]: (1) a directed acyclic graph which is called the structure of the model and represents the relations of the variables; and (2) local probability models which are the numerical parameters (conditional probabilities) for a given network topology. The learning task can be separated into two subtasks: (1) structure learning which is topology identification; and (2) parametric learning which is the estimation of numerical parameters. An example of the application of BN to expression pattern recognition in microarray data can be found in [72]. BN have found much use as a representational tool for modeling relationships in gene-gene relation and as gene regulatory network analysis [147, 90, 72]. Helman et al. [96] and Gevaert et al. [78] applied BN for the prediction of the prognosis in cancer. The major advantages of BN are in the possibilities of the integration of diverse data by using the prior over model space [78, 77] and in data interpretability. The major disadvantages of BN are the following [119]: (1) BN are often not practical for large systems because they are either too costly to perform or impossible, given the large number and combinations of variables; (2) there 3.4 Comparison between methods 17 is no universally accepted method for constructing a network from data; and (3) BN are dependent on the quality of prior beliefs—BN is only as useful as this prior knowledge is reliable. 3.4 Comparison between methods Given the number and diversity of available methods, the choice of adequate class prediction method depends on the needs of potential users. In order to help answer this question there are some comparison studies [52, 123, 176, 103]. Some of the studies are neutral, whereas others aim at demonstrating the superiority of a particular method. Dudoit et al. [52] made a comparison of more than 7 classification methods. Their main conclusion was that simple classifiers such as DLDA and kNN performed remarkably well in comparison to more sophisticated ones, such as RF. Lee et al. [123] extended the previous analysis including more methods (up to 21) and more data sets (7). They reached the conclusion that no classifier is uniformly better than the other. They found better performance for more complex methods. In Statnikov et al. [176], SVM outperformed RF. Huang et al. [103] compared 5 methods including some regularization methods. They reached the conclusion that no classifier is uniformly better than the other, but, that RF performed slightly better. In any case, and whatever the chosen method is, the performance of most methods depends on the quality of data to build the classifier (data acquisition and processing and the sample size). Generally, the comparison of prediction methods from published literature is difficult because: (1) methods are evaluated on different data sets; (2) data sets are pre-processed in different ways; (3) different evaluation schemata and error measures are applied; and (4) methods are not always evaluated correctly. Duply and Simon [54] showed that in more than half of a representative sample of past cancer research studies, inadequate statistical validation was performed. They studied ninety research articles from which three quarters of them were published in journals of an impact factor greater than 6. In class prediction, test data has been used during classifier training. Variable selection has not been often considered as a step of classifier construction and it has been carried out using the whole data set. Such mistakes in evaluation give biased and overoptimistic estimation of classifier performance. Guidelines for good practice in microarray data analysis, including class prediction, can be found in [168, 169]. 18 3 Class prediction Chapter 4 Assessment of classifier performance 4.1 Validation There are several possibilities on how to assess classifier performance. The constructed classifier should be ideally tested on an independent validation data set. It is usually not possible because gene expression data sets do not have enough samples. According to [102], it is not recommended to estimate performance based on a single learning data set and test data set in high-dimensional settings with a limited size of samples, because results depend highly on the chosen partition. Other possibilities on how to assess classifier performance is to use a procedure which tests the classifier based on the data that were not used for its construction, such as cross-validation (including a special case leave-one-out crossvalidation), Monte-Carlo cross-validation or boostrap sampling. The detailed description of validation strategies can be found for example in [27]. Eventually, classifer performance can be assessed via synthetic data with constructed ground-truth labels. However, this approach may not validate these particular statistical characteristics of molecular profiles from a given population. In this thesis, Monte-Carlo cross-validation (MCCV), also called subsampling, is applied as a validation strategy. MCCV generates a learning set in the way that the learning data sets are drawn out of {1, · · · , n} samples randomly and without replacement. The test data set consists of the remaining samples. The random splitting in a non-overlapping learning and test data set is repeated k-times and the classifier’s performances are averaged. Based on recommendation [27], the splitting ratio of the training and test data set is set to 4 : 1. A number of iterations k is set to k = 100, which is an optimal number, because estimates are based on a larger number of iterations k. This approach usually leads to more stable results than standard cross-validation [27]. Figure 4.1 describes the validation procedure. First, k pairs of learning and test data 19 20 4 Assessment of classifier performance sets are generated based on a validation method and microarray gene expression data with class labels. For each k iteration: • k-classifier is constructed based on a prediction method and k-training data set. A feature selection/extraction procedure can be included in this step. • k-response is predicted using a k-constructed classifier and test data set. Then, classifier performance is estimated with some performance measure. validation (randomization) Training data set Data Reduced training data set Classifier with estimated parameters Features Test data set Reduced test data set Response Error Figure 4.1: Schema of microarray gene expression data validation. 4.2 Performance measures The response usually consists of predicted class probabilities. Figure 4.2a shows an example of response distribution of a binary classifier. Based on a different threshold t, it is possible to get various results of classifier with true positives (TP), true negatives (TN), false positives (FP) (type I. error) and false negatives (FN) (type II. error). Error (P(Ŷ 6= Y )) or accuracy (P(Ŷ = Y )) are performance measures based on a threshold. Most reported prediction performance measures are based on user-defined thresholds for a single operating point, such as: error rate, which uses a simple threshold t = 0.5; or equal error rate, which is defined with the point where false positive and false negative rates are equal. A more meaningful estimate that reveals the ratio of true positive rate (P(Ŷ = A|Y = A)) and false positive rate (P(Ŷ = A|Y = B)) at different thresholds is the receiver operating characteristic (ROC) curve. Unfortunately, ROC is a plot, and is not a single value. The qualitative measure of a ROC curve and a commonly used summary measure of diagnostic is the area under the ROC curve (AUC) (Figure 4.2b), which is used as a classifier performance measure in this thesis. 21 4.2 Performance measures Threshold 1 B et te r FP (false alarm) FN (miss) TN (rejection) True positive rate Frequency Good prognosis Poor prognosis TP (hit) AUC W or se 0 Probability 0 (a) Response. False positive rate 1 (b) Area under ROC curve. Figure 4.2: Performance of a binary classifier. AUC is useful in that it aggregates performance across the entire range of thresholds. According to Bradley [28], AUC is a statistically consistent and more discriminating measure than accuracy. Brandley’s paper demonstates several desirable AUC properties compared to accuracy. AUC is estimated with Wilcoxon-Mann-Whitney (WMW) statistics [132]. The ROC curve of a finite set of samples is based on a step function. Yan et al. [199] demonstrate that the AUC is exactly equal to the normalized WMW statistic in the form: U= Pm−1 Pl−1 j=0 I(ai , bj ) i=0 ml , (4.1) where: I(ai , bj ) = ( 1: ai > bj 0 : otherwise. (4.2) WMW statistics [132] are based on pairwise comparisons between a sample ai , i = 0, · · · , m − 1, of random variable A and the sample bj , j = 0, · · · , l − 1, of random variable B. If {a0 , a1 , · · · , am−1 } is identified as the classifier outputs for m examples of class A, and {b0 , b1 , · · · , bl−1 } is identified as the classifier outputs for l examples of class B, the AUC for this classifier is obtained via (4.1). The overall AUC is estimated as the mean proportion of AUCs over k MCCV iterations: µU = k 1X Ui k i=1 with the standard deviation given as: (4.3) 22 4 Assessment of classifier performance σU = r 1 (Ui − µU )2 . k (4.4) Other performance measures, such as an error rate or accuracy evaluated over k MCCV iterations, can be estimated analogously. Chapter 5 Data integration The concept of data integration is not well defined in the literature and it may mean different things. In this thesis, data integration is defined as the process of combining data from different sources. Microarray gene expression data are combined with data from other sources (e.g. clinical variables, genomic variation data, interactomic data, etc.) to improve predictive classification. The first part of this chapter (Section 5.1) categorizes data integration methods from different points of views. The second part (Section 5.2) discusses different information resources that can be combined with microarray gene expression data. Both parts include examples from the published literature. 5.1 Categorization of methods Data integration methods can fall into categories depending on the type of data (integration of similar data types and heterogenous data types) or the stage of integration (early, intermediate, late and serial integration, see Figure 5.1). Early integration (input-based) DATA Intermediate integration (kernel-based) FEATURES CLASSIFIERS Late integration (decision-based) RESPONSES Figure 5.1: Stages of data integration: early, intermediate and late. 23 24 5.1.1 5 Data integration Similar data types Similar data types arise out of the same underlying source, that is, they are all gene expression, single nucleotide polymorphisms (SNP), protein, copy number variations (CNV), clinical, etc. A simple merger of available data from this category is not usually applicable due to differences in technologies, platforms, etc. A more detailed description of microarray gene expression data integration belonging to similar data types category with some examples from the published literature can be found in Section 5.2.2. 5.1.2 Heterogenous data types The integration of heterogenous data types involves two or more different data sources in data integration. It is a challenging task because data can be very diverse. This category falls into the topic of this thesis. Details on combining data sources – gene expression with SNP, CNV, clinical, protein, etc. – are given in Section 5.2. 5.1.3 Early integration Early integration combines data at the input level. For example, data from different studies, experiments or labs can be merged in order to increase the sample size. Jiang et al. [110] combined two lung cancer studies to distinguish diseased from normal samples more precisely. Aerts et al. [9] generated distinct prioritizations for multiple heterogeneous data sources, which were then integrated, or fused, based on their similarity to known genes, into a global ranking using order statistics to prioritize candidate genes underlying diseases. 5.1.4 Intermediate integration In intermediate integration, each data source is transformed into another format first and then the data sources are combined. For example, data can be converted into similarity matrices such as the covariance or correlation matrix and then the matrices are combined before building a classifier for better prediction. The similarity matrices are also called ‘kernel matrices’ and the integration or the methods are known as ‘kernel-based’. Lanckriet [121] et al. used a kernel-based SVM method in order to recognize particular classes of proteins – membrane proteins and ribozomal proteins. Lanckriet et al. [121] integrated amino acid sequences, gene expression data and protein-protein interactions with the kernel-based SVM method that finds a classification rule and corresponding weights for each data set. Daemen et al. [43] used a kernel-based least square SVM to combine microarray gene expression and proteomics data in order to predict outcomes in rectal and prostate cancers. 5.2 Information resources 5.1.5 25 Late integration Late integration (sometimes called decision integration) combines final statistical results from different studies. Data from different sources are learned separately and separate models are built. Then, the predictions from the separate models for the outcome are fused. Ioannidis et al. [104] integrated two Affymetrix and one Illumina SNP data sets. The odds ratio was first calculated for each SNP in each study and a random effects model was used to combine the odds ratios. Gevaert et al. [77] integrated clinical and microarray data with a strategy based on Bayesian networks in order to classify breast cancer patients. Gevaert et al. [77] combined the probability of the outcome for a clinical model with the probability of the outcome for a microarray model. 5.1.6 Serial integration Serial integration can use and combine the foregoing data integration schemata. In serial integration, different sources are implemented sequentially. The outcomes of one stage represent the inputs to the next analysis phase. For example, a combination of clinical and protein expression data based on serial integration was implemented so as to predict early mortality of patients undergoing kidney analysis in [117]. 5.2 Information resources Data integration methods can fall into categories depending on the data source that the methods combine with microarray gene expression data. In the computational biology and bioinformatics community, the ‘omics’ suffix is used for various bioinformatic data. ‘Omics’ include genomics (the quantitative study of genes, regulatory and non-coding sequences), transcriptomics (RNA and gene expression), proteomics (protein expression), metabolomics (metabolites and metabolic networks), interactomics (interactions and relationships among genes, proteins and metabolites), etc. As pointed out in [76], it is not known at present which ‘omics’ has the most disease outcome related information. Owing to the lack of comprehensive studies, validation studies are required to verify which omics has the most information and whether a combination of omics data improves predictive performance. 5.2.1 Genomic variation data Variations in the DNA sequences of humans can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines and other agents. Analysis of human single nucleotide polymorphisms (SNP) [118] has led to the identification of interesting SNP markers for certain disorders. Copy number variations (CNV) [114] – gain or loss of segments of genomic DNA relative to a reference – have also been shown to be associated 26 5 Data integration with several complex and common disorders [154, 67]. Using array-based comparative genomic hybridization (CGH) techniques [154], CNV at multiple loci can be assessed simultaneously allowing for their identification and characterization. CNV microarrays allow exploration of the genome for sources of variability beyond SNP that could explain the strong genetic component of several disorders. Table 5.1 shows examples of public genomic variation databases. Combinations of microarrrays with specific types of genomic variation data are described in [156, 189, 17, 200]. Pounds et al. [156] used projection methods and an efficient permutation-testing algorithm. Walker et al. [189] mapped SNP into expression data in order to reduce the complexity of standard microarrays and identified candidate genes in multiple myeloma. Andrews et al. [17] refined a gene expression signature of breast cancer metastasis with CNV. Yang et al. [200] introduced a web application for multi-dimensional analysis of CGH, SNP and microarray data which supports the interpretation of microarrays. Source Description URL dbSNP Protegenetix COSMIC SNPs SNPs, CGH Somatic mutations in cancer CNV http://www.ncbi.nlm.nih.gov/snp http://www.progenetix.net/ http://www.sanger.ac.uk/genetics/ CGP/cosmic/ http://www.sanger.ac.uk/humgen/cnv/ The CNV Project Table 5.1: Examples of genomic variation databases. 5.2.2 Gene expression data Although data sets in public databases usually have to meet the criteria of standards for biological research data quality, annotation and exchange (see MGED standards [19]), there are differences in the type of microarray used, gene nomenclatures, species, and analytical methods. A way on how to increase the quality of information and statistical power in microarray data is meta-analysis [36]. Microarray meta-analysis combines a diverse collection of microarray data sets to assess the intersection of multiple gene expression signatures. This approach provides a precise view of genes, while simultaneously allowing for differences between laboratories. For example, Rhodes et al. [159] collected and analyzed 40 published cancer microarray data sets in order to characterize a common transcriptional profile of cancer progression. Ma et al. [130] proposed a regularized gene selection method that is applied to multiple pancreatic and liver cancer data sets. An alternative gene expression data is microRNA [128]. MicroRNA expression profiles had better classification performance compared with mRNA on the same samples in [76]. MicroRNA profiles have also shown promise in predicting the prognosis of cancer [105, 141]. 27 5.2 Information resources Source Description URL ArrayExpress Functional genomics experiments including gene expression data Gene Expression Omnibus, functional genomics data microRNA targets and expression data http://www.ebi.ac.uk/ microarray-as/ae/ http://www.ncbi.nlm.nih.gov /geo/ http://www.microrna.org/ GEO microRNA.org Table 5.2: Examples of gene expression databases. Examples of public gene expression databases are given in Table 5.2. 5.2.3 Proteomic data Proteomic technology based on mass spectrometry [47] can have a significant impact on the outcome prediction of cancer patients, especially when taking into account its ability to detect post-translational modifications [45, 112]. Another approach in studying the proteome uses protein microarrays (also called antibody microarrays) [196]. Whereas mass spectrometry-based proteomics is considered to be an unbiased approach, protein/antibody microarrays represent an approach focused on studying the proteome quantitatively [131]. A number of studies have considered whether changes in mRNA concentration are reflected by similar changes in protein abundance, e.g., [179, 37, 85]. Poor correspondence has been typically reported between transcript and protein levels, and in some cases, little or no correlation has been found at all [85]. Bitton et al. [22] combined protein mass spectrometry with Affymetrix Exon array data at the level of individual exons and found significantly higher degrees of correlation than had been previously observed in a number of studies. Daemen et al. [42] demonstrated that combining microarrays and proteomics data sources improves the predictive power. They used a kernel-based method with Least Squares Support Vector Machines to predict rectal cancer regression grades. Table 5.3 shows examples of public proteomic databases. 5.2.4 Interactomic data Interactomic data represents interactions and relationships among biological system components – genes, proteins, small molecules, diseases. The relations form networks or pathways which can be used to construct complex networks that are further analysed. Given the complex nature of biological systems, the networks can be large-scale. A network typically consists of a set of nodes and edges, which represent the biological system components and interactions between components, respectively. Table 5.4 shows examples of public interactomic databases. Lee et al. [122] proposed a classification method based 28 5 Data integration Source Description URL OPD Open Proteomics Database, mass spectrometry data Genome Medicine Database of Japan, proteomic data in cancer The Human Protein Atlas, expression and localization of proteins http://bioinformatics. icmb.utexas.edu/OPD/ https://gemdbj.nibio.go.jp /dgdb/DigeTop.do http://www.proteinatlas.org/ GeMDBJ HPR Table 5.3: Examples of proteomic databases. Source Description URL BioGRID HPRD Datasets of physical and genetic interactions The Human Protein Reference Database, interaction networks and disease association for each protein in the human proteome Kyoto Encyclopedia of Genes and Genomes, molecular interaction and reaction networks http://thebiogrid.org/ www.hprd.org KEGG www.genome.jp/kegg/ Table 5.4: Examples of interactomic databases. on a mapping of gene expression data onto different biological pathways. The investigated pathways were obtained from metabolic and signaling pathways derived from manually annotated databases. For each pathway, they mapped the expression values of each gene onto its corresponding gene (protein) in the pathway and searched for a subset of genes that could be used to differentiate between the samples of the phenotypes investigated. Chuang et al. [39] applied a protein-network-based approach that identified markers not as individual genes, but as subnetworks extracted from protein interaction databases. In order to integrate the expression and network data sets, they overlaid the expression values of each gene on its corresponding protein in the network and searched for subnetworks whose activities across the patients were highly discriminative of metastasis. This method achieved higher accuracy in the classification of metastatic versus non-metastatic tumors. Ergun et al. [63] implemented an algorithm that inferred a global network of gene regulatory interactions from a training gene expression data set related to diverse biological processes and pathologies. This network was then used to detect genes in a test data set, which appeared to be altered in a specific phenotype, for example, a disease class. 5.2.5 Textual data and ontologies Text mining helps biologists to collect disease-gene associations automatically from large volumes of biological literature. Currently, the most important resource for biomedical text 29 5.2 Information resources Source Description URL GO Gene Ontology, descriptions of gene and gene-product attributes Human Phenotype Ontology, descriptions of individual phenotypic anomaly eVOC Ontology, descriptions of genome sequence and expression phenotype www.geneontology.org/ HPO eVOC www.human-phenotypeontology.org/ www.evocontology.org/ Table 5.5: Examples of bio-ontology resources. mining applications is the MEDLINE literature repository [7] developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). MEDLINE covers all aspects of biology, chemistry, and medicine and there is almost no limit to the types of information that may be recovered through careful and exhaustive mining [165]. Ontologies are conceptual models that provide the framework for semantic representation of information [83]. Ontologies usually support two types of relations, ‘is-a’ and the ‘part-of’ relations. Bio-ontologies also support relations specific for the biomedical domain (e.g. ‘has-location’, ‘clinically-associated-with’, ‘has-manifestation’ [173]). Table 5.5 shows examples of bio-ontology resources. Gene ontology (GO) [89] is a widely used bio-ontology resource which can be employed in the analysis of relationships among genes. Genes and their products in different organisms are annotated using a common terminology by GO collaborators. GO is divided into three ontologies: (a) biological process (BP), (b) molecular function (MF), and (c) cellular component (CC). The three ontologies are represented as directed acyclic graphs in which nodes correspond to terms and edges represent their relationships. GO is utilized in many tools that provide methods for functional annotation and interpretation of gene lists derived from microarray experiments (see [5]). For example, methods that combine microarray gene expression data with GO for the purpose of predictive classification are described in [157, 151, 38]. All the examples employ GO for feature selection or extraction. Cho et al. [38] and Qi et al. [157] searched for expression correlations, while Papa et al. [151] focused on elimination of highly correlated genes. Gevaert et al. [78] integrated information from literature abstracts with gene expression data using Bayesian network models to predict the outcome of cancer patients. Yu et al. [202] retrieved biomedical knowledge using nine bio-ontologies and text information from MEDLINE titles and abstracts to obtain the precise identification of disease relevant genes for gene prioritization and clustering. A summary of ontology and text-mining methods in biomedicine can be found in [173]. 30 5 Data integration 5.2.6 Clinical data Clinical data forms the basis of doctor’s diagnostic decision making. It includes, for example, patient history, laboratory analysis and medications. Clinical data is often available for the microarray data set (see Table 5.2 for possible resources). In general, microarray predictors are much more difficult and expensive to collect than clinical ones. There are several studies comparing predictive power of microarrays with clinical data [26, 56, 181]. Methods that combine microarrays with clinical data to construct a joint predictor are described, e.g., in [77, 155, 178, 177]. Boulesteix et al. [26] proposed an approach based on RF and PLS dimension reduction in order to provide a combined classifier and compare the predictive power of microarray and clinical data. Gevaert et al. [77] based the clinical and microarray data integration on BN which included structure and parameter learning. Sun et al. [178] used I-RELIEF wrapper feature selection method. They used LDA as a class prediction method. Pittman et al. [155] proposed a method based on statistical classification tree models that evaluate contributions of multiple forms of data, both clinical and genomic, to predict patient outcomes. They defined metagenes, which characterize dominant common expression patterns within clusters of genes, and combined them with traditional clinical risk factors. 5.2.7 Other data Another kind of the information that has been used in combination with microarray gene expression data in cancer research is the DNA methylation. DNA methylation is a type of chemical modification of DNA influencing cancer outcome [76]. It involves the binding of a methyl group to CpG islands in the genome [120]. CpG islands are often found in the regulatory regions of genes and are often associated with transcriptional inactivation. Most studies focus on using DNA methylation for the early detection of cancer [120], however its use in predicting prognosis has also been shown [194, 13]. Other information sources can be disease-specific databases, such as a database of human genes and genetic disorders – Online Mendelian Inheritance in Man (OMIM) [8] or the database of genetic association studies performed on Alzheimer’s disease – AlzGene [1]. For example, Tsai et al. [184] integrated microarray expression data and OMIM information to investigate a connection of enterovirus 71 with neurological diseases. Chapter 6 Data sets Six publicly available real data sets and several simulated data sets are used for the evaluation of experiments in this thesis. These sets are introduced in the following sections. 6.1 Real data sets van’t Veer data set van’t Veer at al. [185] classified breast cancer patients after curative resection with respect to the risk of tumor recurrence. The set includes gene expression data and clinical variables. cDNA Agilent microarray technology was used to give the expression levels of 22, 483 genes for 78 breast cancer patients. Forty-four patients that are classified into the good prognosis group did not suffer from a recurrence during the first five years after resection; the remaining 34 patients belong to the the poor prognosis group. The data set is prepared as described in [185] and is included in R package ‘DENMARKLAB’ [68]. The resluting set includes 4, 348 resulting genes. Only genes that show a two-fold differential expression and p-value for a gene being expressed < 0.01 in more than five samples are retained. The clinical variables are age, tumor grade, estrogen receptor status, progesterone receptor status, tumor size and angioinvasion. Pittman data set This data set was introduced by Pittman at al. [155]. Gene expression data was prepared with Affymetrix Human U95Av2 GeneChips. The data set gives the expression levels of 12, 625 genes for 158 breast cancer patients. Regarding recurrence of the disease, 63 of the patients are classifed into the poor prognosis group, and the remaining 95 patients belong to the good prognosis group. The data was pre-processed using packages ‘affy’ and ‘genefilter’ to normalize and filter the values. Genes that showed a low variability across all samples were cleared out. The resulting data set includes 8, 961 genes. The clinical 31 32 6 Data sets variables are age, lymph node status, estrogen receptor status, family history, tumor grade and tumor size. Wang data set The data set presented by Wang at al. [192] comprises the expression levels of 22, 283 genes for 286 lymph-node-negative breast cancer patients. Based on relapse, 107 of these patients had developed distant metastases within five years and were classifed into the poor prognosis group, the remaining 179 patients belong to the good prognosis group. Gene expression data was analysed with Affymetrix Human U133a GeneChips. The data was pre-processed using packages ‘affy’ and ‘genefilter’. The genes that showed a low variability across all samples were cleared out. The resulting data set includes 22, 260 genes. The clinical variables are estrogen receptor status and lymph node status. However, the lymph node status is negative for all patients. Mainz data set Mainz data set was published in Schmidt et al. [163]. Gene expression data gives the expression levels of 22, 283 genes for 200 lymph-node-negative breast cancer patients. We considered distant metastases as a response. Based on distant metastases after five years, 46 of these patients were classifed into the poor prognosis group, and the remaining 154 patients belong to the good prognosis group. Gene expression data was analysed with Affymetrix Human U133a GeneChips. The data set was prepared as described in [163]. It is included in R package ‘breastCancerMAINZ’ [6]. The available clinical variables are the age at diagnosis, estrogen receptor status, tumor size and tumor grade. Sotiriou data set The data set was presented by Sotiriou at al. [171]. Gene expression data was prepared with cDNA microarray chips. The author’s web site involves both raw and pre-processed gene expression data. The pre-processed data gives the expression levels of 6, 860 genes for 99 breast cancer patients. According to relapse, 45 of the patients are classifed into the poor prognosis group and 54 of the patients belong to the good prognosis group. There are missing values in this data set. Their occurance is equal to 2.17%. After filtering out all missing values, the data set includes 4, 246 expression levels. There was a possibility to filter out less of missing values and use some missing data imputation methods for the rest. We evaluated this alternative. The missing values were summed in each row and only the rows that had more than 5% of missing values were filtered out (see Figure 6.1a). In this figure, the line denotes this threshold. The whole microarray data set included 0.54% of missing values after filtering. Then we used a sequential kNN imputation method [113], which estimates missing values from the gene 33 100 80 60 40 0 20 Sum of missing values [%] 100 80 60 40 20 0 Sum of missing values [%] 6.1 Real data sets 0 1000 3000 5000 7000 Gene expression pattern [−] (a) Sotiriou data set – gene expression data. 0 10 20 30 40 SNP pattern [−] (b) CFS data set – SNP data. Figure 6.1: Missing values. The circles denote the sums of missing values. The missing values were summed in each row and the rows that had more than 5% of missing values were filtered out. The red line denotes this threshold. that has the least missing rate in microarray data, using a weighted mean of k nearest neighbors (k = 5). The resulting data set had 6, 336 expression levels. We compared this data set with the data set where all missing values were filtered out and evaluated the binary outcome prediction of both data sets. We evaluated each data set 100 times. The first data set with 4, 246 expression levels provided results 6% better on average than the second data set with 6, 336 expression levels in various gene selection settings. That is why we prefered the data set with all filtered missing values for further experiments. The clinical variables are age, lymph node status, estrogen receptor status, tumor grade and tumor size. The Sotirirou data set includes clinical variables that the suplementary author’s web site describes in more detail. CFS data set The chronic fatigue syndrome (CFS) data set comes from a four-year longitudinal study conducted by the Centers for Disease Control (CDC) [3]. It contains gene expression, clinical, SNP and proteomic data. It was used as a conference contest data set in CAMDA 2006 and CAMDA 2007. The CFS data set with a detailed description is publicly available on the CAMDA website [2]. cDNA gene expression data was analysed with MWG Biotech platform. SNP data 34 6 Data sets consists of 36 autosomal SNPs that the CDC had previously selected from eight candidate CFS genes, TPH2 (SNPs selected from locus 12q21), POMC (2p24), NR3C1 (5q34), CRHR2 (7p15), TH (11p15), SLC6A4 (17q11.1), CRHR1 (17q21), COMT (22q11.1), see Smith et al. [170]. The SNPs are coded as 0, 1, or 2, for genotypes AA, AB, and BB, respectively. The missing value occurance in the SNP data is equal to 2.17%. The summed missing values in each row was not more than 5%. The missing values were imputed by sequential a kNN imputation method [113]. Clinical variables with more than 5% of missing values were filtered out. Clinical variables include summary variables such as intake illness classification and empirical illness classification, as well as, variables reflecting medical symptoms on which these summaries are based. For the experiments we chose the intake illness classification of CFS, which is based on the 1994 case definition criteria [73]. This variable has five levels: ever CFS, ever CFS with major depressive disorder with melancholic features (MDDm), ever insufficient symptoms or fatigue (ISF), ever ISF with MDDm and nonfatigued. We reclassified the patients into two groups: CFS disease group and non-CFS disease group. The CFS disease group includes all CFS-like patients, while the non-CFS disease group includes the rest with ISF and nonfatigued patients. Table 6.2 shows number of patients belonging to each group. Patients in gene expression, clinical, SNP and proteomic data were not completely overlapping. Therefore, we chose the overlapping data set with gene expresion, clinical and SNP data, which has 164 patients. Including the proteomic data on top of that, the data set consists of 44 overlapping patients only. Table 6.1 summarizes the number of data in each data set. Data A data set Dimension gene expression clinical SNP 164 x 19797 164 x 61 164 x 42 Data B data set Dimension gene expresion clinical SNP proteomic 44 44 44 44 x x x x 19797 61 42 479 Table 6.1: CFS data set characteristics. A data set B data set patients CFS diseased non-CFS diseased 164 44 64 23 100 21 Table 6.2: CFS data set – patient groups. 35 6.2 Microarray data pre-processing 6.2 Microarray data pre-processing Both, raw and already pre-processed data sets were used for the experiments. Raw data sets, which were Affymetrix, required pre-processing. Pre-processing of Affymetrix and cDNA data sets differ, though some steps are similar. There are a wide variety of methods employed with the pre-processing of microarray data. Interested readers can find more information, for example, in [74]. We used the Robust Multi-Array (RMA) approach [106] which appears to be promising, even if the RMA requires large amounts of 8 16 25 34 43 52 61 70 79 88 97 107 118 129 140 151 1 Array index 8 16 25 34 43 52 61 70 79 88 97 107 118 129 140 151 1 Array index RAM memory. The RMA approach retains probe level information. Affymetrix arrays 2 4 6 8 10 Log2(Intensity) (a) Raw data. 12 14 2 4 6 8 10 12 Log2(Intensity) (b) Pre-processed data. Figure 6.2: Pittman data set box-plots. 14 36 2 4 6 8 12 Log2(Intensity) (a) Density estimates. 10 7 1 4 Array index 10 7 1 4 Array index 0.8 0.4 0.0 Density 6 Data sets 2 4 6 8 12 Log2(Intensity) (b) Raw data box-plots. 2 4 6 8 12 Log2(Intensity) (c) Pre-processed data box-plots. Figure 6.3: Example plots of PM intensities of the first ten arrays from the Pittman data set. typically use between 11 and 20 probe pairs for each gene, each of which has a length of 25 base pairs. One component of these pairs is referred to as a prefect match probe (PM), while the other component of these pairs with the changed middle base is called a mismatch probe (MM). A PM is designed to hybridize only with transcripts for the intended gene (specific hybridization), while an MM is constructed to measure only the non-specific hybridization of the corresponding PM probe (non-specific hybridization is unavoidable). The advantage of the RMA approach is that it does not involve an implicit subtraction of the MM probe values [106]. The subtraction can lead to a lot of noise at low signal values. Instead, RMA looks at the distribution of the PM probe values and fits a combination of two distributions: a ‘noise’ distribution that is normally distributed and a ‘signal’ distribution that is distributed like an exponential distribution. The normalized values are estimated through the expected value of the signal distribution. The RMA approach consists of three particular processing steps: convolution correction, quantile normalization [24], and a summarization based on a multi-array model fit robustly using the median polish algorithm [61]. Figures in this section illustrate pre-processing results of the Pittman data set. Boxplots in Figure 6.2 depict the data set before and after pre-processing. Due to high memory requirements, it was necessary to use a 16GB RAM machine with 64-bit operating system to compute the data. Due to high memory requirements, it was impossible to pre-process data with a standard PC (Intel T72500 Core 2 Duo 2.00 GHz, 2 GB RAM) and 32-bit operating system. For illustrative purposes, other figures in this section depict results just for the first ten arrays from the Pittman data set. Figure 6.3 examines PM intensity behaviour with histograms (or density estimators) of PM intensities for each array and box-plot distributions of raw and pre-processed log scale PM probe intensities. This type of visualization 6.2 Microarray data pre-processing 37 Figure 6.4: MA-plot of the first ten arrays from the Pittman data set before pre-processing. provides a useful exploratory tool for quality assesment. For example, an array with the bimodal distribution in Figure 6.3a with density estimates can indicate a spatial artifact. Figures 6.4 and 6.5 show MA-plots of raw and pre-processed arrays from the Pittman data set. The MA-plot is another quality assessment image. It can highlight the need for normalization and gives a better idea of the differences between arrays in the shape or center of the distribution. It can detect bad quality arrays. A loess curve, which is red in the MA-plot figures, is fitted to the scatter-plot to summarize any non-linear relationship. MA-plot with an oscillating curve means evident array quality problems. According to Figure 6.4, mainly arrays 2 and 5 need to be normalized. MA-plots of ten arrays are plotted against a reference array created by taking probe-wise medians to avoid making 38 6 Data sets Figure 6.5: MA-plot of the first ten arrays from the Pittman data set after pre-processing. an MA-plot for every pairwise comparison. The MA-plot uses M as the y-axis and A as the x-axis where M and A are defined as: M = log2 Xn − log2 R A = 21 (log2 Xn + log2 R) , where Xn is array number n = 1, 2, . . . , 10 and R denotes a refference array. The interpretation of an MA-plot can be as follows: M represents each gene log fold change, thus, how much each gene differs against the refference array. A represents the log intensity of each gene. Figure 6.6 is a quality control (QC) plot of the arrays from the raw Pittman data set 39 6.2 Microarray data pre-processing actin3/actin5 gapdh3/gapdh5 10 9 8 7 6 5 4 3 2 1 35.69% 92.35 36.4% 95.63 43.33% 97.58 40.55% 108.15 41.75% 73.14 50.35% 65.99 38.27% 64.68 40.7% 131.16 30.15% 90.14 44.65% 119.13 bioB bioB bioB −3 −2 −1 0 1 2 3 Figure 6.6: ‘simpleaffy’ QC plot of the first ten arrays from raw Pittman data set. generated with the help of a ‘simpleaffy’ package. Dotted horizontal lines separate the plot into rows, one for each chip. Each row shows percent present calls, average background, scale factors and β-actin/GAPDH ratios for an individual chip. Percent present and average background are listed to the left of the figure. A percent present call represents the percentage of probesets called present on an array. It is generated by looking at the difference between PM and MM values for each probe pair in a probeset. Dotted vertical lines provide a scale from −3 to 3. The blue region represents the spread where all scale factors fall within 3-fold of the mean scale factor for all chips. It is recommended that chips are only comparable if their scale factors are within 3-fold of each other, which is true according to Figure 6.6, because all scale factor lines are coloured blue. β-actin and GAPDH values that are considered potential outliers are coloured as red triangles and circles (see the arrays with numbers 6 and 8 in Figure 6.6). QC plot measures the quality of the RNA hybridised to the chip which is possible to obtain by comparing the amount of signal from the 3’ probeset to 5’ probesets. β-actin and GAPDH are relatively long genes. The majority of Affymetrix chips contain separate probesets targeting the 5’, mid and 3’ regions of their transcripts. The acceptable β-actin 3’:5’ ratio, which is plotted as a blue triangle, is less than 3. The acceptable GAPDH 3’:5’ ratio, which is plotted as a blue circle, is less than 1.25. GAPDH threshold is lower because GAPDH is the smaller of the two genes. 40 6 Data sets The transcript BioB verifies the efficiency of the hybridisation step. Ideally, BioB should be ‘called present’ on every array. If BioB is routinely absent, then the array is performing with suboptimal sensitivity (see ‘simpleaffy’ package documentation for a more detailed description). 6.3 Simulated data sets A group of simulated data sets was generated to illustrate the strengths and weaknesses of the described approaches. Simulated microarray data sets consisted of 500 samples with 1000 microarray variables and 5 clinical variables. A more detailed description related to the experiments can be found in the next chapter. Chapter 7 Combining gene expression and clinical data 7.1 Introduction Combining gene expression with clinical data may add valuable information and can generate more accurate disease outcome predictions than those obtained based on the use of gene expression or clinical data alone. Clinical data is heterogenous and measures various entities (e.g. tumor grade, tumor size, lymph node status), while gene expression data is homogenous and measures gene expressions. The combination of both data sources can involve complementary information. On the other hand, redundant and correlated data can have a contradictory impact on prediction accuracy. In the literature, there are studies aimed at integrative prediction with gene expression and clinical data. A few papers related to this topic have been cited in Section 5.2.6. Gevaert et al. [77] based the microarray and clinical data integration on BN which included structure and parameter learning. Li [125] proposed dimension reduction methods in the context of survival data in order to produce linear combinations of gene expressions, while taking into account clinical International Prognostic Index information. Binder and Schumacher [21] proposed the CoxBoost boosting algorithm that employs high-dimensional data and a few clinical covariates. In this chapter, we propose a combination of logistic regression and boosting to predict disease outcome. We use logistic regression with clinical variables because it has been widely used with clinical data in clinical trials to determine the relationship between variables and outcome and to assess variable significance (see for example [190]). Clinical data is usually low-dimensional because microarray data sets include just a few clinical variables (typically from 5 to 10). Logistic regression cannot be used with high-dimensional data without a dimension reduction step or penalization. Logistic regression with highdimensional data can produce numerically unstable estimates and the predicting model 41 42 7 Combining gene expression and clinical data does not generalize well [100]. We use boosting with microarray data, which is highdimensional. We use the version of boosting which closely corresponds to fitting a logistic regression model. It is denoted as BinomialBoosting in [35] and utilizes componentwise linear least squares (CWLLS) as a base procedure. The boosting algorithm AdaBoost [198] has attracted attention in the machine learning community because of its good classification performance with various data sets. Boosting methods were introduced as multiple prediction schemes, averaging estimated prediction from reweighted data. Breiman [30] demonstrated that the AdaBoost algorithm can be viewed as a gradient descent optimization technique in function space. Friedman [70] developed a more general statistical framework which yields a direct interpretation of boosting as a method for function estimation. He developed methods for regression which are implemented as an optimization using the squared error loss function (L2 -Boosting). Later, Efron et al. [57] made a connection for linear models between forward stagewise linear regression, which is related to L2 -Boosting and L1 -penalized Lasso [180]. Boosting with CWLLS was worked out in Bühlmann [34]. It can be applied to high-dimensional data because it performs coefficient shrinkage and variable selection. According to Bühlmann and Hothorn [35], BinomialBoosting algorithm is similar to LogitBoost, which is a more accurate classifier in the context of microarray data than AdaBoost [48]. Logistic regression and BinomialBoosting can be combined due to the framework of generalized linear models (GLMs) which are described at the beginning of this chapter. We describe the models separately and it is made clear how the data is combined. We propose its extention designed for redundant sets of data. The extension includes pre-validation of models built with microarray and clinical data followed by calculation of weights. Weights determine the relevance of microarray and clinical models for data combination. Evaluations are performed with several redundant and non-redundant simulated data sets first, and then with four breast cancer data sets. We compare designed apporaches with other relevant methods from the literature in the discussion. The section with alternative regularized regression techniques includes an elastic net and an addition of SNP data into the combination of gene expression and clinical data. We demonstrate the results of these alternatives with simulated and real data sets. We compare execution times of applied approaches at the end of this chapter. The notation for prediction with gene expression data was presented in Chapter 3. A similar notation is used for clinical variables. It can be summarized as follows: 43 7.2 Generalized linear models X ... p × n gene expression data matrix xij ... an element of X Z ... q × n clinical data matrix zij ... an element of Z p ... the number of genes q ... the number of clinical variables n ... the number of samples y ... n × 1 response vector yi ... an element of y ŷ ... n × 1 estimated response vector A, B ... the poor and good prognosis ground truth class labels In the following text, the upper indexes X, Z, S distinguish from other variables with gene expression data, clinical data and SNP data. 7.2 Generalized linear models Generalized linear models (GLMs) [134] are a group of statistical models that model the responses as nonlinear functions of linear combinations of predictors. These models are linear in the parameters. The nonlinear function (link) is the relation between the response and the nonlinearly transformed linear combination of the predictors. We employ GLMs in data combining due to nicely shared properties such as linearity. GLMs offer a general theoretical framework for many statistical models. The implementation of various statistical models is simplified because the framework allows for a common method for estimation of parameters and accessing model adequacy of all GLMs [107]. GLMs are generalizations of normal linear regression models, which are characterized by the following features: 1. A linear regression model is defined as: ηi = β0 + q X βj xij + ǫi , (7.1) j=1 where i = 1, . . . , n. β are regression coefficients and ǫ is a random mean-zero error term. 2. The link function is given as: g(yi ) = ηi , (7.2) where g is a link function, i = 1, . . . , n. ηi is a linear predictor. Respectively, it can be also written as yi = g−1 (ηi ), where g−1 is an inverse link function. Finally, g is an identity for the normal linear regression model. 44 7 Combining gene expression and clinical data 3. All yi are assumed to have normal distributions with E(yi ) = µi , with a constant variance σ 2 , yi ∼ N(µi , σ 2 ). GLMs are a generalization of this setup, which differs in the following features: 1. The link functions can be other than the identity. 2. The response variable yi does not to be continous and normally distributed, but it can have a distribution other than the normal one. We will assume that the observations come form a distribution in the exponential family [134]: f (y, θ, ψ) = e yθ−b(θ) +c(y,ψ) a(ψ) , (7.3) where f is the density of continuous y or f is the probability function of discrete y. θ is the parameter of interest – in GLM, θ = θ(β). ψ is a nuisance parameter (as σ in regression). 7.2.1 Estimation of parameters Coefficients of β can be estimated by maximum-likelihood. The log-likelihood of the sample is maximized: l(µ, y, ψ) = n X log f (yi , θi , ψ). (7.4) i=1 If we consider the logistic model, which is described later, neither a(ψ) nor c(y, ψ) have an influence on the maximization. Thus, we maximize only: ˜l(µ, y, ψ) = n X (yi − b(θi )). (7.5) i=1 We derivate 7.6 for β, lay equal to zero and solve for β: n X ∂ ˜ ∂ (yi − b′ (θi )) θi = 0. l(µ, y, ψ) = ∂β ∂β (7.6) i=1 Maximum likelihood estimates (MLEs) do not have a ‘closed’ form [88]. Therefore, they can be computed via iteratively weighted least squares1 (IWLS) or via the Fisher scoring algorithm. McCullagh and Nelder [134] prove that the IWLS algorithm is equivalent to Fisher scoring and leads to MLEs. The weighted least squares solution in the matrix notation, which has a simpler notation, is as follows: 1 IWLS is also called iteratively reweighted least squares (IRLS). Some of the details between algorithms can be found in [66] (page 189). 45 7.3 Logistic regression β̂ (m) = (X T W (m−1) X)−1 X T W (m−1) y (m−1) . (7.7) X is the model matrix, W is a diagonal matrix of weights, y is the response vector and m is an actual iteration. The IWLS algorithm can be briefly described as follows: 1. Initialize β̂ (0) , set m = 0. 2. Update W (m) , y (m) . 3. Increase m: m = m + 1. 4. Estimate β̂ (m) . 5. Iterate steps 2 to 4 until kβ̂(m+1) −β̂(m) k kβ̂(m) k ≤ ξ. The algorithm stops when estimates change more than a specified small amount ξ. A more detailed explanation of MLEs computation can be found, for example in [88]. 7.3 Logistic regression The logistic regression model is often used when the response is binary. The linear logistic regression model is an example of GLM, where: A. The response variable yi is considered as a binomial (Bernoulli) random variable pi : pi ∼ B(ni , pi ). If yi is Bernoulli then [134]: f (y) = P(Y = y) = py (1 − p)1−y p = 1 − p if y = 1, (7.8) if y = 0. B. The parameters of the exponential family for the logistic regression model are: a(ψ) = 1, b(θ) = − log(1 − p) = log(1 + eθ ), (7.9) c(y, ψ) = 0. After the substitution (7.9) in the equation (7.3), we arrive at: f (y, θ) = eyθ+log(1−p) . C. The link function is logistic: (7.10) 46 7 Combining gene expression and clinical data θ = log p 1−p , (7.11) 0.6 0.4 0.0 0.2 Probability 0.8 1.0 which can be derived from (7.8). Figure 7.1 shows a graph of the logistic function. −6 −4 −2 0 2 4 6 Logit Figure 7.1: The logistic function. The logistic regression model with clinical data can be described with the following equation: g(yi ) = ηi = β0Z + q X βlZ zil , (7.12) l=1 where i = 1, . . . , n. g is the link function (7.11). yi or pi are outcome probabilities P(yi = A|zi1 , . . . , ziq ). The upper index Z denotes clinical data coeficients. 7.4 Boosting A boosting with componentwise linear least squares (CWLLS) as a base procedure is applied to microarray data. A linear regression model (7.1)2 is considered again. A boosting algorithm is an iterative algorithm that constructs a function F̂ (x) by considering the emP pirical risk n−1 ni=1 L(yi , F (xi )). L(yi , F (xi )) is a loss function that measures how closely a fitted value F̂ (xi ) comes to the observation yi . In each iteration, the negative gradient of the loss function is fitted by the base learner. The gradient descent is an optimization algorithm that finds a local minimum of the loss function. The base learner is a simple fitting method which yields as estimated function: 2 The gene expression coefficients can be denoted by the upper index X in this equation. 47 7.4 Boosting fˆ(·) = fˆ(X, r)(·), (7.13) where fˆ(·) is an estimate from a base procedure. The response r is fitted against x1 , . . . , xn . 7.4.1 Functional gradient descent boosting algorithm The functional gradient descent (FGD) boosting algorithm, which has been defined by Friedman, is given by the following steps [70, 35]: 1. Initialize F̂ (0) ≡ Set m = 0. n X L(yi , a) ≡ ȳ. i=1 2. Increase m: m = m + 1. Compute the negative gradient (also called pseudo response), which is the current resudial vector: ∂ ri = − ∂F L(y, F )|F =F̂ (m−1) (xi ) = yi − F̂ (m−1) (xi ), i = 1, . . . , n. 3. Fit the residual vector (r1 , . . . , rn ) to (x1 , . . . , xn ) by a base procedure (e.g. regression) base procedure fˆ(m) (·), −−−−−−−−−−−−−→ where fˆ(m) (·) can be viewed as an approximation of the negative gradient vector. (xi , ri )ni=1 4. Update F̂ (m) (·) = F̂ (m−1) (·)+ν · fˆ(m) (·), where 0 < ν < 1 is a step-length (shrinkage) factor. 5. Iterate steps 2 to 4 until m = mstop for some stopping iteration mstop . 7.4.2 Base procedure The componentwise linear least squares (CWLLS) base procedure estimates are defined as (compare with equation (7.1)): fˆ(X, r)(x) = β̂ŝ x̂ŝ , (7.14) n X (7.15) where: ŝ = arg min 1≤j≤p and i=1 (ri − βj xij )2 48 7 Combining gene expression and clinical data Pn ri xij . β̂j = Pni=1 2 i=1 (xij ) (7.16) β̂ are coefficient estimates, j = 1. . . . , p. ŝ denotes the index of the selected (the best) predictor variable in iteration m. The CWLLS base procedure performs a linear least square regression against the one selected predictor variable which reduces the residual sum of squares most. Thus, one predictor variable, not necessary a different one, is selected for each iteration. For every iteration m, a linear model fit is obtained. The function is updated linearly. Step 4 in FGD algorithm (Section 7.4.1) with CWLLS base procedure is as follows: (m) (m) x̂ŝ . F̂ (m) (x) = F̂ (m−1) (x) + ν β̂ŝ (7.17) The update of the coefficient estimates is: (m) β̂ (m) = β̂ (m−1) + ν β̂ŝ . (7.18) Then, the boosting estimator is a sum of base procedure estimates or a linear combination of the base procedures: F̂ (m) (·) = ν m X fˆ(k). (7.19) k=1 7.4.3 Loss function The aforementioned FGD boosting algorithm can be used in many settings. Several versions of boosting can be obtained by varying base procedures, loss functions and implementation details. Examples of loss functions yielding different versions of boosting algorithm are given in Table 7.1. BinomialBoosting [35], which is the version of boosting that we utilize, uses the negative log-likelihood loss function: L(y, F ) = log2 (1 + e−2yF ). (7.20) It can be shown that the population minimizer of (7.20) has the form [35]: F (xi ) = p 1 log , 2 1−p (7.21) where p is P(yi = A|xi1 , . . . , xip ) and relates to the logit function, which is analogous to logistic regression. 49 7.5 Combination of logistic regresion and boosting Algorithm L(y, F ) BinomialBoosting/LogitBoost L2 -Boosting AdaBoost F (x) e−2yF ) log2 (1 + 1 2 2 |y − F | e−(2y−1)F p log 1−p P(Y = A|X = x) E[X = x] p 1 2 log 1−p P(Y = A|X = x) 1 2 Table 7.1: Boosting algorithms, their loss functions and population minimizers. 7.5 Combination of logistic regresion and boosting In GLMs, the linear models are related to the response variable via a link function (7.2). For binary data, we expect that the responses yi come from binomial distribution. Therefore, the logit link function is used in both models with clinical and microarray data. ηi is a linear model, which is a linear part of logistic regression and a linear regression model in boosting with CWLLS described in Section 7.4. We combine data at the level of decisions and sum the linear predictions for clinical and microarray data: ηi = ηiZ + ηiX . (7.22) The upper index Z(X) denotes clinical (microarray) data. According to the additivity rule that is valid for linear models, it is possible to sum the linear models: ηi = β0Z + q X l=1 βlZ zil + p X βjX xij . (7.23) j=1 The inverse link function g−1 , which is the inverse logit function, is applied to the sum of linear predictions ηi : g −1 (ηi ) = logit−1 (ηi ) = eηi . 1 + eηi (7.24) The schematic drawing3 in Figure 7.2 illustrates the combination of clinical and microarray data. Clinical and microarray data are repeatedly split into training and test sets via the Monte Carlo cross-validation (MCCV) procedure which is described in Chapter 4.1. Each clinical training set is fitted to the logistic regression model. We compute the linear predictions ηiZ for each clinical test set. Each microarray training set is fitted to the model using boosting with CWLLS. Then, we compute the linear predictions ηiX of each microarray test. After that, we sum the predictions (7.22) and the equation (7.24) gives a 3 The validation schema is not included. 50 7 Combining gene expression and clinical data response. Based on the responses, we measure the classifier performance with AUC, which is described in Chapter 4.2. For better readibility in results, we denote this approach LOG/Z+B/X. Figure 7.2: The combination of microarray and clinical data. Z – clinical data, X – microarray data, LOG – logistic regression, B – boosting, η Z , η X – linear predictions, Y – response. 7.6 Parameters setting We performed experiments in the R environment4 using packages ‘base’ and ‘mboost’. According to the ‘mboost’ package documentation [101], the coefficients resulting from boosting with family binomial are 1 2 of the coefficients of a logit model obtained via glm. This is due to the internal recoding of the response to −1 and +1. The binomial family object implements the negative binomial log-likelihood of a logistic regression model as a loss function. glm is a function from ‘base’ package, which is used to fit generalized linear models. The choice of the shrinkage factor ν and the number of iterations of the base procedure can be crucial for predictive performance of boosting. Too many iterations bring overfitting of the training data and too complex model, while insuficient number of iterations yield to underfitting of the training data and too sparse model. A small value of ν increases the number of required iterations and computational time, but prevents overshooting. Based on recommendation from Bühlmann et al. [35], we set ν = 0.1 to the standard default value in the ‘mboost’ package. Note that, according to Lutz et al. [129], the natural value of the shrinkage factor for L2 -Boosting with CWLLS is ν = 1. However, they use ν = 0.3 because smaller values have empirically proven to be a better choice. The number of iterations can be estimated with a stopping criterion. For example, resampling methods such as cross-validation and booststrap [87] have been proposed to estimate out-of-sample error for different numbers of iterations. Another computationally less demanding alternative is to use Akaike’s information criterion (AIC) [12] or the Bayesian information criterion (BIC) [164]. 4 www.r-project.org 51 25 30 35 AIC 40 45 50 7.6 Parameters setting 0 100 200 300 400 500 600 700 Number of boosting iterations Figure 7.3: Estimation of an optimal number of boosting iterations (mstop ) depending upon AIC. Circles denote the value of mstop . The example is depicted for van’t Veer data set and 3 MCCV iterations (mmax = 700, p = 500). We use AIC in our computations as a stopping criterion: AIC = 2k(m) − 2log(L(m)) (7.25) where k(m) is the number of variables used by classifier F̂ (m) at step m, and L is the negative binomial likelihood of the data given F̂ (m) . The preferred model is the one with the minimum AIC value. Figure 7.3 shows an example of the estimation of an optimal number of iterations (mstop ) depending upon AIC. mstop is denoted by a circle. The maximal number of iterations was set to mmax = 700. The number of microarray variables was set to p = 500 and selected based on training sets. Figure 7.4 depicts the span of mstop selections for four breast cancer data sets. In the case of van’t Veer and Pittman data sets, the average mstop was approximately 400 iterations, while the average mstop for the Mainz data set was approximately 500 and for the Sotiriou data set it was approximately 550 iterations. For breast cancer data sets, we evaluated the boosting approach with a fixed number of iterations within the range 50 – 800 iterations. AUC, where the number of iterations was estimated via AIC, mostly get close to the highest AUC obtained with the fixed number of iterations. In the case of the Sotiriou data set, the highest AUC was obtained with only 50 iterations. However, this value differed only by 0.01 from AUC obtained via AIC 0 20 40 60 80 600 400 200 0 200 400 600 Number of boosting iterations 7 Combining gene expression and clinical data 0 Number of boosting iterations 52 100 0 20 MCCV iteration 60 MCCV iteration (c) Mainz data set. 100 80 100 80 100 600 400 200 0 Number of boosting iterations 600 400 Number of boosting iterations 200 40 80 (b) Pittman data set. 0 20 60 MCCV iteration (a) van’t Veer data set. 0 40 0 20 40 60 MCCV iteration (d) Sotiriou data set. Figure 7.4: The span of mstop selections for four breast cancer data sets. The number of MCCV iteration is on the horizontal axis and mstop selection is on the vertical axis (mmax = 700, p = 500). estimated boosting. 7.7 Results Evaluations focused on testing the LOG/Z+B/X approach with non-redundant and redundant data sets. Simulated data was generated in different settings and was used for this purpose. We also tested LOG/Z+B/X with four publicly available breast cancer data sets. The performance of individual models was also evaluated and compared to the per- 53 7.7 Results formance of the combined LOG/Z+B/X. The logistic regression model built with clinical variables is denoted by LOG. The boosting model built with microarray data is denoted by B. 7.7.1 Simulated data We considered redundant and non-redundant settings of data and different predictive powers of clinical and microarray data. We generated simulated data sets through the use of an R script available in [26]. Variables Vj in each data set were generated as: Vj = µV j Y + ej , (7.26) where V denotes microarray or clinical data, j = 1, . . . , p or j = 1, . . . , q; µV j are constant parameters controlling the amount of predicting power of microarray or clinical variables; Y is the binary response that follows a binomial distribution; and ej are independent random errors following a standard model distribution. In the case of redundant sets, microarray and clinical variables were generated using exactly the same model. Such variables discriminate classes in the same way and give redundant information. In the case of non-redundant sets, the observations were assumed to form two distinct subgroups [26]. We considered different predictive powers for the clinical variables µZ and different predictive powers for the microarray variables µX . In presented simulations, µZ = 0 denotes no power, µZ = 0.5 a moderate power and µZ = 1 a strong power for Z. Similarly, µX = 0, 0.25, 0.5 for X. The difference in µZ and µX ranges compensates for ranges of predictor values for microarray and clinical variables. In all settings, simulated microarray data sets were generated for 1, 000 microarray variables and 500 samples, while simulated clinical data sets consisted of 5 clinical variables and 500 samples. 1. 2. 3. 4. µZ , µX 0, 0 1, 0.25 0.5, 0.5 1, 0.5 Method (no power) µZ > µX µZ < µX (strong p.) LOG/Z+B/X 0.53 ± 0.05 0.69 ± 0.04 0.73 ± 0.04 0.79 ± 0.04 LOG/Z 0.55 ± 0.05 0.65 ± 0.06 0.56 ± 0.05 0.65 ± 0.06 B/X 0.51 ± 0.05 0.60 ± 0.05 0.72 ± 0.05 0.72 ± 0.05 Table 7.2: Non-redundant data sets. AUCs from test data sets (including mean AUCs and standard deviations) evaluated over 100 MCCV iterations. Tables 7.2 and 7.3 display selected results of LOG/Z+B/X for different predictive 54 7 Combining gene expression and clinical data powers of Z and X. In the case of non-redundant data sets, LOG/Z+B/X increases AUCs. The results coincide with the fact that if the two data sources involve complementary information, the combination of them yields more accurate predictions. In the case of redundant data sets, LOG/Z+B/X has a good performance as well. 1. 2. 3. 4. µZ , µX 0, 0 1, 0.25 0.5, 0.5 1, 0.5 Method (no power) µZ > µX µZ < µX (strong p.) LOG/Z+B/X 0.51 ± 0.05 0.94 ± 0.02 0.96 ± 0.02 0.98 ± 0.01 LOG/Z 0.49 ± 0.05 0.94 ± 0.02 0.78 ± 0.04 0.94 ± 0.02 B/X 0.51 ± 0.05 0.71 ± 0.04 0.98 ± 0.01 0.98 ± 0.01 Table 7.3: Redundant data sets. AUCs from test data sets (including mean AUCs and standard deviations) evaluated over 100 MCCV iterations. 7.7.2 Breast cancer data For the evaluation of the LOG/Z+B/X approach, we used four publicly available breast cancer data sets described in Chapter 6. Table 7.4 shows average AUCs and standard deData set Method p = 50 p = 200 p = 500 van’t Veer LOG/Z+B/X 0.79 ± 0.11 0.78 ± 0.11 0.79 ± 0.11 LOG/Z 0.82 ± 0.10 − − B/X 0.67 ± 0.13 0.65 ± 0.12 0.65 ± 0.11 LOG/Z+B/X 0.79 ± 0.07 0.81 ± 0.08 0.82 ± 0.08 LOG/Z 0.67 ± 0.09 − − B/X 0.75 ± 0.08 0.77 ± 0.08 0.78 ± 0.08 LOG/Z+B/X 0.65 ± 0.10 0.66 ± 0.10 0.68 ± 0.10 LOG/Z 0.59 ± 0.10 − − B/X 0.65 ± 0.10 0.66 ± 0.10 0.68 ± 0.09 LOG/Z+B/X 0.66 ± 0.11 0.65 ± 0.12 0.69 ± 0.11 LOG/Z 0.71 ± 0.11 − − B/X 0.58 ± 0.12 0.57 ± 0.12 0.61 ± 0.12 Pittman Mainz Sotiriou Table 7.4: Breast cancer data sets. AUCs from test data sets (including mean AUCs and standard deviations) evaluated over 100 MCCV iterations. p denotes the number of microarray variables. 7.8 Pre-validation 55 viations over 100 MCCV iterations. We performed tests for different numbers of variables (p = 50, 200, 500) in order to inspect the efficiency of B/X and LOG/Z+B/X. Variables were selected on the basis of the absolute value of the t-statistics using the R package ‘st’. The Pittman data set approaches non-redundant data sets and the combination of microarray and clinical data implicates outcome prediction improvement. The methods in Table 7.4 show lower performance with Mainz and Sotiriou data sets. The results based on the van’t Veer and Sotiriou data sets seem to be close to the simulated redundant data sets setting. In the case of van’t Veer, this finding coincides with the conclusions of Gruvberger et al. [82], which points out a correlation of ER-alfa status in the data set generated by van’t Veer. 7.8 Pre-validation The approach described in this section, in contrast to combining gene expression and clinical data (see Figure 7.2), sets weights that determine relevance of linear predictions for the combination of microarray and clinical models, as shown in the schematic drawing in Figure 7.5. This approach was designed for redundant data sets. Weights are set based on pre-validation. The concept of pre-validation for microarray data and clinical variables is described in [181]. This paper incorporates only points 1 through 5 compared to our approach described in Section 7.9. Also, we use different classifiers and leave-one-out cross-validation (LOOCV), while [181] uses k-fold CV. Figure 7.5: The combination of microarray and clinical data with pre-validation. Z – clinical data, X – microarray data, LOG – logistic regression, B – boosting, η Z , η X – linear predictions, Y – response, ŵsZ , ŵsX – weights. 56 7 Combining gene expression and clinical data 7.9 Determination of weights for models We have K training samples in t-iteration of MCCV. We use LOOCV for pre-validation and, consequently, we determine weights. The weights are computed as follows: 1. Set aside one sample of K training samples. 2. Build a model with logistic regression (boosting) for Z (X) 5 using only data from other K − 1 samples. 3. Predict a linear response with a built model on the left out case. 4. Repeat steps 1–3 for each of the samples K to get pre-validated predictors from Z and X. 5. Fit a logistic regression model to pre-validated predictors from Z and X. 6. Compute weights wi (7.28), where i denotes Z or X. 7. Repeat steps 1–6 for randomized training data obtained from MCCV. 8. Compute modus of weights ŵi from wi for X and Z. Logistic regression is used twice – in building both a model of Z and a model of prevalidated predictors from Z and X. Logistic regression describes the relationship between one or more variables and an outcome. Each of the coefficients describes the size of the contribution of a particular variable. A large regression coefficient means that the variable strongly influences the probability of that outcome. The folowing equation for Z and X variables defines a regression model: η = β0 + βZ QZ + βX QX . (7.27) The weights are determined as follows: wZ = abs βX βZ , wX = abs . β0 β0 (7.28) Randomized training data obtained from MCCV is used twice – in weights estimation, as described in Section 7.8, and in building a model of Z and X, as described in Section 7.5. A histogram of weights obtained from t-iteration of MCCV is close to an exponential distribution of the probability density function. In the case of exponential distribution, the modus is the value with the highest density. In the rest of this thesis, this approach is denoted pre-LOG/Z+B/X. 5 Z denotes clinical data and X denotes microarray data. 57 7.10 Results Data set Pre-validation p = 50 p = 200 p = 500 van’t Veer Yes 0.81 ± 0.10 0.82 ± 0.11 0.82 ± 0.10 No 0.79 ± 0.11 0.78 ± 0.11 0.79 ± 0.11 Yes 0.74 ± 0.08 0.76 ± 0.08 0.78 ± 0.08 No 0.79 ± 0.07 0.81 ± 0.08 0.82 ± 0.08 Yes 0.65 ± 0.10 0.65 ± 0.09 0.66 ± 0.09 No 0.65 ± 0.10 0.66 ± 0.10 0.66 ± 0.09 Yes 0.70 ± 0.10 0.70 ± 0.11 0.71 ± 0.11 No 0.66 ± 0.11 0.65 ± 0.12 0.69 ± 0.11 Pittman Mainz Sotiriou Table 7.5: Pre-validation of LOG/Z+B/X on breast cancer data sets. AUCs from test data sets (including mean AUCs and standard deviations) evaluated over 100 MCCV iterations. p denotes a number of microarray variables. 7.10 Results We evaluated the pre-LOG/Z+B/X approach on four publicly available breast cancer data sets and compared pre-LOG/Z+B/X with LOG/Z+B/X. Table 7.5 includes the results with average AUCs and standard deviations over 100 MCCV iterations. LOG/Z+B/X averages linear predictions from microarray and clinical models on redundant data sets. Pre-validation of built models improves outcome of the prediction in the case of redundant breast cancer data sets. 7.11 Discussion We compared our results with the results from other methods described in the literature. In principle, the results of designed approaches are hard to compare because new approaches are evaluated with different data sets and measures. Gevaert et al. [77] evaluated their results using AUC measures. They integrate microarray data and clinical variables with Bayesian networks in three ways: full integration, decision integration and partial integration. The Bayesian decision integration approach combines data at the same level as our method and achieves the average of AUC 0.79 with the van’t Veer data set. In order to compare LOG/Z+B/X with the approach proposed in Boulesteix et al. [26], we have also extended our computation to report error rates. A comparison is presented in Chapter 8 as it is relevant also for determining the additonal predictive value of microarray data. Our approach provides results 2% better on average on the van’t Veer data set. In the case of the Pittman data set, LOG/Z+B/X has results 5% better than the approach 58 7 Combining gene expression and clinical data proposed in [26]. Eden et al. [56] reproduce the van’t Veer classifier for microarray predictors and apply an artificial neural network (ANN) algorithm to clinical predictors. Their approach achieves AUC 0.79 with all samples of the van’t Veer data set and with LOOCV. This approach achieves AUC 0.85 with only ER positive samples of the van’t Veer data set. Ma and Huang [130] propose the Cov-TGDR method for combining different types of covariates in disease prediction. They use the van’t Veer data set and achieve a prediction error of 0.227. However, they perform feature selection based on the binary outcome with training and test data which is not correct (pre-processing step 4 and 5 in this article). Other examples of methods that combine microarray and clinical data are [65, 155], but these authors evaluate survival times. Fernandez-Teijeiro et al. [65] build a predictive model with a combination of clinical variables and a small number of selected genes. Pittman et al. [155] combine metagenes with clinical risk factors to improve prediction. According to our results, a bigger p increases boosting performance, which relates to the fact that boosting with CWLLS performs variable selection and coeffcient shinkage. This approach is evaluated together with other methods on breast cancer data sets in an experiment aimed at selected genes in Chapter 9. 7.12 Alternative regularized regression techniques There are other regularized regression techniques based on the logistic regression model that can be applied to high-dimensional data. We experimented with R packages ‘glmnet’, ‘grplasso’ and ‘glmpath’ that regularize high-dimensional data with L1 or L2 or elastic net penalties. At the same time, these statistical models were developed for fitting in the GLM framework. The next subsection describes experiments with elastic net from package ‘glmnet’, which performed well and for which model fitting was not time-consuming. Next, we modify LOG/Z+B/X and apply it to the CFS data set that includes three types of data. 7.12.1 Elastic net Elastic net [206] is a regularization and variable selection method that can include both L1 and L2 penalties. L1 penalty, the lasso (least absolute shrinkage and selection operator) [180] has the property that it can shrink the coefficients to zero and performs variable selection. The predictive model is then sparse. A drawback of the lasso is an indifference to very correlated predictors. It tends to select only one variable from the group of correlated variables and does not care which one is selected. L2 penalty, the ridge regression [97] can shrink coefficients of correlated predictors toward each other and is useful in situations where there are many correlated predictor variables. A drawback of 59 7.12 Alternative regularized regression techniques the ridge penalty is that it always keeps all the predictors in the model. The elastic net is a compromise between L1 and L2 penalty. It takes into account correlated variables and performs variable selection together. Friedman et al. [69] developed algorithms for fitting GLMs with elastic net penalties. The algorithms use cyclical coordinate descent (CCD) and compute a regularization path. CCD algorithms optimize each parameter separately, holding all of the others fixed. The linear regression model (7.1) is considered again. The elastic net optimizes the following equation with respect to β [71]: p p n β̂(λ) = arg min X X 1 X Pα (β), βj xij )2 + λ (yi − 2n i=1 (7.29) j=1 j=1 where: 1 Pα (β) = (1 − α) kβk2L2 + αkβkL1 2 1 = (1 − α) βj2 + α|βj | 2 (7.30) is the elastic net penalty. Pα (β) = L penalty 1 if α = 1, L2 penalty elastic net penalty if α = 0, (7.31) if 0 < α < 1. The equation (7.29) is solved by coordinate descent. The coordinate update has the form: P S(βj∗ , λα) S( n1 ni=1 xi rij , λα) β̂j ← = , 1 + λ(1 − λ) 1 + λ(1 − λ) (7.32) where rij is the partial residual yi − ŷij for fitting β̂j and S(κ, γ) is the soft-thresholding operator with a value: κ−γ sign(κ)(|κ| − γ) = κ + γ 0 if κ > 0 and γ < |κ|, if κ < 0 and γ < |κ|, (7.33) if γ ≥ |κ|. Details of the derivation are given in [69]. The soft-thresholding takes care of the lasso contribution to the penalty. A simple description of the CCD algorithm for elastic net is as follows [71]: 60 7 Combining gene expression and clinical data The authors assume that the xij are standardized: • Initialize all β̂j = 0. Pn i=1 xij = 0, 1 n Pn 2 i=1 xij . • Cycle around till convergence and coefficients stabilize: − Compute the partial residuals rij = yi − ŷij = yi − P k6=j xik βk . − Compute the simple least squares coefficient of these residuals on jth preP dictor: βj∗ = n1 ni=1 xij rij . − Update β̂j by soft-thresholding: β̂j ← S(βj∗ , λ), which equals (7.32). Friedman et al. [71] compute a group of solutions (regularization path) for a decreasing sequence of values for λ. Their algorithm starts at the smallest value λmax for which the entire vector β̂ = 0, selects a minimum value λmin = ǫλmax , and constructs a sequence of values of λ decreasing from λmax to λmin on the log scale. Each solution is used as a start for the next problem. Combination of logistic regression and elastic net We employ elastic net with high-dimensional data instead of boosting with CWLLS and combine elastic net and simple logistic regression models similarly as described in Section 7.5. The elastic net models are used in the logistic regression setting. The regularized equation (7.29) is fitted by maximum (binomial) log-likelihood [71]. In GLMs, the linear models are related to the response via a link function. We combine data at the level of decisions and sum the linear predictions for clinical and microarray data (7.22). Then the inverse link function g−1 is applied to the sum of linear predictions (7.24). The schematic drawing for the combination of clinical and microarray data with elastic net is similar to the schema in Figure 7.2. For a better readibility of results, we denote this approach LOG/Z+EN/X. Parameter setting We performed experiments in the R environment using packages ‘base’ and ‘glmnet’. We evaluated λ solution paths produced by the glmnet algorithm in different settings: lasso penalty (α = 1), ridge regression penalty (α = 0), elastic net penalty (α = 0.5) and tested the models with various λ solutions on the training and test data sets. Figure 7.6 depicts examples of the experiment. The subfigures were generated from a simulated microarray data set of moderate power (µX = 0.25) and depict one MCCV iteration (the same for all figures). The simulated microarray data set included 1, 000 variables and 500 samples (400 training and 100 test). Corresponding AUC preformances are on the vertical axes. Figures in the first, second and third line were produced based on 20, 400 and 1, 000 selected variables in order to inspect performance in a different dimensional setting. 61 0.04 0.00 0.8 0.0 120 80 20 0 0.20 0.15 0.10 0.05 0.00 0.8 AUC 0.8 Lambda 0.08 0.04 0.00 0.0 0.0 0.4 AUC 0.4 0.0 AUC 0.12 120 80 60 40 20 0 0.20 0.15 0.10 0.05 0.00 AUC 0.12 0.08 0.04 Lambda 0.00 0.0 0.0 0.4 AUC 0.4 0.0 0.8 Lambda 0.8 Lambda 0.8 Lambda AUC 40 Lambda 0.8 Lambda 60 0.4 0.08 0.4 0.12 0.4 AUC 0.8 0.0 0.4 AUC 0.4 0.0 AUC 0.8 7.12 Alternative regularized regression techniques 120 80 60 40 Lambda 20 0 0.20 0.15 0.10 0.05 0.00 Lambda Figure 7.6: Examples of λ solution path produced by the glmnet algorithm. The dashed (solid) line denotes AUC estimated from a training (test) data set. Columns: 1st – lasso penalty, 2nd – ridge regression penalty, 3rd – elastic net penalty. Lines: 1st, 2nd and 3rd – 20, 400, 1000 variables in data set. The blue vertical line denotes λOP T . λ solution paths generated by the glmnet algorithm are on the the horizontal axes. The blue vertical lines in the subfigures denote the estimated values of λOP T , which were estimated via a training data set cross-validation (CV). The estimation via the training data set CV does not seem to work very well, especially for the ridge regression (the subfigures in the 2nd column). In this case, λOP T is close to zero, although the span of the ridge regression solutions is large and the data is high-dimensional in the 3rd line. Maybe an λOP T estimation via an additional validation data set CV could work better. The estimation of λOP T for lasso (the subfigures in the 1st column) seems to be more reliable. Therefore, we set (α = 1) and use the lasso penalty in further experiments. 62 7 Combining gene expression and clinical data 1. 2. 3. 4. µZ , µX 0, 0 1, 0.25 0.5, 0.5 1, 0.5 Method (no power) µZ > µX µZ < µX (strong p.) LOG/Z+EN/X 0.55 ± 0.04 0.68 ± 0.05 0.75 ± 0.04 0.81 ± 0.04 LOG/Z 0.55 ± 0.05 0.65 ± 0.06 0.56 ± 0.05 0.65 ± 0.06 EN/X 0.50 ± 0.03 0.60 ± 0.05 0.74 ± 0.05 0.74 ± 0.05 Table 7.6: The comparison with elastic net – non-redundant data sets. 1. 2. 3. 4. µZ , µX 0, 0 1, 0.25 0.5, 0.5 1, 0.5 Method (no power) µZ > µX µZ < µX (strong p.) LOG/Z+EN/X 0.48 ± 0.05 0.95 ± 0.02 0.98 ± 0.01 0.99 ± 0.01 LOG/Z 0.49 ± 0.05 0.94 ± 0.02 0.78 ± 0.04 0.94 ± 0.02 EN/X 0.48 ± 0.04 0.72 ± 0.04 0.97 ± 0.01 0.97 ± 0.01 Table 7.7: The comparison with elastic net – redundant data sets. Results The experiment data sets were evaluated in the same way as described in Section 7.7. We evaluated EN/X and LOG/Z+EN/X with simulated data first. Tables 7.6 and 7.7 display results in non-redundant and redundant settings for different predictive powers of microarray and clinical data. Combining microarray and clinical data yields a more accurate prediction performance. When we compare Tables 7.6 and 7.7 with Tables 7.2 and 7.3 with the LOG/Z+B/X approach from Section 7.7. LOG/Z+EN/X due to EN/X performs slightly better than LOG/Z+B/X. Then we evaluated EN/X and LOG/Z+EN/X with breast cancer van’t Veer and Pittman data sets. Average AUCs with standard deviations from test data sets evaluated over 100 MCCV iterations are given in Table 7.8. When we compare Table 7.8 with Table 7.4 from Section 7.7, both approaches indicate a similar performance. LOG/Z+B/X performs slightly better than LOG/Z+EN/X in the case of the Pittman data set. 7.12.2 Combining gene expression, clinical and SNP data The LOG/Z+B/X approach can be also applied to the CFS data set which consists of several (more than two) parts of data. The CFS data set contains gene expression, clinical, SNP and proteomic data. We do not include proteomic data into the experiments because proteomic samples are not enough (see Chapter 6). Clinical data is usually low-dimensional 63 7.12 Alternative regularized regression techniques Data set Method p = 50 p = 200 p = 500 van’t Veer LOG/Z+EN/X 0.77 ± 0.11 0.77 ± 0.10 0.79 ± 0.11 LOG/Z 0.82 ± 0.10 − − EN/X 0.67 ± 0.12 0.64 ± 0.12 0.64 ± 0.12 LOG/Z+EN/X 0.77 ± 0.07 0.78 ± 0.08 0.80 ± 0.07 LOG/Z 0.67 ± 0.09 − − EN/X 0.74 ± 0.08 0.76 ± 0.08 0.77 ± 0.07 Pittman Table 7.8: The comparison with elastic net – breast cancer data sets. because just a few clinical variables are included in microarray data sets. However, clinical data is high-dimensional in the CFS data set and simple logistic regression without a dimension reduction step could not be used. Therefore, we applied boosting with CWLLS to CFS clinical data as well and combined linear predictions in the same way as the equation (7.22) described in Section 7.5. We denote this approach B/Z+B/X. We evaluated B/Z+B/X on van’t Veer and Pittman breast cancer data sets and compared them with LOG/Z+B/X to inspect the efficiency of this method. Table 7.9 shows the results. Simple logistic regression (LOG) performs slightly better on clinical data. However, B/Z+B/X can be used instead of LOG/Z+B/X as well and we applied it to the CFS data set. Then, we applied boosting to SNP data and combined gene expression and SNP data; clinical and SNP data; and clinical, gene expression and SNP data. The sum of linear predictions for clinical, gene expression and SNP data is as follows: ηi = ηiZ + ηiX + ηiS , (7.34) where the upper index S denotes SNP data. Data set van’t Veer Pittman Methods B/Z B/Z+B/X LOG/Z LOG/Z+B/X 0.80 ± 0.10 0.76 ± 0.10 0.82 ± 0.10 0.79 ± 0.11 B/Z B/Z+B/X LOG/Z LOG/Z+B/X 0.65 ± 0.09 0.82 ± 0.08 0.67 ± 0.09 0.82 ± 0.08 Table 7.9: Comparison of methods. AUCs from test data sets evaluated over 100 MCCV iterations. The number of microarray variables is p = 500. 64 7 Combining gene expression and clinical data Results Table 7.10 shows results of experiments with the CFS data set. The models constructed with clinical data reached AU C = 0.74. The models based on gene expressions performed very poorly (AU C = 0.52), which is reflected on combinations of data. The combination of clinical and SNP data improves the outcome prediction. The combination of clinical, gene expression and SNP data performs much like the standalone clinical data. CFS is a very complex disease. Initially, we intended to use the CFS data set for more experiments, but later we abandoned this idea. The reason is that we are not convinced about data set quality. The intake illness classification of CFS, which is based on the 1994 case definition criteria [73], does not seem to be related to microarray data. We used this classification as a disease outcome dependent variable for model fitting. Furthermore, for example, CFS can be caused by chronic infections [143]. The CFS data set does not include infection and immunity markers that could influence data informativeness. The clinical data consists of subjective information based on the questionaire filled in by patients [2]. Data AUC clinical 0.74 ± 0.07 gene expression 0.52 ± 0.11 SNP 0.72 ± 0.08 Combination of data AUC clinical + gene expression 0.62 ± 0.09 clinical + SNP 0.77 ± 0.07 SNP + gene expression 0.58 ± 0.10 clinical + gene expression + SNP 0.74 ± 0.08 Table 7.10: CFS data set. AUCs from test data sets (including mean AUCs and standard deviations) evaluated over 100 MCCV iterations. 7.13 Execution times The execution time of the combined models is dominated by the execution times of the models built with high-dimensional data; therefore, we compared the execution times of the FGD boosting algorithm from the package ‘mboost’ (B/X and LOG/Z+B/X) with the CCD algorithm from the package ‘glmnet’ (EN/X and LOG/Z+EN/X). Figure 7.7 depicts this comparison. Increasing numbers of variables are on the horizontal axes, while total execution times for 100 MCCV iterations (in minutes) are on the vertical axes. The plots indicate that both methods need similar time to be computed. The execution times 65 7.14 Conclusions grow almost linearly. Besides, FGD boosting grows with the number of boosting iterations (in our simulations mmax=700 ). A grid of 100 λ values is computed in each iteration of EN. The simulations were run on a standard PC (Intel T72500 Core 2 Duo 2.00 GHz, 2 5 0 1000 2000 3000 Number of genes [−] (a) B/X and LOG/Z+B/X. 4000 1 2 3 4 0 B/X LOG/Z+B/X EN/X LOG/Z+EN/X 0 Time [minutes] 3 2 1 Time [minutes] 4 5 GB RAM) and a 32-bit operating system. 0 1000 2000 3000 4000 Number of genes [−] (b) EN/X and LOG/Z+EN/X. Figure 7.7: The comparison of execution times. The times are measured for 100 MCCV iterations. 7.14 Conclusions This chapter is concerned with the outcome prediction of combined models. We used logistic regression models built through different ways. GLMs enabled combining these models. LOG/Z+B/X approach employs logistic regression and boosting with CWLLS. The extention of LOG/Z+B/X is pre-LOG/Z+B/X, which includes pre-validation of models built with microarray and clinical data followed by weights calculation. Weights set the relevance of microarray and clinical models for data combination. We described LOG/Z+EN/X which employs logistic regression and elastic net. A comparison with LOG/Z+B/X showed that the two methods provide about the same performance. This finding can be explained by the fact that we used elastic net in the setting with lasso penalty. Boosting with CWLLS is a sparse regularization method as the lasso. In boosting with CWLLS, one predictor variable, not necessarily a different one, is selected for each iteration. Efron et al. [57] demonstrated that the coefficient estimates (regularization paths) from the lasso and forward stagewise linear regression which is related to slow gradient-based L2 -boosting with small fixed stepsize (e-Boosting) 66 7 Combining gene expression and clinical data are nearly identical. We modified LOG/Z+B/X and applied it to CFS data set which consists of microarray, clinical and SNP data. A combination of data sources can lead to a more accurate outcome prediction. On the other hand, there is a danger that integration of a data source of poor quality can decrease outcome prediction. The presented results prove that the approaches described in this chapter are not prone to this problem so much as that they can cope with a data source of a worse quality in some bounds. We also compared execution times of the FGD boosting algorithm from package ‘mboost’with the CCD algorithm from package ‘glmnet’. The execution times of the combined models grow linearly with the number of genes; the approaches are not time consuming. Chapter 8 Additional predictive value of microarrays 8.1 Introduction Many clinical variables have already been investigated and are now well established in cancer research, [183], while most genes are not yet validated as predictors. Gene expressions are expected to contribute significantly to progress in cancer treatment by enabling precise and early diagnosis. However, results show that the predictive power is often overestimated in the case of genes. The overestimation is clear especially when the number of samples is low and the total number of genes high. High-dimensional data is more sensitive to overfitting. Potential prognostic genes are often selected on a single data set and assumed to have the same predictive power with other data sets. In addition, methods that employ microarrays are not always evaluated correctly (see Section 3.4). Probably due to the overoptimistic results and attractiveness of microarrays, researchers do not pay attention to given clinical data in the same manner as in the pre-microarray era. Microarray data sets usually include just a few clinical variables. Clinical data is usually not difficult to collect and its acquisition is almost always much cheaper than in the case of microarray gene expression data. This chapter investigates an additional predictive value of microarray data. Few authors have dealt with this topic. Eden et al. [56] compared the power of gene expression measurements to conventional clinical prognostic markers for predicting distant metastases in breast cancer patients. The performance of metastasis prediction was presented by ROC and Kaplan-Meier plots. The gene expression profiler did not perform noticeably better than indices constructed from clinical variables. Tibshirani and Efron [181] proposed a pre-validation technique for making a fairer comparison between the two sets of predictors. They compared a predictor of a disease outcome derived from gene expression levels to standard clinical predictors. The technique is partially described 67 68 8 Additional predictive value of microarrays in Section 7.8 where it is employed for the determination of weights of linear predictors. Hofling and Thibshirani [98] extended the pre-validation technique by a permutation testing-based procedure. They showed that the permutation test achieves roughly the same power as the analytical test in simulation studies. Boulesteix et al. published two papers about the additional predictive value of microarrays [27, 25]. The first one combines partial least squares (PLS) dimension reduction and pre-validation (PV) introduced by Thibshirani and Efron. Random forests are then applied with pre-validated PLS new components and the clinical variables as predictors. We evaluate and compare our approach together with this method in Section 8.3. The second one deals with a permutation-based procedure and uses logistic regression and boosting. It could seem that this approach is similar to our approach, however it is different because the authors combine data in a different way and feed the offset values of the boosting algorithm with linear predictions from the logistic model. The method supposes that all clinical variables are already given (which means that the clinical data are included during the training of the data and cannot be included during the testing of the data), which is a disadvantage if this method would be used for class prediction with gene expression and clinical data. Truntzer et al. [183] adjusted for clinical and gene expression information in Cox proportional hazard models. The contribution of clinical and transcriptomic variables to prognosis were compared through simulations and by using the Kent and O’Quigley ρ2 measure of dependence. The results showed that predictive power is overestimated in the case of genes. We propose a two-step approach that can determine the additional predictive value of microarray data in this chapter. It is based on the method described in Chapter 7. We evaluate this approach together with other method designed for the same purpose on real breast cancer data sets. According to results, our approach can determine whether microarray data has an additional predictive value and even the joined classifier can combine microarray and clinical data more efficiently than the other designed method used for comparison. 8.2 Determination of predictive value The schematic drawing of our two-step approach for determination of the additional predictive value, including a validation schema, is shown in Figure 8.1. It can be described as follows. Step 1 consists in an evaluation of the performance of the classifier with clinical variables alone. The logistic regression was used as a classifier and was denoted as LOG/Z. Step 2 consists in an evaluation of the performance of the classifier combining clinical variables and microarray data, which was described in Section 7.5 and was denoted as LOG/Z+B/X. If the microarray data has an additional predictive value, the perfor- 69 8.2 Determination of predictive value Clinical data Clinical training set Microarray data Clinical test set Logistic regression Microarray training set Variable selection Microarray test set Reduced micro. tr. set Microarray variables Reduced micro. te. set Boosting AIC Opt. it. numb. Log. reg. model Bin. boost. model Linear prediction Linear prediction + Inverse logit Response Step 1 Step 2 Figure 8.1: Two-step method for the determination of the additional predictive value of microarray data. The dashed line denotes step 1 (LOG/Z). The dot-and-dashed line denotes step 2 (LOG/Z+B/X). 70 8 Additional predictive value of microarrays mance of step 2 is higher than the performance of step 1. The findings are statistically assessed with paired Wilcoxon rank sum test with 95% confidence level. 8.3 Results We evaluated our approach on four breast cancer data sets. The evaluation schema was the same as in previous chapter. Table 8.1 shows the results. According to the table, van’t Veer and Sotiriou microarray data does not have additional predictive value comparing to clinical data and it is better to use clinical data alone for prediction of prognosis. These findings coincide with the findings in previous chapter. The finding that van’t Veer microarray data does not have additional predictive value coincides with conclusions in [27, 25, 56]. Pittman and Mainz microarray data improves prediction performance. We also evaluated our approach together with the method published by Boulesteix et al. [26] that addresses an additional predictive value of microarray data. The authors of the article evaluate the proposed approach in various settings and combinations and compare it with other classifiers such as SVM or RF. We evaluated and compared our approach to the method denoted as PLS-PV-RF/XZ which has the best performance in the article. The method is implemented and is available in R package ‘MAclinical’. The method evaluates by means of error rates. That is why we evaluated our approach by means of error rates in this section as well (see Chapter 4 for the description of validation and evaluation). We used the script from R package ‘MAclinical’. We just applied the same validation schema to PLS-PV-RF/XZ and used the same data sets as for our two-step approach. Table 8.2 shows the results of this experiment. LOG/Z+B/X performs better than PLS-PV+RF/XZ on van’t Veer and Pittman data sets. Methods predict outcome similarly on Mainz and Sotiriou data sets. Data set van’t Veer Pittman Mainz Sotiriou Step AUC 1 0.82 ± 0.11 2 0.79 ± 0.11 1 0.67 ± 0.09 2 0.82 ± 0.08 1 0.59 ± 0.10 2 0.68 ± 0.10 1 0.71 ± 0.11 2 0.61 ± 0.12 APVM p-value Statistically significant NO 6.64e−06 YES YES 1.34e−15 YES YES 1.36e−09 YES NO 8.07e−04 YES Table 8.1: Additional predictive value of microarrays. 71 8.4 Conclusion Data set van’t Veer Pittman Mainz Sotiriou Method Mean error LOG/Z+B/X 0.29 ± 0.11 PLS-PV+RF/XZ 0.31 ± 0.10 LOG/Z+B/X 0.24 ± 0.07 PLS-PV+RF/XZ 0.29 ± 0.08 LOG/Z+B/X 0.24 ± 0.07 PLS-PV+RF/XZ 0.24 ± 0.06 LOG/Z+B/X 0.36 ± 0.11 PLS-PV+RF/XZ 0.37 ± 0.11 p-value Statistically significant 2.63e−02 YES 2.21e−06 YES 8.92e−01 NO 3.68e−01 NO Table 8.2: Comparison of LOG/Z+B/X with PLS-PV+RF/XZ method described in Boulesteix et al. [26]. Methods evaluated by mean values of error rates based on available R script in R package ‘MAclinical’. 8.4 Conclusion We presented a two-step approach that can determine additional predictive value of microarray data in this chapter. It has been shown that the method presented in the previous chapter that offers a solution of construction of a classifier combining the two different types of data can be used for determining additional predictive value of microarray data. According to findings in Section 8.3, van’t Veer and Sotiriou microarray data sets do not have additional predictive value, while Pittman and Mainz microarray data sets do. These findings demonstrate the fact that clinical data is still a valuable data source which should be used if available. The LOG/Z+B/X method can combine clinical and microarray data more efficiently than the PLS-PV+RF/XZ method on the van’t Veer and Pittman data sets. 72 8 Additional predictive value of microarrays Chapter 9 Breast cancer prognostic genes 9.1 Introduction The major aim of class prediction studies is to identify a set of key genes, also known as a signature or mRNA profile, that can accurately predict a class membership of new samples. We employ five breast cancer data sets in this thesis. It is interesting to compare selected features of gene expression classifiers built up in the different breast cancer data sets. Table 9.1 shows an overview of the sizes of these data sets. Class prediction studies in breast cancer can be categorized into two main subtypes: prognostic and predictive class prediction. Prognostic class prediction, discriminates between a good and a poor outcome by comparing highly aggressive and less aggressive primary tumors, while predictive class prediction describes predictors of response to therapy [55]. We deal with prognostic class prediction in this thesis, see the outcomes in Table 9.1. Breast cancer is not a single disease but a complex of genetic diseases characterized by the accumulation of multiple molecular alternations. It has become clear that patients Data set Microarray technology Dimension Outcome van’t Veer cDNA Agilent 78 x 4348 Pittman Wang Affymetrix Human U95Av2 Affymetrix Human U133a 158 x 12625 286 x 22263 Mainz Affymetrix Human U133a 200 x 22283 Sotiriou cDNA 99 x 4246 recurrence during 5 years after resection recurrence within 5 years relapse/distant metastases within 5 years distant metastases after 5 years relapse within 10 years Table 9.1: Summary of microarray data sets. 73 74 9 Breast cancer prognostic genes with similar clinical and pathological features may show distinct outcomes and vary in their response to therapy [55]. It is known that there are several different types of breast cancer. However, it remains to be determined how many of them can be identified reliably with currently available data. The main classes of breast cancer were originally defined by Perou et al. [153]. Table 9.2 shows the current main classes of breast cancer based on the reviews of Rakha et al. [55], Ross [161] and Sotiriou et al. [172]. According to Rakha et al. [55], the difference between the classes is not based on single genes or a specific pathway, but on a constellation of several groups of genes that make the signature of each class. There are a large number of published breast cancer signatures in the literature; however, the predictive success of these studies suffers from the fact that the genes in the designed signatures have only a few genes in common (see, e.g., [59]). For example, the van’t Veer and the Wang prognostic models, which study the same breast cancer population and outcome, shared only three genes in common [76]. Moreover, the predictive performance of both models decreased drastically when applied to each other’s data [59]. According to [116], as few as 3% of published studies describing potential applications in genomic medicine have progressed to a formal assessment of clinical utility. Table 9.3 shows an overview of breast prognostic and predictive biomarkers in clinical use, including multigene signatures. Oncotype DX and MammaPrint are the two tests that have achieved the most advanced commercial success. Although the MammaPrint test, which is based on the van’t Veer signature, was originally criticized for including some patients in both training and test groups, it has been clinically valitated to a high standard and has the US Food and Drug Administration (FDA) approval [136]. Reviews of diagnostic and prognostic signatures can be found in [161, 172]. The fact that different gene signatures have very few genes in common is a common feature of complex gene expression data that contain large numbers of highly correlated variables [172]. It is possible to find several sets of genes with similarly accurate prediction performances despite the limited overlap of these genes [59]. A difficulty of highdimensional model selection comes from the collinearity among the predictors [64]. This variable selection problem has been described in [75]. Large gene expression data sets have stimulated the development of the novel regularized techniques that can cope with small samples, from which has arisen a need of measuring variable importance and standardization of regression coefficients [208]. Classification methods for high-dimensional data has been overviewed in Chapter 3. This chapter tests the gene expression classifiers introduced in Chapter 7 and investigates how many features they select and how many genes they share when they are evaluated with the five breast cancer data sets. We compare boosting, elastic net with L1 penalty (the lasso) and a method that takes into account correlations among genes. This method performed favorably in [10]. 75 9.1 Introduction Class (%) Description ER-positive/luminal tumors (34-66%) Luminal A (19-39%) Characterized by the highest expression of the ER and ER-related genes. Most studies have reported this as the best prognosis class. Luminal B (10-23%) Shows low to moderated expression of ER-related genes. Compared with luminal A, this class may have a higher proliferation rate, expresses genes that seem to be shared with basal-like and HER2 subtypes and are associated with a less favourable outcome. ER-negative tumors (30-45%) Basal-like (16-37%) Corresponds to ER-negative, PR-negative and HER2negative (so-called triple-negative breast cancers). The triple-negative breast cancers are heterogeneous and can be divided into multiple additional subgroups. The dysfunction of the BRCA1 pathway is another feature that differentiates basal-like cancers from luminal ones. HER2-positive (4-10%) Shows amplification of HER2 and high expression of ERB2, GRB7, GATA4 genes. Tumors of this class can be ER-positive or ER-negative. Both basal-like and HER2 tumors include high levels of p53 mutation, aggressive clinical behaviour, poor prognosis and do not respond to hormonal therapy. Normal breast-like (up to 10%) Shows a high expression of genes characteristic of parenchymal basal epithelial cells and adipose stromal cells with a low expression of genes characteristic of luminal epithelial cells. The tumors have better prognosis than basal-like cancers. Table 9.2: The main classes of breast cancer based on the reviews in [55, 161, 172]. Explanatory notes: ER – estrogen receptor, PR – progesterone receptor, HER2 – human epidermal growth factor receptor. 76 9 Breast cancer prognostic genes Biomarker Clinical significance BRCA1 High expression of BRCA1 confers worse prognosis in untreated patients (James et al. [109]). CTC Breast cancer patients with ≧ 5CT C/7.5 ml of peripheral blood are associated with shorter PFS and OS (i.e. a poor prognosis) (Cristofanilli et al. [41]). ER Patients with ER-positive breast tumors have a better survival than patients with hormonal negative tumors (Early Breast Cancer Trialists’ Collaborative Group [81]). eXageneBC Provides prognosis in node-positive or node-negative breast cancer patients (Davis et al. [46]). Her2/neu Patients with Her2/neu-positive breast tumors are more aggressive and have a worse prognosis compared to Her2/neu-negative tumors (Mass et al. [133]). Ki-67 Expression of Ki-67 is associated with proliferation and progression in breast cancer (Dowsett et al. [50]). MammaPrint A 70-gene prognostic assay used to identify breast cancer cases at the extreme end of the spectrum of disease outcome by identifying patients with good or very poor prognosis (van’t Veer et al. [185]). Mammostratr This standard purely prognostic test uses five antibodies with manual slide scoring to divide cases of ER-positive, lymph node negative breast cancer tumors treated with tamoxifen alone into low-, moderate- or high-risk groups (Ring et al. [160]). Oncotype DX A 21-gene multiplex test used for prognosis to determine 10-year disease recurrence for ER-positive, lymph node negative breast cancers using a continuous variable algorithm and assigning a tripartite recurrence score (Goldstein et al. [79]; Paik et al. [150]). PR Patients with PR-positive breast tumors have a better survival than patients with hormonal-negative tumors (Dowsett et al. [49]). NuvoSelect A combination of several pharmacogenomic genesets used primarily to guide selection of therapy in breast cancer patients. This test also provides the ER and HER2 mRNA status (Ayers et al. [18]). Roche AmpliChip Low expression of CYP2D6 predicts resistance to tamoxifen-based chemotherapy in breast cancer patients (Hoskins et al. [99]). Rotterdam Signature A 76-gene assay used to predict recurrence in ER-positive breast cancer patients treated with tamoxifen (Wang et al. [192]). Table 9.3: Breast prognostic and predictive biomarkers in clinical use, which has been reviewed by Mehta et al. [136] (2010). 77 9.2 Gene selection methods 9.2 Gene selection methods We compare the gene selection of the following gene expression classifiers. Boosting and elastic net We use boosting with componentwise linear least squares (CWLLS), which has been already described in Section 7.4. To recapitulate, in each iteration the CWLLS base procedure performs a linear least square regression against the one selected predictor variable which reduces the residual sum of squares most [35]. We use the elastic net algorithm in the setting with L1 penalty (the lasso) (see Section 7.12). The lasso also selects only one variable from the group of correlated variables and does not care which one is selected [206]. In simulations, the lasso had higher predictive performances than the elastic net, which takes into account both L1 and L2 penalties and correlated variables. Therefore, we prefer to use the lasso. Feature selection using CAT scores and FNDR thresholding This method takes into account correlations among genes using correlation-adjusted tscores (CAT) controlled by false nondiscovery rates (FNDR) [10]. The approach is based on shrinkage discriminant analysis and consists of the three following cornerstones: • the use of James-Stein shrinkage for training the classifier, where all regularization parameters are estimated; • feature ranking based on CAT scores, which emerge from a restructured version of the LDA equations and enable the ranking of genes in the presence of correlation; • feature selection based on FNDR thresholding for inclusion in the prediction rule. According to Zuber and Strimmer [207], for two-class classification, the feature selection based on CAT scores describes the vector τ adj of CAT scores, which is defined as follows: 1 1 − 21 + ω, n1 n2 1 1 1 − 12 + V (µ1 − µ2 ), = P−2 × n1 n2 τ adj ≡ (9.1) 1 = P − 2 τ. 1 1 τ adj is proportional to the feature weight vector ω, where: ω = P − 2 V − 2 (µ1 − µ2 ). V = diag{σ12 , . . . , σp2 } is a diagonal matrix containing variances. P = (ρij ) is the correlation − 1 matrix. n11 + n12 2 is the scale factor, where nQ is the number of observations in each 78 9 Breast cancer prognostic genes group. In binary classification, the number of groups is Q = 2. The CAT score extends the fold change and t-scores. While the t-score is the standardized mean difference µ1 −µ2 , the 1 CAT score is the standardized as well as decorrelated mean difference. P − 2 is responsible for the decorrelation and is known as the Mahalonobis transform [10]. According to [10], a summary score to measure the total impact of feature j ∈ {1, . . . , p} can be defined as: Sj = Q X (τijadj )2 . (9.2) i=1 In the two-class classification, Q = 2 and equation (9.2) is simplified to Sj = 2(τjadj )2 . The detailed description of this approach can be found in [10, 207]. The intuitive and natural choice is to use an LDA classifier for class prediction in this approach. The classifier is fed by selected features. We evaluated the predictive performance of this approach with five breast cancer data sets. The computation of CAT scores with FNDR thresholding is implemented in R package ‘sda’. Its authors recommend using the larger FNDR-based feature set for class prediction, not just using genes considered to be differentially expressed [10]. Based on these recommendations, the set of features that can be included in the classifier was thresholded by Si < 0.9. We experimented with lower values but it sometimes happened that the threshold was too strict and no features went through. We evaluate this approach in the same way as methods described earlier in this thesis (see Chapter 4 for more information). Table 9.4 shows the prediction performances of this approach with the five breast cancer data sets. In comparison with previous results, the LDA classifier with features selected using CAT scores with FNDR thresholding performs slightly worse than the approaches evaluated in Chapter 7. Data set van’t Veer Pittman Wang Mainz Sotiriou 0.67 ± 0.12 0.74 ± 0.10 0.59 ± 0.08 0.64 ± 0.10 0.54 ± 0.11 Table 9.4: LDA prediction with feature selection using CAT scores and FNDR thresholding. AUCs from test data sets (including mean AUCs and standard deviations) are evaluated over 100 MCCV iterations. 9.3 Evaluation of selected genes 9.3 79 Evaluation of selected genes The disadvantage of MCCV subsets and small sample sets in general is that the selected genes are influenced by the subset. Therefore, we evaluated and summed nonzero coefficients over all MCCV iterations. We used the same evaluation scheme as described earlier in this thesis – we selected and evaluated features over 100 MCCV iterations. The overall number of selected variables is increased because selected genes in each MCCV iteration subset are slightly different. Regularized class prediction is characterized by the fact that many coefficients are zero. In Chapter 7, we used the classifiers based on logistic regression, where each of the coefficients describes the size of the contribution of each gene. A large coefficient means that the gene strongly influences the probability of the outcome. A positive coefficient means that the gene increases the probability of the outcome, while a negative coefficient means that the gene decreases the probability of the outcome. We recorded the selected coefficients as follows: 1. For each coefficient estimate βˆj , where j = 1, . . . , p and βˆj 6= 0 sum: K X βˆj . K denotes the total number of MCCV iterations. β̂jsum = l=1 2. Compute the absolute value of each βˆj : |βˆj |. 3. Sort |βˆj | in descending order. The second way in which we recorded the selected coefficients was that we counted the frequencies of inclusion of a gene in the classifier. Thus, if βˆj 6= 0, the counter D of j coefficient was Djsum = Djsum + 1, where initially all Djsum = 0 for j = 1, . . . , p. The consequent steps were the same as in the previous case. The two ways of counting coefficient estimates are presented in an example in Figure 9.1. They are shown in the first and the second rows of this figure respectively. The figure depicts the Pittman data set and boosting. 9.4 Results We evaluated the aforementioned gene selection methods with the five breast cancer data sets. Figure 9.2 shows the boxplots with the number of genes selected in every iteration and evaluated over 100 MCCV iterations. In the case of boosting and elastic net classifiers, the number of selected genes is partially influenced by the different sizes of the gene expression data sets and differs with used data sets. The third classifier based on feature selection using CAT and FNDR thresholding selects approximately the same number of genes all the time. 743 400 4418 9592321010111000001 0 10 200 20 Frequency 30 600 40 800 9 Breast cancer prognostic genes 0 Absolute sum of coefficient 80 0 200 400 600 800 0 10 Gene selected over 100 MCCV iterations 20 30 40 Absolute sum of coefficient (b) Histogram of absolute values of sums. 800 100 (a) Absolute values of sums of coefficients. 200 400 Frequency 600 80 60 40 20 Frequency of gene 675 33 16 15 11 0 0 71 0 200 400 600 800 Gene selected over 100 MCCV iterations (c) Frequencies of inclusion of a gene in the classifier. 0 20 40 6 60 6 4 80 4 100 Frequency of gene (d) Histogram of summed gene frequencies. Figure 9.1: The cumulative sums of coefficient estimates computed over MCCV iterations. The example is depicted for the Pittman data set and boosting. Genes with the non-zero coefficients were selected over 100 MCCV iterations. The resulting feature subset with non-zero coefficients included 841 genes. 81 100 50 0 Selected genes 150 9.4 Results van’t Veer Pittman Wang Mainz Sotiriou Mainz Sotiriou Mainz Sotiriou Data set 100 50 0 Selected genes 150 (a) Boosting. van’t Veer Pittman Wang Data set 100 50 0 Selected genes 150 (b) Elastic net. van’t Veer Pittman Wang Data set (c) Feature selection using CAT scores and FNDR thresholding. Figure 9.2: The boxplots with the number of genes selected in every iteration and evaluated over 100 MCCV iterations. 82 9 Breast cancer prognostic genes The second experiment focused on the content of selected gene sets. We compared how many genes the selection methods share. Figure 9.3 shows results of this experiment. We used the first way counting of coefficient estimates described in the previous section; thus we summed the coefficient values. Boosting and elastic net classifiers select the most similar gene sets. Boosting and FS using CAT and FNDR thresholding have less than 20 % of selected genes in common. Due to the fact that boosting and elastic net classifiers select the most similar gene sets, we compared the shared genes of these two classifiers among five breast cancer data sets. We found 14 shared genes that are given in Table 9.5. We used web gene converters [4] to convert Affymetrix IDs of Affymetrix gene expression data to EntrezGene IDs; and Rosetta files [185] to convert IDs of cDNA gene expression data to EntrezGene IDs. Gene name Gene Description FAM21B Family with sequence similarity 21, member B BCLAF1 BCL2-associated transcription factor 1 ORAI2 ORAI calcium release-activated calcium modulator 2 CD44 CD44 molecule (Indian blood group) TMCC1 Transmembrane and coiled-coil domain family 1 CABIN1 Calcineurin binding protein 1 HNRPA3 Heterogeneous nuclear ribonucleoprotein A3 MTHFR 5,10-methylenetetrahydrofolate reductase (NAD PH) TCEB1 Transcription elongation factor B (SIII), polypeptide 1 (15kDa, elongin C) LOC641311 Ribosomal protein L31 pseudogene ISCA1 Iron-sulfur cluster assembly 1 homolog (S. cerevisiae) MTF1 Metal-regulatory transcription factor 1 RND2 Rho family GTPase 2 MAP4 Microtubule-associated protein 4 Table 9.5: The shared genes between the boosting and elastic net classifiers and among five breast cancer data sets (van’t Veer, Pittman, Wang, Mainz, Sotiriou). 9.5 Conclusions Multiple studies have shown that predictive gene signatures suffer from instabililty of their membership and lack of reproducibility in independent studies (see, e.g., [59]). We compared selected genes of three gene selection methods and evaluated them with five breast cancer data sets. Boosting and elastic net classifiers in the setting with L1 penalty (the lasso) select the most similar gene sets, which is probably because of a connection between 83 100 80 60 40 0 20 Shared genes in selected genes [%] 80 60 40 20 0 Shared genes in selected genes [%] 100 9.5 Conclusions 344 van’t Veer 709 Pittman 1300 Wang 420 Mainz 855 Sotiriou 60 van’t Veer 44 Pittman Data set 52 Mainz 0 Sotiriou Data set 80 20 40 60 Boosting Elastic net FS using CAT scores and FNDR thresholding 0 0 20 40 60 80 Shared genes in selected genes [%] 100 (b) Boosting and FS (CAT, FNDR). 100 (a) Boosting and elastic net. Shared genes in selected genes [%] 36 Wang 245 van’t Veer 115 Pittman 132 Wang 394 Mainz 198 Sotiriou Data set (c) Elastic net and FS (CAT, FNDR). 31 35 van’t Veer Pittman 32 Wang 21 Mainz 0 Sotiriou Data set (d) Boosting, elastic net and FS (CAT, FNDR). Figure 9.3: Shared genes among gene selection algorithms. The numbers of shared genes are written down on the horizontal axes. The vertical axes depict the numbers of shared genes in selected genes, which were selected over 100 MCCV iterations. 84 9 Breast cancer prognostic genes these two classifiers [57]. Other comparisons of these classifiers with FS using CAT and FNDR thresholding did not bring very high gene overlaps. When we compared the selected genes from five breast cancer data sets, we found only 14 shared genes. A poor representation of tumors in breast cancer studies can be another problem over the problems resulting from high-dimensional data [136]. According to Rakha et al. [55], microarray technology should not be expected to replace current traditional diagnostic algorithms, but should be integrated within these and may contribute additional complementary prognostic information, which should improve patient management. This statement is in accordance with the combined use of clinical and microarray data advocated in this thesis. Chapter 10 Gene ontology feature selection 10.1 Introduction Dimension reduction (selecting a small number of genes) is an effective way to improve the mining efficiency of high-dimensional data as mentioned in Chapter 3. Methods of dimension reduction can be categorized into two groups: feature selection and feature extraction. Feature selection chooses a subset of features (genes) from the original ones. Feature extraction creates a new lower-dimensional feature space from the original highdimensional feature space. Feature selection algorithms are used more often because the features they choose are more biologically meaningful. Feature selection methods fall into filter and wrapper methods. Filter feature selection methods select a feature subset that is independent of any mining algorithm in contrast to wrapper feature selection methods that use a given mining algorithm to determine the goodness of a selected subset. We concentrate on the filter method. Some filter selection methods evaluate genes in isolation without considering gene-togene relations. They rank genes according to their individual relevance or discriminative power and select top-ranked genes based on statistical tests. However, genes are well known to interact with each other through various reactions. Biologists effort to identify the fundamental mechanisms of the biological process that the gene expression data represents. Thus, it is desirable to find a subset of features which is more interpretable. This chapter aimes at incorporating gene-to-gene relations and interactions into the class prediction process. Gene ontology (GO) represents a controlled biological vocabulary and repository of computable biological knowledge [89]. The similarity matrix is often used in connection with ontologies, which is a way of gene-to-gene relations presentation. We propose preliminary results with four variants of the method combining GO information with gene expression data. GO is incorporated in microarray data at the beginning of the class prediction process (early integration) and influences selected features. Based on GO integration, features selected by the class prediction method can be more meaningful and 85 86 10 Gene ontology feature selection better interpretable. Feature selection [157, 151] or extraction [38] with GO is not new. GO is widely used in many fields. Some authors search for expression correlations [38] while others directly eliminate highly correlated genes [151]. Finding correlated expression patterns is the goal of gene clustering. Srivastava et al. [174] integrate GO knowledge and expression data using Bayesian regression mixture models to perform unsupervised clustering of the samples and identify physiologically relevant discriminative features. Jiang et al. [111] construct a gene semantic similarity network and then use the network to infer disease genes. They show that genes with higher semantic similarity scores tend to be associated with diseases with higher phenotype similarity scores. Cho et al. [38] present a feature extraction algorithm based on the concept of virtual genes by integrating microarray data with GO annotations. Virtual genes are groups of genes that potentially interact with each other and are used to build a sample classifier. The next section describes approaches that select a subset of genes with incorporated GO information. Feature selections based on GO are evaluated with a benchmark breast cancer data set. 10.2 Combining gene expression data with gene ontology The proposed approaches integrate GO information into gene selection. GO is divided into three orthogonal ontologies: (a) biological process (BP), (b) molecular function (MF), and (c) cellular component (CC). The three ontologies are represented as directed acyclic graphs in which nodes correspond to terms and their relationships are represented by edges. GO supports two types of relations: the ‘is-a’ and the ‘part-of’. Figure 10.3 describes the microarray data feature selection procedure. Each of the three different ontologies is used separately. At the beginning, each ontology semantic similarity matrix (SSM) for each ontology is computed based on microarray gene IDs converted to EntrezGene IDs. SSM computing for the expression matrix with a dimension in thousands genes is computationally intensive. Therefore, we take K random genes. SSM can be computed with various measures. An overview can be found in [127]. We calculated SSM with Wang’s [191] method. Wang is a graph structure based method which encodes the semantic of a GO term in a measurable format to enable a quantitative comparison. Wang measures determine the semantic similarity of two GO terms based on both the locations of these terms in the GO graph and their relations with their ancestor terms. Wang defines the semantic value (S-value) of term A as the aggregate contribution of all terms in DAGA to the semantics of term A; the term closer to term A in DAGA contributes more to its semantics [191]. Given two GO terms A and B, Wang’s semantic similarity between these two terms is defined as [191]: 87 10.2 Combining gene expression data with gene ontology SGO (A, B) = X t∈TA ∩TB SA (t) + SB (t) , SV (A) + SV (B) (10.1) where SA (t) hSB (t)i is the S-value of GO term t related to term A hBi, and SV (A) = X AA (t). A more detailed description can be found in [191]. t∈TA We use a threshold to convert SSM to an adjacency matrix with ones and zeros. The two genes are connected if the semantic similarity between two genes is greater than a threshold. The adjacency matrix also has zeros diagonally. We performed experiments in the R environment. An illustrative examples of SSM and the adjacency matrix with ten genes are given in Figure 10.1. > adjm > ssm[1:10,1:10] 1915 2527 7392 1359 9976 4617 4664 8463 80153 9818 1915 1.000 0.627 0.542 0.505 0.282 0.418 0.542 0.542 1.000 0.409 2527 0.627 1.000 0.753 0.708 0.188 0.539 0.753 0.753 0.627 0.542 7392 0.542 0.753 1.000 0.615 0.200 0.677 1.000 1.000 0.449 0.673 1359 0.505 0.708 0.615 1.000 0.147 0.436 0.615 0.615 0.492 0.443 9976 0.282 0.188 0.200 0.147 1.000 0.160 0.200 0.200 0.231 0.164 4617 0.418 0.539 0.677 0.436 0.160 1.000 0.677 0.677 0.351 0.682 4664 0.542 0.753 1.000 0.615 0.200 0.677 1.000 1.000 0.449 0.673 8463 0.542 0.753 1.000 0.615 0.200 0.677 1.000 1.000 0.449 0.673 80153 1.000 0.627 0.449 0.492 0.231 0.351 0.449 0.449 1.000 0.346 9818 0.409 0.542 0.673 0.443 0.164 0.682 0.673 0.673 0.346 1.000 1915 2527 7392 1359 9976 4617 4664 8463 80153 9818 1915 2527 7392 1359 9976 4617 4664 8463 80153 9818 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 Figure 10.1: Examples of the semantic similarity matrix and the adjacency matrix A gene connectivity graph is created from the adjacency matrix. The graph for the illustrative example is depicted in Figure 10.2. In the gene connectivity graph, we look for maximum clique or cliques that identify a set of semantically similar genes. In this figure the genes (vertices) belonging to an example with a maximum gene clique are denoted in red. 10.2.1 Filtering with gene ontology Four variants of combining of GO information with microarray data were explored: • Variant A - Microarray data enrichment with gene ontology relations based on maximum gene cliques Maximum gene cliques are found. Based on the statistical characteristics of microarray data (e.g. mean, variance), we add a value to the gene expression in the microarray data matrix, if the gene is included in the maximum cliques. In the evaluation, t-statistics feature selection is performed during each MCCV iteration from the training data. 88 10 Gene ontology feature selection 3 4 1 8 2 5 0 9 7 6 Figure 10.2: A gene connectivity graph for ten genes with an example of a clique (vertices denoted in red). • Version B - Microarray data enrichment with gene ontology relations based on maximum gene cliques Maximum gene cliques are found. Based on the statistical characteristics of microarray data (e.g. mean, variance), we add a value to the gene expression in microarray data matrix, if the gene included is in the cliques and the gene expression in microarray data is higher than a threshold. We subtract a value from the gene expression in the microarray data matrix, if the gene is included in the cliques and the gene expression in microarray data is lower than a threshold. In the evaluation, t-statistics feature selection is performed during each MCCV iteration from the training data. • Version C - Microarray data gene ontology feature selection based on frequencies of each gene connectivity We use adjacency matrix and sum cols (or rows) of this matrix to get a vector v with frequencies of each gene connectivity. A threshold from this vector is computed max(v) − mean(v) as max(v) − . 2 • Version D - Microarray data gene ontology feature selection based on maximum gene cliques Maximum gene cliques are found. Genes included in maximum gene cliques are selected from microarray data. 89 10.3 Results The devised methods are evaluated with the same schema as described in previous chapters. The enriched or filtered microarray data are evaluated over 100 MCCV iterations. t-statistics feature selection, if it is used, is performed during each MCCV iteration from the training data. AUCs are evaluated for the test data sets. Support vector machines (SVM) with linear kernel and elastic net with the setting described in Section 7.12 with L1 penalty are employed as class prediction methods during the evaluation. Gene ontology (K genes) EntrezGene IDs Microarray data Semantic similarity matrix (threshold) C Filtering A,B,D Maximum gene cliques Adjacency matrix Gene connectivity graph Filtered microarray data Figure 10.3: Gene ontology feature selection. A, B, C, D denote the variants of feature selection described in this chapter. 10.3 Results We evaluated four variants incorporating microarray data with GO described in the previous section together with two other variants: no feature selection and t-statistic feature selection. We used the Pittman breast cancer data set for evaluations. The experiments were evaluated in the R environment using packages ‘GOSemSim’ and ‘igraph’ for computing SSMs and working with graphs respectively. The SVM algorithm is included in R package ‘e1071’ and we used it in the default setting with a linear kernel. The computations of SSMs were time consuming. The Pittman data set includes more than 8000 genes. We decreased the dimension of SSM by using K = 2, 000 random genes. Proposed approaches were evaluated with the cellular component (CC) ontology which showed to be the best [188]. Table 10.1 displays the results. Variant D reaches the best performance with both class predition methods. Elastic net usually performs better than 90 10 Gene ontology feature selection Feature selection Ontology Version − − CC CC A B t-statistics (# of genes) Method SVM Elastic net NO (8385) 0.73 ± 0.07 0.75 ± 0.08 YES 50 0.67 ± 0.08 0.75 ± 0.08 YES 500 0.73 ± 0.08 0.77 ± 0.08 YES 50 0.67 ± 0.08 0.75 ± 0.08 YES 500 0.73 ± 0.08 0.77 ± 0.08 YES 50 0.66 ± 0.08 0.74 ± 0.08 YES 500 0.72 ± 0.08 0.75 ± 0.08 CC C NO (405) 0.73 ± 0.07 0.75 ± 0.08 CC D NO (397) 0.78 ± 0.07 0.78 ± 0.07 Table 10.1: Four variants A,B,C,D used with cellular component ontology (CC) and evaluated with the Pittman data set. AUCs from test data sets (including mean AUCs and standard deviations) evaluated over 100 MCCV iterations. The column t-statistics indicates whether genes were selected on the basis of the absolute value of the t-statistics using R package ‘st’ during each MCCV iteration from training data. SVM in our experiments. The number of selected genes also depends on the threshold of the adjacency matrix. We used a threshold of 0.8 for all experiments. Full-scale testing of thresholds and other parameters are needed. 10.4 Conclusion We proposed four variants of using GO information with gene expression data. GO information is incorporated at the stage of early integration in order to improve binary class prediction. The selected feature sets were enriched with genes regarding gene-to-gene relations and interactions with each other. Similarity matrices were computed with the Wang measure that is based on the GO graph structure. Microarray data gene ontology feature selection based on maximum gene cliques (Variant D) with CC gene ontology improved prediction performances. In the future, research will deal with ways on how to accelerate SSM computing and how to make this calculation possible for all genes of high-dimensional gene expression matrix. Future research will also include full-scale testing of various paramaters with all three ontologies, evaluation of methods with more data sets, a comparison of selected features, and performance of other GO based feature selection methods. Using pathways and other resources in addition to the GO should also bring improvements. Chapter 11 Conclusions This thesis deals with microarray data disease outcome prediction. Microarray data suffers from high-dimensionality, inadequate sample sizes and low signal-to-noise ratios, which influence resulting prediction models. The corner stone of this thesis is combining microarray data with other sources can result in better predictions of disease outcome. The main part focuses on combinations with clinical data. We used logistic regression for clinical data and boosting for gene expression data. We combined data at the level of late integration. We extended this approach with pre-validation of models built with microarray and clinical data followed by weight calculations. Evaluations were performed on real data sets as well as on several redundant and non-redundant simulated data sets to test our approach in various settings. Results show that combining microarray gene expression and clinical data improves prediction performance depending on the quality of data. Applying this approach without pre-validation with non-redundant data sets implicates outcome prediction improvements, while the method does not yield more accurate predictions with redundant data sets. Pre-validation of constructed models increases the performance in cases of redundant data sets. Next, we experimented with alternative approaches. We employed an elastic net with high-dimensional data instead of boosting and used it in a joint classifier with logistic regression. We also evaluated classifiers combining gene expression, clinical and SNP data. This demonstrates that our approach can be used with more than two data sources. The presented approaches that offers a solution of construction of a classifier combining the two types of data can be also used for determining additional predictive value of microarray data. The designed two-step method was evaluated and compared to an alternative approach. Both compared methods made the same decision about an additional predictive value for all the breast cancer data sets. Class prediction studies aim at identifying sets of genes (signatures) that can accurately classify new data. It is interesting to compare selected features of described classifiers on different data sets. Unfortunately, gene signatures resulting from different studies usually 91 92 11 Conclusions have very few genes in common [172]. We compared selected genes of three feature selection methods evaluated on five breast cancer data sets. Boosting with CWLLS and elastic net classifiers in the setting with the L1 penalty (the lasso) provided the most similar gene sets. Incorporating gene-to-gene relations and interactions into the class prediction process can increase prediction accuracy and can help interpret results. The similarity matrix, which is often used in connection with ontologies, provides a natural way to display geneto-gene relations. Four variants of feature selection methods combining gene ontology with gene expression data were presented. Evaluations were performed on a real benchmark data set. The feature selection based on maximum gene cliques (Variant D) improved the prediction performance. As mentioned above, the described approach enables combining more than two or three sources of data. The prediction accuracy of a combination of microarray and other data depends on complementarity of the data sources. If data models are complementary, i.e. they contain non-redundant information, their combination can bring increased prediction accuracy. Moreover, prediction accuracy also depends on quality of data. Procedures of data production constantly are in progress and innovate. Integrating distinct and multiple information resources is and will be an important task in the future. Our future work can focus on additional data types. According to several recently published studies [44, 201, 76], pathways, microRNA, arrayCGH and other gene-related information may help to predict cancer outcomes. Metabolic pathway information from KEGG and OPHID protein-protein interactions outperformed other sources in the study of Daemen et al. [44]. Promising classifiers can be, for example, Bayessian networks because they are able to integrate heterogeneous data. Networks make it possible to keep relations between data and guarantee better interpretability of the results. Bibliography [1] AlzGene – Database of genetic association studies performed on Alzheimer’s disease [cited 1 September 2010]. Available from: http://www.alzgene.org/. [2] CAMDA – Critical Assessment of Microarray Data Analysis [cited 5 June 2010]. Available from: http://camda.bioinfo.cipf.es/camda07/datasets/. [3] CDC – Centers for Disease Control [cited 25 May 2010]. Available from: http: //www.cdc.gov/. [4] Gene id converter [cited 12 January 2012]. Available from: http://idconverter. bioinfo.cnio.es/. [5] Gene Ontology tools – Gene Ontology tools for analysis of microarray gene expression data [cited 12 April 2010]. Available from: http://www.geneontology.org/GO. tools.microarray.shtml. [6] Mainz breast cancer data set: 2011]. Available from: breastCancerMAINZ R package [cited 2 June http://www.bioconductor.org/packages/2.8/data/ experiment/. [7] MEDLINE – biomedical literature repository [cited 22 April 2009]. Available from: http://mbr.nlm.nih.gov/index.shtml. [8] OMIM – Online Mendelian Inheritance in Man [cited 13 November 2010]. Available from: http://www.ncbi.nlm.nih.gov/omim. [9] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, L. C. Tranchevent, B. De Moor, P. Marynen, B. Hassan, P. Carmeliet, and Y. Moreau. Gene prioritization through genomic data fusion. Nature Biotechnology, 24:537–544, 2006. [10] M. Ahdesmäki and K. Strimmer. Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Ann. Appl. Stat., 4(1):503–519, 2010. 93 94 11 Conclusions [11] H. Ahn, H. Moon, M. J. Fazzari, N. Lim, J. J. Chen, and R. L. Kodell. Classification by ensembles from random partitions of high-dimensional data. Computational Statistics Data Analysis, 51(12):6166–6179, 2007. [12] H. Akaike. A new look at the statistical model identification. IEEE Trans. Automat. Contr., 19(6):716–723, 1974. [13] M. Alaminos, V. Davalos, and N. K. Cheung. Clustering of gene hypermethylation associated with clinical risk groups in neuroblastoma. J Natl Cancer Inst, 96:1208– 1219, 2004. [14] C. F. Aliferis, A. Statnikov, and I. A. Tsamardinos. Challenges in the analysis of mass-throughput data: a technical commentary from the statistical machine learning perspective. Canc Inform, (2):133–162, 2007. [15] O. Alter, P. O. Brown, and D. Botstein. Singular value decomposition for genomewide expression data processing and modeling. PNAS, 97(18):10101–10106, 2000. [16] J. C. Alwin, D. J. Kemp, and G. R. Stark. Methods for detection of specific rnas in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with dna probes. Proc. Natl. Acad. Sci. USA, 74:5350–5354, 1977. [17] J. Andrews, W. Kennette, J. Pilon, A. Hodgson, A. B. Tuck, A. F. Chambers, and D. I. Rodenhise. Multi-platform whole-genome microarray analyses refine the epigenetic signature of breast cancer metastasis with gene expression and copy number. PLoS ONE, 5(1):1–17, 2010. [18] M. Ayers, W. F. Symmans, J. Stec, A. I. Damokosh, E. Clark, K. Hess, and M. Lecocke et al. Gene expression profiles predict complete pathologic response to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer. Jour. of Clin. Oncol., 22:2284–2293, 2004. [19] C. A. Ball and A. Brazma. Mged standards: work in progress. OMICS, 10(2):138– 144, 2006. [20] A. Barrier, P. Y. Boelle, A. Lemoine, A. Flahault, S. Dudoit, and M. Huguier. Gene expression profiling in colon cancer. Bull. Acad. Natl. Med., (6):1091–1103, 2007. [21] H. Binder and M. Schumacher. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics, 9:1–10, 2008. [22] D. A. Bitton, M. J. Okoniewski, Y. Connolly, and C. J. Miller. Exon level integration of proteomics and microarray data. BMC Bioinformatics, 9(118):1–11, 2008. 95 [23] T. Bo and I. Johanssen. New feature subset selection procedures for classification of expression profiles. Genome Biol, (4):1–6, 2002. [24] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19:185–193, 2003. [25] A. L. Boulesteix and T. Hothorn. Testing the additional predictive value of highdimensional molecular data. BMC Bioinformatics, 78:1–11, 2010. [26] A. L. Boulesteix, C. Porzelius, and M. Daumer. Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value. Bioinformatics, 24:1698–1706, 2008. [27] A. L. Boulesteix, C. Strobl, T. Augustin, and M. Daumer. Evaluating microarraybased classifiers: An overview. Cancer Inform, 6:77–972, 2008. [28] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:1145–1159, 1997. [29] L. Breiman. Bagging predictors. Journal Machine Learning, 24(2):123–140, 1996. [30] L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493–1517, 1999. [31] L Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. [32] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth Brooks, 1984. [33] P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci, 97(1):262–267, 2000. [34] P. Bühlmann. Boosting for high-dimensional linear models. Ann. Statist., 34(2):559– 583, 2006. [35] P. Bühlmann and T. Hothorn. Boosting algorithms: Regularization, prediction and model fitting. Statist. Sci., 22:477–505, 2007. [36] P. Cahan, F. Rovegno, D. Mooney, J. C. Newman, G. S. Laurent, and T. A. McCaffrey. Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. Gene, 401:12–18, 2007. 96 11 Conclusions [37] G. Chen, T. G. Gharib, C. C. Huang, J. M. Taylor, D. E. Misek, S. L. Kardia, T. J. Giordano, M. D. Iannettoni, M. B. Orringer, S. M. Hanash, and D. G. Beer. Discordant protein and mrna expression in lung adenocarcinomas. Mol Cell Proteomics, 1:304–313, 2002. [38] Y. R. Cho, X. Xu, W. Hwang, and A. Zhang. Feature extraction from microarray expression data by integration of semantic knowledge. Sixth International Conference on Machine Learning and Applications, pages 606–611, 207. [39] H. Y. Chuang, E. Lee, Y. T. Liu, D. Lee, and T. Ideker. Network-based classification of breast cancer metastasis. Mol Syst Biol, 3(140):1–10, 2007. [40] F. H. C. Crick. Central dogma of molecular biology. 227:561–563, 1970. [41] M. Cristofanilli, G. T. Budd, M. J. Ellis, A. Stopeck, J. Matera, M. C. Miller, J. M. Reuben, G. V. Doyle, W. J. Allard, L. W. Terstappen, and D. F. Hayes. Brca1, a potential predictive biomarker in the treatment of breast cancer. N. Engl. J. Med., 351(8):781–791, 2004. [42] A. Daemen, O. Gevaert, T. De Bie, A. Debucquoy, J. P. Machiels, B. De Moor, and K. Haustermans. Integrating microarray and proteomics data to predict the response on cetuximab in patients with rectal cancer. Pac Symp Biocomput, pages 166–177, 2008. [43] A. Daemen, O. Gevaert, F. Ojeda, A. Debucquoy, J. A. Suykens, C. Sempoux, J. P. Machiels, K. Haustermans, and B. De Moor. A kernel-based integration of genomewide data for clinical decision support. Genome Med, 1(4):1–17, 2009. [44] A. Daemen, M. Signoretto, O. Gevaert, J. A. Suykens, and B. De Moor. Improved microarray-based decision support with graph encoded interactome data. PLoS One, 5(4):1–16, 2010. [45] W. S. Dalton and S. H. Friend. Cancer biomarkersâan invitation to the table. Science, 312:1165–1168, 2006. [46] L. M. Davis, C. Harris, L. Tang, P. Doherty, P. Hraber, Y. Sakai, T. Bocklage, K. Doeden, B. Hall, and J. Alsobrook. Amplification patterns of three genomic regions predict distant recurrence in breast carcinoma. The Journal of Molecular Diagnostics, 9(3):327–336, 2007. [47] E. de Hoffmann and V. Stroobant. Mass Spectrometry: Principles and Applications. Biddles Ltd, 1999. [48] M. Dettling and P. Bühlmann. Boosting for tumor classification with gene expression data. Bioinformatics, 19(9):1061–1069, 2003. 97 [49] M. Dowsett, J. Houghton, C. Iden, J. Salter, J. Farndon, R. A’Hern, R. Sainsbury, and M. Baum. Benefit from adjuvant tamoxifen therapy in primary breast cancer patients according oestrogen receptor, progesterone receptor, egf receptor and her2 status. Annals of Oncology, 17(5):818–826, 2006. [50] M. Dowsett, I. E. Smith, S. R. Ebbs, J. M. Dixon, and A. Skene et al. Prognostic value of ki67 expression after short-term presurgical endocrine therapy for primary breast cancer. JNCI, 99:167–170, 2007. [51] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. New York Willey, 2001. [52] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Jour Amer Stat Asoc, 97(457):77–87, 2002. [53] S. Dudoit, J. P. Shaffer, and J. C. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical science, (18):71–103, 2003. [54] A. Dupuy and R. M. Simon. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J. Natl. Cancer. Inst., 99(2):147–157, 2007. [55] E. A. Rakha EA, M. E. El-Sayed, J. S. Reis-Filho JS, and I. O. Ellis. Expression profiling technology: its contribution to our understanding of breast cancer. Histopathology, 1(52):67–81, 2008. [56] P. Eden, C. Ritz, C. Rose, M. Ferno, and C. Peterson. ”good old” clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers. Eur. J. Cancer, 40:1837–1841, 2004. [57] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Statist., 32(2):407–499, 2004. [58] P. H. C. Eilers, J. M. Boer, G. J. van Ommen, and H. C. van Houwelingen. Classification of microarray data with penalized logistic regression. Proc. SPIE, 4266:187–198, 2001. [59] L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 2(21):171–178, 2005. [60] D. A. Elizondo, B. N. Passow, R. Birkenhead, and A. Huemer. Principal Manifolds for Data Visualization and Dimension Reduction. Springer Berlin Heidelberg, 2007. 98 11 Conclusions [61] J.D. Emerson and D.C. Hoaglin. Analysis of two-way tables by medians. Understanding Robust and Exploratory Data Analysis, John Willey Sons, New York, pages 166–206, 1983. [62] D. Engler and Y. Li. Survival analysis with high-dimensional covariates: an application in microarray studies. Stat Appl Genet Mol Biol, 8(14):1–20, 2009. [63] A. Ergun, C. A. Lawrence, M. A. Kohanski, T. A. Brennan, and J. J. Collins. A network biology approach to prostate cancer. Mol Syst Biol, 3(82):1–6, 2007. [64] J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space (invited review article). Statistica Sinica, (20):101–148, 2010. [65] A. Fernandez-Teijeiro, R. A. Betensky, L. M. Sturla, J. Y. Kim, P. Tamayo, and S. L. Pomeroy. Combining gene expression profiles and clinical parameters for risk stratification in medulloblastomas. J Clin Oncol., 22(6):994–998, 2004. [66] J. Fox. An R and S Plus Companion to Applied Regression. SAGE Publications, 2002. [67] J. Fridlyand, J. Snijders, and B. Ylstra. Breast tumor copy number aberration phenotypes and genomic instability. BMC Cancer, 6:1–13, 2006. [68] J. Fridlyand and J. Y. H. Yang. DENMARKLAB R package. Advanced microarray data analysis: Class discovery and class prediction [cited 21 September 2009]. Available from: http://genome.cbs.dtu.dk/courses/norfa2004/Extras/. [69] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathways coordinate optimization. Ann. Appl. Stat., 1:302–332, 2007. [70] J. H. Friedman. Greedy function approximation: A gradient boosting machine. Ann. Statist., 29:1189–1232, 2001. [71] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–24, 2010. [72] N. Friedman and M. Goldszmidt. Learning bayesian networks with local structure. Learning in Graphical Models, pages 421–460, 1999. [73] K. Fukuda, S. E. Straus, I. Hickie, M. C. Sharpe, J. G. Dobbins, and A. Komaroff. The chronic fatigue syndrome: a comprehensive approach to its definition and study. international chronic fatigue syndrome study group. Ann Intern Med, 121:953–959, 1994. 99 [74] R. Gentleman, W.Huber, V. J. Carey, R. A. Irizarry, and S. Dudoit. Bioinformatics and computational Biology Solutions Using R and Bioconductor. Springer, 2005. 10:0-387-25146-4. [75] E. I. George. The variable selection problem. American Statistical Association, 95(452):1304–1308, 2000. [76] O. Gevaert and B. de Moor. Prediction of cancer outcome using dna microarray technology: past, present and future. Expert Opinion on Medical Diagnostics, 3(2):157–165, 2009. [77] O. Gevaert, F. De Smet, D. Timmerman, Y. Moreau, and B. De Moor. Predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks. Bioinformatics, 22:147–157, 2007. [78] O. Gevaert, S. Van Vooren, and B. de Moor. Integration of microarray and textual data improves the prognosis prediction of breast, lung and ovarian cancer patients. Pac Symp Biocomput, pages 279–290, 2008. [79] L. J. Goldstein, R. Gray, S. Badve, B. H. Childs, C. Yoshizawa, S. Rowley, S. Shak, F. L. Baehner, P. M. Ravdin, N. E. Davidson, G. W. Sledge, E. A. Perez, L. N. Shulman, S. M. Sparano, and J. A. Sparano. Prognostic utility of the 21-gene assay in hormone receptorâpositive operable breast cancer compared with classical clinicopathologic features. Jour. of Clin. Oncol., 26:4063–4071, 2008. [80] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomeld, and E. S. Lander. Molecular classication of cancer: Class discovery and class prediction by gene expression monitoring. Science, (286):531–537, 1999. [81] Early Breast Cancer Trialists’ Collaborative Group. Tamoxifen for early breast cancer: an overview of the randomised trials. Lancet, 351(8):1451–1467, 1998. [82] S. K. Gruvberger, M. Ringner, and P. Eden. Expression profiling to predict outcome in breast cancer: the influence of sample selection. Breast Cancer Res., 5(1):23–26, 2003. [83] G. Guizzardi. Ontological Foundations for Structural Conceptual Models. PhD thesis, Telematica Instituut Fundamental Research Series, 2005. [84] I. Guyon, J. Weston, S. M. D. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, (46):389–422, 2002. 100 11 Conclusions [85] S. P. Gygi, B. Rist, S. A. Gerber SA, F. Turecek, M. H. Gelb, and R. Aebersold. Quantitative analysis of complex protein mixtures using isotopecoded affinity tags. Nat Biotechnol, 17:994–999, 1999. [86] J. S. Hamid, P. Hu, N. M. Roslin, V. Ling, C. M. T. Greenwood, and J. Beyene. Data integration in genetics and genomics: Methods and challenges. Human Genomics and Proteomics, 2009:1–13, 2009. [87] J. Han and M. Kamber. Data Mining, Concepts and Techniques. Ed. 2. Morgan Kaufmann, San Francisco, 2005. [88] J. W. Hardin and J. M. Hilbe. Generalized Estimating Equations. Chapman Hall/CRC, 2003. [89] M. A. Harris, J. Clark, and A. Ireland. The gene ontology (go) database and informatics resource. Nucleic Acids Res, 32:258–261, 2004. [90] A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A. Young. Using graphical models and genomic expression data to statistically validate models of gene regulatory networks. In Proceedings of PSBâ01, pages 1–12, 2000. [91] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5:1391–1415, 2004. [92] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell., 18(6):607–615, 1996. [93] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer, 2001. [94] T.J. Hastie and R.J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [95] D. Heckerman, D. Geiger, and D. Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197–243, 1995. [96] P. Helman, R. Veroff, S. Atlas, and C. Willman. A bayesian network classification methodology for gene expression data. Journal of Comp. Biology, 11(4):581–615, 2004. [97] A. Hoerl and R. Kennard. Ridge regression in ‘Encyklopedia of Statistical Sciences’. Wiley, New York, 1988. [98] H. Hofling and R. Tibshirani. A study of pre-validation. Ann. Appl. Stat., 2(2):643– 664, 2008. 101 [99] J. M. Hoskins, L. A. Carey, and H. L. McLeod. Cyp2d6 and tamoxifen: Dna matters in breast cancer. Nature Reviews Cancer, 9:576–586, 2009. [100] D. W. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, 2000. [101] T. Hothorn and P. Bühlmann. mboost: Model-Based Boosting. R package version 0.5-8 [cited 26 September 2008]. Available from: http://CRAN.R-project.org/. [102] T. Hothorn, F. Leisch, K. Hornik, and A. Zeileis. The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, pages 1–22, 2005. [103] X. Huang, W. Pan, S. Grindle, X. Han, Y. Chen, S. J. Park, L. W. Miller, and J. Hall. A comparative study of discriminating human heart failure etiology using gene expression profiles. BMC Bioinformatics, 6(205):1–15, 2005. [104] J. P. Ioannidis, N. A. Patsopoulos, and E. Evangelou. Heterogeneity in meta-analyses of genome-wide association investigations. PLoS One, 2(9):1–7, 2007. [105] M. V. Iorio, M. Ferracin, and C. G. Liu. Microrna gene expression deregulation in human breast cancer. Cancer Res, 65:7065–7070, 2005. [106] R. A. Irizarry, B. Hobbs, F. Collin, Y. D. BeazerâBarclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostat, 4:249–264, 2003. [107] S. Jackman. Generalized linear models. Stanford University, pages 1–7, 2003. [108] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recognition: a rewiew. IEEE Trans Pattern Anal Mach Intel, (22):4–37, 2000. [109] C. R. James, J. E. Quinn, P. B. Mullan, P. G. Johnston, and D. P. Harkin. Brca1, a potential predictive biomarker in the treatment of breast cancer. Oncologist, 12(2):142–150, 2007. [110] H. Jiang, Y. Deng, H. S. Chen, L. Tao, Q. Sha, J. Chen, C. J. Tsai, and S. Zhang. Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics, 5(81):1–12, 2004. [111] R. Jiang, M. Gan, and P. He. Constructing a gene semantic similarity network for the inference of disease genes. BMC Systems Biology, 5(2):1–11, 2011. [112] J. M. Koomen JM, E. B. Haura, and G. Bepler. Proteomic contributions to personalized cancer care. Mol Cell Proteomics, 7(10):1780–1794, 2008. 102 11 Conclusions [113] G. S. Yi K. Y. Kim, B. J. Kim. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics, 5:1–9, 2004. [114] H. Kehrer-Sawatzki and D. N. Cooper. Copy number variation and disease. Cytogenetic and Genome Research, 123:1–123, 2008. [115] J. Khan, J. S. Wei1, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7:673–679, 2001. [116] M. J. Khoury, M. Gwinn, P. W. Yoon, N. Dowling, C. A. Moore, and L. Bradley. The continuum of translation research in genomic medicine: how can we accelerate the appropriate integration of human genome discoveries into health care and disease prevention? Genetics in Medicine, (9):665–674, 2007. [117] T. Knickerbocker, J. R. Chen, R. Thadhani, and G. MacBeath. An integrated approach to prognosis using protein microarrays and nonparametric methods. Mol Syst Biol, 3(123):1–8, 2007. [118] P. Y. Kwok. Single Nucleotide Polymorphisms: Methods and Protocols. Humana Press Inc., 2003. [119] C. Kyprianidou. Analysisng basic genetics using bayesian networks and the impact of genetic testing on the insurance industry, diploma thesis. Technical report, Cass Business School, City University, 2003. [120] P. W. Laird. The power and the promise of dna methylation markers. Nat Rev Cancer, 3:253–266, 2003. [121] G. R. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20(16):2626–2635, 2004. [122] E. Lee, H. Y. Chuang, J. W. Kim, T. Ideker, and D. Lee. Inferring pathway activity toward precise disease classification. PLoS Comput Biol, 4(11):1–9, 2008. [123] J. Won Lee, J. Bok Lee, M. Park, , and S. Song. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics Data Analysis, 48(4):869–885, 2005. [124] S. I. Lee and S. Batzoglou. Application of independent component analysis to microarrays. Genome Biology, 4(11):1–21, 2003. 103 [125] L. Li. Survival prediction of diffuse large-b-cell lymphoma based on both clinical and gene expression information. Bioinformatics, 22(4):466–471, 2006. [126] R. J. Lipshutz, S. P. Fodor, T. R. Gingeras, and D. J. Lockhart. High density synthetic oligonucleotide arrays. 21:20–24, 1999. [127] P. W. Lord, R. D. Stevens, and A. Brass. Semantic similarity measures as tools for exploring the gene. Pac. Symp. of Biocomputing, (8):601–612, 2003. [128] J. Lu, G. Getz, and E. Miska. Microrna expression profiles classify human cancers. Nature, 435:834–838, 2005. [129] R. W. Lutz, M. Kalisch, and P. Bühlmann. Robustified l2 boosting. Computational Statistics Data Analysis, 52(7):1–12, 2008. [130] S. Ma and J. Huang. Regularized gene selection in cancer microarray meta-analysis. BMC Bioinformatics, 10(1):1–12, 2009. [131] G. MacBeath. Protein microarrays and proteomics. Nat Genet, 32:526–532, 2002. [132] H. B. Mann and D. R. Whitney. On a test whether one of two random variables is stochastically larger than the other. Ann. Math. Statist., 18:50–60, 1947. [133] R. D. Mass, M. F. Press, S. Anderson, M. A. Cobleigh, C. L. Vogel, N. Dybdal, G. Leiberman, and D. J. Slamon. Evaluation of clinical outcomes according to her2 detection by fluorescence in situ hybridization in women with metastatic breast cancer treated with trastuzumab. Journal Clinical Breast Cancer, 6(3):240–246, 2005. [134] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1989. [135] G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley, 1992. [136] S. Mehta, A. Shelling, A. Muthukaruppan, A. Lasham, C. Blenkiron, G. Laking, and C. Print. Predictive and prognostic molecular markers for cancer medicine. Ther. Adv. Med. Oncol., 2(2):125–148, 2010. [137] L. Meier, S. van de Geer, and P. Bühlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society, 70:53–71, 2008. [138] T. M. Mitchell. Machine learning. McGraw-Hill, 1997. [139] F. Model, P. Adorjan, A. Olek, and C. Piepenbrock. Feature selection for dna methyllation based cancer classification. Bioinformatics, (17):157–164, 2001. 104 11 Conclusions [140] S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. P. Mesirov, and T. Poggio. Support vector machine classification of microarray data. Massachusetts Institute of Technology, pages 59–60, 1999. [141] Y. Murakami, T. Yasuda, and K. Saigo. Comprehensive analysis of microrna expression patterns in hepatocellular carcinoma and non-tumorous tissues. Oncogene, 25(17):2537–2545, 2006. [142] D. V. Nguyen and D. M. Rocke. Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics, 18:1216–1226, 2002. [143] G. L. Nicolson, M. Y. Nasralla, K. De Meirleir, and J. Haier. Bacterial and viral co-infections in chronic fatigue syndrome (cfs/me) patients. Proc. Clinical Scientific Conference on Myalgic Encephalopathy/Chronic Fatigue Syndrome, pages 1–14, 2002. [144] S. Niijima and S. Kuhara. Effective nearest neighbor methods for multiclass cancer classification using microarray data. pages 1–2. [145] J. Novovicova, P. Pudil, and J. Kittler. Divergence based feature selection for multimodal class densities. EEE Trans Pattern Anal Mach Intel, (18):218–223, 1996. [146] M. C. O’Neil and L. Song. Neural network analysis of lymphoma microarray data: prognosis and diagnosis near-perfect. BMC Bioinformatics, 4(13):1–12, 2003. [147] I. M. Ong, J. D. Glasner, and D. Page. Modeling regulatory pathways in e. coli from time series expression profiles. Bioinformatics, 18:241–248, 2002. [148] C. H. Ooi and P. Tan. Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics, (19):37–44, 2003. [149] K. Owzar, W. T. Barry, S. H. Jung, I. Sohn, and S. L. George. Statistical challenges in preprocessing in microarray experiments in cancer. 14:5959–5966, 2008. [150] S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, M. Cronin, and F. L. Baehner et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N. Engl. J. Med., 351:2817–2826, 2004. [151] G. Papachristoudis, S. Diplaris, and P. A. Mitkas. Sofocles: Feature filtering for microarray classification based on gene ontology. Journal of Biomedical Informatics, 43(1):1–14, 2010. [152] M. Y. Park and T. Hastie. l1 -regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society Series B, 69:659–677, 2007. 105 [153] C. M. Perou, T. Sorlie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen, L. A. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S. X. Zhu, P. E. Lonning, A. L. Borresen-Dale, P. O. Brown, and D. Botstein. Molecular portraits of human breast tumours. Nature, (406):747–752, 2000. [154] D. Pinkel and D. G. Albertson. Array comparative genomic hybridization and its applications in cancer. Nat Genet, 37:11–17, 2005. [155] J. Pittman, E. Huang, and H. Dressman. Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc.Natl.Acad.Sci., 101(22):8431–8436, 2004. [156] S. Pounds, X. Cao, C. Cheng, J. Yang, D. Campana, W. E. Evans, C. H. Pui, and M. V. Relling. Integrated analysis of pharmacokinetic, clinical, and snp microarray data using projection onto the most interesting statistical evidence with adaptive permutation testing. Bioinformatics and Biomedicine, IEEE International Conference on, pages 203–209, 2009. [157] J. Qi and J. Tang. Integrating gene ontology into discriminative powers of genes for feature selection in microarray data. Proceedings of the 2007 ACM symposium on Applied computing, pages 430–434, 2007. [158] J. Quackenbush. Computational analysis of microarray data. Nature Reviews Genetics, 2(2):418–427, 2001. [159] D. R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh, T. Barrette, A. Pandey, and A. M. Chinnaiyan. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. PNAS, 101:9309–9314, 2004. [160] B. Z. Ring, R. S. Seitz, R. Beck, W. J. Shasteen, S. M. Tarr, M. C.U. Cheang, B. J. Yoder, G. T. Budd, T. O. Nielsen, D. G. Hicks, N. C. Estopinal, and D. Ross. Novel prognostic immunohistochemical biomarker panel for estrogen receptorâpositive breast cancer. Jour. of Clin. Oncol., 24:3039–3047, 2006. [161] J. S. Ross. Multigene classifiers, prognostic factors, and predictors of breast cancer clinical outcome. Adv. Anat. Pathol., 16(4):204–215, 2009. [162] R. Schapire. Strength of weak learnability. Machine Learning, 5:197–227, 1990. [163] M. Schmidt, D. Boehm, C. von Toerne, E. Steiner, A. Puhl, H. Pilch, H. A. Lehr, J. G. Hengstler, H. Koelbl, and M. Gehrmann. The humoral immune system has a 106 11 Conclusions key prognostic impact in node-negative breast cancer. Cancer Research, 68(13):5405– 5413, 2008. [164] G. Schwarz. Estimating the dimension of a model. Ann. Stat., 6(2):461–464, 1978. [165] H. Shatkay and R. Feldman. Mining the biomedical literature in the genomic era: an overview. J Comput Biol, 10(6):821–855, 2003. [166] M. Shena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary dna microarray. 270:467–470, 1995. [167] R. Simon. Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data. British Journal of Cancer, 89:1599–1604, 2003. [168] R. M. Simon. Using dna microarrays for diagnostic and prognostic prediction. Expert Rev. Mol. Diagn., 3(5):587–595, 2003. [169] R. M. Simon. Development and validation of biomarker classifiers for treatment selection. J. Stat. Plan. Inference, 138(2):308–320, 2008. [170] A. K. Smith, P. D. White, E. Aslakson, U. Vollmer-Conna, and M. S. Rajeevan. Polymorphisms in genes regulating the hpa axis associated with empirically delineated classes of unexplained chronic fatigue. Pharmacogenomics, 7:387–394, 2006. [171] C. Sotiriou, S. Y. Neo, L. M. McShane, E. L. Korn, P. M. Long, A. Jazaeri, P. Martiat, S. B. Fox, A. L. Harris, and E. T. Liu. Breast cancer classification and prognosis based on gene expression profiles from a population-based study. PNAS, 100:10398– 10398, 2003. [172] C. Sotiriou and L. Pusztai. Gene-expression signatures in breast cancer. N. Engl. J. Med., (360):790–800, 2009. [173] I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar. Text mining and ontologies in biomedicine: Making sense of raw text. Briefings in Bioinformatics, 6(3):239–251, 2005. [174] S. Srivastava, L. Zhang, R. Jin, and C. Chan. A novel method incorporating gene ontology information for unsupervised clustering and feature selection. PLoS One, 3(12):1–8, 2008. [175] F. J. T. Staal, M. van der Burg, L. F. A. Wessels, B. H. Barendregt an M. R. M. Baert, C. M. M. van den Burg, C. van Huffel, A. W. Langerak, V. H. J. van der Velden, M. J. T. Reinders, and J. J. M. van Dongen. Dna microarrays for comparison of gene expression profiles between diagnosis and relaps in precursor-b acute lymphoblestic leukemia: choice of technique and purification influence the identification of potential diagnostic markers. 17:1324–1332, 2003. 107 [176] A. Statnikov, L. Wang, and C. Aliferis. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics, 9:1–10, 2008. [177] A. J. Stephenson, A. Smith, M. W. Kattan, J. Satagopan, V. E. Reuter, P. T. Scardino, and W. L. Gerald. Integration of gene expression profiling and clinical variables to predict prostate carcinoma recurrence after radical prostatectomy. Cancer, 104(2):290–298, 2005. [178] Y. Sun. Improved breast cancer prognosis through the combination of clinical and genetic markers. Bioinformatics, 23:30–37, 2007. [179] Q. Tian, S. B. Stepaniants, M. Mao, L. Weng, M. C. Feetham, M. J. Doyle, E. C. Yi, H. Dai, V. Thorsson, and J. Eng. Integrated genomic and proteomic analyses of gene expression in mammalian cells. Mol Cell Proteomics, 3:960–969, 2004. [180] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58:267–288, 1996. [181] R. Tibshirani and B. Efron. Pre-validation and inference in microarrays. Statistical applications in genetics and molecular biology, 1:1–18, 2002. [182] G. V. Trunk. A problem of dimensionality: A simple example. IEEE Trans Pattern Anal Mach Intel, pages 306–307, 1979. [183] C. Truntzer, D. Maucort-Boulch, and P. Roy. Comparative optimism in models involving both classical clinical and gene expression information. BMC Bioinformatics, 9:1–10, 2008. [184] K. N. Tsai, T. Y. Tsai, and C. M. Chen. A new approach to deciphering pathways of enterovirus 71-infected cells: an integration of microarray data, gene ontology, and pathway database. Biomed Eng Appl Basis Comm, 18(6):337–342, 2006. [185] L. J. van’t Veer, H. Dai, and M. J. van de Vijver. Gene expression profiling predicts clinical outcome of breast cancer. Nature, pages 530–536, 2002. [186] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1998. [187] V. E. Velculescu, B. Vogelstein L. Zhang, and K. W. Kinzler. Serial analysis of gene expression. Science, 1995. 270:484-487. [188] J. Šilhavá and P. Smrž. Gene ontology driven feature filtering from microarray data. In Znalosti, pages 263–266, Jindřichův Hradec, Czech Republic, 2010. 108 11 Conclusions [189] B. A. Walker, P. E. Leone, M. W. Jenner, C. Li, D. Gonzalez, D. C. Johnson, F. M. Ross, F. E. Davies, and G. J. Morgan. Integration of global snp-based mapping and expression arrays reveals key regions, mechanisms, and genes important in the pathogenesis of multiple myeloma. Blood, 108:1733–1743, 2006. [190] D. Wang and A. Bakhai. Clinical trials: a practical guide to design, analysis, and reporting. Remadica, 2005. [191] J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C. F. Chen. A new method to measure the semantic similarity of go terms. Bioinformatics, 23(10):1274–1281, 2007. [192] Y. Wang, J.G. Klijn, Y. Zhang, A. M. Sieuwerts, M. P. Look MP, F. Yang, D. Talantov, M. Timmermans, M. E. Meijer van Gelder, J. Yu, T. Jatkoe, E. M. Berns, D. Atkins, and J. A. Foekens. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet., 365(9460):671–679, 2005. [193] Y. Wang, D. J. Miller, and R. Clarke. Approaches to working in high-dimensional data spaces: gene expression microarrays. Brit Jour Canc, (98):1023–1028, 2008. [194] S. H. Wei, C. Balch, and H. H. Paik. Prognostic dna methylation biomarkers in ovarian cancer. Clinical cancer research, 12:2788–2794, 2006. [195] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. A. Olson, J. R. Marks, and J. R. Nevins. Predicting the clinical status of human breast cancer by using gene expression profiles. PNAS, 28:11462–11467, 2001. [196] C. Wingren and C. A. Borrebaeck. Antibody microarrays: Current status and key technological advances. OMICS, 10:411–427, 2006. [197] L. Xu. A novel statistical method for microarray integration: applications to cancer research. PhD thesis, Johns Hopkins University, Baltimore, Maryland, 2007. [198] Y. and R. E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771–780, 1999. [199] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing classifier performance via an approximation to the wilcoxon-mann-whitney statistic. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), pages 1–8, 2003. [200] T. P. Yang, T. Y. Chang, C. H. Lin, M. T. Hsu, and H. W. Wang. Arrayfusion: a web application for multi-dimensional analysis of cgh, snp and microarray data. Bioinformatics, 22(21):2697–2698, 2006. 109 [201] J. X. Yu, A. M. Sieuwerts, Y. Zhang, J. W. Martens, M. Smid, J. G. Klijn, Y. Wang, and J. A. Foekens. Pathway analysis of gene signatures predicting metastasis of node-negative primary breast cancer. BMC Cancer, 7(182):1–14, 2007. [202] S. Yu, L. C. Tranchevent, B. De Moor, and Y. Moreau. Gene prioritization and clustering by multi-view text mining. BMC Bioinformatics, pages 1–22, 2010. [203] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, 68:49–67, 2007. [204] X. Zhou, K. Y. Liu, and S. T. Wong. Cancer classification and prediction using logistic regression with bayesian gene selection. J. Biomed. Inform., 37:249–259, 2004. [205] J. Zhu and T. Hastie. Classification of gene microarrays by penalized logistic regression. Biostatistics, 5:427–443, 2004. [206] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005. [207] V. Zuber and K. Strimmer. Gene ranking and biomarker discovery under correlation. Bioinformatics, 25(20):2700–2707, 2009. [208] V. Zuber and K. Strimmer. High-dimensional regression and variable selection using car scores. Statistical Applications in Genetics and Molecular Biology, 10(1):1–27, 2011.