Download Methods for class prediction with high-dimensional gene

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
BRNO UNIVERSITY OF TECHNOLOGY
FAKULTA INFORMAČNÍCH TECHNOLOGIÍ
ÚSTAV POČÍTAČOVÉ GRAFIKY A MULTIMEDIÍ
FACULTY OF INFORMATION TECHNOLOGY
DEPARTMENT OF COMPUTER GRAPHICS AND MULTIMEDIA
Methods for class prediction with high-dimensional
gene expression data
Metody pro predikci s vysokodimenzionálnı́mi daty genových expresı́
DISERTAČNÍ PRÁCE
DOCTORAL THESIS
AUTOR PRÁCE
Ing. Jana Šilhavá
AUTHOR
VEDOUCÍ PRÁCE
SUPERVISOR
BRNO 2012
Doc. RNDr. Pavel Smrž, Ph.D.
Abstract
This thesis deals with class prediction with high-dimensional gene expression data. During
the last decade, an increasing amount of genomic data has become available. Combining
gene expression data with other data can be useful in clinical management, where it can
improve the prediction of disease prognosis. The main part of this thesis is aimed at
combining gene expression data with clinical data. We use logistic regression models that
can be built through various regularized techniques. Generalized linear models enable us to
combine models with different structure of data. It is shown that such a combination may
yield more accurate predictions than those obtained based on the use of gene expression or
clinical data alone. Suggested approaches are not computationally intensive. Evaluations
are performed with simulated data sets in different settings and then with real benchmark
data sets. The work also characterizes an additional predictive value of microarrays. The
thesis includes a comparison of selected features of gene expression classifiers built up
in five different breast cancer data sets. Finally, a feature selection that combines gene
expression data with gene ontology information is proposed.
Keywords
predictive classification, generalized linear models, boosting, logistic regression, elastic
net, model evaluation, high-dimensional data, combining of heterogenous data, feature
selection, DNA microarray data, gene expression, clinical data, gene ontology
Bibliographic citation
Jana Šilhavá: Methods for Class Prediction with High-Dimensional Gene Expression Data,
Ph.D. thesis, Department of Computer Graphics and Multimedia, FIT BUT, Brno, CZ,
2012.
i
ii
Abstrakt
Tato dizertačnı́ práce se zabývá predikcı́ vysokodimenzionálnı́ch dat genových expresı́.
Množstvı́ dostupných genomických dat výrazně vzrostlo v průběhu poslednı́ho desetiletı́. Kombinovánı́ dat genových expresı́ s dalšı́mi daty nacházı́ uplatněnı́ v mnoha
oblastech.
Napřı́klad v klinickém řı́zenı́ rakoviny (clinical cancer management) může
prispět k přesnějšı́mu určenı́ prognózy nemocı́.
Hlavnı́ část této dizertačnı́ práce je
zaměřena na kombinovánı́ dat genových expresı́ a klinických dat. Použı́váme logistické
regresnı́ modely vytvořené prostřednictvı́m různých regularizačnı́ch technik. Generalizované linearnı́ modely umožnujı́ kombinovánı́ modelů s různou strukturou dat. V dizertačnı́
práci je ukázáno, že kombinovánı́ modelu dat genových expresı́ a klinických dat může vést
ke zpřesněnı́ výsledků predikce oproti vytvořenı́ modelu pouze z dat genových expresı́
nebo klinických dat. Navrhované postupy přitom nejsou výpočetně náročné. Testovánı́
je provedeno nejprve se simulovanými datovými sadami v různých nastavenı́ch a následně
s reálnými srovnavacı́mi daty. Také se zde zabýváme určenı́m přı́davné hodnoty microarray dat. Dizertačnı́ práce obsahuje porovnánı́ přı́znaků vybraných pomocı́ klasifikátoru
genových expresı́ na pěti ruzných sadách dat týkajı́cı́ch se rakoviny prsu. Navrhujeme také
postup výběru přı́znaků, který kombinuje data genových expresı́ a znalosti z genových ontologiı́.
Klı́čová slova
prediktivnı́ klasifikace, generalizované lineárnı́ modely, boosting, logistická regrese, elastická sı́ť, vyhodnocovánı́ modelu, vysokodimensionálnı́ data, kombinovánı́ heterogennı́ch
dat, výběr přı́znaků, DNA microarray data, genové exprese, klinická data, genové ontologie
Bibliografická Citace
Jana Šilhavá: Methods for Class Prediction with High-Dimensional Gene Expression Data,
Disertačnı́ práce, Ústav počı́tačové grafiky a multimédiı́, FIT VUT, Brno, CZ, 2012.
iii
iv
Acknowledgments
I would like to thank my supervisor Pavel Smrž for his leadership, valuable suggestions
and support. I am also grateful to Jan Černocký and other people from the Department
of Computer Graphics and Multimedia at Brno University of Technology for enabling me
to write and finish my dissertation. I would also like to thank Kenneth Froehling for his
proof-reading. Moreover, special thanks go to my family and all my closest friends for
their patience and encouragement.
v
vi
Prohlašenı́
Prohlašuji, že jsem tuto disertačnı́ práci vypracovala samostatně pod vedenı́m Doc. RNDr.
Pavla Smrže, Ph.D. Uvedla jsem všechny literárnı́ prameny a publikace, ze kterých jsem
čerpala.
V Brně dne 20. srpen 2012
vii
viii
Contents
1 Introduction
1
1.1
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Original contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Background
5
2.1
Gene expression measuring . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Types of DNA microarrays and their production . . . . . . . . . . . . . . .
6
2.3
Microarray data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.4
Microarray data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3 Class prediction
9
3.1
Classifier performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2
Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3
3.4
3.2.1
Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2
Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3
Preference for simple models . . . . . . . . . . . . . . . . . . . . . . 13
3.2.4
Regularization or shrinkage methods . . . . . . . . . . . . . . . . . . 13
3.2.5
Internal validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Classifier construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1
Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2
Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.3
Suport vector machines . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.4
k-nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.5
Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.6
Classification and regression trees and ensemble methods . . . . . . 16
3.3.7
Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Comparison between methods . . . . . . . . . . . . . . . . . . . . . . . . . . 17
ix
x
4 Assessment of classifier performance
19
4.1
Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2
Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Data integration
5.1
5.2
23
Categorization of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1
Similar data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.2
Heterogenous data types . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.3
Early integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.4
Intermediate integration . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.5
Late integration
5.1.6
Serial integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Information resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.1
Genomic variation data . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.2
Gene expression data . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.3
Proteomic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.4
Interactomic data
5.2.5
Textual data and ontologies . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.6
Clinical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.7
Other data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Data sets
31
6.1
Real data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2
Microarray data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.3
Simulated data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 Combining gene expression and clinical data
41
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2
Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2.1
Estimation of parameters . . . . . . . . . . . . . . . . . . . . . . . . 44
7.3
Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.4
Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.4.1
Functional gradient descent boosting algorithm . . . . . . . . . . . . 47
7.4.2
Base procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.4.3
Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5
Combination of logistic regresion and boosting . . . . . . . . . . . . . . . . 49
7.6
Parameters setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.7
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.7.1
Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.7.2
Breast cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
xi
7.8
Pre-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.9
Determination of weights for models . . . . . . . . . . . . . . . . . . . . . . 56
7.10 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.12 Alternative regularized regression techniques
7.12.1 Elastic net
. . . . . . . . . . . . . . . . . 58
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.12.2 Combining gene expression, clinical and SNP data . . . . . . . . . . 62
7.13 Execution times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.14 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8 Additional predictive value of microarrays
67
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.2
Determination of predictive value . . . . . . . . . . . . . . . . . . . . . . . . 68
8.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9 Breast cancer prognostic genes
73
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.2
Gene selection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.3
Evaluation of selected genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
10 Gene ontology feature selection
85
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10.2 Combining gene expression data with gene ontology . . . . . . . . . . . . . 86
10.2.1 Filtering with gene ontology . . . . . . . . . . . . . . . . . . . . . . . 87
10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
11 Conclusions
91
xii
List of Abbreviations
AIC
Akaike Information Criterion
ANOVA
ANalysis Of VAriance
AUC
Area Under the ROC Curve
ANN
Artificial Neural Network
BIC
Bayesian Information Criterion
BN
Bayesian Networks
BP
Biological Process
CC
Cellular Component
CDC
Centers for Disease Control
CFS
Chronic Fatigue Syndrome
CART
Classification And Regression Trees
CAT
Correlation-Adjusted t-scores
CGH
Comparative Genomic Hybridization
cDNA
complementary DeoxyriboNucleic Acid
cRNA
complementary RiboNucleic Acid
CWLLS
Component-Wise Linear Least Squares
CNV
Copy Number Variations
CV
Cross-Validation
CCD
Cyclical Coordinate Descent
DNA
DeoxyriboNucleic Acid
DLDA
Diagonal Linear Discriminant Analysis
EN
Elastic Net
ER
Estrogen Receptor
FLDA
Fisher’s Linear Discriminant Analysis
xiii
xiv
FDA
Food and Drug Administration
FGD
Functional Gradient Descent
FNDR
False NonDiscovery Rate
GO
Gene Ontology
GAM
Generalized Additive Models
GLM
Generalized Linear Models
HER
Human Epidermal Growth factor Receptor
ICA
Independent Component Analysis
IWLS
Iteratively Weighted Least Squares
IRLS
Iteratively Reweighted Least Squares
kNN
k-Nearest Neighbors
LOOCV
Leave-One-Out Cross-Validation
LDA
Linear Discriminant Analysis
LR
Logistic Regression
MLE
Maximum Likelihood Estimate
mRNA
messenger RiboNucleic Acid
MDL
Minimum Description Length
MCCV
Monte-Carlo Cross-Validation
MF
Molecular Function
NCBI
National Center for Biotechnology Information
NLM
National Library of Medicine
PLS
Partial Least Squares
PCA
Principal Component Analysis
OMIM
Online Mendelian Inheritance in Man
OS
Overall Survival
PR
Progesterone Receptor
PFS
Progression-Free Survival
RF
Random Forests
RMA
Robust MultiArray
QC
Quality Control
SNP
Single Nucleotide Polymorphisms
xv
SSM
Semantic Similarity Matrix
SVM
Support Vector Machines
xvi
Chapter 1
Introduction
1.1
Motivation
Increasing amounts of genomic data have become available in the last decade. Gene
expression data, genomic variation data or proteomic data provide examples of the results
produced by high throughput technologies. This thesis deals with gene expression data
coming from microarray analysis.
Microarrays offer insights into the genetic basis of various diseases. Take cancer research as an example. Cancer is thought to be primarily caused by random genetic alternations. Consequently, genomic data has been successfully applied to various cancer-related
problems (e.g. classifying tumors into subtypes and thus potentially improving the clinical
management of cancer [197]).
Microarray technology employs gene chips to measure the expression level of thousands
of genes simultaneously. Each chip characterizes the gene expression levels at a different
time point during the cell cycle. Medical studies compare the gene expression levels of
individual patients, one chip per case history. The large expense of microarray experiments
and the limited number of available patients in the experiments make the sample size
of individual microarray studies relatively small, usually less than 100. This contrasts
with the high number of variables (genes) in gene expression data, typically on the order
of 10,000. Class prediction with high-dimensional microarray data is, therefore, extremely
difficult and all approaches need to take into account possible classifier overfitting.
Furthermore, probe design and experimental conditions influence signal intensities and
sensitivities for many high-throughput technologies [158]. One has to pay special attention
to the high noise when analysing microarray data.
In order to overcome the above-mentioned problems and to increase the reliability of
the findings, the analysis can combine gene expression data with information from other
sources – proteomic and interactomic knowledge, clinical findings, medical ontologies,
disease-specific textual data, etc. Each of these distinct data types, although individually
1
2
1 Introduction
incomplete, can provide valuable, partly independent and complementary information.
The goal of data integration is to obtain more precision, better accuracy, and greater
statistical power than any individual data set would provide. It can also find a solution
on how to cross-validate noisy data sets.
There is a number of challenges in the process of data integration. The challenges
may be of a conceptual, methodological, or practical nature and may relate to issues that
arise due to experimental, computational, or statistical complexities [86]. It is essential
to carefully consider strategies that best capture most information contained in each data
type before combining them. The data from different sources might have a different
quality depending on the experimental conditions. The data might also have different
informativeness even if their quality is good and reliable. The methods for data integration
in general and combining genomic data, in particular, form one of the most active topics
in the current research.
The knowledge fusion combining microarray data with information from additional
data sources to improve the prediction of disease outcome also defines the goals of this
thesis. They can be summarized as follows:
• To summarize the current research on the knowledge integration approaches in the
microarray data analysis.
• To investigate the methods aiming at overcoming “the curse of high-dimensionality”
of the data resulting from genomic studies.
• To design, realize and validate classifiers combining gene expression data with information from additional data sources.
• To deal with the additional predictive value of high-dimensional microarray data in
the case when standard clinical predictors are already available.
1.2
Original contributions of the thesis
The original contributions of the thesis can be sumarized as follows:
• The novel approaches that construct classifiers combining clinical variables with
microarray gene expression data.
• The two-step method combining logistic regression and boosting models able to
determine the additional predictive value of microarray data.
• The combination of logistic regression and boosting with pre-validation.
• The validation of the experiments based on combining gene expression, clinical and
single nucleotide polymorphism data (SNP) using generalized linear models and
regularization methods.
1.3 Structure of the thesis
3
• The study that discusses biomarker discovery methods and compares selected features of gene expression classifiers applied to five benchmark breast cancer data
sets.
• The feature selection approach that integrates gene ontology information with microarray gene expression data.
1.3
Structure of the thesis
The rest of this thesis is divided into the following chapters:
• Chapter 2 briefly describes gene expression measuring, the main types of DNA
microarrays and their production. Then, microarray data analysis is introduced
together with key topics of microarray data mining.
• Chapter 3 is concerned with class prediction of microarray data. It analyses the
performance of a predictive classifier in a high-dimensional setting. It deals with approaches that can optimize classifier perfomance and cope with high-dimensionality
of data. It also gives an overview of the most popular class prediction methods in
the context of microarray gene expression data and a comparison of their features.
• Chapter 4 describes how classifier performance is estimated in this thesis. It describes the validation procedure and the performance measures.
• Chapter 5 categorizes data integration methods from different points of views.
Each category includes references to the published literature. It gives an overview
of information resources that have been combined with microarray gene expression
data as well.
• Chapter 6 introduces data sets used for evaluation of experiments in this thesis .
This chapter also includes examples of microarray data pre-processing.
• Chapter 7 introduces a novel approach that constructs a classifier combining microarray gene expression data with clinical data. An extention of this approach is
described which offers a pre-validation of models built with microarray and clinical
data followed by weight calculations. Evaluations are performed on several redundant and non-redundant simulated data sets as well as on four real benchmark data
sets. It includes a comparison with other methods and discussion. The next part of
the chapter is dedicated to alternative regularized regression techniques. An elastic
net is employed with high-dimensional data instead of boosting and is used in a joint
classifier with logistic regression. After that, classifiers combining gene expression,
clinical and SNP data are validated. A chronic fatigue syndrome (CFS) data set is
used for experiments. The gene expression data is not only high-dimensional data
4
1 Introduction
in this data set. That is why, regularization is not only applied to gene expression
data in these experiments.
• Chapter 8 describes a two-step method combining logistic regression and boosting
models in order to determine an additional predictive value of microarray data.
This method is evaluated and compared together with the other published method
designed for the same purpose.
• Chapter 9 compares selected features of gene expression classifiers built up in five
benchmark breast cancer data sets. The features are selected based on three feature
selection methods. This chapter includes an overview of breast cancer prognostic
and predictive biomarkers in clinical use with a discussion about published breast
cancer signatures and their discovery.
• Chapter 10 proposes preliminary results of approaches of using gene ontology (GO)
information with gene expression data. GO information is incorporated at the stage
of early integration in order to improve class prediction. Feature selections based on
GO are evaluated on a benchmark breast cancer data set.
• Chapter 11 concludes this thesis and outlines the most promising direction for
future research.
Chapter 2
Background
Understanding of data is essential for its analysis and mining. That is why this chapter
introduces the fundamentals of DNA microarrays. It briefly describes principles of gene
expression measuring and summarizes main types of DNA microarrays and their production. Microarray data pre-processing as a part of data analysis is also introduced together
with key topics of microarray data mining.
Figure 2.1 illustrates the process of production and analysis of microarray gene expression data. Most of the activities are matters for biologists. This thesis falls into the
highlighted area.
Biological
question
Experimental
design
Microarray
experiment
Data analysis
and
data mining
Biological verification
and
interpretation
Figure 2.1: Microarray gene expression data analysis process.
2.1
Gene expression measuring
The central dogma of molecular biology [40] outlines that in synthesizing proteins, Deoxyribonucleic acid (DNA) is transcribed into messenger Ribonucleic acid (mRNA), which
is translated into protein. The principle behind microarrays is hybridization between two
DNA strands. Fluorescently labelled target sequences (present in the sample) bind to
complementary probe sequences (attached to the solid surface of a DNA microarray chip)
and generate a signal that depends on the strength of the hybridization determined by the
number of paired bases. A specialized scanner is used to measure the amount of hybridized
target at each probe, which is reported as gene expression levels.
5
6
2 Background
2.2
Types of DNA microarrays and their production
Methods for quantitative measuring gene expression have been available to biologists since
the 80’s [16]. They were limited to examining a small number of genes at a time. A later
technique serial analysis of gene expression (SAGE) [187] enabled biologists to quantify
the expression level of both known and unknown genes, but it was time-consuming. The
current DNA microarray technology has revolutionized basic biological research and laboratory investigatons of patient material. It enables measuring the gene expression levels
of thousands of genes simultaneously under the same experimental conditions.
There are many different types of microarrays (called platforms) in use, but all can
be characterized by a high density and number of biomolecules fixed onto a well-defined
surface. DNA microarrays come in two main types of technical platforms. The first
cDNA microarray (developed in the Brown and Botstein labs at Stanford, 1995) [166] is
based on standard microscopic glass slides on which complementary DNAs (cDNAs) or
long oligonucleotides (typically 70-80 mers) have been spotted. The second high-density
oligonucleotide microarray [126] is based on photolithographic techniques to synthesize
25 mer oligonucleotides on a silicon wafer and constitutes the patented technology of
Affymetrix Inc. The production methods of the two types of DNA microarrays are different: cDNA microarays can be produced relatively easily, but the high-density oligonucleotide microarrays need highly specialized production facilities [175]. A comparison of
cDNA microarrays and high-density oligonucleotide microarrays is shown in Figure 2.2.
For cDNA microarray experiments, two cell populations, for instance diseased and normal,
are isolated, RNA is extracted, and cDNA is made, and used for transcription with Cy3
(green) or Cy5 (red) labeled nucleotides. The two labeled cRNA samples are mixed and
hybridized on a glass slide array, which is scanned by a specialized laser, followed by data
analysis. High-density oligonucleotide microarrays generate a gene expression profile of
one sample and, therefore, one color. RNA is extracted and cDNA is prepared. The cDNA
is used in transcription in order to generate complementary RNA (cRNA). After fragmentation, this cRNA is hybridized to microarrays, washed and stained, and subsequently
scanned on a specialized laser scanner.
2.3
Microarray data pre-processing
Raw numerical outputs from different microarray platforms need to be pre-processed.
Pre-processing steps can be divided into five parts: data import, background adjustment,
normalization, sumarization and quality assessment according to [74]. Data import methods are needed because data come in different formats and are often scattered across a
number of files or database tables from which they need to be extracted and organized.
Background adjustment is essential to reduce non-specific hybridization and noise. Nor-
7
2.3 Microarray data pre-processing
cDNA microarray
High-density oligonucleotide microarray
two cell populations
(diseased and normal)
one cell population
Target preparation
Target preparation
RNA extraction
Conversion to cDNA,
purification and
labeling
Cy3
Cy5
Fragmentation
Hybridization
Array preparation
Array preparation
oligonucleotide array
(Affymetrix)
(multiple short overlapping
oligonucleotides per gene)
glass slide array
(one cDNA or long
oligonucleotides per gene)
Washing
Laser scanning
n1n2 ...
p1
p2
...
...
Cy5
Cy3
gene
expression
ratio
n1n2 ...
p1
p2
Data analysis
’absolute’
gene
expression
value
Figure 2.2: Comparison of cDNA microarray and high-density oligonucleotide microarray
procedures (reproduced from [175]).
malization enables one to compare measurements from different array hybridizations. For
some platforms, summarization is needed because transcripts are represented by multiple
probes. Quality assessment can detect divergent measurements beyond the acceptable
level of random fluctations. Additional steps that are typically performed include missing
value imputation and filtering (removal of genes that are not expressed). An overview
of pre-processing methods, including summary algorithms and quality control metrics for
microarrays, can be found in [149]. A practical example of benchmark microarray data
set pre-processing is given in Chapter 6. Pre-processed microarray data is usually transformed into gene expression matrices where rows form the expression patterns of genes
and columns represent the expression profiles of samples (experimental conditions) or
vice-versa (see Figure 2.2). The cells in cDNA microarray matrix characterize log-ratios
of gene expression from two cells populations (labeled with Cy3 or Cy5), while the cell in
8
2 Background
a high-density oligonucleotide microarray matrix characterizes log-transformed intensities
of gene expression levels of one population.
2.4
Microarray data mining
Microarrays offer insights into the genetic basis of many diseases. However, micorarray
information extraction presents a challenging task for data mining. Most of the literature
assorts data mining with microarray gene expression data into three related cathegories:
class discovery, class comparison and class prediction. Class discovery refers to the process
of dividing patients, tumors or genes into classes (clusters) in the hope that they also have
a similar function, behavior or properties. For example, a class discovery can identify new
disease subtypes. The class comparison can be defined as a selection of genes whose expression is significantly diferent between conditions. These are called diferentially expressed
genes. For example, class comparison problem can detect disease-associated (signalling)
molecular markers (biomarkers) and their complex interactions. Scientists explore genegene interactions because they may play important roles in complex disease studies. The
goal of class prediction is to develop a function (classification rule) for accurately predicting class membership. For example, it can be a function for the prediction of disease
diagnosis or prognosis. Class prediction is the topic of this thesis. It is detailed in the
next chapter.
Chapter 3
Class prediction
In class prediction (or predictive classification), the data with class labels is available and
a classification algorithm learns from samples of this data with known class membership
(the training set) and establishes rules to classify new samples. Results are evaluated on
an unseen data (the test set). Figure 3.1 shows a typical scenario of class prediction. Here,
the class labels can have two values: bad prognosis and good prognosis. In general, class
prediction can deal with a two-class (binary) or multi-class classification problem. N -class
problems can be modelled as N binary classification problems. This thesis is concerned
with binary classification.
Bad prognosis
samples
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
Good prognosis
samples
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
New
samples
aa aaaa
a aa
aa aaaa
aa aaaa
a aa
aa aaaa
a aa
aa aaaa
a aa
aa aaaa
aa
aa
a
aa
a
aa
a
aa
a
aa
New sample
Classification rule
Figure 3.1: Class prediction.
9
Good prognosis
10
3 Class prediction
Notation for class prediction with microarray data
Let X be the p × n matrix containing the expression values of p features/genes and n samples. The response variable with ground truth class labels is a n dimensional vector
y ∈ {A, B} (A and B can denote poor and good prognosis). Similarly ŷ ∈ {A, B} is the
response variable with predicted class labels.
Examples of class prediction include a prediction of disease diagnosis and prognosis or
therapeutic responsiveness. An important interest in the area of prognostic prediction is
the prediction of the survival time of a patient [62]. Some authors dichotomize the survival
time and transform the problem into a classification problem [146]. Class prediction
objectives are:
• Identification of classification rules that perform well with the new data;
• Identification of characteristic genes related to disease progression for further investigation.
3.1
Classifier performance
As mentioned above, classification with high-dimensional microarray gene expression data
is a challenging task. Performance of a predictive classifier/model depends on sample
size, data dimensionality and model complexity [193]. The accuracy of learned classifiers
tends to decrease with high dimensions, a phenomenon called the ‘curse of dimensionality’
[51]. Trunk [182] illustrates the problem by the following example. Consider two equally
probable, normally distributed classes with common variance in each dimension. For the
feature/variable/gene indexed by n = 1, 2, 3, ..., class 1 has a mean
mean −
1
1
n2
1
1
n2
and class 2 has a
. Thus, each additional feature has some class discrimination power, even if
it is decreasing as n increases. Trunk evaluated error rates for the Bayes decision rule,
applied as a function of n, when the variance is assumed to be known but the class means
are estimated based on a finite data set. He found that (1) the best test error was achived
using a finite number of features; (2) using an infinite number of features, test error degrades to the accuracy of random guessing; and (3) the optimal dimensionality increases
with an increasing sample size. These observations are consistent with the ‘bias/variance
dilema’ [108]. Simple models may be biased but will have a low variance. More complex models have greater representation power (low bias) but will overfit to the particular
training set (high variance). Thus, the large variance associated with many features (including those with modest discrimination power) defeats any possible classification benefit
derived from these features (see Figure 3.2). With severe limits on available samples in
microarray studies, complex models using high-feature dimensions will severely overfit,
11
3.1 Classifier performance
greatly compromising classification performance. Computational learning theory provides
bounds on generalisation accuracy in terms of a classifier’s capacity which is related to
model complexity [186]. Relevance of these bounds to the microarray gene expression data
is discussed in [14]. Unfortunately, training error is not a good estimate of test error, as it
does not properly account for model complexity. Figure 3.2 shows the typical behaviour
of the test error, as classifier complexity varies. Training error tends to decrease whenever
the classifier complexity is increased, that is, it is harder to fit the data. However, with
too much fitting, the model adapts itself too closely to the training data and will not
Class prediction error
generalize well (i.e., large test error).
High bias
Low variance
Low bias
High variance
Underfitting
Overfitting
Bias
Expected error
(assesed on test samples)
Expected error
(assesed on training samples)
Variance
Classifier complexity
Figure 3.2: A demonstration of bias/variance dilemma in class prediction and test/training
error as a function of classifier complexity (adapted from [93], page 38).
There are dimension reduction approaches that can optimize classifier perfomance
and cope with the high-dimensionality of data. They form the topic of the following
section. Dimension reduction can take place before classifier construction (feature selection
or extraction), or it can be included in classifier construction (regularization/shrinkage
methods or preference for simple models). Figure 3.3 illustrates the pipeline of microarray
predictive classifier building with possible dimension reduction steps. The process of a class
predictor building from high-dimensional data usually consists of a dimension reduction,
Data
Feature
selection/extraction
Classifier
construction
Classifier
validation
Error
Figure 3.3: The process of a common class predictor building from high-dimensional data.
12
3 Class prediction
classifier construction and validation. The dashed rectangle signalizes that dimension
reduction is included in the classifier construction part.
3.2
Dimension reduction
There are approaches that can optimize classifier perfomance and cope with the highdimensionality of data. Nevertheless, the use of these methods does not prevent overfitting.
The literature includes the following approaches:
• Feature selection;
• Feature extraction;
• Preference for simple models;
• Regularization or shrinkage methods;
• Internal validation.
3.2.1
Feature selection
Feature selection, also known as variable selection or gene selection, is the way of selecting a
subset of relevant features for building classifiers. A standard approach to feature selection
involves identification of the genes that are differentially expressed among the classes when
considered individually. This type of feature selection is a filter approach and can be
computed, for example, by t-test, Wilcoxon test or analysis of variance (ANOVA). The
genes that are significantly differentially expressed at a specified significance level are
selected for inclusion in the class predictor. One can choose only the k top-ranking genes
(where k < p) or the genes whose score exceeds a given threshold. The p-values are used
as a convenient index based on null-hypothesis testing for selecting genes. In microarray
data analysis, there is a problem with false positives which are genes that are found to
be statistically different between conditions, but are not in reality. Therefore, multiple
testing corrections adjust p-values to correct for occurrence of false positives. An overview
of hypothesis testing methods can be found in [53].
This type of feature selection has an important drawback that correlations and interactions with other genes are ignored. It may lead to missing aspects relevant for prediction.
The feature selection wrapper approach uses its performance as the evaluation criterion.
It searches for features better suited to the mining algorithm, aiming to improve mining
performance. The wrapper feature selection is generally computationally intensive and
more difficult to set up than filter feature selection. An example of this approach is gene
selection using SVM [84]. Search algorithms that explore the space of the possible subset
of features define another feature selection approach. Model et al. [139] use the backward
3.2 Dimension reduction
13
selection procedure, whereas Bo et al. [23] select a pair of variables with a forward selection
scheme. Alternative approaches include, for example, genetic algorithms [148].
3.2.2
Feature extraction
Feature extraction is dimensionality reduction which transforms the data in the highdimensional space to a space of fewer dimensions. The data transformation may be linear,
as in Principal component analysis (PCA) [195]. Principal components are the orthogonal linear combinations of the genes showing the greatest variablility among the cases.
The principal components are sometimes referred to as singular values [15]. Dimension
reduction with PCA has two limitations. One is that the principal components are not
necessarily good predictors. The second problem is that measuring principal components
requires measuring expression of all genes which may not be desirable for clinical applications. Other methods are Independent component analysis (ICA) [124], Partial least
squares (PLS) [142], Linear discriminant analysis (LDA), Diagonal LDA (DLDA) or nonlinear methods, such as Non-linear principal manifolds [60]. A general drawback of feature
extraction compared to feature selection is worse interpretability of extracted features.
3.2.3
Preference for simple models
Some methods, such as naive Bayes, Support vector machines (SVM) or DLDA can use
as many features as desired. These methods use simple models that restrict complexity.
Naive Bayes classifiers [138] that assume conditionally independent features from each
other given the class label or even simpler models that can share parameters across classes
[145]. SVM avoid overfitting by finding a linear discriminant function (or generalized
linear discriminant) that maximizes the margin (the minimum distance of a sample point
to the decision boundary) [186]. Dudoit at al. [52] recommend to prefer simple models
before using complex classifers.
3.2.4
Regularization or shrinkage methods
Regularization or shrinkage methods involve additional information in order to prevent
overfitting. This information is usually of the form of a penalty for complexity, such as
restrictions for smoothness or bounds on the vector space. For example, the least-squares
method can be viewed as a very simple form of regularization. Ridge regression and the
lasso [180] are regularized versions of least squares regression using L2 and L1 penalties,
respectively. The regularization has been applied to generalized linear models [180]. In
recent years, there has been an enormous amount of research activity devoted to related
regularization methods: the grouped lasso [203, 137], where variables are included or excluded in groups; the elastic net [205] for correlated variables, which uses mixtures of L1
and L2 penalties; L1 regularization paths for generalized linear models [152]; regularization
14
3 Class prediction
paths for SVM [91], etc. From the regularization point of view, boosting [162] follows a
‘regularized path’ of classifier solutions as the boosting iterations proceed. More complex
solutions can overfit. Well-known classifier selection techniques include the Akaike information criterion (AIC), minimum description length (MDL) and the Bayesian information
criterion (BIC).
3.2.5
Internal validation
Alternative methods of controlling overfitting not involving dimension reduction or regularization include internal validation. Internal validation means that the validation is
applied to the training set only, since parameter estimation is part of the training process.
The test set is there to judge the performance of the selected model. For example, leaveone-out cross-validation (LOOCV) or boostrap sampling have been applied as internal
validation methods (see overview of validation methods in [27]).
3.3
Classifier construction
Classifier construction is influenced by the selection of class prediction method and the
setting of parameters. Class prediction based on microarray gene expression data was
introduced in Golub et al. [80]. They classified two types of acute leukemias with summing
weighted votes for each gene on the test data and looking at the sign of the sum. Since
that time, a lot of prediction methods have been presented in literature in the context of
microarray gene expression data. The most popular methods are listed below.
3.3.1
Linear discriminant analysis
Linear discriminant analysis (LDA) and the related Fisher’s linear discriminant analysis
(FLDA) [135] find a linear combination of features which characterize or separate two
or more classes of objects or events. The resulting combination may be used as a linear
classifier. Dudoit et al. [52] indicated that FLDA did not preform well unless the number
of selected genes were small relative to the number of samples. The reason is that there
are too many correlations to estimate and the method tends to be unstable and overfit
the data. Diagonal linear discriminant analysis (DLDA) [135] is a special case of FLDA in
which the correlation among genes is ignored. By ignoring correlations, one avoids having
to estimate many parameters and obtains a method that performs better when the number
of samples are small [167].
3.3.2
Generalized linear models
Generalized linear models (GLM) [134] are a large group of models for relating responses
to linear combinations of predictor variables. GLM employs an iteratively reweighted least
3.3 Classifier construction
15
squares method for maximum likelihood estimation of the model parameters. According
to [107], the GLM approach is attractive because: (1) it provides a general theoretical
framework for many commonly encountered statistical models; and (2) it simplifies the
implementation of these different models in software, since essentially the same algorithm
can be used for estimation, inference and assessing model adequacy for all GLM. For
example, logistic regression (LR) is an instance of GLM, which consists of a large variety
of exponential models. LR cannot be used with high-dimensional data without a variable
reduction step. An extention of LR with high-dimensional gene expressions is described
in [58, 204, 205]. GLM can also be extended to generalized additive models (GAM) [94].
3.3.3
Suport vector machines
Suport vector machines (SVM) [186] construct the best separating hyperplane between
classes locating this hyperplane so that it has a maximal margin (i.e. so that there is
maximal distance between the hyperplane and the nearest point of any of the classes).
They work in combination with the technique of kernels that automatically realizes a nonlinear mapping to a feature space. SVM have become a very popular option as a class
prediction method in microarrays between practitioners [33, 140], but it is a black-box
that gives results that are difficult to interpret.
3.3.4
k-nearest neighbors
The k-nearest neighbors (kNN) method has also been used for gene expression data class
prediction. It is a simple method which classifies unlabeled samples based on their similarity with samples in the training set. Euclidean distance can be used to measure the
closeness between an unlabeled sample and samples in the training set. The class of unlabeled sample is predicted as the majority vote among the k-nearest neighbors. The number
k of neighbors used can be taken as fixed or optimized by cross-validation [20]. According
to [144], kNN has poor generalization performance compared to k discriminant adaptive
nearest neighbor (kDANN) [92] which selects a distance measure adaptively, making the
class conditional probabilities more homogeneous locally.
3.3.5
Artificial neural networks
According to [167], artificial neural networks (ANN) do not perform well because of hidden
nodes, nonlinear transfer functions and individual features as inputs. By contrast, Khan
et al. [115] reported ANN that correctly classified all samples and identified the genes
most relevant to the classification. Their neural network used a linear transfer function
with no hidden layer and hence it was a linear classifier. The inputs to the ANN were the
first 10 principal components of the genes, that is, the 10 orthogonal linear combinations
of genes that accounted for most of the variability in gene expression among samples.
16
3 Class prediction
3.3.6
Classification and regression trees and ensemble methods
Classification and regression trees (CART) and ensemble methods have been applied to
microarray gene expression data as well [11]. CART [32] builds trees – formulates simple
if/then rules for binary recursive partitioning (splitting) of all the objects into smaller
subgroups. Each such step may give rise to new ‘branches’. According to Dudoit et
al. [52], CART tend to be unstable. Ensemble methods use aggregated (multiple) classifiers
to obtain better predictive performance.
There are two resampling-based aggregated classifiers that have been applied with
relative success to microarrays: bagging (booststrap aggregating) [29] and boosting [162].
An example of bagging is Breiman’s random forests (RF) [31] that combines bootstrap,
decision trees and majority voting. Statnikov et al. [176] compared RF to SVM on
22 diagnostic and prognostic data sets. In this comparison, SVM outperformed RF. The
authors found out that the linear decision functions of SVM may be less sensitive to the
choice of input parameters than RF.
Boosting is based on emphasizing the training instances that previous models misclassified. The best known implementation of boosting is AdaBoost [198]. While boosting
works well on a variety of different types of data, it is not suited to microarray gene expression data according to [52]. Dettling and Bühlmann [48] demonstrated with LogitBoost
the modification of boosting to become an accurate classifier in the context of gene expression data. They found that it gave lower error rates than the commonly used AdaBoost
algorithm and concluded that this was because LogitBoost was less sensitive to outliers.
3.3.7
Bayesian networks
Bayesian networks (BN) [95] have received a growing interest in recent years. BN is a
probabilistic graph-based model that consists of two parts [78]: (1) a directed acyclic graph
which is called the structure of the model and represents the relations of the variables; and
(2) local probability models which are the numerical parameters (conditional probabilities)
for a given network topology. The learning task can be separated into two subtasks: (1)
structure learning which is topology identification; and (2) parametric learning which is
the estimation of numerical parameters. An example of the application of BN to expression
pattern recognition in microarray data can be found in [72]. BN have found much use
as a representational tool for modeling relationships in gene-gene relation and as gene
regulatory network analysis [147, 90, 72]. Helman et al. [96] and Gevaert et al. [78] applied
BN for the prediction of the prognosis in cancer. The major advantages of BN are in the
possibilities of the integration of diverse data by using the prior over model space [78, 77]
and in data interpretability. The major disadvantages of BN are the following [119]:
(1) BN are often not practical for large systems because they are either too costly to
perform or impossible, given the large number and combinations of variables; (2) there
3.4 Comparison between methods
17
is no universally accepted method for constructing a network from data; and (3) BN are
dependent on the quality of prior beliefs—BN is only as useful as this prior knowledge is
reliable.
3.4
Comparison between methods
Given the number and diversity of available methods, the choice of adequate class prediction method depends on the needs of potential users. In order to help answer this question
there are some comparison studies [52, 123, 176, 103]. Some of the studies are neutral,
whereas others aim at demonstrating the superiority of a particular method. Dudoit et al.
[52] made a comparison of more than 7 classification methods. Their main conclusion was
that simple classifiers such as DLDA and kNN performed remarkably well in comparison
to more sophisticated ones, such as RF. Lee et al. [123] extended the previous analysis including more methods (up to 21) and more data sets (7). They reached the conclusion that
no classifier is uniformly better than the other. They found better performance for more
complex methods. In Statnikov et al. [176], SVM outperformed RF. Huang et al. [103]
compared 5 methods including some regularization methods. They reached the conclusion
that no classifier is uniformly better than the other, but, that RF performed slightly better. In any case, and whatever the chosen method is, the performance of most methods
depends on the quality of data to build the classifier (data acquisition and processing and
the sample size).
Generally, the comparison of prediction methods from published literature is difficult
because: (1) methods are evaluated on different data sets; (2) data sets are pre-processed
in different ways; (3) different evaluation schemata and error measures are applied; and (4)
methods are not always evaluated correctly. Duply and Simon [54] showed that in more
than half of a representative sample of past cancer research studies, inadequate statistical
validation was performed. They studied ninety research articles from which three quarters
of them were published in journals of an impact factor greater than 6. In class prediction,
test data has been used during classifier training. Variable selection has not been often
considered as a step of classifier construction and it has been carried out using the whole
data set. Such mistakes in evaluation give biased and overoptimistic estimation of classifier
performance. Guidelines for good practice in microarray data analysis, including class
prediction, can be found in [168, 169].
18
3 Class prediction
Chapter 4
Assessment of classifier
performance
4.1
Validation
There are several possibilities on how to assess classifier performance. The constructed
classifier should be ideally tested on an independent validation data set. It is usually not
possible because gene expression data sets do not have enough samples. According to [102],
it is not recommended to estimate performance based on a single learning data set and test
data set in high-dimensional settings with a limited size of samples, because results depend
highly on the chosen partition. Other possibilities on how to assess classifier performance
is to use a procedure which tests the classifier based on the data that were not used
for its construction, such as cross-validation (including a special case leave-one-out crossvalidation), Monte-Carlo cross-validation or boostrap sampling. The detailed description
of validation strategies can be found for example in [27]. Eventually, classifer performance
can be assessed via synthetic data with constructed ground-truth labels. However, this
approach may not validate these particular statistical characteristics of molecular profiles
from a given population.
In this thesis, Monte-Carlo cross-validation (MCCV), also called subsampling, is applied as a validation strategy. MCCV generates a learning set in the way that the learning
data sets are drawn out of {1, · · · , n} samples randomly and without replacement. The
test data set consists of the remaining samples. The random splitting in a non-overlapping
learning and test data set is repeated k-times and the classifier’s performances are averaged. Based on recommendation [27], the splitting ratio of the training and test data set
is set to 4 : 1. A number of iterations k is set to k = 100, which is an optimal number,
because estimates are based on a larger number of iterations k. This approach usually
leads to more stable results than standard cross-validation [27].
Figure 4.1 describes the validation procedure. First, k pairs of learning and test data
19
20
4 Assessment of classifier performance
sets are generated based on a validation method and microarray gene expression data with
class labels. For each k iteration:
• k-classifier is constructed based on a prediction method and k-training data set. A
feature selection/extraction procedure can be included in this step.
• k-response is predicted using a k-constructed classifier and test data set.
Then, classifier performance is estimated with some performance measure.
validation
(randomization)
Training data set
Data
Reduced
training data set
Classifier with
estimated parameters
Features
Test data set
Reduced
test data set
Response
Error
Figure 4.1: Schema of microarray gene expression data validation.
4.2
Performance measures
The response usually consists of predicted class probabilities. Figure 4.2a shows an example of response distribution of a binary classifier. Based on a different threshold t,
it is possible to get various results of classifier with true positives (TP), true negatives
(TN), false positives (FP) (type I. error) and false negatives (FN) (type II. error). Error
(P(Ŷ 6= Y )) or accuracy (P(Ŷ = Y )) are performance measures based on a threshold.
Most reported prediction performance measures are based on user-defined thresholds for
a single operating point, such as: error rate, which uses a simple threshold t = 0.5; or
equal error rate, which is defined with the point where false positive and false negative
rates are equal. A more meaningful estimate that reveals the ratio of true positive rate
(P(Ŷ = A|Y = A)) and false positive rate (P(Ŷ = A|Y = B)) at different thresholds is the
receiver operating characteristic (ROC) curve. Unfortunately, ROC is a plot, and is not
a single value. The qualitative measure of a ROC curve and a commonly used summary
measure of diagnostic is the area under the ROC curve (AUC) (Figure 4.2b), which is
used as a classifier performance measure in this thesis.
21
4.2 Performance measures
Threshold
1 B
et
te
r
FP
(false alarm)
FN
(miss)
TN
(rejection)
True positive rate
Frequency
Good prognosis
Poor prognosis
TP
(hit)
AUC
W
or
se
0
Probability
0
(a) Response.
False positive rate
1
(b) Area under ROC curve.
Figure 4.2: Performance of a binary classifier.
AUC is useful in that it aggregates performance across the entire range of thresholds. According to Bradley [28], AUC is a statistically consistent and more discriminating
measure than accuracy. Brandley’s paper demonstates several desirable AUC properties
compared to accuracy. AUC is estimated with Wilcoxon-Mann-Whitney (WMW) statistics [132]. The ROC curve of a finite set of samples is based on a step function. Yan et
al. [199] demonstrate that the AUC is exactly equal to the normalized WMW statistic in
the form:
U=
Pm−1 Pl−1
j=0 I(ai , bj )
i=0
ml
,
(4.1)
where:
I(ai , bj ) =
(
1:
ai > bj
0 : otherwise.
(4.2)
WMW statistics [132] are based on pairwise comparisons between a sample ai , i =
0, · · · , m − 1, of random variable A and the sample bj , j = 0, · · · , l − 1, of random variable
B. If {a0 , a1 , · · · , am−1 } is identified as the classifier outputs for m examples of class A,
and {b0 , b1 , · · · , bl−1 } is identified as the classifier outputs for l examples of class B, the
AUC for this classifier is obtained via (4.1).
The overall AUC is estimated as the mean proportion of AUCs over k MCCV iterations:
µU =
k
1X
Ui
k
i=1
with the standard deviation given as:
(4.3)
22
4 Assessment of classifier performance
σU =
r
1
(Ui − µU )2 .
k
(4.4)
Other performance measures, such as an error rate or accuracy evaluated over k MCCV
iterations, can be estimated analogously.
Chapter 5
Data integration
The concept of data integration is not well defined in the literature and it may mean
different things. In this thesis, data integration is defined as the process of combining
data from different sources. Microarray gene expression data are combined with data
from other sources (e.g. clinical variables, genomic variation data, interactomic data, etc.)
to improve predictive classification.
The first part of this chapter (Section 5.1) categorizes data integration methods from
different points of views. The second part (Section 5.2) discusses different information
resources that can be combined with microarray gene expression data. Both parts include
examples from the published literature.
5.1
Categorization of methods
Data integration methods can fall into categories depending on the type of data (integration of similar data types and heterogenous data types) or the stage of integration (early,
intermediate, late and serial integration, see Figure 5.1).
Early integration
(input-based)
DATA
Intermediate integration
(kernel-based)
FEATURES
CLASSIFIERS
Late integration
(decision-based)
RESPONSES
Figure 5.1: Stages of data integration: early, intermediate and late.
23
24
5.1.1
5 Data integration
Similar data types
Similar data types arise out of the same underlying source, that is, they are all gene expression, single nucleotide polymorphisms (SNP), protein, copy number variations (CNV),
clinical, etc. A simple merger of available data from this category is not usually applicable
due to differences in technologies, platforms, etc. A more detailed description of microarray gene expression data integration belonging to similar data types category with some
examples from the published literature can be found in Section 5.2.2.
5.1.2
Heterogenous data types
The integration of heterogenous data types involves two or more different data sources in
data integration. It is a challenging task because data can be very diverse. This category
falls into the topic of this thesis. Details on combining data sources – gene expression with
SNP, CNV, clinical, protein, etc. – are given in Section 5.2.
5.1.3
Early integration
Early integration combines data at the input level. For example, data from different
studies, experiments or labs can be merged in order to increase the sample size. Jiang et
al. [110] combined two lung cancer studies to distinguish diseased from normal samples
more precisely. Aerts et al. [9] generated distinct prioritizations for multiple heterogeneous
data sources, which were then integrated, or fused, based on their similarity to known
genes, into a global ranking using order statistics to prioritize candidate genes underlying
diseases.
5.1.4
Intermediate integration
In intermediate integration, each data source is transformed into another format first and
then the data sources are combined. For example, data can be converted into similarity
matrices such as the covariance or correlation matrix and then the matrices are combined before building a classifier for better prediction. The similarity matrices are also
called ‘kernel matrices’ and the integration or the methods are known as ‘kernel-based’.
Lanckriet [121] et al. used a kernel-based SVM method in order to recognize particular
classes of proteins – membrane proteins and ribozomal proteins. Lanckriet et al. [121] integrated amino acid sequences, gene expression data and protein-protein interactions with
the kernel-based SVM method that finds a classification rule and corresponding weights
for each data set. Daemen et al. [43] used a kernel-based least square SVM to combine
microarray gene expression and proteomics data in order to predict outcomes in rectal
and prostate cancers.
5.2 Information resources
5.1.5
25
Late integration
Late integration (sometimes called decision integration) combines final statistical results
from different studies. Data from different sources are learned separately and separate
models are built. Then, the predictions from the separate models for the outcome are
fused. Ioannidis et al. [104] integrated two Affymetrix and one Illumina SNP data sets.
The odds ratio was first calculated for each SNP in each study and a random effects model
was used to combine the odds ratios. Gevaert et al. [77] integrated clinical and microarray
data with a strategy based on Bayesian networks in order to classify breast cancer patients.
Gevaert et al. [77] combined the probability of the outcome for a clinical model with the
probability of the outcome for a microarray model.
5.1.6
Serial integration
Serial integration can use and combine the foregoing data integration schemata. In serial
integration, different sources are implemented sequentially. The outcomes of one stage
represent the inputs to the next analysis phase. For example, a combination of clinical
and protein expression data based on serial integration was implemented so as to predict
early mortality of patients undergoing kidney analysis in [117].
5.2
Information resources
Data integration methods can fall into categories depending on the data source that the
methods combine with microarray gene expression data. In the computational biology
and bioinformatics community, the ‘omics’ suffix is used for various bioinformatic data.
‘Omics’ include genomics (the quantitative study of genes, regulatory and non-coding sequences), transcriptomics (RNA and gene expression), proteomics (protein expression),
metabolomics (metabolites and metabolic networks), interactomics (interactions and relationships among genes, proteins and metabolites), etc. As pointed out in [76], it is not
known at present which ‘omics’ has the most disease outcome related information. Owing
to the lack of comprehensive studies, validation studies are required to verify which omics
has the most information and whether a combination of omics data improves predictive
performance.
5.2.1
Genomic variation data
Variations in the DNA sequences of humans can affect how humans develop diseases and
respond to pathogens, chemicals, drugs, vaccines and other agents. Analysis of human
single nucleotide polymorphisms (SNP) [118] has led to the identification of interesting
SNP markers for certain disorders. Copy number variations (CNV) [114] – gain or loss of
segments of genomic DNA relative to a reference – have also been shown to be associated
26
5 Data integration
with several complex and common disorders [154, 67]. Using array-based comparative
genomic hybridization (CGH) techniques [154], CNV at multiple loci can be assessed
simultaneously allowing for their identification and characterization. CNV microarrays
allow exploration of the genome for sources of variability beyond SNP that could explain
the strong genetic component of several disorders. Table 5.1 shows examples of public
genomic variation databases. Combinations of microarrrays with specific types of genomic
variation data are described in [156, 189, 17, 200]. Pounds et al. [156] used projection
methods and an efficient permutation-testing algorithm. Walker et al. [189] mapped SNP
into expression data in order to reduce the complexity of standard microarrays and identified candidate genes in multiple myeloma. Andrews et al. [17] refined a gene expression
signature of breast cancer metastasis with CNV. Yang et al. [200] introduced a web application for multi-dimensional analysis of CGH, SNP and microarray data which supports
the interpretation of microarrays.
Source
Description
URL
dbSNP
Protegenetix
COSMIC
SNPs
SNPs, CGH
Somatic mutations
in cancer
CNV
http://www.ncbi.nlm.nih.gov/snp
http://www.progenetix.net/
http://www.sanger.ac.uk/genetics/
CGP/cosmic/
http://www.sanger.ac.uk/humgen/cnv/
The CNV Project
Table 5.1: Examples of genomic variation databases.
5.2.2
Gene expression data
Although data sets in public databases usually have to meet the criteria of standards for
biological research data quality, annotation and exchange (see MGED standards [19]),
there are differences in the type of microarray used, gene nomenclatures, species, and
analytical methods. A way on how to increase the quality of information and statistical power in microarray data is meta-analysis [36]. Microarray meta-analysis combines a
diverse collection of microarray data sets to assess the intersection of multiple gene expression signatures. This approach provides a precise view of genes, while simultaneously
allowing for differences between laboratories. For example, Rhodes et al. [159] collected
and analyzed 40 published cancer microarray data sets in order to characterize a common
transcriptional profile of cancer progression. Ma et al. [130] proposed a regularized gene
selection method that is applied to multiple pancreatic and liver cancer data sets.
An alternative gene expression data is microRNA [128]. MicroRNA expression profiles
had better classification performance compared with mRNA on the same samples in [76].
MicroRNA profiles have also shown promise in predicting the prognosis of cancer [105, 141].
27
5.2 Information resources
Source
Description
URL
ArrayExpress
Functional genomics experiments
including gene expression data
Gene Expression Omnibus,
functional genomics data
microRNA targets and
expression data
http://www.ebi.ac.uk/
microarray-as/ae/
http://www.ncbi.nlm.nih.gov
/geo/
http://www.microrna.org/
GEO
microRNA.org
Table 5.2: Examples of gene expression databases.
Examples of public gene expression databases are given in Table 5.2.
5.2.3
Proteomic data
Proteomic technology based on mass spectrometry [47] can have a significant impact on
the outcome prediction of cancer patients, especially when taking into account its ability
to detect post-translational modifications [45, 112]. Another approach in studying the
proteome uses protein microarrays (also called antibody microarrays) [196]. Whereas
mass spectrometry-based proteomics is considered to be an unbiased approach, protein/antibody microarrays represent an approach focused on studying the proteome quantitatively [131]. A number of studies have considered whether changes in mRNA concentration are reflected by similar changes in protein abundance, e.g., [179, 37, 85]. Poor
correspondence has been typically reported between transcript and protein levels, and in
some cases, little or no correlation has been found at all [85]. Bitton et al. [22] combined
protein mass spectrometry with Affymetrix Exon array data at the level of individual exons and found significantly higher degrees of correlation than had been previously observed
in a number of studies. Daemen et al. [42] demonstrated that combining microarrays and
proteomics data sources improves the predictive power. They used a kernel-based method
with Least Squares Support Vector Machines to predict rectal cancer regression grades.
Table 5.3 shows examples of public proteomic databases.
5.2.4
Interactomic data
Interactomic data represents interactions and relationships among biological system components – genes, proteins, small molecules, diseases. The relations form networks or
pathways which can be used to construct complex networks that are further analysed.
Given the complex nature of biological systems, the networks can be large-scale. A network typically consists of a set of nodes and edges, which represent the biological system
components and interactions between components, respectively. Table 5.4 shows examples
of public interactomic databases. Lee et al. [122] proposed a classification method based
28
5 Data integration
Source
Description
URL
OPD
Open Proteomics Database,
mass spectrometry data
Genome Medicine Database of Japan,
proteomic data in cancer
The Human Protein Atlas, expression
and localization of proteins
http://bioinformatics.
icmb.utexas.edu/OPD/
https://gemdbj.nibio.go.jp
/dgdb/DigeTop.do
http://www.proteinatlas.org/
GeMDBJ
HPR
Table 5.3: Examples of proteomic databases.
Source
Description
URL
BioGRID
HPRD
Datasets of physical and genetic interactions
The Human Protein Reference Database,
interaction networks and disease association
for each protein in the human proteome
Kyoto Encyclopedia of Genes and Genomes,
molecular interaction and reaction networks
http://thebiogrid.org/
www.hprd.org
KEGG
www.genome.jp/kegg/
Table 5.4: Examples of interactomic databases.
on a mapping of gene expression data onto different biological pathways. The investigated
pathways were obtained from metabolic and signaling pathways derived from manually
annotated databases. For each pathway, they mapped the expression values of each gene
onto its corresponding gene (protein) in the pathway and searched for a subset of genes
that could be used to differentiate between the samples of the phenotypes investigated.
Chuang et al. [39] applied a protein-network-based approach that identified markers not
as individual genes, but as subnetworks extracted from protein interaction databases. In
order to integrate the expression and network data sets, they overlaid the expression values of each gene on its corresponding protein in the network and searched for subnetworks
whose activities across the patients were highly discriminative of metastasis. This method
achieved higher accuracy in the classification of metastatic versus non-metastatic tumors.
Ergun et al. [63] implemented an algorithm that inferred a global network of gene regulatory interactions from a training gene expression data set related to diverse biological
processes and pathologies. This network was then used to detect genes in a test data set,
which appeared to be altered in a specific phenotype, for example, a disease class.
5.2.5
Textual data and ontologies
Text mining helps biologists to collect disease-gene associations automatically from large
volumes of biological literature. Currently, the most important resource for biomedical text
29
5.2 Information resources
Source
Description
URL
GO
Gene Ontology, descriptions of gene and
gene-product attributes
Human Phenotype Ontology, descriptions of
individual phenotypic anomaly
eVOC Ontology, descriptions of genome
sequence and expression phenotype
www.geneontology.org/
HPO
eVOC
www.human-phenotypeontology.org/
www.evocontology.org/
Table 5.5: Examples of bio-ontology resources.
mining applications is the MEDLINE literature repository [7] developed by the National
Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM).
MEDLINE covers all aspects of biology, chemistry, and medicine and there is almost no
limit to the types of information that may be recovered through careful and exhaustive
mining [165].
Ontologies are conceptual models that provide the framework for semantic representation of information [83]. Ontologies usually support two types of relations, ‘is-a’ and the
‘part-of’ relations. Bio-ontologies also support relations specific for the biomedical domain
(e.g. ‘has-location’, ‘clinically-associated-with’, ‘has-manifestation’ [173]). Table 5.5 shows
examples of bio-ontology resources. Gene ontology (GO) [89] is a widely used bio-ontology
resource which can be employed in the analysis of relationships among genes. Genes and
their products in different organisms are annotated using a common terminology by GO
collaborators. GO is divided into three ontologies: (a) biological process (BP), (b) molecular function (MF), and (c) cellular component (CC). The three ontologies are represented
as directed acyclic graphs in which nodes correspond to terms and edges represent their
relationships.
GO is utilized in many tools that provide methods for functional annotation and interpretation of gene lists derived from microarray experiments (see [5]). For example,
methods that combine microarray gene expression data with GO for the purpose of predictive classification are described in [157, 151, 38]. All the examples employ GO for
feature selection or extraction. Cho et al. [38] and Qi et al. [157] searched for expression
correlations, while Papa et al. [151] focused on elimination of highly correlated genes.
Gevaert et al. [78] integrated information from literature abstracts with gene expression
data using Bayesian network models to predict the outcome of cancer patients. Yu et
al. [202] retrieved biomedical knowledge using nine bio-ontologies and text information
from MEDLINE titles and abstracts to obtain the precise identification of disease relevant genes for gene prioritization and clustering. A summary of ontology and text-mining
methods in biomedicine can be found in [173].
30
5 Data integration
5.2.6
Clinical data
Clinical data forms the basis of doctor’s diagnostic decision making. It includes, for example, patient history, laboratory analysis and medications. Clinical data is often available
for the microarray data set (see Table 5.2 for possible resources). In general, microarray
predictors are much more difficult and expensive to collect than clinical ones. There are
several studies comparing predictive power of microarrays with clinical data [26, 56, 181].
Methods that combine microarrays with clinical data to construct a joint predictor are
described, e.g., in [77, 155, 178, 177]. Boulesteix et al. [26] proposed an approach based on
RF and PLS dimension reduction in order to provide a combined classifier and compare
the predictive power of microarray and clinical data. Gevaert et al. [77] based the clinical
and microarray data integration on BN which included structure and parameter learning.
Sun et al. [178] used I-RELIEF wrapper feature selection method. They used LDA as a
class prediction method. Pittman et al. [155] proposed a method based on statistical classification tree models that evaluate contributions of multiple forms of data, both clinical
and genomic, to predict patient outcomes. They defined metagenes, which characterize
dominant common expression patterns within clusters of genes, and combined them with
traditional clinical risk factors.
5.2.7
Other data
Another kind of the information that has been used in combination with microarray gene
expression data in cancer research is the DNA methylation. DNA methylation is a type
of chemical modification of DNA influencing cancer outcome [76]. It involves the binding
of a methyl group to CpG islands in the genome [120]. CpG islands are often found in
the regulatory regions of genes and are often associated with transcriptional inactivation.
Most studies focus on using DNA methylation for the early detection of cancer [120],
however its use in predicting prognosis has also been shown [194, 13].
Other information sources can be disease-specific databases, such as a database of
human genes and genetic disorders – Online Mendelian Inheritance in Man (OMIM) [8]
or the database of genetic association studies performed on Alzheimer’s disease – AlzGene [1]. For example, Tsai et al. [184] integrated microarray expression data and OMIM
information to investigate a connection of enterovirus 71 with neurological diseases.
Chapter 6
Data sets
Six publicly available real data sets and several simulated data sets are used for the
evaluation of experiments in this thesis. These sets are introduced in the following sections.
6.1
Real data sets
van’t Veer data set
van’t Veer at al. [185] classified breast cancer patients after curative resection with respect to the risk of tumor recurrence. The set includes gene expression data and clinical
variables. cDNA Agilent microarray technology was used to give the expression levels of
22, 483 genes for 78 breast cancer patients. Forty-four patients that are classified into
the good prognosis group did not suffer from a recurrence during the first five years after
resection; the remaining 34 patients belong to the the poor prognosis group. The data set
is prepared as described in [185] and is included in R package ‘DENMARKLAB’ [68]. The
resluting set includes 4, 348 resulting genes. Only genes that show a two-fold differential
expression and p-value for a gene being expressed < 0.01 in more than five samples are retained. The clinical variables are age, tumor grade, estrogen receptor status, progesterone
receptor status, tumor size and angioinvasion.
Pittman data set
This data set was introduced by Pittman at al. [155]. Gene expression data was prepared
with Affymetrix Human U95Av2 GeneChips. The data set gives the expression levels of
12, 625 genes for 158 breast cancer patients. Regarding recurrence of the disease, 63 of
the patients are classifed into the poor prognosis group, and the remaining 95 patients
belong to the good prognosis group. The data was pre-processed using packages ‘affy’ and
‘genefilter’ to normalize and filter the values. Genes that showed a low variability across
all samples were cleared out. The resulting data set includes 8, 961 genes. The clinical
31
32
6 Data sets
variables are age, lymph node status, estrogen receptor status, family history, tumor grade
and tumor size.
Wang data set
The data set presented by Wang at al. [192] comprises the expression levels of 22, 283 genes
for 286 lymph-node-negative breast cancer patients. Based on relapse, 107 of these patients had developed distant metastases within five years and were classifed into the poor
prognosis group, the remaining 179 patients belong to the good prognosis group. Gene
expression data was analysed with Affymetrix Human U133a GeneChips. The data was
pre-processed using packages ‘affy’ and ‘genefilter’. The genes that showed a low variability across all samples were cleared out. The resulting data set includes 22, 260 genes. The
clinical variables are estrogen receptor status and lymph node status. However, the lymph
node status is negative for all patients.
Mainz data set
Mainz data set was published in Schmidt et al. [163]. Gene expression data gives the
expression levels of 22, 283 genes for 200 lymph-node-negative breast cancer patients.
We considered distant metastases as a response. Based on distant metastases after five
years, 46 of these patients were classifed into the poor prognosis group, and the remaining
154 patients belong to the good prognosis group. Gene expression data was analysed with
Affymetrix Human U133a GeneChips. The data set was prepared as described in [163].
It is included in R package ‘breastCancerMAINZ’ [6]. The available clinical variables are
the age at diagnosis, estrogen receptor status, tumor size and tumor grade.
Sotiriou data set
The data set was presented by Sotiriou at al. [171]. Gene expression data was prepared
with cDNA microarray chips. The author’s web site involves both raw and pre-processed
gene expression data. The pre-processed data gives the expression levels of 6, 860 genes
for 99 breast cancer patients. According to relapse, 45 of the patients are classifed into the
poor prognosis group and 54 of the patients belong to the good prognosis group. There
are missing values in this data set. Their occurance is equal to 2.17%. After filtering out
all missing values, the data set includes 4, 246 expression levels.
There was a possibility to filter out less of missing values and use some missing data
imputation methods for the rest. We evaluated this alternative. The missing values were
summed in each row and only the rows that had more than 5% of missing values were
filtered out (see Figure 6.1a). In this figure, the line denotes this threshold. The whole
microarray data set included 0.54% of missing values after filtering. Then we used a
sequential kNN imputation method [113], which estimates missing values from the gene
33
100
80
60
40
0
20
Sum of missing values [%]
100
80
60
40
20
0
Sum of missing values [%]
6.1 Real data sets
0 1000
3000
5000
7000
Gene expression pattern [−]
(a) Sotiriou data set – gene expression data.
0
10
20
30
40
SNP pattern [−]
(b) CFS data set – SNP data.
Figure 6.1: Missing values. The circles denote the sums of missing values. The missing
values were summed in each row and the rows that had more than 5% of missing values
were filtered out. The red line denotes this threshold.
that has the least missing rate in microarray data, using a weighted mean of k nearest
neighbors (k = 5). The resulting data set had 6, 336 expression levels.
We compared this data set with the data set where all missing values were filtered out
and evaluated the binary outcome prediction of both data sets. We evaluated each data
set 100 times. The first data set with 4, 246 expression levels provided results 6% better
on average than the second data set with 6, 336 expression levels in various gene selection
settings. That is why we prefered the data set with all filtered missing values for further
experiments.
The clinical variables are age, lymph node status, estrogen receptor status, tumor grade
and tumor size. The Sotirirou data set includes clinical variables that the suplementary
author’s web site describes in more detail.
CFS data set
The chronic fatigue syndrome (CFS) data set comes from a four-year longitudinal study
conducted by the Centers for Disease Control (CDC) [3]. It contains gene expression, clinical, SNP and proteomic data. It was used as a conference contest data set in CAMDA 2006
and CAMDA 2007. The CFS data set with a detailed description is publicly available on
the CAMDA website [2].
cDNA gene expression data was analysed with MWG Biotech platform. SNP data
34
6 Data sets
consists of 36 autosomal SNPs that the CDC had previously selected from eight candidate CFS genes, TPH2 (SNPs selected from locus 12q21), POMC (2p24), NR3C1 (5q34),
CRHR2 (7p15), TH (11p15), SLC6A4 (17q11.1), CRHR1 (17q21), COMT (22q11.1), see
Smith et al. [170]. The SNPs are coded as 0, 1, or 2, for genotypes AA, AB, and BB, respectively. The missing value occurance in the SNP data is equal to 2.17%. The summed
missing values in each row was not more than 5%. The missing values were imputed
by sequential a kNN imputation method [113]. Clinical variables with more than 5% of
missing values were filtered out. Clinical variables include summary variables such as intake illness classification and empirical illness classification, as well as, variables reflecting
medical symptoms on which these summaries are based.
For the experiments we chose the intake illness classification of CFS, which is based
on the 1994 case definition criteria [73]. This variable has five levels: ever CFS, ever
CFS with major depressive disorder with melancholic features (MDDm), ever insufficient
symptoms or fatigue (ISF), ever ISF with MDDm and nonfatigued. We reclassified the
patients into two groups: CFS disease group and non-CFS disease group. The CFS disease
group includes all CFS-like patients, while the non-CFS disease group includes the rest
with ISF and nonfatigued patients. Table 6.2 shows number of patients belonging to each
group. Patients in gene expression, clinical, SNP and proteomic data were not completely
overlapping. Therefore, we chose the overlapping data set with gene expresion, clinical
and SNP data, which has 164 patients. Including the proteomic data on top of that, the
data set consists of 44 overlapping patients only. Table 6.1 summarizes the number of
data in each data set.
Data
A data set
Dimension
gene expression
clinical
SNP
164 x 19797
164 x 61
164 x 42
Data
B data set
Dimension
gene expresion
clinical
SNP
proteomic
44
44
44
44
x
x
x
x
19797
61
42
479
Table 6.1: CFS data set characteristics.
A data set
B data set
patients
CFS diseased
non-CFS diseased
164
44
64
23
100
21
Table 6.2: CFS data set – patient groups.
35
6.2 Microarray data pre-processing
6.2
Microarray data pre-processing
Both, raw and already pre-processed data sets were used for the experiments. Raw data
sets, which were Affymetrix, required pre-processing. Pre-processing of Affymetrix and
cDNA data sets differ, though some steps are similar. There are a wide variety of methods employed with the pre-processing of microarray data. Interested readers can find
more information, for example, in [74]. We used the Robust Multi-Array (RMA) approach [106] which appears to be promising, even if the RMA requires large amounts of
8 16 25 34 43 52 61 70 79 88 97 107 118 129 140 151
1
Array index
8 16 25 34 43 52 61 70 79 88 97 107 118 129 140 151
1
Array index
RAM memory. The RMA approach retains probe level information. Affymetrix arrays
2
4
6
8
10
Log2(Intensity)
(a) Raw data.
12
14
2
4
6
8
10
12
Log2(Intensity)
(b) Pre-processed data.
Figure 6.2: Pittman data set box-plots.
14
36
2
4
6
8
12
Log2(Intensity)
(a) Density estimates.
10
7
1
4
Array index
10
7
1
4
Array index
0.8
0.4
0.0
Density
6 Data sets
2
4
6
8
12
Log2(Intensity)
(b) Raw data box-plots.
2
4
6
8
12
Log2(Intensity)
(c) Pre-processed data box-plots.
Figure 6.3: Example plots of PM intensities of the first ten arrays from the Pittman data
set.
typically use between 11 and 20 probe pairs for each gene, each of which has a length
of 25 base pairs. One component of these pairs is referred to as a prefect match probe
(PM), while the other component of these pairs with the changed middle base is called
a mismatch probe (MM). A PM is designed to hybridize only with transcripts for the
intended gene (specific hybridization), while an MM is constructed to measure only the
non-specific hybridization of the corresponding PM probe (non-specific hybridization is
unavoidable). The advantage of the RMA approach is that it does not involve an implicit
subtraction of the MM probe values [106]. The subtraction can lead to a lot of noise at
low signal values. Instead, RMA looks at the distribution of the PM probe values and fits
a combination of two distributions: a ‘noise’ distribution that is normally distributed and
a ‘signal’ distribution that is distributed like an exponential distribution. The normalized
values are estimated through the expected value of the signal distribution. The RMA
approach consists of three particular processing steps: convolution correction, quantile
normalization [24], and a summarization based on a multi-array model fit robustly using
the median polish algorithm [61].
Figures in this section illustrate pre-processing results of the Pittman data set. Boxplots in Figure 6.2 depict the data set before and after pre-processing. Due to high memory
requirements, it was necessary to use a 16GB RAM machine with 64-bit operating system
to compute the data. Due to high memory requirements, it was impossible to pre-process
data with a standard PC (Intel T72500 Core 2 Duo 2.00 GHz, 2 GB RAM) and 32-bit
operating system.
For illustrative purposes, other figures in this section depict results just for the first
ten arrays from the Pittman data set. Figure 6.3 examines PM intensity behaviour with
histograms (or density estimators) of PM intensities for each array and box-plot distributions of raw and pre-processed log scale PM probe intensities. This type of visualization
6.2 Microarray data pre-processing
37
Figure 6.4: MA-plot of the first ten arrays from the Pittman data set before pre-processing.
provides a useful exploratory tool for quality assesment. For example, an array with the
bimodal distribution in Figure 6.3a with density estimates can indicate a spatial artifact.
Figures 6.4 and 6.5 show MA-plots of raw and pre-processed arrays from the Pittman
data set. The MA-plot is another quality assessment image. It can highlight the need for
normalization and gives a better idea of the differences between arrays in the shape or
center of the distribution. It can detect bad quality arrays. A loess curve, which is red in
the MA-plot figures, is fitted to the scatter-plot to summarize any non-linear relationship.
MA-plot with an oscillating curve means evident array quality problems. According to
Figure 6.4, mainly arrays 2 and 5 need to be normalized. MA-plots of ten arrays are
plotted against a reference array created by taking probe-wise medians to avoid making
38
6 Data sets
Figure 6.5: MA-plot of the first ten arrays from the Pittman data set after pre-processing.
an MA-plot for every pairwise comparison. The MA-plot uses M as the y-axis and A as
the x-axis where M and A are defined as:
M = log2 Xn − log2 R
A = 21 (log2 Xn + log2 R) ,
where Xn is array number n = 1, 2, . . . , 10 and R denotes a refference array.
The interpretation of an MA-plot can be as follows: M represents each gene log fold
change, thus, how much each gene differs against the refference array. A represents the
log intensity of each gene.
Figure 6.6 is a quality control (QC) plot of the arrays from the raw Pittman data set
39
6.2 Microarray data pre-processing
actin3/actin5
gapdh3/gapdh5
10
9
8
7
6
5
4
3
2
1
35.69%
92.35
36.4%
95.63
43.33%
97.58
40.55%
108.15
41.75%
73.14
50.35%
65.99
38.27%
64.68
40.7%
131.16
30.15%
90.14
44.65%
119.13
bioB
bioB
bioB
−3 −2 −1 0
1
2
3
Figure 6.6: ‘simpleaffy’ QC plot of the first ten arrays from raw Pittman data set.
generated with the help of a ‘simpleaffy’ package. Dotted horizontal lines separate the plot
into rows, one for each chip. Each row shows percent present calls, average background,
scale factors and β-actin/GAPDH ratios for an individual chip.
Percent present and average background are listed to the left of the figure. A percent
present call represents the percentage of probesets called present on an array. It is generated by looking at the difference between PM and MM values for each probe pair in a
probeset.
Dotted vertical lines provide a scale from −3 to 3. The blue region represents the
spread where all scale factors fall within 3-fold of the mean scale factor for all chips. It
is recommended that chips are only comparable if their scale factors are within 3-fold of
each other, which is true according to Figure 6.6, because all scale factor lines are coloured
blue.
β-actin and GAPDH values that are considered potential outliers are coloured as red
triangles and circles (see the arrays with numbers 6 and 8 in Figure 6.6). QC plot measures
the quality of the RNA hybridised to the chip which is possible to obtain by comparing the
amount of signal from the 3’ probeset to 5’ probesets. β-actin and GAPDH are relatively
long genes. The majority of Affymetrix chips contain separate probesets targeting the 5’,
mid and 3’ regions of their transcripts. The acceptable β-actin 3’:5’ ratio, which is plotted
as a blue triangle, is less than 3. The acceptable GAPDH 3’:5’ ratio, which is plotted as
a blue circle, is less than 1.25. GAPDH threshold is lower because GAPDH is the smaller
of the two genes.
40
6 Data sets
The transcript BioB verifies the efficiency of the hybridisation step. Ideally, BioB
should be ‘called present’ on every array. If BioB is routinely absent, then the array is
performing with suboptimal sensitivity (see ‘simpleaffy’ package documentation for a more
detailed description).
6.3
Simulated data sets
A group of simulated data sets was generated to illustrate the strengths and weaknesses of
the described approaches. Simulated microarray data sets consisted of 500 samples with
1000 microarray variables and 5 clinical variables. A more detailed description related to
the experiments can be found in the next chapter.
Chapter 7
Combining gene expression and
clinical data
7.1
Introduction
Combining gene expression with clinical data may add valuable information and can generate more accurate disease outcome predictions than those obtained based on the use of
gene expression or clinical data alone. Clinical data is heterogenous and measures various
entities (e.g. tumor grade, tumor size, lymph node status), while gene expression data is
homogenous and measures gene expressions. The combination of both data sources can
involve complementary information. On the other hand, redundant and correlated data
can have a contradictory impact on prediction accuracy.
In the literature, there are studies aimed at integrative prediction with gene expression and clinical data. A few papers related to this topic have been cited in Section 5.2.6.
Gevaert et al. [77] based the microarray and clinical data integration on BN which included
structure and parameter learning. Li [125] proposed dimension reduction methods in the
context of survival data in order to produce linear combinations of gene expressions, while
taking into account clinical International Prognostic Index information. Binder and Schumacher [21] proposed the CoxBoost boosting algorithm that employs high-dimensional
data and a few clinical covariates.
In this chapter, we propose a combination of logistic regression and boosting to predict
disease outcome. We use logistic regression with clinical variables because it has been
widely used with clinical data in clinical trials to determine the relationship between
variables and outcome and to assess variable significance (see for example [190]). Clinical
data is usually low-dimensional because microarray data sets include just a few clinical
variables (typically from 5 to 10). Logistic regression cannot be used with high-dimensional
data without a dimension reduction step or penalization. Logistic regression with highdimensional data can produce numerically unstable estimates and the predicting model
41
42
7 Combining gene expression and clinical data
does not generalize well [100]. We use boosting with microarray data, which is highdimensional. We use the version of boosting which closely corresponds to fitting a logistic
regression model. It is denoted as BinomialBoosting in [35] and utilizes componentwise
linear least squares (CWLLS) as a base procedure.
The boosting algorithm AdaBoost [198] has attracted attention in the machine learning
community because of its good classification performance with various data sets. Boosting
methods were introduced as multiple prediction schemes, averaging estimated prediction
from reweighted data. Breiman [30] demonstrated that the AdaBoost algorithm can be
viewed as a gradient descent optimization technique in function space. Friedman [70]
developed a more general statistical framework which yields a direct interpretation of
boosting as a method for function estimation. He developed methods for regression which
are implemented as an optimization using the squared error loss function (L2 -Boosting).
Later, Efron et al. [57] made a connection for linear models between forward stagewise
linear regression, which is related to L2 -Boosting and L1 -penalized Lasso [180]. Boosting
with CWLLS was worked out in Bühlmann [34]. It can be applied to high-dimensional data
because it performs coefficient shrinkage and variable selection. According to Bühlmann
and Hothorn [35], BinomialBoosting algorithm is similar to LogitBoost, which is a more
accurate classifier in the context of microarray data than AdaBoost [48].
Logistic regression and BinomialBoosting can be combined due to the framework of
generalized linear models (GLMs) which are described at the beginning of this chapter. We
describe the models separately and it is made clear how the data is combined. We propose
its extention designed for redundant sets of data. The extension includes pre-validation of
models built with microarray and clinical data followed by calculation of weights. Weights
determine the relevance of microarray and clinical models for data combination. Evaluations are performed with several redundant and non-redundant simulated data sets first,
and then with four breast cancer data sets. We compare designed apporaches with other
relevant methods from the literature in the discussion.
The section with alternative regularized regression techniques includes an elastic net
and an addition of SNP data into the combination of gene expression and clinical data.
We demonstrate the results of these alternatives with simulated and real data sets. We
compare execution times of applied approaches at the end of this chapter.
The notation for prediction with gene expression data was presented in Chapter 3. A
similar notation is used for clinical variables. It can be summarized as follows:
43
7.2 Generalized linear models
X
...
p × n gene expression data matrix
xij
...
an element of X
Z
...
q × n clinical data matrix
zij
...
an element of Z
p
...
the number of genes
q
...
the number of clinical variables
n
...
the number of samples
y
...
n × 1 response vector
yi
...
an element of y
ŷ
...
n × 1 estimated response vector
A, B
...
the poor and good prognosis ground truth class labels
In the following text, the upper indexes X, Z, S distinguish from other variables with
gene expression data, clinical data and SNP data.
7.2
Generalized linear models
Generalized linear models (GLMs) [134] are a group of statistical models that model
the responses as nonlinear functions of linear combinations of predictors. These models
are linear in the parameters. The nonlinear function (link) is the relation between the
response and the nonlinearly transformed linear combination of the predictors. We employ
GLMs in data combining due to nicely shared properties such as linearity. GLMs offer a
general theoretical framework for many statistical models. The implementation of various
statistical models is simplified because the framework allows for a common method for
estimation of parameters and accessing model adequacy of all GLMs [107]. GLMs are
generalizations of normal linear regression models, which are characterized by the following
features:
1. A linear regression model is defined as:
ηi = β0 +
q
X
βj xij + ǫi ,
(7.1)
j=1
where i = 1, . . . , n. β are regression coefficients and ǫ is a random mean-zero error term.
2. The link function is given as:
g(yi ) = ηi ,
(7.2)
where g is a link function, i = 1, . . . , n. ηi is a linear predictor. Respectively, it can be
also written as yi = g−1 (ηi ), where g−1 is an inverse link function. Finally, g is an identity
for the normal linear regression model.
44
7 Combining gene expression and clinical data
3. All yi are assumed to have normal distributions with E(yi ) = µi , with a constant
variance σ 2 , yi ∼ N(µi , σ 2 ).
GLMs are a generalization of this setup, which differs in the following features:
1. The link functions can be other than the identity.
2. The response variable yi does not to be continous and normally distributed, but it can
have a distribution other than the normal one.
We will assume that the observations come form a distribution in the exponential family [134]:
f (y, θ, ψ) = e
yθ−b(θ)
+c(y,ψ)
a(ψ)
,
(7.3)
where f is the density of continuous y or f is the probability function of discrete y. θ is the
parameter of interest – in GLM, θ = θ(β). ψ is a nuisance parameter (as σ in regression).
7.2.1
Estimation of parameters
Coefficients of β can be estimated by maximum-likelihood. The log-likelihood of the
sample is maximized:
l(µ, y, ψ) =
n
X
log f (yi , θi , ψ).
(7.4)
i=1
If we consider the logistic model, which is described later, neither a(ψ) nor c(y, ψ) have
an influence on the maximization. Thus, we maximize only:
˜l(µ, y, ψ) =
n
X
(yi − b(θi )).
(7.5)
i=1
We derivate 7.6 for β, lay equal to zero and solve for β:
n
X
∂ ˜
∂
(yi − b′ (θi )) θi = 0.
l(µ, y, ψ) =
∂β
∂β
(7.6)
i=1
Maximum likelihood estimates (MLEs) do not have a ‘closed’ form [88]. Therefore, they
can be computed via iteratively weighted least squares1 (IWLS) or via the Fisher scoring
algorithm. McCullagh and Nelder [134] prove that the IWLS algorithm is equivalent to
Fisher scoring and leads to MLEs. The weighted least squares solution in the matrix
notation, which has a simpler notation, is as follows:
1
IWLS is also called iteratively reweighted least squares (IRLS). Some of the details between algorithms
can be found in [66] (page 189).
45
7.3 Logistic regression
β̂ (m) = (X T W (m−1) X)−1 X T W (m−1) y (m−1) .
(7.7)
X is the model matrix, W is a diagonal matrix of weights, y is the response vector and
m is an actual iteration. The IWLS algorithm can be briefly described as follows:
1. Initialize β̂ (0) , set m = 0.
2. Update W (m) , y (m) .
3. Increase m: m = m + 1.
4. Estimate β̂ (m) .
5. Iterate steps 2 to 4 until
kβ̂(m+1) −β̂(m) k
kβ̂(m) k
≤ ξ.
The algorithm stops when estimates change more than a specified small amount ξ. A
more detailed explanation of MLEs computation can be found, for example in [88].
7.3
Logistic regression
The logistic regression model is often used when the response is binary. The linear logistic
regression model is an example of GLM, where:
A. The response variable yi is considered as a binomial (Bernoulli) random variable pi :
pi ∼ B(ni , pi ). If yi is Bernoulli then [134]:
f (y) = P(Y = y) = py (1 − p)1−y

p
=
1 − p
if y = 1,
(7.8)
if y = 0.
B. The parameters of the exponential family for the logistic regression model are:
a(ψ) = 1,
b(θ) = − log(1 − p) = log(1 + eθ ),
(7.9)
c(y, ψ) = 0.
After the substitution (7.9) in the equation (7.3), we arrive at:
f (y, θ) = eyθ+log(1−p) .
C. The link function is logistic:
(7.10)
46
7 Combining gene expression and clinical data
θ = log
p
1−p
,
(7.11)
0.6
0.4
0.0
0.2
Probability
0.8
1.0
which can be derived from (7.8). Figure 7.1 shows a graph of the logistic function.
−6
−4
−2
0
2
4
6
Logit
Figure 7.1: The logistic function.
The logistic regression model with clinical data can be described with the following equation:
g(yi ) = ηi =
β0Z
+
q
X
βlZ zil ,
(7.12)
l=1
where i = 1, . . . , n. g is the link function (7.11). yi or pi are outcome probabilities
P(yi = A|zi1 , . . . , ziq ). The upper index Z denotes clinical data coeficients.
7.4
Boosting
A boosting with componentwise linear least squares (CWLLS) as a base procedure is applied to microarray data. A linear regression model (7.1)2 is considered again. A boosting
algorithm is an iterative algorithm that constructs a function F̂ (x) by considering the emP
pirical risk n−1 ni=1 L(yi , F (xi )). L(yi , F (xi )) is a loss function that measures how closely
a fitted value F̂ (xi ) comes to the observation yi . In each iteration, the negative gradient
of the loss function is fitted by the base learner. The gradient descent is an optimization
algorithm that finds a local minimum of the loss function. The base learner is a simple
fitting method which yields as estimated function:
2
The gene expression coefficients can be denoted by the upper index X in this equation.
47
7.4 Boosting
fˆ(·) = fˆ(X, r)(·),
(7.13)
where fˆ(·) is an estimate from a base procedure. The response r is fitted against x1 , . . . , xn .
7.4.1
Functional gradient descent boosting algorithm
The functional gradient descent (FGD) boosting algorithm, which has been defined by
Friedman, is given by the following steps [70, 35]:
1. Initialize F̂ (0) ≡
Set m = 0.
n
X
L(yi , a) ≡ ȳ.
i=1
2. Increase m: m = m + 1.
Compute the negative gradient (also called pseudo response), which is the current
resudial vector:
∂
ri = − ∂F
L(y, F )|F =F̂ (m−1) (xi ) = yi − F̂ (m−1) (xi ), i = 1, . . . , n.
3. Fit the residual vector (r1 , . . . , rn ) to (x1 , . . . , xn ) by a base procedure (e.g. regression)
base procedure
fˆ(m) (·),
−−−−−−−−−−−−−→
where fˆ(m) (·) can be viewed as an approximation of the negative gradient vector.
(xi , ri )ni=1
4. Update F̂ (m) (·) = F̂ (m−1) (·)+ν · fˆ(m) (·), where 0 < ν < 1 is a step-length (shrinkage)
factor.
5. Iterate steps 2 to 4 until m = mstop for some stopping iteration mstop .
7.4.2
Base procedure
The componentwise linear least squares (CWLLS) base procedure estimates are defined
as (compare with equation (7.1)):
fˆ(X, r)(x) = β̂ŝ x̂ŝ ,
(7.14)
n
X
(7.15)
where:
ŝ = arg min
1≤j≤p
and
i=1
(ri − βj xij )2
48
7 Combining gene expression and clinical data
Pn
ri xij
.
β̂j = Pni=1
2
i=1 (xij )
(7.16)
β̂ are coefficient estimates, j = 1. . . . , p. ŝ denotes the index of the selected (the best)
predictor variable in iteration m. The CWLLS base procedure performs a linear least
square regression against the one selected predictor variable which reduces the residual
sum of squares most. Thus, one predictor variable, not necessary a different one, is selected
for each iteration. For every iteration m, a linear model fit is obtained.
The function is updated linearly. Step 4 in FGD algorithm (Section 7.4.1) with CWLLS
base procedure is as follows:
(m) (m)
x̂ŝ .
F̂ (m) (x) = F̂ (m−1) (x) + ν β̂ŝ
(7.17)
The update of the coefficient estimates is:
(m)
β̂ (m) = β̂ (m−1) + ν β̂ŝ
.
(7.18)
Then, the boosting estimator is a sum of base procedure estimates or a linear combination
of the base procedures:
F̂ (m) (·) = ν
m
X
fˆ(k).
(7.19)
k=1
7.4.3
Loss function
The aforementioned FGD boosting algorithm can be used in many settings. Several versions of boosting can be obtained by varying base procedures, loss functions and implementation details. Examples of loss functions yielding different versions of boosting algorithm
are given in Table 7.1.
BinomialBoosting [35], which is the version of boosting that we utilize, uses the negative log-likelihood loss function:
L(y, F ) = log2 (1 + e−2yF ).
(7.20)
It can be shown that the population minimizer of (7.20) has the form [35]:
F (xi ) =
p 1
log
,
2
1−p
(7.21)
where p is P(yi = A|xi1 , . . . , xip ) and relates to the logit function, which is analogous to
logistic regression.
49
7.5 Combination of logistic regresion and boosting
Algorithm
L(y, F )
BinomialBoosting/LogitBoost
L2 -Boosting
AdaBoost
F (x)
e−2yF )
log2 (1 +
1
2
2 |y − F |
e−(2y−1)F
p log 1−p
P(Y = A|X = x)
E[X = x]
p 1
2 log 1−p P(Y = A|X = x)
1
2
Table 7.1: Boosting algorithms, their loss functions and population minimizers.
7.5
Combination of logistic regresion and boosting
In GLMs, the linear models are related to the response variable via a link function (7.2).
For binary data, we expect that the responses yi come from binomial distribution. Therefore, the logit link function is used in both models with clinical and microarray data. ηi is
a linear model, which is a linear part of logistic regression and a linear regression model in
boosting with CWLLS described in Section 7.4. We combine data at the level of decisions
and sum the linear predictions for clinical and microarray data:
ηi = ηiZ + ηiX .
(7.22)
The upper index Z(X) denotes clinical (microarray) data.
According to the additivity rule that is valid for linear models, it is possible to sum the
linear models:
ηi =
β0Z
+
q
X
l=1
βlZ zil
+
p
X
βjX xij .
(7.23)
j=1
The inverse link function g−1 , which is the inverse logit function, is applied to the sum of
linear predictions ηi :
g −1 (ηi ) = logit−1 (ηi ) =
eηi
.
1 + eηi
(7.24)
The schematic drawing3 in Figure 7.2 illustrates the combination of clinical and microarray data. Clinical and microarray data are repeatedly split into training and test
sets via the Monte Carlo cross-validation (MCCV) procedure which is described in Chapter 4.1. Each clinical training set is fitted to the logistic regression model. We compute the
linear predictions ηiZ for each clinical test set. Each microarray training set is fitted to the
model using boosting with CWLLS. Then, we compute the linear predictions ηiX of each
microarray test. After that, we sum the predictions (7.22) and the equation (7.24) gives a
3
The validation schema is not included.
50
7 Combining gene expression and clinical data
response. Based on the responses, we measure the classifier performance with AUC, which
is described in Chapter 4.2.
For better readibility in results, we denote this approach LOG/Z+B/X.
Figure 7.2: The combination of microarray and clinical data. Z – clinical data, X –
microarray data, LOG – logistic regression, B – boosting, η Z , η X – linear predictions, Y
– response.
7.6
Parameters setting
We performed experiments in the R environment4 using packages ‘base’ and ‘mboost’.
According to the ‘mboost’ package documentation [101], the coefficients resulting from
boosting with family binomial are
1
2
of the coefficients of a logit model obtained via glm.
This is due to the internal recoding of the response to −1 and +1. The binomial family
object implements the negative binomial log-likelihood of a logistic regression model as a
loss function. glm is a function from ‘base’ package, which is used to fit generalized linear
models.
The choice of the shrinkage factor ν and the number of iterations of the base procedure
can be crucial for predictive performance of boosting. Too many iterations bring overfitting
of the training data and too complex model, while insuficient number of iterations yield
to underfitting of the training data and too sparse model. A small value of ν increases the
number of required iterations and computational time, but prevents overshooting. Based
on recommendation from Bühlmann et al. [35], we set ν = 0.1 to the standard default value
in the ‘mboost’ package. Note that, according to Lutz et al. [129], the natural value of the
shrinkage factor for L2 -Boosting with CWLLS is ν = 1. However, they use ν = 0.3 because
smaller values have empirically proven to be a better choice. The number of iterations
can be estimated with a stopping criterion. For example, resampling methods such as
cross-validation and booststrap [87] have been proposed to estimate out-of-sample error
for different numbers of iterations. Another computationally less demanding alternative
is to use Akaike’s information criterion (AIC) [12] or the Bayesian information criterion
(BIC) [164].
4
www.r-project.org
51
25
30
35
AIC
40
45
50
7.6 Parameters setting
0
100
200
300
400
500
600
700
Number of boosting iterations
Figure 7.3: Estimation of an optimal number of boosting iterations (mstop ) depending
upon AIC. Circles denote the value of mstop . The example is depicted for van’t Veer data
set and 3 MCCV iterations (mmax = 700, p = 500).
We use AIC in our computations as a stopping criterion:
AIC = 2k(m) − 2log(L(m))
(7.25)
where k(m) is the number of variables used by classifier F̂ (m) at step m, and L is the
negative binomial likelihood of the data given F̂ (m) . The preferred model is the one with
the minimum AIC value.
Figure 7.3 shows an example of the estimation of an optimal number of iterations
(mstop ) depending upon AIC. mstop is denoted by a circle. The maximal number of
iterations was set to mmax = 700. The number of microarray variables was set to p = 500
and selected based on training sets. Figure 7.4 depicts the span of mstop selections for
four breast cancer data sets. In the case of van’t Veer and Pittman data sets, the average
mstop was approximately 400 iterations, while the average mstop for the Mainz data set
was approximately 500 and for the Sotiriou data set it was approximately 550 iterations.
For breast cancer data sets, we evaluated the boosting approach with a fixed number of
iterations within the range 50 – 800 iterations. AUC, where the number of iterations was
estimated via AIC, mostly get close to the highest AUC obtained with the fixed number
of iterations. In the case of the Sotiriou data set, the highest AUC was obtained with
only 50 iterations. However, this value differed only by 0.01 from AUC obtained via AIC
0
20
40
60
80
600
400
200
0
200
400
600
Number of boosting iterations
7 Combining gene expression and clinical data
0
Number of boosting iterations
52
100
0
20
MCCV iteration
60
MCCV iteration
(c) Mainz data set.
100
80
100
80
100
600
400
200
0
Number of boosting iterations
600
400
Number of boosting iterations
200
40
80
(b) Pittman data set.
0
20
60
MCCV iteration
(a) van’t Veer data set.
0
40
0
20
40
60
MCCV iteration
(d) Sotiriou data set.
Figure 7.4: The span of mstop selections for four breast cancer data sets. The number
of MCCV iteration is on the horizontal axis and mstop selection is on the vertical axis
(mmax = 700, p = 500).
estimated boosting.
7.7
Results
Evaluations focused on testing the LOG/Z+B/X approach with non-redundant and redundant data sets. Simulated data was generated in different settings and was used for
this purpose. We also tested LOG/Z+B/X with four publicly available breast cancer data
sets. The performance of individual models was also evaluated and compared to the per-
53
7.7 Results
formance of the combined LOG/Z+B/X. The logistic regression model built with clinical
variables is denoted by LOG. The boosting model built with microarray data is denoted
by B.
7.7.1
Simulated data
We considered redundant and non-redundant settings of data and different predictive
powers of clinical and microarray data. We generated simulated data sets through the use
of an R script available in [26]. Variables Vj in each data set were generated as:
Vj = µV j Y + ej ,
(7.26)
where V denotes microarray or clinical data, j = 1, . . . , p or j = 1, . . . , q; µV j are constant
parameters controlling the amount of predicting power of microarray or clinical variables;
Y is the binary response that follows a binomial distribution; and ej are independent
random errors following a standard model distribution.
In the case of redundant sets, microarray and clinical variables were generated using
exactly the same model. Such variables discriminate classes in the same way and give
redundant information. In the case of non-redundant sets, the observations were assumed
to form two distinct subgroups [26].
We considered different predictive powers for the clinical variables µZ and different
predictive powers for the microarray variables µX . In presented simulations, µZ = 0
denotes no power, µZ = 0.5 a moderate power and µZ = 1 a strong power for Z. Similarly,
µX = 0, 0.25, 0.5 for X. The difference in µZ and µX ranges compensates for ranges of
predictor values for microarray and clinical variables.
In all settings, simulated microarray data sets were generated for 1, 000 microarray
variables and 500 samples, while simulated clinical data sets consisted of 5 clinical variables
and 500 samples.
1.
2.
3.
4.
µZ , µX
0, 0
1, 0.25
0.5, 0.5
1, 0.5
Method
(no power)
µZ > µX
µZ < µX
(strong p.)
LOG/Z+B/X
0.53 ± 0.05
0.69 ± 0.04
0.73 ± 0.04
0.79 ± 0.04
LOG/Z
0.55 ± 0.05
0.65 ± 0.06
0.56 ± 0.05
0.65 ± 0.06
B/X
0.51 ± 0.05
0.60 ± 0.05
0.72 ± 0.05
0.72 ± 0.05
Table 7.2: Non-redundant data sets. AUCs from test data sets (including mean AUCs
and standard deviations) evaluated over 100 MCCV iterations.
Tables 7.2 and 7.3 display selected results of LOG/Z+B/X for different predictive
54
7 Combining gene expression and clinical data
powers of Z and X. In the case of non-redundant data sets, LOG/Z+B/X increases AUCs.
The results coincide with the fact that if the two data sources involve complementary
information, the combination of them yields more accurate predictions. In the case of
redundant data sets, LOG/Z+B/X has a good performance as well.
1.
2.
3.
4.
µZ , µX
0, 0
1, 0.25
0.5, 0.5
1, 0.5
Method
(no power)
µZ > µX
µZ < µX
(strong p.)
LOG/Z+B/X
0.51 ± 0.05
0.94 ± 0.02
0.96 ± 0.02
0.98 ± 0.01
LOG/Z
0.49 ± 0.05
0.94 ± 0.02
0.78 ± 0.04
0.94 ± 0.02
B/X
0.51 ± 0.05
0.71 ± 0.04
0.98 ± 0.01
0.98 ± 0.01
Table 7.3: Redundant data sets. AUCs from test data sets (including mean AUCs and
standard deviations) evaluated over 100 MCCV iterations.
7.7.2
Breast cancer data
For the evaluation of the LOG/Z+B/X approach, we used four publicly available breast
cancer data sets described in Chapter 6. Table 7.4 shows average AUCs and standard deData set
Method
p = 50
p = 200
p = 500
van’t Veer
LOG/Z+B/X
0.79 ± 0.11
0.78 ± 0.11
0.79 ± 0.11
LOG/Z
0.82 ± 0.10
−
−
B/X
0.67 ± 0.13
0.65 ± 0.12
0.65 ± 0.11
LOG/Z+B/X
0.79 ± 0.07
0.81 ± 0.08
0.82 ± 0.08
LOG/Z
0.67 ± 0.09
−
−
B/X
0.75 ± 0.08
0.77 ± 0.08
0.78 ± 0.08
LOG/Z+B/X
0.65 ± 0.10
0.66 ± 0.10
0.68 ± 0.10
LOG/Z
0.59 ± 0.10
−
−
B/X
0.65 ± 0.10
0.66 ± 0.10
0.68 ± 0.09
LOG/Z+B/X
0.66 ± 0.11
0.65 ± 0.12
0.69 ± 0.11
LOG/Z
0.71 ± 0.11
−
−
B/X
0.58 ± 0.12
0.57 ± 0.12
0.61 ± 0.12
Pittman
Mainz
Sotiriou
Table 7.4: Breast cancer data sets. AUCs from test data sets (including mean AUCs
and standard deviations) evaluated over 100 MCCV iterations. p denotes the number of
microarray variables.
7.8 Pre-validation
55
viations over 100 MCCV iterations. We performed tests for different numbers of variables
(p = 50, 200, 500) in order to inspect the efficiency of B/X and LOG/Z+B/X. Variables
were selected on the basis of the absolute value of the t-statistics using the R package ‘st’.
The Pittman data set approaches non-redundant data sets and the combination of
microarray and clinical data implicates outcome prediction improvement. The methods in
Table 7.4 show lower performance with Mainz and Sotiriou data sets. The results based
on the van’t Veer and Sotiriou data sets seem to be close to the simulated redundant
data sets setting. In the case of van’t Veer, this finding coincides with the conclusions of
Gruvberger et al. [82], which points out a correlation of ER-alfa status in the data set
generated by van’t Veer.
7.8
Pre-validation
The approach described in this section, in contrast to combining gene expression and
clinical data (see Figure 7.2), sets weights that determine relevance of linear predictions
for the combination of microarray and clinical models, as shown in the schematic drawing
in Figure 7.5. This approach was designed for redundant data sets. Weights are set based
on pre-validation. The concept of pre-validation for microarray data and clinical variables
is described in [181]. This paper incorporates only points 1 through 5 compared to our
approach described in Section 7.9. Also, we use different classifiers and leave-one-out
cross-validation (LOOCV), while [181] uses k-fold CV.
Figure 7.5: The combination of microarray and clinical data with pre-validation. Z –
clinical data, X – microarray data, LOG – logistic regression, B – boosting, η Z , η X –
linear predictions, Y – response, ŵsZ , ŵsX – weights.
56
7 Combining gene expression and clinical data
7.9
Determination of weights for models
We have K training samples in t-iteration of MCCV. We use LOOCV for pre-validation
and, consequently, we determine weights. The weights are computed as follows:
1. Set aside one sample of K training samples.
2. Build a model with logistic regression (boosting) for Z (X)
5
using only data from
other K − 1 samples.
3. Predict a linear response with a built model on the left out case.
4. Repeat steps 1–3 for each of the samples K to get pre-validated predictors from Z
and X.
5. Fit a logistic regression model to pre-validated predictors from Z and X.
6. Compute weights wi (7.28), where i denotes Z or X.
7. Repeat steps 1–6 for randomized training data obtained from MCCV.
8. Compute modus of weights ŵi from wi for X and Z.
Logistic regression is used twice – in building both a model of Z and a model of prevalidated predictors from Z and X. Logistic regression describes the relationship between
one or more variables and an outcome. Each of the coefficients describes the size of the
contribution of a particular variable. A large regression coefficient means that the variable
strongly influences the probability of that outcome. The folowing equation for Z and X
variables defines a regression model:
η = β0 + βZ QZ + βX QX .
(7.27)
The weights are determined as follows:
wZ = abs
βX βZ , wX = abs
.
β0
β0
(7.28)
Randomized training data obtained from MCCV is used twice – in weights estimation,
as described in Section 7.8, and in building a model of Z and X, as described in Section 7.5.
A histogram of weights obtained from t-iteration of MCCV is close to an exponential
distribution of the probability density function. In the case of exponential distribution,
the modus is the value with the highest density.
In the rest of this thesis, this approach is denoted pre-LOG/Z+B/X.
5
Z denotes clinical data and X denotes microarray data.
57
7.10 Results
Data set
Pre-validation
p = 50
p = 200
p = 500
van’t Veer
Yes
0.81 ± 0.10
0.82 ± 0.11
0.82 ± 0.10
No
0.79 ± 0.11
0.78 ± 0.11
0.79 ± 0.11
Yes
0.74 ± 0.08
0.76 ± 0.08
0.78 ± 0.08
No
0.79 ± 0.07
0.81 ± 0.08
0.82 ± 0.08
Yes
0.65 ± 0.10
0.65 ± 0.09
0.66 ± 0.09
No
0.65 ± 0.10
0.66 ± 0.10
0.66 ± 0.09
Yes
0.70 ± 0.10
0.70 ± 0.11
0.71 ± 0.11
No
0.66 ± 0.11
0.65 ± 0.12
0.69 ± 0.11
Pittman
Mainz
Sotiriou
Table 7.5: Pre-validation of LOG/Z+B/X on breast cancer data sets. AUCs from test data
sets (including mean AUCs and standard deviations) evaluated over 100 MCCV iterations.
p denotes a number of microarray variables.
7.10
Results
We evaluated the pre-LOG/Z+B/X approach on four publicly available breast cancer data
sets and compared pre-LOG/Z+B/X with LOG/Z+B/X. Table 7.5 includes the results
with average AUCs and standard deviations over 100 MCCV iterations. LOG/Z+B/X
averages linear predictions from microarray and clinical models on redundant data sets.
Pre-validation of built models improves outcome of the prediction in the case of redundant
breast cancer data sets.
7.11
Discussion
We compared our results with the results from other methods described in the literature. In
principle, the results of designed approaches are hard to compare because new approaches
are evaluated with different data sets and measures.
Gevaert et al. [77] evaluated their results using AUC measures. They integrate microarray data and clinical variables with Bayesian networks in three ways: full integration,
decision integration and partial integration. The Bayesian decision integration approach
combines data at the same level as our method and achieves the average of AUC 0.79 with
the van’t Veer data set.
In order to compare LOG/Z+B/X with the approach proposed in Boulesteix et al. [26],
we have also extended our computation to report error rates. A comparison is presented in
Chapter 8 as it is relevant also for determining the additonal predictive value of microarray
data. Our approach provides results 2% better on average on the van’t Veer data set. In
the case of the Pittman data set, LOG/Z+B/X has results 5% better than the approach
58
7 Combining gene expression and clinical data
proposed in [26].
Eden et al. [56] reproduce the van’t Veer classifier for microarray predictors and apply an artificial neural network (ANN) algorithm to clinical predictors. Their approach
achieves AUC 0.79 with all samples of the van’t Veer data set and with LOOCV. This
approach achieves AUC 0.85 with only ER positive samples of the van’t Veer data set.
Ma and Huang [130] propose the Cov-TGDR method for combining different types of
covariates in disease prediction. They use the van’t Veer data set and achieve a prediction
error of 0.227. However, they perform feature selection based on the binary outcome with
training and test data which is not correct (pre-processing step 4 and 5 in this article).
Other examples of methods that combine microarray and clinical data are [65, 155],
but these authors evaluate survival times. Fernandez-Teijeiro et al. [65] build a predictive
model with a combination of clinical variables and a small number of selected genes.
Pittman et al. [155] combine metagenes with clinical risk factors to improve prediction.
According to our results, a bigger p increases boosting performance, which relates to
the fact that boosting with CWLLS performs variable selection and coeffcient shinkage.
This approach is evaluated together with other methods on breast cancer data sets in an
experiment aimed at selected genes in Chapter 9.
7.12
Alternative regularized regression techniques
There are other regularized regression techniques based on the logistic regression model
that can be applied to high-dimensional data. We experimented with R packages ‘glmnet’,
‘grplasso’ and ‘glmpath’ that regularize high-dimensional data with L1 or L2 or elastic net
penalties. At the same time, these statistical models were developed for fitting in the
GLM framework.
The next subsection describes experiments with elastic net from package ‘glmnet’,
which performed well and for which model fitting was not time-consuming. Next, we
modify LOG/Z+B/X and apply it to the CFS data set that includes three types of data.
7.12.1
Elastic net
Elastic net [206] is a regularization and variable selection method that can include both
L1 and L2 penalties. L1 penalty, the lasso (least absolute shrinkage and selection operator) [180] has the property that it can shrink the coefficients to zero and performs
variable selection. The predictive model is then sparse. A drawback of the lasso is an
indifference to very correlated predictors. It tends to select only one variable from the
group of correlated variables and does not care which one is selected. L2 penalty, the
ridge regression [97] can shrink coefficients of correlated predictors toward each other and
is useful in situations where there are many correlated predictor variables. A drawback of
59
7.12 Alternative regularized regression techniques
the ridge penalty is that it always keeps all the predictors in the model. The elastic net is
a compromise between L1 and L2 penalty. It takes into account correlated variables and
performs variable selection together. Friedman et al. [69] developed algorithms for fitting
GLMs with elastic net penalties. The algorithms use cyclical coordinate descent (CCD)
and compute a regularization path. CCD algorithms optimize each parameter separately,
holding all of the others fixed.
The linear regression model (7.1) is considered again. The elastic net optimizes the
following equation with respect to β [71]:
p
p
n
β̂(λ) = arg min
X
X
1 X
Pα (β),
βj xij )2 + λ
(yi −
2n
i=1
(7.29)
j=1
j=1
where:
1
Pα (β) = (1 − α) kβk2L2 + αkβkL1
2
1
= (1 − α) βj2 + α|βj |
2
(7.30)
is the elastic net penalty.
Pα (β) =



L penalty

 1
if α = 1,
L2 penalty



elastic net penalty
if α = 0,
(7.31)
if 0 < α < 1.
The equation (7.29) is solved by coordinate descent. The coordinate update has the form:
P
S(βj∗ , λα)
S( n1 ni=1 xi rij , λα)
β̂j ←
=
,
1 + λ(1 − λ)
1 + λ(1 − λ)
(7.32)
where rij is the partial residual yi − ŷij for fitting β̂j and S(κ, γ) is the soft-thresholding
operator with a value:



κ−γ


sign(κ)(|κ| − γ) = κ + γ



0
if κ > 0 and γ < |κ|,
if κ < 0 and γ < |κ|,
(7.33)
if γ ≥ |κ|.
Details of the derivation are given in [69]. The soft-thresholding takes care of the lasso
contribution to the penalty. A simple description of the CCD algorithm for elastic net is
as follows [71]:
60
7 Combining gene expression and clinical data
The authors assume that the xij are standardized:
• Initialize all β̂j = 0.
Pn
i=1 xij
= 0,
1
n
Pn
2
i=1 xij .
• Cycle around till convergence and coefficients stabilize:
− Compute the partial residuals rij = yi − ŷij = yi −
P
k6=j
xik βk .
− Compute the simple least squares coefficient of these residuals on jth preP
dictor: βj∗ = n1 ni=1 xij rij .
− Update β̂j by soft-thresholding: β̂j ← S(βj∗ , λ), which equals (7.32).
Friedman et al. [71] compute a group of solutions (regularization path) for a decreasing
sequence of values for λ. Their algorithm starts at the smallest value λmax for which the
entire vector β̂ = 0, selects a minimum value λmin = ǫλmax , and constructs a sequence of
values of λ decreasing from λmax to λmin on the log scale. Each solution is used as a start
for the next problem.
Combination of logistic regression and elastic net
We employ elastic net with high-dimensional data instead of boosting with CWLLS and
combine elastic net and simple logistic regression models similarly as described in Section 7.5. The elastic net models are used in the logistic regression setting. The regularized
equation (7.29) is fitted by maximum (binomial) log-likelihood [71]. In GLMs, the linear
models are related to the response via a link function. We combine data at the level of
decisions and sum the linear predictions for clinical and microarray data (7.22). Then the
inverse link function g−1 is applied to the sum of linear predictions (7.24). The schematic
drawing for the combination of clinical and microarray data with elastic net is similar
to the schema in Figure 7.2. For a better readibility of results, we denote this approach
LOG/Z+EN/X.
Parameter setting
We performed experiments in the R environment using packages ‘base’ and ‘glmnet’. We
evaluated λ solution paths produced by the glmnet algorithm in different settings: lasso
penalty (α = 1), ridge regression penalty (α = 0), elastic net penalty (α = 0.5) and
tested the models with various λ solutions on the training and test data sets. Figure 7.6
depicts examples of the experiment. The subfigures were generated from a simulated
microarray data set of moderate power (µX = 0.25) and depict one MCCV iteration (the
same for all figures). The simulated microarray data set included 1, 000 variables and 500
samples (400 training and 100 test). Corresponding AUC preformances are on the vertical
axes. Figures in the first, second and third line were produced based on 20, 400 and
1, 000 selected variables in order to inspect performance in a different dimensional setting.
61
0.04
0.00
0.8
0.0
120
80
20
0
0.20 0.15 0.10 0.05 0.00
0.8
AUC
0.8
Lambda
0.08
0.04
0.00
0.0
0.0
0.4
AUC
0.4
0.0
AUC
0.12
120
80
60
40
20
0
0.20 0.15 0.10 0.05 0.00
AUC
0.12
0.08
0.04
Lambda
0.00
0.0
0.0
0.4
AUC
0.4
0.0
0.8
Lambda
0.8
Lambda
0.8
Lambda
AUC
40
Lambda
0.8
Lambda
60
0.4
0.08
0.4
0.12
0.4
AUC
0.8
0.0
0.4
AUC
0.4
0.0
AUC
0.8
7.12 Alternative regularized regression techniques
120
80
60
40
Lambda
20
0
0.20 0.15 0.10 0.05 0.00
Lambda
Figure 7.6: Examples of λ solution path produced by the glmnet algorithm. The dashed
(solid) line denotes AUC estimated from a training (test) data set. Columns: 1st – lasso
penalty, 2nd – ridge regression penalty, 3rd – elastic net penalty. Lines: 1st, 2nd and 3rd
– 20, 400, 1000 variables in data set. The blue vertical line denotes λOP T .
λ solution paths generated by the glmnet algorithm are on the the horizontal axes. The
blue vertical lines in the subfigures denote the estimated values of λOP T , which were
estimated via a training data set cross-validation (CV). The estimation via the training
data set CV does not seem to work very well, especially for the ridge regression (the
subfigures in the 2nd column). In this case, λOP T is close to zero, although the span of
the ridge regression solutions is large and the data is high-dimensional in the 3rd line.
Maybe an λOP T estimation via an additional validation data set CV could work better.
The estimation of λOP T for lasso (the subfigures in the 1st column) seems to be more
reliable. Therefore, we set (α = 1) and use the lasso penalty in further experiments.
62
7 Combining gene expression and clinical data
1.
2.
3.
4.
µZ , µX
0, 0
1, 0.25
0.5, 0.5
1, 0.5
Method
(no power)
µZ > µX
µZ < µX
(strong p.)
LOG/Z+EN/X
0.55 ± 0.04
0.68 ± 0.05
0.75 ± 0.04
0.81 ± 0.04
LOG/Z
0.55 ± 0.05
0.65 ± 0.06
0.56 ± 0.05
0.65 ± 0.06
EN/X
0.50 ± 0.03
0.60 ± 0.05
0.74 ± 0.05
0.74 ± 0.05
Table 7.6: The comparison with elastic net – non-redundant data sets.
1.
2.
3.
4.
µZ , µX
0, 0
1, 0.25
0.5, 0.5
1, 0.5
Method
(no power)
µZ > µX
µZ < µX
(strong p.)
LOG/Z+EN/X
0.48 ± 0.05
0.95 ± 0.02
0.98 ± 0.01
0.99 ± 0.01
LOG/Z
0.49 ± 0.05
0.94 ± 0.02
0.78 ± 0.04
0.94 ± 0.02
EN/X
0.48 ± 0.04
0.72 ± 0.04
0.97 ± 0.01
0.97 ± 0.01
Table 7.7: The comparison with elastic net – redundant data sets.
Results
The experiment data sets were evaluated in the same way as described in Section 7.7.
We evaluated EN/X and LOG/Z+EN/X with simulated data first. Tables 7.6 and 7.7
display results in non-redundant and redundant settings for different predictive powers
of microarray and clinical data. Combining microarray and clinical data yields a more
accurate prediction performance. When we compare Tables 7.6 and 7.7 with Tables 7.2
and 7.3 with the LOG/Z+B/X approach from Section 7.7. LOG/Z+EN/X due to EN/X
performs slightly better than LOG/Z+B/X.
Then we evaluated EN/X and LOG/Z+EN/X with breast cancer van’t Veer and
Pittman data sets. Average AUCs with standard deviations from test data sets evaluated over 100 MCCV iterations are given in Table 7.8. When we compare Table 7.8 with
Table 7.4 from Section 7.7, both approaches indicate a similar performance. LOG/Z+B/X
performs slightly better than LOG/Z+EN/X in the case of the Pittman data set.
7.12.2
Combining gene expression, clinical and SNP data
The LOG/Z+B/X approach can be also applied to the CFS data set which consists of
several (more than two) parts of data. The CFS data set contains gene expression, clinical,
SNP and proteomic data. We do not include proteomic data into the experiments because
proteomic samples are not enough (see Chapter 6). Clinical data is usually low-dimensional
63
7.12 Alternative regularized regression techniques
Data set
Method
p = 50
p = 200
p = 500
van’t Veer
LOG/Z+EN/X
0.77 ± 0.11
0.77 ± 0.10
0.79 ± 0.11
LOG/Z
0.82 ± 0.10
−
−
EN/X
0.67 ± 0.12
0.64 ± 0.12
0.64 ± 0.12
LOG/Z+EN/X
0.77 ± 0.07
0.78 ± 0.08
0.80 ± 0.07
LOG/Z
0.67 ± 0.09
−
−
EN/X
0.74 ± 0.08
0.76 ± 0.08
0.77 ± 0.07
Pittman
Table 7.8: The comparison with elastic net – breast cancer data sets.
because just a few clinical variables are included in microarray data sets. However, clinical data is high-dimensional in the CFS data set and simple logistic regression without a
dimension reduction step could not be used. Therefore, we applied boosting with CWLLS
to CFS clinical data as well and combined linear predictions in the same way as the equation (7.22) described in Section 7.5. We denote this approach B/Z+B/X. We evaluated
B/Z+B/X on van’t Veer and Pittman breast cancer data sets and compared them with
LOG/Z+B/X to inspect the efficiency of this method. Table 7.9 shows the results. Simple
logistic regression (LOG) performs slightly better on clinical data. However, B/Z+B/X
can be used instead of LOG/Z+B/X as well and we applied it to the CFS data set. Then,
we applied boosting to SNP data and combined gene expression and SNP data; clinical
and SNP data; and clinical, gene expression and SNP data. The sum of linear predictions
for clinical, gene expression and SNP data is as follows:
ηi = ηiZ + ηiX + ηiS ,
(7.34)
where the upper index S denotes SNP data.
Data set
van’t Veer
Pittman
Methods
B/Z
B/Z+B/X
LOG/Z
LOG/Z+B/X
0.80 ± 0.10
0.76 ± 0.10
0.82 ± 0.10
0.79 ± 0.11
B/Z
B/Z+B/X
LOG/Z
LOG/Z+B/X
0.65 ± 0.09
0.82 ± 0.08
0.67 ± 0.09
0.82 ± 0.08
Table 7.9: Comparison of methods. AUCs from test data sets evaluated over 100 MCCV
iterations. The number of microarray variables is p = 500.
64
7 Combining gene expression and clinical data
Results
Table 7.10 shows results of experiments with the CFS data set. The models constructed
with clinical data reached AU C = 0.74. The models based on gene expressions performed
very poorly (AU C = 0.52), which is reflected on combinations of data. The combination
of clinical and SNP data improves the outcome prediction. The combination of clinical,
gene expression and SNP data performs much like the standalone clinical data.
CFS is a very complex disease. Initially, we intended to use the CFS data set for more
experiments, but later we abandoned this idea. The reason is that we are not convinced
about data set quality. The intake illness classification of CFS, which is based on the 1994
case definition criteria [73], does not seem to be related to microarray data. We used this
classification as a disease outcome dependent variable for model fitting. Furthermore, for
example, CFS can be caused by chronic infections [143]. The CFS data set does not include
infection and immunity markers that could influence data informativeness. The clinical
data consists of subjective information based on the questionaire filled in by patients [2].
Data
AUC
clinical
0.74 ± 0.07
gene expression
0.52 ± 0.11
SNP
0.72 ± 0.08
Combination of data
AUC
clinical + gene expression
0.62 ± 0.09
clinical + SNP
0.77 ± 0.07
SNP + gene expression
0.58 ± 0.10
clinical + gene expression + SNP
0.74 ± 0.08
Table 7.10: CFS data set. AUCs from test data sets (including mean AUCs and standard
deviations) evaluated over 100 MCCV iterations.
7.13
Execution times
The execution time of the combined models is dominated by the execution times of the
models built with high-dimensional data; therefore, we compared the execution times of
the FGD boosting algorithm from the package ‘mboost’ (B/X and LOG/Z+B/X) with
the CCD algorithm from the package ‘glmnet’ (EN/X and LOG/Z+EN/X). Figure 7.7
depicts this comparison. Increasing numbers of variables are on the horizontal axes, while
total execution times for 100 MCCV iterations (in minutes) are on the vertical axes. The
plots indicate that both methods need similar time to be computed. The execution times
65
7.14 Conclusions
grow almost linearly. Besides, FGD boosting grows with the number of boosting iterations
(in our simulations mmax=700 ). A grid of 100 λ values is computed in each iteration of
EN. The simulations were run on a standard PC (Intel T72500 Core 2 Duo 2.00 GHz, 2
5
0
1000
2000
3000
Number of genes [−]
(a) B/X and LOG/Z+B/X.
4000
1
2
3
4
0
B/X
LOG/Z+B/X
EN/X
LOG/Z+EN/X
0
Time [minutes]
3
2
1
Time [minutes]
4
5
GB RAM) and a 32-bit operating system.
0
1000
2000
3000
4000
Number of genes [−]
(b) EN/X and LOG/Z+EN/X.
Figure 7.7: The comparison of execution times. The times are measured for 100 MCCV
iterations.
7.14
Conclusions
This chapter is concerned with the outcome prediction of combined models. We used
logistic regression models built through different ways. GLMs enabled combining these
models. LOG/Z+B/X approach employs logistic regression and boosting with CWLLS.
The extention of LOG/Z+B/X is pre-LOG/Z+B/X, which includes pre-validation of models built with microarray and clinical data followed by weights calculation. Weights set
the relevance of microarray and clinical models for data combination.
We described LOG/Z+EN/X which employs logistic regression and elastic net. A
comparison with LOG/Z+B/X showed that the two methods provide about the same
performance. This finding can be explained by the fact that we used elastic net in the
setting with lasso penalty. Boosting with CWLLS is a sparse regularization method as
the lasso. In boosting with CWLLS, one predictor variable, not necessarily a different
one, is selected for each iteration. Efron et al. [57] demonstrated that the coefficient
estimates (regularization paths) from the lasso and forward stagewise linear regression
which is related to slow gradient-based L2 -boosting with small fixed stepsize (e-Boosting)
66
7 Combining gene expression and clinical data
are nearly identical.
We modified LOG/Z+B/X and applied it to CFS data set which consists of microarray,
clinical and SNP data. A combination of data sources can lead to a more accurate outcome
prediction. On the other hand, there is a danger that integration of a data source of poor
quality can decrease outcome prediction. The presented results prove that the approaches
described in this chapter are not prone to this problem so much as that they can cope
with a data source of a worse quality in some bounds.
We also compared execution times of the FGD boosting algorithm from package
‘mboost’with the CCD algorithm from package ‘glmnet’. The execution times of the
combined models grow linearly with the number of genes; the approaches are not time
consuming.
Chapter 8
Additional predictive value of
microarrays
8.1
Introduction
Many clinical variables have already been investigated and are now well established in
cancer research, [183], while most genes are not yet validated as predictors. Gene expressions are expected to contribute significantly to progress in cancer treatment by enabling
precise and early diagnosis. However, results show that the predictive power is often overestimated in the case of genes. The overestimation is clear especially when the number of
samples is low and the total number of genes high. High-dimensional data is more sensitive to overfitting. Potential prognostic genes are often selected on a single data set and
assumed to have the same predictive power with other data sets. In addition, methods
that employ microarrays are not always evaluated correctly (see Section 3.4). Probably
due to the overoptimistic results and attractiveness of microarrays, researchers do not pay
attention to given clinical data in the same manner as in the pre-microarray era. Microarray data sets usually include just a few clinical variables. Clinical data is usually not
difficult to collect and its acquisition is almost always much cheaper than in the case of
microarray gene expression data. This chapter investigates an additional predictive value
of microarray data.
Few authors have dealt with this topic. Eden et al. [56] compared the power of gene
expression measurements to conventional clinical prognostic markers for predicting distant metastases in breast cancer patients. The performance of metastasis prediction was
presented by ROC and Kaplan-Meier plots. The gene expression profiler did not perform noticeably better than indices constructed from clinical variables. Tibshirani and
Efron [181] proposed a pre-validation technique for making a fairer comparison between
the two sets of predictors. They compared a predictor of a disease outcome derived from
gene expression levels to standard clinical predictors. The technique is partially described
67
68
8 Additional predictive value of microarrays
in Section 7.8 where it is employed for the determination of weights of linear predictors. Hofling and Thibshirani [98] extended the pre-validation technique by a permutation
testing-based procedure. They showed that the permutation test achieves roughly the
same power as the analytical test in simulation studies.
Boulesteix et al. published two papers about the additional predictive value of microarrays [27, 25]. The first one combines partial least squares (PLS) dimension reduction and
pre-validation (PV) introduced by Thibshirani and Efron. Random forests are then applied with pre-validated PLS new components and the clinical variables as predictors. We
evaluate and compare our approach together with this method in Section 8.3. The second
one deals with a permutation-based procedure and uses logistic regression and boosting.
It could seem that this approach is similar to our approach, however it is different because
the authors combine data in a different way and feed the offset values of the boosting
algorithm with linear predictions from the logistic model. The method supposes that all
clinical variables are already given (which means that the clinical data are included during
the training of the data and cannot be included during the testing of the data), which is
a disadvantage if this method would be used for class prediction with gene expression and
clinical data.
Truntzer et al. [183] adjusted for clinical and gene expression information in Cox proportional hazard models. The contribution of clinical and transcriptomic variables to
prognosis were compared through simulations and by using the Kent and O’Quigley ρ2
measure of dependence. The results showed that predictive power is overestimated in the
case of genes.
We propose a two-step approach that can determine the additional predictive value
of microarray data in this chapter. It is based on the method described in Chapter 7.
We evaluate this approach together with other method designed for the same purpose on
real breast cancer data sets. According to results, our approach can determine whether
microarray data has an additional predictive value and even the joined classifier can combine microarray and clinical data more efficiently than the other designed method used
for comparison.
8.2
Determination of predictive value
The schematic drawing of our two-step approach for determination of the additional predictive value, including a validation schema, is shown in Figure 8.1. It can be described
as follows. Step 1 consists in an evaluation of the performance of the classifier with clinical variables alone. The logistic regression was used as a classifier and was denoted as
LOG/Z. Step 2 consists in an evaluation of the performance of the classifier combining
clinical variables and microarray data, which was described in Section 7.5 and was denoted
as LOG/Z+B/X. If the microarray data has an additional predictive value, the perfor-
69
8.2 Determination of predictive value
Clinical
data
Clinical
training set
Microarray
data
Clinical test
set
Logistic
regression
Microarray
training set
Variable
selection
Microarray
test set
Reduced
micro. tr. set
Microarray
variables
Reduced
micro. te. set
Boosting
AIC
Opt. it.
numb.
Log. reg.
model
Bin. boost.
model
Linear
prediction
Linear
prediction
+
Inverse
logit
Response
Step 1
Step 2
Figure 8.1: Two-step method for the determination of the additional predictive value
of microarray data. The dashed line denotes step 1 (LOG/Z). The dot-and-dashed line
denotes step 2 (LOG/Z+B/X).
70
8 Additional predictive value of microarrays
mance of step 2 is higher than the performance of step 1. The findings are statistically
assessed with paired Wilcoxon rank sum test with 95% confidence level.
8.3
Results
We evaluated our approach on four breast cancer data sets. The evaluation schema was
the same as in previous chapter. Table 8.1 shows the results. According to the table, van’t
Veer and Sotiriou microarray data does not have additional predictive value comparing
to clinical data and it is better to use clinical data alone for prediction of prognosis.
These findings coincide with the findings in previous chapter. The finding that van’t
Veer microarray data does not have additional predictive value coincides with conclusions
in [27, 25, 56]. Pittman and Mainz microarray data improves prediction performance.
We also evaluated our approach together with the method published by Boulesteix
et al. [26] that addresses an additional predictive value of microarray data. The authors
of the article evaluate the proposed approach in various settings and combinations and
compare it with other classifiers such as SVM or RF. We evaluated and compared our
approach to the method denoted as PLS-PV-RF/XZ which has the best performance in
the article. The method is implemented and is available in R package ‘MAclinical’.
The method evaluates by means of error rates. That is why we evaluated our approach
by means of error rates in this section as well (see Chapter 4 for the description of validation
and evaluation). We used the script from R package ‘MAclinical’. We just applied the
same validation schema to PLS-PV-RF/XZ and used the same data sets as for our two-step
approach. Table 8.2 shows the results of this experiment. LOG/Z+B/X performs better
than PLS-PV+RF/XZ on van’t Veer and Pittman data sets. Methods predict outcome
similarly on Mainz and Sotiriou data sets.
Data set
van’t Veer
Pittman
Mainz
Sotiriou
Step
AUC
1
0.82 ± 0.11
2
0.79 ± 0.11
1
0.67 ± 0.09
2
0.82 ± 0.08
1
0.59 ± 0.10
2
0.68 ± 0.10
1
0.71 ± 0.11
2
0.61 ± 0.12
APVM
p-value
Statistically significant
NO
6.64e−06
YES
YES
1.34e−15
YES
YES
1.36e−09
YES
NO
8.07e−04
YES
Table 8.1: Additional predictive value of microarrays.
71
8.4 Conclusion
Data set
van’t Veer
Pittman
Mainz
Sotiriou
Method
Mean error
LOG/Z+B/X
0.29 ± 0.11
PLS-PV+RF/XZ
0.31 ± 0.10
LOG/Z+B/X
0.24 ± 0.07
PLS-PV+RF/XZ
0.29 ± 0.08
LOG/Z+B/X
0.24 ± 0.07
PLS-PV+RF/XZ
0.24 ± 0.06
LOG/Z+B/X
0.36 ± 0.11
PLS-PV+RF/XZ
0.37 ± 0.11
p-value
Statistically significant
2.63e−02
YES
2.21e−06
YES
8.92e−01
NO
3.68e−01
NO
Table 8.2: Comparison of LOG/Z+B/X with PLS-PV+RF/XZ method described in
Boulesteix et al. [26]. Methods evaluated by mean values of error rates based on available
R script in R package ‘MAclinical’.
8.4
Conclusion
We presented a two-step approach that can determine additional predictive value of microarray data in this chapter. It has been shown that the method presented in the previous
chapter that offers a solution of construction of a classifier combining the two different
types of data can be used for determining additional predictive value of microarray data.
According to findings in Section 8.3, van’t Veer and Sotiriou microarray data sets
do not have additional predictive value, while Pittman and Mainz microarray data sets
do. These findings demonstrate the fact that clinical data is still a valuable data source
which should be used if available. The LOG/Z+B/X method can combine clinical and
microarray data more efficiently than the PLS-PV+RF/XZ method on the van’t Veer and
Pittman data sets.
72
8 Additional predictive value of microarrays
Chapter 9
Breast cancer prognostic genes
9.1
Introduction
The major aim of class prediction studies is to identify a set of key genes, also known
as a signature or mRNA profile, that can accurately predict a class membership of new
samples. We employ five breast cancer data sets in this thesis. It is interesting to compare
selected features of gene expression classifiers built up in the different breast cancer data
sets. Table 9.1 shows an overview of the sizes of these data sets.
Class prediction studies in breast cancer can be categorized into two main subtypes:
prognostic and predictive class prediction. Prognostic class prediction, discriminates between a good and a poor outcome by comparing highly aggressive and less aggressive
primary tumors, while predictive class prediction describes predictors of response to therapy [55]. We deal with prognostic class prediction in this thesis, see the outcomes in
Table 9.1.
Breast cancer is not a single disease but a complex of genetic diseases characterized
by the accumulation of multiple molecular alternations. It has become clear that patients
Data set
Microarray technology
Dimension
Outcome
van’t Veer
cDNA Agilent
78 x 4348
Pittman
Wang
Affymetrix Human U95Av2
Affymetrix Human U133a
158 x 12625
286 x 22263
Mainz
Affymetrix Human U133a
200 x 22283
Sotiriou
cDNA
99 x 4246
recurrence during 5 years
after resection
recurrence within 5 years
relapse/distant metastases
within 5 years
distant metastases
after 5 years
relapse within 10 years
Table 9.1: Summary of microarray data sets.
73
74
9 Breast cancer prognostic genes
with similar clinical and pathological features may show distinct outcomes and vary in
their response to therapy [55]. It is known that there are several different types of breast
cancer. However, it remains to be determined how many of them can be identified reliably
with currently available data. The main classes of breast cancer were originally defined
by Perou et al. [153]. Table 9.2 shows the current main classes of breast cancer based on
the reviews of Rakha et al. [55], Ross [161] and Sotiriou et al. [172]. According to Rakha
et al. [55], the difference between the classes is not based on single genes or a specific
pathway, but on a constellation of several groups of genes that make the signature of each
class.
There are a large number of published breast cancer signatures in the literature; however, the predictive success of these studies suffers from the fact that the genes in the
designed signatures have only a few genes in common (see, e.g., [59]). For example, the
van’t Veer and the Wang prognostic models, which study the same breast cancer population and outcome, shared only three genes in common [76]. Moreover, the predictive
performance of both models decreased drastically when applied to each other’s data [59].
According to [116], as few as 3% of published studies describing potential applications in
genomic medicine have progressed to a formal assessment of clinical utility.
Table 9.3 shows an overview of breast prognostic and predictive biomarkers in clinical
use, including multigene signatures. Oncotype DX and MammaPrint are the two tests
that have achieved the most advanced commercial success. Although the MammaPrint
test, which is based on the van’t Veer signature, was originally criticized for including
some patients in both training and test groups, it has been clinically valitated to a high
standard and has the US Food and Drug Administration (FDA) approval [136]. Reviews
of diagnostic and prognostic signatures can be found in [161, 172].
The fact that different gene signatures have very few genes in common is a common
feature of complex gene expression data that contain large numbers of highly correlated
variables [172]. It is possible to find several sets of genes with similarly accurate prediction performances despite the limited overlap of these genes [59]. A difficulty of highdimensional model selection comes from the collinearity among the predictors [64]. This
variable selection problem has been described in [75]. Large gene expression data sets have
stimulated the development of the novel regularized techniques that can cope with small
samples, from which has arisen a need of measuring variable importance and standardization of regression coefficients [208]. Classification methods for high-dimensional data has
been overviewed in Chapter 3.
This chapter tests the gene expression classifiers introduced in Chapter 7 and investigates how many features they select and how many genes they share when they are
evaluated with the five breast cancer data sets. We compare boosting, elastic net with L1
penalty (the lasso) and a method that takes into account correlations among genes. This
method performed favorably in [10].
75
9.1 Introduction
Class (%)
Description
ER-positive/luminal tumors (34-66%)
Luminal A (19-39%)
Characterized by the highest expression of the ER and
ER-related genes. Most studies have reported this as
the best prognosis class.
Luminal B (10-23%)
Shows low to moderated expression of ER-related
genes. Compared with luminal A, this class may have
a higher proliferation rate, expresses genes that seem
to be shared with basal-like and HER2 subtypes and
are associated with a less favourable outcome.
ER-negative tumors (30-45%)
Basal-like (16-37%)
Corresponds to ER-negative, PR-negative and HER2negative (so-called triple-negative breast cancers).
The triple-negative breast cancers are heterogeneous
and can be divided into multiple additional subgroups.
The dysfunction of the BRCA1 pathway is another
feature that differentiates basal-like cancers from luminal ones.
HER2-positive (4-10%)
Shows amplification of HER2 and high expression of
ERB2, GRB7, GATA4 genes. Tumors of this class
can be ER-positive or ER-negative. Both basal-like
and HER2 tumors include high levels of p53 mutation,
aggressive clinical behaviour, poor prognosis and do
not respond to hormonal therapy.
Normal breast-like (up to 10%)
Shows a high expression of genes characteristic of
parenchymal basal epithelial cells and adipose stromal cells with a low expression of genes characteristic
of luminal epithelial cells. The tumors have better
prognosis than basal-like cancers.
Table 9.2: The main classes of breast cancer based on the reviews in [55, 161, 172].
Explanatory notes: ER – estrogen receptor, PR – progesterone receptor, HER2 – human
epidermal growth factor receptor.
76
9 Breast cancer prognostic genes
Biomarker
Clinical significance
BRCA1
High expression of BRCA1 confers worse prognosis in untreated
patients (James et al. [109]).
CTC
Breast cancer patients with ≧ 5CT C/7.5 ml of peripheral blood
are associated with shorter PFS and OS (i.e. a poor prognosis)
(Cristofanilli et al. [41]).
ER
Patients with ER-positive breast tumors have a better survival
than patients with hormonal negative tumors (Early Breast Cancer Trialists’ Collaborative Group [81]).
eXageneBC
Provides prognosis in node-positive or node-negative breast cancer
patients (Davis et al. [46]).
Her2/neu
Patients with Her2/neu-positive breast tumors are more aggressive and have a worse prognosis compared to Her2/neu-negative
tumors (Mass et al. [133]).
Ki-67
Expression of Ki-67 is associated with proliferation and progression in breast cancer (Dowsett et al. [50]).
MammaPrint
A 70-gene prognostic assay used to identify breast cancer cases at
the extreme end of the spectrum of disease outcome by identifying
patients with good or very poor prognosis (van’t Veer et al. [185]).
Mammostratr
This standard purely prognostic test uses five antibodies with
manual slide scoring to divide cases of ER-positive, lymph node
negative breast cancer tumors treated with tamoxifen alone into
low-, moderate- or high-risk groups (Ring et al. [160]).
Oncotype DX
A 21-gene multiplex test used for prognosis to determine 10-year
disease recurrence for ER-positive, lymph node negative breast
cancers using a continuous variable algorithm and assigning a tripartite recurrence score (Goldstein et al. [79]; Paik et al. [150]).
PR
Patients with PR-positive breast tumors have a better survival
than patients with hormonal-negative tumors (Dowsett et al. [49]).
NuvoSelect
A combination of several pharmacogenomic genesets used primarily to guide selection of therapy in breast cancer patients. This test
also provides the ER and HER2 mRNA status (Ayers et al. [18]).
Roche AmpliChip
Low expression of CYP2D6 predicts resistance to tamoxifen-based
chemotherapy in breast cancer patients (Hoskins et al. [99]).
Rotterdam Signature
A 76-gene assay used to predict recurrence in ER-positive breast
cancer patients treated with tamoxifen (Wang et al. [192]).
Table 9.3: Breast prognostic and predictive biomarkers in clinical use, which has been
reviewed by Mehta et al. [136] (2010).
77
9.2 Gene selection methods
9.2
Gene selection methods
We compare the gene selection of the following gene expression classifiers.
Boosting and elastic net
We use boosting with componentwise linear least squares (CWLLS), which has been already described in Section 7.4. To recapitulate, in each iteration the CWLLS base procedure performs a linear least square regression against the one selected predictor variable
which reduces the residual sum of squares most [35]. We use the elastic net algorithm in
the setting with L1 penalty (the lasso) (see Section 7.12). The lasso also selects only one
variable from the group of correlated variables and does not care which one is selected [206].
In simulations, the lasso had higher predictive performances than the elastic net, which
takes into account both L1 and L2 penalties and correlated variables. Therefore, we prefer
to use the lasso.
Feature selection using CAT scores and FNDR thresholding
This method takes into account correlations among genes using correlation-adjusted tscores (CAT) controlled by false nondiscovery rates (FNDR) [10]. The approach is based
on shrinkage discriminant analysis and consists of the three following cornerstones:
• the use of James-Stein shrinkage for training the classifier, where all regularization
parameters are estimated;
• feature ranking based on CAT scores, which emerge from a restructured version of
the LDA equations and enable the ranking of genes in the presence of correlation;
• feature selection based on FNDR thresholding for inclusion in the prediction rule.
According to Zuber and Strimmer [207], for two-class classification, the feature selection based on CAT scores describes the vector τ adj of CAT scores, which is defined as
follows:
1
1 − 21
+
ω,
n1 n2
1
1
1 − 12
+
V
(µ1 − µ2 ),
= P−2 ×
n1 n2
τ adj ≡
(9.1)
1
= P − 2 τ.
1
1
τ adj is proportional to the feature weight vector ω, where: ω = P − 2 V − 2 (µ1 − µ2 ). V =
diag{σ12 , . . . , σp2 } is a diagonal matrix containing variances. P = (ρij ) is the correlation
− 1
matrix. n11 + n12 2 is the scale factor, where nQ is the number of observations in each
78
9 Breast cancer prognostic genes
group. In binary classification, the number of groups is Q = 2. The CAT score extends the
fold change and t-scores. While the t-score is the standardized mean difference µ1 −µ2 , the
1
CAT score is the standardized as well as decorrelated mean difference. P − 2 is responsible
for the decorrelation and is known as the Mahalonobis transform [10].
According to [10], a summary score to measure the total impact of feature j ∈ {1, . . . , p}
can be defined as:
Sj =
Q
X
(τijadj )2 .
(9.2)
i=1
In the two-class classification, Q = 2 and equation (9.2) is simplified to Sj = 2(τjadj )2 .
The detailed description of this approach can be found in [10, 207].
The intuitive and natural choice is to use an LDA classifier for class prediction in
this approach. The classifier is fed by selected features. We evaluated the predictive
performance of this approach with five breast cancer data sets. The computation of
CAT scores with FNDR thresholding is implemented in R package ‘sda’. Its authors
recommend using the larger FNDR-based feature set for class prediction, not just using
genes considered to be differentially expressed [10]. Based on these recommendations, the
set of features that can be included in the classifier was thresholded by Si < 0.9. We
experimented with lower values but it sometimes happened that the threshold was too
strict and no features went through.
We evaluate this approach in the same way as methods described earlier in this thesis
(see Chapter 4 for more information). Table 9.4 shows the prediction performances of this
approach with the five breast cancer data sets. In comparison with previous results, the
LDA classifier with features selected using CAT scores with FNDR thresholding performs
slightly worse than the approaches evaluated in Chapter 7.
Data set
van’t Veer
Pittman
Wang
Mainz
Sotiriou
0.67 ± 0.12
0.74 ± 0.10
0.59 ± 0.08
0.64 ± 0.10
0.54 ± 0.11
Table 9.4: LDA prediction with feature selection using CAT scores and FNDR thresholding. AUCs from test data sets (including mean AUCs and standard deviations) are
evaluated over 100 MCCV iterations.
9.3 Evaluation of selected genes
9.3
79
Evaluation of selected genes
The disadvantage of MCCV subsets and small sample sets in general is that the selected
genes are influenced by the subset. Therefore, we evaluated and summed nonzero coefficients over all MCCV iterations. We used the same evaluation scheme as described earlier
in this thesis – we selected and evaluated features over 100 MCCV iterations. The overall
number of selected variables is increased because selected genes in each MCCV iteration
subset are slightly different.
Regularized class prediction is characterized by the fact that many coefficients are
zero. In Chapter 7, we used the classifiers based on logistic regression, where each of the
coefficients describes the size of the contribution of each gene. A large coefficient means
that the gene strongly influences the probability of the outcome. A positive coefficient
means that the gene increases the probability of the outcome, while a negative coefficient
means that the gene decreases the probability of the outcome.
We recorded the selected coefficients as follows:
1. For each coefficient estimate βˆj , where j = 1, . . . , p and βˆj 6= 0 sum:
K
X
βˆj . K denotes the total number of MCCV iterations.
β̂jsum =
l=1
2. Compute the absolute value of each βˆj : |βˆj |.
3. Sort |βˆj | in descending order.
The second way in which we recorded the selected coefficients was that we counted
the frequencies of inclusion of a gene in the classifier. Thus, if βˆj 6= 0, the counter D
of j coefficient was Djsum = Djsum + 1, where initially all Djsum = 0 for j = 1, . . . , p. The
consequent steps were the same as in the previous case.
The two ways of counting coefficient estimates are presented in an example in Figure 9.1. They are shown in the first and the second rows of this figure respectively. The
figure depicts the Pittman data set and boosting.
9.4
Results
We evaluated the aforementioned gene selection methods with the five breast cancer data
sets. Figure 9.2 shows the boxplots with the number of genes selected in every iteration and
evaluated over 100 MCCV iterations. In the case of boosting and elastic net classifiers, the
number of selected genes is partially influenced by the different sizes of the gene expression
data sets and differs with used data sets. The third classifier based on feature selection
using CAT and FNDR thresholding selects approximately the same number of genes all
the time.
743
400
4418
9592321010111000001
0
10
200
20
Frequency
30
600
40
800
9 Breast cancer prognostic genes
0
Absolute sum of coefficient
80
0
200
400
600
800
0
10
Gene selected over 100 MCCV iterations
20
30
40
Absolute sum of coefficient
(b) Histogram of absolute values of sums.
800
100
(a) Absolute values of sums of coefficients.
200
400
Frequency
600
80
60
40
20
Frequency of gene
675
33 16 15 11
0
0
71
0
200
400
600
800
Gene selected over 100 MCCV iterations
(c) Frequencies of inclusion of a gene in the classifier.
0
20
40
6
60
6
4
80
4
100
Frequency of gene
(d) Histogram of summed gene frequencies.
Figure 9.1: The cumulative sums of coefficient estimates computed over MCCV iterations.
The example is depicted for the Pittman data set and boosting. Genes with the non-zero
coefficients were selected over 100 MCCV iterations. The resulting feature subset with
non-zero coefficients included 841 genes.
81
100
50
0
Selected genes
150
9.4 Results
van’t Veer
Pittman
Wang
Mainz
Sotiriou
Mainz
Sotiriou
Mainz
Sotiriou
Data set
100
50
0
Selected genes
150
(a) Boosting.
van’t Veer
Pittman
Wang
Data set
100
50
0
Selected genes
150
(b) Elastic net.
van’t Veer
Pittman
Wang
Data set
(c) Feature selection using CAT scores and FNDR thresholding.
Figure 9.2: The boxplots with the number of genes selected in every iteration and evaluated
over 100 MCCV iterations.
82
9 Breast cancer prognostic genes
The second experiment focused on the content of selected gene sets. We compared
how many genes the selection methods share. Figure 9.3 shows results of this experiment.
We used the first way counting of coefficient estimates described in the previous section;
thus we summed the coefficient values. Boosting and elastic net classifiers select the most
similar gene sets. Boosting and FS using CAT and FNDR thresholding have less than 20
% of selected genes in common.
Due to the fact that boosting and elastic net classifiers select the most similar gene sets,
we compared the shared genes of these two classifiers among five breast cancer data sets.
We found 14 shared genes that are given in Table 9.5. We used web gene converters [4]
to convert Affymetrix IDs of Affymetrix gene expression data to EntrezGene IDs; and
Rosetta files [185] to convert IDs of cDNA gene expression data to EntrezGene IDs.
Gene name
Gene Description
FAM21B
Family with sequence similarity 21, member B
BCLAF1
BCL2-associated transcription factor 1
ORAI2
ORAI calcium release-activated calcium modulator 2
CD44
CD44 molecule (Indian blood group)
TMCC1
Transmembrane and coiled-coil domain family 1
CABIN1
Calcineurin binding protein 1
HNRPA3
Heterogeneous nuclear ribonucleoprotein A3
MTHFR
5,10-methylenetetrahydrofolate reductase (NAD PH)
TCEB1
Transcription elongation factor B (SIII), polypeptide 1 (15kDa, elongin C)
LOC641311
Ribosomal protein L31 pseudogene
ISCA1
Iron-sulfur cluster assembly 1 homolog (S. cerevisiae)
MTF1
Metal-regulatory transcription factor 1
RND2
Rho family GTPase 2
MAP4
Microtubule-associated protein 4
Table 9.5: The shared genes between the boosting and elastic net classifiers and among
five breast cancer data sets (van’t Veer, Pittman, Wang, Mainz, Sotiriou).
9.5
Conclusions
Multiple studies have shown that predictive gene signatures suffer from instabililty of their
membership and lack of reproducibility in independent studies (see, e.g., [59]). We compared selected genes of three gene selection methods and evaluated them with five breast
cancer data sets. Boosting and elastic net classifiers in the setting with L1 penalty (the
lasso) select the most similar gene sets, which is probably because of a connection between
83
100
80
60
40
0
20
Shared genes in selected genes [%]
80
60
40
20
0
Shared genes in selected genes [%]
100
9.5 Conclusions
344
van’t Veer
709
Pittman
1300
Wang
420
Mainz
855
Sotiriou
60
van’t Veer
44
Pittman
Data set
52
Mainz
0
Sotiriou
Data set
80
20
40
60
Boosting
Elastic net
FS using CAT scores and FNDR thresholding
0
0
20
40
60
80
Shared genes in selected genes [%]
100
(b) Boosting and FS (CAT, FNDR).
100
(a) Boosting and elastic net.
Shared genes in selected genes [%]
36
Wang
245
van’t Veer
115
Pittman
132
Wang
394
Mainz
198
Sotiriou
Data set
(c) Elastic net and FS (CAT, FNDR).
31
35
van’t Veer Pittman
32
Wang
21
Mainz
0
Sotiriou
Data set
(d) Boosting, elastic net and FS (CAT, FNDR).
Figure 9.3: Shared genes among gene selection algorithms. The numbers of shared genes
are written down on the horizontal axes. The vertical axes depict the numbers of shared
genes in selected genes, which were selected over 100 MCCV iterations.
84
9 Breast cancer prognostic genes
these two classifiers [57]. Other comparisons of these classifiers with FS using CAT and
FNDR thresholding did not bring very high gene overlaps. When we compared the selected
genes from five breast cancer data sets, we found only 14 shared genes. A poor representation of tumors in breast cancer studies can be another problem over the problems resulting
from high-dimensional data [136]. According to Rakha et al. [55], microarray technology
should not be expected to replace current traditional diagnostic algorithms, but should
be integrated within these and may contribute additional complementary prognostic information, which should improve patient management. This statement is in accordance
with the combined use of clinical and microarray data advocated in this thesis.
Chapter 10
Gene ontology feature selection
10.1
Introduction
Dimension reduction (selecting a small number of genes) is an effective way to improve
the mining efficiency of high-dimensional data as mentioned in Chapter 3. Methods of
dimension reduction can be categorized into two groups: feature selection and feature
extraction. Feature selection chooses a subset of features (genes) from the original ones.
Feature extraction creates a new lower-dimensional feature space from the original highdimensional feature space. Feature selection algorithms are used more often because the
features they choose are more biologically meaningful. Feature selection methods fall into
filter and wrapper methods. Filter feature selection methods select a feature subset that
is independent of any mining algorithm in contrast to wrapper feature selection methods
that use a given mining algorithm to determine the goodness of a selected subset. We
concentrate on the filter method.
Some filter selection methods evaluate genes in isolation without considering gene-togene relations. They rank genes according to their individual relevance or discriminative
power and select top-ranked genes based on statistical tests. However, genes are well known
to interact with each other through various reactions. Biologists effort to identify the
fundamental mechanisms of the biological process that the gene expression data represents.
Thus, it is desirable to find a subset of features which is more interpretable.
This chapter aimes at incorporating gene-to-gene relations and interactions into the
class prediction process. Gene ontology (GO) represents a controlled biological vocabulary
and repository of computable biological knowledge [89]. The similarity matrix is often used
in connection with ontologies, which is a way of gene-to-gene relations presentation. We
propose preliminary results with four variants of the method combining GO information
with gene expression data. GO is incorporated in microarray data at the beginning of the
class prediction process (early integration) and influences selected features. Based on GO
integration, features selected by the class prediction method can be more meaningful and
85
86
10 Gene ontology feature selection
better interpretable.
Feature selection [157, 151] or extraction [38] with GO is not new. GO is widely
used in many fields. Some authors search for expression correlations [38] while others
directly eliminate highly correlated genes [151]. Finding correlated expression patterns is
the goal of gene clustering. Srivastava et al. [174] integrate GO knowledge and expression
data using Bayesian regression mixture models to perform unsupervised clustering of the
samples and identify physiologically relevant discriminative features. Jiang et al. [111]
construct a gene semantic similarity network and then use the network to infer disease
genes. They show that genes with higher semantic similarity scores tend to be associated
with diseases with higher phenotype similarity scores. Cho et al. [38] present a feature
extraction algorithm based on the concept of virtual genes by integrating microarray data
with GO annotations. Virtual genes are groups of genes that potentially interact with
each other and are used to build a sample classifier.
The next section describes approaches that select a subset of genes with incorporated
GO information. Feature selections based on GO are evaluated with a benchmark breast
cancer data set.
10.2
Combining gene expression data with gene ontology
The proposed approaches integrate GO information into gene selection. GO is divided
into three orthogonal ontologies: (a) biological process (BP), (b) molecular function (MF),
and (c) cellular component (CC). The three ontologies are represented as directed acyclic
graphs in which nodes correspond to terms and their relationships are represented by edges.
GO supports two types of relations: the ‘is-a’ and the ‘part-of’. Figure 10.3 describes
the microarray data feature selection procedure. Each of the three different ontologies
is used separately. At the beginning, each ontology semantic similarity matrix (SSM)
for each ontology is computed based on microarray gene IDs converted to EntrezGene
IDs. SSM computing for the expression matrix with a dimension in thousands genes is
computationally intensive. Therefore, we take K random genes.
SSM can be computed with various measures. An overview can be found in [127].
We calculated SSM with Wang’s [191] method. Wang is a graph structure based method
which encodes the semantic of a GO term in a measurable format to enable a quantitative
comparison. Wang measures determine the semantic similarity of two GO terms based on
both the locations of these terms in the GO graph and their relations with their ancestor
terms. Wang defines the semantic value (S-value) of term A as the aggregate contribution
of all terms in DAGA to the semantics of term A; the term closer to term A in DAGA
contributes more to its semantics [191]. Given two GO terms A and B, Wang’s semantic
similarity between these two terms is defined as [191]:
87
10.2 Combining gene expression data with gene ontology
SGO (A, B) =
X
t∈TA ∩TB
SA (t) + SB (t)
,
SV (A) + SV (B)
(10.1)
where SA (t) hSB (t)i is the S-value of GO term t related to term A hBi, and SV (A) =
X
AA (t). A more detailed description can be found in [191].
t∈TA
We use a threshold to convert SSM to an adjacency matrix with ones and zeros. The
two genes are connected if the semantic similarity between two genes is greater than a
threshold. The adjacency matrix also has zeros diagonally. We performed experiments in
the R environment. An illustrative examples of SSM and the adjacency matrix with ten
genes are given in Figure 10.1.
> adjm
> ssm[1:10,1:10]
1915
2527
7392
1359
9976
4617
4664
8463
80153
9818
1915
1.000
0.627
0.542
0.505
0.282
0.418
0.542
0.542
1.000
0.409
2527
0.627
1.000
0.753
0.708
0.188
0.539
0.753
0.753
0.627
0.542
7392
0.542
0.753
1.000
0.615
0.200
0.677
1.000
1.000
0.449
0.673
1359
0.505
0.708
0.615
1.000
0.147
0.436
0.615
0.615
0.492
0.443
9976
0.282
0.188
0.200
0.147
1.000
0.160
0.200
0.200
0.231
0.164
4617
0.418
0.539
0.677
0.436
0.160
1.000
0.677
0.677
0.351
0.682
4664
0.542
0.753
1.000
0.615
0.200
0.677
1.000
1.000
0.449
0.673
8463
0.542
0.753
1.000
0.615
0.200
0.677
1.000
1.000
0.449
0.673
80153
1.000
0.627
0.449
0.492
0.231
0.351
0.449
0.449
1.000
0.346
9818
0.409
0.542
0.673
0.443
0.164
0.682
0.673
0.673
0.346
1.000
1915
2527
7392
1359
9976
4617
4664
8463
80153
9818
1915 2527 7392 1359 9976 4617 4664 8463 80153 9818
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
1
1
0
0
0
1
0
0
0
1
1
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
1
0
1
1
0
0
1
0
1
0
1
0
1
1
0
0
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
Figure 10.1: Examples of the semantic similarity matrix and the adjacency matrix
A gene connectivity graph is created from the adjacency matrix. The graph for the
illustrative example is depicted in Figure 10.2. In the gene connectivity graph, we look for
maximum clique or cliques that identify a set of semantically similar genes. In this figure
the genes (vertices) belonging to an example with a maximum gene clique are denoted in
red.
10.2.1
Filtering with gene ontology
Four variants of combining of GO information with microarray data were explored:
• Variant A - Microarray data enrichment with gene ontology relations
based on maximum gene cliques
Maximum gene cliques are found. Based on the statistical characteristics of microarray data (e.g. mean, variance), we add a value to the gene expression in the
microarray data matrix, if the gene is included in the maximum cliques. In the
evaluation, t-statistics feature selection is performed during each MCCV iteration
from the training data.
88
10 Gene ontology feature selection
3
4
1
8
2
5
0
9
7
6
Figure 10.2: A gene connectivity graph for ten genes with an example of a clique (vertices
denoted in red).
• Version B - Microarray data enrichment with gene ontology relations
based on maximum gene cliques
Maximum gene cliques are found. Based on the statistical characteristics of microarray data (e.g. mean, variance), we add a value to the gene expression in microarray
data matrix, if the gene included is in the cliques and the gene expression in microarray data is higher than a threshold. We subtract a value from the gene expression
in the microarray data matrix, if the gene is included in the cliques and the gene expression in microarray data is lower than a threshold. In the evaluation, t-statistics
feature selection is performed during each MCCV iteration from the training data.
• Version C - Microarray data gene ontology feature selection based on
frequencies of each gene connectivity
We use adjacency matrix and sum cols (or rows) of this matrix to get a vector v
with frequencies of each gene connectivity. A threshold from this vector is computed
max(v) − mean(v)
as max(v) −
.
2
• Version D - Microarray data gene ontology feature selection based on
maximum gene cliques
Maximum gene cliques are found. Genes included in maximum gene cliques are
selected from microarray data.
89
10.3 Results
The devised methods are evaluated with the same schema as described in previous
chapters. The enriched or filtered microarray data are evaluated over 100 MCCV iterations. t-statistics feature selection, if it is used, is performed during each MCCV iteration
from the training data. AUCs are evaluated for the test data sets. Support vector machines (SVM) with linear kernel and elastic net with the setting described in Section 7.12
with L1 penalty are employed as class prediction methods during the evaluation.
Gene ontology
(K genes)
EntrezGene IDs
Microarray data
Semantic similarity matrix
(threshold)
C
Filtering
A,B,D
Maximum gene cliques
Adjacency matrix
Gene connectivity graph
Filtered microarray data
Figure 10.3: Gene ontology feature selection. A, B, C, D denote the variants of feature
selection described in this chapter.
10.3
Results
We evaluated four variants incorporating microarray data with GO described in the previous section together with two other variants: no feature selection and t-statistic feature
selection. We used the Pittman breast cancer data set for evaluations. The experiments
were evaluated in the R environment using packages ‘GOSemSim’ and ‘igraph’ for computing SSMs and working with graphs respectively. The SVM algorithm is included in
R package ‘e1071’ and we used it in the default setting with a linear kernel. The computations of SSMs were time consuming. The Pittman data set includes more than 8000
genes. We decreased the dimension of SSM by using K = 2, 000 random genes.
Proposed approaches were evaluated with the cellular component (CC) ontology which
showed to be the best [188]. Table 10.1 displays the results. Variant D reaches the best
performance with both class predition methods. Elastic net usually performs better than
90
10 Gene ontology feature selection
Feature selection
Ontology
Version
−
−
CC
CC
A
B
t-statistics (# of genes)
Method
SVM
Elastic net
NO
(8385)
0.73 ± 0.07
0.75 ± 0.08
YES
50
0.67 ± 0.08
0.75 ± 0.08
YES
500
0.73 ± 0.08
0.77 ± 0.08
YES
50
0.67 ± 0.08
0.75 ± 0.08
YES
500
0.73 ± 0.08
0.77 ± 0.08
YES
50
0.66 ± 0.08
0.74 ± 0.08
YES
500
0.72 ± 0.08
0.75 ± 0.08
CC
C
NO
(405)
0.73 ± 0.07
0.75 ± 0.08
CC
D
NO
(397)
0.78 ± 0.07
0.78 ± 0.07
Table 10.1: Four variants A,B,C,D used with cellular component ontology (CC) and evaluated with the Pittman data set. AUCs from test data sets (including mean AUCs and
standard deviations) evaluated over 100 MCCV iterations. The column t-statistics indicates whether genes were selected on the basis of the absolute value of the t-statistics using
R package ‘st’ during each MCCV iteration from training data.
SVM in our experiments. The number of selected genes also depends on the threshold of
the adjacency matrix. We used a threshold of 0.8 for all experiments. Full-scale testing of
thresholds and other parameters are needed.
10.4
Conclusion
We proposed four variants of using GO information with gene expression data. GO information is incorporated at the stage of early integration in order to improve binary class
prediction. The selected feature sets were enriched with genes regarding gene-to-gene relations and interactions with each other. Similarity matrices were computed with the Wang
measure that is based on the GO graph structure. Microarray data gene ontology feature
selection based on maximum gene cliques (Variant D) with CC gene ontology improved
prediction performances.
In the future, research will deal with ways on how to accelerate SSM computing and
how to make this calculation possible for all genes of high-dimensional gene expression
matrix. Future research will also include full-scale testing of various paramaters with all
three ontologies, evaluation of methods with more data sets, a comparison of selected
features, and performance of other GO based feature selection methods. Using pathways
and other resources in addition to the GO should also bring improvements.
Chapter 11
Conclusions
This thesis deals with microarray data disease outcome prediction. Microarray data suffers
from high-dimensionality, inadequate sample sizes and low signal-to-noise ratios, which influence resulting prediction models. The corner stone of this thesis is combining microarray
data with other sources can result in better predictions of disease outcome.
The main part focuses on combinations with clinical data. We used logistic regression
for clinical data and boosting for gene expression data. We combined data at the level
of late integration. We extended this approach with pre-validation of models built with
microarray and clinical data followed by weight calculations. Evaluations were performed
on real data sets as well as on several redundant and non-redundant simulated data sets
to test our approach in various settings. Results show that combining microarray gene
expression and clinical data improves prediction performance depending on the quality of
data. Applying this approach without pre-validation with non-redundant data sets implicates outcome prediction improvements, while the method does not yield more accurate
predictions with redundant data sets. Pre-validation of constructed models increases the
performance in cases of redundant data sets.
Next, we experimented with alternative approaches. We employed an elastic net with
high-dimensional data instead of boosting and used it in a joint classifier with logistic
regression. We also evaluated classifiers combining gene expression, clinical and SNP
data. This demonstrates that our approach can be used with more than two data sources.
The presented approaches that offers a solution of construction of a classifier combining the two types of data can be also used for determining additional predictive value of
microarray data. The designed two-step method was evaluated and compared to an alternative approach. Both compared methods made the same decision about an additional
predictive value for all the breast cancer data sets.
Class prediction studies aim at identifying sets of genes (signatures) that can accurately
classify new data. It is interesting to compare selected features of described classifiers on
different data sets. Unfortunately, gene signatures resulting from different studies usually
91
92
11 Conclusions
have very few genes in common [172]. We compared selected genes of three feature selection
methods evaluated on five breast cancer data sets. Boosting with CWLLS and elastic net
classifiers in the setting with the L1 penalty (the lasso) provided the most similar gene
sets.
Incorporating gene-to-gene relations and interactions into the class prediction process
can increase prediction accuracy and can help interpret results. The similarity matrix,
which is often used in connection with ontologies, provides a natural way to display geneto-gene relations. Four variants of feature selection methods combining gene ontology with
gene expression data were presented. Evaluations were performed on a real benchmark
data set. The feature selection based on maximum gene cliques (Variant D) improved the
prediction performance.
As mentioned above, the described approach enables combining more than two or three
sources of data. The prediction accuracy of a combination of microarray and other data
depends on complementarity of the data sources. If data models are complementary, i.e.
they contain non-redundant information, their combination can bring increased prediction
accuracy. Moreover, prediction accuracy also depends on quality of data. Procedures of
data production constantly are in progress and innovate. Integrating distinct and multiple
information resources is and will be an important task in the future. Our future work
can focus on additional data types. According to several recently published studies [44,
201, 76], pathways, microRNA, arrayCGH and other gene-related information may help
to predict cancer outcomes. Metabolic pathway information from KEGG and OPHID
protein-protein interactions outperformed other sources in the study of Daemen et al. [44].
Promising classifiers can be, for example, Bayessian networks because they are able to
integrate heterogeneous data. Networks make it possible to keep relations between data
and guarantee better interpretability of the results.
Bibliography
[1] AlzGene – Database of genetic association studies performed on Alzheimer’s disease
[cited 1 September 2010]. Available from: http://www.alzgene.org/.
[2] CAMDA – Critical Assessment of Microarray Data Analysis [cited 5 June 2010].
Available from: http://camda.bioinfo.cipf.es/camda07/datasets/.
[3] CDC – Centers for Disease Control [cited 25 May 2010]. Available from: http:
//www.cdc.gov/.
[4] Gene id converter [cited 12 January 2012]. Available from: http://idconverter.
bioinfo.cnio.es/.
[5] Gene Ontology tools – Gene Ontology tools for analysis of microarray gene expression
data [cited 12 April 2010]. Available from: http://www.geneontology.org/GO.
tools.microarray.shtml.
[6] Mainz breast cancer data set:
2011].
Available from:
breastCancerMAINZ R package [cited 2 June
http://www.bioconductor.org/packages/2.8/data/
experiment/.
[7] MEDLINE – biomedical literature repository [cited 22 April 2009]. Available from:
http://mbr.nlm.nih.gov/index.shtml.
[8] OMIM – Online Mendelian Inheritance in Man [cited 13 November 2010]. Available
from: http://www.ncbi.nlm.nih.gov/omim.
[9] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, L. C.
Tranchevent, B. De Moor, P. Marynen, B. Hassan, P. Carmeliet, and Y. Moreau.
Gene prioritization through genomic data fusion. Nature Biotechnology, 24:537–544,
2006.
[10] M. Ahdesmäki and K. Strimmer. Feature selection in omics prediction problems
using cat scores and false nondiscovery rate control. Ann. Appl. Stat., 4(1):503–519,
2010.
93
94
11 Conclusions
[11] H. Ahn, H. Moon, M. J. Fazzari, N. Lim, J. J. Chen, and R. L. Kodell. Classification by ensembles from random partitions of high-dimensional data. Computational
Statistics Data Analysis, 51(12):6166–6179, 2007.
[12] H. Akaike. A new look at the statistical model identification. IEEE Trans. Automat.
Contr., 19(6):716–723, 1974.
[13] M. Alaminos, V. Davalos, and N. K. Cheung. Clustering of gene hypermethylation
associated with clinical risk groups in neuroblastoma. J Natl Cancer Inst, 96:1208–
1219, 2004.
[14] C. F. Aliferis, A. Statnikov, and I. A. Tsamardinos. Challenges in the analysis of
mass-throughput data: a technical commentary from the statistical machine learning
perspective. Canc Inform, (2):133–162, 2007.
[15] O. Alter, P. O. Brown, and D. Botstein. Singular value decomposition for genomewide expression data processing and modeling. PNAS, 97(18):10101–10106, 2000.
[16] J. C. Alwin, D. J. Kemp, and G. R. Stark. Methods for detection of specific rnas in
agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with dna
probes. Proc. Natl. Acad. Sci. USA, 74:5350–5354, 1977.
[17] J. Andrews, W. Kennette, J. Pilon, A. Hodgson, A. B. Tuck, A. F. Chambers, and
D. I. Rodenhise. Multi-platform whole-genome microarray analyses refine the epigenetic signature of breast cancer metastasis with gene expression and copy number.
PLoS ONE, 5(1):1–17, 2010.
[18] M. Ayers, W. F. Symmans, J. Stec, A. I. Damokosh, E. Clark, K. Hess, and
M. Lecocke et al. Gene expression profiles predict complete pathologic response
to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide
chemotherapy in breast cancer. Jour. of Clin. Oncol., 22:2284–2293, 2004.
[19] C. A. Ball and A. Brazma. Mged standards: work in progress. OMICS, 10(2):138–
144, 2006.
[20] A. Barrier, P. Y. Boelle, A. Lemoine, A. Flahault, S. Dudoit, and M. Huguier. Gene
expression profiling in colon cancer. Bull. Acad. Natl. Med., (6):1091–1103, 2007.
[21] H. Binder and M. Schumacher. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics, 9:1–10,
2008.
[22] D. A. Bitton, M. J. Okoniewski, Y. Connolly, and C. J. Miller. Exon level integration
of proteomics and microarray data. BMC Bioinformatics, 9(118):1–11, 2008.
95
[23] T. Bo and I. Johanssen. New feature subset selection procedures for classification
of expression profiles. Genome Biol, (4):1–6, 2002.
[24] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of
normalization methods for high density oligonucleotide array data based on variance
and bias. Bioinformatics, 19:185–193, 2003.
[25] A. L. Boulesteix and T. Hothorn. Testing the additional predictive value of highdimensional molecular data. BMC Bioinformatics, 78:1–11, 2010.
[26] A. L. Boulesteix, C. Porzelius, and M. Daumer. Microarray-based classification and
clinical predictors: on combined classifiers and additional predictive value. Bioinformatics, 24:1698–1706, 2008.
[27] A. L. Boulesteix, C. Strobl, T. Augustin, and M. Daumer. Evaluating microarraybased classifiers: An overview. Cancer Inform, 6:77–972, 2008.
[28] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine
learning algorithms. Pattern Recognition, 30:1145–1159, 1997.
[29] L. Breiman. Bagging predictors. Journal Machine Learning, 24(2):123–140, 1996.
[30] L. Breiman.
Prediction games and arcing algorithms.
Neural Computation,
11(7):1493–1517, 1999.
[31] L Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[32] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and
regression trees. Wadsworth Brooks, 1984.
[33] P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M Ares,
and D. Haussler. Knowledge-based analysis of microarray gene expression data by
using support vector machines. Proc Natl Acad Sci, 97(1):262–267, 2000.
[34] P. Bühlmann. Boosting for high-dimensional linear models. Ann. Statist., 34(2):559–
583, 2006.
[35] P. Bühlmann and T. Hothorn. Boosting algorithms: Regularization, prediction and
model fitting. Statist. Sci., 22:477–505, 2007.
[36] P. Cahan, F. Rovegno, D. Mooney, J. C. Newman, G. S. Laurent, and T. A. McCaffrey. Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. Gene, 401:12–18, 2007.
96
11 Conclusions
[37] G. Chen, T. G. Gharib, C. C. Huang, J. M. Taylor, D. E. Misek, S. L. Kardia, T. J.
Giordano, M. D. Iannettoni, M. B. Orringer, S. M. Hanash, and D. G. Beer. Discordant protein and mrna expression in lung adenocarcinomas. Mol Cell Proteomics,
1:304–313, 2002.
[38] Y. R. Cho, X. Xu, W. Hwang, and A. Zhang. Feature extraction from microarray expression data by integration of semantic knowledge. Sixth International Conference
on Machine Learning and Applications, pages 606–611, 207.
[39] H. Y. Chuang, E. Lee, Y. T. Liu, D. Lee, and T. Ideker. Network-based classification
of breast cancer metastasis. Mol Syst Biol, 3(140):1–10, 2007.
[40] F. H. C. Crick. Central dogma of molecular biology. 227:561–563, 1970.
[41] M. Cristofanilli, G. T. Budd, M. J. Ellis, A. Stopeck, J. Matera, M. C. Miller, J. M.
Reuben, G. V. Doyle, W. J. Allard, L. W. Terstappen, and D. F. Hayes. Brca1, a
potential predictive biomarker in the treatment of breast cancer. N. Engl. J. Med.,
351(8):781–791, 2004.
[42] A. Daemen, O. Gevaert, T. De Bie, A. Debucquoy, J. P. Machiels, B. De Moor,
and K. Haustermans. Integrating microarray and proteomics data to predict the
response on cetuximab in patients with rectal cancer. Pac Symp Biocomput, pages
166–177, 2008.
[43] A. Daemen, O. Gevaert, F. Ojeda, A. Debucquoy, J. A. Suykens, C. Sempoux, J. P.
Machiels, K. Haustermans, and B. De Moor. A kernel-based integration of genomewide data for clinical decision support. Genome Med, 1(4):1–17, 2009.
[44] A. Daemen, M. Signoretto, O. Gevaert, J. A. Suykens, and B. De Moor. Improved
microarray-based decision support with graph encoded interactome data. PLoS One,
5(4):1–16, 2010.
[45] W. S. Dalton and S. H. Friend. Cancer biomarkersâan invitation to the table.
Science, 312:1165–1168, 2006.
[46] L. M. Davis, C. Harris, L. Tang, P. Doherty, P. Hraber, Y. Sakai, T. Bocklage,
K. Doeden, B. Hall, and J. Alsobrook. Amplification patterns of three genomic
regions predict distant recurrence in breast carcinoma. The Journal of Molecular
Diagnostics, 9(3):327–336, 2007.
[47] E. de Hoffmann and V. Stroobant. Mass Spectrometry: Principles and Applications.
Biddles Ltd, 1999.
[48] M. Dettling and P. Bühlmann. Boosting for tumor classification with gene expression
data. Bioinformatics, 19(9):1061–1069, 2003.
97
[49] M. Dowsett, J. Houghton, C. Iden, J. Salter, J. Farndon, R. A’Hern, R. Sainsbury,
and M. Baum. Benefit from adjuvant tamoxifen therapy in primary breast cancer
patients according oestrogen receptor, progesterone receptor, egf receptor and her2
status. Annals of Oncology, 17(5):818–826, 2006.
[50] M. Dowsett, I. E. Smith, S. R. Ebbs, J. M. Dixon, and A. Skene et al. Prognostic
value of ki67 expression after short-term presurgical endocrine therapy for primary
breast cancer. JNCI, 99:167–170, 2007.
[51] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. New York Willey,
2001.
[52] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods
for the classification of tumors using gene expression data. Jour Amer Stat Asoc,
97(457):77–87, 2002.
[53] S. Dudoit, J. P. Shaffer, and J. C. Boldrick. Multiple hypothesis testing in microarray
experiments. Statistical science, (18):71–103, 2003.
[54] A. Dupuy and R. M. Simon. Critical review of published microarray studies for
cancer outcome and guidelines on statistical analysis and reporting. J. Natl. Cancer.
Inst., 99(2):147–157, 2007.
[55] E. A. Rakha EA, M. E. El-Sayed, J. S. Reis-Filho JS, and I. O. Ellis. Expression profiling technology: its contribution to our understanding of breast cancer.
Histopathology, 1(52):67–81, 2008.
[56] P. Eden, C. Ritz, C. Rose, M. Ferno, and C. Peterson. ”good old” clinical markers
have similar power in breast cancer prognosis as microarray gene expression profilers.
Eur. J. Cancer, 40:1837–1841, 2004.
[57] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann.
Statist., 32(2):407–499, 2004.
[58] P. H. C. Eilers, J. M. Boer, G. J. van Ommen, and H. C. van Houwelingen. Classification of microarray data with penalized logistic regression. Proc. SPIE, 4266:187–198,
2001.
[59] L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany. Outcome signature genes
in breast cancer: is there a unique set? Bioinformatics, 2(21):171–178, 2005.
[60] D. A. Elizondo, B. N. Passow, R. Birkenhead, and A. Huemer. Principal Manifolds
for Data Visualization and Dimension Reduction. Springer Berlin Heidelberg, 2007.
98
11 Conclusions
[61] J.D. Emerson and D.C. Hoaglin. Analysis of two-way tables by medians. Understanding Robust and Exploratory Data Analysis, John Willey Sons, New York, pages
166–206, 1983.
[62] D. Engler and Y. Li. Survival analysis with high-dimensional covariates: an application in microarray studies. Stat Appl Genet Mol Biol, 8(14):1–20, 2009.
[63] A. Ergun, C. A. Lawrence, M. A. Kohanski, T. A. Brennan, and J. J. Collins. A
network biology approach to prostate cancer. Mol Syst Biol, 3(82):1–6, 2007.
[64] J. Fan and J. Lv. A selective overview of variable selection in high dimensional
feature space (invited review article). Statistica Sinica, (20):101–148, 2010.
[65] A. Fernandez-Teijeiro, R. A. Betensky, L. M. Sturla, J. Y. Kim, P. Tamayo, and
S. L. Pomeroy. Combining gene expression profiles and clinical parameters for risk
stratification in medulloblastomas. J Clin Oncol., 22(6):994–998, 2004.
[66] J. Fox. An R and S Plus Companion to Applied Regression. SAGE Publications,
2002.
[67] J. Fridlyand, J. Snijders, and B. Ylstra. Breast tumor copy number aberration
phenotypes and genomic instability. BMC Cancer, 6:1–13, 2006.
[68] J. Fridlyand and J. Y. H. Yang. DENMARKLAB R package. Advanced microarray data analysis: Class discovery and class prediction [cited 21 September 2009].
Available from: http://genome.cbs.dtu.dk/courses/norfa2004/Extras/.
[69] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathways coordinate optimization. Ann. Appl. Stat., 1:302–332, 2007.
[70] J. H. Friedman. Greedy function approximation: A gradient boosting machine. Ann.
Statist., 29:1189–1232, 2001.
[71] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33(1):1–24,
2010.
[72] N. Friedman and M. Goldszmidt. Learning bayesian networks with local structure.
Learning in Graphical Models, pages 421–460, 1999.
[73] K. Fukuda, S. E. Straus, I. Hickie, M. C. Sharpe, J. G. Dobbins, and A. Komaroff.
The chronic fatigue syndrome: a comprehensive approach to its definition and study.
international chronic fatigue syndrome study group. Ann Intern Med, 121:953–959,
1994.
99
[74] R. Gentleman, W.Huber, V. J. Carey, R. A. Irizarry, and S. Dudoit. Bioinformatics
and computational Biology Solutions Using R and Bioconductor. Springer, 2005.
10:0-387-25146-4.
[75] E. I. George. The variable selection problem. American Statistical Association,
95(452):1304–1308, 2000.
[76] O. Gevaert and B. de Moor. Prediction of cancer outcome using dna microarray technology: past, present and future. Expert Opinion on Medical Diagnostics,
3(2):157–165, 2009.
[77] O. Gevaert, F. De Smet, D. Timmerman, Y. Moreau, and B. De Moor. Predicting the
prognosis of breast cancer by integrating clinical and microarray data with bayesian
networks. Bioinformatics, 22:147–157, 2007.
[78] O. Gevaert, S. Van Vooren, and B. de Moor. Integration of microarray and textual
data improves the prognosis prediction of breast, lung and ovarian cancer patients.
Pac Symp Biocomput, pages 279–290, 2008.
[79] L. J. Goldstein, R. Gray, S. Badve, B. H. Childs, C. Yoshizawa, S. Rowley, S. Shak,
F. L. Baehner, P. M. Ravdin, N. E. Davidson, G. W. Sledge, E. A. Perez, L. N.
Shulman, S. M. Sparano, and J. A. Sparano. Prognostic utility of the 21-gene
assay in hormone receptorâpositive operable breast cancer compared with classical
clinicopathologic features. Jour. of Clin. Oncol., 26:4063–4071, 2008.
[80] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller,
M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomeld, and E. S. Lander. Molecular classication of cancer: Class discovery and class prediction by gene expression
monitoring. Science, (286):531–537, 1999.
[81] Early Breast Cancer Trialists’ Collaborative Group. Tamoxifen for early breast
cancer: an overview of the randomised trials. Lancet, 351(8):1451–1467, 1998.
[82] S. K. Gruvberger, M. Ringner, and P. Eden. Expression profiling to predict outcome
in breast cancer: the influence of sample selection. Breast Cancer Res., 5(1):23–26,
2003.
[83] G. Guizzardi. Ontological Foundations for Structural Conceptual Models. PhD thesis,
Telematica Instituut Fundamental Research Series, 2005.
[84] I. Guyon, J. Weston, S. M. D. Barnhill, and V. Vapnik. Gene selection for cancer
classification using support vector machines. Machine Learning, (46):389–422, 2002.
100
11 Conclusions
[85] S. P. Gygi, B. Rist, S. A. Gerber SA, F. Turecek, M. H. Gelb, and R. Aebersold.
Quantitative analysis of complex protein mixtures using isotopecoded affinity tags.
Nat Biotechnol, 17:994–999, 1999.
[86] J. S. Hamid, P. Hu, N. M. Roslin, V. Ling, C. M. T. Greenwood, and J. Beyene. Data
integration in genetics and genomics: Methods and challenges. Human Genomics
and Proteomics, 2009:1–13, 2009.
[87] J. Han and M. Kamber. Data Mining, Concepts and Techniques. Ed. 2. Morgan
Kaufmann, San Francisco, 2005.
[88] J. W. Hardin and J. M. Hilbe.
Generalized Estimating Equations.
Chapman
Hall/CRC, 2003.
[89] M. A. Harris, J. Clark, and A. Ireland. The gene ontology (go) database and informatics resource. Nucleic Acids Res, 32:258–261, 2004.
[90] A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A. Young. Using graphical models and genomic expression data to statistically validate models of gene
regulatory networks. In Proceedings of PSBâ01, pages 1–12, 2000.
[91] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for
the support vector machine. Journal of Machine Learning Research, 5:1391–1415,
2004.
[92] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification.
IEEE Trans. Pattern Anal. Mach. Intell., 18(6):607–615, 1996.
[93] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning.
Springer, 2001.
[94] T.J. Hastie and R.J. Tibshirani. Generalized Additive Models. Chapman and Hall,
1990.
[95] D. Heckerman, D. Geiger, and D. Chickering. Learning bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 20:197–243, 1995.
[96] P. Helman, R. Veroff, S. Atlas, and C. Willman. A bayesian network classification
methodology for gene expression data. Journal of Comp. Biology, 11(4):581–615,
2004.
[97] A. Hoerl and R. Kennard. Ridge regression in ‘Encyklopedia of Statistical Sciences’.
Wiley, New York, 1988.
[98] H. Hofling and R. Tibshirani. A study of pre-validation. Ann. Appl. Stat., 2(2):643–
664, 2008.
101
[99] J. M. Hoskins, L. A. Carey, and H. L. McLeod. Cyp2d6 and tamoxifen: Dna matters
in breast cancer. Nature Reviews Cancer, 9:576–586, 2009.
[100] D. W. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, 2000.
[101] T. Hothorn and P. Bühlmann. mboost: Model-Based Boosting. R package version
0.5-8 [cited 26 September 2008]. Available from: http://CRAN.R-project.org/.
[102] T. Hothorn, F. Leisch, K. Hornik, and A. Zeileis. The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, pages 1–22,
2005.
[103] X. Huang, W. Pan, S. Grindle, X. Han, Y. Chen, S. J. Park, L. W. Miller, and
J. Hall. A comparative study of discriminating human heart failure etiology using
gene expression profiles. BMC Bioinformatics, 6(205):1–15, 2005.
[104] J. P. Ioannidis, N. A. Patsopoulos, and E. Evangelou. Heterogeneity in meta-analyses
of genome-wide association investigations. PLoS One, 2(9):1–7, 2007.
[105] M. V. Iorio, M. Ferracin, and C. G. Liu. Microrna gene expression deregulation in
human breast cancer. Cancer Res, 65:7065–7070, 2005.
[106] R. A. Irizarry, B. Hobbs, F. Collin, Y. D. BeazerâBarclay, K. J. Antonellis, U. Scherf,
and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostat, 4:249–264, 2003.
[107] S. Jackman. Generalized linear models. Stanford University, pages 1–7, 2003.
[108] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recognition: a rewiew.
IEEE Trans Pattern Anal Mach Intel, (22):4–37, 2000.
[109] C. R. James, J. E. Quinn, P. B. Mullan, P. G. Johnston, and D. P. Harkin. Brca1,
a potential predictive biomarker in the treatment of breast cancer. Oncologist,
12(2):142–150, 2007.
[110] H. Jiang, Y. Deng, H. S. Chen, L. Tao, Q. Sha, J. Chen, C. J. Tsai, and S. Zhang.
Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics, 5(81):1–12, 2004.
[111] R. Jiang, M. Gan, and P. He. Constructing a gene semantic similarity network for
the inference of disease genes. BMC Systems Biology, 5(2):1–11, 2011.
[112] J. M. Koomen JM, E. B. Haura, and G. Bepler. Proteomic contributions to personalized cancer care. Mol Cell Proteomics, 7(10):1780–1794, 2008.
102
11 Conclusions
[113] G. S. Yi K. Y. Kim, B. J. Kim. Reuse of imputed data in microarray analysis
increases imputation efficiency. BMC Bioinformatics, 5:1–9, 2004.
[114] H. Kehrer-Sawatzki and D. N. Cooper. Copy number variation and disease. Cytogenetic and Genome Research, 123:1–123, 2008.
[115] J. Khan, J. S. Wei1, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann,
F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and
artificial neural networks. Nature Medicine, 7:673–679, 2001.
[116] M. J. Khoury, M. Gwinn, P. W. Yoon, N. Dowling, C. A. Moore, and L. Bradley.
The continuum of translation research in genomic medicine: how can we accelerate
the appropriate integration of human genome discoveries into health care and disease
prevention? Genetics in Medicine, (9):665–674, 2007.
[117] T. Knickerbocker, J. R. Chen, R. Thadhani, and G. MacBeath. An integrated
approach to prognosis using protein microarrays and nonparametric methods. Mol
Syst Biol, 3(123):1–8, 2007.
[118] P. Y. Kwok. Single Nucleotide Polymorphisms: Methods and Protocols. Humana
Press Inc., 2003.
[119] C. Kyprianidou. Analysisng basic genetics using bayesian networks and the impact
of genetic testing on the insurance industry, diploma thesis. Technical report, Cass
Business School, City University, 2003.
[120] P. W. Laird. The power and the promise of dna methylation markers. Nat Rev
Cancer, 3:253–266, 2003.
[121] G. R. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A
statistical framework for genomic data fusion. Bioinformatics, 20(16):2626–2635,
2004.
[122] E. Lee, H. Y. Chuang, J. W. Kim, T. Ideker, and D. Lee. Inferring pathway activity
toward precise disease classification. PLoS Comput Biol, 4(11):1–9, 2008.
[123] J. Won Lee, J. Bok Lee, M. Park, , and S. Song. An extensive comparison of
recent classification tools applied to microarray data. Computational Statistics Data
Analysis, 48(4):869–885, 2005.
[124] S. I. Lee and S. Batzoglou. Application of independent component analysis to microarrays. Genome Biology, 4(11):1–21, 2003.
103
[125] L. Li. Survival prediction of diffuse large-b-cell lymphoma based on both clinical
and gene expression information. Bioinformatics, 22(4):466–471, 2006.
[126] R. J. Lipshutz, S. P. Fodor, T. R. Gingeras, and D. J. Lockhart. High density
synthetic oligonucleotide arrays. 21:20–24, 1999.
[127] P. W. Lord, R. D. Stevens, and A. Brass. Semantic similarity measures as tools for
exploring the gene. Pac. Symp. of Biocomputing, (8):601–612, 2003.
[128] J. Lu, G. Getz, and E. Miska. Microrna expression profiles classify human cancers.
Nature, 435:834–838, 2005.
[129] R. W. Lutz, M. Kalisch, and P. Bühlmann. Robustified l2 boosting. Computational
Statistics Data Analysis, 52(7):1–12, 2008.
[130] S. Ma and J. Huang. Regularized gene selection in cancer microarray meta-analysis.
BMC Bioinformatics, 10(1):1–12, 2009.
[131] G. MacBeath. Protein microarrays and proteomics. Nat Genet, 32:526–532, 2002.
[132] H. B. Mann and D. R. Whitney. On a test whether one of two random variables is
stochastically larger than the other. Ann. Math. Statist., 18:50–60, 1947.
[133] R. D. Mass, M. F. Press, S. Anderson, M. A. Cobleigh, C. L. Vogel, N. Dybdal,
G. Leiberman, and D. J. Slamon. Evaluation of clinical outcomes according to her2
detection by fluorescence in situ hybridization in women with metastatic breast
cancer treated with trastuzumab. Journal Clinical Breast Cancer, 6(3):240–246,
2005.
[134] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall,
1989.
[135] G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley,
1992.
[136] S. Mehta, A. Shelling, A. Muthukaruppan, A. Lasham, C. Blenkiron, G. Laking, and
C. Print. Predictive and prognostic molecular markers for cancer medicine. Ther.
Adv. Med. Oncol., 2(2):125–148, 2010.
[137] L. Meier, S. van de Geer, and P. Bühlmann. The group lasso for logistic regression.
Journal of the Royal Statistical Society, 70:53–71, 2008.
[138] T. M. Mitchell. Machine learning. McGraw-Hill, 1997.
[139] F. Model, P. Adorjan, A. Olek, and C. Piepenbrock. Feature selection for dna
methyllation based cancer classification. Bioinformatics, (17):157–164, 2001.
104
11 Conclusions
[140] S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. P. Mesirov, and T. Poggio. Support vector machine classification of microarray data. Massachusetts Institute of Technology, pages 59–60, 1999.
[141] Y. Murakami, T. Yasuda, and K. Saigo. Comprehensive analysis of microrna expression patterns in hepatocellular carcinoma and non-tumorous tissues. Oncogene,
25(17):2537–2545, 2006.
[142] D. V. Nguyen and D. M. Rocke. Multi-class cancer classification via partial least
squares with gene expression profiles. Bioinformatics, 18:1216–1226, 2002.
[143] G. L. Nicolson, M. Y. Nasralla, K. De Meirleir, and J. Haier. Bacterial and viral
co-infections in chronic fatigue syndrome (cfs/me) patients. Proc. Clinical Scientific Conference on Myalgic Encephalopathy/Chronic Fatigue Syndrome, pages 1–14,
2002.
[144] S. Niijima and S. Kuhara. Effective nearest neighbor methods for multiclass cancer
classification using microarray data. pages 1–2.
[145] J. Novovicova, P. Pudil, and J. Kittler. Divergence based feature selection for multimodal class densities. EEE Trans Pattern Anal Mach Intel, (18):218–223, 1996.
[146] M. C. O’Neil and L. Song. Neural network analysis of lymphoma microarray data:
prognosis and diagnosis near-perfect. BMC Bioinformatics, 4(13):1–12, 2003.
[147] I. M. Ong, J. D. Glasner, and D. Page. Modeling regulatory pathways in e. coli from
time series expression profiles. Bioinformatics, 18:241–248, 2002.
[148] C. H. Ooi and P. Tan. Genetic algorithms applied to multi-class prediction for the
analysis of gene expression data. Bioinformatics, (19):37–44, 2003.
[149] K. Owzar, W. T. Barry, S. H. Jung, I. Sohn, and S. L. George. Statistical challenges
in preprocessing in microarray experiments in cancer. 14:5959–5966, 2008.
[150] S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, M. Cronin, and F. L. Baehner et al.
A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast
cancer. N. Engl. J. Med., 351:2817–2826, 2004.
[151] G. Papachristoudis, S. Diplaris, and P. A. Mitkas. Sofocles: Feature filtering for
microarray classification based on gene ontology. Journal of Biomedical Informatics,
43(1):1–14, 2010.
[152] M. Y. Park and T. Hastie. l1 -regularization path algorithm for generalized linear
models. Journal of the Royal Statistical Society Series B, 69:659–677, 2007.
105
[153] C. M. Perou, T. Sorlie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, C. A. Rees,
J. R. Pollack, D. T. Ross, H. Johnsen, L. A. Akslen, O. Fluge, A. Pergamenschikov,
C. Williams, S. X. Zhu, P. E. Lonning, A. L. Borresen-Dale, P. O. Brown, and
D. Botstein. Molecular portraits of human breast tumours. Nature, (406):747–752,
2000.
[154] D. Pinkel and D. G. Albertson. Array comparative genomic hybridization and its
applications in cancer. Nat Genet, 37:11–17, 2005.
[155] J. Pittman, E. Huang, and H. Dressman.
Integrated modeling of clinical
and gene expression information for personalized prediction of disease outcomes.
Proc.Natl.Acad.Sci., 101(22):8431–8436, 2004.
[156] S. Pounds, X. Cao, C. Cheng, J. Yang, D. Campana, W. E. Evans, C. H. Pui, and
M. V. Relling. Integrated analysis of pharmacokinetic, clinical, and snp microarray
data using projection onto the most interesting statistical evidence with adaptive
permutation testing. Bioinformatics and Biomedicine, IEEE International Conference on, pages 203–209, 2009.
[157] J. Qi and J. Tang. Integrating gene ontology into discriminative powers of genes for
feature selection in microarray data. Proceedings of the 2007 ACM symposium on
Applied computing, pages 430–434, 2007.
[158] J. Quackenbush. Computational analysis of microarray data. Nature Reviews Genetics, 2(2):418–427, 2001.
[159] D. R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh, T. Barrette, A. Pandey, and A. M. Chinnaiyan. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation
and progression. PNAS, 101:9309–9314, 2004.
[160] B. Z. Ring, R. S. Seitz, R. Beck, W. J. Shasteen, S. M. Tarr, M. C.U. Cheang,
B. J. Yoder, G. T. Budd, T. O. Nielsen, D. G. Hicks, N. C. Estopinal, and
D. Ross. Novel prognostic immunohistochemical biomarker panel for estrogen receptorâpositive breast cancer. Jour. of Clin. Oncol., 24:3039–3047, 2006.
[161] J. S. Ross. Multigene classifiers, prognostic factors, and predictors of breast cancer
clinical outcome. Adv. Anat. Pathol., 16(4):204–215, 2009.
[162] R. Schapire. Strength of weak learnability. Machine Learning, 5:197–227, 1990.
[163] M. Schmidt, D. Boehm, C. von Toerne, E. Steiner, A. Puhl, H. Pilch, H. A. Lehr,
J. G. Hengstler, H. Koelbl, and M. Gehrmann. The humoral immune system has a
106
11 Conclusions
key prognostic impact in node-negative breast cancer. Cancer Research, 68(13):5405–
5413, 2008.
[164] G. Schwarz. Estimating the dimension of a model. Ann. Stat., 6(2):461–464, 1978.
[165] H. Shatkay and R. Feldman. Mining the biomedical literature in the genomic era:
an overview. J Comput Biol, 10(6):821–855, 2003.
[166] M. Shena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of
gene expression patterns with a complementary dna microarray. 270:467–470, 1995.
[167] R. Simon. Diagnostic and prognostic prediction using gene expression profiles in
high-dimensional microarray data. British Journal of Cancer, 89:1599–1604, 2003.
[168] R. M. Simon. Using dna microarrays for diagnostic and prognostic prediction. Expert
Rev. Mol. Diagn., 3(5):587–595, 2003.
[169] R. M. Simon. Development and validation of biomarker classifiers for treatment
selection. J. Stat. Plan. Inference, 138(2):308–320, 2008.
[170] A. K. Smith, P. D. White, E. Aslakson, U. Vollmer-Conna, and M. S. Rajeevan.
Polymorphisms in genes regulating the hpa axis associated with empirically delineated classes of unexplained chronic fatigue. Pharmacogenomics, 7:387–394, 2006.
[171] C. Sotiriou, S. Y. Neo, L. M. McShane, E. L. Korn, P. M. Long, A. Jazaeri, P. Martiat, S. B. Fox, A. L. Harris, and E. T. Liu. Breast cancer classification and prognosis
based on gene expression profiles from a population-based study. PNAS, 100:10398–
10398, 2003.
[172] C. Sotiriou and L. Pusztai. Gene-expression signatures in breast cancer. N. Engl.
J. Med., (360):790–800, 2009.
[173] I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar. Text mining and ontologies in
biomedicine: Making sense of raw text. Briefings in Bioinformatics, 6(3):239–251,
2005.
[174] S. Srivastava, L. Zhang, R. Jin, and C. Chan. A novel method incorporating gene
ontology information for unsupervised clustering and feature selection. PLoS One,
3(12):1–8, 2008.
[175] F. J. T. Staal, M. van der Burg, L. F. A. Wessels, B. H. Barendregt an M. R.
M. Baert, C. M. M. van den Burg, C. van Huffel, A. W. Langerak, V. H. J. van der
Velden, M. J. T. Reinders, and J. J. M. van Dongen. Dna microarrays for comparison
of gene expression profiles between diagnosis and relaps in precursor-b acute lymphoblestic leukemia: choice of technique and purification influence the identification
of potential diagnostic markers. 17:1324–1332, 2003.
107
[176] A. Statnikov, L. Wang, and C. Aliferis. A comprehensive comparison of random
forests and support vector machines for microarray-based cancer classification. BMC
Bioinformatics, 9:1–10, 2008.
[177] A. J. Stephenson, A. Smith, M. W. Kattan, J. Satagopan, V. E. Reuter, P. T.
Scardino, and W. L. Gerald. Integration of gene expression profiling and clinical
variables to predict prostate carcinoma recurrence after radical prostatectomy. Cancer, 104(2):290–298, 2005.
[178] Y. Sun. Improved breast cancer prognosis through the combination of clinical and
genetic markers. Bioinformatics, 23:30–37, 2007.
[179] Q. Tian, S. B. Stepaniants, M. Mao, L. Weng, M. C. Feetham, M. J. Doyle, E. C.
Yi, H. Dai, V. Thorsson, and J. Eng. Integrated genomic and proteomic analyses of
gene expression in mammalian cells. Mol Cell Proteomics, 3:960–969, 2004.
[180] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society, 58:267–288, 1996.
[181] R. Tibshirani and B. Efron. Pre-validation and inference in microarrays. Statistical
applications in genetics and molecular biology, 1:1–18, 2002.
[182] G. V. Trunk. A problem of dimensionality: A simple example. IEEE Trans Pattern
Anal Mach Intel, pages 306–307, 1979.
[183] C. Truntzer, D. Maucort-Boulch, and P. Roy. Comparative optimism in models involving both classical clinical and gene expression information. BMC Bioinformatics,
9:1–10, 2008.
[184] K. N. Tsai, T. Y. Tsai, and C. M. Chen. A new approach to deciphering pathways
of enterovirus 71-infected cells: an integration of microarray data, gene ontology,
and pathway database. Biomed Eng Appl Basis Comm, 18(6):337–342, 2006.
[185] L. J. van’t Veer, H. Dai, and M. J. van de Vijver. Gene expression profiling predicts
clinical outcome of breast cancer. Nature, pages 530–536, 2002.
[186] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1998.
[187] V. E. Velculescu, B. Vogelstein L. Zhang, and K. W. Kinzler. Serial analysis of gene
expression. Science, 1995. 270:484-487.
[188] J. Šilhavá and P. Smrž. Gene ontology driven feature filtering from microarray data.
In Znalosti, pages 263–266, Jindřichův Hradec, Czech Republic, 2010.
108
11 Conclusions
[189] B. A. Walker, P. E. Leone, M. W. Jenner, C. Li, D. Gonzalez, D. C. Johnson, F. M.
Ross, F. E. Davies, and G. J. Morgan. Integration of global snp-based mapping
and expression arrays reveals key regions, mechanisms, and genes important in the
pathogenesis of multiple myeloma. Blood, 108:1733–1743, 2006.
[190] D. Wang and A. Bakhai. Clinical trials: a practical guide to design, analysis, and
reporting. Remadica, 2005.
[191] J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C. F. Chen. A new method
to measure the semantic similarity of go terms. Bioinformatics, 23(10):1274–1281,
2007.
[192] Y. Wang, J.G. Klijn, Y. Zhang, A. M. Sieuwerts, M. P. Look MP, F. Yang, D. Talantov, M. Timmermans, M. E. Meijer van Gelder, J. Yu, T. Jatkoe, E. M. Berns,
D. Atkins, and J. A. Foekens. Gene-expression profiles to predict distant metastasis
of lymph-node-negative primary breast cancer. Lancet., 365(9460):671–679, 2005.
[193] Y. Wang, D. J. Miller, and R. Clarke. Approaches to working in high-dimensional
data spaces: gene expression microarrays. Brit Jour Canc, (98):1023–1028, 2008.
[194] S. H. Wei, C. Balch, and H. H. Paik. Prognostic dna methylation biomarkers in
ovarian cancer. Clinical cancer research, 12:2788–2794, 2006.
[195] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan,
J. A. Olson, J. R. Marks, and J. R. Nevins. Predicting the clinical status of human
breast cancer by using gene expression profiles. PNAS, 28:11462–11467, 2001.
[196] C. Wingren and C. A. Borrebaeck. Antibody microarrays: Current status and key
technological advances. OMICS, 10:411–427, 2006.
[197] L. Xu. A novel statistical method for microarray integration: applications to cancer
research. PhD thesis, Johns Hopkins University, Baltimore, Maryland, 2007.
[198] Y. and R. E. Schapire. A short introduction to boosting. Journal of Japanese Society
for Artificial Intelligence, 14(5):771–780, 1999.
[199] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing classifier performance via an approximation to the wilcoxon-mann-whitney statistic. Proceedings of
the Twentieth International Conference on Machine Learning (ICML-2003), pages
1–8, 2003.
[200] T. P. Yang, T. Y. Chang, C. H. Lin, M. T. Hsu, and H. W. Wang. Arrayfusion:
a web application for multi-dimensional analysis of cgh, snp and microarray data.
Bioinformatics, 22(21):2697–2698, 2006.
109
[201] J. X. Yu, A. M. Sieuwerts, Y. Zhang, J. W. Martens, M. Smid, J. G. Klijn, Y. Wang,
and J. A. Foekens. Pathway analysis of gene signatures predicting metastasis of
node-negative primary breast cancer. BMC Cancer, 7(182):1–14, 2007.
[202] S. Yu, L. C. Tranchevent, B. De Moor, and Y. Moreau. Gene prioritization and
clustering by multi-view text mining. BMC Bioinformatics, pages 1–22, 2010.
[203] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society, 68:49–67, 2007.
[204] X. Zhou, K. Y. Liu, and S. T. Wong. Cancer classification and prediction using
logistic regression with bayesian gene selection. J. Biomed. Inform., 37:249–259,
2004.
[205] J. Zhu and T. Hastie. Classification of gene microarrays by penalized logistic regression. Biostatistics, 5:427–443, 2004.
[206] H. Zou and T. Hastie. Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.
[207] V. Zuber and K. Strimmer. Gene ranking and biomarker discovery under correlation.
Bioinformatics, 25(20):2700–2707, 2009.
[208] V. Zuber and K. Strimmer. High-dimensional regression and variable selection using
car scores. Statistical Applications in Genetics and Molecular Biology, 10(1):1–27,
2011.