Download Data Mining the Genetics of Leukemia Geoff Morton

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining the Genetics of Leukemia
by
Geoff Morton
A thesis submitted to the
School of Computing
in conformity with the requirements for
the degree of Master of Science
Queen’s University
Kingston, Ontario, Canada
January 2010
c Geoff Morton, 2010
Copyright °
Abstract
Acute Lymphoblastic Leukemia (ALL) is the most common cancer in children under
the age of 15. At present, diagnosis, prognosis and treatment decisions are made
based upon blood and bone marrow laboratory testing. With advances in microarray
technology it is becoming more feasible to perform genetic assessment of individual
patients as well. We used Singular Value Decomposition (SVD) on Illumina SNP,
Affymetrix and cDNA gene-expression data and performed aggressive attribute selection using random forests to reduce the number of attributes to a manageable
size. We then explored clustering and prediction of patient-specific properties such
as disease sub-classification, and especially clinical outcome. We determined that
integrating multiple types of data can provide more meaningful information than
individual datasets, if combined properly. This method is able to capture the correlation between the attributes. The most striking result is an apparent connection
between genetic background and patient mortality under existing treatment regimes.
We find that we can cluster well using the mortality label of the patients. Also, using
a Support Vector Machine (SVM) we can predict clinical outcome with high accuracy. This thesis will discuss the data-mining methods used and their application to
biomedical research, as well as our results and how this will affect the diagnosis and
treatment of ALL in the future.
i
Acknowledgments
I would like to thank my supervisor Prof. David Skillicorn for the opportunity to
work on this project, for all of the guidance he has given me along the way and for
the chance to continue my work in Australia. The School of Computing at Queen’s
University has provided me not only with a wonderful education but also the funding
that made this work possible and for that I am grateful. Thanks also to Dr. Daniel
Catchpoole for providing a different view into my work and making sure all of our
results were practical and applicable, as well as for his hospitality for my time in
Australia. To my friends and colleagues at the Children’s Cancer Research Unit at
The Children’s Hospital at Westmead, I thank you for making my transition so easy
and making it fun to come into work every day. And finally I would like to thank
my friends and family for their support throughout this whole process. Although my
stories may sound like gibberish to you, you were always there to listen.
ii
Table of Contents
Abstract
i
Acknowledgments
ii
Table of Contents
iii
List of Tables
vi
List of Figures
viii
Chapter 1:
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
My Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Chapter 2:
Background
. . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.1
Acute Lymphoblastic Leukemia (ALL) . . . . . . . . . . . . . . . . .
4
2.2
The Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3
Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.4
Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . .
13
iii
2.5
Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . .
15
2.6
Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Chapter 3:
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.1
Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2
Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.3
Experimental Model . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.4
Combination of Datasets . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.5
Attribute Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.6
Analysis of Selected Attributes . . . . . . . . . . . . . . . . . . . . . .
27
3.7
Validation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.8
Further SNP Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . .
29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.1
SVD Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.2
Combination of Datasets . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.3
Analysis of Selected Data
. . . . . . . . . . . . . . . . . . . . . . . .
42
4.4
Attribute Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.5
Validation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
4.6
Extended SNP Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.7
Discussion of the Nature of the Data . . . . . . . . . . . . . . . . . .
96
Chapter 4:
Results
Chapter 5:
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
99
5.1
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Bibliography
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
v
List of Tables
2.1
Symptoms of ALL . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3.1
Random Forests Setup . . . . . . . . . . . . . . . . . . . . . . . . . .
27
4.1
Combination Method 1 . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.2
Combination Method 2 . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.3
SNP Subset SVM Results . . . . . . . . . . . . . . . . . . . . . . . .
42
4.4
cDNA Subset SVM Results . . . . . . . . . . . . . . . . . . . . . . .
48
4.5
Affy Subset SVM Results . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.6
SNP and cDNA Subset SVM Results . . . . . . . . . . . . . . . . . .
53
4.7
SNP and Affy Subset SVM Results . . . . . . . . . . . . . . . . . . .
56
4.8
cDNA and Affy Subset SVM Results . . . . . . . . . . . . . . . . . .
57
4.9
SNP, cDNA and Affy Subset SVM Results . . . . . . . . . . . . . . .
60
4.10 Top 100 SNP Attributes . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.11 Top 100 cDNA Attributes . . . . . . . . . . . . . . . . . . . . . . . .
64
4.12 Top 100 Affy Attributes . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.13 Top 100 SNP-cDNA Attributes . . . . . . . . . . . . . . . . . . . . .
70
4.14 Top 100 SNP-Affy Attributes . . . . . . . . . . . . . . . . . . . . . .
72
4.15 Top 100 Affy-cDNA Attributes . . . . . . . . . . . . . . . . . . . . .
75
4.16 Top 100 SNP-cDNA-Affy Attributes
78
vi
. . . . . . . . . . . . . . . . . .
4.17 Label Shuffling SVM Results . . . . . . . . . . . . . . . . . . . . . . .
82
4.18 SNP Relapse SVM Results . . . . . . . . . . . . . . . . . . . . . . . .
85
4.19 Comparison of Top Attributes . . . . . . . . . . . . . . . . . . . . . .
93
vii
List of Figures
4.1
All SNP SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.2
All cDNA SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.3
All Affy SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.4
All Clinical SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.5
All SNP-cDNA SVD . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.6
All SNP-Affy SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.7
All Affy-cDNA SVD . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.8
SNP-cDNA-Affy SVD
. . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.9
SNP Subset SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.10 SNP Subset SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.11 cDNA Subset SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.12 Affy Subset SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.13 SNP-cDNA Subset SVD . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.14 SNP-Affy Subset SVD . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.15 cDNA-Affy Subset SVD . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.16 All Combined Subset SVD . . . . . . . . . . . . . . . . . . . . . . . .
61
4.17 SNP Relapse SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.18 SNP Graph Analysis SVD . . . . . . . . . . . . . . . . . . . . . . . .
87
4.19 Reformatted SNP Analysis SVD . . . . . . . . . . . . . . . . . . . . .
89
viii
4.20 250 SNP SVD for old and updated labels . . . . . . . . . . . . . . . .
91
4.21 SNP SVD for updated labels . . . . . . . . . . . . . . . . . . . . . . .
92
4.22 Intersecting SNP SVD . . . . . . . . . . . . . . . . . . . . . . . . . .
96
ix
Chapter 1
Introduction
1.1
Problem
Cancer, in all of its forms, is the second leading cause of death in the United States [15]
and accounts for 13% of all deaths worldwide [25]. It is estimated that in the United
States in 2009 a total of approximately 1.5 million people will have been diagnosed
with cancer and of these, approximately 560,000 will die from their disease [22]. It is
also estimated that approximately 30% of these cancer deaths are preventable [25].
The National Cancer Institute in Washington spends approximately $4.8 billion per
year towards cancer research with most of the funding going towards breast, prostate,
lung, colorectal and leukemia research [21].
Leukemia is the most common malignancy affecting children under the age of
15, but it also affects many adults. There are four subtypes of leukemia; acute
lymphoblastic leukemia, chronic lymphoblastic leukemia, acute myelogenous leukemia
and chronic myelogenous leukemia. It is estimated that approximately 45,000 new
cases of leukemia will have been diagnosed in the United States in 2009 [16]. The
1
CHAPTER 1. INTRODUCTION
2
survival rate for persons with leukemia has dramatically increased over the past four
decades. In the 1960s the five-year event-free survival rate was a mere 14%. In
more recent years these figures have been quoted as being as high as 80% [12, 17, 31].
Although there has been a significant improvement in the treatment of this disease,
20% of all leukemia cases result in death.
With the completion of the Human Genome Project, the understanding of genetics
has increased significantly. As such, many new technologies have been developed to
study the genome in many different forms. One of these technologies is the microarray,
which is a high-throughput device allowing for the analysis of thousands of gene
expression levels simultaneously. As a result, there is a wealth of data being generated
every day for many different purposes. The microarray has become a useful research
tool and has allowed researchers to begin looking at problems on a much larger scale.
As this technology evolves, so to do the applications for microarrays. In cancer
research, researchers can now look at the expression levels for many thousands of
genes as well as a description of an individuals genome given by a set of Single
Nucleotide Polymorphisms (SNPs). The amount of data that is being generated is
staggering, and there is a need to develop methods for analyzing this data efficiently.
These high-throughput technologies have led to many great discoveries which have
had many clinical applications. With 20% of leukemia cases leading to death, there
is an opportunity for this type of technology to have a positive effect in this area
of research. The present method for diagnosis, prognosis and treatment decisions
involve a series of clinical tests and assessments by physicians who ultimately place the
patient in a specific risk category that dictates the treatment they receive. Although
this process has clearly shown improvement over the past four decades, there is still
CHAPTER 1. INTRODUCTION
3
a need for improvement.
1.2
My Contribution
The goal of this resarch is to explore the relationship between the genetics of individuals who have leukemia and whether or not they survive the disease. Our hypothesis
is that there is a genetic relationship between an individual’s genetics and their survivability of this disease. We use data from several microarray experiments that have
generated data about both the SNP profiles of patients as well as gene expression
data. In order to analyze these complex data we have developed a data-mining process that involves filtering the data to remove uninformative data attributes and then
using a matrix-decomposition technique to cluster these data. We perform this datamining technique on the individual datasets as well as all possible combinations of
them, to see if combining data together provides more useful information for the exploration. We use the clinical information that is available for each patient to develop
an understanding of the results that our method has produced and attempt to see if
any biological implications can be drawn.
These data are complex, high-dimensional and evolving. These properties make
them difficult to work with and standard data-mining methods must be modified in
order to have the appropriate functionality. This research is preliminary work in this
field and will provide a foundation for further experiments which will potentially lead
to work with clinical implications. With the amount of data that is being generated
through these high throughput experiments, it is imperative that proper methods
for analyzing these data be developed. We believe that this is the first step in that
direction for this particular research area.
Chapter 2
Background
In this chapter, we provide the necessary background information for this thesis.
First, we explain the specific type of leukemia which we are investigating as well as
discussing the types of datasets that are being used for this experiment. We also
explain the techniques used and their role in the field of biomedical research.
2.1
Acute Lymphoblastic Leukemia (ALL)
Acute Lymphoblastic Leukemia (ALL) is the most common cancer in children under
the age of 15 [12, 17, 31]. The term ’acute’ refers to the short amount of time it
takes to develop this disease. It is estimated that in the US in 2008, 5430 people
were diagnosed with ALL, 60% of whom were children [16]. Overall this is a rare
cancer, accounting for only 0.3% of all cancers diagnosed every year [32]. ALL is
a cancer of the blood, but more specifically it affects cells known as lymphocytes.
These cells are more commonly known as white blood cells, and are an integral part
of the immune system. The bone marrow is responsible for producing blood cells,
4
CHAPTER 2. BACKGROUND
5
and in individuals with ALL, the bone marrow produces too many lymphocytes too
quickly. As a result, the cells never fully mature nor do they develop their proper
functionality [17]. This becomes a problem for several reasons. First, lymphocytes
are an important part of the immune response. If they do not function properly
then the individual is going to be unable to fight off the infection or disease and can
become ill. Second, if there are too many lymphocytes being produced, then there
is less room in the bloodstream for the other vital blood cells: red blood cells and
platelets. As a result of this, the individual may develop anemia or bleeding problems.
Finally, the abnormal lymphocytes can build up in lymph nodes as well as the spleen,
liver, brain and testicles and can cause swelling which can also lead to complications
[12, 17]. There are several risk factors that make it more likely that certain individuals
will develop ALL than others. These include; radiation exposure, benzene exposure,
smoking, genetic predisposition, certain viruses and past chemotherapy [12, 16]. There
are many symptoms associated with ALL, and they are listed in Table 2.1.
Table 2.1: Symptoms of ALL
Symptoms
General weakness
Fatigue
High fever
Weight loss
Frequent infections
Bruising easily with no obvious cause
Bleeding from the gums or nose
A rash of dark red spots
Blood in urine the stool
Pain in the bones or joints
Breathlessness
Swollen lymph glands
A feeling of fullness in the abdomen caused by a swollen liver or spleen
CHAPTER 2. BACKGROUND
6
At this present time, the method for diagnosis of the disease is as follows. Blood
and bone-marrow laboratory tests are performed to look for any leukemia cells. A
complete blood count is used to look at levels of white blood cells, platelets, etc.
A bone-marrow aspirate is done to look for blast cells while a bone-marrow biopsy
is done to see how much disease is in the bone marrow. Also, immunophenotyping
is done to determine if the disease is B-cell or T-cell leukemia [12, 17]. There are
two stages to the treatment procedure. First, induction therapy is conducted with
a goal of killing as many leukemia cells as possible to get the blood counts back
to normal and induce remission. This is done using various types of chemotherapy.
In certain extreme cases when the disease has spread to the brain and spinal cord,
radiation therapy or a bone-marrow transplant is also necessary. Once the patient is in
remission the second stage of treatment begins. This is called post-induction therapy.
This involves giving doses of treatment every 2-3 years. This is necessary because
not all of the leukemia cells will always be killed. This treatment is usually different
from the induction chemotherapy [12, 17]. Survival rates for children diagnosed with
ALL have increased dramatically over only a few decades. In the 1960s the survival
rate was a dismal 4%. Through an improvement in both diagnosis and treatment,
this number has risen to around 80% today and survival rates for high risk patients
remain around 40% [12, 17, 31]. This is a remarkable increase; however, room for
improvement still exists and so it is necessary to continue to look for ways to improve
on diagnosis, prognosis and treatment.
CHAPTER 2. BACKGROUND
2.2
2.2.1
7
The Datasets
SNP data
DNA is the blueprint from which all living creatures are created. It is the variations
in this DNA that allow for the differences both between species as well as within a
species. There are four nucleotides that make up DNA; adenosine (A), cytosine (C),
guanine (G), and thymine (T). Because DNA is double stranded, these nucleotides
work in pairs; A binds with T and C binds with G. In the lifespan of a living being,
this DNA will be replicated numerous times. The process of replication is subject
to error, and although there are many error-checking processes, it is still possible for
a mistake to happen. This is known as a mutation, and there are many different
types of mutations. Some are harmful while some are not. Over the course of time
these mutations are passed down from generation to generation, and are subject to
their own errors as well. Mutations are the reason why there is so much interspecies
individuality [2].
One variation in DNA that has become important to researchers is a Single Nucleotide Polymorphism (SNP). This is the term used to describe a single nucleotide
base-pair difference between individuals. In order to be considered a SNP, this difference must occur in at least 1% of the population [2, 5]. SNPs are responsible for
approximately 90% of all genetic variation in humans, and two out of three times the
change in nucleotide is from a cytosine to a thymine. SNPs can occur in all parts of
the DNA; in coding, non-coding and intergenic regions. Due to its appearance in coding regions, it is believed that SNPs have an effect on an individual’s predisposition
to disease as well as their response to drugs. This has become an important research
CHAPTER 2. BACKGROUND
8
tool, as it may explain why certain individuals do not respond well to treatment, or
why they contracted a certain disease [2, 5].
For this research, the SNP data was collected on the Illumina HumanNS-12 Genotyping BeadChip platform. The SNPs which are analyzed in this research are all
non-synonymous SNPs. This means that a change in the DNA base pair results in
a change in the amino acid coding sequence of the protein. This platform contains
13917 SNPs and this was done for 137 patients. The samples used for this experiment
come from peripheral blood during remission. The data that is generated from this
analysis comes in two forms; theta values and allele values. A theta value generates
a B allele frequency which has a value between 0 and 1. Thus, a value close to 1
represents a homozygous B allele, a value close to 0 represents a homozygous A allele
and a value close to 0.5 represents a heterozygous allele [29]. The other form of these
data are the allele values. There are three possible values these data can take; 0, 1 or
2. A value of 0 represents the homozygous major allele, 1 represents the heterozygous
case and 2 represents the homozygous minor allele. Unlike the theta values, there is
no range of values and each individual has one of these three values.
2.2.2
cDNA data
DNA microarrays are high-throughput devices that allow for the collection of a large
amount of information on a small glass slide. The basis of this technology relies
on how DNA is transcribed. When a particular gene becomes activated, it is transcribed many times in order to produce the necessary proteins. This process involves
creating a complementary strand of mRNA which is then used as a template for building these proteins. Thus, a gene that is highly expressed will have many identical
CHAPTER 2. BACKGROUND
9
mRNA molecules within a cell [33]. This is the basis for a microarray experiment.
Researchers are interested to know which genes are active under certain conditions.
To do this, a microarray slide is spotted with thousands of single stranded problems
representing particular genes. Then, fluorescently labeled mRNA molecules are put
onto the microarray slide. Due to the complementarily of the strands, if the labeled
mRNA molecule finds its match it will hybridize to the probe. As more mRNA bind
to particular strands of the DNA, the more fluorescence there is at that spot on the
microarray. After this process is complete, the microarray is then scanned using a
special scanner which detects the amount of fluorescence. The intensity of the spot is
translated into a numerical form, and this becomes the data from which researchers
work. A spot with a large amount of fluorescence represents a gene that is active
under those particular conditions. These intensity values are often compared to a
“normal” population in order to look at intensity fold changes which can give the
researchers information about which genes are up-regulated or down-regulated given
a certain condition [33]. For this study the platform contained 10027 genes for 68
patients. The samples used for the microarray experiment were taken from the patient’s bone marrow during remission. As such, this sample should represent healthy
bone marrow.
2.2.3
Affymetrix data
The Affymetrix microarray, known as a GeneChip, works similarly to the cDNA
microarray. The main difference between these two methods is the way in which
they are created. An Affymetrix GeneChip is created through a process known as
photolithography. This technique uses masks and ultraviolet light to build the DNA
CHAPTER 2. BACKGROUND
10
probes directly on the slide. This is different from the cDNA method where the
DNA probes are spotted into wells on a slide. Before the creation of the GeneChip,
the researchers must decide the composition of the probes so that the masks can be
created. The slide (known as a wafer) will be sectioned off so that each probe can
be created. The process of photolithography begins by first coating the wafer with
silane which will bind to the glass. Next, a linker molecule with a photosensitive
molecule will bind to the silane. Now, a mask is applied to the wafer which protects
certain areas of the slide, while others will remain unprotected. Next, ultraviolet
light is shown down on the slide and the unprotected parts of the slide will lose their
photosensitive molecule. Then, one of the nucleotides (A, C, G or T) will be added
into the mix and will combine with the unprotected molecules. This is the beginning
of the DNA probe. Next, another mask is applied and then another nucleotide will
be added. This process is repeated until each probe is the desired length and with
the desired sequence. Creating the DNA probes in this way allows for researchers
to be specific about the DNA probes that they want on the slide without having to
isolate all of the DNA as they would for a cDNA microarray [1]. Compared to the
cDNA microarray, the Affymetrix probes are much shorter. To compensate for this,
the Affymetrix chips contain many redundancies [1]. For this study the platform
contained 22277 probes for 144 patients. Similarly to the cDNA experiment, the
samples used for the Affymetrix experiment were also taken from bone marrow during
remission.
From the above three datasets there were 49 patients who had both SNP and
cDNA data, 118 who had both SNP and Affy, 55 who had both Affy and cDNA, and
49 who had all three types of data.
CHAPTER 2. BACKGROUND
2.2.4
11
Clinical data
When it is suspected that an individual may have ALL, many clinical tests are run in
order to confirm the diagnosis. These tests include full blood counts as well as bonemarrow biopsies. If the test results show an increased white blood cell count and
a decreased platelet count then these are the first signs of leukemia. Other clinical
results such as an enlarged spleen or liver, chromosomal abnormalities such as a
translocation and cytogenetic counts, subtype of the disease as well as the patients
age and sex are all used to diagnose this disease. The clinical data for each patient
in this study was processed at the same facility and therefore can be considered to
be comparable. Certain patients were missing various types for clinical data, but the
mortality was known for every patient. All of these data is used in this study to
represent the view of a patient from a clinical perspective. Since this is the data that
is available to clinicians making the diagnosis, we use these data with our techniques
to see how effective these decisions were.
2.3
Random Forests
The random forests algorithm [8] is an ensemble classifier that consists of many binary
decision trees. A binary decision tree is a method of using nodes in a tree structure
to test the attributes of a dataset. The result of these tests are used to split the
training data into subsets which are then passed onto the next layer of the tree. This
continues until each subset at a node contains only one class. There are many popular
decision-tree algorithms, including ID3, CART and C4.5 [23, 30]. Each of these is a
supervised learning method, as they require that the data have class labels. One of
CHAPTER 2. BACKGROUND
12
the challenges with a decision-tree classifier is deciding at each layer of the tree, which
attribute will provide the best split of the data. Two popular choices for this task
are information gain and the gini index. The gini index is the splitting method that
is used in random forests and is defined by:
gini(D) = 1 −
k
X
p2i
(2.1)
i=1
where D is the dataset, k is the number of classes and pi is the relative frequency of
class i in dataset D. After splitting on attribute X, the gini index is defined as
gini(D)X =
n1
n2
gini(D1 ) + gini(D2 )
n
n
(2.2)
where D1 and D2 are the subsets of the dataset at each branch which contain n1 and
n2 objects respectively. The splitting decision is based on the difference in the gini
impurity from a node to its child, and is given by
∆gini(D) = gini(D) − gini(D)X
(2.3)
The attribute that provides the largest reduction of impurity is the best attribute to
split on.
The random forests algorithm was developed by Breiman and Cutler, and is known
as one of the most robust classification algorithms developed to date [8]. For each
CHAPTER 2. BACKGROUND
13
tree that is grown, a training set is created by randomly selecting, with replacement,
n objects from the original set of N objects, where n is less than N . By selecting
with replacement, about one third of the data will not be selected and this becomes
the Out-Of-Bag (OOB) objects which are used as a test set. If there are M attributes
in total, at each internal node a number of attributes m is chosen to be much smaller
than M. Then, m attributes are selected randomly and the gini index is used to
determine which attribute provides the best split. The choice of m is difficult, but a
√
rule of thumb is to select m to be equal to M. Each tree is grown to its full size;
there is no pruning in this algorithm. The error rate is estimated at the end of each
run. Assume that class j received the most votes every time that case n was OOB.
The average of the number of times that j is not equal to the class of n is the OOB
error estimate [8].
2.4
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD) is a matrix decomposition technique which is
given by the formula
A = USVT
(2.4)
where A is a matrix of n rows (objects/patients) and m columns (attributes). If A
has r linearly-independent columns, then U is an nxr matrix, S is an r xr positive
diagonal matrix with non-decreasing singular values, and VT is an r xm matrix. U
and V are orthogonal matrices. The U matrix captures the variation of the rows
CHAPTER 2. BACKGROUND
14
of A, which correspond to the objects. The first column of U captures the most
variation, the second column contains the second most variation, and so on. The
singular values on the diagonal of the S matrix correspond to the importance of
the amount of variation captured in each column of U. The V matrix captures the
variation along the columns of A, which corresponds to the attributes. One of the
many useful properties of SVD is that the results that can be visualized. Since the
most variation in the data is captured in the first few columns of the U and V
matrices, and the variation is captured in an orthogonal manner, it is possible to plot
the first 2 or 3 columns of these matrices. The resulting image can show clusters or
trends in the data that may have otherwise been difficult to see. It is a useful tool for
finding structure in complicated datasets. It is especially useful since it can be used
on very large datasets which are difficult to handle [28]. The easiest interpretation
of SVD is the geometric interpretation. By plotting the U matrix, the data points
correspond to the objects plotted in a new space. Data points that lie close to each
other in space are correlated with each other and therefore are more alike. Points that
lie opposite of each other are negatively correlated with each other and are less alike.
Points that are orthogonal to each other have no association. Also, points that lie
at the origin are either correlated with everything or correlated with nothing. Either
way these points can usually be discarded as not interesting. The power of SVD as a
method of clustering can be seen from this explanation. It is able to find points that
are interesting, points that are not interesting, as well as associations between points
[28].
CHAPTER 2. BACKGROUND
2.5
15
Support Vector Machine (SVM)
A Support Vector Machine (SVM) is a supervised learning method used for classification. This method uses a decision boundary to separate the classes in space.
However, when picturing two classes of data points that are linearly separable, there
are an infinite number of boundary lines that can be drawn and it is impossible to
know which of these boundaries is the best. This method uses what is known as the
maximum-margin hyperplane. The idea is that a separating line is chosen so as to
maximize the distance from the nearest data point on each side. The margin of the
linear classifier is the width that the boundary can reach before hitting a point on
each side. The support vectors are the points that lie on the decision boundary and
these are the only points used in determining the best way to separate the classes
[28].
One common problem with complex data is that there is no simple linear boundary between classes. The idea behind SVM is to project these data into a higher
dimensional space using mathematical functions, known as kernels, to a point where
a hyperplane can separate the data. If this is done properly, the necessary number of
calculations can be minimized so that only the original attributes are needed, making
it an efficient algorithm. It is possible to extend this algorithm to allow misclassifications while incurring a penalty for each. As such, there are several parameters
that can be changed and may require testing to determine the best setup for a particular experiment. Although this is primarily a two class separator, it is possible to
extend this to multiclass prediction. This is one of the most popular and effective
classification algorithms to date [9].
CHAPTER 2. BACKGROUND
2.6
2.6.1
16
Related Research
Related research using SNP data
Yang and colleagues [35] state that during the remission induction phase of therapy
it can be see that there is considerable interindividual variation. They report that
some patients drop from 100% to under 0.01% leukemia cells in the bone marrow with
only 2 or 3 weeks of induction therapy. However, some individuals still exhibit high
levels of leukemia cells in the bone marrow after 4 to 6 weeks of induction therapy.
They attribute most of this variation to host-related factors as opposed to tumorrelated factors. In order to test this hypothesis, the team used two groups of patients
consisting of 318 and 169 people respectively. They studied a total of 476 796 germline
SNPs and after using several statistical tests were able to identify 102 SNPs that are
believed to be related to this variation between individuals. It was also found that
63 of these SNPs are linked to early response, relapse and drug disposition. This
demonstrates that some interindividual variation can be attributed to differences in
individual genetics.
2.6.2
Related research using microarray data
Microarray experiments are the ideal candidate subject for data mining. They are
noisy, very large and complex, but are filled with a wealth of information. With the
ability to capture so much information it is imperative that data miners discover a
way to effectively extract this information. It is no surprise that there has been a lot
of research in this field on both the biological side as well as the computational side.
As an example, Chopra et al. [11] looked at the problem of clustering as it applies to
CHAPTER 2. BACKGROUND
17
microarray data. They state that most of the available clustering algorithms only find
clusters that are independent of the biological context of the analysis. The authors
have developed a novel clustering algorithm, SigCalc, which generates many different
versions of clusters from one dataset where each one provides a different insight into
the dataset. They test their algorithm on three yeast microarray datasets and discover
that they can find many of the same clusters of genes across all datasets. Being able
to include biological data into the clustering process is critical in order to discover
the best possible clusters of genes.
In another experiment, Hoffmann et al [19] used gene expression profiles to attempt to predict long-term outcome of individuals with pre-B ALL. They used microarray data for 101 children diagnosed with pre-B ALL and, using statistical techniques as well as the random forest algorithm, were able to identify an 18-gene classifier which they state can predict the long-term outcome of these patients better
than conventional methods. This demonstrates the power of using the random forest
algorithm with microarray data.
2.6.3
Related research using random forest
The random forests algorithm is used extensively in the biomedical field of research
and because of its design it is well suited for microarray data. These data are generally
very large as well as containing a lot of noise. This algorithm can handle these two
properties much better than many of the other classification algorithms that exist.
Also, random forests has the ability to not only act as a powerful classifier, but as an
attribute selection algorithm as well. All of these properties make this algorithm a
popular choice for biomedical research.
CHAPTER 2. BACKGROUND
18
Diaz-Uriarte and Alvarez de Andres [13] use random forests on a number of different microarray datasets in order to compare its performance with many other well
known algorithms such as SVM, K-nearest neighbours, etc. The authors report that
the classification accuracy of random forests is similar to those of the best algorithms
that are already in use. On top of the classification accuracy, they explore random
forests potential for attribute selection and propose a method for gene selection using
the OOB error rates. The authors state that because of the algorithm’s ability to
perform well as a classifier while also allowing for excellent attribute selection, this
algorithm should become an essential part of the tool-box for prediction and gene
selection with microarray data.
Archer and Kimes [4] perform a similar evaluation of the random forests algorithm
and came to many of the same conclusions about the effectiveness of the algorithm.
In this particular case they apply the algorithm to ALL microarray data as a means of
trying to discover the genes which are responsible for the difference between subtypes
of the disease. There has been much previous research on this topic and so they
were able to compare their results using random forests to these. They were able to
identify many genes which play a role in this and validated this with the previous
research.
2.6.4
Related research using SVD
As previously discussed, microarrays are high throughput devices often containing
many tens of thousands of gene expression values. It is difficult to analyze these data
since most diseases or conditions only affect a small number of genes. It is common
practice to reduce the number of genes in the analysis by discarding those that have
CHAPTER 2. BACKGROUND
19
a low expression level. However, the effect of this is to keep genes that have a large
difference in expression value but does nothing to consider correlations with other
genes or other subtle connections. By performing a singular value decomposition on
these data and sorting the columns of U based on the distance from the origin, it is
possible to select the genes with the most interesting expression values, not simply
the largest difference. An expression level for a gene that does not change across
patients tend to lie near the origin and so these will not appear near the top of the
sorted list [28].
Chapter 3
Experiments
In this chapter, we explain the experimental model used to explore these datasets.
First, the preprocessing procedures that were performed are explained. Next, the
different normalization techniques used are discussed. Finally, we present the setup
of the various experiments that were performed.
3.1
Pre-processing
Careful preprocessing was required to ensure that the data used for the study was of
the highest possible quality. Preprocessing includes such tasks as replacing missing
values, excluding patients or attributes that are not useful, converting the data into
appropriate numerical forms, and many other tasks. These steps were done on each
of the data sets; SNP data, cDNA data, Affymetrix data and Clinical data. The
samples for the SNP experiment were taken from the patient’s peripheral blood during
remission. The samples for the cDNA and Affy microarray experiments were both
taken from the patient’s bone marrow during remission. By taking the samples during
20
CHAPTER 3. EXPERIMENTS
21
remission the cells represent “normal” cells and therefore allow us to investigate the
differences between patient’s genetics.
3.1.1
SNP data
The SNP dataset contained such information as the patient ID tag, the SNP names,
the theta values of each SNP for each patient, as well as the genotype of each SNP for
each patient. There were many missing values in this dataset and this is a fundamental
challenge in data mining. The method for replacing missing values must be chosen
carefully so as to not add any information to the data. Many solutions exist for
replacing missing values; two of the more common approaches are to replace the
missing values with either the column mean or a value of zero. For the purposes
of this thesis, both methods were tested and there was little difference between the
two methods. Therefore, the missing values were all replaced with a value of zero.
Another problem encountered with the SNP dataset was that there were several
duplicate subject records. All of the duplicates were removed from the dataset. After
the preprocessing step, the dataset contained data for 137 patients and 13917 SNPs.
3.1.2
cDNA data
The cDNA dataset went through a similar preprocessing step. The data that was
received had already undergone some preprocessing and normalization to compensate
for technical errors to do with the creation of the microarray. This is a standard
procedure done for all microarray experiments. This dataset contained the patient
ID tags, the gene names, and the microarray values. These values represent the
expression ratio between the patient and a “normal” bone marrow sample. There were
CHAPTER 3. EXPERIMENTS
22
many missing values in this dataset as well, and again, both methods of replacement
were tested. There was little difference between the two methods of replacement and
so all of the missing values were replaced with a value of zero. This was not surprising
as the mean values in a column were very close to zero. After the preprocessing step,
the dataset contained data for 68 patients and 10027 genes.
3.1.3
Affymetrix data
The Affymetrix (Affy) dataset was received in three separate files, due to the fact that
the Affy data is comprised of three separate experiments. Two of these experiments
were conducted at The Children’s Hospital at Westmead and the other experiment
was performed in Washington. This presented a challenge as these experiments were
each subject to their own sources of error which would not be consistent between all
other experiments. One of the experiments had many more attributes than the other
two, so these extra records were removed in order to combine all of the datasets. These
datasets had already had been preprocessed before we received them, and so there
were no missing values to replace. The Affy dataset also contained “normal” patients
in one of the experiments, that is patients who do not have leukemia. These patients
were removed from the combined datasets and kept separate for further testing. In
the end, this dataset contained data for 144 patients and 22277 attributes.
3.1.4
Clinical data
The clinical dataset contained, for each patient, laboratory test results, such as initial white blood cell count and platelet counts, as well as patient information, such as
CHAPTER 3. EXPERIMENTS
23
treatment received, sex, age, etc. For these experiments, we used a selection of laboratory test results. The preprocessing of these data involved converting the attributes
into appropriate numerical forms. The clinical tests which resulted in numerical results were left as they were. However, there were several attributes that had values
that did not translate directly into linearly significant numbers. For these, nominal
values were assigned to represent the different values of the attribute. For example,
one attribute contained information about the size of the liver at diagnosis. The possible values were nil, 0-1cm, 1-5cm, and 6-10cm, and so these values were translated
into 0, 1, 2 and 3. The final dataset contained data for 117 patients and 11 attributes.
3.2
Normalization
Normalization of the data is important when using a technique such as SVD. It is
important to ensure that the data is properly scaled and centered on the origin in
order for SVD to function correctly. In this study, z-scores are calculated for each
attribute which effectively centers the data on the origin and scales all of the values so
that SVD is able to correctly capture the variation in the data. A z-score is calculated
by subtracting the column mean from each value in the column, and dividing by the
column standard deviation.
This technique works well; however, there are some inherent problems associated
with it. Scaling data in this way, assumes two things: that all of the attributes
are equally important, and that all of the values have a linear relationship between
significance and magnitude. With the SNP, cDNA and Affy datasets, all of the
attributes are treated as both equal and linearly significant so this is not a concern
for this particular study.
CHAPTER 3. EXPERIMENTS
24
Another normalization concern is with the microarray data when it is collected.
There are many visual artifacts which must be accounted for in the microarray scan,
such as dust, uneven surfaces, poor washing, etc. Fortunately, the data which is used
for this study was already normalized to account for these problems.
The clinical dataset contained attributes which were either linearly significant or
nominal. For the attributes which were linearly significant, z-scores were calculated.
For the attributes that were not, the logarithm of each value was calculated.
3.3
Experimental Model
In order to gain a better understanding of these datasets, the experimentation process began on a general level. This involved performing SVD on the entire dataset.
Once some knowledge was gained about these datasets, attribute selection was then
performed using the random forests algorithm, and then further exploration of the
data was done using SVD and SVM. To see whether or not data integration would
be beneficial, the three datasets were combined together, as well as in pairs.
3.3.1
SVD analysis of data
Using the geometric interpretation of SVD, the goal of these experiments was to
see if there were significant clusters in the data. By labeling these images with
different clinical features, e.g. mortality, we hoped to be able to understand what the
clusters represented. By doing this for each of the datasets listed below, we wanted
to see which datasets held the most structure and if combining them would give more
information.
CHAPTER 3. EXPERIMENTS
25
• SNP Dataset
• cDNA Dataset
• Affy Dataset
• Clinical Dataset
• SNP and cDNA Dataset
• SNP and Affy Dataset
• cDNA and Affy Dataset
• SNP, cDNA and Affy Dataset
The geometric interpretation of SVD is based upon plotting the U*S matrix to
see if there are clusters in space. The farther away a point is from the origin, the
more interesting that point is. Likewise, the points that lie together in space are more
correlated with each other than those points which lay further away in space. For all
of the following experiments, the first three columns of the U*S matrix were used to
plot the points.
3.4
Combination of Datasets
There are two possible methods of combining datasets for performing attribute selection. First, the datasets can be combined and then the random forests algorithm can
be run to select the best attributes. Second, attribute selection can be performed on
the individual datasets and then the best attributes selected from each and combined
CHAPTER 3. EXPERIMENTS
26
together. In order to determine what the most appropriate method was for these
experiments all possibilities were created. SVM was then run on all of the combined
datasets to determine which approaches were the best.
3.5
Attribute Selection
Attribute selection is the process of removing attributes which contain less useful
information for the task at hand. By choosing attributes that appear to provide the
most useful information, the dimensionality of the problem decreases which helps to
improve the quality of the experiments. This is especially true for datasets which are
as large as these. For example, in the SNP dataset, not all 13917 SNPs are likely to
be relevant to this problem. Having such a large number of attributes not only makes
it difficult to perform accurate classification, but the tests themselves become very
inefficient to run.
There are many ways of performing attribute selection, but because of the size of
these datasets and the quality of the process a good choice is to use random forests.
At the completion of the random forest algorithm, an output file contains the gini
index values for all of the attributes which were used for splitting. One problem
that exists when using this algorithm with such a large dataset is that, in order
for every attribute to be selected in the algorithm, a large number of trees must be
built. However, because of the size of these datasets and the limitations of current
hardware, it was not possible to run the algorithm long enough for every attribute
to be considered. The solution to this was to run the algorithm several times and
combine the results of each trial until all, or almost all, of the attributes have gini
index values.
CHAPTER 3. EXPERIMENTS
27
The setup of the algorithm for each dataset is shown in Table 3.1. For each dataset
the top 25, 50, 100, 250, 500, 1000, 2500 and 5000 genes were selected for further
analysis. These subsets were chosen in order to capture the features of the datasets
as they change from very few attributes to a large number of attributes.
Table 3.1: Attribute selection using Random Forests
3.6
Dataset
Patients
Attributes
Trees Built
4x30000
SNP
137
13917
cDNA
68
10027
4x30000
Affy
144
22277
4x30000
SNP & cDNA
49
23944
8x20000
SNP & Affy
118
36194
10x14000
cDNA & Affy
55
32304
10x14000
SNP & cDNA & Affy
49
46221
12x11000
Analysis of Selected Attributes
By removing a large fraction of the attributes, we have discarded a lot of information
that is not informative for these experiments. We are not taking a minimalist approach to this problem and assuming that there is a small set of genes or SNPs which
can be used to distinguish between whether a patient will live or die. We believe that
it is important to incorporate as much information as possible in order to build the
best model. However, in datasets of this size, it is clear that there are likely to be
many attributes which are irrelevant to the problem at hand. These are the attributes
we are interested in removing. All decisions about how many attributes to select are
arbitrary. However, it is possible to make a principled decision by testing the effect of
adding more attributes and observing if they provide any new information. Including
CHAPTER 3. EXPERIMENTS
28
too many attributes in an analysis makes it difficult to find any interesting information due to noise and variation unrelated to the properties of interest. On the other
hand, including too few attributes can make it difficult to find any useful information
or to make the model generalizable. This is why we have chosen to test several subsets
of varying size so we can find the subset of attributes that best describes the dataset.
For each of the subsets, we performed classification using an SVM. This was done
to gain an understanding about the effect of increasing the size of the subsets. Then,
we performed SVD to see whether there were any significant clusters in the data when
we labeled the images with mortality or subtype of disease.
We performed this analysis on each of the following datasets:
• SNP dataset
• cDNA dataset
• Affy dataset
• Clinical dataset
• SNP and cDNA dataset
• SNP and Affy dataset
• cDNA and Affy dataset
• SNP, cDNA and Affy dataset
CHAPTER 3. EXPERIMENTS
3.7
29
Validation of Results
It is important to be able to validate results and prove that they are not due to
random chance or to overfitting the data. Because of the difficulty of obtaining new
and relevant data in this field, it was necessary to use sophisticated techniques in
order to validate the results we have obtained. We used a random label shuffling to
test our results.
3.8
Further SNP Analysis
Because of the interesting results found for the SNP dataset, we decided to investigate
this further. There are many different experimental pathways that can be explored
during a data-mining experiment. We decided to look down five different pathways;
predicting relapse, graph analysis, a reformatting of the data, updating the patient
labels and a cross validation of the attributes.
3.8.1
Predicting relapse
The previous attribute selection was performed using the patient mortality as a class
label. In order to select attributes that are predictive of relapse, the random forest
algorithm was performed for the SNP data with the class label being whether or not
the patient relapsed.
3.8.2
Graph analysis
The previous analysis using SVD focused on an individual’s attribute values to determine their place in the object space. Another way to view this problem is to compare
CHAPTER 3. EXPERIMENTS
30
patients to each other using the dot product to create a similarity matrix. This matrix is n by n and an entry at x (i,j) represents the similarity between patients i and
j. By applying a threshold to these data, any value less than the threshold becomes
a 0 and the matrix can now be viewed as an adjacency matrix for a graph. SVD is
then applied to this matrix and plotted. All non-zero entries in the matrix represent
an edge between the points in the graph.
3.8.3
Reformatting the data
As explained previously, the SNP data is coded to represent the three possible alleles.
The previous data had one attribute for each SNP which had three possible values.
Another approach we developed was to split each SNP into three attributes; one for
each allele. For each SNP, a patient would have a value of 1 for the allele they had and
0 for the other two alleles. We then performed attribute selection on this dataset with
the hope that if there were any specific alleles for certain SNPs that were interesting,
they would be selected using this method. We then applied SVD to the resulting
datasets and observed the results.
3.8.4
Updating patient labels
Late in the development of this research, we were able to obtain updated data for
the patients. This included five patients who had since died and seven who had
relapsed. We were interested to see where these patients lie in our previous space,
and so the SVD images were relabeled with the updated information. We also reran
the random forest algorithm to see the effect on the attribute-selection process with
updated labels.
CHAPTER 3. EXPERIMENTS
3.8.5
31
Cross validation of the attributes
To further support the attribute-selection process, we divided the selected subsets
into smaller subsets that we then ran through the random forest algorithm. We then
compared the resulting attribute lists in order to see which attributes appeared in
multiple lists or only in one.
Chapter 4
Results
In this chapter we report and explain the results of the experiments. First, we look at
the clustering of the data based on the SVD images for each of the individual datasets
as well as the combined datasets. Next, we explore the subsets of the data created
through the attribute-selection process and evaluate these using SVD as well as SVM.
Then, we discuss the results of attribute selection and present the top attributes from
each dataset. We also explore the SNP dataset further in order to find the biological
significance of the results.
4.1
SVD Results
Each of the following results were obtained by plotting the first three dimensions of
the U matrix in order to visualize the data. All of the images are labeled with clinical
information, such as the patient mortality outcome or the subtype of the disease. By
doing SVD on the entire dataset as an early step, we learn some general structures
that may exist in these data. We can also use these images as a benchmark for how
32
CHAPTER 4. RESULTS
33
our subsequent analysis performs.
4.1.1
SVD on SNP data
The resulting SVD image for the full SNP dataset can be seen in Figure 4.1. There
appear to be two fairly well defined clusters in the data but they are clearly not
related to mortality, seen in Figure 4.1(a), or subtype of disease, seen in Figure 4.1(b).
Since these datasets includes 13917 SNPs, not all of these SNPs are expected to be
associated with leukemia. Upon further investigation we were able to determine that
the spread of the data is caused by the way in which the data is coded. This causes
the data to tend to form three clusters based on whether patients have a majority of
genes of type AA, AB or BB.
−74
−74
−74.5
−74.5
−75
−75
−75.5
−75.5
0
0
−76
−5
−10
−76.5
−15
−76
−5
−76.5
−10
−15
−20
−77
(a) SNP data labeled with mortality
−20
−77
(b) SNP data labeled with subtype
Figure 4.1: SVD images of SNP data. (a) blue = alive, red = deceased. (b) red =
T-cell, blue = B-cell, green = unknown
CHAPTER 4. RESULTS
4.1.2
34
SVD on cDNA data
The SVD image for the cDNA dataset does not appear to contain any noticeable
clusters based on the mortality label, as seen in Figure 4.2(a). When labeling this
image with the patients’ subtype of the disease, more interesting results appear. As
seen in Figure 4.2(b), there is a fairly well defined cluster of T-cell patients. This
implies that the cDNA genes are able to capture a variation in the subtype of disease
that the patients have. This is not surprising since it is well known that the difference
between subtypes can be distinguished by the expression of only one gene [27].
−10
−20
−30
−10
−40
−20
−30
−50
−40
−60
20
−70
−80
−90
20
0
−100
−20
−40
−110
−60
−80
(a) cDNA data labeled with mortality
−50
−60
0
−70
−20
−80
−40
−90
−60
−100
−80
−110
(b) cDNA data labeled with subtype
Figure 4.2: SVD images of cDNA. a) blue = alive, red = deceased. (b) red = T-cell,
blue = B-cell, green = unknown
4.1.3
SVD on Affy data
The SVD results of the Affy dataset did not provide much useful information. The
images, labeled with mortality and subtype, are seen in Figures 4.3(a) and (b). It is
CHAPTER 4. RESULTS
35
clear that there are no separations in the data based on either label. However, when
labeling by type it can be seen that there are no T-cell patients in the main cluster
of points. This suggests that the Affy data contains some information regarding the
subtype of the disease. Since there are so many points whose disease type is labeled
as unknown it is difficult to be confident in this assessment. It is interesting to note
that the cDNA data separates the data better than the Affy data does. This was
moderately surprising as the Affymetrix technology is generally more accepted to
be more reliable than cDNA microarrays. We believe this is related to the way in
which the data was collected and perhaps to issues combining datasets from different
operators.
−150
−145
−150
−140
−40
−135
−40
−145
−20
−140
0
−20
−135
−130
0
20
−130
20
40
−125
(a) Affy data labeled with mortality
40
−125
(b) Affy data labeled with subtype
Figure 4.3: SVD images of Affy. a) blue = alive, red = deceased. (b) red = T-cell,
blue = B-cell, green = unknown
CHAPTER 4. RESULTS
4.1.4
36
SVD on clinical data
The shape of this dataset is interesting, as seen in Figure 4.4. There are four parallel
clusters of data, but these cannot be explained by the mortality or subtype labels.
Upon further investigation it is found that the three leftmost clusters consist of the
data for patients who had a genetic translocation. More specifically the cluster farthest to the left contains data for patients who had a BCR-ABL translocation, and to
the right of that is a cluster of data for patients who had a TCL-AML translocation.
The cluster of data points on the far right can only be described as the data points
related to patients who did not have a translocation. The spread of data from the
bottom to the top of the cluster has been identified as being affected by the size of
the liver at diagnosis, the size of the spleen at diagnosis as well as the initial platelet
count. It is quite clear that any medical decisions about diagnosis, prognosis or treatment based on these data would probably be poor. This is the type of information
that is presently being used for decisions regarding leukemia. Based on the results of
the genetic datasets, we believe that it is important that this information be used to
help support the diagnosis, prognosis and treatment decisions.
CHAPTER 4. RESULTS
37
4
4
3
3
2
2
1
1
0
0
−1
−1
2
−2
1
0
−3
−1
−4
−2
−3
−5
−4
−5
−2
2
−3
1
0
−4
−1
−2
−4
−6
(a) Clinical data labeled with mortality
−5
−3
−5
−6
(b) Clinical data labeled with subtype
Figure 4.4: SVD images of Clinical. a) blue = alive, red = deceased. (b) red = T-cell,
blue = B-cell, green = unknown
4.1.5
SVD on combined SNP and cDNA data
The combined datasets have many more attributes than the individual datasets. With
a large dataset there are going to be many attributes which are irrelevant for our
purposes and will not contain any useful information. If there are many more of
these attributes than useful attributes, the SVD may not be able to discover any
information from the useful attributes are the experiment will not be successful. The
images in Figures 4.5 (a) and (b) demonstrate this effect. When labeling by subtype,
all but one of the T-cell patients are clustered together. However, there are also
several B-cell patients in this cluster as well. This does support the theory that the
cDNA dataset primarily contains information regarding the subtype of the disease.
CHAPTER 4. RESULTS
38
−90
−60
−90
−100
−40
−20
−110
0
−120
20
−100
−60
−40
−110
−20
0
−120
20
40
−130
60
(a) SNP-cDNA data labeled with mortality
−130
40
60
(b) SNP-cDNA data labeled with subtype
Figure 4.5: SVD images of combined SNP-cDNA. a) blue = alive, red = deceased.
(b) red = T-cell, blue = B-cell, green = unknown
4.1.6
SVD on combined SNP and Affy data
It is interesting to note that the shape of the data in the SVD image, shown in Figure
4.6, is similar to that of the Affy dataset alone. This shows that the SNP dataset does
not have a powerful global effect when it is combined with the larger Affy dataset.
As such, the evaluation of this image is similar to that of the previously describe Affy
dataset. Once again, the mortality and subtype labels, shown in Figures 4.6(a) and
(b), do not provide any explanation for the shape of this dataset.
CHAPTER 4. RESULTS
39
−166
−164
−166
−162
−164
−160
−158
−156
−40
−162
−40
−160
−158
−20
−156
−154
−20
−152
−154
0
−152
−150
0
−148
20
40
−146
(a) SNP-Affy data labeled with mortality
−150
20
−148
40
−146
(b) SNP-Affy data labeled with subtype
Figure 4.6: SVD images of combined SNP-Affy. a) blue = alive, red = deceased. (b)
red = T-cell, blue = B-cell, green = unknown
4.1.7
SVD on combined Affy and cDNA data
This dataset was interesting, because the shape of the data is similar to that of the
cDNA dataset alone. This is surprising because the Affy dataset is more than double
the size of the cDNA dataset, and so in order for this to happen the cDNA dataset
must contain many more globally powerful attributes. As seen in Figure 4.7(a), the
mortality label does not provide any meaningful separation in the data, but in Figure
4.7(b), the subtype label does appear to be fairly well separated. It is clear that
the cDNA data has the ability to capture variation in the patients based upon their
subtype of the disease. It is important to note that the combination of these two
datasets is different from combining either of them with the SNP data. These two
datasets are intended to capture the same information, that is, the genetic expression
CHAPTER 4. RESULTS
40
levels of certain genes. This suggests that they should be similar to each other and
the combination may not provide any interesting information.
−175
−170
−175
−165
−170
−160
−165
−155
−160
−150
−155
−150
−145
−145
−140
−40
−140
−40
−135
−20
0
−20
40
−130
20
40
−125
60
60
(a) Affy-cDNA data labeled with mortality
−135
0
−130
20
−125
(b) Affy-cDNA data labeled with subtype
Figure 4.7: SVD images of combined cDNA-Affy. a) blue = alive, red = deceased.
(b) red = T-cell, blue = B-cell, green = unknown
4.1.8
SVD on combined SNP, cDNA and Affy data
The shape of this dataset is also similar to that of the cDNA dataset, suggesting again
that the cDNA dataset contains the most obvious structure. The results of labeling
by mortality are shown in Figure 4.7(a). It can be seen that there are no tight clusters
of patients. When labeling by subtype, as shown in Figure 4.7(b), the T-cell patients
appear to cluster together on the bottom left of the image. Since the separation
based on subtype continues to appear with these large microarray datasets, it is clear
that the genetic expression patterns for these two subtypes are quite distinct which
enables the SVD to discover the separation between them.
CHAPTER 4. RESULTS
41
−175
−150
−170
−155
−165
−160
−160
−155
−165
20
−150
−170
0
−145
−175
−140
−40
−20
−180
−40
−185
−60
−80
−190
−20
−135
0
−130
20
40
60
−125
(a) SNP-Affy-cDNA data labeled with mortality (b) SNP-Affy-cDNA data labeled with subtype
Figure 4.8: SVD images of combined SNP-cDNA-Affy. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown
So far, we have seen that the entire datasets contain only weak clusters that are
mostly related to the subtype of the disease. Next, we look at subsets of the attributes
as determined by the random forests algorithm.
4.2
Combination of Datasets
To properly test which method of combination was better, we explored the combined
SNP and cDNA dataset. The SVM results for the two methods of combination are
shown in Table 4.1 and Table 4.2.It is quite clear that by first combining the datasets
and then performing attribute selection the accuracy of the datasets is much better
than performing attribute selection on the individual datasets and then combining
the top attributes. The reason this method works better is that the random forests
CHAPTER 4. RESULTS
42
algorithm is able to find correlations between the two sets of data when selecting the
best attributes for splitting. It is interesting that this SNP-cDNA dataset showed
a significant improvement with this method of combination, because it means that
the correlation between the SNP dataset and the cDNA dataset provides meaningful
information about the mortality of patients. Because of this, we decided to use this
method of combination for all experiments.
Table 4.1: SVM prediction accuracy of combining datasets and then doing attribute
selection (6-fold cross validation)
Atttributes
% Class Alive
% Class Deceased
25
100
100
50
100
100
100
100
100
250
100
100
Table 4.2: SVM prediction accuracy of attribute selection and then combining
datasets (6-fold cross validation)
Atttributes
% Class Alive
% Class Deceased
25
97.62
57.14
50
97.62
100
100
100
85.71
250
97.62
71.43
CHAPTER 4. RESULTS
4.3
43
Analysis of Selected Data
Here we discuss the results of the experiments which look further into the selected
datasets. For each of the following datasets, we first look at SVM results from each
subset of attributes predicting mortality to develop and understanding about which
subsets were going to be the most predictive. Then we used SVD to see if there
were any new and interesting clusters of data that we could not see with the original
datasets. We look for separation of the data for both mortality and subtype in order
to attempt to explain the clusters. The goal is to try to find the subset of attributes for
each dataset that best separate the patients based on their mortality. It is important
to note that we are not looking for the smallest possible subset of genes to classify
the data; rather we are looking for the number of genes that incorporates as much
data as possible without including so much data that any interesting separations
are lost. This is an arbitrary process. However, through careful use of visualizations,
classification techniques and many test sets it is possible to make an educated decision
about where this cut off should be.
4.3.1
Analysis of selected SNP data
The SVM results for the subsets of the SNP dataset are shown in Table 4.3. This
dataset contained 137 patients, and of these 123 (89.8%) are alive and 14 (10.2%) are
deceased.
Table 4.3: SVM prediction accuracy of SNP subsets (6-fold cross validation)
Atttributes
% Class Alive
% Class Deceased
25
100
71.43
Continued on Next Page. . .
CHAPTER 4. RESULTS
44
Table 4.3 – Continued
Atttributes
% Class Alive
% Class Deceased
50
99.19
92.86
100
100
100
250
100
100
500
100
92.86
1000
100
92.86
2500
100
64.29
5000
100
21.43
The results of the SVM classification are quite promising. It is clear that 25
attributes is not sufficient to capture the necessary information, and 2500 and 5000
attributes contain too much irrelevant information that any meaningful separation is
lost. A range between 50 and 1000 attributes appears to be the best for a subset of
attributes for this dataset.
The resulting SVD of the top 25 selected SNPs is shown in Figure 4.9(a). It is
clear that there is clustering in this data based on the mortality label, as we expected
from the SVM classification. However, it is not a perfect separation. Next, the top
50 attributes (not shown) show a similar situation with only slightly more distance
between the two clusters. However, when looking at the top 100 attributes in Figure
4.9(b) we can now see clear separation between the two clusters. This is exactly what
we expected to find based upon the SVM results. This becomes even more apparent
when looking at the top 250 attributes in Figure 4.9(c). The distance between the
two clusters has been maximized in this subset, and so we believe that most of the
important attributes for mortality are contained in this subset of the data. In order
to fully understand the classification power of this subset of 250 SNPs in separating
data based on a mortality label, it is necessary to apply this to a new set of patients
CHAPTER 4. RESULTS
45
and see if the desired result is still obtained. However, due to the lack of available
data for ALL patients this was not possible for this study.
CHAPTER 4. RESULTS
46
−5
−2
−5.5
−2.5
−6
−3
−6.5
1.5
1
1
0.5
0
−3.5
0.5
0
−0.5
−1
−0.5
−7
−1
−1.5
−1.5
(a) 25 SNP
(b) 100 SNP
−12.5
−8.5
−13
−9
−13.5
−9.5
2
1
0
−14
−10
−1
−2
−3
−14.5
2
−4
1
(c) 250 SNP
0
−1
−2
−3
−4
(d) 500 SNP
−30.5
−31
−44
−31.5
−46
−32
2
−48
−32.5
0
−2
−4
−6
(e) 2500 SNP
−8
−10
2
0
−2
−4
−6
−8
−10
(f) 5000 SNP
Figure 4.9: SVD images of selected SNP data. blue = alive, red = deceased.
−12
CHAPTER 4. RESULTS
47
This dataset becomes interesting when looking at the top 500 attributes in Figure
4.9(d). The data is still clearly separated based on mortality. However, there appears
to be another separation forming as the data begins to divide up into four clusters.
This suggests that the data is becoming more like clusters that we saw in the original
SNP dataset. This can be seen even more clearly in the top 2500 and 5000 attribute
subsets in Figure 4.9(e) and (f). By this time, the data has migrated back to the
original clusters, but it is still possible to see separation based on mortality. However,
when looking at the results of the SVM it is clear that for these two datasets it has
become difficult to classify the deceased patients properly. This suggests that there
are too many attributes included in this dataset for the relevant information to be
discovered.
Med
Med
Med
Med
Med
High
Std
High
Std Med
Med Med
Std
High
Med
Med
Med
Med
Med
Med
Med
Med
Med
Med
Med
Low Med
Med
High
Med
High
Med Med
Med High
Med Med Med
MedMed
Med
Med
Std Std
Std Med
Med
Med
Med
Med
Med
Med
High
Med
High
High
Med Std
Med
Std
Med
Std
Med Med
Med
Med Med
Med
Med High
MedMed
Med
Med
Med
Std
Med
Med
Med Med
Med
Med
Med
High
Med
StdHigh
Med
Med
Med
Avg
High
Med
Med
Med
High
Med
Med
Std
High
High
High
High
High
Med
−14
−13
−15
−16
−17
6
4
2
0
−2
(a) 250 SNP Treatment
−4
−13.5
−6
−14−8
−14.5
−15
−15.5
−16
−16.5
6
4
2
0
(b) 250 SNP Risk
Figure 4.10: SVD images of 250 SNP data. (a) blue = BFM98, red = Study8 (b)
blue = alive, red = deceased, labeled by risk category
The most important and unexpected result that came from this analysis was that
there appears to be a relationship between the genetics of an individual and their
−2
−4
CHAPTER 4. RESULTS
48
mortality under current treatment regimes. Although it is known that there are
genetic factors associated with leukemia, it is not widely believed that there exist
genetic factors which distinguish between those who will live and who die once they
have this disease. It is also surprising that the separation that is seen in the data
is clear. There is no spectrum where the risk is spread from low to high; there only
appears to be extreme risk and lower risk. When labeling these images with such
information as the treatment the patients received, or the risk classification they were
given, as seen in Figure 4.10, it is seen that there is no structure to the data. This
suggests that what is currently being done to treat leukemia is not enough. Based
on these results we believe that this genetic information will provide physicians with
another level of understanding which they can use to make better decisions about
treatment. The current leukemia treatment regimes do not take into account patient
genetics, but rather their clinical manifestations of the disease. What these results
suggest is that certain individual’s genetics place them at extreme risk. Therefore,
treatment plans for these higher risk individuals need to be tailored to their unique
genetics. It is also possible that current treatment plans are insufficient for these
genetically different patients.
4.3.2
Analysis of selected cDNA data
The SVM results for the subsets of the cDNA dataset are shown in Table 4.4. This
dataset contained 68 patients, and of these 59 (86.7%) are alive and 9 (13.2%) are
deceased.
CHAPTER 4. RESULTS
49
Table 4.4: SVM prediction accuracy of cDNA subsets (6-fold cross validation)
Atttributes
% Class Alive
% Class Deceased
25
96.61
77.78
50
100
66.67
100
100
77.78
250
100
77.78
500
100
66.67
1000
100
66.67
2500
100
44.44
5000
100
22.22
The SVM classification results for this dataset were not as good as they were for
the SNP dataset. As such, we did not expect to see a clear separation in the data
based on the mortality label from the SVD images. Through the exploration of the
full cDNA dataset, we discovered that it forms two clusters based on the subtype of
the disease. Although this is not what we are looking, for it was interesting to see
that the expression level of a patient’s genes can be used to distinguish between T-cell
and B-cell.
From the first 25, 50 and 100 attribute subsets, shown in Figure 4.11(a),(b) and
(c), we can see a general clustering based on the mortality label. When labeling these
same images with the subtype label (not shown) there is no clear grouping of these
patients. However, when the top 250 attributes are plotted, as seen in Figure 4.11(d),
it is quite clear that there are two well formed clusters. These clusters are based on
the subtype of the disease and this raises a critical point about this dataset. When
looking at the SVM classification results, there is no drop off in accuracy between
100 and 250 attributes even though it is clear visually that they cluster differently.
Upon further investigation it was seen that all but two of the deceased patients in
CHAPTER 4. RESULTS
50
this dataset have T-cell ALL which can create misleading results when only looking
at the classification. If the data separates well based on subtype, and most of the
deceased are of one type, then the classification results will suggest that the data
separates better than it does. This is why it is important to understand the data
being analyzed and to scrutinize every result that is found.
As more attributes are added, the separation based on mortality begins to disappear and eventually even the SVM does not perform well. From these results, we
believe that there is important information about mortality contained within the first
100 attributes of this dataset. This again suggests that there is a link between the
genetics of a patient and the outcome of their disease and since this observation has
been made in two separate datasets our confidence in this conclusion has increased.
CHAPTER 4. RESULTS
51
0
−2
−4
−6
−8
−10
−6
−4
−2
0
2
4
6
8
6
4
(a) 25 cDNA
2
0
−2
−4
−6
(b) 50 cDNA
5
0
−6
−5
−15
−8
−10
−5
−10
0
−12
5
10
(c) 100 cDNA
0
−10
−5
−15
−10
−15
−20
−20
−25
−25
(d) 250 cDNA
Figure 4.11: SVD images of selected cDNA data. blue = alive, red = deceased.
CHAPTER 4. RESULTS
4.3.3
52
Analysis of selected Affy data
The SVM results for the subsets of the Affy dataset are shown in Table 4.5. This
dataset contains 144 patients, of which 128 (88.9%) are alive and 16 (11.1%) are
deceased.
Table 4.5: SVM prediction accuracy of Affy subsets
Atttributes
% Class Alive
% Class Deceased
25
100
0
50
99.92
0
100
93.75
31.25
250
94.53
25
500
96.09
31.25
1000
96.09
31.25
2500
92.19
25
5000
94.53
25
The results of the SVM classification were poor for this dataset. This suggested
that the Affy dataset would not provide much structure to the data based on the
mortality label. The SVD results from the entire Affy dataset did not show any
significant clusters based on the patient’s mortality or subtype of the disease. This
continues to hold true for the subsets as well. The top 50 and 250 attribute subsets
are shown in Figure 4.12 labeled with both mortality and subtype. It is quite clear
that there are well defined clusters of the data based on the subtype of the disease.
There are many missing labels for some patients but it is easy to see what their
subtype should be labeled as. Although there does appear to be clustering based on
the mortality label, we run into the same problem as with the cDNA data. It is the
nature of the disease that patients with T-cell ALL are at a higher risk, and it is
CHAPTER 4. RESULTS
53
therefore not surprising that more of the deceased patients have T-cell ALL.
28
26
28
24
26
2
22
24
0
22
−2
−4
2
18
0
18
−6
20
4
20
−2
−8
16
−4
16
14
−10
14
−6
−8
12
−10
(a) 50 Affy Mortality
12
(b) 50 Affy Subtype
−36
−36
−34
−34
−32
−32
−30
−30
−28
−28
−26
−26
−24
−10
−22
−5
0
−20
5
10
(c) 250 Affy Mortality
−18
−24
−10
−22
−5
0
−20
5
−18
10
(d) 250 Affy Subtype
Figure 4.12: SVD images of selected Affy data. (a/c) blue = alive, red = deceased.
(b/d) blue = B-cell, red = T-cell, green = unknown
CHAPTER 4. RESULTS
4.3.4
54
Analysis of combined SNP data and cDNA data
The SVM results for the subsets of the combined SNP and cDNA datasets are shown
in Table 4.6. In this dataset there were 49 patients, of those 42 (85.7%) are alive and
7 (14.3%) are deceased.
Table 4.6: SVM prediction accuracy of combined SNP and cDNA subsets
Atttributes
% Class Alive
% Class Deceased
#SNP/#cDNA
25
100
100
10/15
50
100
100
21/29
100
100
100
46/54
250
100
100
114/136
500
100
100
232/268
1000
100
85.71
489/511
2500
100
85.71
1339/1161
5000
100
71.43
2856/2144
It was expected that this dataset would perform well since both of the individual
datasets showed the ability to separate the data based on the mortality label. The
SVM results for this dataset were impressive with the first five subsets all showing
100% classification of both classes.
The top 25, 500, 1000 and 2500 attribute subsets are shown in Figure 4.13. The
separation is clear and it is interesting again to notice that there is no transition of
points from a low risk to a higher risk but rather a stark separation of dead and alive.
It is also interesting to note that the separation can be seen from 25 attributes all
the way to 5000 attributes. Although the SVM prediction is not 100% for the largest
3 datasets, the images all show the same separation. This further goes to show that
the minimalist approach to a biological problem such as this is not the most feasible
CHAPTER 4. RESULTS
55
solution. Although there is an excellent separation with only 25 genes, there is an
equally good separation with 250 genes or even 500 genes. It is quite possible that
there may be a genetic difference between the patients who live and who die based on
the combination of many attributes. By keeping as many attributes as possible, more
information about the patients is gained and more can be learned from them. Our
approach is one of removing the poor attributes as opposed to finding the smallest
possible subset of attributes that separate the data.
This result confirms that the combination of datasets can be beneficial if performed properly. The distribution of SNPs and cDNA within each subset in Table
4.6 is roughly equal but with slightly more cDNA entries in each of the subsets that
perform well. The random forests algorithm is able to find the most informative attributes for classifying the data based on the label provided. By selecting the top
attributes from the combined data we can see that we achieve a new level of separation spanning multiple subsets. Since we are not looking for the smallest possible
subset of attributes, we can safely assume that the top 500 attributes are sufficient
to separate the data accurately. Validating these results is difficult since there is
little available data to test the subsets on. The problem of validation will be further
explored in a later section.
Our conclusion that the genetics of a patient affects their survivability is supported
again by this result. We have seen that the SNP and cDNA datasets individually
can separate the data based on the mortality label and now the combination of the
two datasets provides an even better separation. This is an important finding as it
suggests that there is more information that should be available to physicians when
they diagnose and treat this disease. By incorporating as much useful information
CHAPTER 4. RESULTS
56
as possible it is believed that the higher risk patients can be identified earlier and
treated appropriately, thereby reducing the number of deaths.
8
6
−15
4
−10
−5
2
0
0
5
−4
−2
5
−2
0
0
−5
2
4
6
10
−10
−4
(a) 25 SNP-cDNA
−15
−20
15
(b) 500 SNP-cDNA
−20
−15
−10
−15
−5
−10
0
−5
5
0
10
5
15
10
10
5
20
15
0
−5
−10
−15
(c) 1000 SNP-cDNA
−20
25
30
20
−25
20
10
0
−10
−20
−30
(d) 2500 SNP-cDNA
Figure 4.13: SVD images of selected SNP-cDNA data. blue = alive, red = deceased
−40
CHAPTER 4. RESULTS
4.3.5
57
Analysis of combined SNP data and Affy data
The SVM results for the subsets of the combined SNP and Affy datasets are shown
in Table 4.7. This dataset contained 118 patients, of which 105 (88.9%) are alive and
13 (11%) are deceased.
Table 4.7: SVM prediction accuracy of combined SNP and Affy subsets
Atttributes
% Class Alive
% Class Deceased
#SNP/#Affy
25
100
0
11/14
19/31
50
100
0
100
100
0
47/53
250
91.43
0
122/128
500
94.29
30.77
243/257
1000
94.29
7.69
502/498
2500
96.19
15.38
1215/1285
5000
94.29
7.69
2360/2640
Initially it was expected that this combination should behave similarly to the
combined SNP and cDNA dataset since the cDNA and Affy measure the same type
of information. However, it can be seen from the SVM results that this dataset did
not perform well at all. Due to the poor classification, it was not expected that the
SVD would show anything interesting. The top 25 and 100 attribute subset images
are shown in Figured 4.14(a) and (b). It is quite clear that in these images there is
no clear separations of the data. We can therefore conclude that this dataset does
not provide any useful information for this study and the investigation of this dataset
did not go any further.
CHAPTER 4. RESULTS
58
0
−1.2
0.2
−4
−4.5
−0.5
−1.4
0
−5
−1
−1.6
−0.2
−1.8
−1.5
−2
−0.4
−5.5
−2.2
−0.6
−6
−6.5
−2
−7
−2.4
−0.8
−2.5
−2.6
(a) 25 SNP-Affy
−7.5
(b) 100 SNP-Affy
Figure 4.14: SVD images of selected SNP-Affy data. blue = alive, red = deceased.
4.3.6
Analysis of combined cDNA data and Affy data
The SVM results for the subsets of the combined cDNA and Affy datasets are shown
in Table 4.8. This dataset contains 55 patients, and of these 47 (85.4%) are alive and
8 (14.6%) are deceased.
Table 4.8: SVM prediction accuracy of combined cDNA and Affy subsets
Atttributes
% Class Alive
% Class Deceased
#cDNA/#Affy
25
97.87
87.50
14/11
50
95.74
75
29/21
100
100
100
49/51
250
97.87
62.50
128/132
233/267
500
100
75
1000
100
62.50
479/521
2500
100
62.50
1172/1328
5000
100
62.50
2231/2769
CHAPTER 4. RESULTS
59
It was expected that this dataset would be good at separating the patients based
upon their subtype of the disease, since both of the individual datasets displayed an
ability to do this. This did appear to be true for subsets which contained a larger
number of attributes. However, from the results of the SVM we can see that the
smaller datasets perform quite well, especially the 100 attribute subset. This can be
seen in Figures 4.15(a) and (b).
This becomes interesting when looking at a subset of 250 attributes, seen in Figure
4.15(d). Clearly, there are two well defined clusters of the data when it is labeled with
the subtype of the disease with only a few points which appear to be either mislabeled
or misclassified. However, when looking at this same figure labeled with mortality in
Figure 4.15(c), we see that the data appears to be somewhat separated within the
clusters based upon the mortality label. One problem that exists with this dataset is
there is only one deceased subject with B-cell leukemia so it becomes more difficult
to see if the separation we see on the T-cell side of the image would hold true for
the B-cell side. The larger subsets of the dataset lose this separation based upon
mortality and only separate based upon the subtype of the disease, as seen in Figures
4.15(e) and (f). This is the first dataset in which we have any indication that the
Affy dataset contains any useful information regarding the mortality of the patients.
CHAPTER 4. RESULTS
60
−3
−4
−5
−6
−7
−8
−9
8
−8
4
−11
2
0
−12
−2
−4
4
2
0
−2
−4
−6
−10
6
(a) 25 cDNA-Affy
−6
−8
−13
(b) 100 cDNA-Affy
−10
−5
−10
−5
0
−16
−18
5
−20
0
−16
−18
10
−22
−24
5
−20
10
−22
15
−26
−24
15
−26
−28
20
−28
(c) 250 cDNA-Affy
20
(d) 250 cDNA-Affy Subtype
−40
−35
−30
−40
−30
−20
−45
−10
−25
−20
−15
0
−50
−10
−5
−20
0
5
10
15
−15
(e) 1000 Affy-cDNA Subtype
10
20
−55
30
−60
40
50
(f) 2500 Affy-cDNA Subtype
Figure 4.15: SVD images of selected cDNA-Affy data. (a/b/c) blue = alive, red =
deceased. (d/e/f) blue = B-cell, red = T-cell, green = Unknown
CHAPTER 4. RESULTS
4.3.7
61
Analysis of combined SNP data, cDNA data and Affy
data
The SVM results for the subsets of the combined SNP, cDNA and Affy datasets are
shown in Table 4.9. This dataset contains 49 patients, and of those 42 (85.7%) are
alive and 7 (14.3%) are deceased.
Table 4.9: SVM prediction accuracy of combined SNP, cDNA and Affy subsets
Atttributes
% Class Alive
% Class Deceased
#SNP/#cDNA/#Affy
25
100
85.71
9/12/4
50
100
85.71
16/19/15
100
100
91.43
29/39/32
250
97.62
85.71
85/86/79
500
97.62
71.43
178/155/167
1000
100
62.50
368/283/349
2500
97.62
71.43
934/698/868
5000
97.62
71.43
1877/1329/1794
From the results of the SVM we see that there is fairly good classification for most
of the smaller subsets. The top 25 and 100 attributes are shown in Figures 4.16(a)
and (b). These images look similar to the results of the combined cDNA and Affy
data. This suggests that the SNP data does not have a large influence on the data,
which is surprising because the SNP data can separate the data well. After the 250
subset, the data begins to be separated based upon subtype label as we have seen
in most of the other datasets. It is clear that by incorporating the Affy dataset, the
useful information contained in both the SNP and the cDNA datasets is overpowered
resulting in a poorer separation.
CHAPTER 4. RESULTS
62
−6
−4
−3
−2
−1
0
1
(a) 25 All Combined
2
3
−4
−2
−10
−8
−6
−4
−2
0
2
(b) 100 All Combined
Figure 4.16: SVD images of selected SNP, cDNA and Affy data. blue = alive, red =
deceased.
4.4
Attribute Selection
Here we present the results of attribute selection for each of the datasets previously
mentioned. We have analyzed these subsets in the previous section, and based on
these results we have determined the best subset for each dataset. For each of the
following lists, a bold-face entry represent an attribute that was found in multiple
lists and the corresponding number in brackets is the number of times that attribute
was seen, with a maximum of seven.
4.4.1
Attribute selection on SNP data
Here we present the top 100 SNPs that were selected by the random forest algorithm.
These attributes are listed in Table 4.10.
4
6
CHAPTER 4. RESULTS
63
Table 4.10: Top 100 SNP Attributes
SNP
Gene
Description
Epsilon associated protein
rs735482
CD3EAP
rs17511668
N4BP2
NEDD4 binding protein
rs1533594
RTP4
Receptor (chemosensory) transporter protein
rs2808096
ARHGAP12
Rho GTPase activating protein (2)
rs4820853
SEC14L3
SEC14-like 3 (S. cerevisiae)
rs6077510
PLCB4
Phospholipase C, beta 4
rs4726514
Unknown
rs1140380
TMEM208
Transmembrane protein 208
rs12093154
FAM132A
Family with sequence similarity 132, member A
rs1551118
C12orf64
Chromosome 12 open reading frame 64
rs1109278
THA1P
Threonine aldolase 1 psuedogene
rs831510
MRAP
Melanocortin 2 recepter accessory protein
rs2657879
GLS2
Glutaminase 2 (liver,mitochondrial)
rs2491132
SDC3
Syndecan 3
rs5992917
LOC100129113
Hypothetical protein
rs2303063
SPINK5
Serine peptidase inhibitor, Kazal type 5
rs2289642
KIAA0753
Uncharacterized protein
rs4985404
PDPR
Pyruvate dehydrogenase phosphotase
rs2305830
CEP164
Centrosomal protein 164kDa
rs8113704
NFKBID
Nuclear factor of kappa light polypeptide
rs2288681
DIP2C
Disco-interacting protein 2 (2)
rs15702
NSL1
NSL1, MIND kinetochore complex component
rs3743044
USP8
Ubiquitin specific peptidase 8
rs6720173
ABCG5
ATP-binding cassette, sub-family G
rs6687605
LDLRAP1
Low density lipoprotein receptor adaptor
rs4642516
Unknown
rs11191274
GBF1
Golgi-specific brefeldin A resistant exchange factor 1
rs289723
NLRC5
NLR family, CARD domain containing 5
rs2306242
GAK
Cyclin G associated kinase
rs11543349
OGFR
Opioid growth factor receptor
rs10137972
SYNE2
Spectrin repeat containing, nuclear envelope 2
rs1265100
PSORS1C2
Psoriasis susceptibility 1 candidate 2
rs584367
PLA2G2D
Phospholipase A2, group IID
rs260462
ZNF544
Zinc finger protein 544
rs848210
SPEN
Spen homolog, transcriptional regulator 2
rs17080284
UQCRC1
Ubiquinol-cytochrome c reductase core protein I
rs2072770
RIBC2
RIB43A domain with coiled-coils 2
rs970547
COL12A1
Collagen, type XII, alpha 1
rs3181320
CASP5
Caspase 5, apoptosis-related cysteine peptidase
rs2178004
MGA
MAX gene associated
rs10408676
NOTCH3
Notch homolog 3 (Drosophila)
rs2043449
CYP20A1
Cytochrome P450, family 20, subfamily A, polypeptide 1
Continued on Next Page. . .
CHAPTER 4. RESULTS
64
Table 4.10 – Continued
SNP
Gene
Description
Polymeric immunoglobulin receptor
rs291102
PIGR
rs4865615
SLC38A9
Solute carrier family 38, member 9
rs2108622
CYP4F2
Cytochrome P450, family 4, subfamily F, polypeptide 2
rs17165906
VWDE
von Willebrand factor D and EGF domains
rs1042023
APOB
Apolipoprotein B (including Ag(x) antigen)
Hypoxia inducible factor 1, alpha subunit inhibitor
rs2295778
HIF1AN
rs35018800
TYK2
Tyrosine kinase 2
rs344140
SHROOM3
Shroom family member 3
rs1126642
GFAP
Glial fibrillary acidic protein
rs9928053
ACSM5
Acyl-CoA synthetase medium-chain family member 5
rs13058467
TTLL12
Tubulin tyrosine ligase-like family, member 12
rs4253301
KLKB1
Kallikrein-related peptidase 3
rs1050239
SMPD1
Sphingomyelin phosphodiesterase 1, acid lysosomal
rs4870
TNFRSF14
Tumor necrosis factor receptor superfamily, member 14
rs16971436
ZFHX3
Zinc finger homeobox 3
rs12625565
LPIN3
Lipin 3
rs848209
SPEN
Spen homolog, transcriptional regulator
rs1105879
UGT1A9
UDP glucuronosyltransferase 1 family, polypeptide A9
rs4073918
SLC6A18
Solute carrier family 6, member 18
rs854777
MYO15A
Myosin XVA
rs2297595
DPYD
Dihydropyrimidine dehydrogenase
rs2248490
WDR4
WD repeat domain 4
rs4987310
SELL
Selectin L
rs966384
LRG1
Leucine-rich alpha-2-glycoprotein 1
rs5745325
MSH4
MutL homolog 1, colon cancer, nonpolyposis type 2 (E. coli)
rs2281929
ZBTB46
zinc finger and BTB domain containing 46
rs609320
RHCE
Rh blood group, CcEe antigens
rs500049
OBSCN
Obscurin, calmodulin and titin-interacting RhoGEF (2)
rs854800
MYO15A
Myosin XVA
rs597371
VWA2
von Willebrand factor A domain containing 2
rs6052
FGA
Fibrinogen alpha chain
rs2736155
BAT2
HLA-B associated transcript 2
rs3796318
FBLN2
Fibulin 2
rs6586179
LIPA
Lipase A, lysosomal acid, cholesterol esterase
rs3750904
SCN9A
Sodium channel, voltage-gated, type IX, alpha subunit
rs3848519
FECH
Ferrochelatase (protoporphyria)
rs6094752
NCOA3
Nuclear receptor coactivator 3
rs3800939
FBXL13
F-box and leucine-rich repeat protein 13
rs2240040
ZNF749
Zinc finger protein 749
rs2244492
TTN
Titin
rs292575
WDR91
WD repeat domain 91 (2)
rs3735035
PODXL
Podocalyxin-like
rs2234962
BAG3
BCL2-associated athanogene 3
Continued on Next Page. . .
CHAPTER 4. RESULTS
65
Table 4.10 – Continued
SNP
Gene
Description
rs4299811
Unknown
rs9550987
TNFRSF19
rs292592
WDR91
WD repeat domain 91
rs7259845
ZNF844
Zinc finger protein 844
Tumor necrosis factor receptor superfamily, member 19
rs474534
STK19
Serine/threonine kinase 19
rs11254408
TRDMT1
tRNA aspartic acid methyltransferase 1
rs435549
Unknown
rs4761944
Unknown
rs1052500
C2orf76
rs9423502
PITRM1
Chromosome 2 open reading frame 76
Pitrilysin metallopeptidase 1
rs1035442
MUC16
Mucin 16, cell surface associated
rs7995033
MTMR6
Myotubularin related protein 6
rs2305612
COLQ
Collagen-like tail subunit of asymmetric acetylcholinesterase
rs4407724
SYNE1
Spectrin repeat containing, nuclear envelope 1
rs6659553
POMGNT1
Protein O-linked mannose beta1,2-N-acetylglucosaminyltransferase
4.4.2
Attribute selection on cDNA data
Here we present the top 100 cDNA genes that were selected by the random forest
algorithm. These top 100 attributes are listed below in Table 4.11.
Table 4.11: Top 100 cDNA Attributes
Gene
Description
WDR77
WD repeat domain 77 (4)
TRIM37
Triparite motife-containing 37 (3)
PSMC4
Proteasome 26S (2)
CTNNA1
Catenin (cadherin-associated protein) (4)
HIST1H2AM
Histone cluster 1,H2am (2)
HIST1H2AL
Histone cluster 1,H2al (4)
Unknown
SLCO2A1
Solute carrier organic anion transporter (2)
LCP1
Lymphocyte cytosolic protein 1
MYH10
Myosin, heavy chain 10, non-muscle
PWP1
PWP1 homolog (S.cervisiae) (3)
Continued on Next Page. . .
CHAPTER 4. RESULTS
66
Table 4.11 – Continued
Gene
Description
PVR
Poliovirus receptor (4)
PRG1
p53-responsive gene 1
ROD1
ROD1 regulator of differentiation 1 (3)
FTO
Fat mass and obesity associated (4)
FKBP5
FK506 binding protein 5 (4)
ZFHX1B
Zinc finger E-box binding homeobox 2
GNAS
Guanine nucleotide binding protein (3)
SIL
Endoplasmic reticulum chaperone (3)
FMO1
Flavin containing monooxygenase 1 (2)
Unknown
CLPP
ClpP caseinolytic peptidase (3)
KIF21A
Kinesin family member 21A (4)
PSME1
Proteasome activator subunit 1 (3)
Unknown
RPS4X
Ribosomal protein S4, X-linked
IFI44L
Interferon-induced protein 44-like
BNIP1
BCL2/adenovirus E1B 19kDa interacting protein 1 (3)
PSIP1
PC4 and SFRS1 interacting protein 1 (2)
HLADMA
Major histocompatibility complex, class II, DM alpha (4)
SMNDC1
Survival motor neuron domain containing 1 (4)
CEB1
Hect domain and RLD 5
MYL6
Myosin, light chain 6, alkali, smooth muscle and non-muscle
FKBP2
FK506 binding protein 2, 13kDa
FXR1
Fragile X mental retardation, autosomal homolog 1
MT1F
Metallothionein 1F
QDPR
Quinoid dihydropteridine reductase (2)
DNAJC4
DnaJ (Hsp40) homolog, subfamily C, member 4
PCM1
Pericentriolar material 1 (4)
METAP2
Methionyl aminopeptidase 2 (2)
KAT2B
K(lysine) acetyltransferase 2B (2)
HIST1H2AE
Histone cluster 1, H2ae (2)
DUSP12
Dual specificity phosphatase 12
G3BP
GTPase activating protein binding protein 1 (2)
GSPT1
G1 to S phase transition 1 (3)
IFIT2
Interferon-induced protein tetratricopeptide repeats 2 (2)
PML
Promyelocytic leukemia (2)
GNA11
Guanine nucleotide binding protein alpha 11
GNB2L1
Guanine nucleotide binding protein beta polypeptide 2-like 1
Unknown
NPY
Neuropeptide Y
AKAP13
A kinase (PRKA) anchor protein 13 (2)
MFAP4
Microfibrillar-associated protein 4
U1SNRNPBP
U11/U12 snRNP 35K (2)
Continued on Next Page. . .
CHAPTER 4. RESULTS
67
Table 4.11 – Continued
Gene
Description
RBPMS
RNA binding protein with multiple splicing
GABRG2
Gamma-aminobutyric acid A receptor, gamma 2 (3)
Unknown
KLK4
Kallikrein-related peptidase 4
RCP9
Calcitonin gene-related peptide-receptor
CD24
CD24 molecule
CASQ1
Calsequestrin 1
CKS2
CDC28 protein kinase regulatory subunit 2
GMFG
Glia maturation factor, gamma
MPP2
Membrane protein, palmitoylated 2 (MAGUK p55 subfamily member 2)
RAB35
RAB35, member RAS oncogene family (2)
KIAA1045
KIAA1045 (2)
DDX19A
DEAD (Asp-Glu-Ala-As) box polypeptide 19A
POU3F4
POU class 3 homeobox 4
EBI3
Epstein-Barr virus induced gene 3
ZNF294
Zinc finger protein 294 (2)
SNX22
Sorting nexin 22
MYOZ3
Myozenin 3
NGDN
Neuroguidin, EIF4E binding protein
THBS1
Thrombospondin 1 (4)
AQR
Aquarius homolog (mouse) (2)
ING1L
Inhibitor of growth family, member 2
BTK
Bruton agammaglobulinemia tyrosine kinase (3)
Unknown
Unknown
RARG
Retinoic acid receptor, gamma (2)
SIRT6
Sirtuin 6 (S. cerevisiae)
CNTNAP1
Contactin associated protein 1 (2)
NFS1
NFS1 nitrogen fixation 1 homolog (S. cerevisiae) (2)
NRL
Neural retina leucine zipper gene
DSCR1
Regulator of calcineurin 1
DTX4
Deltex 4 homolog (Drosophila) (4)
ZNF142
Zinc finger protein 142
BAT8
Euchromatic histone-lysine N-methyltransferase 2
ELAVL1
ELAV-like 1 (Hu antigen R) (2)
CA2
Carbonic anhydrase II
DLG1
Discs, large homolog 1 (Drosophila)
BRSK2
BR serine/threonine kinase 2
TP53I3
Tumor protein p53 inducible protein 3
PPP1R10
Protein phosphatase 1, regulatory (inhibitor) subunit 10
PLK1
Polo-like kinase 1 (Drosophila)
Unknown
SLC2A8
Solute carrier family 2 (facilitated glucose transporter), member 8
Continued on Next Page. . .
CHAPTER 4. RESULTS
68
Table 4.11 – Continued
Gene
Description
Unknown
IGF2BP2
Insulin-like growth factor 2 mRNA binding protein 2 (2)
IFI6
Interferon, alpha-inducible protein 6
Unknown
4.4.3
Attribute selection on Affy data
Although the analysis of the Affy subsets did not provide much useful information,
we still believe that there are some useful attributes in this dataset for separating the
data based on the mortality label. In Table 4.12 we present the top 100 attributes
from the dataset.
Table 4.12: Top 100 Affy Attributes
Affy ID
Gene
Description
210249 s at
NCOA1
Nuclear receptor coactivator 1 (3)
204689 at
HHEX
Hematopoietically expressed homeobox
207805 s at
PSMD9
Proteasome 26S subunit
209644 x at
CDKN2A
Cyclin-dependent kinase inhibitor 2A
221569 at
AHI1
Abelson helper integration site 1
220068 at
VPREB3
Pre-B lymphocyte gene 3
200026 at
RPL34
Ribosomal protein L34
217728 at
S100A6
S100 calcium binding protein A6 (2)
207426 s at
TNFSF4
Tumor necrosis factor superfamily, member 4 (2)
205548 s at
BTG3
BTG family, member 3
212812 at
Unknown
217373 x at
MDM2
200855 at
Unknown
Mdm2 p53 binding protein homolog (mouse)
213056 at
FRMD4B
FERM domain containing 4B (2)
206995 x at
SCARF1
Scavenger receptor class F, member 1 (3)
214003 x at
RPS20
Ribosomal protein S20
209995 s at
TCL1A
T-cell leukemia/lymphoma 1A (3)
212423 at
ZCCHC24
Zinc finger, CCHC domain containing 24
Continued on Next Page. . .
CHAPTER 4. RESULTS
69
Table 4.12 – Continued
Affy ID
Gene
Description
209808 x at
ING1
Inhibitor of growth family, member 1 (3)
200032 s at
RPL9
Ribosomal protein L9
202695 s at
STK17A
Serine/threonine kinase 17a
Diaphanous homolog 2 (Drosophila)
205726 at
DIAPH2
204075 s at
KIAA0562
Uncharacterized protein
217820 s at
ENAH
Enabled homolog (Drosophila)
200062 s at
RPL30
Ribosomal protein L30
218820 at
Unknown
203577 at
GTF2H4
General transcription factor IIH, polypeptide 4,
204218 at
C11orf51
Chromosome 11 open reading frame 51
203233 at
IL4R
Interleukin 4 receptor
203616 at
POLB
Polymerase (DNA directed), beta
200660 at
S100A11
S100 calcium binding protein (3)
208438 s at
FGR
Gardner-Rasheed feline sarcoma viral oncogene (2)
215017 s at
Unknown
222146 s at
TCF4
Transcription factor 4 (5)
212810 s at
SLC1A4
Solute carrier family 1
217939 s at
AFTPH
Aftiphilin (2)
209152 s at
TCF3
Transcription factor 3
Mitochondrial ribosomal protein (2)
218281 at
MRPL48
221543 s at
ERLIN2
ER lipid raft associated 2
212324 s at
VPS13D
Vacuolar protein sorting 13
T-cell leukemia/lymphoma 1A (3)
39318 at
TCL1A
201094 at
RPS29
Ribosomal protein S29
208690 s at
PDLIM1
PDZ and LIM domain 1
212587 s at
PTPRC
Protein tyrosine phosphatase, receptor type, C (4)
206752 s at
DFFB
DNA fragmentation factor
207416 s at
NFATC3
Nuclear factor of activated T-cells calcineurin-dependent 3 (4)
215000 s at
FEZ2
Fasciculation and elongation protein zeta 2
213746 s at
Unknown
203688 at
PKD2
205786 s at
ITGAM
Polycystic kidney disease 2
Integrin, alpha M
217168 s at
HERPUD1
Homocysteine and ER stress-inducible, ubiquitin-like domain member 1
218847 at
IGF2BP2
Insulin-like growth factor 2 (2)
203414 at
MMD
Monocyte to macrophage differentiation
209107 x at
NCOA1
Nuclear receptor coactivator 1 (3)
218380 at
NLRP1
NLR family, pyrin domain containing 1
208645 s at
Unknown
212386 at
TCF4
Transcription factor 4 (5)
211991 s at
HLADPA1
Major histocompatibility complex
217712 at
Unknown
202016 at
MEST
204061 at
PRKX
Protein kinase, X-linked
212436 at
TRIM33
Tripartite motif-containing 33
Continued on Next Page. . .
Mesoderm specific transcript homolog (mouse)
CHAPTER 4. RESULTS
70
Table 4.12 – Continued
Affy ID
Gene
Description
212588 at
PTPRC
Protein tyrosine phosphatase, receptor type, C (4)
215411 s at
Unknown
210555 s at
NFATC3
Nuclear factor of activated T-cells calcineurin-dependent 3 (4)
217542 at
MGC5370
Hypothetical protein MGC5370 (2)
201461 s at
MGC5370
Hypothetical protein MGC5370 (2)
209035 at
MDK
Midkine
200951 s at
CCND2
Cyclin D2
200025 s at
RPL27
Ribosomal protein L27
220960 x at
RPL22
Ribosomal protein L22
Ras and Rab interactor 3
60471 at
RIN3
213434 at
STX2
Syntaxin 2
201739 at
SGK
Serum/glucocorticoid regulated kinase 1
203753 at
TCF4
Transcription factor 4 (5)
203279 at
EDEM1
ER degradation enhancer
203434 s at
MME
Membrane metallo-endopeptidase (2)
201254 x at
RPS6
Ribosomal protein S6
218434 s at
AACS
Acetoacetyl-CoA synthetase
206542 s at
SMARCA2
SWI/SNF related, matrix associated, actin dependent regulator of chromatin
213891 s at
TCF4
Transcription factor 4 (5)
208720 s at
RBM39
RNA binding motif protein 39 (3)
200602 at
APP
Amyloid beta (A4) precursor protein
203435 s at
MME
Membrane metallo-endopeptidase (2)
206656 s at
Unknown
212332 at
RBL2
Retinoblastoma-like 2
217979 at
TSPAN13
Tetraspanin 13 (2)
204552 at
Unknown
210776 x at
EST63624
210676 x at
RGPD5
RANBP2-like and GRIP domain containing 5
208894 at
HLADRA
Major histocompatibility complex, class II, DR alpha (2)
204866 at
PHF16
PHD finger protein 16
54037 at
HPS4
Hermansky-Pudlak syndrome 4
217707 x at
Unknown
Jurkat T-cells V Homo sapiens cDNA 5- end, mRNA sequence
201373 at
PLEC1
Plectin 1
210982 s at
HLADRA
Major histocompatibility complex, class II, DR alpha (2)
209927 s at
C1orf77
Chromosome 1 open reading frame 77
212480 at
SPECC1L
SPECC1-like KIAA0376
209269 s at
Unknown
221865 at
C9orf91
Chromosome 9 open reading frame 91
CHAPTER 4. RESULTS
4.4.4
71
Attribute selection on SNP data and cDNA data
In Table 4.13 we list the top 100 attributes for these data.
Table 4.13: Top 100 SNP-cDNA Attributes
SNP
Gene
Description
SMNDC1
Survival motor neuron (4)
CLPP
ClpP caseinolytic peptidase (3)
WDR77
WD repeat domain 77 (4)
SPTBN5
Spectrin, beta, non-erythrocytic 5 (2)
rs3842787
PTGS1
Prostaglandin-endoperoxide synthase 1 (2)
rs1132780
CAMKK2
Calcium/calmodulin-dependent protein kinase
rs16972193
Unknown
rs11153174
PVR
Poliovirus receptor (4)
GABRG2
GABA A receptor, gamma 2 (3)
Unknown
PROZ
Nucleolar protein 7, 27kDa
KAT2B
K(lysine) acetyltransferase 2B (2)
CTNNA1
Catenin (cadherin-associated protein), alpha 1 (4)
rs2277125
Unknown
rs35760989
Unknown
rs35835241
Protein Z, vitamin K-dependent plasma glycoprotein
NOL7
PWP1
PWP1 homolog (S. cerevisiae) (3)
SIL
Endoplasmic reticulum chaperone (3)
DTX4
Deltex 4 homolog (Drosophila) (4)
TBCKL
Unknown
rs869801
DOCK1
rs4969258
Unknown
TPMT
Dedicator of cytokinesis 1
Thiopurine S-methyltransferase (2)
rs6020
F5
Coagulation factor V
rs2276805
AADACL1
Arylacetamide deacetylase-like 1
rs2511241
rs2270384
BCAS2
Breast carcinoma amplified sequence 2
P2RY2
Purinergic receptor P2Y, G-protein coupled, 2
SLC7A4
Solute carrier family 7 member 4
GSPT1
G1 to S phase transition 1 (3)
HLADMA
Major histocompatibility complex (4)
HTR6
5-hydroxytryptamine (serotonin) receptor 6
rs4969259
Unknown
rs3751315
FBRSL1
Fibrosin-like 1
GNAS
GNAS complex locus (3)
RCAN1
Regulator of calcineurin 1
RSN
Continued on Next Page. . .
CAPGLY domain containing linker protein 1
CHAPTER 4. RESULTS
72
Table 4.13 – Continued
SNP
rs4987262
Gene
Description
PTGIR
Prostaglandin I2 (prostacyclin) receptor (IP)
Unknown
rs7578597
THADA
Thyroid adenoma associated (2)
Unknown
rs1966265
FGFR4
Fibroblast growth factor receptor 4
NFS1
NFS1 nitrogen fixation 1 homolog (S. cerevisiae) (2)
NPTX1
Neuronal pentraxin I
rs7338333
ING1
Inhibitor of growth family, member 1 (3)
rs248248
C5orf45
Chromosome 5 open reading frame 45
rs8027765
AEN
Apoptosis enhancing nuclease
CEP290
centrosomal protein 290kDa (2)
rs1999663
rs1137078
BTK
Bruton agammaglobulinemia tyrosine kinase (3)
PPP1CA
Protein phosphatase 1, catalytic subunit, alpha (2)
C20orf114
Chromosome 20 open reading frame 114 (2)
THBS1
Thrombospondin 1 (4)
HLAA29.1
Major histocompatibility complex class I (2)
rs3730947
LIG1
Ligase I, DNA, ATP-dependent
rs1065761
CHIT1
Chitinase 1 (chitotriosidase) (2)
SLC39A7
Solute carrier family 39 (zinc transporter), member 7
rs1122326
HSPB9
Heat shock protein, alpha-crystallin-related, B9 (2)
rs2427536
SLC2A4RG
SLC2A4 regulator (2)
rs753381
PLCG1
Phospholipase C, gamma 1
rs8179070
PLIN
Perilipin
KIF21A
Kinesin family member 21A (4)
RARG
Retinoic acid receptor, gamma (2)
BCKDHB
Branched chain keto acid dehydrogenase E1
CDR1
cerebellar degeneration-related protein 1 (2)
SHFM1
Split hand/foot malformation (ectrodactyly) type 1
HIST1H2AL
Histone cluster 1, H2al (4)
FTO
Fat mass and obesity associated (4)
rs3179969
SPATA7
Spermatogenesis associated 7
rs11240604
ZC3H11A
Zinc finger CCCH-type containing 11A
FKBP5
FK506 binding protein 5 (4)
rs2523720
TRIM26
Tripartite motif-containing 26 (2)
rs13110318
TBC1D1
TBC1 domain family, member 1
AQP9
Aquaporin 9 (3)
PCM1
Pericentriolar material 1 (4)
BNIP1
BCL2/adenovirus E1B 19kDa interacting protein 1 (3)
Unknown
rs16833032
NID1
Nidogen 1
rs3803414
MEGF11
Multiple EGF-like-domains 11
rs3747243
Unknown
Unknown
ZAK
Continued on Next Page. . .
Sterile alpha motif and leucine zipper
CHAPTER 4. RESULTS
73
Table 4.13 – Continued
SNP
Gene
Description
rs3741554
KIAA1602
Uncharacterized protein
MAD2L2
MAD2 mitotic arrest deficient-like 2 (yeast)
Unknown
TIMP2
TIMP metallopeptidase inhibitor 2
MAPK8IP3
Mitogen-activated protein kinase 8 (2)
QDPR
Quinoid dihydropteridine reductase (2)
MB
Myoglobin (3)
rs12090611
MEGF6
Multiple EGF-like-domains 6
rs11164066
Unknown
rs2569491
KLK14
Kallikrein-related peptidase 14
rs3803641
KCNG4
Potassium voltage-gated channel, subfamily G, member 4
rs1143684
NQO2
NAD(P)H dehydrogenase, quinone 2
rs3747532
CER1
Chromosome 3 common eliminated region 1
rs3777721
RNASET2
Ribonuclease T2
ALAD
Aminolevulinate, delta-, dehydratase (2)
rs11264581
4.4.5
UPK1B
Uroplakin 1B
PSME1
Proteasome activator subunit 1 (3)
HNRPU
Heterogeneous nuclear ribonucleoprotein U
PEAR1
Platelet endothelial aggregation receptor 1
Attribute selection on SNP data and Affy data
This dataset provided poor separation of the data. However, we still believe the top
attributes from each dataset will contain important information for separating based
upon mortality. In Table 4.14 the top 100 attributes are listed.
Table 4.14: Top 100 SNP-Affy Attributes
SNP
Affy ID
201425 at
rs1021580
221573 at
204179 at
Gene
Description
ALDH2
Aldehyde dehydrogenase 2 family
CDC20B
Cell division cycle 20 homolog B
C7orf25
Chromosome 7 open reading frame 25
MB
Myoglobin (3)
rs1051484
PREP
Prolyl endopeptidase
rs12932514
Unknown
Continued on Next Page. . .
CHAPTER 4. RESULTS
74
Table 4.14 – Continued
SNP
Affy ID
Gene
Description
rs11016071
MKI67
Antigen identified by monoclonal antibody Ki-67
rs7201721
Unknown
ALDH9A1
Aldehyde dehydrogenase 9 family, member A1
rs8106130
201612 at
OSCAR
Osteoclast associated, immunoglobulin-like receptor
rs4782591
TAF1C
TATA box binding protein RNA polymerase I, C
rs3794153
ST5
Suppression of tumorigenicity 5
201424 s at
CUL4A
Cullin 4A
212591 at
RBM34
RNA binding motif protein 34
TPST2
Tyrosylprotein sulfotransferase 2
204079 at
rs1611149
Unknown
218285 s at
BDH2
216981 x at
SPN
Sialophorin
SGK2
Serum/glucocorticoid regulated kinase 2
rs35187177
215971 at
3-hydroxybutyrate dehydrogenase, type 2
Unknown
205552 s at
OAS1
220498 at
ACTL7B
Actin-like 7B
218851 s at
WDR33
WD repeat domain 33
rs2307289
2’,5’-oligoadenylate synthetase 1
MBD4
Methyl-CpG binding domain protein 4
212816 s at
CBS
Cystathionine-beta-synthase
200030 s at
SLC25A3
Solute carrier family 25
218976 at
DNAJC12
DnaJ (Hsp40) homolog, subfamily C, member 12
217732 s at
ITM2B
Integral membrane protein 2B
202146 at
IFRD1
Interferon-related developmental regulator 1
204693 at
CDC42EP1
CDC42 effector protein
203283 s at
HS2ST1
Heparan sulfate 2-O-sulfotransferase 1
rs9500989
Unknown
207432 at
BEST2
Bestrophin 2
55093 at
CSGlcAT
Chondroitin sulfate glucuronyltransferase
212526 at
SPG20
Spastic paraplegia 20
209931 s at
FKBP1B
FK506 binding protein 1B
rs7732300
Unknown
rs940871
DKFZp547K054
Hypothetical protein
217716 s at
SEC61A1
Sec61 alpha 1 subunit
SLIT-ROBO Rho GTPase activating protein 3
209794 at
SRGAP3
208442 s at
ATM
Ataxia telangiectasia mutated (2)
208697 s at
EIF3E
Eukaryotic translation initiation factor 3
rs4371530
NAALADL2
N-acetylated alpha-linked acidic dipeptidase-like 2
222150 s at
tcag7.1314
Hypothetical protein
215004 s at
SF4
Splicing factor 4
NSUN7
NOL1/NOP2/Sun domain family, member 7
rs4861066
rs2297270
DSCAM
Down syndrome cell adhesion molecule
rs2239808
KCTD20
Potassium channel tetramerisation domain containing 20
DEF8
Differentially expressed in FDCP 8 (3)
ZMYM2
Zinc finger, MYM-type 2
rs17784583
202778 s at
Continued on Next Page. . .
CHAPTER 4. RESULTS
75
Table 4.14 – Continued
SNP
Affy ID
Gene
Description
rs2274670
FAM113A
Family with sequence similarity 113, member A
rs2734971
Unknown
rs500049
OBSCN
Obscurin calmodulin and titin-interacting RhoGEF (2)
rs3918232
NOS3
Nitric oxide synthase 3 (endothelial cell)
200878 at
EPAS1
Endothelial PAS domain protein 1
220072 at
CSPP1
Centrosome and spindle pole associated protein 1
CLEC4D
C-type lectin domain family 4, member D
rs4304840
201465 s at
JUN
Jun oncogene
219326 s at
B3GNT2
UDP-GlcNAc:betaGal beta-1,3-N-acetylglucosaminyltransferase 2
C1orf127
Chromosome 1 open reading frame 127
207420 at
COLEC10
Collectin sub-family member 10
207632 at
MUSK
Muscle, skeletal, receptor tyrosine kinase
202020 s at
LANCL1
LanC lantibiotic synthetase component C-like 1
ICMT
Isoprenylcysteine carboxyl methyltransferase
rs1281013
201611 s at
rs2243620
Unknown
rs2844759
Unknown
rs567083
MAK
Male germ cell-associated kinase
rs4785751
DEF8
Differentially expressed in FDCP 8 (3)
213895 at
EMP1
Epithelial membrane protein 1
209131 s at
SNAP23
Synaptosomal-associated protein
204048 s at
PHACTR2
Phosphatase and actin regulator 2
rs2282284
216945 x at
rs1042303
FCRL3
Fc receptor-like 3
PASK
PAS domain containing serine/threonine kinase
GPLD1
Glycosylphosphatidylinositol specific phospholipase D1
209397 at
ME2
Malic enzyme 2
217976 s at
DYNC1LI1
Dynein, cytoplasmic 1, light intermediate chain 1
RP11265F14.2
Elastase 2B
rs3820071
207811 at
KRT12
Keratin 12
203466 at
MPV17
MpV17 mitochondrial inner membrane protein
207430 s at
MSMB
Microseminoprotein
rs2568023
C11orf16
Chromosome 11 open reading frame 16
rs11543211
PSMC5
Proteasome (prosome, macropain) 26S subunit, ATPase, 5
rs12199003
GFRAL
GDNF family receptor alpha like
rs4977196
KIAA1875
Protein similar to KIAA1875
rs3934462
ARL13A
ADP-ribosylation factor-like 13A
CD180
CD180 molecule
rs5744463
212525 s at
Unknown
rs2050189
C6orf10
rs10843438
OVCH1
Ovochymase 1
EIF5
Eukaryotic translation initiation factor 5
rs10163657
LOXHD1
Lipoxygenase homology domains 1
rs3820011
KIAA1751
Similar to KIAA1751 protein
rs1886544
NEK5
NIMA (never in mitosis gene a)-related kinase 5
TNFRSF11B
Tumor necrosis factor receptor superfamily, member 11b
208708 x at
204933 s at
Continued on Next Page. . .
Chromosome 6 open reading frame 10
CHAPTER 4. RESULTS
76
Table 4.14 – Continued
SNP
Affy ID
Gene
FMO2
Flavin containing monooxygenase 2 (non-functional)
218137 s at
SMAP1
Stromal membrane-associated GTPase-activating protein 1
rs910397
PXMP4
Peroxisomal membrane protein 4
rs4785766
GAS8
Growth arrest-specific 8
CCNA2
Cyclin A2
rs2020860
203418 at
rs3873374
4.4.6
Description
Unknown
Attribute selection on cDNA data and Affy data
Table 4.15 lists the top 100 attributes for this dataset.
Table 4.15: Top 100 Affy-cDNA Attributes
Affy ID
Gene
Description
WDR46
WD repeat domain 46 (2)
SMNDC1
Survival motor neuron (4)
Unknown
DTX4
Deltex 4 homolog (Drosophila) (4)
TRIM37
Tripartite motif-containing 37 (3)
FKBP5
FK506 binding protein 5 (4)
CDR1
Cerebellar degeneration-related protein (2)
SNAPC1
Small nuclear RNA activating complex
200660 at
S100A11
S100 calcium binding protein A11 (3)
201089 at
ATP6V1B2
ATPase, H+ transporting
210555 s at
NFATC3
Nuclear factor of activated T-cells (4)
201990 s at
CREBL2
cAMP responsive element binding protein (2)
222146 s at
TCF4
Transcription factor 4 (5)
PPP1CA
Protein phosphatase 1, catalytic subunit, alpha isoform (2)
201105 at
PSMC4
Proteasome (prosome, macropain) 26S subunit, ATPase, 4 (2)
CTNNA1
Catenin (cadherin-associated protein), alpha 1 (4)
LGALS1
Lectin, galactoside-binding, soluble, 1 (2)
Unknown
IFIT2
Interferon-induced protein with tetratricopeptide repeats 2 (2)
209239 at
NFKB1
Nuclear factor of kappa light polypeptide gene enhancer in B-cells 1
212587 s at
PTPRC
Protein tyrosine phosphatase, receptor type, C (4)
208620 at
THBS1
Thrombospondin 1 (4)
PCBP1
Poly(rC) binding protein 1 (2)
Continued on Next Page. . .
CHAPTER 4. RESULTS
77
Table 4.15 – Continued
Affy ID
Gene
Description
221475 s at
RPL15
Ribosomal protein L15
208720 s at
RBM39
RNA binding motif protein 39 (3)
RAB35
Member RAS oncogene family (2)
212423 at
C10orf56
Chromosome 10 open reading frame 56 (2)
HLA-DMA
Major histocompatibility complex, class II, DM alpha (4)
206050 s at
204011 at
221269 s at
215621 s at
211672 s at
PVR
Poliovirus receptor (4)
KIF21A
Kinesin family member 21A (4)
RNH1
Ribonuclease/angiogenin inhibitor 1
GNAS
GNAS complex locus (3)
SPRY2
Sprouty homolog 2
ELAVL1
ELAV (embryonic lethal, abnormal vision, Drosophila)-like 1 (2)
SH3BGRL3
SH3 domain binding glutamic acid-rich protein like 3
PWP1
PWP1 homolog (S. cerevisiae) (3)
IGHG1
Immunoglobulin heavy constant mu
MAPK8IP3
Mitogen-activated protein kinase 8 interacting protein 3 (2)
ARPC4
Actin related protein 2/3 complex, subunit 4
KIAA1045
Hypothetical protein (2)
PSIP1
PC4 and SFRS1 interacting protein 1 (2)
206656 s at
Unknown
213056 at
FRMD4B
FERM domain containing 4B (2)
G3BP
GTPase activating protein (SH3 domain) binding protein 1 (2)
DR1
Down-regulator of transcription 1
HIST1H2AM
Histone cluster 1, H2am (2)
216652 s at
212387 at
TCF4
Transcription factor 4 (5)
FTO
Fat mass and obesity associated (4)
CSF2RB
Colony stimulating factor 2 receptor, beta
Unknown
Unknown
GTL3
Gene trap locus 3
201012 at
ANXA1
Annexin A1
203165 s at
SLC33A1
Solute carrier family 33 (acetyl-CoA transporter), member 1
209653 at
KPNA4
Karyopherin alpha 4
ROD1
ROD1 regulator of differentiation 1 (3)
208438 s at
213891 s at
209864 at
FGR
Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene (2)
FMO1
Flavin containing monooxygenase 1 (2)
HIST1H2AL
Histone cluster 1, H2al (4)
TCF4
Transcription factor 4 (5)
PSG3
Pregnancy specific beta-1-glycoprotein 3
BNIP1
BCL2/adenovirus E1B 19kDa interacting protein 1 (3)
FRAT2
Frequently rearranged in advanced T-cell lymphomas 2
CNTNAP1
Contactin associated protein 1 (2)
202833 s at
SERPINA1
Serpin peptidase inhibitor, clade A
217728 at
S100A6
S100 calcium binding protein A6 (2)
AQP9
Aquaporin 9 (3)
Continued on Next Page. . .
CHAPTER 4. RESULTS
78
Table 4.15 – Continued
Affy ID
Gene
Description
PCM1
Pericentriolar material 1 (4)
200872 at
S100A10
S100 calcium binding protein A10
EID1
EP300 interacting inhibitor of differentiation 1 (2)
207654 x at
DR1
Down-regulator of transcription 1
221497 x at
EGLN1
Egl nine homolog 1 (C. elegans)
201932 at
LRRC41
Leucine rich repeat containing 41
GABRG2
Gamma-aminobutyric acid (GABA) A receptor, gamma 2 (3)
ZNF294
Zinc finger protein 294 (2)
214687 x at
ALDOA
Aldolase A
202741 at
PRKACB
Protein kinase, cAMP-dependent, catalytic, beta
218987 at
ATF7IP
Activating transcription factor 7 interacting protein (2)
203568 s at
TRIM38
Tripartite motif-containing 38
217939 s at
AFTPH
Aftiphilin (2)
202086 at
218281 at
TSC22
TSC22 domain family, member 3
GSPT1
G1 to S phase transition 1 (3)
MX1
Myxovirus (influenza virus) resistance 1
CEP290
Centrosomal protein 290kDa (2)
MRPL48
Mitochondrial ribosomal protein L48 (2)
220046 s at
CCNL1
Cyclin L1
210249 s at
NCOA1
Nuclear receptor coactivator (3)
219451 at
MSRB2
Methionine sulfoxide reductase
209648 x at
SOCS5
Suppressor of cytokine signaling 5
U1SNRNPBP
U11/U12 snRNP (2)
201421 s at
WDR77
WD repeat domain 77 (4)
212504 at
DIP2C
Disco-interacting protein 2 homolog C (Drosophila) (2)
210202 s at
BIN1
Bridging integrator 1 (2)
ALAD
Aminolevulinate, delta-, dehydratase (2)
202081 at
IER2
Immediate early response 2
207426 s at
TNFSF4
Tumor necrosis factor (ligand) superfamily, member 4 (2)
208686 s at
BRD2
Bromodomain containing 2
206995 x at
4.4.7
SCARF1
Scavenger receptor class F, member 1 (3)
SLCO2A1
Solute carrier organic anion transporter family, member 2A1 (2)
CP
Ceruloplasmin (ferroxidase)
Attribute selection on SNP data, cDNA data and Affy
data
The top 100 attributes for this dataset are listen in Table 4.16.
CHAPTER 4. RESULTS
79
Table 4.16: Top 100 SNP-cDNA-Affy Attributes
SNP
Affy ID
Gene
Description
rs7578597
THADA
Thyroid adenoma associated (2)
rs7729440
Unknown
rs2427536
SLC2A4RG
SLC2A4 regulator (2)
FTO
Fat mass and obesity associated (4)
CLPP
ClpP caseinolytic peptidase (3)
CTNNA1
Catenin (cadherin-associated protein) (4)
rs1133090
DPEP2
Dipeptidase 2
rs2523720
TRIM26
Tripartite motif-containing 26 (2)
WDR46
WD repeat domain 46 (2)
PVR
poliovirus receptor (4)
rs4713380
Unknown
201990 s at
CREBL2
cAMP responsive binding protein (2)
PWP1
PWP1 homolog (S. cerevisiae) (3)
Unknown
rs2304380
RYR3
Ryanodine receptor 3
S100A11
S100 calcium binding protein A11 (3)
rs16972193
SPTBN5
Spectrin, beta, non-erythrocytic 5 (2)
rs3842787
PTGS1
Prostaglandin-endoperoxide
200660 at
Unknown
PSME1
Proteasome activator subunit 1 (3)
222146 s at
TCF4
Transcription factor 4 (5)
201421 s at
WDR77
WD repeat domain 77 (4)
210448 s at
P2RX5
Purinergic receptor P2X
212385 at
TCF4
Transcription factor 4 (5)
Unknown
202239 at
212423 at
218983 at
rs6151428
217979 at
201487 at
rs1065761
215543 s at
204449 at
PARP4
Poly (ADP-ribose) polymerase family, member 4
MB
Myoglobin (3)
C10orf56
Chromosome 10 open reading frame 56 (2)
C1RL
Complement component 1
ARSA
Arylsulfatase A
TSPAN13
Tetraspanin 13 (2)
CTSC
Cathepsin C
CHIT1
Chitinase 1 (chitotriosidase) (2)
LARGE
Like-glycosyltransferase
KIF21A
Kinesin family member 21A (4)
PDCL
Phosducin-like
METAP2
Methionyl aminopeptidase 2 (2)
ROD1
ROD1 regulator of differentiation 1 (S. pombe) (3)
201608 s at
PWP1
PWP1 homolog (S. cerevisiae) (3)
218133 s at
NIF3L1
IF3 NGG1 interacting factor 3-like 1
rs2808096
ARHGAP12
Rho GTPase activating protein 1 (2)
rs16844401
HGFAC
HGF activator
Continued on Next Page. . .
CHAPTER 4. RESULTS
80
Table 4.16 – Continued
SNP
Affy ID
rs1122326
Gene
Description
HSPB9
Heat shock protein, alpha-crystallin-related, B9 (2)
AP1B1
Adaptor-related protein complex 1, beta 1 subunit
212587 s at
PTPRC
Protein tyrosine phosphatase, receptor type, C (4)
SNUPN
Snurportin 1
208720 s at
RBM39
RNA binding motif protein 39 (3)
PCM1
Pericentriolar material 1 (4)
rs1999663
C20orf114
Chromosome 20 open reading frame 114 (2)
rs1137078
HLA-A29.1
Major histocompatibility complex class I (2)
HLA-DMA
Major histocompatibility complex, class II (4)
C20orf43
Chromosome 20 open reading frame 43
PML
Promyelocytic leukemia (2)
BIN1
Bridging integrator 1 (2)
THBS1
Thrombospondin 1 (4)
CELSR2
Cadherin, EGF LAG seven-pass G-type receptor 2
FKBP5
FK506 binding protein 5 (4)
217737 x at
210202 s at
rs12567377
215273 s at
TADA3L
Transcriptional adaptor 3
AKAP13
A kinase (PRKA) anchor protein 13 (2)
FASN
Fatty acid synthase
rs4723884
Unknown
RAMP
Receptor (G protein-coupled) activity modifying protein 1
rs16883930
SLC17A5
Solute carrier family 17 (anion/sugar transporter), member 5
BTK
Bruton agammaglobulinemia tyrosine kinase (3)
C9orf6
Chromosome 9 open reading frame 6
218998 at
rs2304237
ICAM3
Intercellular adhesion molecule 3
206995 x at
SCARF1
Scavenger receptor class F, member 1 (3)
210555 s at
NFATC3
Nuclear factor of activated T-cells (4)
208540 x at
Unknown
rs4978584
DFNB31
Deafness, autosomal recessive 31 (2)
rs3731644
SH3BP4
SH3-domain binding protein 4
HSPB1
Heat shock 27kDa protein 1
TCL1A
T-cell leukemia/lymphoma 1A (3)
39318 at
Unknown
HIST1H2AL
Histone cluster 1, H2al (4)
rs1048786
PDIA2
Protein disulfide isomerase family A, member 2
rs3747243
Unknown
AQP9
Aquaporin 9 (3)
rs2274158
DFNB31
Deafness, autosomal recessive 31 (2)
SAT2
Spermidine/spermine N1-acetyltransferase family member 2
rs13894
211792 s at
CDKN2C
Cyclin-dependent kinase inhibitor 2C
201105 at
LGALS1
Lectin, galactoside-binding, soluble, 1 (2)
207198 s at
LIMS1
LIM and senescent cell antigen-like domains 1
SIL
SIL1 homolog, (S. cerevisiae) (3)
204971 at
rs2287546
Continued on Next Page. . .
CSTA
Cystatin A
SART3
Squamous cell carcinoma antigen recognized by T cells 3
CHAPTER 4. RESULTS
81
Table 4.16 – Continued
SNP
Affy ID
Gene
Description
201461 s at
MAPKAPK2
Nitogen-activated protein kinase-activated protein kinase 2
208620 at
PCBP1
Poly(rC) binding protein 1 (2)
DTX4
Deltex 4 homolog (Drosophila) (4)
218987 at
212387 at
rs1801516
rs10676
217972 at
TRIM37
Tripartite motif-containing 37 (3)
ATF7IP
Activating TF 7 interacting protein (2)
EID1
EP300 interacting inhibitor of differentiation 1 (2)
AQR
Aquarius homolog (mouse) (2)
TCF4
Transcription factor 4 (5)
ATM
Ataxia telangiectasia mutated (2)
HIST1H2AE
Histone cluster 1, H2ae (2)
DHRS12
Dehydrogenase/reductase (SDR family)
TPMT
Thiopurine S-methyltransferase (2)
SMNDC1
Survival motor neuron domain containing 1 (4)
CHCHD3
Coiled-coil-helix-coiled-coil-helix domain containing 3
Observations
It is interesting to note that there are few repeated SNPs in individual SNP dataset.
However, in the combined datasets there are many more SNP repeats between those
two lists. The top cDNA genes consistently appear in multiple lists, and in many
cases the top attributes remain near the top in the other lists. There are a large
number of repeats in the Affy dataset as well, but it is interesting to point out that
in the combined SNP-Affy dataset there were few repeats at all. This supports the
fact that the dataset performed poorly.
None of the SNP profiles identified by Yang and colleagues [35] were identified in
any of the previous list. There was also no overlap in genes identified by Hoffmann
and colleagues [19] as being predictive of long-term clinical outcome
CHAPTER 4. RESULTS
4.5
82
Validation of Results
One of the most important aspects of any scientific experiment is validating the
results. For a typical data mining experiment, this is done by obtaining new data
and applying the model that has been built. If the model performs well then this is
validation that the model is correct. However, if the model does not perform well then
it is thought that the model has been overfit to that the training dataset and does
not generalize well. Unfortunately it is difficult to obtain new data for this type of
experiment. This is due to a combination of the rarity of the disease, the availability
of the data and the fact that not all experiments are done on the same platform. As
such, this method of validation was not immediately available to us. We performed
one method of validation based on the information that was available to us: label
shuffling.
4.5.1
Label shuffling validation
One method of validation when using a classification technique is known as label shuffling. In this technique, the class labels for the dataset are randomly permuted. This
newly labeled dataset is then used for classification and the classification accuracy is
then compared to the original dataset. If the classification accuracy drops with the
randomly permuted labels, then the original dataset results were not due to random
chance. However, if the classification accuracy remains the same or increases then the
results obtained from the original dataset can be said to be due to random chance.
As an example, this technique was performed on the SNP dataset. The original
SVM results can be seen in Table 4.3. Table 4.17 displays the results of the permuted
class label classification. Clearly, this does not perform well at all. In fact, in most
CHAPTER 4. RESULTS
83
cases the classification accuracy drops below the base line accuracy. The base line
accuracy is the overall percentage of if everything was classified as the majority class.
This suggests that the results we have obtained are not due to random chance.
Table 4.17: SVM prediction accuracy of Shuffled Label SNP subsets
4.6
Atttributes
% Class Alive
% Class Deceased
25
100
0
50
100
0
100
100
0
250
95.93
0
500
97.56
0
1000
96.75
0
2500
98.37
14.29
5000
98.37
0
Extended SNP Analysis
One important property of SNP data is that it is unlikely that it will change. It is a
description of the genome of an individual and, barring any genetic mutations, will
remain the same throughout the lifetime of that individual. This is a useful property
for data mining as it means that any results found are not due to data values at that
particular time. Because of this property and due to the positive results seen with
the SNP data analysis we decided to perform an extended analysis using different
data-mining techniques.
CHAPTER 4. RESULTS
4.6.1
84
Predicting relapse
Random forests attribute selection is a supervised learning technique, meaning that a
class label must be provided for each object in the dataset. For the previous analysis
the class label was the mortality of the patients. The reasoning behind this was so the
random forests algorithm could find the attributes which best discriminated based on
that particular label. Another important factor for ALL treatment is whether or not
the patient relapses. If a patient does relapse then the course of treatment needs to be
changed. If it were possible to know which patients had a higher chance of relapsing,
then it is possible that a relapse could be avoided.
In order to explore this possibility, the random forests algorithm was run for the
SNP data with the class label being whether or not the patient relapsed. We expected
that there would be some overlap in the top SNPs that are selected with the previous
set, as all of the patients who have passed away had also relapsed. However, there
were several patients who have relapsed and survived which emphasizes how complex
these data are.
As in the previous analysis, the top 25, 50, 100, 250, 500, 1000, 2500 and 5000
SNPs were selected into subsets which were analyzed by SVD and SVM. Although
the results were not as clear cut as the mortality data, the results proved to be quite
interesting. The SVD images for the top 250 and 1000 SNPs are shown in Figure
4.17. It can be seen in the 250 attribute subset that there is a fairly good separation
of the data. This suggests that there is also a genetic reason for why some patients
relapse. The separation in the 1000 attribute subset is not quite as clear, but it can be
seen that the overall structure of the data is beginning to form two clusters, much like
what we saw with the previous SNP analysis. This again confirms that the more SNP
CHAPTER 4. RESULTS
85
attributes you include, the more general structure about the nature of the patients is
visible.
Y
Y
Y
Y
Y
Y
Y
YY
Y
Y
Y Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y Y
Y
Y
Y
Y
Y
Y
10
Y
Y
Y
Y
Y
Y
Y
Y
5
Y
0
−2
20
Y
0
2
15
Y
4
Y
−15
6
5
8
0
12
−10
10
Y
10
−5
−5
(a) 250 SNP Relapse subset
−5
0
5
−20
10
15
−25
20
(b) 1000 SNP Relapse subset
Figure 4.17: SVD images of relapse selected SNP data. blue = alive, red = deceased,
Y = relapsed
To further this analysis we also performed an SVM on the data to see how predictive each subset is for the relapse label. The results are shown in Table 4.18. Clearly
these results are not as good as the previous analysis using the mortality information.
What this may suggest is that since almost all of the deceased patients also relapsed,
the random forests algorithm is finding attributes which describe mortality more than
relapse. Due to the interconnectivity of these two properties, it is difficult to fully
understand what is going on with these data.
CHAPTER 4. RESULTS
86
Table 4.18: SVM prediction accuracy of SNP subsets for relapse
Atttributes
% Class Alive
% Class Deceased
25
97.3
69.23
50
98.2
61.54
100
100
76.92
250
100
81.54
500
100
50
1000
100
50
2500
100
26.92
5000
100
15.38
When comparing the top SNPs selected using each of the two class labels, mortality and relapse, we find that there are few common SNPs found for the smaller
subsets. For the top 25 and 50 subsets there is only 1 common SNP. As the datasets
get larger, as would be expected, there are more and more common SNPs. Of the
top 5000 SNPs for both subsets, there are 2354 shared SNPs. This is an encouraging
result since we expected to find some similarities due to the fact that relapse is a
fairly good predictor of mortality. This analysis has supported our hypothesis that
there is a relationship between a subject’s genetics and their response to this disease.
4.6.2
Graph analysis of SNP data
Another way of looking at these data is to look at how each patient is similar or
dissimilar to all other patients. Instead of describing a patient in space by the values
of their attributes, it is possible to create a space where the position of the patients
is based on their similarity to each other directly. This can be thought of as a
graph approach where there is a connection between two patients if they are above a
similarity threshold.
CHAPTER 4. RESULTS
87
In order to accomplish this, the dot product is calculated for each pair of patients
and the result is stored in a an nxn matrix with zero values for each x (i,i) entry.
This can be regarded as an adjacency matrix and is the basis of the graph approach.
Once this matrix is calculated, the next step is to determine a threshold value which
defines the point at which two patients are considered similar or not. Any entry below
this threshold is set to equal zero, otherwise the value remains unchanged. From here
we chose to analyze these data using an SVD as before. The top 250 SNP attribute
subset is used for this analysis and the resulting SVD image is shown in Figure 4.18.
In this figure the picture does not contain any connections so as to show the shape of
the data more clearly. There is a structured U shape to the data which suggests that
there is an inherent structure within the data which this approach was able to find.
We hypothesize that this shape is due to the way in which this data was coded. Each
SNP value takes on a theta value of approximately 0, .5 or 1 as explained previously.
Therefore, we believe that this had an effect on the shape of the data since it can be
thought of as each data point migrating to one of three positions. It can be seen that
although the deceased patients are spread out around the U shape, they are still being
separated out vertically. This is an important result since we are finding the same
separation in the data but through a new technique where we are now comparing
individuals instead of taking each individual on its own.
CHAPTER 4. RESULTS
88
10
5
0
10
5
0
−5
−5
−10
−15
(a) 250 SNP graph analysis without connections
Figure 4.18: SVD images of SNP graph analysis. blue = alive, red = deceased
4.6.3
Reformatting SNP data
The previous SNP analysis used the theta values for all of the experiments. There
is another representation based on the genotype for each patient which has three
possible values; 0, 1 and 2. These values represent the homozygous major allele,
heterozygous allele and homozygous minor allele respectively. Another approach to
this is to separate each SNP attribute into three attributes representing each of the
alleles where each subject will have a value of 1 for the allele they contain and a 0
for the other two positions. These data was then run through the random forests
algorithm to find the most significant SNPs. The difference between this method
and the previous method is that this allows the random forests algorithm to pick out
specific alleles within a SNP attribute which may be more important than the others.
One problem that had to be dealt with was the fact that some heterozygous minor
CHAPTER 4. RESULTS
89
alleles are so uncommon that none of these patients contained them at all. This would
result in entire columns of values being 0 which can skew the results of running these
data-mining algorithms. In order to prevent this, all of the 0 columns were removed
before any further analysis was done.
The analysis of these data was similar to what was done previously. First, the
data was run through the random forests algorithm to discover the most important
attributes. Then, these attribute rankings were used in order to select the top subsets
which were then analyzed by SVD. As before the top 25, 50, 100, 250, 500, 1000, 2500
and 5000 attribute subsets were used. We expected that the attribute selection would
select many of the same SNPs as with the previous dataset as this random forests
run was also labeled with the mortality of the patients. However, this method should
identify specific alleles of each SNP which may be more informative than simply
knowing which SNPs are important.
The SVD images for the top 250 and 2500 attributes are shown below in Figure
4.19. The separation between deceased and alive is clearly evident in both. However,
it is interesting to note that the separation becomes even more clear when more
attributes are used. This is not what we have observed with the previous dataset,
but since there are many more attributes in this dataset it is not surprising that this
is true. We also see with the 2500 subset that there appears to be a separation within
the deceased patients as well as they form two or even three clusters. It is important
to note that the surviving patients are clustered around the origin while the deceased
patients appear as the outliers. This is significant because it suggests that these
patients are somehow different from the collective group of surviving patients. This
is an encouraging result as it, again, supports our hypothesis that there is a link
CHAPTER 4. RESULTS
90
between the genetics of the individual and their outcome with the disease.
−5
−5.5
−6
−6.5
−7
0
−1
−2
−3
−18
−7.5
−4
−5
−6
−7
−8
−8
−16
−5
(a) 250 SNP reformatted
0
5
10
15
(b) 2500 SNP reformatted
Figure 4.19: SVD images of reformatted SNP analysis. blue = alive, red = deceased
When we compare the new set of top ranked attributes to the previous set we see
that there is approximately a 60% similarity across most subsets. This suggests that
in our previous subsets there were several attributes included which are not globally
predictive of mortality. but rather may have been correlated with those attributes
that were. When we isolate only the attributes which are shared we see a similar
separation.
4.6.4
Updated data labels
Late in the progress of this study, we were able to obtain updated patient information
for our datasets. Compared to the previous labels, the updated data contained five
more patients who are deceased as well as seven who have since relapsed. We were
20
CHAPTER 4. RESULTS
91
interested to use these labels in two ways; labeling the previous results with the new
labels to see where these updated patients lie in space, and performing a new analysis
with these new labels as the basis for attribute selection. For the purposes of this
study, we have focussed on the SNP data results.
Relabeling previous results
We were interested to see where these newly updated patients would lie in the previously defined space of objects. Since there was a clear separation of the data, we
did not expect to find that these new patients all clustered together as if to suggest
that they were predicted to die, and this was in fact the case. Figure 4.20 shows a
comparison of the top 250 SNP SVD for both the old and new labels. It can be seen
that the newly labeled deceased scatter throughout the large cluster of alive patients.
This raises several questions about both the model that has been built as well as the
nature of these data. Both of these points will be addressed in later sections.
CHAPTER 4. RESULTS
92
−1.5
−10
−1
−0.5
0
−9.5
0.5
1
−9
1.5
2
2.5
−8.5
(a) 250 SNP old labels
−1.5
−10
−1
−0.5
0
−9.5
0.5
1
1.5
−9
2
2.5
−8.5
(b) 250 SNP updated labels
Figure 4.20: SVD images of 250 SNP subset with old and new labels. blue = alive,
red = deceased
Attribute selection
In order to obtain a better understanding of the implications of these new labels we
decided to perform the same line of experiments as we did with the old data labels.
Since the random forests algorithm selects the best attributes based on the data
labels provided, we expected to find many new attributes being selected as compared
to the previous list obtained. As we saw with the cross validation approach described
previously, there are many attributes which are selected purely due to their correlation
with more informative attributes. Following this same principle, we believe that the
attributes which are found in both lists are the most predictive ones. As before, the
top 25, 50, 100, 250, 500, 1000, 2500 and 5000 subsets were analyzed using SVD.
CHAPTER 4. RESULTS
93
SVD analysis
Figure 4.21 shows the result of the SVD for the top 250 and 2500 attributes. It is
clear that there is still a good separation of the data based on the mortality label.
The 250 subset image also shows more separation within each class as well than in
the previous labels. We believe that this is due to the nature of the coding of the
data. The 2500 subset image is reminiscent of the previous data where it is clear that
the data forms two clusters while still maintaining the separation based on mortality.
This is the same separation we believe to be due to the coding of the data. This
suggests that the top 2500 attribute subsets for both the old and new labels most
likely contain many of the same attributes.
−30
−31
−8
−32
−10
−12
1
0.5
0
−0.5
−1
−1.5
(a) 250 SNP new data
−2
−2.5
−3
−33
2
0
−2
−4
−6
(b) 2500 SNP new data
Figure 4.21: SVD images of 250 SNP new labels. blue = alive, red = deceased
−8
CHAPTER 4. RESULTS
4.6.5
94
Cross validation of top attributes
One of the effects of changing data labels is that an attribute which may be truly
predictive will have appeared to be less predictive with that incorrect labeling. This
is a problem that cannot be avoided due to the nature of the data. However, if we
develop a more intelligent methodology for attribute selection then we can compensate
for this.
One way that this can be accomplished is by splitting the data into randomly
generated subsets and then performing attribute selection on each. The idea is that by
finding the attributes which appear in multiple lists, we are filtering out the attributes
which may only be predictive of that particular subset or attributes which appear only
by chance. Also, we believe that if an attribute appears in multiple lists then it is
a more general predictor than one which only appeared in one list. As an example
of this, we divided the SNP data into two subsets and ran each through the random
forest algorithm. Table 4.19 shows the number of common SNPs between each of the
subsets we created.
Table 4.19: Comparison of Top Attribute Lists
Attributes
No. Shared Attributes
% Shared Attributes
25
9
34.61
50
18
36
100
34
34
250
96
38.4
500
230
46
1000
502
50.2
2500
1442
57.68
5000
3266
65.32
CHAPTER 4. RESULTS
95
These results are as we expected; for small subsets there isn’t much overlap but
as the size of the subsets increase there are more and more intersections. This is an
interesting result because the previous SNP analysis showed a perfect linear separation
between for the 250 attribute subset, and yet when we create two subsets of these
data we find only a 38.4% overlap. This suggests that most of the attributes included
for that particular subset were either only informative for that specific dataset or not
informative at all. The other side of that is that 38.4% of the attributes appear in
both lists and we can therefore assume that they are more informative. In order to
verify this we performed an SVD analysis as before.
SVD analysis of attribute intersection
To see the effect of removing the attributes that only appear in one of the lists,
we performed an SVD on each of the subsets and used the updated labels. The
results are shown in Figure 4.22 for the intersection of the top 250, 1000, 2500 and
5000 subsets. For the intersection of the top 250 SNPs it can be seen that the clear
separation between the deceased and alive patients is no longer present. However,
there is still a good clustering of deceased patients. Also, since there is not a clear
separation it could suggest that the patients who lie closer to the deceased are at a
much higher risk. This would be a much more beneficial result as it would provide
more information about each individual patient rather than a model which is specific
to the data. This also suggests that some of the attributes which have been removed
may have been overfitting that particular dataset and thus made the resulting SVD
images appear to have a much bigger separation.
Looking next to the intersection of the top 1000 SNPs, it is interesting to note
CHAPTER 4. RESULTS
96
that the separation between the patients is more clear. This again supports the idea
of wanting to keep as much information as possible. The separation is still not as clear
as we have previously seen, which is the preferred result. For the intersection of the
top 2500 SNPs we see a more familiar picture in that the data is beginning to form
the two clusters we have seen previously. However, the data is still separated quite
well based on the mortality label. Finally, for the intersection of the top 5000 SNPs
we see the same separation as with the previous image with more distance between
the clusters. It is important to note that these images have shown almost exactly the
same separation as the previously displayed results, but with approximately half the
number of attributes. This confirms that there are many attributes in the previous
subsets which are not powerful separators. By removing these attributes we get a
much more clear picture of what is really going on.
CHAPTER 4. RESULTS
97
−5
−5.5
−6
1
0
−1
−2
−6.5
1
0.5
0
−0.5
−1
−1.5
−3
−4
−14.2
−14
−13.8
−13.6
−13.4
−13.2
−13
−12.6
−12.8
−2
(a) 250 SNP Intersection
(b) 1000 SNP Intersection
−34
−36
−23
1
0
−1
−2
−23.5
−3
−4
−5
−6
−24
−7
−38
−24.5
2
(c) 2500 SNP Intersection
0
−2
−4
−6
−8
−10
(d) 5000 SNP Intersection
Figure 4.22: SVD images of intersecting SNP attributes. blue = alive, red = deceased.
4.7
Discussion of the Nature of the Data
This type of data is complex and constantly changing which makes it difficult to build
accurate models. The complexity of the data is represented by the sheer amount of
data that exists for each patient. When it is all combined, there are over 40000
CHAPTER 4. RESULTS
98
attributes for each patient and this is number could increase significantly with newer
technology. It is difficult to accurately remove attributes that do not provide useful
information and select those that do. It is necessary to take many different approaches
and be clever with the available techniques to be able to discover anything useful from
these data.
Another challenge with these data is that it is constantly changing. As we saw
from the updated labels, the model that we had built was completely changed due to
only four patients having been updated. This is difficult to deal with since a model
at one time may quickl become obsolete. This is why it is necessary to do such things
as cross validation in order to try to isolate the attributes that are responsible for
the separation and not the attributes that appear in a list due to their correlation in
that particular dataset.
When dealing with the mortality labels, a patient is listed as either alive or deceased. However, not every patient who has ALL ends up dying because of the
disease. Due to the intensity of the treatment, the patient’s immune system becomes
compromised and so it is possible that the patient may have died due to an infection
or some other health concern. This becomes a problem for this type of analysis since
all deceased patients are treated as equal. Since the number of deceased patients is
small in comparison to the number of surviving patients, this could quite easily skew
the results. At this present moment the cause of death is unavailable and so we must
assume that all patients who have died have done so because of their disease.
One final challenge is what the data is representing. Biological systems are complex and there are many levels of regulation within each system. In this study we
are using both SNP and gene expression data. In a biological system, an individual’s
CHAPTER 4. RESULTS
99
genome affects the genes, which in turn affects the proteins which then affects the
phenotype. By looking at the SNP data we are looking at the genome level. Any
significant changes in the SNPs can affect the genes which could affect the gene expression values. As a result, these datasets are dependent and connected and thus
cannot be treated as mutually exclusive.
Chapter 5
Conclusion
The goal of this research was to investigate the relationship between an individual’s
genetics and whether or not they survived their battle with acute lymphoblastic
leukemia. The data that was used for this study was produced from microarray
analysis of the individual’s SNPs and gene expression values. These data are complex
and high dimensional which provided many challenges for the analysis. We used datamining techniques to analyze these data and created a process of attribute selection
followed by clustering through the use of a Singular Value Decomposition (SVD). We
used various clinical labels to understand the results that this technique produced.
This study has produced many conclusions about both the data and the techniques
that were used. Our analysis has shown that a separation can be found between
patients who live and who die based on both the SNP values and the gene expression
values. This suggests that there may be a genetic explanation for why some patients
die within the context of current treatment regimes. This is significant and novel as
it is not widely accepted that there is a genetic factor which can distinguish patients
who live and die. Rather, the genetic factors that are known are related to individuals
100
CHAPTER 5. CONCLUSION
101
developing this disease or not. We have not been able to pinpoint which attributes are
responsible for this, but we believe that our attribute selection method creates subsets
which contain these informative attributes. This finding was supported through many
different analyses. The SNP, cDNA and the combined dataset analysis using our datamining procedure showed a clear separation of the data based on the mortality label.
Also, our further analysis of the SNP data using various techniques all showed similar
results. The validation technique we used also showed that these results were not due
to random chance. It would be ideal to obtain new data which could be run through
our model, but at this current time this is not possible. We believe that this finding
has merit. However, it will take further research and fine tuning of the techniques to
discover any biological significance.
The process of attribute selection is one which must be done carefully. It is unrealistic to assume that the attribute-selection algorithm, in this case the random
forest algorithm, will be able to identify all of the biologically significant attributes
with such a large dataset. We have shown that by evaluating the attribute selection
process through a cross validation of attributes in smaller subsets, there are many
attributes which are included in these subsets which may only be informative for that
particular dataset and are not globally predictive. It is necessary to be more intelligent about the attribute selection process in order to distinguish between predictive
attributes and those that only appear to be predictive.
We have also shown that the current process of using clinical data to make decisions about diagnosis, prognosis and treatment is not adequate. Although the survival
rate is approximately 80%, it can be seen from our analysis that based on the genetics
of these individuals there does not appear to be any meaningful relationship to the
CHAPTER 5. CONCLUSION
102
risk classification the physicians have assigned as seen in the SVD analysis of the
clinical data.
The nature of these data being complex, high dimensional, constantly changing
and with the datasets being biologically connected, makes it difficult to work with.
We believe that data mining provides the necessary tools to attempt to understand
and learn from this type of data. The data-mining process is involved and requires
the researchers to constantly scrutinize the results and to learn from them in order
to develop a more intelligent process. With so much data being produced from these
high throughput devices every day, it is necessary to develop intelligent and efficient
methods of learning from these data and we believe that data mining is necessary to
take advantage of the wealth of knowledge hidden in these datasets.
5.1
Future Work
This study is a step in a new direction of using data-mining techniques with microarray data for clinical applications to cancer treatment. We believe that we have shown
the power of data mining and its uses in this field of research. This research will be
the foundation for many other studies which use similar techniques and will be able
to build off of these results.
We have identified a need to develop a more intelligent methodology for the process of attribute selection. Although simply using the random forest algorithm was
able to identify interesting attributes, we were able to demonstrate that a large proportion of these attributes were not generally predictive. We are working in this area,
attempting to use techniques such as curve fitting, correlation, SVD and others to
improve attribute selection. It is also important to add domain knowledge into this
CHAPTER 5. CONCLUSION
103
process as well. Based on what we know about the interdependence of SNPs and gene
expression values, we are beginning to develop a method of filtering out attributes
which do not appear to contain any useful information.
The ultimate goal of this project is to be able to create a clinical tool which can
be used to assist physicians in assigning a patient into an appropriate risk category so
that treatment will be more targeted to that particular individual. This is a form of
personalized medicine which we believe will be the future of medical diagnosis. This
will be possible by creating a “space” where these patients will lie based upon their
genetic and clinical information. This space can then be labeled with such information
as treatment received, outcome, risk, whether or not the patient relapsed, etc. Based
on this information, when a new patient with ALL is received they can be put into
this space and their position will be based on their genetics and clinical information.
It is then possible to look in a neighbourhood around this patient and look at the
neighbours which will be biologically similar to this new patient. By observing the
neighbour’s treatment and outcome, more informed decisions about this new patient
can be made. If the neighbours all received the same type of treatment and all
survived, then it would make sense to prescribe this treatment for the new patient.
However, if all of the neighbours received the same treatment and died, then it would
be wise to explore the option of a different course of treatment. This is just an
example of how this system could work once it is developed.
This methodology can be applied to many other types of diseases and we believe
that this will lead to more informed treatment decisions resulting in a higher percentage of individuals who survive their disease. This is truly what personalized medicine
is all about and we believe that this is the future of medicine.
Bibliography
[1] Affymetrix. Genechip microarrays: Student manual. www.affymetrix.com/
about_affymetrix/outreach/educator/microarray_curricula.affx, 2004.
Accessed on October 20, 2009.
[2] G. Alsbeih, N. Al-Harbi, M. Al-Buhairi, K. Al-Hadyan, and M. Al-Hamed. Association between tp53 codon 72-single nucleotide polymorphism and radiation
sensitivity of human fibroblasts. Radiation Research, pages 535–540, 2007.
[3] Orly Alter. Discovery of principles of nature from mathematical modeling of dna
microarray data. PNAS, 103:16063–16064, 2006.
[4] A. Archer and R. Kimes. Empirical characterization of random forest variable
importance measures. Computational Statistics and Data Analysis, 52:2249–
2260, 2007.
[5] E. Asgarian, M.H. Moeinzadeh, S. Sharifian, A. Najafi, A. Ramezani, J. Habibi,
and J. Mohammadzadeh. Solving mec model of haplotype reconstruction using
information fusion, single greedy and parallel clustering approaches. Computer
Systems and Applications, pages 15–19, 2008.
104
BIBLIOGRAPHY
105
[6] Deepa Bhojwani, Huining Kang, Renee Menezes, Wenjian Yang, Harland Sather,
Naomi Moskowitz, Dong-Joon Min, Jeffrey Potter, Richard Harvey, Stephen
Hunger, Nita Seibel, Elizabeth Raetz, Rob Pieters, Martin Horstmann, Mary
Relling, Monique den Boer, Cheryl Willman, and William Carroll. Gene expression signatures predictive of early response and outcome in high-risk childhood
acute lymphoblastic leukemia: a children’s oncology group study. Journal of
Clinical Oncology, 26(27):4376–4384, 2008.
[7] Sikic Branimir, Robert Tibshirani, and Norman Lacayo. Genoics of childhood
leukemia: the virtue of complexity. Journal of Clinical Oncology, 26(27):4367–
4368, 2008.
[8] L Breiman and A Cutler. Random forests. www.stat.berkeley.edu/~breiman/
RandomForests/cc_manual.htm, 2004. Accessed on October 20, 2009.
[9] Christopher J.C. Burges. A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2:121–167, 1998.
[10] Daniel Catchpoole, Andy Lail, Dachuan Guo, Qing-Rong Chen, and Javed
Khan. Gene expression profiles that segregate patients with childhood acute
lymphoblastic leukaemia: an independent validation study identifies that endoglin associates with patient outcome. Leukemia Research, 31:1741–1747, 2007.
[11] P. Chopra. Microarray data mining using landmark gene-guided clustering. BMC
Bioinformatics, 9(92), 2008.
BIBLIOGRAPHY
106
[12] Nigel Crawford, John Heath, David Ashley, Peter Downie, and Jim Buttery.
Survivors of childhood cancer: An australian audit of vaccination status after
treatment. Pediatric Blood Cancer, pages 128–133, 2009.
[13] R. Diaz-Uriate and S. Alvarez de Andres. Gene selection and classification of
microarray data using random forest. BMC Bioinformatics, 7(3), 2006.
[14] Christian Flotho, Elain Coustan-Smith, Deqing Pei, Cheng Cheng, Guangchun
Song, Ching-Hon Pui, James Downing, and Dario Campana. A set of genes
that regulate cell proliferation predicts treatment outcome in childhood acute
lymphoblastic leukaemia. Blood, 110(4):1271–1277, 2007.
[15] Centers for Disease Control and Prevention. Leading causes of death. www.cdc.
gov/nchs/FASTATS/lcod.htm, May 2009. Accessed on November 11, 2009.
[16] Leukaemia Foundation. Acute lymphoblastic leukemia. www.leukemia.org/
web/aboutdiseases/leukaemias_all.php, 2004. Accessed on July 25, 2009.
[17] Clare Frobisher, Emma Lancashire, David Winter, Aliki Taylor, Raoul Reulen,
and Michael Hawkins. Long-term population based divorce rates among adult
survivors of childhood cancer in britain. Pediatric Blood Cancer, pages 116–122,
2009.
[18] Lan Guo, Yan Ma, Rebecca Ward, Vince Castranova, Xianglin Shi, and Yong
Qian. Constructing molecular classifiers for the accurage prognosis of lung adenocarcinoma. Clinical Cancer Research, 11:3344–3354, 2006.
[19] Katrin Hoffmann, Martin J. Firth, Alex H. Beesley, Joseph R. Freitas, Jette Ford,
Saranga Senanayake, Nicholas H. de Klerk, David L. Baker, and Ursula R. Kees.
BIBLIOGRAPHY
107
Prediction of relapse in paediatric pre-b acute lymphoblastic leukaemia using a
three gene risk index. British Journal of Haematology, 140:656–664, 2008.
[20] Amy Holleman, Meyling Cheok, Monique den Boer, Wenjian Yang, Anjo Veerman, Karin Kazemier, Deqing Pei, Cheng Cheng, Ching-Hon Pui, Mary Relling,
Gritta Janka-Schaub, Rob Pieters, and William Evans. Gene-expression patterns
in drug-resistant acute lymphoblastic leukemia cells and response to treatment.
The New England Journal of Medicine, 351(6):533–542, 2004.
[21] National Cancer Institute.
Cancer research funding.
www.cancer.gov/
cancertopic/factsheet/NCI/research-funding, 2009. Accessed on November 11, 2009.
[22] National Cancer Institute. Seer cancer statistics review. seer.cancer.gov/
statfacts/html/all.html, 2009. Accessed on November 11, 2009.
[23] John Luk, Brian Lam, Nikki Lee, David Ho, Pak Sham, Lei Chen, Jirun Peng,
Xisheng Leng, Phillip Day, and Sheung-Tat Fan. Artificial neural networks and
decision tree model analysis of liver cancer proteomes. Biochemical and Biophysical Research Communications, pages 68–73, 2007.
[24] Charles Mullighan, Salil Goorha, Ina Radtke, Christopher Miller, Elaine
Coustan-Smith, James Dalton, Kevin Girtman, Susan Mathew, Jing Ma, Stanley Pounds, Xiaoping Su, Ching-Hon Pui, Mary Relling, William Evans, Sheila
Shurtleff, and James Downing. Genome-wide analysis of genetic alterations in
acute lymphoblastic leukaemia. Nature, 1038, 2007.
BIBLIOGRAPHY
108
[25] World Health Organization. Cancer fact sheet. www.who.int/mediacentre/
factsheets/fs297/en/, February 2009. Accessed on November 11, 2009.
[26] Daniel Peiffer, Jennie Le, Frank Steemers, Weihua Chang, Tony Jenniges, Francisco Garcia, Kirt Haden, Jiangzhen Li, Chad Shaw, John Belmont, Sau Wai
Cheung, Richard Shen, David Barker, and Kevin Gunderson. High-resolution
genomic profiling of chromosomal aberrations using infinium whole-genome genotyping. Genome Research, 16:1136–1148, 2006.
[27] Mary Ross, Xiaodong Zhou, Guangchun Song, Sheila Shurtleff, Kevin Girtman,
W. Kent Williams, Hsi-Che Liu, Rami Mahfouz, Susana Raimondi, Noel Lenny,
Anami Patel, and James Downing. Classfication of pediatric acute lymphoblastic
leukemia by gene expression profiling. Blood, 102(5):2951–2959, 2003.
[28] David B. Skillicorn. Understanding Complex Datasets. CRC Press, 2007.
[29] Johan Staaf, Johan Vallon-Christersson, David Lindgren, Gunnar Juliusson,
Richard Rosenquist, Mattias Hoglund, Ake Borg, and Markus Ringner. Normalization of illumina infinium whole-genome snp data improves copy number
estimates and allelic intensity ratios. BMC Bioinformatics, 409, 2008.
[30] Hequan Sun, Qinke Peng, Quanwei Zhang, and Dan Mou. Splice site prediction
based on characteristics of sequential motifs and c4.5 algorithm. Fifth International Conference on Fuzzy Systems and Knowledge Discovery, pages 417–422,
2008.
[31] Nobuhiro Suzuki, Keiko Yamura-Yagi, Makato Yoshida, Junichi Hara, Shinichiro
Nishimura, Tooru Kudoh, Akio Tawa, Ikuya Usami, Akihiko Tanizawa, Hirkoi
BIBLIOGRAPHY
109
Hori, Yasuhiko Ito, Ryosuke Miyaji, Megumi Oda, Koji Kato, Kazuko
Hamamoto, Yuko Osugi, Yoshiko Hashii, Tatsutoshi Nakahata, and Keizo
Horibe. Outcome of childhood acute lymphoblastic leukemia with induction
failure treated by japan association of childhood leukemia society (jacls) all fprotocol. Pediatric Blood Cancer, pages 71–78, 2009.
[32] Cancer Research UK.
Acute lymphoblastic leukemia and the blood.
www.
cancerhelp.org.uk/help/default.asp?page=32125, March 2009. Accessed
on July 25, 2009.
[33] Wei Wang, Ji Xiang Peng, Jie Quang Yang, and Lian Yue Yang. Identification
of gene expression profiling in heptocellular carcinoma using cdna microarrays.
Digestive Diseases and Sciences, pages 2729–2735, 2008.
[34] Jun Wei, Braden Greer, Frank Westermann, Seth Steinberg, Chang-Gue Son,
Qing-Rong Chen, Craig Whiteford, Sven Bilke, Alexei Krasnoselsky, Nicola Cenacchi, Daniel Catchpoole, Frank Berthold, Manfred Schwab, and Javed Khan.
Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer Research, 64:6883–6891,
2004.
[35] J. J. Yang, C. Cheng, and W. Yeng. Genome-wide interrogation of germline
genetic variation associated with treatment response in childhood acute lymphoblastic leukemia.
301(4):393–403, 2009.
The Journal of the American Medical Association,