Download Full-text PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Genome Informatics 16(1): 245–253 (2005)
245
Simple Discriminant Functions Identify Small Sets of
Genes that Distinguish Cancer Phenotype from Normal
1
2
3
Gul S. Dalgin1
Charles DeLisi2,3
[email protected]
[email protected]
Molecular Biology, Cell Biology and Biochemistry Program, Boston University,
Boston, MA 02215, USA
Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
Bioinformatics Graduate Program, Boston University, Boston, MA, 02215, USA
Abstract
High-throughput gene expression profiling can identify sets of genes that are differentially expressed between different phenotypes. Discovering marker genes is particularly important in diagnosis of a cancer phenotype. However, gene sets produced to date are too large to be economically
viable diagnostics. We use a hybrid decision tree-discriminant analysis to identify small sets of
genes, i.e. single genes and gene pairs, which separate normal samples from different stages of
tumor samples. Half the samples are selected for training to form the probability distribution of
expression values of each gene. The distributions for the tumor and normal phenotypes are then
used to classify the test samples. The algorithm also identifies gene pairs by combining the probability distributions to construct a decision tree which is used to determine the class of test samples.
After a series of training and testing sessions, genes and gene pairs that classify all samples correctly are recorded. The method was applied to a breast cancer data; and classifier genes that
distinguish normal breast from different stages of breast tumor were identified. The genes were
ranked according to their minimum Euclidean distance between the expression values in tumor
and normal samples. The algorithm was able to pick known cancer related genes but also find
genes that were not identified as differentially expressed by t-test with a 2 fold cut-off. Overall,
the method generates possible diagnostic genes and gene pairs for a specific disease phenotype to
pursue further biological interpretations in cancer biology.
Keywords: discriminant analysis, gene expression, cancer, diagnostic genes
1
Introduction
High-throughput gene expression profiling using microarray technology has emerged as a promising
technology for correlating gene expression with environmental conditions. Methods are available for
allocating samples into pre-specified phenotypic groups based on differences in gene expression profiles,
or for segregating samples into groups without prior specification [2, 9].
When groups are pre-specified, the aim is typically to identify differentially expressed diagnostic gene sets. Sets of over or underexpressed genes that stratify closely related diseases have been
successfully identified in ALL-AML classification [4], ovarian cancer and normal tissue [3], BRCA1,
BRCA2, and sporadic breast tumor classification [5] and poor prognosis and good prognosis breast
cancer samples [10]. The main problem is that sets of differentially expressed genes produced so far
are too large to be used as feasible diagnostics.
In this paper, we present a hybrid decision tree-discriminant analysis to identify small sets of
genes whose joint expression distribution separates two pre-defined classes. The method generates
probability distributions from the fraction of samples in the two classes, and exploits it to select genes
that classify all samples accurately after a series of training and test sessions.
246
Dalgin and DeLisi
Herein, we applied the methodology to breast cancer data generated by Ma and colleagues [6].
Single genes and gene pairs, whose joint expression distribution separate tissue samples in different
stages and grades of malignancy from normal tissue, were identified as candidate diagnostic genes.
Overall, the results suggest that this new discriminant analysis efficiently identifies small gene sets
that distinguish phenotypes.
2
Method and Results
Data
The method was applied to the breast cancer gene expression data produced by Ma and colleagues [6]. The data is described in more detail elsewhere (Dalgin et al., manuscript in preparation).
The samples include normal breast tissues from breast cancer patients and three stages of breast tumor
(premalignant stage (ADH), in situ cancer (DCIS) and invasive cancer (IDC) with different grades
(Grade I - slow growing tumor, Grade III - fast growing tumor, Grade II - intermediate). Overall,
32 normal samples, 8 ADH, 9 DCIS Grade I, 11 DCIS Grade II, 10 DCIS Grade III, 5 IDC Grade
I, 9 IDC Grade II and 9 IDC Grade III samples; and 1940 genes that were found to be differentially
expressed between normal and three stages (ADH, DCIS and IDC) by linear discriminant analysis [6]
were used as the publicly available data, in the current analysis. The gene expression level (E) of each
gene was reported as the ratio of the expression level in the experimental sample to the expression
level in the reference sample (E = log2 (sample/reference sample)). As the reference sample, a human
universal reference RNA from Stratagene was used [6].
Method
The method consists of three steps, i.e. (1) dividing the samples into training and test sets (2)
generating probability distributions for identifying single genes and for decision analysis when pairs
are used (3) assigning test samples; and selecting genes and gene pairs that perform well. An overview
of the method is given in Figure 1.
Figure 1: Overview of the method.
Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal
247
In the first step, the samples are divided into training and test samples in each partition. The
method employs a cross-validation technique by which the samples are randomly (as in the first partitionings) or semi-randomly (second and third partitionings) separated as training and test sets (See
Supplementary Figure). This technique assures that all samples are used at least once in training,
but it still has usage bias. The first partitioning will always be random irrespective of other partitionings whereas the second and third partitionings are semi-random to guarantee good coverage of the
samples. Overall, 99 partitions are performed.
In the second step, the probability distributions of expression values (E) of a gene in each of the two
training classes are generated. These are used to classify genes in the test samples. The distributions
of the endothelin 3 gene expression levels for tumor (T ) and normal (N ) is shown in Figure 2 as an
example.
tumor
normal
Series3
Series4
Series5
Series6
Series7
Series8
Series9
Series10
Series11
Series12
Series13
Series14
2
5.54
4.93
4.32
3.71
3.1
2.5
1.89
1.28
0.67
0.06
-0.55
1
6
11
16
21
26
31
Figure 2: Distribution of expression values (E) of endothelin 3 gene in normal and tumor samples.
Expression values are divided into intervals. The interval boundaries are shown near the y-axis.
** The order of the samples is arbitrary and the sample numbers have no special importance.
The expression values are divided into intervals to generate the probability distributions. The
number of intervals is chosen such that the values are discretisized into neither very small nor very
big intervals. As an example, 32 normal and 8 ADH expression values were divided into 10 intervals.
P (E|N ), the probability of an expression in normal samples; and P (E|T ), the probability of an expression in tumor samples, are calculated from the fraction of normal and tumor samples, respectively,
in the interval E + dE. The probability distribution for the endothelin 3 gene is shown in Figure 3.
248
Dalgin and DeLisi
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.06
0.67
1.28
1.89
2.50
3.10
3.71
4.32
4.93
5.54
-0.55
0.06
0.67
1.28
1.89
2.50
3.10
3.71
4.32
4.93
Figure 3: Probability distribution for endothelin 3 gene. The lower (at the bottom) and upper
boundary values (at the top) for each interval are shown in the x-axis. P (E|N ) and P (E|T ) are
calculated from the fractions of normal and tumor samples, respectively, in an interval.
It is evident that for this particular case, expression levels of endothelin 3 above 1.28 occur only
in the normal group, and expression levels below 0.06 occur only in the tumor group. However,
since separation is incomplete, this gene by itself is not a good candidate to use as a signature. We
therefore ask whether a second gene can be found which, in combination with endothelin 3, gives
perfect separation of the training set. In order to limit the search we use, for the first gene in the pair
(endothelin 3 in this example), only genes that misclassify less than 10% of the total training samples.
The pairs (or singlets, when the first gene separates perfectly) thus obtained, are then evaluated on
the test set.
~ > P (T |E),
~ where E
~ =
Samples in the test set are assigned to the normal category if P (N |E)
(E1 , E2 ) and to tumor otherwise, where the posteriors are given by Bayes rule (Figure 1, step 3). The
pairs that correctly classify all test samples are recorded as perfect pairs after each partition.
Table 1: Number of single genes and gene-pairs identified for each normal and tumor stage comparison.
Normal-ADH
Normal-DCIS I∗
Normal-DCIS II
Normal-DCIS III
Normal-IDC I
Normal-IDC II
Normal-IDC III
∗ DCIS
Number of single classifier genes
10
11
18
24
56
23
26
Grade I is abbreviated as DCIS I.
of pairs that appear in at least 10 partitions.
∗∗ Number
Number of pairs (genes involved)∗∗
2136 (336 genes)
8515 (502 genes)
8087 (455 genes)
12520 (670 genes)
12836 (564 genes)
15948 (649 genes)
20823 (743 genes)
Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal
249
The classifier genes that distinguish separately between normal and 7 stages of breast tumor were
identified after performing 99 partitions for each case. Single genes and gene pairs that correctly
separate the samples in at least 1 partition were recorded for each comparison. The results are
summarized in Table 1.
In order to determine how well the genes distinguish the two groups, genes and gene pairs were
ranked based on a distance measure, which uses the overall expression value distribution. For a single
classifier gene, the Euclidean distance between the tumor and normal samples was calculated. The
rank of gene i is determined by this distance (di ):
di =
qX
(ET,i − EN,i )2
where ET,i and EN,i is the expression value of gene i in the tumor and normal sample, respectively.
The rank of the gene is inversely proportional to this distance; the larger the distance, the better the
gene as a classifier.
In order to assess if the Euclidean distance is a distinguished feature of the classifier genes with
respect to other genes, the distribution of Euclidean distances for single classifier genes and other genes
was compared. An example histogram is shown in Figure 4 for single classifier genes that distinguish
normal samples from DCIS Grade III samples. In this case, it is clear that the distances of single
classifier genes are higher than non-classifier genes; hence have a better separation between their
expression values in normal and tumor samples. This observation is valid for other classifier genes as
well (data not shown). This also suggests that Euclidean distance can be used to distinguish/rank
classifier genes. That is to say, when the genes are to be tested on an independent data set, the genes
that have been top ranked in terms of their distance are expected to perform better than the others.
Figure 4: Histogram of the Euclidean distance calculated for normal-DCIS Grade III single classifier
genes and the rest of the genes. Euclidean distance is calculated between the expression values of
genes in normal and tumor samples.
250
Dalgin and DeLisi
Similarly, the gene pairs were ranked according to their minimum Euclidean distance between the
expression values in tumor and normal samples. First, the Euclidean distance between the expression
values of the pair (gene i and j) in a tumor sample and each normal sample was calculated:
³qX
d((ET,i , ET,j ), EN ) = min
(ET,i − EN,i )2 + (ET,j − EN,j )2 + (ET,i − EN,j )2 + (ET,j − EN,i )2
´
The minimum of this set was selected for that tumor sample. After carrying out the procedure for all
tumor samples, the minimum of this set of minima was selected as the minimum distance for the gene
pair:
di,j = min(d((ET,i , ET,j ), EN )) T = 1, ..., NT
where NT is the total number of tumor samples. The rank of the gene pair is inversely proportional
to this minimum distance.
In order to compare our results with a conventional method, we performed t-tests on the same sets
of genes, i.e. 1940 genes, and the same classes defined in breast cancer, i.e., normal and 7 stages of
breast tumor. The average fold change and the significance values were calculated for each gene for each
normal-breast cancer stage. The aim was (1) to see whether the method selects the same or different
genes when t-test is used, and (2) evaluate the classifier genes in terms of quantitative measures like
average fold change. The percentage of single classifier genes that show differential expression change
with a p-value < 0.05 and average fold (ET umor /EN ormal ) > 2 are shown in Table 2.
Table 2: Average fold (tumor/normal) and p-values of single classifier genes obtained by t-test.
Normal-ADH
Normal-DCIS I∗
Normal-DCIS II
Normal-DCIS III
Normal-IDC I
Normal-IDC II
Normal-IDC III
Avg fold (T /N ) > 2
19.1 % (4/21)
61.5 % (16/26)
66.7 % (18/27)
55.3 % (21/38)
75.0 % (42/56)
73.7 % (28/34)
55.6 % (25/45)
p-value < 0.05
71.4 % (15/21)
84.6 % (22/26)
100 % (27/27)
94.7 % (36/38)
80.4 % (45/56)
89.5 % (34/38)
91.1 % (41/45)
The results show that a significant portion of the genes (from 26.3% for IDC Grade II to 80.9%
for ADH) have changed less than 2 fold in tumor; hence would not be identified as significant by
t-test. However, these genes have been identified as possible classifier genes by the current algorithm.
The majority of the classifier genes have statistically significant p-values which indicate that their
expression change in the two classes is significant.
3
3.1
Discussion
Comparison of the Method with Related Methods
Several statistical methods have been successfully applied to find “discriminatory” genes between
groups of samples in analyzing gene expression data. The method introduced in this paper is methodologically compared with two of the most frequently applied methods, t-test and linear discriminant
analysis.
Linear discriminant analysis (LDA) finds a linear subspace that maximizes class separability among
the feature vector projections, where each gene is represented by a vector of its expression values across
Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal
251
the samples, in the space. Popular separability criterion is the ratio between-class scatter and withinclass scatter. LDA seeks directions efficient for discrimination. LDA assumes that the class mean
conveys most of the class information. Therefore, it cannot enhance nonlinearly separable data sets
and classes with the same mean. Additionally, with a limited number of samples and fairly large
number of genes, between-class and within-class separabilites can be quite unstable.
The main difference between LDA and our hybrid discriminant analysis is that LDA finds the separation of the classes spatially, by representing the classes as vectors, whereas our algorithm separates
the classes by a probabilistic approach. It takes into account the distribution of expression values in
both classes and generates probability distributions from the fraction of two classes in defined intervals. The probability distributions are then used to determine the class of an unknown sample in the
case of single genes. The distributions of two genes are combined to construct a decision tree to assign
an unknown class by a gene pair. The algorithm selects the genes that correctly assign the class of all
training and test samples after a good number of simulations; hence consistency across all samples is
an emphasized criterion of the method.
The other method that has been applied to identify differentially expressed genes is t-test to test
the hypothesis that the means of two distributions of values are different. The main disadvantage of
this approach in gene expression analysis is that it produces large gene sets which are not viable to be
used as diagnostics. Moreover, in some cases, e.g. closely related diseases, changes in the expression
of single genes are very modest or not significant at all [8]. Our method is advantageous in such cases
since it takes into account the fraction of samples in two classes no matter how similar/dissimilar the
two class means or variances are. It not only selects single genes but also gene pairs which together
partitions the two classes even if individual genes are not perfect classifiers alone.
The method is designed to select single genes and pairs to classify two groups; however, it was
considered to extend it to identify triplets or more group of genes. The downsides of this are (1) the
execution time increases substantially since the search space, i.e. number of triples, is much bigger
than the case of pairs and, (2) a high number of triplets have been identified for each classification
which makes it hard to evaluate, rank and select for further testing on another data set.
The testing methodology used here differs from the standard jackknife technique, which constructs
the training set by leaving out a normal and a tumor sample, and then tests the genes on that pair.
The jackknife has the advantage of being unbiased, but it is computationally much more demanding
than the procedure we have used. We are currently investigating the difference between the two
methods.
3.2
Marker Genes for Breast Cancer
In particular, we identified single genes and gene pairs that partition normal breast samples from
different breast tumor stages (Table 1). The overlap between the single gene classifiers (0%-10.34%)
and between the gene pairs (0.11%-3.28%) are low showing that majority of these classifiers are specific
to a certain tumor stage.
Some of these genes include previously characterized cancer related genes such as Angiopoiteinlike 4, which is known to be important in sustained angiogenesis; Matrix metalloproteinase 7, which
was found to be up-regulated in colorectal carcinomas [7] and Glutamine synthase, which is also
up-regulated in tumor and important in tumor progression [1].
Grade specific genes also agree with previous findings. As an example, BIRC5 (survivin) gene,
which is known to be overexpressed in common human cancers and was found to be correlated with
Grade III tumors [6], was also identified only in Grade III tumors in this study.
In summary, we were able to distinguish normal from different stages of breast tumor using no
more than two genes in each instance. Each of these single genes and gene pairs are possible candidates
to be used as diagnostics for a specific type of breast tumor. The total sum of all such pairs includes a
large number of genes (Table 1), and that provides an entrée into the search for correlated and/or co-
252
Dalgin and DeLisi
regulated genes. Future work will focus on the identification of biological processes that are enriched
with subsets of these genes and on further determining the regulatory mechanisms controlling these
genes.
References
[1] Dang, C. V. and Semenza, G. L., Oncogenic alterations of metabolism, TIBS Reviews, 24(2):68–
72, 1999.
[2] Eisen, M. B, Spellman, P. T., Brown, P. O., and Botstein, D., Cluster analysis and display of
genome-wide expression patterns, Proc. Natl. Acad. Sci. USA, 95(25):14863–14868, 1998.
[3] Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., and Haussler, D.,
Support vector machine classification and validation of cancer tissue samples using microarray
expression data, Bioinformatics, 16(10):906–914, 2000.
[4] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller,
H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,
Science, 286(5439):531–537, 1999.
[5] Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O. P., Wilfond, B., Borg, A., and Trent, J., Gene-expression
profiles in hereditary breast cancer, N. Engl. J. Med., 344(8):539–548, 2001.
[6] Ma, X. J., Salunga, R., Tuggle, J. T., Gaudet, J., Enright, E., McQuary, P., Payette, T., Pistone,
M., Stecker, K., Zhang, B. M., Zhou, Y. X., Varnholt, H., Smith, B., Gadd, M., Chatfield, E.,
Kessler, J., Baer, T. M., Erlander, M. G., and Sgroi, D. C., Gene expression profiles of human
breast cancer progression, Proc. Natl. Acad. Sci. USA, 100(10):5974–5979, 2003.
[7] Masaki, T., Matsuoka, H., Sugiyama, M., Abe, N., Goto, A., Sakamoto, A., and Atomi, T.,
Matrilysin (MMP-7) as a significant determinant of malignant potential of early invasive colorectal
carcinomas, Br. J. Cancer, 84(10):1317–1321, 2001.
[8] Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J.,
Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M. J., Patterson, N., Mesirov, J. P., Golub, T. R., Tamayo, P., Spiegelman, B., Lander, E. S., Hirschhorn,
J. N., Altschuler, D., and Groop, L. C., PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes, Nat. Genet., 34(3):267–273, 2003.
[9] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareevan, S., Dmitrovsky, E., Lander, E. S., and
Golub, T. R., Interpreting patterns of gene expression with self-organizing maps: methods and
application to hematopoietic differentiation, Proc. Natl. Acad. Sci. USA, 96(6):2907–2912, 1999.
[10] van’t Veer L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse,
H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M.,
Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H., Gene expression profiling predicts
clinical outcome of breast cancer, Nature, 415(6871):530–536, 2002.
Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal
253
Supplementary Figure
32 Normal samples (N)
8 ADH samples (T)
Training set
16 N1
4 T1
Test set
16 N1
4 T1
1
st
partitioning
8 N2
2 T2
8 N2
2 T2
8 N3
2 T3
8 N3
2 T3
8 N2
2 T2
8 N2
2 T2
2
nd
partitioning
8 N3
2 T3
8 N3
2 T3
3rd
partitioning
Figure 5: A schematic overview of dividing the samples into training and test sets. In the first
partitioning, half of one class and half of the other class samples are selected randomly to train, and
the others remain to test. In the second partitioning, half of the training samples are chosen randomly
from the training set of the first partitioning and the other half from those that have not been used
in the training set (the test set of the first partitioning). In the third partitioning, all samples not
previously used in training are selected for training, and the remainder is chosen randomly from the
training set of the second partitioning.