Download portable document (.pdf) format

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics in learning and memory wikipedia , lookup

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Essential gene wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Pathogenomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

History of genetic engineering wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Minimal genome wikipedia , lookup

Gene nomenclature wikipedia , lookup

NEDD9 wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene desert wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene wikipedia , lookup

Oncogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Robust Tail-sum : Two-group cancer outlier differential gene detection
June Luo †
Department of Applied Economics and Statistics, Clemson University, Clemson, SC 29634 USA
Abstract
There have been discussions about detecting differentially expressed (DE) genes that are
over-expressed or down-expressed in some but not all samples in a disease group. The discussion of detecting DE genes has become increasingly useful in cancer studies, where oncogenes
are activated in only a minority of samples. I propose a new statistic called robust tail-sum
(RTS) to detect DE genes. In simulated examples, I compare RTS to the t-statistic, the cancer outlier profile analysis by Tomlins et al. (2005), the outlier-sum by Tibshirani and Hastie
(2007), and the outlier robust t-statistic by Wu (2007). I find RTS performs consistently as the
best or second best method when the number of oncogene outliers in a disease group varies. In
a real example, the new method does well in detecting over-expressed genes.
Keywords: Cancer; COPA; Gene expression analysis; Microarray; Robust.
1
Introduction
Several proposals have been made for detecting differential genes in two-class microarray studies, such as Lyons-Weiler et al. (2004). One widely used approach is to compute t-statistic Ti
for each gene, and call the gene DE if the |Ti | exceeds a certain threshold. This t-statistic has
a similar definition as univariate ranking (UR) as in Golub et al. (1999). Recently, Tomlins
et al. (2005) introduced a method called “cancer outlier profile analysis” (COPA) for detecting
“oncogene outliers.” Those genes show systematically increased expressions in disease samples, but only for a small number of cancer samples. The outcome of the study shows that
COPA can be more powerful than the traditional t-statistic in these cases.
More progress has been made in detecting DE genes with more robust statistics. Tibshirani
and Hastie (2007) introduced outlier-sum (OS) and showed that OS had smaller false discovery
rates (FDR) than COPA on skin data taken from Rieger et al. (2004). Later, Wu (2007) proposed
the outlier robust t-statistic (ORT) which has better overall performance than COPA and OS in
terms of FDR.
The above ranking criteria are efficient in certain situations. Since the methods perform well
in some cases but fail in other cases, I am inspired to study this problem and look for alternative
ways of detecting oncogenes. I introduce “robust tail-sum” (RTS) as a new ranking statistic.
†
Corresponding author:
Tel.: (864)656-5768. E-mail: [email protected]
1
.
RTS reduces the standard error of the ranking statistic by taking the average. Through simulation studies, I found RTS outperformed t, OS, COPA and ORT under most circumstances and
was never significantly worse in all situations. Therefore, I believe it is a valuable contribution
to the field of cancer outlier expression detection.
2
The robust tail-sum
I consider a 2-class microarray for detecting DE genes. Let xij be the expression values for a
group consisting of control samples for gene i = 1, 2, . . . ,p and sample j = 1, 2, . . . ,n1 , and
yij be the expression values for a group consisting of disease samples for gene i = 1, 2, . . . ,p
and sample j = 1, 2, . . . ,n2 . I start from a one-sided test and assume that all DE genes are
over-expressed in the disease group.
The standard t-statistic for a 2-sample test is
ȳi − x̄i
Ti =
.
(2.1)
si
Here x̄i and ȳi are the means of samples for gene i in the control group and the disease group,
respectively. The denominator si is the pooled standard deviation of gene i.
Instead of using pooled standard deviation to standardize datasets, univariate ranking (UR)
by Golub et al. (1999) used the sum of two within-group standard deviations in order to
standardize the data:
ȳi − x̄i
(2.2)
Ti∗ = c
,
σi + σid
where σic and σid indicate the standard deviations for samples of gene i for the control group and
the disease group, respectively.
Both t-statistic and UR are powerful when the alternative distribution is such that {yij , j =
1, 2, . . . ,n2 } all come from a distribution with a higher mean than {xij , j = 1, 2, . . . ,n1 } . Since
it is already known that only a small portion of cancer samples for oncogenes is over-expressed,
both t-statistic and UR could be inefficient for detecting such genes under our assumption. To
improve the detection power, Tomlins et al. (2005) defined the COPA statistic, which is the rth
percentile of standardized samples in the disease group. The formula is
Ci =
qr (yij , 1 ≤ j ≤ n2 ) − medi
,
madi
(2.3)
where qr (.) is the rth percentile of the data, medi is the median of all values for gene i, and madi
is the median absolute deviation of all expressions for gene i. The choice of r is subjective.
The COPA statistic Ci only considers a single value of {yij , j ≤ n2 }. In order to take into
consideration of more expression values, Tibshirani and Hastie (2007) introduced outlier-sum:
X yij − medi
OSi =
· I[yij > q75 (i) + IQR(i)],
(2.4)
mad
i
1≤j≤n
2
2
where qr (i) and IQR(i) are the rth percentile and interquartile range of all expressions for gene
i. Outlier-sum statistic defines outliers in the disease group based on the pooled sample for gene
i, but it makes more sense to define outliers based on the control group. Accordingly, Wu (2007)
defined outlier robust t-statistic:
ORTi =
X
j∈Oi
yij − medci
,
median{|xij − medci |j≤n1 , |yij − meddi |j≤n2 }
(2.5)
where medci = median{xij , 1 ≤ j ≤ n1 }, meddi = median{yij , 1 ≤ j ≤ n2 }, and
Oi = {j ≤ n2 : yij > q75 (xij , 1 ≤ j ≤ n1 ) + IQR(xij , 1 ≤ j ≤ n1 )}.
To minimize the standard error of the statistic ORT, I replace the median, the denominator
in ORT, with the average of the two median absolute deviations from two groups. Unlike COPA
which uses a single value or OS which does not allow any flexibility in identifying outliers, the
new statistic offers flexibility in identifying outliers. The new statistic is called robust tail-sum
(RTS):
RT Si =
X
j≤n2
yij − medci
· I[yij > qr (xij , j ≤ n1 )].
mad(xij , j ≤ n1 ) + mad(yij , j ≤ n2 )
(2.6)
The percentile parameter r is subjective to the users. I will use r = 0.95 in the simulations.
In real applications, one might also desire to detect down-expressed genes. In this case, I define
RT Si∗
=
X
j≤n2
yij − medci
· I[yij < qr (xij , j ≤ n1 )],
mad(xij , j ≤ n1 ) + mad(yij , j ≤ n2 )
(2.7)
where r is a small number such as 0.05. Since the robust tail-sum statistic is not symmetric in
the groups, the procedure can be applied with groups 1 and 2 interchanged if necessary.
3
Simulation study and comparison
I carried out a number of simulations to compare the performance of t-statistic , COPA, OS,
ORT and RTS. According to Tibshirani and Hastie (2007), r = 0.90 is used for COPA. I generated the expression data from standard normal with p = 1000 genes and n = n1 = n2 = 25
samples. For various values of m, which is the number of differentially expressed genes, I
added a constant µ, the over-expression magnitude, to those m genes for k cancer samples in
the disease group. When only the first gene is DE gene, with µ = 2 and various k values, I
repeated the process for 50 times and computed the P-values: the proportion of genes with test
statistic greater than that for gene 1. The mean, median and standard deviation of the P-values
are shown in Table 1. For almost all k values, robust tail-sum showed greater ability to detect
over-expressed genes than all other methods, which is indicated by the smallest mean, median
and standard deviation of the P-values among all five methods.
3
Table 1. Results of simulation study: mean, median, and standard deviation
of P-values for gene 1, m = 1, over 50 simulations
k = 25
k = 15
k=8
k=4
Mean
Median
SD
Mean
Median
SD
Mean
Median
SD
Mean
Median
SD
0
0
0
0.004
0
0.01
0.071
0.029
0.102
0.264
0.172
0.263
COPA 0.275
0.250
0.176
0.152
0.081
0.156
0.068
0.019
0.132
0.176
0.101
0.213
OS
0.447
0.145
0.243
0.243
0.188
0.104
0.031
0.141
0.149
0.082
0.162
ORT 0.008
0
0.03
0.012
0.003
0.028
0.038
0.015
0.055
0.142
0.069
0.157
RTS 0.003
0
0.021
0.001
0
0.003
0.030
0.009
0.079
0.129
0.067
0.148
t
0.368
Receiver operating characteristic (ROC) curves are also estimated for evaluating the detection power of various statistics. When µ = 2, m = 100 and k = 25, 15, 10, 6, 3, 1, I estimated
ROC curves by choosing different thresholds for gene calls. I repeat the process for 50 times.
Each point on the ROC curves is the average of 50 true/false positive rates when I select a
same value for the gene call. Figure 1 shows the estimated true/false-positive rates based on 50
simulations. When k = 25, 15, both RTS and t-statistic perform the best and OS performs the
worst. RTS continues to perform the best and slightly better than t-statistic when k = 10. For
a smaller k, such as k = 6, t-statistic starts to be inefficient while RTS shows strong detection
power. For a even smaller k = 3, RTS still performs the best. When k decreases to 1, RTS and
OS are better than t-statistic. The figure 1 demonstrates that RTS outperforms all other methods
in various situations.
As k decreases, I notice that OS develops an increasing power to detect the over-expressed
genes. However, t-statistic keeps losing its power. In contrast to the dramatic change in the
performance of OS and t-statistic, ORT has quite a consistent performance, but it is never the
best for any k value. Our method, robust tail-sum, has shown a stable and strong power to detect
DE genes for the entire range of k values: RTS performed the best for almost all k values and
never performed significantly worse than any other method. It appears that the RTS formula is
better in detecting DE genes than OS and ORT by efficiently reducing the standard error.
I set µ = 1 as a smaller over-expression magnitude, m = 200 and k = 25, 15, 10, 6, 3, 1 and
estimated their ROC curves to investigate if a small systematic increase in DE genes affects the
performance of different methods. The result is in the supplementary material. It shows similar
pattern to Figure 1. RTS performed well under these conditions. When k = 1, it seems all
methods result in random performance and diagonal curves. Through the simulations, I notice
that RTS is resistant to the systematic increase while OS is greatly influenced.
These simulations suggest that robust tail-sum can provide a useful alternative to t-statistic,
COPA, OS and ORT over a wide range of conditions. I illustrate the improvements of RTS on
a public leukemia microarray data in Section 4.
4
0.8
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
False positive
n=25,mu=2,k=10
n=25,mu=2,k=6
0.8
1.0
0.8
1.0
0.8
1.0
0.8
0.6
0.4
0.0
0.0
0.2
0.4
True positive
0.6
0.8
1.0
False positive
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
False positive
False positive
n=25,mu=2,k=3
n=25,mu=2,k=1
0.8
0.6
0.2
0.0
0.0
0.2
0.4
True positive
0.6
0.8
1.0
0.2
1.0
0.0
0.4
True positive
0.6
0.2
0.0
0.0
0.2
1.0
0.0
True positive
0.4
True positive
0.6
0.4
OS
RTS
ORT
t
COPA
0.2
True positive
0.8
1.0
n=25,mu=2,k=15
1.0
n=25,mu=2,k=25
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
False positive
0.6
False positive
Figure 1: ROC curves estimated based on simulations. Various k values are chosen. The over-expression
magnitude µ = 2 is used.
5
4
Application to leukemia data via robust tail sums method
Leukemia data taken from Golub et al. (1999) consists of 38 samples of two types of acute
leukemias, 11 samples with acute mylogenous leukemia (AML) and 27 samples with acute lymphocytic leukemia (ALL). Each sample has 7129 gene values. In our simulation, I used AML
as the normal group and ALL as the disease group. I applied all five methods and got a list
of top 50 genes from each method. Golub et al. identified 25 highly expressed genes in ALL
group. Table 2 shows the number of overlapped genes from Golub’s list and the five lists of top
50 genes from five methods. Both ORT and RTS outperform the other three methods.
Table 2. number of overlapped genes
t-statistic
Golub et al.
1
COPA
0
OS
0
ORT
8
RTS
6
As in the literature, “terminal deoxynucleotidy1 transferase” is known as an excellent marker
in ALL (Tibshirani et al., 2002; Zhu, 2004). Both RTS and ORT identify the gene “terminal
deoxynucleotidy1 transferase” as top 35 gene. However, t-statistic ranks the gene “terminal
deoxynucleotidy1 transferase” as 189th, which means it is almost impossible for the t-statistic
to detect the gene “terminal deoxynucleotidy1 transferase” as DE gene.
Acknowledgments
I’d like to thank Dr. Richard Dubsky for his remarks. I’d also like to thank all the reviewers.
References
G OLUB , T., S LONIM , D., TAMAYO , P., H UARD , C., G AASENBEEK , M., M ESIROV, J.,
C OLLER , H., L OH , M., D OWNING , J. AND C ALIGIURI , M. (1999). Molecular classification
of cancer: class discovery and class prediction by gene expression monitoring. Science 286,
531-536.
LYONS -W EILER , J., PATEL , S., B ECICH , M. AND G ODFREY, T. (2004). Tests for finding
complex patterns of differential expression in cancers: towards individualized medicine. BMC
Bioinformatics 5, 110-119.
6
R IEGER , K., H ONG , W., T USHER , V., TANG , J., T IBSHIRANI , R. AND C HU , G. (2004).
Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA
damage. Proceedings of the National Academy of Sciences of the United States of America
101, 6634–6640.
T OMLINS , S., R HODES , D., P ERNER , S., D HANASEKARAN , S., M EHRA , R., S UN , X.,
VARAMBALLY, S., C AO , X., T CHINDA , J., K UEFER , R. et al. (2005). Recurrent fusion of
tmprss2 and ets transcription factor genes in prostate cancer. Science 310, 644–648.
T IBSHIRANI , R., H ASTIE , T., NARASIMHAN , B. AND C HU , G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National
Academy of Sciences of the United States of America 99, 6567-6572.
T IBSHIRANI , R. AND H ASTIE , T. (2007). Outlier sums for differential gene expression analysis. Biostatistics 8(1), 2-8.
W U , B. (2007). Cancer outlier differential gene expression detection. Biostatistics 8(3), 566575.
Z HU , J. AND H ASTIE , T. (2004). Classification of gene microarray by penalized logistic regression. Biostatistics 5(3), 427-443.
7