* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download portable document (.pdf) format
Epigenetics in learning and memory wikipedia , lookup
Genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Essential gene wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Pathogenomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
History of genetic engineering wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Minimal genome wikipedia , lookup
Gene nomenclature wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene desert wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genomic imprinting wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Oncogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression programming wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome (book) wikipedia , lookup
Robust Tail-sum : Two-group cancer outlier differential gene detection June Luo † Department of Applied Economics and Statistics, Clemson University, Clemson, SC 29634 USA Abstract There have been discussions about detecting differentially expressed (DE) genes that are over-expressed or down-expressed in some but not all samples in a disease group. The discussion of detecting DE genes has become increasingly useful in cancer studies, where oncogenes are activated in only a minority of samples. I propose a new statistic called robust tail-sum (RTS) to detect DE genes. In simulated examples, I compare RTS to the t-statistic, the cancer outlier profile analysis by Tomlins et al. (2005), the outlier-sum by Tibshirani and Hastie (2007), and the outlier robust t-statistic by Wu (2007). I find RTS performs consistently as the best or second best method when the number of oncogene outliers in a disease group varies. In a real example, the new method does well in detecting over-expressed genes. Keywords: Cancer; COPA; Gene expression analysis; Microarray; Robust. 1 Introduction Several proposals have been made for detecting differential genes in two-class microarray studies, such as Lyons-Weiler et al. (2004). One widely used approach is to compute t-statistic Ti for each gene, and call the gene DE if the |Ti | exceeds a certain threshold. This t-statistic has a similar definition as univariate ranking (UR) as in Golub et al. (1999). Recently, Tomlins et al. (2005) introduced a method called “cancer outlier profile analysis” (COPA) for detecting “oncogene outliers.” Those genes show systematically increased expressions in disease samples, but only for a small number of cancer samples. The outcome of the study shows that COPA can be more powerful than the traditional t-statistic in these cases. More progress has been made in detecting DE genes with more robust statistics. Tibshirani and Hastie (2007) introduced outlier-sum (OS) and showed that OS had smaller false discovery rates (FDR) than COPA on skin data taken from Rieger et al. (2004). Later, Wu (2007) proposed the outlier robust t-statistic (ORT) which has better overall performance than COPA and OS in terms of FDR. The above ranking criteria are efficient in certain situations. Since the methods perform well in some cases but fail in other cases, I am inspired to study this problem and look for alternative ways of detecting oncogenes. I introduce “robust tail-sum” (RTS) as a new ranking statistic. † Corresponding author: Tel.: (864)656-5768. E-mail: [email protected] 1 . RTS reduces the standard error of the ranking statistic by taking the average. Through simulation studies, I found RTS outperformed t, OS, COPA and ORT under most circumstances and was never significantly worse in all situations. Therefore, I believe it is a valuable contribution to the field of cancer outlier expression detection. 2 The robust tail-sum I consider a 2-class microarray for detecting DE genes. Let xij be the expression values for a group consisting of control samples for gene i = 1, 2, . . . ,p and sample j = 1, 2, . . . ,n1 , and yij be the expression values for a group consisting of disease samples for gene i = 1, 2, . . . ,p and sample j = 1, 2, . . . ,n2 . I start from a one-sided test and assume that all DE genes are over-expressed in the disease group. The standard t-statistic for a 2-sample test is ȳi − x̄i Ti = . (2.1) si Here x̄i and ȳi are the means of samples for gene i in the control group and the disease group, respectively. The denominator si is the pooled standard deviation of gene i. Instead of using pooled standard deviation to standardize datasets, univariate ranking (UR) by Golub et al. (1999) used the sum of two within-group standard deviations in order to standardize the data: ȳi − x̄i (2.2) Ti∗ = c , σi + σid where σic and σid indicate the standard deviations for samples of gene i for the control group and the disease group, respectively. Both t-statistic and UR are powerful when the alternative distribution is such that {yij , j = 1, 2, . . . ,n2 } all come from a distribution with a higher mean than {xij , j = 1, 2, . . . ,n1 } . Since it is already known that only a small portion of cancer samples for oncogenes is over-expressed, both t-statistic and UR could be inefficient for detecting such genes under our assumption. To improve the detection power, Tomlins et al. (2005) defined the COPA statistic, which is the rth percentile of standardized samples in the disease group. The formula is Ci = qr (yij , 1 ≤ j ≤ n2 ) − medi , madi (2.3) where qr (.) is the rth percentile of the data, medi is the median of all values for gene i, and madi is the median absolute deviation of all expressions for gene i. The choice of r is subjective. The COPA statistic Ci only considers a single value of {yij , j ≤ n2 }. In order to take into consideration of more expression values, Tibshirani and Hastie (2007) introduced outlier-sum: X yij − medi OSi = · I[yij > q75 (i) + IQR(i)], (2.4) mad i 1≤j≤n 2 2 where qr (i) and IQR(i) are the rth percentile and interquartile range of all expressions for gene i. Outlier-sum statistic defines outliers in the disease group based on the pooled sample for gene i, but it makes more sense to define outliers based on the control group. Accordingly, Wu (2007) defined outlier robust t-statistic: ORTi = X j∈Oi yij − medci , median{|xij − medci |j≤n1 , |yij − meddi |j≤n2 } (2.5) where medci = median{xij , 1 ≤ j ≤ n1 }, meddi = median{yij , 1 ≤ j ≤ n2 }, and Oi = {j ≤ n2 : yij > q75 (xij , 1 ≤ j ≤ n1 ) + IQR(xij , 1 ≤ j ≤ n1 )}. To minimize the standard error of the statistic ORT, I replace the median, the denominator in ORT, with the average of the two median absolute deviations from two groups. Unlike COPA which uses a single value or OS which does not allow any flexibility in identifying outliers, the new statistic offers flexibility in identifying outliers. The new statistic is called robust tail-sum (RTS): RT Si = X j≤n2 yij − medci · I[yij > qr (xij , j ≤ n1 )]. mad(xij , j ≤ n1 ) + mad(yij , j ≤ n2 ) (2.6) The percentile parameter r is subjective to the users. I will use r = 0.95 in the simulations. In real applications, one might also desire to detect down-expressed genes. In this case, I define RT Si∗ = X j≤n2 yij − medci · I[yij < qr (xij , j ≤ n1 )], mad(xij , j ≤ n1 ) + mad(yij , j ≤ n2 ) (2.7) where r is a small number such as 0.05. Since the robust tail-sum statistic is not symmetric in the groups, the procedure can be applied with groups 1 and 2 interchanged if necessary. 3 Simulation study and comparison I carried out a number of simulations to compare the performance of t-statistic , COPA, OS, ORT and RTS. According to Tibshirani and Hastie (2007), r = 0.90 is used for COPA. I generated the expression data from standard normal with p = 1000 genes and n = n1 = n2 = 25 samples. For various values of m, which is the number of differentially expressed genes, I added a constant µ, the over-expression magnitude, to those m genes for k cancer samples in the disease group. When only the first gene is DE gene, with µ = 2 and various k values, I repeated the process for 50 times and computed the P-values: the proportion of genes with test statistic greater than that for gene 1. The mean, median and standard deviation of the P-values are shown in Table 1. For almost all k values, robust tail-sum showed greater ability to detect over-expressed genes than all other methods, which is indicated by the smallest mean, median and standard deviation of the P-values among all five methods. 3 Table 1. Results of simulation study: mean, median, and standard deviation of P-values for gene 1, m = 1, over 50 simulations k = 25 k = 15 k=8 k=4 Mean Median SD Mean Median SD Mean Median SD Mean Median SD 0 0 0 0.004 0 0.01 0.071 0.029 0.102 0.264 0.172 0.263 COPA 0.275 0.250 0.176 0.152 0.081 0.156 0.068 0.019 0.132 0.176 0.101 0.213 OS 0.447 0.145 0.243 0.243 0.188 0.104 0.031 0.141 0.149 0.082 0.162 ORT 0.008 0 0.03 0.012 0.003 0.028 0.038 0.015 0.055 0.142 0.069 0.157 RTS 0.003 0 0.021 0.001 0 0.003 0.030 0.009 0.079 0.129 0.067 0.148 t 0.368 Receiver operating characteristic (ROC) curves are also estimated for evaluating the detection power of various statistics. When µ = 2, m = 100 and k = 25, 15, 10, 6, 3, 1, I estimated ROC curves by choosing different thresholds for gene calls. I repeat the process for 50 times. Each point on the ROC curves is the average of 50 true/false positive rates when I select a same value for the gene call. Figure 1 shows the estimated true/false-positive rates based on 50 simulations. When k = 25, 15, both RTS and t-statistic perform the best and OS performs the worst. RTS continues to perform the best and slightly better than t-statistic when k = 10. For a smaller k, such as k = 6, t-statistic starts to be inefficient while RTS shows strong detection power. For a even smaller k = 3, RTS still performs the best. When k decreases to 1, RTS and OS are better than t-statistic. The figure 1 demonstrates that RTS outperforms all other methods in various situations. As k decreases, I notice that OS develops an increasing power to detect the over-expressed genes. However, t-statistic keeps losing its power. In contrast to the dramatic change in the performance of OS and t-statistic, ORT has quite a consistent performance, but it is never the best for any k value. Our method, robust tail-sum, has shown a stable and strong power to detect DE genes for the entire range of k values: RTS performed the best for almost all k values and never performed significantly worse than any other method. It appears that the RTS formula is better in detecting DE genes than OS and ORT by efficiently reducing the standard error. I set µ = 1 as a smaller over-expression magnitude, m = 200 and k = 25, 15, 10, 6, 3, 1 and estimated their ROC curves to investigate if a small systematic increase in DE genes affects the performance of different methods. The result is in the supplementary material. It shows similar pattern to Figure 1. RTS performed well under these conditions. When k = 1, it seems all methods result in random performance and diagonal curves. Through the simulations, I notice that RTS is resistant to the systematic increase while OS is greatly influenced. These simulations suggest that robust tail-sum can provide a useful alternative to t-statistic, COPA, OS and ORT over a wide range of conditions. I illustrate the improvements of RTS on a public leukemia microarray data in Section 4. 4 0.8 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 False positive n=25,mu=2,k=10 n=25,mu=2,k=6 0.8 1.0 0.8 1.0 0.8 1.0 0.8 0.6 0.4 0.0 0.0 0.2 0.4 True positive 0.6 0.8 1.0 False positive 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 False positive False positive n=25,mu=2,k=3 n=25,mu=2,k=1 0.8 0.6 0.2 0.0 0.0 0.2 0.4 True positive 0.6 0.8 1.0 0.2 1.0 0.0 0.4 True positive 0.6 0.2 0.0 0.0 0.2 1.0 0.0 True positive 0.4 True positive 0.6 0.4 OS RTS ORT t COPA 0.2 True positive 0.8 1.0 n=25,mu=2,k=15 1.0 n=25,mu=2,k=25 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 False positive 0.6 False positive Figure 1: ROC curves estimated based on simulations. Various k values are chosen. The over-expression magnitude µ = 2 is used. 5 4 Application to leukemia data via robust tail sums method Leukemia data taken from Golub et al. (1999) consists of 38 samples of two types of acute leukemias, 11 samples with acute mylogenous leukemia (AML) and 27 samples with acute lymphocytic leukemia (ALL). Each sample has 7129 gene values. In our simulation, I used AML as the normal group and ALL as the disease group. I applied all five methods and got a list of top 50 genes from each method. Golub et al. identified 25 highly expressed genes in ALL group. Table 2 shows the number of overlapped genes from Golub’s list and the five lists of top 50 genes from five methods. Both ORT and RTS outperform the other three methods. Table 2. number of overlapped genes t-statistic Golub et al. 1 COPA 0 OS 0 ORT 8 RTS 6 As in the literature, “terminal deoxynucleotidy1 transferase” is known as an excellent marker in ALL (Tibshirani et al., 2002; Zhu, 2004). Both RTS and ORT identify the gene “terminal deoxynucleotidy1 transferase” as top 35 gene. However, t-statistic ranks the gene “terminal deoxynucleotidy1 transferase” as 189th, which means it is almost impossible for the t-statistic to detect the gene “terminal deoxynucleotidy1 transferase” as DE gene. Acknowledgments I’d like to thank Dr. Richard Dubsky for his remarks. I’d also like to thank all the reviewers. References G OLUB , T., S LONIM , D., TAMAYO , P., H UARD , C., G AASENBEEK , M., M ESIROV, J., C OLLER , H., L OH , M., D OWNING , J. AND C ALIGIURI , M. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-536. LYONS -W EILER , J., PATEL , S., B ECICH , M. AND G ODFREY, T. (2004). Tests for finding complex patterns of differential expression in cancers: towards individualized medicine. BMC Bioinformatics 5, 110-119. 6 R IEGER , K., H ONG , W., T USHER , V., TANG , J., T IBSHIRANI , R. AND C HU , G. (2004). Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage. Proceedings of the National Academy of Sciences of the United States of America 101, 6634–6640. T OMLINS , S., R HODES , D., P ERNER , S., D HANASEKARAN , S., M EHRA , R., S UN , X., VARAMBALLY, S., C AO , X., T CHINDA , J., K UEFER , R. et al. (2005). Recurrent fusion of tmprss2 and ets transcription factor genes in prostate cancer. Science 310, 644–648. T IBSHIRANI , R., H ASTIE , T., NARASIMHAN , B. AND C HU , G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America 99, 6567-6572. T IBSHIRANI , R. AND H ASTIE , T. (2007). Outlier sums for differential gene expression analysis. Biostatistics 8(1), 2-8. W U , B. (2007). Cancer outlier differential gene expression detection. Biostatistics 8(3), 566575. Z HU , J. AND H ASTIE , T. (2004). Classification of gene microarray by penalized logistic regression. Biostatistics 5(3), 427-443. 7