Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A new method for multi-group cancer outlier differential gene detection June Luo † Department of Applied Economics and Statistics, Clemson University, Clemson, SC 29634 Pengju G. Luo Department of Biological Sciences, Clemson University, Clemson, SC 29634, and Department of Basic Sciences, Sherman College, Spartanburg, SC 29304 Abstract There have been discussions about detecting differentially expressed (DE) genes that are over-expressed or down-expressed in some but not all samples in a disease group for a two-class microarray. This has become increasingly useful in cancer studies, where oncogenes are activated in only a minority of samples. But there are not many discussions about detecting DE genes for multi-group microarrays. I propose a new statistic called robust tail F-statistic (RTF) to detect DE genes in multi-group cancer expression datasets. In simulated examples, I compare RTF to the F-statistic, the outlier robust F-statistic by Liu and Wu (2007) and the outlier F-statistic by Liu and Wu (2007). I find RTF performs consistently as the best method under a wide range of conditions. In a real example, robust tail F-statistic provides useful findings beyond those of F-statistic, outlier robust F-statistic and outlier F-statistic. keywords: gene expression analysis; microarray; robust; differentially expressed genes; standard error; ROC curve 1 Introduction Several proposals have been made for detecting DE genes in 2-class microarray studies. One widely used approach is to compute the t-statistic Ti for each gene, and call the gene DE if the |Ti | exceeds a certain threshold. Recently, Tomlins et al. (2005) found it was reasonable to assume that those differential genes show systematically increased or decreased expressions in a subset of disease samples. To address this new problem, Tomlins proposed “cancer outlier profile analysis” (COPA) for detecting “oncogene outliers”. The outcome of the study shows that COPA can be more powerful than the traditional t-statistic in these cases. More progress has been made in detecting DE genes with more robust statistics. Tibshirani and Hastie (2007) introduced outlier-sum (OS) and Wu (2007) proposed the outlier robust t-statistic (ORT) as alternative methods. Through simulations, both methods have better overall performance than COPA and t-statistic in terms of false discovery rate. Almost all the above ranking criteria have focused on two-class microarray data sets. Liu and Wu (2007) extended outlier robust t-statistic to study multi-group cancer outlier differential gene detection. Thus, he proposed outlier robust F-statistic (ORF) and outlier F-statistic (OF), and clearly demonstrated the advantage over F-statistic. Wu and Liu considered the median of multiple sequences and used only extreme values in disease groups in both ORF and OF methods, but it may be more powerful if we could reduce the standard error of the statistic by taking the average of multiple median absolute deviations. I define a group of tail values with flexibility, utilize the tail values and propose robust tail F-statistic (RTF) for multi-group DE gene detection. Through simulation studies, I found RTF outperformed F, ORF and OF under most circumstances and was never significantly worse in all situations. Therefore, I believe it is a valuable contribution to the field of cancer outlier expression detection for multi-group microarrays. † Corresponding author: 229 Barre Hall, Clemson University, Clemson, SC 29630. Tel.: 864-656-5768. E-mail: [email protected] 1 . 2 Methods I first discuss some methods found in the literature for detecting outlier differential genes for 2-class microarray. I then introduce my statistical method for multi-class differential gene detection. Consider a microarray with n total samples comprised of ng samples from each group g = 1, 2, . . . ,G. Let xijg be the expression values for group g, gene i = 1, 2, . . . ,p and sample j = 1, 2, . . . ,ng . Without loss of generality, I assume that the first group is the control group and the other groups are disease groups. 2.1 Two-class cancer outlier differential expression detection The following methods are to detect over-expressed genes for two-class microarray data (G = 2). Similar arguments can be applied to detect down-expressed genes. The standard t-statistic for a 2-sample test is Ti = x̄i2 − x̄i1 . si Here x̄i1 and x̄i2 are the means of gene i for the control group and the disease group, respectively. si is the pooled standard deviation of gene i. The t-statistic is powerful when the alternative distribution is such that {xij2 , j = 1, 2, . . . ,n2 } all come from a distribution with a higher mean than {xij1 , j = 1, 2, . . . ,n1 } . Since it is already known that only a small portion of cancer samples for DE genes are over-expressed, the t-statistic could be inefficient for detecting DE genes in such situations. To improve the statistic, Tomlins et al. (2005) defined the COPA statistic: Ci = qr (xij2 , 1 ≤ j ≤ n2 ) − medi , madi where qr (.) is the rth percentile of the data, medi is the median of all values for gene i, and madi is the median absolute deviation of all expressions for gene i. The choice of r is subjective. Ci only considers a single value of {xij2 , j ≤ n2 }. For more robust statistics, Tibshirani and Hastie (2007) introduced outlier-sum, Wu (2007) proposed outlier robust t-statistic. Luo (2009) introduced robust tail-sum. The ranking formulas are given by OSi = X xij2 − medi · I[xij2 > q75 (i) + IQR(i)], madi j≤n2 ORTi = X j∈Oi RT Si = X j≤n2 xij2 − medi1 , median{|xij1 − medi1 |j≤n1 , |xij2 − medi2 |j≤n2 } xij2 − medi1 · I[xij2 > qr (xij1 , j ≤ n1 )], mad(xij1 , j ≤ n1 ) + mad(xij2 , j ≤ n2 ) where qr (i) and IQR(i) are the rth percentile and interquartile range of all values for gene i. Additionally, medi1 = median{xij1 , j ≤ n1 }, medi2 = median{xij2 , j ≤ n2 } and Oi = {j ≤ n2 : xij2 > q75 (xij1 , 1 ≤ j ≤ n1 ) + IQR(xij1 , 1 ≤ j ≤ n1 )}. As in COPA, the percentile parameter r is subjective in RTS statistic. As an example, Luo used r = 0.95 to detect the over-expressed genes. 2 2.2 The robust tail F-statistic The F-statistic is probably the most often used ranking criterion for multi-class microarrays. It can be used for ranking the significance of genes. The formula is PG 2 g=1 ng (x̄ig − x̄i ) /(G − 1) Fi = PG Png , 2 /(n − G) (x − x̄ ) ijg ig g=1 j=1 (1) Png P Png where x̄ig = j=1 xijg /ng and x̄i = G g=1 j=1 xijg /n. F-statistic uses the within-group mean and pooled variance. As a more robust statistic, outlier robust F-statistic (ORF) by Liu and Wu (2007) replaces the mean with median and pooled variance with median absolute deviation. Liu and Wu also formulated outlier F-statistic (OF) by keeping outliers and removing all other expression values in disease groups. More specifically, their formulas are given by PG P ORFi = j∈Rg (xijg g=2 − mi1 ) median{|xij1 − mi1 |j≤n1 , . . . ,|xijG − miG |j≤nG } , (2) where mig = median{xijg , j ≤ ng } and Rg = {j ≤ ng : xijg > q75 (xij1 , j ≤ n1 ) + IQR(xij1 , j ≤ n1 )}, g = 1, 2, . . . , G, and P e − 1) fg (xf ei )2 /(G n1 (x̄i1 − xei )2 + G ig − x g=2 n OFi = Pn , P P G 1 2 2 n − G) e f ig ) /(e g=2 j∈Rg (xijg − x j=1 (xij1 − x̄i1 ) + where n e = n1 + PG fg , x fg = g=2 n P ng , x e= j∈Rg xijg /f P n1 x̄1 + G fg x f g j=2 n PG , n1 + g=2 n fg (3) n fg is the cardinality of set Rg e counts the number of nonempty classes. Both methods by Wu (2007) rely on the set Rg which is and G defined as an index set for outliers. Wu removed all values less than q75 + IQR in disease groups. By this removal, we may take into account of too few expression values and thus not use enough of the entire dataset. To detect if there are systematically increased expression values of a gene in disease groups, we focus on the tail values in disease groups relative to the normal group. To reduce the standard error of the ranking statistic, I replace the denominator in (2) and (3) with average of median absolute deviations. So I propose robust tail F-statistic defined by PG P RT Fi = g=2 j∈Tg (xijg − mi1 ) mad(xij1 , j ≤ n1 ) + mad(xij2 , j ≤ n2 ) + . . . + mad(xijG , j ≤ nG ) , (4) where Tg = {j ≤ ng : xijg > qr (xij1 , j ≤ n1 )}, and r is subjective to the user. I will use r = 0.95 in the simulations. Hence, RT Fi is large if there are many tail values in the disease groups or a few extremely large values. When G = 2, the proposed RTF reduces to robust tail-sum for two-class microarray. RTF considers the difference between groups and reduces the standard error of the statistic by taking the average of the MADs. Instead of using the extremely large values in the disease group, RTF is flexible in choosing the tail values by using the parameter r. It is possible that the old method Rg and the new set Tg overlap, since I can adjust the amount of percentile to define the tail values, whereas Rg can not change because of its definition, so my set can accomplish more than Rg does in terms of identifying outliers. By reducing the standard error of the statistic and allowing flexibility in identifying outliers, the new method RTF improves the power to detect DE genes. I will demonstrate the efficiency of all methods through simulation results. 3 In real applications, one might also desire to detect down-expressed genes. In this case, the left tail in the distribution of {xijg , j ≤ ng , g = 2, 3, . . . , G} is of interest. Hence, I define PG P RT Fi∗ = g=2 j∈Tg (xijg − mi1 ) mad(xij1 , j ≤ n1 ) + mad(xij2 , j ≤ n2 ) + . . . + mad(xijG , j ≤ nG ) , (5) where Tg = {j ≤ ng : xijg < qr (xij1 , j ≤ n1 )}, and r is a small number such as 0.05. Since the robust tail F-statistic is not symmetric in the groups, the procedure requires a normal class as a basis. 3 Simulation study and comparison I carried out a number of simulations to assess the relative performance of F-statistic, outlier robust Fstatistic, outlier F-statistic and robust tail F-statistic. Suppose we have G = 4 classes and each class has 15 samples. The first class is the normal class. I generate the expression data with p = 1000 gene values for each sample. For the non-DE gene(s), the gene values are independently generated from N(0,1) for all samples. For DE gene(s), the gene values for samples in the normal class are randomly drawn from N(0,1) and the gene values for samples in disease group g = 2, 3, 4 are simulated from the mixture normal distribution (1 − πg )N (0, 1) + πg N (µg , 1) with πg ∈ [0, 1] and µg ∈ (0, ∞). For simplicity, I used π2 = π3 = π4 = π and π = 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 1.0. I also used µ2 = 1, µ3 = 1.5 and µ4 = 2 as three over-expression magnitudes. When only the first gene is over-expressed gene in all disease groups, which means m = 1, for each π value, as in customary, I computed the P-value: the proportion of genes with test statistic greater than that for gene 1. The process was repeated 50 times. The mean, median and standard deviation of the P-values are shown in Table 1. For almost all π values, robust tail F-statistic showed greater ability to detect over-expressed genes than all other methods, which is indicated by lower P-values. When I set m = 100 as the number of DE genes, I calculated the true/false positive rates and constructed the receiver operating characteristic (ROC) curve for power comparison. Figure 1 shows the estimated true/false-positive rates based on 50 simulations for π = 0.1, 0.3, 0.5, 0.7, 0.9, 1.0. Each point on the curves is the average of 50 true/false-positive rates when we use a certain value in [0, 1] for the gene call. A similar pattern of ROC curves was observed for m = 200, 300. When π = 1, RTF and F-statistic have almost perfect performance. Unlike F-statistic whose performance changes dramatically as π decreases or ORF whose performance is consistently the second best, RTF performs the best under a wide range of conditions. Simulation were done when π1 , π2 , π3 are different. Similar patterns were observed as Figure 1. RTF appears to have combined the advantages of F and outlier robust F-statistic in a positive way by reducing the standard error of the statistic. These simulations suggest that robust tail F-statistic can provide more useful findings than F-statistic, ORF and OF under various conditions. I will illustrate the improvements of RTF on a public breast cancer microarray data in Section 4. 4 Table 1. Results of simulation study: mean, median, and standard deviation of P-values for gene 1, over 50 simulations π = 0.1 π = 0.2 π = 0.3 π = 0.4 Mean Median SD Mean Median SD Mean Median SD Mean Median SD 0.492 0.446 0.254 0.371 0.334 0.302 0.275 0.182 0.257 0.217 0.137 0.235 ORF 0.269 0.203 0.230 0.199 0.105 0.224 0.132 0.081 0.152 0.120 0.053 0.150 F OF 0.259 0.176 0.219 0.198 0.120 0.211 0.140 0.066 0.164 0.117 0.064 0.145 RTF 0.254 0.172 0.259 0.161 0.077 0.199 0.089 0.052 0.114 0.091 0.025 0.167 Continued results of Table 1 π = 0.5 π = 0.8 π = 1.0 Mean Median SD Mean Median SD Mean Median SD Mean Median SD 0.093 0.038 0.155 0.012 0.001 0.028 0.007 0 0.046 0 0 0 ORF 0.055 0.019 0.072 0.054 0.001 0.139 0.024 0 0.079 0.003 0 0.011 OF 0.044 0.023 0.069 0.028 0.006 0.057 0.010 0.002 0.025 0.007 0.001 0.015 RTF 0.016 0.005 0.026 0.005 0 0.014 0.001 0 0.005 0 0 0.002 F 4 π = 0.7 Application to breast cancer data via robust tail F-statistic The breast cancer microarray data from West (2001) contains 7129 gene expression levels for 49 breast tumor samples. The raw data can be downloaded from http://data.cgt.duke.edu/west.php. Each sample has two binary outcomes: the status of lymph node involvement in breast cancer (negative and positive, denoted as LN-/LN+) and the estrogen receptor status (negative and positive, denoted as ER-/ER+). Among all tumor samples, there are 13 ER+/LN+ tumors, 11 ER-/LN+ tumors, 12 ER+/LN- tumors, and 13 ER-/LN- tumors. I process the data using the quantile normalization (Bolstad et al., 2003) for followup statistical analysis. In the cancer gene outlier detection, I treat the ER-/LN- group as the normal class and apply the F-statistic, outlier robust F-statistic, outlier F-statistic and the proposed robust tail F-statistic. I picked the top 50 genes identified by each method. As a total, there are 161 genes picked by the four methods. The table 2 shows the number of genes overlapped in any two methods. I notice that RTF and ORF have the most number of overlapped genes. Meanwhile, F-statistic has the least number of overlapped genes with other three methods. The large number of common genes selected by RTF and other methods indicates high possibility that robust tail F-statistic is superior than other methods. 5 0.4 0.6 0.8 0.8 0.0 1.0 0.0 0.2 0.4 0.6 False positive n=15,pai=0.5 n=15,pai=0.7 1.0 0.8 1.0 0.8 1.0 0.0 0.4 True positive 0.4 0.0 0.8 0.8 False positive 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 False positive n=15,pai=0.9 n=15,pai=1 True positive 0.0 0.4 0.0 0.8 False positive 0.4 0.2 0.8 0.0 True positive 0.4 True positive 0.8 0.4 0.2 0.8 0.0 True positive n=15,pai=0.3 F ORF OF RTF 0.0 True positive Figure 1: n=15,pai=0.1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 False positive 0.2 0.4 0.6 False positive Figure 1: ROC curves estimated based on simulations. Various π values are chosen. 6 Table 2: number of overlapped genes from the top 50 genes F ORF OF RTF F 50 3 0 1 ORF 3 50 2 29 OF 0 2 50 6 RTF 1 29 6 50 Acknowledgments I’d like to thank Dr. Richard Dubsky for his remarks. I would also like to thank Dr. Patrick Gerard for all his helpful advice. References B OLSTAD , B., I RIZARRY, R. A STRAND , M., S PEED , T., 2003. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2),185-193. L UO , J., 2009. Robust Tail-sum : Two-group cancer outlier differential gene detection. InterStat, October, 2009. T OMLINS , S. A., R HODES , D. R., P ERNER , S., D HANASEKARAN , S. M., M EHRA , R., S UN , X.W., VARAMBALLY, S., C AO , X., T CHINDA , J., K UEFER , R. and others, 2005. Recurrent fusion of tmprss2 and ets transcription factor genes in prostate cancer. Science 310, 644–8. T IBSHIRANI , R. AND H ASTIE , T., 2007. Outlier sums for differential gene expression analysis. Biostatistics 8(1), 2-8. W EST, M., B LANCHETTE , C., D RESSMAN , H., H UANG , E., I SHIDA , S., S PANG , R., Z UZAN , H., O LSON , J., M ARKS , J., N EVINS , J., 2001. Predicting the clinical status of human breast cancer by using gene expression profiles. PNAS 98(20), 11426-11467. W U , B., 2007. Cancer outlier differential gene expression detection. Biostatistics 8(3), 566-575. L IU , F. AND W U , B., 2007. Multi-group cancer outlier differential gene expression detection. Computational Biology and Chemistry 31, 65-71. 7