Download portable document (.pdf) format

Modified F-statistic: multi-group cancer outlier differential gene detection June Luo † Department of Applied Economics and Statistics Clemson University Clemson, SC 29634 There have been discussions about detecting differentially expressed (DE) genes that are over-expressed in some but not all samples in a disease group. The detection of DE genes has become increasingly useful in cancer studies, where oncogenes are only activated in a minority of samples. Since there are not many discussions about detecting DE genes in multi-group microarrays, we are in need of more applicable and efficient methods to detect DE genes in multi-group microarrays. With the intention of minimizing the variation and reducing the standard error of the statistic, I propose a new method called modified F-statistic (MF) to detect DE genes in two or multi-group cancer microarrays. In simulated examples, I compare MF to the traditional F-statistic, the outlier robust F-statistic in [10] and the outlier F-statistic in [10]. The aspects of comparison include the receiver operating characteristic (ROC) score, accuracy rate and reproducibility of gene lists. I find MF performs consistently as the best or the second best method when the number of oncogene outliers in disease groups varies. In a real example, the new method provides useful findings and confirmation of literature results in detecting over-expressed genes. Keywords: cancer microarray; robust; outlier; differential gene detection; standard error; ROC curve; accuracy rate; concordance 1 Introduction “Oncogene outliers” are those genes which show systematically increased expressions in disease samples, but only for a small number of cancer samples. Since the discovery of the existence of oncogenes, several proposals have been made for detecting differentially expressed (DE) genes in two-class microarray studies, such as [4]. One widely used approach is to compute t-statistic Ti for each gene, and call the gene DE if the |Ti | exceeds a certain threshold. Biologists are fond of fold-change methods ([6, 9]) for their simplicity, interpretation and reproducibility of gene lists as in [7]. In [12], it was argued that reproducibility does not imply accuracy and demonstrated that the choice of best method depends on data sets. Recently, [8] introduced a method called “cancer outlier profile analysis” (COPA) for detecting DE genes. The outcome of the study shows that COPA can be more powerful than the traditional t-statistic in these cases. More progress has been made in detecting DE genes with more robust statistics. In [11], the paper introduced outlier-sum (OS) and showed that OS had smaller false discovery rates (FDR) than COPA on skin data taken in [5]. Later, [13] proposed the outlier robust t-statistic (ORT) and [14] proposed robust tail-sum (RTS). In simulations, both ORT and RTS methods have shown better overall performance than COPA and OS in terms of ROC scores. Almost all the above ranking criteria have focused on two-class microarray data sets. Some extensions of the existing methods have been done to study multi-group cancer outlier differential gene detection. In [10], there was a consideration of robustness and thus proposed outlier robust F-statistic † Corresponding author. E-mail: [email protected] 1 . (ORF) and outlier F-statistic (OF). Liu and Wu clearly demonstrated the advantages of both statistics over the traditional F-statistic. To reduce the standard error and minimize the variation of the statistic, thereby improving the detection power of the ranking statistic, I propose modified F-statistic (MF) (MF) for two or multi-group cancer microarray DE gene detection. In the paper, I will discuss one special statistic in the MF family of statistics. Simulations are performed to illustrate the detection power of various statistics. Simulation results show that the MF statistic has resulted in the highest ROC score in various situations, meanwhile MF statistic shows both high accuracy and reproducibility of gene lists. Given the simplicity of the formula and the demonstration of detection power, I believe MF is a valuable contribution to the field of differential gene detection in multi-class microarrays. 2 The modified F-statistic Consider a microarray of G groups with n total samples comprised of ng samples from group g = 1, 2, . . . ,G. Let xijg be the expression values for group g, gene i = 1, 2, . . . ,p and sample j = 1, 2, . . . ,ng . Without loss of generality, I assume that the first group is the control group and the other groups are disease groups. The F-statistic is probably the most often used ranking criterion to identify DE genes in multi-class microarrays. The formula is PG 2 g=1 ng (x̄ig − x̄i ) /(G − 1) Fi = PG Png , (1) 2 g=1 j=1 (xijg − x̄ig ) /(n − G) Png P Png where x̄ig = j=1 xijg /ng and x̄i = G g=1 j=1 xijg /n. F-statistic uses the within-group mean and pooled variance. As a more robust statistic, outlier robust F-statistic (ORF) in [10] replaces the mean with median and pooled variance with median absolute deviation. Liu and Wu also formulated outlier F-statistic (OF) by keeping outliers and removing all other expression values in disease groups. Both ORF and OF statistics consider the difference between the control group and the disease groups. More specifically, their formulas are given by PG P g=2 j∈Rg (xijg − mi1 ) , (2) ORFi = median{|xij1 − mi1 |j≤n1 , . . . ,|xijG − miG |j≤nG } where mig = median{xijg , j ≤ ng } and Rg = {j ≤ ng : xijg > q75 (xij1 , j ≤ n1 ) + IQR(xij1 , j ≤ n1 )} where qr (.) and IQR(.) are the rth percentile and interquartile range of the sequence in the (), respectively, for all g = 1, 2, . . . , G, and P e − 1) n1 (x̄i1 − xei )2 + G fg (xf ei )2 /(G ig − x g=2 n OFi = Pn , (3) PG P 1 2 2 n − G) e f ig ) /(e j=1 (xij1 − x̄i1 ) + g=2 j∈Rg (xijg − x where n e = n1 + PG fg , x fg = g=2 n P ng , x e= j∈Rg xijg /f P n1 x̄1 + G fg x f g j=2 n PG , n1 + g=2 n fg n fg is the cardinality of set Rg e counts the number of nonempty classes. and G The ORF statistic uses the median of G sequences of deviations. By taking the median of each sequence of deviations and then taking the average of the G median absolute deviation values, I expect to reduce the standard error of the new statistic. To further minimize the variation, I borrow the idea in [1] and use a constant s0 . So I propose the modified F-statistic defined by PG P g=2 j∈Rg (xijg − mi1 ) M Fi = , (4) mad(xij1 , j ≤ n1 ) + mad(xij2 , j ≤ n2 ) + . . . + mad(xijG , j ≤ nG ) + s0 2 where s0 is a constant chosen to minimize the variation. In a special situation when s0 is extremely large, I have formulated the simplest ranking statistic, which is equivalent to (4) as M Fi = G X X (xijg − mi1 ). (5) g=2 j∈Rg The above MF statistic in (5) is interpreted as the adjusted sum of deviations of outliers in disease groups. I will demonstrate the detection power of the new method defined by (5) through simulation results. 3 Simulation study and demonstration of detection power I carried out a number of simulations to assess the performance of F-statistic, outlier robust F-statistic, outlier F-statistic and modified F-statistic. Suppose we have G = 4 classes and each class has 20 samples. The first class is the control group. I generated the expression data with p = 1000 gene values for each sample. For the non-DE gene(s), the gene values are independently generated from N(0,1) for all samples. For DE gene(s), the gene values for samples in the normal class are randomly drawn from N(0,1) and the gene values for samples in disease group g = 2, 3, 4 are simulated from the mixture normal distribution (1 − πg )N (0, 1) + πg N (µg , 1) with πg ∈ [0, 1] and µg ∈ (0, ∞). For simplicity, I used π2 = π3 = π4 = π and π = 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7. I also used µ2 = 1, µ3 = 1.5 and µ4 = 2 as three over-expression magnitudes for the three disease groups g = 2, 3, 4, respectively. When only the first gene is the over-expressed gene in all disease groups, which means m = 1, for each π value, as is customary, I computed the P-value in [11]: the proportion of genes with statistic greater than that for gene 1. A smaller P-value increases the chance that we will identify the first gene as a DE gene. The process was repeated 50 times. The mean, median and standard deviation of the P-values are shown in Table 1. For almost all π values, modified F-statistic showed greater ability to detect over-expressed genes than all other methods, which is indicated by lower P-values. 3 Table 1. Results of simulation study when m = 1: mean, median, and standard deviation of P-values for gene 1, over 50 simulations π = 0.05 π = 0.1 π = 0.2 π = 0.3 Mean Median SD Mean Median SD Mean Median SD Mean Median SD 0.523 0.601 0.306 0.433 0.395 0.305 0.353 0.337 0.260 0.316 0.262 0.277 ORF 0.354 0.317 0.241 0.259 0.191 0.238 0.179 0.092 0.201 0.193 0.092 0.217 OF 0.346 0.332 0.241 0.266 0.186 0.241 0.178 0.108 0.197 0.160 0.114 0.180 MF 0.312 0.291 0.231 0.200 0.157 0.186 0.133 0.079 0.165 0.135 0.063 0.163 F Continued results of Table 1 π = 0.4 π = 0.5 π = 0.6 π = 0.7 Mean Median SD Mean Median SD Mean Median SD Mean Median SD 0.158 0.061 0.215 0.086 0.055 0.118 0.057 0.010 0.143 0.013 0.002 0.026 ORF 0.129 0.053 0.181 0.074 0.016 0.152 0.062 0.009 0.116 0.025 0.004 0.063 OF 0.074 0.041 0.116 0.074 0.027 0.129 0.045 0.022 0.065 0.018 0.009 0.028 MF 0.072 0.015 0.124 0.044 0.003 0.101 0.039 0.002 0.090 0.015 0.001 0.054 F When I used m = 100 as the number of DE genes and set the first 100 genes as the DE genes, I calculated the true/false positive rates and constructed the receiver operating characteristic (ROC) curve for comparison. Figure 1 shows the estimated true/false-positive rates based on 50 replications for each π = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6. Each point on the curves is the average of 50 true/false-positive rates when we use a certain value for the gene call. A similar pattern of ROC curves was observed for m = 200, 300. Unlike F-statistic whose performance changes dramatically as π decreases or ORF whose performance is consistently the second best, MF performed the best under a wide range of conditions. Simulations were done when π1 , π2 , π3 are different and similar patterns were observed as shown in Figure 1. The statistic MF appears to have combined the advantages of F and outlier robust F-statistic in a positive way by efficiently reducing the coefficent of variation of the statistic. When m = 100, using each method, I got a list of 100 most differential genes and computed the accuracy rate. I repeated the process for 50 times and computed the average of 50 accuracy rates for each method. For a sequence of equally spaced π in [0, 0.6], I got the Figure 2. For almost all π values, MF showed the highest accuracy rate. After the examination of ROC score and accuracy rate, I did a simulation to investigate the reproducibility of the gene list using a given statistic. I computed the concordance between the list of 100 most differentially expressed genes in one data and the list of 100 most differentially expressed genes in another data set. The reproducibility of gene list using a given statistic is measured by computing the concordance between the gene lists obtained using that statistic for two halves of a single data set. For 4 ●● ●●●● ● ● ● ● 0.0 ● ● 0.2 0.4 0.6 F ORF OF MF 0.8 0.6 ● n=20,pai=0.2 0.0 ● ● True positive 0.6 ● 0.0 True positive Figure 1: n=20,pai=0.1●● 1.0 ● ●● ●● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.0 0.2 0.4 0.6 0.8 0.6 ● 1.0 ● ●● ● ● ● ● ● ● ● ● 0.0 ● ● 0.0 0.2 0.4 0.6 0.8 False positive 1.0 0.6 ● ● 0.0 ● True positive 0.6 0.0 True positive ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 0.8 ● 1.0 0.4 ●● ● ● 0.6 0.8 1.0 False positive n=20,pai=0.5 ● ● ● 0.2 False positive ● 0.4 ● n=20,pai=0.4 0.0 ● ●● True positive 0.6 ● ●● ● ● ● ● ● ● ● ● ● ● ● False positive n=20,pai=0.3 0.0 True positive False positive ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● n=20,pai=0.6 ● 0.2 ● 0.4 0.6 False positive Figure 1: ROC curves estimated based on simulations. Various π values are chosen. 5 0.8 ● ● 1.0 Figure 3: reproducibility of gene lists 0.3 0.4 concordance 0.5 0.4 F 0.3 ● ● ● 0.0 0.2 0.4 0.2 ORF OF MF 0.1 0.2 ● 0.1 accuracy 0.6 0.5 0.7 0.6 0.8 Figure 2: accuracy comparison 0.6 0.0 pai values 0.2 0.4 pai values 6 ● F ● ORF ● OF ● MF 0.6 each value of π in [0, 0.6], I got the concordance of two gene lists. For a sequence of equally spaced π in [0, 0.6], I obtained the Figure 3 to illustrate the reproducibility in various situations. MF outperformed all other methods by showing the highest concordance rate in almost all situations. These simulations suggest that modified F-statistic can provide more accurate findings than Fstatistic, ORF and OF under various conditions. Given the simplicity of the formula and the demonstration of its detection power, it is a great addition to the existing methods about detecting DE genes in multi-group microarrays. I will illustrate the findings of MF-statistic on a public breast cancer microarray data in Section 4. 4 Application to breast cancer data via modified F-statistic The breast cancer microarray data in [2] contains 7129 gene expression levels for a total of 49 breast tumor samples. The raw data can be downloaded from http://data.cgt.duke.edu/west.php. Each sample has two binary outcomes: the status of lymph node involvement in breast cancer (negative and positive, denoted as LN-/LN+) and the estrogen receptor status (negative and positive, denoted as ER-/ER+). Among all tumor samples, there are 13 ER+/LN+ samples, 11 ER-/LN+ samples, 12 ER+/LN- samples, and 13 ER-/LN- samples. I processed the data using the quantile normalization in [3] for follow-up statistical analysis. In the DE gene detection, I treat the ER-/LN- group as the normal class and apply the F-statistic, outlier robust F-statistic, outlier F-statistic and the proposed modified F-statistic. I picked the top 50 genes identified by each method. As a total, there are 185 genes picked by the four methods. The table 2 shows the number of common genes found by any two methods. I notice that MF and ORF have the most number of common genes. Meanwhile, outlier F-statistic has the least number of common genes with other three methods. I recommend further study on the 8 genes that both ORF and MF have identified. Table 2: number of overlapped genes from the top 50 genes F ORF OF MF F 50 3 0 3 ORF 3 50 2 8 OF 0 2 50 0 MF 3 8 0 50 Acknowledgments I would also like to thank Dr. Dubsky and the reviewers for their time. References [1] V. Tusher, R. Tibshirani and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response, PNAS 98 (2001), 5116-5121. 7 [2] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. Olson, J. Marks and J. Nevins, Predicting the clinical status of human breast cancer by using gene expression profiles, PNAS 98(20) (2001), 11426-11467. [3] B. Bolstad, R. Irizarry, M. Astrand and T. Speed, A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics 19(2) (2003),185193. [4] J. Lyons-Weiler, S. Patel, M. Becich and T. Godfrey, Tests for finding complex patterns of differential expression in cancers: towards individualized medicine, BMC Bioinformatics 5 (2004), 110-119. [5] K. Rieger, W. Hong, V. Tusher,J. Tang, R. Tibshirani and G. Chu, Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage, PNAS 101 (2004), 6634– 6640. [6] S. Choe, M. Boutros, A. Michelson, G. Church and M. Halfon, Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset, Genome Biology 6 (2005), R16. [7] L. Shi, W. Tong, H. Fang, U. Scherf, J. Han, R. Puri, F. Frueh, F. Goodsaid, L. Guo, Z. Su, T. Han, J. Fuscoe, Z. Xu, T. Patterson, H. Hong, Q. Xie, R. Perkins, J. Chen and D. Casciano, Cross-platform comparability of microarray technology: Intraplatform consistency and appropriate data analysis procedures are essential, BMC Bioinformatics 6 (2005), S12. [8] S. Tomlins, D. Rhodes, S. Perner, S. Dhanasekaran, R. Mehra, X. Sun, S. Varambally, X. Cao, J. Tchinda and R. Kuefer, and others Recurrent fusion of tmprss2 and ets transcription factor genes in prostate cancer, Science 310 (2005), 644-648. [9] L. Guo, E. Lobenhofer, C. Wang, R. Shippy, S. Harris, L. Zhang, N. Mei, T. Chen, D. Herman, F. Goodsaid, P. Hurban, K. Phillips, J. Xu, X. Deng, Y. Sun, W. Tong, Y. Dragan and L. Shi, Rat toxicogenomic study reveals analytical consistency across microarray platforms, Nature Biotechnology 24 (2006), 1162-1169. [10] F. Liu and B. Wu, Multi-group cancer outlier differential gene expression detection, Computational Biology and Chemistry 31 (2007), 65-71. [11] R. Tibshirani and T. Hastie, Outlier sums for differential gene expression analysis, Biostatistics 8(1) (2007), 2-8. [12] D. Witten and R. Tibshirani, A comparison of fold change and the t-statistic for microarray data analysis, Technical report, Stanford University, 2007. [13] B. Wu, Cancer outlier differential gene expression detection, Biostatistics 8(3) (2007), 566-575. [14] J. Luo, Robust tail-sum: differential gene detection in two-group cancer microarrays, InterStat, 2009. Available at http://interstat.statjournals.net/YEAR/2009/abstracts/0910001.php. 8

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download portable document (.pdf) format