Download portable document (.pdf) format

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cancer epigenetics wikipedia , lookup

Copy-number variation wikipedia , lookup

Essential gene wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Metagenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Pathogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene therapy wikipedia , lookup

Minimal genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene desert wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

NEDD9 wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Public health genomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Genome (book) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Modified F-statistic: multi-group cancer outlier differential gene detection
June Luo †
Department of Applied Economics and Statistics
Clemson University
Clemson, SC 29634
There have been discussions about detecting differentially expressed (DE) genes that are over-expressed
in some but not all samples in a disease group. The detection of DE genes has become increasingly
useful in cancer studies, where oncogenes are only activated in a minority of samples. Since there are
not many discussions about detecting DE genes in multi-group microarrays, we are in need of more
applicable and efficient methods to detect DE genes in multi-group microarrays. With the intention of
minimizing the variation and reducing the standard error of the statistic, I propose a new method called
modified F-statistic (MF) to detect DE genes in two or multi-group cancer microarrays. In simulated
examples, I compare MF to the traditional F-statistic, the outlier robust F-statistic in [10] and the outlier
F-statistic in [10]. The aspects of comparison include the receiver operating characteristic (ROC) score,
accuracy rate and reproducibility of gene lists. I find MF performs consistently as the best or the second
best method when the number of oncogene outliers in disease groups varies. In a real example, the new
method provides useful findings and confirmation of literature results in detecting over-expressed genes.
Keywords: cancer microarray; robust; outlier; differential gene detection; standard error; ROC curve;
accuracy rate; concordance
1
Introduction
“Oncogene outliers” are those genes which show systematically increased expressions in disease samples, but only for a small number of cancer samples. Since the discovery of the existence of oncogenes,
several proposals have been made for detecting differentially expressed (DE) genes in two-class microarray studies, such as [4]. One widely used approach is to compute t-statistic Ti for each gene, and call
the gene DE if the |Ti | exceeds a certain threshold. Biologists are fond of fold-change methods ([6, 9])
for their simplicity, interpretation and reproducibility of gene lists as in [7]. In [12], it was argued that
reproducibility does not imply accuracy and demonstrated that the choice of best method depends on
data sets.
Recently, [8] introduced a method called “cancer outlier profile analysis” (COPA) for detecting DE
genes. The outcome of the study shows that COPA can be more powerful than the traditional t-statistic
in these cases.
More progress has been made in detecting DE genes with more robust statistics. In [11], the paper
introduced outlier-sum (OS) and showed that OS had smaller false discovery rates (FDR) than COPA on
skin data taken in [5]. Later, [13] proposed the outlier robust t-statistic (ORT) and [14] proposed robust
tail-sum (RTS). In simulations, both ORT and RTS methods have shown better overall performance than
COPA and OS in terms of ROC scores.
Almost all the above ranking criteria have focused on two-class microarray data sets. Some extensions of the existing methods have been done to study multi-group cancer outlier differential gene
detection. In [10], there was a consideration of robustness and thus proposed outlier robust F-statistic
†
Corresponding author. E-mail: [email protected]
1
.
(ORF) and outlier F-statistic (OF). Liu and Wu clearly demonstrated the advantages of both statistics
over the traditional F-statistic.
To reduce the standard error and minimize the variation of the statistic, thereby improving the detection power of the ranking statistic, I propose modified F-statistic (MF) (MF) for two or multi-group
cancer microarray DE gene detection. In the paper, I will discuss one special statistic in the MF family
of statistics.
Simulations are performed to illustrate the detection power of various statistics. Simulation results
show that the MF statistic has resulted in the highest ROC score in various situations, meanwhile MF
statistic shows both high accuracy and reproducibility of gene lists. Given the simplicity of the formula and the demonstration of detection power, I believe MF is a valuable contribution to the field of
differential gene detection in multi-class microarrays.
2
The modified F-statistic
Consider a microarray of G groups with n total samples comprised of ng samples from group g =
1, 2, . . . ,G. Let xijg be the expression values for group g, gene i = 1, 2, . . . ,p and sample j =
1, 2, . . . ,ng . Without loss of generality, I assume that the first group is the control group and the other
groups are disease groups. The F-statistic is probably the most often used ranking criterion to identify
DE genes in multi-class microarrays. The formula is
PG
2
g=1 ng (x̄ig − x̄i ) /(G − 1)
Fi = PG Png
,
(1)
2
g=1
j=1 (xijg − x̄ig ) /(n − G)
Png
P
Png
where x̄ig = j=1
xijg /ng and x̄i = G
g=1
j=1 xijg /n. F-statistic uses the within-group mean and
pooled variance. As a more robust statistic, outlier robust F-statistic (ORF) in [10] replaces the mean
with median and pooled variance with median absolute deviation. Liu and Wu also formulated outlier
F-statistic (OF) by keeping outliers and removing all other expression values in disease groups. Both
ORF and OF statistics consider the difference between the control group and the disease groups. More
specifically, their formulas are given by
PG P
g=2
j∈Rg (xijg − mi1 )
,
(2)
ORFi =
median{|xij1 − mi1 |j≤n1 , . . . ,|xijG − miG |j≤nG }
where mig = median{xijg , j ≤ ng } and Rg = {j ≤ ng : xijg > q75 (xij1 , j ≤ n1 ) + IQR(xij1 , j ≤
n1 )} where qr (.) and IQR(.) are the rth percentile and interquartile range of the sequence in the (),
respectively, for all g = 1, 2, . . . , G, and
P
e − 1)
n1 (x̄i1 − xei )2 + G
fg (xf
ei )2 /(G
ig − x
g=2 n
OFi = Pn
,
(3)
PG P
1
2
2 n − G)
e
f
ig ) /(e
j=1 (xij1 − x̄i1 ) +
g=2
j∈Rg (xijg − x
where n
e = n1 +
PG
fg , x
fg =
g=2 n
P
ng , x
e=
j∈Rg xijg /f
P
n1 x̄1 + G
fg x
f
g
j=2 n
PG
,
n1 + g=2 n
fg
n
fg is the cardinality of set Rg
e counts the number of nonempty classes.
and G
The ORF statistic uses the median of G sequences of deviations. By taking the median of each
sequence of deviations and then taking the average of the G median absolute deviation values, I expect
to reduce the standard error of the new statistic. To further minimize the variation, I borrow the idea in
[1] and use a constant s0 . So I propose the modified F-statistic defined by
PG P
g=2
j∈Rg (xijg − mi1 )
M Fi =
,
(4)
mad(xij1 , j ≤ n1 ) + mad(xij2 , j ≤ n2 ) + . . . + mad(xijG , j ≤ nG ) + s0
2
where s0 is a constant chosen to minimize the variation. In a special situation when s0 is extremely large,
I have formulated the simplest ranking statistic, which is equivalent to (4) as
M Fi =
G X
X
(xijg − mi1 ).
(5)
g=2 j∈Rg
The above MF statistic in (5) is interpreted as the adjusted sum of deviations of outliers in disease groups.
I will demonstrate the detection power of the new method defined by (5) through simulation results.
3
Simulation study and demonstration of detection power
I carried out a number of simulations to assess the performance of F-statistic, outlier robust F-statistic,
outlier F-statistic and modified F-statistic. Suppose we have G = 4 classes and each class has 20
samples. The first class is the control group. I generated the expression data with p = 1000 gene values
for each sample. For the non-DE gene(s), the gene values are independently generated from N(0,1) for
all samples. For DE gene(s), the gene values for samples in the normal class are randomly drawn from
N(0,1) and the gene values for samples in disease group g = 2, 3, 4 are simulated from the mixture
normal distribution (1 − πg )N (0, 1) + πg N (µg , 1) with πg ∈ [0, 1] and µg ∈ (0, ∞). For simplicity, I
used π2 = π3 = π4 = π and π = 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7. I also used µ2 = 1, µ3 = 1.5 and
µ4 = 2 as three over-expression magnitudes for the three disease groups g = 2, 3, 4, respectively.
When only the first gene is the over-expressed gene in all disease groups, which means m = 1, for
each π value, as is customary, I computed the P-value in [11]: the proportion of genes with statistic
greater than that for gene 1. A smaller P-value increases the chance that we will identify the first gene
as a DE gene. The process was repeated 50 times. The mean, median and standard deviation of the
P-values are shown in Table 1. For almost all π values, modified F-statistic showed greater ability to
detect over-expressed genes than all other methods, which is indicated by lower P-values.
3
Table 1. Results of simulation study when m = 1: mean, median, and standard deviation
of P-values for gene 1, over 50 simulations
π = 0.05
π = 0.1
π = 0.2
π = 0.3
Mean
Median
SD
Mean
Median
SD
Mean
Median
SD
Mean
Median
SD
0.523
0.601
0.306
0.433
0.395
0.305
0.353
0.337
0.260
0.316
0.262
0.277
ORF 0.354
0.317
0.241
0.259
0.191
0.238
0.179
0.092
0.201
0.193
0.092
0.217
OF
0.346
0.332
0.241
0.266
0.186
0.241
0.178
0.108
0.197
0.160
0.114
0.180
MF
0.312
0.291
0.231
0.200
0.157
0.186
0.133
0.079
0.165
0.135
0.063
0.163
F
Continued results of Table 1
π = 0.4
π = 0.5
π = 0.6
π = 0.7
Mean
Median
SD
Mean
Median
SD
Mean
Median
SD
Mean
Median
SD
0.158
0.061
0.215
0.086
0.055
0.118
0.057
0.010
0.143
0.013
0.002
0.026
ORF 0.129
0.053
0.181
0.074
0.016
0.152
0.062
0.009
0.116
0.025
0.004
0.063
OF
0.074
0.041
0.116
0.074
0.027
0.129
0.045
0.022
0.065
0.018
0.009
0.028
MF
0.072
0.015
0.124
0.044
0.003
0.101
0.039
0.002
0.090
0.015
0.001
0.054
F
When I used m = 100 as the number of DE genes and set the first 100 genes as the DE genes, I
calculated the true/false positive rates and constructed the receiver operating characteristic (ROC) curve
for comparison. Figure 1 shows the estimated true/false-positive rates based on 50 replications for each
π = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6. Each point on the curves is the average of 50 true/false-positive rates when
we use a certain value for the gene call. A similar pattern of ROC curves was observed for m = 200, 300.
Unlike F-statistic whose performance changes dramatically as π decreases or ORF whose performance
is consistently the second best, MF performed the best under a wide range of conditions. Simulations
were done when π1 , π2 , π3 are different and similar patterns were observed as shown in Figure 1. The
statistic MF appears to have combined the advantages of F and outlier robust F-statistic in a positive way
by efficiently reducing the coefficent of variation of the statistic.
When m = 100, using each method, I got a list of 100 most differential genes and computed the
accuracy rate. I repeated the process for 50 times and computed the average of 50 accuracy rates for each
method. For a sequence of equally spaced π in [0, 0.6], I got the Figure 2. For almost all π values, MF
showed the highest accuracy rate.
After the examination of ROC score and accuracy rate, I did a simulation to investigate the reproducibility of the gene list using a given statistic. I computed the concordance between the list of 100
most differentially expressed genes in one data and the list of 100 most differentially expressed genes in
another data set. The reproducibility of gene list using a given statistic is measured by computing the
concordance between the gene lists obtained using that statistic for two halves of a single data set. For
4
●●
●●●●
●
●
●
●
0.0
●
●
0.2
0.4
0.6
F
ORF
OF
MF
0.8
0.6
●
n=20,pai=0.2
0.0
●
●
True positive
0.6
●
0.0
True positive
Figure 1: n=20,pai=0.1●●
1.0
●
●●
●●
●
●
●
●
●
●
●
●
●
0.0
0.2
0.0
0.2
0.4
0.6
0.8
0.6
●
1.0
●
●●
●
●
●
●
●
●
●
●
0.0
●
●
0.0
0.2
0.4
0.6
0.8
False positive
1.0
0.6
● ●
0.0
●
True positive
0.6
0.0
True positive
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.6
0.8
●
1.0
0.4
●●
●
●
0.6
0.8
1.0
False positive
n=20,pai=0.5
●
●
●
0.2
False positive
●
0.4
●
n=20,pai=0.4
0.0
●
●●
True positive
0.6
●
●●
●
●
●
●
●
●
●
●
●
●
●
False positive
n=20,pai=0.3
0.0
True positive
False positive
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
●
n=20,pai=0.6
●
0.2
●
0.4
0.6
False positive
Figure 1: ROC curves estimated based on simulations. Various π values are chosen.
5
0.8
●
●
1.0
Figure 3: reproducibility of gene lists
0.3
0.4
concordance
0.5
0.4
F
0.3
●
●
●
0.0
0.2
0.4
0.2
ORF
OF
MF
0.1
0.2
●
0.1
accuracy
0.6
0.5
0.7
0.6
0.8
Figure 2: accuracy comparison
0.6
0.0
pai values
0.2
0.4
pai values
6
●
F
●
ORF
●
OF
●
MF
0.6
each value of π in [0, 0.6], I got the concordance of two gene lists. For a sequence of equally spaced π in
[0, 0.6], I obtained the Figure 3 to illustrate the reproducibility in various situations. MF outperformed
all other methods by showing the highest concordance rate in almost all situations.
These simulations suggest that modified F-statistic can provide more accurate findings than Fstatistic, ORF and OF under various conditions. Given the simplicity of the formula and the demonstration of its detection power, it is a great addition to the existing methods about detecting DE genes in
multi-group microarrays. I will illustrate the findings of MF-statistic on a public breast cancer microarray
data in Section 4.
4
Application to breast cancer data via modified F-statistic
The breast cancer microarray data in [2] contains 7129 gene expression levels for a total of 49 breast
tumor samples. The raw data can be downloaded from http://data.cgt.duke.edu/west.php. Each sample
has two binary outcomes: the status of lymph node involvement in breast cancer (negative and positive, denoted as LN-/LN+) and the estrogen receptor status (negative and positive, denoted as ER-/ER+).
Among all tumor samples, there are 13 ER+/LN+ samples, 11 ER-/LN+ samples, 12 ER+/LN- samples,
and 13 ER-/LN- samples. I processed the data using the quantile normalization in [3] for follow-up statistical analysis. In the DE gene detection, I treat the ER-/LN- group as the normal class and apply the
F-statistic, outlier robust F-statistic, outlier F-statistic and the proposed modified F-statistic. I picked
the top 50 genes identified by each method. As a total, there are 185 genes picked by the four methods.
The table 2 shows the number of common genes found by any two methods. I notice that MF and ORF
have the most number of common genes. Meanwhile, outlier F-statistic has the least number of common
genes with other three methods. I recommend further study on the 8 genes that both ORF and MF have
identified.
Table 2: number of overlapped genes from the top 50 genes
F
ORF
OF
MF
F
50
3
0
3
ORF
3
50
2
8
OF
0
2
50
0
MF
3
8
0
50
Acknowledgments
I would also like to thank Dr. Dubsky and the reviewers for their time.
References
[1] V. Tusher, R. Tibshirani and G. Chu, Significance analysis of microarrays applied to the ionizing
radiation response, PNAS 98 (2001), 5116-5121.
7
[2] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. Olson, J. Marks
and J. Nevins, Predicting the clinical status of human breast cancer by using gene expression
profiles, PNAS 98(20) (2001), 11426-11467.
[3] B. Bolstad, R. Irizarry, M. Astrand and T. Speed, A comparison of normalization methods for high
density oligonucleotide array data based on variance and bias, Bioinformatics 19(2) (2003),185193.
[4] J. Lyons-Weiler, S. Patel, M. Becich and T. Godfrey, Tests for finding complex patterns of differential expression in cancers: towards individualized medicine, BMC Bioinformatics 5 (2004),
110-119.
[5] K. Rieger, W. Hong, V. Tusher,J. Tang, R. Tibshirani and G. Chu, Toxicity from radiation therapy
associated with abnormal transcriptional responses to DNA damage, PNAS 101 (2004), 6634–
6640.
[6] S. Choe, M. Boutros, A. Michelson, G. Church and M. Halfon, Preferred analysis methods for
Affymetrix GeneChips revealed by a wholly defined control dataset, Genome Biology 6 (2005),
R16.
[7] L. Shi, W. Tong, H. Fang, U. Scherf, J. Han, R. Puri, F. Frueh, F. Goodsaid, L. Guo, Z. Su, T. Han, J.
Fuscoe, Z. Xu, T. Patterson, H. Hong, Q. Xie, R. Perkins, J. Chen and D. Casciano, Cross-platform
comparability of microarray technology: Intraplatform consistency and appropriate data analysis
procedures are essential, BMC Bioinformatics 6 (2005), S12.
[8] S. Tomlins, D. Rhodes, S. Perner, S. Dhanasekaran, R. Mehra, X. Sun, S. Varambally, X. Cao, J.
Tchinda and R. Kuefer, and others Recurrent fusion of tmprss2 and ets transcription factor genes
in prostate cancer, Science 310 (2005), 644-648.
[9] L. Guo, E. Lobenhofer, C. Wang, R. Shippy, S. Harris, L. Zhang, N. Mei, T. Chen, D. Herman, F.
Goodsaid, P. Hurban, K. Phillips, J. Xu, X. Deng, Y. Sun, W. Tong, Y. Dragan and L. Shi, Rat toxicogenomic study reveals analytical consistency across microarray platforms, Nature Biotechnology
24 (2006), 1162-1169.
[10] F. Liu and B. Wu, Multi-group cancer outlier differential gene expression detection, Computational
Biology and Chemistry 31 (2007), 65-71.
[11] R. Tibshirani and T. Hastie, Outlier sums for differential gene expression analysis, Biostatistics 8(1)
(2007), 2-8.
[12] D. Witten and R. Tibshirani, A comparison of fold change and the t-statistic for microarray data
analysis, Technical report, Stanford University, 2007.
[13] B. Wu, Cancer outlier differential gene expression detection, Biostatistics 8(3) (2007), 566-575.
[14] J. Luo, Robust tail-sum: differential gene detection in two-group cancer microarrays, InterStat,
2009. Available at http://interstat.statjournals.net/YEAR/2009/abstracts/0910001.php.
8