Download PDF file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of genetic engineering wikipedia , lookup

Essential gene wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Public health genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Pathogenomics wikipedia , lookup

Microevolution wikipedia , lookup

Minimal genome wikipedia , lookup

Genome (book) wikipedia , lookup

Metagenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Designer baby wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genomic imprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
A Comparative Study of Exploratory Analysis versus
Statistical Analysis for Oligo-DNA Microarrays
Kenichiro Kanda1
[email protected]
Kaoru Azumi2,3
[email protected]
Nori Satoh3,4
[email protected]
1
2
3
4
Sorin V. Sabau1
[email protected]
Hokkaido Tokai University, 5-1-1-1 Minamisawa, Minami-ku, Sapporo 005-8601, Japan
Division of Innovative Research, Creative Research Initiative “Sousei”, Hokkaido University, Sapporo
001-0021, Japan
CREST, Japan Science and Technology Agency, Saitama, 330-0012, Japan
Department of Zoology, Graduate School of Science, Kyoto University, Kyoto, 606-8502, Japan
Keywords: gene expression data, Bayesian inference, fold difference
1 Introduction
The use of large-scale oligo-DNA microarrays is a common technique to investigate and identify genes with
specific or correlated expression profiles. A classical exploratory analysis of gene expression data would be to
choose a threshold for the weak fluorescence signals, say 500, and to discard all the probes showing a lower
signal than this threshold, then to choose a second threshold, say two fold differential expression, and take
genes whose differential expression is greater than the second threshold. However, from a statistical point of
view this approach is not a satisfactory one for at least two reasons. First, the fold ratio estimation cannot take
into account the variability between the individuals involved in a given experiment, and second, the fold ratio
approach does not consider the sample size either. Moreover, it is hard to find a valid criterion for the two
thresholds needed.
For these reasons statistical analysis methods are preferred [1], i.e. it is preferable to decide the differential
expression of a gene by a hypothesis test. Here the null hypothesis is that there is no biological effect, i.e. the
gene expression is not specific to that case sample when compared with a reference sample. If the null
hypothesis is true then this means that the variability present in data does not represent the biological effect
under study, but it follows from individual differences and measurement errors. However, the statistical
analysis of data does not take into account the biological characteristics of the data, i.e. the fact that data
microarray data actually shows the fold difference between two samples.
Finding difficult to make a choice between these two approaches we perform here a comparative study of
exploratory analysis versus statistical analysis for a set of 36 oligo-DNA microarrays slides (Agilent
Technologies) with dye swap of C. intestinalis (sea squirt) life cycle [2]. We suggest that a combined
approach of exploratory and statistical analysis may lead to a data set containing genes with satisfactory fold
difference and being significantly differentially expressed.
2 Methods and Results
2.1 Data Filtering
Microarray data were analyzed using the GenePix Pro 4.0 software (Axon Instruments). After removing the
flagged features and subtracting the background value we took the logarithm (base 2) of the intensity signals
in order to obtain a normal distribution like behavior of the data. Furthermore, we fit a linear model using the
dye swap pairs, instead of taking the average of signals, using the function lmfit in the limma package of the
Bioconductor [3]. Normalization was performed in order to remove sources of systematic variation other
than differential expression. A within array normalization and a between arrays normalization using quantile
method was performed.
2.2 Differential Gene Expression Analysis
In order to identify the genes differentially expressed across life stages in C. intestinalis, firstly, we select the
differentially expressed genes not on the basis of this fold ratio, but on their p -values, i.e. the probability that
the observed data have occurred by chance. Taking into account that the differential expression of genes with
small p -values (say p -value < 0.01) is unlikely to occur by chance, it follows that the behavior of those
genes actually reflects the biological effect being tested. Secondly, a cross-validation with a fold ratio
exploratory analysis is performed.
Namely, we identified the differentially expressed genes by a Bayesian approach using the eBayes
function in the limma package. The resulting test statistic is a moderated t-statistic, where instead of the
single gene estimated variances, a weighted average is used [3]. A gene having the p -value less than 0.01
was considered differentially expressed.
A sensitivity analysis shows that an inferior cut off threshold of the fluorescence signal of 300, and two
fold differentially expressed genes perform best. These two thresholds, i.e. cut off 300 and the two fold, were
determined from the data set under study requiring that the number of genes with high absolute value of fold
difference included in the stably expressed genes group ( p -value >=0.01) to be minimum.
After discarding the clones with fluorescence intensities of both Cy3 and Cy5 signals under 300, a
combined two-fold analysis of log ratios with the statistical analysis described above divides the set of
remaining clones into three groups: the up-regulated genes group that are differentially expressed with p
-values <0.01 (SIG) and being at the same time two-fold up-regulated; the down-regulated genes group that
are differentially expressed with p -values <0.01 (SIG) and being at the same time two-fold down-regulated;
the stably-expressed genes that are differentially expressed with p-values >=0.01 (UNSIG) and being at the
same time two-fold stably expressed (see Table 1).
The obtained data set was used for clustering Table 1: Statistics of genes extracted using a 300 cut
off and a two-fold analysis.
and further analysis. Using this approach we were
able to obtain new insights into the complex
SIG
UNSIG TOTAL
structure of the C. intestinalis gene expression, to
identify efficiently co-expressed gene groups and
2FD_UP
8086
270
8356
marker genes, and make predictions for the biological
2FD_DOWN
6538
281
6819
roles of many uncharacterized genes (to be reported
2_FD_STABLE 1158
772
1930
in a forthcoming paper).
TOTAL
15782
1323
17105
3
Discussion
We had performed a comparative study of exploratory analysis versus statistical analysis for a set of
oligo-DNA microarrays. Our findings show that a combined approach of these two different analysis
methods allows us to determine the thresholds used in exploratory analysis from a sensitivity analysis.
Software that computes these thresholds by an optimization method from gene expression data is currently
under development.
References
[1] Stekel, D., Microarray Bioinformatics, Cambridge University Press, 2003.
[2] Ogasawara, M., Nakazawa, N., Azumi, K., Yamabe, E., Satoh, N., and Satake, M., Identification of
thirty-four transcripts expressed specifically in hemocytes of Ciona intestinalis and their expression
profiles throughout the life cycle, DNA Res., 13(1):25-35, 2006.
[3] Gentleman, R., Carey, V.J., Huber, W., Irizarry, R.A., and Dudoit, S., Bioinformatics and Computational
Biology using R and Bioconductor, Springer, 2005.