* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PDF file
History of genetic engineering wikipedia , lookup
Essential gene wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Public health genomics wikipedia , lookup
Genome evolution wikipedia , lookup
Pathogenomics wikipedia , lookup
Microevolution wikipedia , lookup
Minimal genome wikipedia , lookup
Genome (book) wikipedia , lookup
Metagenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Designer baby wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Epigenetics of human development wikipedia , lookup
A Comparative Study of Exploratory Analysis versus Statistical Analysis for Oligo-DNA Microarrays Kenichiro Kanda1 [email protected] Kaoru Azumi2,3 [email protected] Nori Satoh3,4 [email protected] 1 2 3 4 Sorin V. Sabau1 [email protected] Hokkaido Tokai University, 5-1-1-1 Minamisawa, Minami-ku, Sapporo 005-8601, Japan Division of Innovative Research, Creative Research Initiative “Sousei”, Hokkaido University, Sapporo 001-0021, Japan CREST, Japan Science and Technology Agency, Saitama, 330-0012, Japan Department of Zoology, Graduate School of Science, Kyoto University, Kyoto, 606-8502, Japan Keywords: gene expression data, Bayesian inference, fold difference 1 Introduction The use of large-scale oligo-DNA microarrays is a common technique to investigate and identify genes with specific or correlated expression profiles. A classical exploratory analysis of gene expression data would be to choose a threshold for the weak fluorescence signals, say 500, and to discard all the probes showing a lower signal than this threshold, then to choose a second threshold, say two fold differential expression, and take genes whose differential expression is greater than the second threshold. However, from a statistical point of view this approach is not a satisfactory one for at least two reasons. First, the fold ratio estimation cannot take into account the variability between the individuals involved in a given experiment, and second, the fold ratio approach does not consider the sample size either. Moreover, it is hard to find a valid criterion for the two thresholds needed. For these reasons statistical analysis methods are preferred [1], i.e. it is preferable to decide the differential expression of a gene by a hypothesis test. Here the null hypothesis is that there is no biological effect, i.e. the gene expression is not specific to that case sample when compared with a reference sample. If the null hypothesis is true then this means that the variability present in data does not represent the biological effect under study, but it follows from individual differences and measurement errors. However, the statistical analysis of data does not take into account the biological characteristics of the data, i.e. the fact that data microarray data actually shows the fold difference between two samples. Finding difficult to make a choice between these two approaches we perform here a comparative study of exploratory analysis versus statistical analysis for a set of 36 oligo-DNA microarrays slides (Agilent Technologies) with dye swap of C. intestinalis (sea squirt) life cycle [2]. We suggest that a combined approach of exploratory and statistical analysis may lead to a data set containing genes with satisfactory fold difference and being significantly differentially expressed. 2 Methods and Results 2.1 Data Filtering Microarray data were analyzed using the GenePix Pro 4.0 software (Axon Instruments). After removing the flagged features and subtracting the background value we took the logarithm (base 2) of the intensity signals in order to obtain a normal distribution like behavior of the data. Furthermore, we fit a linear model using the dye swap pairs, instead of taking the average of signals, using the function lmfit in the limma package of the Bioconductor [3]. Normalization was performed in order to remove sources of systematic variation other than differential expression. A within array normalization and a between arrays normalization using quantile method was performed. 2.2 Differential Gene Expression Analysis In order to identify the genes differentially expressed across life stages in C. intestinalis, firstly, we select the differentially expressed genes not on the basis of this fold ratio, but on their p -values, i.e. the probability that the observed data have occurred by chance. Taking into account that the differential expression of genes with small p -values (say p -value < 0.01) is unlikely to occur by chance, it follows that the behavior of those genes actually reflects the biological effect being tested. Secondly, a cross-validation with a fold ratio exploratory analysis is performed. Namely, we identified the differentially expressed genes by a Bayesian approach using the eBayes function in the limma package. The resulting test statistic is a moderated t-statistic, where instead of the single gene estimated variances, a weighted average is used [3]. A gene having the p -value less than 0.01 was considered differentially expressed. A sensitivity analysis shows that an inferior cut off threshold of the fluorescence signal of 300, and two fold differentially expressed genes perform best. These two thresholds, i.e. cut off 300 and the two fold, were determined from the data set under study requiring that the number of genes with high absolute value of fold difference included in the stably expressed genes group ( p -value >=0.01) to be minimum. After discarding the clones with fluorescence intensities of both Cy3 and Cy5 signals under 300, a combined two-fold analysis of log ratios with the statistical analysis described above divides the set of remaining clones into three groups: the up-regulated genes group that are differentially expressed with p -values <0.01 (SIG) and being at the same time two-fold up-regulated; the down-regulated genes group that are differentially expressed with p -values <0.01 (SIG) and being at the same time two-fold down-regulated; the stably-expressed genes that are differentially expressed with p-values >=0.01 (UNSIG) and being at the same time two-fold stably expressed (see Table 1). The obtained data set was used for clustering Table 1: Statistics of genes extracted using a 300 cut off and a two-fold analysis. and further analysis. Using this approach we were able to obtain new insights into the complex SIG UNSIG TOTAL structure of the C. intestinalis gene expression, to identify efficiently co-expressed gene groups and 2FD_UP 8086 270 8356 marker genes, and make predictions for the biological 2FD_DOWN 6538 281 6819 roles of many uncharacterized genes (to be reported 2_FD_STABLE 1158 772 1930 in a forthcoming paper). TOTAL 15782 1323 17105 3 Discussion We had performed a comparative study of exploratory analysis versus statistical analysis for a set of oligo-DNA microarrays. Our findings show that a combined approach of these two different analysis methods allows us to determine the thresholds used in exploratory analysis from a sensitivity analysis. Software that computes these thresholds by an optimization method from gene expression data is currently under development. References [1] Stekel, D., Microarray Bioinformatics, Cambridge University Press, 2003. [2] Ogasawara, M., Nakazawa, N., Azumi, K., Yamabe, E., Satoh, N., and Satake, M., Identification of thirty-four transcripts expressed specifically in hemocytes of Ciona intestinalis and their expression profiles throughout the life cycle, DNA Res., 13(1):25-35, 2006. [3] Gentleman, R., Carey, V.J., Huber, W., Irizarry, R.A., and Dudoit, S., Bioinformatics and Computational Biology using R and Bioconductor, Springer, 2005.