* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bayesian recursive mixed linear model for gene expression
Epigenetics in learning and memory wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Ridge (biology) wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Human genetic variation wikipedia , lookup
Heritability of IQ wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene therapy wikipedia , lookup
Gene desert wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome evolution wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Helitron (biology) wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Public health genomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Published January 20, 2015 Bayesian recursive mixed linear model for gene expression analyses with continuous covariates1 J. Casellas*2 and N. Ibáñez-Escriche† *Grup de Recerca en Remugants, Departament de Ciència Animal i dels Aliments, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain; and †Genètica i Millora Animal, IRTA-Lleida, 25198 Lleida, Spain ABSTRACT: The analysis of microarray gene expression data has experienced a remarkable growth in scientific research over the last few years and is helping to decipher the genetic background of several productive traits. Nevertheless, most analytical approaches have relied on the comparison of 2 (or a few) well-defined groups of biological conditions where the continuous covariates have no sense (e.g., healthy vs. cancerous cells). Continuous effects could be of special interest when analyzing gene expression in animal productionoriented studies (e.g., birth weight), although very few studies address this peculiarity in the animal science framework. Within this context, we have developed a recursive linear mixed model where not only are linear covariates accounted for during gene expression analyses but also hierarchized and the effects of their genetic, environmental, and residual components on differential gene expression inferred independently. This parameterization allows a step forward in the inference of dif- ferential gene expression linked to a given quantitative trait such as birth weight. The statistical performance of this recursive model was exemplified under simulation by accounting for different sample sizes (n), heritabilities for the quantitative trait (h2), and magnitudes of differential gene expression (λ). It is important to highlight that statistical power increased with n, h2, and λ, and the recursive model exceeded the standard linear mixed model with linear (nonrecursive) covariates in the majority of scenarios. This new parameterization would provide new insights about gene expression in the animal science framework, opening a new research scenario where within-covariate sources of differential gene expression could be individualized and estimated. The source code of the program accommodating these analytical developments and additional information about practical aspects on running the program are freely available by request to the corresponding author of this article. Key words: Bayesian inference, gene expression, microarray, mixed model, recursive ©2012 American Society of Animal Science. All rights reserved. J. Anim. Sci. 2012. 90:67–75 doi:10.2527/jas.2010-3750 INTRODUCTION both discrete (Wolfinger et al., 2001) and continuous (Casellas et al., 2008) gene-specific effects to characterize differential gene expression on the basis of 2 (or more) groups of biological conditions or a continuous covariate, respectively. Within the context of animal science, available studies assessed differential gene expression on the basis of discrete factors such as breeds (Lin and Hsu, 2005), nutrition levels (Reverter et al., 2003a), or pharmacological compounds (McDaneld et al., 2004). However, inferences on linear (or polynomial) covariates have also been suggested as an appealing alternative (Casellas et al., 2008). Moreover, additional information can be obtained if continuous covariates are split into their systematic, genetic, and residual components. Structural equation models closely link with the standard linear model, although accounting for feedback or recursiveness between phenotypes or model parameters (Gianola and Sorensen, 2004; Xiong et al., 2004; Varona et al., 2007). This idea relies on the original disser- Mixed linear models were advocated in gene expression analyses due to their superiority in partitioning sources of variation and their flexibility for accommodating multiple experimental designs (Cui and Churchill, 2003). Indeed, mixed models have become a basic analytical tool for genomics, having been implemented in several microarray-oriented software programs (Reverter et al., 2003b; Wu et al., 2003; Casellas et al., 2008). This methodology easily accommodates 1 This research was funded by grant AGL2008-04818-C03 (Ministerio de Ciencia e Innovación, Madrid, Spain). The research contract of J. Casellas was partially financed by the Ministerio de Ciencia e Innovación of Spain (program Ramón y Cajal, reference RYC-200904049). The authors are also indebted to 2 anonymous referees for their helpful comments on the manuscript. 2 Corresponding author: [email protected] Received November 30, 2010. Accepted August 30, 2011. 67 68 Casellas and Ibáñez-Escriche tation by Wright (1921) and could be of special interest to hierarchize continuous covariates in microarray gene expression analysis. Within this context, the main target of this research was to develop an appropriate structural equation model accounting for recursiveness between a continuous covariate (e.g., birth weight) and gene expression data, as well as assessing differential gene expression on the basis of the environmental, genetic, and residual factors composing the continuous covariate. MATERIALS AND METHODS Animal Care and Use Committee approval was not obtained for this study because no animals were used. Statistical Background (Model A) Assume a sample of ng individuals involved in a microarray experiment with m genes (i.e., probes) per microarray. Under a simple experimental design with noncompetitive hybridization microarrays and one microarray per individual, gene expression data can be analyzed by the following hierarchical mixed linear model (Casellas et al., 2008): yg = Xgm + Zg1d + Zg2c + eg, where yg is the (ngm) × 1 vector of gene expression data sorted by microarray and gene (i.e., probe) within microarray, and eg is the (ngm) × 1 vector of residual terms. This model accounts for the overall effect of each array (m; dimension ng × 1) and p discrete (d) and q continuous (c) within-probe effects linked to the data by incidence matrices Xg, Zg1, and Zg2, respectively. Note that Xg is a ng × ng identity matrix Kronecker product with an m-dimensional vector of ones (η), whereas Zg1 and Zg2 are constructed as Z1⊗I and Z2⊗I, respectively. More specifically, Z1 is a ng × p matrix indicating the influence (1) or no influence (0) of each discrete effect (columns) on each individual (rows), Z2 is a ng × q matrix storing the value of each continuous covariate (column) specific for each individual (row), and I is a m × m identity matrix. Note that this model was taken as our starting point for further methodological development, assuming that it was free from biases and all assumptions were satisfied. Gene expression analyses rely on high-dimensionality parameterizations with a relatively small number of replicates (i.e., microarrays); this typically hampers the power and reliability of the model (van Iterson et al., 2009) and leads to weaker structures with greater probability of biases when one or more sources of variation are not properly accounted for. Nevertheless, the focus of this research was on validating a generalization of model A under conditions where all model assumptions were satisfied, and not on issues related to the robustness of the model. Microarray Data Analysis with Recursiveness (Model B) Assume that the ng individuals are sampled from a larger population of np related individuals with phenotypic data for 1 or more traits of interest. Additionally, assume that one of those phenotypic traits (i.e., Y) is included as a continuous covariate in model A, it being stored in the ith column of matrix Zg2. This last assumption draws a very interesting scenario where the influence of Y on gene expression data can be hierarchized by the systematic, genetic, and residual components of Y. This recursive relationship can be modeled by the following joint parameterization (model B): yp = Xpb + Zpa + ep; yg = Xgm + Zg1d + Zg2,−ic−i + WcY + eg, where yp is the vector of phenotypic data from trait Y, b is the vector of systematic effects, a is the vector of additive genetic effects, ep is the vector of residuals, and Xp and Zp are appropriate incidence matrices. Note that Zg2,−i becomes Zg2 after excluding its ith column, c−i becomes c after excluding its ith element, and cY is the vector for regression coefficients linked to the systematic, genetic, and residual components of Y (W). More specifically, recursiveness between phenotypic and gene expression data is characterized by the incidence matrix W = W*⊗I, where each row in W* stores the appropriate elements of b, a, and e as continuous covariates (see Appendix). Note that the sum of the elements in the jth row of W reconstructs the phenotypic value of the individual linked to the jth microarray. Bayesian Development for Model B Under a standard Bayesian development, the joint posterior distribution of all unknown parameters in model B was proportional to the Bayesian likelihood of both microarray (yg) and phenotypic (yp) data multiplied by the appropriate a priori distributions as follows: p(b,a,σa2,σe2,m,d,Σd,c−i,Σc1,cY,Σc2,R|yp,yg) ∝ p(yp,yg|b,a,σe2,m,d,c−i,cY,R) × p(b) p(a|Aσa2) p(σa2) p(σe2) p(m) p(d|Σd) p(Σd) × p(c−i|Σc1) p(Σc1) p(cY|Σc2) p(Σc2) p(R), where R is the m × m matrix of residual covariances (Casellas et al., 2008), A is the numerator relationship matrix between individuals (Wright, 1922), and Σd, Σc1, and Σc2 are appropriate covariance matrices for d, c−i and cY, respectively. The conditional distribution of the data given the unknown parameters (i.e., the Bayesian likelihood), can be split into Recursiveness in microarray analysis 2 2 p(yp|b,a,σe ) ∝ MVN(Xpb + Zpa, Ipσe ), and p(yg|m,d,c−i,cY,R) ∝ MVN(Xgm + Zg1d + Zg2,−ic−i + WcY, Ig⊗R), where MVN refers to a multivariate normal density with mean and variance as indicated between the parentheses, Ip is an identity matrix with dimensions equal to the number of elements in yp, and Ig is a ng × ng identity matrix. Following Sorensen and Gianola (2002) and Casellas et al. (2008), both flat and multivariate normal priors were assumed for the unknown parameters of the model. More specifically, p(a|Aσa2) was modeled as p(a|Aσa2) ~MVN(0, Aσa2), whereas p(d|Σd), p(c−i|Σc1), and p(cY|Σc2) were assumed as multivariate normal densities [i.e., MVN(0,Σd), MVN(0,Σc1), and MVN(0,Σc2), respectively]. Although other dispersion patterns could be addressed (Cui et al., 2005; Casellas and Varona, 2008), we assumed a standard design accounting for gene-specific variances with null covariance between genes for Σd, Σc1, Σc2, and R (see Casellas et al., 2008). A priori distributions for p(b) and p(m) were assumed improper flat, whereas p(σa2), p(σe2), p(Σd), p(Σc1), p(Σc2), and p(R) were bounded flat distributions between 0 and 1,000. Simulation Studies Model B was illustrated by analyzing simulated data sets. For each simulation, 5 nonoverlapping generations of 1,000 individuals (50 males and 950 females) under random mating were generated and pedigree data were stored for analytical purposes. Each individual had a single phenotypic record sampled from a multivariate normal distribution such as MVN(Si + Hj + ak, 10). More specifically, Si was the effect of the sex with 2 levels (S1 = +10; S2 = 0), Hj mimicked a herd effect with 20 levels (effects were sampled from a uniform distribution between −10 and 10), and ak was the infinitesimal genetic effect for the kth individual. Additive genetic values for founder and nonfounder individuals were simulated from the following normal densities: N(0, σα2) and N[(as + ad)/2, 0.5(1 − f)σα2], respectively. Note that as was the additive genetic value of the sire, ad was the additive genetic value of the dam, f was the average inbreeding coefficient of both parents, and σα2 was the additive genetic variance. Gene expression data from 20,000 independent genes were simulated for a subset of N individuals chosen at random from each simulated population. Gene expression was sampled from the following normal density: N(mk + gl + Siβ1 + Hjβ2 + akβ3 + eijkβ4 , σl2), where mk is microarray-specific effect; gl is a gene-specific effect; Si, Hj, ak, and eijk are the effects of sex, herd, 69 additive genetic value, and residual on the phenotypic value of the kth individual (see previous paragraph); β1, β2, β3, and β4 are the corresponding regression coefficients; and σl2 is the gene-specific residual variance. More specifically, mk was sampled from a uniform distribution between 0 and 1, gl was sampled from a normal distribution N(7,1), and σl2 was sampled from a uniform distribution between 0.01 and 0.2. To test for different incidences of phenotype-related differential gene expression, regression coefficients had different configurations for genes 1 to 100 (β1 = λ, β2 = λ, β3 = λ, and β4 = λ), 101 to 200 (β1 = λ, β2 = 0, β3 = 0, and β4 = 0), 201 to 300 (β1 = 0, β2 = λ, β3 = 0, and β4 = 0), 301 to 400 (β1 = 0, β2 = 0, β3 = λ, and β4 = 0), 401 to 500 (β1 = 0, β2 = 0, β3 = 0, and β4 = λ), and 501 to 20,000 (β1 = 0, β2 = 0, β3 = 0, and β4 = 0). All these simulation procedures were replicated under 3 sample sizes (n = 50, n = 100, and n = 200), and 2 different values for λ (0.1σl and σl) and σα (0.11 and 10). Note that σα2 originated heritabilities (h2) of 0.1 and 0.5, respectively. For each combination of σα2, λ, and N, 10 simulated populations were generated. Each data set was analyzed under model A and model B following the Bayesian approach described above. Model A assumed the same a priori probabilities than the gene expression-specific equation in model B and after removing the appropriate terms [see Casellas et al. (2008) for a detailed Bayesian development of model A]. The hierarchical model for gene expression data accounted for the overall effect of each microarray (m), a probe-specific discrete effect (d), and 1 (c; model A) or 4 (cY; model B) continuous covariates accounting for the magnitude of the phenotypic trait or its systematic, genetic, and residual components, respectively. For the phenotypic trait (model B), the linear mixed model accounted for the same systematic (i.e., sex and herd) and infinitesimal genetic effects assumed during the simulation process. A unique Monte Carlo Markov chain with 120,000 iterations was launched for each analysis, discarding the first 20,000 iterations as burn-in (Raftery and Lewis, 1992). RESULTS AND DISCUSSION Model B was illustrated in 6 simulation scenarios in which the models used to simulate and analyze data were the same. Thus, the focus was on validating the detection of differentially expressed genes under conditions where all model assumptions were satisfied, and not on issues related to the robustness of model B. Simulations focused on 3 different parameters, the number of individuals with gene expression data (n = 50, Tables 1 and 2; n = 100, Tables 3 and 4; n = 200, Tables 5 and 6), heritability of the phenotypic trait (h2 = 0.1, Tables 1, 3, and 5; h2 = 0.5, Tables 2, 4, and 6), and magnitude of the differential gene expression (λ). The percentage of false negatives under model B decreased when increasing the available amount of information, (i.e., larger values of N, h2, and λ; Tables 70 Casellas and Ibáñez-Escriche Table 1. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)1,2 Genes5 (i.e., probes) 3 Simulation (λ ) and effect4 λ = 0.1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A λ=1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A 1 to 100 101 to 200 201 to 300 48.7b (0) 91.2a (3.6x) 2.0e (0) 4.9e (0) 29.2c (2.7x) 96.2a (94.4z) 97.0a (94.6z) 12.2d (0) 93.3a (92.7x) 45.0b (14.5y) 64.1b (0) 5.3e (0) 4.0e (0) 3.0e (0) 26.2c (0.2) 95.0a (93.3z) 11.0d (0) 5.7e (0) 8.7d (0) 87.2a (60.0y) 6.1c (0) 98.6a (3.0x) 4.3c (0) 5.0c (0) 30.9b (0) 8.2c (0) 96.1a (95.0z) 6.2c (0) 9.0c (0) 87.6a (62.1y) 301 to 400 4.7a 5.0a 6.1a 3.2a 5.5a 5.1a 3.1a 7.2a 4.1a 5.4a (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) 401 to 500 6.1c (0) 5.0c (0) 2.8c (0) 5.0c (0) 4.2c (0) 17.3b (0) 3.9c (0) 4.0c (0) 95.2a (6.8z) 4.6c (0) 500 to 20,000 4.4a 3.9a 4.8a 5.3a 5.3a 4.6a 5.0a 3.1a 5.1a 4.7a (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) a–e Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in all 10 replicates). 1 The simulation scenario accounted for 50 individuals with gene expression data for 20,000 genes, these being sampled from a population of 5,000 individuals phenotyped by a quantitative trait with a heritability of 0.1. 2 Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05. 3 λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD. 4 Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4). 5 Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and β4 (genes 1 to 100 and 401 to 500). 1 to 6). Although this link between statistical power and microarray sample size could be anticipated, the pattern shown by the different recursive effects on gene expression data (i.e., systematic, genetic, and residual covariates) suggested independent behaviors with relevant implications for future analyses. Taking the first 100 probes as a reference (i.e., they were simulated with all recursive covariates contributing non-null differential gene expression), covariates β1 (i.e., sex effect) and β2 (i.e., herd effect) detected ~40% and ~90% of differentially expressed genes when λ = 0.1σl (see Tables 1 and 2 for exact values under h2 = 0.1 and h2 = 0.5). These percentages increased to ~82% and ~99% when 100 individuals contributed to the gene expression data (Tables 3 and 4) and reached the maximum (100%) with 200 individuals (Tables 5 and 6). It is important to highlight the departures between β1 and β2 under n = 50 and 100, β1 showing smaller percentages in both cases. The difference in the number of discrete levels inherited from the analysis of the phenotypic trait could be the main reason for this larger percentage of false negatives reported by covariate β1. In a similar way, probes 101 to 200 (differential gene expression was only due to covariate β1) and 201 to 300 (differential gene expression was only due to covariate β2) provided a similar pattern, although with slightly greater percentages for both covariates (Tables 1 to 4). Both genetic (β3) and residual (β4) covariates increased the percentage of significant differential gene expressions with N, although they were also highly sensitive to h2 and λ (see below). Covariate β3 accounted for a very interesting recursiveness between phenotypic and gene expression data. Genes with significant estimates for β3 indicates that their differential gene expression is significantly linked to the genetic background of the phenotypic trait involved in the analyses and, by extension, 1 or several additional genes spread along the whole genome. Indeed, significant estimates must be viewed as a promising starting point for further expression quantitative trait loci or gene network analyses, these providing initial pieces of evidence for more complex gene interactions at the transcriptome level. Our analyses revealed an impaired performance of covariate β3 when h2 and λ were set to 0.1 (Tables 1, 3, and 5), although increasing the percentage of differentially expressed genes with N. This was not surprising, although it suggested that small studies performed on moderate-to-low heritable phenotypic traits could suffer from a moderate-to-high percentage of false negative estimates when testing the genetic covariate on gene expression data. Whereas (systematic) covariates β1 and β2 showed a reasonably robust performance even under small N and λ values (they were not remarkably influenced by h2), estimates from the genetic covariate β3 must be taken with caution. Results for the last covariate, β4, showed an intermediate pattern between those obtained under systematic (β1 and β2) and genetic (β3) covariates (Tables 1 to 6). They accounted for spurious relationships between gene expression and uncontrolled sources of variation of the phenotypic trait. This uncontrolled variability could account for a wide range of environmental and genetic effects. Indeed, our analyses suggested that differential gene expression linked to covariate β4 could be partially accounted for by the residual covariate β4 or even the systematic covariate β2 as illustrated by probes 401 to 500 (differential gene expression was only due to covari- 71 Recursiveness in microarray analysis 1,2 Table 2. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975) Genes5 (i.e., probes) 3 Simulation (λ ) and effect4 λ = 0.1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A λ=1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A 1 to 100 101 to 200 201 to 300 301 to 400 401 to 500 34.2d (0) 86.3b (2.2x) 4.0e (0) 3.0e (0) 27.8d (1.3x) 99.7a (99.9z) 99.6a (98.8z) 98.8a (97.7z) 97.6a (6.5x) 60.3c (27.6y) 60.5b (0) 3.0e (0) 4.7e (0) 3.5e (0) 22.2c (0) 99.7a (99.3z) 11.3d (0) 3.2e (0) 4.7e (0) 99.1a (73.2y) 4.4c (0) 98.6a (2.2x) 5.0c (0) 6.7c (0) 31.5b (0) 11.0c (0) 99.5a (99.2z) 13.1c (0) 5.9c (0) 97.6a (84.1y) (0) (0) (0) (0) (0) 6.1c (0) 14.0b (0) 96.2a (95.5z) 13.1b (0) 94.4a (47.5y) 4.1d (0) 4.9d (0) 5.0d (0) 3.6d (0) 5.5d (0) 8.3c (0) 5.9d (0) 78.8b (0.4z) 95.5a (7.2z) 5.0d (0) 6.0c 5.6c 3.9c 5.0c 4.2c 500 to 20,000 4.5a 3.3a 4.3a 5.1a 4.5a 3.9a 4.9a 5.3a 4.4a 3.7a (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) a–e Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in all 10 replicates). 1 The simulation scenario accounted for 50 individuals with gene expression data for 20,000 genes, these being sampled from a population of 5,000 individuals phenotyped by a quantitative trait with a heritability of 0.5. 2 Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05. 3 λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD. 4 Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4). 5 Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and β4 (genes 1 to 100 and 401 to 500). ate β4) in Table 2 (λ = σs), Table 4 (λ = σs), and Table 6 (λ = σs). It is important to highlight that with the exception of false-positive results reported for probes 401 to 500 under some very specific scenarios (phenotypic trait with h2 = 0.5 and simulated differential gene expression with large effects), the level of false positives was approximately 5% elsewhere. Despite the reasonable statistical performances of model B discussed in previous paragraphs, this statistical development would not be more than an intellectual exercise without real contribution to the gene expression framework if the outputs did not 1) contribute additional information or 2) improve statistical outcomes. The first advantage (i.e., additional or more detailed information) could be linked to the proper partitioning of differential gene expression into several “partial” differential gene expressions related to systematic, genetic, and residual sources of variation (see results for probes 101 to 500 in Tables 1 to 6). Whereas model A identifies a differentially expressed gene whose tran- Table 3. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)1,2 Genes5 (i.e., probes) 3 Simulation (λ ) and effect4 λ = 0.1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A λ=1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A 1 to 100 101 to 200 201 to 300 301 to 400 401 to 500 83.1b (2.0) 99.3a (20.5y) 5.0e (0) 4.8e (0) 33.7d (5.4x) 99.4a (99.0z) 99.1a (98.7z) 56.2c (0) 96.2a (95.6z) 49.7c (18.2y) 95.3a (3.1x) 6.0d (0) 3.7d (0) 4.9d (0) 31.5b (0) 99.3a (99.2z) 12.0c (0) 5.1d (0) 8.3cd (0) 88.5a (68.5y) 4.0e (0) 99.0a (22.3x) 3.5e (0) 6.4e (0) 34.7c (0) 13.2d (0) 99.2a (97.1z) 4.7e (0) 15.3d (0) 86.4b (67.7y) 4.9c (0) 3.9c (0) 6.9c (0) 5.0c (0) 4.2c (0) 7.1c (0) 14.2b (0) 52.3a (0) 7.2c (0) 16.8b (0) (0) (0) (0) (0) (0) 26.3c (0) 24.4c (0) 6.2d (0) 93.7a (91.7z) 61.1b (0) 5.1d 4.9d 7.0d 6.2d 3.5d 500 to 20,000 4.6a 3.1a 5.1a 4.7a 3.9a 5.4a 4.7a 3.1a 4.9a 5.0a (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) a–e Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in all 10 replicates). 1 The simulation scenario accounted for 100 individuals with gene expression data for 20,000 genes, these being sampled from a population of 5,000 individuals phenotyped by a quantitative trait with a heritability of 0.1. 2 Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05. 3 λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD. 4 Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4). 5 Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and β4 (genes 1 to 100 and 401 to 500). 72 Casellas and Ibáñez-Escriche Table 4. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)1,2 Genes5 (i.e., probes) 3 Simulation (λ ) and effect4 λ = 0.1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A λ=1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A 1 to 100 101 to 200 201 to 300 301 to 400 401 to 500 81.1b (1.3w) 98.2a (13.2x) 23.0e (0) 28.0e (0) 44.8d (10.5x) 99.8a (99.4z) 99.9a (99.6z) 99.2a (98.4z) 98.3a (97.1z) 60.3c (34.3y) 96.1a (3.7y) 6.1c (0) 4.9c (0) 5.0c (0) 53.5b (0) 99.9a (99.3z) 9.0c (0) 4.9c (0) 8.9c (0) 99.2a (99.0z) 6.0d (0) 99.2a (19.2y) 6.0d (0) 5.3d (0) 63.9b (1.5x) 12.1cd (0) 99.7a (99.1z) 6.3d (0) 15.4c (0) 98.9a (97.9z) 5.3d (0) 3.6d (0) 29.0b (0) 5.0d (0) 13.7cd (0) 5.5d (0) 6.4d (0) 97.8a (96.5z) 21.0bc (0) 96.2a (69.4y) (0) (0) (0) (0) (0) 5.3e (0) 21.9d (0) 49.9c (0) 97.0a (95.7z) 70.7b (0) 4.8e 7.4e 4.9e 5.2e 3.9e 500 to 20,000 5.0a 4.6a 3.9a 5.2a 5.0a 5.9a 4.6a 4.7a 3.6a 4.9a (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) a–e Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in all 10 replicates). 1 The simulation scenario accounted for 100 individuals with gene expression data for 20,000 genes, these being sampled from a population of 5,000 individuals phenotyped by a quantitative trait with a heritability of 0.5. 2 Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05. 3 λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD. 4 Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4). 5 Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and β4 (genes 1 to 100 and 401 to 500). scription level depends on a given continuous covariate (i.e., phenotypic trait), model B gives an additional step identifying which source of variation of the phenotypic trait (or all of them) links with the differential gene expression. On the other hand, statistical performances improved under model B in terms of positively detected genes with differential gene expression. Taking the results in Table 3 as an example (n = 100, h2 = 0.5, and λ = σs), model A detected 49.7% of the differentially expressed genes within the first 100 simu- lated genes whereas model B detected 99.4% (covariate β1), 99.1% (β2), 56.2% (β3), and 96.2% (β4). Note that all these percentages detected under model B were significantly greater (P < 0.05) then the 49.7% obtained under model A. In a similar way, the same trend was detected for genes 101 to 500 although the presence of significant departures depended on the simulation scenarios. It is important to highlight that all these results were obtained by analyzing simulated data. Although simulation processes tried to mimic the pecu- Table 5. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)1,2 Genes5 (i.e., probes) 3 Simulation (λ ) and effect4 λ = 0.1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A λ=1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A 1 to 100 101 to 200 201 to 300 301 to 400 401 to 500 100 (100) 100 (100) 58.1d (27.7x) 73.3bc (38.2x) 69.9c (37.2x) 100 (100) 100 (100) 92.1a (90.7z) 100 (99.0z) 83.5b (66.6y) 100 (100) 5.3c (0) 3.0c (0) 5.0c (0) 84.0b (78.0y) 100 (100) 6.5c (0) 4.2c (0) 5.1c (0) 99.6a (99.3z) 5.0c (0) 100 (100) 4.7c (0) 6.1c (0) 87.2b (77.0y) 4.8c (0) 100 (100) 4.2c (0) 6.0c (0) 99.0a (98.5z) 4.6c (0) 6.9c (0) 90.7ab (82.9y) 5.4c (0) 83.0b (68.3x) 5.0c (0) 7.0c (0) 99.6a (99.0z) 4.2c (0) 98.0a (96.2z) 4.7c (0) 7.3c (0) 4.7c (0) 95.3ab (91.1yz) 89.2b (84.2y) 5.3c (0) 3.7c (0) 5.1c (0) 99.7a (98.6z) 98.8a (95.9z) 500 to 20,000 7.1a 4.9a 4.7a 3.7a 4.8a 3.3a 4.9a 5.4a 4.9a 6.0a (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) a–d Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in all 10 replicates). 1 The simulation scenario accounted for 200 individuals with gene expression data for 20,000 genes, these being sampled from a population of 5,000 individuals phenotyped by a quantitative trait with a heritability of 0.1. 2 Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05. 3 λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD. 4 Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4). 5 Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and β4 (genes 1 to 100 and 401 to 500). 73 Recursiveness in microarray analysis 1,2 Table 6. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975) Genes5 (i.e., probes) 3 Simulation (λ ) and effect4 λ = 0.1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A λ=1 β1 (fix 1) β2 (fix 2) β3 (genetic) β4 (residual) βModel A 1 to 100 101 to 200 201 to 300 301 to 400 401 to 500 100 (100) 100 (100) 64.3c (32.2x) 69.2bc (35.2x) 64.2c (31.7x) 100 (100) 100 (100) 99.9a (99.5z) 99.7a (98.9z) 78.2b (56.1y) 100 (100) 4.4d (0) 3.7d (0) 5.0d (0) 83.6b (77.7y) 100 (100) 3.7c (0) 4.1c (0) 5.5c (0) 99.6a (99.5z) 4.0c (0) 100 (100) 5.0c (0) 6.5c (0) 83.4b (72.9y) 2.9c (0) 100 (100) 4.8c (0) 5.5c (0) 98.9a (98.3z) 5.2c (0) 3.6c (0) 88.3b (81.9y) 5.3c (0) 79.9b (66.3x) 5.0c (0) 5.3c (0) 98.9a (98.5z) 7.3c (0) 96.9a (92.2z) 6.9c (0) 5.0c (0) 3.5c (0) 93.2ab (90.0yz) 87.9b (81.1y) 5.2c (0) 3.9c (0) 12.2c (0) 98.7a (94.7z) 95.7a (90.4yz) 500 to 20,000 5.0a 3.0a 4.2a 6.1a 5.6a 7.0a 5.6a 4.8a 3.3a 5.0a (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) a–d Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in all 10 replicates). 1 The simulation scenario accounted for 200 individuals with gene expression data for 20,000 genes, these being sampled from a population of 5,000 individuals phenotyped by a quantitative trait with a heritability of 0.5. 2 Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05. 3 λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD. 4 Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4). 5 Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and β4 (genes 1 to 100 and 401 to 500). liarities of microarray gene expression data, statistical performances revealed in Tables 1 to 6 must not be completely extrapolated to real experiments. The substantial degree of uncontrollable noise inherent to real gene expression data could modulate the performance of model B in field or laboratory conditions, these departures attenuating when increasing the size of the experiment. In any case, model B was revealed as a promising statistical tool for the analysis of microarray data under simulation with potential contributions to further field data experiments. Mixed model analysis of microarray gene expression data must be viewed as a powerful analytical tool contributing relevant information to multiple research fields (Wolfinger et al., 2001). Indeed, mixed models provide a robust and flexible framework for interrogating a set of hundreds to thousands of genes (i.e., probes), these being preferred to more simplistic gene-by-gene screenings (Hoeschele and Li, 2005). This flexibility (and robustness) led to a plethora of mixed model parameterizations accounting for different peculiarities of gene expression data such as skew (Kuznetsov et al., 2002; Purdom and Holmes, 2005; Bhowmick et al., 2006) and heavy-tailed patterns (Gottardo et al., 2006; Khondoker et al., 2006), mixtures of distributions (Bing et al., 2005), within-gene residual heteroscedasticity (Casellas and Varona, 2008), covariance in time-series data (Marot et al., 2009), and differential gene expression on a continuous scale (Casellas et al., 2008) among others. It was this last parameterization (i.e., continuous covariates) that originated the basis for the recursive mixed linear model developed in this study. Although the first microarray-based studies relied on the comparison of 2 (or a few) well-defined groups of biological conditions where the continuous covariates had no sense (e.g., healthy vs. cancerous cells; Golub et al., 1999; Perou et al., 2000), continuous effects could be of special interest in animal production-oriented studies. Differential gene expression in livestock has previously been evaluated on the basis of quantitative traits such as meat quality (Bernard et al., 2007) or ovulation rate (Caetano et al., 2004), these being obvious examples of continuous traits influenced by systematic and genetic sources of variation. When using the mixed linear model parameterization by Casellas et al. (2008), differential gene expression for continuous covariates (i.e., BW) was evaluated on a broad sense, without taking advantage of the hierarchical structure inherent in these productive traits (Henderson, 1973). This limitation could be partially overcome by the hierarchical recursive model developed above where differential gene expression was evaluated on the different sources of variation inherent to the quantitative trait implemented as a continuous covariate. Note this methodological development falls within the context of current endeavors to implement structural equation models into genetic (Xiong et al., 2004; Liu et al., 2008) and animal breeding (de los Campos et al., 2006; Varona et al., 2007; López de Maturana et al., 2009; Ibáñez-Escriche et al., 2010) research fields. In conclusion, this study was an endeavor to develop a new statistical tool to delve deeply into the knowledge of the genetic architecture of gene expression, reporting a more detailed characterization of the different sources of differential gene expression and equaling or even improving the statistical power obtained under more standard mixed linear models with linear covariates. Far from becoming a final step for gene expression analysis, this recursive model could open an attractive research scenario where this methodology and additional biosta- 74 Casellas and Ibáñez-Escriche tistical developments would provide new insights into the genetic and environmental mechanisms underlying gene transcription, a topic of special interest for the scientific community from multiple research fields. Moreover, models and parameterizations developed in this manuscript could be straightforwardly adapted to accommodate gene expression data generated under alternative methodologies such as whole transcriptome shotgun sequencing (i.e., RNA-Seq). LITERATURE CITED Bernard, C., I. Cassar-Malek, M. Le Cunff, H. Dubroeucq, G. Renand, and J.-F. Hocquette. 2007. New indicators of beef sensory quality revealed by expression of specific genes. J. Agric. Food Chem. 55:5229–5237. Bhowmick, D., A. C. Davison, D. R. Goldstein, and Y. Ruffieux. 2006. A Laplace mixture model for identification of differential expressions in microarray experiments. Biostatistics 7:630– 641. Bing, N., I. Hoeschele, K. Ye, and K. J. Eilertsen. 2005. Finite mixture model analysis of microarray expression data on samples of uncertain biological type with application to reproductive efficiency. Vet. Immunol. Immunopathol. 105:187–196. Bonferroni, C. E. 1930. Elementi di Statistica Generale. Libreria Seber, Florence, Italy. Caetano, A. R., R. K. Johnson, J. J. Ford, and D. Pomp. 2004. Microarray profiling for differential gene expression in ovaries and ovarian follicles of pigs selected for increased ovulation rate. Genetics 168:1529–1537. Casellas, J., N. Ibáñez-Escriche, M. Martínez-Giner, and L. Varona. 2008. GEAMM v.1.4: a versatile program for mixed model analysis of gene expression data. Anim. Genet. 39:89–90. Casellas, J., and L. Varona. 2008. Between-groups within-gene heterogeneity of residual variances in microarray gene expression data. BMC Genomics 9:319. Cui, X., and G. A. Churchill. 2003. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 4:210. Cui, X., J. T. G. Hwang, J. Qiu, N. J. Blades, and G. A. Churchill. 2005. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics 6:59–75. de los Campos, G., D. Gianola, P. Boettcher, and P. Moroni. 2006. A structural equation model for describing relationships between somatic cell score and milk yield in dairy goats. J. Anim. Sci. 84:2934–2941. Gianola, D., and D. Sorensen. 2004. Quantitative genetic models describing simultaneous and recursive relationships between phenotypes. Genetics 167:1407–1424. Golub, T. R., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Cooler, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression profiling. Science 286:531–537. Gottardo, R., A. E. Raftery, K. Y. Yeung, and R. E. Bumgarner. 2006. Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 62:10–18. Henderson, C. R. 1973. Sire evaluation and genetic trends. Pages 10–41 in Proc. Anim. Breeding Genet. Symp. in Honor of Dr. Jay L. Lush. Am. Soc. Anim. Sci., Champaign, IL. Hoeschele, I., and H. Li. 2005. A note on joint versus gene-specific mixed model analysis of microarray gene expression data. Biostatistics 6:183–186. Ibáñez-Escriche, N., E. López de Maturana, J. L. Noguera, and L. Varona. 2010. An application of change-point recursive models to the relationship between litter size and number of stillborns. J. Anim. Sci. 88:3493–3503. Khondoker, M. R., C. A. Glasbey, and B. J. Worton. 2006. Statistical estimation of gene expression using multiple laser scans of microarrays. Bioinformatics 22:215–219. Kuznetsov, V. A., G. D. Knott, and R. F. Bonner. 2002. General statistics of stochastic process of gene expression in eukaryotic cells. Genetics 161:1321–1332. Lin, C. S., and C. W. Hsu. 2005. Differentially transcribed genes in skeletal muscle of Duroc and Tayuan pigs. J. Anim. Sci. 83:2075–2086. Liu, B., A. de la Fuente, and I. Hoeschele. 2008. Gene network inference via structural equation modeling in genetical genomics experiments. Genetics 178:1763–1776. López de Maturana, E., X.-L. Wu, D. Gianola, K. A. Weigel, and G. J. M. Rosa. 2009. Exploring biological relationships between calving trait in primiparous cattle with a Bayesian recursive model. Genetics 181:277–287. Marot, G., J.-L. Foulley, and F. Jaffrzic. 2009. A structural mixed model to shrink covariance matrices for time-course differential gene expression studies. Comput. Stat. Data Anal. 53:1630– 1638. McDaneld, T. G., D. L. Hancock, and D. E. Moody. 2004. Altered mRNA abundance of ASB15 and four other genes in skeletal muscle following administration of beta-adrenergic receptor agonists. Physiol. Genomics 16:275–283. Perou, C. M., T. Sorlie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen, L. A. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S. X. Zhu, P. E. Lønning, A. L. Børresen-Dale, P. O. Brown, and D. Botstein. 2000. Molecular portraits of human breast tumours. Nature 406:747–752. Purdom, E., and S. P. Holmes. 2005. Error distribution for gene expression data. Stat. Appl. Genet. Mol. Biol. 4:e16. Raftery, A. E., and S. M. Lewis. 1992. How many iterations in the Gibbs sampler? Pages 763–773 in Bayesian Statistics IV. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, ed. Oxford Univ. Press, Oxford, UK. Reverter, A., K. A. Byrne, H. L. Brucet, Y. H. Wang, B. P. Dalrymple, and S. A. Lehnert. 2003a. A mixture model-based cluster analysis of DNA microarray gene expression data on Brahman and Brahman composite steers fed high-, medium, and lowquality diets. J. Anim. Sci. 81:1900–1910. Reverter, A., K. A. Byrne, and B. P. Dalrymple. 2003b. BAYESMIX: A software program for Bayesian analysis of mixture models with an application to the analysis of microarray gene expression data. XV Proc. Assoc. Adv. Anim. Breed. Genet., Melbourne, Australia, 15:90–93. Sorensen, D., and D. Gianola. 2002. Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer-Verlag, New York, NY. van Iterson, M., P. A. C. Hoen, P. Pedotti, G. J. E. J. Hooiveld, J. T. den Dunnen, G. J. B. van Ommen, J. M. Boer, and R. X. Menezes. 2009. Relative power and sample size analysis on gene expression profiling data. BMC Genomics 10:439. Varona, L., D. Sorensen, and R. Thompson. 2007. Analysis of litter size and average litter weight in pigs using a recursive model. Genetics 177:1791–1799. Wolfinger, R. D., G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, and R. S. Paules. 2001. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8:625–637. Wright, S. 1921. Correlation and causation. J. Agric. Res. 210:557– 585. Wright, S. 1922. Coefficients of inbreeding and relationship. Am. Nat. 56:330–338. Wu, H., M. Kerr, X. Cui, and G. Churchill. 2003. MAANOVA: A software package for the analysis of spotted cDNA microarray experiments. Pages 313–341 in The Analysis of Gene Expression Data. G. Parmigiani, E. S. Garrett, R. A. Irizarry, and S. L. Zeger, ed. Springer, London, UK. Xiong, M., J. Li, and X. Fang. 2004. Identification of genetic networks. Genetics 166:1037–1052. Recursiveness in microarray analysis 75 APPENDIX The recursiveness between phenotypic and gene expression data in model B relied on vectors yp, a, b, and ep (from the mixed linear model for the phenotypic trait) and matrix W* (from the mixed linear model for gene expression data) and can be easily illustrated in a small example. Assume a dummy population of 5 related individuals, all of them being recorded for a given phenotypic trait of interest influenced by a unique systematic effect (i.e., see Table A1). Whatever the iteration of the Monte Carlo Markov chain involved in the Bayesian analysis, relevant vectors involved with the linear phenotypic trait can be assumed as y1 y 2 y p = y3 , b = y 4 y5 a1 y1 − b1 − a1 a y − b − a 2 2 2 2 b1 , a = a 3 , and e p = y3 − b1 − a 3 , b a y − b − a 2 4 4 2 4 a 5 y5 − b2 − a 5 where yh and ah are the phenotypic record and the additive genetic effect of the hth individual, respectively; and b1 and b2 are the predicted effect of sex for males and females, respectively. Now, assume that only individuals number 2, 3, and 5 were genotyped for a small microarray platform investigating 3 different genes (Table 7), and the recursiveness originates by feeding W* with the appropriate elements of the previous vectors as follows: b2 a 2 y2 − b2 − a 2 W = b1 a 3 y3 − b1 − a 3 . b a y − b − a 2 5 5 2 5 Assuming that gene expression data in yg were sorted by array and gene within array, W = W ⊗ I generalizes to b2 0 0 a 2 0 0 y2 − b2 − a 2 0 0 0 b 0 0 a 0 0 y − b − a 0 2 2 2 2 2 0 0 b 0 0 a2 0 0 y2 − b2 − a 2 2 b 0 0 a 0 0 y3 − b1 − a 3 0 0 1 3 W = 0 b1 0 0 a 3 0 0 y3 − b1 − a 3 0 , 0 0 y3 − b1 − a 3 0 0 b1 0 0 a 3 b2 0 0 a 5 0 0 y5 − b2 − a 5 0 0 0 b2 0 0 a 5 0 0 y5 − b2 − a 5 0 0 0 b 0 0 a5 0 0 y5 − b2 − a 5 2 and this structure needs to be updated at each sampling iteration. Table A1. Example population of 5 individuals with (yes) or without (no) phenotypic and gene expression data Individual Sex Phenotypic data Gene expression data 1 2 3 4 5 Male Female Male Female Female Yes Yes Yes Yes Yes No Yes No Yes Yes