Download Bayesian recursive mixed linear model for gene expression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epistasis wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Twin study wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

NEDD9 wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Human genetic variation wikipedia , lookup

Heritability of IQ wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene therapy wikipedia , lookup

Gene wikipedia , lookup

Gene desert wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome evolution wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Helitron (biology) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Public health genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
Published January 20, 2015
Bayesian recursive mixed linear model for gene expression
analyses with continuous covariates1
J. Casellas*2 and N. Ibáñez-Escriche†
*Grup de Recerca en Remugants, Departament de Ciència Animal i dels Aliments, Universitat Autònoma de
Barcelona, 08193 Bellaterra, Barcelona, Spain; and †Genètica i Millora Animal, IRTA-Lleida, 25198 Lleida, Spain
ABSTRACT: The analysis of microarray gene expression data has experienced a remarkable growth in scientific research over the last few years and is helping to
decipher the genetic background of several productive
traits. Nevertheless, most analytical approaches have
relied on the comparison of 2 (or a few) well-defined
groups of biological conditions where the continuous
covariates have no sense (e.g., healthy vs. cancerous
cells). Continuous effects could be of special interest
when analyzing gene expression in animal productionoriented studies (e.g., birth weight), although very few
studies address this peculiarity in the animal science
framework. Within this context, we have developed a
recursive linear mixed model where not only are linear
covariates accounted for during gene expression analyses but also hierarchized and the effects of their genetic,
environmental, and residual components on differential
gene expression inferred independently. This parameterization allows a step forward in the inference of dif-
ferential gene expression linked to a given quantitative
trait such as birth weight. The statistical performance
of this recursive model was exemplified under simulation by accounting for different sample sizes (n), heritabilities for the quantitative trait (h2), and magnitudes of
differential gene expression (λ). It is important to highlight that statistical power increased with n, h2, and λ,
and the recursive model exceeded the standard linear
mixed model with linear (nonrecursive) covariates in
the majority of scenarios. This new parameterization
would provide new insights about gene expression in
the animal science framework, opening a new research
scenario where within-covariate sources of differential
gene expression could be individualized and estimated.
The source code of the program accommodating these
analytical developments and additional information
about practical aspects on running the program are
freely available by request to the corresponding author
of this article.
Key words: Bayesian inference, gene expression, microarray, mixed model, recursive
©2012 American Society of Animal Science. All rights reserved.
J. Anim. Sci. 2012. 90:67–75
doi:10.2527/jas.2010-3750
INTRODUCTION
both discrete (Wolfinger et al., 2001) and continuous
(Casellas et al., 2008) gene-specific effects to characterize differential gene expression on the basis of 2 (or
more) groups of biological conditions or a continuous
covariate, respectively. Within the context of animal
science, available studies assessed differential gene expression on the basis of discrete factors such as breeds
(Lin and Hsu, 2005), nutrition levels (Reverter et al.,
2003a), or pharmacological compounds (McDaneld et
al., 2004). However, inferences on linear (or polynomial) covariates have also been suggested as an appealing alternative (Casellas et al., 2008). Moreover,
additional information can be obtained if continuous
covariates are split into their systematic, genetic, and
residual components.
Structural equation models closely link with the standard linear model, although accounting for feedback or
recursiveness between phenotypes or model parameters
(Gianola and Sorensen, 2004; Xiong et al., 2004; Varona et al., 2007). This idea relies on the original disser-
Mixed linear models were advocated in gene expression analyses due to their superiority in partitioning
sources of variation and their flexibility for accommodating multiple experimental designs (Cui and
Churchill, 2003). Indeed, mixed models have become
a basic analytical tool for genomics, having been implemented in several microarray-oriented software programs (Reverter et al., 2003b; Wu et al., 2003; Casellas
et al., 2008). This methodology easily accommodates
1
This research was funded by grant AGL2008-04818-C03 (Ministerio de Ciencia e Innovación, Madrid, Spain). The research contract
of J. Casellas was partially financed by the Ministerio de Ciencia e
Innovación of Spain (program Ramón y Cajal, reference RYC-200904049). The authors are also indebted to 2 anonymous referees for
their helpful comments on the manuscript.
2
Corresponding author: [email protected]
Received November 30, 2010.
Accepted August 30, 2011.
67
68
Casellas and Ibáñez-Escriche
tation by Wright (1921) and could be of special interest to hierarchize continuous covariates in microarray
gene expression analysis. Within this context, the main
target of this research was to develop an appropriate
structural equation model accounting for recursiveness
between a continuous covariate (e.g., birth weight) and
gene expression data, as well as assessing differential
gene expression on the basis of the environmental, genetic, and residual factors composing the continuous
covariate.
MATERIALS AND METHODS
Animal Care and Use Committee approval was not
obtained for this study because no animals were used.
Statistical Background (Model A)
Assume a sample of ng individuals involved in a microarray experiment with m genes (i.e., probes) per
microarray. Under a simple experimental design with
noncompetitive hybridization microarrays and one microarray per individual, gene expression data can be
analyzed by the following hierarchical mixed linear
model (Casellas et al., 2008):
yg = Xgm + Zg1d + Zg2c + eg,
where yg is the (ngm) × 1 vector of gene expression
data sorted by microarray and gene (i.e., probe) within
microarray, and eg is the (ngm) × 1 vector of residual
terms. This model accounts for the overall effect of
each array (m; dimension ng × 1) and p discrete (d)
and q continuous (c) within-probe effects linked to the
data by incidence matrices Xg, Zg1, and Zg2, respectively. Note that Xg is a ng × ng identity matrix Kronecker product with an m-dimensional vector of ones
(η), whereas Zg1 and Zg2 are constructed as Z1⊗I and
Z2⊗I, respectively. More specifically, Z1 is a ng × p
matrix indicating the influence (1) or no influence (0)
of each discrete effect (columns) on each individual
(rows), Z2 is a ng × q matrix storing the value of each
continuous covariate (column) specific for each individual (row), and I is a m × m identity matrix. Note
that this model was taken as our starting point for
further methodological development, assuming that it
was free from biases and all assumptions were satisfied.
Gene expression analyses rely on high-dimensionality
parameterizations with a relatively small number of
replicates (i.e., microarrays); this typically hampers
the power and reliability of the model (van Iterson et
al., 2009) and leads to weaker structures with greater probability of biases when one or more sources of
variation are not properly accounted for. Nevertheless,
the focus of this research was on validating a generalization of model A under conditions where all model
assumptions were satisfied, and not on issues related to
the robustness of the model.
Microarray Data Analysis with Recursiveness
(Model B)
Assume that the ng individuals are sampled from a
larger population of np related individuals with phenotypic data for 1 or more traits of interest. Additionally,
assume that one of those phenotypic traits (i.e., Y) is
included as a continuous covariate in model A, it being
stored in the ith column of matrix Zg2. This last assumption draws a very interesting scenario where the
influence of Y on gene expression data can be hierarchized by the systematic, genetic, and residual components of Y. This recursive relationship can be modeled
by the following joint parameterization (model B):
yp = Xpb + Zpa + ep;
yg = Xgm + Zg1d + Zg2,−ic−i + WcY + eg,
where yp is the vector of phenotypic data from trait
Y, b is the vector of systematic effects, a is the vector
of additive genetic effects, ep is the vector of residuals, and Xp and Zp are appropriate incidence matrices. Note that Zg2,−i becomes Zg2 after excluding its ith
column, c−i becomes c after excluding its ith element,
and cY is the vector for regression coefficients linked
to the systematic, genetic, and residual components of
Y (W). More specifically, recursiveness between phenotypic and gene expression data is characterized by
the incidence matrix W = W*⊗I, where each row in
W* stores the appropriate elements of b, a, and e as
continuous covariates (see Appendix). Note that the
sum of the elements in the jth row of W reconstructs
the phenotypic value of the individual linked to the jth
microarray.
Bayesian Development for Model B
Under a standard Bayesian development, the joint
posterior distribution of all unknown parameters in
model B was proportional to the Bayesian likelihood
of both microarray (yg) and phenotypic (yp) data multiplied by the appropriate a priori distributions as follows:
p(b,a,σa2,σe2,m,d,Σd,c−i,Σc1,cY,Σc2,R|yp,yg)
∝ p(yp,yg|b,a,σe2,m,d,c−i,cY,R)
× p(b) p(a|Aσa2) p(σa2) p(σe2) p(m) p(d|Σd) p(Σd)
× p(c−i|Σc1) p(Σc1) p(cY|Σc2) p(Σc2) p(R),
where R is the m × m matrix of residual covariances
(Casellas et al., 2008), A is the numerator relationship
matrix between individuals (Wright, 1922), and Σd,
Σc1, and Σc2 are appropriate covariance matrices for d,
c−i and cY, respectively. The conditional distribution
of the data given the unknown parameters (i.e., the
Bayesian likelihood), can be split into
Recursiveness in microarray analysis
2
2
p(yp|b,a,σe ) ∝ MVN(Xpb + Zpa, Ipσe ), and
p(yg|m,d,c−i,cY,R) ∝ MVN(Xgm + Zg1d
+ Zg2,−ic−i + WcY, Ig⊗R),
where MVN refers to a multivariate normal density
with mean and variance as indicated between the parentheses, Ip is an identity matrix with dimensions
equal to the number of elements in yp, and Ig is a ng
× ng identity matrix. Following Sorensen and Gianola
(2002) and Casellas et al. (2008), both flat and multivariate normal priors were assumed for the unknown
parameters of the model. More specifically, p(a|Aσa2)
was modeled as
p(a|Aσa2) ~MVN(0, Aσa2),
whereas p(d|Σd), p(c−i|Σc1), and p(cY|Σc2) were assumed
as multivariate normal densities [i.e., MVN(0,Σd),
MVN(0,Σc1), and MVN(0,Σc2), respectively]. Although
other dispersion patterns could be addressed (Cui et
al., 2005; Casellas and Varona, 2008), we assumed a
standard design accounting for gene-specific variances
with null covariance between genes for Σd, Σc1, Σc2,
and R (see Casellas et al., 2008). A priori distributions for p(b) and p(m) were assumed improper flat,
whereas p(σa2), p(σe2), p(Σd), p(Σc1), p(Σc2), and p(R)
were bounded flat distributions between 0 and 1,000.
Simulation Studies
Model B was illustrated by analyzing simulated data
sets. For each simulation, 5 nonoverlapping generations
of 1,000 individuals (50 males and 950 females) under
random mating were generated and pedigree data were
stored for analytical purposes. Each individual had a
single phenotypic record sampled from a multivariate
normal distribution such as MVN(Si + Hj + ak, 10).
More specifically, Si was the effect of the sex with 2
levels (S1 = +10; S2 = 0), Hj mimicked a herd effect
with 20 levels (effects were sampled from a uniform distribution between −10 and 10), and ak was the infinitesimal genetic effect for the kth individual. Additive
genetic values for founder and nonfounder individuals
were simulated from the following normal densities:
N(0, σα2) and N[(as + ad)/2, 0.5(1 − f)σα2], respectively.
Note that as was the additive genetic value of the sire,
ad was the additive genetic value of the dam, f was the
average inbreeding coefficient of both parents, and σα2
was the additive genetic variance.
Gene expression data from 20,000 independent genes
were simulated for a subset of N individuals chosen at
random from each simulated population. Gene expression was sampled from the following normal density:
N(mk + gl + Siβ1 + Hjβ2 + akβ3 + eijkβ4 , σl2),
where mk is microarray-specific effect; gl is a gene-specific effect; Si, Hj, ak, and eijk are the effects of sex, herd,
69
additive genetic value, and residual on the phenotypic
value of the kth individual (see previous paragraph);
β1, β2, β3, and β4 are the corresponding regression coefficients; and σl2 is the gene-specific residual variance.
More specifically, mk was sampled from a uniform distribution between 0 and 1, gl was sampled from a normal distribution N(7,1), and σl2 was sampled from a
uniform distribution between 0.01 and 0.2. To test for
different incidences of phenotype-related differential
gene expression, regression coefficients had different
configurations for genes 1 to 100 (β1 = λ, β2 = λ, β3 =
λ, and β4 = λ), 101 to 200 (β1 = λ, β2 = 0, β3 = 0, and
β4 = 0), 201 to 300 (β1 = 0, β2 = λ, β3 = 0, and β4 =
0), 301 to 400 (β1 = 0, β2 = 0, β3 = λ, and β4 = 0), 401
to 500 (β1 = 0, β2 = 0, β3 = 0, and β4 = λ), and 501 to
20,000 (β1 = 0, β2 = 0, β3 = 0, and β4 = 0). All these
simulation procedures were replicated under 3 sample
sizes (n = 50, n = 100, and n = 200), and 2 different
values for λ (0.1σl and σl) and σα (0.11 and 10). Note
that σα2 originated heritabilities (h2) of 0.1 and 0.5,
respectively. For each combination of σα2, λ, and N, 10
simulated populations were generated.
Each data set was analyzed under model A and model B following the Bayesian approach described above.
Model A assumed the same a priori probabilities than
the gene expression-specific equation in model B and
after removing the appropriate terms [see Casellas et
al. (2008) for a detailed Bayesian development of model
A]. The hierarchical model for gene expression data accounted for the overall effect of each microarray (m),
a probe-specific discrete effect (d), and 1 (c; model
A) or 4 (cY; model B) continuous covariates accounting for the magnitude of the phenotypic trait or its
systematic, genetic, and residual components, respectively. For the phenotypic trait (model B), the linear
mixed model accounted for the same systematic (i.e.,
sex and herd) and infinitesimal genetic effects assumed
during the simulation process. A unique Monte Carlo
Markov chain with 120,000 iterations was launched for
each analysis, discarding the first 20,000 iterations as
burn-in (Raftery and Lewis, 1992).
RESULTS AND DISCUSSION
Model B was illustrated in 6 simulation scenarios in
which the models used to simulate and analyze data
were the same. Thus, the focus was on validating the
detection of differentially expressed genes under conditions where all model assumptions were satisfied,
and not on issues related to the robustness of model
B. Simulations focused on 3 different parameters, the
number of individuals with gene expression data (n =
50, Tables 1 and 2; n = 100, Tables 3 and 4; n = 200,
Tables 5 and 6), heritability of the phenotypic trait (h2
= 0.1, Tables 1, 3, and 5; h2 = 0.5, Tables 2, 4, and
6), and magnitude of the differential gene expression
(λ). The percentage of false negatives under model B
decreased when increasing the available amount of information, (i.e., larger values of N, h2, and λ; Tables
70
Casellas and Ibáñez-Escriche
Table 1. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)1,2
Genes5 (i.e., probes)
3
Simulation (λ )
and effect4
λ = 0.1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
λ=1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
1 to 100
101 to 200
201 to 300
 
48.7b (0)
91.2a (3.6x)
2.0e (0)
4.9e (0)
29.2c (2.7x)
 
96.2a (94.4z)
97.0a (94.6z)
12.2d (0)
93.3a (92.7x)
45.0b (14.5y)
 
64.1b (0)
5.3e (0)
4.0e (0)
3.0e (0)
26.2c (0.2)
 
95.0a (93.3z)
11.0d (0)
5.7e (0)
8.7d (0)
87.2a (60.0y)
 
6.1c (0)
98.6a (3.0x)
4.3c (0)
5.0c (0)
30.9b (0)
 
8.2c (0)
96.1a (95.0z)
6.2c (0)
9.0c (0)
87.6a (62.1y)
301 to 400
 
4.7a
5.0a
6.1a
3.2a
5.5a
 
5.1a
3.1a
7.2a
4.1a
5.4a
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
401 to 500
 
6.1c (0)
5.0c (0)
2.8c (0)
5.0c (0)
4.2c (0)
 
17.3b (0)
3.9c (0)
4.0c (0)
95.2a (6.8z)
4.6c (0)
500 to 20,000
 
4.4a
3.9a
4.8a
5.3a
5.3a
 
4.6a
5.0a
3.1a
5.1a
4.7a
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
a–e
Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript
letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in
all 10 replicates).
1
The simulation scenario accounted for 50 individuals with gene expression data for 20,000 genes, these being sampled from a population of
5,000 individuals phenotyped by a quantitative trait with a heritability of 0.1.
2
Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05.
3
λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD.
4
Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4).
5
Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and
β4 (genes 1 to 100 and 401 to 500).
1 to 6). Although this link between statistical power
and microarray sample size could be anticipated, the
pattern shown by the different recursive effects on gene
expression data (i.e., systematic, genetic, and residual
covariates) suggested independent behaviors with relevant implications for future analyses. Taking the first
100 probes as a reference (i.e., they were simulated with
all recursive covariates contributing non-null differential gene expression), covariates β1 (i.e., sex effect) and
β2 (i.e., herd effect) detected ~40% and ~90% of differentially expressed genes when λ = 0.1σl (see Tables
1 and 2 for exact values under h2 = 0.1 and h2 = 0.5).
These percentages increased to ~82% and ~99% when
100 individuals contributed to the gene expression data
(Tables 3 and 4) and reached the maximum (100%)
with 200 individuals (Tables 5 and 6). It is important
to highlight the departures between β1 and β2 under n
= 50 and 100, β1 showing smaller percentages in both
cases. The difference in the number of discrete levels inherited from the analysis of the phenotypic trait could
be the main reason for this larger percentage of false
negatives reported by covariate β1. In a similar way,
probes 101 to 200 (differential gene expression was only
due to covariate β1) and 201 to 300 (differential gene
expression was only due to covariate β2) provided a
similar pattern, although with slightly greater percentages for both covariates (Tables 1 to 4). Both genetic
(β3) and residual (β4) covariates increased the percentage of significant differential gene expressions with N,
although they were also highly sensitive to h2 and λ
(see below).
Covariate β3 accounted for a very interesting recursiveness between phenotypic and gene expression data.
Genes with significant estimates for β3 indicates that
their differential gene expression is significantly linked
to the genetic background of the phenotypic trait involved in the analyses and, by extension, 1 or several
additional genes spread along the whole genome. Indeed, significant estimates must be viewed as a promising starting point for further expression quantitative
trait loci or gene network analyses, these providing initial pieces of evidence for more complex gene interactions at the transcriptome level. Our analyses revealed
an impaired performance of covariate β3 when h2 and λ
were set to 0.1 (Tables 1, 3, and 5), although increasing the percentage of differentially expressed genes with
N. This was not surprising, although it suggested that
small studies performed on moderate-to-low heritable
phenotypic traits could suffer from a moderate-to-high
percentage of false negative estimates when testing the
genetic covariate on gene expression data. Whereas
(systematic) covariates β1 and β2 showed a reasonably
robust performance even under small N and λ values
(they were not remarkably influenced by h2), estimates
from the genetic covariate β3 must be taken with caution.
Results for the last covariate, β4, showed an intermediate pattern between those obtained under systematic
(β1 and β2) and genetic (β3) covariates (Tables 1 to 6).
They accounted for spurious relationships between gene
expression and uncontrolled sources of variation of the
phenotypic trait. This uncontrolled variability could
account for a wide range of environmental and genetic
effects. Indeed, our analyses suggested that differential
gene expression linked to covariate β4 could be partially
accounted for by the residual covariate β4 or even the
systematic covariate β2 as illustrated by probes 401 to
500 (differential gene expression was only due to covari-
71
Recursiveness in microarray analysis
1,2
Table 2. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)
Genes5 (i.e., probes)
3
Simulation (λ )
and effect4
λ = 0.1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
λ=1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
1 to 100
101 to 200
201 to 300
301 to 400
401 to 500
 
34.2d (0)
86.3b (2.2x)
4.0e (0)
3.0e (0)
27.8d (1.3x)
 
99.7a (99.9z)
99.6a (98.8z)
98.8a (97.7z)
97.6a (6.5x)
60.3c (27.6y)
 
60.5b (0)
3.0e (0)
4.7e (0)
3.5e (0)
22.2c (0)
 
99.7a (99.3z)
11.3d (0)
3.2e (0)
4.7e (0)
99.1a (73.2y)
 
4.4c (0)
98.6a (2.2x)
5.0c (0)
6.7c (0)
31.5b (0)
 
11.0c (0)
99.5a (99.2z)
13.1c (0)
5.9c (0)
97.6a (84.1y)
 
(0)
(0)
(0)
(0)
(0)
 
6.1c (0)
14.0b (0)
96.2a (95.5z)
13.1b (0)
94.4a (47.5y)
 
4.1d (0)
4.9d (0)
5.0d (0)
3.6d (0)
5.5d (0)
 
8.3c (0)
5.9d (0)
78.8b (0.4z)
95.5a (7.2z)
5.0d (0)
6.0c
5.6c
3.9c
5.0c
4.2c
500 to 20,000
 
4.5a
3.3a
4.3a
5.1a
4.5a
 
3.9a
4.9a
5.3a
4.4a
3.7a
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
a–e
Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript
letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in
all 10 replicates).
1
The simulation scenario accounted for 50 individuals with gene expression data for 20,000 genes, these being sampled from a population of
5,000 individuals phenotyped by a quantitative trait with a heritability of 0.5.
2
Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05.
3
λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD.
4
Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4).
5
Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and
β4 (genes 1 to 100 and 401 to 500).
ate β4) in Table 2 (λ = σs), Table 4 (λ = σs), and Table
6 (λ = σs). It is important to highlight that with the
exception of false-positive results reported for probes
401 to 500 under some very specific scenarios (phenotypic trait with h2 = 0.5 and simulated differential gene
expression with large effects), the level of false positives
was approximately 5% elsewhere.
Despite the reasonable statistical performances of
model B discussed in previous paragraphs, this statistical development would not be more than an intellectual
exercise without real contribution to the gene expression framework if the outputs did not 1) contribute additional information or 2) improve statistical outcomes.
The first advantage (i.e., additional or more detailed
information) could be linked to the proper partitioning of differential gene expression into several “partial”
differential gene expressions related to systematic, genetic, and residual sources of variation (see results for
probes 101 to 500 in Tables 1 to 6). Whereas model
A identifies a differentially expressed gene whose tran-
Table 3. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)1,2
Genes5 (i.e., probes)
3
Simulation (λ )
and effect4
λ = 0.1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
λ=1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
1 to 100
101 to 200
201 to 300
301 to 400
401 to 500
 
83.1b (2.0)
99.3a (20.5y)
5.0e (0)
4.8e (0)
33.7d (5.4x)
 
99.4a (99.0z)
99.1a (98.7z)
56.2c (0)
96.2a (95.6z)
49.7c (18.2y)
 
95.3a (3.1x)
6.0d (0)
3.7d (0)
4.9d (0)
31.5b (0)
 
99.3a (99.2z)
12.0c (0)
5.1d (0)
8.3cd (0)
88.5a (68.5y)
 
4.0e (0)
99.0a (22.3x)
3.5e (0)
6.4e (0)
34.7c (0)
 
13.2d (0)
99.2a (97.1z)
4.7e (0)
15.3d (0)
86.4b (67.7y)
 
4.9c (0)
3.9c (0)
6.9c (0)
5.0c (0)
4.2c (0)
 
7.1c (0)
14.2b (0)
52.3a (0)
7.2c (0)
16.8b (0)
 
(0)
(0)
(0)
(0)
(0)
 
26.3c (0)
24.4c (0)
6.2d (0)
93.7a (91.7z)
61.1b (0)
5.1d
4.9d
7.0d
6.2d
3.5d
500 to 20,000
 
4.6a
3.1a
5.1a
4.7a
3.9a
 
5.4a
4.7a
3.1a
4.9a
5.0a
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
a–e
Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript
letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in
all 10 replicates).
1
The simulation scenario accounted for 100 individuals with gene expression data for 20,000 genes, these being sampled from a population of
5,000 individuals phenotyped by a quantitative trait with a heritability of 0.1.
2
Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05.
3
λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD.
4
Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4).
5
Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and
β4 (genes 1 to 100 and 401 to 500).
72
Casellas and Ibáñez-Escriche
Table 4. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)1,2
Genes5 (i.e., probes)
3
Simulation (λ )
and effect4
λ = 0.1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
λ=1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
1 to 100
101 to 200
201 to 300
301 to 400
401 to 500
 
81.1b (1.3w)
98.2a (13.2x)
23.0e (0)
28.0e (0)
44.8d (10.5x)
 
99.8a (99.4z)
99.9a (99.6z)
99.2a (98.4z)
98.3a (97.1z)
60.3c (34.3y)
 
96.1a (3.7y)
6.1c (0)
4.9c (0)
5.0c (0)
53.5b (0)
 
99.9a (99.3z)
9.0c (0)
4.9c (0)
8.9c (0)
99.2a (99.0z)
 
6.0d (0)
99.2a (19.2y)
6.0d (0)
5.3d (0)
63.9b (1.5x)
 
12.1cd (0)
99.7a (99.1z)
6.3d (0)
15.4c (0)
98.9a (97.9z)
 
5.3d (0)
3.6d (0)
29.0b (0)
5.0d (0)
13.7cd (0)
 
5.5d (0)
6.4d (0)
97.8a (96.5z)
21.0bc (0)
96.2a (69.4y)
 
(0)
(0)
(0)
(0)
(0)
 
5.3e (0)
21.9d (0)
49.9c (0)
97.0a (95.7z)
70.7b (0)
4.8e
7.4e
4.9e
5.2e
3.9e
500 to 20,000
 
5.0a
4.6a
3.9a
5.2a
5.0a
 
5.9a
4.6a
4.7a
3.6a
4.9a
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
a–e
Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript
letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in
all 10 replicates).
1
The simulation scenario accounted for 100 individuals with gene expression data for 20,000 genes, these being sampled from a population of
5,000 individuals phenotyped by a quantitative trait with a heritability of 0.5.
2
Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05.
3
λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD.
4
Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4).
5
Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and
β4 (genes 1 to 100 and 401 to 500).
scription level depends on a given continuous covariate (i.e., phenotypic trait), model B gives an additional
step identifying which source of variation of the phenotypic trait (or all of them) links with the differential
gene expression. On the other hand, statistical performances improved under model B in terms of positively
detected genes with differential gene expression. Taking
the results in Table 3 as an example (n = 100, h2 =
0.5, and λ = σs), model A detected 49.7% of the differentially expressed genes within the first 100 simu-
lated genes whereas model B detected 99.4% (covariate
β1), 99.1% (β2), 56.2% (β3), and 96.2% (β4). Note that
all these percentages detected under model B were significantly greater (P < 0.05) then the 49.7% obtained
under model A. In a similar way, the same trend was
detected for genes 101 to 500 although the presence
of significant departures depended on the simulation
scenarios. It is important to highlight that all these
results were obtained by analyzing simulated data. Although simulation processes tried to mimic the pecu-
Table 5. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)1,2
Genes5 (i.e., probes)
3
Simulation (λ )
and effect4
λ = 0.1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
λ=1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
1 to 100
101 to 200
201 to 300
301 to 400
401 to 500
 
100 (100)
100 (100)
58.1d (27.7x)
73.3bc (38.2x)
69.9c (37.2x)
 
100 (100)
100 (100)
92.1a (90.7z)
100 (99.0z)
83.5b (66.6y)
 
100 (100)
5.3c (0)
3.0c (0)
5.0c (0)
84.0b (78.0y)
 
100 (100)
6.5c (0)
4.2c (0)
5.1c (0)
99.6a (99.3z)
 
5.0c (0)
100 (100)
4.7c (0)
6.1c (0)
87.2b (77.0y)
 
4.8c (0)
100 (100)
4.2c (0)
6.0c (0)
99.0a (98.5z)
 
4.6c (0)
6.9c (0)
90.7ab (82.9y)
5.4c (0)
83.0b (68.3x)
 
5.0c (0)
7.0c (0)
99.6a (99.0z)
4.2c (0)
98.0a (96.2z)
 
4.7c (0)
7.3c (0)
4.7c (0)
95.3ab (91.1yz)
89.2b (84.2y)
 
5.3c (0)
3.7c (0)
5.1c (0)
99.7a (98.6z)
98.8a (95.9z)
500 to 20,000
 
7.1a
4.9a
4.7a
3.7a
4.8a
 
3.3a
4.9a
5.4a
4.9a
6.0a
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
a–d
Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript
letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in
all 10 replicates).
1
The simulation scenario accounted for 200 individuals with gene expression data for 20,000 genes, these being sampled from a population of
5,000 individuals phenotyped by a quantitative trait with a heritability of 0.1.
2
Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05.
3
λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD.
4
Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4).
5
Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and
β4 (genes 1 to 100 and 401 to 500).
73
Recursiveness in microarray analysis
1,2
Table 6. Average percentages of differentially expressed genes with posterior probability >0.95 (>0.9999975)
Genes5 (i.e., probes)
3
Simulation (λ )
and effect4
λ = 0.1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
λ=1
β1 (fix 1)
β2 (fix 2)
β3 (genetic)
β4 (residual)
βModel A
1 to 100
101 to 200
201 to 300
301 to 400
401 to 500
 
100 (100)
100 (100)
64.3c (32.2x)
69.2bc (35.2x)
64.2c (31.7x)
 
100 (100)
100 (100)
99.9a (99.5z)
99.7a (98.9z)
78.2b (56.1y)
 
100 (100)
4.4d (0)
3.7d (0)
5.0d (0)
83.6b (77.7y)
 
100 (100)
3.7c (0)
4.1c (0)
5.5c (0)
99.6a (99.5z)
 
4.0c (0)
100 (100)
5.0c (0)
6.5c (0)
83.4b (72.9y)
 
2.9c (0)
100 (100)
4.8c (0)
5.5c (0)
98.9a (98.3z)
 
5.2c (0)
3.6c (0)
88.3b (81.9y)
5.3c (0)
79.9b (66.3x)
 
5.0c (0)
5.3c (0)
98.9a (98.5z)
7.3c (0)
96.9a (92.2z)
 
6.9c (0)
5.0c (0)
3.5c (0)
93.2ab (90.0yz)
87.9b (81.1y)
 
5.2c (0)
3.9c (0)
12.2c (0)
98.7a (94.7z)
95.7a (90.4yz)
500 to 20,000
 
5.0a
3.0a
4.2a
6.1a
5.6a
 
7.0a
5.6a
4.8a
3.3a
5.0a
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
a–d
Means (by column) with the same superscript letter did not differ significantly (P > 0.05); x–zMeans (by column) with the same superscript
letter did not differ significantly (P > 0.0000025). Means without decimals indicated an invariable estimate (i.e., the same value was obtained in
all 10 replicates).
1
The simulation scenario accounted for 200 individuals with gene expression data for 20,000 genes, these being sampled from a population of
5,000 individuals phenotyped by a quantitative trait with a heritability of 0.5.
2
Bonferroni-like correction (Bonferroni, 1930) for 20,000 tests with α = 0.05.
3
λ = weighting factor for the differential gene expression measured in terms of gene-specific residual SD.
4
Differential gene expression on a continuous scale was evaluated under model A (βModel A) and model B (effects β1, β2, β3, and β4).
5
Differential gene expression was simulated for effect β1 (genes 1 to 200), β2 (genes 1 to 100 and 201 to 300), β3 (1 to 100 and 301 to 400), and
β4 (genes 1 to 100 and 401 to 500).
liarities of microarray gene expression data, statistical
performances revealed in Tables 1 to 6 must not be
completely extrapolated to real experiments. The substantial degree of uncontrollable noise inherent to real
gene expression data could modulate the performance
of model B in field or laboratory conditions, these departures attenuating when increasing the size of the
experiment. In any case, model B was revealed as a
promising statistical tool for the analysis of microarray
data under simulation with potential contributions to
further field data experiments.
Mixed model analysis of microarray gene expression
data must be viewed as a powerful analytical tool contributing relevant information to multiple research fields
(Wolfinger et al., 2001). Indeed, mixed models provide
a robust and flexible framework for interrogating a set
of hundreds to thousands of genes (i.e., probes), these
being preferred to more simplistic gene-by-gene screenings (Hoeschele and Li, 2005). This flexibility (and robustness) led to a plethora of mixed model parameterizations accounting for different peculiarities of gene
expression data such as skew (Kuznetsov et al., 2002;
Purdom and Holmes, 2005; Bhowmick et al., 2006) and
heavy-tailed patterns (Gottardo et al., 2006; Khondoker et al., 2006), mixtures of distributions (Bing et al.,
2005), within-gene residual heteroscedasticity (Casellas
and Varona, 2008), covariance in time-series data (Marot et al., 2009), and differential gene expression on a
continuous scale (Casellas et al., 2008) among others. It
was this last parameterization (i.e., continuous covariates) that originated the basis for the recursive mixed
linear model developed in this study. Although the first
microarray-based studies relied on the comparison of 2
(or a few) well-defined groups of biological conditions
where the continuous covariates had no sense (e.g.,
healthy vs. cancerous cells; Golub et al., 1999; Perou et
al., 2000), continuous effects could be of special interest
in animal production-oriented studies. Differential gene
expression in livestock has previously been evaluated
on the basis of quantitative traits such as meat quality
(Bernard et al., 2007) or ovulation rate (Caetano et
al., 2004), these being obvious examples of continuous
traits influenced by systematic and genetic sources of
variation. When using the mixed linear model parameterization by Casellas et al. (2008), differential gene
expression for continuous covariates (i.e., BW) was
evaluated on a broad sense, without taking advantage
of the hierarchical structure inherent in these productive traits (Henderson, 1973). This limitation could be
partially overcome by the hierarchical recursive model
developed above where differential gene expression was
evaluated on the different sources of variation inherent to the quantitative trait implemented as a continuous covariate. Note this methodological development
falls within the context of current endeavors to implement structural equation models into genetic (Xiong
et al., 2004; Liu et al., 2008) and animal breeding (de
los Campos et al., 2006; Varona et al., 2007; López de
Maturana et al., 2009; Ibáñez-Escriche et al., 2010) research fields.
In conclusion, this study was an endeavor to develop
a new statistical tool to delve deeply into the knowledge
of the genetic architecture of gene expression, reporting
a more detailed characterization of the different sources
of differential gene expression and equaling or even improving the statistical power obtained under more standard mixed linear models with linear covariates. Far
from becoming a final step for gene expression analysis,
this recursive model could open an attractive research
scenario where this methodology and additional biosta-
74
Casellas and Ibáñez-Escriche
tistical developments would provide new insights into
the genetic and environmental mechanisms underlying gene transcription, a topic of special interest for
the scientific community from multiple research fields.
Moreover, models and parameterizations developed in
this manuscript could be straightforwardly adapted to
accommodate gene expression data generated under alternative methodologies such as whole transcriptome
shotgun sequencing (i.e., RNA-Seq).
LITERATURE CITED
Bernard, C., I. Cassar-Malek, M. Le Cunff, H. Dubroeucq, G. Renand, and J.-F. Hocquette. 2007. New indicators of beef sensory
quality revealed by expression of specific genes. J. Agric. Food
Chem. 55:5229–5237.
Bhowmick, D., A. C. Davison, D. R. Goldstein, and Y. Ruffieux.
2006. A Laplace mixture model for identification of differential
expressions in microarray experiments. Biostatistics 7:630–
641.
Bing, N., I. Hoeschele, K. Ye, and K. J. Eilertsen. 2005. Finite mixture model analysis of microarray expression data on samples
of uncertain biological type with application to reproductive
efficiency. Vet. Immunol. Immunopathol. 105:187–196.
Bonferroni, C. E. 1930. Elementi di Statistica Generale. Libreria
Seber, Florence, Italy.
Caetano, A. R., R. K. Johnson, J. J. Ford, and D. Pomp. 2004. Microarray profiling for differential gene expression in ovaries and
ovarian follicles of pigs selected for increased ovulation rate.
Genetics 168:1529–1537.
Casellas, J., N. Ibáñez-Escriche, M. Martínez-Giner, and L. Varona.
2008. GEAMM v.1.4: a versatile program for mixed model analysis of gene expression data. Anim. Genet. 39:89–90.
Casellas, J., and L. Varona. 2008. Between-groups within-gene heterogeneity of residual variances in microarray gene expression
data. BMC Genomics 9:319.
Cui, X., and G. A. Churchill. 2003. Statistical tests for differential
expression in cDNA microarray experiments. Genome Biol.
4:210.
Cui, X., J. T. G. Hwang, J. Qiu, N. J. Blades, and G. A. Churchill.
2005. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics
6:59–75.
de los Campos, G., D. Gianola, P. Boettcher, and P. Moroni. 2006. A
structural equation model for describing relationships between
somatic cell score and milk yield in dairy goats. J. Anim. Sci.
84:2934–2941.
Gianola, D., and D. Sorensen. 2004. Quantitative genetic models
describing simultaneous and recursive relationships between
phenotypes. Genetics 167:1407–1424.
Golub, T. R., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,
J. P. Mesirov, H. Cooler, M. L. Loh, J. R. Downing, M. A.
Caligiuri, C. D. Bloomfield, and E. S. Lander. 1999. Molecular
classification of cancer: class discovery and class prediction by
gene expression profiling. Science 286:531–537.
Gottardo, R., A. E. Raftery, K. Y. Yeung, and R. E. Bumgarner.
2006. Bayesian robust inference for differential gene expression
in microarrays with multiple samples. Biometrics 62:10–18.
Henderson, C. R. 1973. Sire evaluation and genetic trends. Pages
10–41 in Proc. Anim. Breeding Genet. Symp. in Honor of Dr.
Jay L. Lush. Am. Soc. Anim. Sci., Champaign, IL.
Hoeschele, I., and H. Li. 2005. A note on joint versus gene-specific
mixed model analysis of microarray gene expression data. Biostatistics 6:183–186.
Ibáñez-Escriche, N., E. López de Maturana, J. L. Noguera, and L.
Varona. 2010. An application of change-point recursive models
to the relationship between litter size and number of stillborns.
J. Anim. Sci. 88:3493–3503.
Khondoker, M. R., C. A. Glasbey, and B. J. Worton. 2006. Statistical estimation of gene expression using multiple laser scans of
microarrays. Bioinformatics 22:215–219.
Kuznetsov, V. A., G. D. Knott, and R. F. Bonner. 2002. General
statistics of stochastic process of gene expression in eukaryotic
cells. Genetics 161:1321–1332.
Lin, C. S., and C. W. Hsu. 2005. Differentially transcribed genes
in skeletal muscle of Duroc and Tayuan pigs. J. Anim. Sci.
83:2075–2086.
Liu, B., A. de la Fuente, and I. Hoeschele. 2008. Gene network inference via structural equation modeling in genetical genomics
experiments. Genetics 178:1763–1776.
López de Maturana, E., X.-L. Wu, D. Gianola, K. A. Weigel, and
G. J. M. Rosa. 2009. Exploring biological relationships between
calving trait in primiparous cattle with a Bayesian recursive
model. Genetics 181:277–287.
Marot, G., J.-L. Foulley, and F. Jaffrzic. 2009. A structural mixed
model to shrink covariance matrices for time-course differential
gene expression studies. Comput. Stat. Data Anal. 53:1630–
1638.
McDaneld, T. G., D. L. Hancock, and D. E. Moody. 2004. Altered
mRNA abundance of ASB15 and four other genes in skeletal
muscle following administration of beta-adrenergic receptor
agonists. Physiol. Genomics 16:275–283.
Perou, C. M., T. Sorlie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey,
C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen, L. A. Akslen,
O. Fluge, A. Pergamenschikov, C. Williams, S. X. Zhu, P. E.
Lønning, A. L. Børresen-Dale, P. O. Brown, and D. Botstein.
2000. Molecular portraits of human breast tumours. Nature
406:747–752.
Purdom, E., and S. P. Holmes. 2005. Error distribution for gene
expression data. Stat. Appl. Genet. Mol. Biol. 4:e16.
Raftery, A. E., and S. M. Lewis. 1992. How many iterations in the
Gibbs sampler? Pages 763–773 in Bayesian Statistics IV. J. M.
Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, ed.
Oxford Univ. Press, Oxford, UK.
Reverter, A., K. A. Byrne, H. L. Brucet, Y. H. Wang, B. P. Dalrymple, and S. A. Lehnert. 2003a. A mixture model-based cluster
analysis of DNA microarray gene expression data on Brahman
and Brahman composite steers fed high-, medium, and lowquality diets. J. Anim. Sci. 81:1900–1910.
Reverter, A., K. A. Byrne, and B. P. Dalrymple. 2003b. BAYESMIX: A software program for Bayesian analysis of mixture
models with an application to the analysis of microarray gene
expression data. XV Proc. Assoc. Adv. Anim. Breed. Genet.,
Melbourne, Australia, 15:90–93.
Sorensen, D., and D. Gianola. 2002. Likelihood, Bayesian, and
MCMC Methods in Quantitative Genetics. Springer-Verlag,
New York, NY.
van Iterson, M., P. A. C. Hoen, P. Pedotti, G. J. E. J. Hooiveld, J.
T. den Dunnen, G. J. B. van Ommen, J. M. Boer, and R. X.
Menezes. 2009. Relative power and sample size analysis on gene
expression profiling data. BMC Genomics 10:439.
Varona, L., D. Sorensen, and R. Thompson. 2007. Analysis of litter
size and average litter weight in pigs using a recursive model.
Genetics 177:1791–1799.
Wolfinger, R. D., G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, and R. S. Paules. 2001. Assessing
gene significance from cDNA microarray expression data via
mixed models. J. Comput. Biol. 8:625–637.
Wright, S. 1921. Correlation and causation. J. Agric. Res. 210:557–
585.
Wright, S. 1922. Coefficients of inbreeding and relationship. Am.
Nat. 56:330–338.
Wu, H., M. Kerr, X. Cui, and G. Churchill. 2003. MAANOVA: A
software package for the analysis of spotted cDNA microarray
experiments. Pages 313–341 in The Analysis of Gene Expression Data. G. Parmigiani, E. S. Garrett, R. A. Irizarry, and S.
L. Zeger, ed. Springer, London, UK.
Xiong, M., J. Li, and X. Fang. 2004. Identification of genetic networks. Genetics 166:1037–1052.
Recursiveness in microarray analysis
75
APPENDIX
The recursiveness between phenotypic and gene expression data in model B relied on vectors yp, a, b, and ep
(from the mixed linear model for the phenotypic trait) and matrix W* (from the mixed linear model for gene
expression data) and can be easily illustrated in a small example. Assume a dummy population of 5 related individuals, all of them being recorded for a given phenotypic trait of interest influenced by a unique systematic effect
(i.e., see Table A1). Whatever the iteration of the Monte Carlo Markov chain involved in the Bayesian analysis,
relevant vectors involved with the linear phenotypic trait can be assumed as
y1 
 
y 
 2
y p = y3  , b =
y 
 4
 
y5 
a1 
 y1 − b1 − a1 
 


a 
y − b − a 
2
2
2
2




b1 
  , a = a 3  , and e p =  y3 − b1 − a 3  ,
 


b 
a 
y − b − a 
 2
 4
 4
2
4
 


a 5 
y5 − b2 − a 5 
where yh and ah are the phenotypic record and the additive genetic effect of the hth individual, respectively; and
b1 and b2 are the predicted effect of sex for males and females, respectively.
Now, assume that only individuals number 2, 3, and 5 were genotyped for a small microarray platform investigating 3 different genes (Table 7), and the recursiveness originates by feeding W* with the appropriate elements
of the previous vectors as follows:
b2 a 2 y2 − b2 − a 2 


W = b1 a 3 y3 − b1 − a 3  .
b a y − b − a 
 2 5 5
2
5 
Assuming that gene expression data in yg were sorted by array and gene within array, W = W ⊗ I generalizes to
b2 0 0 a 2 0 0 y2 − b2 − a 2

0
0


0 b 0 0 a

0
0
y
−
b
−
a
0
2
2
2
2
2


0 0 b
0 0 a2
0
0
y2 − b2 − a 2 
2

b 0 0 a

0 0 y3 − b1 − a 3
0
0
1

3


W =  0 b1 0 0 a 3 0
0
y3 − b1 − a 3
0
 ,


0
0
y3 − b1 − a 3 
 0 0 b1 0 0 a 3


b2 0 0 a 5 0 0 y5 − b2 − a 5

0
0


 0 b2 0 0 a 5 0

0
y5 − b2 − a 5
0


0 0 b
0 0 a5
0
0
y5 − b2 − a 5 
2

and this structure needs to be updated at each sampling iteration.
Table A1. Example population of 5 individuals with (yes) or without (no) phenotypic
and gene expression data
Individual
Sex
Phenotypic data
Gene expression data
1
2
3
4
5
Male
Female
Male
Female
Female
Yes
Yes
Yes
Yes
Yes
No
Yes
No
Yes
Yes