Download ppt - Bayesian Gene Expression

Document related concepts

Long non-coding RNA wikipedia , lookup

Pathogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

History of genetic engineering wikipedia , lookup

Minimal genome wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Helitron (biology) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Bayesian Modelling for Differential
Gene Expression
Alex Lewin
(Imperial College)
Sylvia Richardson (IC Epidemiology)
Tim Aitman (IC Microarray Centre)
In collaboration with
Anne-Mette Hein, Natalia Bochkina (IC Epidemiology)
Helen Causton (IC Microarray Centre)
Peter Green (Bristol)
Insulin-resistance gene Cd36
cDNA microarray: hybridisation signal for SHR much
lower than for Brown Norway and SHR.4 control strains
Aitman et al 1999, Nature Genet 21:76-83
Larger microarray experiment:
look for other genes associated with Cd36
Microarray Data
3 SHR compared with 3 transgenic rats (with Cd36)
3 wildtype (normal) mice compared with 3 mice with
Cd36 knocked out
 12000 genes on each array
Biological Question
Find genes which are expressed differently between
animals with and without Cd36.
• Bayesian Hierarchical Model for Differential
Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and
differential expression
• Gene Ontology analysis for differentially
expressed genes
Microarray analysis is a
multi-step process
Low-level Model
(how gene expression is estimated from signal)
Normalisation
(to make arrays comparable)
Differential
Expression
We aim to integrate all
the steps in a common
statistical framework
Clustering,
Partition Model
Bayesian Modelling Framework
• Model different sources of variability simultaneously,
within array, between array …
• Uncertainty propagated from data to parameter
estimates (so not over-optimistic in conclusions).
• Share information in appropriate ways to get robust
estimates.
Gene Expression Data
3 wildtype mice,
Fat tissue hybridised to Affymetrix chips
Newton et al. 2001
Showed data fit well by
Gamma or Log Normal
distributions
sd
Kerr et al. 2000
Linear model on log scale
mean
Bayesian hierarchical model for
differential expression
Data: ygsr = log expression for gene g, condition s, replicate r
g = gene effect
δg = differential effect for gene g between 2 conditions
r(g)s = array effect (expression-level dependent)
gs2 = gene variance
• 1st level
yg1r | g, δg, g1  N(g – ½ δg + r(g)1 , g12),
yg2r | g, δg, g2  N(g + ½ δg + r(g)2 , g22),
Σr r(g)s = 0
r(g)s = function of g , parameters {a} and {b}
Priors for gene effects
Mean effect g
g ~ Unif (much wider than data range)
Differential effect δg
δg ~ N(0,104) – “fixed” effects (no structure in prior)
OR mixture:
δg ~ p0δ0 + p1G_ (1.5, 1) + p2G+ (1.5, 2)
H0
Explicit modelling
of the alternative
References
Fixed Effects
Kerr et al. 2000
Mixture Models
Newton et al. 2004
(non-parametric mixture)
Löenstedt and Speed 2003, Smyth 2004
(conjugate mixture prior)
Broet et al. 2002
(several levels of DE)
Prior for gene variances
Two extreme cases:
(1) Constant variance
Too stringent
gsr  N(0, 2)
Poor fit
(2) Independent variances gsr  N(0, g2)
! Variance estimates based on few replications are
highly variable
Need to share information between genes to
better estimate their variance, while allowing
some variability
Hierarchical model
Prior for gene variances
• 2nd level
gs2 | μs, τs  logNormal (μs, τs)
Hyper-parameters μs and τs can be influential.
Empirical Bayes
Eg. Löenstedt and Speed 2003, Smyth 2004
Fixes μs , τs
Fully Bayesian
• 3rd level
μs  N( c, d)
τs  Gamma (e, f)
Gene specific variances are stabilised
Variances estimated
using information
from all G x R
measurements
(~12000 x 3) rather
than just 3
Variances stabilised
and shrunk towards
average variance
Prior for array effects (Normalization)
Spline Curve
r(g)s = quadratic in g for ars(k-1) ≤ g ≤ ars(k)
with coeff (brsk(1), brsk(2) ), k =1, … #breakpoints

a0
a1 a2
a3

Locations of break points not fixed
Must do sensitivity checks on # break points
Array effect as a function of gene effect
Bayesian posterior mean
loess
Effect of normalisation on density
Wildtype
Before (ygsr)
^
After (ygsr- r(g)s )
Knockout
Bayesian hierarchical model for
differential expression
• 1st level
– ygsr | g, δg, gs  N(g – ½ δg + r(g)s , gs2),
• 2nd level
– Fixed effect priors for g, δg
– Array effect coefficients, Normal and Uniform
– gs2 | μs, τs  logNormal (μs, τs)
• 3rd level
– μs  N( c, d)
– τs  Gamma (e, f)
WinBUGS software for fitting
Bayesian models
Declare the model
for( i in 1 : ngenes ) {
for( j in 1 : nreps) {
y1[i, j] ~ dnorm(x1[i, j], tau1[i])
x1[i, j] <- alpha[i] - 0.5*delta[i]
+ beta1[i, j]
}
}
for( i in 1 : ngenes ) {
tau1[i] <- 1.0/sig21[i]
sig21[i] <- exp(lsig21[i])
lsig21[i] ~ dnorm(mm1,tt1)
}
mm1 ~ dnorm( 0.0,1.0E-3)
tt1 ~ dgamma(0.01,0.01)
WinBUGS does the calculations
WinBUGS software for fitting
Bayesian models
Whole posterior
distribution
Posterior means,
medians,
quantiles
• Bayesian Hierarchical Model for Differential
Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and
differential expression
• Gene Ontology analysis for differentially
expressed genes
Decision Rules for Inference
So far, discussed fitting the model.
How do we decide which genes are differentially
expressed?
Parameters of interest: g , δg , g
– What quantity do we consider, δg , (δg /g) , … ?
– How do we summarize the posterior distribution?
Fixed Effects Model
Inference on δ
(1)
dg = E(δg | data) posterior mean
Like point estimate of log fold change.
biological
interest
Decision Rule: gene g is DE if |dg| > δcut
(2)
pg = P( |δg| > δcut | data)
biological
interest
posterior probability (incorporates uncertainty)
Decision Rule: gene g is DE if pg > pcut
statistical
confidence
This allows biologist to specify what size of effect
is interesting (not just statistical significance)
Fixed Effects Model
Inference on δ, 
(1)
tg = E(δg | data) / E(g | data)
Like t-statistic.
statistical
confidence
Decision Rule: gene g is DE if |tg| > tcut
(2)
pg = P( |δg /g| > tcut | data)
Decision Rule: gene g is DE if pg > pcut
Bochkina and Richardson (in preparation)
statistical
confidence
Mixture Model
δg ~ p0δ0 + p1G_ (1.5, 1) + p2G+ (1.5, 2)
Explicit modelling
of the alternative
H0
(1)
dg = E(δg | data) posterior mean
Shrunk estimate of log fold change.
Decision Rule: gene g is DE if |dg| > δcut
(2)
Classify genes into the mixture components.
pg = P(gene g not in H0 | data)
Decision Rule: gene g is DE if pg > pcut
Illustration of decision rule
pg = P( |δg| > log(2)
and g > 4 | data)
x pg > 0.8
Δ t-statistic > 2.78
(95% CI)
• Bayesian Hierarchical Model for Differential
Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and
differential expression
• Gene Ontology analysis for differentially
expressed genes
Bayesian P-values
• Compare observed data to a “null” distribution
• P-value: probability of an observation from the null
distribution being more extreme than the actual
observation
• If all observations come from the null distribution, the
distribution of p-values is Uniform
Cross-validation p-values
Idea of cross validation is to split the data: one part for
fitting the model, the rest for validation
n units of observation
For each observation yi, run model on rest of data y-i,
predict new data yinew from posterior distribution.
Bayesian p-value pi = Prob(yinew > yi | data y-i)
Distribution of p-values {pi, i=1,…,n} is approximately
Uniform if model adequately describes the data.
Posterior Predictive p-values
For large n, not possible to run model n times.
Run model on all data. For each observation yi, predict new
data yinew from posterior distribution.
Bayesian p-value pi = Prob(yinew > yi | all data)
“all data” includes yi
p-values are less extreme
than they should be
p-values are conservative (not quite Uniform).
Example: Check priors on
gene variances
1) Compare equal and exchangeable variance models
2) Compare different exchangeable priors
Want to compare data for each gene, not gene and
replicate, so use sample variance Sg2 (suppress index s
here)
Bayesian p-value Prob( Sg2 new > Sg2 obs | data)
WinBUGS code for posterior
predictive checks
for( i in 1 : ngenes ) {
for( j in 1 : nreps) {
y1[i, j] ~ dnorm(x1[i, j], tau1[i])
ynew1[i, j] ~ dnorm(x1[i, j], tau1[i])
x1[i, j] <- alpha[i] - 0.5*delta[i]
+ beta1[i, j]
}
s21[i] <- pow(sd(y1[i, ]), 2)
s2new1[i] <- pow(sd(ynew1[i, ]), 2)
pval1[i] <- step(s2new1[i] - s21[i])
}
replicate relevant
sampling distribution
calculate sample
variances
count no. times
predicted sample
variance is bigger
than observed
sample variance
Posterior predictive
Mean
parameters
Graph shows structure of
model
Prior
parameters
g2
new
ygr
ygr
r = 1:R
new
Sg2
S g2
g = 1:G
Mixed predictive
Mean
parameters
Prior
parameters
Less conservative than posterior
predictive
(Marshall and Spiegelhalter, 2003)
new
 g2
g2
new
ygr
ygr
r = 1:R
new
Sg2
Sg2
g = 1:G
Four models for gene variances
Equal variance model:
Model 1: 2  log Normal (0, 10000)
Exchangeable variance models:
Model 2: g-2  Gamma (2, β)
Model 3: g-2  Gamma (α, β)
Model 4: g2  log Normal (μ, τ)
(α, β, μ, τ all parameters)
Bayesian predictive p-values
• Bayesian Hierarchical Model for Differential
Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and
differential expression
• Gene Ontology analysis for differentially
expressed genes
Expression level dependent
normalization
Many gene expression data sets need normalization
which depends on expression level.
Usually normalization is performed in a pre-processing
step before the model for differential expression is used.
These analyses ignore the fact that the expression level
is measured with variability.
Ignoring this variability leads to bias in the function used
for normalization.
Simulated Data
Gene variances
similar range and distribution to mouse data
Array effects
cubic functions of expression level
Differential effects
900 genes: δg = 0
50 genes: δg  N( log(3), 0.12)
50 genes: δg  N( -log(3), 0.12)
Array Effects and Variability for
Simulated Data
_
Data points: ygsr – yg
Curves:
r(g)s
(r = 1…3)
(r = 1…3)
Two-step method (using loess)
1) Use loess smoothing to obtain array effects loessr(g)s
2) Subtract loess array effects from data:
yloessgsr = ygsr - loessr(g)s
3) Run our model on yloessgsr with no array effects
Decision rules for selecting differentially
expressed genes
If P( |δg| > δcut | data) > pcut then gene g is called
differentially expressed.
δcut chosen according to biological hypothesis of
interest (here we use log(3) ).
pcut corresponds to the error rate (e.g. False Discovery
Rate or Mis-classification Penalty) considered
acceptable.
Full model v. two-step method
Plot observed False
Discovery Rate against
pcut (averaged over 5
simulations)
Solid line for full model
Dashed line for prenormalized method
Different two-step methods
1) yloessgsr = ygsr - loessr(g)s
2) ymodelgsr = ygsr - E(r(g)s | data)
Results from 2 different two-step methods are much closer to each
other than to full model results.
• Bayesian Hierarchical Model for Differential
Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and
differential expression
• Gene Ontology analysis for differentially
expressed genes
Gene Ontology (GO)
Database of biological
terms
Arranged in graph
connecting related terms
Directed Acyclic Graph:
links indicate more
specific terms
~16,000 terms
from QuickGO website (EBI)
Gene Ontology (GO)
from QuickGO website (EBI)
Gene Annotations
• Genes/proteins annotated to relevant GO terms
• Gene may be annotated to several GO terms
• GO term may have 1000s of genes annotated to it
(or none)
• Gene annotated to term A  annotated to all
ancestors of A (terms that are related and more
general)
GO annotations of genes associated with
the insulin-resistance gene Cd36
Compare GO annotations of
genes most and least
differentially expressed
Most differentially expressed
↔ pg > 0.5 (280 genes)
Least differentially expressed
↔ pg < 0.2 (11171 genes)
GO annotations of genes associated with
the insulin-resistance gene Cd36
For each GO term, Fisher’s exact test on
proportion of differentially expressed genes with annotations
v.
proportion of non-differentially expressed genes with annotations
observed O = A
expected E = C*(A+B)/(C+D)
if no association of GO
annotation with DE
FatiGO website
http://fatigo.bioinfo.cnio.es/
genes annot.
to GO term
genes most
diff. exp.
genes least
diff. exp.
genes not annot.
to GO term
A
B
C
D
GO annotations of genes associated with
the insulin-resistance gene Cd36
O = observed no. differentially expressed genes
E = expected no. differentially expressed genes
Biological process
All GO ancestors of
Inflammatory response
Physiological process
Response to stimulus
Organismal movement
Response to external
stimulus
(O=12, E=4.7)
Response to biotic
stimulus
(O=14, E=6.9)
Response to stress
(O=12, E=5.9)
Response to wounding
(O=6, E=1.8)
Response to external
biotic stimulus *
Defense response
(O=11, E=5.8)
* This term was not accessed
by FatiGO
Relations between GO terms
were found using QuickGO:
http://www.ebi.ac.uk/ego/
Response to pest,
pathogen or parasite
(O=8, E=2.6)
Immune response
(O=9, E=4.5)
Inflammatory response
(O=4, E=1.2)
Further Work to do on GO
• Account for dependencies between GO terms
• Multiple testing corrections
• Uncertainty in annotation
( work in preparation )
Summary
• Bayesian hierarchical model flexible, estimates
variances robustly
• Predictive model checks show exchangeable prior
good for gene variances
• Useful to find GO terms over-represented in the
most differentially-expressed genes
Paper available (Lewin et al. 2005, Biometrics, in press)
http ://www.bgx.org.uk/
Decision Rules
• In full Bayesian framework, introduce latent allocation
variable zg = 0,1 for gene g in null, alternative
• For each gene, calculate posterior probability of belonging to
unmodified component: pg = Pr( zg = 0 | data )
• Classify using cut-off on pg (Bayes rule corresponds to 0.5)
• For any given pg , can estimate FDR, FNR.
For gene-list S, est. (FDR | data) = Σg  S pg / |S|
The Null Hypothesis
Composite Null
Point Null, alternative not modelled
Point Null, alternative modelled