* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt - Bayesian Gene Expression
Long non-coding RNA wikipedia , lookup
Pathogenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Gene therapy wikipedia , lookup
History of genetic engineering wikipedia , lookup
Minimal genome wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene desert wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Helitron (biology) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Designer baby wikipedia , lookup
Bayesian Modelling for Differential Gene Expression Alex Lewin (Imperial College) Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina (IC Epidemiology) Helen Causton (IC Microarray Centre) Peter Green (Bristol) Insulin-resistance gene Cd36 cDNA microarray: hybridisation signal for SHR much lower than for Brown Norway and SHR.4 control strains Aitman et al 1999, Nature Genet 21:76-83 Larger microarray experiment: look for other genes associated with Cd36 Microarray Data 3 SHR compared with 3 transgenic rats (with Cd36) 3 wildtype (normal) mice compared with 3 mice with Cd36 knocked out 12000 genes on each array Biological Question Find genes which are expressed differently between animals with and without Cd36. • Bayesian Hierarchical Model for Differential Expression • Decision Rules • Predictive Model Checks • Simultaneous estimation of normalization and differential expression • Gene Ontology analysis for differentially expressed genes Microarray analysis is a multi-step process Low-level Model (how gene expression is estimated from signal) Normalisation (to make arrays comparable) Differential Expression We aim to integrate all the steps in a common statistical framework Clustering, Partition Model Bayesian Modelling Framework • Model different sources of variability simultaneously, within array, between array … • Uncertainty propagated from data to parameter estimates (so not over-optimistic in conclusions). • Share information in appropriate ways to get robust estimates. Gene Expression Data 3 wildtype mice, Fat tissue hybridised to Affymetrix chips Newton et al. 2001 Showed data fit well by Gamma or Log Normal distributions sd Kerr et al. 2000 Linear model on log scale mean Bayesian hierarchical model for differential expression Data: ygsr = log expression for gene g, condition s, replicate r g = gene effect δg = differential effect for gene g between 2 conditions r(g)s = array effect (expression-level dependent) gs2 = gene variance • 1st level yg1r | g, δg, g1 N(g – ½ δg + r(g)1 , g12), yg2r | g, δg, g2 N(g + ½ δg + r(g)2 , g22), Σr r(g)s = 0 r(g)s = function of g , parameters {a} and {b} Priors for gene effects Mean effect g g ~ Unif (much wider than data range) Differential effect δg δg ~ N(0,104) – “fixed” effects (no structure in prior) OR mixture: δg ~ p0δ0 + p1G_ (1.5, 1) + p2G+ (1.5, 2) H0 Explicit modelling of the alternative References Fixed Effects Kerr et al. 2000 Mixture Models Newton et al. 2004 (non-parametric mixture) Löenstedt and Speed 2003, Smyth 2004 (conjugate mixture prior) Broet et al. 2002 (several levels of DE) Prior for gene variances Two extreme cases: (1) Constant variance Too stringent gsr N(0, 2) Poor fit (2) Independent variances gsr N(0, g2) ! Variance estimates based on few replications are highly variable Need to share information between genes to better estimate their variance, while allowing some variability Hierarchical model Prior for gene variances • 2nd level gs2 | μs, τs logNormal (μs, τs) Hyper-parameters μs and τs can be influential. Empirical Bayes Eg. Löenstedt and Speed 2003, Smyth 2004 Fixes μs , τs Fully Bayesian • 3rd level μs N( c, d) τs Gamma (e, f) Gene specific variances are stabilised Variances estimated using information from all G x R measurements (~12000 x 3) rather than just 3 Variances stabilised and shrunk towards average variance Prior for array effects (Normalization) Spline Curve r(g)s = quadratic in g for ars(k-1) ≤ g ≤ ars(k) with coeff (brsk(1), brsk(2) ), k =1, … #breakpoints a0 a1 a2 a3 Locations of break points not fixed Must do sensitivity checks on # break points Array effect as a function of gene effect Bayesian posterior mean loess Effect of normalisation on density Wildtype Before (ygsr) ^ After (ygsr- r(g)s ) Knockout Bayesian hierarchical model for differential expression • 1st level – ygsr | g, δg, gs N(g – ½ δg + r(g)s , gs2), • 2nd level – Fixed effect priors for g, δg – Array effect coefficients, Normal and Uniform – gs2 | μs, τs logNormal (μs, τs) • 3rd level – μs N( c, d) – τs Gamma (e, f) WinBUGS software for fitting Bayesian models Declare the model for( i in 1 : ngenes ) { for( j in 1 : nreps) { y1[i, j] ~ dnorm(x1[i, j], tau1[i]) x1[i, j] <- alpha[i] - 0.5*delta[i] + beta1[i, j] } } for( i in 1 : ngenes ) { tau1[i] <- 1.0/sig21[i] sig21[i] <- exp(lsig21[i]) lsig21[i] ~ dnorm(mm1,tt1) } mm1 ~ dnorm( 0.0,1.0E-3) tt1 ~ dgamma(0.01,0.01) WinBUGS does the calculations WinBUGS software for fitting Bayesian models Whole posterior distribution Posterior means, medians, quantiles • Bayesian Hierarchical Model for Differential Expression • Decision Rules • Predictive Model Checks • Simultaneous estimation of normalization and differential expression • Gene Ontology analysis for differentially expressed genes Decision Rules for Inference So far, discussed fitting the model. How do we decide which genes are differentially expressed? Parameters of interest: g , δg , g – What quantity do we consider, δg , (δg /g) , … ? – How do we summarize the posterior distribution? Fixed Effects Model Inference on δ (1) dg = E(δg | data) posterior mean Like point estimate of log fold change. biological interest Decision Rule: gene g is DE if |dg| > δcut (2) pg = P( |δg| > δcut | data) biological interest posterior probability (incorporates uncertainty) Decision Rule: gene g is DE if pg > pcut statistical confidence This allows biologist to specify what size of effect is interesting (not just statistical significance) Fixed Effects Model Inference on δ, (1) tg = E(δg | data) / E(g | data) Like t-statistic. statistical confidence Decision Rule: gene g is DE if |tg| > tcut (2) pg = P( |δg /g| > tcut | data) Decision Rule: gene g is DE if pg > pcut Bochkina and Richardson (in preparation) statistical confidence Mixture Model δg ~ p0δ0 + p1G_ (1.5, 1) + p2G+ (1.5, 2) Explicit modelling of the alternative H0 (1) dg = E(δg | data) posterior mean Shrunk estimate of log fold change. Decision Rule: gene g is DE if |dg| > δcut (2) Classify genes into the mixture components. pg = P(gene g not in H0 | data) Decision Rule: gene g is DE if pg > pcut Illustration of decision rule pg = P( |δg| > log(2) and g > 4 | data) x pg > 0.8 Δ t-statistic > 2.78 (95% CI) • Bayesian Hierarchical Model for Differential Expression • Decision Rules • Predictive Model Checks • Simultaneous estimation of normalization and differential expression • Gene Ontology analysis for differentially expressed genes Bayesian P-values • Compare observed data to a “null” distribution • P-value: probability of an observation from the null distribution being more extreme than the actual observation • If all observations come from the null distribution, the distribution of p-values is Uniform Cross-validation p-values Idea of cross validation is to split the data: one part for fitting the model, the rest for validation n units of observation For each observation yi, run model on rest of data y-i, predict new data yinew from posterior distribution. Bayesian p-value pi = Prob(yinew > yi | data y-i) Distribution of p-values {pi, i=1,…,n} is approximately Uniform if model adequately describes the data. Posterior Predictive p-values For large n, not possible to run model n times. Run model on all data. For each observation yi, predict new data yinew from posterior distribution. Bayesian p-value pi = Prob(yinew > yi | all data) “all data” includes yi p-values are less extreme than they should be p-values are conservative (not quite Uniform). Example: Check priors on gene variances 1) Compare equal and exchangeable variance models 2) Compare different exchangeable priors Want to compare data for each gene, not gene and replicate, so use sample variance Sg2 (suppress index s here) Bayesian p-value Prob( Sg2 new > Sg2 obs | data) WinBUGS code for posterior predictive checks for( i in 1 : ngenes ) { for( j in 1 : nreps) { y1[i, j] ~ dnorm(x1[i, j], tau1[i]) ynew1[i, j] ~ dnorm(x1[i, j], tau1[i]) x1[i, j] <- alpha[i] - 0.5*delta[i] + beta1[i, j] } s21[i] <- pow(sd(y1[i, ]), 2) s2new1[i] <- pow(sd(ynew1[i, ]), 2) pval1[i] <- step(s2new1[i] - s21[i]) } replicate relevant sampling distribution calculate sample variances count no. times predicted sample variance is bigger than observed sample variance Posterior predictive Mean parameters Graph shows structure of model Prior parameters g2 new ygr ygr r = 1:R new Sg2 S g2 g = 1:G Mixed predictive Mean parameters Prior parameters Less conservative than posterior predictive (Marshall and Spiegelhalter, 2003) new g2 g2 new ygr ygr r = 1:R new Sg2 Sg2 g = 1:G Four models for gene variances Equal variance model: Model 1: 2 log Normal (0, 10000) Exchangeable variance models: Model 2: g-2 Gamma (2, β) Model 3: g-2 Gamma (α, β) Model 4: g2 log Normal (μ, τ) (α, β, μ, τ all parameters) Bayesian predictive p-values • Bayesian Hierarchical Model for Differential Expression • Decision Rules • Predictive Model Checks • Simultaneous estimation of normalization and differential expression • Gene Ontology analysis for differentially expressed genes Expression level dependent normalization Many gene expression data sets need normalization which depends on expression level. Usually normalization is performed in a pre-processing step before the model for differential expression is used. These analyses ignore the fact that the expression level is measured with variability. Ignoring this variability leads to bias in the function used for normalization. Simulated Data Gene variances similar range and distribution to mouse data Array effects cubic functions of expression level Differential effects 900 genes: δg = 0 50 genes: δg N( log(3), 0.12) 50 genes: δg N( -log(3), 0.12) Array Effects and Variability for Simulated Data _ Data points: ygsr – yg Curves: r(g)s (r = 1…3) (r = 1…3) Two-step method (using loess) 1) Use loess smoothing to obtain array effects loessr(g)s 2) Subtract loess array effects from data: yloessgsr = ygsr - loessr(g)s 3) Run our model on yloessgsr with no array effects Decision rules for selecting differentially expressed genes If P( |δg| > δcut | data) > pcut then gene g is called differentially expressed. δcut chosen according to biological hypothesis of interest (here we use log(3) ). pcut corresponds to the error rate (e.g. False Discovery Rate or Mis-classification Penalty) considered acceptable. Full model v. two-step method Plot observed False Discovery Rate against pcut (averaged over 5 simulations) Solid line for full model Dashed line for prenormalized method Different two-step methods 1) yloessgsr = ygsr - loessr(g)s 2) ymodelgsr = ygsr - E(r(g)s | data) Results from 2 different two-step methods are much closer to each other than to full model results. • Bayesian Hierarchical Model for Differential Expression • Decision Rules • Predictive Model Checks • Simultaneous estimation of normalization and differential expression • Gene Ontology analysis for differentially expressed genes Gene Ontology (GO) Database of biological terms Arranged in graph connecting related terms Directed Acyclic Graph: links indicate more specific terms ~16,000 terms from QuickGO website (EBI) Gene Ontology (GO) from QuickGO website (EBI) Gene Annotations • Genes/proteins annotated to relevant GO terms • Gene may be annotated to several GO terms • GO term may have 1000s of genes annotated to it (or none) • Gene annotated to term A annotated to all ancestors of A (terms that are related and more general) GO annotations of genes associated with the insulin-resistance gene Cd36 Compare GO annotations of genes most and least differentially expressed Most differentially expressed ↔ pg > 0.5 (280 genes) Least differentially expressed ↔ pg < 0.2 (11171 genes) GO annotations of genes associated with the insulin-resistance gene Cd36 For each GO term, Fisher’s exact test on proportion of differentially expressed genes with annotations v. proportion of non-differentially expressed genes with annotations observed O = A expected E = C*(A+B)/(C+D) if no association of GO annotation with DE FatiGO website http://fatigo.bioinfo.cnio.es/ genes annot. to GO term genes most diff. exp. genes least diff. exp. genes not annot. to GO term A B C D GO annotations of genes associated with the insulin-resistance gene Cd36 O = observed no. differentially expressed genes E = expected no. differentially expressed genes Biological process All GO ancestors of Inflammatory response Physiological process Response to stimulus Organismal movement Response to external stimulus (O=12, E=4.7) Response to biotic stimulus (O=14, E=6.9) Response to stress (O=12, E=5.9) Response to wounding (O=6, E=1.8) Response to external biotic stimulus * Defense response (O=11, E=5.8) * This term was not accessed by FatiGO Relations between GO terms were found using QuickGO: http://www.ebi.ac.uk/ego/ Response to pest, pathogen or parasite (O=8, E=2.6) Immune response (O=9, E=4.5) Inflammatory response (O=4, E=1.2) Further Work to do on GO • Account for dependencies between GO terms • Multiple testing corrections • Uncertainty in annotation ( work in preparation ) Summary • Bayesian hierarchical model flexible, estimates variances robustly • Predictive model checks show exchangeable prior good for gene variances • Useful to find GO terms over-represented in the most differentially-expressed genes Paper available (Lewin et al. 2005, Biometrics, in press) http ://www.bgx.org.uk/ Decision Rules • In full Bayesian framework, introduce latent allocation variable zg = 0,1 for gene g in null, alternative • For each gene, calculate posterior probability of belonging to unmodified component: pg = Pr( zg = 0 | data ) • Classify using cut-off on pg (Bayes rule corresponds to 0.5) • For any given pg , can estimate FDR, FNR. For gene-list S, est. (FDR | data) = Σg S pg / |S| The Null Hypothesis Composite Null Point Null, alternative not modelled Point Null, alternative modelled