Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian hierarchical modelling of genomic data Sylvia Richardson Centre for Biostatistics Imperial College, London In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Helen Causton and Tim Aitman (Hammersmith) Peter Green (Bristol) Philippe Broët (INSERM, Paris) BBSRC Exploiting Genomics grant 1 Outline • Introduction • A fully Bayesian gene expression index (BGX) • Differential expression and array effects • Mixture models • Discussion 2 Part 1 Introduction • Recent developments in genomics have led to techniques – Capable of interrogating the genome at different levels – Aiming to capture one or several stages of the biological process DNA mRNA protein phenotype 3 Fundamental process Protein-encoding genes are transcribed into mRNA (messenger), and the mRNA is translated to make proteins DNA -> mRNA -> protein 4 Pictures from http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html What are gene expression data ? DNA Microarrays are used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell, under particular conditions The fundamental principle used to measure the expression is that of hybridisation between a sample and probes: – Known sequences of single-stranded DNA representing genes are immobilised on microarray –Tissue sample (with unknown concentration of RNA) fluorescently labelled – Sample hybridised to array – Array scanned to measure amount of RNA present for each sequence gene expression measure The expression level of ten of thousands of probes are measured on a single microarray ! gene expression profile 5 Variation and uncertainty Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability • • • • • condition/treatment biological array manufacture imaging technical • gene specific variability of the probes for a gene • within/between array variation Structured statistical modelling allows considering all uncertainty at once 6 Example of within vs between strains gene variability • 7 cross-bred strains of mice that differ only by a small portion of chromosome 1 • Strains have different phenotypes related to immunological disorders • For each line, 9 animals used to obtain 3 pooled RNA extracts from spleen 7 x 3 samples Excellent experimental design to minimise “biological variability between replicate animals” Aim: to tease out differences between expression profiles of the 7 lines of mice and relate these to locations on chromosome 1 7 Biological variability is large ! Ratio within/total Total variance calculated over the 21 samples Average (over the 7 groups) of within strain variance 8 calculated from the 3 pooled samples 1000 genes most variable between strains: hierarchical clustering recovers the cross-bred lines structure Random set of 1000 genes 9 Common characteristics of genomics data sets • High dimensional data (ten of thousands of genes) and few samples • Many sources of variability (low signal/noise ratio) Common issues • • • • Pre-processing and data reduction Multiple testing Need to borrow information Importance to include prior biological knowledge 10 Part 2 • Introduction • A fully Bayesian gene expression index (BGX) – Single array model – Multiple array model • Differential expression and array effects • Mixture models • Discussion 11 A fully Bayesian Gene eXpression index for Affymetrix GeneChip arrays Anne Mette Hein SR, Helen Causton, Graeme Ambler, Peter Green PM MM PM MM PM MM PM MM Raw intensities Background correction Gene specific variability (probe) Gene index BGX 12 Affymetrix GeneChips: Zoom Image of Hybridised Array * * * * * Hybridised Spot Each gene g represented by probe set: (J:11-20) Expressed PM Perfect match: PMg1,…, PMgJ Mis-match: Non-expressed PM Image of Hybridised Array Slide courtesy of Affymetrix MMg1,…, MMgJ expression measure for gene g 13 Commonly used methods for estimation expression levels from GeneChips MAS5: • uses PM and MMs. Imputes IMs from MMs to obtain all PM-MMs positive • gene expression measure : estimate obtained by applying Tukey Biweight to the set of log(PM-MM) values in the probe set Characteristics: positive, robust, noisy at low levels RMA: • uses PMs only. • Fits an model with additive gene and probe effects to logscale background corrected PMs using median polish 14 Characteristics: positive, robust, attenuated signal detection Mean (left) and Empirical standard deviation (right) over 7 conditions (arrays) for 45000 genes estimated by 2 different methods for quantifying gene expression mean Variability across conditions is conditioned by the 15 choice of summary measure ! Beware of filtering Model assumptions and key biological parameters • The intensity for the PM measurement for probe (reporter) j and gene g is due to binding of labelled fragments that perfectly match the oligos in the spot The true Signal Sgj of labelled fragments that do not perfectly match these oligos The non-specific hybridisation Hgj • The intensity of the corresponding MM measurement is caused by a binding fraction Φ of the true signal Sgj by non-specific hybridisation Hgj 16 BGX single array model: g=1,…,G (thousands), j=1,…,J (11-20) PMgj N( Sgj + Hgj , τ2) Background noise, additive MMgj N(Φ Sgj + Hgj ,τ2) fraction Non-specific hybridisation signal log(Sgj+1) TN (μg , ξg2) log(Hgj+1) TN(λ, η2) j=1,…,J Gene specific error terms: exchangeable Gene expression index (BGX): Array-wide distribution qg=median(TN (μg , ξ g2)) “Pools” information over probes j=1,…,J log(ξ g2)N(a, b2) Priors: “vague” t2 ~ G(10-3, 10-3) F~ B(1,1), mg ~ U(0,15) h2 ~ G(10-3, 10-3), l ~ N(0,103) “Empirical Bayes” 17 Inference: • Implemented in WinBugs and C • allows: - Joint estimation of parameters in full Bayesian framework • obtain: - posterior distributions of parameters (and functions of these) in model: mean 2.5-97.5% credibility interval For each gene g: Log-scale true signals: Log-scale non-spec. hybr: BGX: gene expr: log(Sgj+1): j=1,…,J log(Hgj+1): j=1,…,J qg: 1 2 3 2 3 4 1.75 2 2.25 NB! A distribution 18 Computational issues • We found mixing slow for gene specific parameters (μg , ξg2) and large autocorrelation • For low signal (bottom 25%) more variability of Sgj and Hgj , and less separation So less information on (μg , ξg2) and longer runs are needed • For the full hierarchical model, the convergence of the hyperparameters for the distribution of ξg2 was problematic • We studied sensitivity to a range of plausible values for those and implemented an “empirical Bayes” version of the model which was reproducible with sensible run length 19 Posterior mean of qg using a run of 30 000 versus those obtained from runs of 5 000, 10 000 and 20 000 sweeps Reproducibility is obtained with short runs for large expression values Longer runs are necessary for low expression values 20 Single array model performance: Data set : varying concentrations (geneLogic): • 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line • In sample k: sample k: conc. ck(pM): each of 11 genes spiked in at concentration ck: 1 2 0.0 0.5 3 4 5 6 7 8 9 10 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 11 12 13 14 50 75 100 150 • Each sample hybridised to an array Consider subset consisting of 500 normal genes + 11 spike-ins 21 Single array model: examples of posterior distributions of BGX expression indices Each curve (truncated normal with median param.) represents a gene Examples with data: o: log(PMgj-MMgj) j=1,…,Jg (at 0 if not defined) Mean +- 1SD 22 Comparison with other expression measures Single array model performance: 11 genes spiked in at 13 (increasing) concentrations BGX index qg increases with concentration ….. … except for gene 7 (spiked-in??) Indication of smooth & sustained increase over a wider range of concentrations 23 2.5 – 97.5 % credibility intervals for the Bayesian expression index 11 spike-in genes at 13 different concentration (data set A) Note how the variability is substantially larger for low expression level Each colour corresponds to a different spike-in gene 24 Gene 7 : broken red line Integrated modelling of Affymetrix data Condition 2 Condition 1 PM MM PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Hierarchical model of replicate (biological) variability and array effect Distribution of expression index for gene g , condition 1 PM MM PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Hierarchical model of replicate (biological) variability and array effect Distribution of expression index for gene g , condition 2 Distribution of differential expression parameter 25 BGX Multiple array model: conditions: c=1,…,C, replicates: r = 1,…,Rc PMgjcr N( Sgjcr+ Hgjcr , τcr2) MMgjcr N(ΦSgjcr+ Hgjcr , τcr2) log(Sgjcr+1) TN (μgc , ξ gc2) Gene and condition specific BGX Background noise, additive Array specific log(Hgjcr+1) TN(λcr,ηcr2) Array-specific distribution of non-specific hybridisation qgc=median(TN(μgc, ξ gc2)) “Pools” information over replicate probe sets j = 1,…J, r = 1,…,Rc 26 Subset of AffyU133A spike-in data set (AffyComp) Consider: • Six arrays, 1154 genes (every 20th and 42 spike-ins) • Same cRNA hybridised to all arrays EXCEPT for spike-ins: `1` `2` `3` … `12` `13` `14` Spike-in genes: 1-3 4-6 7-9 … 34-36 37-39 40-42 Spike-in conc (pM): Condition 1 (array 1-3): 0.0 0.25 0.50 … 128 256 512 Condition 2 (array 4-6): 0.25 0.50 1.00 … 256 512 0.00 2 - Fold change: - 2 2 … 2 27 BGX: measure of uncertainty provided Posterior mean +- 1SD credibility intervals diffg=bgxg,1- bgxg,2 } Spike in 1113 -1154 above the blue line 28 Blue stars show RMA measure Part 3 • Introduction • A fully Bayesian gene expression index (BGX) • Differential expression and array effects – Non linear array effects – Model checking • Mixture models • Discussion 29 Differential expression and array effects Alex Lewin SR, Natalia Bochkina, Tim Aitman 30 Data Set and Biological question Biological Question Understand the mechanisms of insulin resistance Using animal models where key genes are knockout and comparison made between gene expression of wildtype (normal) and knockout mice Data set A (MAS 5) ( 12000 genes on each array) 3 wildtype mice compared with 3 mice with Cd36 knocked out Data set B (RMA) ( 22700 genes on each array) 8 wildtype mice compared with 8 knocked out mice 31 Condition 2 Condition 1 Start with given point estimates of expression Hierarchical model of replicate Variability and array effect Hierarchical model of replicate Variability and array effect Posterior distribution Differential expression parameter (flat prior) Mixture modelling for classification 32 Exploratory analysis of array effect Mouse data set A Condition 1 (3 replicates) Needs ‘normalisation’ Spline curves shown Condition 2 (3 replicates) 33 Model for Differential Expression • Expression-level-dependent normalisation • Only few replicates per gene, so share information between genes to estimate variability of gene expression between the replicates • To select interesting genes: – Use posterior distribution of quantities of interest, function of, ranks …. – Use mixture prior on the differential expression parameter 34 Bayesian hierarchical model for differential expression Data: ygsr = log gene expression for gene g, replicate r g = gene effect δg = differential effect for gene g between 2 conditions r(g)s = array effect (expression-level dependent) gs2 = gene variance • 1st level yg1r N(g – ½ δg + r(g)1 , g12), yg2r N(g + ½ δg + r(g)2 , g22), Σrr(g)s = 0, r(g)s = function of g , parameters {a} and {b} • 2nd level Priors for g , δg, coefficients {a} and {b} gs2 lognormal (μs, τs) 35 Details of array effects (Normalization) • Piecewise polynomial with unknown break points: r(g)s = quadratic in g for ars(k-1) ≤ g ≤ ars(k) with coeff (brsk(1), brsk(2) ), k =1, … # breakpoints – Locations of break points not fixed – Must do sensitivity checks on # break points • Joint estimation of array effects and differential expression: In comparison to 2 step method – More accurate estimates of array effects – Lower percentage of false positive (simulation study) 36 Mouse Data set A For this data set, cubic fits well 3 replicate arrays (wildtype mouse data) Model: posterior means E(r(g)s | data) v. E(g | data) Data: ygsr - E(g | data) 37 Bayesian Model Checking • Check assumptions on gene variances, e.g. exchangeable variances, what distribution ? • Predict sample variance Sg2 new (a chosen checking function) from the model specification (not using the data for this) • Compare predicted Sg2 new with observed Sg2 obs ‘Bayesian p-value’: Prob( Sg2 new > Sg2 obs ) • Distribution of p-values approx Uniform if model is ‘true’ (Marshall and Spiegelhalter, 2003) 38 • Easily implemented in MCMC algorithm Data set A 39 Possible Statistics for Differential Expression δg ≈ log fold change δg* = δg / (σ2 g1 / R1 + σ2 g2 / R2 )½ (standardised difference) • We obtain the posterior distribution of all {δg} and/or {δ g * } • Can compute directly posterior probability of genes satisfying criterion X of interest: pg,X = Prob( g of “interest” | Criterion X, data) • Can compute the distributions of ranks 40 Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5 Criterion X Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4 Plot of log fold change versus overall expression level Genes with pg,X > 0.5 (green) # 280 pg,X > 0.8 (red) # 46 The majority of the genes have very small pg,X : 90% of genes have pg,X < 0.2 Genes with low overall expression have a greater range of fold change than those with higher expression pg,X = 0.49 41 Experiment: 8 wildtype mice compared to 8 knockout mice RMA Criterion X: Gene is of interest if |log fold change| > log (1.5) Plot of log fold change versus overall expression level Genes with pg,X > 0.5 (green) # 292 pg,X > 0.8 (red) # 139 The majority of the genes have very small pg,X : 97% of genes have pg,X < 0.2 42 Posterior probabilities and log fold change Data set A : 3 replicates MAS5 Data set B : 8 replicates RMA 43 Credibility intervals for ranks Data set B 100 genes with lowest rank (most under/ over expressed) Low rank, high uncertainty Low rank, low uncertainty 44 Using the posterior distribution of δg* (standardised difference) • Compute Probability ( | δg* | > 2 | data) Bayesian analogue of a t test ! • Order genes • Select genes such that Probability ( | δg* | > 2 | data) > cut-off ( in blue) By comparison, additional genes selected by a standard T test with p value < 5% are in red) 45 Part 4 • • • • Introduction A fully Bayesian gene expression index Differential expression and array effects Mixture models – Classification for differential expression – Bayesian estimate of False Discovery Rates – CGH arrays: models including information on clones spatial location on chromosome • Discussion 46 Mixture and Bayesian estimation of false discovery rates Natalia Bochkina, Philippe Broët Alex Lewin, SR 47 Multiple Testing Problem • Gene lists can be built by computing separately a criteria for each gene and ranking • Thousands of genes are considered simultaneously • How to assess the performance of such lists ? Statistical Challenge Select interesting genes without including too many false positives in a gene list A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up Want an evaluation of the expected false discovery rate (FDR) 48 Bayesian Estimate of FDR • Step 1: Choose a gene specific parameter (e.g. δg ) or a gene statistic • Step 2: Model its prior (resp marginal) distribution using a mixture model -- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δg -- other components to model (flexibly) the alternative • Step 3: Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data • Step 4: Evaluate FDR (and FNR) for any list assuming that all the gene classification are independent (Broët et al 2004) : Bayes FDR (list) | data = 1/card(list) Σg list p49g0 Mixture framework for differential expression • To obtain a gene list, a commonly used method (cf Lönnstedt & Speed 2002, Newton 2003, Smyth 2003, …) is to define a mixture prior for δg : • H0 δg = 0 point mass at 0 with probability p0 δg ~ flexible 2-sided distribution to model • H1 pattern of differential expression Classify each gene following its posterior probabilities of not being in the null: 1- pg0 Use Bayes rule or fix the FDR to get a cutoff 50 Mixture prior for differential expression • In full Bayesian framework, introduce latent allocation variable zg to help computations • Joint estimation of all the mixture parameters (including p0) avoids plugging-in of values (e.g. p0) that are influential on the classification • Sensitivity to prior settings of the alternative distribution • Performance has been tested on simulated data sets Poster by Natalia Bochkina 51 Performance of the mixture prior yg1r = g - ½ δg + g1r , r = 1, … R1 yg2r = g + ½ δg + g2r , r = 1, … R2 (For simplification, we assume that the data has been pre normalised) Var(gsr ) = σ2gs ~ IG(as, bs) δg ~ p0δ0 + p1G (1.5, l1) + p2G (1.5, l2) H0 H1 Dirichlet distribution for (p0, p1, p2) Exponential hyper prior for l1 and l2 52 Estimation • Estimation of all parameters combines information from biological replicates and between condition contrasts • s2gs = 1/Rs Σr (ygsr - ygs. )2 , s = 1,2 Within condition biological variability • 1/Rs Σr ygsr = ygs. , Average expression over replicates • ½(yg1.+ yg2.) Average expression over conditions • ½(yg1.- yg2.) Between conditions contrast 53 DAG for the mixture model zg p a 1 , b1 l1 , l2 δg g g = 1:G ½(yg1.- yg2.) ½(yg1.+ yg2.) 2g1 s2g1 2g2 s2g2 a2, b2 54 Simulated data Plot of the true differences ygr ~ N(δg , σ2g) (8 replicates) σ2gs ~ IG(1.5, 0.05) δg ~ (-1)Bern(0.5) G(2,2), g=1:200 δg = 0, g=201:1000 Choice of simulation parameters inspired by estimates found in analyses of biological data sets 55 Bayes rule True classification Null Alternative Mixture Prior Null Alternative 790 (800) 10 31 169 (200) Margin Null 764 (800) 21 FDR (black) FNR (blue) as a function of 1- pg0 Observed and estimated FDR/FNR correspond well Important feature Post Prob (g H1) = 1- pg0 56 Comparison of mixture classification and posterior probabilities for δg* (standardised differences) Post Prob (g H1) 10 = 6% False positive In red, 200 genes with δg ≠ 0 31 = 4% False negative Probability ( | δg* | > 2 | data) 57 Wrongly classified by mixture: truly dif. expressed, truly not dif. expressed Classification errors are on the borderline: Confusion between size of fold change and biological variability 58 Another simulation 2628 data points Many points added on borderline: classification errors in red Can we improve estimation of within condition biological variability ? 59 DAG for the mixture model zg p a1, b1 l1 , l2 δg g ½(yg1.- yg2.) ½(yg1.+ yg2.) 2g1 2g2 a 2 , b2 g = 1:G s2g1 s2g2 The variance estimates are influenced by the mixture parameters Use only partial information from the replicates to estimate 2gs and feed forward in the mixture ? 60 Mixture, full vs partial Classification altered for 57 points: In 46 data points with improved classification when ‘feed back from mixture is cut’ In 11 data points with changed but new incorrect classification 61 Work in progress Mixture models in CGH arrays experiments • Philippe Broët, SR • Curie Institute oncology department CGH = Competitive Genomic Hybridization between fluorescein- labelled normal and pathologic samples to an array containing clones designed to cover certain areas of the genome 62 Aim: study genomic alterations Loss Tumor supressor gene Gain Oncogene In oncology, where carcinogenesis is associated with complex chromosomic alterations, CGH array can be used for detailed analysis of genomic changes in copy number (gains or loss of genetic information) in the tumor sample. Amplification of an oncogene or deletion of a tumor suppressor gene are considered as important mechanisms for tumorigenesis 63 Specificity of CGH array experiment A priori biological knowledge from conventional CGH : • Limited number of states for a sequence : - presence, - deletion, - gain(s) corresponding to different intensity ratios on the slide Mixture model to capture the underlying discrete states • Clones located contiguously on chromosomes are likely to carry alterations of the same type Use clone spatial location in the allocation model • Some CGH custom array experiments target restricted areas of the genome Large proportion of genomic alterations are expected 64 3 component mixture model with spatial allocation ygr N(θg , g2) , normal versus tumoral change, clone g replicate measure r θg wg0N(μ0 ,h02) + wg1N(μ1 ,h12) + wg2N(μ2 ,h22) presence deletion gain μ0 : known central estimate obtained from reference clones Introduce centred spatial autoregressive Markov random fields, {ug0}, {ug1}, {ug2} with nearest neighbours along the chromosomes x x x Spatial neighbours of g g -1 g g+1 Define mixture proportions to depend on the chromosomic location via a logistic model: wgk = exp(ugk) / Σm exp(ugm) favours allocation of nearby clones to same component Work in progress 65 Curie Institute CGH platform Focus on Investigating deletion areas on chromosome 1 (tumour suppressor locus) Data on 190 clones μ0 Presence ? Ref value μ0 = - 0.11 Deletion ? 66 Mixture model posterior probability p of clone being deleted Classification with cut-off at p ≥ 0.8 Short arm 67 Summary Bayesian gene expression measure (BGX) Good range of resolution, provides credibility intervals Differential Expression Expression-level-dependent normalisation Borrow information across genes for variance estimation Gene lists based on posterior probabilities or mixture classification False Discovery Rate Mixture gives good estimate of FDR and classifies Flexibility to incorporate a priori biological features, e.g. dependence on chromosomic location Future work Mixture prior on BGX index, with uncertainty propagated to mixture parameters, comparison of marginal and prior mixture approaches, clustering of profiles for more general experimental set-ups 68 Papers and technical reports: Hein AM., Richardson S., Causton H., Ambler G. and Green P. (2004) BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data (to appear in Biostatistics) Lewin A., Richardson S., Marshall C., Glazier A. and Aitman T. (2003) Bayesian Modelling of Differential Gene Expression (under revision for Biometrics) Broët P., Lewin A., Richardson S., Dalmasso C. and Magdelenat H. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 22, 2562-2571. Broët, P., Richardson, S. and Radvanyi, F. (2002) Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments. Journal of Computational Biology 9, 671-683. Available at http ://www.bgx.org.uk/ Thanks 69