Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian hierarchical models for large-scale data integration and variable selection Alex Lewin work with Habib Saadi, James Peters, Leo Bottolo and Sylvia Richardson May 2015 Alex Lewin Baysian Variable Selection May 2015 1 / 25 Very basic ideas of genetics/genomics – DNA same in every cell – gene expression (RNA): used as a proxy for how much protein is produced in particular cell, varies between cell types Alex Lewin Baysian Variable Selection May 2015 2 / 25 eQTLs: expression Quantitative Trait Loci QTLs are genetic variants (DNA loci) associated with some kind of trait (e.g. height, blood pressure) eQTLs are SNPs associated with gene expression (eQTLs) Our aims: Detect eQTLs using variable selection models Especially “hotspots”: genetic variants which are associated with expression of multiple genes. Our further (novel) aim: combine data from multiple tissue samples (repeated measures of gene expression). Alex Lewin Baysian Variable Selection May 2015 3 / 25 Data structure Y (gene expression, RNA) tensor n × q × L X (DNA variants) matrix n × p n people ∼ 100s - 1000s p DNA measurements ∼ 10,000s q RNA measurements ∼ 1000s L tissue samples ∼ 3-10 Regression model Aim of analysis: find important correlations between Y and X Alex Lewin Baysian Variable Selection May 2015 4 / 25 Bayesian hierarchical model for multi-tissue eQTLs We model all data simultaneously, and estimate effects for all model parameters together. → consistent inference of which associations and patterns in the data are important I will discuss three ways we impose appropriate structure on the model parameters: 1 Selecting the important DNA variants (p) 2 Combining RNA measurements across tissue samples (L) 3 Sharing information between measurements of RNA levels for different genes (q) Alex Lewin Baysian Variable Selection May 2015 5 / 25 (1): Variable selection Selecting which variables (which DNA variants) are important for predicting responses (RNA levels). No. possible models (possible sets of DNA variants) is huge (2p for each RNA) p >> n Widely studied problem in statistics (traditional estimators don’t work as not enough observations) shrinkage/penalty estimators enforce *sparsity* Alex Lewin Baysian Variable Selection May 2015 6 / 25 (1): Variable selection: Single response model Single response (RNA) in single tissue: vector of observations y across all people. Searching for a regression model: y | γ = X γ βγ + , ∼ N 0, σ 2 where X γ includes only the most important predictors in X . Variable selection is achieved using a latent binary vector ( γj = 1 if βj 6= 0 γ = (γ1 , . . . , γp ) : γj = 0 if βj = 0 Alex Lewin Baysian Variable Selection May 2015 7 / 25 (1): Variable selection: Single response model Non-zero regression coefficients shrunk using g-prior structure (Zellner): −1 βγ g, σ 2 , γ ∼ N 0, gσ 2 XγT Xγ Sparsity prior on latent binary indicators p(γ = 1|ω) = Bern(ω) Priors on shrinkage parameter g and sparsity parameter ω → these also parameters of the model (not fixed). Alex Lewin Baysian Variable Selection May 2015 8 / 25 (2): Model RNA levels across tissues Single response (RNA) in multiple tissues: now have matrix of observations Y (n people × L tissues). Y − A − Xγ Bγ ∼ N (In , Σ) Matrix of regression coefficients B: same variable selection (γ), shrinkage and sparsity priors as before. Matrix Σ (dimensions L × L): Wishart prior (standard conjugate prior for covariance matrices). Aim to find common pattern of associations (Bγ ) across tissues. Signal/noise ratio can vary across tissues (Σ) Allow for residual correlations between tissues (Σ). Alex Lewin Baysian Variable Selection May 2015 9 / 25 (3): Hierarchical model across different responses Full model: multiple RNA responses in multiple tissues: for each RNA response k we have matrix Yk across people and tissues. Yk − Ak − Xγk Bγk ∼ N (In , Σk ) Hierarchical model over responses: Separate regression parameters and variable selection for each response Shared priors allow sharing information, shrinkage estimates. Structured prior on variable selection → improve hotspot detection Alex Lewin Baysian Variable Selection May 2015 10 / 25 (3): Hotspot detection via hierarchical model Sparsity prior on latent binary indicators p(γkj = 1|Ω) = Bern(Ωkj ) Modelling the matrix of the prior probabilities ω11 · · · ω1j · · · ω1p .. .. .. .. .. . . . . . Ω= ωk1 · · · ωkj · · · ωkp , .. . . . . .. .. .. .. . ωq1 · · · ωqj · · · ωqp Ωkj = ωk × ρj ωk is prior prob. of variable selection for given response (sparsity) ρj captures the ‘propensity’ for predictor j to influence several outcomes at the same time (hotspots) (ωk and ρj are parameters of the model, not fixed) Alex Lewin Baysian Variable Selection May 2015 11 / 25 Model fitting Summary of model: Multiple regression model for large data structures Model covariance between tissues Variable selection priors Structured priors for hotspot detection Model fitting using MCMC (Monte Carlo Markov Chain) estimation of full posterior distributions of all parameters. Alex Lewin Baysian Variable Selection May 2015 12 / 25 Simulation Study Investigate power gained by combining responses and tissues. 150 Responses Responses 150 100 50 100 50 0 0 0 200 600 1000 0 200 SNP Index 600 1000 SNP Index Simulate sparse patterns of associations between gene expression (RNA) and SNPs (DNA). Hotspots: SNPs with multiple responses associated. Alex Lewin Baysian Variable Selection May 2015 13 / 25 Simulation Study X data are real SNP data sets. Simulate responses Y in ` = 1, 2, 3 tissues: Y` = XB` + E ` B` have a common pattern for non-zero entries Average B` is µ Residual variation Eik` ∼ N(0, σ`2 ). Signal/noise ratio ≈ µ/σ` =⇒ control signal/noise across tissues by varying σ` Alex Lewin Baysian Variable Selection May 2015 14 / 25 Simulation Study Compare three analyses: - Bayesian model for multiple tissues - Bayesian models for single tissues run separately - MANOVA for multiple tissues (one for each response-predictor pair) Main focus on pairwise associations (response k with predictor j) Bayesian models use posterior probability of association p(γkj = 1 | data) MANOVA uses p-value for each k, j Threshold posterior probabilities or p-values to call positive and negative associations. Alex Lewin Baysian Variable Selection May 2015 15 / 25 Simulation Study: unbalanced tissues Three cases: Balanced: {σ1 , σ2 , σ3 } = {0.1, 0.1, 0.1}, Moderate Imbalance: {σ1 , σ2 , σ3 } = {0.08, 0.1, 0.125}, Large Imbalance: {σ1 , σ2 , σ3 } = {0.05, 0.1, 0.2}. Three tissues in Large Imbalance case: 140 140 140 120 120 120 100 100 100 80 80 80 60 60 60 40 40 40 20 20 0 0 0 200 400 600 Alex Lewin 800 1000 1200 20 0 0 200 400 600 800 1000 Baysian Variable Selection 1200 0 200 400 600 800 1000 May 2015 1200 16 / 25 Simulation Study: unbalanced tissues ROC (receiver operating characteristics) curves compare error rates (sensitivity and specificity) for all thresholds together. Balanced 190 380 570 Nb of false positives 760 950 0 190 380 570 Nb of false positives 760 950 0 1.0 1.0 0.8 0.8 0.8 0.6 MT−HESS All Tissues MANOVA ST−HESS Tissue 1 ST−HESS Tissue 2 ST−HESS Tissue 3 0.4 0.6 0.2 0.2 0.0 0.0 0.000 0.001 0.002 0.003 1−specificity 0.004 0.005 MT−HESS All Tissues MANOVA ST−HESS Tissue 1 ST−HESS Tissue 2 ST−HESS Tissue 3 0.4 sensitivity 1.0 sensitivity sensitivity 0 Large imbalance Moderate imbalance Nb of false positives 190 380 570 760 950 0.6 MT−HESS All Tissues MANOVA ST−HESS Tissue 1 ST−HESS Tissue 2 ST−HESS Tissue 3 0.4 0.2 0.0 0.000 0.001 0.002 0.003 0.004 1−specificity 0.005 0.000 0.001 0.002 0.003 0.004 0.005 1−specificity Combining tissues increases power to detect associations, even with unbalanced tissues. Alex Lewin Baysian Variable Selection May 2015 17 / 25 Simulation Study: hotspot detection Responses 150 100 50 0 0 200 600 1000 SNP Index 10 hotspots (2 each of size 10, 20, 30 responses) 5 cis-associations joint predictors with other SNPs 5 isolated cis-associations Alex Lewin Baysian Variable Selection May 2015 18 / 25 Simulation Study: hotspot detection Counting pairwise associations, using threshold 0.8 on posterior probability of association. Classify by membership of hotspot (trans), cis or true negative. MT-HESS Negative call Positive call ST-HESS Negative call Positive call Alex Lewin True negative True cis iso True cis joint True trans 195467.6 2.4 0.0 5.0 0.2 4.8 46.8 73.2 195469.0 1.0 0.1 4.9 0.6 4.4 112.4 7.6 Baysian Variable Selection May 2015 19 / 25 Simulation Study: hotspot detection 150 True size MT-HESS ST-HESS 10 4.2 0.1 20 13.8 2.2 30 18.8 1.5 Responses Sizes of hotspots: 100 50 0 0 200 600 1000 SNP Index Combining tissues can improve the detection of hotspots. Alex Lewin Baysian Variable Selection May 2015 20 / 25 Application to human data set Search for eQTLs for human gene expression in 3 cell types from human blood samples (CD4 T cells, CD8 T cells and monocytes). n = 59 patients, each has genotype data for p ≈ 21,000 SNPs. Responses are expression measurements for q ≈ 3000 transcripts (genes), in the L = 3 cell types. Alex Lewin Baysian Variable Selection May 2015 21 / 25 Application to human data set: hotspots Hotspot detection: no. of associations for each SNP. Putative master regulator SNP on chromosome 5, associated with expression of 78 genes. no. of genes 80 ● ● 60 CD4 T cells CD8 T cells Monocytes Joint 40 ● ● 20 0 ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●●●●●● ●●● ● 0 ● ● ● ●● ●● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ●● ●● ● ●● ●●● ● ● ●●● ●● ● ●●●● ● ● ●● ● ●● ●●●●● ● ● ●● ● ●● ● ● ●● ●● ● ● ●● ● ● ●● ● ●● ●● ●●● ●● ● ● ● ● ●●● ●●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ●●● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ●● ● ●●● ●● ●● ● ●● ● ● 50 100 150 chr 5 position (MB) Alex Lewin Baysian Variable Selection May 2015 22 / 25 Application to human data set: benefit of multiple SNP models So far looked at marginal pairwise associations between responses and predictors p(γkj = 1 | data). Lots of other ways to summarise the posterior of the model. Now look at “Best Model” for each response k: the combination of variables (γk ) with maximum posterior probability. Alex Lewin Baysian Variable Selection May 2015 23 / 25 0.2 0.4 0.6 0.8 1 0.8 1 0 (a) 0.2 0.4 0.6 0.8 1 2 1 Density 1.5 1 0 0.5 0.6 0 0.2 0.4 0.6 0.8 1 0 0.4 0 0.2 0 1 0 0 0 2.5 2 3 2 1 Density 2 1 0 4 3 4 2 0 2 3 4 5 6 Density 6 4 Application to human data set: multiple SNP models 0.2 0.4 0.6 0.8 1 0 0.2 0.4 (b) 0.6 0.8 1 (c) (a) Correlation matrix for 78 genes. (b) Correlation of residuals for the 78 genes after regression on the master regulator (hotspot) (c) Correlation of residuals for the 78 genes after regression on the master regulator + other SNPs in the best models Alex Lewin Baysian Variable Selection May 2015 24 / 25 Summary Regression for multi-variate responses v. multi-variate predictors. Bayesian variable selection priors: automatic shrinkage and selection. Bayesian hierarchical model increases power by combining information across - classes (here called tissues) - responses General multi-variate data structure, can be applied in other areas. Thanks to: Habib Saadi Leonardo Bottolo Sylvia Richardson James Peters Paper under revision for Bioinformatics: Saadi, Lewin, Peters, Moreno-Moral, Lee, Smith, Petretto, Bottolo, Richardon ”MT-HESS: an efficient Bayesian approach for simultaneous association detection in OMICS datasets, with application to eQTL mapping in multiple tissues”. Alex Lewin Baysian Variable Selection May 2015 25 / 25