Download Bayesian hierarchical models for large

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bayesian hierarchical models for large-scale
data integration and variable selection
Alex Lewin
work with Habib Saadi, James Peters, Leo Bottolo and Sylvia Richardson
May 2015
Alex Lewin
Baysian Variable Selection
May 2015
1 / 25
Very basic ideas of genetics/genomics
– DNA same in every cell
– gene expression (RNA): used as a proxy for how much protein is
produced in particular cell, varies between cell types
Alex Lewin
Baysian Variable Selection
May 2015
2 / 25
eQTLs: expression Quantitative Trait Loci
QTLs are genetic variants (DNA loci) associated with some kind of trait
(e.g. height, blood pressure)
eQTLs are SNPs associated with gene expression (eQTLs)
Our aims:
Detect eQTLs using variable selection models
Especially “hotspots”: genetic variants which are associated with
expression of multiple genes.
Our further (novel) aim: combine data from multiple tissue samples
(repeated measures of gene expression).
Alex Lewin
Baysian Variable Selection
May 2015
3 / 25
Data structure
Y (gene expression, RNA) tensor n × q × L
X (DNA variants) matrix n × p
n people ∼ 100s - 1000s
p DNA measurements ∼ 10,000s
q RNA measurements ∼ 1000s
L tissue samples ∼ 3-10
Regression model
Aim of analysis: find important correlations between Y and X
Alex Lewin
Baysian Variable Selection
May 2015
4 / 25
Bayesian hierarchical model for multi-tissue eQTLs
We model all data simultaneously, and estimate effects for all model
parameters together.
→ consistent inference of which associations and patterns in the data
are important
I will discuss three ways we impose appropriate structure on the model
parameters:
1
Selecting the important DNA variants (p)
2
Combining RNA measurements across tissue samples (L)
3
Sharing information between measurements of RNA levels for
different genes (q)
Alex Lewin
Baysian Variable Selection
May 2015
5 / 25
(1): Variable selection
Selecting which variables (which DNA variants) are important for
predicting responses (RNA levels).
No. possible models (possible sets of DNA variants) is huge
(2p for each RNA)
p >> n
Widely studied problem in statistics (traditional estimators don’t work
as not enough observations)
shrinkage/penalty estimators
enforce *sparsity*
Alex Lewin
Baysian Variable Selection
May 2015
6 / 25
(1): Variable selection: Single response model
Single response (RNA) in single tissue: vector of observations y
across all people.
Searching for a regression model:
y | γ = X γ βγ + ,
∼ N 0, σ 2
where X γ includes only the most important predictors in X .
Variable selection is achieved using a latent binary vector
(
γj = 1 if βj 6= 0
γ = (γ1 , . . . , γp ) :
γj = 0 if βj = 0
Alex Lewin
Baysian Variable Selection
May 2015
7 / 25
(1): Variable selection: Single response model
Non-zero regression coefficients shrunk using g-prior structure
(Zellner):
−1 βγ g, σ 2 , γ ∼ N 0, gσ 2 XγT Xγ
Sparsity prior on latent binary indicators p(γ = 1|ω) = Bern(ω)
Priors on shrinkage parameter g and sparsity parameter ω → these
also parameters of the model (not fixed).
Alex Lewin
Baysian Variable Selection
May 2015
8 / 25
(2): Model RNA levels across tissues
Single response (RNA) in multiple tissues: now have matrix of
observations Y (n people × L tissues).
Y − A − Xγ Bγ ∼ N (In , Σ)
Matrix of regression coefficients B: same variable selection (γ),
shrinkage and sparsity priors as before.
Matrix Σ (dimensions L × L): Wishart prior (standard conjugate prior
for covariance matrices).
Aim to find common pattern of associations (Bγ ) across tissues.
Signal/noise ratio can vary across tissues (Σ)
Allow for residual correlations between tissues (Σ).
Alex Lewin
Baysian Variable Selection
May 2015
9 / 25
(3): Hierarchical model across different responses
Full model: multiple RNA responses in multiple tissues: for each RNA
response k we have matrix Yk across people and tissues.
Yk − Ak − Xγk Bγk ∼ N (In , Σk )
Hierarchical model over responses:
Separate regression parameters and variable selection for each
response
Shared priors allow sharing information, shrinkage estimates.
Structured prior on variable selection → improve hotspot detection
Alex Lewin
Baysian Variable Selection
May 2015
10 / 25
(3): Hotspot detection via hierarchical model
Sparsity prior on latent binary indicators p(γkj = 1|Ω) = Bern(Ωkj )
Modelling the matrix of the prior probabilities


ω11 · · · ω1j · · · ω1p
 ..
..
.. 
..
..
 .
.
.
.
. 



Ω=
 ωk1 · · · ωkj · · · ωkp  ,
 ..
.
.
.
.
..
..
..
.. 
 .

ωq1 · · · ωqj · · · ωqp
Ωkj = ωk × ρj
ωk is prior prob. of variable selection for given response (sparsity)
ρj captures the ‘propensity’ for predictor j to influence several
outcomes at the same time (hotspots)
(ωk and ρj are parameters of the model, not fixed)
Alex Lewin
Baysian Variable Selection
May 2015
11 / 25
Model fitting
Summary of model:
Multiple regression model for large data structures
Model covariance between tissues
Variable selection priors
Structured priors for hotspot detection
Model fitting using MCMC (Monte Carlo Markov Chain) estimation of
full posterior distributions of all parameters.
Alex Lewin
Baysian Variable Selection
May 2015
12 / 25
Simulation Study
Investigate power gained by combining responses and tissues.
150
Responses
Responses
150
100
50
100
50
0
0
0
200
600
1000
0
200
SNP Index
600
1000
SNP Index
Simulate sparse patterns of associations between gene expression
(RNA) and SNPs (DNA).
Hotspots: SNPs with multiple responses associated.
Alex Lewin
Baysian Variable Selection
May 2015
13 / 25
Simulation Study
X data are real SNP data sets.
Simulate responses Y in ` = 1, 2, 3 tissues:
Y` = XB` + E `
B` have a common pattern for non-zero entries
Average B` is µ
Residual variation Eik` ∼ N(0, σ`2 ).
Signal/noise ratio ≈ µ/σ`
=⇒ control signal/noise across tissues by varying σ`
Alex Lewin
Baysian Variable Selection
May 2015
14 / 25
Simulation Study
Compare three analyses:
- Bayesian model for multiple tissues
- Bayesian models for single tissues run separately
- MANOVA for multiple tissues (one for each response-predictor pair)
Main focus on pairwise associations (response k with predictor j)
Bayesian models use posterior probability of association
p(γkj = 1 | data)
MANOVA uses p-value for each k, j
Threshold posterior probabilities or p-values to call positive and
negative associations.
Alex Lewin
Baysian Variable Selection
May 2015
15 / 25
Simulation Study: unbalanced tissues
Three cases:
Balanced: {σ1 , σ2 , σ3 } = {0.1, 0.1, 0.1},
Moderate Imbalance: {σ1 , σ2 , σ3 } = {0.08, 0.1, 0.125},
Large Imbalance: {σ1 , σ2 , σ3 } = {0.05, 0.1, 0.2}.
Three tissues in Large Imbalance case:
140
140
140
120
120
120
100
100
100
80
80
80
60
60
60
40
40
40
20
20
0
0
0
200
400
600
Alex Lewin
800
1000
1200
20
0
0
200
400
600
800
1000
Baysian Variable Selection
1200
0
200
400
600
800
1000
May 2015
1200
16 / 25
Simulation Study: unbalanced tissues
ROC (receiver operating characteristics) curves compare error rates
(sensitivity and specificity) for all thresholds together.
Balanced
190
380
570
Nb of false positives
760
950
0
190
380
570
Nb of false positives
760
950
0
1.0
1.0
0.8
0.8
0.8
0.6
MT−HESS All Tissues
MANOVA
ST−HESS Tissue 1
ST−HESS Tissue 2
ST−HESS Tissue 3
0.4
0.6
0.2
0.2
0.0
0.0
0.000
0.001
0.002
0.003
1−specificity
0.004
0.005
MT−HESS All Tissues
MANOVA
ST−HESS Tissue 1
ST−HESS Tissue 2
ST−HESS Tissue 3
0.4
sensitivity
1.0
sensitivity
sensitivity
0
Large imbalance
Moderate imbalance
Nb of false positives
190
380
570
760
950
0.6
MT−HESS All Tissues
MANOVA
ST−HESS Tissue 1
ST−HESS Tissue 2
ST−HESS Tissue 3
0.4
0.2
0.0
0.000
0.001
0.002
0.003
0.004
1−specificity
0.005
0.000
0.001
0.002
0.003
0.004
0.005
1−specificity
Combining tissues increases power to detect associations, even with
unbalanced tissues.
Alex Lewin
Baysian Variable Selection
May 2015
17 / 25
Simulation Study: hotspot detection
Responses
150
100
50
0
0
200
600
1000
SNP Index
10 hotspots (2 each of size 10, 20, 30 responses)
5 cis-associations joint predictors with other SNPs
5 isolated cis-associations
Alex Lewin
Baysian Variable Selection
May 2015
18 / 25
Simulation Study: hotspot detection
Counting pairwise associations, using threshold 0.8 on posterior
probability of association.
Classify by membership of hotspot (trans), cis or true negative.
MT-HESS
Negative call
Positive call
ST-HESS
Negative call
Positive call
Alex Lewin
True
negative
True
cis iso
True
cis joint
True
trans
195467.6
2.4
0.0
5.0
0.2
4.8
46.8
73.2
195469.0
1.0
0.1
4.9
0.6
4.4
112.4
7.6
Baysian Variable Selection
May 2015
19 / 25
Simulation Study: hotspot detection
150
True size
MT-HESS
ST-HESS
10
4.2
0.1
20
13.8
2.2
30
18.8
1.5
Responses
Sizes of hotspots:
100
50
0
0
200
600
1000
SNP Index
Combining tissues can improve the detection of hotspots.
Alex Lewin
Baysian Variable Selection
May 2015
20 / 25
Application to human data set
Search for eQTLs for human gene expression in 3 cell types from
human blood samples (CD4 T cells, CD8 T cells and monocytes).
n = 59 patients, each has genotype data for p ≈ 21,000 SNPs.
Responses are expression measurements for q ≈ 3000 transcripts
(genes), in the L = 3 cell types.
Alex Lewin
Baysian Variable Selection
May 2015
21 / 25
Application to human data set: hotspots
Hotspot detection: no. of associations for each SNP.
Putative master regulator SNP on chromosome 5, associated with
expression of 78 genes.
no. of genes
80
●
●
60
CD4 T cells
CD8 T cells
Monocytes
Joint
40
●
●
20
0
●
●
●
●
●
●
●
● ●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ● ● ●●●●●●
●●●
●
0
●
●
●
●●
●●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
● ●
●
●● ●
●
●●
●●●
●
●
●
●●
●●
●
●●
●●●
●
●
●●●
●●
●
●●●● ●
●
●●
●
●●
●●●●●
●
●
●●
●
●●
●
● ●●
●●
●
●
●●
●
● ●●
●
●● ●●
●●● ●●
●
●
●
●
●●●
●●●
●
●
●●
●●
●●
●●
● ●
●
●
●
●
●●●
●
●●●●●●●
●
●
●
●
●
●
● ●
●
●
●
●●●
●
●
●●●
●●
●
●●● ●●
●●
●
●●
●
●
50
100
150
chr 5 position (MB)
Alex Lewin
Baysian Variable Selection
May 2015
22 / 25
Application to human data set: benefit of multiple
SNP models
So far looked at marginal pairwise associations between responses
and predictors p(γkj = 1 | data).
Lots of other ways to summarise the posterior of the model.
Now look at “Best Model” for each response k: the combination of
variables (γk ) with maximum posterior probability.
Alex Lewin
Baysian Variable Selection
May 2015
23 / 25
0.2
0.4
0.6
0.8
1
0.8
1
0
(a)
0.2
0.4
0.6
0.8
1
2
1
Density
1.5
1
0
0.5
0.6
0
0.2
0.4
0.6
0.8
1
0
0.4
0
0.2
0
1
0
0
0
2.5
2
3
2
1
Density
2
1
0
4
3
4
2
0
2
3
4
5
6
Density
6
4
Application to human data set: multiple SNP models
0.2
0.4
0.6
0.8
1
0
0.2
0.4
(b)
0.6
0.8
1
(c)
(a) Correlation matrix for 78 genes.
(b) Correlation of residuals for the 78 genes after regression on the master
regulator (hotspot)
(c) Correlation of residuals for the 78 genes after regression on the master
regulator + other SNPs in the best models
Alex Lewin
Baysian Variable Selection
May 2015
24 / 25
Summary
Regression for multi-variate responses v. multi-variate predictors.
Bayesian variable selection priors: automatic shrinkage and selection.
Bayesian hierarchical model increases power by combining information
across
- classes (here called tissues)
- responses
General multi-variate data structure, can be applied in other areas.
Thanks to:
Habib Saadi
Leonardo Bottolo
Sylvia Richardson
James Peters
Paper under revision for Bioinformatics:
Saadi, Lewin, Peters, Moreno-Moral, Lee, Smith, Petretto, Bottolo, Richardon ”MT-HESS: an
efficient Bayesian approach for simultaneous association detection in OMICS datasets, with
application to eQTL mapping in multiple tissues”.
Alex Lewin
Baysian Variable Selection
May 2015
25 / 25
Related documents