Download msb20103-sup-0001 - Molecular Systems Biology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

NEDD9 wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
1
Revealing a signaling role of PHS1P in yeast using integrative systems approaches
Supplementary Materials
L. Ashley Cowart, Matthew Shotwell, Mitchell L. Worley, Adam J Richards, David J. Montefusco,
Yusuf A. Hannun, and Xinghua Lu
2
Supplement Figure 1:
Supplementary Figure 1. GO Steiner tree of lcb4/lcb5 mutationsensitive genes. Genes that are differentially expressed between wild-type
and lcb4/lcb5strains are connected according to their Gene Ontology
(GO) annotations. A gene is represented as a rectangle box and a GO term is
denoted by a blue oval. The subset of genes with significant correlation with
PHS1P is shown as red rectangles. The red inserts shows PHS1P sensitive
genes and blue inserts shows secretion-secretion related genes.
3
Supplementary Results
Correlation of Lipidomic and Transcriptomic data
The gene show significant positive correlation with respect to PhS1P is as follows:
COR1, COX7, ALD6, INH1, GPI12, VAC8, PET9, ARG82, SDH3, QCR7, COX6,
COX4, RIP1, SNF11, QCR10, TFP1, YOP1, GLC7, LCB5, ATP14, ATP7, EDC1, ATP17
The genes show significant negative correlation with respect to PHS1P
DPL1, FUS1, RNH70, NSE1, CBK1, FRE7, ATG11, CPA1, DIE2, PPS1, NAR1, SCC4,
YPS3, CTR3, RLR1, MTO1, MRH4, MSR1, GPA1, BRE2, SEC39, MED4, PCM1,
PMD1, SSK22, ECM7, KOG1, MCM22, YEH1, PSK2, STD1, FUS3, PPR1, PHO81,
MSB4, MOD5, CTK2, DIA2, CCH1, SPO7, VPS62, POP1, PDR3, KIN4, MSI1, FAP1,
PTA1, CAF4, TFC4, NPP1, RAD53, THI7, MSG5, DFG5, BCD1, SMC2, TAD1, ALG11,
YIH1, CDC54
The TFs significantly enriched in the promoters of the genes that positively correlated to PHS1P. The
results is show as gene name and p-value pair
CIN5
CBF1
GCR1
GCR2
CAT8
STB5
GAT3
MGA1
MET4
YAP6
YAP1
0.0270882590216475
0.0432286981561686
3.88982024184248e-05
0.000132818116975297
0.0212699582900187
0.0191256540086794
0.0102921086795477
0.0444241181006045
0.00464591686854476
0.0375914536865041
0.0433633175543843
4
TYE7
ADR1
NRG1
NRG2
UME6
RSF2
HAP5
HAP4
HAP3
HAP2
HAP1
0.020830917363739
0.048338369784423
8.30021054721852e-05
1.43313749756402e-08
0.00875584105244476
5.05304559006436e-06
1.54876111935209e-13
7.20170922896557e-10
5.44009282066327e-15
2.88657986402541e-15
1.27233422557715e-05
5
Supplementary Tables
Table 1: Yeast strains used in this study
STRAIN
JK939
lcb4/lcb5
dpl1
BY4742
hap4
GENOTYPE
MAT  leu2-3 112 ura3-52 rme1 trp1 his4 HMLa
MAT  leu2-3 112 ura3-52 rme1 trp1 his4 HMLa lcb4::KanMX lcb5::KanMX
MAT  leu2-3 112 ura3-52 rme1 trp1 his4 HMLa dpl1:KanMX
MAT  his3 leu2 lys2 ura3
MAT  his3 leu2 lys2 ura3 hap4::KANMX
Table 2: Primers for Real Time PCR
Gene
Sequence
COX4
F: 5’-AACCCGTGGTGAAAACTGC-3’
COX5a
R: 5’-AGGTCTGTTGGAACGGTAC-3’
F: 5’-TTGCCAAAGTGTTGCTGC-3’
COX9
R: 5’-GCTGCCACTCCTTATTCAT-3’
F: 5’-CATCGTCCTCGGGTTCTCC-3’
INH1
R: 5’-CTAGCTCTGCGTAGAACTT-3’
F: 5’-GCCGCAAGGTTCTACTCTG-3’
ATP17
R: 5’-AAGTCTTCCGTGGCCCTTT-3’
F: 5’GGTAAACCATTGTGGCATT- -3’
R: 5’-GCTCTTCCGCACCTTTATG-3’
APS2
F: 5’-TGTGGTGCGGTTGGTGAG-3’
R: 5’-TTCGTCGAATCGGAAAACTCTAC-3’
SAR1
F: 5’-CAGTCAGAGATGTGTTGGCTTCC-3’
R: 5’-TGGATGCCATGTTGGTTGTAAGG-3’
SVP26
F: 5’-ACGCAGACGTGCTAGTTTCG-3’
R: 5’-TGTTGCTCACTAGTCGTTGGTAG-3’
TMP2
F: 5’-GAACAACAGCTAGAAGACAGTGAAG-3’
R: 5’-CTTCTTCCAAAGCAACGATCCTTC-3’
TVP23
ACT1
F: 5’-CAATGTGTCTGACCGCCTGGAAC-3’
R: 5’-CGGAACAGAAGGAAGAGCGAACC-3’
F: 5’-CATCACTATTGGTAACGAAAGAT-3’
R: 5’-ATTCCTTACGGACATCGAC-3’
18s
rDNA
subunit
F: 5’-CCATGGTTTCAACGGGTAACGG -3’
R: 5’-GCCTTCCTTGGATGTGGTAGCC-3’
6
Supplementary Methods
Bayesian transcription factor state model
We developed a Bayesian latent variable model to infer
the activation states of TFs under each experimental
condition. The model is an extension of a previously
published statistical model by the Lu et al(Lu et al,
2004a), and the current model is referred to as the
Bayesian transcription factor state (BTFS) model. The
model infers the states of TFs under specific conditions
through combining two types of omics data: genomic
information derived from epigenomic experiments and
prior knowledge, and gene expression data from our
experiments. In the model, the unobserved state of a
transcription factor t under a specific experimental
condition a is represented as latent binary variable, sat,
such that sat=1 indicates that the TF t is at an active
state, and consequently its influence on the expression
of the genes can be observed in microarray.
Model Specification. The probabilistic graphical model in the format of plate representation(Buntine,
1994) is shown in Figure 1. In this representation, a node represents a variable and a directed edge
represents the probabilistic relationship between a pair of variables. A filled node denotes an observed
instance of a variable, and an open node indicates an unobserved (latent) one. For example, the filled
red node in the middle plate represents the variable eag, the expression value for gene g in microarray a.
Multiple variables of a similar type, e.g., a total of G gene expression measurements, are represented as
a plate, in which the letter at the right bottom corner of the plate indicates the number of instances. The
graphical model can then be interpreted as follows:
7
There are a total of G instances of gene promoters (left major plate); each represents the
promoter of a gene; and each is associated with T binary variables, bgt, indicating if the binding
site for the TF t is present in the promoter of gene g.
There are a total of A microarrays (middle major plate) experiments; each contains G expression
measurements (ega); and each is also associated with T binary variables (sat), indicating if TF t
is activate during the experiment of microarray a.
Each of T transcription factors (right major plate) may potentially influences expression of each
of G genes, and its strength on a gene g is represented by a weight variable wgt. Here, a weight
of zero indicates the TF has no influence on the expression of the gene; a positive value denotes
a positive influence; and a negative value represents a repressive effect.
The probabilistic graphical model depicted in Figure 1 enables integration of various -omics data by
connecting them within a network, which provides a concise representation the joint distribution of
multi-omics data. As seen in the figure, the model effectively integrates the genomic and transcriptomic
data together.
In the BTFS model, the expression value of a gene during a specific experiment (ega) is influenced by
three parent variables: 1) TFs that bind to its promoter, indicated by bgtt  {1,..., T} ; 2) the states of the
TFs under this specific condition, represented by satt  {1,..., T} ; 3) and the strength of the active TFs
on its expression, represented by wgtt  {1,..., T} . We define
 the probabilistic relationship between the
above parent variables and the gene expression
value as follows:

T
ega  bgtsatwga  
t1

or
 T

ega | bgt ,sat ,wga ~ N bgt satwga,  1 
t1

(1)
where  and  represent the noise of the system and N stands for the Gaussian distribution. It is of

interest to note that, in Equation (1), the product of two binary variables, bgtsat, encodes a logic AND
relationship between the two variables, such that the equation can be translated into the following text:
TF t influences the expression value of gene g under the condition a if and only if it has a binding site
in the promoter of the gene (bgt =1) AND it is activated under the condition (sat=1). The equation also
reflects the assumption that the effects of multiple active TFs on a gene’s expression are additive
(usually in logarithmic scales). This assumption has been widely used in modeling of expression
8
system, and its suitableness in the biochemical setting has been discussed (Battle et al, 2005; Gao et al,
2004; Kao et al, 2004; Lee and Batzoglou, 2003; Liao et al, 2003; Lu et al, 2004a; Ochs et al, 2004;
Sun et al, 2006). What is unique about our model is that the state of a TF is represented as a binary
variable rather than a continuous one, which provides two advantages: First, it intuitively reflects the
active/inactive state of a TF. In contrast, a continuous value sometimes can be difficult to interpret. For
example, what does a negative value (allowed in many the other models(Battle et al, 2005; Gao et al,
2004; Kao et al, 2004; Lee et al, 2003; Liao et al, 2003; Ochs et al, 2004; Sun et al, 2006)) means in
terms of the state of a TF; or does a value of 105.5 indicate that a TF is fully or only partially activated?
Second, and more importantly, a binary representation of a TF state enables us to model the
relationship between the concentration of a signaling molecule and the state of a TF in a fashion that
mimics the biological systems.
In our integromics setting, we would like to investigate if the changes of the concentrations of bioactive
lipids influence a module of genes by activating/inactivating a TF—a task in essence is to develop a
probabilistic model that captures the dose-response curve of the lipid concentrations and TF state.
Biologically, a signaling molecule can activate a TF either by directly binding to the TF or by
interacting with enzymes that modify the TF state indirectly. All above interactions follows the rules of
mass action(Voit, 2000), such that a dose-response relationships often take a form a non-linear sigmoid
curves which can be readily captured with a logistic regression model. In such a model, the
concentrations of the signaling molecules can treated as continuous independent variables and the state
of the TF is represented as a binary variable, and the well established statistical methods can be applied
to estimate the relationship between the variables(Bishop, 2006). On the other hand, if both
independent and dependent variables are continuous variables, we would need to either rely on a linear
model which is incapable of capturing the nonlinear biological relationship, or use some convoluted
non-parametric models that are difficult to integrate into complex graphical models.
Inference Algorithm. Given the genomic TFBS data and gene expression data, training of the model
involves inferring the state of each TF under each condition, sat, and estimating the weight parameter
for each TF on each gene wgt. This can be addressed with various statistical inference techniques,
including the expectation-maximization (EM) algorithm(Dempster et al, 1977). The inference of joint
TF states is difficult because the number of possible combinations of TF states is exponential (2T).
9
Furthermore, estimating a large number of parameters (wgt) often leads to an overfitting problem in the
conventional maximal likelihood estimation (MLE) setting. To address both difficulties, we adopt a
variational Bayesian approach(Beal, 2003; Ghahramani and Beal, 2000; Lu et al, 2004a, b) to
approximately infer the state of TFs and estimate the posterior distributions of the parameters rather
than point estimates. The principle and derivation of inference procedures are described in detail in Lu
et al(Lu et al, 2004a), with modification to include the genomic TFBSs information. Our algorithm
involves iteration through the following updates of the parameters of the approximated posterior
distributions.
log
sat
 log
1 sat
G
G
 T

t
2
   g bgte ag wgt  12   g 2 bgtbgi wgtwgi sai  bgt wgt

1  t
 it

g1
g1

-1
A
T 
˜g  diag   
y agy ag  ,
g





a1

 

A
˜ t   t   sat


A
˜g 
g(w) = 
g


G
2
c˜g  cg 
AG
2
d˜g = dg + 12
e
(3)
agy ag,
a1
A
˜ =  + A- s

t
t
at
a1
˜ t  t 

(2)
(4)
a1
˜  

t
t
A

|| w t ||2
(5)
2
 eag2  2eag2 y Tagw g + y Tagw g
(6)

2
a=1

where . stands for expected value of a variable with respect to its posterior distribution. In above
˜ stand for the
equations, at  sat is a parameter governs the posterior distribution of sat;  g(w) and 
g
 mean and covariance matrix of the posterior distribution governing the weight vector associated with
˜ are the parameters for the posterio y  b S ,b S ,...,b S
˜ t and 
each
t
ag
g1 
a1
g2
a2
gT
aT
 gene; 
 r distribution
T
of ; c˜g and d˜g represent the parameters of the posterior distribution over ; is a vector that represents
theresults of AND operation of the binding siteand activation state of TFs, which can be interpreted as

a vector of binary variable representing the interaction events between cis- and trans-regulatory
10
elements. The above equations clearly demonstrate that the inference procedure combines both
genomic (bgt) and transcriptomic data (eag) in a principled and unified framework.
Bayesian Logistic Regression
A Bayesian logistic regression framework (as described in Congdon, 2001, section 4.5) was used to
assess the association between predicted transcription factor activation states and sphingolipid
measurements during the heat-stress time series. Let sat be the predicted state of the TF t under the
experimental condition a, such that sat =1 indicate the TF is activated and 0 otherwise. The probability
p(sat =1) that TF t was activated under condition a was modeled as a function of the sphingolipid
measurements in experiment a. The probabilistic relationship between the bioactive lipids and TF state
is given by
logit( p(sat = 1) ) = βt'xa ,
where βt is the vector of log-odds ratios for the TFBS t, and xa is a vector whose first two elements are
1 and an indicator variable representing the state of heat stress (1 if heat stress, 0 otherwise). The
remaining elements of xa contain the measurements for each sphingolipid from experimental condition
a. In this way, an element βtl reflects the strength and direction of influence by the lipid l on the
activation state of a TF t.
A prior distribution over βt was defined to complete the Bayesian specification of this model. The
Gaussian prior distribution was selected with mean zero and the diagonal covariance matrix. The
diagonal terms of the covariance matrix were identical and given a small value in order to ensure 1) colinearity in the sphingolipid measurements would not hinder identifiability in the posterior distribution
and 2) to limit the potential for false positive findings.
11
The joint posterior distribution for the vector βt was sampled using the Metropolis within Gibbs
sampling technique(Tierney, 1994). The sampling algorithm is summarized as follows. The full
conditional posterior distribution for the element βtl is
Pβt l | βtl  , Da ta= Kφβt l psat =1 at 1 psat =1
A
s
1sat
,
a=1
where βt(-l) consists of the elements of βt not including βtl, K is a constant of proportionality, φ is the
Gaussian prior density function for βtl , and A is the number of experimental conditions. At iteration g,
and for each element l of βt the steps of the sampling algorithm are
1. draw β tl' from G( βtl' | βtlg1 )
2. draw u from U (0,1)
3. if u 
Pβtl' | βt(gl1) , Data
Pβtl
|βt (l ) , Data
 g 1
 g 1
take βtl g   βtl' else βtl g = βtl g 1 ,
where G( βtl' | βtlg1 ) is a proposal density depending on the sampled value βtl g 1 and U (0,1) is the
uniform density on the unit interval. These steps are repeated for each element of βt, completing the gth
iteration. The burn-in period consisted of 10k iterations. Additional samples were collected until
stability was observed in the Gelman and Rubin(Gelman and Rubin, 1992) convergence diagnostic
statistic. The marginal posterior distributions were summarized by computing equal-tailed 95%
credible intervals for each element of βt. Credible intervals not covering the value zero indicated
statistical significance in the relationship between the probability TFBS t is bound, and the
corresponding sphingolipid concentration.
12
Battle A, Segal E, Koller D (2005) Probabilistic discovery of overlapping cellular processes and their
regulation. J Comput Biol 12: 909-927.
Beal MJ (2003) Variational algorithms for approximate Bayesian inference. In the Gatsby
Computational Neuroscience Unit. University College London.
Bishop CM (2006) Pattern Recognition and Machine Learning: Springer.
Buntine W (1994) Operations for learning with graphical models. Journal of Artificial Intelligence
Research 2: 159.
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society, Series B 39: 1-38.
Gao F, Foat BC, Bussemaker HJ (2004) Defining transcriptional networks through integrative
modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics 5: 31.
Gelman A, Rubin DB (1992) Inference from iteratives simulated using multiple sequences. Statistical
Science 7.
Ghahramani Z, Beal MJ (eds) (2000) Graphical Models and Variational Methods.
Kao KC, Yang YL, Boscolo R, Sabatti C, Roychowdhury V, Liao JC (2004) Transcriptome-based
determination of multiple transcription regulator activities in Escherichia coli by using network
component analysis. Proc Natl Acad Sci U S A 101: 641-646.
Lee SI, Batzoglou S (2003) Application of independent component analysis to microarrays. Genome
Biol 4: R76.
Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP (2003) Network component
analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci U S A 100:
15522-15527.
Lu X, Hauskrecht M, Day RS (2004a) Modeling cellular processes with variational Bayesian
cooperative vector quantizer. Pac Symp Biocomput: 533-544.
Lu X, Hauskrecht M, Day RS (2004b) Modeling cellular processes with variational Bayesian
cooperative vector quantizer model. In Proceedings of the Pacific Symposium on Biocomputing, Big
Island, Hawii.
13
Ochs MF, Moloshok TD, Bidaut G, Toby G (2004) Bayesian decomposition: analyzing microarray data
within a biological context. Ann N Y Acad Sci 1020: 212-226.
Sun N, Carroll RJ, Zhao H (2006) Bayesian error analysis model for reconstructing transcriptional
regulatory networks. Proc Natl Acad Sci U S A 103: 7988-7993.
Tierney L (1994) Markov Chain for exploring posterior distributions. The Annals of Statistics 22.
Voit EO (2000) Computational analysis of biochemical systems. A practical guide for biochemists and
molecular biologists: Cambridge Press.