* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download msb20103-sup-0001 - Molecular Systems Biology
Genome (book) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
1 Revealing a signaling role of PHS1P in yeast using integrative systems approaches Supplementary Materials L. Ashley Cowart, Matthew Shotwell, Mitchell L. Worley, Adam J Richards, David J. Montefusco, Yusuf A. Hannun, and Xinghua Lu 2 Supplement Figure 1: Supplementary Figure 1. GO Steiner tree of lcb4/lcb5 mutationsensitive genes. Genes that are differentially expressed between wild-type and lcb4/lcb5strains are connected according to their Gene Ontology (GO) annotations. A gene is represented as a rectangle box and a GO term is denoted by a blue oval. The subset of genes with significant correlation with PHS1P is shown as red rectangles. The red inserts shows PHS1P sensitive genes and blue inserts shows secretion-secretion related genes. 3 Supplementary Results Correlation of Lipidomic and Transcriptomic data The gene show significant positive correlation with respect to PhS1P is as follows: COR1, COX7, ALD6, INH1, GPI12, VAC8, PET9, ARG82, SDH3, QCR7, COX6, COX4, RIP1, SNF11, QCR10, TFP1, YOP1, GLC7, LCB5, ATP14, ATP7, EDC1, ATP17 The genes show significant negative correlation with respect to PHS1P DPL1, FUS1, RNH70, NSE1, CBK1, FRE7, ATG11, CPA1, DIE2, PPS1, NAR1, SCC4, YPS3, CTR3, RLR1, MTO1, MRH4, MSR1, GPA1, BRE2, SEC39, MED4, PCM1, PMD1, SSK22, ECM7, KOG1, MCM22, YEH1, PSK2, STD1, FUS3, PPR1, PHO81, MSB4, MOD5, CTK2, DIA2, CCH1, SPO7, VPS62, POP1, PDR3, KIN4, MSI1, FAP1, PTA1, CAF4, TFC4, NPP1, RAD53, THI7, MSG5, DFG5, BCD1, SMC2, TAD1, ALG11, YIH1, CDC54 The TFs significantly enriched in the promoters of the genes that positively correlated to PHS1P. The results is show as gene name and p-value pair CIN5 CBF1 GCR1 GCR2 CAT8 STB5 GAT3 MGA1 MET4 YAP6 YAP1 0.0270882590216475 0.0432286981561686 3.88982024184248e-05 0.000132818116975297 0.0212699582900187 0.0191256540086794 0.0102921086795477 0.0444241181006045 0.00464591686854476 0.0375914536865041 0.0433633175543843 4 TYE7 ADR1 NRG1 NRG2 UME6 RSF2 HAP5 HAP4 HAP3 HAP2 HAP1 0.020830917363739 0.048338369784423 8.30021054721852e-05 1.43313749756402e-08 0.00875584105244476 5.05304559006436e-06 1.54876111935209e-13 7.20170922896557e-10 5.44009282066327e-15 2.88657986402541e-15 1.27233422557715e-05 5 Supplementary Tables Table 1: Yeast strains used in this study STRAIN JK939 lcb4/lcb5 dpl1 BY4742 hap4 GENOTYPE MAT leu2-3 112 ura3-52 rme1 trp1 his4 HMLa MAT leu2-3 112 ura3-52 rme1 trp1 his4 HMLa lcb4::KanMX lcb5::KanMX MAT leu2-3 112 ura3-52 rme1 trp1 his4 HMLa dpl1:KanMX MAT his3 leu2 lys2 ura3 MAT his3 leu2 lys2 ura3 hap4::KANMX Table 2: Primers for Real Time PCR Gene Sequence COX4 F: 5’-AACCCGTGGTGAAAACTGC-3’ COX5a R: 5’-AGGTCTGTTGGAACGGTAC-3’ F: 5’-TTGCCAAAGTGTTGCTGC-3’ COX9 R: 5’-GCTGCCACTCCTTATTCAT-3’ F: 5’-CATCGTCCTCGGGTTCTCC-3’ INH1 R: 5’-CTAGCTCTGCGTAGAACTT-3’ F: 5’-GCCGCAAGGTTCTACTCTG-3’ ATP17 R: 5’-AAGTCTTCCGTGGCCCTTT-3’ F: 5’GGTAAACCATTGTGGCATT- -3’ R: 5’-GCTCTTCCGCACCTTTATG-3’ APS2 F: 5’-TGTGGTGCGGTTGGTGAG-3’ R: 5’-TTCGTCGAATCGGAAAACTCTAC-3’ SAR1 F: 5’-CAGTCAGAGATGTGTTGGCTTCC-3’ R: 5’-TGGATGCCATGTTGGTTGTAAGG-3’ SVP26 F: 5’-ACGCAGACGTGCTAGTTTCG-3’ R: 5’-TGTTGCTCACTAGTCGTTGGTAG-3’ TMP2 F: 5’-GAACAACAGCTAGAAGACAGTGAAG-3’ R: 5’-CTTCTTCCAAAGCAACGATCCTTC-3’ TVP23 ACT1 F: 5’-CAATGTGTCTGACCGCCTGGAAC-3’ R: 5’-CGGAACAGAAGGAAGAGCGAACC-3’ F: 5’-CATCACTATTGGTAACGAAAGAT-3’ R: 5’-ATTCCTTACGGACATCGAC-3’ 18s rDNA subunit F: 5’-CCATGGTTTCAACGGGTAACGG -3’ R: 5’-GCCTTCCTTGGATGTGGTAGCC-3’ 6 Supplementary Methods Bayesian transcription factor state model We developed a Bayesian latent variable model to infer the activation states of TFs under each experimental condition. The model is an extension of a previously published statistical model by the Lu et al(Lu et al, 2004a), and the current model is referred to as the Bayesian transcription factor state (BTFS) model. The model infers the states of TFs under specific conditions through combining two types of omics data: genomic information derived from epigenomic experiments and prior knowledge, and gene expression data from our experiments. In the model, the unobserved state of a transcription factor t under a specific experimental condition a is represented as latent binary variable, sat, such that sat=1 indicates that the TF t is at an active state, and consequently its influence on the expression of the genes can be observed in microarray. Model Specification. The probabilistic graphical model in the format of plate representation(Buntine, 1994) is shown in Figure 1. In this representation, a node represents a variable and a directed edge represents the probabilistic relationship between a pair of variables. A filled node denotes an observed instance of a variable, and an open node indicates an unobserved (latent) one. For example, the filled red node in the middle plate represents the variable eag, the expression value for gene g in microarray a. Multiple variables of a similar type, e.g., a total of G gene expression measurements, are represented as a plate, in which the letter at the right bottom corner of the plate indicates the number of instances. The graphical model can then be interpreted as follows: 7 There are a total of G instances of gene promoters (left major plate); each represents the promoter of a gene; and each is associated with T binary variables, bgt, indicating if the binding site for the TF t is present in the promoter of gene g. There are a total of A microarrays (middle major plate) experiments; each contains G expression measurements (ega); and each is also associated with T binary variables (sat), indicating if TF t is activate during the experiment of microarray a. Each of T transcription factors (right major plate) may potentially influences expression of each of G genes, and its strength on a gene g is represented by a weight variable wgt. Here, a weight of zero indicates the TF has no influence on the expression of the gene; a positive value denotes a positive influence; and a negative value represents a repressive effect. The probabilistic graphical model depicted in Figure 1 enables integration of various -omics data by connecting them within a network, which provides a concise representation the joint distribution of multi-omics data. As seen in the figure, the model effectively integrates the genomic and transcriptomic data together. In the BTFS model, the expression value of a gene during a specific experiment (ega) is influenced by three parent variables: 1) TFs that bind to its promoter, indicated by bgtt {1,..., T} ; 2) the states of the TFs under this specific condition, represented by satt {1,..., T} ; 3) and the strength of the active TFs on its expression, represented by wgtt {1,..., T} . We define the probabilistic relationship between the above parent variables and the gene expression value as follows: T ega bgtsatwga t1 or T ega | bgt ,sat ,wga ~ N bgt satwga, 1 t1 (1) where and represent the noise of the system and N stands for the Gaussian distribution. It is of interest to note that, in Equation (1), the product of two binary variables, bgtsat, encodes a logic AND relationship between the two variables, such that the equation can be translated into the following text: TF t influences the expression value of gene g under the condition a if and only if it has a binding site in the promoter of the gene (bgt =1) AND it is activated under the condition (sat=1). The equation also reflects the assumption that the effects of multiple active TFs on a gene’s expression are additive (usually in logarithmic scales). This assumption has been widely used in modeling of expression 8 system, and its suitableness in the biochemical setting has been discussed (Battle et al, 2005; Gao et al, 2004; Kao et al, 2004; Lee and Batzoglou, 2003; Liao et al, 2003; Lu et al, 2004a; Ochs et al, 2004; Sun et al, 2006). What is unique about our model is that the state of a TF is represented as a binary variable rather than a continuous one, which provides two advantages: First, it intuitively reflects the active/inactive state of a TF. In contrast, a continuous value sometimes can be difficult to interpret. For example, what does a negative value (allowed in many the other models(Battle et al, 2005; Gao et al, 2004; Kao et al, 2004; Lee et al, 2003; Liao et al, 2003; Ochs et al, 2004; Sun et al, 2006)) means in terms of the state of a TF; or does a value of 105.5 indicate that a TF is fully or only partially activated? Second, and more importantly, a binary representation of a TF state enables us to model the relationship between the concentration of a signaling molecule and the state of a TF in a fashion that mimics the biological systems. In our integromics setting, we would like to investigate if the changes of the concentrations of bioactive lipids influence a module of genes by activating/inactivating a TF—a task in essence is to develop a probabilistic model that captures the dose-response curve of the lipid concentrations and TF state. Biologically, a signaling molecule can activate a TF either by directly binding to the TF or by interacting with enzymes that modify the TF state indirectly. All above interactions follows the rules of mass action(Voit, 2000), such that a dose-response relationships often take a form a non-linear sigmoid curves which can be readily captured with a logistic regression model. In such a model, the concentrations of the signaling molecules can treated as continuous independent variables and the state of the TF is represented as a binary variable, and the well established statistical methods can be applied to estimate the relationship between the variables(Bishop, 2006). On the other hand, if both independent and dependent variables are continuous variables, we would need to either rely on a linear model which is incapable of capturing the nonlinear biological relationship, or use some convoluted non-parametric models that are difficult to integrate into complex graphical models. Inference Algorithm. Given the genomic TFBS data and gene expression data, training of the model involves inferring the state of each TF under each condition, sat, and estimating the weight parameter for each TF on each gene wgt. This can be addressed with various statistical inference techniques, including the expectation-maximization (EM) algorithm(Dempster et al, 1977). The inference of joint TF states is difficult because the number of possible combinations of TF states is exponential (2T). 9 Furthermore, estimating a large number of parameters (wgt) often leads to an overfitting problem in the conventional maximal likelihood estimation (MLE) setting. To address both difficulties, we adopt a variational Bayesian approach(Beal, 2003; Ghahramani and Beal, 2000; Lu et al, 2004a, b) to approximately infer the state of TFs and estimate the posterior distributions of the parameters rather than point estimates. The principle and derivation of inference procedures are described in detail in Lu et al(Lu et al, 2004a), with modification to include the genomic TFBSs information. Our algorithm involves iteration through the following updates of the parameters of the approximated posterior distributions. log sat log 1 sat G G T t 2 g bgte ag wgt 12 g 2 bgtbgi wgtwgi sai bgt wgt 1 t it g1 g1 -1 A T ˜g diag y agy ag , g a1 A ˜ t t sat A ˜g g(w) = g G 2 c˜g cg AG 2 d˜g = dg + 12 e (3) agy ag, a1 A ˜ = + A- s t t at a1 ˜ t t (2) (4) a1 ˜ t t A || w t ||2 (5) 2 eag2 2eag2 y Tagw g + y Tagw g (6) 2 a=1 where . stands for expected value of a variable with respect to its posterior distribution. In above ˜ stand for the equations, at sat is a parameter governs the posterior distribution of sat; g(w) and g mean and covariance matrix of the posterior distribution governing the weight vector associated with ˜ are the parameters for the posterio y b S ,b S ,...,b S ˜ t and each t ag g1 a1 g2 a2 gT aT gene; r distribution T of ; c˜g and d˜g represent the parameters of the posterior distribution over ; is a vector that represents theresults of AND operation of the binding siteand activation state of TFs, which can be interpreted as a vector of binary variable representing the interaction events between cis- and trans-regulatory 10 elements. The above equations clearly demonstrate that the inference procedure combines both genomic (bgt) and transcriptomic data (eag) in a principled and unified framework. Bayesian Logistic Regression A Bayesian logistic regression framework (as described in Congdon, 2001, section 4.5) was used to assess the association between predicted transcription factor activation states and sphingolipid measurements during the heat-stress time series. Let sat be the predicted state of the TF t under the experimental condition a, such that sat =1 indicate the TF is activated and 0 otherwise. The probability p(sat =1) that TF t was activated under condition a was modeled as a function of the sphingolipid measurements in experiment a. The probabilistic relationship between the bioactive lipids and TF state is given by logit( p(sat = 1) ) = βt'xa , where βt is the vector of log-odds ratios for the TFBS t, and xa is a vector whose first two elements are 1 and an indicator variable representing the state of heat stress (1 if heat stress, 0 otherwise). The remaining elements of xa contain the measurements for each sphingolipid from experimental condition a. In this way, an element βtl reflects the strength and direction of influence by the lipid l on the activation state of a TF t. A prior distribution over βt was defined to complete the Bayesian specification of this model. The Gaussian prior distribution was selected with mean zero and the diagonal covariance matrix. The diagonal terms of the covariance matrix were identical and given a small value in order to ensure 1) colinearity in the sphingolipid measurements would not hinder identifiability in the posterior distribution and 2) to limit the potential for false positive findings. 11 The joint posterior distribution for the vector βt was sampled using the Metropolis within Gibbs sampling technique(Tierney, 1994). The sampling algorithm is summarized as follows. The full conditional posterior distribution for the element βtl is Pβt l | βtl , Da ta= Kφβt l psat =1 at 1 psat =1 A s 1sat , a=1 where βt(-l) consists of the elements of βt not including βtl, K is a constant of proportionality, φ is the Gaussian prior density function for βtl , and A is the number of experimental conditions. At iteration g, and for each element l of βt the steps of the sampling algorithm are 1. draw β tl' from G( βtl' | βtlg1 ) 2. draw u from U (0,1) 3. if u Pβtl' | βt(gl1) , Data Pβtl |βt (l ) , Data g 1 g 1 take βtl g βtl' else βtl g = βtl g 1 , where G( βtl' | βtlg1 ) is a proposal density depending on the sampled value βtl g 1 and U (0,1) is the uniform density on the unit interval. These steps are repeated for each element of βt, completing the gth iteration. The burn-in period consisted of 10k iterations. Additional samples were collected until stability was observed in the Gelman and Rubin(Gelman and Rubin, 1992) convergence diagnostic statistic. The marginal posterior distributions were summarized by computing equal-tailed 95% credible intervals for each element of βt. Credible intervals not covering the value zero indicated statistical significance in the relationship between the probability TFBS t is bound, and the corresponding sphingolipid concentration. 12 Battle A, Segal E, Koller D (2005) Probabilistic discovery of overlapping cellular processes and their regulation. J Comput Biol 12: 909-927. Beal MJ (2003) Variational algorithms for approximate Bayesian inference. In the Gatsby Computational Neuroscience Unit. University College London. Bishop CM (2006) Pattern Recognition and Machine Learning: Springer. Buntine W (1994) Operations for learning with graphical models. Journal of Artificial Intelligence Research 2: 159. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39: 1-38. Gao F, Foat BC, Bussemaker HJ (2004) Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics 5: 31. Gelman A, Rubin DB (1992) Inference from iteratives simulated using multiple sequences. Statistical Science 7. Ghahramani Z, Beal MJ (eds) (2000) Graphical Models and Variational Methods. Kao KC, Yang YL, Boscolo R, Sabatti C, Roychowdhury V, Liao JC (2004) Transcriptome-based determination of multiple transcription regulator activities in Escherichia coli by using network component analysis. Proc Natl Acad Sci U S A 101: 641-646. Lee SI, Batzoglou S (2003) Application of independent component analysis to microarrays. Genome Biol 4: R76. Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP (2003) Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci U S A 100: 15522-15527. Lu X, Hauskrecht M, Day RS (2004a) Modeling cellular processes with variational Bayesian cooperative vector quantizer. Pac Symp Biocomput: 533-544. Lu X, Hauskrecht M, Day RS (2004b) Modeling cellular processes with variational Bayesian cooperative vector quantizer model. In Proceedings of the Pacific Symposium on Biocomputing, Big Island, Hawii. 13 Ochs MF, Moloshok TD, Bidaut G, Toby G (2004) Bayesian decomposition: analyzing microarray data within a biological context. Ann N Y Acad Sci 1020: 212-226. Sun N, Carroll RJ, Zhao H (2006) Bayesian error analysis model for reconstructing transcriptional regulatory networks. Proc Natl Acad Sci U S A 103: 7988-7993. Tierney L (1994) Markov Chain for exploring posterior distributions. The Annals of Statistics 22. Voit EO (2000) Computational analysis of biochemical systems. A practical guide for biochemists and molecular biologists: Cambridge Press.