Download Slide 1

Seq-ing the Epigentic Code with Exact Bayesian Network Structure Learning . Brandon M. 1,2 Malone , Changhe 1 Yuan , Eric 1 Hansen and Susan M. 1,2 Bridges 1Department of Computer Science & Engineering, Mississippi State University 2Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University Abstract The epigenetic code [Jaenisch] hypothesis proposes that patterns of post-translational modifications to the histone core proteins, the presence of transcription factor binding sites and other genomic features influence expression of associated DNA. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-Seq) is frequently used to characterize these features at a genome-wide scale. Previous studies [Yu] have used approximation techniques to learn relationships among them. In this work, we apply a novel exact Bayesian network learning algorithm to learn a network structure which identifies regulatory relationships among a set of epigenetic features in human CD4 cells [Barksi]. Comparison to networks learned using greedy methods reveals that our network identifies more biologically relevant relationships. By applying an exact, optimal learning algorithm instead of an approximate, greedy algorithm, the relationships we learn are unaffected by sources of uncertainty stemming from the structure learning algorithm. Optimal Learning with Dynamic Programming The Epigenetic Code In the case of a ChIP-Seq dataset, we do not know the relationships among the variables. Therefore, we must learn them. Singh and Moore [2005] proposed a dynamic programming algorithm to learn an optimal Bayesian network which minimizes the MDL score. The figure below shows the intuition behind the algorithm. Equation 2 expresses this recursively. Silander and Myllmaki [2006] refined the algorithm by reversing the process. The central dogma of molecular biology (roughly) states that DNA is transcribed into RNA which is translated into proteins. Proteins perform many of the functions in the body. We have the same DNA in most of our cells, yet they perform quite different functions. One reason for this differentiation lies in the epigenetic code. When DNA forms chromosomes, it packs together very tightly into a structure called chromatin. The DNA coils around a group of eight proteins called histones. Figure 1 summarizes chromatin packaging. H3K9 ac H3K4 me3 H3K9 ac H3K4 me3 H3K9 ac H3K4 me3 H3K9 ac H3K4 me3 Pol II H3K27 me3 Pol II H3K27 me3 Pol II H3K27 me3 Pol II H3K27 me3 H3K36 me Expr H3K36 me Figure 1. Chromatin packaging and histones. (http://themedicalbiochemistrypage.org/) Equations (1) The histone proteins include a tail domain which is very susceptible to a large number of post-translational modifications which affect the attraction between histones. The attraction can increase between histones, tightening surrounding chromatin and suppressing expression. Chromatin can also loosen, increasing expression. The combination of present modifications determines the effect on the chromatin structure. Some histone modifications affect the likelihood of other modifications. The epigenetic code [Jaenisch] proposes that the combination of histone modifications, as well as other features such as the presence of transcription factor binding sites, serves as a type of message to present and future generations of cells about regulation. ChIP-Seq We can measure the presence of a particular histone modification in cells using chromatin immunopreciptation followed by high throughput sequencing (ChIP-Seq). The figure below shows the ChIP-Seq process. Raw DNA Recursively find optimal leaves until an empty subnetwork remains. As suggested by Equation 2, learning an optimal Bayesian network consists of three phases which we formulate as search problems. Calculating Scores Identifying Optimal Parent Sets Learning Optimal Subnetworks Goal. Calculate MDL(X|U), which is the score of X using U as parents Representation. AD-tree [Moore] Search Strategy. Depth-first AD Node. Records with U= u Vary Node. Records with U = u, X = x Successor. Instantiate a new X Storage. Written to disk Goal. Calculate BestScore(U, X), which selects the best parents of X from U Representation. Sorted and bit arrays Search Strategy. On demand Successor. Use bit operators to find scores consistent with U\Y Score. scores[firstBit(usable(X))] Storage. Arrays and bit sets Goal. Calculate Score(U), which is the best subnetwork for variables U. Representation. Order graph [Yuan] Search Strategy. Breadth-first Node. Score(U) for some U. Successor. Use X as a leaf of U Score. Score(U) + BestScore(U, X) Storage. Hash table or written to disk Calculate and sort all of the scores for a variable. ϕ parents scores A B a B ab b B ab ab ab {1,2} 8 {2} 10 {1} 11 {1,3} 12 {3} 13 {} 15 1,2,3,4 {2,3} 20 Mark which scores use each variable (n-1 of these each). uses[1] a The remaining pieces of DNA are sequenced. The remaining subnetwork is also a DAG, so it has a leaf. Our Optimal Search Formulation The DNA is sheared into pieces around 200 bp in length. Pieces are immunoprecipitated against an antibody to extract desired pieces. Remove that leaf and its edges from the network.. The optimal Bayesian network structure is a DAG, so it has a leaf variable with no children. (2) H3K36 me b Vary Node Nx,u AD Node Nu X X X X X X X X X 1,3,4 X X 1,2 1,3 2,3 1 2 1,4 For each X in U newScore = U.score + 2,3,4 BestScore(U, X) succ = get({U+ X}) if newScore < succ.score 2,4 3,4 put({U+ X}, newScore) X When X is used as a leaf, find the usable parent scores with (usable & ~uses[X]). The first set bit is optimal. usable 1,2,4 X Initially, a variable can use all scores. The first is optimal. usable 1,2,3 Expand(U) X 3 4 ϕ Frontier Breadth-first Branch and Bound Search [Illumina] The sequenced DNA is mapped back to the genome. The order graph has a very regular structure. The successors for a node in layer l always appear in layer l+1. This observation allows us to keep only two layers in memory rather than all n. Furthermore, we can calculate how good a particular node can possibly be. If this is worse than a known bound, we safely disregard it. If optimality is not needed, we disregard many nodes to reduce running time. Experiments Bayesian Networks We used our optimal frontier breadth-first search algorithm to learn an optimal Bayesian network over the 23-variable data set and compared it to a greedy search, used previously [Yu]. Figures 2 and 3 show the learned networks. Representation. Joint probability distribution over a set of variables Structure. Directed acyclic graph storing conditional dependencies. •Vertices correspond to variables. •Edges indicate relationships among variables. Parameters. Conditional probability tables quantifying relationships Scoring. Minimum Description Length (MDL) [Schwartz], Equation 1 Results and Discussion Data Set and Preprocessing Raw Data. 30 human ChIP-Seq experiments [Barski] Cellular Environment. CD4 cells (specialized white blood cells) Normalization. Linear regression, against an IgG control data set Discretization. Clustered genes using MDL for each experiment Processed Data Set. A numeric array of length 30 for each gene Figure 2. Learned structure with our optimal algorithm. Figure 3. Learned structure with a standard greedy algorithm. Selected References Jaenisch, R. & Bird, A. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals Nature Genetics, 2003, 33, 245-254 Schwarz, G. (1978). "Estimating the Dimension of a Model." The Annals of Statistics 6(2): 461-464. Barski, A., S. Cuddapah, et al. (2007). "High-resolution profiling of histone methylations in the human genome." Cell 129(4): 823 – 837 Singh, A. P. and A. W. Moore (2005). Finding optimal bayesian networks by dynamic programming (Technical Report). Carnegie Mellon Univ: 05—106. Silander, T. and P. Myllymaki (2006). A simple approach for finding the globally optimal Bayesian network structure. Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), AUAI Press. Yu, H., S. Zhu, et al. (2008). "Inferring causal relationships among different histone modifications and gene expression." Genome Research 18(8): 1314-1324. Yuan, C.; Malone, B. & Wu, X. Learning Optimal Bayesian Networks using A* Search. Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 2011 Acknowledgments This material is based on work supported by the National Science Foundation under Grants No. NSF EPS-0903787 and NSF IIS-0953723. We focused on the transcription factor binding site for CTCF, known to play a function in the regulation of many elements. We expect CTCF to be an ancestor of important regulatory elements. In our network, CTCF is parent of the five most highly connected regulatory elements in the network. The approximate algorithm identified four parents and three children of intermediate degree for CTCF. Conclusions We presented a frontier breadth-first search algorithm for learning optimal Bayesian networks that improves the memory complexity from O(2n) to O(C(n,n/2)). Provably optimal solutions allow us to focus on interpreting the results. We learned the optimal structure of a network of epigeneitc features; it included more biologically meaningful relationships than structures learned with greedy search.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1