Download ppt - Chair of Computational Biology

V16 – genome rearrangement Important information – contained in the order in which genes occur on the genomes of different species – allows inferring phylogenetic relationships. Together with phylogenetic information, ancestral gene order reconstructions give some clues about the conservation of the functional organisation of genomes  towards a global knowledge of life evolution. Often, phylogeny reconstruction techniques using gene order data rely on the definition of an evolutionary distance between two gene orders. These distances are usually computed as the minimal number of rearrangement operations needed to transform one genome into another one. Bergeron et al. WABI 2004, 14-25 (2004) 16. Lecture WS 2004/05 Bioinformatics III 1 V16 – genome rearrangement Most choices of rearrangements quickly lead to hard algorithmic problems. Therefore, the set of operations is usually restricted to reversals, translocations, fusions or fissions where linear-time algorithms were developed in the last years. However, this choice of rearrangement operations is more dictated by algorithm necessity than by biological reality. E.g., in some genomes, transpositions and inverted transpositions can be quite common. A family of phylogenetic approaches labelled „distance-based“ methods relies on pair-wise evolutionary distances which are then fed into an algorithm such as neighbor-joining to infer tree topology and branch lengths. These methods do not provide information about the putative ancestral gene order. Bergeron et al. WABI 2004, 14-25 (2004) 16. Lecture WS 2004/05 Bioinformatics III 2 V16 – genome rearrangement Parsimony-based approaches attempt to identify the rearrangement scenario (including tree topology and gene orders at the internal nodes) that minimizes the number of evolutionary events required.  problem is computationally much more difficult than just computing distances. Heuristic algorithms exist that use either breakpoint or reversal distances. However, these methods only provide us with one (or a small number of) possible hypothesis about ancestral gene orders, with no information about alternate optimal or near-optimal solutions. Today: - quick look at the reversal distance problem again - new method „sets of conserved intervals“ (Bergeron & Jens Stoye) Bergeron et al. WABI 2004, 14-25 (2004) 16. Lecture WS 2004/05 Bioinformatics III 3 Breakpoint Graph The breakpoint graph of a permutation  is an edge-colored graph G() with n + 2 vertices {0, 1 ... n, n+1}  {0, 1, ..., n, n+1}. We join vertices i and  i+1 by a black edge for 0  i  n. We join vertices i and j by a gray edge if i  j. Black path 0 2 3 1 4 6 5 7 0 2 3 1 4 6 5 7 Grey path Superposition of black and grey paths forms the breakpoint graph: 16. Lecture WS 2004/05 Bioinformatics III A breakpoint graph is obtained by a superposition of a black path traversing the vertices 0, 1, ..., n, n+1 in the order given by the permutation  and a gray path traversing the vertices in the order given by the identity permutation. 4 Cycle decomposition A cycle in an edge-colored graph G is called alternating if the colors of every two consecutive edges of this cycle are distinct. In the following, cycles will mean alternating cycles. Cycle decomposition of the breakpoint graph: 0 2 3 1 4 6 5 7 0 2 3 1 4 6 5 7 0 2 3 1 4 6 5 7 0 2 3 1 4 6 5 7 16. Lecture WS 2004/05 A vertex v in a graph G is called balanced if the number of black edges incident to v equals the number of grey edges incident to v. A balanced graph is a graph in which every vertex is balanced. G() is a balanced graph. Therefore, there exists a cycle decomposition of G() into edge-disjoint alternating cycles (every edge in the graph belongs to exactly one cycle in the decomposition). Cycles in an edge decomposition may be self-intersecting. The previous breakpoint graph can be decomposed into 4 cycles, one of which is self-intersecting. Bioinformatics III 5 Effects of reversals on cycles (A) For reversals acting on two cycles, (b – c) = 1. (B) For reversals acting on an unoriented cycle, (b – c) = 0. (C) For reversals acting on an oriented cycle, (b – c) = -1 Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999) 16. Lecture WS 2004/05 Bioinformatics III 6 Cycle decomposition What is the decomposition of the breakpoint graph into a maximum number c() of edge-disjoint alternating cycles? Here, c() = 4. Cycle decompositions play an important role in estimating reversal distances. When a reversal is applied to a permutation, the number of cycles in a maximum decomposition can change by at most one (while the number of breakpoints can change by two). Bafna&Pevzner (1996) proved the bound for the reversal distance d(): d()  n + 1 - c() which is much tighter than the bound in terms of breakpoints d()  b() / 2. For many biological problems, d() = n + 1 - c(). Therefore, the reversal distance problem reduces to the problem of finding the maximal cycle decomposition. Hurdles, Super-hurdles, fortresses ... 16. Lecture WS 2004/05 Bioinformatics III 7 Alternative concept: conserved intervals Distrance matrices can be used as data for phylogenetic reconstruction, or to reconstruct ancestral genomes. However, all distances (except for the breakpoint distance) are closely tied to initial choices of allowable rearrangement operations. They are pure distances because similarities between genomes are ignored. breakpoint distance is based on the notion of conserved adjacencies. These are easy to compute, but breakpoint distance often fails to capture more global relations between genomes. A first generalization of adjacencies: common intervals that identify subsets of genes that appear consecutively in two or more genomes. Bergeron & Stoye, Report 2003-01 Uni Bielefeld 16. Lecture WS 2004/05 Bioinformatics III Jens Stoye 8 Permutations, Gene Order, and Rearrangements Assume that the genes of an organism are ordered and oriented along linear or circular DNA molecules. E.g. mitochondrial genes in insects Collapse 38 genes into set of 17 blocks. Genes in one block do not change order between these species. Distance approaches: focus on the difference between 2 particular genomes. E.g. Fruit Fly differs from Mosquito by the reversal of gene 10, and the transposition of genes 7 and 8.  count minimal number of reversals and/or transpositions  distance matrix for the set of species Bergeron & Stoye, Report 2003-01 Uni Bielefeld 16. Lecture WS 2004/05 Bioinformatics III 9 Permutations, Gene Order, and Rearrangements breakpoint distance: counts the lost adjacencies between genomes. E.g. given the circularity of the genomes, Fruit Fly and Mosquito have 12 conserved adjacencies and a breakpoint distance of 5. E.g. the first 4 species of table 1 share 6 adjacencies: [1,2], [2,3], [11,12], [15,16], [16,17], and [17,1]. When comparing all 6 species, [17,1] is the only left adjacency. Bergeron & Stoye, Report 2003-01 Uni Bielefeld 16. Lecture WS 2004/05 Bioinformatics III 10 Permutations, Gene Order, and Rearrangements Observation: the 6 permutations are very „similar“. E.g. the genes in the interval [1,12] are all the same, with small variations in their ordering. This is also true for the genes in the intervals [3,6], [6,9], [9,11], and [12,17]. Such intervals, together with conserved adjacencies play a fundamental role in rearrangement and distance theories, ancestral genome reconstructions, and phylogeny. Family portrait of the conserved intervals of the permutations of table 1 Here, the elements that can be glued together to form larger objects are boxed in rectangles. Bergeron & Stoye, Report 2003-01 Uni Bielefeld 16. Lecture WS 2004/05 Bioinformatics III 11 Which arrangements are preferable? All permutations of table 1 fit the representation with the following conventions (1) free objects within a rectangle can be reordered, or can change sign (2) connections between rectangles are fixed. Consider 2 rearrangement scenarios that transform silkworm into Locust using a minimal number of reversals The two scenarios are fundamentally different, although both use 6 reversals. The right one uses much longer reversals than the left one, and the right one breaks conserved intervals between Silkworm and Locust in intermediate permutations, namely [3,6], [1,12], and [12,17]. The right scenario looks highly suspicious. Bergeron & Stoye, Report 2003-01 Uni Bielefeld 16. Lecture WS 2004/05 Bioinformatics III 12 Conserved intervals Definition 1 Let G be a set of signed permutations of n elements. An interval [a,b] is a conserved interval of the set G if: (1) either a precedes b, or –b precedes –a, in each permutation, and (2) the sets of unsigned elements that appear between a and b is the same for all permutations in G.  If [a,b] is a conserved interval, so is [-b,-a]. Consider 2 permutations P = 1 2 3 7 5 6 -4 8 Q = 1 7 -3 -2 5 -6 -4 8 Here, [1,5] and [2,3] are conserved intervals, but not [1,6]. The other conserved intervals of P and Q are [1,-4], [1,8], [5,-4], [5,8], and [-4,8]. The diagram representation of these intervals is 1 2 16. Lecture WS 2004/05 3 7 5 6 -4 8 Bioinformatics III 13 Conserved intervals When the identity permutation is not in G, it is always possible to rename the elements of G such that conserved intervals will be intervals of consecutive elements. E.g. if one composes the permutations P and Q of the example with the inverse permutation P-1, P‘ = P-1 o P = Q‘ = P-1 o Q = or 1 2 3 4 5 1 4 -3 -2 5 1 2 3 4 5 6 7 8 -6 7 8 6 7 8 Proposition 1 Let R be a permutation and G a set of permutations, denote by R o G the set of permutations obtained by composing each permutation in G with R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is conserved in R o G. 16. Lecture WS 2004/05 Bioinformatics III 14 Conserved intervals Proposition 1 Let R be a permutation and G a set of permutations, denote by R o G the set of permutations obtained by composing each permutation in G with R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is conserved in R o G. Proof: if a permutation P is written as P = p1 p2 ... pn then R o P is: R o P = R(p1) R(p2) ... R(pn) If [a,b] is conserved in G, then each permutation in G has a consecutive block of elements beginning with a and ending with b, or beginning with –b and ending with –a. These properties hold also for the set R o G, if one replaces a by R(a) and b by R(b). Some intervals, such as [1,7] for the set {P‘,Q‘} in the above example, are the union of smaller intervals: [1,7 ] = [1,5]  [5,7]. Intervals that are not unions are specially useful. Definition 2 Conserved intervals that are not the union of shorter conserved intervals are called irreducible. Sets of conserved intervals can be characterized by the set of irreducible intervals. 16. Lecture WS 2004/05 Bioinformatics III 15 Irreducible conserved intervals Proposition 2 Two different irreducible conserved intervals [a,b] and [c,d] of a set G of permutations are either 1) disjoint 2) nested with different endpoints, or 3) overlapping on one element. Proof. Wlog we can assume that G contains the identity permutation and that conserved intervals are intervals of consecutive elements. Suppose that [a,b] and [c,d] are nested with a = c and d < b. Since [c,d] is a conserved interval, it contains all integers between c and d  the interval [d,b] contains all integers between d and b, and [a,b] is not irreducible. If [a,b] and [c,d] overlap with more than one element, we can suppose a < c < b < d. Since all elements between c and d are greater than c, then the interval between a and c must contain all elements between a and c, thus [a,b] is not irreducible. 16. Lecture WS 2004/05 Bioinformatics III 16 Conserved intervals Overlapping irreducible intervals form chains linked by their successive common elements. A chain of k-1 intervals [a1,a2] [a2,a3] ... [ak-1,ak] will be denoted simply by its k links [a1,a2,a3 ... ak]. E.g. [1,5,7,8] is a chain of the set of conserved intervals of P‘ and Q‘. A maximal chain is a chain that cannot be extended. Proposition 3. Every irreducible conserved interval belongs to a unique maximal chain. Proof: By Prop. 2: if [a,b] is an irreducible conserved interval, then no other can begin by a or end by b. Maximal chains, as sets of links, together with isolated genes, form a partition of the set of genes. 16. Lecture WS 2004/05 Bioinformatics III 17 Conserved intervals A set of permutations on n elements can have as many as n(n-1)/2 conserved intervals, but at most n-1 irreducible intervals. These bounds are achieved with sets containing only one permutation. Proposition 4. Each maximal chain of k links contributes k(k-1)/2 to the total number of conserved intervals. Proof. Conserved intervals [a,b] are in bijection with chains of the form [a, x1, ..., kx, b] of irreducible intervals. Each maximal chain of k links has k(k-1)/2 such sub-chains. 16. Lecture WS 2004/05 Bioinformatics III 18 Conserved intervals Proposition 5 Let P be a permutation that is contained in both sets G1 and G2. The interval [a,b] is a conserved interval of G = G1  G2 if and only if there exist two chains of irreducible conserved intervals, with respect to P, with k  0, l  0: [a, x1, ..., kx, b] in G1 [a, y1, ..., yl, b] in G2. The interval [a,b] is irreducible if and only if {x1, ..., xk} and {y1, ..., yl} are disjoint. Proof. The interval [a,b] is a conserved interval of G if and only if it is a conserved interval in both G1 and G2, therefore there must exist chains beginning by a and ending by b for both sets G1 and G2. If [a,b] is irreducible in G, and if [a,x] and [x,b] are conserved intervals of G1, say, then x cannot belong to the set {y1, ..., yl}. If there is a common element x to both sets {x1, ..., xk} and {y1, ..., yl}, then [a,b] = [a,x]  [x,b] and both [a,x] and [x,b] are conserved intervals of G. 16. Lecture WS 2004/05 Bioinformatics III 19 Variable Geometry Genomes The definition of conserved intervals can be adapted to other types of genomes than single linear chromosomes. For circular genomes, one can always align all permutations of the set beginning with gene +1. Multi-chromosomal genomes can also be represented by permutations, with special marks that identify different chromosomes. E.g. where each chromosome is on a separate line. Even if the adjacency [5,6] is conserved between the 2 permutations, the first genome does not even have those genes on the same chromosome. In the case of multi-chromosomal genomes, conserved intervals [a,b] should have the added requirement that a and b belong to the same chromosome, in each genome. 16. Lecture WS 2004/05 Bioinformatics III 20 Algorithms Bergeron & Stoye present 3 algorithms: (1) compute the conserved intervals of two permutations (2) compute the conserved intervals of a set of permutations (3) compute conserved intervals of two sets of permutations, directly from their two individual sets of conserved intervals. Conserved Intervals of 2 permutations are strongly related to the notion of connected components of the overlap graph of a signed permutation. Here: linear algorithm that identifies all irreducible intervals [a,b] of a permutation  with the identity permutation such that a > 0 and b > 0 in . The case of negative endpoints is treated by reversing . E.g. for the permutation P = 0 -4 -3 -2 5 8 6 7 9 -1 10 algorithm 1 identifies the positive irreducible intervals [6,7], [5,9], and [0,10]. It will identify [2,3] and [3,4] on the reversed permutation. 16. Lecture WS 2004/05 Bioinformatics III 21 Algorithms The algorithm assumes that the input permutation is in the form  = (0, 1, ..., n-1, n) Mi: nearest unsigned element of the permutation that precedes i and is greater than |i|. Lemma 1 If [s,e] is a positive conserved interval of  and the identity permutation, then Ms = Me. Algorithm uses two stacks: S contains the possible start positions of conserved intervals, M contains possible candidates for Mi. The top of S is always denoted by s. The top of M is always denoted by m. Proposition 6 Algorithm 1 outputs the positive irreducible conserved intervals of a permutation  with the identity permutation in O(n) time. 16. Lecture WS 2004/05 Bioinformatics III 22 Conserved intervals Algorithm runs in linear time. 16. Lecture WS 2004/05 Bioinformatics III 23 Similarity and distance The number of conserved intervals of a set of permutations is a measure of similarity, but can easily be transformed into a distance between two permutations, or two sets of two permutations. Definition 3 Let G1 and G2 be two permutations on n elements, with N1 and N2 conserved intervals. Let N be the number of conserved intervals in G1  G2. The interval distance between G1 and G2 is then defined by: d(G1,G2) = N1 + N2 – 2N The interval distance satisfies the fundamental properties of a mathematical distance, e.g. it fulfils the triangle inequality: d(P,Q) + d(Q,R)  d(P,R) 16. Lecture WS 2004/05 Bioinformatics III 24 Similarity and distance When comparing two permutations, the interval distance counts the total number of intervals that are unique to one of them. E.g. the distance between P = 0 1 2 3 4 5 6 7 8 9 10 Q = 0 5 -7 -6 8 9 1 2 3 4 10 is given by d(P,Q) = (1110)/2 +(1110)/2 – 2  11 = 88 The 2 measures sometimes disagree. The behavior of the interval distance reflects that the length (number of genes) involved in a rearrangement operation matters: short reversals are less disturbing than long ones. 16. Lecture WS 2004/05 Bioinformatics III 25 Comparison with other distance measures Breakpoint distance also gives different results than interval distances. while the same results are obtained by transposition + reversal distances. 16. Lecture WS 2004/05 Bioinformatics III 26 Similarity and distance Proposition 7 Suppose that P and Q have n elements, then (1) if P is obtained from Q by reversing k elements, then the interval distance between P and Q is k (n – k); (2) if P is obtained from Q by transposing two consecutive blocks of a and b elements, then the interval distance between P and Q is (a+b)(n – (a+b)) + ab. Because the interval distance is affected by length, one should question the practice of collapsing identical strips of genes. Why not use all available information? 16. Lecture WS 2004/05 Bioinformatics III 27 Link with rearrangement theories Characterize the rearrangement operations that preserve conserved intervals. Definition 4. Let P and Q be two permutations, and  a rearrangement operation applied to P yielding P‘. We say that  preserves the conserved intervals of P and Q if the conserved intervals of {P,Q} are contained in those of {P‘,Q}. Only rearrangements within blocks are preserving. Note that all operations, except fusions, destroy some adjacencies that existed in the original permutation: the number and nature of these adjacencies is a key concept. Definition 5. Let  be a rearrangement operation that transforms P into P‘. A breakpoint of  is a pair of elements that are adjacent in P but not in P‘. Breakpoints are where one has to cut P in order to apply . Reversals and translocations have 2 breakpoints, transpositions have 3, and fissions have 1. 16. Lecture WS 2004/05 Bioinformatics III 28 Link with rearrangement theories Consider the irreducible intervals of P and P‘ with respect to P. Adjacencies in P either belong to a (smallest) irreducible interval, or are free. E.g. in this diagram the adjacency (3,4) belongs to the interval [1,5], (2,3) belongs to [2,3], and (8,9) is free. When 2 adjacencies belong to the same irreducible interval, then none of these adjacencies is conserved between P and P‘. 16. Lecture WS 2004/05 Bioinformatics III 29 Link with rearrangement theories Theorem 3. Reversals, transpositions, and reverse transpositions are preserving if and only if all their breakpoints belong to the same irreducible interval, or are free. Translocations and fissions are preserving if and only if all their breakpoints are free. Proof. If the breakpoints of any operation are free, then no conserved interval is cut. If the breakpoints of a reversal, transposition, or reverse transposition belong to the same irreducible interval, then the operation reorders, or reverses, some blocks within that interval, thus preserving conserved intervals. If a reversal has its two breakpoints in different intervals, it will break those two intervals. If it has only one free breakpoint, it will break the interval containing the other breakpoint. The same kind of arguments hold for transpositions and reverse transpositions. If a breakpoint of a translocation or fission is not free, then it belongs to an irreducible interval whose extremities will end up in two different chromosomes. It turns out that most rearrangement operations used in optimal scenarios are indeed preserving. 16. Lecture WS 2004/05 Bioinformatics III 30 Link with rearrangement theories E.g. (without proof) Theorem 4. All the breakpoints of a cycle belong to the same irreducible interval. In the sorting by reversals theory, a sorting reversal is defined as a reversal that decreases the reversal distance by 1. The breakpoints of sorting reversals, except one type called hurdle merging, belong to a single cycle. Corollary 4. All sorting reversals, except hurdle merging, are preserving Corollary 5. All transpositions that create two adjacencies are preserving. 16. Lecture WS 2004/05 Bioinformatics III 31 Apply conserved intervals to reconstruct ancestor Bergeron et al. WABI 2004, 14-25 (2004) 16. Lecture WS 2004/05 Bioinformatics III 32 Apply conserved intervals to reconstruct ancestor Bergeron et al. WABI 2004, 14-25 (2004) 16. Lecture WS 2004/05 Bioinformatics III 33 Summary Linear-time algorithms could be developed to minimize reversal distance rearrangement scenarios. Open question which distance measures (breakpoint distance, reversal distance, interval distance ...) are most appropriate to compare genome architectures. Experimental evidence provides new insights which types of rearrangements have likely occurred in the past  need to adopt algorithms to the biological reality. Concept of „conserved intervals“ sounds very promising – can account for arbitrary types of rearrangements. 16. Lecture WS 2004/05 Bioinformatics III 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - Chair of Computational Biology