Download ppt - Chair of Computational Biology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Polyploid wikipedia , lookup

Genomic imprinting wikipedia , lookup

Metagenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

Median graph wikipedia , lookup

Gene desert wikipedia , lookup

Minimal genome wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
V16 – genome rearrangement
Important information – contained in the order in which genes occur on the
genomes of different species – allows inferring phylogenetic relationships.
Together with phylogenetic information, ancestral gene order reconstructions
give some clues about the conservation of the functional organisation of genomes
 towards a global knowledge of life evolution.
Often, phylogeny reconstruction techniques using gene order data rely on the
definition of an evolutionary distance between two gene orders.
These distances are usually computed as the minimal number of
rearrangement operations needed to transform one genome into another one.
Bergeron et al. WABI 2004, 14-25 (2004)
16. Lecture WS 2004/05
Bioinformatics III
1
V16 – genome rearrangement
Most choices of rearrangements quickly lead to hard algorithmic problems.
Therefore, the set of operations is usually restricted to reversals, translocations,
fusions or fissions where linear-time algorithms were developed in the last
years.
However, this choice of rearrangement operations is more dictated by algorithm
necessity than by biological reality. E.g., in some genomes, transpositions and
inverted transpositions can be quite common.
A family of phylogenetic approaches labelled „distance-based“ methods relies on
pair-wise evolutionary distances which are then fed into an algorithm such as
neighbor-joining to infer tree topology and branch lengths.
These methods do not provide information about the putative ancestral gene
order.
Bergeron et al. WABI 2004, 14-25 (2004)
16. Lecture WS 2004/05
Bioinformatics III
2
V16 – genome rearrangement
Parsimony-based approaches attempt to identify the rearrangement scenario
(including tree topology and gene orders at the internal nodes) that minimizes the
number of evolutionary events required.
 problem is computationally much more difficult than just computing distances.
Heuristic algorithms exist that use either breakpoint or reversal distances.
However, these methods only provide us with one (or a small number of) possible
hypothesis about ancestral gene orders, with no information about alternate
optimal or near-optimal solutions.
Today:
- quick look at the reversal distance problem again
- new method „sets of conserved intervals“ (Bergeron & Jens Stoye)
Bergeron et al. WABI 2004, 14-25 (2004)
16. Lecture WS 2004/05
Bioinformatics III
3
Breakpoint Graph
The breakpoint graph of a permutation  is an edge-colored graph G() with
n + 2 vertices {0, 1 ... n, n+1}  {0, 1, ..., n, n+1}. We join vertices i and  i+1 by
a black edge for 0  i  n. We join vertices i and j by a gray edge if i  j.
Black path
0
2 3
1
4
6
5 7
0
2 3
1
4
6
5 7
Grey path
Superposition of black and grey paths forms
the breakpoint graph:
16. Lecture WS 2004/05
Bioinformatics III
A breakpoint graph is
obtained by a superposition of a black path
traversing the vertices
0, 1, ..., n, n+1 in the
order given by the
permutation  and a gray
path traversing the vertices
in the order given by the
identity permutation.
4
Cycle decomposition
A cycle in an edge-colored graph G is called alternating if the colors of every two
consecutive edges of this cycle are distinct. In the following, cycles will mean
alternating cycles.
Cycle decomposition of
the breakpoint graph:
0
2 3
1
4
6
5 7
0
2 3
1
4
6
5 7
0
2 3
1
4
6
5 7
0
2 3
1
4
6
5 7
16. Lecture WS 2004/05
A vertex v in a graph G is called balanced if the
number of black edges incident to v equals the
number of grey edges incident to v.
A balanced graph is a graph in which every
vertex is balanced. G() is a balanced graph.
Therefore, there exists a cycle decomposition
of G() into edge-disjoint alternating cycles
(every edge in the graph belongs to exactly one
cycle in the decomposition). Cycles in an edge
decomposition may be self-intersecting. The
previous breakpoint graph can be decomposed
into 4 cycles, one of which is self-intersecting.
Bioinformatics III
5
Effects of reversals on cycles
(A) For reversals acting on two
cycles, (b – c) = 1.
(B) For reversals acting on an
unoriented cycle, (b – c) = 0.
(C) For reversals acting on an
oriented cycle, (b – c) = -1
Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999)
16. Lecture WS 2004/05
Bioinformatics III
6
Cycle decomposition
What is the decomposition of the breakpoint graph into a maximum number c()
of edge-disjoint alternating cycles? Here, c() = 4.
Cycle decompositions play an important role in estimating reversal distances.
When a reversal is applied to a permutation, the number of cycles in a maximum
decomposition can change by at most one (while the number of breakpoints
can change by two).
Bafna&Pevzner (1996) proved the bound for the reversal distance d():
d()  n + 1 - c()
which is much tighter than the bound in terms of breakpoints d()  b() / 2.
For many biological problems, d() = n + 1 - c().
Therefore, the reversal distance problem reduces to the problem of finding
the maximal cycle decomposition.
Hurdles, Super-hurdles, fortresses ...
16. Lecture WS 2004/05
Bioinformatics III
7
Alternative concept: conserved intervals
Distrance matrices can be used as data for phylogenetic reconstruction, or to
reconstruct ancestral genomes.
However, all distances (except for the breakpoint distance) are closely tied to
initial choices of allowable rearrangement operations.
They are pure distances because similarities between genomes are ignored.
breakpoint distance is based on the notion of conserved adjacencies. These are
easy to compute, but breakpoint distance often fails to capture more global
relations between genomes.
A first generalization of adjacencies: common intervals that identify subsets of
genes that appear consecutively in two or more genomes.
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
16. Lecture WS 2004/05
Bioinformatics III
Jens Stoye
8
Permutations, Gene Order, and Rearrangements
Assume that the genes of an organism are ordered and oriented along linear or
circular DNA molecules. E.g. mitochondrial genes in insects
Collapse 38 genes into set of 17 blocks. Genes in one block do not change order
between these species.
Distance approaches: focus on the difference between 2 particular genomes.
E.g. Fruit Fly differs from Mosquito by the reversal of gene 10, and the
transposition of genes 7 and 8.
 count minimal number of reversals and/or transpositions
 distance matrix for the set of species
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
16. Lecture WS 2004/05
Bioinformatics III
9
Permutations, Gene Order, and Rearrangements
breakpoint distance: counts the lost adjacencies between genomes.
E.g. given the circularity of the genomes, Fruit Fly and Mosquito have 12
conserved adjacencies and a breakpoint distance of 5.
E.g. the first 4 species of table 1 share 6 adjacencies:
[1,2], [2,3], [11,12], [15,16], [16,17], and [17,1].
When comparing all 6 species, [17,1] is the only left adjacency.
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
16. Lecture WS 2004/05
Bioinformatics III
10
Permutations, Gene Order, and Rearrangements
Observation: the 6 permutations are very „similar“.
E.g. the genes in the interval [1,12] are all the same, with small variations in their
ordering.
This is also true for the genes in the intervals [3,6], [6,9], [9,11], and [12,17].
Such intervals, together with conserved adjacencies play a fundamental role in
rearrangement and distance theories, ancestral genome reconstructions, and
phylogeny.
Family portrait of the conserved intervals of the permutations of table 1
Here, the elements that can be glued together to form larger objects are boxed
in rectangles.
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
16. Lecture WS 2004/05
Bioinformatics III
11
Which arrangements are preferable?
All permutations of table 1 fit the representation with the following conventions
(1) free objects within a rectangle can be reordered, or can change sign
(2) connections between rectangles are fixed.
Consider 2 rearrangement scenarios that transform silkworm into Locust using a
minimal number of reversals
The two scenarios are fundamentally different, although both use 6 reversals.
The right one uses much longer reversals than the left one, and the right one
breaks conserved intervals between Silkworm and Locust in intermediate
permutations, namely [3,6], [1,12], and [12,17].
The right scenario looks highly suspicious.
Bergeron & Stoye, Report 2003-01 Uni Bielefeld
16. Lecture WS 2004/05
Bioinformatics III
12
Conserved intervals
Definition 1 Let G be a set of signed permutations of n elements. An interval
[a,b] is a conserved interval of the set G if:
(1) either a precedes b, or –b precedes –a, in each permutation, and
(2) the sets of unsigned elements that appear between a and b is the same for
all permutations in G.
 If [a,b] is a conserved interval, so is [-b,-a].
Consider 2 permutations
P = 1 2 3 7 5 6 -4 8
Q = 1 7 -3 -2 5 -6 -4 8
Here, [1,5] and [2,3] are conserved intervals, but not [1,6].
The other conserved intervals of P and Q are [1,-4], [1,8], [5,-4], [5,8], and [-4,8].
The diagram representation of these intervals is
1 2
16. Lecture WS 2004/05
3
7
5
6
-4 8
Bioinformatics III
13
Conserved intervals
When the identity permutation is not in G, it is always possible to rename the
elements of G such that conserved intervals will be intervals of consecutive
elements.
E.g. if one composes the permutations P and Q of the example with the inverse
permutation P-1,
P‘ = P-1 o P =
Q‘ = P-1 o Q =
or
1 2 3 4 5
1 4 -3 -2 5
1
2
3
4 5
6 7 8
-6 7 8
6 7 8
Proposition 1 Let R be a permutation and G a set of permutations, denote by
R o G the set of permutations obtained by composing each permutation in G with
R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is
conserved in R o G.
16. Lecture WS 2004/05
Bioinformatics III
14
Conserved intervals
Proposition 1 Let R be a permutation and G a set of permutations, denote by
R o G the set of permutations obtained by composing each permutation in G with
R. The interval [a,b] is conserved in G if and only if the interval [R(a),R(b)] is
conserved in R o G.
Proof: if a permutation P is written as
P = p1 p2 ... pn
then R o P is:
R o P = R(p1) R(p2) ... R(pn)
If [a,b] is conserved in G, then each permutation in G has a consecutive block of elements
beginning with a and ending with b, or beginning with –b and ending with –a. These
properties hold also for the set R o G, if one replaces a by R(a) and b by R(b).
Some intervals, such as [1,7] for the set {P‘,Q‘} in the above example, are the
union of smaller intervals: [1,7 ] = [1,5]  [5,7]. Intervals that are not unions are
specially useful.
Definition 2 Conserved intervals that are not the union of shorter conserved
intervals are called irreducible.
Sets of conserved intervals can be characterized by the set of irreducible intervals.
16. Lecture WS 2004/05
Bioinformatics III
15
Irreducible conserved intervals
Proposition 2 Two different irreducible conserved intervals [a,b] and [c,d] of a
set G of permutations are either
1) disjoint
2) nested with different endpoints, or
3) overlapping on one element.
Proof. Wlog we can assume that G contains the identity permutation and that conserved
intervals are intervals of consecutive elements.
Suppose that [a,b] and [c,d] are nested with a = c and d < b. Since [c,d] is a conserved
interval, it contains all integers between c and d  the interval [d,b] contains all integers
between d and b, and [a,b] is not irreducible.
If [a,b] and [c,d] overlap with more than one element, we can suppose
a < c < b < d. Since all elements between c and d are greater than c, then the interval
between a and c must contain all elements between a and c, thus [a,b] is not irreducible.
16. Lecture WS 2004/05
Bioinformatics III
16
Conserved intervals
Overlapping irreducible intervals form chains linked by their successive common
elements. A chain of k-1 intervals [a1,a2] [a2,a3] ... [ak-1,ak] will be denoted simply
by its k links [a1,a2,a3 ... ak].
E.g. [1,5,7,8] is a chain of the set of conserved intervals of P‘ and Q‘.
A maximal chain is a chain that cannot be extended.
Proposition 3. Every irreducible conserved interval belongs to a unique
maximal chain.
Proof: By Prop. 2: if [a,b] is an irreducible conserved interval, then no other can
begin by a or end by b.
Maximal chains, as sets of links, together with isolated genes, form a partition of
the set of genes.
16. Lecture WS 2004/05
Bioinformatics III
17
Conserved intervals
A set of permutations on n elements can have as many as n(n-1)/2 conserved
intervals, but at most n-1 irreducible intervals.
These bounds are achieved with sets containing only one permutation.
Proposition 4. Each maximal chain of k links contributes k(k-1)/2 to the total
number of conserved intervals.
Proof. Conserved intervals [a,b] are in bijection with chains of the form
[a, x1, ..., kx, b]
of irreducible intervals. Each maximal chain of k links has k(k-1)/2 such sub-chains.
16. Lecture WS 2004/05
Bioinformatics III
18
Conserved intervals
Proposition 5 Let P be a permutation that is contained in both sets G1 and G2.
The interval [a,b] is a conserved interval of G = G1  G2 if and only if there exist
two chains of irreducible conserved intervals, with respect to P, with k  0, l  0:
[a, x1, ..., kx, b] in G1
[a, y1, ..., yl, b] in G2.
The interval [a,b] is irreducible if and only if {x1, ..., xk} and {y1, ..., yl} are disjoint.
Proof. The interval [a,b] is a conserved interval of G if and only if it is a conserved interval
in both G1 and G2, therefore there must exist chains beginning by a and ending by b for
both sets G1 and G2. If [a,b] is irreducible in G, and if [a,x] and [x,b] are conserved intervals
of G1, say, then x cannot belong to the set {y1, ..., yl}. If there is a common element x to
both sets {x1, ..., xk} and {y1, ..., yl}, then [a,b] = [a,x]  [x,b] and both [a,x] and [x,b] are
conserved intervals of G.
16. Lecture WS 2004/05
Bioinformatics III
19
Variable Geometry Genomes
The definition of conserved intervals can be adapted to other types of genomes
than single linear chromosomes. For circular genomes, one can always align all
permutations of the set beginning with gene +1.
Multi-chromosomal genomes can also be represented by permutations, with
special marks that identify different chromosomes.
E.g.
where each chromosome is on a separate line.
Even if the adjacency [5,6] is conserved between the 2 permutations, the first
genome does not even have those genes on the same chromosome.
In the case of multi-chromosomal genomes, conserved intervals [a,b] should
have the added requirement that a and b belong to the same chromosome, in
each genome.
16. Lecture WS 2004/05
Bioinformatics III
20
Algorithms
Bergeron & Stoye present 3 algorithms:
(1) compute the conserved intervals of two permutations
(2) compute the conserved intervals of a set of permutations
(3) compute conserved intervals of two sets of permutations, directly from their
two individual sets of conserved intervals.
Conserved Intervals of 2 permutations are strongly related to the notion of
connected components of the overlap graph of a signed permutation.
Here: linear algorithm that identifies all irreducible intervals [a,b] of a permutation
 with the identity permutation such that a > 0 and b > 0 in .
The case of negative endpoints is treated by reversing .
E.g. for the permutation
P = 0 -4 -3 -2 5 8 6 7 9 -1 10
algorithm 1 identifies the positive irreducible intervals [6,7], [5,9], and [0,10].
It will identify [2,3] and [3,4] on the reversed permutation.
16. Lecture WS 2004/05
Bioinformatics III
21
Algorithms
The algorithm assumes that the input permutation is in the form
 = (0, 1, ..., n-1, n)
Mi: nearest unsigned element of the permutation that precedes i and is greater
than |i|.
Lemma 1 If [s,e] is a positive conserved interval of  and the identity
permutation, then Ms = Me.
Algorithm uses two stacks: S contains the possible start positions of conserved
intervals, M contains possible candidates for Mi.
The top of S is always denoted by s. The top of M is always denoted by m.
Proposition 6 Algorithm 1 outputs the positive irreducible conserved intervals of
a permutation  with the identity permutation in O(n) time.
16. Lecture WS 2004/05
Bioinformatics III
22
Conserved intervals
Algorithm runs in linear time.
16. Lecture WS 2004/05
Bioinformatics III
23
Similarity and distance
The number of conserved intervals of a set of permutations is a measure of
similarity, but can easily be transformed into a distance between two
permutations, or two sets of two permutations.
Definition 3 Let G1 and G2 be two permutations on n elements, with N1 and N2
conserved intervals. Let N be the number of conserved intervals in G1  G2.
The interval distance between G1 and G2 is then defined by:
d(G1,G2) = N1 + N2 – 2N
The interval distance satisfies the fundamental properties of a mathematical
distance, e.g. it fulfils the triangle inequality:
d(P,Q) + d(Q,R)  d(P,R)
16. Lecture WS 2004/05
Bioinformatics III
24
Similarity and distance
When comparing two permutations, the interval distance counts the total number
of intervals that are unique to one of them. E.g. the distance between
P = 0 1 2 3 4 5 6 7 8 9 10
Q = 0 5 -7 -6 8 9 1 2 3 4 10
is given by d(P,Q) = (1110)/2 +(1110)/2 – 2  11 = 88
The 2 measures sometimes disagree. The behavior of the interval distance
reflects that the length (number of genes) involved in a rearrangement operation
matters: short reversals are less disturbing than long ones.
16. Lecture WS 2004/05
Bioinformatics III
25
Comparison with other distance measures
Breakpoint distance also gives different results than interval distances.
while the same results are obtained by transposition + reversal distances.
16. Lecture WS 2004/05
Bioinformatics III
26
Similarity and distance
Proposition 7 Suppose that P and Q have n elements, then
(1) if P is obtained from Q by reversing k elements, then the interval distance
between P and Q is k (n – k);
(2) if P is obtained from Q by transposing two consecutive blocks of a and b
elements, then the interval distance between P and Q is (a+b)(n – (a+b)) + ab.
Because the interval distance is affected by length, one should question the
practice of collapsing identical strips of genes.
Why not use all available information?
16. Lecture WS 2004/05
Bioinformatics III
27
Link with rearrangement theories
Characterize the rearrangement operations that preserve conserved intervals.
Definition 4. Let P and Q be two permutations, and  a rearrangement operation
applied to P yielding P‘. We say that  preserves the conserved intervals of P and
Q if the conserved intervals of {P,Q} are contained in those of {P‘,Q}.
Only rearrangements within blocks are preserving. Note that all operations, except
fusions, destroy some adjacencies that existed in the original permutation: the
number and nature of these adjacencies is a key concept.
Definition 5. Let  be a rearrangement operation that transforms P into P‘.
A breakpoint of  is a pair of elements that are adjacent in P but not in P‘.
Breakpoints are where one has to cut P in order to apply .
Reversals and translocations have 2 breakpoints, transpositions have 3, and
fissions have 1.
16. Lecture WS 2004/05
Bioinformatics III
28
Link with rearrangement theories
Consider the irreducible intervals of P and P‘ with respect to P.
Adjacencies in P either belong to a (smallest) irreducible interval, or are free.
E.g. in this diagram
the adjacency (3,4) belongs to the interval [1,5], (2,3) belongs to [2,3], and (8,9)
is free.
When 2 adjacencies belong to the same irreducible interval, then none of these
adjacencies is conserved between P and P‘.
16. Lecture WS 2004/05
Bioinformatics III
29
Link with rearrangement theories
Theorem 3. Reversals, transpositions, and reverse transpositions are preserving
if and only if all their breakpoints belong to the same irreducible interval, or are
free. Translocations and fissions are preserving if and only if all their breakpoints
are free.
Proof. If the breakpoints of any operation are free, then no conserved interval is cut.
If the breakpoints of a reversal, transposition, or reverse transposition belong to the same
irreducible interval, then the operation reorders, or reverses, some blocks within that
interval, thus preserving conserved intervals.
If a reversal has its two breakpoints in different intervals, it will break those two intervals. If
it has only one free breakpoint, it will break the interval containing the other breakpoint.
The same kind of arguments hold for transpositions and reverse transpositions.
If a breakpoint of a translocation or fission is not free, then it belongs to an irreducible
interval whose extremities will end up in two different chromosomes.
It turns out that most rearrangement operations used in optimal scenarios are
indeed preserving.
16. Lecture WS 2004/05
Bioinformatics III
30
Link with rearrangement theories
E.g. (without proof)
Theorem 4. All the breakpoints of a cycle belong to the same irreducible interval.
In the sorting by reversals theory, a sorting reversal is defined as a reversal that
decreases the reversal distance by 1. The breakpoints of sorting reversals,
except one type called hurdle merging, belong to a single cycle.
Corollary 4. All sorting reversals, except hurdle merging, are preserving
Corollary 5. All transpositions that create two adjacencies are preserving.
16. Lecture WS 2004/05
Bioinformatics III
31
Apply conserved intervals to reconstruct ancestor
Bergeron et al. WABI 2004, 14-25 (2004)
16. Lecture WS 2004/05
Bioinformatics III
32
Apply conserved intervals to reconstruct ancestor
Bergeron et al. WABI 2004, 14-25 (2004)
16. Lecture WS 2004/05
Bioinformatics III
33
Summary
Linear-time algorithms could be developed to minimize reversal distance
rearrangement scenarios.
Open question which distance measures (breakpoint distance, reversal distance,
interval distance ...) are most appropriate to compare genome architectures.
Experimental evidence provides new insights which types of rearrangements
have likely occurred in the past  need to adopt algorithms to the biological
reality.
Concept of „conserved intervals“ sounds very promising – can account for
arbitrary types of rearrangements.
16. Lecture WS 2004/05
Bioinformatics III
34