Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Factorization of polynomials over finite fields wikipedia , lookup

Computational complexity theory wikipedia , lookup

Selection algorithm wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Gene prediction wikipedia , lookup

Travelling salesman problem wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Mycoplasma laboratorium wikipedia , lookup

Transcript
Genome Rearrangement
By
Ghada Badr
Part II
Genome Models

Genomes can be modeled by permutations:
each gene
can be assigned a unique number and is exactly found
once in the genome.
Signed
Permutation: Each gene may be assigned + or sign to indicate the strand it resides on.
Unsigned
Permutation: If the corresponding strand is
unknown.
2
Genome Rearrangement
Our problem:
Given a set of genomes and a set of possible
evolutionary events (operations), find a shortest set
of events transforming those genomes into one
another.
What are the Rearrangement events (Operation)?
3
Rearrangement Operations
Rearrangement operations affect gene order
and gene content. There are various types:
In case of single-chromosome genome:
• Inversions
• Transpositions
• Reverse transpositions
• Gene Duplications
• Gene loss
In case of multiple-chromosomes genomes we add:
• Translocations
• fusions
• fissions
4
Rearrangement Problems
Our problem:
Given a set of genomes and a set of possible
evolutionary events (operations), find a shortest set
of events transforming those genomes into one
another.
Any set of operations yields a distance between
genomes, by counting the minimum number of
operations needed to transform one genome
into the other.
5
Rearrangement Problems
Our problem:
Given a set of genomes and a set of possible
evolutionary events (operations), find a shortest set
of events transforming those genomes into one
another.
Two classical problems
• Computing the distance d()
• Computing one optimal sorting sequence of
events.
6
Rearrangement Operations
Can we have a unifying framework in which circular and
linear chromosomes can coexist throughout evolving
genomes?
Can we have a unifying view of Genome Rearrangements?
(Bergeron 2006)
A Double Cut and Join Operation DCJ was introduced.
7
Rearrangement Operations - DCJ
•
Double Cut-and-Join DCJ was first proposed by
Yancopoulos et. al. (2005).
•
Allows to model all the classical operations (inversions,
translocations, fissions, fusions, transposition, and block
interchanges) with a single operation.
•
This general model accounts for the genomic evidence of
the coexistence of both linear and circular chromosomes
in many genomes.
•
Both the DCJ sorting and distance problems can be
solved in O(n) time by Bergeron et. al. (2006)
8
Rearrangement Operations - DCJ
•
A gene a is an oriented sequence of DNA that starts with
a tail at and ends with a head ah.
•
Two consecutive genes do not necessarily have the
same orientation, thus adjacency of two consecutive
genes a and b, can be of four different types:
{ah,bt},{ah,bh},{at,bt},{at,bh}
 ,  ,  , 
An extremity that is not adjacent to any other gene is
called telomeres by a singleton set {ah} or {at}.
•
•
We can use adjacencies to represent both genomes with
multiple or uni-chromosomes.
9
Rearrangement Operations - DCJ
•
A genome is a set of adjacencies and telomeres
such that the tail or head of any gene appears in
exactly one adjacency or telomere.
Example
Genome A: chr1: a c -d
chr2: b e
chr3: f g
Replace each gene by two extremities
at ah ct ch dh dt
 bt bh et eh
ft fh gt gh
Adjacencies: {ah, ct}{ch, dh} {bh, et} {fh, gt}
Telomere:{at} {dt} {bt} {eh}{ft}{gh}
A = {{at}{ah, bt}{bh, ct}{ch, dt}{dh} {et} {eh,ft} {fh,gt} {gh}}
10
Rearrangement Operations - DCJ
•
DCJ operations:
a) {p,q}{r,s}
{p,r}{s,q} or { p,s} {q,r}
11
Rearrangement Operations - DCJ
•
DCJ operations:
b) {p,q}{r}
{p,r}{q} or{p}{q,r}
12
Rearrangement Operations - DCJ
•
DCJ operations:
c) {q} {r}
{q,r}
13
Rearrangement Operations - DCJ
•
DCJ operations:
Example:
Genome A: chr1: a c -d
chr2: b e
chr3: f g
Adjacencies and telomeres are:
{ah, ct}{ch, dh} {bh, et} {fh, gt} {at} {dt} {bt} {eh}{ft}{gh}
{ah,ct}{fh, gt} -->{ah,fh}{ct,gt}  Genome A: chr1: a -f
chr2: b e
chr3: d -c g
{ah,ct}{fh, gt} -->{ah,gt}{ct,fh} 
Genome A: chr1: a g
chr2: b e
chr3: f c -d
14
DCJ sorting and Distance problems
Problem: Given two genomes A and B defined on the
same set of genes, find a shortest sequence of DCJ
operations that transforms A into B. The length of
such a sequence is called the DCJ distance between
A and B, dcj(A,B).
15
DCJ sorting and Distance problems
Example:
Genome A: chr1: a c -d
chr2: b e
chr3: f g
Replace each gene by two extremities
at ah ct ch dh dt
 bt bh et eh
ft fh gt gh
Genome B: chr 1: a b c d 
chr 2: e f g
at ah bt bh ct ch dt dh
et eh ft fh gt gh
Get adjacencies and telomeres for each genome:
A= {{ah, ct}{ch, dh} {bh, et} {fh, gt} {at} {dt} {bt} {eh}{ft}{gh}}
B = {{at}{ah, bt}{bh, ct}{ch, dt}{dh} {et} {eh,ft} {fh,gt} {gh}}
16
DCJ sorting and Distance problems
Greedy Algorithm to sort by DCJ:
{ah, ct}{ch, dh} {bh, et} {fh, gt} {at} {dt} {bt} {eh}{ft}{gh}
{ah, bt}{ch, dh} {bh, et} {fh, gt} {at} {dt} {ct} {eh}{ft}{gh}
Genome A: chr1: a c -d
chr2: b e
chr3: f g
Genome A: chr1:
chr2:
chr3:
{ah, bt} {ch, dh} {bh, ct} {fh, gt} {at} {dt} {et} {eh}{ft}{gh} Genome A: chr1:
chr2:
chr3:
{ah, bt} {ch, dt} {bh, ct} {fh, gt} {at} {dh} {et} {eh} {ft}{gh} Genome A: chr1:
chr2:
chr3:
{at}{ah, bt}{bh, ct}{ch, dt}{dh} {et} {eh,ft} {fh,gt} {gh}
a b e
c -d
f g
a b c -d
e
f g
a b c d
e
f g
Genome B: chr1: a b c d
chr2: e f g17
DCJ sorting and Distance problems
Optimal and O(n) time.
18
DCJ sorting and Distance problems
Adjacency Graph (bipartite graph):
{ah, ct}{ch, dh} {bh, et} {fh, gt} {at} {dt} {bt} {eh}{ft}{gh}
{at}{ah, bt}{bh, ct}{ch, dt}{dh} {et} {eh,ft} {fh,gt} {gh}
Vertices: adjacencies and telomeres
Edges: connect an edge from A to B between adjacencies
or telomers that have common elements.
Graph can be easily constructed in O(n) time and space
19
DCJ sorting and Distance problems
Adjacency Graph (bipartite graph): IF SORTED
{at}{ah, bt}{bh, ct}{ch, dt}{dh} {et} {eh,ft} {fh,gt} {gh}
{at}{ah, bt}{bh, ct}{ch, dt}{dh} {et} {eh,ft} {fh,gt} {gh}
In each iteration: the algorithm increments C by one or
I by two
When sorted: n = C + I/2
dcj(A,B)  n
20
DCJ sorting and Distance problems
Adjacency Graph (bipartite graph):
{ah, ct}{ch, dh} {bh, et} {fh, gt} {at} {dt} {bt} {eh}{ft}{gh}
{at}{ah, bt}{bh, ct}{ch, dt}{dh} {et} {eh,ft} {fh,gt} {gh}
1 cycle
4 odd paths
1 even path
dcj(A,B) = n - (cycles + oddPath/2)
= 7-1-4/2 = 4
21
Genome Rearrangement and phylogeny

Genome rearrangements events are rare, these changes
of gene orders enable biologists to reconstruct histories far
back in time.

Extend the notion of genome rearrangement distance to
the optimal positioning of Steiner points in the appropriate
space of a given distance metric.

Two phylogenetic versions of the Steiner Problem (the
first inside the other):


Inner problem: optimizing internal nodes of a given tree, where n
leaves are labeled.
Outer problem: optimizing over all trees with n leaves.
22
Genome Rearrangement and phylogeny

We will discuss the inner problem defined as follows:
Given a fixed phylogeny (tree) T, together with a set of K
permutations (genome), each of size n corresponding to the terminal
(leaf) nodes.
Find a set of permutations corresponding to the internal nodes such
that the total weight w(T) is minimized, where w(T) is defined as:
w(T) = ∑ d(x,y) for all (x,y) in T
Here d(.,.) is the genome rearrangement distance metric defined on
pairs of permutations.
23
Genome Rearrangement and phylogeny

Consider a heuristic for the problem of computing the
internal nodes, where T is a star on three vertices.

We will study a more basic problem, the median problem.

Divide the problem on an arbitrary binary tree into a number
of overlapping median problems and apply the median
algorithm iteratively to search for a heuristic solution to the
original problem.

internal nodes retain biological meaning, and edges
represent transitions between states of genome.
24
Median Problem

The median-based method for phylogeny reconstruction
was first proposed by Sankoff and Blanchette (1998).

The idea is to build the global solution by aggregating
local solutions for the simplest problem: Find a Steiner
point M of three genomes.

After an initialization step, the algorithm iterates over a
tree, repeatedly resetting the permutations of internal
nodes to the medians of their three neighbors. Continue
till a convergence occurs.
25
Median Problem

The median of three or simply the median problem:
Find a permutation such that the sum of distances is
minimized between  and each of the starting permutation
 = {}.

Find a permutation M that minimizes the median score
S(), where:
S() = d1, M + d2,M + d3,M
26
Median Problem
Constructing phylogeny from medians
27
Median Problem

The median problem: Find a permutation such that the
sum of distances is minimized between  and each of the
starting permutation  = {}.

What are the distance measures that we can use?

Distances: breakpoint, reversal …

A breakpoint median has no straightforward biological
interpretation and they are not unique.

Breakpoint medians score poorly compared to reversal
medians.
28
Reversal Median

Reversal median Problem: Find a solution to the
median problem using the reversal distance.

Find a permutation such that the sum of reversal
distances is minimized between  and each of the
starting genomes.

The reversal median is NP-hard problem.

Why?
29
Reversal Median
Reversal graph for n = 3
Vertices: all permutations of n = 3.
Edges: connect an edge between 1 and 2 if reversal
distance d(1, 2) = 1.
30
Reversal Median
Reversal graph for n = 3


distance d(i, k) = shortest path between v1 and v2.
Finding the median is equivalent to finding the minimum
Steiner tree for the graph.
31
Reversal Median
Reversal graph for n = 3



The graph is huge |V| = n!.2n
A feasible graph-search algorithm is not possible!
What technique we can use to develop an algorithm for
this kind of problems?
32
Reversal Median

We will study a branch-and-bound algorithm by Adam
Siepel 2001.

This algorithm depends only on the availability of a
rapidly computable distance metric.
33
Reversal Median

The median score S() of a set of equally sized
permutations  = {}, separated by distances d1,2,
d1,3, and d2,3, obeys these bounds:
d1,2 + d1,3+ d2,3
2

S()
 min { (d
1,2+d2,3),(d1,2+d1,3),
(d2,3+d1,3)}
34
Reversal Median
•
Assume that is in the shortest path between  and the
median M, and is separated from  by distances d1,,
d1,, and d2,, the median score S()
d2, + d3,+ d2,3
d1, +
S() d
+min{(d2,+d3,),(d3,+d2,3), (d2,3+d2,)}
1,
2
35
Reversal Median
Algorithm (sketch):

Establish upper and lower bounds using a rapid reversal distance algorithm,
Mmin and Mmax.

Start with one of the three permutations, say .

Assume the median is M = .

Push the corresponding vertex v in a priority stack s for the best scoring
vertices.

While s is not empty




Pop the most promising vertex v from s.
If best score of v  Mmax then stop
Generate all possible vertices  that can be obtained from v by single reversal.
For each possible unmarked 






Calculate bound for the previous equation min, max.
If max= Mmin then M =  and stop. (median is found)
Add  to stack s only if max< Mmax (pruning)
update Mmax= max if max< Mmax .
End for loop.
End while loop.
36
Reversal Median
Algorithm (sketch):

Establish upper and lower bounds using a rapid reversal distance algorithm,
Mmin and Mmax.

Start with one of the three permutations, say .

Assume the median is M = .

Push the corresponding vertex v in a priority stack s for the best scoring
vertices.

While s is not empty




Pop the most promising vertex v from s.
If best score of v  Mmax then stop
Generate all possible vertices  that can be obtained from v by single reversal.
For each possible unmarked 






Calculate bound for the previous equation min, max.
If max= Mmin then M =  and stop. (median is found)
Add  to stack s only if max< Mmax (pruning)
update Mmax= max if max< Mmax .
End for loop.
End while loop.
37
Reversal Median
Algorithm (sketch):

Establish upper and lower bounds using a rapid reversal distance algorithm,
Mmin and Mmax.

Start with one of the three permutations, say .

Assume the median is M = .

Push the corresponding vertex v in a priority stack s for the best scoring
vertices.

While s is not empty




Pop the most promising vertex v from s.
If best score of v  Mmax then stop
Generate all possible vertices  that can be obtained from v by single reversal.
For each possible unmarked 






Calculate bound for the previous equation min, max.
If max= Mmin then M =  and stop. (median is found)
Add  to stack s only if max< Mmax (pruning)
update Mmax= max if max< Mmax .
End for loop.
End while loop.
38
Reversal Median
Algorithm (sketch):

Establish upper and lower bounds using a rapid reversal distance algorithm,
Mmin and Mmax.

Start with one of the three permutations, say .

Assume the median is M = .

Push the corresponding vertex v in a priority stack s for the best scoring
vertices.

While s is not empty




Pop the most promising vertex v from s.
If best score of v  Mmax then stop
Generate all possible vertices  that can be obtained from v by single reversal.
For each possible unmarked 






Calculate bound for the previous equation min, max.
If max= Mmin then M =  and stop. (median is found)
Add  to stack s only if max< Mmax (pruning)
update Mmax= max if max< Mmax .
End for loop.
End while loop.
39
Reversal Median
Algorithm (sketch):

Establish upper and lower bounds using a rapid reversal distance algorithm,
Mmin and Mmax.

Start with one of the three permutations, say .

Assume the median is M = .

Push the corresponding vertex v in a priority stack s for the best scoring
vertices.

While s is not empty




Pop the most promising vertex v from s.
If best score of v  Mmax then stop
Generate all possible vertices  that can be obtained from v by single reversal.
For each possible unmarked 






Calculate bound for the previous equation min, max.
If max= Mmin then M =  and stop. (median is found)
Add  to stack s only if max< Mmax (pruning)
update Mmax= max if max< Mmax .
End for loop.
End while loop.
40
Reversal Median
Algorithm (sketch):

Establish upper and lower bounds using a rapid reversal distance algorithm,
Mmin and Mmax.

Start with one of the three permutations, say .

Assume the median is M = .

Push the corresponding vertex v in a priority stack s for the best scoring
vertices.

While s is not empty




Pop the most promising vertex v from s.
If best score of v  Mmax then stop
Generate all possible vertices  that can be obtained from v by single reversal.
For each possible unmarked 






Calculate bound for the previous equation min, max.
If max= Mmin then M =  and stop. (median is found)
Add  to stack s only if max< Mmax (pruning)
update Mmax= max if max< Mmax .
End for loop.
End while loop.
O(n3d) with d = min{d1,2 + d1,3+ d2,3}
With faster average running time
41
Conclusions

Described Double Cut and Join DCJ operation:
A unifying view of genome rearrangements.

Presented a branch and bound median-based
approach for building phylogeny using reversal
distance.

Many other problems in genome
rearrangement as “Genome halving problem”
42
Genome Halving
a
f c
g
g a
b
d
e
f
b
d e
a
f c
b
g
d
e
c
43
Genome Halving
Duplication
a
f c
g
g a
b
d
e
f
b
d e
a
f c
b
g
d
e
c
44
Genome Halving
b
c
d e
g
a
f
b
c
d e
g a
f
a
f c
g
g a
b
d
e
f
b
d e
a
f c
b
g
d
e
c
45
Genome Halving
b
c
d e
g
a
f
b
c
d e
g a
f
a
f c
g
g a
b
d
e
f
b
d e
a
f c
b
g
d
e
c
46
Genome Halving
b
c
d e
g
a
f
b
c
d e
g
a
f
b
c
d e
g a
f
a
f c
g
g a
b
d
e
f
b
d e
a
f c
b
g
d
e
c
47
Genome Halving
b
c
d e
g
a
f
a
f c
b
g
d
e
48
Genome Halving
b
c
d e
g
a
f
b
c
d e
g
a
f
b
c
d e
g a
f
a
f c
b
g
d
e
49
Genome Halving
b
c
d e
g
a
f
b
c
d e
g
a
f
b
c
d e
g a
f
a
f c
g
g a
b
d
e
f
b
d e
a
f c
b
g
d
e
c
50
Genome Halving
b
c
d e
g
a
f
b
c
d e
g a
f
a
f c
g
g a
b
d
e
f
b
d e
c
51
Genome Halving
b
c
d e
g
a
f
b
c
d e
g a
f
a
f c
g
g a
b
d
e
f
b
d e
c
52
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
Bergeron A., A very elementary presentation of the Hannenhalli-Pevzner
theory. Discrete Applied Mathematics, vol. 146, 134-145, 1005.
Marília D. V. Braga. Exploring the solution space of sorting by reversals when
analyzing genome rearrangements. PhD thesis, University of Claude
Bernard, 2009.
Guillaume Fertin, Anthony Labarre, Irena Rusu, Eric Tannier, Stephan
Vialette. Combinatorics of Genome Rearrangements. The MIT
Press,Cambridge, England, 2009.
Siepel A. Exact algorithms for the reversal median problem. Master Thesis,
University of New Mexico, 2001.
Yancopoulos S., Attie O., Friedberg R. Efficient sorting of genomic
permutations by translocation, inversion and block exchange. Bioinformatics
21, 3340 - 3346 2005.
Anne Bergeron, Julia Mixtacki, Jens Stoye. A unifying view of Genome
Rearrangements. WABI 2006, LNBI 4175, 163-173, 2006.
Julia Mixtacki. Genome halving under DCJ revisited. Lecture Notes in
Computer Science, 5092 2008.
Richard C. Deonier, Simon Tavere, Michael S. Waterman. Computational Genome
Analysis, an introduction. Springer, 2005.
Neil C. Jones, Pavel A. Pevzner. An introduction to bioinformatics algorithms. MIT press,
2004.
53