Download Slides Here

Document related concepts

Tag SNP wikipedia , lookup

Human genetic variation wikipedia , lookup

Gene wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene expression programming wikipedia , lookup

DNA virus wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Microevolution wikipedia , lookup

Metagenomics wikipedia , lookup

Designer baby wikipedia , lookup

Oncogenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Median graph wikipedia , lookup

Polyploid wikipedia , lookup

Public health genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome (book) wikipedia , lookup

Transposable element wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

NUMT wikipedia , lookup

ENCODE wikipedia , lookup

Non-coding DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Human genome wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genomic library wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Genome Rearrangements
Basic Biology: DNA
• Genetic information is stored in
deoxyribonucleic acid (DNA) molecules.
• A single DNA molecule is a sequence of
nucleotides
– adenine (A)
– cytosine (C)
phosphate
– guanine (G)
nitrogenous
– thymine (T)
base
pentose
sugar
Nucleotide
DNA molecule
Basic Biology: DNA
• Paired DNA strands are in reverse
complementary orientation.
– One in forward, 5’ to 3’ direction
– The other in reverse, 3’ to 5’
direction
5’
3’
3’
5’
• Both strands are complementary.
– A pairs with a T
– G pairs with a C
forward
strand
reverse
strand
Image modified with the permission of the
National Human Genome Research Institute
(NHGRI), artist Darryl Leja.
Basic Biology: Genome
• The genome is the
entire hereditary
information of an
organism.
• Genomes are
partitioned into
chromosomes.
• A chromosome can be
linear (eukaryotes), or
circular (prokaryotes).
Image modified with the permission of the
National Human Genome Research Institute
(NHGRI), artist Darryl Leja.
The Human Karyogram
Karyotype of a human male.
Courtesy: National Human Genome Research Institute
Changes in Genomic Sequences
• Genomes of different species (even of closely
related individuals) differ from one another.
• These differences are caused by
– point mutations, in which only one nucleotide is
changed, and
– genome rearrangements, where multiple
nucleotides are modified.
Point Mutations
• Insertion
…ATGGCG… → …ATGTGCG…
• Deletion
…ATGTGCG…→ …ATGGCG…
• Substitution …ATGTGCG… → …ATGCGCG…
…ATG-GCATGTGCGATGTGCG…
…ATGTGCATG-GCGATGCGCG…
DNA sequence alignment showing matches, mismatches, and insertions/deletions
Genome Rearrangements
• Reversal
123456789
123654789
• Translocation
123456789
1 2 3 4 13 14 15
10 11 12 13 14 15
10 11 12 5 6 7 8 9
123456789
1234
• Fission
56789
• Fusion
1234
56789
123456789
Levenshtein’s Edit Distance
• Let A and B be two sequences (genomes). The
minimum number of edit operations that
transforms A into B defines the edit distance,
dedit, between A and B.
• Possible edit operations:
– point mutations
– genome rearrangements
A Word Puzzle
• To transform a start word into a target word,
change, add, or delete characters until the target
is reached.
• Example: start “spices” target “lice”:
• spices → slices → slice → lice
• spices → spice→ slice→ lice
• How many steps do you need to transform
– a republican into a democrat?
– Google into Yahoo?
Edit Distance Using Point Mutations
S1=AGCTT, S2=AGCCTG, S3=ACAG
TG
insert C
AGCTT
AGCTG
 dedit(S1,S2) = 2
TG
AGCCTG
AGCTT
AGCTG
 dedit(S1,S3) = 2
delete C
AGCCTG
AGCTG
 dedit(S2,S3) = 2
delete G
TA
AGCAG
TA
ACAG
delete G
AGCAG
ACAG
Edit Distance and Evolution
• The edit distance is often used to infer evolutionary relationships.
• Parsimony assumption: the minimum number of changes reflects the true
evolutionary distance
Parsimonious phylogeny inferred from edit distances
Levenshtein’s Edit Distance
• Let A and B be two sequences (genomes). The
minimum number of edit operations that
transforms A into B defines the edit distance,
dedit, between A and B.
• Possible edit operations:
– point mutations
– genome rearrangements
Rearrangements and Anagrams
• An anagram is a rearrangement of a word or
phrase into another word or phrase.
• eleven plus two → twelve plus one
• forty five → over fifty
Please visit the Internet Anagram web server at
http://wordsmith.org/anagram/.
Rearrangements and Anagrams
Dot plot: “spendit” vs. “stipend”
Dot plot: Mouse genome vs. Human genome
Genome Comparison: Human - Mouse
• Humans and mice have
similar genomes, but
their genes are in a
different order.
• How many edits
(rearrangements) are
needed to transform
human into mouse?
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Transforming Mice into Humans
a) Mouse and
human share a
common ancestor
b) They share the
same genes, but in a
different order
c) A series of
rearrangements transforms
one genome into the other
History of Chromosome X
Rat Consortium, Nature, 2004
Dobzhansky’s Experiment
Giant polytene
chromosomes
Modified from T.S.
Painter, J. Hered.
25:465–476, 1934.
Drosophila melanogaster life cycle
taken from FlyMove
Harvesting polytene chromosomes
taken from BioPix4U
Dobzhansky’s Experiment
Chromosome 3 of Drosophila pseudoobscura
Standard and Arrowhead arrangements differ by an inversion from segments 70 to 76
Figures taken from Dobzhansky T, Sturtevant AH. Genetics (1938), 23(1):28-64.
Dobzhansky’s Experiment
Configurations observed in various inversion heterozygotes
Figures taken from Dobzhansky T, Sturtevant AH. Genetics (1938), 23(1):28-64.
Dobzhansky’s Experiment
Single and Double Inversions
Phylogeny for 3rd chromosome of D. pseudoobscura
Figures taken from Dobzhansky T, Sturtevant AH. Genetics (1938), 23(1):28-64.
Unsigned Reversals
1
2
3
9
8
10
4
7
6
5
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Unsigned Reversals
1
2
3
9
8
10
4
7
6
5
1, 2, 3, 8, 7, 6, 5, 4, 9, 10
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Unsigned Reversals and Gene Orders
p1 = 5 1 4 3 2 6 7 8 9 10
r(1,2)
p2 = 1 5 4 3 2 6 7 8 9 10
r(2,5)
p3 = 1 2 3 4 5 6 7 8 9 10
Reversal Edit Distance
• Goal: Given two permutations, find the shortest series
of reversals that transforms one into another
• Input: Permutations p and s
• Output: A series of reversals r1,…,rt transforming p into
s, such that t is minimum
• t - reversal distance between p and s
• drev(p, s) - smallest possible value of t, given p and s
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Sorting by Reversals Problem
• Goal: Given a permutation, find a shortest series of
reversals that transforms it into the identity
permutation (1 2 … n )
• Input: Permutation π
• Output: A series of reversals r1, …, rt transforming π
into the identity permutation such that t is minimum
• Reversal Distance Problem and Sorting by Reversals
Problem are equivalent. Why?
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Algorithm 1: GreedyReversalSort(π)
1 for i  1 to n – 1
2 j  position of element i in π (i.e. π[j]=i)
3 if j≠i
4
π  π • r(i, j)
5
output π
6 if π is the identity permutation
7
return
Taken from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
GreedyReversalSort is Not Optimal
• For p = 6 1 2 3 4 5 the algorithm needs 5 steps:
•
•
•
•
•
•
Step
Step
Step
Step
Step
Step
0:
1:
2:
3:
4:
5:
6
1
1
1
1
1
1
6
2
2
2
2
2
2
6
3
3
3
3
3
3
6
4
4
4
4
4
4
6
5
5
5
5
5
5
6
i=1;
i=2;
i=3;
i=4;
i=5;
j=2;
j=3;
j=4;
j=5;
j=6;
• However, two reversals are enough:
• Step 0: 6 1 2 3 4 5
• Step 1: 6 5 4 3 2 1
• Step 2: 1 2 3 4 5 6
r(1,2)
r(2,3)
r(3,4)
r(4,5)
r(5,6)
Adjacencies & Breakpoints
• An adjacency is a pair of adjacent elements that are consecutive
• A breakpoint is a pair of adjacent elements that are not consecutive
• b(p) is the number of breakpoints in p
π=5 6 2 1 3 4
Extend π with π0 = 0 and π7 = 7
adjacencies
0 5 6 2 1 3 4 7
breakpoints, b(p)=4
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Reversal Distance and Breakpoints
 One reversal eliminates at most 2 breakpoints.
p =0 2 3 1 4 6 5 7
p1 = 0 1 3 2 4 6 5 7
p2 = 0 1 2 3 4 6 5 7
p3 = 0 1 2 3 4 5 6 7
b(p ) = 5
b(p1) = 4
b(p2) = 2
b(p3) = 0
 This implies: reversal distance ≥ b(p ) / 2
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Strips
• An interval between two consecutive breakpoints in
a permutation is called a strip.
– A strip is increasing if its elements increase.
– Otherwise, the strip is decreasing.
0 1 5 6 7 4 3 2 8 9 10
– A single-element strip is considered decreasing with
exception of the strips [0] and [n+1].
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Strips and Breakpoints
Observation 1: If a permutation contains a decreasing
strip, then there exists a reversal that will decrease
the number of breakpoints.
r(3,8)
0 1 5 6 7 4 3 2 8 9 10
0 1 2 3 4 7 6 5 8 9 10
Observation 2: Otherwise, create a decreasing strip by
reversing an increasing strip. The number of
breakpoints can be reduced in the next step.
r(6,8)
0 1 5 6 7 2 3 4 8 9 10
0 1 5 6 7 4 3 2 8 9 10
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Algorithm2: BreakpointReversalSort(π)
1 while b(π) > 0
2 if π has a decreasing strip
Choose reversal r that minimizes b(π • r)
4 else
5
Choose a reversal r that flips an increasing
strip in π
6 π π•r
7 output π
8 return
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Performance Guarantee
• BreakpointReversalSort (BRS) is an approximation algorithm
that will not use more than four times the minimum number
of reversals.
– BRS eliminates at least one breakpoint every two steps:
dBRS ≤ 2b(p) steps
– An optimal algorithm eliminates at most two breakpoints
every step: dOPT  b(p) / 2 steps
 Performance guarantee:
dBRS / dOPT  [ 2b(p) / (b(p)/2) ] = 4
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Gene Orientation & Genome Representation
modified from http://acim.uqam.ca/~anne/INF4500/Rearrangements.ppt
Genome Rearrangements
Signed Reversals
5’ ATGCCTGTACTA 3’
3’ TACGGACATGAT 5’
Break
and
Invert
5’ ATGTACAGGCTA 3’
3’ TACATGTCCGAT 5’
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Signed Reversals
1
2
3
9
8
10
4
7
6
5
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Signed Reversals
1
2
3
9
8
10
4
7
6
5
1, 2, 3, -8, -7, -6, -5, -4, 9, 10
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Signed Reversals and Breakpoints
1
2
3
9
8
10
4
7
6
5
1, 2, 3, -8, -7, -6, -5, -4, 9, 10
The reversal introduced two breakpoints
Taken and modified from An Introduction to Bioinformatics Algorithms by Neil Jones and Pavel Pevzner
Summary: Complexity Results
• Sorting by unsigned reversals:
– NP-hard
– can be approximated within a constant factor
• Sorting by signed reversals:
– can be solved in polynomial time
Web Tools
• GRIMM Web Server
– computes signed and unsigned reversal distances
between permutations.
http://www-cse.ucsd.edu/groups/bioinformatics/GRIMM
• Cinteny
– a web server for synteny identification and the
analysis of genome rearrangement
http://cinteny.cchmc.org/
DCJ Genome Rearrangements
• The DCJ model uses Double-Cut-and-Join
genome rearrangement operations.
• DCJ operations break and rejoin one or two
intergenic regions (possibly on different
chromosomes).
Genome Representation
• In the DCJ model, a genome is
grouped into chromosomes
(linear/circular).
• A gene g on the forward strand is
represented by [-g,+g]
• A gene g on the reverse strand is
represented by [+g,-g]
• Telomeres are represented by the
special symbol ‘o’.
• An adjacency (intergenic region) is
encoded by the unordered pair of
neighboring gene/telomere ends.
Example.
• linear c1=(o 1 -2 3 4 o)
• circular c2=(5 6 7)
DCJ Operations
• The double-cut-and-join operation “breaks” two
adjacencies and rejoins the fragments:
{a, b} {c, d} → {a,d} {c,b}, or {a,c} {b,d}.
• a, b, c, and d represent different (signed) gene ends
or telomeres (with ‘+o’ = ‘-o’).
• A special case occurs for c=d=o:
{a,b} {o,o} ↔ {a,o} {b,o}.
Signed reversal of genes 2 and 3
Chromosome Linearization
Weird genme transformation
Using Graphs to Sort Genomes
• Adjacency graph AG(A,B)=(V,E)
is a bipartite graph.
• V contains one vertex for each
adjacency of genome A and B.
• Each gene, g, defines two edges:
•
•
e1 connecting the adjacencies
with +g of A and B
e2 connecting the adjacencies
with –g.
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
Using Graphs to Sort Genomes
Algorithm 3: DCJSORT(A,B)
1 Generate adjacency graph AG(A, B) of A and B
2 for each adjacency {p, q} with p,q≠o in genome B do
3 let u={p,l} be the vertex of A that contains p
4 let v={q,m} be the vertex of A that contains q
5 if u ≠ v then
6
replace vertices u and v in A by {p,q} and {l,m}
7
update edge set
8 end if
9 end for
10 for each telomere {p,o} in B do
11 let u={p,l} be the vertex of A that contains p
12 if l≠o then
13
replace vertex u in A by {p,o} and {o,l}
14
update edge set
15 end if
16 end for
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
Using Graphs to Sort Genomes
Algorithm 3: DCJSORT(A,B)
1 Generate adjacency graph AG(A, B) of A and B
2 for each adjacency {p, q} with p,q≠o in genome B do
3 let u={p,l} be the vertex of A that contains p
4 let v={q,m} be the vertex of A that contains q
5 if u ≠ v then
6
replace vertices u and v in A by {p,q} and {l,m}
7
update edge set
8 end if
9 end for
10 for each telomere {p,o} in B do
11 let u={p,l} be the vertex of A that contains p
12 if l≠o then
13
replace vertex u in A by {p,o} and {o,l}
14
update edge set
15 end if
16 end for
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
Using Graphs to Sort Genomes
Algorithm 3: DCJSORT(A,B)
1 Generate adjacency graph AG(A, B) of A and B
2 for each adjacency {p, q} with p,q≠o in genome B do
3 let u={p,l} be the vertex of A that contains p
4 let v={q,m} be the vertex of A that contains q
5 if u ≠ v then
6
replace vertices u and v in A by {p,q} and {l,m}
7
update edge set
8 end if
9 end for
10 for each telomere {p,o} in B do
11 let u={p,l} be the vertex of A that contains p
12 if l≠o then
13
replace vertex u in A by {p,o} and {o,l}
14
update edge set
15 end if
16 end for
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
Using Graphs to Sort Genomes
Algorithm 3: DCJSORT(A,B)
1 Generate adjacency graph AG(A, B) of A and B
2 for each adjacency {p, q} with p,q≠o in genome B do
3 let u={p,l} be the vertex of A that contains p
4 let v={q,m} be the vertex of A that contains q
5 if u ≠ v then
6
replace vertices u and v in A by {p,q} and {l,m}
7
update edge set
8 end if
9 end for
10 for each telomere {p,o} in B do
11 let u={p,l} be the vertex of A that contains p
12 if l≠o then
13
replace vertex u in A by {p,o} and {o,l}
14
update edge set
15 end if
16 end for
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
DCJ1: {1,2} {-2,-3}  {1,-2} {2,-3}
Using Graphs to Sort Genomes
Algorithm 3: DCJSORT(A,B)
1 Generate adjacency graph AG(A, B) of A and B
2 for each adjacency {p, q} with p,q≠o in genome B do
3 let u={p,l} be the vertex of A that contains p
4 let v={q,m} be the vertex of A that contains q
5 if u ≠ v then
6
replace vertices u and v in A by {p,q} and {l,m}
7
update edge set
8 end if
9 end for
10 for each telomere {p,o} in B do
11 let u={p,l} be the vertex of A that contains p
12 if l≠o then
13
replace vertex u in A by {p,o} and {o,l}
14
update edge set
15 end if
16 end for
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
DCJ1: {1,2} {-2,-3}  {1,-2} {2,-3}
Using Graphs to Sort Genomes
Algorithm 3: DCJSORT(A,B)
1 Generate adjacency graph AG(A, B) of A and B
2 for each adjacency {p, q} with p,q≠o in genome B do
3 let u={p,l} be the vertex of A that contains p
4 let v={q,m} be the vertex of A that contains q
5 if u ≠ v then
6
replace vertices u and v in A by {p,q} and {l,m}
7
update edge set
8 end if
9 end for
10 for each telomere {p,o} in B do
11 let u={p,l} be the vertex of A that contains p
12 if l≠o then
13
replace vertex u in A by {p,o} and {o,l}
14
update edge set
15 end if
16 end for
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
DCJ1: {1,2} {-2,-3}  {1,-2} {2,-3}
DCJ2: {4,o} {7,-5}  {4,-5} {7,o}
Using Graphs to Sort Genomes
Algorithm 3: DCJSORT(A,B)
1 Generate adjacency graph AG(A, B) of A and B
2 for each adjacency {p, q} with p,q≠o in genome B do
3 let u={p,l} be the vertex of A that contains p
4 let v={q,m} be the vertex of A that contains q
5 if u ≠ v then
6
replace vertices u and v in A by {p,q} and {l,m}
7
update edge set
8 end if
9 end for
10 for each telomere {p,o} in B do
11 let u={p,l} be the vertex of A that contains p
12 if l≠o then
13
replace vertex u in A by {p,o} and {o,l}
14
update edge set
15 end if
16 end for
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
DCJ1: {1,2} {-2,-3}  {1,-2} {2,-3}
DCJ2: {4,o} {7,-5}  {4,-5} {7,o}
Using Graphs to Sort Genomes
Algorithm 3: DCJSORT(A,B)
1 Generate adjacency graph AG(A, B) of A and B
2 for each adjacency {p, q} with p,q≠o in genome B do
3 let u={p,l} be the vertex of A that contains p
4 let v={q,m} be the vertex of A that contains q
5 if u ≠ v then
6
replace vertices u and v in A by {p,q} and {l,m}
7
update edge set
8 end if
9 end for
10 for each telomere {p,o} in B do
11 let u={p,l} be the vertex of A that contains p
12 if l≠o then
13
replace vertex u in A by {p,o} and {o,l}
14
update edge set
15 end if
16 end for
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
DCJ1: {1,2} {-2,-3}  {1,-2} {2,-3}
DCJ2: {4,o} {7,-5}  {4,-5} {7,o}
DCJ3: {3,-4} {o,o}  {3,o} {o,-4}
Using Graphs to Sort Genomes
Algorithm 3: DCJSORT(A,B)
1 Generate adjacency graph AG(A, B) of A and B
2 for each adjacency {p, q} with p,q≠o in genome B do
3 let u={p,l} be the vertex of A that contains p
4 let v={q,m} be the vertex of A that contains q
5 if u ≠ v then
6
replace vertices u and v in A by {p,q} and {l,m}
7
update edge set
8 end if
9 end for
10 for each telomere {p,o} in B do
11 let u={p,l} be the vertex of A that contains p
12 if l≠o then
13
replace vertex u in A by {p,o} and {o,l}
14
update edge set
15 end if
16 end for
Example:
genome A: (o 1 -2 3 4 o) (5 6 7)
genome B: (o 1 2 3 4 o) (o 5 6 7 o)
DCJ1: {1,2} {-2,-3}  {1,-2} {2,-3}
DCJ2: {4,o} {7,-5}  {4,-5} {7,o}
DCJ3: {3,-4} {o,o}  {3,o} {o,-4}
A  DCJ1 DCJ2 DCJ3B
Summary: Complexity Results
• Sorting by unsigned reversals:
– NP-hard
– can be approximated within a constant factor
• Sorting by signed reversals:
– can be solved in polynomial time
• Sorting by DCJ rearrangements:
– can be solved in polynomial time
The End
Disclaimer
• Our presentation is in many parts inspired by
the textbook An Introduction to Bioinformatics
Algorithms by Neil Jones and Pavel Pevzner, by
lectures from Anne Bergeron and Julia
Mixtacki, as well as many review articles from
multiple colleagues.