Download Finding the origin of replication

Document related concepts
no text concepts found
Transcript
Where in the Genome does
Replication Begin?
Chapter 1
Bioinformatics Algorithms
Phillip Compaeu and Pavel Pevzner
eBook link: http://beta.stepic.org/2
It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material. ­ Watson & Crick, 1953
DNA Replication
●
●
●
Replication Origin (oriC)
DNA Polymerases
Viral Vectors
–
Frost resistant
tomatoes,
pesticide
resistant corn
–
Gene Therapy
Gene Therapy
●
Vector
–
Origin of replication, Multicloning site, Selectable marker
The Problem
Finding
Finding Origin
Origin of
of Replication
Replication Problem
Problem
Input:
Input: The
The DNA
DNAstring
string Genome.
Genome.
Output:
Output: The
The location
location of
of oriC
oriC in
in Genome.
Genome.
Is
Is the
the Finding
Finding Origin
Origin of
of Replication
Replication Problem
Problem
aa clearly
clearly stated
stated computational
computational problem?
problem?
Hidden Messages in oriC
●
Bacterial genome
–
●
Single circular chromosome
Vibrio Cholerae
–
1,108,250 nucleotides
atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc
Hidden Messages in oriC
●
DnaA
–
●
Replication is initiated by this protein
DnaA Box
–
DnaA binds here
–
Multiple DnaA boxes help DnaA bind better
atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc
Hidden Message Problem
Find
Find aa “Hidden
“Hidden Message”
Message” in
in the
the Replication
Replication Origin
Origin
Input:
Input:AAstring
string text
text
(representing
(representing the
the replication
replication origin
origin of
of aa genome).
genome).
Output:
Output:AAhidden
hidden message
message in
in the
the text.
text.
Is
Is the
the Hidden
Hidden message
message problem
problem
aa clearly
clearly stated
stated computational
computational problem?
problem?
The Eureka Moment
The Eureka Moment
It may well be doubted whether human ingenuity can construct an
enigma of the kind which human ingenuity may not, by proper
application, resolve.
-- Edgar Allan Poe (through Legrand)
;48 = THE
Hidden Messages
●
Are there frequent words in the oriC?
ACAACTATGCATACTATCGGGAACTATCCT
ACAACTATGCATACTATCGGGAACTATCCT
●
●
k-mer: String of length k
Count(text, pattern): No. of times the k-mer
Pattern appears as a substring of text.
Count(ACAACTATGCATACTATCGGGAACTATCCT,
Count(ACAACTATGCATACTATCGGGAACTATCCT,ACTAT)
ACTAT)==33
Frequent Words Problem
Find
Find the
the most
most frequent
frequent k-mers
k-mers in
in aa string
string
Input:
Input:AAstring
string Text
Text and
and an
an integer
integer k.
k.
Output:
Output:All
All most
most frequent
frequent k-mers
k-mers in
in Text.
Text.
Frequent Words – Naive Solution
●
●
●
●
Total k-mers = |Text| - k + 1
Each k-mer is compared with at most |Text| - k
other k-mers.
Each comparison compares at most k
characters
2
O(∣Text∣ ⋅k )
Other
Other Implementations
Implementations
k
4 +∣Text∣⋅k ,∣Text∣⋅k⋅log(∣Text∣) ,∣Text∣
Frequent Words – The Encoding
Mystery
k
count
k-mers
3
25
tga
4
11
atga
tgat
5
8
gatca
tgatc
6
8
tgatca
7
5
atgatca
8
4
atgatcaa
9
3
atgatcaag
cttgatcat
tcttgatca
ctcttgatc
atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc
Frequent Words – The Encoding
Mystery
k
count
k-mers
3
25
tga
4
11
atga
tgat
5
8
gatca
tgatc
6
8
tgatca
7
5
atgatca
8
4
atgatcaa
Which
Which of
of these
these results
results are
are
statistically
statistically significant?
significant?
9
3
atgatcaag
cttgatcat
tcttgatca
ctcttgatc
Frequent Words – The Encoding
Mystery
●
DnaA Boxes are usually 9 nucleotides long.
●
Frequent 9-mers:
–
●
ATGATCAAG, CTTGATCAT, TCTTGATCA,
CTCTTGATC
Probability a 9-mer appears 3 or more times in a
randomly generated DNA string of length 500 ≈
1/1300
●
One of the four 9-mers is the DnaA Box?
●
If so, which one of the four?
Frequent Words – The Encoding
Mystery
●
●
Which one of the four is “more surprising”
compared to the others?
Consider ATGATCAAG and CTTGATCAT.
–
Reverse complements !!!
atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtt
tccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc
Frequent Words – The Encoding
Mystery
●
●
6 occurances of a 9-mer in a string of 500
nucleotides is statistically more significant than 3
occurances.
ATGATCAAG is the DnaA box?
atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtt
tccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc
Vibrio Cholerae – DnaA Box
●
How confident are we that the DnaA Box has
been found?
–
●
What if ATGATCAAG occurs along the entire
genome?
Check for all occurances of ATGATCAAG in the
genome.
Pattern Matching Problem
Find
Find all
all occurances
occurances of
of aa pattern
pattern in
in aa string
string
Input:
Input: Strings
Strings Pattern
Pattern and
and Genome.
Genome.
Output:
Output:All
All starting
starting positions
positions in
in Genome
Genome where
where
Pattern
Pattern appears
appears as
as aa substring.
substring.
116556,
116556, 149355,
149355, 151913,
151913, 152013,
152013, 152394,
152394,
186189,
186189, 194276,
194276, 200076,
200076, 224527,
224527,
307692,
307692, 479770,
479770, 610980,
610980, 653338,
653338,
679985,
679985, 768828,
768828, 878903,
878903, 985368
985368
Clumps
●
Positions of ATGATCAAG form a clump in
positions 151913, 152013, and 152394.
–
There are no other clumps
We
We now
now have
have strong
strong
computational
computational and
and statistical
statistical evidence
evidence
that
that ATGATCAAG/CTTGATCAT
ATGATCAAG/CTTGATCAT
is
is the
the DnaA
DnaABox
Box in
in Vibrio
Vibrio Cholerae
Cholerae
Job Done and Dusted ?!?!
May be, not yet ...
Is
IsATGATCAAG/CTTGATCAT
ATGATCAAG/CTTGATCAT
the
the DnaA
DnaABox
Box for
for all
all Bacteria?
Bacteria?
Is
Is the
the clumping
clumping effect
effect of
of
ATGATCAAG/CTTGATCAT
ATGATCAAG/CTTGATCAT
just
just aa statistical
statistical fluke
fluke in
in Vibrio
Vibrio Cholerae?
Cholerae?
Do
Do other
other Bacteria
Bacteria have
have
other
other DnaA
DnaABoxes?
Boxes?
oriC of Thermotoga Petrophila
aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactga
aactaaaatggtaggtttggtggtaggttttgtgtacattttgtagtatctgatttttaa
ttacataccgtatattgtattaaattgacgaacaattgcatggaattgaatatatgcaaa
acaaacctaccaccaaactctgtattgaccattttaggacaacttcagggtggtaggttt
ctgaagctctcatcaatagactattttagtctttacaaacaatattaccgttcagattca
agattctacaacgctgttttaatgggcgttgcagaaaacttaccacctaaaatccagtat
ccaagccgatttcagagaaacctaccacttacctaccacttacctaccacccgggtggta
agttgcagacattattaaaaacctcatcagaagcttgttcaaaaatttcaatactcgaaa
cctaccacctgcgtcccctattatttactactactaataatagcagtataattgatctga
Thermotoga Petrophila
ATGATCAAG
ATGATCAAG or
or CTTGATCAT
CTTGATCAT
does
does not
not occur
occur at
at all
all !!!
!!!
●
6 different 9-mers appear 3 or more times
–
●
●
AACCTACCA, AAACCTACC, ACCTACCAC,
CCTACCACC, GGTAGGTTT, TGGTAGGTT.
Occurance of 6 different 9-mers in a sequence
of 500 nucleotides is extremely unlikely
From the Ori-Finder tool, the DnaA Box is
CCTACCACC/GGTGGTAGG
Thermotoga Petrophila
GGTGGTAGG/CCTACCACC
GGTGGTAGG/CCTACCACC
aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactga
aactaaaatggtaggtttGGTGGTAGGttttgtgtacattttgtagtatctgatttttaa
ttacataccgtatattgtattaaattgacgaacaattgcatggaattgaatatatgcaaa
acaaaCCTACCACCaaactctgtattgaccattttaggacaacttcagGGTGGTAGGttt
ctgaagctctcatcaatagactattttagtctttacaaacaatattaccgttcagattca
agattctacaacgctgttttaatgggcgttgcagaaaacttaccacctaaaatccagtat
ccaagccgatttcagagaaacctaccacttacctaccacttaCCTACCACCcgggtggta
agttgcagacattattaaaaacctcatcagaagcttgttcaaaaatttcaatactcgaaa
CCTACCACCtgcgtcccctattatttactactactaataatagcagtataattgatctga
Now What?
Unlikely
Unlikely that
thatATGATCAAG/CTTGATCAT
ATGATCAAG/CTTGATCAT
or
or GGTGGTAGG/CCTACCACC
GGTGGTAGG/CCTACCACC
are
are DnaA
DnaAboxes
boxes for
for aa newly
newly sequenced
sequenced Bacteria.
Bacteria.
Most
Most frequent
frequent 9-mers
9-mers in
in T.
T.Petrophila
Petrophila did
did not
not give
give
any
any special
special indication
indication to
to identify
identify the
the DnaA
DnaABox.
Box.
Does
Does that
that mean
mean that
that our
our heuristic
heuristic for
for
finding
finding the
the DnaA
DnaAboxes
boxes has
has just
just failed?
failed?
Step
Step back
back aa bit
bit ...
...
What are we trying to solve?
●
Where is the oriC?
–
●
Big Clue: Identify the DnaA box.
From our experience with Vibrio Cholerae
DnaA
DnaAbox
box is
is (most
(most likely)
likely) the
the k-mer
k-mer
that
that occurs
occurs in
in clumps
clumps in
in aa
short
short sequence
sequence of
of the
the genome
genome
This
This sequence
sequence is
is (most
(most likely)
likely) in
in the
the
Neighbourhood
Neighbourhood of
of the
the oriCs
oriCs
Clump Finding
●
Find every k-mer that forms a clump in the
genome in a window of size L
Given
Given integers
integers LL and
and t,t,
aa k-mer
k-mer Pattern
Pattern forms
forms an
an (L,
(L, t)-clump
t)-clump
inside
inside aa (larger)
(larger) string
string Genome
Genome
ifif there
there is
is an
an interval
interval of
of Genome
Genome of
of length
length LL
in
in which
which this
this k-mer
k-mer appears
appears at
at least
least tt times.
times.
X
?
?
?
?
ATGATCAAG
ATGATCAAG forms
forms aa (500,3)-clump
(500,3)-clump
in
in the
the Vibrio
Vibrio cholerae
cholerae genome
genome
X
Clump Finding Problem
Find
Find patterns
patterns forming
forming clumps
clumps in
in aa string
string
Input:
Input:AAstring
string Genome,
Genome, and
and integers
integers k,
k, LL and
and t.t.
Output:
Output:All
All distinct
distinct k-mers
k-mers forming
forming (L,
(L, t)-clumps
t)-clumps in
in Genome
Genome
Continuing
Continuing from
from the
the naive
naive frequent
frequent words
words algorithm,
algorithm,
2
O( L ⋅k⋅∣Genome∣)
Can
Can you
you come
come up
up with
with an
an algorithm
algorithm that
that takes
takes
k
O(4 + k⋅∣Genome∣)
7
LL<< 1000,
1000, kk << 15,
15, |Genome|
|Genome| << 10
107
Clump Finding in E. Coli
More
More than
than 1904
1904 different
different 9-mers
9-mers form
form
(500,
(500, 3)
3) clumps
clumps in
in E.
E. Coli
Coli !!!
!!!
Each
Each is
is as
as likely
likely aa candidate
candidate as
as the
the other
other for
for the
the DnaA
DnaAbox.
box.
What
What now
now ???
???
Biological
Biological insights
insights into
into the
the replication
replication process
process might
might help
help ...
...
DNA Replication Revisited
Replication Terminus
DNA Replication Revisited
DNA
DNAPolymerase
Polymerase can
can read
read
the
the strand
strand from
from 3'
3' →
→ 5'
5' only
only
DNA Replication
How
How does
does the
the unidirectional
unidirectional polymerase
polymerase
replicate
replicate the
the entire
entire circular
circular genome?
genome?
How
How many
many DNA
DNAPolymerases
Polymerases are
are required?
required? Why?
Why?
DNA Replication
Asymmetry in DNA Replication
DNA Replication
Asymmetry in DNA Replication
●
●
Leading or Reverse half strand (3' → 5')
–
DNA Polymerase works non-stop
–
Complete replication sooner than the Forward half
strand.
–
Lives double-stranded most of its life.
Lagging or Forward half strand (5' → 3')
–
DNA Polymerases work in stop-go fashion on Okazaki
fragments
–
DNA Ligase binds Okazaki fragments
–
Waits longer for the DNA Polymerase to attach and
replicate
–
Lives single-stranded life most of the time.
Asymmetry in DNA Replication
Asymmetry in DNA Replication
Does
Does this
this asymmetry
asymmetry provide
provide
clues
clues for
for identification
identification of
of the
the oriC?
oriC?
Forward
Forward half-strand
half-strand undergoes
undergoes more
more mutations
mutations
during
during its
its time
time as
as aa single
single strand
strand than
than the
the reverse
reverse half-strand.
half-strand.
Which
Which among
amongA,
A, C,
C, G,
G, TT has
has the
the highest
highest mutation
mutation rate?
rate?
CC
GG
AA
TT
Entire
EntireStrand
Strand
427419
427419
413241
413241
491488
491488
491363
491363
Reverse
ReverseHalf-Strand
Half-Strand
219518
219518
201634
201634
243963
243963
246641
246641
Forward
ForwardHalf-Strand
Half-Strand
207901
207901
211607
211607
247525
247525
244722
244722
Difference
Difference
+11617
+11617
-9973
-9973
-3562
-3562
-1919
-1919
Peculiar Statistics of Half-Strands
●
GG
AA
TT
Entire
EntireStrand
Strand
427419
427419
413241
413241
491488
491488
491363
491363
Reverse
ReverseHalf-Strand
Half-Strand
219518
219518
201634
201634
243963
243963
246641
246641
Forward
ForwardHalf-Strand
Half-Strand
207901
207901
211607
211607
247525
247525
244722
244722
Difference
Difference
+11617
+11617
-9973
-9973
-3562
-3562
-1919
-1919
Deamination rises 100 fold in forward strands
–
●
CC
Cytosine → Thymine
T – G bonds are corrected to T – A in 2nd round of
replication
Forward
Forward HS
HS (Single-stranded
(Single-stranded life):
life): Shortage
Shortage of
of C,
C, Normal
Normal G
G
Reverse
Reverse HS
HS (Double-stranded
(Double-stranded life):
life): Shortage
Shortage of
of G,
G, Normal
Normal CC
Peculiar Statistics of Half-Strands
Skew Diagram
Skewi (Genome)
i :0→∣Genome∣
E. coli – Skew Diagram
Where
Where is
is the
the oriC?
oriC?
oriC
Minimum Skew Problem
Find
Find aa position
position in
in aa genome
genome minimizing
minimizing the
the skew.
skew.
Input:
Input:AADNA
DNAstring
string Genome.
Genome.
Output: All integer(s) i minimizing Skewi(Genome)
among all values of i (from 0 to |Genome|).
Does
Does the
the min
min skew
skew position
position change
change based
based on
on
varying
varying initial
initial positions?
positions?
Approximate
Approximate position
position of
of the
the oriC
oriC
in
in E.
E. coli:
coli: 3923620
3923620
DnaA box of E. coli
Approximate
Approximate position
position of
of the
the oriC
oriC
in
in E.
E. coli:
coli: 3923620
3923620
aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgc
ataacgcggtatgaaaatggattgaagcccgggccgtggattctactcaa
ctttgtcggcttgagaaagacctgggatcctgggtattaaaaagaagatc
tatttatttagagatctgttctattgtgatctcttattaggatcgcactg
ccctgtggataacaaggatccggcttttaagatcaacaacctggaaagga
tcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcag
aatgaggggttatacacaactcaaaaactgaacaacagttgttctttgga
taactaccggttgatccaagcttcctgacagagttatccacagtagatcg
cacgatctgtatacttatttgagtaaattaacccacgatcccagccattc
ttctgccggatcttccggaatgtcgtgatcaagaatgttgatcttcagtg
No
No 9-mer
9-mer occurs
occurs 33 or
or more
more times
times in
in this
this oriC
oriC !!!
!!!
Revisit Vibrio Cholerae
atcaATGATCAACgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagCATGATCATggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtt
tccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc
Observe
Observe the
the 9-mers
9-mersATGATCAAC
ATGATCAAC and
and CATGATCAT
CATGATCAT
Previously Invisible DnaA Boxes
atcaATGATCAACgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagCATGATCATggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtt
tccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc
Finding
Finding 88 approximate
approximate occurances
occurances of
of 9-mers
9-mers is
is
statistically
statistically more
more suprising.
suprising.
ATGATCAAG,
ATGATCAAG, CTTGATCAT,
CTTGATCAT,ATGATCAAC,
ATGATCAAC, CATGATCAT
CATGATCAT
DnaA
DnaAcan
can bind
bind to
to slight
slight modifications
modifications of
of the
the DnaA
DnaAboxes
boxes !!!
!!!
Approximate Pattern Matching Problem
Find
Find all
all approximate
approximate occurrences
occurrences of
of aa pattern
pattern in
in aa string.
string.
Input:
Input: Strings
Strings Pattern
Pattern and
and Text
Text along
along with
with an
an integer
integer d.
d.
Output: All starting positions where Pattern appears as a
substring of Text with at most d mismatches.
DnaA Boxes in E. coli
aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgc
ataacgcggtatgaaaatggattgaagcccgggccgtggattctactcaa
ctttgtcggcttgagaaagacctgggatcctgggtattaaaaagaagatc
tatttatttagagatctgttctattgtgatctcttattaggatcgcactg
cccTGTGGATAAcaaggatccggcttttaagatcaacaacctggaaagga
tcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcag
aatgaggggTTATACACAactcaaaaactgaacaacagttgttcTTTGGA
TAACtaccggttgatccaagcttcctgacagagTTATCCACAgtagatcg
cacgatctgtatacttatttgagtaaattaacccacgatcccagccattc
ttctgccggatcttccggaatgtcgtgatcaagaatgttgatcttcagtg
DnaA
DnaABox
Box of
of E.
E. coli:
coli: TTATCCACA
TTATCCACA
Epilogue
●
Hidden messages cluster in a genome
–
●
Clumps
DnaA boxes may not be perfect
Complications
●
Some bacteria have fewer DnaA boxes
–
●
●
Frequent Words Problem does not work !
Terminus of
replication is not
often located
directly opposite
to oriC
The skew diagram
is often more
complex than E.
colis
Skew diagram of T. petrophila
Open Problems
●
Multiple origins of replication in a bacterial
genome
●
Finding oriC in Archaea and Yeast
●
Computing probabilities of patterns in a string
Multiple Origins of Replication
●
●
Biologists long believed that each bacterial
chromosome has a single oriC
Xia (2012) argued that
some bacteria may
have multiple
replication origins.
–
●
Bacteria would be able
to replicate faster
Skew diagram of
Wigglesworthia
glossinidia
Does
Does bacterial
bacterial genome
genome have
have multiple
multiple origins
origins of
of replication?
replication?
Xia, DNA Replication and Strand Asymmetry in Prokaryotic and Mitochondrial Genomes, Current Genomics, 13(1), 2012
Multiple Origins of Replication
●
Genome rearrangements can cause multiple local minima in
the skew diagram
–
Reversal: a segment of chromosome is flipped and switched into
the opposite strand
–
Horizontal gene transfer: Gene from forward half-strand of one
is transferred to the reverse half-strand of another
Finding oriCs in Archaea and Yeast
●
Archaea have
multiple oriCs
–
●
Skew diagram of
Sulfolobus solfataricus
Yeast have hundreds
of oriCs
–
Coordinated
replication
Develop
Develop an
an algorithm
algorithm to
to reliably
reliably locate
locate
oriCs
oriCs in
inArchaea
Archaea and
and Yeast.
Yeast.
Computing Probabilities
of Patterns in a String
Is
Is itit statistically
statistically surprising
surprising to
to find
find aa 9-mer
9-mer appearing
appearing
33 or
or more
more times
times within
within ≈≈ 500
500 nucleotides
nucleotides
●
Probability that “01” (“11”) appears in a random
binary string of length 4 is 11/16 (8/16)
●
The overlapping words paradox
●
Pattern “11” overlaps but not “01”
Pr
Prdd(N,
(N,A,
A, Pattern
Pattern t),
t), Pr(N,
Pr(N,A,
A, k,
k, t),
t), ...
...
Related documents