Download P15

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DNA CODES BASED
ON HAMMING STEM
SIMILARITIES
A.G. Dyachkov1,
A.N. Voronina1
1
Dept. of Probability Theory, MechMath., Moscow
State University, Russia
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of code construction
5. Bounds on the rate on DNA codes
6. On sphere sizes
7. Further generalizations
8. Bibliography
1
DNA STRANDS
■ DNA strands consist of nucleotides, composed of
sugar and phosphate backbone and 1 base
Single DNA strand
5’ end
■ There are 4 types of bases:
A
C
adenine
guanine
G
cytosine
thymine
T
■ Base A is said to be complement to T and C – to G
Bases
■ DNA strands are oriented. Thus, for example,
strand AATG is different from strand GTAA
■ 2 oppositely directed strands containing
complement bases at corresponding positions
are called reverse-complement strands.
For example, this 2 strands are reverseNucleotide
complement:
A
The strands have
different directions
T
A
T
C
G
G
C
Sugar phosphate
backbone
3’ end
2
HYBRIDIZATION
■ 2 oppositely directed DNA strands are
capable of coalescing into duplex, or
double helix
Watson-Crick duplex
■ The process of forming of duplex is
referred to as hybridization
■ The basis of this process is forming of
the hydrogen bonds between
complement bases
■ Duplex, formed of reverse-complement
strands is called a Watson-Crick
duplex. Here is the example of it:
A
A
C
G
T
T
T
G
C
A
3
CROSS-HYBRIDIZATION AND ENERGY OF
HYBRIDIZATION
■ Though, hybridization is not a perfect process and non-complementary
strands can also hybridize
This bases are
not complement
■ This is one example of cross-hybridization:
A
A
C
C
T
G
G
C
A
A
C
T
T
G
C
G
C
C
A
A
T
G
This bases are
not complement
■ The indicator of “strength”, or stability of formed duplex is its energy of
hybridization. Its value depends on the total number of bonds formed
■ Thus, the greatest hybridization energy is obtained when Watson-Crick
duplex is formed rather than is case of cross-hybridization
4
LONE BONDS AND “PAIRWISE” METRIC
■ If a pair of bases is bonded but neither of its “neighbor” bases form a
bond as well, then it is called a lone bond. Here it is:
Lone bond does not
contribute to hybr. energy
A triplet is counted as 2
adjacent pairs
A
A
C
G
C
A
C
T
T
T
C
C
A
T
G
A
A pair of bonds add 1 to
total hybr. energy
Hybr. Energy
=3
■ The lone bond is too “weak” to form a strong connection, so it does not
contribution much to the total energy of hybridization
■ Moreover, in fact, the energy of hybridization depends not on the
number of bonds formed, but on the number of pairs of adjacent
bonds
■ Thus, if we suppose, that hybridization energy is equal to the number of
pairs, then in the example above it is equal to 3, not 5 or 6
5
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of code construction
5. Bounds on the rate on DNA codes
6. On sphere sizes
7. Further generalizations
8. Bibliography
6
NOTATIONS
General notations
■ Let
be an arbitrary even integer
■ Denote by
of size
■ Denote by
the standard alphabet
the largest (smallest) integer
Reverse-complementation
■ For any letter
, define
complement of the letter
– the
■ For any q-ary sequence
define its reverse complement
Note, that if
, then
,
for any
.
7
STEM HAMMING SIMILARITY
For 2 q-ary sequences of length n
and
stem Hamming similarity is equal to
where
■
is equal to the total number of common 2-blocks
containing adjacent symbols in the longest common Hamming
subsequence
■
8
HAMMING VS. STEM HAMMING
■ Hamming similarity is element-wise while stem Hamming similarity is
pair-wise (though still additive)
■ Re-ordering the elements in the sequence does not influence
Hamming similarity, but may change stem Hamming similarity
Example
9
STEM HAMMING DISTANCE
■ Note, that
only if
and
■ Stem Hamming distance between
if and
is
Example
Let
and
■ The longest common Hamming subsequence is
■ Stem Hamming similarity is equal to
■ Stem Hamming distance is equal to
10
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of code construction
5. Bounds on the rate on DNA codes
6. On sphere sizes
7. Further generalizations
8. Bibliography
11
MOTIVATION
■ Study of DNA codes was motivated by the needs of DNA
computing and biomolecular nanotechnology
■ In these applications, one must form a collection of DNA strands,
which will serve as markers, while the collection of reversecomplement (to that first strands) DNA strands will be utilized for
reading, or recognition
Coding Strands
for Ligation
Probing Complement
Strands for Reading
TACGCGACTTTC
ATCAAACGATGC
TGTGTGCTCGTC
ATTTTTGCGTTA
CACTAAATACAA
GAAAAAGAAGAA
GAAAGTCGCGTA
GCATCGTTTGAT
GACGAGCACACA
TAACGCAAAAAT
TTGTATTTAGTG
TTCTTCTTTTTC
1. Collection of
mutually reversecomplement pairs
2. No self-reverse
complement
words
3. No crosshybridization
12
DNA CODE
■
■
is a code of length
and size
, where
are the codewords of code
is called a DNA
-code based on stem Hamming
similarity if the following 2 conditions are fulfilled:
1. For any
, there exists
, such that
2. For any
■ Let
be the maximal size of DNA
Is called a rate of DNA codes
-codes.
13
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of code construction
5. Bounds on the rate on DNA codes
6. On sphere sizes
7. Further generalizations
8. Bibliography
14
Q-ARY REED-MULLER CODES
■ q-ary Reed-Muller code:
Let
Define mapping
Reed-Muller code
■ Reed-Muller code
reverse-complementarity
,
of order
with
is the image
of order 1 satisfy the condition of
■ It may contain self-reverse complement words, that should be
excluded from the final construction
15
EXAMPLE OF CODE
Let q=4 and m=1
0
1
2
3
0
1
2
3
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
0
1
2
3
1
2
3
0
2
3
0
1
3
0
1
2
0
2
0
2
1
3
1
3
2
0
2
0
3
1
3
1
0
3
2
1
1
0
3
2
2
1
0
3
3
2
1
0
Mutually-reverse Self-reverse
complement
complement




16
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of DNA codes
5. Bounds on the rate on DNA codes
a. Lower Gilbert-Varshamov bound
b. Upper bounds
c.
Graphs
6. On sphere sizes
7. Possible generalizations
8. Bibliography
17
RANDOM CODING
■
and are independent identically distributed random
sequences with uniform distribution on
■ Define
■ Probability distribution of
■ Sum of
18
GILBERT-VARSHAMOV BOUND
■ Let
. Introduce
■ We construct random code as a collection of independent variables
and their reverse-complements. This fact leads to necessity of
special random coding technique for DNA codes
■ One can check, that
■ Random coding bound (Gilbert-Varshamov bound):
if
then
19
CALCULATION OF THE BOUND
■
are dependent variables:
■
do not constitute a Markov chain:
and
both depend on
and
vs.
■
are deterministic functions of Markov chain
:
and
■ We cannot apply standard technique as in case of Hamming
similarity
■ We have to use Large Deviations Principle for Markov chains
for
20
GILBERT-VARSHAMOV BOUND
■ Introduce
■ Gilbert-Varshamov lower bound on the rate
If
and
then
is a decreasing
:
, where
-convex function with
21
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of DNA codes
5. Bounds on the rate on DNA codes
a. Lower Gilbert-Varshamov bound
b. Upper bounds
c.
Graphs
6. On sphere sizes
7. Possible generalizations
8. Bibliography
22
UPPER BOUNDS
■ Plotkin upper bound:
If
, then
and
if
■ Elias upper bound:
If
, then
, where
is presented by parametric equation
■ Elias bound improves Plotkin bound for small values
of
. We calculated
and
.
23
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of DNA codes
5. Bounds on the rate on DNA codes
a. Lower Gilbert-Varshamov bound
b. Upper bounds
c. Graphs
6. On sphere sizes
7. Possible generalizations
8. Bibliography
24
BOUNDS ON THE RATE (Q=2)
Bound on the rate of DNA code, q=2
1.2
Gilbert-Varshamov
bound
Plotkin bound
1
Hamming bound
0.8
Elias bound
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.75
0.8
1
25
BOUNDS ON THE RATE (Q=4)
Bound on the rate of DNA code, q=4
1.2
Gilbert-Varshamov
bound
Plotkin bound
1
Hamming bound
0.8
Elias bound
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
0.9375
1
26
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of code construction
5. Bounds on the rate on DNA codes
6. On sphere sizes
7. Further generalizations
8. Bibliography
27
FIBONACCI NUMBERS
■ q-ary Fibonacci numbers are defined by recurrent equation
with initial conditions
■ q-ary Fibonacci numbers may also be calculated as sum
■ q-ary Fibonacci number
of q-ary sequences of length
the form (0,0)
may be interpreted as the number
, which do not contain 2-stems of
28
COMBINATORIAL CALCULATION
■ Space with metric
is homogeneous, i.e., the volume of
a sphere does not depend on it’s center
■ Define
for any
■ Consider a sphere with center
sequence
2-stems (pairs) with
type (0,0). Thus,
. Any
must have no common
. In other words, is must have no 2-stems of
■ Sphere sizes for other may be obtained using the same
technique with some corresponding modifications
29
GRAPH OF PROBABILITIES
Probability distribution
1
n=5
n = 10
0.8
n = 20
n = 30
0.6
n = 40
0.4
0.2
0
0
3
6
9
12
15
30
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of code construction
5. Bounds on the rate on DNA codes
6. On sphere sizes
7. Further generalizations
8. Bibliography
31
B-STEM HAMMING SIMILARITY
■
-stem Hamming similarity:
in spite of counting the number of 2-stems (pairs) – calculate
the number of -stems
where
32
WEIGTHED STEM HAMMING SIMILARITY
■ Weighted stem Hamming similarity:
assign weight to each type of q-ary pairs and take it into account
while calculating the sum
■ Let
be a weight function such that
■ Similarity is defined as follows:
, where
33
INSERTION-DELETION STEM SIMILARITY
■ Insertion-deletion stem similarity:
Shift
allow loops and shifts at the DNA duplex
■
is a common block subsequence
between and , if is an ordered collection of
non-overlapping common ( , )-blocks of length
Loop
1. common ( , )-block of length ,
is
a subsequence of and , consisting of consecutive
elements of and
■
is the set of all common block subsequences between
■
of
and
and
is the minimal number of blocks of consecutive elements
in the given subsequence
■ Similarity is defined as follows:
34
OUTLINE
1. DNA background
2. Modeling the hybridization energy
3. DNA codes
4. Example of code construction
5. Bounds on the rate on DNA codes
6. On sphere sizes
7. Further generalizations
8. Bibliography
35
BIBLIOGRAPHY
Probability theory and Large Deviation Principle
■ V.N. Tutubalin, The Theory of Probability and Random Processes. Moscow:
Publishing House of Moscow State University, 1992 (in Russian).
■ A. Dembo, O. Zeitouni, Large Deviations Techniques and Applications.
Boston, MA: Jones and Bartlett, 1993.
DNA codes
■ D'yachkov A.G., Macula A.J., Torney D.C., Vilenkin P.A., White P.S., Ismagilov
I.K., Sarbayev R.S., On DNA Codes. Problemy Peredachi Informatsii, 2005, V.
41, N. 4, P. 57-77, (in Russian). English translation: Problems of Information
Transmission, V. 41, N. 4, 2005, P. 349-367.
■ Bishop M.A.,D'yachkov A.G., Macula A.J., Renz T.E., Rykov V.V., Free Energy
Gap and Statistical Thermodynamic Fidelity of DNA Codes. Journal of
Computational Biology, 2007, V. 14, N. 8, P. 1088-1104.
■ A. D’yachkov, A. Macula, T. Renz and V. Rykov, Random Coding Bounds for
DNA Codes Based on Fibonacci Ensembles of DNA Sequences. Proc. of
2008 IEEE International Symposium on Information Theory, Toronto,
36
Canada, 2008, in print.