Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DNA CODES BASED ON HAMMING STEM SIMILARITIES A.G. Dyachkov1, A.N. Voronina1 1 Dept. of Probability Theory, MechMath., Moscow State University, Russia OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of code construction 5. Bounds on the rate on DNA codes 6. On sphere sizes 7. Further generalizations 8. Bibliography 1 DNA STRANDS ■ DNA strands consist of nucleotides, composed of sugar and phosphate backbone and 1 base Single DNA strand 5’ end ■ There are 4 types of bases: A C adenine guanine G cytosine thymine T ■ Base A is said to be complement to T and C – to G Bases ■ DNA strands are oriented. Thus, for example, strand AATG is different from strand GTAA ■ 2 oppositely directed strands containing complement bases at corresponding positions are called reverse-complement strands. For example, this 2 strands are reverseNucleotide complement: A The strands have different directions T A T C G G C Sugar phosphate backbone 3’ end 2 HYBRIDIZATION ■ 2 oppositely directed DNA strands are capable of coalescing into duplex, or double helix Watson-Crick duplex ■ The process of forming of duplex is referred to as hybridization ■ The basis of this process is forming of the hydrogen bonds between complement bases ■ Duplex, formed of reverse-complement strands is called a Watson-Crick duplex. Here is the example of it: A A C G T T T G C A 3 CROSS-HYBRIDIZATION AND ENERGY OF HYBRIDIZATION ■ Though, hybridization is not a perfect process and non-complementary strands can also hybridize This bases are not complement ■ This is one example of cross-hybridization: A A C C T G G C A A C T T G C G C C A A T G This bases are not complement ■ The indicator of “strength”, or stability of formed duplex is its energy of hybridization. Its value depends on the total number of bonds formed ■ Thus, the greatest hybridization energy is obtained when Watson-Crick duplex is formed rather than is case of cross-hybridization 4 LONE BONDS AND “PAIRWISE” METRIC ■ If a pair of bases is bonded but neither of its “neighbor” bases form a bond as well, then it is called a lone bond. Here it is: Lone bond does not contribute to hybr. energy A triplet is counted as 2 adjacent pairs A A C G C A C T T T C C A T G A A pair of bonds add 1 to total hybr. energy Hybr. Energy =3 ■ The lone bond is too “weak” to form a strong connection, so it does not contribution much to the total energy of hybridization ■ Moreover, in fact, the energy of hybridization depends not on the number of bonds formed, but on the number of pairs of adjacent bonds ■ Thus, if we suppose, that hybridization energy is equal to the number of pairs, then in the example above it is equal to 3, not 5 or 6 5 OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of code construction 5. Bounds on the rate on DNA codes 6. On sphere sizes 7. Further generalizations 8. Bibliography 6 NOTATIONS General notations ■ Let be an arbitrary even integer ■ Denote by of size ■ Denote by the standard alphabet the largest (smallest) integer Reverse-complementation ■ For any letter , define complement of the letter – the ■ For any q-ary sequence define its reverse complement Note, that if , then , for any . 7 STEM HAMMING SIMILARITY For 2 q-ary sequences of length n and stem Hamming similarity is equal to where ■ is equal to the total number of common 2-blocks containing adjacent symbols in the longest common Hamming subsequence ■ 8 HAMMING VS. STEM HAMMING ■ Hamming similarity is element-wise while stem Hamming similarity is pair-wise (though still additive) ■ Re-ordering the elements in the sequence does not influence Hamming similarity, but may change stem Hamming similarity Example 9 STEM HAMMING DISTANCE ■ Note, that only if and ■ Stem Hamming distance between if and is Example Let and ■ The longest common Hamming subsequence is ■ Stem Hamming similarity is equal to ■ Stem Hamming distance is equal to 10 OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of code construction 5. Bounds on the rate on DNA codes 6. On sphere sizes 7. Further generalizations 8. Bibliography 11 MOTIVATION ■ Study of DNA codes was motivated by the needs of DNA computing and biomolecular nanotechnology ■ In these applications, one must form a collection of DNA strands, which will serve as markers, while the collection of reversecomplement (to that first strands) DNA strands will be utilized for reading, or recognition Coding Strands for Ligation Probing Complement Strands for Reading TACGCGACTTTC ATCAAACGATGC TGTGTGCTCGTC ATTTTTGCGTTA CACTAAATACAA GAAAAAGAAGAA GAAAGTCGCGTA GCATCGTTTGAT GACGAGCACACA TAACGCAAAAAT TTGTATTTAGTG TTCTTCTTTTTC 1. Collection of mutually reversecomplement pairs 2. No self-reverse complement words 3. No crosshybridization 12 DNA CODE ■ ■ is a code of length and size , where are the codewords of code is called a DNA -code based on stem Hamming similarity if the following 2 conditions are fulfilled: 1. For any , there exists , such that 2. For any ■ Let be the maximal size of DNA Is called a rate of DNA codes -codes. 13 OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of code construction 5. Bounds on the rate on DNA codes 6. On sphere sizes 7. Further generalizations 8. Bibliography 14 Q-ARY REED-MULLER CODES ■ q-ary Reed-Muller code: Let Define mapping Reed-Muller code ■ Reed-Muller code reverse-complementarity , of order with is the image of order 1 satisfy the condition of ■ It may contain self-reverse complement words, that should be excluded from the final construction 15 EXAMPLE OF CODE Let q=4 and m=1 0 1 2 3 0 1 2 3 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 0 1 2 3 1 2 3 0 2 3 0 1 3 0 1 2 0 2 0 2 1 3 1 3 2 0 2 0 3 1 3 1 0 3 2 1 1 0 3 2 2 1 0 3 3 2 1 0 Mutually-reverse Self-reverse complement complement 16 OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of DNA codes 5. Bounds on the rate on DNA codes a. Lower Gilbert-Varshamov bound b. Upper bounds c. Graphs 6. On sphere sizes 7. Possible generalizations 8. Bibliography 17 RANDOM CODING ■ and are independent identically distributed random sequences with uniform distribution on ■ Define ■ Probability distribution of ■ Sum of 18 GILBERT-VARSHAMOV BOUND ■ Let . Introduce ■ We construct random code as a collection of independent variables and their reverse-complements. This fact leads to necessity of special random coding technique for DNA codes ■ One can check, that ■ Random coding bound (Gilbert-Varshamov bound): if then 19 CALCULATION OF THE BOUND ■ are dependent variables: ■ do not constitute a Markov chain: and both depend on and vs. ■ are deterministic functions of Markov chain : and ■ We cannot apply standard technique as in case of Hamming similarity ■ We have to use Large Deviations Principle for Markov chains for 20 GILBERT-VARSHAMOV BOUND ■ Introduce ■ Gilbert-Varshamov lower bound on the rate If and then is a decreasing : , where -convex function with 21 OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of DNA codes 5. Bounds on the rate on DNA codes a. Lower Gilbert-Varshamov bound b. Upper bounds c. Graphs 6. On sphere sizes 7. Possible generalizations 8. Bibliography 22 UPPER BOUNDS ■ Plotkin upper bound: If , then and if ■ Elias upper bound: If , then , where is presented by parametric equation ■ Elias bound improves Plotkin bound for small values of . We calculated and . 23 OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of DNA codes 5. Bounds on the rate on DNA codes a. Lower Gilbert-Varshamov bound b. Upper bounds c. Graphs 6. On sphere sizes 7. Possible generalizations 8. Bibliography 24 BOUNDS ON THE RATE (Q=2) Bound on the rate of DNA code, q=2 1.2 Gilbert-Varshamov bound Plotkin bound 1 Hamming bound 0.8 Elias bound 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.75 0.8 1 25 BOUNDS ON THE RATE (Q=4) Bound on the rate of DNA code, q=4 1.2 Gilbert-Varshamov bound Plotkin bound 1 Hamming bound 0.8 Elias bound 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 0.9375 1 26 OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of code construction 5. Bounds on the rate on DNA codes 6. On sphere sizes 7. Further generalizations 8. Bibliography 27 FIBONACCI NUMBERS ■ q-ary Fibonacci numbers are defined by recurrent equation with initial conditions ■ q-ary Fibonacci numbers may also be calculated as sum ■ q-ary Fibonacci number of q-ary sequences of length the form (0,0) may be interpreted as the number , which do not contain 2-stems of 28 COMBINATORIAL CALCULATION ■ Space with metric is homogeneous, i.e., the volume of a sphere does not depend on it’s center ■ Define for any ■ Consider a sphere with center sequence 2-stems (pairs) with type (0,0). Thus, . Any must have no common . In other words, is must have no 2-stems of ■ Sphere sizes for other may be obtained using the same technique with some corresponding modifications 29 GRAPH OF PROBABILITIES Probability distribution 1 n=5 n = 10 0.8 n = 20 n = 30 0.6 n = 40 0.4 0.2 0 0 3 6 9 12 15 30 OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of code construction 5. Bounds on the rate on DNA codes 6. On sphere sizes 7. Further generalizations 8. Bibliography 31 B-STEM HAMMING SIMILARITY ■ -stem Hamming similarity: in spite of counting the number of 2-stems (pairs) – calculate the number of -stems where 32 WEIGTHED STEM HAMMING SIMILARITY ■ Weighted stem Hamming similarity: assign weight to each type of q-ary pairs and take it into account while calculating the sum ■ Let be a weight function such that ■ Similarity is defined as follows: , where 33 INSERTION-DELETION STEM SIMILARITY ■ Insertion-deletion stem similarity: Shift allow loops and shifts at the DNA duplex ■ is a common block subsequence between and , if is an ordered collection of non-overlapping common ( , )-blocks of length Loop 1. common ( , )-block of length , is a subsequence of and , consisting of consecutive elements of and ■ is the set of all common block subsequences between ■ of and and is the minimal number of blocks of consecutive elements in the given subsequence ■ Similarity is defined as follows: 34 OUTLINE 1. DNA background 2. Modeling the hybridization energy 3. DNA codes 4. Example of code construction 5. Bounds on the rate on DNA codes 6. On sphere sizes 7. Further generalizations 8. Bibliography 35 BIBLIOGRAPHY Probability theory and Large Deviation Principle ■ V.N. Tutubalin, The Theory of Probability and Random Processes. Moscow: Publishing House of Moscow State University, 1992 (in Russian). ■ A. Dembo, O. Zeitouni, Large Deviations Techniques and Applications. Boston, MA: Jones and Bartlett, 1993. DNA codes ■ D'yachkov A.G., Macula A.J., Torney D.C., Vilenkin P.A., White P.S., Ismagilov I.K., Sarbayev R.S., On DNA Codes. Problemy Peredachi Informatsii, 2005, V. 41, N. 4, P. 57-77, (in Russian). English translation: Problems of Information Transmission, V. 41, N. 4, 2005, P. 349-367. ■ Bishop M.A.,D'yachkov A.G., Macula A.J., Renz T.E., Rykov V.V., Free Energy Gap and Statistical Thermodynamic Fidelity of DNA Codes. Journal of Computational Biology, 2007, V. 14, N. 8, P. 1088-1104. ■ A. D’yachkov, A. Macula, T. Renz and V. Rykov, Random Coding Bounds for DNA Codes Based on Fibonacci Ensembles of DNA Sequences. Proc. of 2008 IEEE International Symposium on Information Theory, Toronto, 36 Canada, 2008, in print.