Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Zinc finger nuclease wikipedia , lookup
DNA repair protein XRCC4 wikipedia , lookup
Homologous recombination wikipedia , lookup
DNA sequencing wikipedia , lookup
DNA replication wikipedia , lookup
DNA profiling wikipedia , lookup
DNA polymerase wikipedia , lookup
DNA nanotechnology wikipedia , lookup
Microsatellite wikipedia , lookup
1 The Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai Ambani Institute of Information and Communication Technology Gandhinagar, Gujarat, 382007 India Email:[email protected], bansari s [email protected] and [email protected] arXiv:1607.00266v1 [cs.IT] 1 Jul 2016 Abstract The idea of computing with DNA was given by Tom Head in 1987, however in 1994 in a seminal paper, the actual successful experiment for DNA computing was performed by Adleman. The heart of the DNA computing is the DNA hybridization, however, it is also the source of errors. Thus the success of the DNA computing depends on the error control techniques. The classical coding theory techniques have provided foundation for the current information and communication technology (ICT). Thus it is natural to expect that coding theory will be the foundational subject for the DNA computing paradigm. For the successful experiments with DNA computing usually we design DNA strings which are sufficiently dissimilar. This leads to the construction of a large set of DNA strings which satisfy certain combinatorial and thermodynamic constraints. Over the last 16 years, many approaches such as combinatorial, algebraic, computational have been used to construct such DNA strings. In this work, we survey this interesting area of DNA coding theory by providing key ideas of the area and current known results. I. I NTRODUCTION Information and Communication Technology (ICT) has come a long way in the last 60 years. We have now connected world of devices talking to each other or performing the computation. This is possible due to the advancement of coding and information theory. In 1994, two new directions in ICT emerged viz. quantum computing and DNA computing. The heart of any computing or communications is coding theory. Thus two new directions in coding theory also emerged such as quantum coding theory and DNA coding theory. This paper focuses on the later area giving a comprehensive overview of techniques and challenges of DNA coding theory from last 16 years. In 1994, L.Adleman performed the computation using DNA strands to solve an instance of the Hamiltonian path problem giving birth to Figure 1: DNA structure with DNA computing [1]–[4]. DNA is a double strand built by the pairing the four basic building its four nucleotides A-Adenine, units A-(Adenine), C-(Cytosine), G-(Guanine), T-(Thymine) which are called nucleotides. G-Guanine, C-Cytosine and T The DNA strand is held by the important feature called complementary base pairing which Thymine. These are the basic buildconnects the Watson Crick complementary bases with each other denoted by AC = T and ing blocks of DNA which are held GC = C. The backbone of DNA is an alternating chains of sugar and phosphate. DNA is by Hydrogen bonds in a double helical manner. Each DNA base an ideal source of computing due to its stable, dense and its self replicating property [5]. is paired with the complementary After the successful experiment by Adleman, area of DNA computing [6] flourished into bases such that the yellow band is different directions like DNA tile assembly [7] [8], building of DNA nano-structures [9] A which is connected to its comple[10], studying error correcting properties in DNA sequences [11] [12] and DNA based data mentary base T (green band) while storage system [13]. But it was only with the identification of mathematical properties in C (blue band) is paired with red G (red band) DNA sequences [14] [15], it inspired many researchers to explore this new amalgamation of biology and coding theory [16]. DNA computing [17] promises to solve many NP-complete problems like Hamilton path problem, satisfiability problem (SAT) [18] and many others with massive parallelization [19]. To perform computation using DNA strands, a specific set of DNA sequences are required with particular properties. Such set of DNA strands satisfying various constraints [20], [21] play an important role in biomolecular computation [22] [23]. The aim of this manuscript is to elucidate the current state of art of DNA codes as shown in Figure 2, DNA codes constraints and especially the new methods contributing to construct DNA codes from algebraic coding. It summarizes the research on DNA codes which may help the researcher to get a broad picture of the area. Also, it gives brief overview on the applications of DNA codes. The Paper is organized as follows. Section II and III discusses the DNA code design problem and constraints on DNA codes. Section IV describes various approaches for the construction of DNA codes. Section V gives a brief description on software tools for DNA sequence design. Section VI includes description on bounds of DNA codes. Section VII gives a brief note on applications of DNA codes. Section VIII shows results on DNA codes for 0 ≤ n ≤ 36 and 0 ≤ d ≤ 20. Section IX concludes the paper with general remarks. II. DNA C ODE D ESIGN P ROBLEM To perform the DNA computation, DNA strands react with each other by Watson Crick base pairing and form perfect match. But in some situation, DNA strands may not form perfect base pairing and react in undesirable manner. One situation is formation of secondary structure in which first half strand of the DNA strand forms complementary with its own other half forming the hair pin like structure. This kind of structure interrupt the desired computation. Such secondary structures have to be avoided by designing the DNA strands carefully. Also DNA strands may bind to another DNA strands forming the complementary base pairing with few base pairs creating an error. One way to avoid this is to ensure that every two DNA strands differ in more than d locations where d depends on the application of DNA computation. This property can be obtained by defining different distance like the Hamming distance, Lee distance, Edit distance, Deletion distance etc. between two DNA strands. DNA codes design problem [24]–[26] is to develop a set of the DNA codewords of length n over DNA alphabets ΣDN A = {A, T, G, C} with predefined distance d [27] [28] and satisfying maximum set of constraints [29]. The main objective is to find the largest possible set of codewords M of length n over alphabet ΣDN A = {A, C, G, T } feasible with respect to set of constraints [30] with distance d that enable the construction of better error correcting codes [31]–[33]. Figure 2: DNA codes Time line: The state of art of DNA codes is showcased from 1994 to 2016. Different work mentioned with authors in chronological order. Definition (DNA code). A DNA code CDN A (n, M, d) ⊂ ΣDN A = {A, T, G, C} with each DNA codeword of length n and size M and minimum distance d. Here A denotes Adenine, T denotes Thymine, G denotes Guanine and C denotes Cytosine as the nucleotides in DNA. III. C ONSTRAINTS ON DNA C ODES There are different types of constraints that DNA codes must satisfy. Three categories in which these constraints can be classified are as follows: 1) Combinatorial constraints 2) Thermodynamic constraints 3) Application Oriented constraints The constraints which should be followed by DNA codes are listed below : 1) Hamming distance constraint (n, d, w) The Hamming distance constraint can be defined as dH (xDNA , yDNA ) ≥ d ∀ xDNA , yDNA ∈ CDN A for some Hamming distance d [14]. A set of codewords with length n, size M and minimum Hamming distance d satisfying Hamming constraint is denoted by CDN A (n, M, d). Hamming distance is calculated by the total number of places at which two DNA codewords differ. For CDN A (n, M, d), the minimum of the distances is considered as the Hamming distance. Aq (n, d) denotes the maximum size of a code with codewords of length n and distance d over alphabet size q, in case of DNA codewords q = 4. Example 1. Let xDNA = AT GACT and yDNA = ACT AGC, then dH (xDNA , yDNA ) = 4 where xDNA , yDNA ∈ CDN A following the Hamming distance constraint with dH (xDNA , yDNA ) ≥ 3. 2) Reverse constraint(n, d) - The reverse constraint is HDN A (xR DNA , yDNA ) ≥ d ∀ xDNA , yDNA ∈ CDN A . A code satisfying reverse constraint is called reverse code. AR (n, d) denotes the maximum size of a reverse code with length n and q minimum Hamming distance d. Page 2 of 19 Figure 3: DNA codes constraints classified as three categories combinatorial, thermodynamics and application oriented constraints. 3) 4) 5) 6) Example 2. Let xDNA = AT GACT , xRDNA = T CAGT A and yDNA = AT ACAT . For n = 6 and d = 3, HDN A (xRDNA , yDNA ) ≥ 3. xDNA , yDNA are reverse code. C Reverse Complement constraint (n, d) - This RC-constraint is HDN A (xR DNA , yDNA ) ≥ d ∀ xDNA , yDNA ∈ CDN A . DNA RC code satisfying RC-constraint is called a reverse complement code. Aq (n, d) denotes the maximum size of a reverse complement code with length n and minimum Hamming distance d [14]. Example 3. Let xDNA = AT GACT , xRDNA = T CAGT A and yDNA = GT ACAC, yCDNA = CAT GT G. For n = 6 and d = 3, HDN A (xRDNA , yCDNA ) = 4 ≥ 3. xDNA , yDNA are reverse complement code. GC-content constraint (n, d, w) - The set of codewords with length n, distance d and GC weight w ,where w is total number of Gs and Cs present in the DNA strand viz. wxDN A = |{xi : xDNA = (xi ) , xi ∈ {C, G}}| [34]–[36]. Generally w = bn/2c. Example 4. Let xDNA = AT T GCT then xDNA ∈ / CDN A for n = 6 and w = 3. Melting temperature constraint (n, d, w)- Melting temperature Tm is a temperature at which half of the DNA strands are hybridized and half are not. Melting is opposite of hybridization in which two strands get separated. It is advantageous to have the codewords with the similar melting temperatures as it will enable the hybridization of multiple DNA strands simultaneously. Hence we select the codewords with similar melting temperature calculated by Nussinov’s algorithm [25]. For each xDNA ∈ CDN A have identical melting temperature [29] . Thermodynamic constraint (n, d, w) - For the DNA stability, it is necessary for all the DNA codes should have comparable free energy ∆G◦ [37]–[39] above some threshold [40], [41]. As DNA with minimum free energy is more stable and hence we consider only those DNA codewords in a DNA code that have approximately comparable free energy in the set of DNA codewords. Free Energy ∆G◦ for the given DNA codewords xDNA , yDNA ∈ CDN A can be obtained by the equation, |∆G◦ (xDNA ) − ∆G◦ (yDNA )| ≤ δ where δ > 0 is a constant. 7) Uncorrelated-correlated constraint (n, d, w) - A codeword ∈ CDN A if shifted by x units where x ≤ n should not match with any of the other codeword ∈ CDN A [42]. Example 5. Let XDNA = CAT CAT C and YDNA = AT CAT CGG. X ◦ Y = 0100100, as depicted below. X= Y= Y= Y= Y= Y= Y= Y= C A A T A T C T A C A C T A A T A C T A T C T A C T A C G C T A C T A G G C T A C T G G C T A C G G C T A G G C T G G C G G G 0 1 0 0 1 0 0 Page 3 of 19 The constraints (1) to (3) are used to avoid undesirable hybridization between different DNA strands and (4) to (6) are the DNA constraints which ensures that all the codewords have similar thermodynamic characteristics to perform uniform computation. IV. VARIOUS A PPROACHES FOR THE DNA C ODES C ONSTRUCTIONS There are different approaches [43] for the construction of the DNA codes with finite length n, defined distance d and set of constraints with respect to application. The set of constraints essential for the DNA codes is subject to the application. There exist algorithmic, theoretic and software simulation method [44] [45] [46] approaches to design the DNA codes. Optimality of the DNA code can be obtained by construction of the DNA codes in a way that every codeword in the set follows maximum number of constraints for a large value of n and large minimum distance d with minimum errors in DNA computation [47]. Figure 4: DNA codes construction Methods are classified as Search algorithms, Algebraic Coding methods and Software simulations. Search algorithms include Variable neighborhood search, Simulated Annealing, Stochastic Local Search. Template Based method uses a template to design a DNA codeword. Algebraic coding method is construction of DNA codes by using Algebraic structures like Fields and Rings. Altruistic Approach is construction of DNA codewords in altruistic way in this work. 1) Variable Neighborhood Search Approach: This approach employs different local search algorithms to search the DNA codes [48]–[51]. Some of the search algorithms are described here. a) Seed Building(SB) : Seed Building (SB) algorithm examines all the possible codewords randomly with respect to seed codeword where seed codewords are the initial set of codewords with the required constraint. For more details on the algorithm, [52] can be referred. The drawback of the approach is that it is inefficient for larger values of n because of the computational and time complexity invested for the development of feasible set of codewords. Example 6. Consider the n = 3, d = 1 and GC-content w = 2 and initial seed be codewords with d = 1 and w = 2 are AGG and AGC then the resulting codewords at first iteration with HD and GC-content constraints are GGC, GCA, GT C, CCG, CGC, CT C, T GG. Note that ACC, GCT, CAG are also codewords with GC-content 2 but at distance d = 3 so will be added in the next iteration for seed d = 3. b) Clique Search : In this method a random subset of the codewords of a given code is removed, leaving a partial code. All the codewords are generated and those compatible with the ones left in the code are identified and a graph is built where the identified codewords are nodes, and an edge exists between compatible codewords. Example 7. Let n = 4, d = 3 and w = 2 then the partial code be CT T C, CGAA, T GGT, GT GA with HD, RC and GC constraints. The clique search will result in codewords CACT, GCT T, AGT G and AAGC. c) Hybrid Search: This method combines two approaches of seed building and clique search. It uses seed building to generate the partial code and clique search to search for the best codewords [53]. d) Greedy Approach: These types of algorithms removes the worst DNA code at each iteration from a set of codes every time the algorithm is iterated. But the problem with this approach is that it doesn’t always produce the best results. In the process of removing the worst code at each stage it may remove a potentially good code at earlier stages and may not remove bad code at the later stages. So it doesn’t always produce the best optimal result [54]. Example 8. Let n = 3, d = 2 and w = 2 then codewords from greedy search AGG, ACC, GGC, GCA, CCG, CT C. e) Lexicographic Approach: These types of algorithms take into consideration a particular arrangement of the DNA codewords. It may be an alphabetical order or ascending order or any such kind of ordered arrangement of codes such that the arrangement of the DNA codewords satisfying the DNA constraints are ordered in a specific manner [55]. 2) Simulated Annealing Approach: Simulated Annealing is a meta heuristic algorithm derived from thermodynamic principles [52]. These types of algorithms work on the set of codewords in which not all the codewords in the given set Page 4 of 19 3) 4) 5) 6) satisfies the constraints specified. These algorithms attempts to change the codewords with the objective of reducing the constraint violations to the constraints and try to make them feasible. The set with a feasible solution is derived when there is no violation of constraints. Stochastic Local Search Approach: We search for the codewords with parameters (n, d, w = n/2) such that the length of the codewords is n and the GC-content is bn/2c and the minimum Hamming distance is atleast d. These types of algorithms work on random set of codewords while solving the problem. It generally considers an initial set of random k codewords and then removes the one which doesn’t satisfy the constraints. For more details on the the algorithm, reader can refer to [56]. Example 9. Let random codewords k = 16 then set is AA, AT, AG, AC, GA, GT, GC, GG, CT, CA, CC, CG, T A, T C, T G, T T . Codewords with n = 2, d = 2 and w = 1 are GC, AG, CA, T C. Genetic Algorithm: In this work, genetic algorithm is used to search efficient and reliable codes by minimizing the mis-hybridization error. This codewords generated were unique in terms of Hamming distance that satisfy the Hamming bound [57]. Template Based Method: This method was introduced in [58]–[60] that involves two step process. Initially a template is designed which is mapped to binary error correcting codes and combination of template and codeword results into DNA codewords. This method is not optimal because the code size is limited depending of the size of error correcting code used and selection of template. Mapping of the template to the codewords is subjected to specific application. Example 10. Let template t = 1001101 and a codeword c = 0010111. Suppose the map define 11 → A, 10 → T , 01 → G and 00 → C for position of 1 and 0 in t is 1 = [AT ] and 0 = [GC] followed by position in codewords then the DNA codes will be T CGT AGA. Algebraic Coding Approach: By using the algebraic coding, DNA codes are constructed from fields and rings by mapping the elements of the field and rings to the DNA nucleotides [61]. a) Codes over Fields: i) Linear codes over GF (4) : In this approach, DNA codewords are constructed from GF (4) by using different one-to-one mapping the elements of GF (4) to DNA nucleotides [62]. The mapping is preferred from {0,1, ω, ω 2 } to {A,C,G,T} with respect to the codes used. There linear code [35] and additive codes.The method used have improved the lower bounds on GC constraints and extended the result on the length of DNA code to n ≤ 30. Researchers extended this construction for non linear codes and cyclic codes [63]. Also DNA codes over GF (4) were constructed using BCH codes in [64] in which the protein and targeting sequences are identified as codewords of error-correcting BCH codes. ii) Linear and Additive codes over GF (4): In this paper, DNA codes are constructed considering linear codes and additive codes over GF (4) pf odd length following Hamming distance constraint and reverse complement constraint. In [65], DNA codes of length 7, 9, 11 and 13 have been considered. DNA nucleotides {A, C, G, T } have been mapped to 0, ω, ω and 1 respectively with ω = ω 2 and ω 2 + ω + 1 = 0. Each codeword has been mapped to a polynomial and the Trace map T r : GF (4) → GF (2) is stated as : T r(x) = x + x2 iii) Extended, Additive, Additive Extended Cyclic codes over GF (4): In referred work, DNA codes satisfying GCcontent constraint and a minimum Hamming distance constraint were constructed using computer algebra systems Magma [66] and Maple [67]. Longer codes of higher length 4 ≤ n ≤ 30 were derived from GF (4), additive codes over GF (4) and Z4 (see Figure 5). Moreover it was claimed that by using different mapping from fields or rings to DNA codewords can result into different lower bounds. Further the bounds on the DNA codes satisfying set of constraints were ameliorated by shortening and puncturing of obtained codes [68]. b) Codes over Rings : Algebraic construction of DNA codes was further extended to codes over rings. Different rings are used to construct DNA codes by mapping rings elements to DNA nucleotides. i) DNA sequences generated by Z4 linear codes: In [69], a biological coding system which modeled the existence of error correcting codes in the DNA structure. Model consists of an encoder and modulator. The encoder consist of a mapper that converts the DNA nucleotides to elements of and BCH codes over Z4 ) and a modulator consist of a genetic code, tRNA (transfer RNA that serves as the connecting link between the mRNA (messenger RNA) to the amino acid sequence of proteins) and ribosome (protein synthesizer of the cell) which is associated with signals that convert the genetic codons to protein. The DNA and protein coding sequences from different species have been identified as the codewords over linear codes over Z4 . A class of error correcting-code BCH codes with parameters (n, k, 3) have been used in the encoder to construct the DNA codes over Z4 . In [70] [71], the self dual codes over Z4 are used for construction for DNA codes. Additionally, GC weight enumerator of the DNA codes that determines the number of Gs and Cs in the codeword. GC wright enumerator helps in the construction of DNA codes satisfying GC- content constraint. Self dual DNA codes over Z4 are Page 5 of 19 Cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 30. Extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 30. Additive cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 19. Additive extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20. Cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 24. Extended cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 24. Cosets of cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20. Cosets of extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20. Cosets of additive cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 14. Cosets of additive extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 15. Cosets of cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 20. Cosets of extended cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 30. Figure 5: DNA codes construction methods using Computer Algebra Systems such as Maple and Magma [68] . developed by using mapping as A → 0 , C → 1, T → 2 and G → 3. The following generator matrix G over Z4 is considered and let K4 denote the codeword over Z4 formed from G. There are 16 codewords and wa (c) where a ∈ Z4 and k ∈ K4 as shown in Table I. 1 1 1 1 G = 0 2 0 2 0 0 2 2 K4 (0000) (2222) (0202) (2020) (0022) (2200) (0220) (2002) w0 4 0 2 2 2 2 2 2 w1 0 0 0 0 0 0 0 0 w2 0 4 2 2 2 2 2 2 w3 0 0 0 0 0 0 0 0 K4 (1111) (3333) (1313) (3131) (1133) (3311) (1331) (3113) w0 0 0 0 0 0 0 0 0 w1 4 0 2 2 2 2 2 2 w2 0 0 0 0 0 0 0 0 w3 0 4 2 2 2 2 2 2 Table I: (4, 16, 3) code is generated by G where wa (c) where a ∈ Z4 and k ∈ K4 [70]. ii) Lifted Polynomials over F16 : In [72], reversible codes by using special family of polynomials denoted as lifted polynomials over F4 which generates the reversible codes of odd length over F1 6 are constructed. 4-lifted polynomial is used to generate the DNA code of even length by using the correspondence between pair of DNA nucleotides to elements of the ring F1 6. Table II preserves the property that if DNA pair is mapped to an element of F16 then reverse of that DNA pair is mapped to fourth power of the element of F16 . For example, α2 → GC then (α2 )4 → CG. iii) DNA codes over F2 [u]/u4 − 1: In [73], cyclic DNA codes of odd length are obtained from F2 [u]/(u4 − 1) = {a + bu + cu2 + du3 | a, b, c, d ∈ F2 } where u4 = 1 commutative ring is considered. All its ideals are listed here h0i = h(1 + u)4 i ⊂ h(1 + u)3 i ⊂ h(1 + u)2 i ⊂ h(1 + u)i ⊂ R. The 16 elements of ring R are mapped to 2-length nucleotides. The following mapping shown in Table III was considered in the paper [73]. This mapping preserved the complementary and reverse property by adding 1 + u + u2 + u3 and multiplying u2 respectively. To find complement of AA, we add 1 + u + u2 + u3 to 0, it will give 1 + u + u2 + u3 = T T . To find reverse AA, multiply 0 to u2 will result in 0 =AA. In [74] DNA cyclic codes of arbitrary length satisfying the reverse complement constraint are constructed by using additive stem distance. The correspondence between the ring elements and DNA is established by following mapping mentioned in Table IV. this preserves the reverse complement property of the DNA codewords by the Page 6 of 19 Sr.No 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. DNA Pair a AA TT AT GC AG TA CC AC GT CG CA GG CT GA TG TC Multiplicative(F16 ) 0 α0 α1 α2 α3 α4 α5 α6 α7 α8 α9 α10 α11 α12 α13 α14 Additive 1 α α2 α3 1+α α + α2 α2 + α3 1 + α + α3 1 + α2 α + α3 1 + α + α2 α + α2 + α3 1 + α + α2 + α3 1 + α2 + α3 1 + α3 Table II: Mapping from DNA nucleotide Pair to element of F16 . [72] Element AA TT GG CC Map 0 1 + u + u2 + u3 1 + u2 u + u3 Element AT TA GC CG Map 1+u u2 + u3 u + u2 1 + u3 Element GT TG AC CA Map 1 u2 1 + u + u3 u + u2 + u3 Element CT TC AG GA Map 1+ u + u2 1 + u2 + u3 u u3 Table III: Mapping from DNA Nucleotide Pair to elements of the Ring F2 + uF2 + u2 F2 + u3 F2 where u4 = 1 [73]. x + xc = u3 + u2 + u + 1. The reverse of the DNA code is obtained by multiplying u2 to any element x of the ring R. Element GG CC GC CG Map 0 1 + u + u2 + u3 1 + u2 u + u3 Element AT TA AA TT Map 1+u u2 + u3 u + u2 1 + u3 Element GT TG AC CA Map 1 u2 1 + u + u3 u + u2 + u3 Element CT TC AG GA Map 1+ u + u2 1 + u2 + u3 u u3 Table IV: Mapping from DNA Nucleotide Pair to elements of the Ring F2 + uF2 + u2 F2 + u3 F2 where u4 = 1 [73]. iv) DNA cyclic codes over F2 + uF2 where u2 = 0: DNA codes of even length following reverse and reverse complement constraints have been studied in [75]. The field F2 is a subring of R. A linear code C of length n over R is defined to be an additive submodule of the R-module Rn . A cyclic code of length n over R is a linear code with the property that if (c0 , c1 , . . . , cn−1 ) ∈ C then (cn−1 , c0 , . . . , cn−2 ) ∈ C. An n-tuple c = (c0 , c1 , . . . , cn−1 ) ∈ Rn is identified with the polynomial c0 + c1 x + . . . + cn−1 xn−1 in the ring Rn = R[x]/ (xn − 1), which is called the polynomial representation of c = (c0 , c1 , ..., cn−1 ). Here R = F2 + uF2 with {0, u, 1, 1 + u} elements are in one to one correspondence with nucleotides A, T, G and C such that 0 → A, u → T , u + 1 → C and 1 → G. The DNA codes of length 8 and 10 are obtained from C = g 6 + u x5 + x and C = g1 g22 + ug2 , ug1 g2 . Necessary and sufficient conditions for cyclic codes to follow the reverse and reverse-complement properties have also have been studied. To preserve the reverse complement constraint o find complement of A, we add u to 0, it will give u = T. v) DNA codes over Z4 +uZ4 : DNA cyclic codes of odd lengths following reverse and reverse complement constraint are constructed in [76]. Here ring Z4 +uZ4 = a + ub | a, b ∈ Z4 with u2 = 0 is considered. Reversible and cyclic reversible complement codes are discovered in this paper. In this work defined a Gray map that allows them to translate the properties of DNA codes to binary codewords i.e. φ : Z4 +uZ4 → Z24 such that φ(a+ub) = (b, a+b) where a, b ∈ Z4 . In this paper, 16 pairs of nucleotides which are mapped as shown in Table V. Element AA AT GT CT Map 0 2 2u 2+3u Element TT TA CA GA Map 1+u 3+u 1+3u 3+2u Element GG GC AC AG Map 1 3 3u 2+2u Element CC CG TG TC Map u 2+u 1+2u 3+3u Table V: Mapping from DNA nucleotide pair to element of the Ring Z4 + uZ4 where u2 = 0 [76]. vi) F4 [u]/ < u2 + 1 > where u2 = 1: In [77] self-reciprocal complement cyclic codes from R with F4 = {0, 1, α, α + 1} where α is the root of primitive polynomial x2 + x + 1 over F2 are studied. The DNA code of specific length 6 over the ring is considered. One to one correspondence is establish between pairs of nucleotides Page 7 of 19 and 16 elements of the ring. DNA cyclic codes constructed followed reverse complement, GC content and Hamming distance constraints. The basis is {1, u + 1} and then every element of R is expressed in the form of a + b(u + 1) where a,b ∈ F4 . The mapping is done as in Table VI Element AA AG TG GC Map 0 α u 1 + αu Element TT TC AC CG Map α + 1(α + 1)u 1 + (α + 1)u α + 1 + αu α+u Element GT AT GA CC Map 1 α+1 αu 1+u Element CA TA CT GG Map α + (α + 1)u (α + 1)u 1+α+u α + αu Table VI: Mapping between the pair of DNA nucleotides and Ring F4 [77]. vii) DNA codes from F2 + uF2 + vF2 + uvF2 with u2 = 0 and v 2 = v: The structure of cyclic DNA codes of an arbitrary length over R2 = F2 + uF2 + vF2 + uvF2 was studied and the relation to codes over R1 = F2 + uF2 by defining Gray map between R2 and R21 was established [78]. DNA codes following reverse, reverse complement constraints are studied. The Gray map from R2 to R1 is defined as φ(a + bv) = (a, a + b). GC weight over the ring was also introduced by using image of Gray map. One type of nontrivial automorphisms can be defined over R2 as follows : σ : F2 + uF2 + vF2 + uvF2 → F2 + uF2 + vF2 + uvF2 , 2n a + bv → a + (1 + v)b such that a, b ∈ F2 + uF2 . Table is defined using : Φ (c) : C → SD4 , (a0 + b0 v, a1 + b1 v, . . . , an−1 + bn−1 v) 7→ (a0 , a1 , . . . , an−1 , a0 + b0 , a1 + b1 , . . . , an−1 + bn−1 ). Below is the Table VII for mapping elements of the ring to DNA described in the paper. For instance, (c0, c1, c2, c3) = (u + v, u, v, 1) is mapped to T CT T AGT CGG. Elements a Gray Images 0 uv 1 1+uv u u+uv 1+u 1+u+uv (0,0) (0,u) (1,1) (1,1+u) (u,u) (u,0) (1+u,1+u) (1+u,1) Double DNA Pairs ζ(a) AA AT GG GC TT TA CC CG Elements a Gray Images Double DNA Pairs ζ(a) v v+uv 1+v 1+v+uv u+v u+v+uv 1+u+v 1+u+v+uv (0,1) (0,1+u) (1,0) (1,u) (u,1+u) (u,1) (1+u,u) (1+u,0) AG AC GA GT TC TG CT CA Table VII: Mapping between DNA pair and elements of the Ring F2 + uF2 + vF2 + uvF2 with u2 = 0 and v 2 = v [78]. viii) codes over F4 + vF4 : In linear, constacyclic and cyclic codes over the ring R = F4 [v]/(v 2 − v) are constructed in [79]. The ring R = {a + vb|a, b ∈ F4 } is non-chain finite semi-local Frobenius ring with 16 elements. The 4 elements are F4 = {0, 1, ω, ω + 1} where ω 2 = ω + 1. The Gray map from R to F4 × F4 is given by : φ(c) = (a + b, a). If a = (a1 , a2 , . . . , an ) P ∈ Rn , then the Hamming weight of a is the sum of the Hamming weights of its n components, i.e.w(a) = i=1 w(ai ). The Hamming distance between a and b in R is d(a, b) = w(a − b). The Lee weight of any element of R is the Gray image of its Hamming weight, i.e. wL (c) = wH (φ(c)). Below is the Table VIII for mapping used in [80]. Elements Gray Images 0 ω v v+ω vω ω + vω v + vω ω + v + vω (0, 0) (ω, ω) (1, 0) (1 + ω, ω) (ω, 0) (0, ω) (1 + ω, 0) (1, ω) Double DNA Pairs ξ(a) AA CC TA GC CA AC GA TC Elements Gray Images 1 1+ω 1+v 1+v+ω 1 + vω 1 + ω + vω 1 + v + vω 1 + v + ω + vω (1, 1) (1 + ω, 1 + ω) (0, 1) (ω, 1 + ω) (1 + ω, 1) (1, 1 + ω) (ω, 1) (0, 1 + ω) Double DNA Pairs ξ(a) TT GG AT CG GT TG CT AG Table VIII: Mapping between DNA Pair and elements of the Ring F4 + vF4 [79]. ix) R = F2 [u]/(u6 ): In [81], DNA cyclic codes over a family R1 = F2 [u]/(u6 ) and ring R2 = F2 + uF2 where v 2 = v satisfying reverse complement constraint have been constructed. In this a new family of DNA skew cyclic codes is introduced over ring R = F2 + vF2 = 0, 1, v, v + 1 where v 2 = v. The ring R1 = F2 [u]/(u6 ) = {a0 + a1 u + a2 u2 + a3 u3 + a4 u4 + a5 u5 ; ai ∈ F2 , u6 = 0}. There is direct map between 64 elements of the ring to 64 codons (three nucleotides) used in nature as a substrate for aminoacid synthesis shown in Table IX. x) DNA codes over ring F2 [u]/u2 − 1 : Ring R = F2 [u]/u2 − 1 = {0, 1, u, 1 + u} where u2 = 1 was described [82] [83] . Elements of the ring were directly mapped to DNA nucleotides such that reverse complement Page 8 of 19 DNA Ring Element Codons CCC GGA GGC GGT AGG CGG GAG AGA AGC ATG AGT CGA CGC CGT TGG GAA DNA Ring Element Codons u5 +u4 +u3 +u2 +u+1 u5 + u4 + u3 + u2 + u u5 + u4 + u3 + u2 + 1 u5 + u4 + u3 + u2 u5 + u4 + u3 + u + 1 u5 + u4 + u2 + u + 1 u5 + u3 + u2 + u + 1 u5 + u4 + u3 + u2 + u u5 + u4 + u3 + 1 u4 + u3 + u2 + u + 1 u5 + u4 + u3 u5 + u4 + u2 + u u5 + u4 + u2 + 1 u5 + u4 + u2 u5 + u4 + u + 1 u5 + u4 + u3 + u2 + u GGG CCT CCG CCA TCC GCC CTC TCT TCG TAC TAT GCT GCG TAA CTG CTT 0 1 u u+1 u2 u3 u4 u2 + 1 u2 + u u5 u5 + 1 u3 + 1 u3 + u u5 + u u4 + u u4 + 1 DNA Ring Element Codons ACT ACG TTT TTG CTA GTT GTG TCA CAA CAC GCA TTA ACC CAT TGT CAG u3 + u2 + u + 1 u3 + u2 + u u4 + u2 + 1 u4 + u2 + u u4 + u + 1 u4 + u3 + 1 u4 + u3 + u u2 + u + 1 u5 + u2 + u u5 + u2 + 1 u3 + u + 1 u4 + u3 u3 + u2 u5 + u2 u5 + u4 u5 + u3 DNA Ring Element Codons GTC ACA GAC AGG GAT GTA ATT ATA ATC TGA AAT AAA TGC AAC TCC TAG u4 + u2 + u + 1 u3 + u2 + u + 1 u5 + u3 + u2 + 1 u5 + u3 + u + 1 u5 + u3 + u2 u4 + u3 + u + 1 u4 + u3 + u2 + 1 u4 + u3 + u2 + u u4 + u3 + u2 u5 + u4 + u u5 + u2 + u + 1 u5 + u3 + u u5 + u4 + 1 u5 + u3 + 1 u4 + u2 u5 + u + 1 Table IX: Mapping between 64 elements of the Ring F2 + uF2 + u2 F2 + u3 F2 + u4 F2 + u5 F2 and 64 codons [81]. constraint is conserved. By adding 1 + u to elements of rings, complement can be obtained. For example 0 → A, u → C, 1 → G, 1 + u → T preserves the complement property: 0 = 1 + u and u = 1. The elements of the ring R can be mapped to the elements of F2 = {0, 1} via the map θ where θ(0) = θ(u + 1) = 0 and θ(1) = θ(u) = 1. Let C be a cyclic code in Rn . It can be extended to the map θ to a map ϕ : C → Z2 [x]/(xn −1) defined by ϕ(a0 +a1 x+a2 x2 +. . .+an−1 xn−1 ) = θ(a0 )+θ(a1 )x+. . .+θ(an−1 xn−1 ). xi) DNA cyclic codes over F2 + uF2 : In [84] odd length codes over rings satisfying reverse complement, GCContent and thermodynamic constraints are studied. They are obtained from the cyclic complement reversible code. Infinite family of BCH DNA codes are constructed. The mapping Φ called Gray Map has been used to map linear codes over R to binary linear codes. The Gray map Φ is the distance-preserving map (Rn , Lee distance) → (F22n ,Hamming distance). xii) Cyclic DNA codes over Z4 + wZ4 : Recently, cyclic DNA codes over R = Z4 + wZ4 where w2 = 2 and S = Z4 + wZ4 + vZ4 + wvZ4 where v 2 = vw have been discovered [85]. In this work, odd length DNA cyclic codes over R satisfying reverse and reverse complement constraint is studied . Also, a family of DNA skew cyclic codes with reverse complement property over R is constructed. Binary images of the cyclic DNA codes over R and S is determined. The correspondence between elements of ring R and double DNA pairs are establish as described in the Table X. Elements Gray Images 0 2 ω 3ω 1 + 2ω 2+ω 2 + 3ω 3 + 2ω (0, 0) (2, 0) (0, 1) (0, 3) (1, 2) (2, 1) (2, 3) (3, 2) Double DNA Pairs ξ(a) AA GA AC AT CG GC GT TG Elements Gray Images 1 3 2ω 1+ω 1 + 3ω 2 + 2ω 3+ω 3 + 3ω (1, 0) (3, 0) (0, 2) (1, 1) (1, 3) (2, 2) (3, 1) (3, 3) Double DNA Pairs ξ(a) CA TA AG CC CT GG TC TT Table X: Mapping between DNA Pair and elements of the Ring. 7) Algebraic Number Theory codes: Aforementioned all the methods include construction of DNA codes from classical coding theory and heuristic approaches works well for small length n. In this author constructed the DNA codes using algebraic number theory [86], making the first attempt, by using irreducible cyclic codes to built DNA codes with large n (< 1000) and number of codewords (M < 7000) satisfying the GC content constraint. V. S OFTWARE TOOLS FOR DNA C ODES G ENERATION There are different tools developed for designing the DNA codewords namely DNA sequence Generator [44] and evolutionary algorithm based program-PUNCH (Princeton University Nucleotide Computing Heuristic) [87] were used to find set of dissimilar sequences. DNA sequence generator and compiler use graph based approach based on the overlapping sub-sequences. The GUI of the software allows the user to import the the DNA sequence to the sequence wizard with different parameters. It Page 9 of 19 also calculate the melting temperature of the DNA sequences. It check for the reverse complement constraints and forbidden DNA strands. However this software do not take care of secondary structure formation of the DNA strands. PUNCH is used for performing various DNA computing by bit set selection. It works on randomization by selecting three basic parameters N, B and V where N is number of bits in the problem, B is the number of nucleotides in each bit, and V is number of variation on each bit set. Here the author has created a web application in which the user has to first select the mode which is either specific based or range. For both modes, the user has to select the specific constraints he wants DNA codes to follow but for specific the input has parameters n and d where n is the length of the codewords and d is the Hamming distance. For range, the input parameters are n1 and n2 where n1 is the starting length and n2 is the ending length and also are d1 and d2 where d1 is the starting Hamming distance and d2 is the ending Hamming distance. The web portal will display the number of codewords and also those codewords that satisfy the constraint. VI. B OUNDS ON DNA C ODES There are several bounds studied for DNA codes. This section comprehend the types of bounds obtained on set of constraints. In the Table XI, columns are ticked for which respective bounds on constraints are obtained. Note that n is the length, d is minimum Hamming distance and w is GC content of the DNA code. One can observe that almost all the methods have obtained lower bounds on reverse constraint. But most important part is to investigate that there is no bounds on DNA codes satisfying the set of HD, GC, R and RC constraints altogether. Also there are very few attempts made to explore the bounds on thermodynamic constraints. One can explore the methods which allows the formulation of bounds on the thermodynamic constraints. More details on the bounds are described in Appendix A. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 AHD (n, d, w) 4 AR 4 (n, d, w) X X X X ARC 4 (n, d, w) AGC 4 (n, d, w) 17 X X X X X X X X X X X X X X X X AR,RC (n, d, w) 4 ARC,GC (n, d, w) 4 AR,GC (n, d, w) 4 X X X X X X AR,RC,GC (n, d, w) 4 Table XI: Bounds on DNA codes: 1-Johnson type Bound [88] [55], 2-Halving Bounds [88] [55], 3-Gilbert-Type Bounds [55], 4- Relation between AGC,RC and AGC,R [55], 5- AGC bound, 6- AGC,RC bound, 7- AGC 4 4 (n, d, w) bound, 8- Product Bounds 4 4 4 RC R RC [88] [55], 9- Relation between A4 and A4 [88], 10 to 16 - Relation between AGC and AR 4 , A4 4 [14], 17-V. Phan bound [89] satisfying Hamming distance constraint. VII. A PPLICATIONS OF DNA C ODES DNA codes are used in various technologies like DNA computing [1], surface based DNA computation [90]–[93], DNA Microarray technology [94], Molecular barcodes for chemical libraries [95], DNA nanostructures [96], [97], DNA origami Page 10 of 19 [10], data encrytption [98], [99] [100]–[102], data storage [13], [103]–[106], signal processing [107] and DNA nanodevices and circuits [108]. Recently, it has reported potential of DNA codes in phylogenetic studies [109]. Also it has contribution in understanding gene regulatory networks [110], protein coding genes [111], [112] and studying the structure of genes via circular codes [113]. In DNA computing, DNA codes with specific properties are required to perform various parallel and logical operations [114]. Molecular barcodes [115] generated from DNA codes are used as biomarkers for authentication of the products. DNA codes are used in creating DNA nanostructures that are used in potential applications like targeted drug delivery systems. DNA codes with specific properties with high stability and robustness are required for nano structures which can be achieved by using efficient encoding procedures for DNA codes. Recently, DNA is used in data hiding techniques for encryption of data more effectively. DNA codes used in this are designed as encryption or decryption keys. In last few years DNA based data storage systems have [42] received attention by many researchers. DNA codes used for data storage must have feasible property that achieve dense data storage capacity and better error correction capacity. To use DNA codes for any application, fundamental constraints mentioned for DNA codes are unavoidable. DNA code design must follow the constraint for stability and robustness but to design the DNA codes for specific application, required constraints must be added to DNA codes to make it more functional and practical. For instance, correlated and uncorrelated constraint was added to DNA codes for development random and re-writable DNA based data storage system. Looking at the potential of DNA codes and advancement in the biotechnology methods, DNA codes promises application in emerging technologies. VIII. DNA C ODES TABLE Many tables on lower bounds of DNA codes satisfying set of constraints are obtained. In Table VIII and XIII lower bounds on DNA codes satisfying GC and reverse complement (RC) constraints are mentioned. In Table XIV lower bounds on DNA codes satisfying Hamming distance and reverse complement constraint. These bounds are compiled from [68] [36] [52] [62] [71] [86]. IX. F UTURE W ORK AND C HALLENGES DNA codes designing have received a great deal of attention by researchers in the last decade. In spite of different approaches proposed in the literature for the construction of DNA codes and constraints, it is still a challenge to design the optimal DNA codes. Classifying the DNA codes for specific application has opportunities to use it for real applications. Though researchers have worked on improvement of the bounds of DNA codes satisfying the set of constraints, designing DNA codes satisfying maximum number of constraints achieving bound is still a huge challenge. Better codes with higher length and distance can be designed by using other algebraic methods or computational methods improving the bounds and obtaining the bounds for the missing n and d can be achieved. Defining DNA code as mathematical structure with the possible operation is interesting area to explore in which different operation can be defined on DNA nucleotides that satisfies the desired properties for computation. There are attempts to develop DNA codes satisfying the thermodynamic constraints though it is a challenge to develop the DNA codes that fits perfectly for the practical application of DNA strands. One of the important research problem is to work on the optimality condition of the DNA codes. To simulate the process of the DNA code designing and practical protocols involved in the DNA computation, it can be automated by developing a platform where DNA codes can be designed and simulated to check with the performance and accuracy. With emerging area of algebraic coding and biological coding theory, these challenges can be investigated and resolved. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] L. M. Adleman, “Molecular computation of solutions to combinatorial problems,” Science, vol. 266, no. 5187, pp. 1021–1024, 1994. C. C. Maley, “DNA computation: theory, practice, and prospects,” Evolutionary computation, vol. 6, no. 3, pp. 201–229, 1998. C. Calude and G. Paun, Computing with cells and atoms: an introduction to quantum, DNA and membrane computing. CRC Press, 2000. M. Garzon, P. Neathery, R. Deaton, R. C. Murphy, D. R. Franceschetti, and S. Stevens Jr, “A new metric for DNA computing,” in Proceedings of the 2nd Genetic Programming Conference, 1997, pp. 472–278. R. Deaton, M. Garzon, R. Murphy, J. Rose, D. Franceschetti, and S. E. Stevens Jr, “Reliability and efficiency of a DNA-based computation,” Physical Review Letters, vol. 80, no. 2, p. 417, 1998. G. Rozenberg and A. Salomaa, “DNA computing: new ideas and paradigms,” in Automata, Languages and Programming. Springer, 1999, pp. 106–118. E. Winfree, F. Liu, L. A. Wenzler, and N. C. Seeman, “Design and self-assembly of two-dimensional DNA crystals,” Nature, vol. 394, no. 6693, pp. 539–544, 1998. E. Winfree, “Algorithmic self-assembly of DNA,” Ph.D. dissertation, California Institute of Technology, 1998. N. C. Seeman, “DNA nanotechnology: novel DNA constructions,” Annual review of biophysics and biomolecular structure, vol. 27, no. 1, pp. 225–248, 1998. P. W. Rothemund, “Folding DNA to create nanoscale shapes and patterns,” Nature, vol. 440, no. 7082, pp. 297–302, 2006. M. Arita, “Writing information into DNA,” in Aspects of Molecular Computing. Springer, 2004, pp. 23–35. Page 11 of 19 Page 12 of 19 n/d 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 609973884610560 42061705248768 3 6 15 43 256 528 1354 4542 14405 59136 167263 768768 1646240 13174400 26355520 44933184 47102080 756760576 90291264 10602158336 1384513088 177279886336 21300369664 2532157069312 4 2 3 16 35 128 273 860 2457 14784 27376 192192 411821 3293600 6587200 11232288 23647760 189432064 188416000 2650495232 670222080 44319794176 11204500480 633038608384 158680788992 10515426312192 2625500086272 152493461268480 80766566400 5154680832 1 4 11 28 65 210 477 1848 3974 11878 25670 55376 97520 699624 738772 11822368 22573824 22607872 43264648 346436544 10399676 326893568 5 2 2 22 19 54 117 924 924 3712 6648 55424 97450 738772 738772 11806240 1412068 5643456 10816624 43355616 1299844 618544192 20057442 10262347776 6691200 149011451520 6 9313176480 641396736 1 2 8 17 37 87 206 796 1600 13856 12864 43632 92252 738520 176772 176772 2703694 21631400 41600552 38656528 7 2 2 8 14 29 62 208 410 6476 6060 43632 11542 368504 45112 353496 676312 5406464 10399676 83204800 40114884 159987712 3853632 2332609440 8 9080016 1 2 5 12 23 49 109 243 579 2691 3678 11452 11148 88424 169182 1351616 2599688 20801200 40114884 627776 9 Table XII: Lower bounds on DNA codes satisfying GC and Reverse Complement Constraints for 4 ≤ n ≤ 30 and 2 ≤ d ≤ 9 946176 64512 320 135 2 24 Page 13 of 19 n/d 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 9110544 150266880 300533760 583395120 1166803110 2268771670 4537543340 2 2 4 10 21 37 83 175 407 960 2868 2926 22088 42968 338016 649922 5199376 10029150 180226336 10 80 96 848 848 1536 19388832 77558760 2 2 4 20 26 30 49 99 179 364 12 2 3 5 12 21 39 77 88 174 336 690 1402 2974 6308 13688 29292 61270 13 2 2 2 4 10 18 33 43 74 126 244 480 977 1927 3987 8245 17677 14 1 2 2 4 8 15 22 36 57 102 190 351 655 1310 2599 5426 15 2 2 2 4 7 11 20 31 51 83 148 262 459 898 1767 16 1 2 2 4 6 10 16 27 65 67 114 194 353 546 17 2 2 2 3 6 8 14 23 38 56 93 155 266 18 1 2 2 2 4 7 12 20 32 50 77 127 19 2 2 2 2 4 6 11 18 28 42 65 20 1 2 2 2 4 6 9 15 24 36 21 2 2 2 2 4 5 8 12 20 22 1 2 2 2 4 4 7 11 23 2 2 1 1 4 4 7 24 1 2 1 1 3 4 25 2 2 2 2 3 26 1 2 2 2 27 2 2 2 28 Table XIII: Lower bounds on DNA codes satisfying GC and Reverse Complement Constraints for 4 ≤ n ≤ 36 and 10 ≤ d ≤ 30 1 2 4 8 18 68 62 133 285 766 847 5522 10701 84964 162986 1299844 2506644 20056584 38777664 708168 11 1 2 29 2 30 Page 14 of 19 n/d 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 >= 107 >= 107 >= 107 3 6 32 62 196 620 1952 8064 23565 2609! 65536 32640 1047552 >= 107 4 2 4 28 42 128 346 2016 4832 32640 65280 523776 1047552 8386560 >= 107 523776 261888 >= 107 523776 65280 2095104 2 4 12 30 80 496 607 4032 5469 32640 65280 5 2 2 16 22 120 136 2016 2016 8192 16384 130560 16384 2091504 1048604 2095104 6 131072 1046528 2 2 8 17 40 120 240 2016 4032 32640 65280 7 2 2 8 15 31 70 512 1024 8192 16384 16384 32512 523776 8 2 2 6 12 24 120 118 480 679 8064 8128 32256 9 2 2 4 10 32 64 120 197 2016 1095 1598 10 2 2 4 8 17 120 68 143 321 2016 11 2 2 4 6 32 64 120 109 480 12 2 2 3 5 12 22 42 83 13 2 2 2 4 10 19 35 14 2 2 2 4 8 16 15 2 2 2 4 7 16 2 2 2 4 17 2 2 2 18 2 2 19 Table XIV: Lower bounds on DNA codes satisfying Hamming distance and Reverse Complement Constraints for 4 ≤ n ≤ 20 and 2 ≤ d ≤ 20 >= 107 >= 107 >= 107 >= 107 2 32 116 512 1968 8192 2887! 2786! 2677! 2734! 2 20 [12] A. G. D’yachkov, P. L. Erdös, A. J. Macula, V. V. Rykov, D. C. Torney, C.-S. Tung, P. A. Vilenkin, and P. S. White, “Exordium for DNA codes,” Journal of Combinatorial Optimization, vol. 7, no. 4, pp. 369–379, 2003. [13] D. Limbachiya and M. K. Gupta, “Natural data storage: A review on sending information from now to then via nature,” arXiv preprint arXiv:1505.04890, 2015. [14] A. Marathe, A. E. Condon, and R. M. Corn, “On combinatorial DNA word design,” Journal of Computational Biology, vol. 8, no. 3, pp. 201–219, 2001. [15] S. Hussini, L. Kari, and S. Konstantinidis, “Coding properties of DNA languages,” in DNA Computing. Springer, 2002, pp. 57–69. [16] M. K. Gupta, “The quest for error correction in biology,” Engineering in Medicine and Biology Magazine, IEEE, vol. 25, no. 1, pp. 46–53, 2006. [17] J. Watada et al., “DNA computing and its applications,” in Intelligent Systems Design and Applications, 2008. ISDA’08. Eighth International Conference on, vol. 2. IEEE, 2008, pp. 288–294. [18] X. Wang, Z. Bao, J. Hu, S. Wang, and A. Zhan, “Solving the sat problem using a DNA computing algorithm based on ligase chain reaction,” Biosystems, vol. 91, no. 1, pp. 117–125, 2008. [19] M. Darehmiraki, “A semi-general method to solve the combinatorial optimization problems based on nanocomputing,” International Journal of Nanoscience, vol. 9, no. 05, pp. 391–398, 2010. [20] A. J. Hartemink, D. K. Gifford, and J. Khodor, “Automated constraint-based nucleotide sequence selection for DNA computation,” Biosystems, vol. 52, no. 1, pp. 227–235, 1999. [21] J. Li, Q. Zhang, R. Li, and S. Zhou, “Optimization of DNA encoding based on combinatorial constraints,” ICIC Express Letters, vol. 2, no. 1, pp. 81–88, 2008. [22] R. Penchovsky and J. Ackermann, “DNA library design for molecular computation,” Journal of Computational Biology, vol. 10, no. 2, pp. 215–229, 2003. [23] E. B. Baum, “DNA sequences useful for computation,” in Proceedings of DNA-based Computers II, Princeton. In AMS DIMACS Series, vol. 44, 1999, pp. 235–241. [24] A. G. D’yachkov, P. A. Vilenkin, I. K. Ismagilov, R. S. Sarbaev, A. Macula, D. Torney, and S. White, “On DNA codes,” Problems of Information Transmission, vol. 41, no. 4, pp. 349–367, 2005. [25] O. Milenkovic and N. Kashyap, “On the design of codes for DNA computing,” in Coding and Cryptography. Springer, 2006, pp. 100–119. [26] M. H. Garzon, V. Phan, S. Roy, and A. J. Neel, “In search of optimal codes for DNA computing,” in DNA Computing. Springer, 2006, pp. 143–156. [27] L. P. Dinu and A. Sgarro, “A low-complexity distance for DNA strings,” Fundamenta Informaticae, vol. 73, no. 3, pp. 361–372, 2006. [28] A. D’yachkov, A. Macula, V. Rykov, and V. Ufimtsev, “DNA codes based on stem similarities between DNA sequences,” in DNA Computing. Springer, 2007, pp. 146–151. [29] J. Sager and D. Stefanovic, “Designing nucleotide sequences for computation: a survey of constraints,” in DNA Computing. Springer, 2006, pp. 275–289. [30] J. Sun, “Bounds on edit metric codes with combinatorial DNA constraints,” Master’s thesis, Brock University, 2010. [31] P. P. Debata, D. Mishra, K. Shaw, and S. Mishra, “A coding theoretic model for error-detecting in DNA sequences,” Procedia Engineering, vol. 38, pp. 1773–1777, 2012. [32] D. Ashlock, S. K. Houghten, J. A. Brown, and J. Orth, “On the synthesis of DNA error correcting codes,” Biosystems, vol. 110, no. 1, pp. 1–8, 2012. [33] L. C. Faria, A. S. Rocha, J. H. Kleinschmidt, M. C. Silva-Filho, E. Bim, R. H. Herai, M. E. Yamagishi, and R. Palazzo Jr, “Is a genome a codeword of an error-correcting code?” PloS one, vol. 7, no. 5, p. e36644, 2012. [34] Y. M. Chee and S. Ling, “Improved lower bounds for constant GC-content DNA codes,” Information Theory, IEEE Transactions on, vol. 54, no. 1, pp. 391–394, 2008. [35] D. H. Smith, N. Aboluion, R. Montemanni, and S. Perkins, “Linear and nonlinear constructions of DNA codes with Hamming distance d and constant GC-content,” Discrete Mathematics, vol. 311, no. 13, pp. 1207–1219, 2011. [36] D. Tulpan, D. H. Smith, and R. Montemanni, “Thermodynamic post-processing versus GC-content pre-processing for DNA codes satisfying the Hamming distance and reverse-complement constraints,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 11, no. 2, pp. 441–452, 2014. [37] M. A. Bishop, A. G. D’yachkov, A. J. Macula, T. E. Renz, and V. V. Rykov, “Free energy gap and statistical thermodynamic fidelity of DNA codes,” Journal of Computational Biology, vol. 14, no. 8, pp. 1088–1104, 2007. [38] A. G. D’yachkov, A. J. Macula, W. K. Pogozelski, T. E. Renz, V. V. Rykov, and D. C. Torney, “A weighted insertion-deletion stacked pair thermodynamic metric for DNA codes,” in DNA Computing. Springer, 2005, pp. 90–103. [39] Q. Zhang, B. Wang, and X. Wei, “Evaluating the different combinatorial constraints in DNA computing based on minimum free energy,” MATCH Commun. Math. Comput. Chem, vol. 65, pp. 291–308, 2011. [40] D. Tulpan, M. Andronescu, S. B. Chang, M. R. Shortreed, A. Condon, H. H. Hoos, and L. M. Smith, “Thermodynamically based DNA strand design,” Nucleic acids research, vol. 33, no. 15, pp. 4951–4964, 2005. [41] Q. Zhang, B. Wang, X. Wei, X. Fang, and C. Zhou, “DNA word set design based on minimum free energy,” NanoBioscience, IEEE Transactions on, vol. 9, no. 4, pp. 273–277, 2010. [42] S. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “A rewritable, random-access DNA-based storage system,” arXiv preprint arXiv:1505.02199, 2015. [43] D. C. Tulpan, “Effective heuristic methods for DNA strand design,” Ph.D. dissertation, The University Of British Columbia, 2006. [44] U. Feldkamp, H. Rauhe, and W. Banzhaf, “Software tools for DNA sequence design,” Genetic Programming and Evolvable Machines, vol. 4, no. 2, pp. 153–171, 2003. [45] U. Feldkamp, W. Banzhaf, H. Rauhe et al., “A DNA sequence compiler,” in Poster presented at Sixth International Meeting on DNA Based Computers, Leiden, 2000. [46] U. Feldkamp, S. Saghafi, W. Banzhaf, and H. Rauhe, “DNAsequencegenerator: A program for the construction of DNA sequences,” in DNA Computing. Springer, 2001, pp. 23–32. [47] J. M. Gray, T. G. Frutos, A. M. Berman, A. E. Condon, M. G. Lagally, L. M. Smith, and R. M. Corn, “Reducing errors in DNA computing by appropriate word design,” University of Wisconsin, Department of Chemistry, 1996. [48] P. Hansen, N. Mladenović, and J. A. M. Pérez, “Variable neighbourhood search: methods and applications,” Annals of Operations Research, vol. 175, no. 1, pp. 367–407, 2010. [49] S. Kawashimo, H. Ono, K. Sadakane, and M. Yamashita, “DNA sequence design by dynamic neighborhood searches,” in DNA Computing. Springer, 2006, pp. 157–171. [50] R. Montemanni and D. H. Smith, “Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm,” Journal of Mathematical Modelling and Algorithms, vol. 7, no. 3, pp. 311–326, 2008. [51] S. Kawashimo, H. Ono, K. Sadakane, and M. Yamashita, “Dynamic neighborhood searches for thermodynamically designing DNA sequence,” in DNA Computing. Springer, 2008, pp. 130–139. [52] R. Montemanni, D. Smith, and N. Koul, “Three metaheuristics for the construction of constant GC-content DNA codes,” Lecture Notes in Management Science, vol. 6, pp. 167–175, 2014. [53] Q. Qiu, D. Burns, Q. Wu, and P. Mukre, “Hybrid architecture for accelerating DNA codeword library searching,” in Computational Intelligence and Bioinformatics and Computational Biology, 2007. CIBCB’07. IEEE Symposium on. IEEE, 2007, pp. 323–330. Page 15 of 19 [54] N. Bennenni, K. Guenda, and A. Gulliver, “Construction of codes for DNA computing by the greedy algorithm,” ACM Commun. Comput. Algebra, vol. 49, no. 1, pp. 14–14, Jun. 2015. [Online]. Available: http://doi.acm.org/10.1145/2768577.2768583 [55] O. D. King, “Bounds for DNA codes with constant GC-content,” Electron. J. Combin, vol. 10, no. 1, p. 33, 2003. [56] D. C. Tulpan, H. H. Hoos, and A. E. Condon, “Stochastic local search algorithms for DNA word design,” in DNA Computing. Springer, 2003, pp. 229–241. [57] R. Deaton, M. Garzon, R. Murphy, and J. R. Koza, “Genetic search of reliable encodings for DNA-based computation,” Late Breaking Papers at the Genetic Programming 1996, pp. 9–15, 1996. [58] M. Arita and S. Kobayashi, “DNA sequence design using templates,” New Generation Computing, vol. 20, no. 3, pp. 263–277, 2002. [59] W. Liu, S. Wang, L. Gao, F. Zhang, and J. Xu, “DNA sequence design based on template strategy,” Journal of chemical information and computer sciences, vol. 43, no. 6, pp. 2014–2018, 2003. [60] S. Kobayashi, T. Kondo, and M. Arita, “On template method for DNA sequence design,” in DNA Computing. Springer, 2003, pp. 205–214. [61] R. Selvakumar, “Unconventional construction of DNA codes: group homomorphism,” Journal of Discrete Mathematical Sciences and Cryptography, vol. 17, no. 3, pp. 227–237, 2014. [62] P. Gaborit and O. D. King, “Linear constructions for DNA codes,” Theoretical Computer Science, vol. 334, no. 1, pp. 99–113, 2005. [63] N. Aboluion, D. H. Smith, and S. Perkins, “Linear and nonlinear constructions of DNA codes with Hamming distance d, constant GC-content and a reverse-complement constraint,” Discrete Mathematics, vol. 312, no. 5, pp. 1062–1075, 2012. [64] L. Faria, A. Rocha, J. Kleinschmidt, R. Palazzo, and M. Silva-Filho, “DNA sequences generated by BCH codes over GF (4),” Electronics letters, vol. 46, no. 3, pp. 202–203, 2010. [65] T. Abualrub, A. Ghrayeb, and X. N. Zeng, “Construction of cyclic codes over GF(4) for DNA computing,” Journal of the Franklin Institute, vol. 343, no. 4, pp. 448–457, 2006. [66] J. Cannon, A. Steel et al., “The MAGMA computational algebra system,” Software available online (magma. maths. usyd. edu. au), 2005. [67] A. Heck and W. Koepf, Introduction to MAPLE. Springer-Verlag New York, 1993, vol. 1993. [68] A. Niema, “The construction of DNA codes using a computer algebra system,” Ph.D. dissertation, PhD Thesis, University of Glamorgan, 2011. [69] A. S. L. Rocha, L. C. B. Faria, J. H. Kleinschmidt, R. Palazzo Jr, and M. C. Silva-Filho, “DNA sequences generated by Z4 linear codes,” in Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1320–1324. [70] B. Feng, S. Bai, B. Chen, and X. Zhou, “The constructions of DNA codes from linear self-dual codes over Z4 ,” International Conference on Computer Information Systems and Industrial Applications, 2015. [71] Z. Varbanov, T. Todorov, and M. Hristova, “A method for constructing DNA codes from additive self-dual codes over GF (4).” in Proc. CAIM conference, Romania, vol. 40, 2014. [72] S. E. Oztas and I. Siap, “Lifted polynomials over F16 and their applications to DNA codes,” Filomat, vol. 27, no. 3, pp. 459–466, 2013. [73] B. Yildiz and I. Siap, “Cyclic codes over F2 [u]/(u4 − 1) and applications to DNA codes,” Computers & Mathematics with Applications, vol. 63, no. 7, pp. 1169–1176, 2012. [74] K. Guenda, T. A. Gulliver, and P. Solé, “On cyclic DNA codes,” in Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on. IEEE, 2013, pp. 121–125. [75] J. Liang and L. Wang, “On cyclic DNA codes over F2 + uF2 ,” Journal of Applied Mathematics and Computing, pp. 1–11, 2015. [76] S. Pattanayak and A. K. Singh, “On cyclic DNA codes over the ring Z4 + uZ4 ,” arXiv preprint arXiv:1508.02015, 2015. [77] F. Ma, Y. Cao, and J. Gao, “On cyclic DNA codes over F4 [u]/(u2 + 1),” International Journal of Research and Reviews in Applied Sciences, vol. 24, no. 3, p. 101, 2015. [78] S. Zhu and X. Chen, “Cyclic DNA codes over F2 + uF2 + vF2 + uvF2 ,” arXiv preprint arXiv:1508.07113, 2015. [79] A. Bayram, E. S. Oztas, and I. Siap, “Codes over F4 + vF4 and some DNA applications,” Designs, Codes and Cryptography, pp. 1–15, 2015. [80] B. Srinivasulu and M. Bhaintwal, “Reversible cyclic codes over F4 + uF4 and their applications to DNA codes,” in 2015 7th International Conference on Information Technology and Electrical Engineering (ICITEE). IEEE, 2015, pp. 101–105. [81] N. Bennenni, K. Guenda, and S. Mesnager, “New DNA cyclic codes over rings,” arXiv preprint arXiv:1505.06263, 2015. [82] I. Siap, T. Abualrub, and A. Ghrayeb, “Cyclic DNA codes over the ring F2 [u]/(u2 − 1) based on the deletion distance,” Journal of the Franklin Institute, vol. 346, no. 8, pp. 731–740, 2009. [83] H. Mostafanasab and A. Y. Darani, “On cyclic DNA codes over F2 + uF2 + u2 F2 ,” arXiv preprint arXiv:1603.05894, 2016. [84] K. Guenda and T. A. Gulliver, “Construction of cyclic codes over F2 + uF2 for DNA computing,” arXiv preprint arXiv:1207.3385, 2012. [85] A. Dertli and Y. Cengellenmis, “On cyclic DNA codes over the rings z {4}+ wz {4} and z {4}+ wz {4}+ vz {4}+ wvz {4},” arXiv preprint arXiv:1605.02968, 2016. [86] H. Hong, L. Wang, H. Ahmad, J. Li, Y. Yang, and C. Wu, “Construction of DNA codes by using algebraic number theory,” Finite Fields and Their Applications, vol. 37, pp. 328–343, 2016. [87] A. J. Ruben, S. J. Freeland, and L. F. Landweber, “Punch: An evolutionary algorithm for optimizing bit set selection,” in DNA Computing. Springer, 2001, pp. 150–160. [88] Z. Ignatova, I. Martinez-Perez, and K.-H. Zimmermann, DNA computing models. Springer Science & Business Media, 2008. [89] Q. Zhang and B. Wang, “On the bounds of DNA coding with h-distance,” MATCH Commun. Math. Comput. Chem, vol. 66, pp. 371–380, 2011. [90] L. M. Smith, R. M. Corn, A. E. Condon, M. G. Lagally, A. G. Frutos, Q. Liu, and A. J. Thiel, “A surface-based approach to DNA computation,” Journal of computational biology, vol. 5, no. 2, pp. 255–267, 1998. [91] A. G. Frutos, Q. Liu, A. J. Thiel, A. M. W. Sanner, A. E. Condon, L. M. Smith, and R. M. Corn, “Demonstration of a word design strategy for DNA computing on surfaces,” Nucleic Acids Research, vol. 25, no. 23, pp. 4748–4757, 1997. [92] Q. Liu, L. Wang, A. G. Frutos, A. E. Condon, R. M. Corn, and L. M. Smith, “DNA computing on surfaces,” Nature, vol. 403, no. 6766, pp. 175–179, 2000. [93] H. Wu, “An improved surface-based method for DNA computation,” Biosystems, vol. 59, no. 1, pp. 1–5, 2001. [94] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,” Science, vol. 270, no. 5235, pp. 467–470, 1995. [95] S. Brenner and R. A. Lerner, “Encoded combinatorial chemistry.” Proceedings of the National Academy of Sciences, vol. 89, no. 12, pp. 5381–5383, 1992. [96] J. Reif, H. Chandran, N. Gopalkrishnan, and T. LaBean, “Self-assembled DNA nanostructures and DNA devices,” Nanofabrication Handbook, pp. 299–328, 2012. [97] Y. Yang, Y. Liu, and H. Yan, “DNA nanostructures as programmable biomolecular scaffolds,” Bioconjugate chemistry, 2015. [98] G. Jacob and A. Murugan, “DNA based cryptography: An overview and analysis,” Int J Emerg Sci, vol. 3, no. 1, pp. 27–36, 2013. [99] D. Tulpan, C. Regoui, G. Durand, L. Belliveau, and S. Léger, “Hyden: a hybrid steganocryptographic approach for data encryption using randomized error-correcting DNA codes,” BioMed research international, vol. 2013, 2013. [100] A. Aich, A. Sen, S. R. Dash, and S. Dehuri, “A symmetric key cryptosystem using DNA sequence with OTP key,” in Information Systems Design and Intelligent Applications. Springer, 2015, pp. 207–215. [101] X. Wei, L. Guo, Q. Zhang, J. Zhang, and S. Lian, “A novel color image encryption algorithm based on DNA sequence operation and hyper-chaotic system,” Journal of Systems and Software, vol. 85, no. 2, pp. 290–299, 2012. Page 16 of 19 [102] R. Enayatifar, A. H. Abdullah, and I. F. Isnin, “Chaos-based image encryption using a hybrid genetic algorithm and a DNA sequence,” Optics and Lasers in Engineering, vol. 56, pp. 83–93, 2014. [103] M. Arita, “Method for designing DNA codes used as information carrier,” May 27 2004, uS Patent App. 10/558,502. [104] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science, vol. 337, no. 6102, pp. 1628–1628, 2012. [105] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,” Nature, vol. 494, no. 7435, pp. 77–80, 2013. [106] S. Yazdi, H. M. Kiah, E. R. Garcia, J. Ma, H. Zhao, and O. Milenkovic, “DNA-based storage: Trends and methods,” arXiv preprint arXiv:1507.01611, 2015. [107] S. Tsaftaris, A. K. Katsaggelos, T. N. Pappas, E. T. Papoutsakis et al., “DNA computing from a signal processing viewpoint,” Signal Processing Magazine, IEEE, vol. 21, no. 5, pp. 100–106, 2004. [108] L. Qian and E. Winfree, “Scaling up digital circuit computation with DNA strand displacement cascades,” Science, vol. 332, no. 6034, pp. 1196–1201, 2011. [109] M. H. Garzon and T.-Y. Wong, “DNA chips for species identification and biological phylogenies,” Natural Computing, vol. 10, no. 1, pp. 375–389, 2011. [110] J. Dingel and O. Milenkovic, “Coding-theoretic methods for reverse engineering of gene regulatory networks,” in Information Theory Workshop, 2008. ITW’08. IEEE. IEEE, 2008, pp. 114–118. [111] D. G. Arquès and C. J. Michel, “A complementary circular code in the protein coding genes,” Journal of theoretical biology, vol. 182, no. 1, pp. 45–58, 1996. [112] D. G. Arques and C. J. Michel, “A code in the protein coding genes,” BioSystems, vol. 44, no. 2, pp. 107–134, 1997. [113] C. J. Michel, “A 2006 review of circular codes in genes,” Computers & Mathematics with Applications, vol. 55, no. 5, pp. 984–988, 2008. [114] M. H. Garzon, “Theory and applications of DNA codeword design,” in Theory and Practice of Natural Computing. Springer, 2012, pp. 11–26. [115] L. V. Bystrykh, “Generalized DNA barcode design based on Hamming codes,” PloS one, vol. 7, no. 5, p. e36852, 2012. A PPENDIX A B OUNDS ON DNA C ODES 1) Johnson type Bound- This bound is derived by shortening the code to length n − 1. The code is modified by choosing the codewords with a fix character b ∈ Zq at the ith position where i ∈ {1 . . . n} and deleting ith position from the codewords. For 0 ≤ d ≤ n and 0 < w < n, 2n GC a) AGC 4 (n, d, w) ≤ b w A4 (n − 1, d, w − 1)c [55]. 2n GC b) AGC 4 (n, d, w) ≤ b n−w A4 (n − 1, d, w)c [55]. 1 R c) AR 4 (n, d) ≤ b 4 A4 (n − 1, d)c [88]. 2) Halving BoundsR a) This is motivated by the fact DNA code CDN A and reverse DNA code CDN A are disjoint. Let n ≥ 1 be an integer. 1 R For each integer d with 0 < d ≤ n, A4 (n, d) ≤ A4 (n, d) [88]. 2 b) For 0 < d ≤ n and 0 ≤ w ≤ n, AGC,RC (n, d, w) ≤ 21 AGC 4 (n, d, w) [88]. 4 GC,R c) For 0 < d ≤ n and 0 ≤ w ≤ n,A4 (n, d, w) ≤ 12 AGC 4 (n, d, w) [88]. 3) Gilbert-Type Bounds-AGC 4 (n, d, w) is derived by dividing the total number of DNA strings of length n with GC-content w by the number of DNA string with distance d − 1 from the fixed codeword. a) For 0 ≤ d ≤ n and 0 ≤ w ≤ n, n w n−w w 2 2 GC A4 (n, d, w) ≥ Pd−1 Pminbr/2c,w,n−w w n−w n−2i [88]. 2i r=0 i=0 i i r−2i 2 b) For 0 ≤ d ≤ n and 0 ≤ w ≤ n, n n w 2 GC A4 (n, d, w) ≥ Pd−1 Pminbr/2c,w,n−w w n−w n−2i [55]. 2i r=0 i=0 i i r−2i 2 4) This bound is based on the aspect of GC-content of given DNA codeword is equal to reverse of DNA codeword [55]. For 0 ≤ d ≤ n and 0 ≤ w ≤ n, a) AGC,RC (n, d, w) = AGC,R (n, d, w) if n is even. 4 4 b) AGC,RC (n, d, w) ≤ AGC,R (n + 1, d + 1, w) if n is odd. 4 4 c) AGC,R (n, d + 1, w) ≤ AGC,R (n, d, w) ≤ AGC,R (n, d − 1, w) if n is odd. 4 4 4 5) By using AGC 4 (n, d, w) ≥ A2 (n, d, w)· A2 (n, d) inequality, following bound is derived [55]. n n−1 GC For 0 ≤ w ≤ n, A4 (n, 2, w) = 2 . w Page 17 of 19 (n, d, w) ≤ 12 AGC 6) By using Halving bound AGC,RC 4 (n, d, w) for d = 2 one can calculate the bound [55]. 4 For 0 ≤ w ≤ n and n is even, n n−2 AGC,RC (n, 2, w) = 2 . 4 w 7) Bound AGC 4 (n, d, w) is computed by dividing the total number of words with GC-content w that are at distance at least d from their reverse-complements by the number of these codewords that are at distance at most d − 1 from any fixed codeword [55]. For 0 ≤ d ≤ n and 0 ≤ w ≤ n, AGC 4 (n, d, w) Pn ≥ r=d V (n, d, r) Pd−1 Pminbr/2c,w,n−w w n−w n−2i 2i 2 r=0 i=0 i i r−2i 2 8) Product Bounds - This is based on the construction of the DNA code AGC 4 (n, d, w) with length n, minimum Hamming distance d and GC-content w from binary constant-weight codes A2 (n, d, w) and ternary constant-weight codes A3 (n, d, w) with length n, Hamming weight w and minimum Hamming distance d [55] [88]. For 0 ≤ d ≤ n and 0 ≤ w ≤ n, a) AGC 4 (n, d, w) ≥ A2 (n, d, w) · A2 (n, d) b) AGC,R (n, d, w) ≥ AR 2 (n, d, w) · A2 (n, d) 4 c) AGC,R (n, d, w) ≥ A2 (n, d, w) · AR 2 (n, d) 4 d) AGC 4 (n, d, w) ≥ A3 (n, d, w) · A2 (n − w, d) e) AGC,R (n, d, w) ≥ AR 3 (n, d, w) · A2 (n − w, d) 4 f) AGC,R (n, d, w) ≥ A3 (n, d, w) · AR 2 (n − w, d) 4 R g) AR 4 (n, d) ≥ A2 (n, d) · A2 (n, d). 9) Reverse and Reverse complement codes bounds - This can be simple observed by the reverse and reverse complement property of the DNA codeword [88] [14]. Let n ≥ 1 be an integer, R a) If n is even, then ARC 4 (n, d) = A4 (n, d) and R b) If n is odd, then ARC 4 (n, d) ≤ A4 (n + 1, d + 1) 10) Special cases - Bounds are observed by considering different combination of length n and GC-content w [14]. For n > 0, with 0 ≤ d ≤ n and 0 ≤ w ≤ n, a) AGC 4 (n, d, 0) = A2 (n, d) GC b) AGC 4 (n, d, w) = A4 (n, d, n − w) c) AGC 4 (n, n, w) = d) AGC,RC (n, n, w) = 4 e) AGC 4 (n, 1, w) 4 if w = n/2 3 if n ≤ w < n/2 or n/2 < w ≤ 2n/3 2 if w < n/3 or w > 2n/3 2 if w = n/2 1 if w 6= n/2 n n = 2 w Page 18 of 19 11) Bounds on reverse P code for d = 3 - This bound iis computed from the concept of sphere-packing bound for codes [14]. bn/2c dn/2e q i = 2bn/2c (q − 1) i AR . q (n, 3) ≤ 2(1 + 4(q − 2) + (n − 4)(q − 1)) 12) Bounds on reverse code of size S - By using Greedy approach to calculate size of the code, bound is derived [14]. Let V (s, d) be the number of words of S such that they have distance d from s where sεS. |S| a) AR where V − (d) = min{V (s, d)|sεS}. q (n, d) ≤ 2V − (b(d − 1)/2c) |S| where, V + (d) = max{V (s, d)|sεS}. b) AR q (n, d) ≥ 2V + d − 1 13) Bounds on reverse code for d = 2 - Bounds on reverse code for d = 2 is obtained by claims • Any two words from the same subset differ in at least two positions ie. dH (xDNA , yDNA ) ≥ 2 ∀ xDNA , yDNA ∈ CDN A . R • If a word belongs to a subset, its reversal is also in the same subset ie. if xDNA ∈ CDN A then xDNA ∈ CDN A n/2 • All the q palindromes are in the same subset. q n−1 , for even n and qε{2, 4}, and a) AR q (n, 2) = 2 q n−1 − q bn/2c b) AR (n, 2) = , for odd n and qε{2, 4} q 2 14) Doubling Construction - This is motivated by the minimum Hamming distance between the DNA codeword, its reverse codeword and revere complement. It is observed from the property of DNA code with dH (xDNA , yDNA ) ≥ d, C RC HDN A (xR DNA , yDNA ) ≥ d, HDN A (xDNA , yDNA ) ≥ d and HDN A (xDNA , yDNA ) ≥ d n n−1 For n ≥ 2, AR ) = 2n [14]. 2 (2 , 2 15) a) Bounds on even and odd length n reverse code - In this bounds, maximum size of reverse code of even and odd length n and relationship between reverse code of even and odd length n − 1 is demonstrated [14]. R R AR q (n − 1, d) ≤ Aq (n, d) ≤ Aq (n, d − 1) and R b) AR q (n − 1, d) ≥ Aq (n, d)/q, for odd n. 16) Bounds on Hamming distance constraint- V.Phan provided lower and upper bounds of DNA codeword sets which satisfy the h-distance constraint [89]. 4n−d+1 ≤ |S| ≤ 4n n d d−1 Page 19 of 19