Download The Art of DNA Strings: Sixteen Years of DNA Coding Theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Helicase wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

DNA repair protein XRCC4 wikipedia , lookup

Homologous recombination wikipedia , lookup

DNA sequencing wikipedia , lookup

DNA repair wikipedia , lookup

DNA replication wikipedia , lookup

DNA profiling wikipedia , lookup

DNA polymerase wikipedia , lookup

Replisome wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Microsatellite wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
1
The Art of DNA Strings: Sixteen Years of DNA
Coding Theory
Dixita Limbachiya, Bansari Rao and Manish K. Gupta
Dhirubhai Ambani Institute of Information and Communication Technology
Gandhinagar, Gujarat, 382007 India
Email:[email protected], bansari s [email protected] and [email protected]
arXiv:1607.00266v1 [cs.IT] 1 Jul 2016
Abstract
The idea of computing with DNA was given by Tom Head in 1987, however in 1994 in a seminal paper, the actual successful
experiment for DNA computing was performed by Adleman. The heart of the DNA computing is the DNA hybridization, however,
it is also the source of errors. Thus the success of the DNA computing depends on the error control techniques. The classical
coding theory techniques have provided foundation for the current information and communication technology (ICT). Thus it
is natural to expect that coding theory will be the foundational subject for the DNA computing paradigm. For the successful
experiments with DNA computing usually we design DNA strings which are sufficiently dissimilar. This leads to the construction
of a large set of DNA strings which satisfy certain combinatorial and thermodynamic constraints. Over the last 16 years, many
approaches such as combinatorial, algebraic, computational have been used to construct such DNA strings. In this work, we survey
this interesting area of DNA coding theory by providing key ideas of the area and current known results.
I. I NTRODUCTION
Information and Communication Technology (ICT) has come a long way in the last 60
years. We have now connected world of devices talking to each other or performing the
computation. This is possible due to the advancement of coding and information theory. In
1994, two new directions in ICT emerged viz. quantum computing and DNA computing.
The heart of any computing or communications is coding theory. Thus two new directions
in coding theory also emerged such as quantum coding theory and DNA coding theory. This
paper focuses on the later area giving a comprehensive overview of techniques and challenges
of DNA coding theory from last 16 years. In 1994, L.Adleman performed the computation
using DNA strands to solve an instance of the Hamiltonian path problem giving birth to
Figure 1: DNA structure with
DNA computing [1]–[4]. DNA is a double strand built by the pairing the four basic building its four nucleotides A-Adenine,
units A-(Adenine), C-(Cytosine), G-(Guanine), T-(Thymine) which are called nucleotides. G-Guanine, C-Cytosine and T The DNA strand is held by the important feature called complementary base pairing which Thymine. These are the basic buildconnects the Watson Crick complementary bases with each other denoted by AC = T and ing blocks of DNA which are held
GC = C. The backbone of DNA is an alternating chains of sugar and phosphate. DNA is by Hydrogen bonds in a double
helical manner. Each DNA base
an ideal source of computing due to its stable, dense and its self replicating property [5]. is paired with the complementary
After the successful experiment by Adleman, area of DNA computing [6] flourished into bases such that the yellow band is
different directions like DNA tile assembly [7] [8], building of DNA nano-structures [9] A which is connected to its comple[10], studying error correcting properties in DNA sequences [11] [12] and DNA based data mentary base T (green band) while
storage system [13]. But it was only with the identification of mathematical properties in C (blue band) is paired with red G
(red band)
DNA sequences [14] [15], it inspired many researchers to explore this new amalgamation of
biology and coding theory [16]. DNA computing [17] promises to solve many NP-complete
problems like Hamilton path problem, satisfiability problem (SAT) [18] and many others with massive parallelization [19]. To
perform computation using DNA strands, a specific set of DNA sequences are required with particular properties. Such set of
DNA strands satisfying various constraints [20], [21] play an important role in biomolecular computation [22] [23].
The aim of this manuscript is to elucidate the current state of art of DNA codes as shown in Figure 2, DNA codes constraints
and especially the new methods contributing to construct DNA codes from algebraic coding. It summarizes the research on
DNA codes which may help the researcher to get a broad picture of the area. Also, it gives brief overview on the applications
of DNA codes.
The Paper is organized as follows. Section II and III discusses the DNA code design problem and constraints on DNA codes.
Section IV describes various approaches for the construction of DNA codes. Section V gives a brief description on software
tools for DNA sequence design. Section VI includes description on bounds of DNA codes. Section VII gives a brief note on
applications of DNA codes. Section VIII shows results on DNA codes for 0 ≤ n ≤ 36 and 0 ≤ d ≤ 20. Section IX concludes
the paper with general remarks.
II. DNA C ODE D ESIGN P ROBLEM
To perform the DNA computation, DNA strands react with each other by Watson Crick base pairing and form perfect
match. But in some situation, DNA strands may not form perfect base pairing and react in undesirable manner. One situation
is formation of secondary structure in which first half strand of the DNA strand forms complementary with its own other
half forming the hair pin like structure. This kind of structure interrupt the desired computation. Such secondary structures
have to be avoided by designing the DNA strands carefully. Also DNA strands may bind to another DNA strands forming the
complementary base pairing with few base pairs creating an error. One way to avoid this is to ensure that every two DNA
strands differ in more than d locations where d depends on the application of DNA computation. This property can be obtained
by defining different distance like the Hamming distance, Lee distance, Edit distance, Deletion distance etc. between two DNA
strands.
DNA codes design problem [24]–[26] is to develop a set of the DNA codewords of length n over DNA alphabets ΣDN A =
{A, T, G, C} with predefined distance d [27] [28] and satisfying maximum set of constraints [29]. The main objective is to
find the largest possible set of codewords M of length n over alphabet ΣDN A = {A, C, G, T } feasible with respect to set of
constraints [30] with distance d that enable the construction of better error correcting codes [31]–[33].
Figure 2: DNA codes Time line: The state of art of DNA codes is showcased from 1994 to 2016. Different work mentioned
with authors in chronological order.
Definition (DNA code). A DNA code CDN A (n, M, d) ⊂ ΣDN A = {A, T, G, C} with each DNA codeword of length n and
size M and minimum distance d. Here A denotes Adenine, T denotes Thymine, G denotes Guanine and C denotes Cytosine
as the nucleotides in DNA.
III. C ONSTRAINTS ON DNA C ODES
There are different types of constraints that DNA codes must satisfy. Three categories in which these constraints can be
classified are as follows:
1) Combinatorial constraints
2) Thermodynamic constraints
3) Application Oriented constraints
The constraints which should be followed by DNA codes are listed below :
1) Hamming distance constraint (n, d, w) The Hamming distance constraint can be defined as dH (xDNA , yDNA ) ≥ d ∀ xDNA , yDNA ∈ CDN A for some Hamming
distance d [14]. A set of codewords with length n, size M and minimum Hamming distance d satisfying Hamming
constraint is denoted by CDN A (n, M, d). Hamming distance is calculated by the total number of places at which two
DNA codewords differ. For CDN A (n, M, d), the minimum of the distances is considered as the Hamming distance.
Aq (n, d) denotes the maximum size of a code with codewords of length n and distance d over alphabet size q, in case
of DNA codewords q = 4.
Example 1. Let xDNA = AT GACT and yDNA = ACT AGC, then dH (xDNA , yDNA ) = 4 where xDNA , yDNA ∈ CDN A
following the Hamming distance constraint with dH (xDNA , yDNA ) ≥ 3.
2) Reverse constraint(n, d) - The reverse constraint is HDN A (xR
DNA , yDNA ) ≥ d ∀ xDNA , yDNA ∈ CDN A . A code satisfying
reverse constraint is called reverse code. AR
(n,
d)
denotes
the maximum size of a reverse code with length n and
q
minimum Hamming distance d.
Page 2 of 19
Figure 3: DNA codes constraints classified as three categories combinatorial, thermodynamics and application oriented
constraints.
3)
4)
5)
6)
Example 2. Let xDNA = AT GACT , xRDNA = T CAGT A and yDNA = AT ACAT . For n = 6 and d = 3, HDN A (xRDNA , yDNA ) ≥
3. xDNA , yDNA are reverse code.
C
Reverse Complement constraint (n, d) - This RC-constraint is HDN A (xR
DNA , yDNA ) ≥ d ∀ xDNA , yDNA ∈ CDN A . DNA
RC
code satisfying RC-constraint is called a reverse complement code. Aq (n, d) denotes the maximum size of a reverse
complement code with length n and minimum Hamming distance d [14].
Example 3. Let xDNA = AT GACT , xRDNA = T CAGT A and yDNA = GT ACAC, yCDNA = CAT GT G. For n = 6 and
d = 3, HDN A (xRDNA , yCDNA ) = 4 ≥ 3. xDNA , yDNA are reverse complement code.
GC-content constraint (n, d, w) - The set of codewords with length n, distance d and GC weight w ,where w is total
number of Gs and Cs present in the DNA strand viz. wxDN A = |{xi : xDNA = (xi ) , xi ∈ {C, G}}| [34]–[36]. Generally
w = bn/2c.
Example 4. Let xDNA = AT T GCT then xDNA ∈
/ CDN A for n = 6 and w = 3.
Melting temperature constraint (n, d, w)- Melting temperature Tm is a temperature at which half of the DNA strands are
hybridized and half are not. Melting is opposite of hybridization in which two strands get separated. It is advantageous
to have the codewords with the similar melting temperatures as it will enable the hybridization of multiple DNA strands
simultaneously. Hence we select the codewords with similar melting temperature calculated by Nussinov’s algorithm
[25]. For each xDNA ∈ CDN A have identical melting temperature [29] .
Thermodynamic constraint (n, d, w) - For the DNA stability, it is necessary for all the DNA codes should have comparable
free energy ∆G◦ [37]–[39] above some threshold [40], [41]. As DNA with minimum free energy is more stable and hence
we consider only those DNA codewords in a DNA code that have approximately comparable free energy in the set of
DNA codewords. Free Energy ∆G◦ for the given DNA codewords xDNA , yDNA ∈ CDN A can be obtained by the equation,
|∆G◦ (xDNA ) − ∆G◦ (yDNA )| ≤ δ
where δ > 0 is a constant.
7) Uncorrelated-correlated constraint (n, d, w) - A codeword ∈ CDN A if shifted by x units where x ≤ n should not match
with any of the other codeword ∈ CDN A [42].
Example 5. Let XDNA = CAT CAT C and YDNA = AT CAT CGG. X ◦ Y = 0100100, as depicted below.
X=
Y=
Y=
Y=
Y=
Y=
Y=
Y=
C
A
A
T
A
T
C
T
A
C
A
C
T
A
A
T
A
C
T
A
T
C
T
A
C
T
A
C
G
C
T
A
C
T
A
G
G
C
T
A
C
T
G
G
C
T
A
C
G
G
C
T
A
G
G
C
T
G
G
C
G
G
G
0
1
0
0
1
0
0
Page 3 of 19
The constraints (1) to (3) are used to avoid undesirable hybridization between different DNA strands and (4) to (6) are
the DNA constraints which ensures that all the codewords have similar thermodynamic characteristics to perform uniform
computation.
IV. VARIOUS A PPROACHES FOR THE DNA C ODES C ONSTRUCTIONS
There are different approaches [43] for the construction of the DNA codes with finite length n, defined distance d and set of
constraints with respect to application. The set of constraints essential for the DNA codes is subject to the application. There
exist algorithmic, theoretic and software simulation method [44] [45] [46] approaches to design the DNA codes. Optimality of
the DNA code can be obtained by construction of the DNA codes in a way that every codeword in the set follows maximum
number of constraints for a large value of n and large minimum distance d with minimum errors in DNA computation [47].
Figure 4: DNA codes construction Methods are classified as Search algorithms, Algebraic Coding methods and Software
simulations. Search algorithms include Variable neighborhood search, Simulated Annealing, Stochastic Local Search. Template
Based method uses a template to design a DNA codeword. Algebraic coding method is construction of DNA codes by using
Algebraic structures like Fields and Rings. Altruistic Approach is construction of DNA codewords in altruistic way in this
work.
1) Variable Neighborhood Search Approach: This approach employs different local search algorithms to search the DNA
codes [48]–[51]. Some of the search algorithms are described here.
a) Seed Building(SB) : Seed Building (SB) algorithm examines all the possible codewords randomly with respect to
seed codeword where seed codewords are the initial set of codewords with the required constraint. For more details
on the algorithm, [52] can be referred. The drawback of the approach is that it is inefficient for larger values of n
because of the computational and time complexity invested for the development of feasible set of codewords.
Example 6. Consider the n = 3, d = 1 and GC-content w = 2 and initial seed be codewords with d = 1 and
w = 2 are AGG and AGC then the resulting codewords at first iteration with HD and GC-content constraints are
GGC, GCA, GT C, CCG, CGC, CT C, T GG. Note that ACC, GCT, CAG are also codewords with GC-content
2 but at distance d = 3 so will be added in the next iteration for seed d = 3.
b) Clique Search : In this method a random subset of the codewords of a given code is removed, leaving a partial
code. All the codewords are generated and those compatible with the ones left in the code are identified and a
graph is built where the identified codewords are nodes, and an edge exists between compatible codewords.
Example 7. Let n = 4, d = 3 and w = 2 then the partial code be CT T C, CGAA, T GGT, GT GA with HD, RC
and GC constraints. The clique search will result in codewords CACT, GCT T, AGT G and AAGC.
c) Hybrid Search: This method combines two approaches of seed building and clique search. It uses seed building to
generate the partial code and clique search to search for the best codewords [53].
d) Greedy Approach: These types of algorithms removes the worst DNA code at each iteration from a set of codes
every time the algorithm is iterated. But the problem with this approach is that it doesn’t always produce the best
results. In the process of removing the worst code at each stage it may remove a potentially good code at earlier
stages and may not remove bad code at the later stages. So it doesn’t always produce the best optimal result [54].
Example 8. Let n = 3, d = 2 and w = 2 then codewords from greedy search AGG, ACC, GGC, GCA,
CCG, CT C.
e) Lexicographic Approach: These types of algorithms take into consideration a particular arrangement of the DNA
codewords. It may be an alphabetical order or ascending order or any such kind of ordered arrangement of codes
such that the arrangement of the DNA codewords satisfying the DNA constraints are ordered in a specific manner
[55].
2) Simulated Annealing Approach: Simulated Annealing is a meta heuristic algorithm derived from thermodynamic
principles [52]. These types of algorithms work on the set of codewords in which not all the codewords in the given set
Page 4 of 19
3)
4)
5)
6)
satisfies the constraints specified. These algorithms attempts to change the codewords with the objective of reducing the
constraint violations to the constraints and try to make them feasible. The set with a feasible solution is derived when
there is no violation of constraints.
Stochastic Local Search Approach: We search for the codewords with parameters (n, d, w = n/2) such that the length
of the codewords is n and the GC-content is bn/2c and the minimum Hamming distance is atleast d. These types of
algorithms work on random set of codewords while solving the problem. It generally considers an initial set of random k
codewords and then removes the one which doesn’t satisfy the constraints. For more details on the the algorithm, reader
can refer to [56].
Example 9. Let random codewords k = 16 then set is AA, AT, AG, AC, GA, GT, GC, GG, CT, CA, CC,
CG, T A, T C, T G, T T . Codewords with n = 2, d = 2 and w = 1 are GC, AG, CA, T C.
Genetic Algorithm: In this work, genetic algorithm is used to search efficient and reliable codes by minimizing the
mis-hybridization error. This codewords generated were unique in terms of Hamming distance that satisfy the Hamming
bound [57].
Template Based Method: This method was introduced in [58]–[60] that involves two step process. Initially a template
is designed which is mapped to binary error correcting codes and combination of template and codeword results into
DNA codewords. This method is not optimal because the code size is limited depending of the size of error correcting
code used and selection of template. Mapping of the template to the codewords is subjected to specific application.
Example 10. Let template t = 1001101 and a codeword c = 0010111. Suppose the map define 11 → A, 10 → T ,
01 → G and 00 → C for position of 1 and 0 in t is 1 = [AT ] and 0 = [GC] followed by position in codewords then
the DNA codes will be T CGT AGA.
Algebraic Coding Approach: By using the algebraic coding, DNA codes are constructed from fields and rings by
mapping the elements of the field and rings to the DNA nucleotides [61].
a) Codes over Fields:
i) Linear codes over GF (4) : In this approach, DNA codewords are constructed from GF (4) by using different
one-to-one mapping the elements of GF (4) to DNA nucleotides [62]. The mapping is preferred from {0,1,
ω, ω 2 } to {A,C,G,T} with respect to the codes used. There linear code [35] and additive codes.The method
used have improved the lower bounds on GC constraints and extended the result on the length of DNA code to
n ≤ 30. Researchers extended this construction for non linear codes and cyclic codes [63]. Also DNA codes over
GF (4) were constructed using BCH codes in [64] in which the protein and targeting sequences are identified
as codewords of error-correcting BCH codes.
ii) Linear and Additive codes over GF (4): In this paper, DNA codes are constructed considering linear codes
and additive codes over GF (4) pf odd length following Hamming distance constraint and reverse complement
constraint. In [65], DNA codes of length 7, 9, 11 and 13 have been considered. DNA nucleotides {A, C, G, T }
have been mapped to 0, ω, ω and 1 respectively with ω = ω 2 and ω 2 + ω + 1 = 0. Each codeword has been
mapped to a polynomial and the Trace map T r : GF (4) → GF (2) is stated as :
T r(x) = x + x2
iii) Extended, Additive, Additive Extended Cyclic codes over GF (4): In referred work, DNA codes satisfying GCcontent constraint and a minimum Hamming distance constraint were constructed using computer algebra systems
Magma [66] and Maple [67]. Longer codes of higher length 4 ≤ n ≤ 30 were derived from GF (4), additive
codes over GF (4) and Z4 (see Figure 5). Moreover it was claimed that by using different mapping from fields or
rings to DNA codewords can result into different lower bounds. Further the bounds on the DNA codes satisfying
set of constraints were ameliorated by shortening and puncturing of obtained codes [68].
b) Codes over Rings : Algebraic construction of DNA codes was further extended to codes over rings. Different rings
are used to construct DNA codes by mapping rings elements to DNA nucleotides.
i) DNA sequences generated by Z4 linear codes: In [69], a biological coding system which modeled the existence
of error correcting codes in the DNA structure. Model consists of an encoder and modulator. The encoder consist
of a mapper that converts the DNA nucleotides to elements of and BCH codes over Z4 ) and a modulator consist
of a genetic code, tRNA (transfer RNA that serves as the connecting link between the mRNA (messenger RNA)
to the amino acid sequence of proteins) and ribosome (protein synthesizer of the cell) which is associated with
signals that convert the genetic codons to protein. The DNA and protein coding sequences from different species
have been identified as the codewords over linear codes over Z4 . A class of error correcting-code BCH codes
with parameters (n, k, 3) have been used in the encoder to construct the DNA codes over Z4 .
In [70] [71], the self dual codes over Z4 are used for construction for DNA codes. Additionally, GC weight
enumerator of the DNA codes that determines the number of Gs and Cs in the codeword. GC wright enumerator
helps in the construction of DNA codes satisfying GC- content constraint. Self dual DNA codes over Z4 are
Page 5 of 19
Cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 30.
Extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 30.
Additive cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 19.
Additive extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20.
Cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 24.
Extended cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 24.
Cosets of cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20.
Cosets of extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20.
Cosets of additive cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 14.
Cosets of additive extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 15.
Cosets of cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 20.
Cosets of extended cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 30.
Figure 5: DNA codes construction methods using Computer Algebra Systems such as Maple and Magma [68] .
developed by using mapping as A → 0 , C → 1, T → 2 and G → 3. The following generator matrix G over
Z4 is considered and let K4 denote the codeword over Z4 formed from G. There are 16 codewords and wa (c)
where a ∈ Z4 and k ∈ K4 as shown in Table I. 

1 1 1 1
G = 0 2 0 2
0 0 2 2
K4
(0000)
(2222)
(0202)
(2020)
(0022)
(2200)
(0220)
(2002)
w0
4
0
2
2
2
2
2
2
w1
0
0
0
0
0
0
0
0
w2
0
4
2
2
2
2
2
2
w3
0
0
0
0
0
0
0
0
K4
(1111)
(3333)
(1313)
(3131)
(1133)
(3311)
(1331)
(3113)
w0
0
0
0
0
0
0
0
0
w1
4
0
2
2
2
2
2
2
w2
0
0
0
0
0
0
0
0
w3
0
4
2
2
2
2
2
2
Table I: (4, 16, 3) code is generated by G where wa (c) where a ∈ Z4 and k ∈ K4 [70].
ii) Lifted Polynomials over F16 : In [72], reversible codes by using special family of polynomials denoted as lifted
polynomials over F4 which generates the reversible codes of odd length over F1 6 are constructed. 4-lifted
polynomial is used to generate the DNA code of even length by using the correspondence between pair of
DNA nucleotides to elements of the ring F1 6. Table II preserves the property that if DNA pair is mapped to an
element of F16 then reverse of that DNA pair is mapped to fourth power of the element of F16 . For example,
α2 → GC then (α2 )4 → CG.
iii) DNA codes over F2 [u]/u4 − 1: In [73], cyclic DNA codes of odd length are obtained from F2 [u]/(u4 − 1) =
{a + bu + cu2 + du3 | a, b, c, d ∈ F2 } where u4 = 1 commutative ring is considered. All its ideals are listed
here h0i = h(1 + u)4 i ⊂ h(1 + u)3 i ⊂ h(1 + u)2 i ⊂ h(1 + u)i ⊂ R. The 16 elements of ring R are mapped to
2-length nucleotides. The following mapping shown in Table III was considered in the paper [73]. This mapping
preserved the complementary and reverse property by adding 1 + u + u2 + u3 and multiplying u2 respectively.
To find complement of AA, we add 1 + u + u2 + u3 to 0, it will give 1 + u + u2 + u3 = T T . To find reverse
AA, multiply 0 to u2 will result in 0 =AA.
In [74] DNA cyclic codes of arbitrary length satisfying the reverse complement constraint are constructed by
using additive stem distance. The correspondence between the ring elements and DNA is established by following
mapping mentioned in Table IV. this preserves the reverse complement property of the DNA codewords by the
Page 6 of 19
Sr.No
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
DNA Pair a
AA
TT
AT
GC
AG
TA
CC
AC
GT
CG
CA
GG
CT
GA
TG
TC
Multiplicative(F16 )
0
α0
α1
α2
α3
α4
α5
α6
α7
α8
α9
α10
α11
α12
α13
α14
Additive
1
α
α2
α3
1+α
α + α2
α2 + α3
1 + α + α3
1 + α2
α + α3
1 + α + α2
α + α2 + α3
1 + α + α2 + α3
1 + α2 + α3
1 + α3
Table II: Mapping from DNA nucleotide Pair to element of F16 . [72]
Element
AA
TT
GG
CC
Map
0
1 + u + u2 + u3
1 + u2
u + u3
Element
AT
TA
GC
CG
Map
1+u
u2 + u3
u + u2
1 + u3
Element
GT
TG
AC
CA
Map
1
u2
1 + u + u3
u + u2 + u3
Element
CT
TC
AG
GA
Map
1+ u + u2
1 + u2 + u3
u
u3
Table III: Mapping from DNA Nucleotide Pair to elements of the Ring F2 + uF2 + u2 F2 + u3 F2 where u4 = 1 [73].
x + xc = u3 + u2 + u + 1. The reverse of the DNA code is obtained by multiplying u2 to any element x of the
ring R.
Element
GG
CC
GC
CG
Map
0
1 + u + u2 + u3
1 + u2
u + u3
Element
AT
TA
AA
TT
Map
1+u
u2 + u3
u + u2
1 + u3
Element
GT
TG
AC
CA
Map
1
u2
1 + u + u3
u + u2 + u3
Element
CT
TC
AG
GA
Map
1+ u + u2
1 + u2 + u3
u
u3
Table IV: Mapping from DNA Nucleotide Pair to elements of the Ring F2 + uF2 + u2 F2 + u3 F2 where u4 = 1 [73].
iv) DNA cyclic codes over F2 + uF2 where u2 = 0: DNA codes of even length following reverse and reverse
complement constraints have been studied in [75]. The field F2 is a subring of R. A linear code C of length
n over R is defined to be an additive submodule of the R-module Rn . A cyclic code of length n over R
is a linear code with the property that if (c0 , c1 , . . . , cn−1 ) ∈ C then (cn−1 , c0 , . . . , cn−2 ) ∈ C. An n-tuple
c = (c0 , c1 , . . . , cn−1 ) ∈ Rn is identified with the polynomial c0 + c1 x + . . . + cn−1 xn−1 in the ring Rn =
R[x]/ (xn − 1), which is called the polynomial representation of c = (c0 , c1 , ..., cn−1 ). Here R = F2 + uF2 with
{0, u, 1, 1 + u} elements are in one to one correspondence with nucleotides A, T, G and C such that 0 → A,
u → T , u + 1 → C and 1 → G. The DNA codes of length 8 and 10 are obtained from C = g 6 + u x5 + x
and C = g1 g22 + ug2 , ug1 g2 . Necessary and sufficient conditions for cyclic codes to follow the reverse and
reverse-complement properties have also have been studied. To preserve the reverse complement constraint o
find complement of A, we add u to 0, it will give u = T.
v) DNA codes over Z4 +uZ4 : DNA cyclic codes of odd lengths following reverse and reverse complement constraint
are constructed in [76]. Here ring Z4 +uZ4 = a + ub | a, b ∈ Z4 with u2 = 0 is considered. Reversible and cyclic
reversible complement codes are discovered in this paper. In this work defined a Gray map that allows them to
translate the properties of DNA codes to binary codewords i.e. φ : Z4 +uZ4 → Z24 such that φ(a+ub) = (b, a+b)
where a, b ∈ Z4 . In this paper, 16 pairs of nucleotides which are mapped as shown in Table V.
Element
AA
AT
GT
CT
Map
0
2
2u
2+3u
Element
TT
TA
CA
GA
Map
1+u
3+u
1+3u
3+2u
Element
GG
GC
AC
AG
Map
1
3
3u
2+2u
Element
CC
CG
TG
TC
Map
u
2+u
1+2u
3+3u
Table V: Mapping from DNA nucleotide pair to element of the Ring Z4 + uZ4 where u2 = 0 [76].
vi) F4 [u]/ < u2 + 1 > where u2 = 1: In [77] self-reciprocal complement cyclic codes from R with F4 =
{0, 1, α, α + 1} where α is the root of primitive polynomial x2 + x + 1 over F2 are studied. The DNA code of
specific length 6 over the ring is considered. One to one correspondence is establish between pairs of nucleotides
Page 7 of 19
and 16 elements of the ring. DNA cyclic codes constructed followed reverse complement, GC content and
Hamming distance constraints. The basis is {1, u + 1} and then every element of R is expressed in the form of
a + b(u + 1) where a,b ∈ F4 . The mapping is done as in Table VI
Element
AA
AG
TG
GC
Map
0
α
u
1 + αu
Element
TT
TC
AC
CG
Map
α + 1(α + 1)u
1 + (α + 1)u
α + 1 + αu
α+u
Element
GT
AT
GA
CC
Map
1
α+1
αu
1+u
Element
CA
TA
CT
GG
Map
α + (α + 1)u
(α + 1)u
1+α+u
α + αu
Table VI: Mapping between the pair of DNA nucleotides and Ring F4 [77].
vii) DNA codes from F2 + uF2 + vF2 + uvF2 with u2 = 0 and v 2 = v: The structure of cyclic DNA codes of an
arbitrary length over R2 = F2 + uF2 + vF2 + uvF2 was studied and the relation to codes over R1 = F2 + uF2 by
defining Gray map between R2 and R21 was established [78]. DNA codes following reverse, reverse complement
constraints are studied. The Gray map from R2 to R1 is defined as φ(a + bv) = (a, a + b). GC weight over the
ring was also introduced by using image of Gray map. One type of nontrivial automorphisms can be defined
over R2 as follows : σ : F2 + uF2 + vF2 + uvF2 → F2 + uF2 + vF2 + uvF2 ,
2n
a + bv → a + (1 + v)b such that a, b ∈ F2 + uF2 . Table is defined using : Φ (c) : C → SD4
,
(a0 + b0 v, a1 + b1 v, . . . , an−1 + bn−1 v) 7→ (a0 , a1 , . . . , an−1 , a0 + b0 , a1 + b1 , . . . , an−1 + bn−1 ). Below is the
Table VII for mapping elements of the ring to DNA described in the paper. For instance, (c0, c1, c2, c3) =
(u + v, u, v, 1) is mapped to T CT T AGT CGG.
Elements a
Gray Images
0
uv
1
1+uv
u
u+uv
1+u
1+u+uv
(0,0)
(0,u)
(1,1)
(1,1+u)
(u,u)
(u,0)
(1+u,1+u)
(1+u,1)
Double
DNA Pairs
ζ(a)
AA
AT
GG
GC
TT
TA
CC
CG
Elements a
Gray Images
Double DNA
Pairs ζ(a)
v
v+uv
1+v
1+v+uv
u+v
u+v+uv
1+u+v
1+u+v+uv
(0,1)
(0,1+u)
(1,0)
(1,u)
(u,1+u)
(u,1)
(1+u,u)
(1+u,0)
AG
AC
GA
GT
TC
TG
CT
CA
Table VII: Mapping between DNA pair and elements of the Ring F2 + uF2 + vF2 + uvF2 with u2 = 0 and v 2 = v [78].
viii) codes over F4 + vF4 : In linear, constacyclic and cyclic codes over the ring R = F4 [v]/(v 2 − v) are constructed
in [79]. The ring R = {a + vb|a, b ∈ F4 } is non-chain finite semi-local Frobenius ring with 16 elements. The
4 elements are F4 = {0, 1, ω, ω + 1} where ω 2 = ω + 1. The Gray map from R to F4 × F4 is given by :
φ(c) = (a + b, a).
If a = (a1 , a2 , . . . , an ) P
∈ Rn , then the Hamming weight of a is the sum of the Hamming weights of its
n
components, i.e.w(a) = i=1 w(ai ). The Hamming distance between a and b in R is d(a, b) = w(a − b). The
Lee weight of any element of R is the Gray image of its Hamming weight, i.e. wL (c) = wH (φ(c)). Below is
the Table VIII for mapping used in [80].
Elements
Gray Images
0
ω
v
v+ω
vω
ω + vω
v + vω
ω + v + vω
(0, 0)
(ω, ω)
(1, 0)
(1 + ω, ω)
(ω, 0)
(0, ω)
(1 + ω, 0)
(1, ω)
Double DNA
Pairs ξ(a)
AA
CC
TA
GC
CA
AC
GA
TC
Elements
Gray Images
1
1+ω
1+v
1+v+ω
1 + vω
1 + ω + vω
1 + v + vω
1 + v + ω + vω
(1, 1)
(1 + ω, 1 + ω)
(0, 1)
(ω, 1 + ω)
(1 + ω, 1)
(1, 1 + ω)
(ω, 1)
(0, 1 + ω)
Double DNA
Pairs ξ(a)
TT
GG
AT
CG
GT
TG
CT
AG
Table VIII: Mapping between DNA Pair and elements of the Ring F4 + vF4 [79].
ix) R = F2 [u]/(u6 ): In [81], DNA cyclic codes over a family R1 = F2 [u]/(u6 ) and ring R2 = F2 + uF2 where
v 2 = v satisfying reverse complement constraint have been constructed. In this a new family of DNA skew
cyclic codes is introduced over ring R = F2 + vF2 = 0, 1, v, v + 1 where v 2 = v. The ring R1 = F2 [u]/(u6 ) =
{a0 + a1 u + a2 u2 + a3 u3 + a4 u4 + a5 u5 ; ai ∈ F2 , u6 = 0}. There is direct map between 64 elements of the
ring to 64 codons (three nucleotides) used in nature as a substrate for aminoacid synthesis shown in Table IX.
x) DNA codes over ring F2 [u]/u2 − 1 : Ring R = F2 [u]/u2 − 1 = {0, 1, u, 1 + u} where u2 = 1 was described
[82] [83] . Elements of the ring were directly mapped to DNA nucleotides such that reverse complement
Page 8 of 19
DNA
Ring Element
Codons
CCC
GGA
GGC
GGT
AGG
CGG
GAG
AGA
AGC
ATG
AGT
CGA
CGC
CGT
TGG
GAA
DNA
Ring Element
Codons
u5 +u4 +u3 +u2 +u+1
u5 + u4 + u3 + u2 + u
u5 + u4 + u3 + u2 + 1
u5 + u4 + u3 + u2
u5 + u4 + u3 + u + 1
u5 + u4 + u2 + u + 1
u5 + u3 + u2 + u + 1
u5 + u4 + u3 + u2 + u
u5 + u4 + u3 + 1
u4 + u3 + u2 + u + 1
u5 + u4 + u3
u5 + u4 + u2 + u
u5 + u4 + u2 + 1
u5 + u4 + u2
u5 + u4 + u + 1
u5 + u4 + u3 + u2 + u
GGG
CCT
CCG
CCA
TCC
GCC
CTC
TCT
TCG
TAC
TAT
GCT
GCG
TAA
CTG
CTT
0
1
u
u+1
u2
u3
u4
u2 + 1
u2 + u
u5
u5 + 1
u3 + 1
u3 + u
u5 + u
u4 + u
u4 + 1
DNA
Ring Element
Codons
ACT
ACG
TTT
TTG
CTA
GTT
GTG
TCA
CAA
CAC
GCA
TTA
ACC
CAT
TGT
CAG
u3 + u2 + u + 1
u3 + u2 + u
u4 + u2 + 1
u4 + u2 + u
u4 + u + 1
u4 + u3 + 1
u4 + u3 + u
u2 + u + 1
u5 + u2 + u
u5 + u2 + 1
u3 + u + 1
u4 + u3
u3 + u2
u5 + u2
u5 + u4
u5 + u3
DNA
Ring Element
Codons
GTC
ACA
GAC
AGG
GAT
GTA
ATT
ATA
ATC
TGA
AAT
AAA
TGC
AAC
TCC
TAG
u4 + u2 + u + 1
u3 + u2 + u + 1
u5 + u3 + u2 + 1
u5 + u3 + u + 1
u5 + u3 + u2
u4 + u3 + u + 1
u4 + u3 + u2 + 1
u4 + u3 + u2 + u
u4 + u3 + u2
u5 + u4 + u
u5 + u2 + u + 1
u5 + u3 + u
u5 + u4 + 1
u5 + u3 + 1
u4 + u2
u5 + u + 1
Table IX: Mapping between 64 elements of the Ring F2 + uF2 + u2 F2 + u3 F2 + u4 F2 + u5 F2 and 64 codons [81].
constraint is conserved. By adding 1 + u to elements of rings, complement can be obtained. For example
0 → A, u → C, 1 → G, 1 + u → T preserves the complement property: 0 = 1 + u and u = 1.
The elements of the ring R can be mapped to the elements of F2 = {0, 1} via the map θ where θ(0) =
θ(u + 1) = 0 and θ(1) = θ(u) = 1. Let C be a cyclic code in Rn . It can be extended to the map θ to a map
ϕ : C → Z2 [x]/(xn −1) defined by ϕ(a0 +a1 x+a2 x2 +. . .+an−1 xn−1 ) = θ(a0 )+θ(a1 )x+. . .+θ(an−1 xn−1 ).
xi) DNA cyclic codes over F2 + uF2 : In [84] odd length codes over rings satisfying reverse complement, GCContent and thermodynamic constraints are studied. They are obtained from the cyclic complement reversible
code. Infinite family of BCH DNA codes are constructed. The mapping Φ called Gray Map has been used to
map linear codes over R to binary linear codes. The Gray map Φ is the distance-preserving map (Rn , Lee
distance) → (F22n ,Hamming distance).
xii) Cyclic DNA codes over Z4 + wZ4 : Recently, cyclic DNA codes over R = Z4 + wZ4 where w2 = 2 and
S = Z4 + wZ4 + vZ4 + wvZ4 where v 2 = vw have been discovered [85]. In this work, odd length DNA
cyclic codes over R satisfying reverse and reverse complement constraint is studied . Also, a family of DNA
skew cyclic codes with reverse complement property over R is constructed. Binary images of the cyclic DNA
codes over R and S is determined. The correspondence between elements of ring R and double DNA pairs are
establish as described in the Table X.
Elements
Gray Images
0
2
ω
3ω
1 + 2ω
2+ω
2 + 3ω
3 + 2ω
(0, 0)
(2, 0)
(0, 1)
(0, 3)
(1, 2)
(2, 1)
(2, 3)
(3, 2)
Double DNA
Pairs ξ(a)
AA
GA
AC
AT
CG
GC
GT
TG
Elements
Gray Images
1
3
2ω
1+ω
1 + 3ω
2 + 2ω
3+ω
3 + 3ω
(1, 0)
(3, 0)
(0, 2)
(1, 1)
(1, 3)
(2, 2)
(3, 1)
(3, 3)
Double DNA
Pairs ξ(a)
CA
TA
AG
CC
CT
GG
TC
TT
Table X: Mapping between DNA Pair and elements of the Ring.
7) Algebraic Number Theory codes: Aforementioned all the methods include construction of DNA codes from classical
coding theory and heuristic approaches works well for small length n. In this author constructed the DNA codes using
algebraic number theory [86], making the first attempt, by using irreducible cyclic codes to built DNA codes with large
n (< 1000) and number of codewords (M < 7000) satisfying the GC content constraint.
V. S OFTWARE TOOLS FOR DNA C ODES G ENERATION
There are different tools developed for designing the DNA codewords namely DNA sequence Generator [44] and evolutionary
algorithm based program-PUNCH (Princeton University Nucleotide Computing Heuristic) [87] were used to find set of
dissimilar sequences. DNA sequence generator and compiler use graph based approach based on the overlapping sub-sequences.
The GUI of the software allows the user to import the the DNA sequence to the sequence wizard with different parameters. It
Page 9 of 19
also calculate the melting temperature of the DNA sequences. It check for the reverse complement constraints and forbidden
DNA strands. However this software do not take care of secondary structure formation of the DNA strands. PUNCH is used
for performing various DNA computing by bit set selection. It works on randomization by selecting three basic parameters N,
B and V where N is number of bits in the problem, B is the number of nucleotides in each bit, and V is number of variation
on each bit set.
Here the author has created a web application in which the user has to first select the mode which is either specific based
or range. For both modes, the user has to select the specific constraints he wants DNA codes to follow but for specific the
input has parameters n and d where n is the length of the codewords and d is the Hamming distance. For range, the input
parameters are n1 and n2 where n1 is the starting length and n2 is the ending length and also are d1 and d2 where d1 is the
starting Hamming distance and d2 is the ending Hamming distance. The web portal will display the number of codewords and
also those codewords that satisfy the constraint.
VI. B OUNDS ON DNA C ODES
There are several bounds studied for DNA codes. This section comprehend the types of bounds obtained on set of constraints.
In the Table XI, columns are ticked for which respective bounds on constraints are obtained. Note that n is the length, d is
minimum Hamming distance and w is GC content of the DNA code. One can observe that almost all the methods have
obtained lower bounds on reverse constraint. But most important part is to investigate that there is no bounds on DNA codes
satisfying the set of HD, GC, R and RC constraints altogether. Also there are very few attempts made to explore the bounds
on thermodynamic constraints. One can explore the methods which allows the formulation of bounds on the thermodynamic
constraints. More details on the bounds are described in Appendix A.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
AHD
(n, d, w)
4
AR
4 (n, d, w)
X
X
X
X
ARC
4 (n, d, w)
AGC
4 (n, d, w)
17
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
AR,RC
(n, d, w)
4
ARC,GC
(n, d, w)
4
AR,GC
(n, d, w)
4
X
X
X
X
X
X
AR,RC,GC
(n, d, w)
4
Table XI: Bounds on DNA codes: 1-Johnson type Bound [88] [55], 2-Halving Bounds [88] [55], 3-Gilbert-Type Bounds [55],
4- Relation between AGC,RC
and AGC,R
[55], 5- AGC
bound, 6- AGC,RC
bound, 7- AGC
4
4 (n, d, w) bound, 8- Product Bounds
4
4
4
RC
R
RC
[88] [55], 9- Relation between A4 and A4 [88], 10 to 16 - Relation between AGC
and AR
4 , A4
4 [14], 17-V. Phan bound
[89] satisfying Hamming distance constraint.
VII. A PPLICATIONS OF DNA C ODES
DNA codes are used in various technologies like DNA computing [1], surface based DNA computation [90]–[93], DNA
Microarray technology [94], Molecular barcodes for chemical libraries [95], DNA nanostructures [96], [97], DNA origami
Page 10 of 19
[10], data encrytption [98], [99] [100]–[102], data storage [13], [103]–[106], signal processing [107] and DNA nanodevices
and circuits [108]. Recently, it has reported potential of DNA codes in phylogenetic studies [109]. Also it has contribution
in understanding gene regulatory networks [110], protein coding genes [111], [112] and studying the structure of genes via
circular codes [113].
In DNA computing, DNA codes with specific properties are required to perform various parallel and logical operations [114].
Molecular barcodes [115] generated from DNA codes are used as biomarkers for authentication of the products. DNA codes
are used in creating DNA nanostructures that are used in potential applications like targeted drug delivery systems. DNA codes
with specific properties with high stability and robustness are required for nano structures which can be achieved by using
efficient encoding procedures for DNA codes. Recently, DNA is used in data hiding techniques for encryption of data more
effectively. DNA codes used in this are designed as encryption or decryption keys. In last few years DNA based data storage
systems have [42] received attention by many researchers. DNA codes used for data storage must have feasible property that
achieve dense data storage capacity and better error correction capacity.
To use DNA codes for any application, fundamental constraints mentioned for DNA codes are unavoidable. DNA code
design must follow the constraint for stability and robustness but to design the DNA codes for specific application, required
constraints must be added to DNA codes to make it more functional and practical. For instance, correlated and uncorrelated
constraint was added to DNA codes for development random and re-writable DNA based data storage system. Looking at
the potential of DNA codes and advancement in the biotechnology methods, DNA codes promises application in emerging
technologies.
VIII. DNA C ODES TABLE
Many tables on lower bounds of DNA codes satisfying set of constraints are obtained. In Table VIII and XIII lower bounds
on DNA codes satisfying GC and reverse complement (RC) constraints are mentioned. In Table XIV lower bounds on DNA
codes satisfying Hamming distance and reverse complement constraint. These bounds are compiled from [68] [36] [52] [62]
[71] [86].
IX. F UTURE W ORK AND C HALLENGES
DNA codes designing have received a great deal of attention by researchers in the last decade. In spite of different approaches
proposed in the literature for the construction of DNA codes and constraints, it is still a challenge to design the optimal DNA
codes. Classifying the DNA codes for specific application has opportunities to use it for real applications. Though researchers
have worked on improvement of the bounds of DNA codes satisfying the set of constraints, designing DNA codes satisfying
maximum number of constraints achieving bound is still a huge challenge. Better codes with higher length and distance can be
designed by using other algebraic methods or computational methods improving the bounds and obtaining the bounds for the
missing n and d can be achieved. Defining DNA code as mathematical structure with the possible operation is interesting area
to explore in which different operation can be defined on DNA nucleotides that satisfies the desired properties for computation.
There are attempts to develop DNA codes satisfying the thermodynamic constraints though it is a challenge to develop the
DNA codes that fits perfectly for the practical application of DNA strands. One of the important research problem is to work
on the optimality condition of the DNA codes. To simulate the process of the DNA code designing and practical protocols
involved in the DNA computation, it can be automated by developing a platform where DNA codes can be designed and
simulated to check with the performance and accuracy. With emerging area of algebraic coding and biological coding theory,
these challenges can be investigated and resolved.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
L. M. Adleman, “Molecular computation of solutions to combinatorial problems,” Science, vol. 266, no. 5187, pp. 1021–1024, 1994.
C. C. Maley, “DNA computation: theory, practice, and prospects,” Evolutionary computation, vol. 6, no. 3, pp. 201–229, 1998.
C. Calude and G. Paun, Computing with cells and atoms: an introduction to quantum, DNA and membrane computing. CRC Press, 2000.
M. Garzon, P. Neathery, R. Deaton, R. C. Murphy, D. R. Franceschetti, and S. Stevens Jr, “A new metric for DNA computing,” in Proceedings of the
2nd Genetic Programming Conference, 1997, pp. 472–278.
R. Deaton, M. Garzon, R. Murphy, J. Rose, D. Franceschetti, and S. E. Stevens Jr, “Reliability and efficiency of a DNA-based computation,” Physical
Review Letters, vol. 80, no. 2, p. 417, 1998.
G. Rozenberg and A. Salomaa, “DNA computing: new ideas and paradigms,” in Automata, Languages and Programming. Springer, 1999, pp. 106–118.
E. Winfree, F. Liu, L. A. Wenzler, and N. C. Seeman, “Design and self-assembly of two-dimensional DNA crystals,” Nature, vol. 394, no. 6693, pp.
539–544, 1998.
E. Winfree, “Algorithmic self-assembly of DNA,” Ph.D. dissertation, California Institute of Technology, 1998.
N. C. Seeman, “DNA nanotechnology: novel DNA constructions,” Annual review of biophysics and biomolecular structure, vol. 27, no. 1, pp. 225–248,
1998.
P. W. Rothemund, “Folding DNA to create nanoscale shapes and patterns,” Nature, vol. 440, no. 7082, pp. 297–302, 2006.
M. Arita, “Writing information into DNA,” in Aspects of Molecular Computing. Springer, 2004, pp. 23–35.
Page 11 of 19
Page 12 of 19
n/d
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
609973884610560
42061705248768
3
6
15
43
256
528
1354
4542
14405
59136
167263
768768
1646240
13174400
26355520
44933184
47102080
756760576
90291264
10602158336
1384513088
177279886336
21300369664
2532157069312
4
2
3
16
35
128
273
860
2457
14784
27376
192192
411821
3293600
6587200
11232288
23647760
189432064
188416000
2650495232
670222080
44319794176
11204500480
633038608384
158680788992
10515426312192
2625500086272
152493461268480
80766566400
5154680832
1
4
11
28
65
210
477
1848
3974
11878
25670
55376
97520
699624
738772
11822368
22573824
22607872
43264648
346436544
10399676
326893568
5
2
2
22
19
54
117
924
924
3712
6648
55424
97450
738772
738772
11806240
1412068
5643456
10816624
43355616
1299844
618544192
20057442
10262347776
6691200
149011451520
6
9313176480
641396736
1
2
8
17
37
87
206
796
1600
13856
12864
43632
92252
738520
176772
176772
2703694
21631400
41600552
38656528
7
2
2
8
14
29
62
208
410
6476
6060
43632
11542
368504
45112
353496
676312
5406464
10399676
83204800
40114884
159987712
3853632
2332609440
8
9080016
1
2
5
12
23
49
109
243
579
2691
3678
11452
11148
88424
169182
1351616
2599688
20801200
40114884
627776
9
Table XII: Lower bounds on DNA codes satisfying GC and Reverse Complement Constraints for 4 ≤ n ≤ 30 and 2 ≤ d ≤ 9
946176
64512
320
135
2
24
Page 13 of 19
n/d
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
9110544
150266880
300533760
583395120
1166803110
2268771670
4537543340
2
2
4
10
21
37
83
175
407
960
2868
2926
22088
42968
338016
649922
5199376
10029150
180226336
10
80
96
848
848
1536
19388832
77558760
2
2
4
20
26
30
49
99
179
364
12
2
3
5
12
21
39
77
88
174
336
690
1402
2974
6308
13688
29292
61270
13
2
2
2
4
10
18
33
43
74
126
244
480
977
1927
3987
8245
17677
14
1
2
2
4
8
15
22
36
57
102
190
351
655
1310
2599
5426
15
2
2
2
4
7
11
20
31
51
83
148
262
459
898
1767
16
1
2
2
4
6
10
16
27
65
67
114
194
353
546
17
2
2
2
3
6
8
14
23
38
56
93
155
266
18
1
2
2
2
4
7
12
20
32
50
77
127
19
2
2
2
2
4
6
11
18
28
42
65
20
1
2
2
2
4
6
9
15
24
36
21
2
2
2
2
4
5
8
12
20
22
1
2
2
2
4
4
7
11
23
2
2
1
1
4
4
7
24
1
2
1
1
3
4
25
2
2
2
2
3
26
1
2
2
2
27
2
2
2
28
Table XIII: Lower bounds on DNA codes satisfying GC and Reverse Complement Constraints for 4 ≤ n ≤ 36 and 10 ≤ d ≤ 30
1
2
4
8
18
68
62
133
285
766
847
5522
10701
84964
162986
1299844
2506644
20056584
38777664
708168
11
1
2
29
2
30
Page 14 of 19
n/d
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>= 107
>= 107
>= 107
3
6
32
62
196
620
1952
8064
23565
2609!
65536
32640
1047552
>= 107
4
2
4
28
42
128
346
2016
4832
32640
65280
523776
1047552
8386560
>= 107
523776
261888
>= 107
523776
65280
2095104
2
4
12
30
80
496
607
4032
5469
32640
65280
5
2
2
16
22
120
136
2016
2016
8192
16384
130560
16384
2091504
1048604
2095104
6
131072
1046528
2
2
8
17
40
120
240
2016
4032
32640
65280
7
2
2
8
15
31
70
512
1024
8192
16384
16384
32512
523776
8
2
2
6
12
24
120
118
480
679
8064
8128
32256
9
2
2
4
10
32
64
120
197
2016
1095
1598
10
2
2
4
8
17
120
68
143
321
2016
11
2
2
4
6
32
64
120
109
480
12
2
2
3
5
12
22
42
83
13
2
2
2
4
10
19
35
14
2
2
2
4
8
16
15
2
2
2
4
7
16
2
2
2
4
17
2
2
2
18
2
2
19
Table XIV: Lower bounds on DNA codes satisfying Hamming distance and Reverse Complement Constraints for 4 ≤ n ≤ 20 and 2 ≤ d ≤ 20
>= 107
>= 107
>= 107
>= 107
2
32
116
512
1968
8192
2887!
2786!
2677!
2734!
2
20
[12] A. G. D’yachkov, P. L. Erdös, A. J. Macula, V. V. Rykov, D. C. Torney, C.-S. Tung, P. A. Vilenkin, and P. S. White, “Exordium for DNA codes,”
Journal of Combinatorial Optimization, vol. 7, no. 4, pp. 369–379, 2003.
[13] D. Limbachiya and M. K. Gupta, “Natural data storage: A review on sending information from now to then via nature,” arXiv preprint arXiv:1505.04890,
2015.
[14] A. Marathe, A. E. Condon, and R. M. Corn, “On combinatorial DNA word design,” Journal of Computational Biology, vol. 8, no. 3, pp. 201–219,
2001.
[15] S. Hussini, L. Kari, and S. Konstantinidis, “Coding properties of DNA languages,” in DNA Computing. Springer, 2002, pp. 57–69.
[16] M. K. Gupta, “The quest for error correction in biology,” Engineering in Medicine and Biology Magazine, IEEE, vol. 25, no. 1, pp. 46–53, 2006.
[17] J. Watada et al., “DNA computing and its applications,” in Intelligent Systems Design and Applications, 2008. ISDA’08. Eighth International Conference
on, vol. 2. IEEE, 2008, pp. 288–294.
[18] X. Wang, Z. Bao, J. Hu, S. Wang, and A. Zhan, “Solving the sat problem using a DNA computing algorithm based on ligase chain reaction,” Biosystems,
vol. 91, no. 1, pp. 117–125, 2008.
[19] M. Darehmiraki, “A semi-general method to solve the combinatorial optimization problems based on nanocomputing,” International Journal of
Nanoscience, vol. 9, no. 05, pp. 391–398, 2010.
[20] A. J. Hartemink, D. K. Gifford, and J. Khodor, “Automated constraint-based nucleotide sequence selection for DNA computation,” Biosystems, vol. 52,
no. 1, pp. 227–235, 1999.
[21] J. Li, Q. Zhang, R. Li, and S. Zhou, “Optimization of DNA encoding based on combinatorial constraints,” ICIC Express Letters, vol. 2, no. 1, pp.
81–88, 2008.
[22] R. Penchovsky and J. Ackermann, “DNA library design for molecular computation,” Journal of Computational Biology, vol. 10, no. 2, pp. 215–229,
2003.
[23] E. B. Baum, “DNA sequences useful for computation,” in Proceedings of DNA-based Computers II, Princeton. In AMS DIMACS Series, vol. 44, 1999,
pp. 235–241.
[24] A. G. D’yachkov, P. A. Vilenkin, I. K. Ismagilov, R. S. Sarbaev, A. Macula, D. Torney, and S. White, “On DNA codes,” Problems of Information
Transmission, vol. 41, no. 4, pp. 349–367, 2005.
[25] O. Milenkovic and N. Kashyap, “On the design of codes for DNA computing,” in Coding and Cryptography. Springer, 2006, pp. 100–119.
[26] M. H. Garzon, V. Phan, S. Roy, and A. J. Neel, “In search of optimal codes for DNA computing,” in DNA Computing. Springer, 2006, pp. 143–156.
[27] L. P. Dinu and A. Sgarro, “A low-complexity distance for DNA strings,” Fundamenta Informaticae, vol. 73, no. 3, pp. 361–372, 2006.
[28] A. D’yachkov, A. Macula, V. Rykov, and V. Ufimtsev, “DNA codes based on stem similarities between DNA sequences,” in DNA Computing. Springer,
2007, pp. 146–151.
[29] J. Sager and D. Stefanovic, “Designing nucleotide sequences for computation: a survey of constraints,” in DNA Computing. Springer, 2006, pp.
275–289.
[30] J. Sun, “Bounds on edit metric codes with combinatorial DNA constraints,” Master’s thesis, Brock University, 2010.
[31] P. P. Debata, D. Mishra, K. Shaw, and S. Mishra, “A coding theoretic model for error-detecting in DNA sequences,” Procedia Engineering, vol. 38,
pp. 1773–1777, 2012.
[32] D. Ashlock, S. K. Houghten, J. A. Brown, and J. Orth, “On the synthesis of DNA error correcting codes,” Biosystems, vol. 110, no. 1, pp. 1–8, 2012.
[33] L. C. Faria, A. S. Rocha, J. H. Kleinschmidt, M. C. Silva-Filho, E. Bim, R. H. Herai, M. E. Yamagishi, and R. Palazzo Jr, “Is a genome a codeword
of an error-correcting code?” PloS one, vol. 7, no. 5, p. e36644, 2012.
[34] Y. M. Chee and S. Ling, “Improved lower bounds for constant GC-content DNA codes,” Information Theory, IEEE Transactions on, vol. 54, no. 1,
pp. 391–394, 2008.
[35] D. H. Smith, N. Aboluion, R. Montemanni, and S. Perkins, “Linear and nonlinear constructions of DNA codes with Hamming distance d and constant
GC-content,” Discrete Mathematics, vol. 311, no. 13, pp. 1207–1219, 2011.
[36] D. Tulpan, D. H. Smith, and R. Montemanni, “Thermodynamic post-processing versus GC-content pre-processing for DNA codes satisfying the
Hamming distance and reverse-complement constraints,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 11,
no. 2, pp. 441–452, 2014.
[37] M. A. Bishop, A. G. D’yachkov, A. J. Macula, T. E. Renz, and V. V. Rykov, “Free energy gap and statistical thermodynamic fidelity of DNA codes,”
Journal of Computational Biology, vol. 14, no. 8, pp. 1088–1104, 2007.
[38] A. G. D’yachkov, A. J. Macula, W. K. Pogozelski, T. E. Renz, V. V. Rykov, and D. C. Torney, “A weighted insertion-deletion stacked pair thermodynamic
metric for DNA codes,” in DNA Computing. Springer, 2005, pp. 90–103.
[39] Q. Zhang, B. Wang, and X. Wei, “Evaluating the different combinatorial constraints in DNA computing based on minimum free energy,” MATCH
Commun. Math. Comput. Chem, vol. 65, pp. 291–308, 2011.
[40] D. Tulpan, M. Andronescu, S. B. Chang, M. R. Shortreed, A. Condon, H. H. Hoos, and L. M. Smith, “Thermodynamically based DNA strand design,”
Nucleic acids research, vol. 33, no. 15, pp. 4951–4964, 2005.
[41] Q. Zhang, B. Wang, X. Wei, X. Fang, and C. Zhou, “DNA word set design based on minimum free energy,” NanoBioscience, IEEE Transactions on,
vol. 9, no. 4, pp. 273–277, 2010.
[42] S. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “A rewritable, random-access DNA-based storage system,” arXiv preprint arXiv:1505.02199,
2015.
[43] D. C. Tulpan, “Effective heuristic methods for DNA strand design,” Ph.D. dissertation, The University Of British Columbia, 2006.
[44] U. Feldkamp, H. Rauhe, and W. Banzhaf, “Software tools for DNA sequence design,” Genetic Programming and Evolvable Machines, vol. 4, no. 2,
pp. 153–171, 2003.
[45] U. Feldkamp, W. Banzhaf, H. Rauhe et al., “A DNA sequence compiler,” in Poster presented at Sixth International Meeting on DNA Based Computers,
Leiden, 2000.
[46] U. Feldkamp, S. Saghafi, W. Banzhaf, and H. Rauhe, “DNAsequencegenerator: A program for the construction of DNA sequences,” in DNA Computing.
Springer, 2001, pp. 23–32.
[47] J. M. Gray, T. G. Frutos, A. M. Berman, A. E. Condon, M. G. Lagally, L. M. Smith, and R. M. Corn, “Reducing errors in DNA computing by
appropriate word design,” University of Wisconsin, Department of Chemistry, 1996.
[48] P. Hansen, N. Mladenović, and J. A. M. Pérez, “Variable neighbourhood search: methods and applications,” Annals of Operations Research, vol. 175,
no. 1, pp. 367–407, 2010.
[49] S. Kawashimo, H. Ono, K. Sadakane, and M. Yamashita, “DNA sequence design by dynamic neighborhood searches,” in DNA Computing. Springer,
2006, pp. 157–171.
[50] R. Montemanni and D. H. Smith, “Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm,” Journal of
Mathematical Modelling and Algorithms, vol. 7, no. 3, pp. 311–326, 2008.
[51] S. Kawashimo, H. Ono, K. Sadakane, and M. Yamashita, “Dynamic neighborhood searches for thermodynamically designing DNA sequence,” in DNA
Computing. Springer, 2008, pp. 130–139.
[52] R. Montemanni, D. Smith, and N. Koul, “Three metaheuristics for the construction of constant GC-content DNA codes,” Lecture Notes in Management
Science, vol. 6, pp. 167–175, 2014.
[53] Q. Qiu, D. Burns, Q. Wu, and P. Mukre, “Hybrid architecture for accelerating DNA codeword library searching,” in Computational Intelligence and
Bioinformatics and Computational Biology, 2007. CIBCB’07. IEEE Symposium on. IEEE, 2007, pp. 323–330.
Page 15 of 19
[54] N. Bennenni, K. Guenda, and A. Gulliver, “Construction of codes for DNA computing by the greedy algorithm,” ACM Commun. Comput. Algebra,
vol. 49, no. 1, pp. 14–14, Jun. 2015. [Online]. Available: http://doi.acm.org/10.1145/2768577.2768583
[55] O. D. King, “Bounds for DNA codes with constant GC-content,” Electron. J. Combin, vol. 10, no. 1, p. 33, 2003.
[56] D. C. Tulpan, H. H. Hoos, and A. E. Condon, “Stochastic local search algorithms for DNA word design,” in DNA Computing. Springer, 2003, pp.
229–241.
[57] R. Deaton, M. Garzon, R. Murphy, and J. R. Koza, “Genetic search of reliable encodings for DNA-based computation,” Late Breaking Papers at the
Genetic Programming 1996, pp. 9–15, 1996.
[58] M. Arita and S. Kobayashi, “DNA sequence design using templates,” New Generation Computing, vol. 20, no. 3, pp. 263–277, 2002.
[59] W. Liu, S. Wang, L. Gao, F. Zhang, and J. Xu, “DNA sequence design based on template strategy,” Journal of chemical information and computer
sciences, vol. 43, no. 6, pp. 2014–2018, 2003.
[60] S. Kobayashi, T. Kondo, and M. Arita, “On template method for DNA sequence design,” in DNA Computing. Springer, 2003, pp. 205–214.
[61] R. Selvakumar, “Unconventional construction of DNA codes: group homomorphism,” Journal of Discrete Mathematical Sciences and Cryptography,
vol. 17, no. 3, pp. 227–237, 2014.
[62] P. Gaborit and O. D. King, “Linear constructions for DNA codes,” Theoretical Computer Science, vol. 334, no. 1, pp. 99–113, 2005.
[63] N. Aboluion, D. H. Smith, and S. Perkins, “Linear and nonlinear constructions of DNA codes with Hamming distance d, constant GC-content and a
reverse-complement constraint,” Discrete Mathematics, vol. 312, no. 5, pp. 1062–1075, 2012.
[64] L. Faria, A. Rocha, J. Kleinschmidt, R. Palazzo, and M. Silva-Filho, “DNA sequences generated by BCH codes over GF (4),” Electronics letters,
vol. 46, no. 3, pp. 202–203, 2010.
[65] T. Abualrub, A. Ghrayeb, and X. N. Zeng, “Construction of cyclic codes over GF(4) for DNA computing,” Journal of the Franklin Institute, vol. 343,
no. 4, pp. 448–457, 2006.
[66] J. Cannon, A. Steel et al., “The MAGMA computational algebra system,” Software available online (magma. maths. usyd. edu. au), 2005.
[67] A. Heck and W. Koepf, Introduction to MAPLE. Springer-Verlag New York, 1993, vol. 1993.
[68] A. Niema, “The construction of DNA codes using a computer algebra system,” Ph.D. dissertation, PhD Thesis, University of Glamorgan, 2011.
[69] A. S. L. Rocha, L. C. B. Faria, J. H. Kleinschmidt, R. Palazzo Jr, and M. C. Silva-Filho, “DNA sequences generated by Z4 linear codes,” in Information
Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1320–1324.
[70] B. Feng, S. Bai, B. Chen, and X. Zhou, “The constructions of DNA codes from linear self-dual codes over Z4 ,” International Conference on Computer
Information Systems and Industrial Applications, 2015.
[71] Z. Varbanov, T. Todorov, and M. Hristova, “A method for constructing DNA codes from additive self-dual codes over GF (4).” in Proc. CAIM conference,
Romania, vol. 40, 2014.
[72] S. E. Oztas and I. Siap, “Lifted polynomials over F16 and their applications to DNA codes,” Filomat, vol. 27, no. 3, pp. 459–466, 2013.
[73] B. Yildiz and I. Siap, “Cyclic codes over F2 [u]/(u4 − 1) and applications to DNA codes,” Computers & Mathematics with Applications, vol. 63,
no. 7, pp. 1169–1176, 2012.
[74] K. Guenda, T. A. Gulliver, and P. Solé, “On cyclic DNA codes,” in Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on.
IEEE, 2013, pp. 121–125.
[75] J. Liang and L. Wang, “On cyclic DNA codes over F2 + uF2 ,” Journal of Applied Mathematics and Computing, pp. 1–11, 2015.
[76] S. Pattanayak and A. K. Singh, “On cyclic DNA codes over the ring Z4 + uZ4 ,” arXiv preprint arXiv:1508.02015, 2015.
[77] F. Ma, Y. Cao, and J. Gao, “On cyclic DNA codes over F4 [u]/(u2 + 1),” International Journal of Research and Reviews in Applied Sciences, vol. 24,
no. 3, p. 101, 2015.
[78] S. Zhu and X. Chen, “Cyclic DNA codes over F2 + uF2 + vF2 + uvF2 ,” arXiv preprint arXiv:1508.07113, 2015.
[79] A. Bayram, E. S. Oztas, and I. Siap, “Codes over F4 + vF4 and some DNA applications,” Designs, Codes and Cryptography, pp. 1–15, 2015.
[80] B. Srinivasulu and M. Bhaintwal, “Reversible cyclic codes over F4 + uF4 and their applications to DNA codes,” in 2015 7th International Conference
on Information Technology and Electrical Engineering (ICITEE). IEEE, 2015, pp. 101–105.
[81] N. Bennenni, K. Guenda, and S. Mesnager, “New DNA cyclic codes over rings,” arXiv preprint arXiv:1505.06263, 2015.
[82] I. Siap, T. Abualrub, and A. Ghrayeb, “Cyclic DNA codes over the ring F2 [u]/(u2 − 1) based on the deletion distance,” Journal of the Franklin
Institute, vol. 346, no. 8, pp. 731–740, 2009.
[83] H. Mostafanasab and A. Y. Darani, “On cyclic DNA codes over F2 + uF2 + u2 F2 ,” arXiv preprint arXiv:1603.05894, 2016.
[84] K. Guenda and T. A. Gulliver, “Construction of cyclic codes over F2 + uF2 for DNA computing,” arXiv preprint arXiv:1207.3385, 2012.
[85] A. Dertli and Y. Cengellenmis, “On cyclic DNA codes over the rings z {4}+ wz {4} and z {4}+ wz {4}+ vz {4}+ wvz {4},” arXiv preprint
arXiv:1605.02968, 2016.
[86] H. Hong, L. Wang, H. Ahmad, J. Li, Y. Yang, and C. Wu, “Construction of DNA codes by using algebraic number theory,” Finite Fields and Their
Applications, vol. 37, pp. 328–343, 2016.
[87] A. J. Ruben, S. J. Freeland, and L. F. Landweber, “Punch: An evolutionary algorithm for optimizing bit set selection,” in DNA Computing. Springer,
2001, pp. 150–160.
[88] Z. Ignatova, I. Martinez-Perez, and K.-H. Zimmermann, DNA computing models. Springer Science & Business Media, 2008.
[89] Q. Zhang and B. Wang, “On the bounds of DNA coding with h-distance,” MATCH Commun. Math. Comput. Chem, vol. 66, pp. 371–380, 2011.
[90] L. M. Smith, R. M. Corn, A. E. Condon, M. G. Lagally, A. G. Frutos, Q. Liu, and A. J. Thiel, “A surface-based approach to DNA computation,”
Journal of computational biology, vol. 5, no. 2, pp. 255–267, 1998.
[91] A. G. Frutos, Q. Liu, A. J. Thiel, A. M. W. Sanner, A. E. Condon, L. M. Smith, and R. M. Corn, “Demonstration of a word design strategy for DNA
computing on surfaces,” Nucleic Acids Research, vol. 25, no. 23, pp. 4748–4757, 1997.
[92] Q. Liu, L. Wang, A. G. Frutos, A. E. Condon, R. M. Corn, and L. M. Smith, “DNA computing on surfaces,” Nature, vol. 403, no. 6766, pp. 175–179,
2000.
[93] H. Wu, “An improved surface-based method for DNA computation,” Biosystems, vol. 59, no. 1, pp. 1–5, 2001.
[94] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,”
Science, vol. 270, no. 5235, pp. 467–470, 1995.
[95] S. Brenner and R. A. Lerner, “Encoded combinatorial chemistry.” Proceedings of the National Academy of Sciences, vol. 89, no. 12, pp. 5381–5383,
1992.
[96] J. Reif, H. Chandran, N. Gopalkrishnan, and T. LaBean, “Self-assembled DNA nanostructures and DNA devices,” Nanofabrication Handbook, pp.
299–328, 2012.
[97] Y. Yang, Y. Liu, and H. Yan, “DNA nanostructures as programmable biomolecular scaffolds,” Bioconjugate chemistry, 2015.
[98] G. Jacob and A. Murugan, “DNA based cryptography: An overview and analysis,” Int J Emerg Sci, vol. 3, no. 1, pp. 27–36, 2013.
[99] D. Tulpan, C. Regoui, G. Durand, L. Belliveau, and S. Léger, “Hyden: a hybrid steganocryptographic approach for data encryption using randomized
error-correcting DNA codes,” BioMed research international, vol. 2013, 2013.
[100] A. Aich, A. Sen, S. R. Dash, and S. Dehuri, “A symmetric key cryptosystem using DNA sequence with OTP key,” in Information Systems Design and
Intelligent Applications. Springer, 2015, pp. 207–215.
[101] X. Wei, L. Guo, Q. Zhang, J. Zhang, and S. Lian, “A novel color image encryption algorithm based on DNA sequence operation and hyper-chaotic
system,” Journal of Systems and Software, vol. 85, no. 2, pp. 290–299, 2012.
Page 16 of 19
[102] R. Enayatifar, A. H. Abdullah, and I. F. Isnin, “Chaos-based image encryption using a hybrid genetic algorithm and a DNA sequence,” Optics and
Lasers in Engineering, vol. 56, pp. 83–93, 2014.
[103] M. Arita, “Method for designing DNA codes used as information carrier,” May 27 2004, uS Patent App. 10/558,502.
[104] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science, vol. 337, no. 6102, pp. 1628–1628, 2012.
[105] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information
storage in synthesized DNA,” Nature, vol. 494, no. 7435, pp. 77–80, 2013.
[106] S. Yazdi, H. M. Kiah, E. R. Garcia, J. Ma, H. Zhao, and O. Milenkovic, “DNA-based storage: Trends and methods,” arXiv preprint arXiv:1507.01611,
2015.
[107] S. Tsaftaris, A. K. Katsaggelos, T. N. Pappas, E. T. Papoutsakis et al., “DNA computing from a signal processing viewpoint,” Signal Processing
Magazine, IEEE, vol. 21, no. 5, pp. 100–106, 2004.
[108] L. Qian and E. Winfree, “Scaling up digital circuit computation with DNA strand displacement cascades,” Science, vol. 332, no. 6034, pp. 1196–1201,
2011.
[109] M. H. Garzon and T.-Y. Wong, “DNA chips for species identification and biological phylogenies,” Natural Computing, vol. 10, no. 1, pp. 375–389,
2011.
[110] J. Dingel and O. Milenkovic, “Coding-theoretic methods for reverse engineering of gene regulatory networks,” in Information Theory Workshop, 2008.
ITW’08. IEEE. IEEE, 2008, pp. 114–118.
[111] D. G. Arquès and C. J. Michel, “A complementary circular code in the protein coding genes,” Journal of theoretical biology, vol. 182, no. 1, pp. 45–58,
1996.
[112] D. G. Arques and C. J. Michel, “A code in the protein coding genes,” BioSystems, vol. 44, no. 2, pp. 107–134, 1997.
[113] C. J. Michel, “A 2006 review of circular codes in genes,” Computers & Mathematics with Applications, vol. 55, no. 5, pp. 984–988, 2008.
[114] M. H. Garzon, “Theory and applications of DNA codeword design,” in Theory and Practice of Natural Computing. Springer, 2012, pp. 11–26.
[115] L. V. Bystrykh, “Generalized DNA barcode design based on Hamming codes,” PloS one, vol. 7, no. 5, p. e36852, 2012.
A PPENDIX A
B OUNDS ON DNA C ODES
1) Johnson type Bound- This bound is derived by shortening the code to length n − 1. The code is modified by choosing
the codewords with a fix character b ∈ Zq at the ith position where i ∈ {1 . . . n} and deleting ith position from the
codewords.
For 0 ≤ d ≤ n and 0 < w < n,
2n GC
a) AGC
4 (n, d, w) ≤ b w A4 (n − 1, d, w − 1)c [55].
2n
GC
b) AGC
4 (n, d, w) ≤ b n−w A4 (n − 1, d, w)c [55].
1 R
c) AR
4 (n, d) ≤ b 4 A4 (n − 1, d)c [88].
2) Halving BoundsR
a) This is motivated by the fact DNA code CDN A and reverse DNA code CDN
A are disjoint. Let n ≥ 1 be an integer.
1
R
For each integer d with 0 < d ≤ n, A4 (n, d) ≤ A4 (n, d) [88].
2
b) For 0 < d ≤ n and 0 ≤ w ≤ n, AGC,RC
(n, d, w) ≤ 21 AGC
4 (n, d, w) [88].
4
GC,R
c) For 0 < d ≤ n and 0 ≤ w ≤ n,A4
(n, d, w) ≤ 12 AGC
4 (n, d, w) [88].
3) Gilbert-Type Bounds-AGC
4 (n, d, w) is derived by dividing the total number of DNA strings of length n with GC-content
w by the number of DNA string with distance d − 1 from the fixed codeword.
a) For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
n w n−w
w 2 2
GC
A4 (n, d, w) ≥ Pd−1 Pminbr/2c,w,n−w w n−w n−2i
[88].
2i
r=0
i=0
i
i
r−2i 2
b) For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
n n
w 2
GC
A4 (n, d, w) ≥ Pd−1 Pminbr/2c,w,n−w w n−w n−2i
[55].
2i
r=0
i=0
i
i
r−2i 2
4) This bound is based on the aspect of GC-content of given DNA codeword is equal to reverse of DNA codeword [55].
For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
a) AGC,RC
(n, d, w) = AGC,R
(n, d, w) if n is even.
4
4
b) AGC,RC
(n, d, w) ≤ AGC,R
(n + 1, d + 1, w) if n is odd.
4
4
c) AGC,R
(n, d + 1, w) ≤ AGC,R
(n, d, w) ≤ AGC,R
(n, d − 1, w) if n is odd.
4
4
4
5) By using AGC
4 (n, d, w) ≥ A2 (n, d,
w)· A2 (n, d) inequality, following bound is derived [55].
n n−1
GC
For 0 ≤ w ≤ n, A4 (n, 2, w) =
2
.
w
Page 17 of 19
(n, d, w) ≤ 12 AGC
6) By using Halving bound AGC,RC
4 (n, d, w) for d = 2 one can calculate the bound [55].
4
For 0 ≤ w ≤ n and n is even,
n n−2
AGC,RC
(n,
2,
w)
=
2
.
4
w
7) Bound AGC
4 (n, d, w) is computed by dividing the total number of words with GC-content w that are at distance at least
d from their reverse-complements by the number of these codewords that are at distance at most d − 1 from any fixed
codeword [55].
For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
AGC
4 (n, d, w)
Pn
≥
r=d V (n, d, r)
Pd−1 Pminbr/2c,w,n−w w n−w
n−2i 2i
2 r=0 i=0
i
i
r−2i 2
8) Product Bounds - This is based on the construction of the DNA code AGC
4 (n, d, w) with length n, minimum Hamming distance d and GC-content w from binary constant-weight codes A2 (n, d, w) and ternary constant-weight codes
A3 (n, d, w) with length n, Hamming weight w and minimum Hamming distance d [55] [88].
For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
a) AGC
4 (n, d, w) ≥ A2 (n, d, w) · A2 (n, d)
b) AGC,R
(n, d, w) ≥ AR
2 (n, d, w) · A2 (n, d)
4
c) AGC,R
(n, d, w) ≥ A2 (n, d, w) · AR
2 (n, d)
4
d) AGC
4 (n, d, w) ≥ A3 (n, d, w) · A2 (n − w, d)
e) AGC,R
(n, d, w) ≥ AR
3 (n, d, w) · A2 (n − w, d)
4
f) AGC,R
(n, d, w) ≥ A3 (n, d, w) · AR
2 (n − w, d)
4
R
g) AR
4 (n, d) ≥ A2 (n, d) · A2 (n, d).
9) Reverse and Reverse complement codes bounds - This can be simple observed by the reverse and reverse complement
property of the DNA codeword [88] [14].
Let n ≥ 1 be an integer,
R
a) If n is even, then ARC
4 (n, d) = A4 (n, d) and
R
b) If n is odd, then ARC
4 (n, d) ≤ A4 (n + 1, d + 1)
10) Special cases - Bounds are observed by considering different combination of length n and GC-content w [14].
For n > 0, with 0 ≤ d ≤ n and 0 ≤ w ≤ n,
a) AGC
4 (n, d, 0) = A2 (n, d)
GC
b) AGC
4 (n, d, w) = A4 (n, d, n − w)
c) AGC
4 (n, n, w) =
d) AGC,RC
(n, n, w) =
4
e)
AGC
4 (n, 1, w)



4 if w = n/2

3 if n ≤ w < n/2 or n/2 < w ≤ 2n/3


2 if w < n/3 or w > 2n/3
2 if w = n/2
1 if w 6= n/2
n n
=
2
w
Page 18 of 19
11) Bounds on reverse P
code for d = 3 - This
bound iis computed from the concept of sphere-packing bound for codes [14].
bn/2c
dn/2e
q
i
=
2bn/2c
(q − 1)
i
AR
.
q (n, 3) ≤
2(1 + 4(q − 2) + (n − 4)(q − 1))
12) Bounds on reverse code of size S - By using Greedy approach to calculate size of the code, bound is derived [14].
Let V (s, d) be the number of words of S such that they have distance d from s where sεS.
|S|
a) AR
where V − (d) = min{V (s, d)|sεS}.
q (n, d) ≤
2V − (b(d − 1)/2c)
|S|
where, V + (d) = max{V (s, d)|sεS}.
b) AR
q (n, d) ≥
2V + d − 1
13) Bounds on reverse code for d = 2 - Bounds on reverse code for d = 2 is obtained by claims
• Any two words from the same subset differ in at least two positions ie. dH (xDNA , yDNA ) ≥ 2 ∀ xDNA , yDNA ∈ CDN A .
R
• If a word belongs to a subset, its reversal is also in the same subset ie. if xDNA ∈ CDN A then xDNA ∈ CDN A
n/2
• All the q
palindromes are in the same subset.
q n−1
, for even n and qε{2, 4}, and
a) AR
q (n, 2) =
2
q n−1 − q bn/2c
b) AR
(n,
2)
=
, for odd n and qε{2, 4}
q
2
14) Doubling Construction - This is motivated by the minimum Hamming distance between the DNA codeword, its
reverse codeword and revere complement. It is observed from the property of DNA code with dH (xDNA , yDNA ) ≥ d,
C
RC
HDN A (xR
DNA , yDNA ) ≥ d, HDN A (xDNA , yDNA ) ≥ d and HDN A (xDNA , yDNA ) ≥ d
n n−1
For n ≥ 2, AR
) = 2n [14].
2 (2 , 2
15) a) Bounds on even and odd length n reverse code - In this bounds, maximum size of reverse code of even and odd
length n and relationship between reverse code of even and odd length n − 1 is demonstrated [14].
R
R
AR
q (n − 1, d) ≤ Aq (n, d) ≤ Aq (n, d − 1) and
R
b) AR
q (n − 1, d) ≥ Aq (n, d)/q, for odd n.
16) Bounds on Hamming distance constraint- V.Phan provided lower and upper bounds of DNA codeword sets which satisfy
the h-distance constraint [89].
4n−d+1
≤ |S| ≤ 4n
n
d d−1
Page 19 of 19