Download DNA Compression Using Codon Representation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA polymerase wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Replisome wikipedia , lookup

DNA sequencing wikipedia , lookup

DNA profiling wikipedia , lookup

DNA nanotechnology wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
DNA Compression Using Codon Representation
Muhammad A. M. Islam(1), Nour S. I. Bakr(2)
(1)Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Egypt
(2)Biomedical Engineering Department, Higher Technological Institute, 10-th of Ramadan, Egypt.
Abstract: DNA sequences are composed of four bases,
each base can be represented by two bits. DNA
sequences are large, and DNA databases are huge. The
standard text compression algorithms failed to
compress DNA sequences.
Therefore, special
compression techniques are required. This paper
suggests converting the DNA sequence of bases into a
sequence of codons before compression. A codon is
composed of three DNA bases. This conversion
improves the compression ratio of the sequence. To
demonstrate the effectiveness of the suggested method,
static and adaptive Huffman compression in addition to
the expert system, were used. The results show that, the
codon representation significantly improved the
compression ratio. In addition, analysis of the test
sequences suggests that codon frequency distribution is
almost invariant with a shift of one or two bases.
Moreover, subsequences of the same test data have
almost identical frequency distributions.
Key words: DNA, codons, compression
I INTRODUCTION
DNA sequences of many organisms have been
identified. These sequences are stored in huge
molecular biology databases which are routinely
handled by molecular biologists. The DNA sequences
are stored, communicated, and analyzed in order to
understand its properties. As shown in table-1, most
standard compression algorithms, such as gzip and
compress, cannot compress DNA sequences, instead,
they expand them [1] and [2].
Specific compression algorithms have been
proposed for the compression of DNA sequences
based on particular characteristics in DNA sequences,
i.e. exact repeats, approximate repeats and reverse
complements (reversed repeats, where A and C are
respectively replaced by T and G, and reciprocally).
Similarly, other techniques for the compression of
sets of related sequences have also been proposed
utilizing the inter-sequence similarity [13], [14], [15],
Table 1: The average compression ratio of some
standard compression algorithms, [3].
Compression
Compression Algorithms
Ratio (bits/base )
Compact (Adaptive
2.380
Huffman)
bzip2 (Burrows Wheeler)
2.064
Compress (LZW)
2.185
gzip (LZ 77)
2.271
Arithmetic Coding
1.952
Context Tree Weighting
1.883
and [16]. The earliest special purpose DNA
compression algorithm found in the literature is
BioCompress [5]. It detects exact repeats and reverse
complements in the DNA sequence, and then encodes
them by the repeat length and the position of a
previous repeat occurrence, otherwise it is encoded
by 2 bits per symbol. The improved version,
BioCompress-2 [2], uses order-2 arithmetic coding
(Arith-2) to encode non-repeat regions. The Cfact
DNA compressor [6], also searches for the longest
exact repeats but is a two-pass algorithm. It builds the
suffix tree in the first pass. In the second phase, the
encoding phase, the repetitions with guaranteed gains
are coded using the suffix tree; otherwise, encoded by
two-bit per base. A substitution approach is used in
GenCompress [4, 7] based on approximate repeats.
The algorithm has two variants: GenCompress-1 and
GenCompress-2. GenCompress-1 uses the Hamming
distance (only substitutions) while GenCompress-2
uses the edition distance (deletion, insertion and
substitution) for the encoding of the repeats. The
Context Tree Weighted LZ algorithm [1, 4],
combines an LZ-77 type method like GenCompress
and the CTW algorithm. Long exact / approximate
repeats are encoded by LZ77-type algorithm
(substitution method), while short repeats and non
repeat areas are encoded by CTW. DNACompress
algorithm [8] finds approximate repeats, including
complemented reverses, in one pass. Approximate
repeat regions and non-repeat regions are encoded in
another pass. DNAPack [4] uses Hamming distance
for the repeats and complementary palindromes.
Non-repeat regions are encoded by the best choice
from an Arith-2, context tree weighting, and naive 2
bits/symbol. DNAPack finds the repeats using a
dynamic programming approach. Expert Model [9]
uses both statistical properties and repetitions within
sequences. The algorithm encodes each symbol by
estimating the probability based on information
obtained from previous symbols. If the symbol is part
of a repeat, the information from the previous
occurrence is used. Once the symbol’s probability
distribution is determined, it is encoded by arithmetic
coding.
In this paper, a new technique to improve the
compression operation is proposed.
The new
technique suggests transforming DNA sequences into
codon sequences before compression. The details of
the proposed technique are given next.
II. METHODOLOGY
Data compression involves two steps, modeling then
coding. Standard compression techniques are
designed to handle data structures that are that are not
similar to DNA data structures, therefore, they failed
to compress the DNA sequences. In order to improve
the DNA compression ratio, more attention was
given to the DNA sequence structures. It was
discovered that DNA sequences has many long
repetitions, and reverse complements. This deeper
insight into the DNA structure helped to better model
the DNA sequence. Based on the improved DNA
models, the special compression techniques modified
the
standard compression techniques, and
consequently achieved higher compression ratios.
Similarly, the new technique tries to gain more
insight into the structure of the DNA sequence, and
improve its representation, and hence compression
ratio, through codon representation.
Codons are 3-base subsequences. For example, the
DNA sequence AAGGCT contains two codons, AAG
and GCT. Codons in gene coding regions are
translated into amino acids. There are 64 codons and
20 amino acids, as some amino acids corresponds to
more than one codon. Amino acids are the basic
building block of proteins.
Proteins control the shape and activities of all
organisms. Proteins are composed of sequence of
amino acids. These amino acids are fabricated from
corresponding codon sequences in the coding regions
of genes. Every codon in the sequence is translated
into an amino acids. Therefore, Protein information is
coded in the form of codons. This suggests that
modeling a DNA sequence as sequence of codons
should better represent the sequence information.
Every codon in the sequence is translated into an
amino acids. Therefore, Protein information is coded
in the form of codons. This suggests that modeling a
DNA sequence as sequence of codons should better
represent
the
sequence
information,
and
consequently, improves its compression ratio.
III. RESULTS
In order to demonstrate effectiveness of the proposed
method, 11-different DNA sequences were used.
These sequences include complete genomes of 2mitochondries, 2-chloroplasts, 2-viruses, and 5different complete genes of Human genes [5, 10], as
shown in Table 2. Both DNA and codon forms of the
above sequences were compressed using both static
and adaptive Huffman coding, in addition to the
expert model [9]. Compression results are given in
tables 3.
Codon probability distribution was calculated 3
times for each test sequence. The analysis was carried
on each intact sequence, then after removing its first
base, and finally after removing its first two bases.
Table 2: Size, Version, and Number in GenBank, for
test Sequences.
Sequence
CHMPXX
CHNTXX
HEHCMVCG
HUMDYSTROP
HUMGHCSA
HUMHBB
HUMHDABCD
HUMHPRTB
MPOMTCG
MIPACGA
VACCG
Size (bp)
121,024
155,943
229,354
38,770
66,495
73,308
58,864
56,737
186,609
100,314
191,737
Version
X04465.1
Z00044.2
X17403.1
M86524.1
J03071.1
U01317.1
M63544.1
M26434.1
M68929.1
----------M35027.1
Number
11640
76559634
59591
181901
183148
455025
183921
184369
786182
----------335317
Table 3: Comparison of DNA and Codon
compression ratios, in bits/base using Adaptive
Huffman, Static Huffman, and Expert Model.
Huffman
Adaptive
Fixed
DNA Cod. DNA Cod.
CHMPXX
2.217 1.915 1.932 1.881
CHNTXX
2.385 1.990 2.001 1.967
HEHCMVCG
2.213 2.017 2.001 1.995
HUMDYSTROP 2.511 1.990 2.005 1.999
HUMGHCSA
2.494 2.018 2.003 2.005
HUMHBB
2.496 1.987 2.003 1.981
HUMHDABCD 2.479 2.031 2.003 2.017
HUMHPRTB
2.410 2.001 2.004 1.996
MPOMTCG
2.428 2.003 2.001 1.997
MIPACGA
2.213 1.919 1.946 1.906
VACCG
2.334 1.960 2.001 1.938
Average
2.380 1.985 1.991 1.971
Sequence
Expert
Model
DNA Cod.
1.658 1.536
1.607 1.528
1.843 1.672
1.903 1.811
0.983 0.806
1.751 1.684
1.667 1.549
1.736 1.633
1.877 1.726
1.845 1.704
1.765 1.681
1.698 1.576
Fig 1 shows a typical frequency distribution for the
intact and the two striped sequences. In additions,
distributions of the three shifts of each test sequence.
The analyses were also carried on subsequences with
different size and locations in each test sequence.
Note that removing one or two bases leads to a
change in all codons, and hence a completely
different codon sequence. However, the codon
frequency distribution remains almost the same. The
intact sequence was labeled shift0, the sequence with
one with removed base was labeled shift1, and the
sequence with two removed bases was labeled shift2.
The following example demonstrates the above:
Shift 0: ( | T T T | C T A | A T T | G T T …… )
Shift 1: ( | T T C | T A A | T T G | T T …… )
Shift 2: (| T C T | A A T | T G T| T …… )
Table (4): the correlation between the frequency
distribution of codons in different frames
Sequence
CHMPXX
CHNTXX
HEHCMVCG
HUMDYSTROP
HUMGHCSA
HUMHBB
HUMHDABCD
HUMHPRTB
MPOMTCG
MIPACGA
VACCG
0-1
0.99396
0.98278
0.97975
0.98665
0.99230
0.98325
0.98658
0.99303
0.97643
0.98830
0.98899
Correlation
0-2
0.99399
0.98414
0.97789
0.98238
0.98943
0.98226
0.98757
0.99296
0.97540
0.98728
0.98944
1-2
0.99401
0.98368
0.97523
0.98706
0.99014
0.97979
0.98737
0.99382
0.97207
0.98881
0.98652
IV DISCUSSION AND CONCLUSION
The proposed codon compression technique
significantly improved the average compression ratio.
The expert model codon compression achieved the
best compression reatio reported so far. In addition,
for the test sequence, and using a sufficiently large
subsequence, the codon frequency distribution was
found to be almost invariant along the sequence.
Based on the test data, a minimum subsequence of
length about 15 k codons is sufficient. Moreover, the
codon frequency distribution remains almost the
same when one or two bases are removed from the
beginning of any sequence, despite the fact that the
base removal leads to altering the whole codon
sequence. Note the high degree of redundancy in the
sequences, which make the code error resilient. This
helps prevent problems due to damage of few random
codons. It is worth mentioning that many sequences
have many similarities in there frequency
distributions.
Another very interesting observation shows that
However, the new sequence keeps the same
frequency distribution. This suggests that information
coded in the genome is not related the codon
sequential order, but more to its frequency
distribution. This may be another safety mechanism
to prevent malfunction due to limited base damage.
Finally, we can conclude that transformation of
DNA sequence to codon sequence before
compression lead to improving the compression ratio
in both static and Adaptive model. Static Huffman
compression is more efficient than Adaptive
compression due to stationary of frequency
distribution along the sequence. In addition, the
codon sequence has a frequency distribution
(signature) which is invariant under codon sequence
shift. Subsequences of reasonable size have an
almost identical frequency distributions of its codon
sequence along the sequence and the similarity
increase with the increasing the window size.
Acknowledgments
Great appreciation and gratitude are due to
Prof. Abdalla S. A. Mohamed and Dr.
Mohamad Abou El-Hoda, Systems and
Biomedical Engineering Department, Faculty of
Engineering, Cairo University, Egypt. and Prof.
Nabila A. El-Sheikh, for their help and
support throughout the work. Special thanks go
to Eng. Heba Afifi for her valuable contribution
regarding the implementation of the expert model
0.0800
Probabily
0.0700
Shift 0
Shift 1
0.0600
Shift 2
0.0500
0.0400
0.0300
0.0200
0.0100
Codon
0.0000
%@ ( Z & $ ) Y ! = 1 X 0 + - W V R N J U Q M I 6 P L H S O K 9 F B x 2 E 8 w s D z v r 7 y u q p l h d o k 5 3 n j f b m i e 4
Fig 1 Typical Codon probability distribution
REFERENCES
[1].
Matsumoto Toshiko, et. al., “Biological
Sequence Compression Algorithms”, Genome
Informatics, Volume 11 : Pages 43–52, (2000).
[2].
Grumbach Stephane and Tahi Fariza, “A
new Challenge for Compression Algorithms:
Genetic Sequences”, Journal of Information
Processing and Management, Volume 30:
Pages 875–866, (1994).
[3].
Matsumoto Toshiko, et. al., “Can
General-Purpose Compression Schemes Really
Compress DNA Sequences?”, Currents in
Computational Molecular Biology, Universal
Academy Press , Pages 76–77, (2000).
[4].
Behzadi Behshad and Le Fessant
Fabrice, “DNA Compression Challenge
Revisited”, CPM, pages 190–200, (2005).
[5].
Grumbach Stephane and Tahi Fariza,
“Compression of DNA Sequences”, IEEE
Computer Society Press, In Data compression
conference, Pages 340-350, (1993).
[6].
Eric Revals, et. al., “Discerning Repeats
in DNA Compression with a Compression
Algorithm”, CABIOS, Volume 13: Pages 131136, (1997).
[7].
Chen Xin, et. al., “A Compression
Algorithm for DNA Sequences and Its
Applications in Genome Comparison”, The
10th workshop on Genome Informatics
(GIW’99), Pages 51-61, (1999).
[8].
Chen Xin, et. al., “DNACompress: fast
and effective DNA sequence compression”,
Bioinformatics, Volume 18: Pages 1696–1698,
( 2002).
[9].
Cao Minh Duc, et. al., “A Simple
Statistical Algorithm for Biological Sequence
Compression”, IEEE Data Compression
Conference (DCC), Pages 43-52 , (2007).
[10].
National Center for Biotechnology
Information
(NCBI).
http://www.ncbi.nlm.nih.gov/sites/entrez , seen
at (2008).
[11].
Static
Huffman
Encoder/Decoder
Source.
http://www.codeproject.com/cpp/Huffman_cod
ing.asp , seen at (2006).
[12].
Adaptive Huffman Encoder/Decoder
Source.
http://www.gotdotnet.com/Community/UserSa
mples/Details.aspx?SampleGuid=c8bc181b5ddf-4969-aeca-a508374f1282 , seen at (2006).
[13].
Marty C. Brandon, et. al., “Data
Structures and Compression algorithms for
Genomic Sequence data,” Bioinformatics, vol.
25,
pages
1731-1738,
doi:10.1093/bioinformatics/btp319 , May 2009.
[14].
Christos Kozanitis, et. al., “Compressing
Genomic
Sequence
Fragments
Using
SLIMGENE”,
RECOMB
2010,
LNBI
6044/2010, pages 310-324. DOI: 10.1007/9783-642-12683-3_20, (2010)
[15].
Hyoung Do Kim , Ju-Han Kim “DNA
Data Compression Based on the Whole
Genome Sequence”, Journal of Convergence
Information Technology Volume 4, Number 3,
September 2009
[16].
Pavol Hanus, et. al., “Compression of
Whole
Genome
Alignments”,
IEEE
Transactions On Information Theory, Volume.
56, NO. 2, February 2010