Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DNA Compression Using Codon Representation Muhammad A. M. Islam(1), Nour S. I. Bakr(2) (1)Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Egypt (2)Biomedical Engineering Department, Higher Technological Institute, 10-th of Ramadan, Egypt. Abstract: DNA sequences are composed of four bases, each base can be represented by two bits. DNA sequences are large, and DNA databases are huge. The standard text compression algorithms failed to compress DNA sequences. Therefore, special compression techniques are required. This paper suggests converting the DNA sequence of bases into a sequence of codons before compression. A codon is composed of three DNA bases. This conversion improves the compression ratio of the sequence. To demonstrate the effectiveness of the suggested method, static and adaptive Huffman compression in addition to the expert system, were used. The results show that, the codon representation significantly improved the compression ratio. In addition, analysis of the test sequences suggests that codon frequency distribution is almost invariant with a shift of one or two bases. Moreover, subsequences of the same test data have almost identical frequency distributions. Key words: DNA, codons, compression I INTRODUCTION DNA sequences of many organisms have been identified. These sequences are stored in huge molecular biology databases which are routinely handled by molecular biologists. The DNA sequences are stored, communicated, and analyzed in order to understand its properties. As shown in table-1, most standard compression algorithms, such as gzip and compress, cannot compress DNA sequences, instead, they expand them [1] and [2]. Specific compression algorithms have been proposed for the compression of DNA sequences based on particular characteristics in DNA sequences, i.e. exact repeats, approximate repeats and reverse complements (reversed repeats, where A and C are respectively replaced by T and G, and reciprocally). Similarly, other techniques for the compression of sets of related sequences have also been proposed utilizing the inter-sequence similarity [13], [14], [15], Table 1: The average compression ratio of some standard compression algorithms, [3]. Compression Compression Algorithms Ratio (bits/base ) Compact (Adaptive 2.380 Huffman) bzip2 (Burrows Wheeler) 2.064 Compress (LZW) 2.185 gzip (LZ 77) 2.271 Arithmetic Coding 1.952 Context Tree Weighting 1.883 and [16]. The earliest special purpose DNA compression algorithm found in the literature is BioCompress [5]. It detects exact repeats and reverse complements in the DNA sequence, and then encodes them by the repeat length and the position of a previous repeat occurrence, otherwise it is encoded by 2 bits per symbol. The improved version, BioCompress-2 [2], uses order-2 arithmetic coding (Arith-2) to encode non-repeat regions. The Cfact DNA compressor [6], also searches for the longest exact repeats but is a two-pass algorithm. It builds the suffix tree in the first pass. In the second phase, the encoding phase, the repetitions with guaranteed gains are coded using the suffix tree; otherwise, encoded by two-bit per base. A substitution approach is used in GenCompress [4, 7] based on approximate repeats. The algorithm has two variants: GenCompress-1 and GenCompress-2. GenCompress-1 uses the Hamming distance (only substitutions) while GenCompress-2 uses the edition distance (deletion, insertion and substitution) for the encoding of the repeats. The Context Tree Weighted LZ algorithm [1, 4], combines an LZ-77 type method like GenCompress and the CTW algorithm. Long exact / approximate repeats are encoded by LZ77-type algorithm (substitution method), while short repeats and non repeat areas are encoded by CTW. DNACompress algorithm [8] finds approximate repeats, including complemented reverses, in one pass. Approximate repeat regions and non-repeat regions are encoded in another pass. DNAPack [4] uses Hamming distance for the repeats and complementary palindromes. Non-repeat regions are encoded by the best choice from an Arith-2, context tree weighting, and naive 2 bits/symbol. DNAPack finds the repeats using a dynamic programming approach. Expert Model [9] uses both statistical properties and repetitions within sequences. The algorithm encodes each symbol by estimating the probability based on information obtained from previous symbols. If the symbol is part of a repeat, the information from the previous occurrence is used. Once the symbol’s probability distribution is determined, it is encoded by arithmetic coding. In this paper, a new technique to improve the compression operation is proposed. The new technique suggests transforming DNA sequences into codon sequences before compression. The details of the proposed technique are given next. II. METHODOLOGY Data compression involves two steps, modeling then coding. Standard compression techniques are designed to handle data structures that are that are not similar to DNA data structures, therefore, they failed to compress the DNA sequences. In order to improve the DNA compression ratio, more attention was given to the DNA sequence structures. It was discovered that DNA sequences has many long repetitions, and reverse complements. This deeper insight into the DNA structure helped to better model the DNA sequence. Based on the improved DNA models, the special compression techniques modified the standard compression techniques, and consequently achieved higher compression ratios. Similarly, the new technique tries to gain more insight into the structure of the DNA sequence, and improve its representation, and hence compression ratio, through codon representation. Codons are 3-base subsequences. For example, the DNA sequence AAGGCT contains two codons, AAG and GCT. Codons in gene coding regions are translated into amino acids. There are 64 codons and 20 amino acids, as some amino acids corresponds to more than one codon. Amino acids are the basic building block of proteins. Proteins control the shape and activities of all organisms. Proteins are composed of sequence of amino acids. These amino acids are fabricated from corresponding codon sequences in the coding regions of genes. Every codon in the sequence is translated into an amino acids. Therefore, Protein information is coded in the form of codons. This suggests that modeling a DNA sequence as sequence of codons should better represent the sequence information. Every codon in the sequence is translated into an amino acids. Therefore, Protein information is coded in the form of codons. This suggests that modeling a DNA sequence as sequence of codons should better represent the sequence information, and consequently, improves its compression ratio. III. RESULTS In order to demonstrate effectiveness of the proposed method, 11-different DNA sequences were used. These sequences include complete genomes of 2mitochondries, 2-chloroplasts, 2-viruses, and 5different complete genes of Human genes [5, 10], as shown in Table 2. Both DNA and codon forms of the above sequences were compressed using both static and adaptive Huffman coding, in addition to the expert model [9]. Compression results are given in tables 3. Codon probability distribution was calculated 3 times for each test sequence. The analysis was carried on each intact sequence, then after removing its first base, and finally after removing its first two bases. Table 2: Size, Version, and Number in GenBank, for test Sequences. Sequence CHMPXX CHNTXX HEHCMVCG HUMDYSTROP HUMGHCSA HUMHBB HUMHDABCD HUMHPRTB MPOMTCG MIPACGA VACCG Size (bp) 121,024 155,943 229,354 38,770 66,495 73,308 58,864 56,737 186,609 100,314 191,737 Version X04465.1 Z00044.2 X17403.1 M86524.1 J03071.1 U01317.1 M63544.1 M26434.1 M68929.1 ----------M35027.1 Number 11640 76559634 59591 181901 183148 455025 183921 184369 786182 ----------335317 Table 3: Comparison of DNA and Codon compression ratios, in bits/base using Adaptive Huffman, Static Huffman, and Expert Model. Huffman Adaptive Fixed DNA Cod. DNA Cod. CHMPXX 2.217 1.915 1.932 1.881 CHNTXX 2.385 1.990 2.001 1.967 HEHCMVCG 2.213 2.017 2.001 1.995 HUMDYSTROP 2.511 1.990 2.005 1.999 HUMGHCSA 2.494 2.018 2.003 2.005 HUMHBB 2.496 1.987 2.003 1.981 HUMHDABCD 2.479 2.031 2.003 2.017 HUMHPRTB 2.410 2.001 2.004 1.996 MPOMTCG 2.428 2.003 2.001 1.997 MIPACGA 2.213 1.919 1.946 1.906 VACCG 2.334 1.960 2.001 1.938 Average 2.380 1.985 1.991 1.971 Sequence Expert Model DNA Cod. 1.658 1.536 1.607 1.528 1.843 1.672 1.903 1.811 0.983 0.806 1.751 1.684 1.667 1.549 1.736 1.633 1.877 1.726 1.845 1.704 1.765 1.681 1.698 1.576 Fig 1 shows a typical frequency distribution for the intact and the two striped sequences. In additions, distributions of the three shifts of each test sequence. The analyses were also carried on subsequences with different size and locations in each test sequence. Note that removing one or two bases leads to a change in all codons, and hence a completely different codon sequence. However, the codon frequency distribution remains almost the same. The intact sequence was labeled shift0, the sequence with one with removed base was labeled shift1, and the sequence with two removed bases was labeled shift2. The following example demonstrates the above: Shift 0: ( | T T T | C T A | A T T | G T T …… ) Shift 1: ( | T T C | T A A | T T G | T T …… ) Shift 2: (| T C T | A A T | T G T| T …… ) Table (4): the correlation between the frequency distribution of codons in different frames Sequence CHMPXX CHNTXX HEHCMVCG HUMDYSTROP HUMGHCSA HUMHBB HUMHDABCD HUMHPRTB MPOMTCG MIPACGA VACCG 0-1 0.99396 0.98278 0.97975 0.98665 0.99230 0.98325 0.98658 0.99303 0.97643 0.98830 0.98899 Correlation 0-2 0.99399 0.98414 0.97789 0.98238 0.98943 0.98226 0.98757 0.99296 0.97540 0.98728 0.98944 1-2 0.99401 0.98368 0.97523 0.98706 0.99014 0.97979 0.98737 0.99382 0.97207 0.98881 0.98652 IV DISCUSSION AND CONCLUSION The proposed codon compression technique significantly improved the average compression ratio. The expert model codon compression achieved the best compression reatio reported so far. In addition, for the test sequence, and using a sufficiently large subsequence, the codon frequency distribution was found to be almost invariant along the sequence. Based on the test data, a minimum subsequence of length about 15 k codons is sufficient. Moreover, the codon frequency distribution remains almost the same when one or two bases are removed from the beginning of any sequence, despite the fact that the base removal leads to altering the whole codon sequence. Note the high degree of redundancy in the sequences, which make the code error resilient. This helps prevent problems due to damage of few random codons. It is worth mentioning that many sequences have many similarities in there frequency distributions. Another very interesting observation shows that However, the new sequence keeps the same frequency distribution. This suggests that information coded in the genome is not related the codon sequential order, but more to its frequency distribution. This may be another safety mechanism to prevent malfunction due to limited base damage. Finally, we can conclude that transformation of DNA sequence to codon sequence before compression lead to improving the compression ratio in both static and Adaptive model. Static Huffman compression is more efficient than Adaptive compression due to stationary of frequency distribution along the sequence. In addition, the codon sequence has a frequency distribution (signature) which is invariant under codon sequence shift. Subsequences of reasonable size have an almost identical frequency distributions of its codon sequence along the sequence and the similarity increase with the increasing the window size. Acknowledgments Great appreciation and gratitude are due to Prof. Abdalla S. A. Mohamed and Dr. Mohamad Abou El-Hoda, Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Egypt. and Prof. Nabila A. El-Sheikh, for their help and support throughout the work. Special thanks go to Eng. Heba Afifi for her valuable contribution regarding the implementation of the expert model 0.0800 Probabily 0.0700 Shift 0 Shift 1 0.0600 Shift 2 0.0500 0.0400 0.0300 0.0200 0.0100 Codon 0.0000 %@ ( Z & $ ) Y ! = 1 X 0 + - W V R N J U Q M I 6 P L H S O K 9 F B x 2 E 8 w s D z v r 7 y u q p l h d o k 5 3 n j f b m i e 4 Fig 1 Typical Codon probability distribution REFERENCES [1]. Matsumoto Toshiko, et. al., “Biological Sequence Compression Algorithms”, Genome Informatics, Volume 11 : Pages 43–52, (2000). [2]. Grumbach Stephane and Tahi Fariza, “A new Challenge for Compression Algorithms: Genetic Sequences”, Journal of Information Processing and Management, Volume 30: Pages 875–866, (1994). [3]. Matsumoto Toshiko, et. al., “Can General-Purpose Compression Schemes Really Compress DNA Sequences?”, Currents in Computational Molecular Biology, Universal Academy Press , Pages 76–77, (2000). [4]. Behzadi Behshad and Le Fessant Fabrice, “DNA Compression Challenge Revisited”, CPM, pages 190–200, (2005). [5]. Grumbach Stephane and Tahi Fariza, “Compression of DNA Sequences”, IEEE Computer Society Press, In Data compression conference, Pages 340-350, (1993). [6]. Eric Revals, et. al., “Discerning Repeats in DNA Compression with a Compression Algorithm”, CABIOS, Volume 13: Pages 131136, (1997). [7]. Chen Xin, et. al., “A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison”, The 10th workshop on Genome Informatics (GIW’99), Pages 51-61, (1999). [8]. Chen Xin, et. al., “DNACompress: fast and effective DNA sequence compression”, Bioinformatics, Volume 18: Pages 1696–1698, ( 2002). [9]. Cao Minh Duc, et. al., “A Simple Statistical Algorithm for Biological Sequence Compression”, IEEE Data Compression Conference (DCC), Pages 43-52 , (2007). [10]. National Center for Biotechnology Information (NCBI). http://www.ncbi.nlm.nih.gov/sites/entrez , seen at (2008). [11]. Static Huffman Encoder/Decoder Source. http://www.codeproject.com/cpp/Huffman_cod ing.asp , seen at (2006). [12]. Adaptive Huffman Encoder/Decoder Source. http://www.gotdotnet.com/Community/UserSa mples/Details.aspx?SampleGuid=c8bc181b5ddf-4969-aeca-a508374f1282 , seen at (2006). [13]. Marty C. Brandon, et. al., “Data Structures and Compression algorithms for Genomic Sequence data,” Bioinformatics, vol. 25, pages 1731-1738, doi:10.1093/bioinformatics/btp319 , May 2009. [14]. Christos Kozanitis, et. al., “Compressing Genomic Sequence Fragments Using SLIMGENE”, RECOMB 2010, LNBI 6044/2010, pages 310-324. DOI: 10.1007/9783-642-12683-3_20, (2010) [15]. Hyoung Do Kim , Ju-Han Kim “DNA Data Compression Based on the Whole Genome Sequence”, Journal of Convergence Information Technology Volume 4, Number 3, September 2009 [16]. Pavol Hanus, et. al., “Compression of Whole Genome Alignments”, IEEE Transactions On Information Theory, Volume. 56, NO. 2, February 2010