* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Analysis of Cross Sequence Similarities for Multiple - PolyU
Zinc finger nuclease wikipedia , lookup
Genome evolution wikipedia , lookup
DNA profiling wikipedia , lookup
DNA polymerase wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
DNA sequencing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
DNA damage theory of aging wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
SNP genotyping wikipedia , lookup
DNA barcoding wikipedia , lookup
Transposable element wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Y chromosome wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Comparative genomic hybridization wikipedia , lookup
Microevolution wikipedia , lookup
Primary transcript wikipedia , lookup
DNA vaccination wikipedia , lookup
Molecular cloning wikipedia , lookup
X-inactivation wikipedia , lookup
Epigenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Sequence alignment wikipedia , lookup
Genomic library wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
DNA supercoil wikipedia , lookup
Metagenomics wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Human genome wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Point mutation wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Genome editing wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Non-coding DNA wikipedia , lookup
Neocentromere wikipedia , lookup
Helitron (biology) wikipedia , lookup
Int. J. Computer Aided Engineering and Technology, Vol. 1, No. 4, 2009 Analysis of cross sequence similarities for multiple DNA sequences compression Paula Wu* Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong E-mail: [email protected] *Corresponding author Ngai-Fong Law Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Fax: +852 2362 8439 E-mail: [email protected] Wan-Chi Siu Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Fax: +852 2362 6412 E-mail: [email protected] Abstract: Current DNA compression algorithms rely on finding repetitions within the DNA sequence so that similar subsequences can be encoded by referencing to each other. We explore similarities between different chromosomes of the sequence ‘Saccharomyces cerevisiae’. These similarities are characterised by the existence of similar subsequences among different chromosomes. The longer the similar subsequences are, the higher the cross-similarities are. Our study indicates that these cross-sequence similarities are often significant as compared to self-sequence similarity. This implies that it would be advantageous to compress two or more chromosome sequences together so that similar subsequences found between multiple chromosome sequences can be encoded together. Keywords: computer aided engineering; technology; deoxyribonucleic acid sequence; DNA sequence; chromosome; prediction; Saccharomyces cerevisiae; multiple DNA sequences; multiple chromosome; cross chromosomal similarities; compression. Reference to this paper should be made as follows: Wu, P., Law, N-F. and Siu, W-C. (2009) ‘Analysis of cross sequence similarities for multiple DNA sequences compression’, Int. J. Computer Aided Engineering and Technology, Vol. 1, No. 4, pp.437–454. Copyright © 2009 Inderscience Enterprises Ltd. 437 438 P. Wu et al. Biographical notes: Paula Wu received her BSc (Hons) in Internet and Multimedia Technologies with 1st Class Honours from the Hong Kong Polytechnic University in 2006. At present, she is an MPhil student at the same University under the supervision of Dr. N.F. Bonnie Law and Prof. W.C. Siu. Her research interests include signal and image processing, compression and coding. Ngai-Fong Law received her BEng with 1st Class Honours from the University of Auckland, New Zealand, in 1993 and PhD from the University of Tasmania, Australia, in 1997, both in Electrical and Electronic Engineering. She is currently an Assistant Professor in the Electronic and Information Engineering Department, Hong Kong Polytechnic University, Hong Kong. Her research interests include wavelet transform, pattern recognition and bioinformatics. Wan-Chi Siu received his MPhil and PhD from the Chinese University of Hong Kong and Imperial College, London, in 1977 and 1984 respectively. He joined the Hong Kong Polytechnic University as a Lecturer in 1980 and has become Chair Professor since 1992. He was Head of Department of Electronic and Information Engineering and Dean of Engineering Faculty, and is now Director of Centre for Signal Processing of the same university. He has published over 360 research papers in DSP, transforms, fast algorithms, video coding and pattern recognition, and has been an invited and keynote speaker of many international conferences. 1 Introduction Deoxyribonucleic acid (DNA) technologies have been widely used in genetic engineering, forensics and anthropology. We can see that the size of the databases storing DNA, RNA and amino-acid sequences is increasing exponentially (Matsumoto et al., 2000). As an example, the lengths of the 24 chromosomes in human are found to have 50 to 250 million base pairs (Human Genome Project Science, http://www.ornl.gov/sci/techresources/Human_Genome/project/info.shtml). Compression is thus desirable not only to reduce its storage requirement, but also uncover similarities and differences among sequences so that properties of DNA sequences can be understood (Matsumoto et al., 2000; Li et al., 2001). Current compression algorithms work by finding redundant information within the DNA sequence. For example, most compression algorithms tried to exploit exact/approximate repetitions and complementary palindromes within the DNA sequences (Matsumoto et al., 2000; Li et al., 2001; Korodi and Tabus, 2007; Grumbach and Tahi, 1993, 1994; Rivals et al., 1995, 1996; Chen et al., 1999, 2001, 2002; Chang, 2004; Behzadi and Fessant, 2005). These similarly repeated regions are then encoded together by referencing to each other in a hope to use less than two bits on average for each base pair. Besides exploitation of repetitions, the three-based periodicity inside the protein coding regions was also exploited (Pinho et al., 2006). In the field of video compression, each video frame can be compressed as either an I-frame or a P-frame (Richardson, 2003; Wang et al., 2002). The I-frame means that the frame is intra-coded in which redundancy is exploited within the image itself. In contrast, P-frame means that the frame is inter-coded in which redundancy is exploited between two consecutive frames. The P-frame always has a better compression ratio than the Analysis of cross sequence similarities for multiple DNA sequences 439 I-frame since the redundancy found between two consecutive frames is always significant when comparing with that found within the frame itself. In the field of DNA compression, current algorithms are analogous to intra-frame compression as redundant information is exploited only within one chromosome. Here, we propose that inter-sequence (cross-sequence) redundancy among a number of chromosomes should be exploited in DNA sequence compression too. Although cross-sequence similarity is well-known and is the basis of sequence analysis algorithms such as multiple sequence alignment or phylogenetic analysis, the idea of exploiting this information specifically for DNA sequence compression is novel. While only modest compression ratio might be achieved for one chromosome sequence, we hypothesised that higher compression ratio can be achieved for multiple chromosome sequences compression since it can benefit from both self-sequence similarity and cross-sequence similarities. This chapter attempts to give a quantitative analysis of cross-sequence similarities among chromosomes to support our hypothesis. We specifically look at the cross-sequence similarities among different chromosomes of ‘Saccharomyces cerevisiae’ (S.cerevisiae). The lengths and locations of similar subsequences among chromosomes are investigated and their implications for DNA sequence compression are discussed. 2 Fundamentals of DNA sequence compressions DNA is a molecule composed of deoxyribonucleotides connected by phosphodiester linkages. Genome is the complete DNA sequence of a living organism while gene is a special section of the DNA coding for a protein. The largest publicly accessible nucleotide datasets are maintained in: National Center for Biotechnology Information Genetic Databank (GenBank) (http://www.ncbi.nlm.nih.gov/Genbank/index.html), European Molecular Biology Laboratory (EMBL) (http://www.ebi.ac.uk/embl/) and DNA Database of Japan (DDJB) (http://www.ddbj.nig.ac.jp/Welcome-e.html). Each of these databases shares their information. In February 2008, GenBank reported that there were approximately 857 billion bases in 82 million sequence records in the traditional GenBank database and 1086 billion bases in 27 million sequence records in the NCBI Whole Genome Shotgun sequence division (http://www.ncbi.nlm.nih.gov/Genbank/index.html; http://michael.dipperstein.com/dna). Without any compression, two bits per base are required for encoding four nucleotides. However, most general-purpose compression tools use more than two bits to encode a base. It is because these tools do not consider the special structures in a DNA sequence. Therefore, before discussing the similarity study, we first introduce the characteristics of DNA sequences and describe existing DNA compression methods in the following subsection. 2.1 Characteristics of DNA sequences DNA is a long sequence consisting of four kinds of nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T). It is a double helix held together by hydrogen bonds. The nucleotides (A, T) and (C, G) are complement pairs as shown in Table 1. A nucleotide in one strand is bonded to its complement in the other strand as depicted in 440 P. Wu et al. Figure 1. Thus, only one strand needs to be encoded since the other strand can be obtained from the complement of the nucleotide in this strand. Two important characteristic structures of DNA sequences are exact/approximate repeats and complementary palindromes. They are often exploited in DNA sequence-oriented compression algorithms. Table 1 Four types of nucleotides, adenine (A), guanine (G), thymine (T) and cytosine (C), and their complements Bases Base symbols Complement Adenine A T Cytosine C G Guanine G C Thymine T A Figure 1 Example of DNA sequence (see online version for colours) Source: Phillips et al. (2000) 2.1.1 Approximate repeats The cases of approximate repeat include exact match and match with some operators such as substitution, deletion and insertion. An exact match means two subsequences consist of identical nucleotides. Matching with substitution, insertion and deletion are illustrated in Figures 2(a), 2(b) and 2(c), respectively. Figure 2 Examples of (a) substitution (b) insertion and (c) deletion in approximated matches (a) (b) Note: The sequences are parts of a DNA sequence. (c) Analysis of cross sequence similarities for multiple DNA sequences 441 In Figure 2(a), a set of 12 nucleotides ‘ACGCTTACGCAT’ is a sample sequence. The subsequence ‘ACGCTT’ shown between 1 and 6 indicates the first six bases of the sample sequence while the subsequence ‘ACGCAT’ listed between 7 and 12 is the 7th to 12th bases of the sample sequence. The vertical line located between two bases indicates that the upper base is identical to the lower base. By comparing the first subsequence (1st to 6th bases) and the second subsequence (7th to 12th bases), the 5th base ‘T’ and the 11th base ‘A’ are not the same as no vertical line is present. However, if the 5th base ‘T’ is replaced by ‘A’, the second subsequence can be reproduced from the first subsequence. This is called substitution. In Figure 2(b), there are only 11 nucleotides ‘ACGCTACGCAT’ in the sample sequence. The horizontal line appeared in the 5th position of first subsequence means no base at that position. In other words, to reconstruct the second subsequence, ‘A’ should be inserted in between the 4th and the 5th base of the first subsequence to form the second subsequence. This is insertion. In Figure 2(c), the sample sequence is ‘ACGCTTACGCT’ and the horizontal line is located at the 5th position of the second subsequence. To simulate the second subsequence, we can delete the 5th base ‘T’ in the first subsequence. This is named deletion. 2.1.2 Complementary palindromes Complementary palindrome is also called reversed repeat, complemented inverted repeat or reverse complement repeat in the literature. It means nucleotides in a sequence are the reverse ordering of nucleotides in another sequence with each nucleotide replaced by its complement. For instance, since (A, T) and (C, G) are complement pairs, the subsequences ‘AAGCGT’ and ‘ACGCTT’ are complementary palindrome. In Figure 3, the 12 bases sequence ‘ACGCTTAAGCGT’ is a part of DNA sequence. We first focus on the bases from 7th to 12th, i.e., ‘AAGCGT’. Its complement is ‘TTCGCA’ as listed in the second row and the reverse order of ‘TTCGCA’ is ‘ACGCTT’ as shown in the third row. It is trivial that the subsequence from the 1st to the 6th bases exactly matches with the reverse complement of the subsequence from the 7th to the 12th bases. Figure 3 Example of complementary palindromes Note: The sequence are parts of a DNA sequence. 442 P. Wu et al. 2.2 DNA compression There are two kinds of compression methods: lossless compression and lossy compression. Retrieving from compressed data without loss is defined as lossless while recovering from compressed data with data loss is called lossy. Since all the data in a DNA sequence cannot be sacrificed, only lossless compression is applied in DNA compression. The structural information of DNA sequence such as approximate repeats and complementary palindromes is essential for DNA compression. Therefore, DNA compression is a kind of lossless compression and is based on its characteristic structures. As DNA sequence just includes four bases, two bits are enough to store each nucleotide. In Figure 4, ‘00’, ‘01’, ‘10’ and ‘11’ are assigned to represent the nucleotides A, C, G and T respectively as an example. Thus, less than two bits per base are the minimum requirement for DNA compression. Figure 4 Two bits per base (see online version for colours) Source: Phillips et al. (2000) 2.2.1 Current DNA compression schemes Most DNA-based compression algorithms rely on encoding together similar repeated regions found within one chromosome sequence. Biocompress proposed by Grumbach and Tahi (1993) is the first algorithm designed specifically for compressing DNA sequences. Both Biocompress and its second version Biocompress-2 (Grumbach and Tahi, 1994) are based on a sliding window algorithm known as LZ77 (Ziv and Lempel, 1977). In Biocompress-2, exact matches and complementary palindromes are found so that the matched subsequences can be encoded with respect to the identical subsequences occurred in the past. In particular, the whole matched sequences are replaced by two parameters: the start position of the previous occurred subsequence and the repeat length. For those insignificant repeated regions or non-repeated regions, order-2 arithmetic coding (Arith-2) can be used. Cfact proposed by Rivals et al. (1995, 1996) utilises a two passes algorithm. In the first pass, exact matches are found by a suffix tree. In the second pass, if there is a compression gain, the matched subsequences are encoded using previous references; otherwise, they are kept uncompressed. GenCompress (Li et al., 2001; Chen et al., 1999, 2001) unlike Biocompress and Cfact, consider approximate matches in addition to exact matches. GenCompress-1 uses substitutions only, while GenCompress-2 uses deletions, insertions and substitutions for repeats encoding. Similar to Biocompress, GenCompress considers whether the matched subsequence is worthy of being encoded. If not, Airth-2 Analysis of cross sequence similarities for multiple DNA sequences 443 encoding is used. CTW+LZ proposed by Matsumoto et al. (2000) bases on the context tree weighting method and the LZ-based compression. Long exact/approximate repeats and complementary palindromes repeats are encoded by the LZ-based algorithm, whereas short subsequences are compressed using CTW. Although it obtains good compression ratio, its execution time is too high for long sequences. DNACompress (Chen et al., 2002) consists of two parts. All approximate repeats including complementary palindromes are detected by a separate software tool called PatternHunter (Ma et al., 2002) in the first part. Those approximate repeats and non-repeating regions are then encoded in the second part. DNACompress not only provides good compression ratio, but also is significantly faster than GenCompress. In addition, DNAC (Chang, 2004) is divided into four phases. The suffix tree is built in the first phase to locate exact matches. All the exact repeats are extended in the second phase to approximate repeats by dynamic programming. In the third phase, the optimal non-overlapping repeats are extracted from the overlapping regions. All the repeats are then encoded in the last phase. Similar to DNAC, DNAPack (Behzadi and Fessant, 2005) uses dynamic programming approach for the identification and encoding of repeats. 2.2.2 Homology searching engine PatternHunter (Ma et al., 2002) is a homology search tool for identifying approximate repeats and approximate reverse complement repeats. All approximate repeats obtained from PatternHunter are ranked by a similarity measure called score. A high score indicates a high similarity existed between two subsequences. Besides, details of the repeats such as the location and the length of the repetitive regions are output to an ‘aln’ file. Figure 5 Example of an ‘aln’ file from the PatternHunter In an ‘aln’ file, ‘Score’ shows the bit score of the local alignment generated through the search. ‘Expect’ is also known as the E-value. A lower expect value indicates a more 444 P. Wu et al. homologous sequence. The numerator of ‘identities’ shows the number of identical nucleotides while the denominator of the ‘identities’ displays the number of approximate match nucleotides in the obtained alignment. ‘Gaps’ shows whether the alignment has gaps. ‘Strand’ shows the direction of the aligned strands. The word before ‘/’ refers to ‘Query’ subsequence and the word after ‘/’ refers to ‘Sbjct’ subsequence. Minus indicates it is a complementary strand. ‘Query’ refers a subsequence of the query sequence. ‘Sbjct’ refers to another subsequence of the subject sequence. Figure 5 shows one of the low score repetition record listed in an aln file. This is a less homologous sequence since the E-value is high and the score is low. There are a total of 210 nucleotides involved in the alignment, in which three times of deletion/insertion and 132 identical nucleotides between these two subsequences are included. The complementary palindrome of the query subsequence starting from the 14812th to 15019th bases is compared with the subsequence starting from the 198493rd to 198695th bases. Each base of the two subsequences is listed. 3 Similarity study It is often conjectured that similarities do exist among different chromosomes of one species (Li et al., 1998). On the other hand, it is definitely the case that sequences of evolutionary similar species share similar mitochondrial DNA sequences (Hizume et al., 2002). In this section, the similarities in DNA sequences between different chromosomes of S.cerevisiae are investigated. We studied the first 16 chromosomes starting from Chr I to Chr XVI which can be downloaded from http://www.ncbi.nlm.nih.gov/Genbank/index.html. 3.1 Existence of similar subsequences among chromosomes To search for all approximate repeats (see approximate repeats section) and approximate reverse complement repeats (see complementary palindromes section) in one chromosome sequence or between a pair of chromosome sequences, PatternHunter (Ma et al., 2002) is employed. 3.1.1 Self-referencing Self-referencing is defined as finding repetitions in one chromosome sequence. All currently proposed DNA compression algorithms consider self-referencing only. Figure 6 shows the lengths of the top four score repetitive regions found inside Chr I, Chr III, Chr IV, Chr V, Chr VII, Chr VIII, Chr XI, Chr XII, Chr XIII, Chr XIV, Chr XV and Chr XVI itself. Y-axis denotes the length of the repetitive regions found. The black, grey, light grey and white colour bars represent the first, second, third and fourth highest scores respectively. The lengths of most repetitive regions such as the lengths in Chr IV, Chr VII, Chr XII, Chr XIII and Chr XVI are around 6000. Chr I is one of special cases as the length of the highest score is around 13000 but the second one is dropped to around 2000. Besides, the lengths of the top four score of Chr III and Chr XI are very short, they are around 1000 only. Analysis of cross sequence similarities for multiple DNA sequences Figure 6 445 The lengths of the top four score repetitive regions with reference to itself in S.cerevisiae Self-Referencing 14000 12000 10000 8000 6000 4000 2000 0 I III IV V VII VIII XI XII XIII XIV XV XVI Notes: The first, second, third and fourth highest scores are illustrated by black, grey, light grey and white colour bars respectively. Y-axis denotes the length of the repetitive regions found. 3.1.2 Cross-referencing Cross-referencing is defined as finding repetitions between different chromosome sequences. It attempts to find similarities among different chromosome sequences. To show the similarities between different chromosome sequences in S.cerevisiae, self-referencing in Chr I and cross-referencing between Chr I and Chr VIII will be explored. The following shows the identities of the top five scores found inside Chr I of S.cerevisiae, i.e., this shows self-reference subsequences found within Chr I itself. Identities = 13159/14613 (90%) Identities = 2434/2588 (94%) Identities = 2071/2298 (90%) Identities = 1610/1759 (91%) Identities = 1573/1759 (89%) The length of the repetitive regions is of special interest. It is because the repetitive regions can be encoded with respect to similar regions that have been encoded already. Thus, the longer the matching sequences are, the higher the compression ratios attained. In the first record, ‘13159’ means the number of exact match nucleotides while ‘14613’ indicates the number of approximate match nucleotides including exact match nucleotides. The number in the bracket is the percentage of exact match within the whole repeated subsequence. Thus, the longest repetitive region found within Chr I is about 446 P. Wu et al. 13000. The following shows the identities of the top five scores of cross-reference sequences found between Chr I and Chr VIII. Identities = 17034/17466 (97%) Identities = 12502/13765 (90%) Identities = 6407/6790 (94%) Identities = 5677/6041 (93%) Identities = 1518/1904 (79%) Result shows that the lengths of the two longest similar regions found between Chr I and Chr VIII are about 17000 and 12000. In fact, if we compare the top four results, the lengths of each similar region between Chr I and Chr VIII are greater than that of similar subsequences found within Chr I. To have a clear picture, Figure 7 depicts the lengths of the top three score repetitive regions of self-reference and cross-reference between a particular chromosome with the other 15 chromosome sequences of S.cerevisiae. The highlighted area indicates self-referencing similarity while others are cross-referencing similarities. Figure 7(a) summarises the lengths of the top three score repetitive regions within Chr I itself and between Chr I and the other 15 chromosome sequences of S.cerevisiae. The three bars of Chr VIII indicate the top three most similar sequences found between Chr I and Chr VIII. We can see that the lengths of the repetitive regions found between Chr I and Chr VIII are always larger than those found within Chr I alone. In addition, the lengths of the repetitive regions found between Chr I and other chromosomes such as Chr II, Chr IV, Chr VII, Chr X, Chr XII, Chr XIII and Chr XVI are significant too. Figure 7(b) shows the lengths of the top three score repetitive regions within Chr VIII itself and between Chr VIII and the other 15 chromosome sequences of S.cerevisiae. The three bars of Chr I indicate the top three most similar sequences found between Chr I and Chr VIII. Obviously, the lengths of the repetitive regions found between Chr I and Chr VIII are always larger than those found within Chr VIII alone. At the same time, the lengths of the repetitive regions found between Chr VIII and other chromosomes except Chr III, Chr IX and Chr XI are noteworthy too. Comparing Figure 7(a) with Figure 7(b), the interesting point is that the lengths of the repetitive regions found between Chr I and Chr VIII are always larger than those found within Chr I alone [the highlighted part in Figure 7(a)] or those found within Chr VIII alone [the highlighted part in Figure 7(b)]. Besides, the lengths of the repetitive regions found between Chr I and other chromosomes (except Chr VIII) shown in Figure 7(a) and that between Chr VIII and other chromosomes (except Chr I) shown in Figure 7(b) have identical patterns. Figure 7(c) illustrates the lengths of the top three score repetitive regions within Chr III itself and between Chr III and the other 15 chromosome sequences of S.cerevisiae. In this case, we can see that the self-similarity inside Chr III is small, as compared to the cross-similarities between Chr III and other chromosome sequences. The case is true for Chr XI as shown in Figure 7(d). In fact, similar observation is obtained from other chromosome sequences of S.cerevisiae. This shows that besides self-similarity within the chromosome sequence itself, cross-similarities with other chromosome sequences cannot be ignored. These cross-similarities can be exploited which should be beneficial for compression applications. Analysis of cross sequence similarities for multiple DNA sequences Figure 7 The lengths of the top three score repetitive regions between, (a) Chr I (b) Chr VIII (c) Chr III and (d) Chr XI with the other 15 chromosome sequences of S.cerevisiae Similarity with Chromosome I 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI (a) Similarity with Chromosome VIII 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI (b) Notes: The first, second and third highest scores are illustrated by black, grey and light grey colour bars respectively. Y-axis denotes the length of the repetitive regions found. 447 448 P. Wu et al. Figure 7 The lengths of the top three score repetitive regions between, (a) Chr I (b) Chr VIII (c) Chr III and (d) Chr XI with the other 15 chromosome sequences of S.cerevisiae (continued) Similarity with Chromosome III 8000 7000 6000 5000 4000 3000 2000 1000 0 I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI (c) Similarity with Chromosome XI 8000 7000 6000 5000 4000 3000 2000 1000 0 I II III IV V VI VII VIII IX X XI XII XIII XIV (d) Notes: The first, second and third highest scores are illustrated by black, grey and light grey colour bars respectively. Y-axis denotes the length of the repetitive regions found. XV XVI Analysis of cross sequence similarities for multiple DNA sequences 449 3.2 Location and length of similar sequences between chromosomes Most existing DNA compression algorithms work by first finding self-similar subsequences inside the current chromosome sequence. Then the subsequence is encoded with reference to an identical/similar subsequence that occurred in the past (Chen et al., 2002). To quantify the potential gain in cross-sequence compression, we need to find out whether any subsequence in the current chromosome sequence can be predicted from regions in another chromosome sequence. If so, there will be a gain if these two sequences are compressed together by referencing regions to each other. We termed this as cross-sequence compression. The length of these cross-reference subsequences determines potential compression ratios that would result by considering multiple DNA sequences in compression. The longer the length is, the higher the potential compression ratio will be. 3.3 Analysis with self-referencing and cross-referencing Table 2 and Table 3 show the total lengths of subsequences that can be predicted either from the current chromosome sequence or from other chromosome sequences. Table 2 Total lengths of subsequences in Chr a that can be predicted from certain regions in Chr b a b Length of Chr a Class of Chr a I III IV V VII VIII 230208 316617 1531918 576869 1090946 562643 3 1 3 1 2 1 I 24807 12411 31766 13354 23469 36809 II 15253 17228 58017 35443 56365 22205 III 9964 11361 29904 12292 26207 13925 IV 16241 22604 82152 47444 55529 35110 V 10988 11933 56508 25003 37456 20144 VI 9634 12218 33910 16000 34056 23460 VII 16149 14952 67605 39911 43212 26373 VIII 50536 14030 48346 27718 29262 20263 IX 7623 9438 19237 21098 27053 16160 X 14274 20753 61192 37469 37470 28774 XI 7467 17228 13789 8715 15015 19735 XII 7623 17316 77913 37045 62116 29828 XIII 13193 14127 46460 29155 44821 31372 XIV 13049 28820 53655 39883 39941 27743 XV 25981 16711 73035 35748 46149 24951 XVI 10455 14598 55231 34973 66621 33132 Note: The italicized value represents self-similarity (i.e., self-prediction) while the highlighted boxes represent those entries that have greater values than the self-predicted one. 450 P. Wu et al. Each column entry in the table represents the total lengths of subsequences in Chr a that can be predicted from certain regions in Chr b. For example in Table 2, the first entry for Chr I ‘24807’ represent the total length of similar subsequences that can be found within Chr I. In other words, a total of 24807 nucleotides can be predicted by referencing to itself. Similarly, the second entry ‘15253’ represents the total lengths of similar subsequences in Chr I that can be predicted from Chr II. In other words, a total of 15253 nucleotides in Chr I can be encoded with reference to similar subsequences in Chr II. Furthermore, the first entry for Chr III ‘12411’ is highlighted since that is greater than ‘11361’ (the third entry of Chr III), which is the total length of similar subsequences that can be found within itself. Table 3 Total lengths of subsequences in Chr a that can be predicted from certain regions in Chr b a b Length of Chr a Class of Chr a XI XII XIII XIV XV XVI 666454 1078175 924429 784333 1091289 948062 1 3 2 1 2 2 8459 22818 18084 19422 33736 15894 II 9926 36714 29897 40236 43754 40400 III 15414 26790 11836 32780 22574 13006 IV 12097 87680 41181 46787 70059 43633 V 7095 42686 32899 37707 29723 26308 I VI 6975 30481 19089 30273 26885 22531 VII 11571 79301 45231 35342 41149 67663 VIII 19659 32142 24432 35704 25680 30953 IX 12521 17685 16718 34194 32314 14307 X 35014 41511 37283 41269 38576 34794 XI 7169 12743 19559 11450 17025 8671 XII 9765 84170 51846 41057 48221 40916 XIII 21718 46768 37573 35588 46740 40699 XIV 13460 55506 31969 22881 49117 24580 XV 13033 64470 51032 51085 37964 55019 XVI 9145 58181 43181 26549 58936 34648 Note: The italicized value represents self-similarity (i.e., self-prediction) while the highlighted boxes represent those entries that have greater values than the self-predicted one The self-referencing values are italicized in Table 2 and Table 3. All entries that have a greater number of nucleotides predicted from other chromosomes than the self-referencing value are highlighted. Results can be grouped into three classes. The first class, consisting of Chr III, Chr XI, Chr XIV, Chr VIII and Chr V, has high similarities with chromosomes other than itself. We can see that more than half of the chromosomes have cross-referencing values bigger than the self-referencing value. This implies that a potentially high compression gain can be obtained if these sequences employ Analysis of cross sequence similarities for multiple DNA sequences 451 cross-referencing strategy with subsequences predicted from other chromosomes in addition to self-referencing. The second class consists of Chr XV, Chr XVI, Chr VII and Chr XIII. The numbers of highlighted entries for Chr XV, Chr XVI, Chr VII and Chr XIII are 8, 7, 6 and 5 respectively. Although its numbers are not as high as that in the first class, a potential compression gain is also expected since the cross-referencing values are still large. As self-referencing is still considered in compression, an effective cross-referencing strategy should improve the overall compression ratio. The last class consists of Chr I, Chr XII and Chr IV. The numbers of highlighted entries for Chr I and Chr XII are 2 and 1 respectively as well as no highlighted entries for Chr IV. In Chr I, a total of 50536 nucleotides can be predicted from Chr VIII. In contrast, only 24807 nucleotides can be self-referenced within Chr I. The number is almost doubled if a reference is made to Chr VIII. This is consistent with the findings in Figure 7(a). In Chr XII, a total of 87680 nucleotides can be predicted from Chr IV. This is comparable to the self-referencing value which is 84170. As the length of Chr XII is 1078175, these self-referencing and cross-referencing values are indeed significant. In Chr IV, the self-similarity consists of 82152 nucleotides. In contrast, the largest cross-similarity with Chr XII is 77913. While this is smaller than the self-referencing value, the combination of self-referencing and cross-referencing values should contribute to a better compression. Besides considering the total length of similar subsequences, their exact locations are important too. If similar subsequences within a single sequence are well spread out instead of heavily overlapped, a high proportion of the total nucleotides within the sequence can be predicted by cross-referencing among chromosomes. This in turn results in a high compression gain. Figure 8 provides a detailed analysis on the locations of similar subsequences among chromosomes. The similar subsequences are well spread out. This shows the potential benefits of encoding multiple chromosome sequences together. In order to present the locations of similar subsequences clearly, we only consider those repeats with scores above 100. Also, the illustration just shows those with repeat lengths over 20. Figures 8(a), 8(b) and 8(c) demonstrate the locations of similar subsequences for the first, the second and the third class respectively. In Figure 8(a), we can see that the portions of self-referencing regions (shown in black colour) in all the five chromosomes are very small, as compared to the portions of cross-referencing regions (shown in grey colour) with other chromosomes. Since the proportion of self-referencing subsequences in the case of Chr XI, Chr XIV, Chr VIII and Chr V are too small; we cannot even see the self-referencing subsequences in the figure. Besides, similar subsequences predicted from other chromosomes contribute to different locations. For example, in Chr XI, the four similar subsequences found from Chr X, Chr XIII, Chr VIII and Chr III contribute to four different areas. Similar observations can be seen from Figure 8(b) about the second class. Figure 8(c) shows locations of similar subsequences for the third class. In Chr I, we can see that the portions of cross-referencing regions with either Chr VIII or Chr XV are much larger than that of self-referencing regions. In Chr XII, the portions of cross-referencing regions with Chr XIII or Chr IV are comparable to that of self-referencing regions. In Chr IV, the portions of cross-referencing regions with Chr XII are comparable to that of self-referencing regions too. 452 P. Wu et al. Figure 8 shows that the cross-referencing regions with other chromosomes are often significant when compared with self-referencing regions within the chromosome. Also, similar sequences from different chromosomes contribute to different locations in the chromosome. As a result, it would be advantageous to compress different chromosomes together to be beneficial from both self-sequence and cross-sequence similarities. Figure 8 Locations of similar subsequences for, (a) the first class (b) the second class and (c) the third class of chromosome sequences (a) (b) (c) Notes: Self-similarity is shown in black colour while cross-similarities with other chromosomes are in grey colour. The sequence number of the chromosome is marked inside the coloured region. Only significant regions are presented and are drawn on scale. Note that the * next to the chromosomes represent those chromosomes without significant self-sequence repetitions. 4 Conclusions and future development We have investigated similarities among the 16 chromosomes of S.cerevisiae. Although cross-sequence similarities has been known and exploited in many applications, we quantified it here for the first time with a view to an efficient DNA sequence compression. A detailed similarity analysis including the length and location of similar subsequences between chromosomes was performed. We found that cross-sequence similarities are highly significant between chromosomes. It is found that the length of Analysis of cross sequence similarities for multiple DNA sequences 453 similar subsequences found between chromosomes is at least comparable to that found within a chromosome. While current DNA compression only considers repetitions found within the chromosome sequence itself, our study implies that it would be highly advantageous to compress different chromosomes together to achieve a higher compression ratio. Therefore, compression can be benefited from both self-sequence similarity and cross-sequences similarities. Our future work would be to quantify this observation between species and to develop an efficient DNA compression scheme that exploits both self-sequence and cross-sequence similarities. Acknowledgements This work is supported by the Centre for Signal Processing, Department of Electronic and Information Engineering and the Hong Kong Polytechnic University (1-BB9F). Paula Wu acknowledges the research studentship provided by the University. References Behzadi, B. and Fessant, F.L. (2005) ‘DNA compression challenge revisited: a dynamic programming approach’, Symposium on Combinatorial Pattern Matching, pp.190–200. Chang, C.H. (2004) ‘DNAC: a compression algorithm for DNA sequences by non-overlapping approximate repeats’, Master thesis. Chen, X., Kwong, S. and Li, M. (1999) ‘A compression algorithm for DNA sequences and its applications in genome comparison’, Genome Informatics, Vol. 10, pp.51–61. Chen, X., Kwong, S. and Li, M. (2001) ‘A compression algorithm for DNA sequences’, IEEE Engineering in Medicine and Biology Magazine, Vol. 20, No. 4, pp.61–66. Chen, X., Li, M., Ma, B. and Tromp, J. (2002) ‘DNACompress: fast and effective DNA sequence compression’, Bioinformatics, Vol. 18, No. 12, pp.1696–1698. DDBJ Homepage, Available at http://www.ddbj.nig.ac.jp/Welcome-e.html. DNA Sequence Alignment, Available at http://michael.dipperstein.com/dna. GenBank overview, Available at http://www.ncbi.nlm.nih.gov/Genbank/index.html. Grumbach, S. and Tahi, F. (1993) ‘Compression of DNA sequences’, in Data Compression Conference, IEEE Computer Society Press, pp.340–350. Grumbach, S. and Tahi, F. (1994) ‘A new challenge for compression algorithms: genetic sequences’, Information Processing & Management, Vol. 30, No. 6, pp.875–886. Hizume, F., Shibata, F., Matsusaki, Y. and Garajova, Z. (2002) ‘Chromosome identification and comparative karyotypic analyses of four pinus species’, Theor. Appl. Genet., Vol. 105, pp.491–497. Human Genome Project Science, Available at http://www.ornl.gov/sci/techresources/Human_Genome/project/info.shtml. Korodi, G. and Tabus, I. (2007) ‘Compression of annotated nucleotide sequences’, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 4, No. 3, pp.447–457. Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P. and Zhang, H. (2001) ‘An information-based sequence distance and its application to whole mitochondrial genome phylogeny’, Bioinformatics, Vol. 17, No. 2, pp.149–154. Li, W., Stolovitzki, G., Beraola-Galvan, P. and Oliver, J.L. (1998) ‘Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes’, Genome Research, Vol. 8, pp.916–928. 454 P. Wu et al. Ma, B., Tromp, J. and Li, M. (2002) ‘PatternHunter: faster and more sensitive homology search’, Bioinformatics, Vol. 18, No. 3, pp.440–445. Matsumoto, T., Sadakane, K. and Imai, H. (2000) ‘Biological sequence compression algorithms’, Genome Informatics, Vol. 11, pp.43–52. Phillips, Bridgeman, J. and Ferguson-Smith, M. (2000) ‘The Inquiry into BSE and variant CJD in the United Kingdom’, available at http://www.bseinquiry.gov.uk/report/ volume2/fig1_1.htm. Pinho, A.J., Neves, A.J.R., Afreixo, V., Bastos, C.A.C. and Ferreira, P.J.S.G. (2006) ‘A three-state model for DNA protein-coding regions’, IEEE Transactions on Biomedical Engineering, Vol. 53, No. 11, pp.2148–2155. Richardson, I.E.G. (2003) H.264 and MPEG-4 Video Compression – Video Coding for Next-Generation Multimedia, John Wiley & Sons. Rivals, E., Delahaye, J.P., Dauchet, M. and Delgrange, O. (1995) ‘A guaranteed compression scheme for repetitive DNA sequences’, Technical Report IT-285, LIFL Lille I University. Rivals, E., Delahaye, J.P., Dauchet, M. and Delgrange, O. (1996) ‘A guaranteed compression scheme for repetitive DNA sequences’, Data Compression Conference. The EMBL Nucleotide Sequence Database, Available at http://www.ebi.ac.uk/embl/. Wang, Y., Ostermann, J. and Zhang, Y. (2002) Video Processing and Communications, Prentice Hall. Ziv, J. and Lempel, A. (1977) ‘A universal algorithm for sequential data compression’, Transactions on Information Theory, Vol. IT-23, pp.337–343.