Download Analysis of Cross Sequence Similarities for Multiple - PolyU

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Telomere wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Genome evolution wikipedia , lookup

Ploidy wikipedia , lookup

DNA profiling wikipedia , lookup

DNA polymerase wikipedia , lookup

DNA wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

DNA sequencing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

DNA damage theory of aging wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

SNP genotyping wikipedia , lookup

DNA barcoding wikipedia , lookup

Transposable element wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Y chromosome wikipedia , lookup

Mutagen wikipedia , lookup

Nucleosome wikipedia , lookup

Replisome wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

Primary transcript wikipedia , lookup

DNA vaccination wikipedia , lookup

Molecular cloning wikipedia , lookup

X-inactivation wikipedia , lookup

Epigenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Sequence alignment wikipedia , lookup

Genomic library wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

DNA supercoil wikipedia , lookup

Polyploid wikipedia , lookup

Metagenomics wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Human genome wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Point mutation wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Karyotype wikipedia , lookup

Genome editing wikipedia , lookup

Chromosome wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomics wikipedia , lookup

Neocentromere wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microsatellite wikipedia , lookup

Transcript
Int. J. Computer Aided Engineering and Technology, Vol. 1, No. 4, 2009
Analysis of cross sequence similarities for multiple
DNA sequences compression
Paula Wu*
Department of Electronic and Information Engineering,
The Hong Kong Polytechnic University,
Hung Hom, Kowloon, Hong Kong
E-mail: [email protected]
*Corresponding author
Ngai-Fong Law
Department of Electronic and Information Engineering,
The Hong Kong Polytechnic University,
Hung Hom, Kowloon, Hong Kong
Fax: +852 2362 8439
E-mail: [email protected]
Wan-Chi Siu
Department of Electronic and Information Engineering,
The Hong Kong Polytechnic University,
Hung Hom, Kowloon, Hong Kong
Fax: +852 2362 6412
E-mail: [email protected]
Abstract: Current DNA compression algorithms rely on finding repetitions
within the DNA sequence so that similar subsequences can be encoded by
referencing to each other. We explore similarities between different
chromosomes of the sequence ‘Saccharomyces cerevisiae’. These similarities
are characterised by the existence of similar subsequences among different
chromosomes. The longer the similar subsequences are, the higher the
cross-similarities are. Our study indicates that these cross-sequence similarities
are often significant as compared to self-sequence similarity. This implies that
it would be advantageous to compress two or more chromosome sequences
together so that similar subsequences found between multiple chromosome
sequences can be encoded together.
Keywords: computer aided engineering; technology; deoxyribonucleic acid
sequence; DNA sequence; chromosome; prediction; Saccharomyces cerevisiae;
multiple DNA sequences; multiple chromosome; cross chromosomal
similarities; compression.
Reference to this paper should be made as follows: Wu, P., Law, N-F. and
Siu, W-C. (2009) ‘Analysis of cross sequence similarities for multiple DNA
sequences compression’, Int. J. Computer Aided Engineering and Technology,
Vol. 1, No. 4, pp.437–454.
Copyright © 2009 Inderscience Enterprises Ltd.
437
438
P. Wu et al.
Biographical notes: Paula Wu received her BSc (Hons) in Internet and
Multimedia Technologies with 1st Class Honours from the Hong Kong
Polytechnic University in 2006. At present, she is an MPhil student at the same
University under the supervision of Dr. N.F. Bonnie Law and Prof. W.C. Siu.
Her research interests include signal and image processing, compression and
coding.
Ngai-Fong Law received her BEng with 1st Class Honours from the University
of Auckland, New Zealand, in 1993 and PhD from the University of Tasmania,
Australia, in 1997, both in Electrical and Electronic Engineering. She is
currently an Assistant Professor in the Electronic and Information Engineering
Department, Hong Kong Polytechnic University, Hong Kong. Her research
interests include wavelet transform, pattern recognition and bioinformatics.
Wan-Chi Siu received his MPhil and PhD from the Chinese University of Hong
Kong and Imperial College, London, in 1977 and 1984 respectively. He joined
the Hong Kong Polytechnic University as a Lecturer in 1980 and has become
Chair Professor since 1992. He was Head of Department of Electronic and
Information Engineering and Dean of Engineering Faculty, and is now Director
of Centre for Signal Processing of the same university. He has published over
360 research papers in DSP, transforms, fast algorithms, video coding and
pattern recognition, and has been an invited and keynote speaker of many
international conferences.
1
Introduction
Deoxyribonucleic acid (DNA) technologies have been widely used in genetic
engineering, forensics and anthropology. We can see that the size of the databases
storing DNA, RNA and amino-acid sequences is increasing exponentially (Matsumoto
et al., 2000). As an example, the lengths of the 24 chromosomes in human are found
to have 50 to 250 million base pairs (Human Genome Project Science,
http://www.ornl.gov/sci/techresources/Human_Genome/project/info.shtml). Compression
is thus desirable not only to reduce its storage requirement, but also uncover similarities
and differences among sequences so that properties of DNA sequences can be understood
(Matsumoto et al., 2000; Li et al., 2001).
Current compression algorithms work by finding redundant information within the
DNA sequence. For example, most compression algorithms tried to exploit
exact/approximate repetitions and complementary palindromes within the DNA
sequences (Matsumoto et al., 2000; Li et al., 2001; Korodi and Tabus, 2007; Grumbach
and Tahi, 1993, 1994; Rivals et al., 1995, 1996; Chen et al., 1999, 2001, 2002; Chang,
2004; Behzadi and Fessant, 2005). These similarly repeated regions are then encoded
together by referencing to each other in a hope to use less than two bits on average for
each base pair. Besides exploitation of repetitions, the three-based periodicity inside the
protein coding regions was also exploited (Pinho et al., 2006).
In the field of video compression, each video frame can be compressed as either an
I-frame or a P-frame (Richardson, 2003; Wang et al., 2002). The I-frame means that the
frame is intra-coded in which redundancy is exploited within the image itself. In contrast,
P-frame means that the frame is inter-coded in which redundancy is exploited between
two consecutive frames. The P-frame always has a better compression ratio than the
Analysis of cross sequence similarities for multiple DNA sequences
439
I-frame since the redundancy found between two consecutive frames is always significant
when comparing with that found within the frame itself. In the field of DNA
compression, current algorithms are analogous to intra-frame compression as redundant
information is exploited only within one chromosome. Here, we propose that
inter-sequence (cross-sequence) redundancy among a number of chromosomes should be
exploited in DNA sequence compression too.
Although cross-sequence similarity is well-known and is the basis of sequence
analysis algorithms such as multiple sequence alignment or phylogenetic analysis, the
idea of exploiting this information specifically for DNA sequence compression is novel.
While only modest compression ratio might be achieved for one chromosome sequence,
we hypothesised that higher compression ratio can be achieved for multiple chromosome
sequences compression since it can benefit from both self-sequence similarity and
cross-sequence similarities.
This chapter attempts to give a quantitative analysis of cross-sequence similarities
among chromosomes to support our hypothesis. We specifically look at the
cross-sequence similarities among different chromosomes of ‘Saccharomyces cerevisiae’
(S.cerevisiae). The lengths and locations of similar subsequences among chromosomes
are investigated and their implications for DNA sequence compression are discussed.
2
Fundamentals of DNA sequence compressions
DNA is a molecule composed of deoxyribonucleotides connected by phosphodiester
linkages. Genome is the complete DNA sequence of a living organism while gene is a
special section of the DNA coding for a protein. The largest publicly accessible
nucleotide datasets are maintained in: National Center for Biotechnology Information
Genetic Databank (GenBank) (http://www.ncbi.nlm.nih.gov/Genbank/index.html),
European Molecular Biology Laboratory (EMBL) (http://www.ebi.ac.uk/embl/) and
DNA Database of Japan (DDJB) (http://www.ddbj.nig.ac.jp/Welcome-e.html).
Each of these databases shares their information. In February 2008, GenBank
reported that there were approximately 857 billion bases in 82 million sequence
records in the traditional GenBank database and 1086 billion bases in 27 million
sequence records in the NCBI Whole Genome Shotgun sequence division
(http://www.ncbi.nlm.nih.gov/Genbank/index.html; http://michael.dipperstein.com/dna).
Without any compression, two bits per base are required for encoding four
nucleotides. However, most general-purpose compression tools use more than two bits to
encode a base. It is because these tools do not consider the special structures in a DNA
sequence. Therefore, before discussing the similarity study, we first introduce the
characteristics of DNA sequences and describe existing DNA compression methods in
the following subsection.
2.1 Characteristics of DNA sequences
DNA is a long sequence consisting of four kinds of nucleotides: adenine (A), cytosine
(C), guanine (G) and thymine (T). It is a double helix held together by hydrogen bonds.
The nucleotides (A, T) and (C, G) are complement pairs as shown in Table 1.
A nucleotide in one strand is bonded to its complement in the other strand as depicted in
440
P. Wu et al.
Figure 1. Thus, only one strand needs to be encoded since the other strand can be
obtained from the complement of the nucleotide in this strand.
Two important characteristic structures of DNA sequences are exact/approximate
repeats and complementary palindromes. They are often exploited in DNA
sequence-oriented compression algorithms.
Table 1
Four types of nucleotides, adenine (A), guanine (G), thymine (T) and cytosine (C),
and their complements
Bases
Base symbols
Complement
Adenine
A
T
Cytosine
C
G
Guanine
G
C
Thymine
T
A
Figure 1
Example of DNA sequence (see online version for colours)
Source: Phillips et al. (2000)
2.1.1 Approximate repeats
The cases of approximate repeat include exact match and match with some operators
such as substitution, deletion and insertion. An exact match means two subsequences
consist of identical nucleotides. Matching with substitution, insertion and deletion are
illustrated in Figures 2(a), 2(b) and 2(c), respectively.
Figure 2
Examples of (a) substitution (b) insertion and (c) deletion in approximated matches
(a)
(b)
Note: The sequences are parts of a DNA sequence.
(c)
Analysis of cross sequence similarities for multiple DNA sequences
441
In Figure 2(a), a set of 12 nucleotides ‘ACGCTTACGCAT’ is a sample sequence.
The subsequence ‘ACGCTT’ shown between 1 and 6 indicates the first six bases of the
sample sequence while the subsequence ‘ACGCAT’ listed between 7 and 12 is the 7th to
12th bases of the sample sequence. The vertical line located between two bases indicates
that the upper base is identical to the lower base. By comparing the first subsequence
(1st to 6th bases) and the second subsequence (7th to 12th bases), the 5th base ‘T’ and the
11th base ‘A’ are not the same as no vertical line is present. However, if the 5th base ‘T’
is replaced by ‘A’, the second subsequence can be reproduced from the first subsequence.
This is called substitution.
In Figure 2(b), there are only 11 nucleotides ‘ACGCTACGCAT’ in the sample
sequence. The horizontal line appeared in the 5th position of first subsequence means no
base at that position. In other words, to reconstruct the second subsequence, ‘A’ should
be inserted in between the 4th and the 5th base of the first subsequence to form the
second subsequence. This is insertion.
In Figure 2(c), the sample sequence is ‘ACGCTTACGCT’ and the horizontal line is
located at the 5th position of the second subsequence. To simulate the second
subsequence, we can delete the 5th base ‘T’ in the first subsequence. This is named
deletion.
2.1.2 Complementary palindromes
Complementary palindrome is also called reversed repeat, complemented inverted repeat
or reverse complement repeat in the literature. It means nucleotides in a sequence are the
reverse ordering of nucleotides in another sequence with each nucleotide replaced by its
complement. For instance, since (A, T) and (C, G) are complement pairs, the
subsequences ‘AAGCGT’ and ‘ACGCTT’ are complementary palindrome.
In Figure 3, the 12 bases sequence ‘ACGCTTAAGCGT’ is a part of DNA sequence.
We first focus on the bases from 7th to 12th, i.e., ‘AAGCGT’. Its complement is
‘TTCGCA’ as listed in the second row and the reverse order of ‘TTCGCA’ is ‘ACGCTT’
as shown in the third row. It is trivial that the subsequence from the 1st to the 6th bases
exactly matches with the reverse complement of the subsequence from the 7th to the 12th
bases.
Figure 3
Example of complementary palindromes
Note: The sequence are parts of a DNA sequence.
442
P. Wu et al.
2.2 DNA compression
There are two kinds of compression methods: lossless compression and lossy
compression. Retrieving from compressed data without loss is defined as lossless while
recovering from compressed data with data loss is called lossy. Since all the data in a
DNA sequence cannot be sacrificed, only lossless compression is applied in DNA
compression. The structural information of DNA sequence such as approximate repeats
and complementary palindromes is essential for DNA compression. Therefore, DNA
compression is a kind of lossless compression and is based on its characteristic structures.
As DNA sequence just includes four bases, two bits are enough to store each
nucleotide. In Figure 4, ‘00’, ‘01’, ‘10’ and ‘11’ are assigned to represent the nucleotides
A, C, G and T respectively as an example. Thus, less than two bits per base are the
minimum requirement for DNA compression.
Figure 4
Two bits per base (see online version for colours)
Source: Phillips et al. (2000)
2.2.1 Current DNA compression schemes
Most DNA-based compression algorithms rely on encoding together similar repeated
regions found within one chromosome sequence. Biocompress proposed by Grumbach
and Tahi (1993) is the first algorithm designed specifically for compressing DNA
sequences. Both Biocompress and its second version Biocompress-2 (Grumbach and
Tahi, 1994) are based on a sliding window algorithm known as LZ77 (Ziv and Lempel,
1977). In Biocompress-2, exact matches and complementary palindromes are found so
that the matched subsequences can be encoded with respect to the identical subsequences
occurred in the past. In particular, the whole matched sequences are replaced by two
parameters: the start position of the previous occurred subsequence and the repeat length.
For those insignificant repeated regions or non-repeated regions, order-2 arithmetic
coding (Arith-2) can be used.
Cfact proposed by Rivals et al. (1995, 1996) utilises a two passes algorithm. In the
first pass, exact matches are found by a suffix tree. In the second pass, if there is a
compression gain, the matched subsequences are encoded using previous references;
otherwise, they are kept uncompressed. GenCompress (Li et al., 2001; Chen et al., 1999,
2001) unlike Biocompress and Cfact, consider approximate matches in addition to exact
matches. GenCompress-1 uses substitutions only, while GenCompress-2 uses deletions,
insertions and substitutions for repeats encoding. Similar to Biocompress, GenCompress
considers whether the matched subsequence is worthy of being encoded. If not, Airth-2
Analysis of cross sequence similarities for multiple DNA sequences
443
encoding is used. CTW+LZ proposed by Matsumoto et al. (2000) bases on the context
tree weighting method and the LZ-based compression. Long exact/approximate repeats
and complementary palindromes repeats are encoded by the LZ-based algorithm, whereas
short subsequences are compressed using CTW. Although it obtains good compression
ratio, its execution time is too high for long sequences.
DNACompress (Chen et al., 2002) consists of two parts. All approximate repeats
including complementary palindromes are detected by a separate software tool called
PatternHunter (Ma et al., 2002) in the first part. Those approximate repeats and
non-repeating regions are then encoded in the second part. DNACompress not only
provides good compression ratio, but also is significantly faster than GenCompress.
In addition, DNAC (Chang, 2004) is divided into four phases. The suffix tree is built in
the first phase to locate exact matches. All the exact repeats are extended in the second
phase to approximate repeats by dynamic programming. In the third phase, the optimal
non-overlapping repeats are extracted from the overlapping regions. All the repeats are
then encoded in the last phase. Similar to DNAC, DNAPack (Behzadi and Fessant, 2005)
uses dynamic programming approach for the identification and encoding of repeats.
2.2.2 Homology searching engine
PatternHunter (Ma et al., 2002) is a homology search tool for identifying approximate
repeats and approximate reverse complement repeats. All approximate repeats obtained
from PatternHunter are ranked by a similarity measure called score. A high score
indicates a high similarity existed between two subsequences. Besides, details of the
repeats such as the location and the length of the repetitive regions are output to an ‘aln’
file.
Figure 5
Example of an ‘aln’ file from the PatternHunter
In an ‘aln’ file, ‘Score’ shows the bit score of the local alignment generated through the
search. ‘Expect’ is also known as the E-value. A lower expect value indicates a more
444
P. Wu et al.
homologous sequence. The numerator of ‘identities’ shows the number of identical
nucleotides while the denominator of the ‘identities’ displays the number of approximate
match nucleotides in the obtained alignment. ‘Gaps’ shows whether the alignment has
gaps. ‘Strand’ shows the direction of the aligned strands. The word before ‘/’ refers to
‘Query’ subsequence and the word after ‘/’ refers to ‘Sbjct’ subsequence. Minus indicates
it is a complementary strand. ‘Query’ refers a subsequence of the query sequence. ‘Sbjct’
refers to another subsequence of the subject sequence.
Figure 5 shows one of the low score repetition record listed in an aln file. This is a
less homologous sequence since the E-value is high and the score is low. There are a total
of 210 nucleotides involved in the alignment, in which three times of deletion/insertion
and 132 identical nucleotides between these two subsequences are included. The
complementary palindrome of the query subsequence starting from the 14812th to
15019th bases is compared with the subsequence starting from the 198493rd to 198695th
bases. Each base of the two subsequences is listed.
3
Similarity study
It is often conjectured that similarities do exist among different chromosomes of one
species (Li et al., 1998). On the other hand, it is definitely the case that sequences of
evolutionary similar species share similar mitochondrial DNA sequences (Hizume
et al., 2002). In this section, the similarities in DNA sequences between different
chromosomes of S.cerevisiae are investigated. We studied the first 16 chromosomes
starting from Chr I to Chr XVI which can be downloaded from
http://www.ncbi.nlm.nih.gov/Genbank/index.html.
3.1 Existence of similar subsequences among chromosomes
To search for all approximate repeats (see approximate repeats section) and approximate
reverse complement repeats (see complementary palindromes section) in one
chromosome sequence or between a pair of chromosome sequences, PatternHunter (Ma
et al., 2002) is employed.
3.1.1 Self-referencing
Self-referencing is defined as finding repetitions in one chromosome sequence. All
currently proposed DNA compression algorithms consider self-referencing only.
Figure 6 shows the lengths of the top four score repetitive regions found inside Chr I,
Chr III, Chr IV, Chr V, Chr VII, Chr VIII, Chr XI, Chr XII, Chr XIII, Chr XIV, Chr XV
and Chr XVI itself. Y-axis denotes the length of the repetitive regions found. The black,
grey, light grey and white colour bars represent the first, second, third and fourth highest
scores respectively. The lengths of most repetitive regions such as the lengths in Chr IV,
Chr VII, Chr XII, Chr XIII and Chr XVI are around 6000. Chr I is one of special cases as
the length of the highest score is around 13000 but the second one is dropped to around
2000. Besides, the lengths of the top four score of Chr III and Chr XI are very short, they
are around 1000 only.
Analysis of cross sequence similarities for multiple DNA sequences
Figure 6
445
The lengths of the top four score repetitive regions with reference to itself in
S.cerevisiae
Self-Referencing
14000
12000
10000
8000
6000
4000
2000
0
I
III
IV
V
VII
VIII
XI
XII
XIII
XIV
XV
XVI
Notes: The first, second, third and fourth highest scores are illustrated by black, grey,
light grey and white colour bars respectively. Y-axis denotes the length of the
repetitive regions found.
3.1.2 Cross-referencing
Cross-referencing is defined as finding repetitions between different chromosome
sequences. It attempts to find similarities among different chromosome sequences. To
show the similarities between different chromosome sequences in S.cerevisiae,
self-referencing in Chr I and cross-referencing between Chr I and Chr VIII will be
explored.
The following shows the identities of the top five scores found inside Chr I of
S.cerevisiae, i.e., this shows self-reference subsequences found within Chr I itself.
Identities = 13159/14613 (90%)
Identities = 2434/2588 (94%)
Identities = 2071/2298 (90%)
Identities = 1610/1759 (91%)
Identities = 1573/1759 (89%)
The length of the repetitive regions is of special interest. It is because the repetitive
regions can be encoded with respect to similar regions that have been encoded already.
Thus, the longer the matching sequences are, the higher the compression ratios attained.
In the first record, ‘13159’ means the number of exact match nucleotides while
‘14613’ indicates the number of approximate match nucleotides including exact match
nucleotides. The number in the bracket is the percentage of exact match within the whole
repeated subsequence. Thus, the longest repetitive region found within Chr I is about
446
P. Wu et al.
13000. The following shows the identities of the top five scores of cross-reference
sequences found between Chr I and Chr VIII.
Identities = 17034/17466 (97%)
Identities = 12502/13765 (90%)
Identities = 6407/6790 (94%)
Identities = 5677/6041 (93%)
Identities = 1518/1904 (79%)
Result shows that the lengths of the two longest similar regions found between Chr I and
Chr VIII are about 17000 and 12000. In fact, if we compare the top four results, the
lengths of each similar region between Chr I and Chr VIII are greater than that of similar
subsequences found within Chr I. To have a clear picture, Figure 7 depicts the lengths of
the top three score repetitive regions of self-reference and cross-reference between a
particular chromosome with the other 15 chromosome sequences of S.cerevisiae. The
highlighted area indicates self-referencing similarity while others are cross-referencing
similarities.
Figure 7(a) summarises the lengths of the top three score repetitive regions within
Chr I itself and between Chr I and the other 15 chromosome sequences of S.cerevisiae.
The three bars of Chr VIII indicate the top three most similar sequences found between
Chr I and Chr VIII. We can see that the lengths of the repetitive regions found between
Chr I and Chr VIII are always larger than those found within Chr I alone. In addition, the
lengths of the repetitive regions found between Chr I and other chromosomes such as
Chr II, Chr IV, Chr VII, Chr X, Chr XII, Chr XIII and Chr XVI are significant too.
Figure 7(b) shows the lengths of the top three score repetitive regions within Chr VIII
itself and between Chr VIII and the other 15 chromosome sequences of S.cerevisiae.
The three bars of Chr I indicate the top three most similar sequences found between Chr I
and Chr VIII. Obviously, the lengths of the repetitive regions found between Chr I and
Chr VIII are always larger than those found within Chr VIII alone. At the same time, the
lengths of the repetitive regions found between Chr VIII and other chromosomes except
Chr III, Chr IX and Chr XI are noteworthy too.
Comparing Figure 7(a) with Figure 7(b), the interesting point is that the lengths of the
repetitive regions found between Chr I and Chr VIII are always larger than those found
within Chr I alone [the highlighted part in Figure 7(a)] or those found within Chr VIII
alone [the highlighted part in Figure 7(b)]. Besides, the lengths of the repetitive regions
found between Chr I and other chromosomes (except Chr VIII) shown in Figure 7(a) and
that between Chr VIII and other chromosomes (except Chr I) shown in Figure 7(b) have
identical patterns.
Figure 7(c) illustrates the lengths of the top three score repetitive regions within
Chr III itself and between Chr III and the other 15 chromosome sequences of
S.cerevisiae. In this case, we can see that the self-similarity inside Chr III is small, as
compared to the cross-similarities between Chr III and other chromosome sequences.
The case is true for Chr XI as shown in Figure 7(d). In fact, similar observation is
obtained from other chromosome sequences of S.cerevisiae. This shows that besides
self-similarity within the chromosome sequence itself, cross-similarities with other
chromosome sequences cannot be ignored. These cross-similarities can be exploited
which should be beneficial for compression applications.
Analysis of cross sequence similarities for multiple DNA sequences
Figure 7
The lengths of the top three score repetitive regions between, (a) Chr I (b) Chr VIII
(c) Chr III and (d) Chr XI with the other 15 chromosome sequences of S.cerevisiae
Similarity with Chromosome I
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII XIV XV XVI
(a)
Similarity with Chromosome VIII
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII XIV XV XVI
(b)
Notes: The first, second and third highest scores are illustrated by black, grey and light
grey colour bars respectively. Y-axis denotes the length of the repetitive regions
found.
447
448
P. Wu et al.
Figure 7
The lengths of the top three score repetitive regions between, (a) Chr I (b) Chr VIII
(c) Chr III and (d) Chr XI with the other 15 chromosome sequences of S.cerevisiae
(continued)
Similarity with Chromosome III
8000
7000
6000
5000
4000
3000
2000
1000
0
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII XIV XV XVI
(c)
Similarity with Chromosome XI
8000
7000
6000
5000
4000
3000
2000
1000
0
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII XIV
(d)
Notes: The first, second and third highest scores are illustrated by black, grey and light
grey colour bars respectively. Y-axis denotes the length of the repetitive regions
found.
XV XVI
Analysis of cross sequence similarities for multiple DNA sequences
449
3.2 Location and length of similar sequences between chromosomes
Most existing DNA compression algorithms work by first finding self-similar
subsequences inside the current chromosome sequence. Then the subsequence is encoded
with reference to an identical/similar subsequence that occurred in the past (Chen et al.,
2002). To quantify the potential gain in cross-sequence compression, we need to find out
whether any subsequence in the current chromosome sequence can be predicted from
regions in another chromosome sequence. If so, there will be a gain if these two
sequences are compressed together by referencing regions to each other. We termed this
as cross-sequence compression. The length of these cross-reference subsequences
determines potential compression ratios that would result by considering multiple DNA
sequences in compression. The longer the length is, the higher the potential compression
ratio will be.
3.3 Analysis with self-referencing and cross-referencing
Table 2 and Table 3 show the total lengths of subsequences that can be predicted either
from the current chromosome sequence or from other chromosome sequences.
Table 2
Total lengths of subsequences in Chr a that can be predicted from certain regions in
Chr b
a
b
Length of Chr a
Class of Chr a
I
III
IV
V
VII
VIII
230208
316617
1531918
576869
1090946
562643
3
1
3
1
2
1
I
24807
12411
31766
13354
23469
36809
II
15253
17228
58017
35443
56365
22205
III
9964
11361
29904
12292
26207
13925
IV
16241
22604
82152
47444
55529
35110
V
10988
11933
56508
25003
37456
20144
VI
9634
12218
33910
16000
34056
23460
VII
16149
14952
67605
39911
43212
26373
VIII
50536
14030
48346
27718
29262
20263
IX
7623
9438
19237
21098
27053
16160
X
14274
20753
61192
37469
37470
28774
XI
7467
17228
13789
8715
15015
19735
XII
7623
17316
77913
37045
62116
29828
XIII
13193
14127
46460
29155
44821
31372
XIV
13049
28820
53655
39883
39941
27743
XV
25981
16711
73035
35748
46149
24951
XVI
10455
14598
55231
34973
66621
33132
Note: The italicized value represents self-similarity (i.e., self-prediction) while the
highlighted boxes represent those entries that have greater values than the
self-predicted one.
450
P. Wu et al.
Each column entry in the table represents the total lengths of subsequences in Chr a that
can be predicted from certain regions in Chr b. For example in Table 2, the first entry for
Chr I ‘24807’ represent the total length of similar subsequences that can be found within
Chr I. In other words, a total of 24807 nucleotides can be predicted by referencing to
itself. Similarly, the second entry ‘15253’ represents the total lengths of similar
subsequences in Chr I that can be predicted from Chr II. In other words, a total of 15253
nucleotides in Chr I can be encoded with reference to similar subsequences in Chr II.
Furthermore, the first entry for Chr III ‘12411’ is highlighted since that is greater than
‘11361’ (the third entry of Chr III), which is the total length of similar subsequences that
can be found within itself.
Table 3
Total lengths of subsequences in Chr a that can be predicted from certain regions in
Chr b
a
b
Length of Chr a
Class of Chr a
XI
XII
XIII
XIV
XV
XVI
666454
1078175
924429
784333
1091289
948062
1
3
2
1
2
2
8459
22818
18084
19422
33736
15894
II
9926
36714
29897
40236
43754
40400
III
15414
26790
11836
32780
22574
13006
IV
12097
87680
41181
46787
70059
43633
V
7095
42686
32899
37707
29723
26308
I
VI
6975
30481
19089
30273
26885
22531
VII
11571
79301
45231
35342
41149
67663
VIII
19659
32142
24432
35704
25680
30953
IX
12521
17685
16718
34194
32314
14307
X
35014
41511
37283
41269
38576
34794
XI
7169
12743
19559
11450
17025
8671
XII
9765
84170
51846
41057
48221
40916
XIII
21718
46768
37573
35588
46740
40699
XIV
13460
55506
31969
22881
49117
24580
XV
13033
64470
51032
51085
37964
55019
XVI
9145
58181
43181
26549
58936
34648
Note: The italicized value represents self-similarity (i.e., self-prediction) while the
highlighted boxes represent those entries that have greater values than the
self-predicted one
The self-referencing values are italicized in Table 2 and Table 3. All entries that have a
greater number of nucleotides predicted from other chromosomes than the
self-referencing value are highlighted. Results can be grouped into three classes. The first
class, consisting of Chr III, Chr XI, Chr XIV, Chr VIII and Chr V, has high similarities
with chromosomes other than itself. We can see that more than half of the chromosomes
have cross-referencing values bigger than the self-referencing value. This implies that a
potentially high compression gain can be obtained if these sequences employ
Analysis of cross sequence similarities for multiple DNA sequences
451
cross-referencing strategy with subsequences predicted from other chromosomes in
addition to self-referencing.
The second class consists of Chr XV, Chr XVI, Chr VII and Chr XIII. The numbers
of highlighted entries for Chr XV, Chr XVI, Chr VII and Chr XIII are 8, 7, 6 and 5
respectively. Although its numbers are not as high as that in the first class, a potential
compression gain is also expected since the cross-referencing values are still large. As
self-referencing is still considered in compression, an effective cross-referencing strategy
should improve the overall compression ratio.
The last class consists of Chr I, Chr XII and Chr IV. The numbers of highlighted
entries for Chr I and Chr XII are 2 and 1 respectively as well as no highlighted entries for
Chr IV. In Chr I, a total of 50536 nucleotides can be predicted from Chr VIII. In contrast,
only 24807 nucleotides can be self-referenced within Chr I. The number is almost
doubled if a reference is made to Chr VIII. This is consistent with the findings in
Figure 7(a). In Chr XII, a total of 87680 nucleotides can be predicted from Chr IV. This
is comparable to the self-referencing value which is 84170. As the length of Chr XII is
1078175, these self-referencing and cross-referencing values are indeed significant. In
Chr IV, the self-similarity consists of 82152 nucleotides. In contrast, the largest
cross-similarity with Chr XII is 77913. While this is smaller than the self-referencing
value, the combination of self-referencing and cross-referencing values should contribute
to a better compression.
Besides considering the total length of similar subsequences, their exact locations are
important too. If similar subsequences within a single sequence are well spread out
instead of heavily overlapped, a high proportion of the total nucleotides within the
sequence can be predicted by cross-referencing among chromosomes. This in turn results
in a high compression gain. Figure 8 provides a detailed analysis on the locations of
similar subsequences among chromosomes. The similar subsequences are well spread
out. This shows the potential benefits of encoding multiple chromosome sequences
together. In order to present the locations of similar subsequences clearly, we only
consider those repeats with scores above 100. Also, the illustration just shows those with
repeat lengths over 20. Figures 8(a), 8(b) and 8(c) demonstrate the locations of similar
subsequences for the first, the second and the third class respectively.
In Figure 8(a), we can see that the portions of self-referencing regions (shown in
black colour) in all the five chromosomes are very small, as compared to the portions of
cross-referencing regions (shown in grey colour) with other chromosomes. Since the
proportion of self-referencing subsequences in the case of Chr XI, Chr XIV, Chr VIII and
Chr V are too small; we cannot even see the self-referencing subsequences in the figure.
Besides, similar subsequences predicted from other chromosomes contribute to different
locations. For example, in Chr XI, the four similar subsequences found from Chr X,
Chr XIII, Chr VIII and Chr III contribute to four different areas. Similar observations can
be seen from Figure 8(b) about the second class.
Figure 8(c) shows locations of similar subsequences for the third class. In Chr I, we
can see that the portions of cross-referencing regions with either Chr VIII or Chr XV are
much larger than that of self-referencing regions. In Chr XII, the portions of
cross-referencing regions with Chr XIII or Chr IV are comparable to that of
self-referencing regions. In Chr IV, the portions of cross-referencing regions with
Chr XII are comparable to that of self-referencing regions too.
452
P. Wu et al.
Figure 8 shows that the cross-referencing regions with other chromosomes are often
significant when compared with self-referencing regions within the chromosome. Also,
similar sequences from different chromosomes contribute to different locations in the
chromosome. As a result, it would be advantageous to compress different chromosomes
together to be beneficial from both self-sequence and cross-sequence similarities.
Figure 8
Locations of similar subsequences for, (a) the first class (b) the second class and (c) the
third class of chromosome sequences
(a)
(b)
(c)
Notes: Self-similarity is shown in black colour while cross-similarities with other
chromosomes are in grey colour. The sequence number of the chromosome is
marked inside the coloured region. Only significant regions are presented and are
drawn on scale. Note that the * next to the chromosomes represent those
chromosomes without significant self-sequence repetitions.
4
Conclusions and future development
We have investigated similarities among the 16 chromosomes of S.cerevisiae. Although
cross-sequence similarities has been known and exploited in many applications, we
quantified it here for the first time with a view to an efficient DNA sequence
compression. A detailed similarity analysis including the length and location of similar
subsequences between chromosomes was performed. We found that cross-sequence
similarities are highly significant between chromosomes. It is found that the length of
Analysis of cross sequence similarities for multiple DNA sequences
453
similar subsequences found between chromosomes is at least comparable to that found
within a chromosome. While current DNA compression only considers repetitions found
within the chromosome sequence itself, our study implies that it would be highly
advantageous to compress different chromosomes together to achieve a higher
compression ratio. Therefore, compression can be benefited from both self-sequence
similarity and cross-sequences similarities. Our future work would be to quantify this
observation between species and to develop an efficient DNA compression scheme that
exploits both self-sequence and cross-sequence similarities.
Acknowledgements
This work is supported by the Centre for Signal Processing, Department of Electronic and
Information Engineering and the Hong Kong Polytechnic University (1-BB9F). Paula
Wu acknowledges the research studentship provided by the University.
References
Behzadi, B. and Fessant, F.L. (2005) ‘DNA compression challenge revisited: a dynamic
programming approach’, Symposium on Combinatorial Pattern Matching, pp.190–200.
Chang, C.H. (2004) ‘DNAC: a compression algorithm for DNA sequences by non-overlapping
approximate repeats’, Master thesis.
Chen, X., Kwong, S. and Li, M. (1999) ‘A compression algorithm for DNA sequences and its
applications in genome comparison’, Genome Informatics, Vol. 10, pp.51–61.
Chen, X., Kwong, S. and Li, M. (2001) ‘A compression algorithm for DNA sequences’, IEEE
Engineering in Medicine and Biology Magazine, Vol. 20, No. 4, pp.61–66.
Chen, X., Li, M., Ma, B. and Tromp, J. (2002) ‘DNACompress: fast and effective DNA sequence
compression’, Bioinformatics, Vol. 18, No. 12, pp.1696–1698.
DDBJ Homepage, Available at http://www.ddbj.nig.ac.jp/Welcome-e.html.
DNA Sequence Alignment, Available at http://michael.dipperstein.com/dna.
GenBank overview, Available at http://www.ncbi.nlm.nih.gov/Genbank/index.html.
Grumbach, S. and Tahi, F. (1993) ‘Compression of DNA sequences’, in Data Compression
Conference, IEEE Computer Society Press, pp.340–350.
Grumbach, S. and Tahi, F. (1994) ‘A new challenge for compression algorithms: genetic
sequences’, Information Processing & Management, Vol. 30, No. 6, pp.875–886.
Hizume, F., Shibata, F., Matsusaki, Y. and Garajova, Z. (2002) ‘Chromosome identification and
comparative karyotypic analyses of four pinus species’, Theor. Appl. Genet., Vol. 105,
pp.491–497.
Human Genome Project Science, Available at
http://www.ornl.gov/sci/techresources/Human_Genome/project/info.shtml.
Korodi, G. and Tabus, I. (2007) ‘Compression of annotated nucleotide sequences’, IEEE/ACM
Transactions on Computational Biology and Bioinformatics, Vol. 4, No. 3, pp.447–457.
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P. and Zhang, H. (2001) ‘An
information-based sequence distance and its application to whole mitochondrial genome
phylogeny’, Bioinformatics, Vol. 17, No. 2, pp.149–154.
Li, W., Stolovitzki, G., Beraola-Galvan, P. and Oliver, J.L. (1998) ‘Compositional heterogeneity
within, and uniformity between, DNA sequences of yeast chromosomes’, Genome Research,
Vol. 8, pp.916–928.
454
P. Wu et al.
Ma, B., Tromp, J. and Li, M. (2002) ‘PatternHunter: faster and more sensitive homology search’,
Bioinformatics, Vol. 18, No. 3, pp.440–445.
Matsumoto, T., Sadakane, K. and Imai, H. (2000) ‘Biological sequence compression algorithms’,
Genome Informatics, Vol. 11, pp.43–52.
Phillips, Bridgeman, J. and Ferguson-Smith, M. (2000) ‘The Inquiry into BSE and variant CJD in
the United Kingdom’, available at http://www.bseinquiry.gov.uk/report/ volume2/fig1_1.htm.
Pinho, A.J., Neves, A.J.R., Afreixo, V., Bastos, C.A.C. and Ferreira, P.J.S.G. (2006) ‘A three-state
model for DNA protein-coding regions’, IEEE Transactions on Biomedical Engineering,
Vol. 53, No. 11, pp.2148–2155.
Richardson, I.E.G. (2003) H.264 and MPEG-4 Video Compression – Video Coding for
Next-Generation Multimedia, John Wiley & Sons.
Rivals, E., Delahaye, J.P., Dauchet, M. and Delgrange, O. (1995) ‘A guaranteed compression
scheme for repetitive DNA sequences’, Technical Report IT-285, LIFL Lille I University.
Rivals, E., Delahaye, J.P., Dauchet, M. and Delgrange, O. (1996) ‘A guaranteed compression
scheme for repetitive DNA sequences’, Data Compression Conference.
The EMBL Nucleotide Sequence Database, Available at http://www.ebi.ac.uk/embl/.
Wang, Y., Ostermann, J. and Zhang, Y. (2002) Video Processing and Communications, Prentice
Hall.
Ziv, J. and Lempel, A. (1977) ‘A universal algorithm for sequential data compression’,
Transactions on Information Theory, Vol. IT-23, pp.337–343.