Download DB-Curve: a novel 2D method of DNA sequence visualization and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA repair protein XRCC4 wikipedia , lookup

Homologous recombination wikipedia , lookup

DNA repair wikipedia , lookup

DNA replication wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

DNA sequencing wikipedia , lookup

DNA profiling wikipedia , lookup

DNA polymerase wikipedia , lookup

Replisome wikipedia , lookup

DNA nanotechnology wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Chemical Physics Letters 367 (2003) 170–176
www.elsevier.com/locate/cplett
DB-Curve: a novel 2D method of DNA sequence
visualization and representation
Yonghui Wu a, Alan Wee-Chung Liew
a
a,*
, Hong Yan
a,c
, Mengsu Yang
b
Department of Computer Engineering and Information Technology, 83 Tat Chee Avenue, City University of Hong Kong,
Kowloon, Hong Kong
b
Department of Biology and Chemistry, 83 Tat Chee Avenue, City University of Hong Kong, Kowloon, Hong Kong
c
School of Electrical and Information Engineering, University of Sydney, NSW 2006, Australia
Received 23 July 2002
Abstract
The large number of bases in a DNA sequence and the cryptic nature of the 4-alphabet representation make
graphical visualization of DNA sequences useful for biologists. However, existing 3D graphical representations are
complicated, whereas existing 2D graphical representations suffer from high degeneracy, and many features in a DNA
sequence cannot be visualized clearly. This Letter introduces a novel 2D method of DNA representation: the DB-Curve
(Dual-Base Curve), which overcomes some of the limitations in existing 2D graphical representations. Many properties
of DNA sequences can be observed and visualized easily using a combination of DB-Curves. The new representation
can avoid degeneracy completely compared to existing 2D graphical representations of DNA sequences. Unlike 3D
graphical representations, no 2D projection is required for the DB-Curve, and this allows for easier analysis of DNA
sequences. The DB-Curve provides a useful graphical tool for the visualization and study of DNA sequences.
Ó 2002 Elsevier Science B.V. All rights reserved.
1. Introduction
Biologists need to observe the useful features of
long DNA sequences that include several thousands or several tens of thousands of bases. With
the alphabet representation of DNA sequences, it
is difficult to observe meaningful features in the
sequence. DNA sequence visualization would
*
Corresponding author. Fax: +852-2788-8292.
E-mail addresses: [email protected] (Y. Wu), [email protected] (A. Wee-Chung Liew), [email protected]
(H. Yan).
provide a simple, friendly, immediate and interactive graphical display such that the users can
easily observe the global and location visual
features in a long DNA sequence.
DNA sequences, even for relatively short segments, do not yield an immediately useful or informative characterization. Comparison of DNA
sequences, even with bases less than a hundred
could be quite difficult [1]. DNA sequence visualization should also facilitate the analysis, comparison and identification of many DNA
sequences, especially for DNA sequences with less
than one hundred bases [1,10].
0009-2614/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 0 0 9 - 2 6 1 4 ( 0 2 ) 0 1 6 8 4 - 6
Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176
Existing methods of DNA sequence visualization can be classified into either 2D or 3D graphical representation. Examples of 3D graphical
representation of DNA sequences include the HCurve [2–5,8], the Chaos game display [6,7] and
the W-Curve [9]. The 3D graphical representations
can uniquely characterize a DNA sequence, but
the disadvantage is that it is complicated, inconvenient, and requires the display of 2D projections
or 3D stereo projections for visualization and
analysis.
Examples of 2D graphical representation of
DNA sequences include the methods proposed by
Gates, Nandy, Leong and Mogenthaler, and Guo
[1,10]. The first three methods essentially plot a
point corresponding to a base by moving one unit
in the positive or negative x- or y-axis, depending
on the defined association of a base with a cardinal
direction. The cumulative plot of such points produces a graph that corresponds to the sequence of
bases in the gene fragment under consideration.
These 2D graphical representations all have high
degeneracy, because the graphical representation of
171
a short DNA sequence may correspond to a longer
DNA sequence. For example, sequences AGTCA,
AGTCAG, AGTCAGT, AGTCAG TC. . . will
have the same graphical representation [1,10].
GuoÕs representation of DNA sequences has
lower degeneracy and less overlapping. However,
since GuoÕs graphical representation does not have
monotonic increasing characteristic, degeneracy
cannot be avoided totally. This also makes further
analysis of features in GuoÕs curve difficult. For
example, it is difficult to fit a function to GuoÕs
representation and carry out analysis in the spectral domain. Most importantly, it does not allow a
clear display of the interesting features in a DNA
sequence (see Fig. 1a).
2. The DB-Curve
In order to offer a simple and direct graphical
tool that can (i) avoid degeneracy completely and
(ii) display the features in a DNA sequence clearly,
this Letter introduces a novel 2D two nucleic acid
Fig. 1. GuoÕs method and AC DB-Curve of the DNA sequence for human b-globin. (a) GuoÕs method (b) The AC DB-Curve.
172
Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176
bases method of DNA representation, called the
DB-Curve (Dual-Base Curve). The DB-Curve
displays two of the four DNA bases at a time on a
plan. The idea is that if a sequence exhibits interesting visual features, this should also be visible in
the sub-sequence consisting of two of the bases.
Two sequences that are similar should also have
their similarity reflected in their sub-sequences
consisting of two of the bases.
2.1. Construction of DB-Curve
If we take two at a time out of all four possible
DNA bases of A, C, G and T, we have 12 combinations. By ignoring the base order, i.e., AT is
treated the same as TA, six unique combinations
can be obtained: AC, TC, CG, AT, TG and AG. A
DB-Curve can be obtained from each of these six
combinations, and the combination of several DBCurves can uniquely characterize a given DNA
sequence. According to the rules of statistics, the
probability of two repeated DNA sequences is very
rare when their length is larger than 1000 bp. A
DB-Curve can therefore, with high probability,
uniquely characterize a given DNA sequence, if the
length of this DNA sequence is larger than 1000 bp.
For example, for the AC DB-Curve, we define a
vector with start point ð0; 0Þ and end point
ðþ1; þ1Þ corresponding to base A, a vector with
start point ð0; 0Þ and end point ð1; þ1Þ corresponding to base C, a vector with start point ð0; 0Þ
and end point ð0; þ1Þ corresponding to bases T
and G. If we define the starting point as ð0; 0Þ, a
DNA sequence can be mapped to a 2D-coordinate
system by a cumulative plot of the bases in the
sequence using the above notation (see Fig. 1b).
The AC DB-Curve emphasizes the relationship
between A and C bases and makes their visualization simple and clear. The other TG, TC, CG,
AT and AG DB-Curves can be obtained similarly.
In addition, colours can be used to denote different meanings according to our need. This is
convenient for observation and analysis. For example, colours can be used to denote the four
different bases in the DB-Curve for short DNA
sequences with bases less than a hundred.
2.2. The degeneracy of the DB-Curve
From the construction of the DB-Curve, we can
see that it has a monotonic increasing characteristic, i.e., the ordering of the bases in a sequence is
visually preserved as one travels along the curve. It
does not produce closed loops. Thus, the problem
of degeneracy is totally avoided. The property of
monotonic increase with the base number of the
representation allows further analysis, such as
spectral analysis, to be carried out on the curve.
Fig. 2. The full AC DB-Curve of the DNA sequence for the Homo sapiens partial OBSCN gene for obscurin that includes 18760 bp
bases. Exons are shown in black, introns are shown in grey.
Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176
2.3. Properties of the DB-Curve
The DB-Curve has the following properties:
1. The DB-Curve can display the sequence information on a 2D plane, where the global features of a long DNA sequence can be
observed easily. The DB-Curve can also serve
as a good annotation tool. For example, Fig.
2 is the full AC DB-Curve of the DNA sequence for the Homo sapiens partial OBSCN
gene for obscurin that includes 18760 bp bases
(GI: 21104339). Exons are shown in black, introns are shown in grey.
2. DB-Curve can show overlapping genes in a conspicuous manner. For example, Fig. 3 is part of
a gene called coliphage phiX174 (GI: 9626372).
Overlapping sequences are shown in black,
other parts of the protein codes are shown in
grey, and other sequences (bottom section of
the curve, just below the NP 040706.1 sequence)
are shown as dotted line.
3. The relative abundance of two bases can be
observed directly. For example, in Fig. 4, the
X-coordinate value of the end point (Point E
in Fig. 4) XAC on the AC DB-Curve indicates
the relative abundance of base A and base C
(Fig. 4):
Let XAC ¼ NA NC
If XAC > 0; then A > C
If XAC < 0; then A < C
If XAC ¼ 0; then A ¼ C
173
ð1Þ
Similarly, the X-coordinate value of the end
point, XTG , on the TG DB-Curve indicates the
relative abundance of base T and base G.
4. The Y-coordinate value of the end point indicates the number of nucleotides in the sequence
(Point E in Fig. 4). The local maxima and minima on a DB-Curve signify local changes in the
relative abundance of certain bases. In Fig. 4,
the AC DB-Curve has several local maxima
and minima (peaks pointing to the right and
left, respectively). The local maxima and minima indicate sudden changes of A and C bases,
from an A rich region to a C rich region and
vice versa. In Fig. 4, the local growth rate, a,
of base A is defined by
DN
a ¼ tan1
;
ð2Þ
DA
a indicates the number of base A per certain
length of DNA sequences. If a is small, the
numbers of base A per certain length of DNA
sequences is large, whereas if a is large, the
Fig. 3. A part of the coliphage phiX174 complete genome that includes some overlapping sequences. Overlapping sequences are shown
in black, other parts of the protein codes are shown in grey and other sequences are shown as dotted line.
174
Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176
Fig. 4. An AC DB-Curve displaying a DNA sequence that has some repetitive structures. In this illustration, the asterisks mark
repeating fragments. The two arrows indicate points of transition on the DNA sequence from an A rich region to a C rich region and
vice versa. a is the local growth rate of base A along the DNA sequence.
numbers of base A per certain length of DNA
sequences is small.
5. The regularities and symmetries of a DNA sequence are preserved in its DB-Curves. A repetitive fragment of a DNA sequence will have
corresponding repetitive sections in its DBCurve. For example, the AC DB-Curve of Fig.
4 has four repetitive sub-sequences – aaccaatgcc
– inserted between the first 120 bases of L00459.
The repetitive property is easy to observe in the
AC DB-Curve (marked by asterisks in Fig. 4).
6. In some special cases in which the DNA sequence or sub-sequence consists of only two
kinds of bases, the DB-Curve has some special
forms. For example, if the sequence consists of
the bases A and T only, its GC DB-Curve is a
vertical line and its AT DB-Curve is a curve
without any vertical segment. If the sequence
consists of only the base A, its AT DB-Curve is
a bisector of angle in the first quadrant and its
GC DB-Curve is a single vertical line.
7. For two DNA sequences that are complementary to each other, their complementary DBCurves are the same, i.e., the AC DB-Curves
of one DNA sequence would be the same as
the TG DB-Curves of another DNA sequence.
8. The flexuous local structure of the curve around
a certain value of Y will reflect the detailed nucleotide composition of the DNA sequence in
the vicinity of that Y value. This property is useful for the analysis and comparison of DNA
sequences.
The first exons of b-globin genes for eight different species listed in [1] are shown in Table 1. We
use the DB-Curve to display them. At the left of
Fig. 5, we show the DB-Curve representation of
the first exon of the human b-globin gene. The rest
of Fig. 5 shows the first exon of the b-globin gene
of several other species for comparison. Qualitative similarities and differences between exons of
different species are immediately apparent.
The capability of recognizing certain protein
features (e.g., hydrophobic regions, sections particularly rich in certain amino acids, etc.) by a direct inspection of corresponding gene sequences in
a DB-Curve is a possibility not yet fully explored.
3. An extension
A simple extension to the DB-Curve can be
done as follows. If we take four at a time out of the
Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176
175
Table 1
DNA sequences of the first exons of b-globin genes for eight different species
A
Human b-globin
92 Bases
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATTAAGTTG
GTGGTGAGGCCCTGGGCAG
B
Goat alanine b-globin
86 Bases
ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTGAAAGTGGATGAAGTTGGTGCT
GAGGCCCTGGGCAG
C
Opossum b-hemoglobin b-M-gene
92 Bases
ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCTAAGGTGCAGGTTGACCAGACTG
GTGGTGAGGCCCTTGGCAG
D
Gallus gallus b-globin
92 Bases
ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGTCAATGTGGCCGAATGT
GGGGCCGAAGCCCTGGCCAG
E
Lemur b-globin
92 Bases
ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAGGTGGATGTAGAGAAAGTTG
GTGGCGAGGCCTTGGGCAG
F
Mouse b-a-globin
94 Bases
ATGGTTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGTGGGCAAAGGTGAACCCCGATGAAGTT
GGTGGTGAGGCCCTGGGCAGG
G
Rabbit b-globin
90 Bases
ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAGGTGAATGTGGAAGAAGTT
GGTGGTGAGGCCCTGGGC
H
Rat b-globin
92 Bases
ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGAAAGGTGAACCCTGATAATGTT
GGCGCTGAGGCCCTGGGCAG
ð0; 0Þ and an end point ðþ1; þ1Þ corresponding to
bases A or T, a vector with a start point ð0; 0Þ and
an end point ð1; þ1Þ corresponding to base C or
G. The GC–AT-Curve then displays the variation
of base G or C against base A or T, and would be
helpful in visualizing the variation in the GC
content along genes, chromosomes and genomes.
4. Conclusions
Fig. 5. The AC DB-Curve of the DNA sequences of the first
exons of b-globin genes for eight different species.
four possible bases A, C, G and T, and ignore the
base order, i.e., AT is treated the same as TA, and
AT–GC the same as GC–AT, three unique combinations can be obtained: GC–AT, TC–AG and
TG–AC. For example, the GC–AT-Curve can be
constructed by defining a vector with a start point
This Letter introduces a novel 2D method of
DNA representation: the DB-Curve. Many properties of visual importance in a DNA sequence are
preserved in the DB-Curves. It is useful for visualizing the global features of long DNA sequences
and can facilitate the visual discovery of interesting
features in a DNA sequence. The DB-Curve can
also serve as a good annotation tool. This novel
representation can avoid degeneracy completely
compared to existing 2D graphical representations.
Unlike 3D graphical representations, no 2D projection is required for the DB-Curve, which facilitates easy visualization of DNA sequences.
176
Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176
Some useful properties of DB-Curves are as
follows: The relative abundance of two bases can
be observed directly. The number of nucleotides in
a sequence is given by the Y-coordinate value of
the end point. The local extrema on a DB-Curve
signify local changes in the relative abundance of
two bases. The complementary DB-Curves of two
complementary DNA sequences are the same. The
regularities and symmetries of a DNA sequence
are preserved in the DB-Curves. The flexuous local
structure of the curve around a point in Y and will
reflect the detailed nucleotide composition of the
DNA sequence in the vicinity of that point. The
DB-Curve should prove to be a useful graphical
tool for the visualization and study of DNA
sequences.
Acknowledgements
This work is supported by a CityU SRG Grant
(7001183) and an interdisciplinary research Grant
(9010003).
References
[1] X.F. Guo, M. Randic, S.C. Basak, Chem. Phys. Lett. 350
(2001) 106.
[2] E. Hamori, J. Ruskin, J. Biol. Chem. 258 (1983) 1318.
[3] E. Hamori, G. Varga, J.J. Laguardia, Comput. Appl.
Biosci. 5 (1989) 263.
[4] E. Hamori, Biotechniques 7 (1989) 710.
[5] E. Hamori, in: C.A. Pickover, S.K. Tewksbury (Eds.),
Frontiers of Scientific Visualization, volume 3 of Scientific
Visualization, Wiley–Interscience, New York, 1994,
p. 90.
[6] H.J. Jeffrey, Nucleic Acids Res. 18 (1990) 2163.
[7] H.J. Jeffrey, Comput. Graph. 16 (1992) 25.
[8] M.L. Lantin, M.S.T. Carpendale, in: Proceedings of IEEE
Conference on Visualization, IEEE Computer Society
Press, Silver Spring, MD, 1998, p. 423.
[9] D. Wu, J. Roberge, D.J. Cork, B.G. Nguyen, T. Grace, in:
Proceedings of IEEE Conference on Visualization, IEEE
Computer Society Press, Silver Spring, MD, 1993,
p. 308.
[10] M. Randic, J. Chem. Inf. Comput. Sci. 40 (2000)
1235.