Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DNA repair protein XRCC4 wikipedia , lookup
Homologous recombination wikipedia , lookup
DNA replication wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
DNA sequencing wikipedia , lookup
DNA profiling wikipedia , lookup
DNA polymerase wikipedia , lookup
DNA nanotechnology wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Chemical Physics Letters 367 (2003) 170–176 www.elsevier.com/locate/cplett DB-Curve: a novel 2D method of DNA sequence visualization and representation Yonghui Wu a, Alan Wee-Chung Liew a a,* , Hong Yan a,c , Mengsu Yang b Department of Computer Engineering and Information Technology, 83 Tat Chee Avenue, City University of Hong Kong, Kowloon, Hong Kong b Department of Biology and Chemistry, 83 Tat Chee Avenue, City University of Hong Kong, Kowloon, Hong Kong c School of Electrical and Information Engineering, University of Sydney, NSW 2006, Australia Received 23 July 2002 Abstract The large number of bases in a DNA sequence and the cryptic nature of the 4-alphabet representation make graphical visualization of DNA sequences useful for biologists. However, existing 3D graphical representations are complicated, whereas existing 2D graphical representations suffer from high degeneracy, and many features in a DNA sequence cannot be visualized clearly. This Letter introduces a novel 2D method of DNA representation: the DB-Curve (Dual-Base Curve), which overcomes some of the limitations in existing 2D graphical representations. Many properties of DNA sequences can be observed and visualized easily using a combination of DB-Curves. The new representation can avoid degeneracy completely compared to existing 2D graphical representations of DNA sequences. Unlike 3D graphical representations, no 2D projection is required for the DB-Curve, and this allows for easier analysis of DNA sequences. The DB-Curve provides a useful graphical tool for the visualization and study of DNA sequences. Ó 2002 Elsevier Science B.V. All rights reserved. 1. Introduction Biologists need to observe the useful features of long DNA sequences that include several thousands or several tens of thousands of bases. With the alphabet representation of DNA sequences, it is difficult to observe meaningful features in the sequence. DNA sequence visualization would * Corresponding author. Fax: +852-2788-8292. E-mail addresses: [email protected] (Y. Wu), [email protected] (A. Wee-Chung Liew), [email protected] (H. Yan). provide a simple, friendly, immediate and interactive graphical display such that the users can easily observe the global and location visual features in a long DNA sequence. DNA sequences, even for relatively short segments, do not yield an immediately useful or informative characterization. Comparison of DNA sequences, even with bases less than a hundred could be quite difficult [1]. DNA sequence visualization should also facilitate the analysis, comparison and identification of many DNA sequences, especially for DNA sequences with less than one hundred bases [1,10]. 0009-2614/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S 0 0 0 9 - 2 6 1 4 ( 0 2 ) 0 1 6 8 4 - 6 Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176 Existing methods of DNA sequence visualization can be classified into either 2D or 3D graphical representation. Examples of 3D graphical representation of DNA sequences include the HCurve [2–5,8], the Chaos game display [6,7] and the W-Curve [9]. The 3D graphical representations can uniquely characterize a DNA sequence, but the disadvantage is that it is complicated, inconvenient, and requires the display of 2D projections or 3D stereo projections for visualization and analysis. Examples of 2D graphical representation of DNA sequences include the methods proposed by Gates, Nandy, Leong and Mogenthaler, and Guo [1,10]. The first three methods essentially plot a point corresponding to a base by moving one unit in the positive or negative x- or y-axis, depending on the defined association of a base with a cardinal direction. The cumulative plot of such points produces a graph that corresponds to the sequence of bases in the gene fragment under consideration. These 2D graphical representations all have high degeneracy, because the graphical representation of 171 a short DNA sequence may correspond to a longer DNA sequence. For example, sequences AGTCA, AGTCAG, AGTCAGT, AGTCAG TC. . . will have the same graphical representation [1,10]. GuoÕs representation of DNA sequences has lower degeneracy and less overlapping. However, since GuoÕs graphical representation does not have monotonic increasing characteristic, degeneracy cannot be avoided totally. This also makes further analysis of features in GuoÕs curve difficult. For example, it is difficult to fit a function to GuoÕs representation and carry out analysis in the spectral domain. Most importantly, it does not allow a clear display of the interesting features in a DNA sequence (see Fig. 1a). 2. The DB-Curve In order to offer a simple and direct graphical tool that can (i) avoid degeneracy completely and (ii) display the features in a DNA sequence clearly, this Letter introduces a novel 2D two nucleic acid Fig. 1. GuoÕs method and AC DB-Curve of the DNA sequence for human b-globin. (a) GuoÕs method (b) The AC DB-Curve. 172 Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176 bases method of DNA representation, called the DB-Curve (Dual-Base Curve). The DB-Curve displays two of the four DNA bases at a time on a plan. The idea is that if a sequence exhibits interesting visual features, this should also be visible in the sub-sequence consisting of two of the bases. Two sequences that are similar should also have their similarity reflected in their sub-sequences consisting of two of the bases. 2.1. Construction of DB-Curve If we take two at a time out of all four possible DNA bases of A, C, G and T, we have 12 combinations. By ignoring the base order, i.e., AT is treated the same as TA, six unique combinations can be obtained: AC, TC, CG, AT, TG and AG. A DB-Curve can be obtained from each of these six combinations, and the combination of several DBCurves can uniquely characterize a given DNA sequence. According to the rules of statistics, the probability of two repeated DNA sequences is very rare when their length is larger than 1000 bp. A DB-Curve can therefore, with high probability, uniquely characterize a given DNA sequence, if the length of this DNA sequence is larger than 1000 bp. For example, for the AC DB-Curve, we define a vector with start point ð0; 0Þ and end point ðþ1; þ1Þ corresponding to base A, a vector with start point ð0; 0Þ and end point ð1; þ1Þ corresponding to base C, a vector with start point ð0; 0Þ and end point ð0; þ1Þ corresponding to bases T and G. If we define the starting point as ð0; 0Þ, a DNA sequence can be mapped to a 2D-coordinate system by a cumulative plot of the bases in the sequence using the above notation (see Fig. 1b). The AC DB-Curve emphasizes the relationship between A and C bases and makes their visualization simple and clear. The other TG, TC, CG, AT and AG DB-Curves can be obtained similarly. In addition, colours can be used to denote different meanings according to our need. This is convenient for observation and analysis. For example, colours can be used to denote the four different bases in the DB-Curve for short DNA sequences with bases less than a hundred. 2.2. The degeneracy of the DB-Curve From the construction of the DB-Curve, we can see that it has a monotonic increasing characteristic, i.e., the ordering of the bases in a sequence is visually preserved as one travels along the curve. It does not produce closed loops. Thus, the problem of degeneracy is totally avoided. The property of monotonic increase with the base number of the representation allows further analysis, such as spectral analysis, to be carried out on the curve. Fig. 2. The full AC DB-Curve of the DNA sequence for the Homo sapiens partial OBSCN gene for obscurin that includes 18760 bp bases. Exons are shown in black, introns are shown in grey. Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176 2.3. Properties of the DB-Curve The DB-Curve has the following properties: 1. The DB-Curve can display the sequence information on a 2D plane, where the global features of a long DNA sequence can be observed easily. The DB-Curve can also serve as a good annotation tool. For example, Fig. 2 is the full AC DB-Curve of the DNA sequence for the Homo sapiens partial OBSCN gene for obscurin that includes 18760 bp bases (GI: 21104339). Exons are shown in black, introns are shown in grey. 2. DB-Curve can show overlapping genes in a conspicuous manner. For example, Fig. 3 is part of a gene called coliphage phiX174 (GI: 9626372). Overlapping sequences are shown in black, other parts of the protein codes are shown in grey, and other sequences (bottom section of the curve, just below the NP 040706.1 sequence) are shown as dotted line. 3. The relative abundance of two bases can be observed directly. For example, in Fig. 4, the X-coordinate value of the end point (Point E in Fig. 4) XAC on the AC DB-Curve indicates the relative abundance of base A and base C (Fig. 4): Let XAC ¼ NA NC If XAC > 0; then A > C If XAC < 0; then A < C If XAC ¼ 0; then A ¼ C 173 ð1Þ Similarly, the X-coordinate value of the end point, XTG , on the TG DB-Curve indicates the relative abundance of base T and base G. 4. The Y-coordinate value of the end point indicates the number of nucleotides in the sequence (Point E in Fig. 4). The local maxima and minima on a DB-Curve signify local changes in the relative abundance of certain bases. In Fig. 4, the AC DB-Curve has several local maxima and minima (peaks pointing to the right and left, respectively). The local maxima and minima indicate sudden changes of A and C bases, from an A rich region to a C rich region and vice versa. In Fig. 4, the local growth rate, a, of base A is defined by DN a ¼ tan1 ; ð2Þ DA a indicates the number of base A per certain length of DNA sequences. If a is small, the numbers of base A per certain length of DNA sequences is large, whereas if a is large, the Fig. 3. A part of the coliphage phiX174 complete genome that includes some overlapping sequences. Overlapping sequences are shown in black, other parts of the protein codes are shown in grey and other sequences are shown as dotted line. 174 Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176 Fig. 4. An AC DB-Curve displaying a DNA sequence that has some repetitive structures. In this illustration, the asterisks mark repeating fragments. The two arrows indicate points of transition on the DNA sequence from an A rich region to a C rich region and vice versa. a is the local growth rate of base A along the DNA sequence. numbers of base A per certain length of DNA sequences is small. 5. The regularities and symmetries of a DNA sequence are preserved in its DB-Curves. A repetitive fragment of a DNA sequence will have corresponding repetitive sections in its DBCurve. For example, the AC DB-Curve of Fig. 4 has four repetitive sub-sequences – aaccaatgcc – inserted between the first 120 bases of L00459. The repetitive property is easy to observe in the AC DB-Curve (marked by asterisks in Fig. 4). 6. In some special cases in which the DNA sequence or sub-sequence consists of only two kinds of bases, the DB-Curve has some special forms. For example, if the sequence consists of the bases A and T only, its GC DB-Curve is a vertical line and its AT DB-Curve is a curve without any vertical segment. If the sequence consists of only the base A, its AT DB-Curve is a bisector of angle in the first quadrant and its GC DB-Curve is a single vertical line. 7. For two DNA sequences that are complementary to each other, their complementary DBCurves are the same, i.e., the AC DB-Curves of one DNA sequence would be the same as the TG DB-Curves of another DNA sequence. 8. The flexuous local structure of the curve around a certain value of Y will reflect the detailed nucleotide composition of the DNA sequence in the vicinity of that Y value. This property is useful for the analysis and comparison of DNA sequences. The first exons of b-globin genes for eight different species listed in [1] are shown in Table 1. We use the DB-Curve to display them. At the left of Fig. 5, we show the DB-Curve representation of the first exon of the human b-globin gene. The rest of Fig. 5 shows the first exon of the b-globin gene of several other species for comparison. Qualitative similarities and differences between exons of different species are immediately apparent. The capability of recognizing certain protein features (e.g., hydrophobic regions, sections particularly rich in certain amino acids, etc.) by a direct inspection of corresponding gene sequences in a DB-Curve is a possibility not yet fully explored. 3. An extension A simple extension to the DB-Curve can be done as follows. If we take four at a time out of the Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176 175 Table 1 DNA sequences of the first exons of b-globin genes for eight different species A Human b-globin 92 Bases ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATTAAGTTG GTGGTGAGGCCCTGGGCAG B Goat alanine b-globin 86 Bases ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTGAAAGTGGATGAAGTTGGTGCT GAGGCCCTGGGCAG C Opossum b-hemoglobin b-M-gene 92 Bases ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCTAAGGTGCAGGTTGACCAGACTG GTGGTGAGGCCCTTGGCAG D Gallus gallus b-globin 92 Bases ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGTCAATGTGGCCGAATGT GGGGCCGAAGCCCTGGCCAG E Lemur b-globin 92 Bases ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAGGTGGATGTAGAGAAAGTTG GTGGCGAGGCCTTGGGCAG F Mouse b-a-globin 94 Bases ATGGTTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGTGGGCAAAGGTGAACCCCGATGAAGTT GGTGGTGAGGCCCTGGGCAGG G Rabbit b-globin 90 Bases ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAGGTGAATGTGGAAGAAGTT GGTGGTGAGGCCCTGGGC H Rat b-globin 92 Bases ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGAAAGGTGAACCCTGATAATGTT GGCGCTGAGGCCCTGGGCAG ð0; 0Þ and an end point ðþ1; þ1Þ corresponding to bases A or T, a vector with a start point ð0; 0Þ and an end point ð1; þ1Þ corresponding to base C or G. The GC–AT-Curve then displays the variation of base G or C against base A or T, and would be helpful in visualizing the variation in the GC content along genes, chromosomes and genomes. 4. Conclusions Fig. 5. The AC DB-Curve of the DNA sequences of the first exons of b-globin genes for eight different species. four possible bases A, C, G and T, and ignore the base order, i.e., AT is treated the same as TA, and AT–GC the same as GC–AT, three unique combinations can be obtained: GC–AT, TC–AG and TG–AC. For example, the GC–AT-Curve can be constructed by defining a vector with a start point This Letter introduces a novel 2D method of DNA representation: the DB-Curve. Many properties of visual importance in a DNA sequence are preserved in the DB-Curves. It is useful for visualizing the global features of long DNA sequences and can facilitate the visual discovery of interesting features in a DNA sequence. The DB-Curve can also serve as a good annotation tool. This novel representation can avoid degeneracy completely compared to existing 2D graphical representations. Unlike 3D graphical representations, no 2D projection is required for the DB-Curve, which facilitates easy visualization of DNA sequences. 176 Y. Wu et al. / Chemical Physics Letters 367 (2003) 170–176 Some useful properties of DB-Curves are as follows: The relative abundance of two bases can be observed directly. The number of nucleotides in a sequence is given by the Y-coordinate value of the end point. The local extrema on a DB-Curve signify local changes in the relative abundance of two bases. The complementary DB-Curves of two complementary DNA sequences are the same. The regularities and symmetries of a DNA sequence are preserved in the DB-Curves. The flexuous local structure of the curve around a point in Y and will reflect the detailed nucleotide composition of the DNA sequence in the vicinity of that point. The DB-Curve should prove to be a useful graphical tool for the visualization and study of DNA sequences. Acknowledgements This work is supported by a CityU SRG Grant (7001183) and an interdisciplinary research Grant (9010003). References [1] X.F. Guo, M. Randic, S.C. Basak, Chem. Phys. Lett. 350 (2001) 106. [2] E. Hamori, J. Ruskin, J. Biol. Chem. 258 (1983) 1318. [3] E. Hamori, G. Varga, J.J. Laguardia, Comput. Appl. Biosci. 5 (1989) 263. [4] E. Hamori, Biotechniques 7 (1989) 710. [5] E. Hamori, in: C.A. Pickover, S.K. Tewksbury (Eds.), Frontiers of Scientific Visualization, volume 3 of Scientific Visualization, Wiley–Interscience, New York, 1994, p. 90. [6] H.J. Jeffrey, Nucleic Acids Res. 18 (1990) 2163. [7] H.J. Jeffrey, Comput. Graph. 16 (1992) 25. [8] M.L. Lantin, M.S.T. Carpendale, in: Proceedings of IEEE Conference on Visualization, IEEE Computer Society Press, Silver Spring, MD, 1998, p. 423. [9] D. Wu, J. Roberge, D.J. Cork, B.G. Nguyen, T. Grace, in: Proceedings of IEEE Conference on Visualization, IEEE Computer Society Press, Silver Spring, MD, 1993, p. 308. [10] M. Randic, J. Chem. Inf. Comput. Sci. 40 (2000) 1235.