Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany http://www.fli-leibniz.de/tsb Introduction: During the last 10 years, a large number of complete genomes has been sequenced. Having these data at hand, the basic aim is now to convert this information into biological knowledge. This requires the identification of biologically meaningful motifs in genomic data. Computational motif discovery has been used with some success in simple organisms such as yeast, for example. For higher organisms with more complex genomes more sensitive methods are required. There is also a growing awareness that not single motifs but motif combinations usually called modules may be relevant to biological function. We describe here a new type of GenomeBrowser that offers user-friendly genome analysis tools for the statistical analysis of single and multiple sequences as well as for the visual exploration of single sequences. A peculiarity is that not only the standard sequence representation in terms of the bases A, T, G and C can be adopted, but also a reduced sequence representation by purine/pyrimidine and AT/GC characteristics and finally a representation in terms of a large number of dinucleotide parameters that can encode geometrical information on DNA structure, for example. All of these coding schemes can be converted into a signal representation that allows for a very effective visual motif discovery. Analyses can be performed for the + and – as well as for the double strand. Combining these sequence- and signal-based representations offers a new approach for the detection of new regulatory elements. The functionalities described make the GenomeBrowser a unique tool for the identification and analysis of functional motifs in genomes. Implemented tools 1. Repeat finder Tool to search for any type of simple repeat in the sequence or signal representation 2. Motif finder 3. Average statistic Tool for searching DNA motifs in the sequence or signal 4. Showing underlying DNA sequence Tool for calculating the average for any type of DNA feature of selected DNA fragments 5. Property editor Feature that allows to show the underlying DNA sequence of a selected part of the signal representation Tool for searching, filtering and selecting all types of features indicated in the GenBank file Parameters Main window The main window of the GenomeBrowser 1 consists of three panel. The first (1) is the control panel which allows uploading and manipulation of sequence and coding parameter information. In the main field (2) the signal curve is shown and in the third panel (3) the position information of the actually depicted sequence range. The DNA sequence in FASTA or GenBank format is converted into a signal representation by applying dinucleotide parameters and smoothened using a shifting window technique. All sequence features included in the GenBank file can be selected and shown in different colors. 2 3 AA -1,20 AC -1,50 AG -1,50 AT -0,90 CA -1,70 CC -2,10 CG -2,80 CT -1,50 GA -1,50 GC -2,30 GG -2,10 GT -1,50 TA -0,90 TC -1,50 TG -1,70 TT -1,20 To visualize biochemical and biophysical properties of a DNA strand we have included about 40 different dinucleotide properties. All parameters are available for the complete set of 16 dinucleotide combinations. The table shows, as an example, the free energy change (B-DNA) [Kcal/mol] for the set of all 16 dinucleotides (M. Aida J. Theor. Biol. 130, 327-335 (1988)). First applications 1. Visualization of evolutionary events 2. Visualization of gene and exon/intron organization The GenomeBrowser can be used to distinguish between 3 types of rRNA gene clusters in chloroplast genomes. The patterns can be best seen applying the free energy change measure for the DNA double strand. With help of the GenomeBrowser it can be shown that genes tend to be purine-rich. In both pictures below the positive strand is encoded by the pyrimidine content. On the left side all genes of the + strand and on the right side all genes of the – strand are shown in red. 3. Repeats which cannot be found by standard repeat search methods We have shown this by hiding DNA sequence repeats in an artificial sequence with only 50% alignment identity. The new sequence contains the same repeats that are only visible in the signal representation. 1.) 1.) original sequence repeats Inverted Repeats (25kB) 79 of 88 genomes 2.) the same repeats hidden in an artificial sequence with only 50% sequence identity 2.) Inverted Repeat Lacking Clade 7 of 88 genomes 3.) 3 Directed Repeats 2 of 88 genomes (subclass: Euglenozoa) Conclusion: The exon (red) and intron (green) structure of a given gene can be seen adopting a GC content representation. Exons tend to have a higher GC content than introns. The GenomeBrowser is a powerful new tool for motif discovery in genomes. In addition to the standard sequence representation the DNA is also analysed considering biochemical and biophysical dinucleotide properties. This allows to identify and visualize a broad range of both known and unknown genome patterns. The new way of seeing the genome can lead to a better understanding of its organisation and function.