* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Tools for Comparing Bacterial Genomes
Vectors in gene therapy wikipedia , lookup
Gene expression programming wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Human genetic variation wikipedia , lookup
Essential gene wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Gene desert wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genetic engineering wikipedia , lookup
Ridge (biology) wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Gene expression profiling wikipedia , lookup
Oncogenomics wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Genomic imprinting wikipedia , lookup
Designer baby wikipedia , lookup
Transposable element wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Public health genomics wikipedia , lookup
Microevolution wikipedia , lookup
Genome (book) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Non-coding DNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human genome wikipedia , lookup
Metagenomics wikipedia , lookup
Genomic library wikipedia , lookup
Genome editing wikipedia , lookup
Pathogenomics wikipedia , lookup
Human Genome Project wikipedia , lookup
Helitron (biology) wikipedia , lookup
74 Tools for Comparison of Bacterial Genomes T. M. Wassenaar1,2 . T. T. Binnewies1,3 . P. F. Hallin1 . D. W. Ussery1,* 1 Center for Biological Sequence Analysis, Technical University of Denmark, Kgs. Lyngby, Denmark *[email protected] 2 Molecular Microbiology and Genomics Consultants, Zotzenheim, Germany 3 Roche Diagnostics Ltd., Advanced Systems Group, Global Platforms & Support, Rotkreuz, Switzerland 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4314 2 Genomic DNA Sequence Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4314 3 Visualization of Genomic Data: The Genome Atlas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4317 4 Whole Genome Alignment Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4319 5 Comparing the Coding Fraction of Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4321 6 Codon Usage Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4322 7 Protein Sequence Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4322 8 Gene Synteny and Genome Islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4325 9 Minimal Information About a Genome Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4325 10 Research Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4325 K. N. Timmis (ed.), Handbook of Hydrocarbon and Lipid Microbiology, DOI 10.1007/978-3-540-77587-4_337, # Springer-Verlag Berlin Heidelberg, 2010 4314 74 Tools for Comparison of Bacterial Genomes Abstract: Of the plethora of bioinformatical tools available, some useful tools that allow complete genome sequences to be compared are described here. Comparisons of genome length, base composition, gene density, numbers of tRNA and rRNA genes, and codon usage can provide useful biological insights. Examples are provided of a Genome Atlas plot, to summarize many features of a single genome, and a BLAST Atlas, in which multiple genomes can be combined. A table of web-services for useful tools is provided. 1 Introduction Presently, there are about 900 bacterial and archaeal genomes that have been fully sequenced and become publicly available1 and their number more than doubled last year. Approximately 40% of the sequenced genomes are obtained from environmental (terrestrial and marine) organisms. In addition, metagenomic projects are now producing a vast amount of sequences. Here we provide a brief overview of methods to compare sequenced bacterial genomes. Of the many methods available to compare bacterial genomes (Binnewies et al., 2006) > Table 1 lists several that we find useful. It is beyond the scope of this review to provide a detailed analysis of these methods, and the list is far from complete. The tools discussed here provide some interesting information on fundamental biological features and can be used to compare a few or large numbers of genomes. The tools are easy to use and produce results that are easy to interpret and can be graphically represented. The latter is an important quality determinant of any sequence analysis tool when dealing with genomes, as the complexity of input data is so large. 2 Genomic DNA Sequence Comparisons A genome can be more than one DNA molecule. Approximately 10% of the bacterial genomes sequenced so far have more than one chromosome. By definition a genome includes all chromosomes (and plasmids) that constitute an organism’s total DNA. Chromosomes are essential, single-copy, independently replicating DNA molecules present in each member of the species. Some species contain plasmids; these are frequently strain-specific and sometimes (incorrectly, in our opinion) omitted from a genome sequence. At the time of writing, the largest bacterial genome sequenced is that of Solibacter usitatus (strain Ellin 6076), a soil bacterium belonging to the Acidobacteria. It consists of a single chromosome of 9.97 mega basepairs (Mbp). The smallest bacterial genome known is that of Carsonella ruddii (PV), an endosymbiont of a plant sap-feeding insect with a mere 159,662 bp. Genome size is a rough indicator of biological adaptive potential so it is no surprise that soil bacteria have bigger genomes, as they have to adapt to environmental variation, whereas the protective niche of an endosymbiont allows for a small genome. The genome size of an organism is easy to calculate and tabulate. > Figure 1a gives a graphical representation for genome size variation within bacterial phyla. A ‘‘box and whiskers’’ plot as shown in > Fig. 1 visualizes the distribution of a property that can be 1 Completed genome statistics obtained from the NCBI Genome Project web pages: http://www.ncbi.nlm.nih.gov/ genomes/lproks.cgi Tools for Comparison of Bacterial Genomes 74 . Table 1 Methods for comparison of bacterial genomes Method URL References Length, %GC http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi Wheeler et al. (2007) Chromosome alignment (ACT) http://www.sanger.ac.uk/Software/ACT/ Carver et al. (2005) Chromosome alignment (MUMMER) http://mummer.sourceforge.net http://www.webact.org/WebACT/home Kurtz et al. (2004) Repeats – various http://www.cbs.dtu.dk/services/GenomeAtlas Ussery et al. (2004) Repeats – tetranucleotides http://www.megx.net/tetra Teeling et al. (2004) Repeats – short, tandem http://minisatellites.u-psud.fr/GPMS/default.php Denoeud and Vergnaud (2004) Repeats – VNTRs http://vntr.csie.ntu.edu.tw Chang et al. (2007) Replication Origins http://www.cbs.dtu.dk/services/GenomeAtlas Worning et al. (2006) Noncoding RNAs http://rfam.sanger.ac.uk Griffiths-Jones, et al. (2005) rRNAs http://www.cbs.dtu.dk/services/RNAmmer Lagesen et al. (2007) Genome Atlas http://www.cbs.dtu.dk/services/GenomeAtlas Hallin and Ussery (2004) BLAST Atlas (zoomable) http://www.cbs.dtu.dk/services/gwBrowser UPDATE! Hallin and Ussery (2004) ‘‘Genome Properties’’ Selengut et al. (2007) http://cmr.tigr.org/tigr-scripts/CMR/shared/ GenomePropertiesHomePage.cgi expressed as a numerical value, such as length, %GC, number of genes, etc. Such plots show the spread of the data and are made as follows: the values are sorted and divided into two equal parts, separated by the median, which is marked as a bar in the middle of the distribution. A box is drawn to cover the range where the middle 50% of the data are (excluding the first 25% and the last 25% of the data). The ‘‘whiskers’’ are the hatched lines, connecting the lowest (left) and highest (right) values, with the exception of outlier points, which are shown as individual dots. Outliers are defined as data that are distant by more than 1.5 times the range of the box. The base composition of genomes, i.e., their %GC content (or %AT which together make 100%), can also be compared, as shown in > Fig. 1b. The GC content of a genome can range from 17% in C. ruddii to 75% GC in Anaeromyxobacter dehalogenans. The smallest genome is also the most AT rich, and many of the larger genomes are quite GC rich. It is not clear if there is a biological force in play behind this correlation, although it has been observed that the ecological niche an organism occupies roughly correlates to both genome size and GC content (Foerstner et al., 2005, Musto et al., 2006). In addition to the average GC content for a whole genome, local variation within a given genome can be examined, and this reveals two general trends for almost all bacterial genomes. First, on a more global, chromosomal level a large region flanking the origin of DNA 4315 2 4 6 8 Genome size (Mbp) 10 12 30 40 50 60 AT content (percent) 70 80 . Figure 1 (a) Box and Whisker plot of genome length distribution for 779 bacterial chromosomes, grouped by phyla. The phylum and the number of chromosomes included are indicated at the left. Each phylum is colored according to our GenomeAtlas website. (b) The distribution of average chromosomal AT content for the same set of bacterial genomes. 0 AT content distribution of prokaryotic genomes (N = 779) 74 Crenarchaeota (n = 16) Euryarchaeota (n = 35) Nanoarchaeota (n = 1) Acidobacteria (n = 2) Actinobacteria (n = 55) Aquificae (n = 3) Bacteroidetes/chlorobi ( n = 26) Chlamydiae/verrucomicrobia (n = 13) Chloroflexi (n = 7) Cyanobacteria (n = 33) Deinococcus/thermus (n = 4) Firmicutes (n = 155) Fusobacteria (n = 1) Planctomycetes (n = 1) Alphaproteobacteria (n = 94) Betaproteobacteria (n = 61) Gammaproteobacteria (n = 191) Deltaproteobacteria (n = 21) Epsilonproteobacteria (n = 22) Spirochaetes (n = 16) Thermotogea (n = 8) Other archaea (n = 1) Other bacteria (n = 13) Size distribution of prokaryotic genomes (N = 779) 4316 Tools for Comparison of Bacterial Genomes Tools for Comparison of Bacterial Genomes 74 replication tends to be more GC rich, and the region around the replication terminus usually is more AT rich. AT-rich sequences melt more easily than GC-rich sequences, due in part to the extra hydrogen bond present in a GC base pair. Contra-intuitively, this would make the origin of replication the least likely to start replication. However, within the ‘‘large region’’ around the origin of approximately 5% of the chromosome, there is a short stretch of more AT rich basepairs, where the replication origin bubble opens up. Second, and zooming in at genes, the average GC content of intergenic regions is generally lower than that of coding sequences. These regions will melt more readily, are more curved and more rigid than the chromosomal average, in order to enable gene expression (Pedersen et al., 2000, Ussery and Hallin, 2004). This is true for nearly all of the bacterial genomes sequenced, regardless of GC content. In order to calculate relative or local %GC, a window has to be defined (say, investigating 100 basepairs) for which the %GC is calculated. This window is then moved along the genome by singlenucleotide steps, and the %GC is scored related to the middle of each window. These scores can then be graphically represented. A web-based tool for this is available at the Genome Atlas Website2 in which local %GC can be visualized by color codes as discussed below. 3 Visualization of Genomic Data: The Genome Atlas Genome atlases are circular plots of chromosomes or plasmids (a linear version is available when applicable) on which general properties of the DNA molecule are plotted as colors. Genome atlases are available from our web server2 for many of the currently sequenced bacterial genomes. > Figure 2 shows a Genome Atlas for the chromosome of Geobacillus kaustophilus strain HTA426 (a thermophilic Firmicute that also contains a plasmid of 4.8 kb). This isolate was obtained from a deep sea sediment of the Mariana Trench in the Pacific Ocean (Takami et al., 2004a, b). Its genome is 3.5 Mbp long and contains 52.1% GC. G. kaustophilus has been suggested to provide a possible solution for paraffin deposition problems with oil production (Sood and Lal, 2008). A Genome Atlas maps four different aspects of the chromosomal DNA sequence in various lanes in a standard manner: DNA structural features are represented in the three outer lanes, all coding sequences are indicated in the next lane, two kinds of repeats are mapped in the next two lanes, and base composition properties are plotted in the two innermost lanes (Jensen et al., 1999). The scale in the center corresponds with the sequence numbering in GenBank. The DNA structural features of the three outermost circles are based on the physical chemical properties of the DNA helix. The annotated genes are given in blue for protein-coding genes oriented clockwise, and red for genes on the other strand (counterclockwise). The tRNA and rRNA genes have their own color. The clockwise strand corresponds with the sequence stored in GenBank (genes on the other strand are annotated as ‘‘complement’’ in there). To identify global repeats (sequences that are repeated somewhere else on the chromosome) we search for the best match of a 100 bp window against the entire chromosome. Searching on the positive strand results in direct repeats (both sequences run in the same direction) whilst searching on the negative strand gives inverted repeats (the two repeat units run in opposite directions). For most of these general properties summarized in a Genome Atlas (structural properties, repeats, base composition) dedicated atlases are also available, where more features are given (such as local and simple repeats in a Repeat Atlas, or 2 http://www.cbs.dtu.dk/services/GenomeAtlas/ 4317 2M dev avg CDS + dev avg 0.17 dev avg –7.55 0.22 7.50 0.80 0.14 7.50 fix avg dev avg fix avg Center for biological sequence analysis http://www.cbs.dtu.dk/ Resolution: 1418 0.20 Percent AT –0.15 GC Skew 5.00 fix avg tRNA Global inverted repeats 5.00 Global direct repeats rRNA Annotations: 0.14 Position preference –9.03 Stacking energy 0.17 CDS – . Figure 2 Genome atlas of the main chromosome of Geobacillus kaustrophilus. See text for further explanation. 5M 3,544,776 bp Genome atlas Intrinsic curvature 0. G. kaustophilus HTA426 main chromosome 1. 3M 0M 1M 5M M 74 2 .5 4318 Tools for Comparison of Bacterial Genomes Tools for Comparison of Bacterial Genomes 74 base composition in a Base Atlas). Such specialized atlases are explained in detail in a book that we recently produced (Ussery et al., 2008). As can be seen in > Fig. 2, the genes in this chromosome are strongly favoring one strand: the positive strand for the first (right) half and the negative strand for the second (left) half of the chromosome. These happen to be the leading strand during replication. Replication starts at the origin, (the 12 o’clock position here), and proceeds on either side along the circle with both a leading and lagging strand until the bubble reaches the terminus, at 6 o’clock, and the ends are combined. The positive strand represented by a genome sequence is the leading strand but only for the first half up till the terminus. Reading across the terminus along the sequence on the same strand one enters the lagging strand. Gene preference for the leading strand is a general feature for Firmicutes and for some other bacteria. In > Fig. 2 the two outward lanes identify some regions with strong structural properties (for instance the region around 2 o’clock, indicated by a black line). The observed strong curvature (blue in the outward lane) where the DNA would easily melt (red in the second lane) suggests this region contains genes that are highly expressed. There are a number of global repeats, notably in the first quarter of the chromosome. Note that the ribosomal RNA genes (light blue in the annotation lane) are located here, as indicated by the arrows, and these are picked up as global repeats, as indeed they are repeated genes. The GC skew lane shows the bias of G’s towards one strand or the other, averaged over a 10,000 bp window. In contrast to many Firmicutes with a strong GC skew, this genome only has a weak GC skew (the right half is light blue and the left half is light pink). The innermost circle colors the local AT content when it is more than three standard deviations distant from the global average. Note a light red color around the 2 o’clock region: this local deviation in AT content is related to the structural features located here. The Genome Atlas of the Archaea Methanosarcina acetivorans, shown in > Fig. 3, tells a different story. This strictly anaerobic organism so efficiently produces methane that it is held responsible for virtually all biogenic methane. It can also oxidate CO to CO2 (Lessner et al., 2006). Strain C2A (the type strain of the species) was isolated from a marine sediment (Galagan et al., 2002). Its genome is 5.7 Mbp long and contains 42.7% GC. The Genome Atlas shows that its genes are evenly distributed over the two strands, and a GC skew is absent. Instead, the lower quart of the genome contains many strong structural features. The genome only contains three rRNA gene copies (indicated by arrows) one of which is located on the negative strand (but as discussed above, this is actually the leading strand, as is preferred for nearly all bacterial rRNA genes). Many other global repeats are visible, notably in the region around 1.2 Mbp, which is strongly curved and easily melted, and is slightly more AT rich than the rest of the genome. Here, the important carbon-monoxide dehydrogenase gene locus is present, as are multiple transposases, which could be an indication of horizontally acquired DNA. The genome is relatively poorly annotated, with many genes given as ‘‘predicted protein’’ only, which is not uncommon for archaeal genomes. In conclusion, a Genome atlas combines a number of features in one single figure that summarizes a very long and detailed story about a chromosome or plasmid. 4 Whole Genome Alignment Methods Another way to compare genomes is based on alignment of nucleotide or amino acid sequences. Sequence alignment is a common tool to identify similarities, with BLAST, for 4319 0.5 M 1 .5 M 2M M 3M 5M 3.5 M . Figure 3 Genome atlas of the main chromosome of the Archea Methanosarcina acetivorans. 2 .5 M. acetivorans C2A 5,751,492 bp M 0M 1M 4 .5 M fix avg 0.80 dev avg 0.02 Center for biological sequence analysis http://www.cbs.dtu.dk/ Resolution: 2301 0.20 Percent AT –0.03 GC skew 5.00 fix avg 7.50 fix avg 7.50 Global inverted repeats 5.00 tRNA rRNA CDS – CDS + dev avg 0.15 dev avg –7.21 dev avg 0.24 Global direct repeats Annotations: 0.13 Position preference –8.10 Stacking energy 0.18 Intrinsic curvature Genome atlas 74 4 4320 Tools for Comparison of Bacterial Genomes Tools for Comparison of Bacterial Genomes 74 Basic Local Alignment Search Tool, the most common (Altschul et al., 1990). However BLAST is not automatically suitable for large DNA input segments such as complete genomes. A more suitable program to align sequences in the range of megabases is Mummer, developed at TIGR, of which version 3 is now publicly available (Kurtz et al., 2004). Further, this method has been recently extended to include the average nucleotide identity in the conserved core genes of a set of genomes (Deloger et al., 2009). Moreover, graphical representation of the resulting alignment becomes an issue. Specific tools have been designed to align genome sequences and visualize such events. The Artemis Comparison Tool (ACT) is worth mentioning of which two versions are available: a downloadable version to be used on a local computer (Carver et al., 2005) and a web-based version with pre-computed comparisons between several hundred bacterial genomes.3 BLAST results of entire bacterial chromosomes against each other have also been used to construct phylogenetic trees (Henz et al., 2005). Blast comparisons will be treated in Section 7 of this chapter. 5 Comparing the Coding Fraction of Genomes The typical coding density for a bacterial genome is about 90%, ranging from 95% for Pelagibacter ubique (an alpha-proteal marine bacterium that counts to the most numerous bacteria in the world) (Giovannoni et al., 2005) to around 75% for M. acetivorans. Intracellular bacteria can have a coding density as low as 50%. This means the majority of bacterial DNA codes for genes, which mostly are not spliced so that introns are absent (with very few exceptions). However, not every open reading frame is a gene, and it appears that many bacterial genomes are over-annotated, predicting 10–15% more genes than are real (Skovgaard et al., 2001). These over-annotated genes are frequently short open reading frames. In addition, genes can be missed in the annotation. A frequent mistake is that genes are annotated on the wrong strand, which can happen if the reading frame is open in either direction. The intergenic regions separating genes regulate transcription, and in intracellular bacteria frequently contain pseudogenes or repeats. Genes not coding for proteins include tRNA and rRNA genes, and some parts of intergenic regions can be transcribed into stable RNA that are transcribed but do not code for proteins. E. coli contains several hundred small non-coding RNA genes (ncRNA) (Chen et al., 2002) that can act as regulators (Gottesman, 2005). Their role in environmental bacteria is virtually unexplored. Although tRNA and rRNA genes are essential to life, they are sometimes missed in the annotation of a genome, a rather embarrassing omission, or occasionally annotated on the wrong strand (Lagesen et al., 2007). The number and location of rRNA operons in a genome can say something about an organism. It appears that organisms with short doubling times have larger numbers of rRNA and tRNA genes. Comparing > Figs. 2 and 3 it is likely that G. kaustrophilus with 9 rRNA copies, nearly all located close to the origin of replication (which boosts expression during replication as their copy number increases) can divide more quickly than M. acetivorans which only has three copies. Some really fast-growing bacteria can have 14 or more rRNA copies, as can be viewed from our list of genomes.4 3 4 http://www.webact.org/WebACT/home www.cbs.dtu.dk/services/GenomeAtlas/ 4321 4322 74 6 Tools for Comparison of Bacterial Genomes Codon Usage Comparisons Once the genes of a given genome have been defined, their codon usage can be analyzed. Since the genetic code is redundant, with up to 6 codons per amino acid, variable codons are used at different frequencies. Much of the redundancy in the genetic code is due to third base variation. > Figure 4 displays the amino acid usage for three prokaryotic genomes: Methanosphaera stadtmanae (27.6% GC), an archaeal methanogen that uses methanol and hydrogen to produce methane; Desulfitobacterium hafniense (47.4% GC), a Firmicute that efficiently dehalogenates tetrachloroethene and polychloroethanes; and Anaeromyxobacter dehalogenans (75% GC). This species, the first myxobacteria to be grown as a pure culture, can use orthosubstituted mono- and dichlorinated phenols. The frequency of each possible codon is plotted in a wheel plot in the upper part of the figure, arranged such that their third base is conserved in each quarter. The bias in codon usage towards the third position can also be seen in the sequence logo plots in the lower part of > Fig. 4. From both graphics it is evident that genomic GC content highly affects codon use (or the other way round). Based on a genome’s bias in codon usage, it is possible to predict its likely environmental niche (Willenbrock et al., 2006). Moreover, it is known that amino acid usage (not shown here) depends on environment, based on analysis of metagenomic samples (Musto et al., 2006, Foerstner et al., 2005). 7 Protein Sequence Comparisons One can compare each individual gene in a given genome by BLAST against a set of genomes. This produces a huge amount of data that can be graphically represented in a BLAST Matrix (Binnewies et al., 2005, Ussery et al., 2009). A BLAST Matrix is not symmetrical, as the outcome is determined by which genome is used as query sequence. The diagonal of a BLAST matrix represents a BLASTof a genome against itself. The self-match (the gene finding itself) is discarded, thus the reported scores reflect internal homologues present in a given genome. Most of these have been derived from gene duplication and are thus paralogs. When more information should be visualized a BLAST Atlas is helpful. Such an atlas uses one genome as a reference against which the gene conservation of other genomes is plotted (Hallin and Ussery, 2004, Skovgaard et al., 2002). In this case gene location only refers to the location in the reference genome, which of course can be varied in multiple BLAST Atlases. A BLAST Atlas is also a suitable platform to visualize metagenomic data. So far, we have not dealt with metagenomics extensively, mainly because this approach very rarely results in completely assembled microbiological genomes. But for a BLAST Atlas, that is not a problem, as one can combine all the metagenomic DNA in one lane, thereby ignoring from which organism the detected genes originated. All obtained BLAST hits are plotted around a reference genome. An example of a BLAST Atlas is given in > Fig. 5, centered around Pelotomaculum thermopropionicum, a thermophilic, syntropic Firmicute that can utilize 1-butanol, 1-propanol, 1-pentanol or 1,3-propanediol as a carbon source. Note that despite the high number of lanes, conserved and variable genes can still be easily visually inspected. From compacting a single genome into a Genome Atlas, we’ve now moved several levels up and compact multiple genomes into a single atlas. In > Fig. 5, the P. thermopropionicum genome is compared to many species of Clostridia, as well as other bacteria. Unfortunately, very few BLAST hits were found with the metagenomics samples so there is very little color in those three lanes. Compared to well characterized genomes (like E. coli), relatively few hits are A A A U G 1st 2nd 3rd A C G U A U G C C 53% AT GGG GA CAA A CGG G UA A GC UGG AAA C G G A U UUA UA UA G A CC AG G UC AC CG 0.3 0.2 0.3 0.2 1st 2nd 3rd 0.4 0.4 0.1 0.5 0.1 72% AT GGG GA CAA A CGG G UA A GC UGG AAA C G G A U UUA UA UA G A CC AG G C U AC CG 0.6 A 0.5 GA G GA CAU U CGA UA U UGA A U A C G A UUUUU UU UU GA 0.6 C G A G AU U C C A G U A GGC G A G C A G CGC UA G UGC AA G C G A U UUG UG UG C G AG A GGC GA CA G G CGC UA G UGC AA G C G A U UUG UG UG C G AG Methanosphaera stadtmanae DSM 3091 1st 2nd 3rd C G AA U G U C G C U A GGA GA CAU U CGA UA U A UG AAU C G A UUUUU UU UU GA A GGA GA CAU U CGA UA U UGA AAU C G A UUUUU UU UU GA A . Figure 4 Frequency wheel plots of codon usage (top) and sequence logo plots (bottom) of Anaeromyxobacter dehalogenans (left), Desulfitobacterium hafniense (middle) and Methanosphaera stadtmanae (right). 0.1 0.2 0.3 0.4 0.5 0.6 GGU GA CA C C CGU UA C UGU AA C C G A U UUC UC UC U C AG 25% AT U GC U CC U UC CU C GC C CC C UC CC GGG GA CAA A CGG G UA A GC UGG AAA C G G A U UUA UA UA G A CC AG G C U AC CG U GC GGU GA CA C C CGU UA C UGU AA C C G A U UUC UC UC U C AG A C GC C CC C UC CC GGC G A CA G G CGC UA G UGC AA G C G A U UUG UG UG C G AG Desulfitobacterium hafniense Y51 U GC U CC U UC CU A A A GC A CC A UC A A GGU GA CA C C CGU UA C UGU AA C C G A U UUC UC UC U C AG A GC A CC A UC A A C GC C CC C UC CC A GC A CC A UC A U CC U UC CU Anaeromyxobacter dehalogenans 2CP-C Tools for Comparison of Bacterial Genomes 74 4323 74 Tools for Comparison of Bacterial Genomes 0M 1M 2M 2.5 5M M 0. P. thermopropionicum SI 3,025,375 bp 1. 5 M 4324 2 Alkaliphilus species Bacillus fragilis 17 Clostridium species 4 Desulfitobacterium species E. coli K-12 6 other species belonging to Clostridia . Figure 5 BLAST Atlas with Pelotomaculum thermoproopionicuma the reference genome. Around this the BLAST hits of 31 genomes of other bacteria are added as listed to the right, from the outermost circle (top in the legend), to the innermost circle of the bacterial genomes (bottom of legend). The outermost lane shows the hits of P. thermopropionicum in the UniProt database (which does not contain all annotated genes as it requires biological evidence of a gene product). The next three lanes are metagenomic DNA samples from. . . [Dave specify] and next follow 30 genomes of other bacteria as listed to the right. found in other genomes, indicated by lack of strong colour in most of the lanes in Figure 5. This is probably a reflection of the huge diversity in DNA content in such samples, reducing the chance of a BLAST hit. It is a sobering thought that there is still so little we know, and so much that remains to be discovered in the microbial world. There are many methods being developed which utilizes sets of conserved genes and gene families in related organisms to cluster organisms into groups; these groups can represent known taxonomic relationships. For example, certain genes might be common to a set of organisms growing in a particular ecological niche. Some examples of such regions along the chromosome can be seen in the BLAST atlas plots where genomes of related organisms of different species are compared. Tools for Comparison of Bacterial Genomes 8 74 Gene Synteny and Genome Islands A comparison of genes present, absent or diverged between genomes usually ignores gene synteny: the position at which such genes are found. The term was coined for eukaryotes to describe genes that were located on the same chromosome; in bacterial genomes the local neighboring genes, their order and direction are usually compared. The closer two organisms are, the more likely is gene synteny to be conserved (between genomes of the same genus, or species, subspecies or phylogenic clade, in increasing order). Gene synteny is destroyed by inversions (changing the direction of one or several genes), translocations (changing the position of genes) and insertion and deletion events. All of these can result from mistakes during replication, or be the result of self-replicating mobile elements, such as bacteriophages, integrons, transposons etc. The events that affect gene synteny, combined with point mutations accumulating during replication are the two major forces that increase genetic diversity; selection of those organisms that are fittest to survive particular conditions decreases diversity. Evolution further depends on the change of such selective conditions. With a slow but steady re-shuffling of genes by evolutionary processes, a pattern emerges of a genetic ‘‘backbone’’ of genes whose location is relatively conserved between genomes of reasonable genetic distance, and groups of ‘‘cluttered’’ genes that are far more variable, in what have been termed ‘‘genome islands.’’ Genome islands usually contain genes that are all involved in a particular phenotypic process. Examples are pathogenicity islands, symbiosis islands, metabolic islands or magnetosome islands. Examples are sulfur metabolism islands discovered in metagenomic sequences from marine sediments (Mussmann et al., 2005) or the magnetosome island containing all genes that produce the intracellular organelle enabling magnetotactic bacteria to orient themselves along magnetic field lines (Richter et al., 2007). The evolutionary advantage of genome islands is obvious. They can be regarded as genetic ‘‘building blocks’’; when transferred from one organism to the next, they can confer a complete phenotypic trait to the acceptor, enabling, for instance, adaptation to a novel ecological niche. 9 Minimal Information About a Genome Sequence Genome sequences are stored in public databases such as GenBank under their biological names (preceded by ‘‘candidatus’’ for undecided taxonomic position), or by a code of numbers and letters for unculturable organisms that have not been classified. Unfortunately, other relevant information is often lacking. It has become apparent that biological and environmental data are important, and a recent standard for ‘‘Minimal Information about a Genome Sequence’’ has been proposed (Field et al., 2008). The Genomic Standards Consortium5 (GSC, http://gensc.org) promotes the standardization of genome sequencing descriptions and their exchange and integration in the scientific community. Overall, it is important that genome sequence information is released into the public domain in a timely manner so that global scientific progress can be maintained. 10 Research Needs For very few environmental species multiple genome sequences are available. From genomic intra-species comparisons of pathogenic bacteria we know that these provide an extra layer of 4325 4326 74 Tools for Comparison of Bacterial Genomes information, as genetic diversity within a bacterial species can be enormous. When multiple genomes are available for a species we can define its core genome (all genes that are present in all genomes of that species), its pan-genome (all genes that have been found in that species) and its dispensable genes that are responsible for the variation between isolates. Multiple genomes per species, together with more metagenomic data and more archaeal genome sequences, comprise our most urgent data gaps. The research tools for analysis of the genomes are available. Generate the sequences and the feast can begin. References Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410. Binnewies TT, Hallin PF, Staerfeldt HH, Ussery DW (2005) Genome update: proteome comparisons. Microbiology 151: 1–4. Binnewies TT, et al. (2006) Ten years of bacterial genome sequencing: comparative-genomics-based discoveries. Funct Integr Genomics 6: 165–185. Carver TJ, Rutherford KM, Berriman M, Rajandream MA, Barrell BG, Parkhill J (2005) ACT: the Artemis Comparison Tool. Bioinformatics 21: 3422–3423. Chang CH, Chang YC, Underwood A, Chiou CS, Kao CY (2007) VNTRDB: a bacterial variable number tandem repeat locus database. Nucleic Acids Res 35: D416–D421. Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH, Ecker DJ, Blyn LB (2002) A bioinformatics based approach to discover small RNA genes in the Escherichia coli genome. Biosystems 65: 157–177. Deloger M, El Karoui M, Petit MA (2009) A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J Bacteriol 191: 91–99. Denoeud F, Vergnaud G (2004) Identification of polymorphic tandem repeats by direct comparison of genome sequence from different bacterial strains: a web-based resource. BMC Bioinformatics 5: 4. Field D, et al. (2008) The minimum information about a genome sequence (MIGS) specification. Nature Biotechnol 26:541–547. Foerstner KU, von Mering C, Hooper SD, Bork P (2005) Environments shape the nucleotide composition of genomes. EMBO Rep 6: 1208–1213. Galagan JE, et al. (2002) The genome of M. acetivorans reveals extensive metabolic and physiological diversity. Genome Res 12: 532–542. Giovannoni SJ, et al. (2005) Genome streamlining in a cosmopolitan oceanic bacterium. Science 309: 1242–1245. Gottesman S (2005) Micros for microbes: non-coding regulatory RNAs in bacteria. Trends Genet 21: 399–404. Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33: D121–D124. Hallin PF, Binnewies TT, Ussery DW (2008) The genome BLAST atlas - a GeneWiz extension for visualization of whole-genome homology. Mol Biosyst 4: 363–371. Hallin PF, Ussery DW (2004) CBS Genome Atlas Database: a dynamic storage for bioinformatic results and sequence data. Bioinformatics 20: 3682–3686. Henz SR, Huson DH, Auch AF, Nieselt-Struwe K, Schuster SC (2005) Whole-genome prokaryotic phylogeny. Bioinformatics 21: 2329–2335. Jensen LJ, Friis C, Ussery DW (1999) Three views of microbial genomes. Res Microbiol 150: 773–777. Kurtz S, Philippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2004) Versatile and open software for comparing large genomes. Genome Biol 5: R12. Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW (2007) RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35: 3100–3108. Lessner DJ, et al. (2006) An unconventional pathway for reduction of CO2 to methane in CO-grown Methanosarcina acetivorans revealed by proteomics. Proc Natl Acad Sci USA 103: 17921–17926. Mussmann M, Richter M, Lombardot T, Meyerdierks A, Kuever J, Kube M, Glöckner FO, Amann R (2005) Clustered genes related to sulfate respiration in uncultured prokaryotes support the theory of their concomitant horizontal transfer. J Bacteriol. 187: 7126–7137. Musto H, Naya H, Zavala A, Romero H, Alvarez-Valin F, Bernardi G (2006) Genomic GC level, optimal growth temperature, and genome size in prokaryotes. Biochem Biophys Res Commun 347: 1–3. Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW (2000) A DNA structural atlas for Escherichia coli. J Mol Biol 299: 907–930. Richter M, Kube M, Bazylinski DA, Lombardot T, Glöckner FO, Reinhardt R, Schüler D (2007) Comparative genome analysis of four magnetotactic Tools for Comparison of Bacterial Genomes bacteria reveals a complex set of group-specific genes implicated in magnetosome biomineralization and function. J Bacteriol 189: 4899–4910. Selengut JD, et al. (2007) TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res 35: D260–D264. Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A (2001) On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 17: 425–428. Skovgaard M, Jensen LJ, Friis C, Stærfeldt HH, Worning P, Brunak S, Ussery D (2002) The atlas visualisation of genome-wide information. Meth Microbiol. 33: 49–63. Sood N, Lal B. (2008). Isolation and characterization of a potential paraffin-wax degrading thermophilic bacterial strain Geobacillus kaustophilus TERI NSM for application in oil wells with paraffin deposition problems. Chemosphere 70: 1445–1451. Takami H, et al. (2004a) Genomic characterization of thermophilic Geobacillus species isolated from the deepest sea mud of the Mariana Trench. Extremophiles 8: 351–356. Takami H, et al. (2004b) Thermoadaptation trait revealed by the genome sequence of thermophilic 74 Geobacillus kaustophilus. Nucl Acids Res 32: 6292–6303. Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner FO (2004) TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5: 163. Ussery DW, Hallin PF (2004) Genome update: AT content in sequenced prokaryotic genomes. Microbiology 150: 749–752. Ussery DW, Borini S, Wassenaar TM (2009) Computing for Comparative Microbial Genomics: Bioinformatics for Microbiologists (Computational series) London, Verlag: Springer. Wheeler DL, et al. (2007) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 35: D5–D12. Willenbrock H, Friis C, Friis AS, Ussery DW (2006) An environmental signature for 323 microbial genomes based on codon adaptation indices. Genome Biol 7: R114. Worning P, Jensen LJ, Hallin PF, Staerfeldt HH, Ussery DW (2006) Origin of replication in circular prokaryotic chromosomes. Environ Microbiol 8: 353–361. 4327