Download Tools for Comparing Bacterial Genomes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vectors in gene therapy wikipedia , lookup

Gene expression programming wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Human genetic variation wikipedia , lookup

Essential gene wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene desert wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

DNA virus wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genetic engineering wikipedia , lookup

Ridge (biology) wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Gene expression profiling wikipedia , lookup

Oncogenomics wikipedia , lookup

Polyploid wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

RNA-Seq wikipedia , lookup

Genomic imprinting wikipedia , lookup

Designer baby wikipedia , lookup

ENCODE wikipedia , lookup

Transposable element wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

NUMT wikipedia , lookup

Gene wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

Genome (book) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Non-coding DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human genome wikipedia , lookup

Metagenomics wikipedia , lookup

Genomic library wikipedia , lookup

Genome editing wikipedia , lookup

Pathogenomics wikipedia , lookup

Human Genome Project wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genomics wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
74 Tools for Comparison of
Bacterial Genomes
T. M. Wassenaar1,2 . T. T. Binnewies1,3 . P. F. Hallin1 . D. W. Ussery1,*
1
Center for Biological Sequence Analysis, Technical University of
Denmark, Kgs. Lyngby, Denmark
*[email protected]
2
Molecular Microbiology and Genomics Consultants, Zotzenheim,
Germany
3
Roche Diagnostics Ltd., Advanced Systems Group, Global Platforms &
Support, Rotkreuz, Switzerland
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4314
2
Genomic DNA Sequence Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4314
3
Visualization of Genomic Data: The Genome Atlas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4317
4
Whole Genome Alignment Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4319
5
Comparing the Coding Fraction of Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4321
6
Codon Usage Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4322
7
Protein Sequence Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4322
8
Gene Synteny and Genome Islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4325
9
Minimal Information About a Genome Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4325
10 Research Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4325
K. N. Timmis (ed.), Handbook of Hydrocarbon and Lipid Microbiology, DOI 10.1007/978-3-540-77587-4_337,
# Springer-Verlag Berlin Heidelberg, 2010
4314
74
Tools for Comparison of Bacterial Genomes
Abstract: Of the plethora of bioinformatical tools available, some useful tools that allow
complete genome sequences to be compared are described here. Comparisons of genome
length, base composition, gene density, numbers of tRNA and rRNA genes, and codon usage
can provide useful biological insights. Examples are provided of a Genome Atlas plot, to
summarize many features of a single genome, and a BLAST Atlas, in which multiple genomes
can be combined. A table of web-services for useful tools is provided.
1
Introduction
Presently, there are about 900 bacterial and archaeal genomes that have been fully sequenced
and become publicly available1 and their number more than doubled last year. Approximately
40% of the sequenced genomes are obtained from environmental (terrestrial and marine)
organisms. In addition, metagenomic projects are now producing a vast amount of sequences.
Here we provide a brief overview of methods to compare sequenced bacterial genomes. Of the
many methods available to compare bacterial genomes (Binnewies et al., 2006) > Table 1
lists several that we find useful. It is beyond the scope of this review to provide a detailed
analysis of these methods, and the list is far from complete. The tools discussed here provide
some interesting information on fundamental biological features and can be used to compare
a few or large numbers of genomes. The tools are easy to use and produce results that are easy
to interpret and can be graphically represented. The latter is an important quality determinant
of any sequence analysis tool when dealing with genomes, as the complexity of input data is
so large.
2
Genomic DNA Sequence Comparisons
A genome can be more than one DNA molecule. Approximately 10% of the bacterial genomes
sequenced so far have more than one chromosome. By definition a genome includes all
chromosomes (and plasmids) that constitute an organism’s total DNA. Chromosomes are
essential, single-copy, independently replicating DNA molecules present in each member of
the species. Some species contain plasmids; these are frequently strain-specific and sometimes
(incorrectly, in our opinion) omitted from a genome sequence.
At the time of writing, the largest bacterial genome sequenced is that of Solibacter usitatus
(strain Ellin 6076), a soil bacterium belonging to the Acidobacteria. It consists of a single
chromosome of 9.97 mega basepairs (Mbp). The smallest bacterial genome known is
that of Carsonella ruddii (PV), an endosymbiont of a plant sap-feeding insect with a mere
159,662 bp. Genome size is a rough indicator of biological adaptive potential so it is no
surprise that soil bacteria have bigger genomes, as they have to adapt to environmental
variation, whereas the protective niche of an endosymbiont allows for a small genome.
The genome size of an organism is easy to calculate and tabulate. > Figure 1a gives
a graphical representation for genome size variation within bacterial phyla. A ‘‘box and
whiskers’’ plot as shown in > Fig. 1 visualizes the distribution of a property that can be
1
Completed genome statistics obtained from the NCBI Genome Project web pages: http://www.ncbi.nlm.nih.gov/
genomes/lproks.cgi
Tools for Comparison of Bacterial Genomes
74
. Table 1
Methods for comparison of bacterial genomes
Method
URL
References
Length, %GC
http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi
Wheeler et al. (2007)
Chromosome
alignment (ACT)
http://www.sanger.ac.uk/Software/ACT/
Carver et al. (2005)
Chromosome
alignment (MUMMER)
http://mummer.sourceforge.net
http://www.webact.org/WebACT/home
Kurtz et al. (2004)
Repeats – various
http://www.cbs.dtu.dk/services/GenomeAtlas
Ussery et al. (2004)
Repeats –
tetranucleotides
http://www.megx.net/tetra
Teeling et al. (2004)
Repeats – short,
tandem
http://minisatellites.u-psud.fr/GPMS/default.php
Denoeud and
Vergnaud (2004)
Repeats – VNTRs
http://vntr.csie.ntu.edu.tw
Chang et al. (2007)
Replication Origins
http://www.cbs.dtu.dk/services/GenomeAtlas
Worning et al.
(2006)
Noncoding RNAs
http://rfam.sanger.ac.uk
Griffiths-Jones, et al.
(2005)
rRNAs
http://www.cbs.dtu.dk/services/RNAmmer
Lagesen et al. (2007)
Genome Atlas
http://www.cbs.dtu.dk/services/GenomeAtlas
Hallin and Ussery
(2004)
BLAST Atlas (zoomable) http://www.cbs.dtu.dk/services/gwBrowser
UPDATE!
Hallin and Ussery
(2004)
‘‘Genome Properties’’
Selengut et al.
(2007)
http://cmr.tigr.org/tigr-scripts/CMR/shared/
GenomePropertiesHomePage.cgi
expressed as a numerical value, such as length, %GC, number of genes, etc. Such plots show
the spread of the data and are made as follows: the values are sorted and divided into two equal
parts, separated by the median, which is marked as a bar in the middle of the distribution. A
box is drawn to cover the range where the middle 50% of the data are (excluding the first 25%
and the last 25% of the data). The ‘‘whiskers’’ are the hatched lines, connecting the lowest (left)
and highest (right) values, with the exception of outlier points, which are shown as individual
dots. Outliers are defined as data that are distant by more than 1.5 times the range of the box.
The base composition of genomes, i.e., their %GC content (or %AT which together make
100%), can also be compared, as shown in > Fig. 1b. The GC content of a genome can range
from 17% in C. ruddii to 75% GC in Anaeromyxobacter dehalogenans. The smallest genome is
also the most AT rich, and many of the larger genomes are quite GC rich. It is not clear if there
is a biological force in play behind this correlation, although it has been observed that the
ecological niche an organism occupies roughly correlates to both genome size and GC content
(Foerstner et al., 2005, Musto et al., 2006).
In addition to the average GC content for a whole genome, local variation within a given
genome can be examined, and this reveals two general trends for almost all bacterial genomes.
First, on a more global, chromosomal level a large region flanking the origin of DNA
4315
2
4
6
8
Genome size (Mbp)
10
12
30
40
50
60
AT content (percent)
70
80
. Figure 1
(a) Box and Whisker plot of genome length distribution for 779 bacterial chromosomes, grouped by phyla. The phylum and the number of chromosomes
included are indicated at the left. Each phylum is colored according to our GenomeAtlas website. (b) The distribution of average chromosomal AT content
for the same set of bacterial genomes.
0
AT content distribution of prokaryotic genomes (N = 779)
74
Crenarchaeota (n = 16)
Euryarchaeota (n = 35)
Nanoarchaeota (n = 1)
Acidobacteria (n = 2)
Actinobacteria (n = 55)
Aquificae (n = 3)
Bacteroidetes/chlorobi ( n = 26)
Chlamydiae/verrucomicrobia (n = 13)
Chloroflexi (n = 7)
Cyanobacteria (n = 33)
Deinococcus/thermus (n = 4)
Firmicutes (n = 155)
Fusobacteria (n = 1)
Planctomycetes (n = 1)
Alphaproteobacteria (n = 94)
Betaproteobacteria (n = 61)
Gammaproteobacteria (n = 191)
Deltaproteobacteria (n = 21)
Epsilonproteobacteria (n = 22)
Spirochaetes (n = 16)
Thermotogea (n = 8)
Other archaea (n = 1)
Other bacteria (n = 13)
Size distribution of prokaryotic genomes (N = 779)
4316
Tools for Comparison of Bacterial Genomes
Tools for Comparison of Bacterial Genomes
74
replication tends to be more GC rich, and the region around the replication terminus usually
is more AT rich. AT-rich sequences melt more easily than GC-rich sequences, due in part to the
extra hydrogen bond present in a GC base pair. Contra-intuitively, this would make the origin
of replication the least likely to start replication. However, within the ‘‘large region’’ around
the origin of approximately 5% of the chromosome, there is a short stretch of more AT rich
basepairs, where the replication origin bubble opens up. Second, and zooming in at genes, the
average GC content of intergenic regions is generally lower than that of coding sequences.
These regions will melt more readily, are more curved and more rigid than the chromosomal
average, in order to enable gene expression (Pedersen et al., 2000, Ussery and Hallin, 2004).
This is true for nearly all of the bacterial genomes sequenced, regardless of GC content. In order
to calculate relative or local %GC, a window has to be defined (say, investigating 100 basepairs)
for which the %GC is calculated. This window is then moved along the genome by singlenucleotide steps, and the %GC is scored related to the middle of each window. These scores can
then be graphically represented. A web-based tool for this is available at the Genome Atlas
Website2 in which local %GC can be visualized by color codes as discussed below.
3
Visualization of Genomic Data: The Genome Atlas
Genome atlases are circular plots of chromosomes or plasmids (a linear version is available
when applicable) on which general properties of the DNA molecule are plotted as colors.
Genome atlases are available from our web server2 for many of the currently sequenced
bacterial genomes. > Figure 2 shows a Genome Atlas for the chromosome of Geobacillus
kaustophilus strain HTA426 (a thermophilic Firmicute that also contains a plasmid of 4.8 kb).
This isolate was obtained from a deep sea sediment of the Mariana Trench in the Pacific Ocean
(Takami et al., 2004a, b). Its genome is 3.5 Mbp long and contains 52.1% GC. G. kaustophilus
has been suggested to provide a possible solution for paraffin deposition problems with oil
production (Sood and Lal, 2008). A Genome Atlas maps four different aspects of the
chromosomal DNA sequence in various lanes in a standard manner: DNA structural features
are represented in the three outer lanes, all coding sequences are indicated in the next lane, two
kinds of repeats are mapped in the next two lanes, and base composition properties are plotted
in the two innermost lanes (Jensen et al., 1999). The scale in the center corresponds with the
sequence numbering in GenBank. The DNA structural features of the three outermost circles
are based on the physical chemical properties of the DNA helix. The annotated genes are given
in blue for protein-coding genes oriented clockwise, and red for genes on the other strand
(counterclockwise). The tRNA and rRNA genes have their own color. The clockwise strand
corresponds with the sequence stored in GenBank (genes on the other strand are annotated as
‘‘complement’’ in there). To identify global repeats (sequences that are repeated somewhere
else on the chromosome) we search for the best match of a 100 bp window against the entire
chromosome. Searching on the positive strand results in direct repeats (both sequences run in
the same direction) whilst searching on the negative strand gives inverted repeats (the two
repeat units run in opposite directions). For most of these general properties summarized in a
Genome Atlas (structural properties, repeats, base composition) dedicated atlases are also
available, where more features are given (such as local and simple repeats in a Repeat Atlas, or
2
http://www.cbs.dtu.dk/services/GenomeAtlas/
4317
2M
dev
avg
CDS +
dev
avg
0.17
dev
avg
–7.55
0.22
7.50
0.80
0.14
7.50
fix
avg
dev
avg
fix
avg
Center for biological sequence analysis
http://www.cbs.dtu.dk/
Resolution: 1418
0.20
Percent AT
–0.15
GC Skew
5.00
fix
avg
tRNA
Global inverted repeats
5.00
Global direct repeats
rRNA
Annotations:
0.14
Position preference
–9.03
Stacking energy
0.17
CDS –
. Figure 2
Genome atlas of the main chromosome of Geobacillus kaustrophilus. See text for further explanation.
5M
3,544,776 bp
Genome atlas
Intrinsic curvature
0.
G. kaustophilus
HTA426
main chromosome
1.
3M
0M
1M
5M
M
74
2 .5
4318
Tools for Comparison of Bacterial Genomes
Tools for Comparison of Bacterial Genomes
74
base composition in a Base Atlas). Such specialized atlases are explained in detail in a book that
we recently produced (Ussery et al., 2008).
As can be seen in > Fig. 2, the genes in this chromosome are strongly favoring one strand:
the positive strand for the first (right) half and the negative strand for the second (left) half of
the chromosome. These happen to be the leading strand during replication. Replication starts
at the origin, (the 12 o’clock position here), and proceeds on either side along the circle with
both a leading and lagging strand until the bubble reaches the terminus, at 6 o’clock, and the
ends are combined. The positive strand represented by a genome sequence is the leading
strand but only for the first half up till the terminus. Reading across the terminus along the
sequence on the same strand one enters the lagging strand. Gene preference for the leading
strand is a general feature for Firmicutes and for some other bacteria.
In > Fig. 2 the two outward lanes identify some regions with strong structural properties
(for instance the region around 2 o’clock, indicated by a black line). The observed strong
curvature (blue in the outward lane) where the DNA would easily melt (red in the second lane)
suggests this region contains genes that are highly expressed.
There are a number of global repeats, notably in the first quarter of the chromosome. Note
that the ribosomal RNA genes (light blue in the annotation lane) are located here, as indicated
by the arrows, and these are picked up as global repeats, as indeed they are repeated genes.
The GC skew lane shows the bias of G’s towards one strand or the other, averaged over a
10,000 bp window. In contrast to many Firmicutes with a strong GC skew, this genome only
has a weak GC skew (the right half is light blue and the left half is light pink). The innermost
circle colors the local AT content when it is more than three standard deviations distant from
the global average. Note a light red color around the 2 o’clock region: this local deviation in AT
content is related to the structural features located here.
The Genome Atlas of the Archaea Methanosarcina acetivorans, shown in > Fig. 3, tells a
different story. This strictly anaerobic organism so efficiently produces methane that it is held
responsible for virtually all biogenic methane. It can also oxidate CO to CO2 (Lessner et al.,
2006). Strain C2A (the type strain of the species) was isolated from a marine sediment
(Galagan et al., 2002). Its genome is 5.7 Mbp long and contains 42.7% GC. The Genome
Atlas shows that its genes are evenly distributed over the two strands, and a GC skew is absent.
Instead, the lower quart of the genome contains many strong structural features. The genome
only contains three rRNA gene copies (indicated by arrows) one of which is located on the
negative strand (but as discussed above, this is actually the leading strand, as is preferred for
nearly all bacterial rRNA genes). Many other global repeats are visible, notably in the region
around 1.2 Mbp, which is strongly curved and easily melted, and is slightly more AT rich than
the rest of the genome. Here, the important carbon-monoxide dehydrogenase gene locus is
present, as are multiple transposases, which could be an indication of horizontally acquired
DNA. The genome is relatively poorly annotated, with many genes given as ‘‘predicted
protein’’ only, which is not uncommon for archaeal genomes.
In conclusion, a Genome atlas combines a number of features in one single figure that
summarizes a very long and detailed story about a chromosome or plasmid.
4
Whole Genome Alignment Methods
Another way to compare genomes is based on alignment of nucleotide or amino acid
sequences. Sequence alignment is a common tool to identify similarities, with BLAST, for
4319
0.5
M
1 .5 M
2M
M
3M
5M
3.5
M
. Figure 3
Genome atlas of the main chromosome of the Archea Methanosarcina acetivorans.
2 .5
M. acetivorans C2A
5,751,492 bp
M 0M
1M
4 .5 M
fix
avg
0.80
dev
avg
0.02
Center for biological sequence analysis
http://www.cbs.dtu.dk/
Resolution: 2301
0.20
Percent AT
–0.03
GC skew
5.00
fix
avg
7.50
fix
avg
7.50
Global inverted repeats
5.00
tRNA
rRNA
CDS –
CDS +
dev
avg
0.15
dev
avg
–7.21
dev
avg
0.24
Global direct repeats
Annotations:
0.13
Position preference
–8.10
Stacking energy
0.18
Intrinsic curvature
Genome atlas
74
4
4320
Tools for Comparison of Bacterial Genomes
Tools for Comparison of Bacterial Genomes
74
Basic Local Alignment Search Tool, the most common (Altschul et al., 1990). However BLAST is not automatically suitable for large DNA input segments such as complete
genomes. A more suitable program to align sequences in the range of megabases is Mummer,
developed at TIGR, of which version 3 is now publicly available (Kurtz et al., 2004). Further,
this method has been recently extended to include the average nucleotide identity in the
conserved core genes of a set of genomes (Deloger et al., 2009). Moreover, graphical representation of the resulting alignment becomes an issue. Specific tools have been designed to align
genome sequences and visualize such events. The Artemis Comparison Tool (ACT) is worth
mentioning of which two versions are available: a downloadable version to be used on a local
computer (Carver et al., 2005) and a web-based version with pre-computed comparisons
between several hundred bacterial genomes.3 BLAST results of entire bacterial chromosomes
against each other have also been used to construct phylogenetic trees (Henz et al., 2005). Blast
comparisons will be treated in Section 7 of this chapter.
5
Comparing the Coding Fraction of Genomes
The typical coding density for a bacterial genome is about 90%, ranging from 95%
for Pelagibacter ubique (an alpha-proteal marine bacterium that counts to the most numerous bacteria in the world) (Giovannoni et al., 2005) to around 75% for M. acetivorans.
Intracellular bacteria can have a coding density as low as 50%. This means the majority
of bacterial DNA codes for genes, which mostly are not spliced so that introns are absent
(with very few exceptions). However, not every open reading frame is a gene, and it
appears that many bacterial genomes are over-annotated, predicting 10–15% more genes
than are real (Skovgaard et al., 2001). These over-annotated genes are frequently short
open reading frames. In addition, genes can be missed in the annotation. A frequent mistake
is that genes are annotated on the wrong strand, which can happen if the reading frame is
open in either direction. The intergenic regions separating genes regulate transcription,
and in intracellular bacteria frequently contain pseudogenes or repeats. Genes not coding
for proteins include tRNA and rRNA genes, and some parts of intergenic regions can
be transcribed into stable RNA that are transcribed but do not code for proteins. E. coli
contains several hundred small non-coding RNA genes (ncRNA) (Chen et al., 2002) that
can act as regulators (Gottesman, 2005). Their role in environmental bacteria is virtually
unexplored.
Although tRNA and rRNA genes are essential to life, they are sometimes missed in the
annotation of a genome, a rather embarrassing omission, or occasionally annotated on
the wrong strand (Lagesen et al., 2007). The number and location of rRNA operons in a
genome can say something about an organism. It appears that organisms with short doubling
times have larger numbers of rRNA and tRNA genes. Comparing > Figs. 2 and 3 it is
likely that G. kaustrophilus with 9 rRNA copies, nearly all located close to the origin of
replication (which boosts expression during replication as their copy number increases) can
divide more quickly than M. acetivorans which only has three copies. Some really fast-growing
bacteria can have 14 or more rRNA copies, as can be viewed from our list of genomes.4
3
4
http://www.webact.org/WebACT/home
www.cbs.dtu.dk/services/GenomeAtlas/
4321
4322
74
6
Tools for Comparison of Bacterial Genomes
Codon Usage Comparisons
Once the genes of a given genome have been defined, their codon usage can be analyzed. Since
the genetic code is redundant, with up to 6 codons per amino acid, variable codons are used at
different frequencies. Much of the redundancy in the genetic code is due to third base
variation. > Figure 4 displays the amino acid usage for three prokaryotic genomes: Methanosphaera stadtmanae (27.6% GC), an archaeal methanogen that uses methanol and hydrogen to
produce methane; Desulfitobacterium hafniense (47.4% GC), a Firmicute that efficiently
dehalogenates tetrachloroethene and polychloroethanes; and Anaeromyxobacter dehalogenans
(75% GC). This species, the first myxobacteria to be grown as a pure culture, can use orthosubstituted mono- and dichlorinated phenols. The frequency of each possible codon is plotted
in a wheel plot in the upper part of the figure, arranged such that their third base is conserved
in each quarter. The bias in codon usage towards the third position can also be seen in the
sequence logo plots in the lower part of > Fig. 4. From both graphics it is evident that genomic
GC content highly affects codon use (or the other way round). Based on a genome’s bias in
codon usage, it is possible to predict its likely environmental niche (Willenbrock et al., 2006).
Moreover, it is known that amino acid usage (not shown here) depends on environment, based
on analysis of metagenomic samples (Musto et al., 2006, Foerstner et al., 2005).
7
Protein Sequence Comparisons
One can compare each individual gene in a given genome by BLAST against a set of genomes.
This produces a huge amount of data that can be graphically represented in a BLAST Matrix
(Binnewies et al., 2005, Ussery et al., 2009). A BLAST Matrix is not symmetrical, as the
outcome is determined by which genome is used as query sequence. The diagonal of a BLAST
matrix represents a BLASTof a genome against itself. The self-match (the gene finding itself) is
discarded, thus the reported scores reflect internal homologues present in a given genome.
Most of these have been derived from gene duplication and are thus paralogs.
When more information should be visualized a BLAST Atlas is helpful. Such an atlas uses
one genome as a reference against which the gene conservation of other genomes is plotted
(Hallin and Ussery, 2004, Skovgaard et al., 2002). In this case gene location only refers to the
location in the reference genome, which of course can be varied in multiple BLAST Atlases.
A BLAST Atlas is also a suitable platform to visualize metagenomic data. So far, we have
not dealt with metagenomics extensively, mainly because this approach very rarely results in
completely assembled microbiological genomes. But for a BLAST Atlas, that is not a problem,
as one can combine all the metagenomic DNA in one lane, thereby ignoring from which
organism the detected genes originated. All obtained BLAST hits are plotted around a
reference genome. An example of a BLAST Atlas is given in > Fig. 5, centered around
Pelotomaculum thermopropionicum, a thermophilic, syntropic Firmicute that can utilize
1-butanol, 1-propanol, 1-pentanol or 1,3-propanediol as a carbon source. Note that despite
the high number of lanes, conserved and variable genes can still be easily visually inspected.
From compacting a single genome into a Genome Atlas, we’ve now moved several levels up
and compact multiple genomes into a single atlas. In > Fig. 5, the P. thermopropionicum
genome is compared to many species of Clostridia, as well as other bacteria. Unfortunately,
very few BLAST hits were found with the metagenomics samples so there is very little color in
those three lanes. Compared to well characterized genomes (like E. coli), relatively few hits are
A
A
A
U
G
1st 2nd 3rd
A
C
G
U
A
U
G
C
C
53% AT
GGG
GA
CAA A
CGG
G
UA A
GC
UGG
AAA
C
G
G
A U UUA UA UA
G
A
CC
AG
G
UC
AC
CG
0.3
0.2
0.3
0.2
1st 2nd 3rd
0.4
0.4
0.1
0.5
0.1
72% AT
GGG
GA
CAA A
CGG
G
UA A
GC
UGG
AAA
C
G
G
A U UUA UA UA
G
A
CC
AG
G
C
U
AC
CG
0.6
A
0.5
GA
G
GA
CAU U
CGA
UA U
UGA
A
U
A
C
G
A UUUUU UU UU
GA
0.6
C
G
A
G
AU
U
C
C
A
G
U
A
GGC
G
A
G
C
A
G
CGC
UA G
UGC
AA G
C
G
A U UUG UG UG
C
G
AG
A
GGC
GA
CA G G
CGC
UA G
UGC
AA G
C
G
A U UUG UG UG
C
G
AG
Methanosphaera stadtmanae DSM 3091
1st 2nd 3rd
C
G
AA
U
G
U
C
G
C
U
A
GGA
GA
CAU U
CGA
UA U
A
UG
AAU
C
G
A UUUUU UU UU
GA
A
GGA
GA
CAU U
CGA
UA U
UGA
AAU
C
G
A UUUUU UU UU
GA
A
. Figure 4
Frequency wheel plots of codon usage (top) and sequence logo plots (bottom) of Anaeromyxobacter dehalogenans (left), Desulfitobacterium hafniense
(middle) and Methanosphaera stadtmanae (right).
0.1
0.2
0.3
0.4
0.5
0.6
GGU
GA
CA C C
CGU
UA C
UGU
AA C
C
G
A U UUC UC UC
U
C
AG
25% AT
U
GC
U
CC
U
UC
CU
C
GC
C
CC
C
UC
CC
GGG
GA
CAA A
CGG
G
UA A
GC
UGG
AAA
C
G
G
A U UUA UA UA
G
A
CC
AG
G
C
U
AC
CG
U
GC
GGU
GA
CA C C
CGU
UA C
UGU
AA C
C
G
A U UUC UC UC
U
C
AG
A
C
GC
C
CC
C
UC
CC
GGC
G
A
CA G G
CGC
UA G
UGC
AA G
C
G
A U UUG UG UG
C
G
AG
Desulfitobacterium hafniense Y51
U
GC
U
CC
U
UC
CU
A
A
A
GC
A
CC
A
UC
A
A
GGU
GA
CA C C
CGU
UA C
UGU
AA C
C
G
A U UUC UC UC
U
C
AG
A
GC
A
CC
A
UC
A
A
C
GC
C
CC
C
UC
CC
A
GC
A
CC
A
UC
A
U
CC
U
UC
CU
Anaeromyxobacter dehalogenans 2CP-C
Tools for Comparison of Bacterial Genomes
74
4323
74
Tools for Comparison of Bacterial Genomes
0M
1M
2M
2.5
5M
M
0.
P. thermopropionicum SI
3,025,375 bp
1. 5 M
4324
2 Alkaliphilus species
Bacillus fragilis
17 Clostridium species
4 Desulfitobacterium species
E. coli K-12
6 other species belonging
to Clostridia
. Figure 5
BLAST Atlas with Pelotomaculum thermoproopionicuma the reference genome. Around this the
BLAST hits of 31 genomes of other bacteria are added as listed to the right, from the outermost
circle (top in the legend), to the innermost circle of the bacterial genomes (bottom of legend).
The outermost lane shows the hits of P. thermopropionicum in the UniProt database (which
does not contain all annotated genes as it requires biological evidence of a gene product).
The next three lanes are metagenomic DNA samples from. . . [Dave specify] and next follow
30 genomes of other bacteria as listed to the right.
found in other genomes, indicated by lack of strong colour in most of the lanes in Figure 5.
This is probably a reflection of the huge diversity in DNA content in such samples, reducing
the chance of a BLAST hit. It is a sobering thought that there is still so little we know, and so
much that remains to be discovered in the microbial world.
There are many methods being developed which utilizes sets of conserved genes and gene
families in related organisms to cluster organisms into groups; these groups can represent
known taxonomic relationships. For example, certain genes might be common to a set of
organisms growing in a particular ecological niche. Some examples of such regions along the
chromosome can be seen in the BLAST atlas plots where genomes of related organisms of
different species are compared.
Tools for Comparison of Bacterial Genomes
8
74
Gene Synteny and Genome Islands
A comparison of genes present, absent or diverged between genomes usually ignores gene synteny:
the position at which such genes are found. The term was coined for eukaryotes to describe genes
that were located on the same chromosome; in bacterial genomes the local neighboring genes,
their order and direction are usually compared. The closer two organisms are, the more likely is
gene synteny to be conserved (between genomes of the same genus, or species, subspecies or
phylogenic clade, in increasing order). Gene synteny is destroyed by inversions (changing the
direction of one or several genes), translocations (changing the position of genes) and insertion
and deletion events. All of these can result from mistakes during replication, or be the result of
self-replicating mobile elements, such as bacteriophages, integrons, transposons etc.
The events that affect gene synteny, combined with point mutations accumulating during
replication are the two major forces that increase genetic diversity; selection of those organisms that are fittest to survive particular conditions decreases diversity. Evolution further
depends on the change of such selective conditions. With a slow but steady re-shuffling of
genes by evolutionary processes, a pattern emerges of a genetic ‘‘backbone’’ of genes whose
location is relatively conserved between genomes of reasonable genetic distance, and groups of
‘‘cluttered’’ genes that are far more variable, in what have been termed ‘‘genome islands.’’
Genome islands usually contain genes that are all involved in a particular phenotypic process.
Examples are pathogenicity islands, symbiosis islands, metabolic islands or magnetosome
islands. Examples are sulfur metabolism islands discovered in metagenomic sequences from
marine sediments (Mussmann et al., 2005) or the magnetosome island containing all genes
that produce the intracellular organelle enabling magnetotactic bacteria to orient themselves
along magnetic field lines (Richter et al., 2007). The evolutionary advantage of genome islands
is obvious. They can be regarded as genetic ‘‘building blocks’’; when transferred from one
organism to the next, they can confer a complete phenotypic trait to the acceptor, enabling,
for instance, adaptation to a novel ecological niche.
9
Minimal Information About a Genome Sequence
Genome sequences are stored in public databases such as GenBank under their biological
names (preceded by ‘‘candidatus’’ for undecided taxonomic position), or by a code of
numbers and letters for unculturable organisms that have not been classified. Unfortunately,
other relevant information is often lacking. It has become apparent that biological and
environmental data are important, and a recent standard for ‘‘Minimal Information about a
Genome Sequence’’ has been proposed (Field et al., 2008). The Genomic Standards Consortium5 (GSC, http://gensc.org) promotes the standardization of genome sequencing descriptions and their exchange and integration in the scientific community. Overall, it is important
that genome sequence information is released into the public domain in a timely manner so
that global scientific progress can be maintained.
10
Research Needs
For very few environmental species multiple genome sequences are available. From genomic
intra-species comparisons of pathogenic bacteria we know that these provide an extra layer of
4325
4326
74
Tools for Comparison of Bacterial Genomes
information, as genetic diversity within a bacterial species can be enormous. When multiple
genomes are available for a species we can define its core genome (all genes that are present in
all genomes of that species), its pan-genome (all genes that have been found in that species)
and its dispensable genes that are responsible for the variation between isolates. Multiple
genomes per species, together with more metagenomic data and more archaeal genome
sequences, comprise our most urgent data gaps. The research tools for analysis of the
genomes are available. Generate the sequences and the feast can begin.
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ
(1990) Basic local alignment search tool. J Mol Biol
215: 403–410.
Binnewies TT, Hallin PF, Staerfeldt HH, Ussery DW
(2005) Genome update: proteome comparisons.
Microbiology 151: 1–4.
Binnewies TT, et al. (2006) Ten years of bacterial genome
sequencing: comparative-genomics-based discoveries. Funct Integr Genomics 6: 165–185.
Carver TJ, Rutherford KM, Berriman M, Rajandream
MA, Barrell BG, Parkhill J (2005) ACT: the Artemis
Comparison Tool. Bioinformatics 21: 3422–3423.
Chang CH, Chang YC, Underwood A, Chiou CS, Kao CY
(2007) VNTRDB: a bacterial variable number tandem repeat locus database. Nucleic Acids Res 35:
D416–D421.
Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH,
Ecker DJ, Blyn LB (2002) A bioinformatics based
approach to discover small RNA genes in the Escherichia coli genome. Biosystems 65: 157–177.
Deloger M, El Karoui M, Petit MA (2009) A genomic
distance based on MUM indicates discontinuity between most bacterial species and genera. J Bacteriol
191: 91–99.
Denoeud F, Vergnaud G (2004) Identification of polymorphic tandem repeats by direct comparison of
genome sequence from different bacterial strains: a
web-based resource. BMC Bioinformatics 5: 4.
Field D, et al. (2008) The minimum information about a
genome sequence (MIGS) specification. Nature Biotechnol 26:541–547.
Foerstner KU, von Mering C, Hooper SD, Bork P (2005)
Environments shape the nucleotide composition of
genomes. EMBO Rep 6: 1208–1213.
Galagan JE, et al. (2002) The genome of M. acetivorans
reveals extensive metabolic and physiological diversity. Genome Res 12: 532–542.
Giovannoni SJ, et al. (2005) Genome streamlining in a
cosmopolitan oceanic bacterium. Science 309:
1242–1245.
Gottesman S (2005) Micros for microbes: non-coding
regulatory RNAs in bacteria. Trends Genet 21:
399–404.
Griffiths-Jones S, Moxon S, Marshall M, Khanna A,
Eddy SR, Bateman A (2005) Rfam: annotating
non-coding RNAs in complete genomes. Nucleic
Acids Res 33: D121–D124.
Hallin PF, Binnewies TT, Ussery DW (2008) The genome
BLAST atlas - a GeneWiz extension for visualization
of whole-genome homology. Mol Biosyst 4: 363–371.
Hallin PF, Ussery DW (2004) CBS Genome Atlas
Database: a dynamic storage for bioinformatic results
and sequence data. Bioinformatics 20: 3682–3686.
Henz SR, Huson DH, Auch AF, Nieselt-Struwe K,
Schuster SC (2005) Whole-genome prokaryotic
phylogeny. Bioinformatics 21: 2329–2335.
Jensen LJ, Friis C, Ussery DW (1999) Three views of
microbial genomes. Res Microbiol 150: 773–777.
Kurtz S, Philippy A, Delcher AL, Smoot M, Shumway M,
Antonescu C, Salzberg SL (2004) Versatile and open
software for comparing large genomes. Genome Biol
5: R12.
Lagesen K, Hallin P, Rodland EA, Staerfeldt HH,
Rognes T, Ussery DW (2007) RNAmmer: consistent
and rapid annotation of ribosomal RNA genes.
Nucleic Acids Res 35: 3100–3108.
Lessner DJ, et al. (2006) An unconventional pathway for
reduction of CO2 to methane in CO-grown Methanosarcina acetivorans revealed by proteomics. Proc
Natl Acad Sci USA 103: 17921–17926.
Mussmann M, Richter M, Lombardot T, Meyerdierks A,
Kuever J, Kube M, Glöckner FO, Amann R (2005)
Clustered genes related to sulfate respiration in uncultured prokaryotes support the theory of their
concomitant horizontal transfer. J Bacteriol. 187:
7126–7137.
Musto H, Naya H, Zavala A, Romero H, Alvarez-Valin F,
Bernardi G (2006) Genomic GC level, optimal
growth temperature, and genome size in prokaryotes. Biochem Biophys Res Commun 347: 1–3.
Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH,
Ussery DW (2000) A DNA structural atlas for
Escherichia coli. J Mol Biol 299: 907–930.
Richter M, Kube M, Bazylinski DA, Lombardot T,
Glöckner FO, Reinhardt R, Schüler D (2007) Comparative genome analysis of four magnetotactic
Tools for Comparison of Bacterial Genomes
bacteria reveals a complex set of group-specific
genes implicated in magnetosome biomineralization and function. J Bacteriol 189: 4899–4910.
Selengut JD, et al. (2007) TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes.
Nucleic Acids Res 35: D260–D264.
Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A
(2001) On the total number of genes and their
length distribution in complete microbial genomes.
Trends Genet 17: 425–428.
Skovgaard M, Jensen LJ, Friis C, Stærfeldt HH, Worning P,
Brunak S, Ussery D (2002) The atlas visualisation of
genome-wide information. Meth Microbiol. 33:
49–63.
Sood N, Lal B. (2008). Isolation and characterization of a
potential paraffin-wax degrading thermophilic bacterial strain Geobacillus kaustophilus TERI NSM for
application in oil wells with paraffin deposition
problems. Chemosphere 70: 1445–1451.
Takami H, et al. (2004a) Genomic characterization of
thermophilic Geobacillus species isolated from the
deepest sea mud of the Mariana Trench. Extremophiles 8: 351–356.
Takami H, et al. (2004b) Thermoadaptation trait
revealed by the genome sequence of thermophilic
74
Geobacillus kaustophilus. Nucl Acids Res 32:
6292–6303.
Teeling H, Waldmann J, Lombardot T, Bauer M,
Glockner FO (2004) TETRA: a web-service and a
stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA
sequences. BMC Bioinformatics 5: 163.
Ussery DW, Hallin PF (2004) Genome update: AT content in sequenced prokaryotic genomes. Microbiology 150: 749–752.
Ussery DW, Borini S, Wassenaar TM (2009) Computing
for Comparative Microbial Genomics: Bioinformatics for Microbiologists (Computational series)
London, Verlag: Springer.
Wheeler DL, et al. (2007) Database resources of the
National Center for Biotechnology Information.
Nucleic Acids Res 35: D5–D12.
Willenbrock H, Friis C, Friis AS, Ussery DW (2006) An
environmental signature for 323 microbial genomes
based on codon adaptation indices. Genome Biol 7:
R114.
Worning P, Jensen LJ, Hallin PF, Staerfeldt HH,
Ussery DW (2006) Origin of replication in circular
prokaryotic chromosomes. Environ Microbiol 8:
353–361.
4327