Download Investigating the Relationship Between Genome Structure

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Artificial gene synthesis wikipedia , lookup

Genome editing wikipedia , lookup

Transcript
Investigating the Relationship Between Genome Structure, Composition,
and Ecology in Prokaryotes
Pietro Liò
Department of Zoology, University of Cambridge, United Kingdom
Our thesis is that the DNA composition and structure of genomes are selected in part by mutation bias (GC pressure)
and in part by ecology. To illustrate this point, we compare and contrast the oligonucleotide composition and the
mosaic structure in 36 complete genomes and in 27 long genomic sequences from archaea and eubacteria. We report
the following findings (1) High–GC-content genomes show a large underrepresentation of short distances between Gn
and Cn homopolymers with respect to distances between An and Tn homopolymers; we discuss selection versus
mutation bias hypotheses. (2) The oligonucleotide compositions of the genomes of Neisseria (meningitidis and gonorrhoea), Helicobacter pylori and Rhodobacter capsulatus are more biased than the other sequenced genomes. (3)
The genomes of free-living species or nonchronic pathogens show more mosaic-like structure than genomes of chronic
pathogens or intracellular symbionts. (4) Genome mosaicity of intracellular parasites has a maximum corresponding
to the average gene length; in the genomes of free-living and nonchronic pathogens the maximum occurs at larger
length scales. This suggests that free-living species can incorporate large pieces of DNA from the environment, whereas
for intracellular parasites there are recombination events between homologous genes. We discuss the consequences in
terms of evolution of genome size. (5) Intracellular symbionts and obligate pathogens show small, but not zero, amount
of chromosome mosaicity, suggesting that recombination events occur in these species.
Introduction
The recently derived genome sequences of several
bacteria and archaea provide an excellent opportunity to
investigate differences in DNA sequence composition
between- and within-genomes and to relate these differences to the species’ ecology. Genome sequences have
been investigated on the basis of several characteristics,
for example, dinucleotide contents (Karlin and Burge
1995; Karlin, Campbell, and Mrazek 1998), percentage
of shared homologous genes (Huynen, Dandekar, and
Bork 1998; Tekaja, Lazcano, and Dujon 1999; Grishin,
Wolf, and Koonin 2000), and patterns associated with
physicochemical properties of sequences, such as DNA
curvature of regulatory sequences (Bolshoy and Nevo
2000; Pedersen et al. 2000).
The density of guanine and cytosine, i.e., the GC
content, is an important parameter to investigate and
compare the structure and evolution of genomes (Muto
and Osawa 1987; Bernardi 1993; Bellgard and Gojobori
1999). Although the average GC content of bacterial
genomes varies across species, the GC mutational pressure, i.e., the specificity in replication and repair machinery, and the context-dependent mutation bias tend
to homogenize each genome (Sueoka 1962, 1992; Liò
et al. 1996; Karlin, Campbell, and Mrazek 1998). The
different selection constraints acting on regulatory and
coding regions and the lateral transfer of genes with different GC content (Martin 1999; Garcia-Vallvé, Romeu,
and Palau 2000; Ochman, Lawrence, and Groisman
2000) are opposed to the genome composition
homogenization.
We have compared a large number of genome sequences to investigate the relationships between oligoKey words: genomics, GC content, genome structure, prokaryotes, ecology.
Address for correspondence and reprints: Pietro Liò, Department
of Zoology, University of Cambridge, Downing Street, Cambridge
CB2 3EJ, U.K. E-mail: [email protected].
Mol. Biol. Evol. 19(6):789–800. 2002
q 2002 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
nucleotide composition and genome structure heterogeneity in prokaryotes and their ecology, i.e., distinguishing free-living species and nonchronic pathogens from
chronic pathogens and symbionts. First, we use a measure of sequence entropy, based on the frequency content of short oligonucleotides (1–7 bp), to compare 36
complete genomes and 27 long genomic sequences from
archaea and eubacteria with different GC content. Second, because oligonucleotide frequencies depend on the
mosaic structure of genomes, we used a method derived
from wavelets (see Chui 1992 and Daubechies 1992,
among others) to estimate the length-size of the genome
structure heterogeneity in bacteria and archaea with different ecology. We show that the entropy measure and
the wavelet scalogram can be used as complementary
methods to detect global patterns in genome sequences.
Materials and Methods
Data
We have analyzed 36 complete genomes and 27
long sequences from eubacteria and archaea (permission
to use data of uncompleted genome sequencing projects
has been obtained; see Acknowledgments). We report in
table 1 the list of genomic sequences ordered by the
average GC content, which are also shown in figure 2,
and the average GC content if the genome sequence is
partial (pg) and if the prokaryote is an archaea (A). It
is known that there is often a large difference between
the average genomic GC content and the GC content
calculated considering only the third base of codons; in
this work we analyze entire genome sequences, and
therefore we have used the average genomic GC
content.
Statistical Analysis
Measure of Between-Genome Composition Diversity
We compared genome sequences through the analysis of short oligonucleotide frequencies (1–7 bp). We
789
790
Liò
Table 1
List of the Genomic Sequences Analyzed
Species (genome)
GC%
Ureaplasma urealyticum . . . . . . . . . . . . . . . . . .
Buchnera sp.. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Borrelia burgdorferi . . . . . . . . . . . . . . . . . . . . .
Rickettsia prowazekii . . . . . . . . . . . . . . . . . . . . .
Clostridium difficile . . . . . . . . . . . . . . . . . . . . . .
Campylobacter jejuni. . . . . . . . . . . . . . . . . . . . .
Prochlorococcus marinus . . . . . . . . . . . . . . . . .
Methanococcus jannaschii . . . . . . . . . . . . . . . .
Mycoplasma genitalium. . . . . . . . . . . . . . . . . . .
Staphilococcus aureus . . . . . . . . . . . . . . . . . . . .
Methanococcus maripaludis . . . . . . . . . . . . . . .
Streptococcus mutans . . . . . . . . . . . . . . . . . . . .
Haemophilus influentiae . . . . . . . . . . . . . . . . . .
Haemophilus ducreyi . . . . . . . . . . . . . . . . . . . . .
Streptococcus pyogenes . . . . . . . . . . . . . . . . . . .
Helicobacter pylori 26695 . . . . . . . . . . . . . . . .
Helicobacter pylori J99 . . . . . . . . . . . . . . . . . . .
Legionella pneumophila . . . . . . . . . . . . . . . . . .
Mycoplasma pneumoniae . . . . . . . . . . . . . . . . .
Chlamydia muridarum. . . . . . . . . . . . . . . . . . . .
Chlamydia pneumoniae CWL029. . . . . . . . . . .
Chlamydia pneumoniae J138 . . . . . . . . . . . . . .
Chlamydia pneumoniae AR39 . . . . . . . . . . . . .
Pyrococcus furiosus. . . . . . . . . . . . . . . . . . . . . .
Chlamydia trachomatis . . . . . . . . . . . . . . . . . . .
Nostoc puntiforme . . . . . . . . . . . . . . . . . . . . . . .
Pyrococcus horikoshii . . . . . . . . . . . . . . . . . . . .
Aquifex aeolicus . . . . . . . . . . . . . . . . . . . . . . . . .
Bacillus subtilis . . . . . . . . . . . . . . . . . . . . . . . . .
Bacillus halodurans C-125 . . . . . . . . . . . . . . . .
Bacillus actynomycetemcomitans . . . . . . . . . . .
Pyrococcus abyssi . . . . . . . . . . . . . . . . . . . . . . .
Thermoplasma acidophilum . . . . . . . . . . . . . . .
Thermotoga maritima . . . . . . . . . . . . . . . . . . . .
Vibrio cholerae chromosome II . . . . . . . . . . . .
Yersinia pestis . . . . . . . . . . . . . . . . . . . . . . . . . .
Vibrio cholerae chromosome I . . . . . . . . . . . . .
Synechocystis PCC6803. . . . . . . . . . . . . . . . . . .
Archaeglobus fulgidus . . . . . . . . . . . . . . . . . . . .
Bacterium thermolitotrophicus . . . . . . . . . . . . .
Escherichia coli . . . . . . . . . . . . . . . . . . . . . . . . .
Nitrosomonas europaea. . . . . . . . . . . . . . . . . . .
Neisseria meningitidis MC58 . . . . . . . . . . . . . .
Neisseria meningitidis Z2491 . . . . . . . . . . . . . .
Xylella fastidiosa . . . . . . . . . . . . . . . . . . . . . . . .
Salmonella typhi . . . . . . . . . . . . . . . . . . . . . . . .
Neisseria gonorrhoeae. . . . . . . . . . . . . . . . . . . .
Treponema pallidum . . . . . . . . . . . . . . . . . . . . .
Bacillus stearothermophilus . . . . . . . . . . . . . . .
Corynebacterium diphtheriae . . . . . . . . . . . . . .
Aeropyrum pernix . . . . . . . . . . . . . . . . . . . . . . .
Klebsiella pneumoniae. . . . . . . . . . . . . . . . . . . .
Mycobacterium leprae . . . . . . . . . . . . . . . . . . . .
Burkholderia pseudomallei . . . . . . . . . . . . . . . .
Mycobacterium bovis. . . . . . . . . . . . . . . . . . . . .
Mycobacterium tubercolosis . . . . . . . . . . . . . . .
Bordetella bronchiseptica . . . . . . . . . . . . . . . . .
Bordetella parapertussis . . . . . . . . . . . . . . . . . .
Pseudomonas aeruginosa . . . . . . . . . . . . . . . . .
Deinococcus radiodurans chr II . . . . . . . . . . . .
Rhodobacter capsulatus. . . . . . . . . . . . . . . . . . .
Deinococcus radiodurans chr I . . . . . . . . . . . .
Bordetella pertussis . . . . . . . . . . . . . . . . . . . . . .
Rhodobacter sphaeroides . . . . . . . . . . . . . . . . .
Halobacterium NRC-1. . . . . . . . . . . . . . . . . . . .
Streptomyces coelicolor. . . . . . . . . . . . . . . . . . .
0.25
0.26
0.29
0.29
0.30
0.31
0.31
0.31
0.32
0.33
0.35
0.37
0.38
0.38
0.39
0.39
0.39
0.39
0.40
0.40
0.41
0.41
0.41
0.41
0.41
0.42
0.42
0.43
0.44
0.44
0.45
0.45
0.46
0.46
0.47
0.48
0.48
0.48
0.49
0.5
0.51
0.51
0.52
0.52
0.53
0.52
0.53
0.53
0.53
0.53
0.56
0.57
0.58
0.64
0.65
0.66
0.66
0.66
0.66
0.67
0.67
0.67
0.67
0.67
0.68
0.72
Reference
Glass et al. 2000
Shigenobu et al. 2000
Fraser et al. 1997
Andersson et al. 1998
pg
Parkhill et al. 2000
pg
A: Bult et al. 1996
Fraser et al. 1995
pg
pg
pg
Fleischmann et al. 1995
pg
pg
Tomb et al. 1997
Alm et al. 1999
pg
Himmelreich et al. 1996
Read et al. 2000
Kalman et al. 2000
Shirai et al. 2000
Read et al. 2000
A
Stephens et al. 1998
pg
A; Kawarabayasi et al. 1998
Deckert et al. 1998
Kunst et al. 1997
Takami et al. 2000
pg
A: pg
A: Ruepp et al. 2000
Nelson et al. 1999
Heidelberg et al. 2000
pg
Heidelberg et al. 2000
Kaneko et al. 1996
A: Klenk et al. 1997
A: Smith et al. 1997
Blattner et al. 1997
pg
Parkhill et al. 2000
Tettelin. et al. 2000
Simpson et al. 2000
pg
pg
Fraser et al. 1998
pg
pg
A: Kawarbayasi et al. 1999
pg
pg
pg
pg
Cole. et al. 1998
pg
pg
Stover et al. 2000
White et al. 1999
pg
White et al. 1999
pg
pg
A: Ng et al. 2000
pg
NOTE.—The entries are: sequences, the average GC content, if the genome sequence is partial (pg) and if the prokaryote
is an archaea (A).
Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes
Table 2
Test of the Correction Factor for the Sequence Finite Size
Length (bp)
D(1) .
D(2)
D(3)
D(4)
D(5)
D(6)
D(7)
.............
2 D(1) . . . . . . . .
2 D(2) . . . . . . . .
2 D(3) . . . . . . . .
2 D(4) . . . . . . . .
2 D(5) . . . . . . . .
2 D(6) . . . . . . . .
RND
RND (Cor.F.)
1.386
1.386
1.386
1.386
1.384
1.378
1.355
1.386
1.386
1.386
1.386
1.386
1.386
1.386
NOTE.—The entries are: first-difference of entropy values of oligonucleotide
frequencies (1–7 bp) in a 200-kb random DNA sequence with GC content 5
0.5 (column RND) and using the correction factor (RND Cor.F.). Correct values
remain closer to Ln(4) . 1.386.
used the following entropy measure: where fi is the frequency of the oligonucleotide i of length n (n 5 2, . . . ,
7 bases). In short sequences, there are oligonucleotides
that do not occur even if their probabilities in the entire
genome are not zero. Therefore, in order to compensate
the finite length of the sequence, N, we used a correction
factor M/2N, where M is the number of oligonucleotides
with nonzero occurrences (see Herzel 1988; Schmitt and
Herzel 1997; Liò and Ruffo 1998). In a long, random
DNA sequence, all the oligonucleotides of length n are
equally represented and D (n) is equal to n·ln4
(51.386. . . for n 5 1). If some oligonucleotides are
more abundant than expected and others extremely rare,
D(n) is smaller than n·ln4; if the sequence is periodic,
with a period p, D(n) is a constant for n . p. A further
estimator is the first-difference of entropy, it allows to
estimate the bias in the frequencies of oligonucleotides
of length n, taking into account the bias in the frequencies of oligonucleotides of length n 2 1. The term estimates the bias in dinucleotide frequencies, taking into
account the single-base frequencies; the term estimates
the bias in trinucleotide frequencies with respect to the
dinucleotide frequencies, and so on. Therefore, the plot
of for different values of n allows to detect changes in
oligonucleotide frequencies that may reflect mutationselection pressures. In table 2, we show the effect of the
correction factor in computing the first-difference of entropy of oligonucleotide frequencies in a 200-kbp random DNA sequence with GC content 5 0.5. Clearly,
using the correction factor, we can neglect the finite size
of the sequences; moreover, all the sequences in the genome data set are longer than 200 kbp. In a similar way,
for comparison, we computed the second-differences of
entropy, for different values of n.
Distribution of Distances Between Homopolymers
We used density plots to analyze the distribution of
distances between An, Tn, Cn, and Gn homopolymers.
Density plots, that have unit total area, are essentially
smooth versions of histograms which provide smooth
estimates of population frequency or probability density
curves. Histograms depend on the starting point on the
grid of bins, and the differences can be surprisingly
large; better results are obtained on computing the histograms by averaging over a large number of different
starting points and having very small bins (Silverman
791
1986, pp. 45–47). We compute the density through the
formula for a sample of distances s1, . . . , si, . . . , sn, a
fixed kernel k, and a bandwidth b. We used a normal
kernel, and we selected data-dependent bandwidths (b)
using the formula b 5 0.9 min (s,
R/1.34)n21/5, where n is the sample size, s is the standard deviation, and R is the interquantile range (Silverman 1986).
Measure of Within-Genome Composition
Heterogeneity
Karlin and Brendel (1993), Karlin and Burge
(1995), and Karlin and Mrazek (1997), among others
(see also Liò et al. 1996), have provided evidences that
even the DNA sequence of the most simple microorganism shows large degree of patchiness at different sequence length scales. In their analyses, the window size
is chosen on the basis of the sequence length and features to be detected; consequently, the localization accuracy of the methods is of the order of the chosen
window length. In order to investigate GC patchiness,
we coded DNA sequences as G,C 5 1, A,T 5 0. It is
known that GC content variations along genomes occur
at different sequence lengths; for example, codons,
genes, long repeats, pathogenicity islands, and isochores
(Bernardi 1993; Nekrutenko and Li 2000). GC variation
is much larger than purine-pyrimidine variation (Liò et
al. 1996; Liò and Ruffo 1998); therefore, GC patchiness
represents most of the genome sequence patchiness. Arneodo et al. (1995, 1996, 1998) used continuous wavelet
transform to analyze GC patchiness and correlations in
DNA sequences of different species. We used wavelet
scalogram based on discrete wavelet transform (DWT,
Mallat 1989).
Using Wavelets to Assess Within-Genome
Composition Heterogeneity
The name wavelet means small waves (the sinusoids used in Fourier analysis are big waves). In short,
a wavelet is an oscillation that decays quickly. Wavelets
are functions that can be used to efficiently describe a
signal by breaking it down into its components at different scales (or frequency bands) and following their
evolution in the space domain. Unlike the Fourier basis,
wavelets are local both in frequency and space. Wavelets
are discontinuous and sum to zero and show different,
complex shapes, each suitable for a different class of
problems. Wavelets are also related to fractals in that
the same shapes repeat at different orders of magnitude.
Therefore, they are particularly performing better than
Fourier analysis when the signals contain discontinuities
and sharp spikes (Chui 1992; Daubechies 1992).
Wavelet Series
In wavelet theory, a function is represented by an
infinite series expansion in terms of dilated and translated versions of a basic function c, called mother wavelet, each multiplied by an appropriate coefficient (see
Daubechies 1992 and Chui 1992, among others). The
792
Liò
wavelet family cj,k is obtained from the mother wavelet
by shrinking by a factor 2 j and translating by 22jk, to
obtain cj,k(x) 5 2 j/2c(2 jx 2 k) where the j subscript
represents the dilation number and the k subscript represents the translation number. The scale factor 2 j/2 is a
normalization factor for cj,k. The wavelet series representation of a function f is therefore
Choice of Wavelet Basis
Coefficients fj,k describe features of f at the spatial
location and frequency proportional to 2j (or scale j).
Despite Fourier transform, wavelets provide time-frequency localization in that the coefficient fj,k gives information about the function near time point and near
frequency proportional to 2j.
We found that, in general, Daubechies wavelets
perform better than Haar for the type of signals considered in this work (GC plots). Wavelets from Daubechies
families have two important properties; they are compactly supported and have maximum number of vanishing moments (a function f has N vanishing moments if
where q 5 0, 1, . . . , N 2 1). Compact supports are
useful to describe local characteristics that change rapidly with time. A large number of vanishing moments
leads to high compressibility because the fine scale
wavelet coefficients will be essentially zero where the
function is smooth. Although we have previously used
Dauchechies N 5 10 (Liò and Vannucci 2000), the analysis of the large set of genomic sequences (table 1)
showed that, if interested in scales at or above gene
length, the Daubechies’ basis of type N 5 2 performs
as well as N 5 10 for G1C pattern analysis. Therefore,
in this work and in a recent publication (Vannucci and
Liò 2001) we used Daubechies N 5 2.
Discrete Wavelet Transform
Wavelet Scalogram
The DWT decomposes a function into its wavelet
coefficients (Mallat 1989). From a computational point
of view, the DWT proceeds by recursively applying two
convolution functions known as quadrature mirror filters, each producing an output stream that is half of the
length of the original input, until the resolution level
zero is reached. If the filters are applied n times (with
2n # N), at each intermediate step (a level in wavelet
terminology) j 5 1, . . . , n, the transform produces two
vectors of coefficients, Sj of scaling coefficients and Dj
of wavelet coefficients. The vector Dj is kept, whereas
Sj is processed through the two filters. At the last level
n, both Sn and Dn are kept. Different coefficient vectors
contain information about the characteristics of the sequence at different scales or sequence lengths. Coefficients at coarse scales capture gross and global features.
Coefficients at fine scales contain the local details of the
profile. At level j, the wavelet coefficients Dj are associated with changes in the averages of the data on a scale
2j21 at a set of location times. Scaling coefficients Sn at
the last level n are instead associated with averages of
the data on scales 2n and higher. The wavelet transform
is, therefore, a cumulative measure of the variations in
the data over regions proportional to the wavelet scales,
with coefficients at coarser and coarser levels, i.e., for
increasing values of j, describing features at lower frequency ranges and larger time periods. For example,
given a genome sequence of 2 3 106 bp, GC variations
at gene length (1,000–2,000 bp) correspond to scales 10
to 11.
For practical purposes, the DWT is often represented in matrix form as Wy, with W an orthogonal matrix and y a vector of observations of the signal. An
inverse wavelet transform can be also defined. The standard DWT, as the fast Fourier transform, operates on
data sets with length 2N, N integer. When required, data
can be padded with zeros. These zeroes do not affect
the results.
Genome mosaicity is analyzed using the scalogram,
that is the equivalent of the periodogram used in Fourier
analysis. The scalogram is a plot of the sum of the
squares of the coefficients at each scale (Flandrin 1988;
Chiann and Morettin 1998; Ariño and Vidakovic 1995).
The plot will indicate at which scale of resolution the
energy of the function is concentrated. A relatively
smooth function will have most of its energy concentrated at large scales. A function showing high frequency oscillations will have a large portion of its energy
concentrated in high-resolution wavelet coefficients.
We found that the largest amount of GC content
variation in eubacteria and archaea genomes occurs at
short sequence–lengths, mainly at codon or few codon
lengths. Therefore, in order to improve the detection of
GC content variations over gene-length, we applied a
wavelet denoising technique to eliminate rapid variations of GC content (Donoho and Johnstone 1994; Donoho et al. 1995). Finally, each scalogram is generated
by subtracting the values of a scalogram obtained from
a random DNA sequence of the same length and average
GC content.
f (x) 5
Ofc
jk
j,k (x)
jk
with wavelet coefficients
f jk 5
E
1`
f (x)cj,k (x) dx.
2`
Results
Patterns of Genome Composition in High– and Low–
GC-Content Genomes
Recent papers have shown that the relative abundances of the dinucleotides constitute a signature of prokaryotic and eukaryotic genome sequences (Karlin and
Burge 1995; Karlin and Mrazek 1997; Karlin, Campbell,
and Mrazek 1998). In this work we have extended the
analysis to longer oligonucleotides (1–7 bp) and used a
very large genome sequence data set. Figure 1A shows
the entropy of the single-base frequency with respect to
the GC content for the 36 complete genomes and 27
long genomic sequences (see description in Material
and Methods and table 1). We found that single-base
Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes
793
FIG. 1.—Entropy measure of genomic sequences. A, Entropies of single-base frequency for genome data set with respect to the GC content
(x-axis). B–E, first-difference of entropy based on the frequencies of 2, 3, 6, and 7 bp oligonucleotides, respectively, with respect to the average
GC content in bacteria and archaea genomes (see Materials and Methods and table 1). The curved lines in B and E represent the entropy of
single-base frequency from random DNA sequences with different GC content. The genomes are represented by symbols according to the
prokaryotes’ ecology: squares for obligate pathogens and intracellular symbionts, circles for free-living and nonobligate pathogens. Complete
genomes are represented by open symbols and uncompleted genomes by closed symbols. F, we have subtracted the values of the curved line
in figure 2E (and fig. 2B) from the first-difference in entropy of 7 bp oligonucleotide frequencies (y-axis); the x-axis represents the GC content.
The species are coded according to their taxonomy: closed squares, mycoplasmas; closed diamonds, mycobacteria; open squares, other gram1;
circles, archaea; upper triangles, gram2; open diamonds, hyperthermophiles; star, spirochaetes; lower triangles, chlamydias; filled circles, cyanobacteria. The arrows are described in the text.
entropies do not show significant deviations from a
curve generated using random DNA sequences with different GC content (the curved line in fig. 1B and E).
Figures 1B to E shows the first-difference of entropy of
2, 3, 6, and 7 bp oligonucleotide frequencies (results for
4 and 5 bp do not add any interesting features). The
genomes are represented by symbols according to the
species’ ecology. In figure 1F, we show the difference
between the first-difference of entropy of 7 bp oligonucleotide frequencies and the values obtained using
random DNA sequences (curved line in fig. 1E). Therefore, whereas for figures 1A to E the comparisons of
entropy values should be done with respect to the curved
line, in figure 1F the comparison is with the top line
(value 0). Note that the species are classified according
to their taxonomy. Results from figures 1E and F suggest that genomes are clustered neither by ecology nor
by taxonomy. For many genome sequences, the firstdifference of entropy is very large for both dinucleotides
(distances from the curved line of fig. 1A) and trinucleotides (distances between dinucleotide positions in fig.
1B and trinucleotide positions in fig. 1C). There are few
genomes that have almost equal or similar first-differences of entropy values at all oligonucleotide lengths n
(n 5 1, . . . , 7). We found that all the Chlamydias’ sequences (lower triangles) have similar values of firstdifference of entropy; there are negligible differences
between Haemophilus influenzae and H. ducrey and between Neisseria meningitidis strains MC58 and Z2491;
there are small differences between N. meningitidis
strains and the N. gonorrhoea (second arrow from left
in fig. 1B) and between both the Helicobacter pylori
strains J99 and 26695 (first arrow from left in fig. 1B).
The genome sequences of H. pylori and Neisseria show
larger first-difference of entropy for dinucleotides than
the sequences with the same GC content (fig. 1B). The
genome sequences of B. stearothermophilus, H. pylori,
B. pseudomallei, and Thermotoga maritima show the
largest first-difference of entropy for trinucleotides.
Our results show that, particularly for n . 4, the
largest first-difference of entropy occurs for low–GCcontent genomes, for example Ureaplasma urealyticum
794
Liò
FIG. 2.—Entropy measure of genomic sequences. A, Second-difference of entropy based on the frequencies of 7 bp oligonucleotides with
respect to the average GC content in bacteria and archaea genomes (see Materials and Methods and table 1). Entropies of single-base frequency
for genome data set are coded as filled circles. The species are coded as in figure 1F. B, we have subtracted the values of the curved line in
figure 2A from the second-difference in entropy of 7 bp oligonucleotide (y-axis); the x-axis represents the GC content. The species are coded
as in figure 1F, and the arrows are described in the text.
(first arrow from left in fig. 1E), H. pylori strains (first
arrow from left in fig. 1B; second arrow from left in fig.
1E; first arrow from the left in fig. 1F), the N. meningitidis strains (second arrow from left in fig. 1B; third
arrow in fig. 1E; second arrow from the right in fig. 1F),
N. gonorrhoea (on the right hand side of N. meningitidis), and high–GC-content genomes, such as Rickettsia
capsulatus (fourth arrow from the left in fig. 1E; third
arrow from the left in fig. 1F) and S. streptomyces (first
arrow from the right in fig. 1E).
The genome sequence of Mycoplasma genitalium
shows larger oligonucleotide first-difference of entropy
than the other sequenced mycoplasma genomes, U.
urealyticum and Mycoplasma pneumoniae (closed
squares in fig. 1F). Among the high–GC-content genomes, Mycoplasma bovis and Mycoplasma tubercolosis
have larger first-difference of entropy than Mycoplasma
leprae (closed diamonds in fig. 1F ). Although genes
from these genomes share 80% amino acid identity on
an average, M. leprae has a lower GC content than the
other two mycoplasmas (Bellgard and Gojobori 1999).
The genomes of M. bovis and M. tubercolosis have lower first-difference of entropy than the genomes of proteobacteria with similar GC content, such as Pseudomonas aeruginosa and the Bordetellae sequences.
The comparison of genome sequences that have
complementary GC-content percentage, for example,
35% and 65%, shows that low–GC-content genomes
have smaller first-difference of entropy than high–GCcontent genomes (fig. 1F). For example, the average
first-difference of entropy for 7-bp oligonucleotides in
genome sequences with GC% , 0.40, GC% between
0.4 and 0.6, and GC% . 0.6 are (the standard deviation
in parentheses): 20.042 (0.020), 20.046 (0.019), and
20.071 (0.015), respectively. This finding seems to be
particularly relevant for proteobacteria (circles in fig.
1F). A further confirmation is given by the second-difference entropies of 7-bp oligonucleotides, D(7)2D(5),
for the same set of genomic sequences of figure 1 (fig.
2A). In figure 2B, we have subtracted entropies from
random DNA sequences from the values in figure 2A.
The largest values correspond to those of figure 1F: the
two H. pylori (first arrow from left in figs. 2A and B),
N. meningitidis and N. gonorrhoea (second arrow from
left in figs. 2A and B), and R. capsulatus (third arrow
from the left in figs. 2A and B).
We analyzed in details the oligonucleotide frequencies in high– and low–GC-content genomes. The contrast in oligonucleotide frequencies for high– and low–
GC-content genomes shows that, particularly for n . 4
bp, homopolymers of the type Gn and Cn are generally
underrepresented in all genomes in the data set. We
made use of density plots to analyze the distribution of
distances between An, Tn, Cn, and Gn homopolymers. In
figure 3, we show the density plots of the distances between A6 and T6 homopolymers in AT-rich genomes and
the distances between C6 and G6 homopolymers in GCrich genomes in a set of bacterial genomes. Each plot
of figure 3 shows the distances between homopolymers
in genomes with almost complementary GC%, i.e., for
example, Campylobacter jejuni (GC content ;31%)
versus P. aeruginosa (;67%, fig. 3A); Borrelia burgdorferi (;29%) versus Deinococcus radiodurans
(;67%, fig. 3B); Methanococcus jannaschii (;31%)
versus Halobacterium sp. (;68%, fig. 3C) and Escherichia coli (;51%, fig. 3D). In the absence of any selection or mutation mechanism, we expect the GC-rich
genomes to contain a number of Gn and Cn homopolymers, quite equal to the number of An and Tn homopolymers in AT-rich genomes and vice versa. Instead,
we found that the number of G6 and C6 homopolymers
in GC-rich genomes is much lower than the number of
A6 and T6 homopolymers in AT-rich genomes. All the
plots show similar large differences between G6/C6 and
A6/T6 distances in GC-rich and AT-rich genomes,
respectively.
Within-Genome Composition Heterogeneity
Our study, based on a very large set of genomes
from bacteria and archaea, focuses on searching for correlation between genome mosaicity and ecology in bacteria and archaea. In order to investigate the mosaic-like
structure of genomes, we calculated the wavelet scalo-
Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes
795
FIG. 3.—Density plots of the distances between A6 and T6 homopolymers in low–GC-content genomes and between C6 and G6 homopolymers in high–GC-content genomes. A, Campylobacter jejuni (GC content ;31%) and P. aeruginosa (;67%); B, B. burgdorferi (;29%)
and D. radiodurans (;67%); C, M. jannaschii (;31%) and Halobacterium sp. (;68%); D, E. coli (;51%). The x-axis shows the distances
between homopolymers, and the y-axis shows the density.
grams for the genome data set (fig. 4). We found quantitative and qualitative differences between the genomes
of free-living species/not-chronic pathogens (figs. 4A
and B) and those of intracellular parasites/symbionts
(figs. 4C and D). Although the size of GC-variations,
i.e., the scales, depends on the sequence length because
of the wavelet decomposition process (see Material and
Methods), the genomes of free-living species and nonobligate pathogens clearly show more mosaic-like structure than the genomes of obligate pathogens and intracellular symbionts. In particular, the genomes of the N.
meningitidis strains show more mosaic-like structure
than all the other genomes analyzed, even larger than
Synechocystis and E. coli (not shown) and other freeliving bacteria. Mosaicity occurs at all the scales in the
genomes of free-living species or not-chronic parasites
(figs. 4A and B), whereas intracellular parasites’ genomes have low or negligible values at coarser scales
(figs. 4C and D). The maximum of the scalograms for
intracellular parasites occurs at scales that correspond to
the average gene length (table 3); it occurs at coarser
scales for free-living and not-chronic genomes. This
suggests that free-living species can incorporate large
pieces of DNA from the environment, whereas for intracellular parasites the most common event is the recombination between homologous genes. This has important consequences on the genome reduction evolution
in that it explains the genome size reduction observed
in intracytoplasmic genomes and mitochondria. Finally,
we found that there is small but not negligible mosaicity
in genomes of intracytoplasmic symbionts. It is noteworthy that these results do not depend on the GC directional mutational pressure: for example low–GC-content genomes (for example Buchnera sp.) and genomes
that have 50% GC content (for example Chlamidias)
have similar genome mosaicity and ecology.
Discussion
Genome Composition and GC Mutational Pressure
The evolution of DNA composition and structure
of bacterial genomes depends on mutational and selection processes, such as GC mutation pressure, strandbias mutation generated by DNA repair during transcription (Sueoka 1995), and the relative frequency of synonymous codons caused by the abundance of tRNAs
(Sorensen, Kurland, and Pedersen 1989; Berg and Kurland 1997). Although much of the difference in oligonucleotide frequencies in a short sequence can be nonfunctional, genome-wide, it may have relevance in terms
of genome organization, codon usage, gene expression,
recombination, and mutation rate. The first-differences
of entropy of dinucleotide frequencies may reflect structural constraints, for example, the stacking energies and
796
Liò
FIG. 4.—Wavelet scalograms of archaea and eubacteria genome sequences. A, Synechocystis sp. (1), N. meningitidis MC58 (circles), V.
cholerae (triangles); B, A. fulgidus (1), C. jejuni (circles), M. jannaschii (triangles); C, H. pylori (1), R. prowazekii (circles), Chlamydia
pneumoniae J138 (triangles); D, B. burgdorferi (1), U. urealyticum (circles), Buchnera sp. (triangles). The x-axis shows the scale, and the yaxis shows the energy that is an estimate of the amount of mosaicity at each scale.
the species-specific machinery for DNA repair and modification (Karlin, Campbell, and Mrazek 1998). Because
coding regions generally constitute more than 90% of
the bacterial genome, trinucleotide frequency diversity
may be related to synonymous codon usage in coding
regions.
The genomes of N. meningitidis and H. pylori show
larger values of first-difference of entropy than other
bacteria and archaea with similar GC content. These geTable 3
Numerical Results of Scalogram of Figure 4
Species (genome)
Synechocystis PCC6803 . . . . .
N. meningitides . . . . . . . . . . . .
V. cholerae . . . . . . . . . . . . . . . .
A. fulgidus . . . . . . . . . . . . . . . .
C. jejuni . . . . . . . . . . . . . . . . . .
M. jannaschii. . . . . . . . . . . . . .
H. pylori . . . . . . . . . . . . . . . . . .
R. prowazekii . . . . . . . . . . . . . .
C. pneumoniae J138 . . . . . . . .
B. burgdorferi. . . . . . . . . . . . . .
U. urealyticum . . . . . . . . . . . . .
Buchnera sp.. . . . . . . . . . . . . . .
Length Maximum Scale
3.57
2.27
1.03
2.18
1.64
1.66
1.67
1.11
1.23
0.91
0.75
0.64
4,426
6,635
2,173
2,083
2,178
2,691
2,287
2,046
1,368
1,400
938
1,034
9
9
10
8
11
11
11
9
11
10
10
10
Bases
13,958
8,876
7,878
17,018
1,603
1,625
1,628
4,341
1,199
1,779
1,468
1,251
NOTE.—The entries are: species (free-living and nonobligate pathogens in
bold), genome length (3 106 bp), maximum, scale and length of the corresponding GC variation (in bp), the area under the scalogram.
nomes contain several large pathogenicity islands that
have different GC content from the nearby genomic regions (Tomb et al. 1997; Liò and Vannucci 2000; Parkhill et al. 2000; Tettelin et al. 2000). The results from
the entropy analysis and the wavelet scalograms allow
us to infer that the H. pylori genome does not contain
more GC-heterogeneity than the other chronic pathogens, but it contains few genomic regions with remarkably different short oligonucleotide frequencies. This is
in agreement with our previous findings (see fig. 2 in
Liò and Vannucci 2000) and the findings of Liu and coworkers (Liu et al. 1999) that showed that there are large
differences in sequence composition in the CAG pathogenicity island with respect to the average composition
of the H. pylori genome. The genome of N. gonorrhoea
has similar entropy values as that of N. meningitidis, it
may contain pathogenicity islands with similar base
composition and size; R. capsulatus may have pathogenicity regions or alien genes too. Low–GC-content genomes show less diversity in oligonucleotide frequencies than high–GC-content genomes. Some authors have
shown that in E. coli and in other bacteria, the homopolymers of the type An and Tn are more abundant than
Cn and Gn (Dechering et al. 1998; Shomer and Yagil
1999). The results in figure 3 suggest that short distances
(,2 kbp) between Cn and Gn are less abundant than
Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes
797
short distances between An and Tn in bacterial genomes.
Although several DNA mismatch and repair mechanisms are known to change oligonucleotide frequencies
(see for example Deschavanne and Radman 1991), no
mutation mechanisms is known to act selectively on
long Gn and Cn homopolymers. Because noncoding regions represent a small percentage of eubacteria and archaea genomes, statistical analyses give little information about differences between homopolymers in coding
regions with respect to the overall genome. The fact that
bacterial mRNAs are generally polycistronic with length
of several kilobase pairs suggests the importance of a
constraint on RNA secondary structure against Gn and
Cn sticky patches. This is in agreement with the work
of Huynen and co-workers who found that, in histone
genes, the compensation of the G-C ratio indicates a
selection pressure at the mRNA level rather than a selection pressure or mutation bias at the DNA level or a
selection pressure on codon usage (Huynen, Konings,
and Hogeweg 1992). It is also known that pairing of
palindromic Gn and Cn patches serve as stop transcription mechanism in several bacterial operons (see for example Lewin 1997, pp. 318–319).
among the bacterial genomes sequenced to date (Andersson et al. 1998). It is known that the presence of
repeats increases the recombination rate, and this may
explain the relatively large values of the GC variations.
Our analyses, based on a very large number of genome sequences, show how oligonucleotide composition
is affected by GC mutational pressure and that genome
mosaic-like structure depends on the history of gene
transfer events and thus on the ecology of the species.
Entropy measure and wavelet scalogram are complementary methods to analyze genome sequences and to
detect the presence of alien genes, i.e., they can be used
as preliminary analyses in the investigation of host-pathogen relationship. The alien genes can be located
through the selective reconstruction of GC plot from the
scalogram using just few scales, as shown in Liò and
Vannucci (2000). Further improvements of this work
will consider using the nondecimated or stationary version of the DWT, a modified transform where coefficients at each level are not subsampled.
Ecology and Genome Composition Heterogeneity:
Molecular Evolution Implications
P.L. is supported by an EPSRC/BBSRC Bioinformatics Initiative grant. We thank Marina Vannucci and
Nick Goldman for helpful suggestions. For the following species we have considered all nonredundant sequences available in GenBank and the sequences freely
available through the web sites of several research institutions (we obtained the permission on using data
from ongoing genomic projects). Sequence data of B.
pertussis, B. bronchiseptica, B. parapertussis, B. pseudomallei, Corynebacterium diphtheriae, C. difficile, M.
bovis, M. leprae, S. typhi, S. coelicolor, and Y. pestis
are from sequencing groups at Sanger Centre and can
be found at www.sanger.ac.uk. Sequence data of N. gonorrhoea, S. pyogenes, A. actinomycetemcomitans, B.
stearothermophilus, S. aureus, and S. mutans are from
the University of Oklahoma (www.genome.ou.edu,
dna1.chem.ou.edu/gono.html). We acknowledge the
Gonococcal Genome Sequencing Project supported by
USPHS/NIH grant AI38399, and L. A. Lewis, A. Gillaspy, R. McLaughlin, M. Gipson, T. Ducey, T. Ownbey,
K. Hartman, C. Nydick, M. Carson, J. Vaughn, C.
Thomson, L. Song, S. Lin, X. Yuan, F. Najar, M. Zhan,
Q. Ren, H. Zhu, S. Qi, S. Kenton, H. Lai, J. White, S.
Clifton, B. A. Roe, and D. W. Dyer. We also acknowledge the S. mutans Genome Sequencing Project funded
by USPHS/NIH grant from the Dental Institute and B.
A. Roe, R. Y. Tian, H. G. Jia, Y. D. Qian, S. P. Linn, L.
Song, R. E. McLaughlin, M. McShan, and J. Ferretti. L.
pneumophila, N. puntiforme sequences are from Columbia Genome Center (genome3.cpmc.columbia.edu); K.
pneumoniae is sequenced by Washington University
Consortium (genome.wustl.edu/gsc/Projects/bacteria.
shtml); H. ducreyi, M. maripaludis are sequenced by the
University of Washington (www.htsc.washington.edu,
kandisnky.genome.washington.edu/maripaludis), N. europaea, P. marinus are from DOE- JGI (www.jgi.doe.
gov/JGIpmicrobial/html); P. abyssi from Genoscope
A possible explanation as to why the genomes of
free-living and nonchronic pathogens show more patchiness than genomes of chronic pathogens or symbionts
is that free-living bacteria and nonchronic pathogens experience a fluctuating and challenging environment with
diversified, in time and locus, selection. The genomes
of these species can easily exchange DNA segments and
incorporate cassettes of resistance genes that allow them
to face environmental changes. Probably, nonchronic
pathogens have high degree of mosaicity because of the
pressure to keep the pathogenicity islands shared among
bacterial species with different genome-wide base
composition.
Differences in the wavelet scalograms of free-living bacteria may also reflect different mechanisms for
DNA uptake and recombination. Therefore, genomes
such as D. radiodurans (GC% ø 67%) and S. coelicolor
(GC% ø 72%), that are subjected to strong GC mutational bias, do not undergo a reduction in genome size
and tRNA population. Instead, the genomes of chronic
pathogens and symbionts that are subjected to GC pressure, such as Mycoplasma capricolum (GC content ;
25%) and Micrococcus luteus (GC content ; 75%), undergo a genome size and tRNA population reduction
(Muto et al. 1990; Kano et al. 1991; Andersson and
Kurland 1995). The fact that the scalograms of the genomes of intracellular pathogens or symbionts such as
Buchnera sp. (fig. 4D and table 3) show a small but not
negligible amount of genome composition heterogeneity
suggests that genetic recombination events occur. This
finding is in agreement with the predictions of Wolf and
co-workers for Rickettsiae and Chlamydiae (Wolf, Aravind, and Koonin 1999). The genome of Rickettsia
prowazekii contains the largest amount of repeats (24%)
Acknowledgments and Permissions for Using
Genome Data
798
Liò
(www.genoscope.cns.fr/Pab); R. capsulatus is sequenced
by University of Chicago (capsulapedia.uchicago.edu),
and R. sphaeroides is sequenced by University of Texas
(www-mmg.med.uth.tmc.edu/sphaeroides).
LITERATURE CITED
ALM, R. A., L. S. LING, D. T. MOIR et al. (23 co-authors).
1999. Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori.
Nature 397:176–180.
ANDERSSON, S. G., and C. G. KURLAND. 1995. Genomic evolution drives the evolution of the translation system. Biochem. Cell Biol. 73:775–787.
ANDERSSON, S. G., A. ZOMORODIPOUR, J. O. ANDERSSON et al.
(10 co-authors). 1998. The genome sequence of Rickettsia
prowazekii and the origin of mitochondria. Nature 396:133–
140.
ARIÑO, M., and B. VIDAKOVIC. 1995. On wavelet scalograms
and their applications in economic time series. Discussion
paper 95-21, ISDS, Duke University.
ARNEODO, A., E. BACRY, P. V. GRAVES, and J. F. MUZY. 1995.
Characterizing long-range correlations in DNA sequences
from wavelet analysis. Phys. Rev. Lett. 74:3293–3296.
ARNEODO, A., Y. D’AUBENTON CARAFA, B. AUDIT, E. BACRY,
J. F. MUZY, and C. THERMES. 1998. What can we learn with
wavelets about DNA sequences. Physica A 249:439–448.
ARNEODO, A., Y. D’AUBENTON CARAFA, E. BACRY, P. V.
GRAVES, J. F. MUZY, and C. THERMES. 1996. Wavelet based
fractal analysis of DNA sequences. Physica D 1328:1–30.
BELLGARD, M. I., and T. GOJOBORI. 1999. Inferring the direction of evolutionary changes of genomic base composition.
TiG 15:254–256.
BERG, O. G., and C. G. KURLAND. 1997. Growth rate-optimised tRNA abundance and codon usage. J. Mol. Biol. 270:
544–550.
BERNARDI, G. 1993. The vertebrate genome: isochores and
evolution. Mol. Biol. Evol. 10:186–204.
BLATTNER, F. R., G. PLUNKETT, C. A. BLOCH et al. (17 coauthors). 1997. The complete genome sequence of Escherichia coli K-12. Science 277:1453–1474.
BOLSHOY, A., and E. NEVO. 2000. Ecologic genomics of DNA:
upstream bending in prokaryotic promoters. Genome Res.
10:1185–1193.
BULT, C. J., O. WHITE, G. J. OLSEN et al. (23 co-authors).
1996. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273:1058–
1073.
CHIANN, C., and P. A. MORETTIN. 1998. A wavelet analysis for
time series. J. Nonparametric Stat. 10:1–46.
CHUI, C. K. 1992. An introduction to wavelets. Academic
Press, New York.
COLE, S. T., R. BROSCH, R. PARKHILL et al. (25 co-authors).
1998. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393:537–
544.
DAUBECHIES, I. 1992. Ten lectures on wavelets. SIAM,
Philadelphia.
DECHERING, K. J., K. CUELENAERE, R. N. KONINGS, and J. A.
LEUNISSEN. 1998. Distinct frequency-distributions of homopolymeric DNA tracts in different genomes. Nucleic Acids Res. 26:4056–4062.
DECKERT, G., P. V. WARREN, T. GAASTERLAND et al. (15 coauthors). 1998. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392:353–358.
DESCHAVANNE, P., and M. RADMAN. 1991. Counterselection of
GATC sequences in enterobacteriophages by the compo-
nents of the methyl-directed mismatch repair system. J.
Mol. Evol. 33:125–132.
DONOHO, D., and I. JOHNSTONE. 1994. Ideal spatial adaptation
via wavelet shrinkage. Biometrika 81:425–455.
DONOHO, D., I. JOHNSTONE, G. KERKYACHARIAN, and D. PICARD. 1995. Wavelet shrinkage: asymptopia? (with discussion). J. R. Stat. Soc. Ser. B 57:301–369.
FLANDRIN, P. 1988. Time-frequency and time-scale. IEEE
Fourth Annual ASSP Workshop on Spectrum Estimation
and Modeling. Pp. 77–80. Minnesota, Minn.
FLEISCHMANN, R. D., M. D. ADAMS, O. WHITE et al. (10 coauthors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–
512.
FRASER, C. M., S. CASJEANS, W. M. HUANG et al. (25 coauthors). 1997. Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390:580–586.
FRASER, C. M., J. D. GOCAYNE, O. WHITE et al. (10 co-authors). 1995. The minimal gene complement of Mycoplasma genitalium. Science 270:397–403.
FRASER, C. M., S. J. NORRIS, G. M. WEINSTOCK et al. (25 coauthors). 1998. Complete genome sequence of Treponema
pallidum, the syphilis spirochete. Science 281:375–388.
GARCIA-VALLVÉ, S., A. ROMEU, and J. PALAU. 2000. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res. 10:1719–1725.
GLASS, J. I., E. J. LEFKOWITZ, J. S. GLASS, C. R. HEINER, E.
Y. CHEN, and G. H. CASSELL. 2000. The complete sequence
of the mucosal pathogen Ureaplasma urealyticum. Nature
407:757–762.
GRISHIN, N. V., Y. I. WOLF, and E. V. KOONIN. 2000. From
complete genomes to measures of substitution rate variability within and between proteins. Genome Res. 10:991–
1000.
HEIDELBERG, J. F., J. A. EISEN, W. C. NELSON et al. (33 coauthors). 2000. DNA sequence of both chromosomes of the
cholera pathogen Vibrio cholerae. Nature 406:477–483.
HERZEL, H. 1988. Complexity of symbol sequences. Syst.
Anal. Model. Simul. 5:435–441.
HIMMELREICH, R., H. HILBERT, H. PLAGENS, E. PIRKL, B. C.
LI, and R. HERRMANN. 1996. Complete sequence analysis
of the genome of the bacterium Mycoplasma pneumoniae.
Nucleic Acids Res. 24:4420–4449.
HUYNEN, M., T. DANDEKAR, and P. BORK. 1998. Measuring
genome evolution. Proc. Natl. Acad. Sci. USA 95:5849–
5856.
HUYNEN, M. A., D. A. KONINGS, and P. HOGEWEG. 1992. Equal
G and C contents in histone genes indicates selection pressures on mRNA secondary structure. J. Mol. Evol. 34:280–
291.
KALMAN, S., W. MITCHELL, R. MARATHE et al. (10 co-authors).
2000. Comparative genomes of Chlamydia pneumoniae and
C. trachomatis. Nat. Genet. 21:385–389.
KANEKO, T., S. SATO, H. KOTANI et al. (24 co-authors). 1996.
Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence
determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 3:109–136.
KANO, A., Y. ANDACHI, T. OHAMA, and S. OSAWA. 1991. Novel anticodon composition of transfer RNAs in Micrococcus
luteus, a bacterium with a high genomic G 1 C content.
Correlation with codon usage. J. Mol. Biol. 221:387–401.
KARLIN, S., and V. BRENDEL. 1993. Patchiness and correlations
in DNA sequences. Science 259:677–680.
KARLIN, S., and C. BURGE. 1995. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11:
283–290.
Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes
KARLIN, S., A. M. CAMPBELL, and J. MRAZEK. 1998. Comparative DNA analysis across diverse genomes. Annu. Rev.
Genet. 32:185–225.
KARLIN, S., and J. MRAZEK. 1997. Compositional differences
within and between eukaryotic genomes. Proc. Natl. Acad.
Sci. USA 94:10227–10232.
KAWARABAYASI, Y., Y. HINO, H. HORIKAWA et al. (25 co-authors). 1999. Complete genome sequence of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1. DNA
Res. 6:83–101.
KAWARABAYASI, Y., M. SAWADA, H. HORIKAWA et al. (25 coauthors). 1998. Complete sequence and gene organization
of the genome of a hyper-thermophilic archaebacterium,
Pyrococcus horikoshii OT3. DNA Res. 5:55–76.
KLENK, H. P., R. A. CLAYTON, J. F. TOMB et al. (25 co-authors).
1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390:364–370.
KUNST, F., N. OGASAWARA, I. MOSZER et al. (25 co-authors).
1997. The complete genome sequence of the gram-positive
bacterium Bacillus subtilis. Nature 390:249–256.
LEWIN, B. 1997. Gene VI, Chap. 11. Oxford University Press
Inc., New York.
LIÒ, P., S. RUFFO, A. POLITI, and M. BUIATTI. 1996. Analysis
of genomic patchiness of Haemophilus influenzae and S.
cerevisiae chromosomes. J. Theor. Biol. 183:455–469.
LIÒ, P., and S. RUFFO. 1998. Searching for genomic constraints.
Il Nuovo Cimento D 20:113–127.
LIÒ, P., and M. VANNUCCI. 2000. Finding pathogenicity islands
and gene transfer events in genome data. Bioinformatics 16:
932–940.
LIU, G., T. K. MCDANIEL, S. FALKOW, and S. KARLIN. 1999.
Sequence anomalies in the Cag7 gene of the helicobacter
pylori pathogenicity island. Proc. Natl. Acad. Sci. USA 96:
7011–7016.
MALLAT, S. G. 1989. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Machine Intelligence 11:674–693.
MARTIN, W. 1999. Mosaic bacterial chromosomes: a challenge
en route to a tree of genomes. Bioessays 21:99–104.
MUTO, A., Y. ANDACHI, H. YUZAWA, F. YAMAO, and S. OSAWA. 1990. The organization and evolution of transfer RNA
genes in Mycoplasma capricolum. Nucleic Acids Res. 18:
5037–5043.
MUTO, A., and S. OSAWA. 1987. The guanine and cytosine
content of genomic DNA and bacterial evolution. Proc.
Natl. Acad. Sci. USA 84:166–169.
NEKRUTENKO, A., and W. H. LI. 2000. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 10:1986–1995.
NELSON, K. E., R. A. CLAYTON, S. R. GILL et al. (25 coauthors). 1999. Evidence for lateral gene transfer between
Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399:323–329.
NG, W. V., S. P. KENNEDY, G. G. MAHAIRAS et al. (43 coauthors). 2000. Genome sequence of Halobacterium species
NRC-1. Proc. Natl. Acad. Sci. USA 97:12176–12181.
OCHMAN, H., J. G. LAWRENCE, and E. A. GROISMAN. 2000.
Lateral gene transfer and the nature of bacterial innovation.
Nature 405:299–303.
PARKHILL, J., M. ACHTMAN, K. D. JAMES et al. (21 co-authors).
2000. Complete DNA sequence of a serogroup A strain of
Neisseria menigitidis Z2491. Nature 404:502–506.
PARKHILL, J., B. W. WREN, and K. MUNGALL. 2000. The genome sequence of the food-borne pathogen Campylobacter
jejuni reveals hypervariable sequences. Nature 403:665–
668.
799
PEDERSEN, A. G., L. J. JENSEN, S. BRUNAK, H. H. STAERFELDT,
and D. W. USSERY. 2000. A DNA structural atlas for Escherichia coli. J. Mol. Biol. 299:907–930.
READ, T. D., R. C. BRUNHAM, C. SHEN et al. (25 co-authors).
2000. Genome sequences of Chlamydia trachomatis MoPn
and Chlamydia pneumoniae AR39. Nucleic Acids Res. 28:
1397–1406.
RUEPP, A., W. GRAML, M. L. SANTOS-MARTINEZ et al. (13 coauthors). 2000. The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum. Nature 407:
508–513.
SCHMITT, A. O., and H. HERZEL. 1997. Estimating the entropy
of DNA sequences. J. Theor. Biol. 188:369–377.
SCOTT, D. W. 1992. Multivariate density estimation. Theory,
practice and visualization. John Wiley and Sons, New York.
SHIGENOBU, S., H. WATANABE, M. HATTORI, Y. SAKAKI, and
H. ISHIKAWA. 2000. Genome sequence of the endocellular
bacterial symbiont of aphids Buchnera sp. APS. Nature
407:81–86.
SHIRAI, M., H. HIRAKAWA, M. KIMOTO et al. (10 co-authors).
2000. Comparison of whole genome sequences of Chlamydia pneumoniae J138 from Japan and CWL029 from
USA. Nucleic Acids Res. 28:2311–2314.
SHOMER, B., and G. YAGIL. 1999. Long W tracts are overrepresented in the Escherichia coli and Haemophilus influenzae genomes. Nucleic Acids Res. 27:4491–4500.
SILVERMAN, B. W. 1986. Density estimation for statistics and
data analysis. Chapman & Hall, London.
SIMPSON, A. J. G., F. C. REINACH, P. ARRUDA et al. (115 coauthors). 2000. The genome sequence of the plant pathogen
Xylella fastidiosa. Nature 406:151–157.
SMITH, D. R., L. A. DOUCETTE-STAMM, C. DELOUGHERY et al.
(25 co-authors). 1997. Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional
analysis and comparative genomics. J. Bacteriol. 179:7135–
7155.
SORENSEN, M. A., C. G. KURLAND, and S. PEDERSEN. 1989.
Codon usage determines translation rate in Escherichia coli.
J. Mol. Biol. 207:365–377.
STEPHENS, R. S., S. KALMAN, C. LAMMEL et al. (12 co-authors). 1998. Genome sequence of an obligate intracellular
pathogen of humans: Chlamydia trachomatis. Science 282:
754–759.
STOVER, C. K., X. Q. PHAM, A. L. ERWIN et al. (31 co-authors).
2000. Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen. Nature 406:959–
964.
SUEOKA, N. 1962. On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci.
USA 48:582–588.
———. 1992. Directional mutation pressure, selective constraints and genetic equilibria. J. Mol. Evol. 34:95–114.
———. 1995. Intrastrand parity rules of DNA base composition and usage biases of synonymous codons. J. Mol. Evol.
40:318–325.
TAKAMI, H., K. NAKASONE, Y. TAKAKI et al. (12 co-authors).
2000. Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans and genomic comparison with
bacillus subtilis. Nucleic Acids Res. 28:4317–4331.
TEKAJA, F., A. LAZCANO, and B. DUJON. 1999. The genomic
tree as revealed from whole genome comparisons. Genome
Res. 9:550–557.
TETTELIN, H., N. J. SAUNDERS, J. HEIDELBERG et al. (42 coauthors). 2000. Complete genome sequence of Neisseria
meningitidis serogroup B strain MC58. Science 287:1809–
1815.
800
Liò
TOMB, J. F., O. WHITE, A. R. KERLAVAGE et al. (25 co-authors).
1997. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388:539–547.
VANNUCCI, M., and P. LIÒ. 2001. Wavelet analysis of biological
sequences: applications to protein structure and genomics.
Sankhya Ser. B 63:204–219.
WHITE, O., J. A. EISEN, J. F. HEIDELBERG et al. (25 co-authors).
1999. Genome sequence of the Radioresistant Bacterium
Deinococcus radiodurans R1. Science 286:1571–1577.
WOLF, Y. I., L. ARAVIND, and E. V. KOONIN. 1999. Rickettsiae
and Chlamydiae: evidence of horizontal gene transfer and
gene exchange. TiG 15:173–175.
WILLIAM TAYLOR, reviewing editor
Accepted October 3, 2001