Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Investigating the Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes Pietro Liò Department of Zoology, University of Cambridge, United Kingdom Our thesis is that the DNA composition and structure of genomes are selected in part by mutation bias (GC pressure) and in part by ecology. To illustrate this point, we compare and contrast the oligonucleotide composition and the mosaic structure in 36 complete genomes and in 27 long genomic sequences from archaea and eubacteria. We report the following findings (1) High–GC-content genomes show a large underrepresentation of short distances between Gn and Cn homopolymers with respect to distances between An and Tn homopolymers; we discuss selection versus mutation bias hypotheses. (2) The oligonucleotide compositions of the genomes of Neisseria (meningitidis and gonorrhoea), Helicobacter pylori and Rhodobacter capsulatus are more biased than the other sequenced genomes. (3) The genomes of free-living species or nonchronic pathogens show more mosaic-like structure than genomes of chronic pathogens or intracellular symbionts. (4) Genome mosaicity of intracellular parasites has a maximum corresponding to the average gene length; in the genomes of free-living and nonchronic pathogens the maximum occurs at larger length scales. This suggests that free-living species can incorporate large pieces of DNA from the environment, whereas for intracellular parasites there are recombination events between homologous genes. We discuss the consequences in terms of evolution of genome size. (5) Intracellular symbionts and obligate pathogens show small, but not zero, amount of chromosome mosaicity, suggesting that recombination events occur in these species. Introduction The recently derived genome sequences of several bacteria and archaea provide an excellent opportunity to investigate differences in DNA sequence composition between- and within-genomes and to relate these differences to the species’ ecology. Genome sequences have been investigated on the basis of several characteristics, for example, dinucleotide contents (Karlin and Burge 1995; Karlin, Campbell, and Mrazek 1998), percentage of shared homologous genes (Huynen, Dandekar, and Bork 1998; Tekaja, Lazcano, and Dujon 1999; Grishin, Wolf, and Koonin 2000), and patterns associated with physicochemical properties of sequences, such as DNA curvature of regulatory sequences (Bolshoy and Nevo 2000; Pedersen et al. 2000). The density of guanine and cytosine, i.e., the GC content, is an important parameter to investigate and compare the structure and evolution of genomes (Muto and Osawa 1987; Bernardi 1993; Bellgard and Gojobori 1999). Although the average GC content of bacterial genomes varies across species, the GC mutational pressure, i.e., the specificity in replication and repair machinery, and the context-dependent mutation bias tend to homogenize each genome (Sueoka 1962, 1992; Liò et al. 1996; Karlin, Campbell, and Mrazek 1998). The different selection constraints acting on regulatory and coding regions and the lateral transfer of genes with different GC content (Martin 1999; Garcia-Vallvé, Romeu, and Palau 2000; Ochman, Lawrence, and Groisman 2000) are opposed to the genome composition homogenization. We have compared a large number of genome sequences to investigate the relationships between oligoKey words: genomics, GC content, genome structure, prokaryotes, ecology. Address for correspondence and reprints: Pietro Liò, Department of Zoology, University of Cambridge, Downing Street, Cambridge CB2 3EJ, U.K. E-mail: [email protected]. Mol. Biol. Evol. 19(6):789–800. 2002 q 2002 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 nucleotide composition and genome structure heterogeneity in prokaryotes and their ecology, i.e., distinguishing free-living species and nonchronic pathogens from chronic pathogens and symbionts. First, we use a measure of sequence entropy, based on the frequency content of short oligonucleotides (1–7 bp), to compare 36 complete genomes and 27 long genomic sequences from archaea and eubacteria with different GC content. Second, because oligonucleotide frequencies depend on the mosaic structure of genomes, we used a method derived from wavelets (see Chui 1992 and Daubechies 1992, among others) to estimate the length-size of the genome structure heterogeneity in bacteria and archaea with different ecology. We show that the entropy measure and the wavelet scalogram can be used as complementary methods to detect global patterns in genome sequences. Materials and Methods Data We have analyzed 36 complete genomes and 27 long sequences from eubacteria and archaea (permission to use data of uncompleted genome sequencing projects has been obtained; see Acknowledgments). We report in table 1 the list of genomic sequences ordered by the average GC content, which are also shown in figure 2, and the average GC content if the genome sequence is partial (pg) and if the prokaryote is an archaea (A). It is known that there is often a large difference between the average genomic GC content and the GC content calculated considering only the third base of codons; in this work we analyze entire genome sequences, and therefore we have used the average genomic GC content. Statistical Analysis Measure of Between-Genome Composition Diversity We compared genome sequences through the analysis of short oligonucleotide frequencies (1–7 bp). We 789 790 Liò Table 1 List of the Genomic Sequences Analyzed Species (genome) GC% Ureaplasma urealyticum . . . . . . . . . . . . . . . . . . Buchnera sp.. . . . . . . . . . . . . . . . . . . . . . . . . . . . Borrelia burgdorferi . . . . . . . . . . . . . . . . . . . . . Rickettsia prowazekii . . . . . . . . . . . . . . . . . . . . . Clostridium difficile . . . . . . . . . . . . . . . . . . . . . . Campylobacter jejuni. . . . . . . . . . . . . . . . . . . . . Prochlorococcus marinus . . . . . . . . . . . . . . . . . Methanococcus jannaschii . . . . . . . . . . . . . . . . Mycoplasma genitalium. . . . . . . . . . . . . . . . . . . Staphilococcus aureus . . . . . . . . . . . . . . . . . . . . Methanococcus maripaludis . . . . . . . . . . . . . . . Streptococcus mutans . . . . . . . . . . . . . . . . . . . . Haemophilus influentiae . . . . . . . . . . . . . . . . . . Haemophilus ducreyi . . . . . . . . . . . . . . . . . . . . . Streptococcus pyogenes . . . . . . . . . . . . . . . . . . . Helicobacter pylori 26695 . . . . . . . . . . . . . . . . Helicobacter pylori J99 . . . . . . . . . . . . . . . . . . . Legionella pneumophila . . . . . . . . . . . . . . . . . . Mycoplasma pneumoniae . . . . . . . . . . . . . . . . . Chlamydia muridarum. . . . . . . . . . . . . . . . . . . . Chlamydia pneumoniae CWL029. . . . . . . . . . . Chlamydia pneumoniae J138 . . . . . . . . . . . . . . Chlamydia pneumoniae AR39 . . . . . . . . . . . . . Pyrococcus furiosus. . . . . . . . . . . . . . . . . . . . . . Chlamydia trachomatis . . . . . . . . . . . . . . . . . . . Nostoc puntiforme . . . . . . . . . . . . . . . . . . . . . . . Pyrococcus horikoshii . . . . . . . . . . . . . . . . . . . . Aquifex aeolicus . . . . . . . . . . . . . . . . . . . . . . . . . Bacillus subtilis . . . . . . . . . . . . . . . . . . . . . . . . . Bacillus halodurans C-125 . . . . . . . . . . . . . . . . Bacillus actynomycetemcomitans . . . . . . . . . . . Pyrococcus abyssi . . . . . . . . . . . . . . . . . . . . . . . Thermoplasma acidophilum . . . . . . . . . . . . . . . Thermotoga maritima . . . . . . . . . . . . . . . . . . . . Vibrio cholerae chromosome II . . . . . . . . . . . . Yersinia pestis . . . . . . . . . . . . . . . . . . . . . . . . . . Vibrio cholerae chromosome I . . . . . . . . . . . . . Synechocystis PCC6803. . . . . . . . . . . . . . . . . . . Archaeglobus fulgidus . . . . . . . . . . . . . . . . . . . . Bacterium thermolitotrophicus . . . . . . . . . . . . . Escherichia coli . . . . . . . . . . . . . . . . . . . . . . . . . Nitrosomonas europaea. . . . . . . . . . . . . . . . . . . Neisseria meningitidis MC58 . . . . . . . . . . . . . . Neisseria meningitidis Z2491 . . . . . . . . . . . . . . Xylella fastidiosa . . . . . . . . . . . . . . . . . . . . . . . . Salmonella typhi . . . . . . . . . . . . . . . . . . . . . . . . Neisseria gonorrhoeae. . . . . . . . . . . . . . . . . . . . Treponema pallidum . . . . . . . . . . . . . . . . . . . . . Bacillus stearothermophilus . . . . . . . . . . . . . . . Corynebacterium diphtheriae . . . . . . . . . . . . . . Aeropyrum pernix . . . . . . . . . . . . . . . . . . . . . . . Klebsiella pneumoniae. . . . . . . . . . . . . . . . . . . . Mycobacterium leprae . . . . . . . . . . . . . . . . . . . . Burkholderia pseudomallei . . . . . . . . . . . . . . . . Mycobacterium bovis. . . . . . . . . . . . . . . . . . . . . Mycobacterium tubercolosis . . . . . . . . . . . . . . . Bordetella bronchiseptica . . . . . . . . . . . . . . . . . Bordetella parapertussis . . . . . . . . . . . . . . . . . . Pseudomonas aeruginosa . . . . . . . . . . . . . . . . . Deinococcus radiodurans chr II . . . . . . . . . . . . Rhodobacter capsulatus. . . . . . . . . . . . . . . . . . . Deinococcus radiodurans chr I . . . . . . . . . . . . Bordetella pertussis . . . . . . . . . . . . . . . . . . . . . . Rhodobacter sphaeroides . . . . . . . . . . . . . . . . . Halobacterium NRC-1. . . . . . . . . . . . . . . . . . . . Streptomyces coelicolor. . . . . . . . . . . . . . . . . . . 0.25 0.26 0.29 0.29 0.30 0.31 0.31 0.31 0.32 0.33 0.35 0.37 0.38 0.38 0.39 0.39 0.39 0.39 0.40 0.40 0.41 0.41 0.41 0.41 0.41 0.42 0.42 0.43 0.44 0.44 0.45 0.45 0.46 0.46 0.47 0.48 0.48 0.48 0.49 0.5 0.51 0.51 0.52 0.52 0.53 0.52 0.53 0.53 0.53 0.53 0.56 0.57 0.58 0.64 0.65 0.66 0.66 0.66 0.66 0.67 0.67 0.67 0.67 0.67 0.68 0.72 Reference Glass et al. 2000 Shigenobu et al. 2000 Fraser et al. 1997 Andersson et al. 1998 pg Parkhill et al. 2000 pg A: Bult et al. 1996 Fraser et al. 1995 pg pg pg Fleischmann et al. 1995 pg pg Tomb et al. 1997 Alm et al. 1999 pg Himmelreich et al. 1996 Read et al. 2000 Kalman et al. 2000 Shirai et al. 2000 Read et al. 2000 A Stephens et al. 1998 pg A; Kawarabayasi et al. 1998 Deckert et al. 1998 Kunst et al. 1997 Takami et al. 2000 pg A: pg A: Ruepp et al. 2000 Nelson et al. 1999 Heidelberg et al. 2000 pg Heidelberg et al. 2000 Kaneko et al. 1996 A: Klenk et al. 1997 A: Smith et al. 1997 Blattner et al. 1997 pg Parkhill et al. 2000 Tettelin. et al. 2000 Simpson et al. 2000 pg pg Fraser et al. 1998 pg pg A: Kawarbayasi et al. 1999 pg pg pg pg Cole. et al. 1998 pg pg Stover et al. 2000 White et al. 1999 pg White et al. 1999 pg pg A: Ng et al. 2000 pg NOTE.—The entries are: sequences, the average GC content, if the genome sequence is partial (pg) and if the prokaryote is an archaea (A). Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes Table 2 Test of the Correction Factor for the Sequence Finite Size Length (bp) D(1) . D(2) D(3) D(4) D(5) D(6) D(7) ............. 2 D(1) . . . . . . . . 2 D(2) . . . . . . . . 2 D(3) . . . . . . . . 2 D(4) . . . . . . . . 2 D(5) . . . . . . . . 2 D(6) . . . . . . . . RND RND (Cor.F.) 1.386 1.386 1.386 1.386 1.384 1.378 1.355 1.386 1.386 1.386 1.386 1.386 1.386 1.386 NOTE.—The entries are: first-difference of entropy values of oligonucleotide frequencies (1–7 bp) in a 200-kb random DNA sequence with GC content 5 0.5 (column RND) and using the correction factor (RND Cor.F.). Correct values remain closer to Ln(4) . 1.386. used the following entropy measure: where fi is the frequency of the oligonucleotide i of length n (n 5 2, . . . , 7 bases). In short sequences, there are oligonucleotides that do not occur even if their probabilities in the entire genome are not zero. Therefore, in order to compensate the finite length of the sequence, N, we used a correction factor M/2N, where M is the number of oligonucleotides with nonzero occurrences (see Herzel 1988; Schmitt and Herzel 1997; Liò and Ruffo 1998). In a long, random DNA sequence, all the oligonucleotides of length n are equally represented and D (n) is equal to n·ln4 (51.386. . . for n 5 1). If some oligonucleotides are more abundant than expected and others extremely rare, D(n) is smaller than n·ln4; if the sequence is periodic, with a period p, D(n) is a constant for n . p. A further estimator is the first-difference of entropy, it allows to estimate the bias in the frequencies of oligonucleotides of length n, taking into account the bias in the frequencies of oligonucleotides of length n 2 1. The term estimates the bias in dinucleotide frequencies, taking into account the single-base frequencies; the term estimates the bias in trinucleotide frequencies with respect to the dinucleotide frequencies, and so on. Therefore, the plot of for different values of n allows to detect changes in oligonucleotide frequencies that may reflect mutationselection pressures. In table 2, we show the effect of the correction factor in computing the first-difference of entropy of oligonucleotide frequencies in a 200-kbp random DNA sequence with GC content 5 0.5. Clearly, using the correction factor, we can neglect the finite size of the sequences; moreover, all the sequences in the genome data set are longer than 200 kbp. In a similar way, for comparison, we computed the second-differences of entropy, for different values of n. Distribution of Distances Between Homopolymers We used density plots to analyze the distribution of distances between An, Tn, Cn, and Gn homopolymers. Density plots, that have unit total area, are essentially smooth versions of histograms which provide smooth estimates of population frequency or probability density curves. Histograms depend on the starting point on the grid of bins, and the differences can be surprisingly large; better results are obtained on computing the histograms by averaging over a large number of different starting points and having very small bins (Silverman 791 1986, pp. 45–47). We compute the density through the formula for a sample of distances s1, . . . , si, . . . , sn, a fixed kernel k, and a bandwidth b. We used a normal kernel, and we selected data-dependent bandwidths (b) using the formula b 5 0.9 min (s, R/1.34)n21/5, where n is the sample size, s is the standard deviation, and R is the interquantile range (Silverman 1986). Measure of Within-Genome Composition Heterogeneity Karlin and Brendel (1993), Karlin and Burge (1995), and Karlin and Mrazek (1997), among others (see also Liò et al. 1996), have provided evidences that even the DNA sequence of the most simple microorganism shows large degree of patchiness at different sequence length scales. In their analyses, the window size is chosen on the basis of the sequence length and features to be detected; consequently, the localization accuracy of the methods is of the order of the chosen window length. In order to investigate GC patchiness, we coded DNA sequences as G,C 5 1, A,T 5 0. It is known that GC content variations along genomes occur at different sequence lengths; for example, codons, genes, long repeats, pathogenicity islands, and isochores (Bernardi 1993; Nekrutenko and Li 2000). GC variation is much larger than purine-pyrimidine variation (Liò et al. 1996; Liò and Ruffo 1998); therefore, GC patchiness represents most of the genome sequence patchiness. Arneodo et al. (1995, 1996, 1998) used continuous wavelet transform to analyze GC patchiness and correlations in DNA sequences of different species. We used wavelet scalogram based on discrete wavelet transform (DWT, Mallat 1989). Using Wavelets to Assess Within-Genome Composition Heterogeneity The name wavelet means small waves (the sinusoids used in Fourier analysis are big waves). In short, a wavelet is an oscillation that decays quickly. Wavelets are functions that can be used to efficiently describe a signal by breaking it down into its components at different scales (or frequency bands) and following their evolution in the space domain. Unlike the Fourier basis, wavelets are local both in frequency and space. Wavelets are discontinuous and sum to zero and show different, complex shapes, each suitable for a different class of problems. Wavelets are also related to fractals in that the same shapes repeat at different orders of magnitude. Therefore, they are particularly performing better than Fourier analysis when the signals contain discontinuities and sharp spikes (Chui 1992; Daubechies 1992). Wavelet Series In wavelet theory, a function is represented by an infinite series expansion in terms of dilated and translated versions of a basic function c, called mother wavelet, each multiplied by an appropriate coefficient (see Daubechies 1992 and Chui 1992, among others). The 792 Liò wavelet family cj,k is obtained from the mother wavelet by shrinking by a factor 2 j and translating by 22jk, to obtain cj,k(x) 5 2 j/2c(2 jx 2 k) where the j subscript represents the dilation number and the k subscript represents the translation number. The scale factor 2 j/2 is a normalization factor for cj,k. The wavelet series representation of a function f is therefore Choice of Wavelet Basis Coefficients fj,k describe features of f at the spatial location and frequency proportional to 2j (or scale j). Despite Fourier transform, wavelets provide time-frequency localization in that the coefficient fj,k gives information about the function near time point and near frequency proportional to 2j. We found that, in general, Daubechies wavelets perform better than Haar for the type of signals considered in this work (GC plots). Wavelets from Daubechies families have two important properties; they are compactly supported and have maximum number of vanishing moments (a function f has N vanishing moments if where q 5 0, 1, . . . , N 2 1). Compact supports are useful to describe local characteristics that change rapidly with time. A large number of vanishing moments leads to high compressibility because the fine scale wavelet coefficients will be essentially zero where the function is smooth. Although we have previously used Dauchechies N 5 10 (Liò and Vannucci 2000), the analysis of the large set of genomic sequences (table 1) showed that, if interested in scales at or above gene length, the Daubechies’ basis of type N 5 2 performs as well as N 5 10 for G1C pattern analysis. Therefore, in this work and in a recent publication (Vannucci and Liò 2001) we used Daubechies N 5 2. Discrete Wavelet Transform Wavelet Scalogram The DWT decomposes a function into its wavelet coefficients (Mallat 1989). From a computational point of view, the DWT proceeds by recursively applying two convolution functions known as quadrature mirror filters, each producing an output stream that is half of the length of the original input, until the resolution level zero is reached. If the filters are applied n times (with 2n # N), at each intermediate step (a level in wavelet terminology) j 5 1, . . . , n, the transform produces two vectors of coefficients, Sj of scaling coefficients and Dj of wavelet coefficients. The vector Dj is kept, whereas Sj is processed through the two filters. At the last level n, both Sn and Dn are kept. Different coefficient vectors contain information about the characteristics of the sequence at different scales or sequence lengths. Coefficients at coarse scales capture gross and global features. Coefficients at fine scales contain the local details of the profile. At level j, the wavelet coefficients Dj are associated with changes in the averages of the data on a scale 2j21 at a set of location times. Scaling coefficients Sn at the last level n are instead associated with averages of the data on scales 2n and higher. The wavelet transform is, therefore, a cumulative measure of the variations in the data over regions proportional to the wavelet scales, with coefficients at coarser and coarser levels, i.e., for increasing values of j, describing features at lower frequency ranges and larger time periods. For example, given a genome sequence of 2 3 106 bp, GC variations at gene length (1,000–2,000 bp) correspond to scales 10 to 11. For practical purposes, the DWT is often represented in matrix form as Wy, with W an orthogonal matrix and y a vector of observations of the signal. An inverse wavelet transform can be also defined. The standard DWT, as the fast Fourier transform, operates on data sets with length 2N, N integer. When required, data can be padded with zeros. These zeroes do not affect the results. Genome mosaicity is analyzed using the scalogram, that is the equivalent of the periodogram used in Fourier analysis. The scalogram is a plot of the sum of the squares of the coefficients at each scale (Flandrin 1988; Chiann and Morettin 1998; Ariño and Vidakovic 1995). The plot will indicate at which scale of resolution the energy of the function is concentrated. A relatively smooth function will have most of its energy concentrated at large scales. A function showing high frequency oscillations will have a large portion of its energy concentrated in high-resolution wavelet coefficients. We found that the largest amount of GC content variation in eubacteria and archaea genomes occurs at short sequence–lengths, mainly at codon or few codon lengths. Therefore, in order to improve the detection of GC content variations over gene-length, we applied a wavelet denoising technique to eliminate rapid variations of GC content (Donoho and Johnstone 1994; Donoho et al. 1995). Finally, each scalogram is generated by subtracting the values of a scalogram obtained from a random DNA sequence of the same length and average GC content. f (x) 5 Ofc jk j,k (x) jk with wavelet coefficients f jk 5 E 1` f (x)cj,k (x) dx. 2` Results Patterns of Genome Composition in High– and Low– GC-Content Genomes Recent papers have shown that the relative abundances of the dinucleotides constitute a signature of prokaryotic and eukaryotic genome sequences (Karlin and Burge 1995; Karlin and Mrazek 1997; Karlin, Campbell, and Mrazek 1998). In this work we have extended the analysis to longer oligonucleotides (1–7 bp) and used a very large genome sequence data set. Figure 1A shows the entropy of the single-base frequency with respect to the GC content for the 36 complete genomes and 27 long genomic sequences (see description in Material and Methods and table 1). We found that single-base Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes 793 FIG. 1.—Entropy measure of genomic sequences. A, Entropies of single-base frequency for genome data set with respect to the GC content (x-axis). B–E, first-difference of entropy based on the frequencies of 2, 3, 6, and 7 bp oligonucleotides, respectively, with respect to the average GC content in bacteria and archaea genomes (see Materials and Methods and table 1). The curved lines in B and E represent the entropy of single-base frequency from random DNA sequences with different GC content. The genomes are represented by symbols according to the prokaryotes’ ecology: squares for obligate pathogens and intracellular symbionts, circles for free-living and nonobligate pathogens. Complete genomes are represented by open symbols and uncompleted genomes by closed symbols. F, we have subtracted the values of the curved line in figure 2E (and fig. 2B) from the first-difference in entropy of 7 bp oligonucleotide frequencies (y-axis); the x-axis represents the GC content. The species are coded according to their taxonomy: closed squares, mycoplasmas; closed diamonds, mycobacteria; open squares, other gram1; circles, archaea; upper triangles, gram2; open diamonds, hyperthermophiles; star, spirochaetes; lower triangles, chlamydias; filled circles, cyanobacteria. The arrows are described in the text. entropies do not show significant deviations from a curve generated using random DNA sequences with different GC content (the curved line in fig. 1B and E). Figures 1B to E shows the first-difference of entropy of 2, 3, 6, and 7 bp oligonucleotide frequencies (results for 4 and 5 bp do not add any interesting features). The genomes are represented by symbols according to the species’ ecology. In figure 1F, we show the difference between the first-difference of entropy of 7 bp oligonucleotide frequencies and the values obtained using random DNA sequences (curved line in fig. 1E). Therefore, whereas for figures 1A to E the comparisons of entropy values should be done with respect to the curved line, in figure 1F the comparison is with the top line (value 0). Note that the species are classified according to their taxonomy. Results from figures 1E and F suggest that genomes are clustered neither by ecology nor by taxonomy. For many genome sequences, the firstdifference of entropy is very large for both dinucleotides (distances from the curved line of fig. 1A) and trinucleotides (distances between dinucleotide positions in fig. 1B and trinucleotide positions in fig. 1C). There are few genomes that have almost equal or similar first-differences of entropy values at all oligonucleotide lengths n (n 5 1, . . . , 7). We found that all the Chlamydias’ sequences (lower triangles) have similar values of firstdifference of entropy; there are negligible differences between Haemophilus influenzae and H. ducrey and between Neisseria meningitidis strains MC58 and Z2491; there are small differences between N. meningitidis strains and the N. gonorrhoea (second arrow from left in fig. 1B) and between both the Helicobacter pylori strains J99 and 26695 (first arrow from left in fig. 1B). The genome sequences of H. pylori and Neisseria show larger first-difference of entropy for dinucleotides than the sequences with the same GC content (fig. 1B). The genome sequences of B. stearothermophilus, H. pylori, B. pseudomallei, and Thermotoga maritima show the largest first-difference of entropy for trinucleotides. Our results show that, particularly for n . 4, the largest first-difference of entropy occurs for low–GCcontent genomes, for example Ureaplasma urealyticum 794 Liò FIG. 2.—Entropy measure of genomic sequences. A, Second-difference of entropy based on the frequencies of 7 bp oligonucleotides with respect to the average GC content in bacteria and archaea genomes (see Materials and Methods and table 1). Entropies of single-base frequency for genome data set are coded as filled circles. The species are coded as in figure 1F. B, we have subtracted the values of the curved line in figure 2A from the second-difference in entropy of 7 bp oligonucleotide (y-axis); the x-axis represents the GC content. The species are coded as in figure 1F, and the arrows are described in the text. (first arrow from left in fig. 1E), H. pylori strains (first arrow from left in fig. 1B; second arrow from left in fig. 1E; first arrow from the left in fig. 1F), the N. meningitidis strains (second arrow from left in fig. 1B; third arrow in fig. 1E; second arrow from the right in fig. 1F), N. gonorrhoea (on the right hand side of N. meningitidis), and high–GC-content genomes, such as Rickettsia capsulatus (fourth arrow from the left in fig. 1E; third arrow from the left in fig. 1F) and S. streptomyces (first arrow from the right in fig. 1E). The genome sequence of Mycoplasma genitalium shows larger oligonucleotide first-difference of entropy than the other sequenced mycoplasma genomes, U. urealyticum and Mycoplasma pneumoniae (closed squares in fig. 1F). Among the high–GC-content genomes, Mycoplasma bovis and Mycoplasma tubercolosis have larger first-difference of entropy than Mycoplasma leprae (closed diamonds in fig. 1F ). Although genes from these genomes share 80% amino acid identity on an average, M. leprae has a lower GC content than the other two mycoplasmas (Bellgard and Gojobori 1999). The genomes of M. bovis and M. tubercolosis have lower first-difference of entropy than the genomes of proteobacteria with similar GC content, such as Pseudomonas aeruginosa and the Bordetellae sequences. The comparison of genome sequences that have complementary GC-content percentage, for example, 35% and 65%, shows that low–GC-content genomes have smaller first-difference of entropy than high–GCcontent genomes (fig. 1F). For example, the average first-difference of entropy for 7-bp oligonucleotides in genome sequences with GC% , 0.40, GC% between 0.4 and 0.6, and GC% . 0.6 are (the standard deviation in parentheses): 20.042 (0.020), 20.046 (0.019), and 20.071 (0.015), respectively. This finding seems to be particularly relevant for proteobacteria (circles in fig. 1F). A further confirmation is given by the second-difference entropies of 7-bp oligonucleotides, D(7)2D(5), for the same set of genomic sequences of figure 1 (fig. 2A). In figure 2B, we have subtracted entropies from random DNA sequences from the values in figure 2A. The largest values correspond to those of figure 1F: the two H. pylori (first arrow from left in figs. 2A and B), N. meningitidis and N. gonorrhoea (second arrow from left in figs. 2A and B), and R. capsulatus (third arrow from the left in figs. 2A and B). We analyzed in details the oligonucleotide frequencies in high– and low–GC-content genomes. The contrast in oligonucleotide frequencies for high– and low– GC-content genomes shows that, particularly for n . 4 bp, homopolymers of the type Gn and Cn are generally underrepresented in all genomes in the data set. We made use of density plots to analyze the distribution of distances between An, Tn, Cn, and Gn homopolymers. In figure 3, we show the density plots of the distances between A6 and T6 homopolymers in AT-rich genomes and the distances between C6 and G6 homopolymers in GCrich genomes in a set of bacterial genomes. Each plot of figure 3 shows the distances between homopolymers in genomes with almost complementary GC%, i.e., for example, Campylobacter jejuni (GC content ;31%) versus P. aeruginosa (;67%, fig. 3A); Borrelia burgdorferi (;29%) versus Deinococcus radiodurans (;67%, fig. 3B); Methanococcus jannaschii (;31%) versus Halobacterium sp. (;68%, fig. 3C) and Escherichia coli (;51%, fig. 3D). In the absence of any selection or mutation mechanism, we expect the GC-rich genomes to contain a number of Gn and Cn homopolymers, quite equal to the number of An and Tn homopolymers in AT-rich genomes and vice versa. Instead, we found that the number of G6 and C6 homopolymers in GC-rich genomes is much lower than the number of A6 and T6 homopolymers in AT-rich genomes. All the plots show similar large differences between G6/C6 and A6/T6 distances in GC-rich and AT-rich genomes, respectively. Within-Genome Composition Heterogeneity Our study, based on a very large set of genomes from bacteria and archaea, focuses on searching for correlation between genome mosaicity and ecology in bacteria and archaea. In order to investigate the mosaic-like structure of genomes, we calculated the wavelet scalo- Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes 795 FIG. 3.—Density plots of the distances between A6 and T6 homopolymers in low–GC-content genomes and between C6 and G6 homopolymers in high–GC-content genomes. A, Campylobacter jejuni (GC content ;31%) and P. aeruginosa (;67%); B, B. burgdorferi (;29%) and D. radiodurans (;67%); C, M. jannaschii (;31%) and Halobacterium sp. (;68%); D, E. coli (;51%). The x-axis shows the distances between homopolymers, and the y-axis shows the density. grams for the genome data set (fig. 4). We found quantitative and qualitative differences between the genomes of free-living species/not-chronic pathogens (figs. 4A and B) and those of intracellular parasites/symbionts (figs. 4C and D). Although the size of GC-variations, i.e., the scales, depends on the sequence length because of the wavelet decomposition process (see Material and Methods), the genomes of free-living species and nonobligate pathogens clearly show more mosaic-like structure than the genomes of obligate pathogens and intracellular symbionts. In particular, the genomes of the N. meningitidis strains show more mosaic-like structure than all the other genomes analyzed, even larger than Synechocystis and E. coli (not shown) and other freeliving bacteria. Mosaicity occurs at all the scales in the genomes of free-living species or not-chronic parasites (figs. 4A and B), whereas intracellular parasites’ genomes have low or negligible values at coarser scales (figs. 4C and D). The maximum of the scalograms for intracellular parasites occurs at scales that correspond to the average gene length (table 3); it occurs at coarser scales for free-living and not-chronic genomes. This suggests that free-living species can incorporate large pieces of DNA from the environment, whereas for intracellular parasites the most common event is the recombination between homologous genes. This has important consequences on the genome reduction evolution in that it explains the genome size reduction observed in intracytoplasmic genomes and mitochondria. Finally, we found that there is small but not negligible mosaicity in genomes of intracytoplasmic symbionts. It is noteworthy that these results do not depend on the GC directional mutational pressure: for example low–GC-content genomes (for example Buchnera sp.) and genomes that have 50% GC content (for example Chlamidias) have similar genome mosaicity and ecology. Discussion Genome Composition and GC Mutational Pressure The evolution of DNA composition and structure of bacterial genomes depends on mutational and selection processes, such as GC mutation pressure, strandbias mutation generated by DNA repair during transcription (Sueoka 1995), and the relative frequency of synonymous codons caused by the abundance of tRNAs (Sorensen, Kurland, and Pedersen 1989; Berg and Kurland 1997). Although much of the difference in oligonucleotide frequencies in a short sequence can be nonfunctional, genome-wide, it may have relevance in terms of genome organization, codon usage, gene expression, recombination, and mutation rate. The first-differences of entropy of dinucleotide frequencies may reflect structural constraints, for example, the stacking energies and 796 Liò FIG. 4.—Wavelet scalograms of archaea and eubacteria genome sequences. A, Synechocystis sp. (1), N. meningitidis MC58 (circles), V. cholerae (triangles); B, A. fulgidus (1), C. jejuni (circles), M. jannaschii (triangles); C, H. pylori (1), R. prowazekii (circles), Chlamydia pneumoniae J138 (triangles); D, B. burgdorferi (1), U. urealyticum (circles), Buchnera sp. (triangles). The x-axis shows the scale, and the yaxis shows the energy that is an estimate of the amount of mosaicity at each scale. the species-specific machinery for DNA repair and modification (Karlin, Campbell, and Mrazek 1998). Because coding regions generally constitute more than 90% of the bacterial genome, trinucleotide frequency diversity may be related to synonymous codon usage in coding regions. The genomes of N. meningitidis and H. pylori show larger values of first-difference of entropy than other bacteria and archaea with similar GC content. These geTable 3 Numerical Results of Scalogram of Figure 4 Species (genome) Synechocystis PCC6803 . . . . . N. meningitides . . . . . . . . . . . . V. cholerae . . . . . . . . . . . . . . . . A. fulgidus . . . . . . . . . . . . . . . . C. jejuni . . . . . . . . . . . . . . . . . . M. jannaschii. . . . . . . . . . . . . . H. pylori . . . . . . . . . . . . . . . . . . R. prowazekii . . . . . . . . . . . . . . C. pneumoniae J138 . . . . . . . . B. burgdorferi. . . . . . . . . . . . . . U. urealyticum . . . . . . . . . . . . . Buchnera sp.. . . . . . . . . . . . . . . Length Maximum Scale 3.57 2.27 1.03 2.18 1.64 1.66 1.67 1.11 1.23 0.91 0.75 0.64 4,426 6,635 2,173 2,083 2,178 2,691 2,287 2,046 1,368 1,400 938 1,034 9 9 10 8 11 11 11 9 11 10 10 10 Bases 13,958 8,876 7,878 17,018 1,603 1,625 1,628 4,341 1,199 1,779 1,468 1,251 NOTE.—The entries are: species (free-living and nonobligate pathogens in bold), genome length (3 106 bp), maximum, scale and length of the corresponding GC variation (in bp), the area under the scalogram. nomes contain several large pathogenicity islands that have different GC content from the nearby genomic regions (Tomb et al. 1997; Liò and Vannucci 2000; Parkhill et al. 2000; Tettelin et al. 2000). The results from the entropy analysis and the wavelet scalograms allow us to infer that the H. pylori genome does not contain more GC-heterogeneity than the other chronic pathogens, but it contains few genomic regions with remarkably different short oligonucleotide frequencies. This is in agreement with our previous findings (see fig. 2 in Liò and Vannucci 2000) and the findings of Liu and coworkers (Liu et al. 1999) that showed that there are large differences in sequence composition in the CAG pathogenicity island with respect to the average composition of the H. pylori genome. The genome of N. gonorrhoea has similar entropy values as that of N. meningitidis, it may contain pathogenicity islands with similar base composition and size; R. capsulatus may have pathogenicity regions or alien genes too. Low–GC-content genomes show less diversity in oligonucleotide frequencies than high–GC-content genomes. Some authors have shown that in E. coli and in other bacteria, the homopolymers of the type An and Tn are more abundant than Cn and Gn (Dechering et al. 1998; Shomer and Yagil 1999). The results in figure 3 suggest that short distances (,2 kbp) between Cn and Gn are less abundant than Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes 797 short distances between An and Tn in bacterial genomes. Although several DNA mismatch and repair mechanisms are known to change oligonucleotide frequencies (see for example Deschavanne and Radman 1991), no mutation mechanisms is known to act selectively on long Gn and Cn homopolymers. Because noncoding regions represent a small percentage of eubacteria and archaea genomes, statistical analyses give little information about differences between homopolymers in coding regions with respect to the overall genome. The fact that bacterial mRNAs are generally polycistronic with length of several kilobase pairs suggests the importance of a constraint on RNA secondary structure against Gn and Cn sticky patches. This is in agreement with the work of Huynen and co-workers who found that, in histone genes, the compensation of the G-C ratio indicates a selection pressure at the mRNA level rather than a selection pressure or mutation bias at the DNA level or a selection pressure on codon usage (Huynen, Konings, and Hogeweg 1992). It is also known that pairing of palindromic Gn and Cn patches serve as stop transcription mechanism in several bacterial operons (see for example Lewin 1997, pp. 318–319). among the bacterial genomes sequenced to date (Andersson et al. 1998). It is known that the presence of repeats increases the recombination rate, and this may explain the relatively large values of the GC variations. Our analyses, based on a very large number of genome sequences, show how oligonucleotide composition is affected by GC mutational pressure and that genome mosaic-like structure depends on the history of gene transfer events and thus on the ecology of the species. Entropy measure and wavelet scalogram are complementary methods to analyze genome sequences and to detect the presence of alien genes, i.e., they can be used as preliminary analyses in the investigation of host-pathogen relationship. The alien genes can be located through the selective reconstruction of GC plot from the scalogram using just few scales, as shown in Liò and Vannucci (2000). Further improvements of this work will consider using the nondecimated or stationary version of the DWT, a modified transform where coefficients at each level are not subsampled. Ecology and Genome Composition Heterogeneity: Molecular Evolution Implications P.L. is supported by an EPSRC/BBSRC Bioinformatics Initiative grant. We thank Marina Vannucci and Nick Goldman for helpful suggestions. For the following species we have considered all nonredundant sequences available in GenBank and the sequences freely available through the web sites of several research institutions (we obtained the permission on using data from ongoing genomic projects). Sequence data of B. pertussis, B. bronchiseptica, B. parapertussis, B. pseudomallei, Corynebacterium diphtheriae, C. difficile, M. bovis, M. leprae, S. typhi, S. coelicolor, and Y. pestis are from sequencing groups at Sanger Centre and can be found at www.sanger.ac.uk. Sequence data of N. gonorrhoea, S. pyogenes, A. actinomycetemcomitans, B. stearothermophilus, S. aureus, and S. mutans are from the University of Oklahoma (www.genome.ou.edu, dna1.chem.ou.edu/gono.html). We acknowledge the Gonococcal Genome Sequencing Project supported by USPHS/NIH grant AI38399, and L. A. Lewis, A. Gillaspy, R. McLaughlin, M. Gipson, T. Ducey, T. Ownbey, K. Hartman, C. Nydick, M. Carson, J. Vaughn, C. Thomson, L. Song, S. Lin, X. Yuan, F. Najar, M. Zhan, Q. Ren, H. Zhu, S. Qi, S. Kenton, H. Lai, J. White, S. Clifton, B. A. Roe, and D. W. Dyer. We also acknowledge the S. mutans Genome Sequencing Project funded by USPHS/NIH grant from the Dental Institute and B. A. Roe, R. Y. Tian, H. G. Jia, Y. D. Qian, S. P. Linn, L. Song, R. E. McLaughlin, M. McShan, and J. Ferretti. L. pneumophila, N. puntiforme sequences are from Columbia Genome Center (genome3.cpmc.columbia.edu); K. pneumoniae is sequenced by Washington University Consortium (genome.wustl.edu/gsc/Projects/bacteria. shtml); H. ducreyi, M. maripaludis are sequenced by the University of Washington (www.htsc.washington.edu, kandisnky.genome.washington.edu/maripaludis), N. europaea, P. marinus are from DOE- JGI (www.jgi.doe. gov/JGIpmicrobial/html); P. abyssi from Genoscope A possible explanation as to why the genomes of free-living and nonchronic pathogens show more patchiness than genomes of chronic pathogens or symbionts is that free-living bacteria and nonchronic pathogens experience a fluctuating and challenging environment with diversified, in time and locus, selection. The genomes of these species can easily exchange DNA segments and incorporate cassettes of resistance genes that allow them to face environmental changes. Probably, nonchronic pathogens have high degree of mosaicity because of the pressure to keep the pathogenicity islands shared among bacterial species with different genome-wide base composition. Differences in the wavelet scalograms of free-living bacteria may also reflect different mechanisms for DNA uptake and recombination. Therefore, genomes such as D. radiodurans (GC% ø 67%) and S. coelicolor (GC% ø 72%), that are subjected to strong GC mutational bias, do not undergo a reduction in genome size and tRNA population. Instead, the genomes of chronic pathogens and symbionts that are subjected to GC pressure, such as Mycoplasma capricolum (GC content ; 25%) and Micrococcus luteus (GC content ; 75%), undergo a genome size and tRNA population reduction (Muto et al. 1990; Kano et al. 1991; Andersson and Kurland 1995). The fact that the scalograms of the genomes of intracellular pathogens or symbionts such as Buchnera sp. (fig. 4D and table 3) show a small but not negligible amount of genome composition heterogeneity suggests that genetic recombination events occur. This finding is in agreement with the predictions of Wolf and co-workers for Rickettsiae and Chlamydiae (Wolf, Aravind, and Koonin 1999). The genome of Rickettsia prowazekii contains the largest amount of repeats (24%) Acknowledgments and Permissions for Using Genome Data 798 Liò (www.genoscope.cns.fr/Pab); R. capsulatus is sequenced by University of Chicago (capsulapedia.uchicago.edu), and R. sphaeroides is sequenced by University of Texas (www-mmg.med.uth.tmc.edu/sphaeroides). LITERATURE CITED ALM, R. A., L. S. LING, D. T. MOIR et al. (23 co-authors). 1999. Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori. Nature 397:176–180. ANDERSSON, S. G., and C. G. KURLAND. 1995. Genomic evolution drives the evolution of the translation system. Biochem. Cell Biol. 73:775–787. ANDERSSON, S. G., A. ZOMORODIPOUR, J. O. ANDERSSON et al. (10 co-authors). 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396:133– 140. ARIÑO, M., and B. VIDAKOVIC. 1995. On wavelet scalograms and their applications in economic time series. Discussion paper 95-21, ISDS, Duke University. ARNEODO, A., E. BACRY, P. V. GRAVES, and J. F. MUZY. 1995. Characterizing long-range correlations in DNA sequences from wavelet analysis. Phys. Rev. Lett. 74:3293–3296. ARNEODO, A., Y. D’AUBENTON CARAFA, B. AUDIT, E. BACRY, J. F. MUZY, and C. THERMES. 1998. What can we learn with wavelets about DNA sequences. Physica A 249:439–448. ARNEODO, A., Y. D’AUBENTON CARAFA, E. BACRY, P. V. GRAVES, J. F. MUZY, and C. THERMES. 1996. Wavelet based fractal analysis of DNA sequences. Physica D 1328:1–30. BELLGARD, M. I., and T. GOJOBORI. 1999. Inferring the direction of evolutionary changes of genomic base composition. TiG 15:254–256. BERG, O. G., and C. G. KURLAND. 1997. Growth rate-optimised tRNA abundance and codon usage. J. Mol. Biol. 270: 544–550. BERNARDI, G. 1993. The vertebrate genome: isochores and evolution. Mol. Biol. Evol. 10:186–204. BLATTNER, F. R., G. PLUNKETT, C. A. BLOCH et al. (17 coauthors). 1997. The complete genome sequence of Escherichia coli K-12. Science 277:1453–1474. BOLSHOY, A., and E. NEVO. 2000. Ecologic genomics of DNA: upstream bending in prokaryotic promoters. Genome Res. 10:1185–1193. BULT, C. J., O. WHITE, G. J. OLSEN et al. (23 co-authors). 1996. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273:1058– 1073. CHIANN, C., and P. A. MORETTIN. 1998. A wavelet analysis for time series. J. Nonparametric Stat. 10:1–46. CHUI, C. K. 1992. An introduction to wavelets. Academic Press, New York. COLE, S. T., R. BROSCH, R. PARKHILL et al. (25 co-authors). 1998. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393:537– 544. DAUBECHIES, I. 1992. Ten lectures on wavelets. SIAM, Philadelphia. DECHERING, K. J., K. CUELENAERE, R. N. KONINGS, and J. A. LEUNISSEN. 1998. Distinct frequency-distributions of homopolymeric DNA tracts in different genomes. Nucleic Acids Res. 26:4056–4062. DECKERT, G., P. V. WARREN, T. GAASTERLAND et al. (15 coauthors). 1998. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392:353–358. DESCHAVANNE, P., and M. RADMAN. 1991. Counterselection of GATC sequences in enterobacteriophages by the compo- nents of the methyl-directed mismatch repair system. J. Mol. Evol. 33:125–132. DONOHO, D., and I. JOHNSTONE. 1994. Ideal spatial adaptation via wavelet shrinkage. Biometrika 81:425–455. DONOHO, D., I. JOHNSTONE, G. KERKYACHARIAN, and D. PICARD. 1995. Wavelet shrinkage: asymptopia? (with discussion). J. R. Stat. Soc. Ser. B 57:301–369. FLANDRIN, P. 1988. Time-frequency and time-scale. IEEE Fourth Annual ASSP Workshop on Spectrum Estimation and Modeling. Pp. 77–80. Minnesota, Minn. FLEISCHMANN, R. D., M. D. ADAMS, O. WHITE et al. (10 coauthors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496– 512. FRASER, C. M., S. CASJEANS, W. M. HUANG et al. (25 coauthors). 1997. Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390:580–586. FRASER, C. M., J. D. GOCAYNE, O. WHITE et al. (10 co-authors). 1995. The minimal gene complement of Mycoplasma genitalium. Science 270:397–403. FRASER, C. M., S. J. NORRIS, G. M. WEINSTOCK et al. (25 coauthors). 1998. Complete genome sequence of Treponema pallidum, the syphilis spirochete. Science 281:375–388. GARCIA-VALLVÉ, S., A. ROMEU, and J. PALAU. 2000. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res. 10:1719–1725. GLASS, J. I., E. J. LEFKOWITZ, J. S. GLASS, C. R. HEINER, E. Y. CHEN, and G. H. CASSELL. 2000. The complete sequence of the mucosal pathogen Ureaplasma urealyticum. Nature 407:757–762. GRISHIN, N. V., Y. I. WOLF, and E. V. KOONIN. 2000. From complete genomes to measures of substitution rate variability within and between proteins. Genome Res. 10:991– 1000. HEIDELBERG, J. F., J. A. EISEN, W. C. NELSON et al. (33 coauthors). 2000. DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature 406:477–483. HERZEL, H. 1988. Complexity of symbol sequences. Syst. Anal. Model. Simul. 5:435–441. HIMMELREICH, R., H. HILBERT, H. PLAGENS, E. PIRKL, B. C. LI, and R. HERRMANN. 1996. Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res. 24:4420–4449. HUYNEN, M., T. DANDEKAR, and P. BORK. 1998. Measuring genome evolution. Proc. Natl. Acad. Sci. USA 95:5849– 5856. HUYNEN, M. A., D. A. KONINGS, and P. HOGEWEG. 1992. Equal G and C contents in histone genes indicates selection pressures on mRNA secondary structure. J. Mol. Evol. 34:280– 291. KALMAN, S., W. MITCHELL, R. MARATHE et al. (10 co-authors). 2000. Comparative genomes of Chlamydia pneumoniae and C. trachomatis. Nat. Genet. 21:385–389. KANEKO, T., S. SATO, H. KOTANI et al. (24 co-authors). 1996. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 3:109–136. KANO, A., Y. ANDACHI, T. OHAMA, and S. OSAWA. 1991. Novel anticodon composition of transfer RNAs in Micrococcus luteus, a bacterium with a high genomic G 1 C content. Correlation with codon usage. J. Mol. Biol. 221:387–401. KARLIN, S., and V. BRENDEL. 1993. Patchiness and correlations in DNA sequences. Science 259:677–680. KARLIN, S., and C. BURGE. 1995. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11: 283–290. Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes KARLIN, S., A. M. CAMPBELL, and J. MRAZEK. 1998. Comparative DNA analysis across diverse genomes. Annu. Rev. Genet. 32:185–225. KARLIN, S., and J. MRAZEK. 1997. Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. USA 94:10227–10232. KAWARABAYASI, Y., Y. HINO, H. HORIKAWA et al. (25 co-authors). 1999. Complete genome sequence of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1. DNA Res. 6:83–101. KAWARABAYASI, Y., M. SAWADA, H. HORIKAWA et al. (25 coauthors). 1998. Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3. DNA Res. 5:55–76. KLENK, H. P., R. A. CLAYTON, J. F. TOMB et al. (25 co-authors). 1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390:364–370. KUNST, F., N. OGASAWARA, I. MOSZER et al. (25 co-authors). 1997. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 390:249–256. LEWIN, B. 1997. Gene VI, Chap. 11. Oxford University Press Inc., New York. LIÒ, P., S. RUFFO, A. POLITI, and M. BUIATTI. 1996. Analysis of genomic patchiness of Haemophilus influenzae and S. cerevisiae chromosomes. J. Theor. Biol. 183:455–469. LIÒ, P., and S. RUFFO. 1998. Searching for genomic constraints. Il Nuovo Cimento D 20:113–127. LIÒ, P., and M. VANNUCCI. 2000. Finding pathogenicity islands and gene transfer events in genome data. Bioinformatics 16: 932–940. LIU, G., T. K. MCDANIEL, S. FALKOW, and S. KARLIN. 1999. Sequence anomalies in the Cag7 gene of the helicobacter pylori pathogenicity island. Proc. Natl. Acad. Sci. USA 96: 7011–7016. MALLAT, S. G. 1989. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Machine Intelligence 11:674–693. MARTIN, W. 1999. Mosaic bacterial chromosomes: a challenge en route to a tree of genomes. Bioessays 21:99–104. MUTO, A., Y. ANDACHI, H. YUZAWA, F. YAMAO, and S. OSAWA. 1990. The organization and evolution of transfer RNA genes in Mycoplasma capricolum. Nucleic Acids Res. 18: 5037–5043. MUTO, A., and S. OSAWA. 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl. Acad. Sci. USA 84:166–169. NEKRUTENKO, A., and W. H. LI. 2000. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 10:1986–1995. NELSON, K. E., R. A. CLAYTON, S. R. GILL et al. (25 coauthors). 1999. Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399:323–329. NG, W. V., S. P. KENNEDY, G. G. MAHAIRAS et al. (43 coauthors). 2000. Genome sequence of Halobacterium species NRC-1. Proc. Natl. Acad. Sci. USA 97:12176–12181. OCHMAN, H., J. G. LAWRENCE, and E. A. GROISMAN. 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405:299–303. PARKHILL, J., M. ACHTMAN, K. D. JAMES et al. (21 co-authors). 2000. Complete DNA sequence of a serogroup A strain of Neisseria menigitidis Z2491. Nature 404:502–506. PARKHILL, J., B. W. WREN, and K. MUNGALL. 2000. The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature 403:665– 668. 799 PEDERSEN, A. G., L. J. JENSEN, S. BRUNAK, H. H. STAERFELDT, and D. W. USSERY. 2000. A DNA structural atlas for Escherichia coli. J. Mol. Biol. 299:907–930. READ, T. D., R. C. BRUNHAM, C. SHEN et al. (25 co-authors). 2000. Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39. Nucleic Acids Res. 28: 1397–1406. RUEPP, A., W. GRAML, M. L. SANTOS-MARTINEZ et al. (13 coauthors). 2000. The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum. Nature 407: 508–513. SCHMITT, A. O., and H. HERZEL. 1997. Estimating the entropy of DNA sequences. J. Theor. Biol. 188:369–377. SCOTT, D. W. 1992. Multivariate density estimation. Theory, practice and visualization. John Wiley and Sons, New York. SHIGENOBU, S., H. WATANABE, M. HATTORI, Y. SAKAKI, and H. ISHIKAWA. 2000. Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 407:81–86. SHIRAI, M., H. HIRAKAWA, M. KIMOTO et al. (10 co-authors). 2000. Comparison of whole genome sequences of Chlamydia pneumoniae J138 from Japan and CWL029 from USA. Nucleic Acids Res. 28:2311–2314. SHOMER, B., and G. YAGIL. 1999. Long W tracts are overrepresented in the Escherichia coli and Haemophilus influenzae genomes. Nucleic Acids Res. 27:4491–4500. SILVERMAN, B. W. 1986. Density estimation for statistics and data analysis. Chapman & Hall, London. SIMPSON, A. J. G., F. C. REINACH, P. ARRUDA et al. (115 coauthors). 2000. The genome sequence of the plant pathogen Xylella fastidiosa. Nature 406:151–157. SMITH, D. R., L. A. DOUCETTE-STAMM, C. DELOUGHERY et al. (25 co-authors). 1997. Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics. J. Bacteriol. 179:7135– 7155. SORENSEN, M. A., C. G. KURLAND, and S. PEDERSEN. 1989. Codon usage determines translation rate in Escherichia coli. J. Mol. Biol. 207:365–377. STEPHENS, R. S., S. KALMAN, C. LAMMEL et al. (12 co-authors). 1998. Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis. Science 282: 754–759. STOVER, C. K., X. Q. PHAM, A. L. ERWIN et al. (31 co-authors). 2000. Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen. Nature 406:959– 964. SUEOKA, N. 1962. On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA 48:582–588. ———. 1992. Directional mutation pressure, selective constraints and genetic equilibria. J. Mol. Evol. 34:95–114. ———. 1995. Intrastrand parity rules of DNA base composition and usage biases of synonymous codons. J. Mol. Evol. 40:318–325. TAKAMI, H., K. NAKASONE, Y. TAKAKI et al. (12 co-authors). 2000. Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans and genomic comparison with bacillus subtilis. Nucleic Acids Res. 28:4317–4331. TEKAJA, F., A. LAZCANO, and B. DUJON. 1999. The genomic tree as revealed from whole genome comparisons. Genome Res. 9:550–557. TETTELIN, H., N. J. SAUNDERS, J. HEIDELBERG et al. (42 coauthors). 2000. Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science 287:1809– 1815. 800 Liò TOMB, J. F., O. WHITE, A. R. KERLAVAGE et al. (25 co-authors). 1997. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388:539–547. VANNUCCI, M., and P. LIÒ. 2001. Wavelet analysis of biological sequences: applications to protein structure and genomics. Sankhya Ser. B 63:204–219. WHITE, O., J. A. EISEN, J. F. HEIDELBERG et al. (25 co-authors). 1999. Genome sequence of the Radioresistant Bacterium Deinococcus radiodurans R1. Science 286:1571–1577. WOLF, Y. I., L. ARAVIND, and E. V. KOONIN. 1999. Rickettsiae and Chlamydiae: evidence of horizontal gene transfer and gene exchange. TiG 15:173–175. WILLIAM TAYLOR, reviewing editor Accepted October 3, 2001