Download Environmental Sequence Data from the Sargasso Sea Reveal That

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
RESEARCH ARTICLES
Environmental Sequence Data from the Sargasso Sea Reveal That the
Characteristics of Genome Reduction in Prochlorococcus Are Not a Harbinger
for an Escalation in Genetic Drift
Jinghua Hu* and Jeffrey L. Blanchard *Department of Electrical and Computer Engineering, University of Massachusetts, Amherst; and Department of Microbiology,
University of Massachusetts, Amherst
The marine cyanobacterium Prochlorococcus MED4 has the smallest sequenced genome of any photosynthetic
organism. Prochlorococcus MED4 shares many genomic characteristics with chloroplasts and bacterial endosymbionts,
including a reduced coding capacity, missing DNA repair genes, a minimal transcriptional regulatory network, a marked
AT% bias, and an accelerated rate of amino acid changes. In chloroplasts and endosymbionts, these molecular
phenotypes appear to be symptomatic of a relative increase in genetic drift due to restrictions on effective population size
in the host environment. As a free-living bacterium, Prochlorococcus MED4 is not known to be subject to similar
ecological constraints. To test whether the high-light-adapted Prochlorococcus MED4 is experiencing a reduction in
selection efficiency resulting from genetic drift, we examine two data sets, namely, the environmental genome shotgun
sequencing data from the Sargasso Sea and a set of cyanobacterial genome sequences. After integrating these data sets,
we compare the evolutionary profile of a high-light Prochlorococcus group to that of a group of Synechococcus (a
closely related group of marine cyanobacteria) that does not exhibit a similar small-genome syndrome. The average
pairwise dN/dS ratios in the high-light-adapted Prochlorococcus group are significantly lower than those in the
Synechococcus group, leading us to reject the hypothesis that the Prochlorococcus group is currently experiencing higher
levels of genetic drift.
Introduction
The marine cyanobacterium group, Prochlorococcus,
was not discovered until 20 years ago because their small
size and unusual photosynthetic pigment composition
evaded many of the detection methods commonly used
to identify and enumerate bacteria (Chisholm et al.
1988). They are now considered to be the most numerically
abundant photosynthetic organisms in the ocean (Partensky
et al. 1999) and thus to constitute a fundamental player in
the ocean carbon cycle. To gain insight into the ecological
role of the Prochlorococcus group, complete genome sequences have been obtained for several representative
members. The Prochlorococcus genome size mirrors the
organism’s diminutive physical size. For example, the genome of Prochlorococcus MED4 is the smallest found thus
far in any photosynthetic organism and contains only 1,716
protein-coding regions (Rocap et al. 2003).
As a group of free-living bacteria that exhibits both
genome reduction and a genome-wide acceleration of protein
evolutionary rate, Prochlorococcus represents an interesting
system in which to examine the evolutionary forces at play in
genome reduction. The small, streamlined genome of Prochlorococcus MED4 may reflect specialization for growth
in relatively stable oligotrophic water (Rocap et al. 2003;
Garcia-Fernandez et al. 2004; Martiny et al. 2006); selection
for metabolic economy (Dufresne et al. 2003; Garcia-Fernandez et al. 2004; Dufresne et al. 2005); an increased mutation
rate that has resulted in a loss of low fitness genes (Marais
et al. 2007); or selection for small cell size and/or increased
Key words: genetic drift, metagenomics, Prochlorococcus, cyanobacteria, endosymbionts, genomics.
E-mail: [email protected].
Mol. Biol. Evol. 26(1):5–13. 2009
doi:10.1093/molbev/msn217
Advance Access publication October 8, 2008
Ó The Author 2008. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
buoyancy, both of which are important for life near the top
of the water column (Dufresne et al. 2005).
Another hypothesis, assigning a greater role to genetic
drift, is supported by the striking genomic similarities
among chloroplasts, endosymbionts, and Prochlorococci.
Analysis of the genome content of Prochlorococcus
MED4 has revealed similarities to chloroplast genomes
(Hess et al. 2001), although chloroplasts are not thought
to be derived from within the Prochlorococci. Chloroplasts
are derived from free-living cyanobacteria that have become integrated into the cellular architecture of all plants
and algae, with genomes substantially smaller than those
of free-living cyanobacteria, as the result of the transfer
of genes to the nucleus and the loss of other genes that
may not have been needed inside of the host cell. The chloroplast genome is haploid and is inherited uniparentally and
transmitted asexually, with limited opportunity for recombination with genetically distinct lineages. Moreover, chloroplasts undergo frequent population bottlenecks during
germ-line transmission, and thus, the effective population
size for chloroplast genes is smaller than that for genes
in the nucleus of the same organism. As a result, there is
an increase in genetic drift (thus a decrease in the efficiency
of selection) among chloroplast genes, resulting in the accumulation of slightly deleterious mutations (Lynch and
Blanchard 1998; Blanchard and Lynch 2000). A similar
phenomenon exists in mitochondria, which are also derived
from free-living bacteria (Andersson and Kurland 1998;
Lynch and Blanchard 1998; Blanchard and Lynch 2000).
An increase in genetic drift has also been demonstrated
in other endosymbiotic genomes. Buchnera is a group of
obligate bacterial endosymbionts of insects. The Buchnera
genome is not as tightly integrated with the host cell as a
mitochondrial or chloroplast genome, but, like chloroplasts
and mitochondria, Buchnera has a very reduced genome
size. Extensive testing of molecular data from the insect
6
Hu and Blanchard
bacterial endosymbionts Buchnera and Wigglesworthia indicates that genetic drift, not selection, is responsible for an
increase in AT% content, a decrease in codon bias, and an
acceleration in the rate of protein and DNA evolution
(Moran 1996; Wernegreen and Moran 1999; Funk et al.
2001; Moran and Mira 2001; Silva et al. 2001; Wernegreen
et al. 2001; Abbot and Moran 2002; Berg and Kurland
2002; Gil et al. 2002; Palacios and Wernegreen 2002;
Wernegreen et al. 2002; Herbeck et al. 2003; Herbeck,
Wall, and Wernegreen 2003; van Ham et al. 2003;
Wernegreen and Funk 2004; Schaber et al. 2005).
We can directly test the relative role of genetic drift
versus natural selection in Prochlorococcus gene evolution
using nucleotide sequence data. A priori, we do not expect
Prochlorococcus ecotypes to have small populations sizes
because they are numerically abundant in the open ocean
(Ahlgren et al. 2006; Johnson et al. 2006). There is nothing
to suggest that Prochlorococcus has formed an obligate
symbiosis and it can be grown axenically in the laboratory
(Moore et al. 1995). However, we have a very limited understanding of what constitutes a bacterial population in the
open ocean. Furthermore, it is possible that unusual, as-yet
unidentified aspects of Prochlorococcus population biology and ecology could have resulted in an unusually high
level of genetic drift in Prochlorococcus MED4.
To test the relative role of genetic drift in Prochlorococcus evolution, we first explored the availability of genomic
and metagenomic sequence data. We developed a new
method for filtering the environmental genome shotgun sequencing data from the Sargasso Sea, a habitat rich in Prochlorococcus high-light-adapted ecotypes. The resulting
metagenomic data set was then integrated with a set of all
the completely sequenced cyanobacterial genomes, and
the combined data set was used to compare the evolutionary
profile of the high-light-adapted Prochlorococcus group
with that of a closely related group of marine Synechococcus.
Materials and Methods
Data Sets
The National Center for Biotechnology Information
(NCBI) set of complete published microbial genomic DNA
and protein sequences, including those of Prochlorococcus
MED4,
Prochlorococcus
SS120,
Prochlorococcus
MIT9313, and Synechococcus WH8102, was downloaded in
January 2005. At the same time, the complete genome sequences of Prochlorococcus MIT9312, Prochlorococcus NATL2A, Synechococcus CC9902, Synechococcus CC9605, and
the freshwater Synechococcus PCC7942 were downloaded
from the draft genome section of NCBI. Each of these five genomes has been assembled into a single circular contiguous sequence. Publications describing the genome sequences of
Prochlorococcus MIT9312 (Coleman and Chisholm 2007)
and Prochlorococcus NATL2A have subsequently appeared,
butthe published nucleotide and protein sequences are identical
to the draft sequences that we downloaded in 2005. The environmental shotgun sequence data set from the Sargasso Sea
(Venter et al. 2004), consisting of 1,986,782 unassembled reads
totaling nearly 2.0 Gbp, was obtained in January 2005 from the
NCBI environmental sequence database (ftp://ftp.ncbi.nlm.
nih.gov/pub/TraceDB/environmental_sequence/).
Generation of Orthologous Gene Sets
A set of orthologous genes that are shared among the
genomes of Prochlorococcus MED4, Prochlorococcus
MIT9312, Prochlorococcus NATL2A, Prochlorococcus
SS120, Prochlorococcus MIT9313, Synechococcus
WH8102, Synechococcus CC9902, Synechococcus
CC9605, and the freshwater Synechococcus PCC7942
was derived using the Genome Flux Analysis program
(Tolopko and Blanchard, unpublished data). This program
uses a phylogeny tree as input and deploys rule-based inferences to determine orthologous gene sets based on raw
Blast scores. Briefly, the Genome Flux Analysis program
proceeds by using BlastP to find highly similar protein sequences among the phylogenetic focus group relative to
other published microbial genomes. In this case, the phylogenetic focus group consists of the nine cyanobacterial
genomes listed above, which we also refer to as our ‘‘reference genomes.’’ The inclusion of the larger set of published microbial genomes during BlastP allows a low bit
score cutoff threshold to be used to generate a ‘‘top hit list’’
for each query based on the raw BlastP results.
To be considered an orthologous gene group, the
genes in the top-hit list for the query must meet two criteria.
First, each of the genes coming from reference genomes
must meet a BlastP bit score threshold of 40. If one or more
of the genes from reference genomes do not meet this
threshold for similarity significance, it is assumed that
the protein sequence is not conserved across all reference
genomes and the genes cannot form an orthologous gene
group. Second, the scores of reference genes must be higher
than that of the larger set of genes that originate from other
microbial genomes. We also eliminate any gene whose evolutionary history is complicated by putative horizontal gene
transfer or gene duplication events. This results in a conservative, high-quality gene set that likely underestimates the
true number of orthologous genes because we removed
rather than resolved genes flagged as putative horizontal
gene transfer or gene duplication events and because we
required all taxa to contain the orthologous gene.
Calculation of Synonymous and Nonsynonymous
Substitution Rates
Orthologous protein sequences were aligned using
POA (Lee et al. 2002), and the corresponding nucleotide
sequences were mapped onto the protein alignments using
custom Perl scripts to generate sets of aligned nucleotide
sequences. Protein and nucleotide distances were calculated
using ‘‘dnadist’’ and ‘‘protdist’’ in the PHYLIP (Felsenstein
1989) package. Synonymous and nonsynonymous substitution rates were calculated using the ‘‘yn00’’ program
(Yang and Nielsen 2000) in the PAML package (Yang
1997), which implements a maximum likelihood method
based on the HKY85 (Hasegawa et al. 1985) model.
A Phylogenetic Focus Group-Based Environmental
Sequence-Filtering Framework
A phylogenetic focus group-based environmental sequence-filtering framework was developed to attempt to
correct for the potential biases caused by generic Blast
Characteristics of Genome Reduction 7
searches in extracting sequences belonging to the Prochlorococcus/Synechococcus groups from the collection of unassembled reads in the Sargasso Sea data (Hu J and
Blanchard JL, unpublished data). The application of the
phylogenetic focus group-based sequence-filtering algorithm to this study consists of the following major steps:
1) Identifying the orthologous reference gene table from
the complete genomes using the Genome Flux Analysis
program described above; 2) Generating nucleotide and
protein distance metrics for each orthologous reference
gene based on mutual Blast searches within orthologous
genes; 3) Running BlastN and BlastP using the orthologous
reference genes as queries against the EGSS database for
initial sequence selection and collecting the results as
a query set for reverse Blast against orthologous reference
genes; and 4) Filtering the reduced EGSS data based on
the relative distances between the EGSS sequences and
the reference genes. The two-dimensional relative distance metrics were defined based on the range of genomic
variations within the orthologous reference gene trees and
across the reference genomes. With the strict sequence selection criteria adopted during this step, we obtained a conservative collection of EGSS sequences that fall within
very close distances to the orthologous reference genes.
These EGSS sequences were first aligned with their orthologous reference genes by aligning the best match from
the six frame translations of the EGSS sequence to the fulllength proteins of the reference genes. Then, the protein
alignments were mapped back to the corresponding nucleotide EGSS sequence alignments.
There were a number of problems associated with the
quality of these initial EGSS alignments due to the EGSS
sequences not spanning the full length of most genes and
due to lower quality of the EGSS sequences at their terminal
regions. To address these problems, we further developed an
algorithm for sequence trimming and final sequence selection to improve the quality of the alignments and to increase
the number of usable sequences spanning the same range of
the alignment. This algorithm consists of the following steps:
1) Using nucleotide EGSS sequence alignments described
above as the input, trimming the aligned sequences based
on the alignment range as determined by reference genes.
(We refer to these initially trimmed alignments as ‘‘raw gene
blocks.’’) 2) Using raw gene blocks as input, removing all
gaps in the reference gene sequences and in corresponding
positions in the aligned EGSS sequences. (We refer to the
output of this step as ‘‘full-length gene blocks.’’) 3) Segmenting the full-length gene blocks into smaller blocks 300 bp in
length, which we refer to as ‘‘partial gene blocks.’’ 4) For
each full-length or partial gene block, discarding EGSS
sequences that have more than 20% of gaps and, as a final
selection criterion, discarding any gene block that contains
fewer than 30 sequences. 5) Translating the selected gene
blocks to protein sequence alignment blocks.
Results
Acceleration in the Rate of Protein Evolution within the
Prochlorococci
The relative contribution of natural selection to genome evolution can be measured using molecular data.
A decrease in the efficiency of natural selection due to genetic drift will result in an increase in the genome-wide rate
of nonsynonymous site evolution and correspondingly the
rate of protein sequence evolution because of the accumulation of deleterious mutations. To test whether a reduction
in genome size in the high-light-adapted Prochlorococcus
clade might be related to reduction of selection efficiency
owing to genetic drift, we compared relative changes in
rates of nucleotide and amino acid evolution in Prochlorococcus with a closely related sister group, the marine Synechococcus. For measurements of the rates of protein
evolution relative to a common reference point, the marine
Synechococcus and Prochlorococcus proteins were compared with an outgroup, the freshwater Synechococcus
PCC7942. The phylogenetic relationship among these taxa
constructed from the 16S rRNA gene is shown in figure 1
along with their genome sizes.
A data set of orthologous genes from the eight Synechococcus and Prochlorococcus complete genome sequences was constructed using the Genome Flux Analysis
program described above. This workflow resulted in
1,156 orthologous genes distributing along the genome.
We refer to this data set as the complete genome data
set or reference orthologous genes. This set of orthologous
genes covers 67% of the 1,716 genes found in the smallest
genome, that of Prochlorococcus MED4. The relative rates
of protein evolution were calculated using protdist based on
aligned sequences (table 1). The rate of protein sequence
evolution is greatest in Prochlorococcus MED4 and
MIT9312, whereas Prochlorococcus SS120 and NATL2A
have intermediate rates. In contrast, the rates of protein evolution are very similar in Prochlorococcus MIT9313 and
the marine Synechococcus. Thus, the speedup in the rate
of protein evolution appears to be localized within the Prochlorococcus clade.
Genomic Sequence Synonymous Distances between
Genomes Are at or near Saturation
To determine whether the change in rate of protein
evolution is due to a change in mutation rate or to a change
in selection efficiency arising from genetic drift, the orthologous protein sequences derived from completed genomes
were used as a template to create an alignment of the corresponding nucleotide sequences. There are dramatic differences in the nucleotide GC% content (table 1) and
amino acid usage (data not shown) in this data set.
The synonymous substitutions in most comparisons
among the eight genomes were too numerous to yield reliable synonymous distance estimates. The four pairwise
comparisons with the lowest average dS estimates are
shown in table 2, indicating significantly lower dN/dS ratios
in Prochlorococcus than in Synechococcus (Student’s
t-test, P 5 0.000). As the sequence divergence between
two taxa grows larger, it becomes more difficult to accurately account for multiple substitutions. In our complete
genome data set, the values of dS are very high and thus
can be difficult to estimate even with correction for multiple
substitutions. In addition, although PAML does account for
certain nucleotide biases, forms of bias similar to those
8
Hu and Blanchard
FIG. 1.—Phylogeny and genome sizes of cyanobacteria. A phylogeny of the Prochlorococcus/Synechococcus group based on Neighbor-Joining
analysis of the 16S rRNA gene using HKY85 to calculate the distance matrix. In all, 1,000 bootstrap replicates were performed.
found in our genome data set can still lead to inaccurate
estimates (Aris-Brosou and Bielawski 2006). Rather than
resolving the differences between the models, we sought
to create a data set of more closely related sequences, which
would be less subject to model-dependent estimates of multiple substitutions.
Filtering the Environmental Genome Shotgun
Sequencing Data into Blocks of Orthologous ProteinCoding Sequences
An environmental genome shotgun sequencing data
set from the Sargasso Sea contains copious reads of Prochlorococcus and Synechococcus sequences (Venter
et al. 2004) that are expected to show less sequence divergence than the available complete genome sequences. We
refer to this data set as the EGSS data set. To utilize this data
set, we divided the computational task into several steps,
including filtering environmental sequences into taxonomic
clades, aligning environmental sequences to orthologous
Table 1
Protein Distances Relative to Synechococcus PCC7942 and
Nonsynonymous GC% for Orthologous Genes Derived from
the Complete Cyanobacteria Genomes
Rates of Protein
Evolution
Fast
Intermediate
Slow
Protein Distance
GC%
Taxa
Mean
SD
Mean
SD
Pro MED4
Pro MIT9312
Pro NATL2A
Pro SS120
Pro MIT9313
Syn CC9902
Syn CC9605
Syn WH8102
Syn PCC7942
0.92
0.92
0.84
0.83
0.74
0.73
0.72
0.72
0
0.46
0.47
0.42
0.41
0.37
0.36
0.36
0.36
0
0.38
0.39
0.43
0.44
0.54
0.55
0.56
0.57
0.54
0.05
0.05
0.04
0.04
0.03
0.04
0.04
0.04
0.03
reference sequences in the taxonomic clades, trimming the
aligned sequences into short sequence blocks to fully utilize
the EGSS sequence alignments, and summarizing the
patterns of within- and between-group measures of genetic
variation (e.g., dN/dS ratios) for EGSS sequence blocks.
To derive a set of homologous sequences to estimate
nonsynonymous and synonymous distances, we started
with the collection of 1,986,782 unassembled EGSS reads
totaling about 2.0 Gbp. We developed a phylogenetic focus
group-based sequence-filtering framework to reduce the
EGSS data into Prochlorococcus- and Synechococcus-like
reads that are homologous to genes in the complete genome
orthologous gene table described above. During step 3 as
described in the Materials and Methods, with forward Blast
(E value 5 1.0 10 10) and reverse Blast (E value 5
1.0), the EGSS data set was reduced into 513,608 reads.
During step 4, the EGSS data set was further reduced into
25,655 reads. Because the EGSS reads are mostly 1–2 kbp
in length, and most of them only partially overlap with the
full-length reference protein sequences of orthologous
genes, we only obtained a small set of full-length gene
blocks from the EGSS data. In order to better utilize the
EGSS data, we applied our sequence-trimming algorithm
to the aligned EGSS sequences to create partial gene blocks.
After final selection on both full-length and partial gene
blocks, we generated a high-quality collection of aligned
population-level sequence segments of reference genomes
for further exploration. The nucleotide sequences were then
substituted for the protein sequences and used for calculating synonymous and nonsynonymous substitution rates.
Out of the initial 1,156 entries in the orthologous gene table,
we obtained 46 full-length gene blocks and 996 partial gene
blocks, each including at least 30 sequences. (We chose 30
as the block size cutoff in order to balance the need for highquality blocks with the requirements of our statistical analysis.) These partial EGSS gene blocks cover a total of
Characteristics of Genome Reduction 9
Table 2
Pairwise Comparison of dN, dS, and dN/dS Ratios Calculated by PAML for Orthologous Genes Derived from the Complete
Cyanobacteria Genomes
dN
dS
dN/dS
Mean
SD
Mean
SD
Mean
SD
Synechococcus
Prochlorococcus
(MED4, MIT9312)
(CC9605, CC9902)
(CC9605, WH8102)
(CC9902, WH8102)
0.102
0.063
2.998
1.021
0.038
0.028
0.142
0.101
2.283
0.940
0.070
0.056
0.133
0.107
1.848
0.725
0.076
0.064
0.152
0.110
2.304
0.915
0.075
0.061
483 distinct genes and thus are considered to represent a
genome-wide distribution.
An example of the gene phylogeny in a full-length
EGSS gene block is shown in figure 2, created by using
the UPGMA method based on nucleotide distances among
the sequences. As shown in this figure, the majority of the
EGSS gene copies are associated with the high-light-adapted
Prochlorococcus clade and are mostly similar to Prochlorococcus MIT9312. In contrast, fewer EGSS gene copies are
associated with the Synechococcus clade. That kind of species
composition is typical of all EGSS gene blocks examined in
this study. The pairwise dN and dS values among all sequences from this example were calculated by PAML and are
shown in figure 3.
Summarization of Independent EGSS Gene Blocks
Reveals That the dN/dS Ratios in the Pro-group Are
Significantly Lower than Those in the Syn-group
Because the EGSS sequence reads are not linked to
any particular individual organism as in the complete genome sequences, we applied general regression models
to summarize the patterns of within-group dN/dS ratios
for EGSS sequence blocks. For each EGSS sequence block,
we examine two subgroups, that is, the Pro-group for sequences matching Prochlorococcus MED4 and MIT9312
reference genes and the Syn-group for sequences matching
the three marine Synechococcus reference genes. Based on
the linear patterns in (dS, dN) plots, for each EGSS gene
block, we fitted straight lines to the scattered dots that represent pairwise within-group (dS, dN) values for the Progroup and the Syn-group, respectively. The slope values
are calculated from linear regression, representing the averaged within-group dN/dS ratios. An example of a dN/
dS calculation is shown in figure 3, in which the slope value
(i.e., dN/dS) is 0.023 for the Pro-group and 0.073 for the
Syn-group. In this example, the R-squared statistic for linear regression is 0.83 for the Pro-group and 0.86 for the
Syn-group.
We collected the estimated dN/dS ratios for Pro-group
and Syn-group for all sequence blocks and determined the
distribution of dN/dS ratios, using only those ratios in
which the dS value was 1.5. Although an even smaller
dS window would enhance the reliability of our dN/dS estimates, we chose this particular dS cutoff because the Synechococcus clade is only sparsely represented (in contrast to
the abundance of Prochlorococcus MIT9312 sequences) in
the EGSS data. Had we chosen a smaller dS window, we
could not have assembled enough Synechococcus data to
allow us to calculate dN/dS values.
In table 3, we summarize the calculation of dN/dS ratios based on EGSS gene blocks using PAML, including
the mean values and standard deviations (SDs) for dN/
dS in Pro-group and Syn-group, the t-test P values for comparing the averages of the two distributions, and the Rsquared statistics for measuring the goodness of fit in linear
regression. The summary indicates that the average pairwise dN/dS ratios in the high-light-adapted Pro-group
are significantly lower than that in the Syn-group, based
on the full-length gene blocks as well as the partial gene
blocks. In addition, the SD of the dN/dS ratios for the
Pro-group is smaller than that for the Syn-group. The comparison results fail to support a role for genetic drift as a major contributor to genome reduction in Prochlorococcus
and are consistent in that regard with the results for complete genomes shown in table 2.
The Effect of Changing the Range of Synonymous
Substitution Distances on dN/dS Calculations
Gene blocks with different levels of synonymous substitution distances may have different potential applications. Assuming the same mutation rates, smaller dS
values may indicate within-population level sequence
polymorphism, whereas larger dS values may indicate
between-population level divergence. To examine whether
the results set forth in the preceding discussion are affected
by the range of dS values chosen for analysis, we also used
a sliding-window approach to select the (dS, dN) points for
linear regression. As we slid the window for dS in the direction of higher dS values for the Pro-group, both the mean
value and the SD of the resulting dN/dS ratios gradually
decreased without yielding any abrupt change in the distribution of dN/dS ratios (table 4). A similar result was seen
when we enlarged the dS window during the summarization
of dN/dS ratios for the Pro-group (table 4). An increase in
dN/dS ratio among closely related sequences (table 4) has
also been observed in other genomic comparisons and is
likely to be the result of deleterious mutations that have
not yet been filtered by natural selection (Rocha et al.
2006). However, the change in dN/dS distribution for
the Pro-group in our study is not large enough to influence
the comparison results of dN/dS ratios between the Progroup and the Syn-group. Thus, the results of dN/dS
10 Hu and Blanchard
FIG. 2.—An example of genomic sequence block extracted from EGSS data set. This figure shows an example of a gene phylogeny within a fulllength EGSS gene block after a sequence trimming. The reference Prochlorococcus MED4 gene in this example is grpE (gi33860576), the heat shock
protein. This sequence block consists of 38 sequences, with 8 reference genes and 30 EGSS sequences. In this figure, the indices of reference genes are
composed of ‘‘gi’’ and the 8-digit gene index number. The indices of EGSS sequences are composed of the 9-digit gnl/ti indices defined in the Sargasso
Sea Trace database and a 10th digit for the frame number.
comparison between the Pro-group and the Syn-group are
not directly affected by the range of dS values employed.
The Relatively Lower dN/dS in Prochlorococcus Is
Genome-Wide and Consistent between Genomic and
Metagenomic Data Sets
We have shown that the average pairwise dN/dS ratios
in the high-light-adapted Pro-group are significantly lower
than those in the Syn-group, based on the EGSS gene
blocks (table 3), which agrees with the results derived from
the complete genome data (table 2). These comparisons are
based on genome-wide averages, suggesting that the relatively lower dN/dS in Prochlorococcus is observed genome
wide. Not surprisingly, however, there are exceptions to the
general genome-wide trends. Figure 4 shows that when we
examine the 996 EGSS partial gene blocks, 12.5% of
those blocks in Prochlorococcus actually exhibit higher
dN/dS ratios than the corresponding blocks in the Synechococcus group—including Photosystem I protein
PsaL (gi33862075), tmRNA-binding protein SmpB
(gi33862172), Clp protease proteolytic subunit
(gi33862212), and putative GTP cyclohydrolase I
(gi33861093). However, groups of genes associated with
specific processes and pathways do not appear to have higher dN/dS ratios in the Prochlorococcus group than in the
Synechococcus group.
The distribution of dN/dS ratios is similar in the genome sequence data set (data not shown). Figure 5 shows
Characteristics of Genome Reduction 11
FIG. 3.—An example of pairwise dN versus dS calculation for EGSS
gene blocks. This figure shows an example of pairwise dN versus dS for
an EGSS full-length gene block calculated by PAML, using the same
sequence block as in figure 2. The Prochlorococcus group consists of all
EGSS sequences matching the reference genes in the high-light-adapted
Prochlorococcus, that is, MED4 and MIT9312. The Syn-group consists
of sequences matching the three marine Synechococcus reference genes.
The fitted straight lines were derived from linear regression on data points
within the two subgroups, respectively.
that the dN/dS ratio estimates for the EGSS full-length gene
blocks are approximately 1.6–1.7 times higher than the estimates for the corresponding orthologous genes in the
complete genome sequence data.
Discussion
The conspicuous genome reduction shared by chloroplasts, endosymbionts, and Prochlorococci led us to hypothesize that a common evolutionary force, genetic
drift, was responsible. Driven by our working hypothesis,
we developed a metagenomic sequence-filtering and trimming framework to explore the abundance of microbial sequences from the environmental genome shotgun
sequencing data of the Sargasso Sea and extracted population level Prochlorococcus sequences to complement the
individual complete genome sequences. The EGSS se-
quence reads are not necessarily from the same genome
and thus we treated each sequence independently. Our
framework involved first identifying a set of EGSS sequences that fall within the close neighborhood of orthologous
reference genes derived from the complete Prochlorococcus and Synechococcus genomes based on relative distance
metrics. Because the ends of EGSS reads do not correspond
to gene boundaries, we then devised a sequence-trimming
algorithm to address the problem of irregular EGSS alignments in order to better utilize the data. The trimmed sequence blocks were further partitioned and selected to
generate a collection of full-length gene blocks and partial
gene blocks. This framework has allowed us to explore the
abundance of Prochlorococcus in the Sargasso Sea data set.
The dN/dS ratios are significantly lower in the highlight–adapted Prochlorococcus group than in its sister
group, the marine Synechococcus group (table 3). This is
in agreement with the PAML-derived analysis of the complete genome sequences (table 2). Comparing the results
across different data sets, the dN/dS ratios are higher in
the EGSS partial gene blocks than in the EGSS full-length
gene blocks and are lowest in the complete genome data.
This may be due to the nonlinear multihit corrections in
dN and dS estimations as the level of within-group divergence changes across different data sets. The higher dN/dS
ratios in the EGSS data set may also be due to a higher rate
of sequencing errors in the EGSS data set and/or difficulties
in aligning EGSS sequences with heterogeneous ends.
Because the genome-wide efficiency of natural selection is actually significantly higher in the Prochlorococcus
MED4/MIT9312 group than in the marine Synechococcus,
the higher rate of protein evolution in Prochlorococcus (table 1) appears to be the result of an elevated mutation rate
rather than a decrease in the efficiency of selection. Our
study is the first to show that the increased genetic drift seen
in organelles and endosymbionts is not apparent in Prochlorococcus. It is interesting that despite the remarkable
similarities in genomic characteristics between Prochlorococcus and chloroplasts/endosymbionts, the underlying
mechanisms driving their evolutionary dynamics could
be dramatically different. The lower dN/dS ratios suggest
that the effective population sizes are relatively larger in
the Prochlorococcus MED4/MIT9312 group, consistent
with their observed numerical abundance in the
Table 3
Summary of dN/dS Calculations from EGSS Gene Blocks with dS £ 1.5, Calculated by PAML
Comparison of Pairwise dN/dS Ratios
Prochlorococcus
Synechococcus
EGSS Gene Blocks
Mean
SD
Mean
SD
P value
Full length
Partial
0.058
0.077
0.048
0.054
0.107
0.136
0.104
0.114
0.003
0.000
R-Squared Statistics for Linear Regression
Prochlorococcus
Synechococcus
EGSS Gene Blocks
Mean
SD
Mean
SD
Full length
Partial
0.770
0.751
0.343
0.321
0.751
0.765
0.210
0.209
12 Hu and Blanchard
Table 4
Effect of Changing dS Window on the Calculations of dN/dS Ratios in the Prochlorococcus Group of EGSS Data Set
Sliding dS Window
EGSS Gene Blocks
Full length
Partial
Opening up dS Window
dN/dS Ratios
[0,0.5]
(0.5,1.0]
(1.0,1.5]
(1.5,2.0]
[0,0.5]
[0,1.0]
[0,1.5]
[0,2.0]
Mean
SD
Mean
SD
0.082
0.072
0.100
0.066
0.054
0.041
0.076
0.059
0.052
0.052
0.064
0.051
0.053
0.049
0.060
0.047
0.082
0.072
0.100
0.066
0.067
0.055
0.085
0.060
0.058
0.048
0.077
0.054
0.057
0.046
0.072
0.050
oligotrophic waters of the ocean (Ahlgren et al. 2006;
Johnson et al. 2006). However, we do not have direct measurements of Prochlorococcus population sizes because of
the difficulties in determining population boundaries in
the open ocean.
The rejection of our hypothesis regarding the role of
genetic drift leaves open the question of which environmental parameters might be affecting the genome-wide patterns
of sequence evolution in Prochlorococcus. Several alternative hypotheses have been proposed in regard to those patterns, explaining them in terms of the organism’s adaptation
to the relatively stable oligotrophic water (niche specialization), in terms of selection for metabolic economy or selection for small cell size and/or increased buoyancy, and in
terms of a loss of low fitness genes due to increased mutation rate (Dufresne et al. 2003; Rocap et al. 2003; GarciaFernandez et al. 2004; Dufresne et al. 2005; Marais et al.
2007). We emphasize that these hypotheses are not mutually exclusive, but it is still unclear how these different evolutionary forces might interact with each other to generate
the kind of genome reduction that is characteristic of the
Prochlorococci and endosymbionts and organelles. Additional complete genome sequences and environmental sequencing data may allow us to develop a more detailed
phylogenetic framework that would yield further explanations concerning patterns of gene loss and the evolutionary
forces leading to genome reduction.
The size and complexity of the environmental genome
shotgun sequencing data sets have presented computational
challenges along with new research opportunities (DeLong
2005). In addition to the kind of application demonstrated
here, these data sets may also have potential applications to
population genetics, for example, in the context of subpopulation structure modeling. An initial approximation of
what might constitute a population can be obtained from
phylogenetic trees under the assumption that each distinct
clade represents a population. If a large proportion of genes
in a genome exhibit a single, distinctive multimodal pattern
in the distributions of their relative abundance, we might
infer the existence of subpopulations within the species,
with each mode representing one subpopulation.
There are at least two major challenges involved in
applying the EGSS data to population genetics. First, none
of the sequences can be assumed to be from the same organism. Thus with environmental sequence data, unlike
with genome sequence data, it is difficult to measure recombination rates between isolates and it is therefore difficult to
define population boundaries. Second, each data set contains different numbers of individual sequences, and the
tree structure must be generated independently for each
block or gene data set. This makes it difficult to directly
compare gene sets using standard molecular evolution
frameworks. We expect, however, that there is much insight
to be gained by developing new methods capable of making
better use of the environmental sequence data.
FIG. 4.—Comparison of dN/dS ratios for EGSS partial gene blocks
within the Prochlorococcus group and the Synechococcus group. This
figure shows the comparison of average pairwise dN/dS ratios for all
EGSS partial gene blocks in the Prochlorococcus group (x axis) and the
Synechococcus group (y axis).
FIG. 5.—Comparison of dN/dS ratios for selected genes across
different data sets. This figure shows, for selected genes, the comparison
of dN/dS ratios calculated on the basis of orthologous genes derived from
complete genomes (x axis) and dN/dS ratios calculated on the basis of the
corresponding EGSS full-length gene blocks (y axis).
Characteristics of Genome Reduction 13
Acknowledgments
The authors would like to thank Zhiyi Sun for invaluable discussions and comments and Stuart Cane for his expert editorial assistance.
Literature Cited
Ahlgren NA, Rocap G, Chisholm SW. 2006. Measurement of
Prochlorococcus ecotypes using real-time polymerase chain
reaction reveals different abundances of genotypes with
similar light physiologies. Environ Microbiol. 8:441–454.
Andersson SG, Kurland CG. 1998. Reductive evolution of
resident genomes. Trends Microbiol. 6:263–268.
Aris-Brosou S, Bielawski JP. 2006. Large-scale analyses of
synonymous substitution rates can be sensitive to assumptions
about the process of mutation. Gene. 378:58–64.
Blanchard JL, Lynch M. 2000. Organellar genes: why do they
end up in the nucleus? Trends Genet. 16:315–320.
Chisholm SW, Olson RJ, Zettler ER, Waterbury J, Goericke R,
Welschmeyer NA. 1988. A novel free living Prochlorococcus
occurs at high concentration in the oceanic euphotic zone.
Nature. 334:340–343.
Coleman ML, Chisholm SW. 2007. Code and context:
Prochlorococcus as a model for cross-scale biology. Trends
Microbiol. 15:398–407.
DeLong EF. 2005. Microbial community genomics in the ocean.
Nat Rev Microbiol. 3:459–469.
Dufresne A, Garczarek L, Partensky F. 2005. Accelerated
evolution associated with genome reduction in a free-living
prokaryote. Genome Biol. 6:R14.
Dufresne A, Salanoubat M, Partensky F, et al. (21 co-authors).
2003. Genome sequence of the cyanobacterium Prochlorococcus marinus SS120, a nearly minimal oxyphototrophic
genome. Proc Natl Acad Sci USA. 100:10020–10025.
Felsenstein J. 1989. PHYLIP—phylogeny inference package
(version 3.2). Cladistics. 5:164–166.
Garcia-Fernandez JM, de Marsac NT, Diez J. 2004. Streamlined
regulation and gene loss as adaptive mechanisms in
Prochlorococcus for optimized nitrogen utilization in oligotrophic environments. Microbiol Mol Biol Rev. 68:630–638.
Hasegawa M, Kishino H, Yano T. 1985. Dating of the humanape splitting by a molecular clock of mitochondrial DNA.
J Mol Evol. 22:160–174.
Hess WR, Rocap G, Ting CS, Larimer FW, Stilwagen S,
Lamerdin J, Chisholm SW. 2001. The photosynthetic
apparatus of Prochlorococcus: insights through comparative
genomics. Photosiynth Res. 70:53–71.
Johnson ZI, Zinser ER, Coe A, McNulty NP, Woodward EM,
Chisholm SW. 2006. Niche partitioning among Prochlorococcus ecotypes along ocean-scale environmental gradients.
Science. 311:1737–1740.
Lee C, Grasso C, Sharlow MF. 2002. Multiple sequence
alignment using partial order graphs. Bioinformatics. 18:
452–464.
Lynch M, Blanchard JL. 1998. Deleterious mutation accumulation in organelle genomes. Genetica. 102–103:29–39.
Marais GA, Calteau A, Tenaillon O. 2007. Mutation rate and
genome reduction in endosymbiotic and free-living bacteria.
Genetica.
Martiny AC, Coleman ML, Chisholm SW. 2006. Phosphate
acquisition genes in Prochlorococcus ecotypes: evidence for
genome-wide adaptation. Proc Natl Acad Sci USA. 103:
12552–12557.
Moore LR, Goericke R, Chisholm SW. 1995. Comparative
physiology of Synechococcus and Prochlorococcus: influence of light and temperature on growth, pigments,
fluorescence and absorptive properties. Mar Ecol Prog Ser.
116:259–275.
Partensky F, Hess WR, Vaulot D. 1999. Prochlorococcus,
a marine photosynthetic prokaryote of global significance.
Microbiol Mol Biol Rev. 63:106–127.
Rocap G, Larimer FW, Lamerdin J, et al. (24 co-authors). 2003.
Genome divergence in two Prochlorococcus ecotypes reflects
oceanic niche differentiation. Nature. 424:1042–1047.
Rocha EP, Smith JM, Hurst LD, Holden MT, Cooper JE,
Smith NH, Feil EJ. 2006. Comparisons of dN/dS are time
dependent for closely related bacterial genomes. J Theor Biol.
239:226–235.
Venter JC, Remington K, Heidelberg JF, et al. (23 co-authors).
2004. Environmental genome shotgun sequencing of the
Sargasso Sea. Science. 304:66–74.
Yang Z. 1997. PAML: a program package for phylogenetic
analysis by maximum likelihood. Comput Appl Biosci. 13:
555–556.
Yang Z, Nielsen R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary
models. Mol Biol Evol. 17:32–43.
Jennifer Wernegreen, Associate Editor
Accepted September 8, 2008