Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
RESEARCH ARTICLES Environmental Sequence Data from the Sargasso Sea Reveal That the Characteristics of Genome Reduction in Prochlorococcus Are Not a Harbinger for an Escalation in Genetic Drift Jinghua Hu* and Jeffrey L. Blanchard *Department of Electrical and Computer Engineering, University of Massachusetts, Amherst; and Department of Microbiology, University of Massachusetts, Amherst The marine cyanobacterium Prochlorococcus MED4 has the smallest sequenced genome of any photosynthetic organism. Prochlorococcus MED4 shares many genomic characteristics with chloroplasts and bacterial endosymbionts, including a reduced coding capacity, missing DNA repair genes, a minimal transcriptional regulatory network, a marked AT% bias, and an accelerated rate of amino acid changes. In chloroplasts and endosymbionts, these molecular phenotypes appear to be symptomatic of a relative increase in genetic drift due to restrictions on effective population size in the host environment. As a free-living bacterium, Prochlorococcus MED4 is not known to be subject to similar ecological constraints. To test whether the high-light-adapted Prochlorococcus MED4 is experiencing a reduction in selection efficiency resulting from genetic drift, we examine two data sets, namely, the environmental genome shotgun sequencing data from the Sargasso Sea and a set of cyanobacterial genome sequences. After integrating these data sets, we compare the evolutionary profile of a high-light Prochlorococcus group to that of a group of Synechococcus (a closely related group of marine cyanobacteria) that does not exhibit a similar small-genome syndrome. The average pairwise dN/dS ratios in the high-light-adapted Prochlorococcus group are significantly lower than those in the Synechococcus group, leading us to reject the hypothesis that the Prochlorococcus group is currently experiencing higher levels of genetic drift. Introduction The marine cyanobacterium group, Prochlorococcus, was not discovered until 20 years ago because their small size and unusual photosynthetic pigment composition evaded many of the detection methods commonly used to identify and enumerate bacteria (Chisholm et al. 1988). They are now considered to be the most numerically abundant photosynthetic organisms in the ocean (Partensky et al. 1999) and thus to constitute a fundamental player in the ocean carbon cycle. To gain insight into the ecological role of the Prochlorococcus group, complete genome sequences have been obtained for several representative members. The Prochlorococcus genome size mirrors the organism’s diminutive physical size. For example, the genome of Prochlorococcus MED4 is the smallest found thus far in any photosynthetic organism and contains only 1,716 protein-coding regions (Rocap et al. 2003). As a group of free-living bacteria that exhibits both genome reduction and a genome-wide acceleration of protein evolutionary rate, Prochlorococcus represents an interesting system in which to examine the evolutionary forces at play in genome reduction. The small, streamlined genome of Prochlorococcus MED4 may reflect specialization for growth in relatively stable oligotrophic water (Rocap et al. 2003; Garcia-Fernandez et al. 2004; Martiny et al. 2006); selection for metabolic economy (Dufresne et al. 2003; Garcia-Fernandez et al. 2004; Dufresne et al. 2005); an increased mutation rate that has resulted in a loss of low fitness genes (Marais et al. 2007); or selection for small cell size and/or increased Key words: genetic drift, metagenomics, Prochlorococcus, cyanobacteria, endosymbionts, genomics. E-mail: [email protected]. Mol. Biol. Evol. 26(1):5–13. 2009 doi:10.1093/molbev/msn217 Advance Access publication October 8, 2008 Ó The Author 2008. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] buoyancy, both of which are important for life near the top of the water column (Dufresne et al. 2005). Another hypothesis, assigning a greater role to genetic drift, is supported by the striking genomic similarities among chloroplasts, endosymbionts, and Prochlorococci. Analysis of the genome content of Prochlorococcus MED4 has revealed similarities to chloroplast genomes (Hess et al. 2001), although chloroplasts are not thought to be derived from within the Prochlorococci. Chloroplasts are derived from free-living cyanobacteria that have become integrated into the cellular architecture of all plants and algae, with genomes substantially smaller than those of free-living cyanobacteria, as the result of the transfer of genes to the nucleus and the loss of other genes that may not have been needed inside of the host cell. The chloroplast genome is haploid and is inherited uniparentally and transmitted asexually, with limited opportunity for recombination with genetically distinct lineages. Moreover, chloroplasts undergo frequent population bottlenecks during germ-line transmission, and thus, the effective population size for chloroplast genes is smaller than that for genes in the nucleus of the same organism. As a result, there is an increase in genetic drift (thus a decrease in the efficiency of selection) among chloroplast genes, resulting in the accumulation of slightly deleterious mutations (Lynch and Blanchard 1998; Blanchard and Lynch 2000). A similar phenomenon exists in mitochondria, which are also derived from free-living bacteria (Andersson and Kurland 1998; Lynch and Blanchard 1998; Blanchard and Lynch 2000). An increase in genetic drift has also been demonstrated in other endosymbiotic genomes. Buchnera is a group of obligate bacterial endosymbionts of insects. The Buchnera genome is not as tightly integrated with the host cell as a mitochondrial or chloroplast genome, but, like chloroplasts and mitochondria, Buchnera has a very reduced genome size. Extensive testing of molecular data from the insect 6 Hu and Blanchard bacterial endosymbionts Buchnera and Wigglesworthia indicates that genetic drift, not selection, is responsible for an increase in AT% content, a decrease in codon bias, and an acceleration in the rate of protein and DNA evolution (Moran 1996; Wernegreen and Moran 1999; Funk et al. 2001; Moran and Mira 2001; Silva et al. 2001; Wernegreen et al. 2001; Abbot and Moran 2002; Berg and Kurland 2002; Gil et al. 2002; Palacios and Wernegreen 2002; Wernegreen et al. 2002; Herbeck et al. 2003; Herbeck, Wall, and Wernegreen 2003; van Ham et al. 2003; Wernegreen and Funk 2004; Schaber et al. 2005). We can directly test the relative role of genetic drift versus natural selection in Prochlorococcus gene evolution using nucleotide sequence data. A priori, we do not expect Prochlorococcus ecotypes to have small populations sizes because they are numerically abundant in the open ocean (Ahlgren et al. 2006; Johnson et al. 2006). There is nothing to suggest that Prochlorococcus has formed an obligate symbiosis and it can be grown axenically in the laboratory (Moore et al. 1995). However, we have a very limited understanding of what constitutes a bacterial population in the open ocean. Furthermore, it is possible that unusual, as-yet unidentified aspects of Prochlorococcus population biology and ecology could have resulted in an unusually high level of genetic drift in Prochlorococcus MED4. To test the relative role of genetic drift in Prochlorococcus evolution, we first explored the availability of genomic and metagenomic sequence data. We developed a new method for filtering the environmental genome shotgun sequencing data from the Sargasso Sea, a habitat rich in Prochlorococcus high-light-adapted ecotypes. The resulting metagenomic data set was then integrated with a set of all the completely sequenced cyanobacterial genomes, and the combined data set was used to compare the evolutionary profile of the high-light-adapted Prochlorococcus group with that of a closely related group of marine Synechococcus. Materials and Methods Data Sets The National Center for Biotechnology Information (NCBI) set of complete published microbial genomic DNA and protein sequences, including those of Prochlorococcus MED4, Prochlorococcus SS120, Prochlorococcus MIT9313, and Synechococcus WH8102, was downloaded in January 2005. At the same time, the complete genome sequences of Prochlorococcus MIT9312, Prochlorococcus NATL2A, Synechococcus CC9902, Synechococcus CC9605, and the freshwater Synechococcus PCC7942 were downloaded from the draft genome section of NCBI. Each of these five genomes has been assembled into a single circular contiguous sequence. Publications describing the genome sequences of Prochlorococcus MIT9312 (Coleman and Chisholm 2007) and Prochlorococcus NATL2A have subsequently appeared, butthe published nucleotide and protein sequences are identical to the draft sequences that we downloaded in 2005. The environmental shotgun sequence data set from the Sargasso Sea (Venter et al. 2004), consisting of 1,986,782 unassembled reads totaling nearly 2.0 Gbp, was obtained in January 2005 from the NCBI environmental sequence database (ftp://ftp.ncbi.nlm. nih.gov/pub/TraceDB/environmental_sequence/). Generation of Orthologous Gene Sets A set of orthologous genes that are shared among the genomes of Prochlorococcus MED4, Prochlorococcus MIT9312, Prochlorococcus NATL2A, Prochlorococcus SS120, Prochlorococcus MIT9313, Synechococcus WH8102, Synechococcus CC9902, Synechococcus CC9605, and the freshwater Synechococcus PCC7942 was derived using the Genome Flux Analysis program (Tolopko and Blanchard, unpublished data). This program uses a phylogeny tree as input and deploys rule-based inferences to determine orthologous gene sets based on raw Blast scores. Briefly, the Genome Flux Analysis program proceeds by using BlastP to find highly similar protein sequences among the phylogenetic focus group relative to other published microbial genomes. In this case, the phylogenetic focus group consists of the nine cyanobacterial genomes listed above, which we also refer to as our ‘‘reference genomes.’’ The inclusion of the larger set of published microbial genomes during BlastP allows a low bit score cutoff threshold to be used to generate a ‘‘top hit list’’ for each query based on the raw BlastP results. To be considered an orthologous gene group, the genes in the top-hit list for the query must meet two criteria. First, each of the genes coming from reference genomes must meet a BlastP bit score threshold of 40. If one or more of the genes from reference genomes do not meet this threshold for similarity significance, it is assumed that the protein sequence is not conserved across all reference genomes and the genes cannot form an orthologous gene group. Second, the scores of reference genes must be higher than that of the larger set of genes that originate from other microbial genomes. We also eliminate any gene whose evolutionary history is complicated by putative horizontal gene transfer or gene duplication events. This results in a conservative, high-quality gene set that likely underestimates the true number of orthologous genes because we removed rather than resolved genes flagged as putative horizontal gene transfer or gene duplication events and because we required all taxa to contain the orthologous gene. Calculation of Synonymous and Nonsynonymous Substitution Rates Orthologous protein sequences were aligned using POA (Lee et al. 2002), and the corresponding nucleotide sequences were mapped onto the protein alignments using custom Perl scripts to generate sets of aligned nucleotide sequences. Protein and nucleotide distances were calculated using ‘‘dnadist’’ and ‘‘protdist’’ in the PHYLIP (Felsenstein 1989) package. Synonymous and nonsynonymous substitution rates were calculated using the ‘‘yn00’’ program (Yang and Nielsen 2000) in the PAML package (Yang 1997), which implements a maximum likelihood method based on the HKY85 (Hasegawa et al. 1985) model. A Phylogenetic Focus Group-Based Environmental Sequence-Filtering Framework A phylogenetic focus group-based environmental sequence-filtering framework was developed to attempt to correct for the potential biases caused by generic Blast Characteristics of Genome Reduction 7 searches in extracting sequences belonging to the Prochlorococcus/Synechococcus groups from the collection of unassembled reads in the Sargasso Sea data (Hu J and Blanchard JL, unpublished data). The application of the phylogenetic focus group-based sequence-filtering algorithm to this study consists of the following major steps: 1) Identifying the orthologous reference gene table from the complete genomes using the Genome Flux Analysis program described above; 2) Generating nucleotide and protein distance metrics for each orthologous reference gene based on mutual Blast searches within orthologous genes; 3) Running BlastN and BlastP using the orthologous reference genes as queries against the EGSS database for initial sequence selection and collecting the results as a query set for reverse Blast against orthologous reference genes; and 4) Filtering the reduced EGSS data based on the relative distances between the EGSS sequences and the reference genes. The two-dimensional relative distance metrics were defined based on the range of genomic variations within the orthologous reference gene trees and across the reference genomes. With the strict sequence selection criteria adopted during this step, we obtained a conservative collection of EGSS sequences that fall within very close distances to the orthologous reference genes. These EGSS sequences were first aligned with their orthologous reference genes by aligning the best match from the six frame translations of the EGSS sequence to the fulllength proteins of the reference genes. Then, the protein alignments were mapped back to the corresponding nucleotide EGSS sequence alignments. There were a number of problems associated with the quality of these initial EGSS alignments due to the EGSS sequences not spanning the full length of most genes and due to lower quality of the EGSS sequences at their terminal regions. To address these problems, we further developed an algorithm for sequence trimming and final sequence selection to improve the quality of the alignments and to increase the number of usable sequences spanning the same range of the alignment. This algorithm consists of the following steps: 1) Using nucleotide EGSS sequence alignments described above as the input, trimming the aligned sequences based on the alignment range as determined by reference genes. (We refer to these initially trimmed alignments as ‘‘raw gene blocks.’’) 2) Using raw gene blocks as input, removing all gaps in the reference gene sequences and in corresponding positions in the aligned EGSS sequences. (We refer to the output of this step as ‘‘full-length gene blocks.’’) 3) Segmenting the full-length gene blocks into smaller blocks 300 bp in length, which we refer to as ‘‘partial gene blocks.’’ 4) For each full-length or partial gene block, discarding EGSS sequences that have more than 20% of gaps and, as a final selection criterion, discarding any gene block that contains fewer than 30 sequences. 5) Translating the selected gene blocks to protein sequence alignment blocks. Results Acceleration in the Rate of Protein Evolution within the Prochlorococci The relative contribution of natural selection to genome evolution can be measured using molecular data. A decrease in the efficiency of natural selection due to genetic drift will result in an increase in the genome-wide rate of nonsynonymous site evolution and correspondingly the rate of protein sequence evolution because of the accumulation of deleterious mutations. To test whether a reduction in genome size in the high-light-adapted Prochlorococcus clade might be related to reduction of selection efficiency owing to genetic drift, we compared relative changes in rates of nucleotide and amino acid evolution in Prochlorococcus with a closely related sister group, the marine Synechococcus. For measurements of the rates of protein evolution relative to a common reference point, the marine Synechococcus and Prochlorococcus proteins were compared with an outgroup, the freshwater Synechococcus PCC7942. The phylogenetic relationship among these taxa constructed from the 16S rRNA gene is shown in figure 1 along with their genome sizes. A data set of orthologous genes from the eight Synechococcus and Prochlorococcus complete genome sequences was constructed using the Genome Flux Analysis program described above. This workflow resulted in 1,156 orthologous genes distributing along the genome. We refer to this data set as the complete genome data set or reference orthologous genes. This set of orthologous genes covers 67% of the 1,716 genes found in the smallest genome, that of Prochlorococcus MED4. The relative rates of protein evolution were calculated using protdist based on aligned sequences (table 1). The rate of protein sequence evolution is greatest in Prochlorococcus MED4 and MIT9312, whereas Prochlorococcus SS120 and NATL2A have intermediate rates. In contrast, the rates of protein evolution are very similar in Prochlorococcus MIT9313 and the marine Synechococcus. Thus, the speedup in the rate of protein evolution appears to be localized within the Prochlorococcus clade. Genomic Sequence Synonymous Distances between Genomes Are at or near Saturation To determine whether the change in rate of protein evolution is due to a change in mutation rate or to a change in selection efficiency arising from genetic drift, the orthologous protein sequences derived from completed genomes were used as a template to create an alignment of the corresponding nucleotide sequences. There are dramatic differences in the nucleotide GC% content (table 1) and amino acid usage (data not shown) in this data set. The synonymous substitutions in most comparisons among the eight genomes were too numerous to yield reliable synonymous distance estimates. The four pairwise comparisons with the lowest average dS estimates are shown in table 2, indicating significantly lower dN/dS ratios in Prochlorococcus than in Synechococcus (Student’s t-test, P 5 0.000). As the sequence divergence between two taxa grows larger, it becomes more difficult to accurately account for multiple substitutions. In our complete genome data set, the values of dS are very high and thus can be difficult to estimate even with correction for multiple substitutions. In addition, although PAML does account for certain nucleotide biases, forms of bias similar to those 8 Hu and Blanchard FIG. 1.—Phylogeny and genome sizes of cyanobacteria. A phylogeny of the Prochlorococcus/Synechococcus group based on Neighbor-Joining analysis of the 16S rRNA gene using HKY85 to calculate the distance matrix. In all, 1,000 bootstrap replicates were performed. found in our genome data set can still lead to inaccurate estimates (Aris-Brosou and Bielawski 2006). Rather than resolving the differences between the models, we sought to create a data set of more closely related sequences, which would be less subject to model-dependent estimates of multiple substitutions. Filtering the Environmental Genome Shotgun Sequencing Data into Blocks of Orthologous ProteinCoding Sequences An environmental genome shotgun sequencing data set from the Sargasso Sea contains copious reads of Prochlorococcus and Synechococcus sequences (Venter et al. 2004) that are expected to show less sequence divergence than the available complete genome sequences. We refer to this data set as the EGSS data set. To utilize this data set, we divided the computational task into several steps, including filtering environmental sequences into taxonomic clades, aligning environmental sequences to orthologous Table 1 Protein Distances Relative to Synechococcus PCC7942 and Nonsynonymous GC% for Orthologous Genes Derived from the Complete Cyanobacteria Genomes Rates of Protein Evolution Fast Intermediate Slow Protein Distance GC% Taxa Mean SD Mean SD Pro MED4 Pro MIT9312 Pro NATL2A Pro SS120 Pro MIT9313 Syn CC9902 Syn CC9605 Syn WH8102 Syn PCC7942 0.92 0.92 0.84 0.83 0.74 0.73 0.72 0.72 0 0.46 0.47 0.42 0.41 0.37 0.36 0.36 0.36 0 0.38 0.39 0.43 0.44 0.54 0.55 0.56 0.57 0.54 0.05 0.05 0.04 0.04 0.03 0.04 0.04 0.04 0.03 reference sequences in the taxonomic clades, trimming the aligned sequences into short sequence blocks to fully utilize the EGSS sequence alignments, and summarizing the patterns of within- and between-group measures of genetic variation (e.g., dN/dS ratios) for EGSS sequence blocks. To derive a set of homologous sequences to estimate nonsynonymous and synonymous distances, we started with the collection of 1,986,782 unassembled EGSS reads totaling about 2.0 Gbp. We developed a phylogenetic focus group-based sequence-filtering framework to reduce the EGSS data into Prochlorococcus- and Synechococcus-like reads that are homologous to genes in the complete genome orthologous gene table described above. During step 3 as described in the Materials and Methods, with forward Blast (E value 5 1.0 10 10) and reverse Blast (E value 5 1.0), the EGSS data set was reduced into 513,608 reads. During step 4, the EGSS data set was further reduced into 25,655 reads. Because the EGSS reads are mostly 1–2 kbp in length, and most of them only partially overlap with the full-length reference protein sequences of orthologous genes, we only obtained a small set of full-length gene blocks from the EGSS data. In order to better utilize the EGSS data, we applied our sequence-trimming algorithm to the aligned EGSS sequences to create partial gene blocks. After final selection on both full-length and partial gene blocks, we generated a high-quality collection of aligned population-level sequence segments of reference genomes for further exploration. The nucleotide sequences were then substituted for the protein sequences and used for calculating synonymous and nonsynonymous substitution rates. Out of the initial 1,156 entries in the orthologous gene table, we obtained 46 full-length gene blocks and 996 partial gene blocks, each including at least 30 sequences. (We chose 30 as the block size cutoff in order to balance the need for highquality blocks with the requirements of our statistical analysis.) These partial EGSS gene blocks cover a total of Characteristics of Genome Reduction 9 Table 2 Pairwise Comparison of dN, dS, and dN/dS Ratios Calculated by PAML for Orthologous Genes Derived from the Complete Cyanobacteria Genomes dN dS dN/dS Mean SD Mean SD Mean SD Synechococcus Prochlorococcus (MED4, MIT9312) (CC9605, CC9902) (CC9605, WH8102) (CC9902, WH8102) 0.102 0.063 2.998 1.021 0.038 0.028 0.142 0.101 2.283 0.940 0.070 0.056 0.133 0.107 1.848 0.725 0.076 0.064 0.152 0.110 2.304 0.915 0.075 0.061 483 distinct genes and thus are considered to represent a genome-wide distribution. An example of the gene phylogeny in a full-length EGSS gene block is shown in figure 2, created by using the UPGMA method based on nucleotide distances among the sequences. As shown in this figure, the majority of the EGSS gene copies are associated with the high-light-adapted Prochlorococcus clade and are mostly similar to Prochlorococcus MIT9312. In contrast, fewer EGSS gene copies are associated with the Synechococcus clade. That kind of species composition is typical of all EGSS gene blocks examined in this study. The pairwise dN and dS values among all sequences from this example were calculated by PAML and are shown in figure 3. Summarization of Independent EGSS Gene Blocks Reveals That the dN/dS Ratios in the Pro-group Are Significantly Lower than Those in the Syn-group Because the EGSS sequence reads are not linked to any particular individual organism as in the complete genome sequences, we applied general regression models to summarize the patterns of within-group dN/dS ratios for EGSS sequence blocks. For each EGSS sequence block, we examine two subgroups, that is, the Pro-group for sequences matching Prochlorococcus MED4 and MIT9312 reference genes and the Syn-group for sequences matching the three marine Synechococcus reference genes. Based on the linear patterns in (dS, dN) plots, for each EGSS gene block, we fitted straight lines to the scattered dots that represent pairwise within-group (dS, dN) values for the Progroup and the Syn-group, respectively. The slope values are calculated from linear regression, representing the averaged within-group dN/dS ratios. An example of a dN/ dS calculation is shown in figure 3, in which the slope value (i.e., dN/dS) is 0.023 for the Pro-group and 0.073 for the Syn-group. In this example, the R-squared statistic for linear regression is 0.83 for the Pro-group and 0.86 for the Syn-group. We collected the estimated dN/dS ratios for Pro-group and Syn-group for all sequence blocks and determined the distribution of dN/dS ratios, using only those ratios in which the dS value was 1.5. Although an even smaller dS window would enhance the reliability of our dN/dS estimates, we chose this particular dS cutoff because the Synechococcus clade is only sparsely represented (in contrast to the abundance of Prochlorococcus MIT9312 sequences) in the EGSS data. Had we chosen a smaller dS window, we could not have assembled enough Synechococcus data to allow us to calculate dN/dS values. In table 3, we summarize the calculation of dN/dS ratios based on EGSS gene blocks using PAML, including the mean values and standard deviations (SDs) for dN/ dS in Pro-group and Syn-group, the t-test P values for comparing the averages of the two distributions, and the Rsquared statistics for measuring the goodness of fit in linear regression. The summary indicates that the average pairwise dN/dS ratios in the high-light-adapted Pro-group are significantly lower than that in the Syn-group, based on the full-length gene blocks as well as the partial gene blocks. In addition, the SD of the dN/dS ratios for the Pro-group is smaller than that for the Syn-group. The comparison results fail to support a role for genetic drift as a major contributor to genome reduction in Prochlorococcus and are consistent in that regard with the results for complete genomes shown in table 2. The Effect of Changing the Range of Synonymous Substitution Distances on dN/dS Calculations Gene blocks with different levels of synonymous substitution distances may have different potential applications. Assuming the same mutation rates, smaller dS values may indicate within-population level sequence polymorphism, whereas larger dS values may indicate between-population level divergence. To examine whether the results set forth in the preceding discussion are affected by the range of dS values chosen for analysis, we also used a sliding-window approach to select the (dS, dN) points for linear regression. As we slid the window for dS in the direction of higher dS values for the Pro-group, both the mean value and the SD of the resulting dN/dS ratios gradually decreased without yielding any abrupt change in the distribution of dN/dS ratios (table 4). A similar result was seen when we enlarged the dS window during the summarization of dN/dS ratios for the Pro-group (table 4). An increase in dN/dS ratio among closely related sequences (table 4) has also been observed in other genomic comparisons and is likely to be the result of deleterious mutations that have not yet been filtered by natural selection (Rocha et al. 2006). However, the change in dN/dS distribution for the Pro-group in our study is not large enough to influence the comparison results of dN/dS ratios between the Progroup and the Syn-group. Thus, the results of dN/dS 10 Hu and Blanchard FIG. 2.—An example of genomic sequence block extracted from EGSS data set. This figure shows an example of a gene phylogeny within a fulllength EGSS gene block after a sequence trimming. The reference Prochlorococcus MED4 gene in this example is grpE (gi33860576), the heat shock protein. This sequence block consists of 38 sequences, with 8 reference genes and 30 EGSS sequences. In this figure, the indices of reference genes are composed of ‘‘gi’’ and the 8-digit gene index number. The indices of EGSS sequences are composed of the 9-digit gnl/ti indices defined in the Sargasso Sea Trace database and a 10th digit for the frame number. comparison between the Pro-group and the Syn-group are not directly affected by the range of dS values employed. The Relatively Lower dN/dS in Prochlorococcus Is Genome-Wide and Consistent between Genomic and Metagenomic Data Sets We have shown that the average pairwise dN/dS ratios in the high-light-adapted Pro-group are significantly lower than those in the Syn-group, based on the EGSS gene blocks (table 3), which agrees with the results derived from the complete genome data (table 2). These comparisons are based on genome-wide averages, suggesting that the relatively lower dN/dS in Prochlorococcus is observed genome wide. Not surprisingly, however, there are exceptions to the general genome-wide trends. Figure 4 shows that when we examine the 996 EGSS partial gene blocks, 12.5% of those blocks in Prochlorococcus actually exhibit higher dN/dS ratios than the corresponding blocks in the Synechococcus group—including Photosystem I protein PsaL (gi33862075), tmRNA-binding protein SmpB (gi33862172), Clp protease proteolytic subunit (gi33862212), and putative GTP cyclohydrolase I (gi33861093). However, groups of genes associated with specific processes and pathways do not appear to have higher dN/dS ratios in the Prochlorococcus group than in the Synechococcus group. The distribution of dN/dS ratios is similar in the genome sequence data set (data not shown). Figure 5 shows Characteristics of Genome Reduction 11 FIG. 3.—An example of pairwise dN versus dS calculation for EGSS gene blocks. This figure shows an example of pairwise dN versus dS for an EGSS full-length gene block calculated by PAML, using the same sequence block as in figure 2. The Prochlorococcus group consists of all EGSS sequences matching the reference genes in the high-light-adapted Prochlorococcus, that is, MED4 and MIT9312. The Syn-group consists of sequences matching the three marine Synechococcus reference genes. The fitted straight lines were derived from linear regression on data points within the two subgroups, respectively. that the dN/dS ratio estimates for the EGSS full-length gene blocks are approximately 1.6–1.7 times higher than the estimates for the corresponding orthologous genes in the complete genome sequence data. Discussion The conspicuous genome reduction shared by chloroplasts, endosymbionts, and Prochlorococci led us to hypothesize that a common evolutionary force, genetic drift, was responsible. Driven by our working hypothesis, we developed a metagenomic sequence-filtering and trimming framework to explore the abundance of microbial sequences from the environmental genome shotgun sequencing data of the Sargasso Sea and extracted population level Prochlorococcus sequences to complement the individual complete genome sequences. The EGSS se- quence reads are not necessarily from the same genome and thus we treated each sequence independently. Our framework involved first identifying a set of EGSS sequences that fall within the close neighborhood of orthologous reference genes derived from the complete Prochlorococcus and Synechococcus genomes based on relative distance metrics. Because the ends of EGSS reads do not correspond to gene boundaries, we then devised a sequence-trimming algorithm to address the problem of irregular EGSS alignments in order to better utilize the data. The trimmed sequence blocks were further partitioned and selected to generate a collection of full-length gene blocks and partial gene blocks. This framework has allowed us to explore the abundance of Prochlorococcus in the Sargasso Sea data set. The dN/dS ratios are significantly lower in the highlight–adapted Prochlorococcus group than in its sister group, the marine Synechococcus group (table 3). This is in agreement with the PAML-derived analysis of the complete genome sequences (table 2). Comparing the results across different data sets, the dN/dS ratios are higher in the EGSS partial gene blocks than in the EGSS full-length gene blocks and are lowest in the complete genome data. This may be due to the nonlinear multihit corrections in dN and dS estimations as the level of within-group divergence changes across different data sets. The higher dN/dS ratios in the EGSS data set may also be due to a higher rate of sequencing errors in the EGSS data set and/or difficulties in aligning EGSS sequences with heterogeneous ends. Because the genome-wide efficiency of natural selection is actually significantly higher in the Prochlorococcus MED4/MIT9312 group than in the marine Synechococcus, the higher rate of protein evolution in Prochlorococcus (table 1) appears to be the result of an elevated mutation rate rather than a decrease in the efficiency of selection. Our study is the first to show that the increased genetic drift seen in organelles and endosymbionts is not apparent in Prochlorococcus. It is interesting that despite the remarkable similarities in genomic characteristics between Prochlorococcus and chloroplasts/endosymbionts, the underlying mechanisms driving their evolutionary dynamics could be dramatically different. The lower dN/dS ratios suggest that the effective population sizes are relatively larger in the Prochlorococcus MED4/MIT9312 group, consistent with their observed numerical abundance in the Table 3 Summary of dN/dS Calculations from EGSS Gene Blocks with dS £ 1.5, Calculated by PAML Comparison of Pairwise dN/dS Ratios Prochlorococcus Synechococcus EGSS Gene Blocks Mean SD Mean SD P value Full length Partial 0.058 0.077 0.048 0.054 0.107 0.136 0.104 0.114 0.003 0.000 R-Squared Statistics for Linear Regression Prochlorococcus Synechococcus EGSS Gene Blocks Mean SD Mean SD Full length Partial 0.770 0.751 0.343 0.321 0.751 0.765 0.210 0.209 12 Hu and Blanchard Table 4 Effect of Changing dS Window on the Calculations of dN/dS Ratios in the Prochlorococcus Group of EGSS Data Set Sliding dS Window EGSS Gene Blocks Full length Partial Opening up dS Window dN/dS Ratios [0,0.5] (0.5,1.0] (1.0,1.5] (1.5,2.0] [0,0.5] [0,1.0] [0,1.5] [0,2.0] Mean SD Mean SD 0.082 0.072 0.100 0.066 0.054 0.041 0.076 0.059 0.052 0.052 0.064 0.051 0.053 0.049 0.060 0.047 0.082 0.072 0.100 0.066 0.067 0.055 0.085 0.060 0.058 0.048 0.077 0.054 0.057 0.046 0.072 0.050 oligotrophic waters of the ocean (Ahlgren et al. 2006; Johnson et al. 2006). However, we do not have direct measurements of Prochlorococcus population sizes because of the difficulties in determining population boundaries in the open ocean. The rejection of our hypothesis regarding the role of genetic drift leaves open the question of which environmental parameters might be affecting the genome-wide patterns of sequence evolution in Prochlorococcus. Several alternative hypotheses have been proposed in regard to those patterns, explaining them in terms of the organism’s adaptation to the relatively stable oligotrophic water (niche specialization), in terms of selection for metabolic economy or selection for small cell size and/or increased buoyancy, and in terms of a loss of low fitness genes due to increased mutation rate (Dufresne et al. 2003; Rocap et al. 2003; GarciaFernandez et al. 2004; Dufresne et al. 2005; Marais et al. 2007). We emphasize that these hypotheses are not mutually exclusive, but it is still unclear how these different evolutionary forces might interact with each other to generate the kind of genome reduction that is characteristic of the Prochlorococci and endosymbionts and organelles. Additional complete genome sequences and environmental sequencing data may allow us to develop a more detailed phylogenetic framework that would yield further explanations concerning patterns of gene loss and the evolutionary forces leading to genome reduction. The size and complexity of the environmental genome shotgun sequencing data sets have presented computational challenges along with new research opportunities (DeLong 2005). In addition to the kind of application demonstrated here, these data sets may also have potential applications to population genetics, for example, in the context of subpopulation structure modeling. An initial approximation of what might constitute a population can be obtained from phylogenetic trees under the assumption that each distinct clade represents a population. If a large proportion of genes in a genome exhibit a single, distinctive multimodal pattern in the distributions of their relative abundance, we might infer the existence of subpopulations within the species, with each mode representing one subpopulation. There are at least two major challenges involved in applying the EGSS data to population genetics. First, none of the sequences can be assumed to be from the same organism. Thus with environmental sequence data, unlike with genome sequence data, it is difficult to measure recombination rates between isolates and it is therefore difficult to define population boundaries. Second, each data set contains different numbers of individual sequences, and the tree structure must be generated independently for each block or gene data set. This makes it difficult to directly compare gene sets using standard molecular evolution frameworks. We expect, however, that there is much insight to be gained by developing new methods capable of making better use of the environmental sequence data. FIG. 4.—Comparison of dN/dS ratios for EGSS partial gene blocks within the Prochlorococcus group and the Synechococcus group. This figure shows the comparison of average pairwise dN/dS ratios for all EGSS partial gene blocks in the Prochlorococcus group (x axis) and the Synechococcus group (y axis). FIG. 5.—Comparison of dN/dS ratios for selected genes across different data sets. This figure shows, for selected genes, the comparison of dN/dS ratios calculated on the basis of orthologous genes derived from complete genomes (x axis) and dN/dS ratios calculated on the basis of the corresponding EGSS full-length gene blocks (y axis). Characteristics of Genome Reduction 13 Acknowledgments The authors would like to thank Zhiyi Sun for invaluable discussions and comments and Stuart Cane for his expert editorial assistance. Literature Cited Ahlgren NA, Rocap G, Chisholm SW. 2006. Measurement of Prochlorococcus ecotypes using real-time polymerase chain reaction reveals different abundances of genotypes with similar light physiologies. Environ Microbiol. 8:441–454. Andersson SG, Kurland CG. 1998. Reductive evolution of resident genomes. Trends Microbiol. 6:263–268. Aris-Brosou S, Bielawski JP. 2006. Large-scale analyses of synonymous substitution rates can be sensitive to assumptions about the process of mutation. Gene. 378:58–64. Blanchard JL, Lynch M. 2000. Organellar genes: why do they end up in the nucleus? Trends Genet. 16:315–320. Chisholm SW, Olson RJ, Zettler ER, Waterbury J, Goericke R, Welschmeyer NA. 1988. A novel free living Prochlorococcus occurs at high concentration in the oceanic euphotic zone. Nature. 334:340–343. Coleman ML, Chisholm SW. 2007. Code and context: Prochlorococcus as a model for cross-scale biology. Trends Microbiol. 15:398–407. DeLong EF. 2005. Microbial community genomics in the ocean. Nat Rev Microbiol. 3:459–469. Dufresne A, Garczarek L, Partensky F. 2005. Accelerated evolution associated with genome reduction in a free-living prokaryote. Genome Biol. 6:R14. Dufresne A, Salanoubat M, Partensky F, et al. (21 co-authors). 2003. Genome sequence of the cyanobacterium Prochlorococcus marinus SS120, a nearly minimal oxyphototrophic genome. Proc Natl Acad Sci USA. 100:10020–10025. Felsenstein J. 1989. PHYLIP—phylogeny inference package (version 3.2). Cladistics. 5:164–166. Garcia-Fernandez JM, de Marsac NT, Diez J. 2004. Streamlined regulation and gene loss as adaptive mechanisms in Prochlorococcus for optimized nitrogen utilization in oligotrophic environments. Microbiol Mol Biol Rev. 68:630–638. Hasegawa M, Kishino H, Yano T. 1985. Dating of the humanape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 22:160–174. Hess WR, Rocap G, Ting CS, Larimer FW, Stilwagen S, Lamerdin J, Chisholm SW. 2001. The photosynthetic apparatus of Prochlorococcus: insights through comparative genomics. Photosiynth Res. 70:53–71. Johnson ZI, Zinser ER, Coe A, McNulty NP, Woodward EM, Chisholm SW. 2006. Niche partitioning among Prochlorococcus ecotypes along ocean-scale environmental gradients. Science. 311:1737–1740. Lee C, Grasso C, Sharlow MF. 2002. Multiple sequence alignment using partial order graphs. Bioinformatics. 18: 452–464. Lynch M, Blanchard JL. 1998. Deleterious mutation accumulation in organelle genomes. Genetica. 102–103:29–39. Marais GA, Calteau A, Tenaillon O. 2007. Mutation rate and genome reduction in endosymbiotic and free-living bacteria. Genetica. Martiny AC, Coleman ML, Chisholm SW. 2006. Phosphate acquisition genes in Prochlorococcus ecotypes: evidence for genome-wide adaptation. Proc Natl Acad Sci USA. 103: 12552–12557. Moore LR, Goericke R, Chisholm SW. 1995. Comparative physiology of Synechococcus and Prochlorococcus: influence of light and temperature on growth, pigments, fluorescence and absorptive properties. Mar Ecol Prog Ser. 116:259–275. Partensky F, Hess WR, Vaulot D. 1999. Prochlorococcus, a marine photosynthetic prokaryote of global significance. Microbiol Mol Biol Rev. 63:106–127. Rocap G, Larimer FW, Lamerdin J, et al. (24 co-authors). 2003. Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche differentiation. Nature. 424:1042–1047. Rocha EP, Smith JM, Hurst LD, Holden MT, Cooper JE, Smith NH, Feil EJ. 2006. Comparisons of dN/dS are time dependent for closely related bacterial genomes. J Theor Biol. 239:226–235. Venter JC, Remington K, Heidelberg JF, et al. (23 co-authors). 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 304:66–74. Yang Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13: 555–556. Yang Z, Nielsen R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 17:32–43. Jennifer Wernegreen, Associate Editor Accepted September 8, 2008