* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Rooting the Ribosomal Tree of Life Research article
Citric acid cycle wikipedia , lookup
Molecular ecology wikipedia , lookup
Magnesium transporter wikipedia , lookup
Fatty acid synthesis wikipedia , lookup
Community fingerprinting wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Metalloprotein wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Peptide synthesis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Point mutation wikipedia , lookup
Proteolysis wikipedia , lookup
Molecular evolution wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Rooting the Ribosomal Tree of Life Gregory P. Fournier1 and J. Peter Gogarten*,1 1 Department of Molecular and Cell Biology, University of Connecticut *Corresponding author: E-mail: [email protected]. Associate editor: Martin Embley Abstract Key words: tree of life, ribosome, genetic code, ancestral reconstruction. Introduction In the last few decades, extensive sequencing of genetic material from a broad range of organisms has permitted the construction of large detailed phylogenetic trees representing their evolutionary relationships. Of particular interest are the deeper relationships between the three domains of life: Bacteria, Archaea, and Eukarya. As no organismal outgroup exists to polarize a universal phylogeny representing all three domains, identifying alternative methods to root universal trees is critical to understand the deep evolutionary history of life on Earth. This problem is also compounded by an increasing realization that universal trees do not necessarily reflect the organismal tree of life (ToL) due to extensive horizontal gene transfer. Many genes have distinct incongruent histories, and therefore, a single tree is insufficient to explain the full evolutionary history of life (Bapteste et al. 2009). Many protein families contain ancestral gene duplications, where divergent paralogous copies of a specific gene existed at the time of the most recent common ancestor (MRCA). In effect, this allows each paralog to act as outgroup for the other within a phylogenetic reconstruction, producing a reciprocal rooting of the tree (Schwartz and Dayhoff 1978). These ancestral gene duplications have been frequently used to root the ToL, an approach pioneered using protein sequences of the catalytic and noncatalytic subunits of the ATP synthase complex (Gogarten et al. 1989) and elongation factors (Iwabe et al. 1989). This approach has also been used with other gene families containing ancestral duplications, including aminoacyl transfer RNA (tRNA) synthetases (aaRS) (Brown and Doolittle 1995), an additional analysis of elongation factors (Baldauf et al. 1996), and protein-targeting machinery components (Gribaldo and Cammarano 1998). Additionally, this approach has been applied in at least one case to ancient internal gene duplications using the large subunit of carbamoyl phosphate synthetase (Schofield 1993; Lawson et al. 1996; Olendzenski and Gogarten 1998; Islas et al. 2007). These analyses generally support placing the root either on the bacterial branch or within the bacterial domain itself. In most of these analyses, the eukaryotic nucleocytoplasm groups together with archaeal homologs, and this grouping is further supported by higher order shared derived molecular characteristics (see discussion in Zhaxybayeva and Gogarten 2007). Molecular data have provided overwhelming evidence for the endosymbiotic origin of mitochondria and plastids (Gray 1993; Margulis 1995). Following common practice, we use the term ‘‘eukaryotes’’ to denote the nucleocytoplasmic component of the eukaryotic cell. Although support for the eukaryotic nucleocytoplasm grouping with the archaea is strong, the question of monophyletic or paraphyletic archaea remains contested. Sometimes, the eukaryotic nucleocytoplasm is found to have originated from one particular group of archaea (e.g., Cox et al. 2008); however, the root of the group of Eukaryotes plus Archaea is difficult to determine, and many argue that the eukaryotic nucleocytoplasm constitutes a deeper branch and a true sister group to extant archaea (see, e.g., the discussion in Dagan and Martin 2007; Poole and Penny 2007a, 2007b), although recent analyses have also suggested an eocyte rooting and topology (Cox et al. 2008). Phylogenetic analysis including all three domains of life is further complicated by the possibility of horizontal gene transfer events (Zhaxybayeva and Gogarten 2004), as well as artifacts of tree reconstruction, especially those caused by long branches frequently associated with interdomain relationships (for review, see Zhaxybayeva et al. 2005). Other approaches for rooting the ToL focus on specific physiological characters that are proposed to be primitive, © The Author 2010. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] 1792 Mol. Biol. Evol. 27(8):1792–1801. 2010 doi:10.1093/molbev/msq057 Advance Access publication March 1, 2010 Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 Research article The origin of the genetic code and the rooting of the tree of life (ToL) are two of the most challenging problems in the study of life’s early evolution. Although both have been the focus of numerous investigations utilizing a variety of methods, until now, each problem has been addressed independently. Typically, attempts to root the ToL have relied on phylogenies of genes with ancient duplications, which are subject to artifacts of tree reconstruction and horizontal gene transfer, or specific physiological characters believed to be primitive, which are often based on subjective criteria. Here, we demonstrate a unique method for rooting based on the identification of amino acid usage biases comprising the residual signature of a more primitive genetic code. Using a phylogenetic tree of concatenated ribosomal proteins, our analysis of amino acid compositional bias detects a strong and unique signal associated with the early expansion of the genetic code, placing the root of the translation machinery along the bacterial branch. Rooting the Ribosomal Tree of Life · doi:10.1093/molbev/msq057 translation machinery must have evolved in a preprotein (and likely RNA-based) world, such a complex system could only have evolved incrementally, likely coevolving with the emergence of encoded proteins of increasing complexity and functionality. As such, it is also likely that specific genetically encoded amino acids were incorporated into protein synthesis gradually, up until the code reached its current retinue of 20 amino acids. Amino acids could not be fixed at specific positions within proteins until their establishment in the code; therefore, at the time of the MRCA, the most ‘‘recent’’ amino acids would have had less time to become selected for and should have been underrepresented if there was insufficient time for equilibrium to be attained. Conversely, more ancient amino acids should have been overrepresented at this time (Fournier and Gogarten 2007). Based on this model, at least two distinct approaches have attempted to use amino acid usage at ancient positions to infer the evolutionary history of the genetic code, albeit with somewhat differing results (Brooks and Fresco 2002; Brooks et al. 2002, 2004; Fournier and Gogarten 2007). The work of Brooks et al. 2002, 2004; identifies biases in overall amino acid usage in ancestral reconstructions of a wide set of universally conserved genes using an expectation maximization methodology. In contrast, Fournier and Gogarten (2007) only count completely conserved positions within universal ribosomal proteins, a simpler yet more stringent model that seeks to avoid overinterpretation of reconstructions at variable positions at the expense of a large data set. These analyses both agree on the ancient underrepresentation of Cys, Trp, Phe, and Tyr residues. However, whereas Brooks et al. (2004) also identify Ser, Thr, Leu, Gln, and Asp as recent and Val, Ile, His, and Glu as more ancient, Fournier and Gogarten (2007) identify Ile, Val, Glu, and Lys as recent and Asn and Gly as ancient. Interestingly, a subsequent revised analysis (Fournier 2009) incorporating a 90% conservation probability cutoff for the inclusion of reconstructed positions produces results more similar to that of Brooks et al., providing statistical support for Ser and Gln as more recent additions while removing support for Ile and Val. As these hydrophobic amino acids are extremely similar to one another, it is likely that the initial analysis underestimated their ancestral frequency due to occasional substitutions in some genomes. In both these compositional analyses, the location of the root of the ToL is assumed to be known (as either a midpoint of the individual molecular phylogenies [Brooks et al. 2004] or the bacterial branch rooting [Fournier and Gogarten 2007]), with the ancestral genetic code impacting usages on this particular deep branch. Both these assumptions are increasingly difficult to justify, given the emerging understanding of the rate heterogeneity of genomic evolution, the frequency of gene transfer, and the occasionally conflicting rootings identified using different sets of genes (Zhaxybayeva et al. 2005; Bapteste et al. 2009). However, assuming that a strong and unique compositional bias indicates a proximity to a more primitive genetic code unique to the deepest branch on the ToL, it becomes apparent that detecting such 1793 Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 such as aspects of membrane and cell wall structure (Cavalier-Smith 2002), split tRNA structures (Di Giulio 2007), or extensive ‘‘RNA-world’’ molecular relics (Penny and Poole 1999). Branches associated with groups showing these primitive characters are postulated to contain the root. These approaches root the tree within the bacterial or archaeal domains or on the eukaryal branch. Although logical support can be given to define any of these characters as primitive, ultimately these approaches depend on subjective and often controversial hypotheses about the physiological nature of early life. Whereas primitive characters are not useful in phylogenetic reconstruction (Hennig 1966), derived characters (duplicated genes, large insertions, or gaps) can be used to exclude the root from that part of the ToL where organisms possess the derived character state, although the root may be located on the branch where the character transitioned from primitive to derived (Gogarten et al. 1989; Zhaxybayeva et al. 2005; Skophammer et al. 2006, 2007; Lake et al. 2007; Zhaxybayeva and Gogarten 2007). Many of the aforementioned analyses depend either directly or indirectly on specific genes that show weak conservation and/or a history of frequent interdomain horizontal gene transfer, making them poor proxies for an organismal ToL. It seems likely that the best genetic proxy for an organismal ToL are phylogenetic trees generated using the ribosomal machinery, specifically the three ribosomal RNAs (rRNAs) and about 29 universal ribosomal proteins that comprise the ‘‘core ribosome.’’ Although ribosomal protein- and RNA-encoding genes have been transferred in the past (see examples and discussion in Gogarten et al. 2002), these genes are resistant to transfer (Sorek et al. 2007), with most transfers occurring between close relatives followed by homologous recombination (e.g., Morandi et al. 2005). These transfers across short phylogenetic distances have little effect on large-scale tree topology, especially among deep branches where the root of the ToL may reside. Additionally, the high level of structural, sequence, and functional conservation in ribosomal proteins allows for a higher confidence in tree topologies being free of artifacts, especially those producing long branches via radical functional divergence following duplication, as proposed for some deep paralogs (Philippe and Forterre 1999; Cavalier-Smith 2006). For these reasons, a ribosomal ToL has been proposed to be an ideal ‘‘backbone’’ upon which to map horizontal gene transfers, clearly depicting their distinct contribution to genomic evolution (Gogarten 1995; Dagan et al. 2008; Swithers et al. 2009). Unfortunately, because there are no confirmed paralogs for ribosomal genes, rooting the ribosomal ToL using established methods so far has not been possible. In comparison with many organismal characters used in other rooting methods, the genetic code is shared by all cellular life and likely present in its complete form at the time of the MRCA, apart from slight modifications within specific lineages (Knight et al. 2001; Miranda et al. 2006; Fournier and Gogarten 2007). Although this strongly suggests that the genetic code and its requisite MBE MBE Fournier and Gogarten · doi:10.1093/molbev/msq057 domain. Domain alignments were then combined in ClustalW v. 1.83 (Thompson et al. 1994) using a profile alignment. Alignments for each individual protein were then concatenated. Phylogeny and Ancestral sequence Reconstruction a bias is, in fact, a method for empirically locating the root of the ribosomal ToL. In this generalized approach, positions conserved within every branch of a phylogenetic tree can be identified using ancestral sequence reconstructions (fig. 1). Therefore, if enough sequence conservation exists within a universal protein phylogeny, the branch containing the root can be directly inferred, if positions conserved within it have retained a strong and unique signature of bias in amino acid composition independent of other physiological effects. Provided that such a unique compositional signal is detected, its compatibility with existing hypotheses regarding genetic code evolution provides additional empirical evidence for both the location of the root and the supported model(s) of code evolution. Like previous composition-based investigations into genetic code evolution, this hypothesis assumes that newer amino acids will approach an equilibrium of usage due to the effects of positive and purifying selection on protein sequences, thereby sequestering this ‘‘signature’’ to only the deepest branches of the tree. It is also important to note that, while relying on the ‘‘primitive’’ character of an ancestral genetic code signature, this approach is distinct from other attempts to root the tree using primitive characters, in that the root is being directly inferred along an internal branch, as opposed to indirectly inferred via the character states within various taxa at terminal positions on the tree. Model for Identification of Conserved Positions Materials and Methods Usage bias along each branch was calculated for each amino acid as the difference between the observed and average simulated usage rate at conserved positions, measured as a relative percentage. For tests of statistical significance, this difference was measured in SD units (standard deviation of simulated data sets), followed by a two-sided Z-test. Significance was defined as a corresponding P value of less than 5% (P , 0.05). To measure combined compositional bias for each branch, usage biases (SD) for all 20 analyzed amino acids were combined by calculating a vector in Sequence Collection and Alignment Sequences of 29 universally conserved ribosomal proteins were collected from GenBank (Benson et al. 2008) from 121 genomes, including 45 completed bacterial genomes with a wide phylogenetic distribution, 43 completed archaeal genomes, and 33 eukaryal genomes. The MUSCLE program (Edgar 2004) was used to perform a multiple sequence alignment for individual proteins within each 1794 Positions within each node of the ancestral reconstruction were defined as ‘‘conserved’’ if their identity was reported with at least 90% probability. This filter permits the inclusion of sites that may have infrequent substitutions between similar amino acids at terminal branches, avoiding any underrepresentation of similar frequently interchanged residues due to excessive stringency. Assuming in each case that the branch in question contains the root, positions were defined as conserved within each branch if identical positions were conserved as the same amino acid within both adjacent nodes. Data Set Simulations Simulated sequences (ten replicates) were generated using the EVOLVER program in PAML (Yang 2007), using the above-described observed tree topology, 4,500 positions (roughly equal to the average ribosomal protein concatenate length), a 5 1.14 (as estimated for the actual data), under a WAG model with four rate categories. Initial amino acid compositions were set to the average observed overall ribosomal protein amino acid usages from the concatenated data sets. For each replicate, ancestral sequences were reconstructed using ANCESCON (O-option and no optimization of P-vector). Measuring/Combining Compositional Bias Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 FIG. 1. Model for identifying amino acid positions conserved on branches. For ancestral node reconstructions, shaded positions indicate .90% probability of amino acid identity, reflecting a high probability of fixation. The set of conserved positions shared between adjacent nodes corresponds to those conserved at a postulated root located along this particular branch. The analysis is repeated for every branch within the tree. Sequence fragments are from reconstructions of ribosomal protein L11. Maximum likelihood (ML) trees were constructed using PHYML (Guindon and Gascuel 2003) under a WAG þ C (four rate categories, estimated a) þ PINVAR model, with 100 bootstrap replicates. Ancestral sequences were reconstructed using ANCESCON (O-option and no optimization of P-vector, WAG model) (Cai et al. 2004) using an ML tree constructed from positions with at least 50% conservation (i.e., at least 50% of the extant sequences at the tip of the branches shared the same amino acid). This marginal reconstruction method does not rely on tree polarity and determines the likelihood of each amino acid at each position within ancestral nodes, given the alignment and unrooted tree topology. MBE Rooting the Ribosomal Tree of Life · doi:10.1093/molbev/msq057 20D space. The length of this vector was then used to calculate the geometric mean distance per amino acid (SD). Long-Branch Compositional Bias Simulation Results and Discussion Phylogeny of Concatenated Ribosomal Proteins The ribosome is one of the most ancient and wellconserved structures in the biological world, with 29 ‘‘core’’ proteins universally present across all cellular life (Harris et al. 2003). The high level of sequence conservation across all three domains allows for reliable alignment of individual ribosomal protein sequences, and as there are high selective barriers to horizontal transfers of these genes (Sorek et al. 2007), they are more likely to retain a vertical signal of organismal evolution. These features make ribosomal protein sequences excellent for constructing reliable phylogenetic trees, especially for distantly related groups; however, because no paralogs of universal ribosomal proteins are currently known, phylogenies of ribosomal proteins (and ribosomal RNAs) have remained unrooted. Individual core ribosomal proteins are short in length, each containing few phylogenetically informative positions. As such, although universal phylogenies generated from alignments of individual ribosomal proteins generally do not show significant conflict, they are largely unresolved (data not shown). Concatenation of the 29 individually aligned protein sequences produces an alignment with 9,258 positions, resulting in consistent highly resolved phylogenies using a variety of tree reconstruction methods. Consistency and resolution is further improved by only including slowly evolving positions, generating trees using an alignment consisting of 4,571 positions showing at least 50% identity within each site. The resulting ML phylogenetic tree was largely congruent with phylogenies generated using 16S rRNA (Woese and Fox 1977) in that it showed strong support for the monophyly of Archaea, Bacteria, and Eukarya, as well as the monophyly of other major groups, such as Crenarchaeota, Euryarchaeota, Proteobacteria, and Firmicutes (supplementary text file S2, Supplementary Material online). Some artifacts due to composition and long branches are also present, such as the deep placement of some protist and fungal lineages within the Eukarya and possibly the locations of the roots of the bacterial and archaeal domains. Tree reconstructions using the more sophisticated nonhomogenous site model in PhyloBayes (Lartillot and Philippe 2004) did not remove the apparent long-branch attraction artifacts present in Compositional Bias and Substitution Models To attribute compositional bias at conserved positions to a branch-specific biological effect, the model of amino acid substitutions must be first taken into account. In any biologically relevant substitution model, similar amino acids more frequently substitute for one another, whereas some more unique amino acids undergo substitutions much less often. Based on these characteristics, conserved positions on any branch should contain an overabundance of amino acids, which are relatively invariant (e.g., Cys), and an underabundance of amino acids, which frequently substitute for others (e.g., Ser), relative to their overall frequencies, which remain constant across the reconstruction. Furthermore, the more evolution separates sequences (i.e., the longer the branch), the more pronounced these effects should become. Testing this assumption by simulating sequence evolution across increasingly long branches shows that this is indeed the case (fig. 2). Even before a branch length of 1 substitution/site is reached, conserved positions show a clear enrichment in Gly, Cys, Trp, and Pro. At longer branch lengths, an underabundance of Asn, Gln, Lys, and Ser also becomes apparent. Therefore, observed usages at conserved positions along branches must be compared with those detected within simulated sequence reconstructions in order to avoid attributing branch-specific biological relevance to any observed bias in these particular amino acids. Interestingly, these biases seem to reach a maximum at branch lengths of 10–20 and then decrease until converging with overall amino acid usage rates at branch lengths around 50. As this is approximately the branch length at which, for this model, the number of conserved positions remaining approaches the number expected by chance between two nonrelated sequences (N/20), this likely represents the eventual removal of any detectable sequence homology due to extreme saturation with substitutions. At this point, remaining matching positions are therefore merely a product of overall sequence composition, as would be the case for two random sequences. Amino Acid Usage Biases In comparing observed conserved positions with those within reconstructions using simulated sequences, ten 1795 Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 Simulated sequences were generated using the EVOLVER program in PAML, using a simple three-branch tree topology, 4,500 positions, flat amino acid usage rates (all equal to 0.05), a 5 1.14, under a WAG model with four rate categories. Ten replicates were generated for each branch length at log-linear intervals (0.135–90.017 substitutions/ site). For each simulation, the usage rates of amino acids at conserved positions along one branch were calculated and then averaged across replicates. our phylogeny (data not shown). However, the observed phylogenetic artifacts and uncertainties should have little if any impact on our analysis, given the very short deep branches within the bacterial and archaeal domains, and the absence of any hypothesis, placing the root within the eukaryal domain. In addition, the impact of model misspecification on ancestral state reconstruction is greatly reduced by the limitation of our analyses to sites for which the ancestral state is determined with more than 90% posterior probability. Future refinement of this methodology may explore branch heterogeneity with respect to composition and covariation (e.g., Blanquart and Lartillot 2008), although it is unclear if such an approach would significantly impact our results. Fournier and Gogarten · doi:10.1093/molbev/msq057 MBE amino acids showed strong statistically significant biases within deep branches, with nine showing the strongest bias within the bacterial branch, that is, the branch leading to the bacterial root (table 1, figs. 3 and 4, supplementary fig. S1, Supplementary Material online). Gly, Ala, and Asn all showed a strong significant overrepresentation in the bacterial branch (35.1%, 45.6%, and 54.7%, respectively). The branch leading to the archaeal root also showed weaker yet significant overrepresentation of Ala and Gly. The only other amino acid showing significant overrepresentation Table 1. Amino Acid Usage at Conserved Positions on Deep Phylogenetic Branches. Amino Acid Ala Asp Cys Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Usage within Deep Branches Bacterial 0.113* 0.043 0.005* 0.045 0.020* 0.185* 0.022 0.068 0.068 0.089 0.021 0.037* 0.067 0.011* 0.086 0.024* 0.042 0.080 0.002* 0.015* Archaeal 0.093* 0.040 0.007 0.058 0.025* 0.135* 0.023 0.071 0.082 0.085 0.022 0.026 0.070 0.016* 0.096* 0.031* 0.045 0.077 0.008* 0.030 Eukaryal 0.091 0.041 0.009 0.055 0.026* 0.140 0.024 0.063 0.081 0.085 0.019 0.028 0.076 0.017 0.102* 0.028* 0.048 0.073 0.009* 0.025* Global Ribosomal Usage 0.082 0.042 0.008 0.064 0.027 0.079 0.021 0.066 0.101 0.075 0.018 0.037 0.043 0.032 0.088 0.051 0.049 0.083 0.007 0.027 * amino acid usage rates significantly different from expected usage in sequence simulations (2-sided Z-test, p,0.05, see Materials and Methods). 1796 on deep branches was Arg along the branches leading to the archaeal and eukaryal roots, albeit with a substantially weaker bias (13.9% and 22.2%, respectively). Conversely, strong significant underrepresentation along the bacterial branch was observed for Gln (47.6%), Phe (44.0%), Ser (38.2%), Trp (85.1%), Tyr (64.1%), and Cys (65.4%). Weaker yet significant biases were observed along the branch leading to the archaeal root for Gln, Phe, Ser, and Trp and along the branch to the eukaryal root for Phe, Ser, Trp, and Tyr. The strong statistically significant biases observed for Asn, Cys, and Trp along the bacterial branch are especially compelling as their direction is the inverse of that expected for conserved positions within long branches (fig. 2). Therefore, these usage biases cannot be explained by an underestimation of branch lengths at deep positions within the tree. Similarly, usages of Ala, Tyr, and Phe should be immune to any such conservation bias as these are not observed to substantially fluctuate over increasing branch lengths. Because the usage biases of Gly, Gln, and Ser are of the same direction as the expected conservation biases, their significance lies in the magnitude of the effect, which potentially could be an artifact of underestimated branch lengths. However, the magnitude of the observed biases is beyond what the model can accommodate. For example, at the maximal bias observed within simulations (branch length ;20), usage of Gly at conserved positions is less than double the initial condition (8.7% vs. 5.0%) (fig. 2), a substantially smaller increase than what is observed on the much shorter branch leading to the bacterial root compared with current average ribosomal usages (18.5% vs. 7.9%) (table 1). The case is similar for the underrepresentation of Gln and Ser. Therefore, it is unlikely that the strong compositional bias observed at conserved positions on the bacterial branch is Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 FIG. 2. Branch length–induced compositional bias at conserved positions. Bias in the usage of particular AAs increases with evolutionary distance (branch length) due to the differential substitution rates of AAs as described in biologically relevant substitution models. As labeled, AAs that tend to occupy conserved positions tend to increase in usage (Gly, Cys, Trp, and Pro), whereas AAs that frequently substitute tend to decrease (Asn, Gln, Ser, and Lys). At extreme branch lengths, this bias subsides as no informative positions remain due to saturation with substitutions. Branch lengths are represented on a log scale. AA, amino acid. MBE Rooting the Ribosomal Tree of Life · doi:10.1093/molbev/msq057 a product of especially large conservation bias along long branches. Rare amino acids such as Cys, Trp, and Met show particularly strong biases across most of the tree; however, because their usage rates are low with respect to their variance across simulated analyses, most of these biases are nonsignificant. For example, even though Met shows an overrepresentation of 40.1% along the bacterial branch, this value is not statistically significant at the cutoff level used for this analysis (P , 0.05). Composite Analysis Combining individual amino acid biases into an overall composite bias across the tree (fig. 3), the bacterial branch clearly shows the strongest composite bias outside of the haloarchaea, corresponding to 3.44 SD per amino acid. In comparison, the composite bias in the archaeal branch is equivalent to 1.95 SD per amino acid and in the eukaryal branch is 1.71 SD per amino acid. This verifies that the strong individual biases in amino acid usage along this branch are indicative of a unique signal and that there is no strong alternative signal consisting of a combined effect among amino acids with weaker biases, which, on their own, would not be identified as significant. In general, composite bias is greater within the Archaea than in the Bacteria and Eukarya due to the combined effects of widespread physiological factors that influence amino acid composition in different archaeal groups. Alternative Compositional Signatures Several physiological characters can impose amino acid biases within proteins, including thermophily, halophily, and bias in genomic composition, specifically G þ C content. The impact of each of these factors is well known and is reflected in the observed amino acid composition across several branches of the concatenated ribosomal tree (table 2). The overrepresentation of Asp and Glu in halophiles is an adaptation to the ionic hypersaline environment of their intracellular space (Dennis and Shimmin 1997). Thermophiles prefer charged residues over polar due to the thermodynamics of protein folding and also tend to favor proline to promote rigidity of protein structures (Watanabe et al. 1991; Fukuchi and Nishikawa 2001; Zhou et al. 2008). Finally, genomes with strong G þ C biases tend to favor and avoid specific sets of amino acids based on the nucleotide content of their codons (Paila et al. 2008). Each of these observed physiologically imposed compositional biases is incongruent with the set of biases observed on the bacterial branch; this reinforces our interpretation that this signature is not simply reflection of an ancestral physiological state but is an echo of a more primitive genetic code at the deepest branch of the phylogeny of the ribosome. An alternative hypothesis to explain the previously reported (Fournier and Gogarten 2007) compositional bias within the bacterial branch is that it represents a ‘‘mesophilic 1797 Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 FIG. 3. Composite amino acid usage bias across universal ribosomal tree. Bias values reflect the geometric mean distances (normalized as SD) between observed and expected amino acid usages. Increased bias across the archaeal domain is largely due to widespread thermophily, halophily, and nucleotide composition bias. Aside from branches associated with haloarchaea, the branch leading to the bacteria (the bacterial root) contains the greatest bias. Fournier and Gogarten · doi:10.1093/molbev/msq057 MBE signature’’ for early life, as indicated by an underrepresentation of amino acids typically favored in thermophiles (Boussau et al. 2008). Although the signature we reconstruct for the bacterial branch is inconsistent with ancient thermophily, it does not necessarily follow that this signature is solely caused by the converse physiological state, that is, ‘‘nonthermophily.’’ Because the majority of sequences used to provide the initial amino acid frequencies in the expectation simulation are from mesophiles, a branch solely defined by its mesophilic character should produce no significant signature of compositional bias at all; underrepresentation of ‘‘thermophilic’’ amino acids would therefore not be expected. Our data therefore are compatible with an underrepresentation of amino acids added late during the expansion of the genetic code and a mesophilic MRCA located on the bacterial branch. Assuming a mesophilic MRCA alone cannot explain the observed compositional bias. Comparison with Other Models for the Early Expansion of the Genetic Code The strong unique biases in amino acid composition at conserved positions present along the bacterial branch cannot be explained by the effects of substitution models within long branches, nucleotide composition, or the compositional impacts of any known physiological conditions. Additionally, the specific amino acid biases detected are similar to those predicted in several models of genetic code Table 2. Amino Acid Bias Signatures within the Universal Ribosomal Tree. Signature Thermophiles Halophiles Low G 1 C High G 1 C Deep branches 1798 Overrepresentation Glu, Pro, Trp Asp, Glu Ile, Lys, Ser, Tyr, Asn Arg Ala, Gly, Asn Underrepresentation Ser, Gln Ile, Lys, Tyr, Cys Pro, Glu Ile, Asn Phe, Gln, Ser, Trp, Tyr, Cys evolution, suggesting that the bacterial branch is closest to a more primitive state of the genetic code, and therefore contains the root of the translational machinery. The set of amino acids showing significant bias along the bacterial branch is similar to those previously detected by a related but less sophisticated analysis (Fournier and Gogarten 2007). However, the incorporation of phylogeny, ancestral sequence reconstruction, substitution biases, and a probabilistic definition of conserved positions greatly increase the sample size and utility of the method presented, correcting artifacts in the previous analysis, such as the previously discussed apparent underrepresentation of Ile, Val, and Lys. The absence of a significant bias on this deep branch for some amino acids (Val, Ile, Leu, Asp, Pro, Thr, Lys, or Arg) can be explained in two ways. First, an amino acid may actually be within the ‘‘intermediate set’’ of those added to the code, with sufficient time for usage equilibrium before the MRCA, but not old enough for an initial overwhelming excess to exist. Second, some amino acids within this set may be equally ancient to those showing a significant overrepresentation; however, their initial excess may have rapidly disappeared with the addition of other similar amino acids that provided a slight selective advantage in many positions, effectively ‘‘partitioning away’’ their excess. In some cases, this would also effectively diminish the bias signal of the underrepresented ‘‘new’’ amino acid as sites would have been ‘‘preselected’’ for their invasion. This preselection effect could explain the flat usage levels of Val, Leu, and Ile, making it impossible to determine from this analysis in what order they were added to the code. This may also be the case for Arg and Lys. However, because Thr and Asp both have similar alternatives, which are underrepresented on the bacterial branch (Ser and Glu), there is stronger evidence that these are both indeed more ancient amino acids. Because Pro is a functionally and structurally unique amino acid that should not be susceptible to Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 FIG. 4. Amino acid usages at conserved positions along the branch leading to the bacterial root. Error bars represent SD calculated from ten replicate simulations in the case of expected values (red bars, SD corrected for low sample size) and from 29 universal ribosomal proteins in the case of observed usages in extant ribosomal protein sequence (green bars). MBE Rooting the Ribosomal Tree of Life · doi:10.1093/molbev/msq057 (Mosyak et al. 1995) and undergoes a class I-like aminoacylation reaction (Sprinzl and Cramer 1975); SerRS does not recognize the anticodon of its cognate tRNAs (as its uniquely disparate set of six codons makes this impossible), instead relying on a long variable arm for interaction specificity (Asahara et al. 1994). This may suggest that in each case, these aaRS are more derived and among the more recent class II variants to evolve. The identification of Ser as a late amino acid is especially interesting for additional reasons as this amino acid is generally considered to be among the most ancient based on biochemical, metabolic, and codon table criteria (Trifonov 2000). Although it may be the case that the biochemical and metabolic support for Ser being a primordial amino acid simply indicates its likely participation in primordial intermediate metabolism (predating its incorporation into the genetic code), this type of speculation requires evidence beyond that provided in the scope of this investigation. There is at least one model of genetic code evolution predicting Thr preceding Ser (Higgs 2009), which reconstructs the organization and evolution of the genetic code using a selection-based model, evaluating the fitness effect of code expansion via the addition of new amino acids via stepwise codon space partitioning. The amino acids predicted to be the earliest in this model (Gly, Ala, and Asp) are also in agreement with the presented analysis, along with Val, on which our analysis remains agnostic as previously described. In further agreement, Pro is predicted to be added at an intermediate stage and Cys, Trp, Tyr, and Phe at a later stage. The only amino acids showing conflict between these two analyses are Glu and Asn. Higgs (2009) places Glu among the earlier amino acids to be added (fifth, in the second stage of code evolution), as opposed to among the most recent. Conversely, Higgs (2009) places Asn as an intermediate-late addition, as opposed to an early one. In both cases, these differences are likely related to the selection-based model favoring the physiochemical ionic similarity of the Asp–Glu pair over the metabolic/ chemical similarity of the Glu–Gln and Asp–Asn pairs. Conclusions Rooting the tree of the translation machinery using the polarizing ancestral state of its own functional semantics (i.e., the genetic code) is a novel approach that avoids many of the shortcomings of other rooting methods while simultaneously making use of a large set of proteins providing reliable phylogenetic reconstruction. The results of this method locate the ribosomal MRCA upon the branch leading to the bacterial domain, in agreement with previous analyses utilizing sets of genes that underwent ancestral duplications, including those also related to protein synthesis. As ribosomal proteins are more likely to show a consistent phylogenetic signal indicative of vertical inheritance, the rooting of the ribosomal ToL provides a polarized phylogenetic scaffold upon which the complex genetic reticulations of evolutionary history can be mapped and better understood. Additionally, this method identifies a pattern 1799 Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 competition, it seems most likely that it was actually added at an intermediate stage. Additional inferences about code evolution can be made by comparing the signatures of sets of metabolically related amino acids with similar physiochemical properties and assuming that these ‘‘come as a set.’’ For example, Gln is physiochemically analogous to Asn and shows strong underrepresentation on the bacterial branch, indicating that it was not a specifically encoded amino acid in an earlier version of the genetic code. Although its underrepresentation on the bacterial branch is not statistically significant at the cutoff used in this analysis, Glu does show a trend congruent with the metabolically related Glu–Gln amino acid pair being a more recent addition to the code. Likewise, the similar Asp–Asn pair is more likely to be ancient due to the detected overrepresentation of Asn along the same deep branch. Based on this inference, at a more primitive state, the code would still contain a similar diversity of physiochemical properties, being able to synthesize proteins containing hydroxyl, amino, hydrophobic, and both positively and negatively charged side chains. This suggests that after its earliest stages, genetic code evolution may have promoted continued optimization of existing protein folds rather than being responsible for a radical expansion of protein diversity and functionality. The only exceptions seem to be unique specialized amino acids such as those with aromatic structure (Tyr, Trp, and Phe) and Cys. Aromatic residues may have been advantageous for the hydrophobic core packing needed as proteins became larger, whereas Cys would have been advantageous as more opportunities developed for complex linkages between proteins, lipids, and other substrates. Alternatively, aromatic amino acids could have been selected for as RNA ‘‘replacements’’ in a dying RNA world, mimicking nucleotides in important structural roles (e.g., stacking interactions). The unique metal-binding properties of Cys may also have played a role in its selection as this faculty may have been deficient in an RNA-based physiology. The specific amino acid biases identified on the bacterial branch are consistent with several generally accepted principles regarding the evolution of the genetic code, specifically a preference for earlier amino acids to be simple, with latter ones more complex (Trifonov 2000). This is especially apparent in the identification of Trp and Tyr as more recent amino acids as both are products of complex metabolic pathways (Klipcan and Safro 2004). These results are also in agreement with the hypothesis that class II aaRS are generally associated with earlier amino acids, with class I aaRS evolving later (Hartman 1995). In fact, every amino acid identified as ‘‘early’’ in this analysis has a cognate class II aaRS, whereas most of the amino acids identified as ‘‘late’’ have a cognate class I aaRS. Second-order inferences regarding amino acids without significant bias also favor this model as Asp and Thr (inferred to be earlier) are both associated with class II aaRS, whereas Glu (inferred to be more recent) is associated with class I. The exceptions are Phe and Ser; interestingly, both are associated with unique class II aaRS: PheRS has a unique heterotetrameric structure Fournier and Gogarten · doi:10.1093/molbev/msq057 of amino acid usage biases in general agreement with current models of genetic code evolution. Supplementary Material Supplementary figure S1 and supplementary text file S2 are available at Molecular Biology and Evolution online (http:// www.mbe.oxfordjournals.org/). Acknowledgments References Asahara H, Himeno H, Tamura K, Nameki N, Hasegawa T, Shimizu M. 1994. Escherichia coli seryl-tRNA synthetase recognizes tRNA(Ser) by its characteristic tertiary structure. J Mol Biol. 236:738–748. Baldauf SL, Palmer JD, Doolittle WF. 1996. The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proc Natl Acad Sci U S A. 93:7749–7754. Bapteste E, O’Malley MA, Beiko RG, et al. (11 co-authors). 2009. Prokaryotic evolution and the tree of life are two different things. Biol Direct. 4:34. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2008. GenBank. Nucleic Acids Res. 37:26–31. Blanquart S, Lartillot N. 2008. A site- and time-heterogeneous model of amino-acid replacement. Mol Biol Evol. 25:842–858. Boussau B, Blanquart S, Necsulea A, Lartillot N, Gouy M. 2008. Parallel adaptations to high temperatures in the Archaean eon. Nature 456:942–945. Brooks DJ, Fresco JR. 2002. Increased frequency of cysteine, tyrosine, and phenylalanine residues since the last universal ancestor. Mol Cell Proteomics. 1:125–131. Brooks DJ, Fresco JR, Lesk AM, Singh M. 2002. Evolution of amino acid frequencies in proteins over deep time: inferred order of introduction of amino acids into the genetic code. Mol Biol Evol. 19:1645–1655. Brooks DJ, Fresco JR, Singh M. 2004. A novel method for estimating ancestral amino acid composition and its application to proteins of the Last Universal Ancestor. Bioinformatics 20:2251–2257. Brown J, Doolittle W. 1995. Root of the universal tree of life based on ancient aminoacyl-tRNA synthetase gene duplications. Proc Natl Acad Sci U S A. 92:2441–2445. Cai W, Pei J, Grishin NV. 2004. Reconstruction of ancestral protein sequences and its applications. BMC Evol Biol. 4:33. Cavalier-Smith T. 2002. The neomuran origin of archaebacteria, the negibacterial root of the universal tree and bacterial megaclassification. Int J Syst Evol Microbiol. 52:7–76. Cavalier-Smith T. 2006. Rooting the tree of life by transition analyses. Biol Direct. 1:19. Cox CJ, Foster PG, Hirt RP, Harris SR, Embley TM. 2008. The archaebacterial origin of eukaryotes. Proc Natl Acad Sci U S A. 105:20356–20361. Dagan T, Artzy-Randrup Y, Martin W. 2008. Modular networks and cumulative impact of lateral transfer in prokaryote genome evolution. Proc Natl Acad Sci U S A. 105:10039–10044. 1800 Dagan T, Martin W. 2007. Testing hypotheses without considering predictions. Bioessays 29:500–503. Dennis PP, Shimmin LC. 1997. Evolutionary divergence and salinitymediated selection in halophilic archaea. Microbiol Mol Biol Rev. 61:90–104. Di Giulio M. 2007. The tree of life might be rooted in the branch leading to Nanoarchaeota. Gene 401:108–113. Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32: 1792–1797. Fournier GP. 2009. Genetic code evolution and amino acid composition analysis [PhD thesis]. [Storrs (CT)]: Department of Molecular and Cell Biology, University of Connecticut. Fournier GP, Gogarten JP. 2007. Signature of a primitive genetic code in ancient protein lineages. J Mol Evol. 65:425–436. Fukuchi S, Nishikawa K. 2001. Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. J Mol Biol. 309:835–843. Gogarten JP. 1995. The early evolution of cellular life. Trends Ecol Evol. 10:147–151. Gogarten JP, Doolittle WF, Lawrence JG. 2002. Prokaryotic evolution in light of gene transfer. Mol Biol Evol. 19:2226–2238. Gogarten JP, Kibak H, Dittrich P, et al. (13 co-authors). 1989. Evolution of the vacuolar Hþ-ATPase: implications for the origin of eukaryotes. Proc Natl Acad Sci U S A. 86:6661–6665. Gray MW. 1993. Origin and evolution of organelle genomes. Curr Opin Genet Dev. 3:884–890. Gribaldo S, Cammarano P. 1998. The root of the universal tree of life inferred from anciently duplicated genes encoding components of the protein-targeting machinery. J Mol Evol. 47:508–516. Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52:696–704. Harris JK, Kelley ST, Spiegelman GB, Pace NR. 2003. The genetic core of the universal ancestor. Genome Res. 13:407–412. Hartman H. 1995. Speculations on the evolution of the genetic code IV: the evolution of the aminoacyl-tRNA synthetases. Orig Life Evol Biosph. 25:265–269. Hennig W. 1966. Phylogenetic systematics. Urbana (IL): University of Illinois Press. Higgs PG. 2009. A four-column theory for the origin of the genetic code: tracing the evolutionary pathways that gave rise to an optimized code. Biol Direct. 4:16. Islas S, Hernández-Morales R, Lazcano A. 2007. Question 7: comparative genomics and early cell evolution: a cautionary methodological note. Orig Life Evol Biosph. 37:415–418. Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T. 1989. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc Natl Acad Sci U S A. 86:9355–9359. Klipcan L, Safro M. 2004. Amino acid biogenesis, evolution of the genetic code and aminoacyl-tRNA synthetases. J Theor Biol. 228:389–396. Knight RD, Freeland SJ, Landweber LF. 2001. Rewiring the keyboard: evolvability of the genetic code. Nat Rev Genet. 2:49–58. Lake JA, Herbold CW, Rivera MC, Servin JA, Skophammer RG. 2007. Rooting the tree of life using nonubiquitous genes. Mol Biol Evol. 24:130–136. Lartillot N, Philippe H. 2004. Bayesian phylogenetic software based on mixture models. Mol Biol Evol. 21:1095–1109. Lawson F, Charlebois R, Dillon J. 1996. Phylogenetic analysis of carbamoylphosphate synthetase genes: complex evolutionary history includes an internal duplication within a gene which can root the tree of life. Mol Biol Evol. 13:970–977. Margulis L. 1995. Symbiosis in cell evolution: microbial communities in the archean and proterozoic eons. New York: W H Freeman & Co. Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 We thank Tim Harlow for writing the program for parsing ANCESCON results; Pascal Lapierre for writing the program to identify conserved positions within multiple sequence alignments; and Kristen Swithers, Nicolas Galtier, and two anonymous reviewers for comments and discussions. This work was supported through grants from the NASA Exobiology (NNX07AK15G) and National Science Foundation Assembling the Tree of Life (DEB 0830024) programs. MBE Rooting the Ribosomal Tree of Life · doi:10.1093/molbev/msq057 Sorek R, Zhu Y, Creevey CJ, Francino MP, Bork P, Rubin EM. 2007. Genome-wide experimental determination of barriers to horizontal gene transfer. Science 318:1449–1452. Sprinzl M, Cramer F. 1975. Site of aminoacylation of tRNAs from Escherichia coli with respect to the 2#- or 3#-hydroxyl group of the terminal adenosine. Proc Natl Acad Sci U S A. 72:3049–3053. Swithers KS, Gogarten JP, Fournier GP. 2009. Trees in the web of life. J Biol. 8:54. Thompson J, Higgins D, Gibson T. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighing, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680. Trifonov EN. 2000. Consensus temporal order of amino acids and evolution of the triplet code. Gene 261:139–151. Watanabe K, Chishiro K, Kitamura K, Suzuki Y. 1991. Proline residues responsible for thermostability occur with high frequency in the loop regions of an extremely thermostable oligo-1,6-glucosidase from Bacillus thermoglucosidasius KP1006. J Biol Chem. 266:24287–24294. Woese CR, Fox GE. 1977. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A. 74:5088–5090. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24:1586–1591. Zhaxybayeva O, Gogarten JP. 2004. Cladogenesis, coalescence and the evolution of the three domains of life. Trends Genet. 20:182–187. Zhaxybayeva O, Gogarten JP. 2007. Horizontal gene transfer, gene histories and the root of the tree of life. In: Pudritz RE, Higgs PG, Stone J, editors. Astrobiology and the origins of life. Cambridge: Cambridge University Press p.178–192. Zhaxybayeva O, Lapierre P, Gogarten JP. 2005. Ancient gene duplications and the root(s) of the tree of life. Protoplasma 227:53–64. Zhou XX, Wang YB, Pan YJ, Li WF. 2008. Differences in amino acids composition and coupling patterns between mesophilic and thermophilic proteins. Amino Acids. 34:25–33. 1801 Downloaded from mbe.oxfordjournals.org at Libraries-Montana State University, Bozeman on September 30, 2010 Miranda I, Silva R, Santos MA. 2006. Evolution of the genetic code in yeasts. Yeast 23:203–213. Morandi A, Zhaxybayeva O, Gogarten JP, Graf J. 2005. Evolutionary and diagnostic implications of intragenomic heterogeneity in the 16S rRNA gene in Aeromonas strains. J Bacteriol. 187:6561–6564. Mosyak L, Reshetnikova L, Goldgur Y, Delarue M, Safro MG. 1995. Structure of phenylalanyl-tRNA synthetase from Thermus thermophilus. Nat Struct Biol. 2:537–547. Olendzenski L, Gogarten JP. 1998. Deciphering the molecular record for the early evolution of life: gene duplication and horizontal gene transfer. In: Wiegel J, Adams MWW, editors. Thermophiles: the keys to molecular evolution and the origin of life? Philadelphia (PA): Taylor & Francis Inc. p. 165–176. Paila U, Kondam R, Ranjan A. 2008. Genome bias influences amino acid choices: analysis of amino acid substitution and recompilation of substitution matrices exclusive to an AT-biased genome. Nucleic Acids Res. 36:6664–6675. Penny D, Poole A. 1999. The nature of the last universal common ancestor. Curr Opin Genet Dev. 9:672–677. Philippe H, Forterre P. 1999. The rooting of the universal tree of life is not reliable. J Mol Evol. 49:509–523. Poole AM, Penny D. 2007a. Evaluating hypotheses for the origin of eukaryotes. Bioessays 29:74–84. Poole AM, Penny D. 2007b. Response to Dagan and Martin. Bioessays 29:611–614. Schofield JP. 1993. Molecular studies on an ancient gene encoding for carbamoyl-phosphate synthetase. Clin Sci (Lond). 84:119–128. Schwartz RM, Dayhoff MO. 1978. Origins of prokaryotes, eukaryotes, mitochondria, and chloroplasts. Science. 199:395–403. Skophammer RG, Herbold CW, Rivera MC, Servin JA, Lake JA. 2006. Evidence that the root of the tree of life is not within the Archaea. Mol Biol Evol. 23:1648–1651. Skophammer RG, Servin JA, Herbold CW, Lake JA. 2007. Evidence for a gram-positive, eubacterial root of the tree of life. Mol Biol Evol. 24:1761–1768. MBE