Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Common Structural Core of Three-Dozen Residues Reveals Intersuperfamily Relationships Heli A.M. Mönttinen,1 Janne J. Ravantti,*,1,2 and Minna M. Poranen*,1 1 Department of Biosciences, Viikki Biocenter, University of Helsinki, Helsinki, Finland Institute of Biotechnology, Viikki Biocenter, University of Helsinki, Helsinki, Finland 2 *Corresponding author: E-mail: [email protected]; [email protected]. Associate editor: Lars Jermiin Abstract Identification of relationships among protein families or superfamilies is a challenge. However, functionally essential protein regions typically retain structural integrity, even when the corresponding protein sequences evolve. Consequently, comparison of protein structures enables deeper phylogenetic analyses than achievable through the use of sequence information only. Here, we focus on a group of distantly related viral and cellular enzymes involved in nucleic acid or nucleotide processing and synthesis. All these enzymes share an apparently similar protein fold at their active site, which resembles the palm subdomain of the right-hand-shaped polymerases. Using a structure-based hierarchical clustering method, we identified a common structural core of 36 equivalent residues for this functionally diverse group of enzymes, representing five protein superfamilies. Based on the properties of these 36 residues, we deduced a structural distance-based tree in which the proteins were accurately clustered according to the established family classification. Within this tree, the enzymes catalyzing genomic nucleic acid replication or transcription were separated from those performing supplementary nucleic acid or nucleotide processing functions. In addition, we found that the family Y DNA polymerases are structurally more closely related to the nucleotide cyclase superfamily members than to the other members of the DNA/RNA polymerase superfamily, and these enzymes share 88 equivalent residues comprising a b1-a1-a2-b2-b3-a3-b4-a4-b5 fold. The results highlight the power of structure-based hierarchical clustering in identifying remote evolutionary relationships. Furthermore, our study implies that a protein substructure of only three-dozen residues can contain a substantial amount of information on the evolutionary history of proteins. Key words: protein evolution, structural alignment, structural distances, nucleic acid and nucleotide processing enzymes, polymerase evolution. Introduction number of the solved protein structures. However, such approaches may prove critical for discerning the relatedness of distant proteins, which is difficult to achieve solely through amino acid sequence comparisons. Homologous Structure Finder (HSF; Ravantti et al. 2013) is a recently developed tool for structural alignment and structure-based classification of proteins. It uses several parameters (e.g., physicochemical properties of the amino acids, type of secondary structure and position in which it is located, local geometry and local alignment, Ca-distances, and backbone direction; Ravantti et al. 2013), and through pairwise comparisons, identifies the most similar structures within a set of protein structures. The program then determines equivalent residues within the two structures; that is, residues sharing similar properties in terms of the parameters used. These equivalent residues represent the common structural core and can be considered homologous positions that are comparable to the columns of multiple sequence alignments (Morrison 2015). Thus, the identified structural equivalence may reveal possible homology. The special feature of this program is that iterative pairwise comparisons of structures and previously identified structural cores are continued until ß The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] Mol. Biol. Evol. 33(7):1697–1710 doi:10.1093/molbev/msw047 Advance Access publication March 1, 2016 1697 Article Phylogenetic relationships between protein-coding genes have commonly been studied using amino acid sequence alignments. These allow identification of homologous protein regions in evolutionally closely related proteins and the assignment of protein families. Superfamilies represent a higher level of classification and can be defined when a common structural region is observed in proteins belonging to different families. Such similarity among distantly related protein structures can often be observed, even when amino acid sequence similarity has become undetectable (Chothia and Lesk 1986; Ponting and Russell 2002). In addition to structural homology, enzymes within a superfamily typically share a similar catalytic mechanism; however, the substrate specificities may vary. Assignment of protein superfamilies and relationships for distantly related proteins involves the identification of structural similarity through multiple structural alignments, but often also requires manual curation. Although methods for automated structural alignment have significantly improved during the last two decades, structure-based studies aimed at identifying relationships among distantly related protein coding genes have been scarce. In part, this arises from the limited MBE Mönttinen et al. . doi:10.1093/molbev/msw047 all structures within the data set are merged to form a single hierarchical clustering dendrogram, and a common structural core is identified for all the proteins in question. Subsequently, all-against-all pairwise distances are calculated for the residues comprising the structural core, and the resulting distances can be converted to a normalized distance matrix from which a structure-based distance tree can be constructed (Ravantti et al. 2013). Although structures are not necessarily drifting in a continuous manner, and the calculated structural distances may not exactly reflect evolutionary distances, the clustering of protein structures in the structure-based distance trees agrees well with the sequence-based classification of protein-coding genes into protein families (Mönttinen et al. 2014). Consequently, structure-based analyses can be used to roughly estimate the chronology of evolutionary events occurring over a large time scale. This is especially valuable when deep evolutionary relationships, not detectable through sequence-based analyses alone, are evaluated. We have previously utilized HSF to describe a common structural core for the right-hand-shaped polymerases of the DNA/RNA polymerases superfamily (Mönttinen et al. 2014). This core comprises 57 residues at the palm subdomain, forming the active site of these enzymes, and was used to predict relationships for the protein families within the DNA/ RNA polymerases superfamily (Mönttinen et al. 2014). According to the Structural Classification of Proteins (SCOP) database (Murzin et al. 1995; Andreeva et al. 2014), the palm subdomain of the right-hand-shaped polymerases is structurally related to the active site fold of the class III adenylate cyclases, which belong to a distinct protein superfamily, the nucleotide cyclases. Structurally similar active site folds have also been observed in proteins belonging to the primpol domain, origin of replication-binding domain, and transposase IS200-like superfamilies, as well as in some unclassified proteins (Murzin et al. 1995; Iyer et al. 2005; Hyde et al. 2010). All these proteins are involved in nucleic acid or nucleotide metabolism, and many have catalytic aspartic acid or histidine residues in similar positions (Iyer et al. 2005; Hyde et al. 2010). Despite the structural and functional similarities, these superfamilies are currently placed under four different SCOP-folds: DNA/RNA polymerases, ferredoxin-like, origin of replication-binding domain (RBD-like), and prim-pol domain. Relationships among protein superfamilies have previously been inferred based on the conservation of protein topology and conserved sequence signatures (Iyer et al. 2005). In this study, we asked whether structural similarity could be automatically detected among proteins belonging to different superfamilies. We hypothesized that such information could facilitate the construction of structure-based distance trees, which have the potential to reveal inter- and intrasuperfamily relationships between proteins that do not share detectable sequence identity. In particular, we were interested in the family Y DNA polymerases, which are traditionally classified together with the other right-hand-shaped polymerases in the superfamily of DNA/RNA polymerases, but share sequence similarity with the members of the nucleotide cyclases superfamily (Makarova et al. 2002). 1698 We selected a set of 89 protein structures, representing five protein superfamilies and a few unclassified proteins involved in nucleic acid or nucleotide metabolism and possessing an active site fold similar to the palm subdomain of the righthand-shaped polymerases (table 1). Using HSF, we identified a common structural core of 36 amino acid residues shared by this set of protein structures. All 36 residues assigned as equivalent by the HSF program are located within the active site of the proteins, but share no sequence identity across the whole data set. These 36 equivalent residues are also found in the same order within the polypeptide chain for all the proteins in our data set, except in the birnavirus polymerases, where there is a sequence inversion that results in distinct connectivity between the secondary structure elements at the active site, as compared with the other known right-hand-shaped polymerases (Gorbalenya et al. 2002; Pan et al. 2007). Using the properties of these 36 residues and automated computational methods, we deduced a structure-based distance tree, in which the proteins accurately cluster according to their known function and established family classification. This indicates that a substructure of only three-dozen residues can contain a significant amount of information on the evolutionary history of a protein. Results and Discussion Protein Structures Sharing an Active Site Fold Resembling the Palm Subdomain of the RightHand-Shaped Polymerases The classification of protein structures has changed over time, and there are variations in nomenclature depending on the source. Here, we adhere to the family nomenclature established in the literature but also provide the family and superfamily names used in the SCOP database (table 1 and supplementary table S1, Supplementary Material online). Polymerase structures displaying the right-hand fold were initially selected from the Protein Data Bank (PDB; www.pdb. org, last accessed March 6, 2014) (table 1 and supplementary table S1, Supplementary Material online). The data set was subsequently extended with PDB and DALI (ekhidna.biocenter.helsinki.fi/dali_server, last accessed March 6, 2014) searches, as well as with a review of literature references (Tesmer et al. 1999; Iyer et al. 2005; Morrison et al. 2008; Hyde et al. 2010), to obtain a comprehensive set of protein structures proposed to have an active site fold homologous to the palm subdomain of the right-hand-shaped polymerases. Structures that could not be automatically aligned with others in the data set due to deviations in the relative orientation of the secondary structure elements in the palm-like fold were discarded from further analyses. As a result, we obtained a set of 89 protein structures from five different protein superfamilies (table 1 and supplementary table S1, Supplementary Material online). Our data set covers, depending on the classification system (table 1), 13 or 15 protein families previously proposed to share a common origin (Iyer et al. 2005), as well as four unclassified protein structures (table 1 and supplementary table S1, Supplementary Material online). These proteins do not share any evident Replication initiation protein E1 (64332) Replication protein Rep (75492) Relaxase domain (103120) Transposase IS200-like (143423) Diguanylate cyclases (6) GGDEF domain (117984) Transposase IS200-likec (143422) Origin of replicationbinding domain RBDlikee (55464) Guanylate cyclases and class III adenylate cyclases (8)d Adenylyl and guanylyl cyclase catalytic domain (55074) Nucleotide cyclasec (55073) Viral origin-binding proteins (3) HUH endonucleases (7) DNA-dependent DNA polymerization in DNA repair Family Y DNA polymerases (10) Family B DNA polymerases (13) Lesion bypass DNA polymerase (Y-family), catalytic domain (100888) DNA-dependent DNA polymerization in replication or DNA repair Family A DNA polymerases (6) DNA polymerase I (56673) DNA-dependent RNA polymerization in transcription DNA-dependent DNA polymerization in the processing of the Okazaki fragments, DNA repair, or replication RNA-dependent RNA polymerases (20) RNA-dependent RNA polymerase (56694) dsRNA phage RNA-dependent RNA-polymerase (64480) Cleavage of a phosphodiester bond in ssDNA; involved in the rollingcircle replication of plasmids and viruses, conjugative plasmid transfer and DNA transposition Recognition and binding to the replication initiation site; some Catalyzation the formation of cyclic di-3’,5’-guanylate from GTP (needed in the regulation of biofilm formation and persistence) Catalyzation the formation of secondary messengers 3’5’ cyclic AMP/3’5’ cyclic GMP from ATP/ GTP Archaea, bacteria, eukaryotes Mitochondria, DNA phages RNA-dependent RNA polymerization in viral genome replication and transcription Reverse transcriptases (4) RNA (and DNA)-dependent DNA polymerization in viral genome replication and telomere synthesis Activity/Function Reverse transcriptase (56686) Family Single-subunit RNA polymerases (3) b Alternative Family Name (the number of structures) in This Studya T7 RNA polymerase (56691) DNA/RNA polymerase (56672) Superfamily SCOP Classification (SCOP id) for Proteins in This Study Table 1. Protein Families and Superfamilies Sharing an Active Site Fold Similar to the Right-Hand-Shaped Polymerases. DNA viruses (continued) Archaea, bacteria, eukaryotes (transposons) DNA viruses Bacteria Archaea, bacteria, eukaryotes Archaea, bacteria, eukaryotes, DNA phages Bacteria, DNA phages, mitochondria ssRNA viruses,dsRNA viruses Retroviruses, eukaryotes Organisms Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047 MBE 1699 1700 Addition of guanylate residue to the 5’-end of tRNAHis Participates in CRIPSPR-Cas RNA silencing effector complex Reverse polymerases (tRNAHis guanylyltransferase) (2) Cmr2 (1) No SCOP classification No SCOP classification Catalyzation the formation of 2-amino-5-formylamino-6-ribofurano-sylamino-4(3H)-pyrimidinone ribonucleotide monophosphate Catalyzation of DNA-dependent DNA polymerization via de novo initiation (replication, able to bypass lesions) Catalyzation of RNA polymerization in priming process Ligation of double-strand DNA breaks in bacterial nonhomologous end-joining pathway GTP cyclohydrolase III (1) Archaeal primase-polymerases (1) Ligase D (1) Archaeal-eukaryotic primases (3) of the proteins have also helicase activity Activity/Function No SCOP classification Bifunctional DNA primase/polymerase N-terminal domain (103305) PriA-like (56748) Family Alternative Family Name (the number of structures) in This Studya Archaea, bacteria Archaea, bacteria, eukaryotes Archaea Archaea Bacteria Archaea, eukaryotes Organisms c b The structures selected are not all included in the SCOP database but were assigned to SCOP families based on similarity to assigned SCOP family members and the information obtained from other sources. Also known as right-hand-shaped polymerases. Belongs to the SCOP fold DNA/RNA polymerases (id 56671). According to SCOP the palm domain has a ferredoxin-like fold. Belongs to the SCOP fold Ferredoxin-like (id 54861). d Guanylate/adenylate cyclases is used as an alternative name in this study. e Belongs to the SCOP fold Origin of replication-binding domain, RBD-like (id 55463). f Belongs to the SCOP fold Prim-pol domain (id 56746). a No SCOP classification Prim-Pol domainf (56747) The origin DNA-binding domain of SV40 T-antigen (55465) Superfamily SCOP Classification (SCOP id) for Proteins in This Study Table 1. Continued Mönttinen et al. . doi:10.1093/molbev/msw047 MBE Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047 sequence identity, as determined using the MAFFT sequence alignment method (Katoh and Frith 2012; data not shown), and therefore, structural information is needed to recognize possible similarities and homology between the superfamilies. For the initial validation of the data set, we made all pairwise structural alignments using TM-align (Zhang and Skolnick 2005). Structural alignment of nonhomologous protein generally yields TM-scores <0.17 whereas that of proteins with well-established homologous relationships generally yields TM-scores >0.5 (Xu and Zhang 2010). The obtained pairwise TM-scores varied between 0.20 and 0.98 within the whole data set. The TM-scores between members of the same family were generally between 0.49 and 0.98; however, within the family of RNA-dependent RNA polymerases the TM-scores varied between 0.43 and 0.98, whereas those among the reverse transcriptases varied between 0.40 and 0.81. Thus, using TM-align the structural similarity of some of the proteins was found to be quite low, which, in turn, call into question the homology of some of these proteins. These results also called into question the homology between all the RNA-dependent RNA polymerases and reverse transcriptases, which are encoded by genes known to evolve fast. However, because TM-scores were frequently >0.5 and none of them were <0.17, it was decided to assume that the data set comprises sequences of homologous proteins. Notably, TM-scores >0.5 were also obtained between distinct protein families so that a network connecting all the families together could be built (data not shown). Automated Identification of Equivalent Residues for Proteins Representing Different Superfamilies The selected 89 protein structures, representing five protein superfamilies, were structurally aligned and clustered using the HSF program. This uses multiple parameters, such as geometry, secondary structure, physiochemical properties of the amino acids, and amino acid sequence (see Ravantti et al. 2013) to identify equivalent residues in protein structures, describing their common structural core. Due to the low sequence similarity in our data set, the parameters describing local alignment and local geometry were upweighted, whereas the proportional weight of the sequence was only 3.5% of all parameters (for the optimization of the parameters, see Mönttinen et al. 2014). Through the hierarchal clustering of the protein structures, the HSF program automatically identified 36 equivalent residues with a rootmean-square deviation (rmsd) of 3.6 Å, for all proteins in our data set (for the clustering, see supplementary fig. S1, Supplementary Material online). All equivalent residues are located within the palm-like fold forming the active site. The common core covers four b-strands and two a-helices, sharing approximately similar relative orientations in each structure, and is depicted for the RNA-dependent RNA polymerase of hepatitis C virus and the family Y DNA polymerase Dpo4 of Sulfolobus solfataricus in figure 1 (panels IB and IIB, respectively). We could not identify any chemically identical amino acids in the equivalent positions among the 36 residues by HSF (supplementary fig. S1, Supplementary Material online), even when the positional constraints were MBE relaxed, and the amino acids were compared within a sequence window of 61 amino acids. A similar analysis was also carried out for a set of 250 protein structures classified under the ferredoxin-like fold in SCOP database, including members of the nucleotide cyclases and transposase IS200-like superfamilies present in our main data set. Using HSF, it was not possible to align the ferredoxinlike fold, and the identified common core was found to have only three equivalent residues. This suggests that the structural heterogeneity among the proteins assigned to have the ferredoxin-like fold is substantially higher than that present in our main data set, which contains proteins from three different SCOP fold classes (table 1). Structural Properties of the 36-Residue Common Core To obtain further insights into the structural organization of the 36-residue common core, the individual protein sequences were aligned in 11 groups (corresponding to the protein families and the prim-pol domain superfamily) using the structure-directed sequence alignment method, PROMALS3D (Pei et al. 2008), and the obtained alignments were manually superimposed (supplementary fig. S2, Supplementary Material online). The birnavirus RNA polymerases, which have a sequence inversion at the palm subdomain (b1-b2-b3-a1-a2; Gorbalenya et al. 2002; Pan et al. 2007), could not be automatically aligned with the other RNA-dependent RNA polymerases by PROMALS3D and were omitted from this analysis. We found that for all other structures in our data set, the order of secondary structure elements within the common core was conserved (b1-a1-b2b3-a2 fold; fig. 1 and supplementary fig. S2, Supplementary Material online), suggesting that the active site of these proteins might share a common origin. Furthermore, in most proteins, with the exception of those belonging to the HUH endonucleases, viral origin-binding proteins, and guanylate/ adenylate cyclases, there is an insertion between the b1- and a2-elements. This insertion forms the fingers subdomain in the right-hand-shaped polymerases and has likely occurred early in the evolution of these nucleic acid synthesizing enzymes. Our structure-based sequence alignment also allowed identification of conserved amino acid residues for each protein family. Comparison of the conserved residues revealed few that are identical across several protein families, allowing connections to be made among the superfamilies (supple mentary fig. S2, Supplementary Material online). Most notably, an aspartic acid (D) in b1 is conserved among the members of the DNA/RNA polymerases, nucleotide cyclases, and prim-pol superfamilies, as well as in the three unclassified proteins in our data set (table 1). Furthermore, a conserved histidine residue (H) in b3-strand bridges the members of the prim-pol domain superfamily and the HUH endonucleases family, which are currently classified under three different SCOP superfamilies and three SCOP folds (table 1). Members of the viral origin-binding proteins family, from the origin of replication-binding protein superfamily, do not share any detectable sequence identity with the members of the other protein families. Nevertheless, based on the recently 1701 Mönttinen et al. . doi:10.1093/molbev/msw047 MBE FIG. 1. Structurally equivalent residues for the selected nucleic acid/nucleotide-binding proteins. The structure of hepatitis C virus RNA-dependent RNA polymerase (PDBid: 3TYQ) and the family Y DNA polymerase, Dpo4, of Sulfolobus solfataricus (PDBid: 1JX4), shown in rows I and II, respectively, are used for illustration of the conserved structural features among the selected proteins structures representing five superfamilies (table 1 and supplementary table S1, Supplementary Material online). Column A depicts the complete palm subdomain (according to Bressanelli et al. 1999; Ling et al. 2001), and column B shows the 36-amino acid common structural core for the entire data set. Panel IC shows the equivalent residues shared by the RNA-dependent RNA polymerases, reverse transcriptases, single-subunit RNA polymerases, and family A and B DNA polymerases (i.e., members of group I of fig. 2). Panel IIC illustrates the common core for the members of the family Y DNA polymerases, the representatives of nucleotide cyclase, transposase IS200-like, origin of replication-binding domain, and prim-pol domain superfamilies, and the three nonclassified proteins in the data set (i.e., the members of group II of fig. 2). The palm subdomain is in green, and the conserved residues in the thumb and finger subdomains are indicated with pink and yellow, respectively. The main secondary structure elements in the substructures are indicated. released evolutionary classification of protein domains (ECOD) database (Cheng et al. 2014), the viral origin-binding proteins are related with the HUH endonucleases. A shared common origin between the viral origin proteins and members of the DNA/RNA polymerases superfamily, the nucleotide cyclases, and the primase polymerases has also been proposed previously (Iyer et al. 2005). The results of the TM-align analysis also support these observations and suggest that the members of the five protein superfamilies and the nonclassified proteins in our data set (table 1 and supplemen tary table S1, Supplementary Material online) are structurally related and thus likely homologous (see above). 1702 Relationships among Proteins Families and Superfamilies Deduced from the 36 Equivalent Residues A structure-based distance tree was constructed for the members of the five protein superfamilies and the four unclassified protein structures described in table 1 by comparing the properties of the 36 equivalent residues. In this analysis, using only the common structural core prevents the comparison of individual residues or motifs that may coincidentally share similar structural properties but are part of distinct structural features that do not share a common origin with other proteins in the comparison. The relationships among Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047 the different protein structures were produced automatically by converting all the pairwise similarity scores between the equivalent residues into distances and depicting them as a tree. Despite the small size of the common structural core and the lack of identical amino acids, the grouping of protein structures in the resulting distance tree accurately follows the established protein families that are based on sequence analyses (fig. 2; table 1; Poch et al. 1989; Ito and Braithwaite 1991; Braithwaite and Ito 1993; Ohmori et al. 2001; Chandler et al. 2013). This implies that distance trees constructed using HSF could be used to propose protein phylogenies, and thus, can be described as “structure-based phylogenetic trees.” Within the constructed distance tree, the protein families form two main groups. Group I includes the RNA-dependent RNA polymerases, reverse transcriptases, single-subunit RNA polymerases, and family A and B DNA polymerases from the superfamily of DNA/RNA polymerases (i.e., mainly processive nucleic acid polymerases involved in DNA/RNA replication or transcription). Group II contains primarily nonreplicative enzymes, as well as enzymes that take part in the replication of extrachromosomal genetic elements. These proteins represent five protein superfamilies and are involved in primer synthesis and DNA nicking, ligation, and repair, as well as in the recognition of specific DNA sequences, RNA modification, and the synthesis of secondary messengers from nucleotide precursors (e.g., cyclic adenosine monophosphate; table 1 and fig. 2). Similar grouping was also observed in the initial hierarchical clustering of the complete protein structures (sup plementary fig. S1, Supplementary Material online). The common cores identified for groups I and II through the hierarchical clustering were 76 and 43 residues, respectively (supplementary fig. S1, Supplementary Material online; see below). The sizes of the common cores of the different subclusters (e.g., protein families) are also provided in the supplementary fig. S1, Supplementary Material online. Interestingly, the family Y DNA polymerases of the DNA/ RNA polymerases superfamily are grouped separately from the other right-hand-shaped polymerases and cluster in group II with the 33 other nucleic acid/nucleotide processing enzymes (fig. 2 and supplementary fig. S1, Supplementary Material online). This suggests that family Y DNA polymerases involved in DNA repair are only distantly related to the other right-hand-shaped polymerases, which mainly catalyze genomic nucleic acid replication and transcription. However, like the right-hand-shaped RNA polymerases and replicative DNA polymerases, the family Y polymerases and most of the nucleotide cyclases, as well as the unclassified proteins, GTP cyclohydrolase III and Cmr2, have a similar catalytic site containing two aspartate residues (supplementary fig. S2, Supplementary Material online), supporting the placement of these enzymes adjacent to the members of the DNA/RNA polymerases superfamily (fig. 2). We propose that the processive nucleic acid polymerases in group I are of early origin, with the first ones arising during the phase when RNA was the predominant genetic material (RNA world; Gilbert 1986). These enzymes perform key MBE functions in viruses and cellular organisms, catalyzing DNA and RNA synthesis using either DNA or RNA templates. Conversely, the representatives of group II carry out more sophisticated functions, which are related to error correction and communication between cells; that is, functions that likely have evolved later in more complex biological systems. As a potential reflection of this evolutionary history, the archaeal primase–polymerases of group II can possess both DNA-dependent DNA and RNA polymerization activities (Prato et al. 2008), as well as RNA-dependent DNA synthesis activity (Gill et al. 2014). To evaluate the robustness of our structure-based distance tree (fig. 2), we performed a simplified jackknife test, in which random members of each protein family or unclassified protein structures were discarded one-by-one from the analysis. The structure-based distance trees were then recalculated using the remaining 88 structures (supplementary fig. S3, Supplementary Material online). The results of this test confirmed the general pattern we observed with our full data set, as well as the distribution of protein families into two main groups. However, in the simplified jackknife test, the position of the unclassified protein, Cmr2, varied slightly, such that, in 20% of the replicates it clustered with group I rather than group II. Critically, despite the small size of the common core, the grouping of family members together was also stable, particularly among the group II families. However, the family B DNA polymerase of phage u29 and the telomerase-reverse transcriptase of Tribolium castaneum were only loosely associated with their respective families (in 20% and 27% of the replicates, respectively). Additionally, members of singlesubunit RNA polymerases were disconnected in 27% of the replicates (supplementary fig. S3, Supplementary Material online). Fine-Tuning Structure-Based Protein Phylogenies by Increasing Common Core Size A common region composed of only 36 structurally equivalent residues allowed the automated reconstruction of structure-based interfamily and intersuperfamily relationships for a set of distant protein homologs (fig. 2). To further evaluate the quality of the tree constructed from this 36-residue structural core, distance trees were constructed separately for the two main clusters, that is, the group I enzymes catalyzing nucleic acid replication or transcription (supplementary fig. S4, Supplementary Material online) and the group II enzymes involved in supplementary nucleic acid/nucleotide processing functions (fig. 3). The common structural cores identified by HSF for the members of groups I and II are 76 and 43 acarbon residues, respectively (fig. 1 and supplementary fig. S1, Supplementary Material online). We observed that the distance trees deduced from these larger structural cores (sup plementary fig. S4, Supplementary Material online and fig. 3) follow the overall topology observed in the original tree (fig. 2), indicating that the tree deduced from the 36-residue substructure is relatively robust at the interfamily level. The positions of the individual protein structures were slightly changed when the larger structural cores were applied, finetuning the intrafamily relations. Overall, we consider the 1703 Mönttinen et al. . doi:10.1093/molbev/msw047 MBE FIG. 2. Intersuperfamily and interfamily relationships for proteins harboring a palm-like fold. A structure-based distance tree for the selected nucleic acid/nucleotide processing enzymes (table 1 and supplementary table S1, Supplementary Material online) was deduced based on the properties of the 36 structurally equivalent amino acid residues (depicted in fig. 1, panels IB and IIB). The coloring of the individual branches reflects the family classification of the proteins (see table 1). The family names are indicated on the inner circle, and the protein superfamilies (according to SCOP; Murzin et al. 1995; Andreeva et al. 2014) are shown on the outer circle; unclassified protein structures are marked with an asterisk (*). The separation of the two groups is highlighted with a dashed line. vOBPs, viral origin-binding proteins; APP, archaeal primase polymerase; AEP, archaeal-eukaryotic primases; LigD, polymerase domain of DNA ligase D. structure-based analyses to be suitable for studying the relationships between protein families, superfamilies, and intrafamily relations, thus effectively supplementing sequencebased methods. Common Core and Relationships for Members of Group I The members of group I (i.e., the RNA-dependent RNA polymerases, reverse transcriptases, single-subunit RNA polymerases, and family A and B DNA polymerases) share a 76residue structural core with an rmsd of 4.2 Å. The common 1704 structural region for these members of the DNA/RNA polymerase superfamily contains, in addition to the palm subdomain, regions from the fingers and thumb subdomains, as depicted for the hepatitis C virus RNA-dependent RNA polymerase in figure 1 (panel IC; thumb- and fingers-specific residues indicated in pink and yellow, respectively). The conserved palm region covers 57 residues, sharing a similar connectivity in all group I members (supplementary fig. S2, Supplementary Material online), excepting the birnavirus RNA-dependent RNA polymerases (data not shown; Gorbalenya et al. 2002; Pan et al. 2007). Although the fingers Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047 MBE FIG. 3. Fine-tuning the structural relationships for the members of group II. A structure-based distance tree for selected nucleic acid/nucleotide processing enzymes representing five different protein superfamilies and four unclassified proteins (members of group II, fig. 2) was deduced based on the properties of the 43 structurally equivalent residues (depicted in fig. 1, panel IIC). The coloring of the individual branches reflects the established family classification of the enzymes (see table 1), and the corresponding family names are indicated with the same colors. The superfamilies are indicated by light violet backgrounds and black font. The organism from which each protein originates is shown at the tip of each branch, and the name of the enzyme is also provided for clarity, when needed. E. coli, Escherichia coli; H. sapiens, Homo sapiens; M. smegmatis, Mycobacterium smegmatis; M. tuberculosis, Mycobacterium tuberculosis; P. aeruginosa, Pseudomonas aeruginosa; P. horikoshii, Pyrococcus horikoshii; P. furiosus; Pyrococcus furiosus; S. acidocaldarius, Sulfolobus acidocaldarius; S. cerevisiae, Saccharomyces cerevisiae; S. solfataricus, Sulfolobus solfataricus; S. tokodaii, Sulfolobus tokodaii. and thumb subdomains are not considered homologous among the different polymerase families (Steitz 1999), their overall shape is similar, and therefore, HSF can also identify equivalent positions in these subdomains (fig. 1, panel IC). HSF did not automatically identify any chemically identical equivalent residues within this region. However, when identical amino acids were searched in a sequence window of 61, the two conserved catalytic aspartates could be recognized in the b1- and b3-strands. These were also identified independently in the structure-directed sequence alignment using PROMALS3D (supplementary fig. S2, Supplementary Material online). 1705 Mönttinen et al. . doi:10.1093/molbev/msw047 Within the structure-based distance trees (fig. 2 and supplementary fig. S4, Supplementary Material online), the RNAdependent and DNA-dependent polymerases form separate branches. This likely reflects the early appearance of the RNAdependent enzymes in the RNA world and the subsequent development of reverse transcriptases to handle the originally toxic deoxyribonucleotides (Leon 1998). DNA-based processes, rather, are believed to have evolved later (Forterre 2005), leading to the emergence of the DNA-dependent DNA and single-subunit RNA polymerases. Although the structurally characterized reverse transcriptases included in our analysis represent diverse biological system (e.g., retroviruses, retroelements, and eukaryotic telomerase machinery), all seem to share a common origin (fig. 2 and supplementary fig. S4, Supplementary Material online; Nakamura et al. 1997; Gladyshev and Arkhipova 2011). Similarly, all the DNA-dependent polymerases of group I likely evolved from the same ancient palm subdomain, and subsequently differentiated into the family A and B DNA polymerases, as well as the single-subunit RNA polymerases (fig. 2 and supplementary fig. S4, Supplementary Material online). Furthermore, clustering of the viral RNA-dependent RNA polymerases mostly follows the taxonomic classification of viral families (supple mentary fig. S4, Supplementary Material online), implying that exchange of viral RNA polymerase genes has been relatively limited since the division and differentiation of the different viral families. These observations further support our proposal that structure-based analyses can be used to deduce phylogenetic relationships. Common Core and Relationships for Members of Group II Group II represents five distinct protein superfamilies comprising a variety of nonreplicative enzymes involved in the processing of nucleic acids or nucleotides (table 1). These enzymes share only 43 structurally common residues with an rmsd of 4.1 Å, and none of these is chemically identical, even when searched in 61 sequence window. However, using a structure-directed sequence alignment (supplementary fig. S2, Supplementary Material online), it was possible to identify a few residues that are conserved in several families, allowing connections to be made between the superfamilies. The common structural core is depicted for the family Y DNA polymerase, Dpo4, of S. solfataricus in figure 1 (panel IIC). It contains four antiparallel b-strands and two a-helices; that is, the same secondary structural elements as the 36-residue common core identified for the whole data set (fig. 1, panel IIB). However, as a reflection of decreased structural variation, the coverage of the secondary structures has become more complete. In our structure-based distance tree, the members of group II form two large branches. The first one contains the family Y DNA polymerases of the DNA/RNA polymerase superfamily and the two families of the nucleotide cyclase superfamily (i.e., the guanylate cyclase and class III adenylate cyclase family and the diguanylate cyclases family; table 1). The structure of the archaeal GTP cyclohydrolase III, which has no assigned family classification in the SCOP database 1706 MBE (table 1), is also part of this subtree (fig. 3). Interestingly, GTP cyclohydrolase III has a GGDNF sequence-motif in the catalytic site that is highly similar to the GGDEF sequencemotif, a hallmark feature of the diguanylate cyclase family (Galperin et al. 2001; Galperin 2004). These observations suggest that GTP cyclohydrolase III could be placed into the same superfamily with the nucleotide cyclases. In addition, the structure of the CRISPR-Cas RNA silencing effector complex protein Cmr2 (CRISPR for Clustered, Regularly Interspaced, Short Palindromic Repeat) is located in the same branch with the guanylate/adenylate cyclases and GTP cyclohydrolase III (fig. 3). The structural similarity between Cmr2 and the adenylate cyclases has been described previously (Zhu and Ye 2012). However, Cmr2 does not share functional properties with either cyclases or polymerases (Cocozaki et al. 2012; Huang 2012), and it has been proposed to form a distinct protein family with cas10 proteins (Makarova et al. 2011). Based on the structural similarity we observe, however, Cmr2 likely belongs to the superfamily of nucleotide cyclases. The second branch in group II (fig. 2) includes the members of the transposase IS200-like, origin of replication-binding domain, and prim-pol domain superfamilies. In the SCOP database, members of the HUH endonuclease family are classified into several different families and two superfamilies: Transposase IS200-like and origin of replication-binding domain (table 1 and fig. 3). Our analyses imply that the HUH endonucleases form two branches, which likely are evolutionarily more closely related to each other than to the members of the origin of replication-binding domain superfamily (fig. 3). These observations suggest that HUH endonucleases should be placed into a single superfamily, as was also suggested earlier (Ilyina and Koonin 1992; Koonin and Ilyina 1993). The nonclassified reverse polymerases (i.e., tRNAHis guanylyltransferases) are closely associated with the family Y DNA polymerases and diguanylate cyclases in all our structurebased distance trees (figs. 2 and 3; supplementary fig. S3, Supplementary Material online). Like all the right-handshaped polymerases and nucleotide cyclases, reverse polymerases harbor catalytic aspartic acids in the b1- and b3strands of the palm-like fold (supplementary fig. S2, Supplementary Material online; Hyde et al. 2010), suggesting high conservation in the catalytic mechanism of nucleotide addition, despite the reverse directionality. The close relation of the reverse polymerases and nucleotide cyclases is supported by previous sequence-based studies in which reverse polymerases were shown to be related to GGDEF and CRISPR polymerases (Anantharaman et al. 2010). Members of the Family Y DNA Polymerases and the Nucleotide Cyclase Superfamily Share an Extended Palm-Like Fold In the structure-based distance trees presented here (figs. 2 and 3), the family Y DNA polymerases of the DNA/RNA polymerase superfamily are grouped together with the members of the nucleotide cyclase superfamily (including GTP Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047 MBE cyclohydrolase III and Cmr2). Similar grouping was also observed in the hierarchical clustering dendrogram produced by HSF (supplementary fig. S1, Supplementary Material online), and in this case, a common structural core of 88 residues with an rmsd of 3.0 Å was identified. The structural core is presented for Dpo4 of S. solfataricus and the adenylyl cyclase Rv1264 of Mycobacterium tuberculosis (fig. 4A and B). It contains all the secondary structure elements present in the common structural core of the right-hand-shaped polymerases (Mönttinen et al. 2014), as well as two additional a-helices and b-strands forming a b1-a1-a2-b2-b3-a3-b4-a4-b5 fold (fig. 4C). This fold was previously found to be conserved among the members of the nucleotide cyclases superfamily (Pei and Grishin 2001), and our results demonstrate that also the family Y DNA polymerases share this fold. The conserved b1-a1-a2-b2-b3-a3-b4-a4-b5 fold comprises 25% of the entire polypeptide chain of the family Y DNA polymerase Dpo4 of S. solfataricus and over 52% of the catalytic domain (Tews et al. 2005) of the class III adenylate cyclase Rv1264 of M. tuberculosis. It is considerably larger than the common core of the right-hand-shaped polymerases or the common active site fold previously described for the eukaryotic adenylate cyclase and family A DNA polymerases of Escherichia coli and Thermus aquaticus (Artymiuk et al. 1997). The HSF program automatically detected one positionally and chemically conserved amino acid for all the family Y DNA polymerases and nucleotide cyclases—a glycine (G) in the b4strand. Mutational studies on UmuC from E. coli, a member of the family Y DNA polymerases, have shown that this glycine has a role in maintaining protein stability and local structure (Woodgate et al. 1994). This conserved glycine and a conserved aspartic acid (D) in b1 could also be identified using PROMALS3D (supplementary fig. S2, Supplementary Material online). Likewise, an additional conserved aspartic acid (D) could be identified in b3 from family Y DNA polymerases and guanylate/adenylate cyclases. The substantial similarity in the active site fold (fig. 4) indicates that family Y DNA polymerases are structurally more closely related to nucleotide cyclases than to the right-handshaped polymerases (DNA/RNA polymerases superfamily). Although SCOP classifies nucleotide cyclases and family Y DNA polymerases into separate folds, the new classification database ECOD (Cheng et al. 2014) places them under the same homology and topology level. Additionally, previous sequence comparisons (Makarova et al. 2002) support our results. In the light of these findings, either the classification of the family Y DNA polymerases should be reevaluated or the SCOP superfamily classification should not be considered to reflect evolutionary relationships for the family Y DNA polymerases. The nucleotide cyclases and the family Y DNA polymerases are present in all three domains of life, and typically, FIG. 4. The common structural fold for the family Y DNA polymerases, nucleotide cyclases, and GTP cyclohydrolase III. The common structural core for the members of the family Y DNA polymerases (of superfamily DNA/RNA polymerases) and the nucleotide cyclases superfamily, as well as for GTP cyclohydrolase III and Cmr2 (unclassified proteins), was identified using the HSF program (Ravantti et al. 2013). Equivalent residues identified are shown in green in the structures of the family Y DNA polymerase, Dpo4 of Sulfolobus solfataricus (PDBid: 1JX4) (A), and the adenylate cyclase Rv1625 of Mycobacterium tuberculosis (PDBid: 1Y11) (B). (C) A schematic presentation of the conserved fold. 1707 MBE Mönttinen et al. . doi:10.1093/molbev/msw047 an organism encodes several paralogs with slightly different functionalities. This suggests that genes encoding precursors of these enzymes were most likely present in the last universal common ancestor or that there has been extensive horizontal gene transfer between different organisms at a later stage of cellular evolution. Due to this diversity, it is difficult to predict the order of evolutionary steps that has resulted into the development of the different forms of these enzymes. Conclusions Here, we have used protein structures to reconstruct phylogenetic relationships for distantly related proteins representing five protein superfamilies and four unclassified proteins (table 1). Using the HSF program, we were able to identify a common core of 36 residues (fig. 1) at the active site of these proteins. Despite the lack of sequence-level identity, a phylogenetic tree confirming the established protein family classification could be deduced based on the properties of the conserved structural core (fig. 2). This implies that a substructure of only three-dozen residues can contain a substantial amount of information on the evolutionary history of proteins. Such an approach allowed us to detect not only interfamily but also intersuperfamily relations, extending the evolutionary timeframe of protein phylogenies. Materials and Methods Selection of Protein Structures The starting point for protein structure selection was the data set described in Mönttinen et al. (2014). This data set was updated with seven recently published novel right-handshaped polymerase structures and subsequently extended to include all protein structures containing the conserved b1-a1-b2-b3-a2 fold of the right-hand-shaped polymerases (described in Mönttinen et al. 2014). This was accomplished using a PDB search (www.pdb.org), DALI software (Holm and Rosenström 2010) (ekhidna.biocenter.helsinki.fi/dali_server), and a review of literary references (Tesmer et al. 1999; Iyer et al. 2005; Morrison et al. 2008; Hyde et al. 2010). Only one structure from each protein type per organism was selected (www.pdb.org; structures before 6th of March 2014; see sup plementary table S1, Supplementary Material online), using the following criteria: 1) maximal amino acid sequence coverage of the protein domain of interest, 2) a minimal size of 132 residues, 3) resolution below 4 Å, and 4) no gaps in the palm-like fold (b1-a1-b2-b3-a2-b4). In addition, structures that did not automatically align with any other protein structure in the data set (e.g., due to differences in the relative orientation of the secondary structure elements in the palmlike fold) were discarded. TM-align (Zhang and Skolnick 2005) was used to validate homology among the selected protein structures. All the protein structures were pairwisely aligned against all the protein structures in the data set. From the pairwise alignments, TMscores normalized by the structures having fewer residues were used. A second data set was collected for the comparison of proteins assigned to contain a ferredoxin-like fold (SCOP id 1708 54861; Murzin et al. 1995) from the SCOP2 database (Andreeva et al. 2014). Only structures with over 100 residues were included in the analysis to ensure that they contained the entire ferredoxin-like fold (approximately 90 amino acids). The resulting data set was filtered using cdhit v4.5.4 until the amino acid sequences of the protein structures had at most 40% similarity (Li and Godzik 2006; Fu et al. 2012). From the nuclear magnetic resonance spectroscopy-based structures, only one chain of the model was taken into the alignment. In addition, X-ray and nuclear magnetic resonance structures that did not follow the proper PDB-file format were removed. The SCOP classification is according to SCOP 1.75 release (http://scop.mrc-lmb.cam.ac.uk/scop/) and SCOP2 (http:// scop2.mrc-lmb.cam.ac.uk/). Structural Alignment and Identification of Common Cores Equivalent residues in the selected protein structures were identified from structural alignments using the HSF-program (Ravantti et al. 2013), applying parameters optimized for the right-hand-shaped polymerases (Mönttinen et al. 2014). Visual Molecular Dynamics (VMD) 1.9.1 (Humphrey et al. 1996) was used for the visualization of protein structures and common structural cores. The identification of chemical conservation among the structurally equivalent residues was based on the HSF program (Ravantti et al. 2013). A 61 sequence window around the structurally equivalent residue was applied to relax positional constraints (Mönttinen et al. 2014). If an identical amino acid was found within the window from every protein structure in an analysis, the amino acid type was considered to be conserved for that position. Structure-Based Distance Trees Structure-based distance trees were deduced by identifying the structurally equivalent residues for selected protein structures and subsequently comparing this set of residues. The branch distances were calculated as described in Ravantti et al. (2013). The resulting normalized distance matrix was converted to a tree using distance-method applicable for structure-based analyses (Stuart et al. 1979), the FitchMargoliash algorithm of the Fitch program (Felsenstein 1989), and visualized with Dendroscope 3 (Huson and Scornavacca 2012). To evaluate the sensitivity of the obtained structure-based distance tree to variation in the data set, a simplified jackknife test was performed. A single structure from each protein family, as well as individual unclassified proteins, was removed one by one. The common core was then identified using the remaining 88 structures, and the distance tree was reconstructed based on the newly calculated common core. This simplified jackknife method was used due to the relatively high computational resource requirements of the structural alignment method. Sequence Alignments Amino acid sequence alignments for the entire data set were performed using the multiple sequence alignment method Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047 MAFFT (Katoh and Frith 2012). The structure-directed sequence alignment method, PROMALS3D (Pei et al. 2008), was used to identify amino acid-level conservation within individual protein families or within the prim-pol domain superfamily. The conserved amino acids identified by PROMALS3D were mapped onto secondary structure elements covering the region of the conserved core (identified by HSF) using example structures that were retrieved from the DSSP database (Kabsch and Sander 1983; Touw et al. 2015). An amino acid was considered to be conserved if the same amino acid was found in the same position in 50% of the aligned protein structures/sequences. The visualization of secondary structures and conserved amino acids was done using ALINE (Bond and Schüttelkopf 2009). Supplementary Material Supplementary figures S1-–S4 and Supplementary table S1 are available at Molecular Biology and Evolution online (http:// www.mbe.oxfordjournals.org/). Acknowledgments This work was supported by the Academy of Finland (grant numbers 250113, 256069, and 283192) and the Sigrid Juselius Foundation. References Anantharaman V, Iyer LM, Aravind L. 2010. Presence of a classical RRMfold palm domain in Thg1-type 3’- 5’nucleic acid polymerases and the origin of the GGDEF and CRISPR polymerase domains. Biol Direct. 5:43. Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG. 2014. SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res. 42:D310–D314. Artymiuk PJ, Poirrette AR, Rice DW, Willett P. 1997. A polymerase I palm in adenylyl cyclase? Nature 388:33–34. Bond CS, Schüttelkopf AW. 2009. ALINE: a WYSIWYG protein-sequence alignment editor for publication-quality alignments. Acta Crystallogr D Biol Crystallogr. 65:510–512. Braithwaite DK, Ito J. 1993. Compilation, alignment, and phylogenetic relationships of DNA polymerases. Nucleic Acids Res. 21:787–802. Bressanelli S, Tomei L, Roussel A, Incitti I, Vitale RL, Mathieu M, De Francesco R, Rey FA. 1999. Crystal structure of the RNA-dependent RNA polymerase of hepatitis C virus. Proc Natl Acad Sci U S A. 96:13034–13039. Chandler M, de la Cruz F, Dyda F, Hickman AB, Moncalian G, Ton-Hoang B. 2013. Breaking and joining single-stranded DNA: the HUH endonuclease superfamily. Nat Rev Microbiol. 11:525–538. Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S, Kim BH, Grishin NV. 2014. ECOD: an evolutionary classification of protein domains. PLoS Comput Biol. 10:e1003926. Chothia C, Lesk Am. 1986. The Relation between the divergence of sequence and structure in proteins. EMBO J. 5:823–826. Cocozaki AI, Ramie NF, Shao YM, Hale CR, Terns RM, Terns MP, Li H. 2012. Structure of the Cmr2 subunit of the CRISPR-Cas RNA silencing complex. Structure 20:545–553. Felsenstein J. 1989. PHYLIP – Phylogeny Inference Package (Version 3.2). Cladistics 5:164–166. Forterre P. 2005. The two ages of the RNA world, and the transition to the DNA world: a story of viruses and cells. Biochimie 87:793–803. Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152. Galperin MY. 2004. Bacterial signal transduction network in a genomic perspective. Environ Microbiol. 6:552–567. MBE Galperin MY, Nikolskaya AN, Koonin EV. 2001. Novel domains of the prokaryotic two-component signal transduction systems. FEMS Microbiol Lett. 203:11–21. Gilbert W. 1986. Origin of life - the RNA world. Nature 319:618–618. Gill S, Krupovic M, Desnoues N, Béguin P, Sezonov G, Forterre P. 2014. A highly divergent archaeo-eukaryotic primase from the Thermococcus nautilus plasmid, pTN2. Nucleic Acids Res. 42:3707–3719. Gladyshev EA, Arkhipova IR. 2011. A widespread class of reverse transcriptase-related cellular genes. Proc Natl Acad Sci U S A. 108:20311–20316. Gorbalenya AE, Pringle FM, Zeddam JL, Luke BT, Cameron CE, Kalmakoff J, Hanzlik TN, Gordon KH, Ward VK. 2002. The palm subdomainbased active site is internally permuted in viral RNA-dependent RNA polymerases of an ancient lineage. J Mol Biol. 324:47–62. Holm L, Rosenström P. 2010. Dali server: conservation mapping in 3D. Nucleic Acids Res. 38:W545–W549. Huang RH. 2012. Cas protein Cmr2 full of surprises. Structure 20:389–390. Humphrey W, Dalke A, Schulten K. 1996. VMD: visual molecular dynamics. J Mol Graph. 14:33–38. Huson DH, Scornavacca C. 2012. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst Biol. 61: 1061–1067. Hyde SJ, Eckenroth BE, Smith BA, Eberley WA, Heintz NH, Jackman JE, Doublié S. 2010. tRNAHis guanylyltransferase (THG1), a unique 3’-5’ nucleotidyl transferase, shares unexpected structural homology with canonical 5’-3’ DNA polymerases. Proc Natl Acad Sci U S A. 107:20305–20310. Ilyina TV, Koonin EV. 1992. Conserved sequence motifs in the initiator proteins for rolling circle DNA replication encoded by diverse replicons from eubacteria, eucaryotes and archaebacteria. Nucleic Acids Res. 20:3279–3285. Ito J, Braithwaite DK. 1991. Compilation and alignment of DNA polymerase sequences. Nucleic Acids Res. 19:4045–4057. Iyer LM, Koonin EV, Leipe DD, Aravind L. 2005. Origin and evolution of the archaeo-eukaryotic primase superfamily and related palm-domain proteins: structural insights and new members. Nucleic Acids Res. 33:3875–3896. Kabsch W, Sander C. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637. Katoh K, Frith MC. 2012. Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics 28:3144–3146. Koonin EV, Ilyina TV. 1993. Computer-assisted dissection of rolling circle DNA replication. Biosystems 30:241–268. Leon PE. 1998. Inhibition of ribozymes by deoxyribonucleotides and the origin of DNA. J Mol Evol. 47:122–126. Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. Ling H, Boudsocq F, Woodgate R, Yang W. 2001. Crystal structure of a Yfamily DNA polymerase in action: a mechanism for error-prone and lesion-bypass replication. Cell 107:91–102. Makarova KS, Aravind L, Grishin NV, Rogozin IB, Koonin EV. 2002. A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res. 30:482–496. Makarova KS, Aravind L, Wolf YI, Koonin EV. 2011. Unification of Cas protein families and a simple scenario for the origin and evolution of CRISPR-Cas systems. Biol Direct. 6:38. Morrison DA. 2015. Is sequence alignment an art or a science? Syst Bot. 40:14–26. Morrison SD, Roberts SA, Zegeer AM, Montfort WR, Bandarian V. 2008. A new use for a familiar fold: the X-ray crystal structure of GTPbound GTP cyclohydrolase III from Methanocaldococcus jannaschii reveals a two metal ion catalytic mechanism. Biochemistry 47:230–242. Mönttinen HAM, Ravantti JJ, Stuart DI, Poranen MM. 2014. Automated structural comparisons clarify the phylogeny of the right-handshaped polymerases. Mol Biol Evol. 31:2741–2752. 1709 Mönttinen et al. . doi:10.1093/molbev/msw047 Murzin AG, Brenner SE, Hubbard T, Chothia C. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 247:536–540. Nakamura TM, Morin GB, Chapman KB, Weinrich SL, Andrews WH, Lingner J, Harley CB, Cech TR. 1997. Telomerase catalytic subunit homologs from fission yeast and human. Science 277:955–959. Ohmori H, Friedberg EC, Fuchs RP, Goodman MF, Hanaoka F, Hinkle D, Kunkel TA, Lawrence CW, Livneh Z, Nohmi T, et al. 2001. The Y-family of DNA polymerases. Mol Cell. 8:7–8. Pan J, Vakharia VN, Tao YJ. 2007. The structure of a birnavirus polymerase reveals a distinct active site topology. Proc Natl Acad Sci U S A. 104:7385–7390. Pei J, Kim BH, Grishin NV. 2008. PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res. 36:2295–2300. Pei JM, Grishin NV. 2001. GGDEF domain is homologous to adenylyl cyclase. Proteins 42:210–216. Poch O, Sauvaget I, Delarue M, Tordo N. 1989. Identification of four conserved motifs among the RNA-dependent polymerase encoding elements. EMBO J. 8:3867–3874. Ponting CP, Russell RR. 2002. The natural history of protein domains. Annu Rev Biophys Biomol Struct. 31:45–71. Prato S, Vitale Rm, Contursi P, Lipps G, Saviano M, Rossi M, Bartolucci S. 2008. Molecular modeling and functional characterization of the monomeric primase-polymerase domain from the Sulfolobus Solfataricus plasmid Pit3. FEBS J. 275:4389–4402. 1710 MBE Ravantti J, Bamford D, Stuart DI. 2013. Automatic comparison and classification of protein structures. J Struct Biol. 183:47–56. Steitz TA. 1999. DNA polymerases: structural diversity and common mechanisms. J Biol Chem. 274:17395–17398. Stuart DI, Levine M, Muirhead H, Stammers DK. 1979. Crystal-structure of cat muscle pyruvate-kinase at a resolution of 2.6 Å. J Mol Biol. 134:109–142. Tesmer JJ, Sunahara RK, Johnson RA, Gosselin G, Gilman AG, Sprang SR. 1999. Two-metal-ion catalysis in adenylyl cyclase. Science 285:756–760. Tews I, Findeisen F, Sinning I, Schultz A, Schultz JE, Linder JU. 2005. The structure of a pH-sensing mycobacterial adenylyl cyclase holoenzyme. Science 308:1020–1023. Touw WG, Baakman C, Black J, te Beek TA, Krieger E, Joosten RP, Vriend G. 2015. A series of PDB-related databanks for everyday needs. Nucleic Acids Res. 43:D364–D368. Woodgate R, Singh M, Kulaeva OI, Frank EG, Levine AS, Koch WH. 1994. Isolation and characterization of novel plasmid-encoded umuC mutants. J Bacteriol. 176:5011–5021. Xu J, Zhang Y. 2010. How significant is a protein structure similarity with TM-score ¼ 0.5? Bioinformatics 26:889–895. Zhang Y, Skolnick J. 2005. TM-align: a protein structure alignment algorithm based on TM-score. Nucleic Acid Res. 33:2302–2309. Zhu X, Ye K. 2012. Crystal structure of Cmr2 suggests a nucleotide cyclase-related enzyme in type III CRISPR-Cas systems. FEBS Lett. 586:939–945.