Download Common Structural Core of Three

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Common Structural Core of Three-Dozen Residues Reveals
Intersuperfamily Relationships
Heli A.M. Mönttinen,1 Janne J. Ravantti,*,1,2 and Minna M. Poranen*,1
1
Department of Biosciences, Viikki Biocenter, University of Helsinki, Helsinki, Finland
Institute of Biotechnology, Viikki Biocenter, University of Helsinki, Helsinki, Finland
2
*Corresponding author: E-mail: [email protected]; [email protected].
Associate editor: Lars Jermiin
Abstract
Identification of relationships among protein families or superfamilies is a challenge. However, functionally essential
protein regions typically retain structural integrity, even when the corresponding protein sequences evolve.
Consequently, comparison of protein structures enables deeper phylogenetic analyses than achievable through the
use of sequence information only. Here, we focus on a group of distantly related viral and cellular enzymes involved
in nucleic acid or nucleotide processing and synthesis. All these enzymes share an apparently similar protein fold at their
active site, which resembles the palm subdomain of the right-hand-shaped polymerases. Using a structure-based hierarchical clustering method, we identified a common structural core of 36 equivalent residues for this functionally diverse
group of enzymes, representing five protein superfamilies. Based on the properties of these 36 residues, we deduced a
structural distance-based tree in which the proteins were accurately clustered according to the established family
classification. Within this tree, the enzymes catalyzing genomic nucleic acid replication or transcription were separated
from those performing supplementary nucleic acid or nucleotide processing functions. In addition, we found that the
family Y DNA polymerases are structurally more closely related to the nucleotide cyclase superfamily members than to
the other members of the DNA/RNA polymerase superfamily, and these enzymes share 88 equivalent residues comprising
a b1-a1-a2-b2-b3-a3-b4-a4-b5 fold. The results highlight the power of structure-based hierarchical clustering in identifying remote evolutionary relationships. Furthermore, our study implies that a protein substructure of only three-dozen
residues can contain a substantial amount of information on the evolutionary history of proteins.
Key words: protein evolution, structural alignment, structural distances, nucleic acid and nucleotide processing
enzymes, polymerase evolution.
Introduction
number of the solved protein structures. However, such
approaches may prove critical for discerning the relatedness
of distant proteins, which is difficult to achieve solely through
amino acid sequence comparisons.
Homologous Structure Finder (HSF; Ravantti et al. 2013) is
a recently developed tool for structural alignment and structure-based classification of proteins. It uses several parameters
(e.g., physicochemical properties of the amino acids, type of
secondary structure and position in which it is located, local
geometry and local alignment, Ca-distances, and backbone
direction; Ravantti et al. 2013), and through pairwise comparisons, identifies the most similar structures within a set of
protein structures. The program then determines equivalent
residues within the two structures; that is, residues sharing
similar properties in terms of the parameters used. These
equivalent residues represent the common structural core
and can be considered homologous positions that are comparable to the columns of multiple sequence alignments
(Morrison 2015). Thus, the identified structural equivalence
may reveal possible homology. The special feature of this
program is that iterative pairwise comparisons of structures
and previously identified structural cores are continued until
ß The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: [email protected]
Mol. Biol. Evol. 33(7):1697–1710 doi:10.1093/molbev/msw047 Advance Access publication March 1, 2016
1697
Article
Phylogenetic relationships between protein-coding genes
have commonly been studied using amino acid sequence
alignments. These allow identification of homologous protein
regions in evolutionally closely related proteins and the assignment of protein families. Superfamilies represent a higher
level of classification and can be defined when a common
structural region is observed in proteins belonging to different
families. Such similarity among distantly related protein structures can often be observed, even when amino acid sequence
similarity has become undetectable (Chothia and Lesk 1986;
Ponting and Russell 2002). In addition to structural homology,
enzymes within a superfamily typically share a similar catalytic
mechanism; however, the substrate specificities may vary.
Assignment of protein superfamilies and relationships for
distantly related proteins involves the identification of structural similarity through multiple structural alignments, but
often also requires manual curation. Although methods for
automated structural alignment have significantly improved
during the last two decades, structure-based studies aimed at
identifying relationships among distantly related protein coding genes have been scarce. In part, this arises from the limited
MBE
Mönttinen et al. . doi:10.1093/molbev/msw047
all structures within the data set are merged to form a single
hierarchical clustering dendrogram, and a common structural
core is identified for all the proteins in question. Subsequently,
all-against-all pairwise distances are calculated for the residues
comprising the structural core, and the resulting distances
can be converted to a normalized distance matrix from which
a structure-based distance tree can be constructed (Ravantti
et al. 2013). Although structures are not necessarily drifting in
a continuous manner, and the calculated structural distances
may not exactly reflect evolutionary distances, the clustering
of protein structures in the structure-based distance trees
agrees well with the sequence-based classification of protein-coding genes into protein families (Mönttinen et al.
2014). Consequently, structure-based analyses can be used
to roughly estimate the chronology of evolutionary events
occurring over a large time scale. This is especially valuable
when deep evolutionary relationships, not detectable
through sequence-based analyses alone, are evaluated.
We have previously utilized HSF to describe a common
structural core for the right-hand-shaped polymerases of the
DNA/RNA polymerases superfamily (Mönttinen et al. 2014).
This core comprises 57 residues at the palm subdomain,
forming the active site of these enzymes, and was used to
predict relationships for the protein families within the DNA/
RNA polymerases superfamily (Mönttinen et al. 2014).
According to the Structural Classification of Proteins
(SCOP) database (Murzin et al. 1995; Andreeva et al. 2014),
the palm subdomain of the right-hand-shaped polymerases is
structurally related to the active site fold of the class III adenylate cyclases, which belong to a distinct protein superfamily, the nucleotide cyclases. Structurally similar active site folds
have also been observed in proteins belonging to the primpol domain, origin of replication-binding domain, and transposase IS200-like superfamilies, as well as in some unclassified
proteins (Murzin et al. 1995; Iyer et al. 2005; Hyde et al. 2010).
All these proteins are involved in nucleic acid or nucleotide
metabolism, and many have catalytic aspartic acid or
histidine residues in similar positions (Iyer et al. 2005; Hyde
et al. 2010). Despite the structural and functional similarities,
these superfamilies are currently placed under four different
SCOP-folds: DNA/RNA polymerases, ferredoxin-like, origin
of replication-binding domain (RBD-like), and prim-pol
domain.
Relationships among protein superfamilies have previously
been inferred based on the conservation of protein topology
and conserved sequence signatures (Iyer et al. 2005). In this
study, we asked whether structural similarity could be automatically detected among proteins belonging to different
superfamilies. We hypothesized that such information could
facilitate the construction of structure-based distance trees,
which have the potential to reveal inter- and intrasuperfamily
relationships between proteins that do not share detectable
sequence identity. In particular, we were interested in the
family Y DNA polymerases, which are traditionally classified
together with the other right-hand-shaped polymerases in
the superfamily of DNA/RNA polymerases, but share
sequence similarity with the members of the nucleotide
cyclases superfamily (Makarova et al. 2002).
1698
We selected a set of 89 protein structures, representing five
protein superfamilies and a few unclassified proteins involved
in nucleic acid or nucleotide metabolism and possessing an
active site fold similar to the palm subdomain of the righthand-shaped polymerases (table 1). Using HSF, we identified a
common structural core of 36 amino acid residues shared by
this set of protein structures. All 36 residues assigned as equivalent by the HSF program are located within the active site of
the proteins, but share no sequence identity across the whole
data set. These 36 equivalent residues are also found in the
same order within the polypeptide chain for all the proteins
in our data set, except in the birnavirus polymerases, where
there is a sequence inversion that results in distinct connectivity between the secondary structure elements at the active
site, as compared with the other known right-hand-shaped
polymerases (Gorbalenya et al. 2002; Pan et al. 2007). Using
the properties of these 36 residues and automated computational methods, we deduced a structure-based distance
tree, in which the proteins accurately cluster according to
their known function and established family classification.
This indicates that a substructure of only three-dozen residues can contain a significant amount of information on the
evolutionary history of a protein.
Results and Discussion
Protein Structures Sharing an Active Site Fold
Resembling the Palm Subdomain of the RightHand-Shaped Polymerases
The classification of protein structures has changed over time,
and there are variations in nomenclature depending on the
source. Here, we adhere to the family nomenclature established in the literature but also provide the family and superfamily names used in the SCOP database (table 1 and
supplementary table S1, Supplementary Material online).
Polymerase structures displaying the right-hand fold were
initially selected from the Protein Data Bank (PDB; www.pdb.
org, last accessed March 6, 2014) (table 1 and supplementary
table S1, Supplementary Material online). The data set was
subsequently extended with PDB and DALI (ekhidna.biocenter.helsinki.fi/dali_server, last accessed March 6, 2014)
searches, as well as with a review of literature references
(Tesmer et al. 1999; Iyer et al. 2005; Morrison et al. 2008;
Hyde et al. 2010), to obtain a comprehensive set of protein
structures proposed to have an active site fold homologous to
the palm subdomain of the right-hand-shaped polymerases.
Structures that could not be automatically aligned with
others in the data set due to deviations in the relative orientation of the secondary structure elements in the palm-like
fold were discarded from further analyses. As a result, we
obtained a set of 89 protein structures from five different
protein superfamilies (table 1 and supplementary table S1,
Supplementary Material online). Our data set covers, depending on the classification system (table 1), 13 or 15 protein
families previously proposed to share a common origin (Iyer
et al. 2005), as well as four unclassified protein structures
(table 1 and supplementary table S1, Supplementary
Material online). These proteins do not share any evident
Replication initiation protein E1
(64332)
Replication protein Rep (75492)
Relaxase domain (103120)
Transposase IS200-like (143423)
Diguanylate cyclases (6)
GGDEF domain (117984)
Transposase IS200-likec
(143422)
Origin of replicationbinding domain RBDlikee (55464)
Guanylate cyclases and class III adenylate cyclases (8)d
Adenylyl and guanylyl cyclase
catalytic domain (55074)
Nucleotide cyclasec
(55073)
Viral origin-binding proteins (3)
HUH endonucleases (7)
DNA-dependent DNA polymerization in DNA repair
Family Y DNA polymerases (10)
Family B DNA polymerases (13)
Lesion bypass DNA polymerase (Y-family),
catalytic domain
(100888)
DNA-dependent DNA polymerization in replication or DNA repair
Family A DNA polymerases (6)
DNA polymerase I (56673)
DNA-dependent RNA polymerization in transcription
DNA-dependent DNA polymerization in the processing of the
Okazaki fragments, DNA repair,
or replication
RNA-dependent RNA polymerases
(20)
RNA-dependent RNA polymerase
(56694)
dsRNA phage RNA-dependent
RNA-polymerase (64480)
Cleavage of a phosphodiester bond
in ssDNA; involved in the rollingcircle replication of plasmids and
viruses, conjugative plasmid
transfer and DNA transposition
Recognition and binding to the
replication initiation site; some
Catalyzation the formation of cyclic
di-3’,5’-guanylate from GTP
(needed in the regulation of biofilm formation and persistence)
Catalyzation the formation of secondary messengers 3’5’ cyclic
AMP/3’5’ cyclic GMP from ATP/
GTP
Archaea, bacteria, eukaryotes
Mitochondria, DNA phages
RNA-dependent RNA polymerization in viral genome replication
and transcription
Reverse transcriptases (4)
RNA (and DNA)-dependent DNA
polymerization in viral genome
replication and telomere
synthesis
Activity/Function
Reverse transcriptase (56686)
Family
Single-subunit RNA polymerases
(3)
b
Alternative Family Name (the
number of structures) in This
Studya
T7 RNA polymerase
(56691)
DNA/RNA polymerase
(56672)
Superfamily
SCOP Classification (SCOP id)
for Proteins in This Study
Table 1. Protein Families and Superfamilies Sharing an Active Site Fold Similar to the Right-Hand-Shaped Polymerases.
DNA viruses
(continued)
Archaea, bacteria, eukaryotes
(transposons) DNA viruses
Bacteria
Archaea, bacteria, eukaryotes
Archaea, bacteria, eukaryotes, DNA
phages
Bacteria, DNA phages,
mitochondria
ssRNA viruses,dsRNA viruses
Retroviruses, eukaryotes
Organisms
Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047
MBE
1699
1700
Addition of guanylate residue to
the 5’-end of tRNAHis
Participates in CRIPSPR-Cas RNA
silencing effector complex
Reverse polymerases (tRNAHis
guanylyltransferase) (2)
Cmr2 (1)
No SCOP classification
No SCOP classification
Catalyzation the formation of
2-amino-5-formylamino-6-ribofurano-sylamino-4(3H)-pyrimidinone ribonucleotide
monophosphate
Catalyzation of DNA-dependent
DNA polymerization via de novo
initiation (replication, able to
bypass lesions)
Catalyzation of RNA polymerization in priming process
Ligation of double-strand DNA
breaks in bacterial nonhomologous end-joining pathway
GTP cyclohydrolase III (1)
Archaeal primase-polymerases (1)
Ligase D (1)
Archaeal-eukaryotic primases (3)
of the proteins have also helicase
activity
Activity/Function
No SCOP classification
Bifunctional DNA primase/polymerase N-terminal domain
(103305)
PriA-like (56748)
Family
Alternative Family Name (the
number of structures) in This
Studya
Archaea, bacteria
Archaea, bacteria, eukaryotes
Archaea
Archaea
Bacteria
Archaea, eukaryotes
Organisms
c
b
The structures selected are not all included in the SCOP database but were assigned to SCOP families based on similarity to assigned SCOP family members and the information obtained from other sources.
Also known as right-hand-shaped polymerases. Belongs to the SCOP fold DNA/RNA polymerases (id 56671). According to SCOP the palm domain has a ferredoxin-like fold.
Belongs to the SCOP fold Ferredoxin-like (id 54861).
d
Guanylate/adenylate cyclases is used as an alternative name in this study.
e
Belongs to the SCOP fold Origin of replication-binding domain, RBD-like (id 55463).
f
Belongs to the SCOP fold Prim-pol domain (id 56746).
a
No SCOP classification
Prim-Pol domainf
(56747)
The origin DNA-binding
domain of SV40 T-antigen (55465)
Superfamily
SCOP Classification (SCOP id)
for Proteins in This Study
Table 1. Continued
Mönttinen et al. . doi:10.1093/molbev/msw047
MBE
Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047
sequence identity, as determined using the MAFFT sequence
alignment method (Katoh and Frith 2012; data not shown),
and therefore, structural information is needed to recognize
possible similarities and homology between the superfamilies.
For the initial validation of the data set, we made all pairwise structural alignments using TM-align (Zhang and
Skolnick 2005). Structural alignment of nonhomologous protein generally yields TM-scores <0.17 whereas that of proteins with well-established homologous relationships
generally yields TM-scores >0.5 (Xu and Zhang 2010). The
obtained pairwise TM-scores varied between 0.20 and 0.98
within the whole data set. The TM-scores between members
of the same family were generally between 0.49 and 0.98;
however, within the family of RNA-dependent RNA polymerases the TM-scores varied between 0.43 and 0.98, whereas
those among the reverse transcriptases varied between 0.40
and 0.81. Thus, using TM-align the structural similarity of
some of the proteins was found to be quite low, which, in
turn, call into question the homology of some of these proteins. These results also called into question the homology
between all the RNA-dependent RNA polymerases and reverse transcriptases, which are encoded by genes known to
evolve fast. However, because TM-scores were frequently >0.5 and none of them were <0.17, it was decided
to assume that the data set comprises sequences of homologous proteins. Notably, TM-scores >0.5 were also obtained
between distinct protein families so that a network connecting all the families together could be built (data not shown).
Automated Identification of Equivalent Residues for
Proteins Representing Different Superfamilies
The selected 89 protein structures, representing five protein
superfamilies, were structurally aligned and clustered using
the HSF program. This uses multiple parameters, such as geometry, secondary structure, physiochemical properties of
the amino acids, and amino acid sequence (see Ravantti
et al. 2013) to identify equivalent residues in protein structures, describing their common structural core. Due to the
low sequence similarity in our data set, the parameters describing local alignment and local geometry were upweighted,
whereas the proportional weight of the sequence was only
3.5% of all parameters (for the optimization of the parameters, see Mönttinen et al. 2014). Through the hierarchal clustering of the protein structures, the HSF program
automatically identified 36 equivalent residues with a rootmean-square deviation (rmsd) of 3.6 Å, for all proteins in our
data set (for the clustering, see supplementary fig. S1,
Supplementary Material online). All equivalent residues are
located within the palm-like fold forming the active site. The
common core covers four b-strands and two a-helices, sharing approximately similar relative orientations in each structure, and is depicted for the RNA-dependent RNA
polymerase of hepatitis C virus and the family Y DNA polymerase Dpo4 of Sulfolobus solfataricus in figure 1 (panels IB
and IIB, respectively). We could not identify any chemically
identical amino acids in the equivalent positions among the
36 residues by HSF (supplementary fig. S1, Supplementary
Material online), even when the positional constraints were
MBE
relaxed, and the amino acids were compared within a sequence window of 61 amino acids.
A similar analysis was also carried out for a set of 250
protein structures classified under the ferredoxin-like fold in
SCOP database, including members of the nucleotide cyclases
and transposase IS200-like superfamilies present in our main
data set. Using HSF, it was not possible to align the ferredoxinlike fold, and the identified common core was found to have
only three equivalent residues. This suggests that the structural heterogeneity among the proteins assigned to have the
ferredoxin-like fold is substantially higher than that present in
our main data set, which contains proteins from three different SCOP fold classes (table 1).
Structural Properties of the 36-Residue Common Core
To obtain further insights into the structural organization of
the 36-residue common core, the individual protein sequences were aligned in 11 groups (corresponding to the
protein families and the prim-pol domain superfamily) using
the structure-directed sequence alignment method,
PROMALS3D (Pei et al. 2008), and the obtained alignments
were manually superimposed (supplementary fig. S2,
Supplementary Material online). The birnavirus RNA polymerases, which have a sequence inversion at the palm subdomain (b1-b2-b3-a1-a2; Gorbalenya et al. 2002; Pan et al.
2007), could not be automatically aligned with the other
RNA-dependent RNA polymerases by PROMALS3D and
were omitted from this analysis. We found that for all other
structures in our data set, the order of secondary structure
elements within the common core was conserved (b1-a1-b2b3-a2 fold; fig. 1 and supplementary fig. S2, Supplementary
Material online), suggesting that the active site of these proteins might share a common origin. Furthermore, in most
proteins, with the exception of those belonging to the HUH
endonucleases, viral origin-binding proteins, and guanylate/
adenylate cyclases, there is an insertion between the b1- and
a2-elements. This insertion forms the fingers subdomain in
the right-hand-shaped polymerases and has likely occurred
early in the evolution of these nucleic acid synthesizing
enzymes.
Our structure-based sequence alignment also allowed
identification of conserved amino acid residues for each protein family. Comparison of the conserved residues revealed
few that are identical across several protein families, allowing
connections to be made among the superfamilies (supple
mentary fig. S2, Supplementary Material online). Most notably, an aspartic acid (D) in b1 is conserved among the members of the DNA/RNA polymerases, nucleotide cyclases, and
prim-pol superfamilies, as well as in the three unclassified
proteins in our data set (table 1). Furthermore, a conserved
histidine residue (H) in b3-strand bridges the members of the
prim-pol domain superfamily and the HUH endonucleases
family, which are currently classified under three different
SCOP superfamilies and three SCOP folds (table 1).
Members of the viral origin-binding proteins family, from
the origin of replication-binding protein superfamily, do not
share any detectable sequence identity with the members of
the other protein families. Nevertheless, based on the recently
1701
Mönttinen et al. . doi:10.1093/molbev/msw047
MBE
FIG. 1. Structurally equivalent residues for the selected nucleic acid/nucleotide-binding proteins. The structure of hepatitis C virus RNA-dependent RNA polymerase (PDBid: 3TYQ) and the family Y DNA polymerase, Dpo4, of Sulfolobus solfataricus (PDBid: 1JX4), shown in rows I and II,
respectively, are used for illustration of the conserved structural features among the selected proteins structures representing five superfamilies
(table 1 and supplementary table S1, Supplementary Material online). Column A depicts the complete palm subdomain (according to Bressanelli
et al. 1999; Ling et al. 2001), and column B shows the 36-amino acid common structural core for the entire data set. Panel IC shows the equivalent
residues shared by the RNA-dependent RNA polymerases, reverse transcriptases, single-subunit RNA polymerases, and family A and B DNA
polymerases (i.e., members of group I of fig. 2). Panel IIC illustrates the common core for the members of the family Y DNA polymerases, the
representatives of nucleotide cyclase, transposase IS200-like, origin of replication-binding domain, and prim-pol domain superfamilies, and the
three nonclassified proteins in the data set (i.e., the members of group II of fig. 2). The palm subdomain is in green, and the conserved residues in
the thumb and finger subdomains are indicated with pink and yellow, respectively. The main secondary structure elements in the substructures
are indicated.
released evolutionary classification of protein domains
(ECOD) database (Cheng et al. 2014), the viral origin-binding
proteins are related with the HUH endonucleases. A shared
common origin between the viral origin proteins and members of the DNA/RNA polymerases superfamily, the nucleotide cyclases, and the primase polymerases has also been
proposed previously (Iyer et al. 2005). The results of the
TM-align analysis also support these observations and suggest
that the members of the five protein superfamilies and the
nonclassified proteins in our data set (table 1 and supplemen
tary table S1, Supplementary Material online) are structurally
related and thus likely homologous (see above).
1702
Relationships among Proteins Families and
Superfamilies Deduced from the 36 Equivalent
Residues
A structure-based distance tree was constructed for the
members of the five protein superfamilies and the four unclassified protein structures described in table 1 by comparing
the properties of the 36 equivalent residues. In this analysis,
using only the common structural core prevents the comparison of individual residues or motifs that may coincidentally share similar structural properties but are part of distinct
structural features that do not share a common origin with
other proteins in the comparison. The relationships among
Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047
the different protein structures were produced automatically
by converting all the pairwise similarity scores between the
equivalent residues into distances and depicting them as a
tree.
Despite the small size of the common structural core and
the lack of identical amino acids, the grouping of protein
structures in the resulting distance tree accurately follows
the established protein families that are based on sequence
analyses (fig. 2; table 1; Poch et al. 1989; Ito and Braithwaite
1991; Braithwaite and Ito 1993; Ohmori et al. 2001; Chandler
et al. 2013). This implies that distance trees constructed
using HSF could be used to propose protein phylogenies,
and thus, can be described as “structure-based phylogenetic
trees.”
Within the constructed distance tree, the protein families
form two main groups. Group I includes the RNA-dependent
RNA polymerases, reverse transcriptases, single-subunit RNA
polymerases, and family A and B DNA polymerases from the
superfamily of DNA/RNA polymerases (i.e., mainly processive
nucleic acid polymerases involved in DNA/RNA replication or
transcription). Group II contains primarily nonreplicative enzymes, as well as enzymes that take part in the replication of
extrachromosomal genetic elements. These proteins represent five protein superfamilies and are involved in primer
synthesis and DNA nicking, ligation, and repair, as well as in
the recognition of specific DNA sequences, RNA modification, and the synthesis of secondary messengers from nucleotide precursors (e.g., cyclic adenosine monophosphate; table
1 and fig. 2). Similar grouping was also observed in the initial
hierarchical clustering of the complete protein structures (sup
plementary fig. S1, Supplementary Material online). The common cores identified for groups I and II through the hierarchical clustering were 76 and 43 residues, respectively
(supplementary fig. S1, Supplementary Material online; see
below). The sizes of the common cores of the different subclusters (e.g., protein families) are also provided in the supplementary fig. S1, Supplementary Material online.
Interestingly, the family Y DNA polymerases of the DNA/
RNA polymerases superfamily are grouped separately from
the other right-hand-shaped polymerases and cluster in
group II with the 33 other nucleic acid/nucleotide processing
enzymes (fig. 2 and supplementary fig. S1, Supplementary
Material online). This suggests that family Y DNA polymerases involved in DNA repair are only distantly related to the
other right-hand-shaped polymerases, which mainly catalyze
genomic nucleic acid replication and transcription. However,
like the right-hand-shaped RNA polymerases and replicative
DNA polymerases, the family Y polymerases and most of the
nucleotide cyclases, as well as the unclassified proteins, GTP
cyclohydrolase III and Cmr2, have a similar catalytic site containing two aspartate residues (supplementary fig. S2,
Supplementary Material online), supporting the placement
of these enzymes adjacent to the members of the DNA/RNA
polymerases superfamily (fig. 2).
We propose that the processive nucleic acid polymerases
in group I are of early origin, with the first ones arising during
the phase when RNA was the predominant genetic material
(RNA world; Gilbert 1986). These enzymes perform key
MBE
functions in viruses and cellular organisms, catalyzing DNA
and RNA synthesis using either DNA or RNA templates.
Conversely, the representatives of group II carry out more
sophisticated functions, which are related to error correction
and communication between cells; that is, functions that
likely have evolved later in more complex biological systems.
As a potential reflection of this evolutionary history, the archaeal primase–polymerases of group II can possess both
DNA-dependent DNA and RNA polymerization activities
(Prato et al. 2008), as well as RNA-dependent DNA synthesis
activity (Gill et al. 2014).
To evaluate the robustness of our structure-based distance
tree (fig. 2), we performed a simplified jackknife test, in which
random members of each protein family or unclassified protein structures were discarded one-by-one from the analysis.
The structure-based distance trees were then recalculated
using the remaining 88 structures (supplementary fig. S3,
Supplementary Material online). The results of this test confirmed the general pattern we observed with our full data set,
as well as the distribution of protein families into two main
groups. However, in the simplified jackknife test, the position
of the unclassified protein, Cmr2, varied slightly, such that, in
20% of the replicates it clustered with group I rather than
group II. Critically, despite the small size of the common core,
the grouping of family members together was also stable,
particularly among the group II families. However, the family
B DNA polymerase of phage u29 and the telomerase-reverse
transcriptase of Tribolium castaneum were only loosely
associated with their respective families (in 20% and 27% of
the replicates, respectively). Additionally, members of singlesubunit RNA polymerases were disconnected in 27% of the
replicates (supplementary fig. S3, Supplementary Material
online).
Fine-Tuning Structure-Based Protein Phylogenies by
Increasing Common Core Size
A common region composed of only 36 structurally equivalent residues allowed the automated reconstruction of structure-based interfamily and intersuperfamily relationships for a
set of distant protein homologs (fig. 2). To further evaluate
the quality of the tree constructed from this 36-residue structural core, distance trees were constructed separately for the
two main clusters, that is, the group I enzymes catalyzing
nucleic acid replication or transcription (supplementary fig.
S4, Supplementary Material online) and the group II enzymes
involved in supplementary nucleic acid/nucleotide processing
functions (fig. 3). The common structural cores identified by
HSF for the members of groups I and II are 76 and 43 acarbon residues, respectively (fig. 1 and supplementary fig. S1,
Supplementary Material online). We observed that the distance trees deduced from these larger structural cores (sup
plementary fig. S4, Supplementary Material online and fig. 3)
follow the overall topology observed in the original tree
(fig. 2), indicating that the tree deduced from the 36-residue
substructure is relatively robust at the interfamily level. The
positions of the individual protein structures were slightly
changed when the larger structural cores were applied, finetuning the intrafamily relations. Overall, we consider the
1703
Mönttinen et al. . doi:10.1093/molbev/msw047
MBE
FIG. 2. Intersuperfamily and interfamily relationships for proteins harboring a palm-like fold. A structure-based distance tree for the selected
nucleic acid/nucleotide processing enzymes (table 1 and supplementary table S1, Supplementary Material online) was deduced based on the
properties of the 36 structurally equivalent amino acid residues (depicted in fig. 1, panels IB and IIB). The coloring of the individual branches reflects
the family classification of the proteins (see table 1). The family names are indicated on the inner circle, and the protein superfamilies (according to
SCOP; Murzin et al. 1995; Andreeva et al. 2014) are shown on the outer circle; unclassified protein structures are marked with an asterisk (*).
The separation of the two groups is highlighted with a dashed line. vOBPs, viral origin-binding proteins; APP, archaeal primase polymerase; AEP,
archaeal-eukaryotic primases; LigD, polymerase domain of DNA ligase D.
structure-based analyses to be suitable for studying the
relationships between protein families, superfamilies, and intrafamily relations, thus effectively supplementing sequencebased methods.
Common Core and Relationships for Members of
Group I
The members of group I (i.e., the RNA-dependent RNA
polymerases, reverse transcriptases, single-subunit RNA polymerases, and family A and B DNA polymerases) share a 76residue structural core with an rmsd of 4.2 Å. The common
1704
structural region for these members of the DNA/RNA polymerase superfamily contains, in addition to the palm subdomain, regions from the fingers and thumb subdomains, as
depicted for the hepatitis C virus RNA-dependent RNA polymerase in figure 1 (panel IC; thumb- and fingers-specific
residues indicated in pink and yellow, respectively). The conserved palm region covers 57 residues, sharing a similar connectivity in all group I members (supplementary fig. S2,
Supplementary Material online), excepting the birnavirus
RNA-dependent RNA polymerases (data not shown;
Gorbalenya et al. 2002; Pan et al. 2007). Although the fingers
Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047
MBE
FIG. 3. Fine-tuning the structural relationships for the members of group II. A structure-based distance tree for selected nucleic acid/nucleotide
processing enzymes representing five different protein superfamilies and four unclassified proteins (members of group II, fig. 2) was deduced based
on the properties of the 43 structurally equivalent residues (depicted in fig. 1, panel IIC). The coloring of the individual branches reflects the
established family classification of the enzymes (see table 1), and the corresponding family names are indicated with the same colors. The
superfamilies are indicated by light violet backgrounds and black font. The organism from which each protein originates is shown at the tip of each
branch, and the name of the enzyme is also provided for clarity, when needed. E. coli, Escherichia coli; H. sapiens, Homo sapiens; M. smegmatis,
Mycobacterium smegmatis; M. tuberculosis, Mycobacterium tuberculosis; P. aeruginosa, Pseudomonas aeruginosa; P. horikoshii, Pyrococcus horikoshii; P. furiosus; Pyrococcus furiosus; S. acidocaldarius, Sulfolobus acidocaldarius; S. cerevisiae, Saccharomyces cerevisiae; S. solfataricus, Sulfolobus
solfataricus; S. tokodaii, Sulfolobus tokodaii.
and thumb subdomains are not considered homologous
among the different polymerase families (Steitz 1999), their
overall shape is similar, and therefore, HSF can also identify
equivalent positions in these subdomains (fig. 1, panel IC).
HSF did not automatically identify any chemically identical
equivalent residues within this region. However, when
identical amino acids were searched in a sequence window
of 61, the two conserved catalytic aspartates could be
recognized in the b1- and b3-strands. These were also identified independently in the structure-directed sequence
alignment using PROMALS3D (supplementary fig. S2,
Supplementary Material online).
1705
Mönttinen et al. . doi:10.1093/molbev/msw047
Within the structure-based distance trees (fig. 2 and supplementary fig. S4, Supplementary Material online), the RNAdependent and DNA-dependent polymerases form separate
branches. This likely reflects the early appearance of the RNAdependent enzymes in the RNA world and the subsequent
development of reverse transcriptases to handle the originally
toxic deoxyribonucleotides (Leon 1998). DNA-based processes, rather, are believed to have evolved later (Forterre
2005), leading to the emergence of the DNA-dependent
DNA and single-subunit RNA polymerases. Although the
structurally characterized reverse transcriptases included in
our analysis represent diverse biological system (e.g., retroviruses, retroelements, and eukaryotic telomerase machinery),
all seem to share a common origin (fig. 2 and supplementary
fig. S4, Supplementary Material online; Nakamura et al. 1997;
Gladyshev and Arkhipova 2011). Similarly, all the DNA-dependent polymerases of group I likely evolved from the same
ancient palm subdomain, and subsequently differentiated
into the family A and B DNA polymerases, as well as the
single-subunit RNA polymerases (fig. 2 and supplementary
fig. S4, Supplementary Material online). Furthermore, clustering of the viral RNA-dependent RNA polymerases mostly
follows the taxonomic classification of viral families (supple
mentary fig. S4, Supplementary Material online), implying
that exchange of viral RNA polymerase genes has been
relatively limited since the division and differentiation of
the different viral families. These observations further support
our proposal that structure-based analyses can be used to
deduce phylogenetic relationships.
Common Core and Relationships for Members of
Group II
Group II represents five distinct protein superfamilies comprising a variety of nonreplicative enzymes involved in the
processing of nucleic acids or nucleotides (table 1). These
enzymes share only 43 structurally common residues with
an rmsd of 4.1 Å, and none of these is chemically identical,
even when searched in 61 sequence window. However, using
a structure-directed sequence alignment (supplementary fig.
S2, Supplementary Material online), it was possible to identify
a few residues that are conserved in several families, allowing
connections to be made between the superfamilies. The common structural core is depicted for the family Y DNA polymerase, Dpo4, of S. solfataricus in figure 1 (panel IIC). It
contains four antiparallel b-strands and two a-helices; that
is, the same secondary structural elements as the 36-residue
common core identified for the whole data set (fig. 1, panel
IIB). However, as a reflection of decreased structural variation,
the coverage of the secondary structures has become more
complete.
In our structure-based distance tree, the members of
group II form two large branches. The first one contains the
family Y DNA polymerases of the DNA/RNA polymerase superfamily and the two families of the nucleotide cyclase superfamily (i.e., the guanylate cyclase and class III adenylate
cyclase family and the diguanylate cyclases family; table 1).
The structure of the archaeal GTP cyclohydrolase III, which
has no assigned family classification in the SCOP database
1706
MBE
(table 1), is also part of this subtree (fig. 3). Interestingly,
GTP cyclohydrolase III has a GGDNF sequence-motif in the
catalytic site that is highly similar to the GGDEF sequencemotif, a hallmark feature of the diguanylate cyclase family
(Galperin et al. 2001; Galperin 2004). These observations suggest that GTP cyclohydrolase III could be placed into the same
superfamily with the nucleotide cyclases.
In addition, the structure of the CRISPR-Cas RNA silencing
effector complex protein Cmr2 (CRISPR for Clustered,
Regularly Interspaced, Short Palindromic Repeat) is located
in the same branch with the guanylate/adenylate cyclases and
GTP cyclohydrolase III (fig. 3). The structural similarity between Cmr2 and the adenylate cyclases has been described
previously (Zhu and Ye 2012). However, Cmr2 does not share
functional properties with either cyclases or polymerases
(Cocozaki et al. 2012; Huang 2012), and it has been proposed
to form a distinct protein family with cas10 proteins
(Makarova et al. 2011). Based on the structural similarity
we observe, however, Cmr2 likely belongs to the superfamily
of nucleotide cyclases.
The second branch in group II (fig. 2) includes the members of the transposase IS200-like, origin of replication-binding
domain, and prim-pol domain superfamilies. In the SCOP
database, members of the HUH endonuclease family are classified into several different families and two superfamilies:
Transposase IS200-like and origin of replication-binding domain (table 1 and fig. 3). Our analyses imply that the HUH
endonucleases form two branches, which likely are evolutionarily more closely related to each other than to the members
of the origin of replication-binding domain superfamily
(fig. 3). These observations suggest that HUH endonucleases
should be placed into a single superfamily, as was also suggested earlier (Ilyina and Koonin 1992; Koonin and Ilyina
1993).
The nonclassified reverse polymerases (i.e., tRNAHis guanylyltransferases) are closely associated with the family Y DNA
polymerases and diguanylate cyclases in all our structurebased distance trees (figs. 2 and 3; supplementary fig. S3,
Supplementary Material online). Like all the right-handshaped polymerases and nucleotide cyclases, reverse polymerases harbor catalytic aspartic acids in the b1- and b3strands of the palm-like fold (supplementary fig. S2,
Supplementary Material online; Hyde et al. 2010), suggesting
high conservation in the catalytic mechanism of nucleotide
addition, despite the reverse directionality. The close relation
of the reverse polymerases and nucleotide cyclases is supported by previous sequence-based studies in which reverse
polymerases were shown to be related to GGDEF and CRISPR
polymerases (Anantharaman et al. 2010).
Members of the Family Y DNA Polymerases and the
Nucleotide Cyclase Superfamily Share an Extended
Palm-Like Fold
In the structure-based distance trees presented here (figs. 2
and 3), the family Y DNA polymerases of the DNA/RNA
polymerase superfamily are grouped together with the members of the nucleotide cyclase superfamily (including GTP
Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047
MBE
cyclohydrolase III and Cmr2). Similar grouping was also observed in the hierarchical clustering dendrogram produced by
HSF (supplementary fig. S1, Supplementary Material online),
and in this case, a common structural core of 88 residues with
an rmsd of 3.0 Å was identified. The structural core is presented for Dpo4 of S. solfataricus and the adenylyl cyclase
Rv1264 of Mycobacterium tuberculosis (fig. 4A and B). It contains all the secondary structure elements present in the
common structural core of the right-hand-shaped polymerases (Mönttinen et al. 2014), as well as two additional a-helices and b-strands forming a b1-a1-a2-b2-b3-a3-b4-a4-b5
fold (fig. 4C). This fold was previously found to be conserved
among the members of the nucleotide cyclases superfamily
(Pei and Grishin 2001), and our results demonstrate that also
the family Y DNA polymerases share this fold. The conserved
b1-a1-a2-b2-b3-a3-b4-a4-b5 fold comprises 25% of the entire polypeptide chain of the family Y DNA polymerase Dpo4
of S. solfataricus and over 52% of the catalytic domain (Tews
et al. 2005) of the class III adenylate cyclase Rv1264 of M.
tuberculosis. It is considerably larger than the common core
of the right-hand-shaped polymerases or the common active
site fold previously described for the eukaryotic adenylate
cyclase and family A DNA polymerases of Escherichia coli
and Thermus aquaticus (Artymiuk et al. 1997).
The HSF program automatically detected one positionally
and chemically conserved amino acid for all the family Y DNA
polymerases and nucleotide cyclases—a glycine (G) in the b4strand. Mutational studies on UmuC from E. coli, a member of
the family Y DNA polymerases, have shown that this glycine
has a role in maintaining protein stability and local structure
(Woodgate et al. 1994). This conserved glycine and a conserved aspartic acid (D) in b1 could also be identified using
PROMALS3D (supplementary fig. S2, Supplementary Material
online). Likewise, an additional conserved aspartic acid (D)
could be identified in b3 from family Y DNA polymerases and
guanylate/adenylate cyclases.
The substantial similarity in the active site fold (fig. 4) indicates that family Y DNA polymerases are structurally more
closely related to nucleotide cyclases than to the right-handshaped polymerases (DNA/RNA polymerases superfamily).
Although SCOP classifies nucleotide cyclases and family Y
DNA polymerases into separate folds, the new classification
database ECOD (Cheng et al. 2014) places them under the
same homology and topology level. Additionally, previous
sequence comparisons (Makarova et al. 2002) support our
results. In the light of these findings, either the classification of
the family Y DNA polymerases should be reevaluated or the
SCOP superfamily classification should not be considered to
reflect evolutionary relationships for the family Y DNA polymerases. The nucleotide cyclases and the family Y DNA polymerases are present in all three domains of life, and typically,
FIG. 4. The common structural fold for the family Y DNA polymerases, nucleotide cyclases, and GTP cyclohydrolase III. The common
structural core for the members of the family Y DNA polymerases (of
superfamily DNA/RNA polymerases) and the nucleotide cyclases superfamily, as well as for GTP cyclohydrolase III and Cmr2 (unclassified
proteins), was identified using the HSF program (Ravantti et al. 2013).
Equivalent residues identified are shown in green in the structures of
the family Y DNA polymerase, Dpo4 of Sulfolobus solfataricus (PDBid:
1JX4) (A), and the adenylate cyclase Rv1625 of Mycobacterium tuberculosis (PDBid: 1Y11) (B). (C) A schematic presentation of the conserved fold.
1707
MBE
Mönttinen et al. . doi:10.1093/molbev/msw047
an organism encodes several paralogs with slightly different
functionalities. This suggests that genes encoding precursors
of these enzymes were most likely present in the last universal
common ancestor or that there has been extensive horizontal
gene transfer between different organisms at a later stage of
cellular evolution. Due to this diversity, it is difficult to predict
the order of evolutionary steps that has resulted into the
development of the different forms of these enzymes.
Conclusions
Here, we have used protein structures to reconstruct phylogenetic relationships for distantly related proteins representing five protein superfamilies and four unclassified proteins
(table 1). Using the HSF program, we were able to identify a
common core of 36 residues (fig. 1) at the active site of these
proteins. Despite the lack of sequence-level identity, a phylogenetic tree confirming the established protein family classification could be deduced based on the properties of the
conserved structural core (fig. 2). This implies that a substructure of only three-dozen residues can contain a substantial
amount of information on the evolutionary history of proteins. Such an approach allowed us to detect not only interfamily but also intersuperfamily relations, extending the
evolutionary timeframe of protein phylogenies.
Materials and Methods
Selection of Protein Structures
The starting point for protein structure selection was the data
set described in Mönttinen et al. (2014). This data set was
updated with seven recently published novel right-handshaped polymerase structures and subsequently extended
to include all protein structures containing the conserved
b1-a1-b2-b3-a2 fold of the right-hand-shaped polymerases
(described in Mönttinen et al. 2014). This was accomplished
using a PDB search (www.pdb.org), DALI software (Holm and
Rosenström 2010) (ekhidna.biocenter.helsinki.fi/dali_server),
and a review of literary references (Tesmer et al. 1999; Iyer
et al. 2005; Morrison et al. 2008; Hyde et al. 2010). Only one
structure from each protein type per organism was selected
(www.pdb.org; structures before 6th of March 2014; see sup
plementary table S1, Supplementary Material online), using
the following criteria: 1) maximal amino acid sequence coverage of the protein domain of interest, 2) a minimal size of
132 residues, 3) resolution below 4 Å, and 4) no gaps in the
palm-like fold (b1-a1-b2-b3-a2-b4). In addition, structures
that did not automatically align with any other protein structure in the data set (e.g., due to differences in the relative
orientation of the secondary structure elements in the palmlike fold) were discarded.
TM-align (Zhang and Skolnick 2005) was used to validate
homology among the selected protein structures. All the protein structures were pairwisely aligned against all the protein
structures in the data set. From the pairwise alignments, TMscores normalized by the structures having fewer residues
were used.
A second data set was collected for the comparison of
proteins assigned to contain a ferredoxin-like fold (SCOP id
1708
54861; Murzin et al. 1995) from the SCOP2 database
(Andreeva et al. 2014). Only structures with over 100 residues
were included in the analysis to ensure that they contained
the entire ferredoxin-like fold (approximately 90 amino acids).
The resulting data set was filtered using cdhit v4.5.4 until the
amino acid sequences of the protein structures had at most
40% similarity (Li and Godzik 2006; Fu et al. 2012). From the
nuclear magnetic resonance spectroscopy-based structures,
only one chain of the model was taken into the alignment.
In addition, X-ray and nuclear magnetic resonance structures
that did not follow the proper PDB-file format were removed.
The SCOP classification is according to SCOP 1.75 release
(http://scop.mrc-lmb.cam.ac.uk/scop/) and SCOP2 (http://
scop2.mrc-lmb.cam.ac.uk/).
Structural Alignment and Identification of Common
Cores
Equivalent residues in the selected protein structures were
identified from structural alignments using the HSF-program
(Ravantti et al. 2013), applying parameters optimized for the
right-hand-shaped polymerases (Mönttinen et al. 2014).
Visual Molecular Dynamics (VMD) 1.9.1 (Humphrey et al.
1996) was used for the visualization of protein structures
and common structural cores.
The identification of chemical conservation among the
structurally equivalent residues was based on the HSF program (Ravantti et al. 2013). A 61 sequence window around
the structurally equivalent residue was applied to relax positional constraints (Mönttinen et al. 2014). If an identical
amino acid was found within the window from every protein
structure in an analysis, the amino acid type was considered
to be conserved for that position.
Structure-Based Distance Trees
Structure-based distance trees were deduced by identifying
the structurally equivalent residues for selected protein structures and subsequently comparing this set of residues. The
branch distances were calculated as described in Ravantti
et al. (2013). The resulting normalized distance matrix was
converted to a tree using distance-method applicable for
structure-based analyses (Stuart et al. 1979), the FitchMargoliash algorithm of the Fitch program (Felsenstein
1989), and visualized with Dendroscope 3 (Huson and
Scornavacca 2012).
To evaluate the sensitivity of the obtained structure-based
distance tree to variation in the data set, a simplified jackknife
test was performed. A single structure from each protein
family, as well as individual unclassified proteins, was removed
one by one. The common core was then identified using the
remaining 88 structures, and the distance tree was reconstructed based on the newly calculated common core. This
simplified jackknife method was used due to the relatively
high computational resource requirements of the structural
alignment method.
Sequence Alignments
Amino acid sequence alignments for the entire data set were
performed using the multiple sequence alignment method
Superfamily Relations through Protein Structure . doi:10.1093/molbev/msw047
MAFFT (Katoh and Frith 2012). The structure-directed sequence alignment method, PROMALS3D (Pei et al. 2008),
was used to identify amino acid-level conservation within
individual protein families or within the prim-pol domain
superfamily. The conserved amino acids identified by
PROMALS3D were mapped onto secondary structure elements covering the region of the conserved core (identified
by HSF) using example structures that were retrieved from
the DSSP database (Kabsch and Sander 1983; Touw et al.
2015). An amino acid was considered to be conserved if
the same amino acid was found in the same position in 50%
of the aligned protein structures/sequences. The visualization
of secondary structures and conserved amino acids was done
using ALINE (Bond and Schüttelkopf 2009).
Supplementary Material
Supplementary figures S1-–S4 and Supplementary table S1
are available at Molecular Biology and Evolution online (http://
www.mbe.oxfordjournals.org/).
Acknowledgments
This work was supported by the Academy of Finland (grant
numbers 250113, 256069, and 283192) and the Sigrid Juselius
Foundation.
References
Anantharaman V, Iyer LM, Aravind L. 2010. Presence of a classical RRMfold palm domain in Thg1-type 3’- 5’nucleic acid polymerases and
the origin of the GGDEF and CRISPR polymerase domains. Biol
Direct. 5:43.
Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG. 2014. SCOP2
prototype: a new approach to protein structure mining. Nucleic
Acids Res. 42:D310–D314.
Artymiuk PJ, Poirrette AR, Rice DW, Willett P. 1997. A polymerase I palm
in adenylyl cyclase? Nature 388:33–34.
Bond CS, Schüttelkopf AW. 2009. ALINE: a WYSIWYG protein-sequence
alignment editor for publication-quality alignments. Acta Crystallogr
D Biol Crystallogr. 65:510–512.
Braithwaite DK, Ito J. 1993. Compilation, alignment, and phylogenetic
relationships of DNA polymerases. Nucleic Acids Res. 21:787–802.
Bressanelli S, Tomei L, Roussel A, Incitti I, Vitale RL, Mathieu M, De
Francesco R, Rey FA. 1999. Crystal structure of the RNA-dependent
RNA polymerase of hepatitis C virus. Proc Natl Acad Sci U S A.
96:13034–13039.
Chandler M, de la Cruz F, Dyda F, Hickman AB, Moncalian G, Ton-Hoang
B. 2013. Breaking and joining single-stranded DNA: the HUH endonuclease superfamily. Nat Rev Microbiol. 11:525–538.
Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S, Kim BH, Grishin NV.
2014. ECOD: an evolutionary classification of protein domains. PLoS
Comput Biol. 10:e1003926.
Chothia C, Lesk Am. 1986. The Relation between the divergence of
sequence and structure in proteins. EMBO J. 5:823–826.
Cocozaki AI, Ramie NF, Shao YM, Hale CR, Terns RM, Terns MP, Li H.
2012. Structure of the Cmr2 subunit of the CRISPR-Cas RNA silencing complex. Structure 20:545–553.
Felsenstein J. 1989. PHYLIP – Phylogeny Inference Package (Version 3.2).
Cladistics 5:164–166.
Forterre P. 2005. The two ages of the RNA world, and the transition to
the DNA world: a story of viruses and cells. Biochimie 87:793–803.
Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering
the next-generation sequencing data. Bioinformatics 28:3150–3152.
Galperin MY. 2004. Bacterial signal transduction network in a genomic
perspective. Environ Microbiol. 6:552–567.
MBE
Galperin MY, Nikolskaya AN, Koonin EV. 2001. Novel domains of the
prokaryotic two-component signal transduction systems. FEMS
Microbiol Lett. 203:11–21.
Gilbert W. 1986. Origin of life - the RNA world. Nature 319:618–618.
Gill S, Krupovic M, Desnoues N, Béguin P, Sezonov G, Forterre P. 2014. A
highly divergent archaeo-eukaryotic primase from the Thermococcus
nautilus plasmid, pTN2. Nucleic Acids Res. 42:3707–3719.
Gladyshev EA, Arkhipova IR. 2011. A widespread class of reverse transcriptase-related cellular genes. Proc Natl Acad Sci U S A.
108:20311–20316.
Gorbalenya AE, Pringle FM, Zeddam JL, Luke BT, Cameron CE, Kalmakoff
J, Hanzlik TN, Gordon KH, Ward VK. 2002. The palm subdomainbased active site is internally permuted in viral RNA-dependent RNA
polymerases of an ancient lineage. J Mol Biol. 324:47–62.
Holm L, Rosenström P. 2010. Dali server: conservation mapping in 3D.
Nucleic Acids Res. 38:W545–W549.
Huang RH. 2012. Cas protein Cmr2 full of surprises. Structure
20:389–390.
Humphrey W, Dalke A, Schulten K. 1996. VMD: visual molecular
dynamics. J Mol Graph. 14:33–38.
Huson DH, Scornavacca C. 2012. Dendroscope 3: an interactive
tool for rooted phylogenetic trees and networks. Syst Biol. 61:
1061–1067.
Hyde SJ, Eckenroth BE, Smith BA, Eberley WA, Heintz NH, Jackman JE,
Doublié S. 2010. tRNAHis guanylyltransferase (THG1), a unique 3’-5’
nucleotidyl transferase, shares unexpected structural homology with
canonical 5’-3’ DNA polymerases. Proc Natl Acad Sci U S A.
107:20305–20310.
Ilyina TV, Koonin EV. 1992. Conserved sequence motifs in the initiator
proteins for rolling circle DNA replication encoded by diverse replicons from eubacteria, eucaryotes and archaebacteria. Nucleic Acids
Res. 20:3279–3285.
Ito J, Braithwaite DK. 1991. Compilation and alignment of DNA polymerase sequences. Nucleic Acids Res. 19:4045–4057.
Iyer LM, Koonin EV, Leipe DD, Aravind L. 2005. Origin and evolution of
the archaeo-eukaryotic primase superfamily and related palm-domain proteins: structural insights and new members. Nucleic Acids
Res. 33:3875–3896.
Kabsch W, Sander C. 1983. Dictionary of protein secondary structure:
pattern recognition of hydrogen-bonded and geometrical features.
Biopolymers 22:2577–2637.
Katoh K, Frith MC. 2012. Adding unaligned sequences into an existing
alignment using MAFFT and LAST. Bioinformatics 28:3144–3146.
Koonin EV, Ilyina TV. 1993. Computer-assisted dissection of rolling circle
DNA replication. Biosystems 30:241–268.
Leon PE. 1998. Inhibition of ribozymes by deoxyribonucleotides and the
origin of DNA. J Mol Evol. 47:122–126.
Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics
22:1658–1659.
Ling H, Boudsocq F, Woodgate R, Yang W. 2001. Crystal structure of a Yfamily DNA polymerase in action: a mechanism for error-prone and
lesion-bypass replication. Cell 107:91–102.
Makarova KS, Aravind L, Grishin NV, Rogozin IB, Koonin EV. 2002. A
DNA repair system specific for thermophilic Archaea and bacteria
predicted by genomic context analysis. Nucleic Acids Res.
30:482–496.
Makarova KS, Aravind L, Wolf YI, Koonin EV. 2011. Unification of Cas
protein families and a simple scenario for the origin and evolution of
CRISPR-Cas systems. Biol Direct. 6:38.
Morrison DA. 2015. Is sequence alignment an art or a science? Syst Bot.
40:14–26.
Morrison SD, Roberts SA, Zegeer AM, Montfort WR, Bandarian V. 2008.
A new use for a familiar fold: the X-ray crystal structure of GTPbound GTP cyclohydrolase III from Methanocaldococcus jannaschii
reveals a two metal ion catalytic mechanism. Biochemistry
47:230–242.
Mönttinen HAM, Ravantti JJ, Stuart DI, Poranen MM. 2014. Automated
structural comparisons clarify the phylogeny of the right-handshaped polymerases. Mol Biol Evol. 31:2741–2752.
1709
Mönttinen et al. . doi:10.1093/molbev/msw047
Murzin AG, Brenner SE, Hubbard T, Chothia C. 1995. SCOP: a structural
classification of proteins database for the investigation of sequences
and structures. J Mol Biol. 247:536–540.
Nakamura TM, Morin GB, Chapman KB, Weinrich SL, Andrews
WH, Lingner J, Harley CB, Cech TR. 1997. Telomerase catalytic
subunit homologs from fission yeast and human. Science
277:955–959.
Ohmori H, Friedberg EC, Fuchs RP, Goodman MF, Hanaoka F, Hinkle D,
Kunkel TA, Lawrence CW, Livneh Z, Nohmi T, et al. 2001. The
Y-family of DNA polymerases. Mol Cell. 8:7–8.
Pan J, Vakharia VN, Tao YJ. 2007. The structure of a birnavirus polymerase reveals a distinct active site topology. Proc Natl Acad Sci U S A.
104:7385–7390.
Pei J, Kim BH, Grishin NV. 2008. PROMALS3D: a tool for multiple protein
sequence and structure alignments. Nucleic Acids Res. 36:2295–2300.
Pei JM, Grishin NV. 2001. GGDEF domain is homologous to adenylyl
cyclase. Proteins 42:210–216.
Poch O, Sauvaget I, Delarue M, Tordo N. 1989. Identification of four
conserved motifs among the RNA-dependent polymerase encoding
elements. EMBO J. 8:3867–3874.
Ponting CP, Russell RR. 2002. The natural history of protein domains.
Annu Rev Biophys Biomol Struct. 31:45–71.
Prato S, Vitale Rm, Contursi P, Lipps G, Saviano M, Rossi M, Bartolucci S.
2008. Molecular modeling and functional characterization of the
monomeric primase-polymerase domain from the Sulfolobus
Solfataricus plasmid Pit3. FEBS J. 275:4389–4402.
1710
MBE
Ravantti J, Bamford D, Stuart DI. 2013. Automatic comparison and classification of protein structures. J Struct Biol. 183:47–56.
Steitz TA. 1999. DNA polymerases: structural diversity and common
mechanisms. J Biol Chem. 274:17395–17398.
Stuart DI, Levine M, Muirhead H, Stammers DK. 1979. Crystal-structure
of cat muscle pyruvate-kinase at a resolution of 2.6 Å. J Mol Biol.
134:109–142.
Tesmer JJ, Sunahara RK, Johnson RA, Gosselin G, Gilman AG, Sprang SR.
1999. Two-metal-ion catalysis in adenylyl cyclase. Science
285:756–760.
Tews I, Findeisen F, Sinning I, Schultz A, Schultz JE, Linder JU. 2005. The
structure of a pH-sensing mycobacterial adenylyl cyclase holoenzyme. Science 308:1020–1023.
Touw WG, Baakman C, Black J, te Beek TA, Krieger E, Joosten RP, Vriend
G. 2015. A series of PDB-related databanks for everyday needs.
Nucleic Acids Res. 43:D364–D368.
Woodgate R, Singh M, Kulaeva OI, Frank EG, Levine AS, Koch WH. 1994.
Isolation and characterization of novel plasmid-encoded umuC mutants. J Bacteriol. 176:5011–5021.
Xu J, Zhang Y. 2010. How significant is a protein structure similarity with
TM-score ¼ 0.5? Bioinformatics 26:889–895.
Zhang Y, Skolnick J. 2005. TM-align: a protein structure alignment algorithm based on TM-score. Nucleic Acid Res. 33:2302–2309.
Zhu X, Ye K. 2012. Crystal structure of Cmr2 suggests a nucleotide
cyclase-related enzyme in type III CRISPR-Cas systems. FEBS Lett.
586:939–945.
Related documents