* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Conserved Key Amino Acid Positions (CKAAPs) Derived From the
Artificial gene synthesis wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Magnesium transporter wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Interactome wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein purification wikipedia , lookup
Western blot wikipedia , lookup
Point mutation wikipedia , lookup
Genetic code wikipedia , lookup
Metalloprotein wikipedia , lookup
Biochemistry wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein–protein interaction wikipedia , lookup
PROTEINS: Structure, Function, and Genetics 42:148 –163 (2001) Conserved Key Amino Acid Positions (CKAAPs) Derived From the Analysis of Common Substructures in Proteins Boojala V.B. Reddy1, Wilfred W. Li1, Ilya N. Shindyalov1, and Philip E. Bourne1,2,3* 1 San Diego Supercomputer Center, University of California, San Diego, La Jolla, California 2 Department of Pharmacology, University of California, San Diego, La Jolla, California 3 The Burnham Institute, La Jolla, California ABSTRACT An all-against-all protein structure comparison using the Combinatorial Extension (CE) algorithm applied to a representative set of PDB structures revealed a gallery of common substructures in proteins (http://cl.sdsc.edu/ce.html). These substructures represent commonly identified folds, domains, or components thereof. Most of the subsequences forming these similar substructures have no significant sequence similarity. We present a method to identify conserved amino acid positions and residue-dependent property clusters within these subsequences starting with structure alignments. Each of the subsequences is aligned to its homologues in SWALL, a nonredundant protein sequence database. The most similar sequences are purged into a common frequency matrix, and weighted homologues of each one of the subsequences are used in scoring for conserved key amino acid positions (CKAAPs). We have set the top 20% of the high-scoring positions in each substructure to be CKAAPs. It is hypothesized that CKAAPs may be responsible for the common folding patterns in either a local or global view of the protein-folding pathway. Where a significant number of structures exist, CKAAPs have also been identified in structure alignments of complete polypeptide chains from the same protein family or superfamily. Evidence to support the presence of CKAAPs comes from other computational approaches and experimental studies of mutation and protein-folding experiments, notably the Paracelsus challenge. Finally, the structural environment of CKAAPs versus non-CKAAPs is examined for solvent accessibility, hydrogen bonding, and secondary structure. The identification of CKAAPs has important implications for protein engineering, fold recognition, modeling, and structure prediction studies and is dependent on the availability of structures and an accurate structure alignment methodology. Proteins 2001;42:148 –163. © 2000 Wiley-Liss, Inc. Key words: protein structure comparison; sequence homology; conserved key; amino acid positions; protein folding; protein structure prediction; protein engineering © 2000 WILEY-LISS, INC. INTRODUCTION It was observed long ago that the three-dimensional structural constraints and functional selection of proteins in nature leads to the retention of significant sequence homology between proteins of similar fold and function. This observation has been the basis for successful use of comparative (homology) modeling procedures in which structures of homologues are used to model a query sequence.1,2 Such modeled structures are more reliable as the homology between the sequence of the template structure and the target sequence increases over 40%.3,4 Conversely, as sequence and structural data have increased rapidly, we observe proteins with significant similarity in their structure and possibly function but no measurable similarity from their sequences alone. Many families of protein structures, classified by CATH,5 SCOP,6 or HOMSTRAD,7 contain one or more member structures that have no significant sequence similarity (⬍25% sequence identity) but have a similar overall structure or at least a fold belonging to the corresponding family or superfamily. These observations have driven the attention of scientists interested in sequence analysis to strive for new methods that could help identify remotely related proteins from the sequence information alone, for the ratio of available sequence to structure information will remain large.8,9 It is also a challenging task for scientists interested in protein structure analysis to explain the rationale for similar folds from apparently dissimilar amino acid sequences. Such explanations will provide new insights into protein folding and protein engineering. There are two general models that attempt to explain the overall three-dimensional conformation of a protein from its amino acid sequence10,11: (i) a centralized (local) model, in which fold specificity is coded in just a few critical residues (10 –20%) of the sequence and (ii) a distributed (global) model, in which the fold is formed by interactions involving the entire sequence. The global Grant sponsor: National Biomedical Computation Resource: Grant number: NIH P41 RR08605-06; Grant sponsor: National Science Foundation: Grant number: DBI 9808706. *Correspondence to: Philip E. Bourne, San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0505. E-mail: [email protected] Received 13 March 2000; Accepted 7 September 2000 Published online 00 Month 2000 CONSERVED KEY AMINO ACID POSITIONS model is supported by a number of mutation studies in which most of single-residue mutations were found to provide no measurable effect on protein function and presumably structure.12,13 This view is also supported by structural studies. Russell and Barton14 report that for many proteins with similar three-dimensional structures, the proportion of complementary changes is near to that expected by chance, suggesting that many similar proteins with similar three-dimensional structures have fundamentally different stabilizing interactions. Furthermore, an analysis of the most commonly occurring immunoglobulin core sequences indicated that the high degree of structural flexibility outside the common core and the variability of side-chain packing inside the core did not support the notion of a common protein-folding pathway.15 Likewise, Wood and Pearson10 argue that statistically defined Zvalues for sequence similarity and structural similarity are related linearly at all levels of sequence identity, supporting the idea of a global folding pathway. Conversely, a folding pathway governed by local interactions is also supported by many convincing examples in the literature, for example, the mutations leading to protein misfolding that are associated with certain diseases.16 Studies of sequence versus structural similarity17–20 show most protein pairs with ⬎30% identical residues have similar structures. However, many proteins have a similar fold with sequence identity ⬍25%. It could therefore be argued that only a small subset of residues are important to define the fold. Hence, a number of investigators have recently concentrated on the idea of a minimal set of conserved key residues in proteins serving as nucleation centers in the folding pathway and in stabilization of the protein structure.21–23 The Gaussian network model24 identified residues with the highest frequency fluctuations near the native state as kinetically hot residues. It was shown that these residues are correlated with the most conserved amino acids in proteins. The formation of the transition state in folding kinetics is observed to be due to a sufficient minimal number of specific contacts as observed in the native structures.21,25 Such a minimal number of contact residues are said to be residues of the fold nucleus that are position-specific and conserved throughout the family and superfamily of protein structures. Mirny et al.26 showed the requirement and need of a minimal set of key residues for faster folding of proteins on a physiological timescale. Rost27 concurs by proposing that only 3– 4% of residues are “anchor residues,” which are implicated to be significant during evolution in relating proteins of dissimilar functions with similar folds. Dosztányi et al.28 reported a minimal set of conserved residues as stabilization centers in protein structures. Lichtarge et al.29 presented an evolutionary trace method to identify functionally important residues through analysis of closely related sequences at various levels of sequence identities. Recently Michnick and Shakhnovich30 presented a method to predict conserved residues for a given protein structure through simulated sequences and implicated such residues to be guiding and deciding the folding path. Another report from the same 149 group31 further discussed the universally conserved positions in the five most commonly occurred protein folds including the immunoglobulin (Ig) fold. In the case of the most commonly occurring immunoglobulin fold Clarke et al.32 showed that the structurally more conserved amino acids among the proteins of similar folds from superfamilies are indeed used for guiding the folding process. The work presented here does not attempt to favor either a local or global view of folding. Rather it attempts to recognize residues whose property conservation may be significant. The percentage of those residues that are indeed significant is unknown. We have arbitrarily chosen a 20% cutoff in the residues reported. It may be that a far greater percentage is needed in some cases, supporting a global view of folding, and less in others, supporting a local view. The viewpoint of this article is simply one of using the increasing amount of structural data possessing the same fold, aligning those structures and interpreting the sequence alignments associated with those structure alignments in an effort to recognize a common fingerprint. Consider this viewpoint with reference to the Paracelsus Challenge.33 It has been shown by three independent groups34 –36 that by mutating ⬍50% of the residues, an all-␣ protein structure can be converted to a very different ␣- fold and vice versa. Key questions then become, what is the minimum number of residues that need to be mutated to effect the change to a stable protein of a different conformation? What if I mutate more residues, can I achieve a more stable protein? Answers to these questions are going to depend very much on the size and characteristics of the starting protein. The presence of CKAAPs, for which the significance is unknown, and the unknown outcome on stability of making a specific mutation, do not answer these questions. However, CKAAPs may provide some insight into what residues to and what not to mutate in addressing the Challenge. With this background, consider how CKAAPs are derived. An all-against-all protein structure comparison study using the Combinatorial Extension (CE) algorithm developed in our laboratory37 revealed 150 clusters of common substructures in proteins.38 Approximately 40% of all structures found in the PDB contain one or more of these substructures. Substructures are formed by near-continuous subsequence of ⬎60 amino acids in length. Each of the substructures is formed by at least five or more dissimilar subsequences and hence presents an excellent data source for inferring sequence-structure relationships. Here we present a strategy to identify residue positions conserved, at least for property, among these naturally occurring structural homologues with dissimilar sequences. CKAAPs are determined from three-dimensional structure alignments by a combined normalized occurrence score based on absolute amino acid conservation combined with property-based conservation. Structure alignments for each substructure data set are expanded by including sequences for which no structures are available and which have ⬎40% sequence identity to a substructure. We propose that CKAAPs provide insights to function when a 150 B.V.B. REDDY ET AL. clear evolutionary relationship between the sequences being compared exists and insights into what residues are most important in defining a particular protein fold. Evidence for these insights comes from parallel computational studies and experimental evidence from the literature. Here we highlight evidence from the extracellular matrix protein tenascin-C (TN), the calcium-binding regulatory protein Troponin C (TnC), and others. Finally, we present an analysis of the structural environment of CKAAPs versus all other residue positions to ascertain whether they represent a unique fingerprint and discuss their usefulness in sequence-based protein structure prediction, fold recognition, de novo design and modeling, and engineering of protein structures to achieve a desired function. Fig. 1. Sequence space scanning procedure to identify CKAAPs using structural homologues identified by CE. MATERIALS AND METHODS Previously an all-against-all protein three-dimensional protein structure comparison was performed by using the CE algorithm37 to identify representative protein structures38 in the Protein Data Bank (PDB). Each of the representative structures is grouped with its represented structures based on the following criteria: a Root Mean Square Deviation (RMSD) cutoff of 4.0 Å among superposed C␣ positions of aligned amino acid residues; a length difference between two sequences of ⬍10%; and a Z-score37 ⬎ 4.0. Based on these collective criteria, which in no way are biased by sequence similarity, there are 2,016 representative polypeptide chains in the PDB (release 1998). Subsequent comparison of this structurally nonredundant set of complete polypeptide chains revealed a set of 75 recurring substructures.38 These substructures make no account of protein domains, yet in virtually all instances fall within domains defined by CATH.5 From the original CE alignment for each substructure, we then excluded substructures whose subsequences had ⬎25% sequence identity between one another. Using these alignments and the naturally occurring sequences in SWALL (a nonredundant sequence database derived from SwissProt ⫹ Trembl ⫹ TremblNew) at the European Bioinformatics Institute (EBI), we developed a procedure to identify CKAAPs. This procedure is described below. Sequence-Based Analysis of Common Substructures The analysis for a given common substructure consists of the following steps, with the initial steps depicted in Figure 1. 1. Each of the subsequences Si of a given substructure i are submitted to a FASTA3 search against the SWALL nonredundant sequence database using a Blosum50 weight matrix, a gap ⫽ ⫺12, and a gap-extension ⫽ ⫺2 to obtain homologous subsequences Sij. 2. The set of subsequences, Sij, with ⬎40% sequence identity are sorted, and 100% identical subsequences are removed by keeping only one such subsequence. Subsequences with ⬎90% sequence identity, S⬙ik, are grouped and separate position specific frequency tables, ik flm, are calculated for each such group, k. Here l is the alignment position of the subsequence and m is the type of amino acid. Each group k is given the weight of one sequence equivalent. The effect is not to over weight close homologues. The remaining subsequences having a sequence identity between 40% and 90% are grouped together into S⬘ij, and the residue occurrence value, i Nlm, is computed. 3. The frequency of occurrence (iflm) of amino acids is computed for each position l in each substructure i by converting the residue occurrence into a frequency and adding the group specific frequency values as depicted in Eq. 1. N⬘lm ⫹ i i 冘 ik f⬙lm k flm ⫽ (1) n⬘ ⫹ n⬙ where n⬘ is the subsequence count in S⬘ij and n⬙ is the number of k groups. 4. A weighted occurrence score, given as a position specific amino acid matrix, Nlm, is computed for all subsequences, Si, of a given substructure (Eq. 2). 冘 共iflm ⫻ iw兲 i ⫻ 1000 ⫽ Nlm n i w ⫽ 0.2 if S⬘ij count is ⱕ 5 Where i w ⫽ 1.0 otherwise (2) 再 and n is the total number of subsequences in Si. 5. An average occurrence value for each amino acid (Nm) is obtained and the sum of RMSDs of amino acid occurrence (Ra) at every sequence position is computed as given below. Ra ⫽ 冑冘 共Nlm ⫺ Nm兲2Ⲑ20 (3) m 6. The amino acids are divided into 12 property groups as classified by Taylor39 and Zvelebil et al.40 Most amino acids are included in more than one group because they satisfy the property pertaining to multiple groups. An CONSERVED KEY AMINO ACID POSITIONS ment to amino acids in a random data set. Specifically, a data set of 306 nonhomologous (ⱕ25% sequence identity), best resolved (ⱕ2.0 Å resolution X-ray structures)41 were used to calculate the propensities of residues for a given structural environment and compared to the environment of amino acids in CKAAPs. The parameters used to define the structural environment of amino acid residues were secondary structure type, packing density, hydrogen bonding, and solvent accessibility. The values were computed by using the methods described below. TABLE I. Property Groups of Amino Acids Used to Identify CKAAPs39,40 S. No Property group Amino acids 1 2 3 4 5 6 7 8 9 10 11 12 Hydrophobic group 1 Hydrophobic group 2 Polar group Small residue group Tiny Aliphatic Aromatic Positive (basic) Negative (acidic) Charged Conformational Neutral polar AVLIFYWTCMHKG AVLIFYW YWSTNQDEHKR AVSTCNDGP AGS ILV FYWH HKR ED HKRED PG STCMNQ Secondary structure average value of occurrence of residues belonging to each property group (Table I) is calculated and the RMSD for the 12 property group occurrences, Rg, is computed. The first 20% of residue positions which have the highest (Ra ⫹ Rg) values are marked as conserved key amino acid positions (CKAAPs). 7. The weighted log odds values (Hlm) for each amino acid at every position is calculated (Eq. 4). Hlm ⫽ 共兩Nlm ⫺ Nl兩/10兲log共Nlm/Nm兲 151 (4) Identification of CKAAPs for Complete Proteins The procedure has been extended to identify CKAAPs for complete polypeptide chains, rather than substructures. This is possible in cases in which several structures exist for proteins in the same family and their structures can be aligned. Two such families, the chymotrypsin inhibitor and the ubiquitin family of proteins, are discussed in this study. On occasion the Z-score, RMSD, and the sequence identity thresholds have been adjusted to provide a statistically significant number of structural homologues (5 or more members). If required, the Z-score was lowered from 4.0 to 3.7, and the sequence identity threshold was increased from 25% to 35% or the RMSD between C␣ coordinates was raised from 4.0 to 5.0 Å. We then follow steps 2–7 outlined above to identify CKAAPs in a given protein family. Undoubtedly, these adjustments affect the significance of CKAAPs, but as shown subsequently, empirically the determinations still appear useful. Stereo Diagram Generation and Molecular Visualization Stereo diagrams were generated by using WebLab Pro from Molecular Simulation Inc. (MSI, San Diego, CA). When the software rendering the secondary structures of molecules did not agree with those in the CATH classification (http://www.biochem.ucl.ac.uk/bsm/cath/), the latter was applied where indicated. Further molecular visualization was conducted by using Insight II 98.0 also from MSI. Analysis of the Structural Environment of CKAAPs All the CKAAPs in the gallery of common substructures38 were analyzed to compare their structural environ- The secondary structure definition of Kabsch and Sander,42 as summarized by Smith43 in his SSTRUC program, were used to define the secondary structure type taken by the residue in the wild type proteins, and classified as either helices, strands, or random coils. Packing density (Ooi number) A contact number with other residues within an 8 Å and a 14 Å radius were computed by using the method of Nishikawa and Ooi.44 Because the longest distance from C␣i to C␣i⫹1 is approximately 4 Å, the nearest neighbor residues on either side of the dipeptide were omitted when counting. Ooi numbers calculated at both a 8 Å and a 14 Å radius were combined and used as a single structure environment parameter. Hydrogen bond formation Hydrogen bond formation was defined by using a donoracceptor distance of ⱕ3.5 Å.45. Angular criteria were not considered because side-chain atoms are not equally well positioned by crystallography and not all hydrogen atom positions are fixed by the positions of the heavier atoms. Hydrogen bonding was examined from a side chain at positions i to the residues other than those at positions i⫺1, i, and i⫹1. The average number of hydrogen bonds (dipole interactions) that could be formed by the residue in a given protein structure was computed. Solvent accessibility The solvent accessible contact area of amino acids was calculated by using the method of Lee and Richards,46 as coded by Sali and Blundell47 in their PSA program, with a probe radius of 1.4 Å. The percentage of accessible contact area of the residue side chain, main chain, polar side chain, non-polar side chain, and total atoms were used. RESULTS AND DISCUSSION The classification of protein structures has received considerable attention over the years,5,6,48 particularly as more structures have been experimentally determined. Consider a bottom-up view. Amino acids show different propensities to adopt a particular conformation based on their side chain properties and the influence from near neighbor residue interactions. The most predominant secondary structures are ␣-helices and -sheets connected through different types of turns, bends, and loops. Different classes of supersecondary structures are formed on the 152 B.V.B. REDDY ET AL. basis of the properties of residues on the surface of these predominantly occurring rigid secondary structural elements. Compact substructures in proteins evolve as a result of the hierarchical packing of secondary structural elements (SSE),49,50 which fold into energetically stable structural domains. Many related proteins have similar combinations of SSEs, domains, or substructures as a result of geometrically and energetically favorable packing architectures.51–53 Here we refer to a substructure as a part of a domain classified in the CATH database.5 Acceptability of mutation(s) at each position in the natural sequence depends on the local environment of amino acid(s)54 and the kind of interactions each amino acid undergoes in the folding process. Most of the single residue mutations—substitution, insertion, and deletion— bring about a small conformational change in the near vicinity of the residue, effecting a drift in atomic positions within an 8 –10 Å radius.55–57 However, some residues may effect a greater change than others, to the point where a single mutation would bring about a drastic change in the folding process. The resulting protein structure might be unstable and prone to premature degradation.58 Such substitution mutations may also lead to the evolution of new folds in proteins.59,60 The sequences of naturally occurring structural homologs for each such fold have undergone many evolutionary changes. However, the need to maintain structural integrity to permit biological function necessitates the absolute conservation of residues or at least the conservation of property. The key question then becomes whether we can identify the relative importance of these residues. Conserved Key Amino Acid Positions (CKAAPs) in Common Substructures The CE algorithm developed in our laboratory37 allowed us to provide a different view of protein fold space.38 Among the 2,016 nonredundant representative polypeptide chains identified, the commonly occurring substructures are formed by mostly dissimilar sequences and represented in a gallery of substructures (http://cl.sdsc.edu/ ce.html). Each substructure is formed by a significant number of apparently functionally unrelated proteins. The availability of these structural alignments—and the importance of validating manually good structure alignments is critical—makes it possible to examine the conservation of amino acids (Ra) and amino acid properties (Rg), which may be playing a key role in substructure formation. The sum value (Ra ⫹ Rg), calculated from amino acids and property group occurrence scores, provides an indication as to whether particular positions are preferred by either a few specific amino acids, or property group, or both. Arbitrarily, the first 20% of residues with the highest sum values are referred as CKAAPs and considered to have the most significant contribution to the threedimensional structure. CKAAPs have automatically been determined for all the common substructures in the gallery as defined in the methods section. CKAAPs for all the common substructures are available via the World Wide Web at http://ckaaps.sdsc.edu. Figure 2 shows stereo dia- grams of four representative substructures for all ␣, all , and ␣--␣ type proteins, with their associated CKAAPs highlighted. DC02 (DCxx denotes the substructure, where a lower value of xx indicates a more frequently occurring substructure in the PDB) is found in the nitrate/nitrite response regulator; DC07, in the mannose permease; and DC30, in the vascular cell adhesion molecule. The molecular chaperone DnaK is a heat-shock protein family member, important in protein folding, interaction, and translocation. DC57 represents an ␣ helical segment in DnaK that is involved in DnaK-substrate complex stabilization. It comes as no surprise, based on many studies of the hydrophobic cores of proteins, that most of CKAAPs are found in the hydrophobic core of the molecule. However, a number of CKAAPs are found in loop regions, exposed to solvent and available for interaction with solvent and/or other cellular components. The remainder of this article is devoted to a detailed analysis of the CKAAPs in a few substructures, for example, the Ig fold and the EF-hand motif containing Tn-C, together with an analysis of the overall environment for the most predominant CKAAPs versus other residues. The immunoglobulin fold (DC01) The most common substructure (DC01) is the immunoglobulin (Ig) fold, with 105 aligned members with no discernable sequence similarity. Sixteen such substructures are superimposed in Figure 3 to represent the level of structural diversity and conservation within the Ig superfamily. The Ig superfamily includes immunoglobulins, cell adhesion molecules, extracellular matrix proteins, bacterial and viral proteins, and the NF-B p65 subunit and PD-1, which are involved in gene regulation and cell death.61 The identities of a subset of DC01 comprised of sequences with ⬍25% sequence identity are shown in Table II. The many functions represented by these proteins attest to the diversity present in this superfamily of proteins, yet the Ig fold is highly conserved (Fig. 3). The Ig fold is the most commonly occurring -sandwich and defined by six core strands in two sheets (A, B, E, sheet I; G, F, C, sheet II). The Ig fold family (IgFF) is further classified into a number of different subtypes based on sequence or structural conservation.62 The extracellular matrix protein tenascin-C (TNfn3) consists of 15 repeats of the fibronectin type 3 (Fn3) subtype. The crystal structure of the third Fn3 repeat (1TEN) is used as a reference structure for the substructures defined as DC01.38 In this study, we identified 16 CKAAPs (see Fig. 4a for the 3-D structure, Table II for aligned subsequences and key positions, and Table III for the log odds matrix). Most of the key residues contribute to the stabilization of the -sandwich via hydrophobic side chains packed between the sheets. The tendency for conserved residues to occur in secondary structures, ␣-helices or -strands, has been observed in many studies; however, the conservation of residues or amino acid property group in the turn and loop region is less documented. The loop regions have in CONSERVED KEY AMINO ACID POSITIONS 153 Fig. 2. Stereo diagrams of common substructures with C-␣ of CKAAPs highlighted as solid spheres and associated side chains labeled and rendered as sticks. Details of CKAAPs for all substructures with their associated DCxx identifiers are available via the Web at http://ckaaps.sdsc.edu. a: 1RNL, a nitrate/nitrite response regulator. This three-layer ␣--␣ sandwich fold is observed in 43 dissimilar subsequences (DC02); b: 1PDO, a mannose permease. This ␣--␣ sandwich is observed in 30 dissimilar subsequences (DC07). c: 1VSC:A, a vascular cell adhesion molecule. This  sandwich is found in 16 dissimilar subsequences (DC30) and includes a disulphide bridge between residues C71 and C23. d: 1DKX:A, a substrate binding domain of DnaK. This mainly ␣ substructure is found in 11 dissimilar subsequences (DC57). general been observed to be less conserved and serve as ligands or ligand binding sites. In TNfn3, the RGD and IDG tripeptide binding sites for integrin are located on the F-G and B-C loop, respectively. In our study, we did not identify any positions as CKAAPs on those two loops, suggesting that these positions are not key across the complete superfamily. Rather, we identified conserved positions on the E-F loop (Leu863) and A-B loop (Thr817). Closer examinations of these positions in the loop region show that Leu863 and Thr817 have a hydrogen bond interaction that may be very important for the turn (Fig. 4a). Together with the hydrogen bond interactions between Ala819 and Ile860 (both residues identified as CKAAPs by our work and by Halaby and Mormon.61) providing stability to the -sheet, the Leu863-Thr817 hydrogen bond interaction could further contribute to the conformational stability of this region. In Table III, the position marked as j (Thr817) has several other possible amino acids, Gly with a score of 31, Asp 5, Asn 4, and Thr 0. These residues are found in the small residue group (see Table III). Thus, across substructures, Gly has the highest chance of being observed, suggesting the need for a small residue suitable for a tight turn. There are two recent reports related to the work presented here.31,61 These articles describe conserved residues in commonly occurring folds in proteins using different methods from that reported here. The report by Halaby et al.62 discussed only the conserved residues in the Ig fold formed by functionally varied protein sequences. Mirny and Shakhnovich31 discussed the five most commonly occurring folds, including the Ig fold. They identify key residues in the fold nucleus by correlating the low-residue entropy in the homologous sequence space with solvent exposure. That is, residue positions with low entropy and more buried in the structure of all families are said to be universally conserved residues important to the fold nucleus. Results presented here are compared with the results from these approaches. The same core hydrophobic residues identified by Mirny and Shakhnovich31 (Ala819, Ile821, Trp823, Leu835, Val871, and Leu873) are all 154 B.V.B. REDDY ET AL. Fig. 3. a: Stereo view of dissimilar subsequences that form the highly conserved immunoglobulin (Ig) fold (DC01). Each subsequence has a unique color. b: Rotated 90° about the vertical axis. The superimposed substructure is approximately 86 amino acids in length. Shown are 16 representatives with an RMSD between C-␣ atoms of ⱕ3.0 Å and a Z-score ⬎ 4.0. Sequence identity between any two sequences is ⬍25%. The PDB ID:chain ID, for each substructures is 1TEN:_ 1AHW:C, 3HHR:C:1, 1ILL:R(193), 1JRH:I, 1ITE:B, 2HFT:_, 1ITE:C, 1EBP:A, 1BP3:B:1, 1BJ8:_, 1CFB:_, 1TTF:_, 1DAN:T, 1B4R:_ and 1CTO:_. identified as CKAAPs by this study. The topohydrophobic residues identified by Halaby et al.62 are also identified (Ile821, Trp823, Leu835, Tyr837, Ile860, Tyr869, Val871, and Leu873). Residues underlined were identified by all three studies. The Troponin C EF-hand calcium-binding substructure In the case of DC01 (TNfn3), we observed an apparent separation of structural conservation and functional conservation, where residues known to be important in one family did not show conservation across the superfamily. DC05, on the other hand, reveals a case of conservation of structural as well as functional residues (Fig. 4b). In this substructure alignment, the chicken troponin C (TnC) (Glu413 Ala) mutant (1SMG)63 is used as the reference structure. Troponin C and troponin I (Tnl) are very important in skeletal muscle contraction and relaxation regulated by calcium.63 TnC has two low-affinity regulatory Ca2⫹ binding sites located in the N-terminal domain, each consisting of a EF-hand motif.63– 65 Calcium binding to the EF-hand induces a conformational change in TnC, exposing the core hydrophobic residues, which is the proposed TnI binding site.66 In a study by Strynadka et al.67 of turkey TnC (5TNC) the EF-hand residues important for the function of calcium binding and conformational change are examined via hydrogen bond interactions. We have identified 18 CKAAPs in TnC, underlying residues found in our studies and Strynadka et al.67 Asp30, Asp32, Glu41(Ala) belong to the 1st EF-hand, of which Gly34, Asp36, and Ser38 are not identified, but Phe26, Gly35, Ile37, and Leu42 are identi- CONSERVED KEY AMINO ACID POSITIONS 155 TABLE II. Subsequences With <25% Sequence Identity That Represent the DC01 Substructure† ------------r-------m-kjc-l--o--------b-h--------------------n-a--e---p-f-i-d-----------q---g1TEN:_ 1WIT:_ 2VAA:B 1WHP:_ 1IGJ:A 1SOX:A 1IGY:B 1AIF:H 1BHG:A 1VSC:A 1SEB:E 1EUT:_ 1WIQ:B 1HWG:B 1ILM:G 1AKJ:D 2HFT:_ 4KBP:B 1IGE:A 1BGM:L 1CID:_ 2ISD:A 1AO2:N 1BEC:_ 1AH1:_ 1DAN:U 2RAM:A 2PCY:_ 1GOG:_ 1EBP:A 1CTO:_ 1NCG:_ 1FNF:_ 1IAM:_ 1GGT:A 1TTF:_ 1TLK:_ 1ILN:B 1TIT:_ 1ZXQ:_ 1HNG:A DAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIGNLKPDTEYEVSLISRRGDMSSNPAKETFT KILTASRKIKIKAGFTHNLEVDFIGADPTATWTVG----DSGAALADAKSSTTSIFFPSAKRADSGNYKLKVKNELGED-EAIFEVI KTPQIQVYSRHPKPNILNCYVTQFHPPHIEIQMLKNG----KKIPEMSDMSYILAHTEFTPTETDTYACRVKH---DSMEPKTVYWD ---VTFTVEGSNEKHLAVLVKYEGDT--MAEVELREHGSD-EWVAMTKG-EGGVWTFDSEEPLQGPFNFRFLTEK-GMKNVFDDV-VMTQTPLSLPVSLGDQASISCRSSQSLVYLNWYLQKAGQS-PKLLIYKVGTDFTLKISRVEAEDLGIYFCSQTTHVPPTFGGGTKLE PVQS-AVTQPRVPELTVKGYAWSGGGREVVRVDVSLDGGR-TWKVARLMGDALWELTVPVEATELEIVCKAVDS--SYNVQAWHRVR ---LQESGAELARPSVKMSCKASGYTTYTIHWIKQRPGQGL-EWIGYINPSTANIHLSSLTSDDSAVYYCVRE--GEVPYWGQGTTV ---KLQESGGGLVSMKLSCVASGFTFNNYWMSWVRQSPKGLEWVAEIRLDDSRLYLQMNSLRATGIYYCVLRPLFYYVDYWGQGTSV YIDDITVTTSVEQSGLVNYQISVKGNLFKLEVRLLDAE--NKVVANGTG--TQGQLKVPGVSLYLYSLEVQLTAQPVSDFYTLPVGI FKIETTPRYLAQIGSVSLTC-STTGCESPFFSWRT--QIDSPLNKVTNEGTTSTLTMNPVSFGNEHSYLCTATC-ESRKLEKGIQVE VPPEVTVLTNSPPNVLICFIDKF--TPPVVNVTWLRN--GKPVTVSETVFLFRKFHYLPFLPSDVYDCRVEHW---GLDEPLLKHWE ICAP-FTIPDVALVTVPVAVTNQSGIVPKPSLQLD-ASPDWQVQGSVEPLMQAKGQVTITVPGRYRVGATLRT---SAGNASTTFTV VFGLTANSDHLLQGQSLTLTLESPPGSSPSVQCRSPR----GKNIQGGK----TLSVSQLELQDSGTWTCTVLQNQKKV-EFKIDIV DPPIALNWTLLNHADIQVRWEAPRNMVLEYELQYKEVNETKWKMMDPIL--TTSVPVYSLKVDKEYEVRVRSKQSGNYGEFSVLYVT WAPENLTLHKLSESQLELNWNNRFLLEHLVQYRTDWDHS-WTEQSVDYR---HKFSLPSVDGQKRYTFRVRSRFAQHWSEWSPIHWG -SQFRVSPRTWNLGTVELKCQVLLSNPSGCSWLFQPRGAAASPTFLLYLSDTFVLTLSDFRRENEGYYFCSALSNSIMYFSHFVPVF VAAYNLTWKSTN-FKTILEWEPKPVNQ-VYTVQISTKSGD-WKSKCFYTT-DTECDLTDEDVKQTYLARVFSYPXXXEPLYENSEFT APQQVHITQGDLVGRAMIISWVTMDEPGSSAVRYWSEKNG-RKRIAKGKMSIHHTTIRKLKYNTKYYYEVGLR----NTTRRFSFIT ---VSAYLSRPSPPTITCLVVDLAPSKGTVNLTWSRASG-KPVNTRKEEKQTVTSTLPVGTRGETYQCRVTH-----PHRALMRSTT QISDFHVATRFFSRAVLEAEVQMCGEYLRVTVSLWQG--ETQVASGTAPFGGVTLRLNVENPKLLYRAVVELHTDGTLIEAEACDVG ------ITAYKSEGSAEFSFPLNLGEESLQGELRWKAEPSSQSWITFSLKLPLTLQIPQVSLQFAGSGNLTLTLD-RGILYQEVNLV ------------LRVRIISGQQLPNSIVDPKVIVEIHGVGTGSRQTAVITNPRWDMEFEFEVTALVRFMVEDYDSSSNDFIGQSTIP PMVERQDTDSCLVYGGQQMILTGQNFTESKVVFTEKTTDQQIWEMEATVDKLFVEIPEKHIRTPVKVNFYVIN-GKRKRSQPQHFTY AVTQSPRNKVAVTGGKVTLSCQQTNNHNNMYWYRQ-DTGHGLRLIHYSYQEQFSLILELATPSQTSVYFCASGGGAEQFFGPGTRLT ---VAQPAVVLASGIASFVCEYASPGKEVRVTVLRQADSQVTEVCAATYMMQVNLTIQGLRAMTGLYICKVELMYPYYLGIGNGTQI GQPTIQSFE-QVGTKVNVTVEDERTFDLIYTLYYWXXXXSG-KKTAKTN--TNEFLID-VDKGENYCFSVQAVIPNRKSTDSVECMLKICRVNRNSGSCLGDEIFLLCDKVQKEDIEVYFTG---PGWEARGSFSQADVFRTPPPSLQAPVRVSMQLRRPSDRELSEPMEFQY SLAFVPSEFSISPGEKIVFKNNA---GFPHNIVFDSIPSGVDASSMLLNAKGETFEVA--LSNKGEYSFYCSP----HAGMVGKVTV -PKITRTSTQSVKVGGRITISTDS---SISKASLIRYGDQRRIPLTLTNNGSYSFQVPSDSLPGYWMLFVMNS---AGVPSVASTIR DAPVGLVARLAG--HVVLRWLPPPETHIRYEVDVSAGNGAGSVQRVEILEGRTECVLSNLRGRTRYTFAVRARMGGFWSAWSPVSLL ---PMLQALDIGPGCLWLSWKPWKYMEQECELRYQPQLKGANWTLVFHLPSSKQFELCGLHQAPVYTLQMRCIRPGFWSPWPGLQLR -----IPPINLPENSELVRIRSGRDLSLRYSVTGPGA-DQPPTGIINPI--SGQLSVTKPLDRARFHLRAHAVDINGNQNPIDIVIN PPPTDLRFTNIGPDTMRVTWAPPPIDLTNFLVRYSPVKNEEDVAELSISPSDNAVVLTNLLPGTEYVVSVSSVYEQHESTPLRGRQK WTPERVELAPLPSLTLRCQVEGGAPRAQLTVVLLRGE----KELKREPAVGEAEVTTTVLHHGAQFSCRTELDLRPQELFENTSPYQ DMDFEVEN--AVLGKDFKLSITFRNTITAYLSANITFYGVPKAEKKETFDVEAVLIQAGQLLEQASLHFFVTARIRDVLAKQKSTVL DVPRDLEVVAATPTSLLISWDAPAVTVRYYRITYGETGGNSPVQEFTVPGSKSTATISGLKPGVDYTITVYAVTGASSKPISINYRT KPYFTKTILDMDVAARFDCKVEG---YPDPEVMWFKDD---NPVKIDYDEEGNCSLTISEVCGDAKYTCKAVN---SLGEATCTAEL KPFENLRLMAPETHRCNISWEISQAYFERHLFEARTLSPGHTWEEAPLTLKQEWICLETLTPDTQYEFQVRVKPLQTWSPWSQPLAF IEVEKPLYGVEVFVTAHFEIELS--EPDVHGQWKLKGQ---PLTAIIEDGKKHILILHNCQLGMTGEVSFQA-----ANAKSAANLK PPRQVILTLQPTLSFTIECRVPTVEPLDLTLFLFRG----NETLHYETFGKATATFNSTADRGHRNFSCLAVLDLMNIFHKHSAPKM MVSKPMIYWECSNATLTCEVLEG---TDVELKLYQG----KEHLRSLR---QKTMSYQWTN-LRAPFKCKAVN---RVSQESEMEVV † The PDB code is followed by the sequence represented by the single letter amino acid code. The conserved key amino acids are shown in the first row, where “a” is most conserved and “r” is the least conserved. 1TEN_: Tenascin (Third Fibronectin Type III Repeat); 1WIT_: Twitchin 18Th Igsf Module; 2VAAB: Mhc Class I H-2Kb Heavy Chain; 1WHP_: Allergen Phl P 2; 1IGJA: Fab (Igg2A) Fragment (26–10) Complex With Digoxin; 1SOXA: Sulfite Oxidase; Oxidoreductase; 1IGYB: Igg1 Intact Antibody Mab61.1.3; 1AIFH: Anti-Idiotypic Fab 409.5.3 (Igg2A); 1BHGA: Beta-Glucuronidase (Glycosidase); 1VSCA: Vascular Cell Adhesion Molecule-1; 1SEBE: Hla Class II Histocompatibility Antigen; 1EUT_: Sialidase; Neuraminidase; Hydrolase; 1WIQB: T-Cell Surface Glycoprotein Cd4; 1HWGB: Growth Hormone; 1ILMG: Interleukin-2 (Model); 1AKJD: Mhc Class I Histocompatibility Antigen; 2HFT_: Human Tissue Cogulation Factor; 4KBPB: Purple Acid Phosphatase; 1IGEA: Fc Fragment (Ige’Cl); 1BGML: Beta-Galactosidase (O-Glycosyl); 1CID_: T-Cell Surface Glycoprotein CD4; 2ISDA: Phosphoinositide-Specific Phospholipase C, Isozyme 1; 1AO2N: Nfat-DNA Binding Domain; 1BEC_: 14.3D T Cell Antigen Receptor; 1AH1_: Ctla-4 N-Terminal Immunoglobulin V-Like; 1DANU: Blood Coagulation Factor Viia; 2RAMA: Transcription Factor Nf-Kb P65; 2PCY_: Plastocyanin; 1GOG_: Galactose Oxidase (Oxidoreductase); 1EBPA: EPO-receptor (Cytokine Receptor/Peptide); 1CTO_: Granulocyte Colony-Stimulating Factor—Receptor; 1NCG_: Neural Cadherin Domain 1; 1FNF_: Fibronectin; 1IAM_: Intercellular Adhesion Molecule-1; 1GGTA: Coagulation Factor Xiii (A-Subunit Zymogen); 1TTF_: Fibronectin (Tenth Type III Module); 1TLK_: Telokin; 1ILNB: Interleukin-2 Complex (Cytokine—Model B); 1TIT_: Titin Ig Repeat-27 (Connectin-I27); 1ZXQ_: Intercellular Adhesion Molecule-2; 1HNGA: Cell Adhesion Molecule Cd2; 1ASOA: Ascorbate Oxidase (Oxidoreductase); 1ITEC: Interleukin-4 Receptor; 1SVB_: Tick-Borne Encephalitis Virus Glycoprotein; 1ALS_: Fceri (Ige) (Subunit, Extracellular Region); 1CFB_: Drosophila Neuroglian (Fibronectin Type III Repeats); 1KOA_: Twitchin (Kinase Fragment); 1AAC_: Amicyanin (Electron transport); 1ILNG: Interleukin-2 Complex (Cytokine); 1NEU_: Myelin PO Protein. fied as part of the EF-hand. Asp66, Asp68, Ser70, Asp74, and Glu77 belong to the second EF-hand, of which Thr72 is not identified, but Ile73, Phe75, and Phe78 are. Clearly, Phe26, Ile37, Leu42, Ile73, Phe75, and Phe78 are involved in the formation of the hydrophobic core and may not directly participate in calcium binding. Therefore, in this 156 B.V.B. REDDY ET AL. Fig. 4. Stereo views of tenascin (TNfn3) and troponin C (TnC). CKAAPs are rendered as labeled sticks, and their van der Waals surfaces are rendered according to secondary structures: red, helices; blue, strands; and gray, loops. The EF hand of troponin C is colored yellow. a: Extracellular matrix protein tenascin (1TEN). From the reader’s view the right sheet comprises strands ordered A, B, and E away from the reader and the left sheet strands ordered G, F, C, and C⬘ (an additional strand) away from the reader. b: Calcium-regulated muscle protein troponin C (1SMG) with a E413 A mutant and NMR structure model 1 used. The EF-hand is colored yellow. case, our method not only identified the residues contributing to the hydrophobic core formation but also the functionally conserved residues needed for calcium binding. CKAAPs Versus Nucleation-Stabilization Centers Predicted by Using Other Methods The CE structure alignment procedure allows us to obtain structurally homologous protein sequences of complete polypeptide chain by setting higher thresholds on RMSD and lower thresholds on Z-scores. This differs from the substructures that have a higher level of structural homology but only over a fragment of the complete polypep- tide chain. The result is longer aligned sequence, but possibly less accurate alignments. The question then becomes whether useful CKAAPs can still be derived from these alignments. To address this question, we have obtained the maximum available number of structural neighbors (4 or more) with low-sequence identities to chymotrypsin inhibitor, Chey-signal transduction protein, and cytochrome C and the ubiquitin family of proteins (Table IV). We have then followed a similar procedure as is done in the case of common substructures to identify CKAAPs for each family of proteins. We have compared the kinetically hot residues24 and folding nucleus residues 157 CONSERVED KEY AMINO ACID POSITIONS TABLE III. Weighted Log Odd Values (Hlm) for CKAAPs† 7 A V L I F Y W S T C M N Q D E H K R G P Ra Rg a b c d e f d e f g h i j k l m n o ⫺1 ⫺5 ⫺10 0 1 0 ⫺1 0 1 0 2 ⫺1 0 0 2 1 0 0 1 36 2 29 4 ⫺3 3 0 0 7 5 5 ⫺6 0 ⫺3 0 ⫺5 0 13 0 15 0 22 ⫺1 8 3 0 5 11 ⫺1 ⫺5 14 0 ⫺25 0 11 26 4 9 ⫺2 0 ⫺4 4 1 2 3 0 1 ⫺4 2 ⫺3 ⫺3 ⫺1 0 0 0 1 0 3 0 0 7 9 0 ⫺2 0 0 5 0 ⫺4 0 0 0 0 ⫺5 0 ⫺10 56 ⫺5 20 31 ⫺10 ⫺6 0 ⫺5 5 ⫺10 0 ⫺1 ⫺2 ⫺5 14 ⫺5 ⫺1 ⫺5 ⫺1 11 10 0 ⫺1 0 43 ⫺5 0 0 0 ⫺5 0 ⫺9 ⫺1 ⫺3 ⫺3 ⫺2 ⫺3 ⫺4 ⫺3 ⫺1 0 ⫺2 0 0 ⫺1 3 ⫺4 10 ⫺3 ⫺1 0 0 ⫺3 0 ⫺1 0 ⫺9 ⫺3 0 0 0 0 ⫺1 0 7 0 0 ⫺3 ⫺3 11 40 ⫺1 ⫺1 ⫺3 ⫺1 4 ⫺3 0 26 ⫺3 8 ⫺1 ⫺3 ⫺1 ⫺3 6 0 0 0 0 ⫺1 0 0 ⫺2 0 7 ⫺1 ⫺2 1 0 0 1 0 0 ⫺2 0 ⫺11 ⫺1 0 ⫺4 ⫺3 ⫺1 0 ⫺3 ⫺6 4 ⫺5 0 0 0 ⫺5 0 0 ⫺4 0 ⫺1 ⫺1 0 ⫺1 ⫺4 0 ⫺2 ⫺1 ⫺1 0 0 0 0 4 ⫺4 ⫺3 0 ⫺12 0 ⫺5 ⫺5 ⫺10 ⫺7 ⫺4 0 ⫺12 5 ⫺1 0 19 ⫺5 ⫺3 ⫺3 0 ⫺2 ⫺8 0 ⫺1 ⫺2 ⫺18 ⫺11 1 ⫺2 ⫺6 0 ⫺2 0 0 ⫺2 ⫺2 ⫺1 ⫺1 ⫺4 ⫺2 0 0 ⫺1 0 0 0 ⫺4 0 0 ⫺4 ⫺1 ⫺1 0 0 ⫺1 0 0 0 ⫺6 0 ⫺1 0 0 0 0 0 ⫺1 ⫺2 0 0 0 0 0 ⫺12 ⫺3 0 ⫺2 ⫺3 0 2 ⫺4 0 0 ⫺2 0 ⫺1 0 ⫺1 0 ⫺1 ⫺19 ⫺16 ⫺19 ⫺5 ⫺3 6 0 ⫺2 ⫺1 ⫺19 0 ⫺2 31 ⫺8 3 0 3 ⫺8 0 ⫺16 ⫺16 ⫺13 0 0 ⫺3 ⫺16 ⫺16 ⫺7 ⫺10 ⫺16 0 ⫺16 5 0 1 3 55 68 51 66 54 62 43 44 47 43 46 52 54 44 33 43 38 42 39 22 31 12 20 8 28 26 22 27 23 16 12 20 28 17 21 16 Values are given in decreasing order of their (Ra ⫹ Rg) values. † in the ubiquitin family of proteins reported by Michnick and Shakhnovich30 to the CKAAPs and observed that at least 75% of the residues are common. This indicates a high level of consistency, but using a method that is computationally tractable and can be applied across a wide range of superfamilies for which multiple structures can be aligned. We have compared CKAAPs with conservatismof-conservatism (CoC) residues31 for the most commonly occurring folds identified by both methods (Fig. 5). We predicted all the CoC residues as CKAAPs, plus some additional residues in our arbitrary 20% cutoff. Most of our additionally predicted amino acid positions are in the terminal regions of rigid secondary structural elements. Mirny and Shakhnovich31 identify the positions conserved and unexposed to the solvent; however, we take the first 20% of the conserved positions to be important for the fold. Similar CKAAPs in Related Substructures—the Continuity of Fold Space The substructures obtained through structural alignment using CE show some degree of overlap.38 The question then becomes are the CKAAPs identified in both substructures the same? We have compared the CKAAPs for TNfn3 (1TEN) and the vascular cell adhesion molecule (VCAM, 1VSC) from substructures DC01 and DC30, respectively (see Shindyalov and Bourne38 for details). Both are derived from the Ig fold, illustrating the continuity of the folding. CKAAPs of VCAM are identified by aligning against TNfn3 and shown in bold letters in Table V(a). When VCAM is used as the reference structure in DC30, the key positions identified are slightly different (Table V(b)). However, 10 of the residues identified are identical. In addition, when the results for VCAM are compared with those identified by Halaby et al.,62 all the topohydrophobic positions are identified, as shown by underscored residues except two amino acids (Table V(b)). In particular, both cysteines that are important in disulfide bridge formation in VCAM are identified as CKAAPs in the reference structure and in analogous structures. Mutations of CKAAPs and the Effects on Structure, Function, and Stability of Proteins We have searched the protein mutation database (PMD)68 and the relevant literature to find reports on proteins in which the residues we predict as CKAAPs are mutated and any observed effects on structure, function, and stability of the protein. We discuss here two examples of such mutation studies. Arc repressor protein Arc repressor of bacteriophage P22 has 53 residues and exists as a homodimer with a simple architecture consisting of a -sheet and four ␣-helices. Alanine scanning substitution mutations were made and the effects on folding, stability, and function of this protein have been reported.33 Using CE we identified 15 structural homologues in the PDB aligned to residues 7– 46 of the Arc repressor protein. The CKAAPs for this protein are found to be G30, V25, S32, E36, R31, R40, S35, V22, and S44 in the order of their conservation. The alanine-scanning mutations for this protein were done one residue at a time. Of the nine CKAAPs mutated, seven alanine substitution mutants (G30A, V22A, S32A, E36A, R31A, R40A, and S44A) were found to decrease the equilibrium stability and cause a significant increases in the rate of unfolding.69 In other words, compared with other mutants, CKAAP mutants exhibit more severe perturbations in protein stability. The implication is that CKAAPs indeed play a significant role in structural stability and fold architecture. B1 Domain of streptococcal IgG-binding protein G versus Rop protein Dalal et al.34 interconverted an ␣;/-sheet protein, the B1 domain of Streptococcal IgG-binding protein G (1PGA), 158 B.V.B. REDDY ET AL. TABLE IV. Determination of CKAAPs in Complete Polypeptide Chains† A. Ubiquitin family CKAAPs --fca---------j-d--------o--------ml--n---i------b--------------kgh-e----1TBE:A 1 MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLR 1RAX:A 29 CIIRVSLDVNMYKSILVTSQDKAPAVIRKAMDKHNLEPEDYELLQIKLKIPENANVFYAMNSANYDFVLKKRTF 1RRB:_ 20 -TIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLARLDWNTDAA--SLIG-EELQVDFLK1A5R:_ 24 IKLKVIGQDSSEIHFKVKMTTHLKKLKESYCQRQGVPMNSLRFLFEGQRIADNHTPKELGMEEEDVIEVYQE---*-*---------*-*--------*---*------------*-----------------------*-*----- B. CKAAPs compared with kinetically hot residues † (A) Ubiquitin and its structural homologues (⬍25% sequence identity). The RMSD, sequence identity, length of alignment, and Z-score values for each structural alignment are given. For the other two sequences of the ubiquitin-like superfamily only CKAAPs and nucleation residues identified by other methods are given. (B) CKAAPs are compared with kinetically hot residues in proteins of the chymotrypsin inhibitor family of protein structures. Alphabets above CKAAPs represent the descending order of (Ra ⫹ Rg) values (a highest). The conserved amino acids with potential nucleation sites identified by other methods are marked with a * in the last row of each sequence alignment. ˆ represents an important site peculiar to the specific structure. to a homodimeric helix-turn-helix Rop-like protein (1ROP) through substitution mutation of ⬍50% of the amino acids (PGA-m). The B1 domain is 56 amino acids in length and the helix-turn-helix motif of Rop is confined to the first 56 N-terminal amino acids. We have taken the 26 nonhomologous subsequences from the CE alignments of 1ROP and identified CKAAPs and calculated the frequency based log-odds table. Similarly, we obtained CKAAPs and the log-odds matrix for the 11 nonhomologous CE alignments of 1PGA. Of the 28 substitution mutations in PGA-m, 12 are observed to be CKAAPs of either 1PGA or 1ROP (Table VI(a)). Eight mutations correspond to the CKAAPs of 1ROP and 7 mutations correspond to the CKAAPs of 1PGA. There are four common CKAAPs for 1PGA and 1ROP corresponding to three mutation sites. We compared the sum score of log-odd values for CKAAPs positions of 1PGA-m by using the 1PGA-based log-odds matrix and 1ROPbased log-odds matrix to sum score of log-odd values of corresponding native proteins and observed an interesting relationship. The sum score of log-odd values of 1PGA-m converges significantly toward a Rop-like structure (Table VIb). The implication is that the mutations made by Dalal et al.34 disturbed a significant number of CKAAPs in such a way that the mutant protein is now less inclined to form an ␣/-structure and more inclined to form a helix-turn-helix structure. Such a result points toward the potential usefulness of CKAAPs in protein design. In fact, using CKAAPs we suggest a minimum set of 12 substitution mutations to 1PGA to engineer a structural change from the 1PGA ␣;/; structure to the 1ROP like helix-turn-helix motif (Table VIb). Amino acid residues for substitution are selected such that the sum score of log-odd values of amino acids for the CKAAPs of 1ROP are maximized and that of the CKAAPs of 1PGA are minimized. This was observed to be consistent with similar experiments performed by other groups.70 CONSERVED KEY AMINO ACID POSITIONS 159 Fig. 5. CKAAPs compared with key residues identified by other methods. a: Conservatism-of-conservatism (CoC) amino acids20 and those identified by Clarke et al.32 1TEN, tenascin, representative of Ig fold (DC01). The color coding of C-␣ atoms is as follows: Purple, CKAAPs, green, CoCs; red, Clarke et al.; pink, CKAAPs ⫹ Clarke et al.; blue, CoC ⫹ Clarke et al.; black, identified by all three methods. b: Conservatism-of-conservatism (CoC) amino acids.20 2ACY, acylphosphatase, representative of ␣/ plaited structures (not in subdomain gallery). The color coding of C-␣ atoms is as follows: Yellow: CKAAPs ⫹ CoCs; purple, CKAAPs only. Structural Environment of CKAAPs To further evaluate the structural environment of CKAAPs, we compared various environment-dependent parameters for the amino acids present in CKAAPs in all substructures against all amino acids. The hydrophobic amino acids, L, V, and I have a higher relative occurrence compared with their composition in a nonredundant data set (Fig. 6). The charged and polar amino acids, K, R, D, E, N, P, Q, R, S, and T show a considerably lower frequency of occurrence in CKAAPs. This finding supports the well- known observation that amino acids in the hydrophobic core play a key role in the structural integrity of a protein. The composition of amino acids present in CKAAPs is higher in terminal regions of the rigid secondary structure elements, ␣-helices and -strands (not shown), and turns and loop regions of the protein structures (Fig. 7a). The total hydrogen-bonding interactions of these residues is higher for two and three hydrogen bonding interactions per residue (Fig. 7b). Thus, the charged groups of CKAAPs are better neutralized by dipole interactions. The Ooi 160 B.V.B. REDDY ET AL. TABLE V. CE Alignments Showing CKAAPs for the Substructure Represented by 1VSC Using (a) 1TEN as the Reference Structure and (b) 1VSC as the Reference Structure (Underlined)† (a) 1TEN DAPSQI IEVKDVTDT TTALI ALITW WFKP PLAEIDGIEL LTY YGIKDVPGDRTTIDLTEDENQY YSI IGNL LKPDT TEY YEV VSL LISRRGDMSSNPA AKETF FT 1VSC FKIETT TPRYLAQIG GSVSL VSLTC C-ST TTGCESPFFS SWR RT--QIDSPLNKVTNEGTTSTL LTM MNPV VSFGN NEH HSY YLC CTATC-ESRKLEK KGIQV VE (b) 1VSC ETTPESRY YLA AQIG GDSV VSL LTC CSTTG GCESPFFSW WRTQ QIDSPLNGKVTNEGTTSTL LTM MNPV VSFGN NEH HSY YLC CTATCESRKLEKGIQV VEI IYS 1VSCFKIETT TP--RYLAQIG G_SVS VSLTC CSTTGCESPFFS SWR RTQIDSPLN-KVTNEGTTSTL LTM MNPV VSFGN NEH HHSY YLC CTATCESRKLEK KGIQV VE † CKAAPs are represented by bold characters. For comparison in (b), 1VSC is copied from (a) except corresponding deletions are represented by hyphens. The underlined residues are identified by Halaby et al.62 TABLE VI. Comparison of CKAAPs With the Substitution Mutations Made to 1PGA to Meet Paracelsus Challenge† 1PGA CKAAPs PGA-m of Dalal et al.33 1ROP CKAAPs PGA-M Sequence MTYKLIL LNGKTLKGET TTT TEAVDA VDAATAE AEKVF FKQYA ANDNGVDGEW WTYD DDATKTFTVTE MTKKAILALNTAKFLRTQAAVLAAKLEKLGAQEANDNAVDLEDTADDLYKTLLVLA GTKQE EKTAL LNMARFIRSQTLTLLE LEKL LNELG GADEQADICESLH LHDHA HADELY YRSC CLARF MTYKLIANIKTLKGENTTEAVDIATIDKVGKQYTNDNGVDIASTYKDATKTFTVTE Log odds matrix 円 Sequence 3 2 ¥ Hlm of ROP-CKAAPs ¥ Hlm of PGA-CKAAPs 1PGA 1ROP PGA-m PGA-M ⫺18 197 77 ⫺2 67 87 97 ⫺123 (a) (b) † (a) Comparison of substitution mutations to the 1PGA sequence. (PGA-m) by Dalal et al.34 and the CKAAPs (bold) identified for the 1PGA and 1ROP sequences. The suggested 20 substitution mutations of the 1PGA sequence to convert it to a 1ROP-like structure are shown in the last row (PGA-M). (b) Sum score of log-odd values for the amino acids at CKAAPs based on 1PGA and 1ROP structural homologues with ⬍25% sequence identity. The mutations made by Dalal et al.33 (PGA-m) increased the sum score of ROP-CKAAPs residues and decreases the sum score of PGA-CKAAPs. We suggested a sequence, PGA-M, with only 12 substitution mutations at the CKAAPs by optimizing the sum score more toward a ROP-like structure. values (Fig. 7c) indicate that CKAAPs residues are significantly more buried than residues overall, in keeping with their presence in the hydrophobic core. Finally, the solventaccessible contact area of CKAAPs do not show much difference compared with amino acids in a random data set (Fig. 7d). In summary, the structural environments of CKAAPs show no change in the normal pattern of solvent accessibility; however, CKAAPs are predominant in the terminal regions of rigid secondary structural elements. The Ooi number shows that CKAAPs are mostly surrounded by other amino acids and that charged groups on the amino acids are better neutralized by hydrogen bonding interactions. Usefulness of CKAAPs for Protein Engineering and Fold Recognition The recognition of CKAAPs has implications in protein structure prediction and in the design of new protein sequences to achieve a desired folded architecture. The amino acids present as CKAAPs are mostly important for the integrity and stabilization of the common substructure, thus allowing specific mutations at other locations with minimum distortion to the overall protein structure. Such mutations could be used to either engineer altered functions or to design a new function. CKAAPs have the potential to engineer stability for less stable proteins by appropriately substituting nonoptimal amino acids at CKAAPs. These conserved amino acids may also be useful in fold recognition and structure prediction studies. CONCLUSIONS Using the CE structure comparison algorithm identifies similar substructures formed by dissimilar subsequences. We have presented a sequence space scanning procedure to identify conserved key amino acid positions (CKAAPs) in these commonly occurring protein substructures. We propose that CKAAPs are important for structural integrity and for nucleation and stabilization of proteins. Tertiary structure formation from primary amino acid sequence can be explained by two different models. (i) A global model in which a fold is formed by interactions that involve the entire sequence. The global model is supported by mutation studies that show that mutations at any position in a sequence, have no measurable impact on the fold in some proteins. (ii) A local model in which fold specificity is coded only within a few critical residues (10 –20%) of the sequence. The local model is suggested by studies of sequence versus structure similarity that show that naturally occurring protein sequences with 25–90% sequence identities having no significant change in fold, yet below 25% radical changes in fold can occur. This observation has been the basis for successful use of homology-based protein modeling when the structure being modeled has significant sequence identity to an existing structure. Our analysis does not necessarily contradict either of these models but bridges CONSERVED KEY AMINO ACID POSITIONS Fig. 6. Histogram showing the composition of amino acids in CKAAPs relative to their composition in a random (natural) set of nonhomologous protein structures. them as we propose a hierarchy of position-specific residues important for a given fold. In other words, many of the residues can have an impact on folding, but 161 some clearly have a greater impact that others. This study clearly lacks a good statistical treatment and leads to many questions. For example, what is the minimum number of aligned structures and over what length is needed to provide useful CKAAPs? Answers to these questions are part of an ongoing study. It can be stated at this point that the iterative random removal of 20% of the structures forming a substructure will lead to 10% of the 20% of CKAAPs reappearing at the 95% confidence level. Clearly, the success of this approach to assigning residues most likely to impact the folding of a protein depends on the accuracy of the structure alignments from which the sequence alignments are derived. Structure alignment is an ongoing area of study in our laboratory, including multiple structure alignments.71 The prediction of CKAAPs is more reliable the greater the number of dissimilar subsequences that form similar substructures. Because the predictions are based on available sequence and structure space, more sequence and structure data should provide more reliable prediction of CKAAPs in the future. CKAAPs for substructures are already available on the Web at http://ckaaps.sdsc.edu. A Fig. 7. Composition of amino acids in CKAAPs compared with a random representative set of nonhomologous protein structures. On the Y-axis is the percentage of amino acids and on the X-axis: a: secondary structural regions (H helix, E strand, and C coil); b: hydrogen bonding interactions; c: Ooi number in an 8 Å radius around the amino acid; and d: solvent accessible contact area as a percentage of residue accessibility. 162 B.V.B. REDDY ET AL. database of CKAAPs72 that can be queried by PDB id is available from the same URL. 29. REFERENCES 1. Hilbert M, Bohm G, Jaenicke R. Structural relationships of homologous proteins as a fundamental principle in homology modeling. Proteins 1993;17:138 –151. 2. Moult J, Hubbard T, Fidelis K, Pedersen JT. Critical assessment of methods of protein structure prediction (CASP): round III. Proteins 1999;37(S3):2– 6. 3. Srinivasan N, Blundell TL. An evaluation of the performance of an automated procedure for comparative modeling of protein tertiary structure. Protein Eng 1993;6:501–512. 4. Sanchez R, Sali A. Evaluation of comparative protein structure modeling by MODELLER-3. Proteins 1997;Suppl.1:50 –58. 5. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH—a hierarchic classification of protein domain structures. Structure 1997;5:1093 –1108. 6. Hubbard TJ, Ailey B, Brenner SE, Murzin AG, Chothia C. SCOP: a structural classification of protein’s database. Nucleic Acids Res 1999;27:254 –256. 7. Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci 1998;7:2469 –2471. 8. Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25:3389 –3402. 9. Geetha V, Francesco VD, Garnier J, Munson PJ. Comparing protein sequence-based and predicted secondary structure-based methods for identification of remote homologs. Protein Eng 1999; 12:527–534. 10. Wood TC, Pearson WR. Evolution of protein sequences and structures J Mol Biol 1999;291:977–995. 11. Lattman EE, Rose GD. Protein folding—what’s the question? Proc Natl Acad Sci USA 1993;90:439 – 441. 12. Bowie JU, Reidhaar-Olson JF, Lim WA, Sauer RT. Deciphering the message in protein sequences: tolerance to amino acid substitutions. Science 1990;247:1306 –1310. 13. Mathews BW. Genetic and structural analysis of the protein stability problem. Biochemistry 1987;26:6885– 6888. 14. Russell RB, Barton GJ. Structural features can be unconserved in proteins with similar folds an analysis of side-chain to side-chain contacts secondary structure and accessibility. J Mol Biol 1994;244: 332–350. 15. Bork P, Holm L, Sander C. The immunoglobulin fold: structural classification, sequence pattern and common core. J Mol Biol 1994;242:309 –320. 16. Thomas PJ, Qu BH, Pedersen PL. Defective protein folding as a basis of human disease. Trends Biochem Sci 1995;20:456 – 459. 17. Doolittle RF. Similar amino acid sequences: chance or common ancestry? Science 1981;214:149 –59. 18. Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J 1986;5:823– 826. 19. Chothia C, Lesk AM. The evolution of protein structures. Cold Spring Harb Symp Quant Biol 1987;52:399 – 405. 20. Rost B. Twilight zone of protein sequence alignments. Protein Eng 1999;12:85–94. 21. Shakhnovich EI, Abkevich VI, Ptitsyn O. Conserved residues and the mechanism of protein folding. Nature 1996;379:96 –98. 22. Ptitsyn OB. Protein folding and protein evolution: common folding nucleus in different subfamilies of c-type cytochromes? J Mol Biol 1998;278:655– 666. 23. Ptitsyn OB, Ting KH. Non-functional conserved residues in globins and their possible role as a folding nucleus. J Mol Biol 1999;291:671– 682. 24. Demirel MC, Atilgan AR, Jernigan RL, Erman B, Bahar I. Identification of kinetically hot residues in proteins. Protein Sci 1998;7:2522–2532. 25. Shakhnovich EI. Folding by association. Nat Struct Biol 1999;6:99 – 102. 26. Mirny LA, Abkevich VI, Shakhnovich EI. How evolution makes proteins fold quickly. Proc Natl Acad Sci USA 1998;28:4976 – 4981. 27. Rost B. Protein structures sustain evolutionary drift. Fold Design 1997;2:519 –524. 28. Dosztányi Z, Fiser A, Simon I. Stabilization centers in proteins: 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. identification, characterization and predictions. J Mol Biol 1997; 272:597– 612. Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996;257:342–358. Michnick SW, Shakhnovich E. A strategy for detecting the conservation of folding-nucleus residues in proteins superfamilies. Fold Design 1998;3:239 –251. Mirny LA, Shakhnovich EI. Universally conserved positions in protein folds: reading evolutionary signals about stability, folding, kinetics and function. J Mol Biol 1999;291:177–196. Clarke J, Cota E, Fowler SB, Hamill SJ. Folding studies of immunoglobulin-like beta-sandwich proteins suggest that they share a common folding pathway. Struct Fold Design 1999;7:1145– 1153. Rose GD. Protein folding and the Paracelsus challenge. Nat Struct Biol 1997;4:512–514. Dalal S, Balasubramanian S, Regan L. Protein alchemy: changing beta-sheet into alpha-helix. Nat Struct Biol 1997;4:538 – 452. Jones DT, Moody CM, Uppenbrink J, et al. Towards meeting the Paracelsus challenge: the design, synthesis, and characterization of paracelsin-43, an alpha-helical protein with over 50% sequence identity to an all-beta protein. Proteins 1996;24:502–513. Yuan SM, Clarke ND. A hybrid sequence approach to the Paracelsus challenge. Proteins 1998;30:136 –143. Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998;11:739 –747. Shindyalov IN, Bourne PE. An alternative view of protein fold space. Proteins 2000;38:247–260. Taylor WR. Classification of amino acid conservation. J Theor Biol 1986;119:205–218. Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJ. Prediction of protein secondary structure and active sites using the alignment of homologous sequence. J Mol Biol 1987;195:957–961. Hobohm U, Sander C. Enlarged representative set of protein structures. Protein Sci 1994;3:522–524. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983;22:2577–2637. Smith D. SSTRUC: a program to calculate a secondary structural summary. Department of Crystallography, Birkbeck College, University of London, 1989. Nishikawa K, Ooi T. Prediction of the surface-interior diagram of globular proteins by an empirical method. Int J Pept Protein Res 1980;16:19 –32. Baker EN, Hubbard RE. Hydrogen bonding in globular proteins. Prog Biophys Mol Biol 1984;44:97–179. Lee B, Richards FM. The interpretation of protein structures: estimation of static accessibility. J Mol Biol 1971;55:379 – 400. Sali A, Blundell TL. Definition of general topological equivalence in protein structures: a procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J Mol Biol 1990;212:403– 428. Richardson JS. The anatomy and taxonomy of protein structure. Adv Prot Chem 1981;34:167–339. Efimov AV. Structural trees for protein superfamilies. Proteins 1997;28:241–260. Efimov AV. A structural tree for proteins containing S-like beta-sheets. FEBS Lett 1998;437:246 –250. Reddy BVB, Blundell TL. Packing of secondary structural elements in proteins: analysis and prediction of inter-helix distances. J Mol Biol 1993;233:464 – 479. Reddy BVB, Nagarajaram HA, Blundell TL. Analysis of interactive packing of secondary structural elements in alpha/beta units in proteins. Protein Sci 1999;8:573–586. Nagarajaram HA, Reddy BVB, Blundell TL. Analysis and prediction of inter-strand packing distances between beta-sheets of globular proteins. Protein Eng 1999:12;1055–1062. Reddy BVB, Datta S, Thiwari S. Use of propensity of amino acids to the local structural environments to understand effect of substitution mutations on protein stability. Protein Eng 1998:11; 1137–1145. Lesk AM, Chothia C. The response of protein structures to amino acid sequence changes. Phil Trans R Soc Lond [Biol] 1986;317:345– 356. CONSERVED KEY AMINO ACID POSITIONS 56. Shortle D. Mutational studies of protein structures and their stability’s. Q Rev Biophys 1992;25:205–250. 57. Shortle D, Sondek J. The emerging role of insertions and deletions in protein engineering. Curr Opin Biotechnol 1995;6:387–393. 58. Dice JF. Molecular determinants of protein half-lives in eukaryotic cells. FASEB J 1987;1:349 –357. 59. Murzin AG. How far divergent evolution goes in proteins. Curr Opin Struct Biol 1998;8:380 –387. 60. Cordes MHJ, Walsh NP, Knight JM, Sauer RT. Evolution of a protein fold in vitro. Science 1999;284:325–327. 61. Halaby DM, Mornon JP. The immunoglobulin superfamily: an insight on its tissular, species, and functional diversity. J Mol Evol 1998;46:389 – 400. 62. Halaby DM, Poupon A, Mornon JP. The immunoglobulin fold family: sequence analysis and 3D structure comparisons. Protein Eng 1999;12:563–571. 63. Gagne SM, Li MX, Sykes BD. Mechanism of direct coupling between binding and induced structural change in regulatory calcium binding proteins. Biochemistry 1997;36:4386 – 4392. 64. Ingraham RH, Swenson CA. Binary interactions of troponin subunits. J Biol Chem 1984;259:9544 –9548. 65. Kretsinger RH, Nockolds CE. Carp muscle calcium-binding pro- 66. 67. 68. 69. 70. 71. 72. 163 tein II. Structure determination and general description. J Biol Chem 1973;248:3313–3326. Farah CS, Reinach FC. The troponin complex and regulation of muscle contraction. FASEB J 1995;9:755–767. Strynadka NC, Cherney M, Sielecki AR, Li MX, Smillie LB, James MN. Structural details of a calcium-induced molecular switch: x-ray crystallographic analysis of the calcium-saturated Nterminal domain of troponin C at 1.75 A resolution. J Mol Biol 1997;273:238 –255. Kawabata T, Ota M, Nishikawa K. The protein mutant database. Nucleic Acids Res 1999;27:355–357. Sauer RT, Milla ME, Waldburger CD, Brown BM, Schildbach JF. Sequence determinants of folding and stability for the P22 Arc repressor. FASEB J 1996;10:42– 48. Dalal S, Balasubramanian S, Regan L. Transmuting alpha helices and beta sheets. Fold Design 1997;2:R71–79. Guda C, Scheeff ED, Bourne PE, Shindyalov IN. A new algorithm for alignment of multiple protein structures using Monte Carlo optimization. Pacific Symposium on Biocomputing 2001. In press. Li W, Reddy BVB, Shindyaloo IN, Bovine PE. CKAAPs DB: A conserved key amino acid positions database. Nucleic Acids Research 2001. In press.