Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BIOINFORMATICS Vol. 19 no. 1 2003 Pages 117–124 Search for structural similarity in proteins Jacek Leluk 1, 2, Leszek Konieczny 3 and Irena Roterman 4,∗ 1 Institute of Biochemistry and Molecular Biology, University of Wrocław, Tamka 2, 50-137 Wrocław, Poland, 2 Interdisciplinary Centre for Mathematical and Computational Modeling (ICM), Pawiłn skiego 5A, 02-106 Warsaw,Poland, 3 Institute of Medical Biochemistry, Collegium Medicum, Jagiellonian University, Kopernika 7, 31-034 Krakow,Poland and 4 Department of Biostatistics and Medical Informatics, Collegium Medicum, Jagiellonian University, Kopernika 17, 31-501 Krakow, Poland Received on February 3, 2002; revised on May 27, 2002; July 11, 2002; accepted on July 15, 2002 ABSTRACT Motivation: The expanding protein sequence and structure databases await methods allowing rapid similarity search. Geometric parameters—dihedral angle between two sequential peptide bond planes (V ) and radius of curvature (R) as they appear in pentapeptide fragments in polypeptide chains—are proposed for use in evaluating structural similarity in proteins (VeaR). The parabolic (empirical) function expressing the radius of curvature’s dependence on the V -angle in model polypeptides is altered in real proteins in a form characteristic for a particular protein. This can be used as a criterion for judging similarity. Results: A structural comparison of proteins representing a wide spectrum of structures was assessed versus sequence similarity analysis based on the genetic semihomology algorithm. The term ‘consensus structure’, analogous to ‘consensus sequence’, was introduced for the serpine family. Availability: Semihom—sequence comparison freely available on request from J. Leluk. VeaR—structural comparison freely available on request from I. Roterman. Contact: [email protected] for SEMIHOM program [email protected] for VeaR program. INTRODUCTION Various properties of polypeptide chains in proteins can be used as criteria for similarity estimation: superposition of two compared protein structures and RMS-D calculation to measure the discrepancies in Cα location (Hubbard, 1999; Guda et al., 2001), contact maps expressing inter-residue distances (Ortiz et al., 1999), environmental properties (Jung and Lee, 2000) such as solvent-accessible surface, and conformational properties such as dihedral angles or the mutual orientation of centres of masses (Shindyalov and Bourne, 1998). The ∗ To whom correspondence should be addressed. c Oxford University Press 2003 increasing size of sequence and structure databases demands methods that uniformly evaluate similarity for homology-based structure prediction. In the postgenomic era, when complete knowledge of the full proteome of an organism is available, a theoretical method allowing prediction of protein structure, especially one based on homology, is urgently required (Fisher, 1999). This is why the problem of similarity search is closely related to all methods predicting protein structure and function on the basis of the known amino acid sequence (Kolinski et al., 2001; Irving et al., 2001; Fetrow et al., 2001). The tools for sequence multiple alignment help to solve the problem of homology search, while with 3D structure comparison it is difficult to identify more than a pair of proteins (Sauder et al., 2000; Leibowitz et al., 2001). A genetic algorithm has been proposed for use in protein alignment (Szostakowski and Weng, 2000). The Markov model has been adapted for structure alignment (Kawabata and Nishikawa, 2000; Bienkowska et al., 2000). Superposition of Gaussian functions describing atoms in their so-called ‘fuzzy’ form produces a good description of the protein body. The multiple alignment procedure has also been proposed for this model (Maggiora et al., 2001). The parameters proposed here as criteria for similarity search (VeaR program) express protein structure in the form of linear profiles of the parameters characteristic of the polypeptide treated as a ribbon. Subjective quantitative criteria such as window size and score level are defined by the user. The selected parameters allow units larger than secondary structure units to be compared, including loops and random coiled fragments. The proposed method may be used also for comparison of structures as they appear after dynamics simulation. A complex comparison of the members of the serpine family, performed on the basis of the presented parameters, supports one proposal to adopt the notion of ‘consensus structure’, analogous to ‘consensus sequence’. 117 J.Leluk et al. SYSTEMS AND METHODS Structural similarity A model based on a ribbon-like approximation was described in detail (Roterman, 1995a). The main assumption of the model is that all structures of polypeptide fragments in proteins can be treated as helix-like forms with different radii of curvature, which for extended forms are extremely large (theoretically infinite), so the radii of curvature are measured as a logarithmic scale. The pentapeptide appeared to be the proper unit to calculate the radius of curvature. Analysis of model pentapeptides revealed that the radius of curvature depends on and is determined by the mutual orientation between two sequential peptide bond planes. The dihedral angle between them is expressed by the second parameter—the V -angle. The procedure for calculating R and V is as follows. Before the R and V parameters can be calculated, all sequential pentapeptides in a polypeptide chain should be oriented in a unified form to make comparison of their values possible. The Z -axis for each sequential pentapeptide is determined by the averaged CO bond positions. For this orientation the radius of curvature for the X Y plane projection of Cα atoms can be calculated. Two approaches will be adopted for this procedure. One of them defines the centre of the helix by averaging the positions of all (five) Cα atoms. In the second one the coordinates of three Cα atoms (the 1st, 3rd and 5th in the pentapeptide) are taken into equations expressing the equal distance between the atoms and the centre of the circle. An analytical solution can be found in this case. Both methods are applied for each analyzed structure of the pentatpeptide. The radius of curvature that satisfies the conditions of lower dispersion versus the theoretical curvature is accepted as the proper one for a particular pentapeptide structure. The procedure is described in detail in (Roterman, 1995b). Vi —the tilt angle for the central peptide bond plane in the pentapeptide—is calculated versus the Z -axis. The angle between the CO bond and X Y plane is taken as the measure for tilt angle. The Vi+1 and/or Vi−1 tilt is estimated in the same way. The difference between Vi and Vi+1 and/or Vi−1 (or the average of Vi+1 and Vi−1 —this form is used here) expresses the dihedral angle between two sequential peptide bond planes characteristic for each pentapeptide fragment in the polypeptide chain. The distribution of ln(R) and V along the polypeptide chain (one amino acid step) can be represented as the profile characteristic for a particular protein. The window size for which the similarity is estimated can be defined arbitrarily by the user. This size depends on the problem being studied. In this paper the window size of 25 amino acids was chosen to enable comparison of structure in two polypeptide chains. This window size of 25 aa allows comparison between two fragments longer than the usual 118 secondary structure units. A window size equal to 10, for example, allows identification of secondary structure fragments as similar. A 5 aa window size usually finds the loops linking the β-structural fragments of polypeptides to be similar to helical fragments in the compared protein. The similarity score (S) for fragments was calculated as follows: Si, j = (Pi − P j )/ min(Pi , P j ) where P expresses the parameter (ln(R) or V ) (ith amino acid in one chain and jth amino acid in second chain). A value of S below 50% discriminates similar pentapeptides for dot-matrix calculations, and S = 50% and 60% for other similarity searches in this paper. The S parameter is user-defined, depending on the level of precision in a similarity search. The 25 aa fragment was considered similar when more than half of the amino acids satisfied the criteria presented above. The profile comparison was performed automatically, following the procedure producing the dot-matrix. The ln(R) value appeared to be dependent on the V -angle for model pentapeptide structure representing the low-energy area on the Ramachandran map. The mathematical relation can easily be defined as a second-degree polynomial function (empirical function; Roterman, 1995a,b). The regular dependency that occurs for model pentapeptides is not evident in real proteins. The deviation versus the theoretical relation can be measured by the DIS parameter, simply expressing the difference between the expected and observed ln(R) values for a particular V -angle. A positive value of DIS indicates a larger than expected radius of curvature for a particular V -angle. A negative DIS value describes a ‘spring’ more squeezed than the relaxed form of the pentapeptide for a particular V -angle. The profile of the DIS parameter in a particular protein also appeared to be characteristic, and taken in conjunction with the V -angle it may be used as a criterion for comparison. Data collection Proteins representing serpine family were selected for analysis: (PDB ID - 1OVA)—uncleaved ovalbumin (chain A), called OVA(A) here; 1ATT—cleaved bovine antithrombin (chain A), called ATT(A); 2ACH— cleaved human alfa-1-antichymotrypsin (chain A), called ACH(A); 1AZX—human antithrombin (chain I), called AZX(I); 2ANT—human antithrombin (chain L), called ANT(L); and 7API—alfa-1-antitrypsin (chain A), called API(A). The proposed method was compared with another one commonly used for the same purpose, the DALI program (Guda et al., 2001). Search for structural similarity in proteins Fig. 1. Semihomologous relationships among proteinaceous amino acids. Codons of residues along each axis differ by only one nucleotide (Leluk, 1998). Diagram setting shows codon changes at first (axis 1), second (axis 2) or third (axis 3) position in the order A→G→C→U. Transition type of replacement is represented by solid line at entire section; dashed line at any fragment of the connection between two residues assigns transversion. This diagram forms the basis of the genetic semihomology algorithm, instead of a matrix of statistical parameters or replacement indices which supports statistical algorithms. Sequence similarity We chose the algorithm of genetic semihomology (Leluk, 1998, 2000a,b) as the most appropriate and informative tool for comparative analysis of the sequences. The reason for using the genetic semihomology algorithm was to improve the accuracy of the results of protein sequence comparison, to avoid the wrong assumptions and misinterpretation of the results, and to increase the amount of information available from such a study. The genetic semihomology has no stochastic scoring matrix in its composition (Leluk, 1998, 2000a,b). The initial assumptions are reduced to two dogmatic ones: the codon- amino acid translation table, and the assumption that the single transition/transversion is the most frequent mechanism of differentiation among homologous proteins. Another significant feature distinguishing the genetic semihomology approach from those based on statistical matrices is that it considers both levels (genetic and amino acid) simultaneously. Consequently the analysis is more complete with respect to all significant processes controlling evolutionary variability (the possible mutation scale at the nucleotide sequence level and the selection criteria at the amino acid level). Instead of PAM-like or BLOSUM-like matrices, the principal component of this algorithm is a three-dimensional diagram representing all theoretically possible genetic relationships between genetically encoded amino acids (Figure 1). It shows all possible amino acid replacements as the effect of a single transition/transversion, and the genetic distance between amino acids depending on their codon sequence. Thus, the semihomology approach enables thorough analysis, reconstruction and/or prediction of the greatest number of mechanisms of variability that refer to the existing natural sequences. This nonstatistical approach has been successfully used for theoretical analysis of several protein families of different nature, origin and location (Leluk, 2000a,b,c; Leluk et al., 2001, 2002; Hanus-Lorenz et al., 2001; Leluk and Grabiec, 2001) with respect to: (1) the locations of homologous and semihomologous sites in compared proteins; (2) precise estimation of insertion/deletion gaps in nonhomologous fragments; (3) analysis of internal homology and semihomology; (4) the precise locations of domains in multidomain proteins; (5) estimation of the genetic code of nonhomologous fragments; (6) construction of genetic probes; (7) studies on differentiation processes among related proteins; (8) estimation of the degree of relationship among related proteins; (9) studies on the evolutional mechanism within homologous protein families; and (10) confirmation of actual relationships of sequences showing a low degree of homology. The nonstatistical semihomology approach for comparative analysis of protein primary structure has been used successfully to discover, verify and/or describe in detail some important mechanisms controlling protein differentiation pathways, as in the following studies: the significant role of cryptic mutations in extending the variability spectrum (Leluk, 2000a; Leluk et al., 2001), the nonMarkovian nature of protein evolution (Leluk, 2000b), asymmetry in replacement frequency between amino acids, depending on their codons (Leluk, 2000b), analysis of the mutational variability mechanism in protein differentiation (Leluk, 2000b,c; Leluk et al., 2001), and the high occurrence of correlated mutations, including spatially very distant amino acid residues (Leluk et al., 2002). The correct multiple alignment and consensus sequence construction achieved by the genetic 119 J.Leluk et al. Fig. 2. V -angle and ln(R) profiles and dot-matrix representing inter- and intra-molecular similarity for V and ln(R) as structural similarity criteria. (A) V -angle distribution as it appeared in ATT(A) molecule; (B) V -angle distribution as it appeared in OVA(A) molecule; (C) ln(R) distribution as it appeared in ATT(A) molecule; (D) ln(R) distribution as it appeared in OVA(A) molecule; (E) Upper right part of matrix represents similarity distribution when V -angle is taken as criterion. Lower left part of matrix represents distribution of similarity when ln(R) is taken as similarity criterion; (F) Similarity dispersion in and between OVA(A) and ATT(A) when simultaneous criteria of ln(R) and V -angle are used semihomology approach were also the starting point for predicting the tertiary structure of the proteinase inhibitor family from squash seeds (Leluk, 2000c). The algorithm is assisted by some other useful approaches, such as the significant similarity estimation method (Leluk and Grabiec, 2001) and FEEDBACK program identifying the allowed/forbidden sets at all sequence positions, which are determined by the presence/absence of a particular residue at a particular position (Leluk et al., 2002). 120 IMPLEMENTATION Structural similarity The V and R parameter profiles, calculated according to the procedure presented in Methods, were used to characterize the structure of OVA(A) and ATT(A) (Figure 2A– 2D). High values of V and ln(R) are present in β-structure fragments of OVA(A) (150–250 aa) and ATT(A) (190–260 aa) proteins. Search for structural similarity in proteins Fig. 3. Relation between V -angle and ln(R). Thick line represents theoretical relation found for model pentapeptides. Dots represent V angle and ln(R) values for pentapeptides in real proteins. Thin line represents approximation curve obtained based on observed parameters. Proteins taken for analysis are as follows: (a) OVA(A) protein; (b) ATT(A) protein. Plot representing magnitude of deviation of experimentally observed values of V and ln(R) versus theoretical parabolic curve found based on model pentapeptides. (c) OVA(A); (d) ATT(A). 121 J.Leluk et al. The dot-matrix for two similar proteins [OVA(A) and ATT(A)] looks as presented in Figure 2E and 2F. Besides the internal similarity revealed in the dot-matrix, the intermolecular similarity is very easily distinguishable. Two sub-diagonals can be seen in the upper right and lower left quarters of the map, showing inter-molecular similarity. Similarity for R and V in conjunction is observed rather seldom; such a case is shown in Figure 2F. The origin of this event is explained next. MODEL AND REAL PROTEINS The mathematical relation between ln(R) and V for model pentapeptides exhibits parabolic form (Roterman, 1995a,b). Real proteins demonstrate a dispersion versus the theoretical curve (Figure 3a,b). The thick line represents the theoretical relation according to the second-degree polynomial found for the model pentapeptides (Roterman, 1995a,b). The dots represent ln(R) versus V as observed in ATT(A) and OVA(A). The thin lines represent the relation as an approximation curve for the experimental points as they appeared in the analyzed proteins. The same relation is almost linear for OVA(A) and ATT(A). The profile of DIS along the polypeptide chain as it appeared in OVA(A) and ATT(A), is shown in Figure 3c,d. The V -angle and DIS profiles were used to compare analyzed serpine family members and shown in Figure 4 together with the results of sequence (Figure 4a) and structure (Figure 4c) comparison based on Cα-Cα atoms overlaping (according to DALI procedure—(Guda et al., 2001)). Each point on the thick line in Figure 4b represents a 25 aa fragment in which more than 50% of the residues Fig. 4. Similarity between serpine family members (ACH(A), ANT(L), API(A), AZX(I), OVA(A)) as it appears using: (a) Sequence similarity as obtained according to Semihom algorithm; (b) V and DIS parameter similarity in serpine family members. Thick lines represent similarity for S = 50%, thin lines for S = 60%. (c) DALI procedure (Cα–Cα distance between overlapped protein molecules). ATT(A) structure was taken as template structure for comparison in all methods. Horizontal axis represents sequence of template molecule (ATT(A)). Vertical axis represents sequence (in relative numbers versus sequence of target molecule) of compared chain. Line parallel to horizontal axis (0 on vertical axis scale) represents the situation when two chains are similar without any shift in their sequences. Such a line represents the situation that in a dot-matrix is represented by a diagonal. Negative values on a vertical line represent deletion in second compared chain, while a positive value represents insertion 122 Search for structural similarity in proteins Fig. 4. Continued. represent a V -angle varying no more than 20% (calculated according to the expression of S) and the DIS parameter satisfies the condition of having the S value lower than 50%. Each point on the thin line represents a 25 aa fragment satisfying the V -angle similarity condition and having DIS similarity measured with the S parameter below 60%. The ATT(A) protein was taken as the template protein structure to which all others were compared pair-wise. The diagrams shown in Figure 4 reveal the same V angle distribution and the same deformation in the sense of radius curvature measured versus the expected ln(R) values. DISCUSSION The analysis presented in this paper shows that the proposed geometric parameters can be used to search for structural similarity in proteins, although the criteria selected for this search differ from the standard ones. The profiles of ln(R) and V as they appear in proteins really express the visual characteristics of the polypeptide ribbon. They are correlated in model peptides (in the model peptide all five amino acids represent the same Phi, Psi angles). In real proteins this correlation is somewhat disturbed, as is shown in Figure 3. The explanation for this is that the V -angle parameter is rather the backbone attribute, while the R-value describes the shape of the polypeptide adopted for its particular fragment. The empirical function expressing the dependence of ln(R) on the V -angle describes structures with agreement between the mutual orientation of the peptide bond planes (backbone structure) and the resulting radius of curvature for the particular fragment. The radius of curvature is influenced by the V -angle, on the one hand, and depends on inter-side-chain interactions on the other. The latter force allows the backbone to adopt a nonrelaxed radius of curvature for the polypeptide ribbon fragment. The deviation from the theoretical curve in real proteins can be explained as the consensus between these two factors: backbone structure and side-chain interactions. The characteristic distribution of polypeptide chain fragments deviating from the relaxed form reveals the specificity of inter-side-chain interactions. The Semihom and DALI programs revealed the high similarity of the patterns representing sequence and 3D structural similarity in the analyzed proteins (Figure 4). The similarity distribution found using the newly introduced method agrees only partially with the results obtained using the standard method DALI—pair-wise alignment distance-based score. The disagreement is due to differences in the criteria of comparison. The fragments that appeared similar in both the DALI and VeaR methods represent the situation in which the fragment adopted a similar V and R system and in consequence represents a similar spatial arrangement. The fragments revealed as similar by DALI criteria and dissimilar by the VeaR method are fragments whose similarity is not based on the same structural origin, although for each similar (according to DALI) fragment a sort of incipient similarity based on V and R characteristics can be found. The term ‘consensus structure’, analogous to ‘consensus sequence’, can be proposed. Since the structural parameters can be expressed in quantitative form, the ‘averaged’ structure with averaged V -angles and R-radii for structurally similar fragments can be described. V and R profile comparison between the target molecule and the predicted form of protein structure could be useful in the CASP project (Hubbard, 1999), especially since the proposed parameters are somehow related to polypeptide chain folding (Xu et al., 1999). Random coiled fragments, which usually are difficult to identify, can also be easily analyzed uniformly with secondary structure fragments when the proposed parameters are used. 123 J.Leluk et al. ACKNOWLEDGEMENT Many thanks to Anna Smietanska for technical support. This study was partially supported by ICM grant BST 783/2002. REFERENCES Bienkowska,J., Yu,L., Zarakhowich,S., Rogers,R.G.Jr and Smith,T.F. (2000) Protein fold recognition by total alignment probability. Proteins, 40, 451–464. Fetrow,J.S., Siew,N., DiGennaro,J.A., Martinez-Yamout,M., Dyson,H.J. and Skolnick,J. (2001) Genomic-scale comparison of sequence- and structure-based methods of function prediction: does structure provide additional insight? Protein Sci., 10, 1005–1014. Fisher,D. (1999) Rational structural genomics: affirmative action for ORFans and the growth in our structural knowledge. Protein. Eng., 12, 1029–1030. Guda,C., Scheeff,E.D., Bourne,P.E. and Shindyalow,I.N. (2001) A new algorithm for the alignment of multiple protein structure using Monte Carlo optimization. Proceedings of Pacific Symposia on Biocomputing, 6, 25–286. Hanus-Lorenz,B., Hryniewicz-Jankowska,A., Leluk,J., Lorenz,M., Skała,J. and Sikorski,A.F. (2001) Spectrin motifs are detected in plant and yeast genomes, 8th International W. MejbaumKatzenellenbogen’s Seminar on Membrane Skeleton and Its Regulatory Functions, Szklarska Porêba 2001. Cell. Mol. Biol. Lett., 6, 207. Hubbard,T.J.P. (1999) RMS/Coverage Graphs: a quantitative method for comparing three-dimensional protein structure prediction. Proteins, Suppl. 3, 15–21. Irving,J.A., Whisstock,J.C. and Lesk,A.M. (2001) Protein structural alignments and functional genomics. Proteins, 42, 378–382. Jung,J. and Lee,B. (2000) Protein structure alignment using environmental profiles. Protein Eng., 13, 535–543. Kawabata,T. and Nishikawa,K. (2000) Protein structure comparison using the Markov transition model of evolution. Proteins, 41, 108–122. Kolinski,A., Betancourt,M.R., Kihara,D., Rotkiewicz,P. and Skolnick,J. (2001) Generalized comparative modeling (GENECOMP): a combination of sequence comparison, threading and lattice modeling for protein structure prediction and refinement. Proteins, 44, 133–149. Leibowitz,N., Fligelman,Z.Y., Nussinov,R. and Wolfson,H.J. (2001) Automated multiple structure alignament and detection of a common substructural motif. Proteins, 43, 235–245. 124 Leluk,J. (1998) A new algorithm for analysis of the homology in protein primary structure. Computers and Chemistry, 22, 123– 131. Leluk,J. (2000a) A non-statistical approach to protein mutational variability. BioSystems, 56, 83–93. Leluk,J. (2000b) Regularities in mutational variability in selected protein families and the Markovian model of amino acid replacement. Computers and Chemistry, 24, 659–672. Leluk,J. (2000c) Serine proteinase inhibitor family in squash seeds: mutational variability mechanism and correlation. Cell. Mol. Biol. Lett., 5, 91–106. Leluk,J. and Grabiec,M. (2001) Sequence similarity estimation and correlated mutations in selected protein families. I. An approach to protein sequence similarity estimation. Ist Summer School on ‘Parallel Computing in Biomolecular Simulations’, September 1–3 2001, Gdansk, Poland; Abstracts L-5. Leluk,J., Hanus-Lorenz,B. and Sikorski,A.F. (2001) Application of genetic semihomology algorithm to theoretical studies on various protein families. Acta Biochim. Polon., 48, 21–33. Leluk,J., Sobczyk,M. and Becella,Ł (2002) Correlated mutations in selected protein families. TASK Quarterly, in press. Maggiora,G.M., Rohrer,D.C. and Mestres,J. (2001) Comparing protein structures: a Gaussian-based approach to the threedimensional structural similarity of proteins. J. Mol. Graph. Model., 19, 168–178. Ortiz,A.R., Kolinski,A., Rotkiewicz,P., Ilkowski,B. and Skolnick,J. (1999) Ab initio folding of proteins using restraints derived from evolutionary information. Proteins, Suppl. 3, 177–185. Roterman,I. (1995a) Modelling the optimal simulation path in the peptide chain folding—studies based on geometry of alanine heptapeptide. J. Theor. Biol., 177, 283–288. Roterman,I. (1995b) The geometrical analysis of peptide backbone structure and its local deformations. Biochimie, 77, 204–216. Sauder,J.M., Arthur,J.W. and Dunbrack,Jr,R.L. (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignment. Proteins, 40, 6–22. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein. Eng., 11, 739–747. Szostakowski,J.D. and Weng,Z. (2000) Protein structure alignment using a genetic algorithm. Proteins, 38, 428–440. Xu,Y., Xu,D., Crawford,O.H., Einstein,J.R., Larimer,F., Uberbacher,E., Unseren,M.A. and Zhang,G. (1999) Protein threading by PROSPECT: a prediction experiment in CASP3. Protein Eng., 12, 899–907.