* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Common Pattern of Coarse-Grained Charge Distribution of
Silencer (genetics) wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Magnesium transporter wikipedia , lookup
Interactome wikipedia , lookup
Metalloprotein wikipedia , lookup
Western blot wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Point mutation wikipedia , lookup
Biochemistry wikipedia , lookup
Genetic code wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Structural alignment wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Chem-Bio Informatics Journal, Vol. 3, No. 4, pp.194-200(2003) Common Pattern of Coarse-Grained Charge Distribution of Structurally Analogous Proteins Kenichiro Imai* and Shigeki Mitaku Nagoya University, Graduate School of Engineering, Department of Applied Physics, Nagoya, Chikusa-ku, Furocho, 464-8606, Japan *E-mail: [email protected] (Received November 5, 2003; accepted December 22, 2003; published online December 31, 2003) Abstract Structurally analogous protein pairs with low sequence identity, such as analogues and remote homologues, comprise a large part of structurally similar pairs thus complicating the relationship between sequence and structure. To obtain clues for clarifying such intricate relationships, we developed a method to analyze the coarse-grained charge distribution in an amino acid sequence and analyzed the pattern of charge distribution for the pairs of structurally similar proteins with sequence identities lower than 20%. We found two types of pairs, those with similar patterns of charge distribution and those with inverted charge distribution. This finding suggested that the charge distribution in a sequence might be a good parameter for clustering the structures as analogs and remote homologs. The possibility of automatic fold recognition is discussed by a quantitative comparison of charge distribution patterns. Key Words: analogue, remote homologue, charge distribution, structural biology, bioinformatics Area of Interest: Bioinformatics and Bio computing 1. Introduction Proteins with high amino acid sequence similarity generally adopt similar structures. The majority of structurally similar pairs, however, have low sequence identity (less than 20 % sequence identity) [1][2]. The protein pairs with weak identity are defined as remote homologues and analogues in terms of their functional similarity [3][4]. The occurrence of protein pairs with low sequence homology and high structural similarity complicate the analysis of a relationship between sequence and structure. If the use of well-defined physical parameters can solve the problem of modeling the intricate relationship between sequence and structure, it will provide a new method for the annotation of orphan genes in a genome. One of the efficient ways of clustering the shape of proteins is the so-called coarse-graining of 194 Copyright 2003 Chem-Bio Informatics Society http://www.cbi.or.jp Chem-Bio Informatics Journal, Vol. 3, No. 4, pp.194-200(2003) physicochemical parameters of amino acid sequences. For example, a hydropathy plot is a kind of coarse graining of hydropathy values [5]. Transmembrane regions in an amino acid sequence correctly correspond to the peaks in the hydropathy profiles. Up to now, however, similar coarse graining approaches have rarely been applied to soluble proteins. Sipple and his group reported a prediction system for the protein fold recognition, itself [6]. According to their system, the effective force potential between amino acids is used even though the physical meaning of the effective force potential is not clear. In this study, we focused on the net charges of amino acid sequences and found very similar patterns of the coarse-grained charge distribution for protein pairs of analogues and remote homologues. In general, the electrical interaction in water is much weaker than that in a nonpolar environment because of the high permittivity and dielectric constant of water. However, it is well known that the electrostatic interaction between large colloidal particles with large electrical charges is very important for their stability. In the same way as for colloidal particles, the electrostatic interaction may be an important factor in protein folding, when an amino acid sequence carries clumps of electrical charges. Thus, we devised a charge density plot (CD plot) for estimating the coarse-grained charge distribution of an amino acid sequence. By comparing the CD plots of analogous protein pairs, we found that the charge distributions of several pairs were very similar or showed an inverted pattern. 2. Methods 2.1 Charge density plot (CD plot) The charge density plot (CD plot) is a method by which the net electrical charge densities of polypeptide segments of various lengths are plotted by pseudo-color according to the following procedures, which are shown in Figure 1. (1) An amino acid sequence is transformed to a sequence representing the number of elementary charges, in which LYS, ARG and HIS are +1, ASP and GLU are –1, and other residues are 0. The charge of His depends on pH, but the pattern of the CD plot did not depend so much on the charge value of His, because the His residues are not present in large numbers nor are they clustered for most proteins. Also, the charges at the N- and C-terminal ends had no effect on the pattern of the CD plot. (2) The density of the net charge for every segment in an amino acid sequence was calculated from the i-th to j-th residues, as represented by the following equation. CD(i, j ) = CD(j,i) = j ∑ C ( k ) /(| j-i |+1 ) T (1) k=i (3) CD(i,j) is represented by a pseudo-color and then plotted at the position of (i,j) and (j,i). The parameter CT (k ) was 1, 0, or -1 corresponding to the positive, neutral, and negative charges, respectively. As shown in Figure 1., blue and red represent positive and negative charges, respectively. 2.2 Comparison of two charge density plots The comparison was represented by equation (2). We used the summation of the squares of CD A (i, j ) − CD B (i, j ) or CD A (i, j ) + CD B (i, j ) to compare the charge density plots of two proteins in which the suffixes A and B indicate two amino acid sequences. The inverse similarity of the 195 Chem-Bio Informatics Journal, Vol. 3, No. 4, pp.194-200(2003) charge distribution can be estimated by CD A (i, j ) + CD B (i, j ) . Figure 1. The procedures for the calculation of the CD plot Each point of the CD plot is colored according to the charge density calculated by equation (1). Blue and red are the respective pseudo-colors of the positive and negative charges. The position of a point (i, j) is determined by the sequence numbers, i and j, for the N- and C-sides of a segment. A CD plot of interferon α-2A (1itf) is shown as an example. The rainbow bar in the right of the CD plot is colored in the order of the secondary structures. 196 Chem-Bio Informatics Journal, Vol. 3, No. 4, pp.194-200(2003) < S ± >= ∑ (CD |i − j |≥ 20 A (i, j ) ± CD B (i, j )) 2 ∑ (CD A (i, j )) 2 ∑ (CDB (i, j )) 2 (2) The similarities of patterns and inverted patterns are estimated by the parameters S+ and S-, respectively. We neglected the local properties, but the segments longer than 20 residues was used for comparison Figure 2. A CD plot for (a) thiamin phosphate synthase (PDB id 2tpsA) and (b) KDPG aldolase (1fq0A) which adopted a TIM barrel fold. 2.3 Dataset of analogous protein pairs We constructed a dataset of protein pairs having pairwise sequence identity lower than 20 % and structural similarity with RMSD less than 4.0 A from the DBAli database [7]. In addition, we removed the pairs whose difference of sequence length was more then 30 residues and had partial 197 Chem-Bio Informatics Journal, Vol. 3, No. 4, pp.194-200(2003) structural similarity and finally got 256 protein pairs. The information of the 3D-structures together with the sequences was taken from the PDB, and DS ViewerPro 5.0 (Accerlys) was used for the graphical representation. Figure 3. A CD plot for (a) pyrrolidone carboxyl peptidase (1a2zA) and (b) purine nucleoside phosphorylase (1ecpB). 3. Results and Discussion The CD plots of protein pairs with low sequence homology and high structural similarity showed that several protein pairs have similar charge distributions. For example, the pairs of thiamin phosphate synthase (PDB id 2tpsA) and KDPG aldolase (1fq0A), which adopt the similar structure of a TIM barrel, showed very similar charge distribution in spite of their weak pairwise identity (Figure 2). The structural similarity of these protein pairs cannot be identified from the 198 Chem-Bio Informatics Journal, Vol. 3, No. 4, pp.194-200(2003) amino acid sequence itself. However, when that sequence is transformed into a sequence of net electric charges, the similarity becomes visible, suggesting that the coarse-grained charge distribution can be a good physical parameter of protein folding. We also found several pairs with an inverted pattern of charge distribution. Pyrrolidone carboxyl peptidase (1a2zA) and purine nucleoside phosphorylase (1ecpB) showed very similar structures, but their CD plot gave inverted profiles of positive and negative charge distribution (Figure 3). The inverted charge distributions for structurally similar proteins are physically reasonable, because the electrostatic forces are the same when the signs of all charge clusters are inverted. Figure 4. Histograms of the value of <S-> (a) and <S+> (b), which evaluate the similarity and inversion of the CD plot patterns, respectively, for 256 structurally analogous pairs. The graphs at the bottom are the histograms of <S-> and <S+> for 256 structurally analogous pairs and the graphs at the top are the histograms of <S->and <S+> for the pairs which show distinct similarity and inversion of the pattern, respectively. 199 Chem-Bio Informatics Journal, Vol. 3, No. 4, pp.194-200(2003) Figure 4 shows histograms of <S-> or <S+>. The upper graphs show the histograms for protein pairs, which have very similar charge density maps, and the lower graphs are the histograms of all 256 charge density maps. It is clear that the parameters <S-> or <S+> represent the similarity of charge distribution. This fact suggests the possibility for achieving protein fold recognition from the coarse-grained charge distributions. There are many genes in a genome that cannot be annotated by sequence alignment programs. These genes would likely code for many remote homologous and analogous proteins. However, we cannot obtain any information about those proteins by the comparison of amino acid sequences alone. Therefore, the comparison of similar or inverted charge profiles for protein pairs with low sequence homology and high structural similarity may give new insight into the mechanism of protein folding. Furthermore, this common charge distribution may provide a novel method for the annotation of genes on a genome-wide scale. References [1] [2] [3] [4] B. Rost, Folding & Design. 2, S19-24 (1997). J. M. Sauder, J. W. Arthur, R. L. Dunbrack, Jr., PROTEINS, 40, 6-22 (2001). R. B. Russell, J.G. Barton, J.Mol.Biol., 244, 332-350 (1994). R. B. Russell, M. A. S. Saqi, R. A. Sayle, P. A. Bates, M. J. E. Sternberg J.Mol.Biol., 269, 423-439 (1997). [5] J. Kyte and R. F. Doolittle, J.Mol.Biol., 157, 105-132 (1982). [6] H. Floeckner, M. Braxenthaler, P. Lackner, M. Jaritz, M. Ortner and M. J. Sippl, Proteins, 23, 376-386 (1995). [7] M. A. Marti-Renom, V. A. llyin, A. Sali, Bioinformatics, 17, 746-747 (2001). 200