* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Prediction of Nickel Binding Sites in Proteins from Amino acid
Survey
Document related concepts
Implicit solvation wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Protein domain wikipedia , lookup
List of types of proteins wikipedia , lookup
Circular dichroism wikipedia , lookup
Structural alignment wikipedia , lookup
Protein purification wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein moonlighting wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Western blot wikipedia , lookup
Cooperative binding wikipedia , lookup
Homology modeling wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Transcript
International Journal of Energy, Sustainability and Environmental Engineering Vol. 1 (1), September 2014, pp. 19-23 Prediction of Nickel Binding Sites in Proteins from Amino acid Sequences by Radial Basis Function Neural Network Classifier G C Dash1, M R Panigrahi2, J K Meher3, M K Raval4 1 Department of Chemistry, APS College, Roth, Balangir, Odisha, India. Department of Chemical Engineering, Orissa Engineering College, Bhubaneswar, India 3 Department of Computer Science and Engg, Vikash College of Engg for Women, Bargarh, Odisha, India 4 Department of Chemistry, Gangadhar Meher College, Sambalpur, Odisha, India 2 Received 30 August 2014; accepted 12 September 2014 Abstract Prediction of Ni-binding proteins has not received intensive and specific attention. There is also a need for a sequence based predictive method for metal binding site with high accuracy so that it can be applied at proteomic scale. A radial basis function neural network classifier is developed to predict nickel binding sites in proteins from sequence alone data. Two physicochemical parameters namely, polarizability and partial charge of the ligand atom of amino acid side chains are used as features for the classifiers. The accuracy of the classifier is 94%. Keywords Ni-binding, neural network, physicochemical parameters, ligand, polarizability The crucial test of knowledge of coordination chemistry lies in our ability to predict metal binding to ligands or proteins along with their thermodynamic and kinetic stabilities. Because of the functional and sequence diversity of metalloproteins, it is necessary to design algorithms, which include different aspects of metal-ligand complex formation for predicting metal-binding proteins. The problem has been approached from different direction to develop algorithms with greater accuracy. An approach to begin with is to classify as metal binding and non-metal binding proteins. Obvious choice in this situation is Support Vector Machine (SVM) method1. The method predicts 67% of the 87 metal-binding proteins non-homologous to any protein in the Swissprot database and 85.3% of the Corresponding Author: G C Dash e-mai: [email protected] M R Panigrahi e-mail: [email protected] J K Meher e-mail: [email protected] M K Raval e-mail: [email protected] 333 proteins of known metal-binding domains as metal-binding. These suggest the usefulness of SVM for facilitating the prediction of metal-binding proteins1. Structural information also has been used for predicting metal-binding sites based on the detection of principal liganding residues and metalligand complex architectures2, the use of common local structural parameters2, combination of sequence and structural profiles3, analysis of bond strength contributions4, and the computation of force fields5. However, metal specificity in proteins with loosely or temporarily bound metals, such as enzymes that use metal ions as cofactors, are often poorly characterized6. Therefore, sequence-based computational methods appear to be useful for these types of proteins and whose 3D structures are not determined experimentally7. Besides, sequence similarity methods, the sequence-based methods include metal-binding sites sequence motifs8, multiple sequence alignments against known metalbinding proteins, and neural networks of sequence segments of amino acids of higher metal-binding propensity9,10. Combinatorial application of multiple structural, sequence alignments and annotation methods has been found to be highly useful for improving prediction accuracy of metal-binding proteins1. Empirical force field method – The empirical force field Fold-X is developed to predict the 20 Int J Ener Sustain Env Engg, September 2014 position of single atom ligands (structural water molecules and metal ions). Fold-X predicts 76% of water molecules interacting with two or more polar atoms of proteins in high-resolution crystal structures and within 0.8 Å standard deviation in position on average. The prediction of metal ionbinding sites have accuracy between 90% and 97% depending on the nature of metal ion, with an average standard deviation of 0.3 – 0.6 Å on the position of binding. The force field includes Mg2+, Ca2+, Zn2+, Mn2+, and Cu2+. The accuracy of the energy prediction using the force field is sufficient to efficiently discriminate between Mg2+, Ca2+, and Zn2+ binding5. Sequence based prediction method – Algorithms to predict of metal binding sites in proteins from sequence are helpful in identifying the metalloproteins and annotation of uncharacterized proteins on a genomic scale. Development of such algorithms are highly challenging due to the enormous amount of alternative candidate configurations7. Passerini et al. (2012) develop new algorithm based on structured output learning for determining transition-metal-binding sites coordinated by cysteines and histidines. The inference step (retrieving the best scoring output) is intractable for general output types (general graphs). Metal binding has been proved to be the algebraic structure of a matroid assuming that no residue can coordinate more than one metal ion. This allows to employ greedy algorithm7. Test of predictor in a highly stringent setting where the training set consists of protein chains belonging to SCOP folds achieves 56% precision and 60% recall in the identification of ligand-ion bonds7. Another sequence based predictor- MetalloPred, is developed, which consists of three levels of hierarchical classification using cascade of neural networks from sequence derived features9. The 1st layer of the prediction engine is for identifying whether a protein is a metalloprotein or not. The 2nd layer is for determining the main functional class. The 3rd layer is for identifying the sub-functional class. The overall success rates for all the three layers are ~ 60% that were obtained through crossvalidation tests on the datasets of non-homologous proteins with cut off of 30% sequence identity in the same class or subclass9. We have made an attempt to use hard and soft acid-base (HSAB) concept in metal complex formation considering the concerned two physicochemical parameters: partial charge and polarizability of binding ligand atoms of amino acid residues, to predict nature of metal binding to proteins. Metal ion Ni is selected for the study as prediction of Ni binding to the protein has been not been intensively and specifically studied so far. A neural network classifier has been applied yielding an accuracy of ~95%. Materials and Datasheet Ni-Binding Protein Data Set Co-ordinate files of metalloproteins are obtained from the protein data bank (PDB)11 (www.pdb.org/pdb). For the study in this work we have chosen a non-redundant set of PDB files of Nibinding proteins determined by the X-Ray diffraction experimental method and reported up to September 2010. The available PDB files are manually scrutinized by removing PDB entries with proteins with single ligand to Ni ion and proteins surface binding Ni. These proteins may be contaminated with Ni ions. However, some single ligand binding Ni, those which are interfacially bound to multiple chains in multimeric proteins has been considered. Ni-Binding Sequence Data Set A set of Ni-binding segments of 15 amino acid residues are prepared by selecting sequences Si-7 to Si+7, where ith residue binds to Ni. The set contains 200 Ni-binding sequences, which are used as Nibinding data set. Similar procedure is followed for Ca and Cu-binding data sets. Randomly 15 residue long segments are selected from no-metal binding protein sequences to prepare no-metal binding data set. No-metal binding, Ca-binding and Cu-binding data sets together constitute no-Ni-binding data set. Ca ion and Cu ion represent the hard and soft metal ions respectively. Ni lies in the borderline of hard and soft classification. Hence the present classifier which is based on HSAB properties may classify Ni-binding sites in the background of no-metal binding and other metal binding sites in proteins. Calculation of Feature Parameters Polarizability and partial charge of the ligand binding atoms namely, OD1(Asp), OD2 (Asp/Asn), OE1(Glu), OE2 (Glu/Gln), OZ (Tyr), ND1(His), NE2 (His), SG (Cys/Met), are calculated in energy optimized amino acids (X) protected by acetyl (Ace) and N-methyl amide (NMe) group on Nterminus and C-terminus respectively ( AceXNMe), using QSAR module of molecular modeling software HyperChem Pro 8.0. Energy optimization is done by semi-empirical method (PM3). The parameters are assigned to those amino acids only whose side chains are associated with metal binding. Rest all are assigned zero values. Polarizability and partial charge are the two parameters of the ligand, which are considered to be 21 Das et al.: Prediction of Nickel Binding Sites crucial in explaining the metal-ligand bond formation in terms of the hard-soft acid-base (HSAB) principle12. Of course, the HSAB principle has been criticized regarding its tenability recently in case of ambident reactivity13. Table 1 Physicochemical parameters of amino acid residues used in algorithm for prediction of Ni-binding sites in proteins Amino acid Polarizability Partial charge A R N D C Q E G H L I K M F P S T W Y V 0 0 0.57 0.57 3.00 0.57 0.57 0 1.03 0 0 0 3.00 0 0 0.64 0.64 0 0.64 0 0 0 -0.5679 -0.8014 -0.3119 -0.5679 0.8014 0 -0.5727 0 0 0 -0.3119 0 0 -0.6546 -0.6546 0 -0.55791 0 Radial basis function neural network classifier (RBFNNC) In this paper we have introduced a low complexity radial basis function neural network (RBFNN) classifier to efficiently predict the sample class14-16. The potential of the proposed approach is evaluated through an exhaustive study by many benchmark datasets. The experimental results showed that the proposed method can be a useful approach for classification. A radial basis function network is an artificial neural network that uses radial basis functions as activation functions. It is a linear combination of radial basis functions. The radial basis function network (RBFNN) is suitable for function approximation and pattern classification problems because of their simple topological structure and their ability to learn in an explicit manner. In the classical RBF network, there is an input layer, a hidden layer consisting of nonlinear node function, an output layer and a set of weights to connect the hidden layer and output layer. Due to its simple structure it reduces the computational task as compared to conventional multi layer perception (MLP) network. The structure of a RBF network is shown in Fig. 1. W X 1 X 2 φ0 Y 1 φ Y 2 φ Y Input s φ X N Output 3 Y W N kj Input Hidden Output Layer Layer Layernetwork Fig. 1 Architecture of a radial basis function In the RBFNN based classifier, an input vector x is used as input to all radial basis functions, each with different parameters. The output of the network is a linear combination of the outputs from radial basis functions. For an input feature vector x, the output y of the jth output node is given as. N N k 1 k 1 y j w kjk w kje x (n ) C k 2 2k (1) The error occurs in the learning process is reduced by updating the three parameters, the positions of centers (Ck), the width of the Gaussian function (σk) and the connecting weights (w) of RBFNN by a stochastic gradient approach as defined below: (2) w(n 1) w(n) w J(n) w (3) Ck (n 1) C k (n) c J(n) Ck k (n 1) k (n) J(n) k (4) Where, J(n) 1 e(n) 2 e (n)=d(n) - y(n) 2 , is the error, d(n) is the target output and y(n) is the predicted output. w C and are the learning parameters of the RBF network. Simulation Studies and Discussions In order to compare the efficiency of the proposed method in predicting the class of the Nickel binding data we have used standard datasets. All the datasets categorized into two groups: binary class to assess the performance of the proposed method. The dataset consists of amino acid sequences of 15 characters. 200 sequences from Ni-binding protein dataset and 200 sequences from non-Ni-binding 22 Int J Ener Sustain Env Engg, September 2014 dataset are taken as training set. The feature selection process proposed in this paper includes polarizability and partial charge as shown in the Table 1. To implement the RBFNN classifier, we first read in the file of protein sequence which are represented with numerical values. The performance of the proposed feature extraction method is analyzed with the neural network classifiers: RBFNN. The leave one out cross validation (LOOCV) test is conducted by combining all the training and test samples for the classifiers with datasets17. LOOCV is a technique where the classifier is successively learned on n-1 samples and tested on the remaining one. i.e., it removes one sample at a time for testing and takes other as training set. It involves leaving out all possible subsets so the entire process is run as many times as there are samples. This is repeated n times so that every sample was left out once. Repeating these procedure n times gives us n classifiers in the end. Our error score is the number of mispredictions18. Out of 200 sequences from Nibinding protein dataset, 192 samples are detected as true positive whereas out of 250 sequences from non-Ni-binding protein dataset 17 samples are detected as false positive. The prediction accuracy has been analyzed in terms of two measuring parameters such as accuracy (A), precision (P) and recall (R). These are are defined in terms of four parameters true positive (tp), false positive (fp), true negative (tn) and false negative (fn). tp denotes the number of Nickel bindings and are also predicted as Nickel binding, fp denotes the number of actually Non nickel bindings but are predicted to be Nickel bindings, tn is the number of actually Non nickel bindings and also predicted to be Non nickel bindings, and fn is the number of actually Nickel bindings and predicted to be Non nickel bindings. Accuracy The accuracy of prediction of Ni-binding in amino acid sequence is defined as the percentage of Nibinding correctly predicted of the total binding sequences present. It is computed as follows: A t p tn t p f p tn f n (5) Precision Precision is defined as the percentage of Ni-binding correctly predicted to be one class of the total Nibinding predicted to be of that class. It is computed as: P tp (6) tp f p Recall Recall is defined as the percentage of the Nibinding that belong to a class that are predicted to be that class. Recall is computed as: R tp (7) t p fn Table 2 Measuring parameters for prediction accuracy Actual Predicted NB NNB Nickel Binding (NB) 192 (tp) Non Nickel Binding (NNB) 17 (fp) 8 (fn) 233 (tn) The accuracy, precision and recall are 0.94, 0.91, and 0.96 respectively. The accuracy of sequence based classifiers reported so far is about 65%. Hence the present classifier appears to have high accuracy compared to existing sequence based classifiers It needs to be extended for all the metal ions found in biosystems before it can be used at proteomic level. However, it fulfills the need of a classifier for Ni-binding proteins keeping in view the growing database of Ni-binding proteins along with the escalating interest of scientific community in the field during last decade. Conclusion The classifier in the present work is performs with high accuracy, to the tune of 94%. The method would be extended to data sets with all other metalbinding sequences and success with similarly high accuracy is expected. A sequence of 15 residues long is chosen arbitrarily in the present work. Further investigation is necessary to find out whether the accuracy is dependent on length of sequence. Acknowledgement The authors wish to thank management members and the principal of the college for all kinds of supports to complete this work. References 1. Lin H H, Han L Y, Zhang H L, Zheng C J, Xie B, Cao Z W & Chen Y Z, Bioinformatics, 7(Suppl 5):S13, 2006 doi:10.1186/1471-2105-7-S5-S13 2. Gregory D S, Martin A C, Cheetham J C & Rees A R, Protein Eng, 6 (1993) 29. Das et al.: Prediction of Nickel Binding Sites 23 3. Sodhi J S, Bryson K, McGuffin L J, Ward J J, 11. Berman H M, Westbrook J, Feng Z, Gilliland G, Bhat Wernisch L & Jones D T, J Mol Biol, 342 (2004) 307. 2+ 4. Nayal M & Di Cera E, Predicting Ca -binding sites in proteins, Proceeding of Natural Academy of Science, USA, 91 (1994) 817. 5. Schymkowitz J W, Rousseau F, Martins I C, Ferkinghoff-Borg J, Stricher F & Serrano L, Prediction of water and metal binding sites and their affinities by using the Fold-X force field, Proceeding of Natural Academy of Science, USA, 102 (2005) 10147. 6. Jensen M R, Petersen G, Lauritzen C, Pedersen J & Led J J, Biochem, 44 (2005) 11014. 7. Passerini A, Lippi M & Frasconi P, IEEE/ACM Tran Comput Biol Bioinform, 9 (2012) 203. 8. Rigden D J & Galperin M Y, J Mol Biol, 343 (2004) 971. 9. Naik P K, Ranjan P, Kesari P & Jain S, J Biophys Chem, 2 (2011) 111. 10. Lin C T, Lin K L, Yang C H, Chung I F, Huang C D & Yang Y S, Int J Neural Syst, 15 (2005) 71. T N, Weissig H, Shindyalov I N & Bourne P E, Nucleic Acids Research, 28 (2000) 235. 12. Glusker J P, Katz A K & Bock C W, Metal ions in biological systems, 16 (1999) 8. 13. Mayr H, Breugst M & Ofial A R, Angew Chem Int Ed, 50 (2011) 6470. 14. Powell M J D, Radial basis functions formultivariable interpolation: A review, paper presented at IMA Conference on Algorithms for the Approximationof Functions and Data, RMCS, Shrivenham, England, 1985. 15. Broomhead D S & Lowe D, Complex Systems, 2 (1988) 321. 16. Chen S, Cowan C F N & Grant P M, IEEE Trans Neural Networks, 2 (1991) 302. 17. Lachenbruch P A & Mickey M R, Technometrics, 10 (1968) 1. 18. Varma S & Simon R, BMC Bioinformatics, 7 (2006) 91.