Download Prediction of Nickel Binding Sites in Proteins from Amino acid

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Implicit solvation wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Protein domain wikipedia , lookup

List of types of proteins wikipedia , lookup

Circular dichroism wikipedia , lookup

Proteomics wikipedia , lookup

Structural alignment wikipedia , lookup

Protein purification wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein moonlighting wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Cyclol wikipedia , lookup

Cooperative binding wikipedia , lookup

Homology modeling wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Metalloprotein wikipedia , lookup

Transcript
International Journal of Energy, Sustainability and Environmental Engineering
Vol. 1 (1), September 2014, pp. 19-23
Prediction of Nickel Binding Sites in Proteins from Amino acid Sequences by Radial
Basis Function Neural Network Classifier
G C Dash1, M R Panigrahi2, J K Meher3, M K Raval4
1
Department of Chemistry, APS College, Roth, Balangir, Odisha, India.
Department of Chemical Engineering, Orissa Engineering College, Bhubaneswar, India
3
Department of Computer Science and Engg, Vikash College of Engg for Women, Bargarh, Odisha, India
4
Department of Chemistry, Gangadhar Meher College, Sambalpur, Odisha, India
2
Received 30 August 2014; accepted 12 September 2014
Abstract
Prediction of Ni-binding proteins has not received intensive and specific attention. There is also a need for
a sequence based predictive method for metal binding site with high accuracy so that it can be applied at proteomic
scale. A radial basis function neural network classifier is developed to predict nickel binding sites in proteins from
sequence alone data. Two physicochemical parameters namely, polarizability and partial charge of the ligand atom of
amino acid side chains are used as features for the classifiers. The accuracy of the classifier is 94%.
Keywords Ni-binding, neural network, physicochemical parameters, ligand, polarizability
The crucial test of knowledge of coordination
chemistry lies in our ability to predict metal binding
to ligands or proteins along with their
thermodynamic and kinetic stabilities. Because of
the functional and sequence diversity of
metalloproteins, it is necessary to design algorithms,
which include different aspects of metal-ligand
complex formation for predicting metal-binding
proteins. The problem has been approached from
different direction to develop algorithms with
greater accuracy.
An approach to begin with is to classify as metal
binding and non-metal binding proteins. Obvious
choice in this situation is Support Vector Machine
(SVM) method1. The method predicts 67% of the
87 metal-binding proteins non-homologous to any
protein in the Swissprot database and 85.3% of the
Corresponding Author:
G C Dash
e-mai: [email protected]
M R Panigrahi
e-mail: [email protected]
J K Meher
e-mail: [email protected]
M K Raval
e-mail: [email protected]
333 proteins of known metal-binding domains as
metal-binding. These suggest the usefulness of
SVM for facilitating the prediction of metal-binding
proteins1.
Structural information also has been used for
predicting metal-binding sites based on the
detection of principal liganding residues and metalligand complex architectures2, the use of common
local structural parameters2, combination of
sequence and structural profiles3, analysis of bond
strength contributions4, and the computation of
force fields5. However, metal specificity in proteins
with loosely or temporarily bound metals, such as
enzymes that use metal ions as cofactors, are often
poorly characterized6. Therefore, sequence-based
computational methods appear to be useful for these
types of proteins and whose 3D structures are not
determined experimentally7. Besides, sequence
similarity methods, the sequence-based methods
include metal-binding sites sequence motifs8,
multiple sequence alignments against known metalbinding proteins, and neural networks of sequence
segments of amino acids of higher metal-binding
propensity9,10. Combinatorial application of multiple
structural, sequence alignments and annotation
methods has been found to be highly useful for
improving prediction accuracy of metal-binding
proteins1.
Empirical force field method – The empirical
force field Fold-X is developed to predict the
20
Int J Ener Sustain Env Engg, September 2014
position of single atom ligands (structural water
molecules and metal ions). Fold-X predicts 76% of
water molecules interacting with two or more polar
atoms of proteins in high-resolution crystal
structures and within 0.8 Å standard deviation in
position on average. The prediction of metal ionbinding sites have accuracy between 90% and 97%
depending on the nature of metal ion, with an
average standard deviation of 0.3 – 0.6 Å on the
position of binding. The force field includes Mg2+,
Ca2+, Zn2+, Mn2+, and Cu2+. The accuracy of the
energy prediction using the force field is sufficient
to efficiently discriminate between Mg2+, Ca2+, and
Zn2+ binding5.
Sequence based prediction method – Algorithms
to predict of metal binding sites in proteins from
sequence are helpful in identifying the
metalloproteins and annotation of uncharacterized
proteins on a genomic scale. Development of such
algorithms are highly challenging due to the
enormous amount of alternative candidate
configurations7. Passerini et al. (2012) develop new
algorithm based on structured output learning for
determining
transition-metal-binding
sites
coordinated by cysteines and histidines. The
inference step (retrieving the best scoring output) is
intractable for general output types (general
graphs). Metal binding has been proved to be the
algebraic structure of a matroid assuming that no
residue can coordinate more than one metal ion.
This allows to employ greedy algorithm7. Test of
predictor in a highly stringent setting where the
training set consists of protein chains belonging to
SCOP folds achieves 56% precision and 60% recall
in the identification of ligand-ion bonds7.
Another sequence based predictor- MetalloPred,
is developed, which consists of three levels of
hierarchical classification using cascade of neural
networks from sequence derived features9. The 1st
layer of the prediction engine is for identifying
whether a protein is a metalloprotein or not. The 2nd
layer is for determining the main functional class.
The 3rd layer is for identifying the sub-functional
class. The overall success rates for all the three
layers are ~ 60% that were obtained through crossvalidation tests on the datasets of non-homologous
proteins with cut off of 30% sequence identity in
the same class or subclass9.
We have made an attempt to use hard and soft
acid-base (HSAB) concept in metal complex
formation considering the concerned two
physicochemical parameters: partial charge and
polarizability of binding ligand atoms of amino acid
residues, to predict nature of metal binding to
proteins. Metal ion Ni is selected for the study as
prediction of Ni binding to the protein has been not
been intensively and specifically studied so far. A
neural network classifier has been applied yielding
an accuracy of ~95%.
Materials and Datasheet
Ni-Binding Protein Data Set
Co-ordinate files of metalloproteins are obtained
from
the
protein
data
bank
(PDB)11
(www.pdb.org/pdb). For the study in this work we
have chosen a non-redundant set of PDB files of Nibinding proteins determined by the X-Ray
diffraction experimental method and reported up to
September 2010. The available PDB files are
manually scrutinized by removing PDB entries with
proteins with single ligand to Ni ion and proteins
surface binding Ni. These proteins may be
contaminated with Ni ions. However, some single
ligand binding Ni, those which are interfacially
bound to multiple chains in multimeric proteins has
been considered.
Ni-Binding Sequence Data Set
A set of Ni-binding segments of 15 amino acid
residues are prepared by selecting sequences Si-7 to
Si+7, where ith residue binds to Ni. The set contains
200 Ni-binding sequences, which are used as Nibinding data set. Similar procedure is followed for
Ca and Cu-binding data sets. Randomly 15 residue
long segments are selected from no-metal binding
protein sequences to prepare no-metal binding data
set. No-metal binding, Ca-binding and Cu-binding
data sets together constitute no-Ni-binding data set.
Ca ion and Cu ion represent the hard and soft metal
ions respectively. Ni lies in the borderline of hard
and soft classification. Hence the present classifier
which is based on HSAB properties may classify
Ni-binding sites in the background of no-metal
binding and other metal binding sites in proteins.
Calculation of Feature Parameters
Polarizability and partial charge of the ligand
binding atoms namely, OD1(Asp), OD2 (Asp/Asn),
OE1(Glu), OE2 (Glu/Gln), OZ (Tyr), ND1(His),
NE2 (His), SG (Cys/Met), are calculated in energy
optimized amino acids (X) protected by acetyl
(Ace) and N-methyl amide (NMe) group on Nterminus and C-terminus respectively ( AceXNMe),
using QSAR module of molecular modeling
software HyperChem Pro 8.0. Energy optimization
is done by semi-empirical method (PM3).
The parameters are assigned to those amino
acids only whose side chains are associated with
metal binding. Rest all are assigned zero values.
Polarizability and partial charge are the two
parameters of the ligand, which are considered to be
21
Das et al.: Prediction of Nickel Binding Sites
crucial in explaining the metal-ligand bond
formation in terms of the hard-soft acid-base
(HSAB) principle12. Of course, the HSAB principle
has been criticized regarding its tenability recently
in case of ambident reactivity13.
Table 1 Physicochemical parameters of amino acid
residues used in algorithm for prediction of Ni-binding
sites in proteins
Amino acid
Polarizability
Partial charge
A
R
N
D
C
Q
E
G
H
L
I
K
M
F
P
S
T
W
Y
V
0
0
0.57
0.57
3.00
0.57
0.57
0
1.03
0
0
0
3.00
0
0
0.64
0.64
0
0.64
0
0
0
-0.5679
-0.8014
-0.3119
-0.5679
0.8014
0
-0.5727
0
0
0
-0.3119
0
0
-0.6546
-0.6546
0
-0.55791
0
Radial basis function neural network
classifier (RBFNNC)
In this paper we have introduced a low complexity
radial basis function neural network (RBFNN)
classifier to efficiently predict the sample class14-16.
The potential of the proposed approach is evaluated
through an exhaustive study by many benchmark
datasets. The experimental results showed that the
proposed method can be a useful approach for
classification. A radial basis function network is an
artificial neural network that uses radial basis
functions as activation functions. It is a linear
combination of radial basis functions. The radial
basis function network (RBFNN) is suitable for
function approximation and pattern classification
problems because of their simple topological
structure and their ability to learn in an explicit
manner. In the classical RBF network, there is an
input layer, a hidden layer consisting of nonlinear
node function, an output layer and a set of weights
to connect the hidden layer and output layer. Due to
its simple structure it reduces the computational
task as compared to conventional multi layer
perception (MLP) network. The structure of a RBF
network is shown in Fig. 1.
W
X
1
X
2
φ0
Y
1
φ
Y
2
φ
Y
Input
s
φ
X
N
Output
3
Y
W
N
kj
Input
Hidden
Output
Layer
Layer
Layernetwork
Fig. 1 Architecture of a radial basis function
In the RBFNN based classifier, an input vector x is
used as input to all radial basis functions, each with
different parameters. The output of the network is a
linear combination of the outputs from radial basis
functions.
For an input feature vector x, the output y of the
jth output node is given as.
N
N
k 1
k 1
y j   w kjk   w kje

x (n )  C k
2  2k
(1)
The error occurs in the learning process is
reduced by updating the three parameters, the
positions of centers (Ck), the width of the Gaussian
function (σk) and the connecting weights (w) of
RBFNN by a stochastic gradient approach as
defined below:

(2)
w(n  1)  w(n)   w
J(n)
w

(3)
Ck (n  1)  C k (n)   c
J(n)
Ck

 k (n  1)   k (n)   
J(n)
 k
(4)
Where, J(n)  1 e(n) 2 e (n)=d(n) - y(n)
2
,
is the error, d(n) is the target output and y(n) is the
predicted output.  w C and   are the learning
parameters of the RBF network.
Simulation Studies and Discussions
In order to compare the efficiency of the proposed
method in predicting the class of the Nickel binding
data we have used standard datasets. All the
datasets categorized into two groups: binary class to
assess the performance of the proposed method. The
dataset consists of amino acid sequences of 15
characters. 200 sequences from Ni-binding protein
dataset and 200 sequences from non-Ni-binding
22
Int J Ener Sustain Env Engg, September 2014
dataset are taken as training set. The feature
selection process proposed in this paper includes
polarizability and partial charge as shown in the
Table 1. To implement the RBFNN classifier, we
first read in the file of protein sequence which are
represented
with
numerical
values.
The
performance of the proposed feature extraction
method is analyzed with the neural network
classifiers: RBFNN. The leave one out cross
validation (LOOCV) test is conducted by
combining all the training and test samples for the
classifiers with datasets17. LOOCV is a technique
where the classifier is successively learned on n-1
samples and tested on the remaining one. i.e., it
removes one sample at a time for testing and takes
other as training set. It involves leaving out all
possible subsets so the entire process is run as many
times as there are samples. This is repeated n times
so that every sample was left out once. Repeating
these procedure n times gives us n classifiers in the
end. Our error score is the number of
mispredictions18. Out of 200 sequences from Nibinding protein dataset, 192 samples are detected as
true positive whereas out of 250 sequences from
non-Ni-binding protein dataset 17 samples are
detected as false positive.
The prediction accuracy has been analyzed in
terms of two measuring parameters such as
accuracy (A), precision (P) and recall (R). These are
are defined in terms of four parameters true positive
(tp), false positive (fp), true negative (tn) and false
negative (fn). tp denotes the number of Nickel
bindings and are also predicted as Nickel binding, fp
denotes the number of actually Non nickel bindings
but are predicted to be Nickel bindings, tn is the
number of actually Non nickel bindings and also
predicted to be Non nickel bindings, and fn is the
number of actually Nickel bindings and predicted to
be Non nickel bindings.
Accuracy
The accuracy of prediction of Ni-binding in amino
acid sequence is defined as the percentage of Nibinding correctly predicted of the total binding
sequences present. It is computed as follows:
A
t p  tn
t p  f p  tn  f n
(5)
Precision
Precision is defined as the percentage of Ni-binding
correctly predicted to be one class of the total Nibinding predicted to be of that class. It is computed
as:
P
tp
(6)
tp  f p
Recall
Recall is defined as the percentage of the Nibinding that belong to a class that are predicted to
be that class. Recall is computed as:
R
tp
(7)
t p  fn
Table 2 Measuring parameters for prediction accuracy
Actual
Predicted
NB
NNB
Nickel
Binding
(NB)
192 (tp)
Non Nickel
Binding
(NNB)
17 (fp)
8 (fn)
233 (tn)
The accuracy, precision and recall are 0.94, 0.91,
and 0.96 respectively. The accuracy of sequence
based classifiers reported so far is about 65%.
Hence the present classifier appears to have high
accuracy compared to existing sequence based
classifiers It needs to be extended for all the metal
ions found in biosystems before it can be used at
proteomic level. However, it fulfills the need of a
classifier for Ni-binding proteins keeping in view
the growing database of Ni-binding proteins along
with the escalating interest of scientific community
in the field during last decade.
Conclusion
The classifier in the present work is performs with
high accuracy, to the tune of 94%. The method
would be extended to data sets with all other metalbinding sequences and success with similarly high
accuracy is expected. A sequence of 15 residues
long is chosen arbitrarily in the present work.
Further investigation is necessary to find out
whether the accuracy is dependent on length of
sequence.
Acknowledgement
The authors wish to thank management members
and the principal of the college for all kinds of
supports to complete this work.
References
1. Lin H H, Han L Y, Zhang H L, Zheng C J, Xie B,
Cao Z W & Chen Y Z, Bioinformatics, 7(Suppl
5):S13, 2006 doi:10.1186/1471-2105-7-S5-S13
2. Gregory D S, Martin A C, Cheetham J C & Rees A
R, Protein Eng, 6 (1993) 29.
Das et al.: Prediction of Nickel Binding Sites
23
3. Sodhi J S, Bryson K, McGuffin L J, Ward J J,
11. Berman H M, Westbrook J, Feng Z, Gilliland G, Bhat
Wernisch L & Jones D T, J Mol Biol, 342 (2004) 307.
2+
4. Nayal M & Di Cera E, Predicting Ca -binding sites
in proteins, Proceeding of Natural Academy of
Science, USA, 91 (1994) 817.
5. Schymkowitz J W, Rousseau F, Martins I C,
Ferkinghoff-Borg J, Stricher F & Serrano L,
Prediction of water and metal binding sites and their
affinities by using the Fold-X force field, Proceeding
of Natural Academy of Science, USA, 102 (2005)
10147.
6. Jensen M R, Petersen G, Lauritzen C, Pedersen J &
Led J J, Biochem, 44 (2005) 11014.
7. Passerini A, Lippi M & Frasconi P, IEEE/ACM Tran
Comput Biol Bioinform, 9 (2012) 203.
8. Rigden D J & Galperin M Y, J Mol Biol, 343 (2004)
971.
9. Naik P K, Ranjan P, Kesari P & Jain S, J Biophys
Chem, 2 (2011) 111.
10. Lin C T, Lin K L, Yang C H, Chung I F, Huang C D
& Yang Y S, Int J Neural Syst, 15 (2005) 71.
T N, Weissig H, Shindyalov I N & Bourne P E,
Nucleic Acids Research, 28 (2000) 235.
12. Glusker J P, Katz A K & Bock C W, Metal ions in
biological systems, 16 (1999) 8.
13. Mayr H, Breugst M & Ofial A R, Angew Chem Int
Ed, 50 (2011) 6470.
14. Powell
M J D, Radial basis functions
formultivariable interpolation: A review, paper
presented at IMA Conference on Algorithms for the
Approximationof Functions and Data, RMCS,
Shrivenham, England, 1985.
15. Broomhead D S & Lowe D, Complex Systems, 2
(1988) 321.
16. Chen S, Cowan C F N & Grant P M, IEEE Trans
Neural Networks, 2 (1991) 302.
17. Lachenbruch P A & Mickey M R, Technometrics, 10
(1968) 1.
18. Varma S & Simon R, BMC Bioinformatics, 7 (2006)
91.