Download C. Flow Chart

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Artificial gene synthesis wikipedia , lookup

Expression vector wikipedia , lookup

Gene expression wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Magnesium transporter wikipedia , lookup

Biosynthesis wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Interactome wikipedia , lookup

Genetic code wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Metalloprotein wikipedia , lookup

Protein purification wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Biochemistry wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
Protein Loop Prediction
Shivanandappa, Praveenkumar N Hadapad
Department of Biotechnology, RV College of Engineering,
Mysore Road, R.V. Vidyanikethan Post, Bangalore -560069
Email: [email protected]
Abstract—All the 3D structures of protein are having some
challenging regions to be model are called as loops. These
regions are highly variable in nature, and in few proteins
they are insignificant but not all proteins. The regions
which are highly variable are not functionally significant,
hence its highly challenging to predict their position and
structure in 3D structure of protein. But still there are
many attempts predict loops, few of them are successful to
some extent. These existing methods lack specificity and
accuracy, even though the loops are functionally not
significant, but their presence brings significant change in
overall 3D structure of protein. So accuracy and specificity
in loop prediction is highly significant, this is because it
increases the accuracy in loop prediction. In this work, we
were able to design an algorithm called RVloop MOD that
predict loops in an unknown protein structure. Since the
prediction is based on the natural tendency of amino acids
that are taking part in loop formation, we call this
algorithm a kind of probabilistic model. After the
prediction of loops, these loops are further refined and
final loops are predicted. As far as accuracy of loop
prediction using Rvloop MOD in comparison with existing
methods, there is a quite improvement in specificity and
accuracy. The loops predicted using RvloopMOD are over
80% accurate.
I. INTRODUCTION
A. Overview
Loops are irregular structures which connect two
secondary structure elements in proteins. They often
play important roles in function, including enzyme
reactions and ligand binding. Despite their importance,
their structure remains difficult to predict. Most protein
loop structure prediction methods sample local loop
segments and score them. In particular protein loop
classifications and database search methods depend
heavily on local properties of loops.
Protein loops are pattern less regions which connect two
regular secondary structures. They are generally located
on the protein’s surface in solvent exposed areas and
often play important roles, such as interacting with other
biological objects. Despite the lack of patterns, loops are
not completely random structures. Early studies of short
turns and hairpins showed that these peptide fragments
could be clustered into structural classes. Such
classifications have also been made across all loops or
within specific protein families such as antibody
complementarily
determining
regions.
Loop
classifications are generally based on local properties
such as sequence, the secondary structures from which
the loop starts and finishes (anchor region), the distance
between the anchors, and the geometrical shape along
the loop structure. Loops can also be classified in terms
of function also. Accurate protein loop structure
prediction remains an open question. Protein loop
predictors have dealt with the problem as a case of local
protein structure prediction. Protein structures are
hypothesized to be in thermodynamic equilibrium with
their environment. Thus the primary determinant of a
protein structure is considered to be its atomic
interactions, i.e. its amino acid sequence. An analogous
conjecture has arisen at the local scale. The modeling of
protein loops is often considered a mini protein folding
problem. In fact, most loop structure prediction methods
are based on this conjecture. Database search methods
have been successful in the realm of loop structure
prediction. They depend upon the assumption that
similarity between local properties may suggest similar
local structures. All database search methods work in an
analogous fashion using either a complete set or a
classified set of loops and selecting predictions using
local features including sequence similarity and anchor
geometry. Ab initio loop modeling methods aim to
predict peptide fragments that do not exist in homology
modeling templates without structure databases.
Generally, Ab initio methods generate large structure
conformation sets and select predictions. The generated
loop candidates are optimised against scoring functions.
In all loop modeling procedures anchor regions are often
problematic and the accuracy of loop modeling depends
upon the distance between the anchors. It is widely
believed that the accuracy of loop structure prediction
depends on the number of residues, i.e. the larger the
number of residues, the more difficult a loop is to
predict.
Protein structure prediction is the prediction of the 3D
structure of protein from its amino acid sequence, that is,
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014
30
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
the prediction of its secondary, tertiary and quaternary
structure from its primary structure. Protein structure
prediction has now been bifurcated into two main
approaches viz. experimental and computational
approach. Experimental methods are time consuming
and expensive and it is not always feasible to identify
the protein structure experimentally. In order to predict
the protein structure using computational methods, the
problem is formulated as an optimization problem and
the goal is to find the lowest free energy conformation.
There are several methods that have been developed to
predict secondary and tertiary structure of proteins, the
catch being that these methods can achieve up to 7176% overall accuracy. Considering this, there is a huge
demand for the tools to predict protein structure (loop)
with the increased accuracy, efficiency and specificity.
By this work, we intend to develop an improved and
optimized algorithm for Protein loop Prediction with
increased accuracy.
Jpred
Meta-PP
PREDATOR
PredictProtein
PSIPRED
SymPred
YASSPP
RaptorX-SS8
PSSpred
Neural network assignment
Consensus Prediction of servers
Knowledge based database
comparison
Profile based neural network
Two feed-forward neural networks
which perform analysis on the
output obtained from PSI-BLAST
Dictionary based approach that
captures local sequence similarities
in a group of proteins
M based predictor based on PSIBLAST profiles
Predict both 3-state and 8-state
secondary structure using
conditional neural fields
Multiple backpropagation neural
network predictors from PSIBLAST profiles
B. Protein structure prediction- the fundamentals
1) Computational methods of structure prediction:
Protein structure prediction took a definite turn when
computers with higher processing efficiency emerged
onto the scene. Representation of the prediction problem
in terms of mathematical models and algorithms
provides an easier alternative to the aforementioned
biophysical methods, primarily in terms of time and
money constraints. The prediction problem basically can
be defined as the prediction of the secondary, tertiary
and quaternary structure of a protein given its amino
acid sequence.
2) Secondary Structure prediction: Secondary structure
prediction by computational methods involves usage of
mathematical techniques to predict the local secondary
structures of proteins based only on knowledge of their
primary structure i.e. amino acid sequence. The
Prediction entails the assignment of regions of the amino
acid sequence as likely alpha helices, beta strands (often
noted as "extended" conformations), or turns. The
success of a prediction is determined by comparing it to
the results of the DSSP algorithm applied to the crystal
structure of the protein. The best modern methods of
secondary structure prediction in proteins reach about
80% accuracy. The accuracy of current protein
secondary structure prediction methods is assessed in
weekly assessment benchmarks like LiveBench and
EVA.
The following table summarizes the main secondary
structure prediction methods:
TABLE I. LIST OF POPULAR SECONDARY STRUCTURE
PREDICTION TOOLS AND THEIR METHOD DESCRIPTION
Method
NetSurfP
GOR
Description
Profile based neural network
Information theory/ Bayesian
inference
The best proponent methodology was the Chou-Fasman
Method which in many ways was a pioneer method in
structure prediction. However, when compared to
current methods, Chou-Fasman produces poor results.
Another method that obtained some prominence was the
GOR method which is anin formation theory-based
method and uses more powerful probabilistic techniques
of Bayesian inference. This method also included local
propensities of amino acids and was a major
improvement on Chou-Fasman method. However, like
almost all earlier methods, it over-predicted alphahelices and wrongly predicted beta sheets as loops and
turns.
II. MATERIALS AND METHODS
A. Databases
1)
PDB: The Protein data bank (PDB) is a
repository for the 3-D structural data of large biological
molecules, such as proteins and nucleic acids. The data,
typically obtained by X-ray crystallography or NMR
spectroscopy and submitted by biologists and
biochemists from around the world, are freely accessible
on the Internet via the websites of its member
organizations (PDBe, PDBj, and RCSB). The PDB is
overseen by an organisation called the Worldwide
Protein Data Bank, ww PDB. The role of the PDB in our
work is to provide characterised secondary structural
data and it provides a cross platform and independent
data repository for local alignment.
As we have mentioned in earlier sections, Backbone
modeling attributed to the accurate estimation of
secondary structures.
2)
SCOP: The Structural Classification of Proteins
(SCOP) database is a largely manual classification of
protein structural domains based on similarities of their
amino acid sequences and three-dimensional structures.
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014
31
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
SCOP includes the following structural classes: α-helical
domains, β-sheet domains, α/β domains which consist of
"beta-alpha-beta" structural units or "motifs" that form
mainly parallel β-sheets, α+β domains formed by
independent α-helices and mainly antiparallel β-sheets,
multi-domain proteins, membrane and cell surface
proteins and peptides (not including those involved in
the immune system), "small" proteins, coiled-coil
proteins, low-resolution protein structures, peptides and
fragments, designed proteins of non-natural sequence.
C. Flow Chart
Flow chart shows the design of the complete tool. This
starts by reaching the query sequence and database
sequence one by one for optimal alignment.
3)
CATH: The CATH database is a hierarchical
domain classification of protein structures in the Protein
Data Bank (PDB, Berman et al. 2003). Only crystal
structures solved to resolution better than 4.0 angstroms
are considered, together with NMR structures. All nonproteins, models, and structures with greater than 30%
"C-alpha only" are excluded from CATH.
The data obtained from structural databases like PDB,
CATH and SCOP has been used to estimate propensity
values, so that we can identify conservation patterns
with improved specificity and accuracy. Extensive
survey of records in these databases has been carried out
with intent of covering huge no of records. Finally we
have achieved this goal by automating the task with our
own Perl script.
B. Techniques
Chau-Fasman Algorithm is one of the oldest and
simplest method. The method was originally presented
in 1974 and later improved in 1977, 1978, 1979, 1985
and 1989. It depends on observed frequency of types of
amino acid residues in alpha-helix, beta strand, beta
turn, and other structures in protein three-dimensional
structures. The Chou-Fasman algorithm for the
prediction of protein secondary structure is one of the
most widely used predictive schemes. The Chou-Fasman
method of secondary structure prediction depends on
assigning a set of prediction values to a residue and then
applying a simple algorithm to the conformational
parameters and positional frequencies. The ChouFasman algorithm is simple in principle. The
conformational parameters for each amino acid were
calculated by considering the relative frequency of a
given amino acid within a protein, its occurrence in a
given type of secondary structure, and the fraction of
residues occurring in that type of structure. These
parameters are measures of a given amino acids
preference to be found in helix, sheet or coil. Using
these conformational parameters, one finds nucleation
sites within the sequence and extends them until a
stretch of amino acids is encountered that is not disposed
to occur in that type of structure or until a stretch is
encountered that has a greater disposition for another
type of structure. At that point, the structure is
terminated. This process is repeated throughout the
sequence until the entire sequence is predicted.
Fig.1 Secondary structure prediction flow chart and the
method.
D. Algorithm
Step 1: Perform the optimal alignment of query
sequence against the database sequence one by one
using dynamic programming and find to which sequence
in the database, the query is related.
Mi-j, j-1+ Si, j
Mi,j = max
Mi-1, j + W
Mi,j -1 + W
Step 2: Select the corresponding propensity table of
amino acid based on optimum score obtained by
optimum alignment.
Step 3: Substitute the propensity values for each amino
acid of query sequence from the
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014
32
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
Step 4: Input the propensity values of query sequence to
the modified chou-fasman algorithm.
Step 5: Loop areas are cross validated with the other
structure occurrences.
Step 6: If none fit the criteria then loop is confirmed.
Step 7: The output is provided with where all the loop
structure is predicted in the unknown sequence.
III. IMPLEMENTATION OF THE
ALGORITHM
The RVloopMOD algorithm start by taking query
sequence as an input, then homologous sequences are
searched by PSI-BLAST. Next the propensity values of
the top homologous sequence are considered and then it
predicts all possible loop structures for unknown and
template sequence. Loop structural information so
obtained will be used for parsing of C-alpha trace from
template structure. Then the result obtained is cross
validated with the other secondary structures. If none of
the other overlap with predicted loop sequence, then the
loop structure predicted is confirmed. At the end finally
algorithm generates optimised loop structure and there
positions as an output, which can be further utilised to
build 3D structure of protein.
Loop structure prediction is based on the accurate
estimation of propensity value of each amino acid. From
the literature survey it was observed that distribution of
propensity value among the different secondary
structure is different in different protein sample. So
accurate estimation of propensity was necessary by
careful sampling of data. After selecting data from the
databases Natural tendency of all amino acids was
calculated using the following formula.
∑ AA in sec strc S
∑ all AA in sec strc S
P(AA|S) =
∑ AA in all sec strc
∑ all AA in sec strc
P(AA|S)=propensity of any amino acid given the
secondary structure, ∑AA= sum of amino acid the in
given secondary structure, ∑all AA= sum of all amino
acids in the given secondary structure, ∑AA in all = sum
of amino acid in all secondary structures, ∑all AA= sum
of all amino acids in all secondary structures.
IV. RESULTS AND DISCUSSION
Propensities were calculated at two hierarchal levels;
first at Kingdom level and then at the Phylum level. As a
minimum requirement of 500 unique records were taken
for the calculation of each propensity table, the
propensity values are statistically significant.
also observe that certain amino acids in particular
secondary structures are extremely high and it is also
variable among other secondary structures.
On the whole, we observe that the amino acids arginine,
glutamine, lysine, and tryptophan have higher values
among all secondary structures.
The propensity values obtained are tabulated in the
following table.
TABLE II. PROPENSITY VALUES OF KINGDOM ANIMALIA,
BACTERIA AND PLANTAE.
Amino
Acids
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
Kingdom
Bacteria,
Phylum
aquificae
0.768
0.915
1.297
1.298
1.36
0.79
1.012
1.142
1.863
0.724
0.655
0.932
1.64
0.831
1.941
1.455
1.268
0.727
0.749
0.83
Kingdom
Plantae,
Phylum
rhodophyta
0.739
0.881
0.961
1.361
0.895
0.925
1.06
0.987
1.138
0.652
0.684
0.851
0.964
1.035
1.846
1.38
1.275
0.636
0.893
0.937
Kingdom
Animalia,
Phylum
arthropoda
0.857
0.911
1.162
1.188
1.057
0.863
0.898
1.155
1.351
0.724
0.777
0.96
0.985
0.752
1.832
1.299
1.145
0.674
0.719
0.783
Kingdom Bacteria, Phylum aquificae
2
1
Loops
0
A R N D C E Q G H I L KM F P S TWY V
Fig 2.Natural tendency of aminoacid vs propensity
values of loops in Kingdom Bacteria, Phylum aquificae.
Graphical representation of the amino acid propensity
values generated was used to verify if there is a variation
across different secondary structures. The comparisons
lead us to conclude that natural tendency for the
formation of a particular secondary structure varies. We
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014
33
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
Future prospective, this approach can be used for
specific prediction such as antigen-antibody binding
domains, protein-protein interaction site etc.
Kingdom Plantae, Phylum rhodophyta
2
1.5
REFERENCES
1
Loops
[1]
Alper Kucukural & Yang Zhang Ambrish Roy,
“I-TASSER: a unified platform for automated
protein structure and function prediction,” Nature
Protocols, vol. 5, pp. 725-738, April2010
[2]
Parbati
Biswas
Nicholus
Bhattacharjee,
“Position specific propensities of amino acid in
beta strands,” BMC Structural Biology, vol. 10,
no. 29, 2010
[3]
Rosni Abdullah, and Rosalina Abdul Salam
Hesham Awadh A. Bahamish, “Protein tertiary
structure prediction using artificial bee colony
algorithm,” in Proceedings of the 2009 Third
Asia International Conference on Modeling and
Simulation (AMS ’09), Bali, 2009, pp. 258-263
[4]
Robert C. Edgar, "MUSCLE: multiple sequence
alignment with high accuracy and high
throughput," Nucl. Acids Res. , vol. 32, no. 5, pp.
1792-1797, 2004.
[5]
R.Y., Wang, G., Gao, G., Liao, L., Dunbrack,
R.L. Jr. Kahsay, "Quasi-consensus based
comparison of profile hidden Markov models for
protein sequences.," Bioinformatics, vol. 21, no.
10, pp. 2287-2293, 2005.
[6]
A Tan and D and Deville, Y Gilbert, "Multi-class
protein fold classification using a new ensemble
machine learning approach.," in Proceedings of
the 14th International Conference on Genome
Informatics, Yokohama, Japan, 2003, pp. 206217.
[7]
A Tan and D and Deville, Y Gilbert, "Multi-class
protein fold classification using a new ensemble
machine learning approach.," in Proceedings of
the 14th International Conference on Genome
Informatics, Yokohama, Japan, 2003, pp. 206217.
[8]
Geoffrey J. Barton James A. Cuff, "Application
of multiple sequence alignment profiles to
improve protein secondary structure prediction,"
Proteins: Structure, Function, and Bioinformatics,
vol. 40, no. 3, pp. 502–511, August 2000.
[9]
Burke DF, Deane CM, Blundell TL, “Browsing
the SLoop database of structurally classified
loops connecting elements of protein secondary
structure”. Bioinformatics, 16: 513-519, 2000.
[10]
Choi Y, Deane CM. “FREAD revisited: Accurate
loop structure prediction using a database search
algorithm”. Proteins: Structure, Function, and
Bioinformatics,78:1431-1440, 2010.
[11]
Ambrish Roy, A. K. (2010). I-TASSER: a unified
0.5
0
A R N D C E Q G H I L KM F P S TWY V
Fig 3.Natural tendency of aminoacid vs propensity
values of loops in Kingdom Plantae, Phylum
rhodophyta.
Kingdom Animalia, Phylum arthropoda
2
1.5
1
Loops
0.5
0
A R N D C E QG H I L KMF P S TWY V
Fig 4. Natural tendency of aminoacid vs propensity
values of loops in Kingdom Animalia, Phylum
arthropoda.
2.5
2
Kingdom
Bacteria,
Phylum
aquificae
1.5
1
0.5
0
ARNDCEQGH I L KMF P S TWYV
Fig 5.Natural tendency of aminoacid vs propensity
values of loops in Kingdom Bacteria, Plantae and
Animalia.
CONCLUSION:
The observation made from the result section, we came
to know that accuracy and specificity have been
improved. The drastic improvement is attributed to the
highly specific propensity values, which are estimated
after clustering huge sample of data available in PDB
into Phylum Plantae, Animalia, Bacteria and Virus.
The specificity of the prediction is improved because, in
the present work prediction of the loop is accomplished
after homology search for unknown sequence which
helps us to select suitable propensity table based on the
taxanomic information. With this, we conclude that
there is a moderate improvement in both specificity and
accuracy of the prediction. To support the conclusion we
have provided the data in the result section.
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014
34
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
platform for automated protein structure and
function prediction. Nature Protocols , 5, 725–
738.
[12]
[13]
CHRISTINA LESLIE, E. E. (2002). The
spectrum kernel: a string kernel for SVM protein
classification.
Pacific
Symposium
on
Biocomputing, (pp. 566-575).
[14]
Hesham Awadh A. Bahamish, R. A. (2009).
Protein Tertiary Structure Prediction Using
Artificial Bee Colony Algorithm. Proceedings of
the 2009 Third Asia International Conference on
Modelling & Simulation (AMS '09), (pp. 258263). Bali.
[15]
Kahsay, R. W. (2005). Quasi-consensus based
comparison of profile hidden Markov models for
protein sequences. Bioinformatics , 21 (10),
2287-2293.
Edgar, R. C. (2004). MUSCLE: multiple
sequence alignment with high accuracy and high
throughput. Nucl. Acids Res. , 32 (5), 1792-1797.

_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014
35