* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download C. Flow Chart
Survey
Document related concepts
Artificial gene synthesis wikipedia , lookup
Expression vector wikipedia , lookup
Gene expression wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Magnesium transporter wikipedia , lookup
Biosynthesis wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Point mutation wikipedia , lookup
Interactome wikipedia , lookup
Genetic code wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein purification wikipedia , lookup
Western blot wikipedia , lookup
Biochemistry wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Transcript
International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ Protein Loop Prediction Shivanandappa, Praveenkumar N Hadapad Department of Biotechnology, RV College of Engineering, Mysore Road, R.V. Vidyanikethan Post, Bangalore -560069 Email: [email protected] Abstract—All the 3D structures of protein are having some challenging regions to be model are called as loops. These regions are highly variable in nature, and in few proteins they are insignificant but not all proteins. The regions which are highly variable are not functionally significant, hence its highly challenging to predict their position and structure in 3D structure of protein. But still there are many attempts predict loops, few of them are successful to some extent. These existing methods lack specificity and accuracy, even though the loops are functionally not significant, but their presence brings significant change in overall 3D structure of protein. So accuracy and specificity in loop prediction is highly significant, this is because it increases the accuracy in loop prediction. In this work, we were able to design an algorithm called RVloop MOD that predict loops in an unknown protein structure. Since the prediction is based on the natural tendency of amino acids that are taking part in loop formation, we call this algorithm a kind of probabilistic model. After the prediction of loops, these loops are further refined and final loops are predicted. As far as accuracy of loop prediction using Rvloop MOD in comparison with existing methods, there is a quite improvement in specificity and accuracy. The loops predicted using RvloopMOD are over 80% accurate. I. INTRODUCTION A. Overview Loops are irregular structures which connect two secondary structure elements in proteins. They often play important roles in function, including enzyme reactions and ligand binding. Despite their importance, their structure remains difficult to predict. Most protein loop structure prediction methods sample local loop segments and score them. In particular protein loop classifications and database search methods depend heavily on local properties of loops. Protein loops are pattern less regions which connect two regular secondary structures. They are generally located on the protein’s surface in solvent exposed areas and often play important roles, such as interacting with other biological objects. Despite the lack of patterns, loops are not completely random structures. Early studies of short turns and hairpins showed that these peptide fragments could be clustered into structural classes. Such classifications have also been made across all loops or within specific protein families such as antibody complementarily determining regions. Loop classifications are generally based on local properties such as sequence, the secondary structures from which the loop starts and finishes (anchor region), the distance between the anchors, and the geometrical shape along the loop structure. Loops can also be classified in terms of function also. Accurate protein loop structure prediction remains an open question. Protein loop predictors have dealt with the problem as a case of local protein structure prediction. Protein structures are hypothesized to be in thermodynamic equilibrium with their environment. Thus the primary determinant of a protein structure is considered to be its atomic interactions, i.e. its amino acid sequence. An analogous conjecture has arisen at the local scale. The modeling of protein loops is often considered a mini protein folding problem. In fact, most loop structure prediction methods are based on this conjecture. Database search methods have been successful in the realm of loop structure prediction. They depend upon the assumption that similarity between local properties may suggest similar local structures. All database search methods work in an analogous fashion using either a complete set or a classified set of loops and selecting predictions using local features including sequence similarity and anchor geometry. Ab initio loop modeling methods aim to predict peptide fragments that do not exist in homology modeling templates without structure databases. Generally, Ab initio methods generate large structure conformation sets and select predictions. The generated loop candidates are optimised against scoring functions. In all loop modeling procedures anchor regions are often problematic and the accuracy of loop modeling depends upon the distance between the anchors. It is widely believed that the accuracy of loop structure prediction depends on the number of residues, i.e. the larger the number of residues, the more difficult a loop is to predict. Protein structure prediction is the prediction of the 3D structure of protein from its amino acid sequence, that is, _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014 30 International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ the prediction of its secondary, tertiary and quaternary structure from its primary structure. Protein structure prediction has now been bifurcated into two main approaches viz. experimental and computational approach. Experimental methods are time consuming and expensive and it is not always feasible to identify the protein structure experimentally. In order to predict the protein structure using computational methods, the problem is formulated as an optimization problem and the goal is to find the lowest free energy conformation. There are several methods that have been developed to predict secondary and tertiary structure of proteins, the catch being that these methods can achieve up to 7176% overall accuracy. Considering this, there is a huge demand for the tools to predict protein structure (loop) with the increased accuracy, efficiency and specificity. By this work, we intend to develop an improved and optimized algorithm for Protein loop Prediction with increased accuracy. Jpred Meta-PP PREDATOR PredictProtein PSIPRED SymPred YASSPP RaptorX-SS8 PSSpred Neural network assignment Consensus Prediction of servers Knowledge based database comparison Profile based neural network Two feed-forward neural networks which perform analysis on the output obtained from PSI-BLAST Dictionary based approach that captures local sequence similarities in a group of proteins M based predictor based on PSIBLAST profiles Predict both 3-state and 8-state secondary structure using conditional neural fields Multiple backpropagation neural network predictors from PSIBLAST profiles B. Protein structure prediction- the fundamentals 1) Computational methods of structure prediction: Protein structure prediction took a definite turn when computers with higher processing efficiency emerged onto the scene. Representation of the prediction problem in terms of mathematical models and algorithms provides an easier alternative to the aforementioned biophysical methods, primarily in terms of time and money constraints. The prediction problem basically can be defined as the prediction of the secondary, tertiary and quaternary structure of a protein given its amino acid sequence. 2) Secondary Structure prediction: Secondary structure prediction by computational methods involves usage of mathematical techniques to predict the local secondary structures of proteins based only on knowledge of their primary structure i.e. amino acid sequence. The Prediction entails the assignment of regions of the amino acid sequence as likely alpha helices, beta strands (often noted as "extended" conformations), or turns. The success of a prediction is determined by comparing it to the results of the DSSP algorithm applied to the crystal structure of the protein. The best modern methods of secondary structure prediction in proteins reach about 80% accuracy. The accuracy of current protein secondary structure prediction methods is assessed in weekly assessment benchmarks like LiveBench and EVA. The following table summarizes the main secondary structure prediction methods: TABLE I. LIST OF POPULAR SECONDARY STRUCTURE PREDICTION TOOLS AND THEIR METHOD DESCRIPTION Method NetSurfP GOR Description Profile based neural network Information theory/ Bayesian inference The best proponent methodology was the Chou-Fasman Method which in many ways was a pioneer method in structure prediction. However, when compared to current methods, Chou-Fasman produces poor results. Another method that obtained some prominence was the GOR method which is anin formation theory-based method and uses more powerful probabilistic techniques of Bayesian inference. This method also included local propensities of amino acids and was a major improvement on Chou-Fasman method. However, like almost all earlier methods, it over-predicted alphahelices and wrongly predicted beta sheets as loops and turns. II. MATERIALS AND METHODS A. Databases 1) PDB: The Protein data bank (PDB) is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography or NMR spectroscopy and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organizations (PDBe, PDBj, and RCSB). The PDB is overseen by an organisation called the Worldwide Protein Data Bank, ww PDB. The role of the PDB in our work is to provide characterised secondary structural data and it provides a cross platform and independent data repository for local alignment. As we have mentioned in earlier sections, Backbone modeling attributed to the accurate estimation of secondary structures. 2) SCOP: The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their amino acid sequences and three-dimensional structures. _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014 31 International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ SCOP includes the following structural classes: α-helical domains, β-sheet domains, α/β domains which consist of "beta-alpha-beta" structural units or "motifs" that form mainly parallel β-sheets, α+β domains formed by independent α-helices and mainly antiparallel β-sheets, multi-domain proteins, membrane and cell surface proteins and peptides (not including those involved in the immune system), "small" proteins, coiled-coil proteins, low-resolution protein structures, peptides and fragments, designed proteins of non-natural sequence. C. Flow Chart Flow chart shows the design of the complete tool. This starts by reaching the query sequence and database sequence one by one for optimal alignment. 3) CATH: The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank (PDB, Berman et al. 2003). Only crystal structures solved to resolution better than 4.0 angstroms are considered, together with NMR structures. All nonproteins, models, and structures with greater than 30% "C-alpha only" are excluded from CATH. The data obtained from structural databases like PDB, CATH and SCOP has been used to estimate propensity values, so that we can identify conservation patterns with improved specificity and accuracy. Extensive survey of records in these databases has been carried out with intent of covering huge no of records. Finally we have achieved this goal by automating the task with our own Perl script. B. Techniques Chau-Fasman Algorithm is one of the oldest and simplest method. The method was originally presented in 1974 and later improved in 1977, 1978, 1979, 1985 and 1989. It depends on observed frequency of types of amino acid residues in alpha-helix, beta strand, beta turn, and other structures in protein three-dimensional structures. The Chou-Fasman algorithm for the prediction of protein secondary structure is one of the most widely used predictive schemes. The Chou-Fasman method of secondary structure prediction depends on assigning a set of prediction values to a residue and then applying a simple algorithm to the conformational parameters and positional frequencies. The ChouFasman algorithm is simple in principle. The conformational parameters for each amino acid were calculated by considering the relative frequency of a given amino acid within a protein, its occurrence in a given type of secondary structure, and the fraction of residues occurring in that type of structure. These parameters are measures of a given amino acids preference to be found in helix, sheet or coil. Using these conformational parameters, one finds nucleation sites within the sequence and extends them until a stretch of amino acids is encountered that is not disposed to occur in that type of structure or until a stretch is encountered that has a greater disposition for another type of structure. At that point, the structure is terminated. This process is repeated throughout the sequence until the entire sequence is predicted. Fig.1 Secondary structure prediction flow chart and the method. D. Algorithm Step 1: Perform the optimal alignment of query sequence against the database sequence one by one using dynamic programming and find to which sequence in the database, the query is related. Mi-j, j-1+ Si, j Mi,j = max Mi-1, j + W Mi,j -1 + W Step 2: Select the corresponding propensity table of amino acid based on optimum score obtained by optimum alignment. Step 3: Substitute the propensity values for each amino acid of query sequence from the _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014 32 International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ Step 4: Input the propensity values of query sequence to the modified chou-fasman algorithm. Step 5: Loop areas are cross validated with the other structure occurrences. Step 6: If none fit the criteria then loop is confirmed. Step 7: The output is provided with where all the loop structure is predicted in the unknown sequence. III. IMPLEMENTATION OF THE ALGORITHM The RVloopMOD algorithm start by taking query sequence as an input, then homologous sequences are searched by PSI-BLAST. Next the propensity values of the top homologous sequence are considered and then it predicts all possible loop structures for unknown and template sequence. Loop structural information so obtained will be used for parsing of C-alpha trace from template structure. Then the result obtained is cross validated with the other secondary structures. If none of the other overlap with predicted loop sequence, then the loop structure predicted is confirmed. At the end finally algorithm generates optimised loop structure and there positions as an output, which can be further utilised to build 3D structure of protein. Loop structure prediction is based on the accurate estimation of propensity value of each amino acid. From the literature survey it was observed that distribution of propensity value among the different secondary structure is different in different protein sample. So accurate estimation of propensity was necessary by careful sampling of data. After selecting data from the databases Natural tendency of all amino acids was calculated using the following formula. ∑ AA in sec strc S ∑ all AA in sec strc S P(AA|S) = ∑ AA in all sec strc ∑ all AA in sec strc P(AA|S)=propensity of any amino acid given the secondary structure, ∑AA= sum of amino acid the in given secondary structure, ∑all AA= sum of all amino acids in the given secondary structure, ∑AA in all = sum of amino acid in all secondary structures, ∑all AA= sum of all amino acids in all secondary structures. IV. RESULTS AND DISCUSSION Propensities were calculated at two hierarchal levels; first at Kingdom level and then at the Phylum level. As a minimum requirement of 500 unique records were taken for the calculation of each propensity table, the propensity values are statistically significant. also observe that certain amino acids in particular secondary structures are extremely high and it is also variable among other secondary structures. On the whole, we observe that the amino acids arginine, glutamine, lysine, and tryptophan have higher values among all secondary structures. The propensity values obtained are tabulated in the following table. TABLE II. PROPENSITY VALUES OF KINGDOM ANIMALIA, BACTERIA AND PLANTAE. Amino Acids A R N D C E Q G H I L K M F P S T W Y V Kingdom Bacteria, Phylum aquificae 0.768 0.915 1.297 1.298 1.36 0.79 1.012 1.142 1.863 0.724 0.655 0.932 1.64 0.831 1.941 1.455 1.268 0.727 0.749 0.83 Kingdom Plantae, Phylum rhodophyta 0.739 0.881 0.961 1.361 0.895 0.925 1.06 0.987 1.138 0.652 0.684 0.851 0.964 1.035 1.846 1.38 1.275 0.636 0.893 0.937 Kingdom Animalia, Phylum arthropoda 0.857 0.911 1.162 1.188 1.057 0.863 0.898 1.155 1.351 0.724 0.777 0.96 0.985 0.752 1.832 1.299 1.145 0.674 0.719 0.783 Kingdom Bacteria, Phylum aquificae 2 1 Loops 0 A R N D C E Q G H I L KM F P S TWY V Fig 2.Natural tendency of aminoacid vs propensity values of loops in Kingdom Bacteria, Phylum aquificae. Graphical representation of the amino acid propensity values generated was used to verify if there is a variation across different secondary structures. The comparisons lead us to conclude that natural tendency for the formation of a particular secondary structure varies. We _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014 33 International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ Future prospective, this approach can be used for specific prediction such as antigen-antibody binding domains, protein-protein interaction site etc. Kingdom Plantae, Phylum rhodophyta 2 1.5 REFERENCES 1 Loops [1] Alper Kucukural & Yang Zhang Ambrish Roy, “I-TASSER: a unified platform for automated protein structure and function prediction,” Nature Protocols, vol. 5, pp. 725-738, April2010 [2] Parbati Biswas Nicholus Bhattacharjee, “Position specific propensities of amino acid in beta strands,” BMC Structural Biology, vol. 10, no. 29, 2010 [3] Rosni Abdullah, and Rosalina Abdul Salam Hesham Awadh A. Bahamish, “Protein tertiary structure prediction using artificial bee colony algorithm,” in Proceedings of the 2009 Third Asia International Conference on Modeling and Simulation (AMS ’09), Bali, 2009, pp. 258-263 [4] Robert C. Edgar, "MUSCLE: multiple sequence alignment with high accuracy and high throughput," Nucl. Acids Res. , vol. 32, no. 5, pp. 1792-1797, 2004. [5] R.Y., Wang, G., Gao, G., Liao, L., Dunbrack, R.L. Jr. Kahsay, "Quasi-consensus based comparison of profile hidden Markov models for protein sequences.," Bioinformatics, vol. 21, no. 10, pp. 2287-2293, 2005. [6] A Tan and D and Deville, Y Gilbert, "Multi-class protein fold classification using a new ensemble machine learning approach.," in Proceedings of the 14th International Conference on Genome Informatics, Yokohama, Japan, 2003, pp. 206217. [7] A Tan and D and Deville, Y Gilbert, "Multi-class protein fold classification using a new ensemble machine learning approach.," in Proceedings of the 14th International Conference on Genome Informatics, Yokohama, Japan, 2003, pp. 206217. [8] Geoffrey J. Barton James A. Cuff, "Application of multiple sequence alignment profiles to improve protein secondary structure prediction," Proteins: Structure, Function, and Bioinformatics, vol. 40, no. 3, pp. 502–511, August 2000. [9] Burke DF, Deane CM, Blundell TL, “Browsing the SLoop database of structurally classified loops connecting elements of protein secondary structure”. Bioinformatics, 16: 513-519, 2000. [10] Choi Y, Deane CM. “FREAD revisited: Accurate loop structure prediction using a database search algorithm”. Proteins: Structure, Function, and Bioinformatics,78:1431-1440, 2010. [11] Ambrish Roy, A. K. (2010). I-TASSER: a unified 0.5 0 A R N D C E Q G H I L KM F P S TWY V Fig 3.Natural tendency of aminoacid vs propensity values of loops in Kingdom Plantae, Phylum rhodophyta. Kingdom Animalia, Phylum arthropoda 2 1.5 1 Loops 0.5 0 A R N D C E QG H I L KMF P S TWY V Fig 4. Natural tendency of aminoacid vs propensity values of loops in Kingdom Animalia, Phylum arthropoda. 2.5 2 Kingdom Bacteria, Phylum aquificae 1.5 1 0.5 0 ARNDCEQGH I L KMF P S TWYV Fig 5.Natural tendency of aminoacid vs propensity values of loops in Kingdom Bacteria, Plantae and Animalia. CONCLUSION: The observation made from the result section, we came to know that accuracy and specificity have been improved. The drastic improvement is attributed to the highly specific propensity values, which are estimated after clustering huge sample of data available in PDB into Phylum Plantae, Animalia, Bacteria and Virus. The specificity of the prediction is improved because, in the present work prediction of the loop is accomplished after homology search for unknown sequence which helps us to select suitable propensity table based on the taxanomic information. With this, we conclude that there is a moderate improvement in both specificity and accuracy of the prediction. To support the conclusion we have provided the data in the result section. _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014 34 International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ platform for automated protein structure and function prediction. Nature Protocols , 5, 725– 738. [12] [13] CHRISTINA LESLIE, E. E. (2002). The spectrum kernel: a string kernel for SVM protein classification. Pacific Symposium on Biocomputing, (pp. 566-575). [14] Hesham Awadh A. Bahamish, R. A. (2009). Protein Tertiary Structure Prediction Using Artificial Bee Colony Algorithm. Proceedings of the 2009 Third Asia International Conference on Modelling & Simulation (AMS '09), (pp. 258263). Bali. [15] Kahsay, R. W. (2005). Quasi-consensus based comparison of profile hidden Markov models for protein sequences. Bioinformatics , 21 (10), 2287-2293. Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl. Acids Res. , 32 (5), 1792-1797. _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -3, 2014 35