* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download LocalStructureBystro..
Survey
Document related concepts
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Protein moonlighting wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein domain wikipedia , lookup
Bottromycin wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Structural alignment wikipedia , lookup
Transcript
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum [email protected] The Approach • Learn a set of clusters or structure segments that can be identified from short local sequence • Combine a set of local structural predictions into one whole structure Methods - Database • Database of 471 protein sequence families • By Sander & Schneider 1994 • Each family contains one known sequence structure • No more than 25% sequence identity between any 2 alignments • Well determined structures • Non-membrane proteins Clustering of Sequence Segments • Each position in the database is described by a weighted amino acid frequency (Vingron & Argos 1989) • Similarity between a sequence and a cluster is defined by “Cross-Entropy”: • Segments of given length (3-15) were clustered via the K-means algorithm • Unsupervised Assessing Structure within a cluster and Choice of Paradigm • Structural similarity between 2 peptide structure segments • S1i->j is the distance between -carbon atoms i and j in segments S1 • The paradigm for a cluster was chosen from the top 20 segments as the one with the smallest sum of mda/dme values with the others True/False Boundaries in Structure Space • Used for the refinement procedure • Find Natural Boundaries • Compute Histograms of dme & mda vs the paradigm over all segments in the cluster • The boundary was set to the point where the histogram first dropped to ½ of its maximum • If reached 130o or 1.3Ao the cluster is rejected • Average boundaries is 81o and 89A • 82 cluster were constructed (I-site library) DMA-MDA for 9 residue serine B-hairpin Iterative Refinement of Clusters • • • • For each cluster with good boundaries Clustering increases P(cluster|sequence) In order to increase P(structure|cluster) 2 residues are also observed on each side of each sequence • All segments that are not within the natural boundaries of the paradigm are removed • The frequency profile of the cluster is calculated • The database is searched using the new profile and the highest 400 scored sequences are the new cluster Cross-Validation and confidence • A 10 fold cross validation was performed • If the 10 paradigm were not structurally the same or if the 10 runs did not converge to the same profile then the cluster was rejected • If the cluster was not rejected a confidence curve was computed as a function of the Dpq sequence to cluster similarity. • This enables to compare different profile lengths and incorporates P(clust|seq) and P(struct|clust) Confidence for Similarity Clustering – What do we want? • Direction: Sequence -> Structure • We want to as separated as possible cluster of sequences so that given a test sequence we can assign it to 1 cluster • Each cluster should have 1 or a few possible structures. Those structures will be used to predict the test protein structure • P(struct|seq) = cluster P(struct|clust,seq)*P(clust|seq) = P(struct|clust)* P(clust|seq) Iterative Peak Removal • Similar Sequences can map to different structures in some cases • When this happens, the predominant pattern occludes the second one • To find those clusters the refinement was performed using subset of the data that excludes the other class members • This helped identifying two distinct -C-cap extensions which were very similar in sequence Cluster Weights • The prediction accuracy is improved by weighting the confidence curves • Iterative update was used • Where F+C are the false positive of cluster C and F-C are the false negative errors Prediction Protocol • 1. 2. 3. 4. 5. Given a sequence to predict: Submit the sequence to PHD (Rose 94) to obtain a set of multiple aligned sequences and hence a profile Each segment of the profile is scored against each of the 82 clusters to produce weighted confidences Confidences are sorted The first segment assigns & from its paradigm For all the subsequent segments in the sorted list the prediction is used if it doesn’t conflict with previously assigned & Results • Reported on the training set and on 55 independent protein family set • Local evaluation is measured by agreement over 8 residue window • 8 residue segment prediction is considered to be correct if non of the & differences is larger than 120o or if the rmsd between the correct and predicted structure was less than 1.4A • An error is counted per position iff all 8 overlapping segments are incorrect • Mda is stricter than the commonly used Q3 score Results • Training Set – 471 sequences -> 122,510 residues – 95% of 471 had 1 match ¸ 0.8 confidence – 40% of the residues had confidence ¸ 0.6 and were 71%(mda) correct Results Combinations of I-sites and conventional Secondary Structure Predictions • With the PHD program • Requires translation into Sec Structure or from SS into torsion angles • Every program performed better in it’s pwn domain • 64% Q3 because of under predicting loops and over predicting strands • I-site was much better in loops and specific angles of turns • Can compliment PHD Comparison of I-Site & PHD I-site library • 82 cluster represents 13 structural motifs Summary of the I-site library Conclusions • Method is fast – requires only profile comparisons • There is a measure of “confidence” in the prediction • They do not provide accuracy over the whole protein • Believe that the strong local sequencestructure relationships (that occur more than 30 times) are present in I-site Discussion • NMR studies of isolated peptides of less than 30 residue show that the peptides do not have a well defined structure. The Isite motif are the exceptions • It might be that the motifs are the areas that adopt structure independence to the rest of the protein • An extension might be context specific motifs 2 Approaches for global scoring functions • Derived from the protein Database – Large # of parameters – Complicated • Potentials – Based on Chemical Intuitions – Simpler – Clearer insights into sequence/structure relations • They chose the Database approach – Because of the dangers of crafting a measure for a specific protein family rather than for the whole DB Scoring Functions P( Sequence | Structure) P( Structure | Sequence) P( Structure) P( Sequence) P( Structure) P( Sequence | Structure) since P(Sequence ) is independen t of Structure • P(Seq|Str) is used when computing sequence profiles for motifs • P(Structure) is hardest to estimate and contains most of the non-local interactions. • For ab-initio, P(Structure) captures the features that distinguish folded structures from random chain (local) configurations. Radius of gryation2 Scoring Function • Measures the largest radius from the center of the fold Radius of gryation2 Scoring Function • Advantages – Non-dependent on alpha-beta decomposition - since the generated structures is made from segments of real proteins its alpha-beta decomposition much like of real proteins • Disadvantages – Structures with beta paired strands are no more probable than those of unpaired beta strands