* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Secondary Structure Prediction Protein Folding
Survey
Document related concepts
Magnesium transporter wikipedia , lookup
Gene expression wikipedia , lookup
Point mutation wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Expression vector wikipedia , lookup
Genetic code wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Biochemistry wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Metalloprotein wikipedia , lookup
Proteolysis wikipedia , lookup
Transcript
3URWHLQ)ROGLQJ3UREOHP 3URWHLQ6WUXFWXUH 6HFRQGDU\6WUXFWXUH3UHGLFWLRQ A protein (which is a linear sequence of amino acids) folds into a unique 3D structure: determine this structure Lysozyme sequence: KVFGRCELAA RGYSLGNWVC QATNRNTDGS RWWCNDGRTP SALLSSDITA DGNGMNAWVA AMKRHGLDNY AAKFESNFNT TDYGILQINS GSRNLCNIPC SVNCAKKIVS WRNRCKGTDV QAWIRGCRL 6HFRQGDU\VWUXFWXUHSUHGLFWLRQ :K\6HFRQGDU\6WUXFWXUH3UHGLFWLRQ" • Given a protein sequence (primary structure) Easier problem than 3D structure prediction (more than 40 years of history). Accurate secondary structure prediction can be an important information for the tertiary structure prediction Improving sequence alignment accuracy Protein function prediction GHWIAT HWIATRVGQLIREAYEDY GQLIREAYEDYRHFSSECP SSECPFIP z Predict its secondary structure content (C=coils H=Alpha Helix E=Beta Strands) CEEEEE EEEEECCHHHHHHHHHHH HHHHHHHHHHHCCCHHHHH HHHHHCCC Secondary Structure Alignment to proteins with known function? Protein classification Fold prediction %HQFKPDUNV %HQFKPDUNV • EVA: EValuation of Automatic protein structure prediction • http://www.rostlab.org/eva/ • Targets: no pair in subset has more than 33% identical residues over more than 100 residues aligned. • Updated every six months • Latest release: June 2006, 3477 chains • CASP: Critical Assessment of Techniques for Protein Structure Prediction • Biennial contest on prediction of protein structures • CASP7 meeting will be help in California in November 2006 • CASP6 and CASP7 did not have secondary structure prediction competition 1 6WDQGDUGRIWUXWK $FFXUDF\PHDVXUHV • PDB -> secondary structure • Q3 • Observed secondary structure is taken primarily from the program DSSP. However, EVA willl use STRIDE additionally. • Conversion of DSSP secondary structure • The following chart illustrates, how the 8 states from DSSP are converted to three secondary structure states: DSSP H G I used H H E B T S '' H E E L L • Q3helix : number of correctly assigned helices / total number of true helical residues • Specificity • SPhelix : number of correctly assigned helices / total number of residues predicted to be helices L $FFXUDF\PHDVXUHV • Matthew’s Correlation Coefficient MCC • number of correctly assigned residues / total number of residues • Q3helix, Q3strand, Q3loop TP TN FP FN (TN FN )(TN FP)(TP FN )(TP FP) 3HUIRUPDQFHRIFXUUHQWPHWKRGV • Results on EVA benchmark. Q3 scores • Only proteins with no significant sequence identity to previously known PDB proteins are reported. • Common set 1: PSIpred PROFsec PHDpsi PROF king Prospect SAM-T99sec 76.8 75.5 73.4 71.6 71.1 77.2 • Common set 2: PSIpred PROFsec PHDpsi PROF king Prospect SAM-T99sec 77.4 76.2 74.3 71.7 - 77.2 • Common set 5: 7KHRUHWLFDOOLPLW • ~88%-~90% • Because of differences even in the experimental methods (X-ray vs. NMR) • Still 10%-15% gap to fill : ) • The first paper addresses the question: • Why cannot the current techniques reach that theoretical limit? • Short answer: Because current techniques do not consider long-range interactions between amino acids PSIpred PROFsec PHDpsi PROF king Prospect SAM-T99sec 77.3 76.4 74.3 - - 77.1 5HDGLQJ • The effect of long-range interactions on the secondary structure formation of proteins by D. Kihara (Purdue University) Protein Science 2005 • Correlation of prediction accuracy with the Residue Contact Order (RCO) information in 2777 proteins 2 5HVLGXH&RQWDFW2UGHU 3UHGLFWLRQDFFXUDF\ • On the analyzed 2777 proteins where n is the total number of contacts for residue i and įij is 1 when residue i is in contact with residue j. (Otherwise it is 0) 3UHGLFWLRQDFFXUDF\YVSURWHLQOHQJWK $FFXUDF\YV5HODWLYH&RQWDFW2UGHU $FFXUDF\YV5&2 $FFXUDF\YV5&2 3 $PRUHTXDQWLWDWLYHHYDOXDWLRQRI FRUUHODWLRQ 5HDGLQJ • A new representation of protein secondary structure prediction based on frequent patterns by Birzele and Kramer from Germany Bioinformatics August 29 2006 • Use variable length patterns to define features for amino acids that can be used to predict secondary structures. • It is a window-less approach • Does it mean that it takes long-range interactions into account? • If so, why is it not better than PSIpred? $OJRULWKPVWRILQGIUHTXHQWSDWWHUQV • A level-wise approach: • Starting from patterns of length 1 look for frequently occurring patterns of increasing length • The relax the exact matching requirement they extend the regular alphabet by defining amino acid groups (they use 10 groups in the paper) • Level-wise extension idea: • If a pattern is frequent then all of its subsequences should be frequent. So, if A is a candidate pattern, A1:n-1 and A2:n should already be frequent patterns )UHTXHQW3DWWHUQVWRIHDWXUHV )UHTXHQWSDWWHUQ ILQGLQJDOJRULWKP Check if the candidates really occur frequently Generate candidates for next level by extending current frequent patterns Example: Frequent patterns: ACDEF, ACDEG, ACDEH CACDE Next level candidatew: CACDEF, CACDEG, CACDEH 8VLQJKRPRORJVHTXHQFHV • Frequent patterns will be used to define a set of features for a single amino-acid. How? Feature vector for the amino acid D 4 )HDWXUH4XDQWLILFDWLRQ %XWVWLOO • Still, at this form, these feature vectors cannot be used as input to an SVM • Borrow some ideas from text mining: • What is the size of the feature vector? • 60M in the window-less approach • How many days will it take to train the SVM? • Before feeding the feature vectors to SVM we need to do some manual feature selection Normalize the values among the same length patterns: 7KH690PRGHO • Use ȋ2 and precision-recall binning to prune some of the features 5HVXOWV • Now you have ~15K features (pattern, pattern position) for an amino acid. • Extracted 218,678 features of maximum length 8 with minimum number of occurrence 40 • Trained on 940 proteins • Two layer SVM • First layer: predict H,E,C based on a central amino acid • Second layer: smooth out prediction errors PSI-PRED is still the best! This method is better only at predicting coils:) • e.g, HHHHCHHHH should actually be HHHHHHHHH 5HDGLQJ • Protein secondary structure prediction for a single-sequence using hidden semi-Markov models by Z. Aydin, Y. Altunbasak, and M. Borodovsky from Georgia Tech, BMC Bioinformatics March 2006. • Built an improved HMM model: Use different models for residues internal to a SS segment and for residues on the boundaries. • General comments on the paper: 6HFRQGDU\6WUXFWXUHUHSUHVHQWDWLRQ • Two vectors: one to indicate end of secondary structure segments and the other to indicate secondary structure types: T=(L,E,L,E,L,H,L) S=(4,9,12,16,21,28,33) • Bad structure: Tables and texts are too far apart • Typos, no Figure 2 5 %D\HVLDQ)RUPXODWLRQ &RPSXWLQJWKHOLNHOLKRRG • Given a sequence, the goal is to find the best S and T vectors which maximize the a posteriori probability. likelihood a priori probability • That’s the tricky part. Constant for every (S,T), so might as well be dropped for our purposes Non-local interactions are ignored. (Unlike what Kihara suggested) In order to model proximal residues and internal residues differently they suggest the following expression for P(R|S,T) i.e., what is the probability of having a type t secondary structure of length l &RPSXWLQJWKHOLNHOLKRRG 7KH+600DUFKLWHFWXUH For residues at the N-terminus For internal residues For residues at the C-terminus Too many parameters ĺ need a lot of training data Solution: get rid of some parameters (reduced dependency model) 5HVXOWV 'LVFXVVLRQ • Slightly better than PSIPRED? • How can the methods be improved? • But not the real PSIPRED, the PSIPRED that does not use any homology information. • Can we reach the 88%-90% limit? How? • Other papers? • Project ideas? 6