* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1 -2 -2 2 -3 I -1
Community fingerprinting wikipedia , lookup
Magnesium transporter wikipedia , lookup
Biochemistry wikipedia , lookup
Expression vector wikipedia , lookup
Genetic code wikipedia , lookup
Gene expression wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Point mutation wikipedia , lookup
Interactome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metalloprotein wikipedia , lookup
Western blot wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Proteolysis wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU Objectives • Understand the basic concepts of fold recognition • Learn why even sequences with very low sequence similarity can be modeled – Understand why is %id such a terrible measure for reliability • See the beauty of sequence profiles Background. Why protein modeling? • Because it works! – Close to 50% of all new sequences can be homology modeled • Experimental effort to determine protein structure is very large and costly • The gap between the size of the protein sequence data and protein structure data is large and increasing Homology modeling and the human genome Swiss-Prot database ~200.000 in Swiss-Prot ~ 2.000.000 if include Tremble New PDB structures PDB New Fold Growth Old folds New folds • • • The number of unique folds in nature is fairly small (possibly a few thousands) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB Number of new folds is NOT growing How well can we do it? Sali, A. & Kuriyan, J. Trends Biochem. Sci. 22, M20–M24 (1999) Homology modeling. Why can we do it? The structure of a protein is uniquely determined by its amino acid sequence (but sequence is sometimes not enough): – prions – pH, ions, cofactors, chaperones Structure is conserved much longer than sequence in evolution Identification of fold If sequence similarity is high proteins share structure (Safe zone) If sequence similarity is low proteins may share structure (Twilight zone) Most proteins do not have a high sequence homologous partner Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47 Example. A person once did her PhD solving the structure of a protein having the amino acid sequence below >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL • What is the function • Where is the active site? Could she have saved three years work?. • Function • Run Blast against PDB • No significant hits • Run Blast against NR (Sequence database) • Function is Acetylesterase? • Where is the active site? Example. Where is the active site? 1G66 Acetylxylan esterase 1USW Hydrolase 1WAB Acetylhydrolase Example. Where is the active site? • Align sequence against structures of known acetylesterase, like • 1WAB, 1FXW, … • Cannot be aligned. Too low sequence similarity 1K7C.A 1WAB._ RMSD 11.2397 QAL 1K7C.A 71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTF DAL 1WAB._ 160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY Is it really impossible? • Worked for 2-3 years in SBI-AT developing methods for homology modeling in the twilight zone • Shown that homology modeling is possible also for very low sequence homology • So, try to show that 3 years work could have been saved if the most advanced homology modeling techniques had been used How can we do it? • Identify template(s) – initial alignment • Can give you protein fold (indicates function) • Improve alignment • Can give you putative active site • Backbone generation • Loop modeling • Most difficult part • Side chains • Refinement • Validation How to do it Identify fold (template) for modeling – Find the structure in the PDB database that resembles your new protein the most – Can be used to predict function – And maybe active sites Protein homology modeling is only possible if %id greater than 30-50% Why %id is so bad!! 1200 models sharing 25-95% sequence identity with the submitted sequences (www.expasy.ch/swissmod) Identification of correct fold • % ID is a poor measure – Many evolutionary related proteins share low sequence homology – A short alignment of 5 amino acids can share 100% id, what does this mean? • P-value or E-value more reliable What are P and E values? • E-value – Number of expected hits in database with score higher than match – Depends on database size • P-value Score 150 10 hits with higher score (E=10) 10000 hits in database => P=10/10000 = 0.001 – Probability that a random hit will have score higher than match – Database size independent Score Template identification • Simple sequence based methods – Align (BLAST) sequence against sequence of proteins with known structure (PDB database) • Sequence profile based methods – Align sequence profile (Psi-BLAST) against sequence of proteins with known structure (PDB, FUGUE) – Align sequence profile against profile of proteins with known structure (FFAS) • Sequence and structure based methods – Align profile and predicted secondary structure against proteins with known structure (3D-PSSM, Phyre) • Sequence profiles and structure based methods – Our work What goes wrong when Blast fails? • Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences • This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC A G D S . G G G D S X X X X X X Blosum scoring matrix A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 Alignment accuracy. Scoring functions • Blosum62 score matrix. Fg=1. Ng=0? L A G D S D F 0 -2 -3 -3 -2 -3 I 2 -1 -4 -3 -2 -3 G -4 0 6 -1 0 -1 D -4 -2 -1 6 0 6 S -2 1 0 0 4 0 L 4 -1 -4 -4 -2 -4 • Score =2-1+6+6+4=17 LAGDS • Alignment I-GDS 1PLC._ When Blast works! 1PLB._ 1PLC._ When Blast fails! 1PMY._ 1PLC._ When Blast fails, use sequence profiles! 1PMY._ Sequence profiles • In reality not all positions in a protein are equally likely to mutate • • • Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score Sequence profiles can capture these differences Protein structure classification Protein world Protein fold Protein superfamily Protein family New Fold Sequence profiles ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD---TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP Matching any thing but G => large negative score Any thing can match How to make sequence profiles Align (BLAST) sequence against large sequence database (Swiss-Prot) Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts Use weight matrix to align against sequence database to find new significant hits Repeat 2 and 3 (normally 3 times!) Sequence profiles (1J2J.B) 0 iterations (Blosum62) 2 iterations 1 iterations 3 iterations Example. The post doc sequence (SGNH active site) Example. Where is the active site? • Sequence profiles might show you where to look! • The active site could be around • S9, G42, N74, and H195 1K7C.A Profile-profile scoring matrix 1WAB._ Example. Where is the active site? Align using sequence profiles ALN 1K7C.A 1WAB._ RMSD = 5.29522. 14% ID 1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N 1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG-----1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA 1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP 1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H 1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L Structural superposition Blue: 1K7C.A Red: 1WAB._ Where was the active site? Rhamnogalacturonan acetylesterase (1k7c) Including structure • Sequence with in a protein superfamily share remote sequence homology • , but they share high structural homology • Structure is known for template • Predict structural properties for query – Secondary structure – Surface exposure • Position specific gap penalties derived from secondary structure and surface exposure Using structure Sequence & structure profile-profile based alignments – Template • Sequence based profiles • Annotated secondary structure • Predicted secondary structure – Query • Sequence based profile • Predicted secondary structure – Position specific gap penalties derived from secondary structure How good are we? Alignment accuracy Alignment performance 0.450 0.400 Fractional n4 0.350 0.300 0.250 0.200 0.150 0.100 0.050 0.000 Blosum Profile Profile+SS Profile+ASS Train 0.259 0.393 0.417 0.420 Test 0.212 0.348 0.386 0.393 Fold recognition • Benchmark – Query set of 100 train set, 200 test set – Database of 355 PDB structures – Align Query against Db • If structural similar hit = 1, else hit = 0 – Use CE to define structural similar • Calculate AUC (area under the ROC curve) – Perfect method can separate hits from non-hits • How to rank hits? – Alignment score? – %Id – Z score (p-value) CE structural alignment (combinatorial extension) AUC performance measure Query 1CJ0.A 1CJ0.A 1CJ0.A 1CJ0.A 1CJ0.A 1CJ0.A 1CJ0.A Templ 1B78.A 1B8A.A 1B8B.A 1B8G.A 1B9H.A 1BAR.A 1BAV.C Score Hit/nonhit 0.170963 0 -0.040029 0 -0.012789 0 12.342823 1 13.394361 1 -1.281068 0 -1.091305 0 Query 1CJ0.A 1CJ0.A 1CJ0.A 1CJ0.A 1CJ0.A 1CJ0.A 1CJ0.A 1CJ0.A Templ 1B8G.A 1DTY.A 1DGD._ 1GTX.A 2GSA.A 1BW9.A 1AUP._ 1GTM.A Score Hit/nonhit 12.342823 1 11.867786 1 11.271914 1 11.010288 1 10.958170 1 2.651775 0 2.507336 1 2.444512 0 Fold recognition performance Test set performance 1.000 0.950 0.900 0.850 AUC 0.800 0.750 0.700 0.650 0.600 0.550 0.500 sco z score Profile-profile bl pdbblast blast All 0.921 0.958 0.749 0.809 0.698 Per Protein 0.956 0.971 0.855 0.888 0.809 Outlook • Include position dependent gap penalties • The method now uses equal gap penalties through out the scoring matrix • In real proteins placement of insertions and deletions is highly structure dependent • No gaps in secondary structure elements • Gaps most frequent in loops CASP. Which are the best methods • Critical Assessment of Structure Predictions • Every second year • Sequences from about-to-be-solvedstructures are given to groups who submit their predictions before the structure is published • Modelers make prediction • Meeting in December where correct answers are revealed CASP6 results The top 4 homology modeling groups in CASP6 • All winners use consensus predictions – The wisdom of the crowd • Same approach as in CASP5! • Nothing has happened in 2 years! The Wisdom of the Crowds The Wisdom of Crowds. Why the Many are Smarter than the Few. James Surowiecki One day in the fall of 1906, the British scientist Fracis Galton left his home and headed for a country fair… He believed that only a very few people had the characteristics necessary to keep societies healthy. He had devoted much of his career to measuring those characteristics, in fact, in order to prove that the vast majority of people did not have them. … Galton came across a weight-judging competition…Eight hundred people tried their luck. They were a diverse lot, butchers, farmers, clerks and many other no-experts…The crowd had guessed … 1.197 pounds, the ox weighted 1.198 The wisdom of the crowd! – The highest scoring hit will often be wrong • Not one single prediction method is consistently best – Many prediction methods will have the correct fold among the top 10-20 hits – If many different prediction methods all have a common fold among the top hits, this fold is probably correct 3D-Jury (Best group) Inspired by Ab initio modeling methods – Average of frequently obtained low energy structures is often closer to the native structure than the lowest energy structure Find most abundant high scoring model in a list of prediction from several predictors 1. Use output from a set of servers 2. Superimpose all pairs of structures 3. Similarity score Sij = # of Ca pairs within 3.5Å (if #>40;else Sij=0) 4. 3D-Jury score = SijSij/(N+1) Similar methods developed by A Elofsson (Pcons) and D Fischer (3D shotgun) How to do it? Where is the crowd • Meta prediction server – Web interface to a list of public protein structure prediction servers – Submit query sequence to all selected servers in one go http://bioinfo.pl/meta/ Meta Server Evaluating the crowd. Meta Server Evaluating the crowd. 3D Jury Take home message • Identifying the correct fold is only a small step towards successful homology modeling • Do not trust % ID or alignment score to identify the fold. Use p-values • Use sequence profiles and local protein structure to align sequences • Do not trust one single prediction method, use consensus methods (3D Jury) • Only if every things fail, use ab initio methods The end