Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Homology Modeling Workshop GHIKLSYTVNEQNLKPERFFYTSAVAIL Outline: • Introduction to protein structure & databases • Structure prediction approaches – Ab-initio – Threading – Homology modeling • Hands ON • Model evaluation From Sequence to Structure Protein structure is hierarchic: • Primary – sequence of covalently attached amino acid • Secondary – local 3D patterns (helices, sheets, loops) • Tertiary – overall 3D fold • Quaternary – two or more protein chains From Sequence to Structure • All information about the native structure of a protein is encoded in the amino acid sequence + its native solution environment. • Many possible conformation still only one or few native folds are exhibited for each protein (Levinthal’s paradox) • Protein folding is driven by various forces: – Ionic forces – Hydrogen bonds – The hydrophobic affect – ... Protein 3D Structures A protein’s structure has a critical effect on its function: 1. Binding pockets PDB ID 1nw7 Protein 3D Structures A protein’s structure has a critical effect on its function: 2. Areas of specific chemical\electrical properties Protein 3D Structures A protein’s structure has a critical effect on its function: 3. Importance of the global fold for function Motivation to Acquire a Structure • Identifying active and binding sites • Characterization of the protein’s mechanism (catalysis & interactions) • Searching for ligand of a given binding site • Understanding the molecular basis of diseases • Designing mutants • Drug design • And more... Determining Structure • NMR • X-ray diffraction • Electron Microscopy Protein Sequence & Structure Databases Some of the available databases: • RCSB- the Protein Data Bank- all deposited structures • UniProt- main sequence database – SwissProt – Tremble • NCBI- lots of databases, including sequence and structures • PDBsum- combines structural & sequence data Protein Data Bank (PDB) • The PDB archive contains information about experimentallydetermined structures of proteins, nucleic acids, and complex assemblies. • The structures in the archive range from tiny proteins and bits of DNA to complex molecular machines like the ribosome. • There are currently 57013 structures deposited in the PDB. However, taking out redundant sequences (e.g. 90%) reduces the number of structures to 19988… • Each structure receives a unique 4 letter ID Number Protein Data Bank (PDB) Year SCOP – fold classification All alpha All beta Alpha and beta Protein Data Bank (PDB) Number Growth of unique folds as defined by SCOP Year UniProt- Protein Sequence Database • UniProt is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). • In 2002, the three institutes decided to pool their resources and expertise and formed the UniProt Consortium. UniProt- Protein Sequence Database • The world's most comprehensive catalog of information on proteins • Sequence, function & more… • Comprised mainly of the databases: – SwissProt – 366226 last year, 412525 protein entries now – high quality annotation, non-redundant & cross-referenced to many other databases. – TrEMBL - 5708298 last year, 7527796 protein entries now – computer translation of the genetic information from the EMBL Nucleotide Sequence Database many proteins are poorly annotated since only automatic annotation is generated UniProt- Protein Sequence Database UniProt- Protein Sequence Database More Sequences Than Structures • Discrepancy between the number of known sequences and solved structures: 5,047,807 UniRef90 entries vs. 19988 90% Non-redundant structures Computational methods are needed to obtain more structures Protein Structure Prediction Why predict protein structure if we can use experimental tools to determine it? • Experimental methods are slow and expensive • Some structures were failed to be solved • A representative family structure can suffice to deduce structures of the entire family sequences Structure Prediction Approaches 1. Homology (Comparative) Modeling Based on sequence similarity with a protein for which a structure has been solved. 2. Threading (Fold Recognition) Requires a structure similar to a known structure 3. Ab-initio fold prediction Not based on similarity to a sequence\structure Ab-initio Structure prediction from “first principals”: Given only the sequence, try to predict the structure based on physico-chemical properties (energy, hydrophobicity etc.) • When all else fails works for novel folds • Shows that we understand the process The Force Field (energy function) A group of mathematical expressions describing the potential energy of a molecular system • Each expression describes a different type of physicochemical interaction between atoms in the system: • Van der Waals forces • Covalent bonds • Hydrogen bonds • Charges • Hydrophobic effects Non-bonded terms Approaches to Ab-initio Prediction 1. Molecular Dynamics • Simulates the forces that governs the protein within water. • Since proteins usually naturally fold, this would lead to the native protein structure. Problems: • Thousands of atoms • Huge number of time steps to reach folded protein feasible only for very small proteins Approaches to Ab-initio Prediction 2. Minimal Energy Assumption: the folded form is the minimal energy conformation of a protein Main principals: • Define an energy function. • Search for 3D conformation that minimize energy. Ab-initio 2. Minimal Energy • Use of simplified energy function • Search methods for minimal energy conformation: – Greedy search – Simulated annealing –… Ab-initio • Current methods (e.g. Rosetta) primarily utilize the fact that although we are far from observing all protein folds, we probably have seen nearly all substructures: Local sequence-structure relationships: • A library of known sub-structures (fragments less than 10 residues) is created. • A range of possible conformations for each fragment in the query protein are selected. Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006) Ab-initio Non-local sequence-structure relationships: • The primary nonlocal interactions considered are hydrophobic burial, electrostatics, main-chain hydrogen bonding etc. Structures that are consistent with both the local and non-local interactions are generated by minimizing the non-local interaction energy in the space defined by the local structure distributions. Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006) Ab-initio - Example Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006) Fold Recognition (Threading) Given a sequence and a library of folds, thread the sequence through each fold. Take the one with the highest score. • Method will fail if new protein does not belong to any fold in the library. • Score of the threading is computed based on known physical chemistry properties and statistics of amino acids. Threading: example • structural template 4E • neighbor definition C3 • energy function C2 ACCECADAAC -3-1-4-4-1-4-3-3=-23 A1 E E aib j positionsi, j 10 5 C 9 6 A 8 7 D Eab A C D E . A C -3 -1 -1 -4 0 1 0 2 . . C A A D 0 1 5 6 . E ….. 0 .. 2 .. 6 .. 7 .. . Find best fold for a protein sequence: Fold recognition (threading) 1) ... 56) ... MAHFPGFGQSLLFGYPVYVFGD... -10 ... ... n) ... -123 ... Potential fold 20.5 GenTHREADER • Align the query sequence with each template (requires some sequence homology!) • Assess the alignment by: – Sequence alignment score – Pairwise potentials – Solvation function • Record lengths of: alignment, query, template • Using Neural Network the overall score is computed. Jones DT et al. J. Mol. Biol. 287: 797-815(1999) GenTHREADER Jones DT et al. J. Mol. Biol. 287: 797-815(1999) I-TASSER- Hybrid Approach • In a recent wide blind experiment, CASP7, I-TASSER generated the best 3D structure predictions among all automated servers. •Based on the secondary-structure threading and the iterative implementation of the Threading ASSEmbly Refinement (TASSER) program. •For predicting the biological function of the protein, the I-TASSER server matches the predicted 3D models to the proteins in 3 independent libraries which consist of proteins of known enzyme classification (EC) number, gene ontology (GO) vocabulary, and ligand-binding sites. I-TASSER Test Case: Rosetta Vs. TASSER Grey: Crystal structure of the beta2 adrenergic receptor Purple: Rosetta prediction, starting from homology modeling Green: TASSER prediction Homology Modeling Homology Modeling – Basic Idea 1. A protein structure is defined by its amino acid sequence. 2. Closely related sequences adopt highly similar structures, distantly related sequences may still fold into similar structures. 3. Three-dimensional structure of proteins from the same family is more conserved than their primary sequences. Triophospate ismoerases 44.7% sequence identity 0.95 RMSD General Scheme 1. Searching for structures related to the query sequence 2. Selecting templates 3. Aligning query sequence with template structures 4. Building a model for the query using information from the template structures NEST 5. Evaluating the model Fiser A et al. Methods in Enzymology 374: 461-491(2004) General Scheme Homology modeling requires handling structures & sequences • Query- only the protein sequence is available- usually found at the UniProt database • Template- after identification, both structural and sequencerelated data should be found- UniPort (or NCBI databases), RCSB and PDBsum 1. Searching For Structures • Sequence search against the PDB sequences • Sequence-profile search • Threading: sequence-structure fitness function 1. Searching For Structures If BLAST search against the PDB fail to recognize adequate templates, turn to fold recognition (threading) servers: • FFAS03- http://ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl • HHPRED- http://toolkit.tuebingen.mpg.de/hhpred • HMAP (available through the FUDGE pipeline)http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software: PUDGE • I-TASSER- http://zhang.bioinformatics.ku.edu/I-TASSER/ These servers not only find optional templates, but also suggest a pairwise alignment and in some cases even construct the 3D model. 2. Selecting Templates How to select the right template? • Higher sequence similarity - %ID • Close subfamily - phylogenetic tree • Seq. 1 “Environment” similarity - solvent, pH, ligand, Seq. 2 quaternary interactions Seq. 3 Seq. 4 determined Seq. 5 Seq. 6 • The quality of the experimentally structure • Purpose of modeling - e.g. protein-ligand model vs. geometry of active site 2. Selecting Templates More than one template • Two ways to combine multiple templates: – Global model – alignment with different domain of the target with little overlap between them – Local model – alignment with the same part of the target 2. Selecting Templates More than one template The more the merrier - multiple structures with the same fold: 2. Selecting Templates Trial and error • Generate a model for each candidate template and/or their combination. • Evaluate the models by an energy or any other scoring function. (will be discussed later…) 3. Aligning query and template sequences • All comparative modeling programs depend on a target-template alignment. • When the sequence similarity between the template and target proteins is high, simple pairwise alignments are usually fine (e.g. Needleman-Wunsch global alignment). • Gaps or low/medium sequence similarity indicate that we should improve the alignment... 3. Aligning query and template sequences Guidelines: 1. Create a multiple sequence alignment and extract the template-query pairwise alignment. Pairwise alignments – not enough! 3. Aligning query and template sequences Guidelines: 1. Create a multiple sequence alignment and extract the template-query pairwise alignment. Template Query • Visual inspection of alignments - difficult to teach… a matter of experience… 3. Aligning query and template sequences Guidelines: 1. Create a multiple sequence alignment and extract the template-query pairwise alignment. 2. Use secondary structure information to improve pairwise alignment- avoid gaps in these regions! Query Template 3. Aligning query and template sequences Guidelines: 1. Create a multiple sequence alignment and extract the template-query pairwise alignment 2. Use secondary structure information to improve pairwise alignment- avoid gaps in these regions! 3. Biochemical and structural previous data 3. Aligning query and template sequences Tips for MSA building • Where? (to find homologues) • Structural templates- search against the PDB • Sequence homologues- search against SwissProt or Uniprot (recommended!)- usually using BLAST • How many? • As many as possible, as long as the MSA looks good (next week…) 3. Aligning query and template sequences Tips for MSA building • How long? (length of homologues) • Fragments- short homologues (less than 50,60% the query’s length) = bad alignment • Ensure your sequences exhibit the wanted domain(s) • N/C terminal tend to vary in length between homologues • How close? (distance from query sequence) • All too close- no information • Too many too far- bad alignment • Ensure that you have a balanced collection! 3. Aligning query and template sequences Tips for MSA building • From who? (which species the sequence belongs to) • Don’t care, all homologues are welcome • Orthologues/paralogues may be helpful • Sequences from distant/close species provide different types of information • Which alignment method? • The best today are MUSCLE, T-Coffee and MAFFT. All available at 3. Aligning query and template sequences Tips for MSA building • Most importantly, make sure that both the query and the selected template are included in the MSA. • Sequences which are more distant than the template are not needed to be included in the alignment. 3. Aligning query and template sequences Query-template alignment via a profile-to-profile approach: 1. Construct an MSA for the query, serving as profiles depicting the protein family properties. 2. Align the profile to profiles of all proteins of the PDB, using, e.g., FFAS03 or HHpred. 3. Compare pairwise alignments constructed via the different methods – hope to get a consensus prediction… 3. Aligning query and template sequences Different levels of similarity between the template & query initiate various computational approaches: 4. Building a model Once you have an improved pairwise alignment between your query & template Use NEST to build your model! Petrey, D., Xiang, X., Tang, C. L., Xie, L., Gimpelev, M., Mitors, T., Soto, C. S., Goldsmith-Fischman, S., Kernytsky, A., Schlessinger, A., Koh, I. Y. Y., Alexov, E. and Honig, B. (2003) Using Multiple Structure Alignments, Fast Model Building, and Energetic Analysis in Fold Recognition and Homology Modeling. Proteins: Struc., Func. and Genet 53:430-435 . NEST Incorporates a variety of programs to facilitate the model building • Input: 1. Sequence alignment of a query to one (or more) template PDBs 2. The template PDB file(s) in the same directory • Output: a 3D model in PDB format • Capabilities: 1. Model building with artificial evolution 2. Sequence alignment tuning 3. Composite structure building\multiple templates 4. Structure refinement NEST Based on “artificial evolution”: • Changes to the template structure, such as residue mutation, insertions or deletions are made one at a time. • After each change, a slight energy minimization is preformed to avoid atom clashes. • This process is repeated until the target sequence is completely modeled. • The resulting structure is subjected to minimization energy is calculated based on a simplified potential function that includes: van der Waals, hydrophobic, electrostatic, torsion angle and hydrogen- bond terms. 5. Model Evaluation • The accuracy of the model depends on its sequence identity with the template: 5. Model Evaluation The model can be assessed in two levels: • Global- reliability of the model as a whole. *Useful when several models are generated and one should be chosen as the best one. *When different models were based on various templates, may help choose the best one. • Local- assessing the reliability of the different regions, even specific residues, of the model. *Useful to detect local mistakes, that may originate in many time from alignment errors. 5. Model Evaluation Examples of assessment approaches: 1. Assessment of the model’s stereochemistry 2. Prediction of unreliable regions of the model “pseudo energy” profile: peaks errors 3. Consistence with experimental observations 4. Consistence with evolutionary conservation rates Summary: 5 Basic Steps Hands ON The Query Protein Name: Dihydrodipicolinate reductase Enzyme reaction: Molecular process: Lysine biosynthesis (early stages) Organism: E. coli Sequence length: 273 aa 1. Searching For Structures 1. Searching For Structures Get your sequence <DAPB_ECOLI MHDANIRVAIAGAGGRMGRQLIQAALALEGVQLGAALEREGSSLLGSDAGELAGAG KTGVTVQSSLDAVKDDFDVFIDFTRPEGTLNHLAFCRQHGKGMVIGTTGFDEAGKQ AIRDAAADIAIVFAANFSVGVNVMLKLLEKAAKVMGDYTDIEIIEAHHRHKVDAPSGTA LAMGEAIAHALDKDLKDCAVYSREGHTGERVPGTIGFATVRAGDIVGEHTAMFADIGE RLEITHKASSRMTFANGAVRSALWLSGKESGLFDMRDVLDLNNL http://www.uniprot.org/ 1. Searching For Structures Find templates with significant homology: • BLAST against the sequences in the PDB Find also more distant templates, using profile-toprofile approach: • FFAS03 server • HHPRED server 1. Searching For Structures Blast against the PDB http://www.ncbi.nlm.nih.gov/BLAST/ 1. Searching For Structures Blast against the PDB 1. Paste sequence 2. Select the PDB database 3. http://www.ncbi.nlm.nih.gov/BLAST/ 1. Searching For Structures Blast against the PDB http://www.ncbi.nlm.nih.gov/BLAST/ 1. Searching For Structures Use fold recognition - FFAS03 1. Paste sequence Select the PDB database Run 1. Searching For Structures Use fold recognition - HHPRED http://toolkit.tuebingen.mpg.de/hhpred Select the PDB database 1. Paste sequence Run 2. Selecting templates 2. Selecting templates Blast against the PDB The real structure of our protein Closest homologous structure 2. Selecting templates Blast against the PDB The selected template: 1VM6, chain A http://www.ncbi.nlm.nih.gov/BLAST/ 2. Selecting templates Use fold recognition - FFAS03 http://ffas.ljcrf.edu/ffas-cgi/cgi/get_mu.pl?ses=&qdb=public&tdb =PDB0408&type=re&key=221830166.3750.0000000 2. Selecting templates Use fold recognition - FFAS03 Scores below -9.5 significant 2. Selecting templates Use fold recognition - HHPRED http://toolkit.tuebingen.mpg.de/hhpred/histograms/8455009 2. Selecting templates Use fold recognition - HHPRED 2. Selecting templates Who is our template? PDB ID 1VM6 is UniProt entry ‘DAPB_THEMA’ www.ebi.ac.uk/thornton-srv/databases/pdbsum 3. Alignment 3. Alignment Find query’s homologous sequences 1. Paste query sequence 2. http://conseq.bioinfo.tau.ac.il/ 3. Alignment Find query’s homologous sequences Download the query’s alignment 3. Alignment Extract query-template pairwise alignment 1. Open: Start Phylogeny BioEdit 2. Open the alignment: file open ‘query.aln’ 2. Select the template: Edit Search Find in Titles “DAPB_THEMA” 3. Alignment Extract query-template pairwise alignment “DAPB_THEMA” 3. Alignment Extract query-template pairwise alignment 4. Add the query to the template selection: ctrl + ‘query’ 5. Invert selection: Edit invert title selection 6. Delete other sequences: Edit Cut Sequences(s) 7. Minimize gaps: Alignment Minimize Alignment 8. Save the pairwise alignment: File Save as “DAPB_ECOLI_1VM6.fas” 3. Alignment Extract query-template pairwise alignment query DAPB_THEMA File name Save as “fasta” format 3. Alignment Use fold recognition - FFAS03 Scores below -9.5 significant 3. Alignment Use fold recognition - FFAS03 http://ffas.ljcrf.edu/ffas-cgi/cgi/get_mu.pl?ses=&qdb=public&tdb =PDB0408&type=re&key=221830166.3750.0000000 3. Alignment Use fold recognition - HHPRED http://toolkit.tuebingen.mpg.de/hhpred/histograms/8455009 3. Alignment Use fold recognition - HHPRED 3. Alignment Edit query-template pairwise alignment • NEST requires a specific file format - unfortunately we will have to edit the pairwise alignment. 3. Alignment Edit query-template pairwise alignment The PDB file of the template (rename DAPB_THEMA) >P1;SEQ sequence:DAPB_ECOLI MHDANIRVAIAGAGGRMGRQLIQAALALEGVQLGAALEREGSSLLGSDAGELAGAGKTGV TVQSSLDAVKDDFDVFIDFTRPEGTLNHLAFCRQHGKGMVIGTTGFDEAGKQAIRDAAAD IAIVFAANFSVGVNVMLKLLEKAAKVMGDYTDIEIIEAHHRHKVDAPSGTALAMGEAIAH ALDKDLKDCAVYSREGHTGERVPGTIGFATVRAGDIVGEHTAMFADIGERLEITHKASSR MTFANGAVRSALWLSGKESGLFDMRDVLDLNNL The name of the query protein (this will >P1;1VM6 be the name of the modeled PDB file) structure:1VM6:A -----MKYGIVGYSGRMGQEIQKVFSE-KGHELVLKVDV-----------------------NGVEEL-DSPDVVIDFSSPEALPKTVDLCKKYRAGLVLGTTALKEEHLQMLRELSKE VPVVQAYNFSIGINVLKRFLSELVKVLE-DWDVEIVETHHRFKKDAPSGTAILLESAL-------------------GK----SVPIHSLRVGGVPGDHVVVFGNIGETIEIKHRAISR TVFAIGALKAAEFLVGKDPGMYSFEEVI----- Save as “DAPB_ECOLI_1VM6.pir” 4. Model Building 4. Model Building 1. Paste the template’s PDB ID “1VM6” Get the template structure 2. http://www.rcsb.org/pdb/home/home.do 4. Model Building Get the template structure: 1vm6 chain A Save as: “1VM6.pdb” Notice: case sensitive! 4. Model Building File Transfer: pir file + template PDB • Open explorer Your user name • Enter address: ftp://[email protected] • Enter: user: nest password: uniprot1 4. Create directory: “[your name]” 5. Copy the pairwise pir file & the pdb file into your directory 4. Model Building Open: Start Programs Tera Term 1. Enter server 2. 4. Model Building Run NEST: • Enter: cd [your name] • Type: nest DAPB_ECOLI_1VM6.pir • Take a coffee break….. 4. Model Building Run NEST • Well, the model (“DABP_ECOLI_final.pdb”) is probably ready by now: Copy it back to the computer using the FTP window… 5. Evaluation 5. Evaluation Model Visualization 1. Open: Start Bioinformatics RasTop 2. Get the model: file open DABP_ECOLI_final.pdb 5. Evaluation Active Site Residues 5. Evaluation Stereochemistry -ProCheck 5. Evaluation Model Conservation http://consurf.tau.ac.il 5. Evaluation Model Conservation http://consurf.tau.ac.il 5. Evaluation Model Conservation http://consurf.tau.ac.il Real Vs. Model Superimposition Useful Links 1. Searching – – – – – for structures PDB-Blast at NCBI- http://blast.ncbi.nlm.nih.gov/Blast.cgi Meta server- 3D judry http://bioinfo.pl/meta/ FFAS03- http://ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl HHPRED- http://toolkit.tuebingen.mpg.de/hhpred FUDGE- pipeline- http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:PUDGE 2. Selecting templates 3. Aligning query sequence with template structures – MSA - MUSCLE, T-coffee and MAFFT at http://toolkit.tuebingen.mpg.de/sections/alignment – Alignment editor – Bioedit - http://www.mbio.ncsu.edu/BioEdit/bioedit.html 4. Building a model – Nest - http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:nest – Modeller - http://salilab.org/modeller/modeller.html 5. Evaluating the model – ConSurf http://consurf.tau.ac.il – PROCHECK http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html – WHATCHECK www.cmbi.kun.nl/swift/whatcheck/ – ProSA https://prosa.services.came.sbg.ac.at/prosa.php – ProQ http://www.sbc.su.se/~bjornw/ProQ/ProQ.cgi – AT the Honig lab http://luna.bioc.columbia.edu/Model_Quality_Assessment/cgibin/Model_Quality_Assessment.cgi Any future questions: Maya Schushan [email protected]