Download Structural Bioinformatics

Structural Bioinformatics • • • • Motivation Concepts Structure Prediction Summary Lecture 12 CS566 1 Motivation • Holy Grail: Mapping between sequence and structure. Structure = F(Sequence). What is F? • Why – Structure dictates chemistry, thermodynamics and therefore function – Not all structures can be (need be?) determined experimentally • Cost • Experimental limitations Lecture 12 CS566 2 Concepts – Prediction spectrum Decreasing reliance on known structures Homology Modeling Threading Lecture 12 CS566 ab initio Quantum Mechanics 3 Concepts - Common Principles • Constraints to reduce search space • Consideration of many alternate conformations – Protein backbone dihedral angles (‘Twists along axis of protein’) – Amino-acid geometry (‘Amino-acids can have more than one shape’) • Method for local optimization • Scoring function to compare conformations Lecture 12 CS566 4 Evaluation of quality of prediction • RMSD comparison with experimentally known structure • Comparison with crystal structure quality criteria – Ramachandran Plot • Residue specific dihedral angle distribution • CASP (Critical assessment of structure prediction) and CAFASP (..Fully Automated..) competitions Lecture 12 CS566 5 Methods • Knowledge-based constraints of search space – Homology Modeling – Threading – ab initio (Based on knowledge primitives: not true ab initio) • Approaches to refinement – Quantum mechanics (ab initio) • Based on quantum mechanical model of elementary particles • Unscalable – Molecular mechanics • Uses parametric Force Fields (Newton’s laws, Hooke’s law, …) • Typically used for local or constrained global optimization • Molecular Dynamics or Monte Carlo-based Lecture 12 CS566 6 Homology modeling • Homology – Based on sequence-sequence similarity ( > ~25%, the higher, the better) – Steps • Pair-wise local sequence similarity to identify related structures (possible templates) • Refine alignment by global pair-wise sequence similarity and msa • Overlay sequence backbone (N-C-C) on template • Model loops based on – Statistical knowledge from databases of known structures – Molecular mechanics • Model side-chains (approach similar to that of loops) • Molecular mechanical unconstrained local optimization • Pray for a good solution! Lecture 12 CS566 7 Threading • Based on sequence-structure similarity • Concept – Residues in core adopt fewer conformations than surface • Approach – Thread sequence through all known structures – Score match with core of each structure based on • Environmental scoring matrices and/or • Amino acid neighborhood matrices (a la Dot matrix) – Refine structure using molecular mechanics based on best template(s) Lecture 12 CS566 8 Rosetta (“ab initio”) Approach • Pioneered by David Baker’s group in the late 1990s • Remarkable success in CASP and CAFASP experiments • Recently made publicly available on an automated server by Christopher Bystroff’s group • Pot pourri of many different approaches • Key components – ‘Divide and conquer’ strategy with respect to length of sequence to be modeled – Use of knowledge based energy function Lecture 12 CS566 9 ‘Divide and conquer’ • Mimics natural process of protein folding • Compromise between extremes of – Looking for homologous sequences with known structure – Modeling a priori (one amino acid at a time) • Use library of 3D structures of fragments of length 3 and 9 derived from the crystal structure database (a priori estimates = 8K and ~ 1012). • Break up query sequence into a set of 3mers and 9mers, to find matches with above library – using a sequence profile approach Lecture 12 CS566 10 ‘Divide and conquer’ • Once matches found, reduces to combinatorial problem of selecting best set of fragments with most energetically favorable structure • In practice, Monte Carlo based search of possible combinations is carried out. Lecture 12 CS566 11 Knowledge based energy function • Fundamentally, ∆G = ∆H - T ∆S • Free energy is the enthalpy less an entropic term that is proportional to temperature • Entropy is proportional to the natural log of the number of conformations/possible states S = K ln W Lecture 12 CS566 12 Knowledge based energy function • Hence makes sense to use existing distribution of structures to derive energy function • Energy function is based on taking statistical distribution of 3D shapes in database of known structures as the underlying probability distribution • For a given structure, deviations from probability distribution are subject to proportional energetic penalties Lecture 12 CS566 13 Rosetta – Steps used in CASP4 1. If possible, use PSI-BLAST to find similar sequences A. If found, use the multiple sequence alignment to break down sequence into domains to be modeled independently B. For domains with similarity to known structures, use Homology based approach C. For remaining domains, carry out Rosetta Lecture 12 CS566 14 Rosetta - Steps 2. 3. For domains with similarity to other sequences, apply following steps to the homologs as well (consensus modeling) Generate fragment library for each query A. 4. Collect 3mer and 9mer sub-structures from the PDB with similarity to 3mer and 9mer subsequences Use Monte Carlo approach for backbone fragment substitution into query A. B. C. Pick a fragment at random from library (~40,000 fragment substitutions for each structure) Repeat A several times Between 10K and 100K conformations (‘decoys’) generated for each target Lecture 12 CS566 15 Rosetta - Steps 5. Filter set of conformations to remove unlikely structures A. B. 6. 7. 8. Remove structures with minimal long range interactions (low contact order) Remove structures with unrealistic strands Add side chains as statistically predicted by the backbone conformation Cluster set of conformations (including, when available, the generated structures of homologues) Representative structures from the top 5 most-populous clusters are candidate structures Lecture 12 CS566 16 Summary • Methods like Rosetta represents a breakthrough in the ab initio prediction of protein 3D structure and are very useful in cases where homology cannot be observed • For CASP4, at least one subsequence longer than 50 residues could be predicted ‘correctly’ (< 6.5 rmsd) in 17 of 21 cases • Combination of various approaches works best Lecture 12 CS566 17 Summary • However, both completeness and accuracy of prediction leave ample room for improvement – RMS error frequently too high to be useful – Even in homology modeling, template per se is often better match! – Often, only subsequences are accurately modeled, and not the whole structure – The Nobel Prize is still up for grabs! Lecture 12 CS566 18

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Structural Bioinformatics