* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lect 6 - BIDD - National University of Singapore
Multi-state modeling of biomolecules wikipedia , lookup
Paracrine signalling wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Genetic code wikipedia , lookup
Biochemistry wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression wikipedia , lookup
Magnesium transporter wikipedia , lookup
Expression vector wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Point mutation wikipedia , lookup
Metalloprotein wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein purification wikipedia , lookup
Proteolysis wikipedia , lookup
BL5203: Molecular Recognition & Interaction Lecture 6: Modeling Protein Structure and Protein-Protein Interaction Y.Z. Chen Department of Pharmacy National University of Singapore Tel: 65-6616-6877; Email: [email protected] ; Web: http://bidd.nus.edu.sg Content • Protein fold and structure • Homology modeling • Protein-protein docking Sizes of protein databases 500M 1.6M 26K 1K 10,000,000,000 100,000,000 1,000,000 10,000 100 1 Protein Protein Protein Protein residues sequences structures complexes Swiss-Prot database Protein structure classification Protein world Protein fold Protein superfamily Protein family New Fold PDB New Fold Growth New PDB structures Old folds New folds • • The number of unique folds in nature is fairly small (possibly a few thousands) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB Protein classification • Number of protein sequences grow exponentially • Number of solved structures grow exponentially • Number of new folds identified very small (and close to constant) • Protein classification can – Generate overview of structure types – Detect similarities (evolutionary relationships) between protein sequences Problems in Protein Bioinformatics • 20,000 entries of proteins in the PDB • 1000 - 2000 distinct protein folds in nature • Thought to be only several thousand unique folds in all • Prediction of structure from sequence – Fold recognition – Fragment construction • Proteome annotation • Protein-protein docking Protein folding code Protein folding code Protein sequence Protein structure Prediction of correct fold Query sequence Matched fold Fold recognition Match sequence against library of known folds Eisenberg et al. Jones, Taylor, Thornton Computational Requirements • 1 sequence search takes 12 mins (3Ghz) • Benchmarking on 100 proteins with 100 runs for a simplex search of parameter space = 80 days • 30 approaches explored = 7 years (on 1 cpu) Types of Structure Prediction • De novo protein – methods seek to build three-dimensional protein models "from scratch" – Example: Rosetta • Comparative protein – modeling uses previously solved structures as starting points, or templates. – Example: protein threading Factors that Make Protein Structure Prediction a Difficult Task • The number of possible structures that proteins may possess is extremely large, as highlighted by the Levinthal paradox • The physical basis of protein structural stability is not fully understood. • The primary sequence may not fully specify the tertiary structure. – chaperones • Direct simulation of protein folding is not generally tractable for both practical and theoretical reasons. Homology Modeling • Homolog a protein related to it by divergent evolution from a common ancestor • 40 % amino-acid identity with its homolog – NO large insertions or deletions – Produces a predicted structure equivalent to that of a medium resolution experimentally solved structure • 25 % of known protein sequences fall in a safe area implying they can be modeled reliably Homology Modeling Defined • Homology modeling – Based on the reasonable assumption that two homologous proteins will share very similar structures. – Given the amino acid sequence of an unknown structure and the solved structure of a homologous protein, each amino acid in the solved structure is mutated computationally, into the corresponding amino acid from the unknown structure. Homology Modeling Limitations • Cannot study conformational changes • Cannot find new catalytic/binding sites • Brainstorm lack of activity vs activity – Chymotrypsionogen, trypsinogen and plasminogen – 40% homologous – 2 active, 1 no activity, cannot explain why • Large Bias towards structure of template • Models cannot be docked together Why Homology Modeling? • Value in structure based drug design • Find common catalytic sites/molecular recognition sites • Use as a guide to planning and interpreting experiments • 70-80 % chance a protein has a similar fold to the target protein due to X-ray crystallography or NMR spectroscopy • Sometimes it’s the only option or best guess Protein Threading • A target sequence is threaded through the backbone structure of a collection of template proteins (fold library) • Quantitative measure of how well the sequence fits the fold • Based on assumptions – 3-D structures of proteins have characteristics that are semi-quantitatively predictable – reflect the physical-chemical properties of amino acids – Limited types of interactions allowed within folding Fold Recognition Methods • Bowie, Lüthy and Eisenberg (1991) • 2 approaches to recognition methods • Derive a 1-D profile for each structure in the fold library and align the target sequence to these profiles – Identify amino acids based on core or external positions – Part of secondary structure • Consider the full 3-D structure of the protein template – Modeled as a set of inter-atomic distances – NP-Hard (if include interactions of multiple residues) Protein Threading • The word threading implies that one drags the sequence (ACDEFG...) step by step through each location on each template Protein Threading Generalized Threading Score • Want to correctly recognize arrangements of residues • Building a score function – potentials of mean force – from an optimization calculation. • G(rAB) = kTln (ρAB/ ρAB°) – – – – G, free energy k and T Boltzmanns constant and temperature respectively ρ is the observed frequency of AB pairs at distance r. ρ° the frequency of AB pairs at distance r you would expect to see by chance. • Z-score = (ENat - <Ealt>)/σ Ealt – Natural energies and mean energies of all the wrong structures/ standard deviation Scoring Different Folds • Goodness of fit score – Based on empirical energy function – Modify to take into account pairwise interactions and solvation terms – High score means good fit – Low score means nothing learned Some Threading Programs • • • • • • • • • 3D-pssm (ICNET). Based on sequence profiles, solvatation potentials and secondary structure. TOPITS (PredictProtein server) (EMBL). Based on coincidence of secondary structure and accesibility. UCLA-DOE Structure Prediction Server (UCLA). Executes various threading programs and report a consensus. 123D+ Combines substitution matrix, secondary structure prediction, and contact capacity potentials. SAM/HMM (UCSC). Basen on Markov models of alignments of crystalized proteins. FAS (Burnham Institute). Based on profile-profile matching algorithms of the query sequence with sequences from clustered PDB database. PSIPRED-GenThreader (Brunel) THREADER2 (Warwick). Based on solvatation potentials and contacts obtained from crystalized proteins. ProFIT CAME (Salzburg) Process of 3D Structure Prediction by Threading • Has this protein sequence similarity to other with a known structure? • Structure related information in the databases • Results from threading programs • Predicted folding comparison • Threading on the structure and mapping of the known data • A comparison between the threading predicted structure and the actual one Protein Threading Based on Multiple Protein Structure Alignment Tatsuya Akutsu and Kim Lan Sim Human Genome Center, Institute of Medical Science, University of Tokyo • NP-Hard if include interactions between 2 or more AA • Determine multiple structural alignments based on pair wise structure alignments – Center Star Method Center Star Method • Let I0 be the maximum number of gap symbols placed before the first residue of S0 in any of the alignments A(S0; S1); : : : ;A(S0; SN). Let IS0j be the maximum number of gaps placed after the last character of S0 in any of the alignments, and let Ii be the maximum number of gaps placed between character S0;i and S0;i+1, where Sj:i denotes the i-th letter of string Si • Create a string S0 by inserting I0 gaps before S0, IjSo gaps after S0, and Ij gaps between S0;I and S0;i+1. • For each Sj (j > 0), create a pairwise alignment A(S0; Sj) between S0 and Sj by inserting gaps into Sj so that deletion of the columns consisting of gaps from A(S0; Sj) results in the same alignment as A(S0; Sj). • Simply arrange A(S0; Sj )'s into a single matrix A (note that all A(S0; Sj )'s have the same length). Simple Threading Algorithm • Apply simple score function based on structure alignment algorithm – Let X = x1……xN (input amino acid sequence) – Ci ( i-th column in A) • Test and analyze results and/or apply constraints Protein Threading with Constraints • Assume part of the input sequence xi…xi+k must correspond to part of the structure alignment cj…cj+k • Apply constraints Prediction Power • • • • Entered in CASP3 competition 17 predictions made 3 targets evaluated as similar to correct folds Only team to create a nearly correct model for structure T0043 • Best in competition – 8 evaluated as similar to correct Next time…. • In depth detail of – Multiple structural alignment program • Multiprospector – Global Optimum Protein Threading with Gapped Alignment • Quality measures for protein threading models • Improvements on threading-based models Gapped Alignment Fragment based method 1 -Predict structure of segment Trial structures for a local sequence taken from database of segments of known 3D structure . Fragment based method 2 - Construct trial model from segments Fragment based method 3 - Identify good trial structures 1 Low resolution energy function used in initial search through conformational space 2 - Side chains represented by single “centroid” pseudoatom 3 - Major contributions from Hydrophobic burial Beta strand pairing Steric overlap Specific residue pair interactions 4 - Models then refined using explicit rotamer based side chain representation and potential from design method Fragment-based protein folding Cro repressor (1orc) observed Computational Requirements • Methodology performs numerous simulations and looks for clusters • One simulation takes 3 mins (3Ghz) • Require 1,000 simulations per protein = 2 days • Benchmark on 50 proteins = 100 days 3D-GENOMICS - proteome annotation Proteome sequences Annotation procedure Database sequences MySQL database Database structures New research Functional data WWW Types of annotation No similar sequence - orphan E. coli Protein325 -homology but no function structure Enzyme ABC EC 1.2.3.4 - function suggested membrane protein 3D-Genomics database -structural and functional annotation size Computational requirements • Today 800,000 protein sequences. • Each sequence 15 mins to annotate on 2.5GHz cpu. • Time today = 8,000 cpu days = 2.5 months with 100 processor farm. • Need to update every 6 months. • No of sequences will double in 2-3 years and so will keep pace with increase in compute power. Modelling protein-protein docking Modelling protein-protein docking Protein-protein docking Coordinates of mol 1 Coordinates of mol 2 Rigid body search List of possible complexes Evaluate association energy Flexibility to refine List of complexes Experimental information Step 1 - Generating Complexes Shape complementarity +1 -15 overlap +1 x -15 +1 match +1 x +1 A(i,j,k) C= B(l,m,n) SSS A(i,j,k) x B(l,m,n) Electrostatic complementarity -1 +1 Charge in 1 = Q(i,j,k) E= +1 -1 Potential outside 2 V(l,m,n) SSS Q(i,j,k) x V(l,m,n) Step 2 - Modelling residueresidue interactions E V I Step 2 - Modelling residueresidue interactions E V I Empirical residue pair potentials < distance cut off (4.5A) a b Analyse residues packing across 90 hetero-protein interfaces A pair of residues pack if one atom-atom contact Score(a,b) = log10 (Observed no a/b pairs) (Expected no a/b pairs) Step 3 - Including information about functional residues From literature E Step 3 - Including information about functional residues From literature E Step 4 - Refinement by multicopy Search for optimal combination of side-chain rotamers by energy calculation + Limited rigid-body shifts CAPRI - blind test of docking bound Ab - X-ray bound Ab - predicted unbound amylase Prediction / Actual: Difference =0.6A Computational Requirements • 1 run of procedure takes 2 day on one 3Ghz processor • Development tested on 30 protein complexes takes 60 days for one parameter set • Applications – extension to predict which protein interacts with another requires 1000s of docking simulations Application area • Protein structure prediction – fold recognition – simulation • Proteome annotation • Protein-protein docking Computing cost • Modelling algorithm on one protein 10 mins - 2 days on one 3GHz cpu • But algorithm development requires consideration of several structures (50 100) with different parameter sets. • Hence years of cpu required Structure prediction & sequence space ASDJFHLKASDLFH ASDFLHUHOUIQWE QWEONBLQWEROKJ ASDFPOIQWERUHO QWEORSADFLKJIJ ASDJFHLKASDLFHTJYH ASDFLHUHOUIQWEDFGH QWEONBLQWEROKJDGHJ ASDFPOIQWERUHODHGR QWEORSADFLKJIJGHFG QWOIEGTXKNBVALHERT ASDLFHIUWERHSDDFGH KBJDDURMWOFBMFERTJ FGJDKEGORTMVIRGHRT ASDJFHLKASD ASDFLHUHOUI QWEONBLQWER ASDFPOIQWER QWEORSADFLK ASDJFHLKASDLFHTJYH ASDFLHUHOUIQWEDFGH QWEONBLQWEROKJDGHJ ASDFPOIQWERUHODHGR QWEORSADFLKJIJGHFG Multiple sequence alignments aid comparative protein modeling • 1 in 3 sequences are recognizably related to at least one protein structure. • A significant fraction of the remaining 2/3 have solved structural homologues, but they are not recognized through sequence similarity searching techniques. • Marti-Renom et al. (2000) • Multiple sequence alignments greatly improve the efficacy and accuracy of almost all phase of comparative modeling. • Venclovas (2001) Computational protein design New sequence Iterative refinement Native structure Large scale sequence generation “Reverse BLAST” study: Total structures 264 Total backbone variants 26,400 Total time of data collection 80 days Processors available 4,000 Total sequences generated 200,000 “Reverse BLAST”: finding templates for comparative modeling Larson SM, Garg A, Desjarlais JR, Pande VS. (2003) Proteins: Structure, Function, and Genetics Experiment: Sequence quality Design ASDFASDFASDFAS FDSAFASDFASDFA FASDFASDFASDFA FHFDIDIFERIDKD ADHFYWTEFHHASD ASDFYEFHGASDFV ADHFYWTEFHHASD ASDFYEFHGASDFV DGSAHDYERCNDFK AKSLKALSDFPLAK BLAST E<0.01 Results: Sequence quality 10 25 50 75 100 125 150 175 200 E-value of best PDB hit 0.01 0.001 225 25 0.0001 1E-05 20 1E-06 1E-07 1E-08 1E-09 15 1E-10 1E-11 10 1E-12 1E-13 1E-14 1E-15 5 1E-16 1E-17 0 Designed sequence profile (ranked by E-value) Average identity to native sequence (%) 1 0.1 0 30 Method: “Reverse BLAST” Designed Sequences THEHYPOTHETICA LPROTEINSEQUEN CEASDFASDFASDF AASDFASDFASDFA SDFASDFASDFASD FASDFHWERHWIEN CVASDFNWEFUWEF THEHYPOTHETICA LPROTEINSEQUEN CEASDFASDFASDF AASDFASDFASDFA SDFASDFASDFASD FASDFHWERHWIEN CVASDFNWEFUWEF THEHYPOTHETICA LPROTEINSEQUEN CEASDFASDFASDF AASDFASDFASDFA SDFASDFASDFASD FASDFHWERHWIEN CVASDFNWEFUWEF THEHYPOTHETICA LPROTEINSEQUEN CEASDFASDFASDF AASDFASDFASDFA SDFASDFASDFASD FASDFHWERHWIEN CVASDFNWEFUWEF THEHYPOTHETICA LPROTEINSEQUEN CEASDFASDFASDF AASDFASDFASDFA SDFASDFASDFASD FASDFHWERHWIEN CVASDFNWEFUWEF THEHYPOTHETICA LPROTEINSEQUEN CEASDFASDFASDF AASDFASDFASDFA SDFASDFASDFASD FASDFHWERHWIEN CVASDFNWEFUWEF THEHYPOTHETICA LPROTEINSEQUEN CEASDFASDFASDF AASDFASDFASDFA SDFASDFASDFASD FASDFHWERHWIEN CVASDFNWEFUWEF THEHYPOTHETICA LPROTEINSEQUEN CEASDFASDFASDF AASDFASDFASDFA SDFASDFASDFASD FASDFHWERHWIEN CVASDFNWEFUWEF Hypothetical Proteins BLAST Structural Templates E<0.01 Do the designed sequences help? Correctly identified structural templates fold-increase in # of templates fold-increase in # of genes total hits Remote homology detection Optimizing structural diversity 80 6 sequence entropy 70 5 4 (%) 50 40 3 30 2 Sequence entropy 60 prediction accuracy prediction coverage mean pairwise %ID 20 1 10 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 RMSD of structural ensem ble (Angstrom s) 1.8 2 mean native %ID