* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Comparative Modeling of Mainly
Bimolecular fluorescence complementation wikipedia , lookup
Implicit solvation wikipedia , lookup
List of types of proteins wikipedia , lookup
Rosetta@home wikipedia , lookup
Protein purification wikipedia , lookup
Western blot wikipedia , lookup
Protein design wikipedia , lookup
Circular dichroism wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Protein folding wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Protein domain wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Metalloprotein wikipedia , lookup
Alpha helix wikipedia , lookup
Protein structure prediction wikipedia , lookup
Comparative Modeling for Beta Protein Structure Prediction Lenore J. Cowen Tufts University Amino Acids A protein is composed of a central backbone and a collection of (typically) 50-2000 amino acids (a.k.a. residues). There are 20 different kinds of amino acids each consisting of up to 18 atoms, e.g., Name Leucine Alanine Serine Glycine Valine Glutamic acid Threonine 3-letter code Leu Ala Ser Gly Val Glu Thr 1-letter code L A S G V E T Protein Structure repeating repeating backbone backbone structure structure O H O H O H O H O H OH OH H3N+ CH C N CH C N CH C N CH C N CH C N CH C N CH C N CH COOCH2 CH2 COO- CH2 CH CH2 H3C CH3 CH2 H C CH3 CH2 OH CH3 NH CH2 CH2 CH2 HC CH HN CH2 CH2 N CH C NH2 Asp D Arg R N+H2 Val V Tyr Y Ile I His Pro H P Protein sequence: DRVYIHPF Phe F Protein Folding Problem Given an amino acid sequence, e.g., MDPNCSCAAAGDSCTCANSCTCLACKCTSCK, how will it fold in 3D? The fold is important because it determines the function of the protein. Note: The pictures I’ve been giving are “cartoons” of the backbone The Inverse Protein Folding Problem Instead of given a sequence, and asking what’s its fold, take a fold, and ask for all the sequences that form that fold. …VLWIXS…. …SSCILWG… What do we mean by “that fold”? SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) Can we recognize and model all folds that form a beta-trefoil, etc.? • If they are evolutionarily close enough the answer is YES. • Use BLAST to recognize homology (similar sequences have similar folds) and align conserved parts of the backbone. …GVFIIIMGSHGK… …GVD-LMG-HGR… Comparative modeling • One the backbone of the conserved core is fixed, pack in the sidechains • Add loops and unstructured regions. Can we recognize and model all folds that form a beta-trefoil, etc.? • But STRUCTURE can be more CONSERVED that sequence—maybe the structures align but we can no longer use BLAST because the sequence similarity is too weak …GVFIIIMGSHGK… …GR—CV-GCAGR… Comparative modeling • If you CAN find the correct alignment, can do as before. • One the backbone of the conserved core is fixed, pack in the sidechains • Add loops and unstructured regions. Approaches to Structural Motif Recognition • Statistical template/profile methods (Altschul et al. 1990) • Hidden Markov Models (Eddy, 1998) • Threading Methods (Jones et al. 1992) • Combinations of two or more of the above Our Results Recognizing the Beta Helix and Beta Trefoil Folds The Right-handed Parallel Beta-Helix A processive fold composed of repeated supersecondary units. Each rung consists of three beta-strands separated by turn regions. Pectate Lyase C (Yoder et al. 1993) No sequence repeat. Biological Importance of Beta Helices Surface proteins in human infectious disease: • virulence factors • adhesins • toxins • allergens Proposed as a model for amyloid fibrils (e.g. Alzheimer’s and Creutzfeldt-Jakob) Virulence factors in plant pathogens What was Known Solved beta-helix structures: 12 structures in PDB in 7 different SCOP families Pectate Lyase: Pectate Lyase C Pectate Lyase E Pectate Lyase Pectin Lyase: Pectin Lyase A Pectin Lyase B Galacturonase: Polygalacturonase Polygalacturonase II Rhamnogalacturonase A Chondroitinase B Pectin Methylesterase P.69 Pertactin P22 Tailspike BetaWrap Program [Bradley, Cowen, Menke, King, Berger, PNAS, 2001, 98:26, 14,81914,824 ; Cowen, Bradley, Menke, King, Berger (2002), J Comp Biol, 9, 261-276] Performance: • On PDB: no false positives & no false negatives. Recognizes beta helices in PDB across SCOP families in cross-validation. • Recognizes many new potential beta helices when run on larger sequence databases. • Runs in linear time (~5 min. on SWISS-PROT). BetaWrap Program Histogram of protein scores for: • beta helices not in database (12 proteins) • non-beta helices in PDB (1346 proteins ) Single Rung of a Beta Helix 3D Pairwise Correlations B3 T2 B2 B1 Stacking residues in adjacent beta-strands exhibit strong correlations Residues in the T2 turn have special correlations (Asparagine ladder, aliphatic stacking) Question: how can we find these correlations which are a variable distance apart in sequence? Finding Candidate Wraps • Assume we have the correct locations of a single T2 turn (fixed B2 & B3). B3 T2 Candidate Rung B2 • Generate the 5 best-scoring candidates for the next rung. Scoring Candidate Wraps (rung-to-rung) Rung-to-rung alignment score incorporates: • Beta sheet pairwise alignment preferences taken from amphipathic beta structures in PDB. (w/o beta helices) • Additional stacking bonuses on internal pairs. • Distribution on turn lengths. Scoring Candidate Wraps (5 rungs) • Iterate out to 5 rungs generating candidate wraps: • Score each wrap: - sum the rung-to-rung scores - B1 correlations filter - screen for alpha-helical content Predicted Beta Helices Features of the 200 top-scoring proteins in the NCBI’s protein sequence database: • Many proteins of similar function to the known betahelices; some with similar sequences. • A significant fraction are characterized as microbial outer membrane or cell-surface proteins. • Mouse, human, worm and fly sequences significantly underrepresented – only two proteins! Some Predicted Beta Helices in Human Pathogens Vibrio cholerae Helicobacter pylori Plasmodium falciparum Chlamyidia trachomatis Chlamydophilia pneumoniae Listeria monocytogenes Trypanosoma brucei Borrelia burgdorferi Leishmania donovani Bordetella bronchiseptica Trypanosoma cruizi Bordetella parapertussis Bacillus anthracis Rickettsia ricketsii Rickettsia japonica Neisseria meningitidis Legionaella pneumophilia Cholera Ulcers Malaria Venereal infection Respiratory infection Listeriosis Sleeping sickness Lyme disease Leishmaniasis Respiratory infection Sleeping sickness Whooping cough Anthrax Rocky Mtn. spotted fever Oriental spotted fever Meningitis Legionnaire’s disease The Beta-Trefoil The beta-trefoil consists of three leaves around an axis of three-fold symmetry. B3 T2 Cap T3 B2 T1 x3 Barrel B4 B1 Single Leaf Entire trefoil (3 leaves) 1BFF (Kitagawa et al. 1991) Templates A leaf template consists of: Cap template • T2 a B1-strand, followed by a T1 turn of length 2 to 17, followed by • a B2-strand, followed by a T2 turn of length 0 to 11, followed by a B3-strand, followed by B4 • a T3 turn of length 4 to 20, followed by a B4 strand. B3 T3 B2 T1 B1 In addition, it is between 26 and 64 residues long. A trefoil template consists of three leaf templates separated by two T4 turns of length 0 to 16. What Pairs Do We Consider? B3 T2 T3 B2 T1 B4 B1 In both the barrel and the cap, we consider both directly aligned pairs of residues and pairs of residues oneoff from each other. Different tables are used for pairwise preferences for buried, exposed, and one-off pairs of residues. Packing moves earlier in the modeling process • In order to produce more accurate sequence-structure alignments, we return several possible “wraps” and try to pack sidechains. • So sidechain packing is used earlier in the comparative modeling process; also to help find the correct sequence-structure alignment. The Packing Function Top wraps fed to packing function. • SCWRL (Canutescu, 2003) is better at packing cap than barrels. • Input to SCWRL: • Atomic coordinates of the backbone of cap strand pairs from a member of each trefoil superfamily in the training set. • Top 4 wraps of the target sequence onto the trefoil template. • Return best-scoring wrap with a good packing, if one exists, else reject. Example of the Packing Phase Partial PDB file from actual trefoil ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM … 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 N CA C O CB CG CD1 CD2 H N CA C O CB OG1 CG2 H LEU LEU LEU LEU LEU LEU LEU LEU LEU THR THR THR THR THR THR THR THR B B B B B B B B B B B B B B B B B 196 196 196 196 196 196 196 196 196 197 197 197 197 197 197 197 197 41.442 40.705 40.704 41.787 41.441 41.503 41.902 40.155 42.299 39.524 39.397 38.506 37.700 38.704 39.307 38.808 38.752 LTSKD STILL 12345 67890 Known Cap SCWRL Predicted cap atomic positions ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM … 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 N CA C O CB CG CD1 CD2 N CA C O CB CG CD NE CZ B3 9 10 4 5 3 LEU LEU LEU LEU LEU LEU LEU LEU ARG ARG ARG ARG ARG ARG ARG ARG ARG B2 2 1 1ABR (Tahirov et al. 1995) 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 41.442 40.705 40.704 41.787 41.412 40.686 39.364 41.533 39.524 39.397 38.506 37.700 38.788 39.658 38.984 39.799 39.404 Cap from LRVYY RILHN top wrap 12345 67890 B3 7 6 … … … … … … … … … … … … … … … … … B2 Steric clash … … … … … … … … … … … … … … … … … Toward Automation • For each SCOP beta-structural template *align all known examples of fold *find pairs in conserved core *thread onto template (additionally use profiles); find candidate alignments Pack sidechains for each, determine best structure Place loops and unstructured regions Toward Automation • For each SCOP beta-structural template *align all known examples of fold *find pairs in conserved core *thread onto template (additionally use profiles); find candidate alignments Pack sidechains for each, determine best structure Place loops and unstructured regions Multiple Structure Alignment for Remote Protein Homologs • We spend the remainder of the talk discussing our new program for multiple structure alignment: MATT The Multiple Structure Alignment Problem Input: atomic coordinates for the backbones of m protein structures Output: A sequence alignment of the protein structures, together with a superimposition of the structures in 3D space. The Multiple Structure Alignment Problem Def: the common core of a protein structure is the set of positions where every structure contributes a residue in alignment The Multiple Structure Alignment Problem Geometric criteria: Good multiple structure alignments MAXIMIZE common core size while MINIMIZING pairwise RMSDs between structures. Note: even simplified versions NP-Hard (Goldman, Istrail and Papadimitriou, 1999) The Multiple Structure Alignment Problem Discrimination criteria: Good multiple structure alignments align what is “supposed to be aligned” because it is part of the evolutionarily conserved core. Approaches to Structure Alignment • AFP chaining methods align all short pieces and chain together using dynamic programming • Contact map methods look for similarities within distance matrices • Geometric hashing, secondary structure elements, etc. Some Popular Structure Aligners • • • • Dali (Holm 93) VAST (Bryant 96) LOCK (Singh 97) FlexProt (Shatsky et al. 02) • FATCAT (Ye&Godzik 04) • LOVOALIGN (Andreani et al. 06) • CE/CE-MC (Shindyalov 2000) • SSAP (Orengo&Taylor 96) • MultiProt (Shatsky&Wolfson 04) • POSA (Ye&Godzik 05) • Mustang (Konagurthu et al. 06) • CBA (Ebert 07) The Benchmark Datasets • Globins • Homstrad – 1028 alignments – Each alignment contains 2-41 structures – 399 sets with > 2 structures The Benchmark Datasets Sabmark Superfamily set: – 3645 domains in 426 subsets Twilight zone set: – 1740 domains in 209 subsets Both sets contain: – Between 3 and 25 structures – Decoy structures (sequence matches that reside in different SCOP domains) Matt: Multiple Alignment with Translation and Twists • Matt is an AFP chaining method that additionally adds flexibility in the form of geometrically impossible bends and breaks. Other work modeling flexibility • In structure alignment: – Flexprot [Shatsky et al., 2002] – Fatcat/POSA [Ye&Godzik, 2004, 2005] • For other reasons: – Molecular docking [Echols et al,03; Bonvin,06] – Ligand binding [Lemmen et al, 2006] – Decoy construction [Singh&Berger, 2006] Outline of the Matt Algorithm Results on Sabmark (Superfamily) Program Name Avg. Core Size Avg. RMSD Multiprot 68.701 1.498 Mustang 104.162 4.146 Matt 104.692 2.639 Results on Sabmark (Twilight Zone) Program Name Avg. Core Size Avg. RMSD Multiprot 36.54 1.536 Mustang 66.833 5.035 Matt 66.967 2.916 Sabmark Decoy Set • For each SCOP superfamily, positive examples of the fold, and negative examples that are – Random examples from a different superfamily – Examples from a different superfamily that are nonetheless good BLAST hits Toward Automation • For each SCOP beta-structural template *align all known examples of fold *find pairs in conserved core *thread onto template (additionally use profiles); find candidate alignments Pack sidechains for each, determine best structure Place loops and unstructured regions On the Web • BetawrapPro for predicting beta-helices and beta-trefoils at: http://betawrappro.csail.mit.edu • Matt at: http://matt.csail.mit.edu OR http://matt.cs.tufts.edu Acknowledgements • • • • • Matt Menke Andrew McDonnell Phil Bradley Bonnie Berger Jonathan King • National Science Foundation