* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Predicting protein 3D structure from evolutionary sequence variation
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Paracrine signalling wikipedia , lookup
Multi-state modeling of biomolecules wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genetic code wikipedia , lookup
Biochemistry wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Magnesium transporter wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Point mutation wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein purification wikipedia , lookup
Structural alignment wikipedia , lookup
Western blot wikipedia , lookup
Interactome wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Proteolysis wikipedia , lookup
Predicting protein 3D structure from evolutionary sequence variation Thomas A. Hopf Protein Prediction I, 06/18/2015 Outline Prologue: Correlated mutations Local vs. global models for 2D prediction Application to 3D structure prediction 2 The structure prediction problem genotype phenotype ACTGTGCACG TAATGGCATC 3 Structure from sequence alone? Christian Anfinsen, Nobel Prize for Chemistry 1972 4 Sequence-structure gap is not closing! Marks, Hopf & Sander, Nature Biotechnology (2012) 5 A protein 6 Evolutionary selection leaves residue covariation signature easy inverse problem 7 Folding proteins from evolutionary couplings ??? 8 Outline Prologue: Correlated mutations Local vs. global models for 2D prediction Application to 3D structure prediction 9 Try something simple: correlation between two columns N C A A A A A A A A D D D D E E E E R R R R R R R R i L A L A L A L A T T T T T T T T L L L L L L L L T T T T T T T T A A A A A A A A K K K K K K K K K K K K K K K K D I D I D I D I G G Y Y G G Y Y P P P P P P P P C C C C C C C C D D D D D D D D E E E E E E E E A A A A A A A A Y Y G G Y Y G G G G G G G G G G N N N N N N N N A A A A A V A A R R R R K K K K C C C C C C C C single column f (A ) fj (Aj ) i i frequencies: column pair frequencies: fij (Ai , Aj ) j fij (Ai , Aj ) fi (Ai )fj (Aj ) To what extent do we see a pair of amino acids more/less often than expected by chance? 10 on of direct information Mutual information measures measure for calculating thetwo correlation between correlation between columns t is mutual information (MI), which is defined as M Iij = q ⌃ Ai ,Aj =1 fij (Ai , Aj ) ln fij (Ai , Aj ) fi (Ai )fj (Aj ) ⇥ mns i and j. M Iij is the difference entropy betwee weight Aj ) and the expected frequency fi (Afrom i )fj (Aj ) if deviation sum all possible endent.amino MI acid is an inherently localstatistical measure which combinations independence 11 Bad news: doesn‘t work. TRY2_RAT,MIs and Residues less than 5.00 Angstroms apart local model main problem: transitivity 50 100 150 200 50 100 150 200 nz = 300 PDB structure residue contacts residue pairs with 100 highest MI values 12 Solution: use a global model! global probability model explains observed correlations by causative pair interactions Inverse Potts inference (undirected graphical model) observed correlations causal, direct coevolution 13 and, in keeping with the maximum-entropy principle, the desired most-gene " model is had by maximizing the entropy S = − σ P (σ)lnP (σ) while adher to these. Maximization of S under (3) can be carried out through the int duction of Lagrange multipliers, giving, after some straightforward calculatio the Potts distribution ⎛ ⎞ N N −1 ! N ! ! 1 P (σ) = exp ⎝ hi (σi ) + Jij (σi , σj )⎠ , Z i=1 i=1 j=i+1 Probability model connects correlations to direct interactions in which multipliers remain as parameters to be fitted to data. This P (σ), in information-theory sense, makesapproximate minimal assumption about the world while s capable of staying true to our observed averages. Z is a normalizing const maximum likelihood making sure the total probability is one by summing over all possible states inference ⎛ ⎞ N N −1 ! N ! ! observed ! data direct pair ⎝ ⎠. Z = exp h (σ ) + J (σ , σ ) i i ij i j (sequences) interactions σ 4 Unless i=1 i=1 j=i+1 stated otherwise, node indexes are assumed to take on values 1 ≤ i ≤ N , 14 indexes run through 1 ≤ i < j ≤ N , and states take on 1 ≤ k, l ≤ q. From sequences to pair scores Infer parameters Y P (data|parameters) = 2alignment P ( | h, J) approximate! Calculate evolutionary couplings v u 21 21 uX X F Nij = ||J ij ||2 = t J ij (k, l)2 k=1 k=1 + some other technical details References: Marks et al. (2011), Ekeberg et al. (2013) 15 Most global model pairs are close in 3D local model (mutual information) global model (MaxEnt, Marks et al., 2011) TRY2_RAT, Residues less than 5.00 Angstroms apart TRY2_RAT,MIs and Residues less than 5.00 Angstroms apart 50 50 100 100 150 150 200 200 50 100 150 nz = 300 200 50 100 150 number of constraints = 150 200 16 He solved it before everyone else did... Alan Lapedes 17 Outline Prologue: Correlated mutations Local vs. global models for 2D prediction Application to 3D structure prediction 18 What could we predict using evolutionary couplings? 3D structure function protein complexes conformation changes 19 What could we predict? 3D structure protein complexes 20 A brief reminder 21 The breakthrough: 15 proteins folded from sequences alone 3D Structure Marks et al., PLoS ONE (2011) Figure 2. Predicted 3D structures for three representative proteins. Visual comparison of22 3 of the 15 te reveals the remarkable agreement of the predicted top ranked 3D structure (left) and the experimentally observed error and, in parentheses, number of residues used for C -RMSD error calculation, e.g., 2.9 Å C -RMSD (67). The ribb Also works for membrane proteins! predicted experimental Hopf et al., Cell (2012) 23 Medically important membrane proteins predicted from sequences predicted (Hopf et al., Cell, 2012) experimental (Baradaran et al., Nature, 2013) respiratory complex I 24 Medically important membrane proteins predicted from sequences II human adiponectin receptor 1 (3.7Å, 186 residues) 25 What could we predict? 3D structure protein complexes 26 Interacting proteins co-evolve to maintain interaction 27 Complex interactions from the evolutionary sequence record 28 Accurate prediction of the ABC transporter MetNI 29 Benchmark set gallery 30 De novo prediction of unsolved complex interactions 31 Can we predict which proteins interact? E. coli ATP synthase 32 ATP synthase interaction matrix 33 Try folding yourself! www.evfold.org 34 References for further reading Protein 3D Structure Computed from Evolutionary Sequence Variation Debora S. Marks1*., Lucy J. Colwell2., Robert Sheridan3, Thomas A. Hopf1, Andrea Pagnani4, Riccardo Zecchina4,5, Chris Sander3 1 Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America, 2 MRC Laboratory of Molecular Biology, Hills Road, Cambridge, United Kingdom, 3 Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America, 4 Human Genetics Foundation, Torino, Italy, 5 Politecnico di Torino, Torino, Italy Abstract The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Ca-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes. Citation: Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6(12): e28766. doi:10.1371/journal.pone.0028766 Editor: Andrej Sali, University of California San Francisco, United States of America Received November 10, 2011; Accepted November 14, 2011; Published December 7, 2011 Copyright: ! 2011 Marks et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: CS and RS have support from the Dana Farber Cancer Institute-Memorial Sloan-Kettering Cancer Center Physical Sciences Oncology Center (NIH U54CA143798). LC is supported by an Engineering and Physical Sciences Research Council fellowship (EP/H028064/1). TH has support from the German National Academic Foundation. RZ has support from European Community grant 267915. No other financial support was received for the research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] . These authors contributed equally to this work. Introduction Exploiting the evolutionary record in protein families The evolutionary process constantly samples the space of possible sequences and, by implication, structures consistent with a functional protein in the context of a replicating organism. Homologous proteins from diverse organisms can be recognized by sequence comparison because strong selective constraints prevent amino acid substitutions in particular positions from being accepted. The beauty of this evolutionary record, reported in protein family databases such as PFAM [1], is the balance [2,3]. This suggests that residue correlations could provide information about amino acid residues that are close in structure [4,5,6,7,8,9,10,11]. However, correlated residue pairs within a protein are not necessarily close in 3D space. Confounding residue correlations may reflect constraints that are not due to residue proximity but are nevertheless true biological evolutionary constraints or, they could simply reflect correlations arising from the limitations of our insight and technical noise. Evolutionary constraints on residues involved in oligomerization, proteinprotein, or protein-substrate interactions or other spatially indirect or spatially distributed interactions can result in co-variation 35 Acknowledgments Debora Marks Chris Sander Burkhard Rost (Harvard Medical School) (MSKCC) (TU München) Charlotta Schärfe (HMS & Uni Tübingen) 36