Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Metalloprotein wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Biochemistry wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Multi-state modeling of biomolecules wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Evolution of angles 10.12.15 Background. Challis and Schmidler (2012) introduced a statistical model for evolution of proteins, where, the spatial positions of amino acids were modeled as independent Ornstein-Uhlenbeck processes. This paper developed software that could analyze 2 protein structures. Herman et al. (2014) extended this to a set of structures related by a phylogeny. This is very exciting research and takes important steps toward statistical description of structure evolution. Statistical modeling has been of immense importance for the analysis of sequence evolution. However, the Challis-Schmidler model is unrealistic and very limited in its application area. It is unrealistic because an amino acid position to does not evolve according to an Ornstein-Uhlenbeck process and two neighboring positions are not independent. They are typically and constantly about .25 nm apart. And the model has limited application area because it can only be applied to pairs or sets of structured that convincingly can be superimposed by 1) translation + 2) rotation + 3) deletion of selected positions. A proper investigation of this on existing protein structure databases would be interesting, but it is most likely restricts the use of this model to very similar structures. Random structures have earlier been modeled by a random walk (McLachlan, 1979; W Taylor, 2006) in space. Hamelryck et al. (2006) used an HMM with hidden information about the secondary structure of the present amino acid to make a sequence of direction-decisions to define a distribution on protein structure and to make a structure predictor given a sequence, without considering evolution. Proteins can be represented at many levels of detail, but two will be predominant in these applications: only the α-C or the α-C with the adjacent C and N atoms. In the former case a direction will have to be chosen with the present α-C at the middle of a sphere [S] and pointing to the next α-C. In the latter representation two angles [-π;π]will have to be chosen, which could be shown as a little square [-π;π]*[-π;π] but with the caveat, that when a point leaves the square at one edge, it will enter at the same position at the opposite edge. The torus has the appropriate properties and be obtained by gluing one pair of opposite edges together making a tube. Then the ends of the tube are glued together to make a torus [T]. Not entirely perfect, because now the two coordinates are not treated identically, but it is acceptable for most purposes. It seems natural to combine the ideas of Hamelryck et al. with evolutionary models to develop methods that could analyze sets of homologous proteins. Hamelryck used the Bingham-Fisher distribution on spheres and the bivariate von Mises distributions on the torus. If we have observed a homologous pair of sequences-structures, then we will need a pair of spheres [S1,S2] or a pair of tori [T1,T2] to represent the appropriate angles. To illustrate on a simpler case what we expect will happen on the combined pairs ([S1,S2] or [T1,T2]), let us focus on the simple state space of for nucleotides [A, C, G, T] and the simplest model (Jukes-Cantor; JC69). We now have a model with values in [A, C, G, T]* [A, C, G, T] that is a function of t. If t is zero, we have 1.0 in the diagonal and 0.0 outside the diagonal. As t grows more probability mass will be found outside the diagonal and eventually each row will look the same, namely four 0.25 indicating that the second nucleotide is chosen independently of the first nucleotide. In this simple model, we first choose one of the 4 nucleotides according to the equilibrium distribution of JC69 [0.25, 0.25, 0.25, 0.25] and the use the 4 by 4 matrix to choose the second nucleotide. Something similar should be expected to happen in the angle-model. If t is zero, then we choose two points at the same position on S or T according to the distribution used by Hamelryck. As t grows the first point will stay fixed start to move away from the first point according to some evolutionary process. And eventually for t approaching infinity the two points can be chosen independently. In collaboration with a group in Copenhagen a torus diffusion has been developed. We are at the stage where we can use this torus diffusion to model pairwise evolution - that is the evolution of one amino acid sequence and structure (set of dihedral angles) into another. Together with this we also simultaneously model the insertion and deletion evolutionary process (also known as statistical alignment), instead of assuming a fixed pairwise alignment. A natural extension pairwise evolution (Figure 4.2) is to consider the evolution of structure on a phylogeny (Herman et. al., 2014; Figure 4.3). In the pairwise case, the time-reversibility of the torus diffusion allows us to ignore the unobserved common ancestor of the two protein species being considered, however, in the case of a phylogeny it becomes necessary to integrate out the dihedral angles at the unobserved ancestral nodes. This has the disadvantage that things become less computationally tractable, but the advantage that more of the available data (more species) can be used. This will allow us to get better estimates of evolutionary parameters (for example, rates of dihedral angle evolution) and to more accurately predict quantities of interest, for example: predicting a missing structure (set of dihedral angles) given a set of related sequences and/or structures (homology modeling). If time permits one might also like to consider implementing statistical alignment in the case of a phylogeny, however, once again this has the problem of having to integrate out the states at the ancestral nodes and is known to be computationally intractable. The major advantage of doing so is that one does not need to rely on a fixed multiple alignment (thus ignoring alignment uncertainty), but rather integrates over all possible insertion/deletion histories Work Plan: A. Read discuss them in detail with supervisor: Chapters 6,7 and 10 in Bayesian Structural Bioinformatics which should give a good foundation in the distributions on spheres and tori. Challis and Schmidler (2012) and Herman et al.(2014) should give a sense of how to extend this to an evolutionary setting. The angle-choice model should also have the advantage over the ChallisSchmidler model that it does not need the translation + rotation + residue mapping since angles are encoded as a sequence. B. Take a set of homologous protein structures (~10) and align them. C. Consider how to integrate out the unobserved dihedral angles at the ancestral states. A naive approach would be to use Markov Chain Monte Carlo (MCMC) to integrate out the unobserved ancestral dihedral angles, however, it may be possible that a more efficient approach exists. D. Use C) to construct and train a model of amino acid sequence and dihedral angle evolution and estimate parameters of interest. For a start a simple Hidden Markov Model (HMM) can be implemented, whereby each hidden state represents a "evolutionary regime" that parameterizes the evolutionary process describing both sequence and structure evolution at each position. Such a model assumes the evolutionary regime has remained constant at a given position amongst all species along the phylogeny, whilst allowing changes in evolutionary regime from one amino acid/dihedral angle position to the next. Whereby the transition probability matrix of the HMM encodes the dependencies amongst the evolutionary regimes. Future applications and cautionary remarks: Although this project will only address a single building block in a larger complex, this would eventually have a large set of applications: 1. It leads to a more realistic model of sequence and structure evolution 2. It leads to better alignments 3. It leads to better phylogenetic use of the data. 4. Homology modeling It will be an issue for any data analysis that proteins are determined with some error and in protein databases some indication of size of this error is given in the so-called B-factors. In the angle representation there is no direct indicator of the level of error. These errors are the analogue of sequencing errors for DNA (which can now be brought below 10-5), but they are very large and will have to be accounted for in a model. The illustration: Illustrated in the left tree below as an elongation of the edges leading to the observed sequence-structure. 1 2 3 4 Very left: the case Hamelryck has addressed (the red wiggle or pigs tail is a protein!!) Number two: what we have addressed with the pairwise evolutionary model. Number three: what you are expected to address in this project. The same but generalized to a phylogeny and with some structures missing. Number four. In principles all subfigures should have added an extra component to the leaf edge since the structure is observed with a lot of error. References: Challis, C. J., & Schmidler, S. C. (2012). A stochastic evolutionary model for protein structure alignment and phylogeny. Molecular Biology and Evolution, 29(11), 3575–87. Hamelryck, Mardia and Ferkinhoff-Borg (eds., 2012, Springer) " Bayesian Methods in Structural Bioinformatics" Hamelryck (2006) Herman,, Challiss, Novak JHein and Schmidler (2014) Simultaneous Bayesian Estimation of Alignment and Phylogeny under a Joint Model of Protein Sequence and Structure AD McLachlan (1979) Gene duplications in the structural evolution of chymotrypsin. J Mol Biol.;128(1):49-79. Rolfs, Harrsion and Nielsen (2013) Modeling gene expression evolution with an extended Ornstein-Uhlenbeck process accounting for within-species variation. Mol Biol Evol. Willie Taylor (2006) Decoy Models for Protein Structure Comparison Score Normalisation. J. Mol. Biol. 357, 676–699 Acknowledgements: JH taught a class in the autumn 2014 at SAMSI and there we had a study group of the evolution of protein structures where we read key papers, which lead to many ideas. Illustrations are from Hamelryck figure 10.2 and 10.3 Oxford Work This means supervised or intiated by Jotun Hein. A lot of interesting work is going on in Oxford that I will not mention here. Might be in future editions of enlarged project description. Sequences is the major data type in the present biomedical revolution and models of sequence evolution has been central in this. Key concepts in these models are homology and evolution. However there are other data types where exactly the same considerations apply. The most obvious ones are networks and structures, so it tempting to do similar analysis the these as to sequences. However, they are very diffferent: much more noisy, there are stronger correlations between different parts (making them harder do model, but also more interesting) and they are closer to function than sequences. There is the same dichomy between opmisation and stochastic based methods, where the latter was almost unexplored until after 2010, while the former goes back to at least 1972. So it was natural to try to make a proper stochastic model. This project got started in the most unlikely way: A person committed suicide by throwing himself in front of the train going between Stansted and London. I then sat next to a protein chemist, I knew of – Richard Goldstein – on the train station since the trains had been cancelled. And we discussed this and he said he had some funds he could put into this. On June 30th the groups of Richard Goldstein and Willie Taylor came to Oxford and we all gave lectures with our different expertises. Our initial model took two proteins and defined a large sets of stepping stones in protein space “between” the given protein structures and then defined rates between these stepping stones. It would quickly need a very large number of stepping stones but were perfectly doable for quite similar structures, which are the ones that is one wants to analyse anyway. In autumn of 2014 I did at sabbatical at SAMSI [Statistical and Applied Mathematics Insitute, North Carolina] and gave a course, where there were 4 study groups. One [Jeremy Ash, Gary Larson, Michael Golden] was on models of protein structure evolution and we went through the key papers. In the beginning Ziheng Yang, Richard Goldstein, Scott Schmidler and David Pollock was also present. We went through 1-2 key papers each week. The most promising stochastic model of structure evolution was Challis and Schmidler from 2012, but it had the unfortunate assumption that each amino acid was evolving independelty is space of all others. Joe Herman had extended this model from being a model on pairs of homologous structures to a set of homologous structures related by a phylogeny. Any protein would immediately dissolve if it evolved according to this process since there was no concept of molecular bonds in the model. Already Andrew McLauglin around 1980 modelled a random protein as a random walk in space, so it was a natural idea to model structure evolution by letting the angles [phi/psi] between amino acids change over time. We immediately had a schism between North Carolina people and Oxford people, where the former want to have 3D Gaussian distributions describing the relationship between neighbor amino acids, while the latter would argue that bond lenghts were known and one just had to describe the angles. When Jotun Hein and Michael Golden returned to Oxford they started a study group in the recently published “Bayesian Methods for Structural Bioinformatics”, which was extremely rewarding and included chapters on distribution on angle pairs by Kanti Mardia, who was an external professor in Oxford. Thomas Hamelryck and colleagues already had advanced models for protein secondary structure and angles for large protein families, but not including evolution. This seemed like the ideal collaboration and Michael Golden switched entirely to work on this. This is actually an extremely advanced project that forced us to read on things like diffusions on a torus, the reference ratio method and copulas. And would involve a lot programming. We started as simple as possible with an Msc student – Ian Lim – doing a 3 month project on the circle before we took the step of difussions on the torus. However, it was also extremely interesting. It has already been shown that your phylogeny inference becomes better if you include structure information but investigating this large scale could be very valuable. It could well be that the 3% protein coding sequences in the genome has as much phylogenetic information content as the remaning 97%. A combined structure-sequence evolution model would also allow testing of co-linearity of structure and sequence evolution. And how does structural conservedness correlate with selection strength on an amino acid. Another exciting possibility is to use this for homology modelling: you know sequence, structure for one protein, but only the sequence for another protein and want to predict the structure. The present model would define all the necessary distributions in a proper statistical framework. Christian Ravn has plotted statistics like observed angle-pair evolution for pairs of aligned proteins in databases and it is obvious that there is an nice positive correlation between divergence time and how much angles have changed. The present model is a local model – it describes the angle change – but it is most likely bad at the global level. It is known from previous work by for instance David Baker that such models will predict bad protein structures. However, there are ways to deal with this. One method is the reference-ratio method also described in the book on Structural Bioinformatics. You need distributions on global statistics, like radius of gyration (a measure of extendedness) that typically will be too large if you only use a local model. It is then possible to sample from the local model in a way that gives the right global statistics. This sounds as if it will just solve our problem, but it has also created problems: How do we do this not for a single global model but for a complete evolutionary trajectory. We also had 2 project at Oxford Summer School in Computational Biology (OSSCB): • [Jan Domanski, Erik Sjöland, Anna FitzMaurice] End point conditioned path sampling in the Goldstein-HeinTaylor model and that was implemented in an animator which I loved, but Joe Herman hated. It produced a stochastic morphing of one protein structure into another assumed homologous proteins structure. • [Alex Cumberworth, Alexandra Grigore, Nynke Niezink] Worked on an explicit model of correlated positions in protein structures. The positions in a protein co-‐‑evolve is an old idea, but only after 2011 were there data enough for this idea to become useful. Very useful in fact and and has improved computational structure prediction a lot. One of the first explicit models of this was by Pollock, Goldman and Taylor from 1999, but there was too little data so it was only applied convincingly on simulated data. The successstories after 2011 all used Mutual Information and not an explicit model of co-‐‑evolution, which this group did. Marton Munz did a complete Dphil (co-‐‑supervised by Phil Biggin) on comparing the motion of protein structures. In principle you need an evolutionary model of movements which is a large state space: Instead of 4 nucletides, you might have 100 3D positions observed at 109 time points. Marton summarised a very long trajectory in the correlation of the movement of pairs and each protein L long would thus be associated a L times L matrix and the movement of pairs would be compared by aligning matrices. This is much harder than aligning strings. The method word for pairs up to about 100 amino acids long. Clearly we were eager to extend it to several sequences and longer sequences. Longer sequences would have need a local alignment algorithm for matrices.