Download Protein Structure Evolution Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metalloprotein wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Biochemistry wikipedia , lookup

Interactome wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Evolution of angles
10.12.15
Background. Challis and Schmidler (2012) introduced a statistical model for evolution of proteins,
where, the spatial positions of amino acids were modeled as independent Ornstein-Uhlenbeck
processes. This paper developed software that could analyze 2 protein structures. Herman et al. (2014)
extended this to a set of structures related by a phylogeny. This is very exciting research and takes
important steps toward statistical description of structure evolution. Statistical modeling has been of
immense importance for the analysis of sequence evolution.
However, the Challis-Schmidler model is unrealistic and very limited in its application area. It is
unrealistic because an amino acid position to does not evolve according to an Ornstein-Uhlenbeck
process and two neighboring positions are not independent. They are typically and constantly about
.25 nm apart. And the model has limited application area because it can only be applied to pairs or
sets of structured that convincingly can be superimposed by 1) translation + 2) rotation + 3) deletion of
selected positions. A proper investigation of this on existing protein structure databases would be
interesting, but it is most likely restricts the use of this model to very similar structures.
Random structures have earlier been modeled by a random walk (McLachlan, 1979; W Taylor, 2006)
in space.
Hamelryck et al. (2006) used an HMM with
hidden information about the secondary structure
of the present amino acid to make a sequence of
direction-decisions to define a distribution on
protein structure and to make a structure predictor
given a sequence, without considering evolution.
Proteins can be represented at many levels of
detail, but two will be predominant in these
applications: only the α-C or the α-C with the adjacent C and N atoms.
In the former case a direction will have to be chosen with the present α-C at the middle of a sphere
[S] and pointing to the next α-C.
In the latter representation two
angles [-π;π]will have to be chosen,
which could be shown as a little
square [-π;π]*[-π;π] but with the
caveat, that when a point leaves the
square at one edge, it will enter at
the same position at the opposite
edge. The torus has the appropriate
properties and be obtained by gluing
one pair of opposite edges together making a tube. Then the ends of the tube are glued together to
make a torus [T]. Not entirely perfect, because now the two coordinates are not treated identically, but
it is acceptable for most purposes.
It seems natural to combine the ideas of Hamelryck et al. with evolutionary models to develop
methods that could analyze sets of homologous proteins. Hamelryck used the Bingham-Fisher
distribution on spheres and the bivariate von Mises distributions on the torus. If we have observed a
homologous pair of sequences-structures, then we will need a pair of spheres [S1,S2] or a pair of tori
[T1,T2] to represent the appropriate angles.
To illustrate on a simpler case what we expect will happen on
the combined pairs ([S1,S2] or [T1,T2]), let us focus on the
simple state space of for nucleotides [A, C, G, T] and the
simplest model (Jukes-Cantor; JC69). We now have a model
with values in [A, C, G, T]* [A, C, G, T] that is a function of t.
If t is zero, we have 1.0 in the diagonal and 0.0 outside the
diagonal. As t grows more probability mass will be found outside the diagonal and eventually each
row will look the same, namely four 0.25 indicating that the second nucleotide is chosen independently
of the first nucleotide. In this simple model, we first choose one of the 4 nucleotides according to the
equilibrium distribution of JC69 [0.25, 0.25, 0.25, 0.25] and the use the 4 by 4 matrix to choose the
second nucleotide.
Something similar should be expected to happen in the angle-model. If t is zero, then we choose two
points at the same position on S or T according to the distribution used by Hamelryck. As t grows the
first point will stay fixed start to move away from the first point according to some evolutionary
process. And eventually for t approaching infinity the two points can be chosen independently.
In collaboration with a group in Copenhagen a torus diffusion has been developed. We are at the stage
where we can use this torus diffusion to model pairwise evolution - that is the evolution of one amino
acid sequence and structure (set of dihedral angles) into another. Together with this we also
simultaneously model the insertion and deletion evolutionary process (also known as statistical
alignment), instead of assuming a fixed pairwise alignment.
A natural extension pairwise evolution (Figure 4.2) is to consider the evolution of structure on a
phylogeny (Herman et. al., 2014; Figure 4.3). In the pairwise case, the time-reversibility of the torus
diffusion allows us to ignore the unobserved common ancestor of the two protein species being
considered, however, in the case of a phylogeny it becomes necessary to integrate out the dihedral
angles at the unobserved ancestral nodes. This has the disadvantage that things become less
computationally tractable, but the advantage that more of the available data (more species) can be used.
This will allow us to get better estimates of evolutionary parameters (for example, rates of dihedral
angle evolution) and to more accurately predict quantities of interest, for example: predicting a missing
structure (set of dihedral angles) given a set of related sequences and/or structures (homology
modeling).
If time permits one might also like to consider implementing statistical alignment in the case of a
phylogeny, however, once again this has the problem of having to integrate out the states at the
ancestral nodes and is known to be computationally intractable. The major advantage of doing so is
that one does not need to rely on a fixed multiple alignment (thus ignoring alignment uncertainty), but
rather integrates over all possible insertion/deletion histories
Work Plan:
A. Read discuss them in detail with supervisor: Chapters 6,7 and 10 in Bayesian Structural
Bioinformatics which should give a good foundation in the distributions on spheres and tori. Challis
and Schmidler (2012) and Herman et al.(2014) should give a sense of how to extend this to an
evolutionary setting. The angle-choice model should also have the advantage over the ChallisSchmidler model that it does not need the translation + rotation + residue mapping since angles are
encoded as a sequence.
B. Take a set of homologous protein structures (~10) and align them.
C. Consider how to integrate out the unobserved dihedral angles at the ancestral states. A naive
approach would be to use Markov Chain Monte Carlo (MCMC) to integrate out the unobserved
ancestral dihedral angles, however, it may be possible that a more efficient approach exists.
D. Use C) to construct and train a model of amino acid sequence and dihedral angle evolution and
estimate parameters of interest. For a start a simple Hidden Markov Model (HMM) can be
implemented, whereby each hidden state represents a "evolutionary regime" that parameterizes the
evolutionary process describing both sequence and structure evolution at each position. Such a model
assumes the evolutionary regime has remained constant at a given position amongst all species along
the phylogeny, whilst allowing changes in evolutionary regime from one amino acid/dihedral angle
position to the next. Whereby the transition probability matrix of the HMM encodes the dependencies
amongst the evolutionary regimes.
Future applications and cautionary remarks: Although this project will only address a single building
block in a larger complex, this would eventually have a large set of applications:
1. It leads to a more realistic model of sequence and structure evolution
2. It leads to better alignments
3. It leads to better phylogenetic use of the data.
4. Homology modeling
It will be an issue for any data analysis that proteins are determined with some error and in protein
databases some indication of size of this error is given in the so-called B-factors. In the angle
representation there is no direct indicator of the level of error. These errors are the analogue of
sequencing errors for DNA (which can now be brought below 10-5), but they are very large and will
have to be accounted for in a model.
The illustration: Illustrated in the left tree below as an elongation of the edges leading to the observed
sequence-structure.
1 2 3 4 Very left: the case Hamelryck has addressed (the red wiggle or pigs tail is a protein!!)
Number two: what we have addressed with the pairwise evolutionary model.
Number three: what you are expected to address in this project. The same but generalized to a
phylogeny and with some structures missing.
Number four. In principles all subfigures should have added an extra component to the leaf edge since
the structure is observed with a lot of error.
References:
Challis, C. J., & Schmidler, S. C. (2012). A stochastic evolutionary model for protein structure alignment and phylogeny. Molecular Biology and Evolution, 29(11), 3575–87.
Hamelryck, Mardia and Ferkinhoff-Borg (eds., 2012, Springer) " Bayesian Methods in Structural Bioinformatics" Hamelryck (2006)
Herman,, Challiss, Novak JHein and Schmidler (2014) Simultaneous Bayesian Estimation of Alignment and Phylogeny under a Joint Model of Protein Sequence and Structure
AD McLachlan (1979) Gene duplications in the structural evolution of chymotrypsin. J Mol Biol.;128(1):49-79.
Rolfs, Harrsion and Nielsen (2013) Modeling gene expression evolution with an extended Ornstein-Uhlenbeck process accounting for within-species variation. Mol Biol Evol.
Willie Taylor (2006) Decoy Models for Protein Structure Comparison Score Normalisation. J. Mol. Biol. 357, 676–699
Acknowledgements: JH taught a class in the autumn 2014 at SAMSI and there we had a study group of the evolution of protein structures where we read key papers, which lead to many ideas. Illustrations
are from Hamelryck figure 10.2 and 10.3
Oxford Work
This means supervised or intiated by Jotun Hein. A lot of
interesting work is going on in Oxford that I will not mention here. Might be in
future editions of enlarged project description.
Sequences is the major data type in the present biomedical revolution and
models of sequence evolution has been central in this. Key concepts in these
models are homology and evolution. However there are other data types where
exactly the same considerations apply. The most obvious ones are networks and
structures, so it tempting to do similar analysis the these as to sequences.
However, they are very diffferent: much more noisy, there are stronger correlations between
different parts (making them harder do model, but also more interesting) and they are closer to
function than sequences. There is the same dichomy between opmisation and stochastic based
methods, where the latter was almost unexplored until after 2010, while the
former goes back to at least 1972. So it was natural to try to make a proper
stochastic model. This project got started in the most unlikely way: A person
committed suicide by throwing himself in front of the train going between
Stansted and London. I then sat next to a protein chemist, I knew of – Richard
Goldstein – on the train station since the trains had been cancelled. And we
discussed this and he said he had some funds he could put into this. On June
30th the groups of Richard Goldstein and Willie Taylor came to Oxford and we
all gave lectures with our different expertises. Our initial model took two proteins and defined a
large sets of stepping stones in protein space “between” the given protein
structures and then defined rates between these stepping stones. It would
quickly need a very large number of stepping stones but were perfectly doable
for quite similar structures, which are the ones that is one wants to analyse
anyway.
In autumn of 2014 I did at sabbatical at SAMSI [Statistical and Applied
Mathematics Insitute, North Carolina] and gave a course, where there were 4
study groups. One [Jeremy Ash, Gary Larson, Michael Golden] was on models
of protein structure evolution and we went through the key papers. In the beginning Ziheng Yang,
Richard Goldstein, Scott Schmidler and David Pollock was also present. We went through 1-2
key papers each week. The most promising stochastic model of structure evolution was Challis
and Schmidler from 2012, but it had the unfortunate assumption that each amino acid was evolving independelty
is space of all others. Joe Herman had extended this model from being a
model on pairs of homologous structures to a set of homologous structures
related by a phylogeny. Any protein would immediately dissolve if it
evolved according to this process since there was no concept of molecular
bonds in the model. Already Andrew McLauglin around 1980 modelled a
random protein as a random walk in space, so it was a natural idea to
model structure evolution by letting the angles [phi/psi] between amino
acids change over time. We immediately had a schism between North
Carolina people and Oxford people, where the former want to have 3D
Gaussian distributions describing the relationship between neighbor
amino acids, while the latter would argue that bond lenghts were known
and one just had to describe the angles. When Jotun Hein and Michael
Golden returned to Oxford they started a study group in the recently
published “Bayesian Methods for Structural Bioinformatics”, which was
extremely rewarding and included chapters on distribution on angle pairs
by Kanti Mardia, who was an external professor in Oxford. Thomas
Hamelryck and colleagues already had advanced models for protein
secondary structure and angles for large protein families, but not including evolution. This seemed like the ideal
collaboration and Michael Golden switched entirely to work on this.
This is actually an extremely advanced project that forced us to read on things like diffusions on a torus, the
reference ratio method and copulas. And would involve a lot programming. We started as simple as possible
with an Msc student – Ian Lim – doing a 3 month project on the circle
before we took the step of difussions on the torus.
However, it was also extremely interesting. It has already been
shown that your phylogeny inference becomes better if you include
structure information but investigating this large scale could be very
valuable. It could well be that the 3% protein coding sequences in the
genome has as much phylogenetic information content as the remaning
97%. A combined structure-sequence evolution model would also allow testing of co-linearity of structure and
sequence evolution. And how does structural conservedness correlate with selection strength on an amino acid.
Another exciting possibility is to use this for homology modelling: you know sequence, structure
for one protein, but only the sequence for another protein and want to predict the structure. The
present model would define all the necessary distributions in a proper statistical framework.
Christian Ravn has plotted statistics like observed angle-pair evolution for pairs
of aligned proteins in databases and it is obvious that there is an nice positive
correlation between divergence time and how much angles have changed.
The present model is a local model – it describes the angle change – but it is
most likely bad at the global level. It is known from previous work by for instance David Baker
that such models will predict bad protein structures. However, there are ways to deal with this.
One method is the reference-ratio method also described in the book on Structural
Bioinformatics. You need distributions on global statistics, like radius of
gyration (a measure of extendedness) that typically will be too large if you only
use a local model. It is then possible to sample from the local model in a way
that gives the right global statistics. This sounds as if it will just solve our problem, but it has
also created problems: How do we do this not for a single global model but for a complete
evolutionary trajectory.
We also had 2 project at Oxford Summer School in Computational Biology (OSSCB):
• [Jan Domanski, Erik Sjöland, Anna FitzMaurice] End
point conditioned path sampling in the Goldstein-HeinTaylor model and that was implemented in an animator
which I loved, but Joe Herman hated. It produced a
stochastic morphing of one protein structure into another
assumed homologous proteins structure.
• [Alex Cumberworth, Alexandra Grigore, Nynke Niezink] Worked on an explicit model of correlated positions in protein structures. The positions in a protein co-­‐‑evolve is an old idea, but only after 2011 were there data enough for this idea to become useful. Very useful in fact and and has improved computational structure prediction a lot. One of the first explicit models of this was by Pollock, Goldman and Taylor from 1999, but there was too little data so it was only applied convincingly on simulated data. The successstories after 2011 all used Mutual Information and not an explicit model of co-­‐‑evolution, which this group did. Marton Munz did a complete Dphil (co-­‐‑supervised by Phil Biggin) on comparing the motion of protein structures. In principle you need an evolutionary model of movements which is a large state space: Instead of 4 nucletides, you might have 100 3D positions observed at 109 time points. Marton summarised a very long trajectory in the correlation of the movement of pairs and each protein L long would thus be associated a L times L matrix and the movement of pairs would be compared by aligning matrices. This is much harder than aligning strings. The method word for pairs up to about 100 amino acids long. Clearly we were eager to extend it to several sequences and longer sequences. Longer sequences would have need a local alignment algorithm for matrices.