Download Protein Structure

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expression vector wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Magnesium transporter wikipedia , lookup

Gene regulatory network wikipedia , lookup

Point mutation wikipedia , lookup

Metabolism wikipedia , lookup

Metalloprotein wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Western blot wikipedia , lookup

Biosynthesis wikipedia , lookup

Interactome wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein wikipedia , lookup

Genetic code wikipedia , lookup

Structural alignment wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Biochemistry wikipedia , lookup

Transcript
Protein Structure
What We Are Going to Cover
• Secondary Structure Prediction
– Chou-Fasman rules for individual amino acids
– Nearest neighbor approaches
– Machine learning techniques
• Neural networks
– Physical properties vs. statistics
– Measuring prediction quality
• Tertiary structure prediction
– Force fields
– Threading
– Homology modeling
Background
•
In the 1950s Christian Anfinsen showed that
pancreatic RNase could refold itself into its active
configuration after denaturation, without any
external guidance. This, and many confirming
experiments on other proteins, has lead to the
general belief that the amino acid sequence of a
protein contains all the information needed to fold
itself properly, without any additional energy input.
–
–
•
•
For most proteins this process is very rapid, on the
order of milliseconds.
The primary driving forces are the need for
hydrophobic side chains to minimize their interactions
with water, and the formation of hydrogen bonds.
Chaperone proteins assist in the folding of some
proteins, and in the re-folding of mis-folded or
aggregated proteins. This is often a consequence of
elevated temperatures, and chaperone proteins
were first discovered because they were highly
expressed during heat shock in Drosophila.
Prions are normal cellular proteins (PrPC) that can
fold into an abnormal configuration (PrPSC) that
causes other normal PrPC proteins to refold into the
bad configuration, causing large aggregates of
tightly packed beta sheets to form.
–
If refolding is spontaneous, is the bad form a lower
energy configuration than the good form?
PrPC (left) and PrPSC (right)
Secondary Structure
•
Linus Pauling defined the two main protein secondary structures in the
1950s: alpha helix and beta sheet.
– There are other related structures: for instance, the alpha helix has hydrogen
bonds between the backbone –NH group of one alpha-carbon to the
backbone C=O group of the alpha-carbon 4 residues earlier in the chain
(i+4->i). There are also the 310 helix (i+3->i) and the π helix (i+5->i).
– The Dictionary of Protein Secondary Structure (DSSP) defines 8 states; the
4 mentioned above plus several forms of turn, plus “everything else”, called
random coil.
– Real proteins have these structures bent and distorted, not laid out in
theoretical perfection.
– Most prediction programs classify every amino acid in a chain into just 3
states: alpha helix, beta sheet, or random coil.
– These predictions can then be compared to X-ray crystallography results.
Assessing Prediction Accuracy
•
The first problem is classification: given a solved X-ray
structure of a protein, how do you classify its amino acids
into alpha-helix, beta-sheet, etc.?
–
–
•
A second issue: source of protein structures to use for
training. You want them to be unrelated to each other and to
the sequences you will use for testing, a structural “nr” data
set.
–
•
Three separate programs, DSSP, STRIDE, and DEFINE, all use
different definitions, and thus come up with somewhat different
results, mostly at the ends of the structures.
Hydrogen-bonding patterns, dihedral angles (the  and  angles
of the alpha carbons in the backbone), and interatomic distances
compared with ideal structures.
The Protein Data Base (PDB) has a set called PDB_SELECT
that does this. About 4000 chains (late 2008), all less than 25%
sequence identity with high quality X-ray crystallography
resolution (< 3.5 Angstroms).
CASP (Critical Assessment of Sequence Predictions) is a
series of contests where the sequences of a group of
proteins whose structures have recently been determined but
not published are released to anyone who wants to try
predicting their structures. After a few months, they have a
meeting an score everyone’s results. Currently, the CASP8
contest has just ended.
More on Prediction Accuracy
•
Q3 is the proportion of amino acids in a
test dataset whose predicted state (alphahelix, beta-strand, or random coil)
matches its actual state.
–
–
–
•
Note that you can’t use sequences from the
training dataset for this.
Given equal distributions of the 3 states, you
would expect random guessing to give a Q3 score
of 33%. A test using random guesses with an
actual dataset gave 38%.
Also, given the variations in automatically
assigning types to known structures, scores above
85% are unlikely.
Sov is a measure of how individual segments
match that tries to avoid variation in end-ofsegment predictions. It measures the
percentage of times that there is an overlap
between observed segments and predicted
segments. It works on the individual protein or
fold level, as opposed to Q3, which measures
performance over a whole database.
Similar Q3 scores are not always meanignful.
Sov
OBS:
CHHHHHHHHHHC
PRED1: CHCHCHCHCHCC
PRED2: CCCHHHHHCCCC
12.5
63.2
Q3
58.3
58.3
Chou-Fasman Rules
• Gives the propensity of each
amino acid to form or break
alpha helices and beta
strands.
– Originally developed in the
1970’s from a very small set of
proteins (15!).
– Originally just a qualitative
measure: “helix forming”,
indifferent”, “helix-breaking”,
etc.
– It has been made quantitative
and extended to 14 structures,
which involved some fairly
large changes in parameter
values.
– This method is just for
individual amino acids in
isolation, ignoring the
neighborhood.
Extensions to Chou-Fasman
•
The GOR method uses a sliding window
of 17 amino acids (8 before and 8 after
the residue being predicted) with rules
based on known sequences to predict
each amino acid’s state.
– Based on self-information (each individual
residue’s propensity to each type of
secondary structure: approximately what
the Chou-Fasman rules are), directionalinformation (how each other amino acid in
the window affects the current residue
regardless of what the current residue is)
, and pair-information (how each other
amino acid in the window affects the
current residue considering what type of
amino acid it is).
– Several improved versions consider pairs
and triplets of amino acids, and increase
the number of sequences analyzed to
produce the statistics.
Directional-information:
Effects of the alpha-helix breaker
proline (top) and the non-breaker
methionine (bottom) 5 residues
downstream from residue j, whose
type is not specified.
Use of Multiple Alignments
•
It seems obvious that homologous proteins will have slightly
different sequences with approximately the same secondary
structure.
–
–
•
The Zpred program is a modification of the GOR program that
uses multiple alignment information at each position in the
aligned sequences to improve prediction.
–
–
–
•
This allows a better understanding of which residues are important
for the different structures. The predictions for each homologous
region can be averaged
Alpha helices and beta strands are less likely to tolerate insertions
and deletions than random coil.
Start by predicting each sequence separately as in GOR, then
average the predictions.
Zpred also encodes the amino acid properties Venn diagram we
have seen before.
The Zvelibil conservation number is 1.0 if all homologous residues
at a given position are identical, 0.9 if they are not identical but all
within the same set in the Venn diagram, and lesser values for
amino acids in different groups. A value of 0 is given if any
sequence has a gap.
A modification of this, the nearest neighbor approach, finds
the best set of matching sequences (homologous or not) for
just the region of the sliding window. The idea is that similar
sequences will share the same fold even if the rest of the
protein is different.
Neural Networks
•
Neural networks are a common
bioinformatics tool for machine learning
and optimization.
– Based on neurobiology observations,
primarily the retina of the eye.
– Often used in structure prediction, among
others.
– Machine learning: automatic procedures to
fit model parameters to a training dataset.
•
Layers of nodes, with each node
connected to all the nodes in the next
layer.
– The top layer is the input, a model of the
sequence being analyzed.
– The bottom layer is output: generally 3
nodes, one for alpha-helix, one for betastrand, and one for random coil.
– A model with just an input and an output
layer is called a perceptron.
– Usually there are one or more (and
sometimes lots more, like 80) hidden layers
that allow interactions between the input
layers.
– Some networks allow feedback between
layers.
More Neural Network
•
•
Inputs. The most common way to code
neural network input is to have one node for
each type of amino acid (and often an
additional one for a gap), multiplied by a node
for each position in the sliding window.
– Thus, for a 13 residue window, the net
would have 21 x 13 = 273 input nodes.
– Also, a few extra inputs encoding things
like sequence length and distance form
the N- and C-termini.
– Each input node “fires” either a 0 or a 1
output.
Signal processing. Each node sends out the
same signal to all the nodes in the next layer.
– The receiving node weights all of its
inputs differently.
– The receiving node also can add in a
bias factor to affect how it will respond.
– The node then adds the weighted input
signals plus the bias and runs them
through a response function (or transfer
function) to determine its output.
– Usually all nodes (at least at a given
layer) have the same transfer function.
– The output can be 0/1 or some numerical
value.
Weights applied to the possible outputs
from the input layer at a hidden node. Black
is positive, red is negative for alpha helix.
Some possible transfer functions
Still More Neural Network
•
Parameterization of a neural network is a big issue. Essentially it comes
down to assigning weights for the inputs to each node and assigning bias
factors. This assumes a constant transfer function.
– Most of the weighting comes from the input node array.
– It’s a matter of using a training set where you know the input sequence and the
secondary structure at each residue. You start with randomly assigned
parameters and adjust them using an optimization procedure until the outputs
match the known results or until no further improvement happens.
•
It is quite common to feed the output of one neural network into another
network.
– This allows predictions for one residue to influence the prediction for neighboring
residues.
– Best done with numerical outputs as opposed to 0/1. For example, the
predictions for alpha helix, beta strand and random coil are better reported as
(0.76, 0.55, 0.37) instead of (1, 0, 0).
– It is also useful for allowing homologous sequences to influence each other: first
run the predictions for each sequence separately, then use the second neural net
to combine them.
– The first net is a sequence-to-structure net, and the second one is a structure-tostructure net.
Tertiary Structure
•
Are we going to have time to say
anything meaningful about this
subject?
–
–
•
•
It is the cutting edge, the hardest
unsolved bioinformatics problem
It is the basis of rational drug design.
Ab initio (from first principles) structure
predictions: just using the sequence
and know physical and chemical
properties, predict the protein’s final 3D structure. This is very difficult and
not wildly successful.
More commonly used are homology
modeling and threading.
–
–
In homology modeling, an unknown
protein’s structure is modeled in
comparison to the known structure of a
homologous protein.
For threading, no homolog is used, but
instead the protein’s sequence is
matched to a library of known folds and
Plots of potential energy (y-axis) and
root-mean-square deviation of atomic
coordinates from their actual position for
a large number of ab initio structure prediction
runs. The red arrow points to the best prediction.
The left one, E. coli RecA protein worked well,
but the right one,human Fyn tyrosine kinase,
did not.
Force Fields
•
Based an the Anfinsen RNAase experiments and many others
like it, it is thought that proteins fold into the lowest free energy
state.
–
–
•
Various forces affect protein folding: covalent bonds, hydrogen
bonds, polar and ionic interactions, van der Waals forces,
solvent interaction.
–
–
•
•
The problem is, the energy landscape is like a mountain range:
there are vast numbers of possible folding states with many local
minima separated by high peaks.
“Free energy” has both an entropy term and an enthalpy term.
Enthalpy is the energy stored in covalent bonds and non-covalent
atomic interactions. Entropy is hard to calculate, so all this work is
done using enthalpy, or potential energy.
These forces can be described by a set of equations, collectively
known as the “force field”.
Solvent interactions are troublesome because they involve large
numbers of independently-moving water molecules. Sometimes
these are dealt with by using a statistical distribution rather than
trying to model individual solvent molecules.
Covalent bonds are most easily described using internal
coordinates: bond lengths and angles. The potential energy of
the bonds is easily described in these terms.
On the other hand, non-covalent forces are more easily
described using external coordinates: (x,y,z) numbers. This is
because the forces involved vary with the distance separating
the molecules.
–
Thus the final force field needs to combine the two coordinate
systems.
G  H  TS
Threading
•
When there is no homolog that has a known
structure, the protein sequence can be compared
to a library of protein folds.
– This is based on the theory that there are a limited
number of possible folds, perhaps around 2000.
The 35,000 protein structures in PDB (when the
book was written) can be described in terms of
750-1500 folds.
•
One type of threading method uses positionspecific scoring matrices to describe the folds.
– From PSI-BLAST
– The matrices are asymmetric, since you are
comparing a known to an unknown, rather than
comparing two unknowns. For example, the
penalty for substituting Lys for Arg is different from
a Arg to Lys substitution.
– This can also be done by using a set of
environment-specific scoring matrices for the
various positions, similar to BLOSUM matrices.
•
•
The other common threading method is explicitly
calculate the potential energy of the sequence
when forced into the fold.
Dynamic programming methods make it possible
to ignore wildly improbable configurations.
These structures are the SH3 domain of
dihydrofolate reductase and a kinase
They have only 14.5% identical amino acids.
Homology Modeling
• When your sequence
has a homologue with
a known structure, it is
possible to build a
model using the known
structure as a template.
– Originally done with
wires and plastic balls!
• The more similar your
protein is to the known
structure, the better it
works.
Dependence of modeling accuracy on sequence
identity.
Red ones are ab initio predictions.
Sperm whale myoglobin model
More Homology Modeling
•
•
•
•
•
Start by finding homologues with
structures in PDB, using a BLAST
search.
Then do a careful hand-refined
alignment.
Then fit the highly conserved
regions
Model insertions as loops.
Non-identical amino acids are
predicted with rotamer libraries.
– Side chains can generally only
take on a few configurations,
based on “exhaustive” searches of
known structures
•
After the preliminary model is built,
refinements can be made to
minimize energy and fit things
together as best as possible.
Rotamer libraries for tyrosine (left)
and phenylalanine (right).
Each has 2 main positions with variants.