Download Secondary Structure Prediction Protein Folding

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metabolism wikipedia , lookup

Magnesium transporter wikipedia , lookup

Gene expression wikipedia , lookup

Point mutation wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Expression vector wikipedia , lookup

Genetic code wikipedia , lookup

Interactome wikipedia , lookup

Western blot wikipedia , lookup

Protein wikipedia , lookup

Biochemistry wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Metalloprotein wikipedia , lookup

Proteolysis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
3URWHLQ)ROGLQJ3UREOHP
3URWHLQ6WUXFWXUH
6HFRQGDU\6WUXFWXUH3UHGLFWLRQ
A protein (which is a linear sequence of amino acids) folds into a
unique 3D structure: determine this structure
Lysozyme sequence:
KVFGRCELAA
RGYSLGNWVC
QATNRNTDGS
RWWCNDGRTP
SALLSSDITA
DGNGMNAWVA
AMKRHGLDNY
AAKFESNFNT
TDYGILQINS
GSRNLCNIPC
SVNCAKKIVS
WRNRCKGTDV
QAWIRGCRL
6HFRQGDU\VWUXFWXUHSUHGLFWLRQ
:K\6HFRQGDU\6WUXFWXUH3UHGLFWLRQ"
• Given a protein sequence (primary structure)
ƒ Easier problem than 3D structure prediction (more than
40 years of history).
ƒ Accurate secondary structure prediction can be an
important information for the tertiary structure prediction
ƒ Improving sequence alignment accuracy
ƒ Protein function prediction
GHWIAT
HWIATRVGQLIREAYEDY
GQLIREAYEDYRHFSSECP
SSECPFIP
z
Predict its secondary structure content
(C=coils H=Alpha Helix E=Beta Strands)
CEEEEE
EEEEECCHHHHHHHHHHH
HHHHHHHHHHHCCCHHHHH
HHHHHCCC
ƒ Secondary Structure Alignment to proteins with known function?
ƒ Protein classification
ƒ Fold prediction
%HQFKPDUNV
%HQFKPDUNV
• EVA: EValuation of Automatic protein
structure prediction
• http://www.rostlab.org/eva/
• Targets: no pair in subset has more than 33%
identical residues over more than 100
residues aligned.
• Updated every six months
• Latest release: June 2006, 3477 chains
• CASP: Critical Assessment of Techniques for
Protein Structure Prediction
• Biennial contest on prediction of protein structures
• CASP7 meeting will be help in California in
November 2006
• CASP6 and CASP7 did not have secondary
structure prediction competition
1
6WDQGDUGRIWUXWK
$FFXUDF\PHDVXUHV
• PDB -> secondary structure
• Q3
• Observed secondary structure is taken primarily
from the program DSSP. However, EVA willl use
STRIDE additionally.
• Conversion of DSSP secondary structure
• The following chart illustrates, how the 8 states
from DSSP are converted to three secondary
structure states:
DSSP
H G I
used
H H
E B T S ''
H E E L
L
• Q3helix : number of correctly assigned helices / total
number of true helical residues
• Specificity
• SPhelix : number of correctly assigned helices /
total number of residues predicted to be helices
L
$FFXUDF\PHDVXUHV
• Matthew’s Correlation Coefficient
MCC
• number of correctly assigned residues / total
number of residues
• Q3helix, Q3strand, Q3loop
TP TN FP FN
(TN FN )(TN FP)(TP FN )(TP FP)
3HUIRUPDQFHRIFXUUHQWPHWKRGV
• Results on EVA benchmark. Q3 scores
• Only proteins with no significant sequence identity to previously
known PDB proteins are reported.
• Common set 1:
PSIpred
PROFsec
PHDpsi
PROF king
Prospect
SAM-T99sec
76.8
75.5
73.4
71.6
71.1
77.2
• Common set 2:
PSIpred
PROFsec
PHDpsi
PROF king
Prospect
SAM-T99sec
77.4
76.2
74.3
71.7
-
77.2
• Common set 5:
7KHRUHWLFDOOLPLW
• ~88%-~90%
• Because of differences even in the
experimental methods (X-ray vs. NMR)
• Still 10%-15% gap to fill : )
• The first paper addresses the question:
• Why cannot the current techniques reach that
theoretical limit?
• Short answer: Because current techniques do not
consider long-range interactions between amino
acids
PSIpred
PROFsec
PHDpsi
PROF king
Prospect
SAM-T99sec
77.3
76.4
74.3
-
-
77.1
5HDGLQJ
• The effect of long-range interactions on the
secondary structure formation of proteins by
D. Kihara (Purdue University) Protein
Science 2005
• Correlation of prediction accuracy with the
Residue Contact Order (RCO) information in
2777 proteins
2
5HVLGXH&RQWDFW2UGHU
3UHGLFWLRQDFFXUDF\
• On the analyzed 2777 proteins
where n is the total number of contacts for
residue i and įij is 1 when residue i is in
contact with residue j. (Otherwise it is 0)
3UHGLFWLRQDFFXUDF\YVSURWHLQOHQJWK
$FFXUDF\YV5HODWLYH&RQWDFW2UGHU
$FFXUDF\YV5&2
$FFXUDF\YV5&2
3
$PRUHTXDQWLWDWLYHHYDOXDWLRQRI
FRUUHODWLRQ
5HDGLQJ
• A new representation of protein secondary
structure prediction based on frequent
patterns by Birzele and Kramer from
Germany Bioinformatics August 29 2006
• Use variable length patterns to define
features for amino acids that can be used to
predict secondary structures.
• It is a window-less approach
• Does it mean that it takes long-range interactions
into account?
• If so, why is it not better than PSIpred?
$OJRULWKPVWRILQGIUHTXHQWSDWWHUQV
• A level-wise approach:
• Starting from patterns of length 1 look for
frequently occurring patterns of increasing length
• The relax the exact matching requirement
they extend the regular alphabet by defining
amino acid groups (they use 10 groups in the
paper)
• Level-wise extension idea:
• If a pattern is frequent then all of its
subsequences should be frequent. So, if A is a
candidate pattern, A1:n-1 and A2:n should already be
frequent patterns
)UHTXHQW3DWWHUQVWRIHDWXUHV
)UHTXHQWSDWWHUQ
ILQGLQJDOJRULWKP
Check if the candidates
really occur frequently
Generate candidates for
next level by extending
current frequent patterns
Example:
Frequent patterns:
ACDEF, ACDEG, ACDEH
CACDE
Next level candidatew:
CACDEF, CACDEG, CACDEH
8VLQJKRPRORJVHTXHQFHV
• Frequent patterns will be used to define a set
of features for a single amino-acid. How?
Feature vector for the amino acid D
4
)HDWXUH4XDQWLILFDWLRQ
%XWVWLOO
• Still, at this form, these feature vectors
cannot be used as input to an SVM
• Borrow some ideas from text mining:
• What is the size of the feature vector?
• 60M in the window-less approach
• How many days will it take to train the SVM?
• Before feeding the feature vectors to SVM we
need to do some manual feature selection
Normalize the values among the same length patterns:
7KH690PRGHO
• Use ȋ2 and precision-recall binning to prune some
of the features
5HVXOWV
• Now you have ~15K features (pattern,
pattern position) for an amino acid.
• Extracted 218,678 features of maximum
length 8 with minimum number of occurrence
40
• Trained on 940 proteins
• Two layer SVM
• First layer: predict H,E,C based on a central
amino acid
• Second layer: smooth out prediction errors
PSI-PRED is still the best!
This method is better
only at predicting coils:)
• e.g, HHHHCHHHH should actually be HHHHHHHHH
5HDGLQJ
• Protein secondary structure prediction for a
single-sequence using hidden semi-Markov
models by Z. Aydin, Y. Altunbasak, and M.
Borodovsky from Georgia Tech, BMC
Bioinformatics March 2006.
• Built an improved HMM model: Use different
models for residues internal to a SS segment
and for residues on the boundaries.
• General comments on the paper:
6HFRQGDU\6WUXFWXUHUHSUHVHQWDWLRQ
• Two vectors: one to indicate end of
secondary structure segments and the other
to indicate secondary structure types:
T=(L,E,L,E,L,H,L)
S=(4,9,12,16,21,28,33)
• Bad structure: Tables and texts are too far apart
• Typos, no Figure 2
5
%D\HVLDQ)RUPXODWLRQ
&RPSXWLQJWKHOLNHOLKRRG
• Given a sequence, the goal is to find the best
S and T vectors which maximize the a
posteriori probability. likelihood
a priori probability
• That’s the tricky part.
Constant for every
(S,T), so might as
well be dropped
for our purposes
Non-local interactions are ignored. (Unlike what Kihara suggested)
In order to model proximal residues and internal residues differently
they suggest the following expression for P(R|S,T)
i.e., what is the probability of
having a type t secondary
structure of length l
&RPSXWLQJWKHOLNHOLKRRG
7KH+600DUFKLWHFWXUH
For residues at the
N-terminus
For internal residues
For residues at the
C-terminus
Too many parameters ĺ
need a lot of training data
Solution: get rid of some
parameters (reduced dependency
model)
5HVXOWV
'LVFXVVLRQ
• Slightly better than PSIPRED?
• How can the methods be improved?
• But not the real PSIPRED, the PSIPRED that
does not use any homology information.
• Can we reach the 88%-90% limit? How?
• Other papers?
• Project ideas?
6