Download Foldrec_2010 - Center for Biological Sequence Analysis

Document related concepts

Magnesium transporter wikipedia , lookup

Biochemistry wikipedia , lookup

Genetic code wikipedia , lookup

Expression vector wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Gene expression wikipedia , lookup

Point mutation wikipedia , lookup

Interactome wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metalloprotein wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Transcript
Protein Fold recognition
Morten Nielsen,
CBS,
Department of Systems Biology,
DTU
Objectives
• Understand the basic concepts of fold
recognition
• Learn why even sequences with very low
sequence similarity can be modeled
– Understand why is %id such a terrible
measure for reliability
• See the beauty of sequence profiles
– Position specific scoring matrices (PSSMs)
Objectives
• and .....
• See the beauty of sequence profiles
– Position specific scoring matrices (PSSMs)
Background. Why protein modeling?
• Because it works!
– Close to 50% of all new sequences can be homology
modeled
• Experimental effort to determine protein
structure is very large and costly
• The gap between the size of the protein
sequence data and protein structure data is
large and increasing
Growth of databases
Homology modeling and the human genome
How can we do it?
• Identify template(s) – initial alignment
• Can give you protein function
• Improve alignment
• Can give you active site
• Backbone generation
• Loop modeling
• Most difficult part
• Side chains
• Refinement
• Validation
Identification of fold
If sequence similarity is
high proteins share
structure (Safe zone)
If sequence similarity is low
proteins may share
structure (Twilight zone)
Most proteins do not have a
high sequence homologous
partner
Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47
Structural Genomics in North America
• 10 year $600 million project initiated in 2000,
funded largely by NIH
• AIM: structural information on 10000 unique
proteins (now 4-6000), so far 1000 have been
determined
• Improve current techniques to reduce time
(from months to days) and cost (from $100.000
to $20.000/structure)
• 9 research centers currently funded (2005),
targets are from model and disease-causing
organisms (a separate project on TB proteins)
Homology modeling for structural genomics
What a new fold can give
Roberto Sánchez et al. Nature Structural Biology 7, 986 - 990 (2000)
Example.
A post doc in our group did her PhD obtaining the structure of the
sequence below
>1K7C.A
TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV
VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL
FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL
GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL
• What is the function
• Where is the active site?
What would you do?
• Function
• Run Blast against PDB
• No significant hits
• Run Blast against NR (Sequence database)
• Function is Acetylesterase?
• Where is the active site?
Example. Where is the active site?
1G66 Acetylxylan esterase
1USW Hydrolase
1WAB Acetylhydrolase
Example. Where is the active site?
• Align sequence against structures of
known acetylesterase, like
• 1WAB, 1FXW, …
• Cannot be aligned. Too low sequence
similarity
1K7C.A 1WAB._ RMSD 11.2397
QAL 1K7C.A
71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTF
DAL 1WAB._
160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY
Is it really impossible?
Protein homology modeling is only possible
if %id greater than 30-50%
Why %id is so bad!!
1200 models sharing 25-95% sequence identity with the
submitted sequences (www.expasy.ch/swissmod)
Identification of correct fold
• % ID is a poor measure
– Many evolutionary related proteins share
low sequence homology
– A short alignment of 5 amino acids can
share 100% id, what does this mean?
• Alignment score even worse
– Many sequences will score high against
every thing (hydrophobic stretches)
• P-value or E-value more reliable
What are P and E values?
• E-value
– Number of expected hits
in database with score
higher than match
– Depends on database size
• P-value
Score 150
10 hits with higher
score (E=10)
10000 hits in
database =>
P=10/10000 = 0.001
– Probability that a random
hit will have score higher
than match
– Database size independent
Score
What goes wrong when Blast fails?
• Conventional sequence alignment uses a (Blosum)
scoring matrix to identify amino acids matches in
the two protein sequences
• This scoring matrix is identical at all positions in
the protein sequence!
EVVFIGDSLVQLMHQC
A
G
D
S
.
G
G
G
D
S
X
X
X
X
X
X
Blosum scoring matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
Alignment accuracy. Scoring functions
• Blosum62 score matrix. Fg=1. Ng=0?
L
A
G
D
S
D
F
0
-2
-3
-3
-2
-3
I
2
-1
-4
-3
-2
-3
G
-4
0
6
-1
0
-1
D
-4
-2
-1
6
0
6
S
-2
1
0
0
4
0
L
4
-1
-4
-4
-2
-4
• Score =2+6+6+4-1=17
• Alignment
LAGDS
I-GDS
1PLC._
When Blast works!
1PLB._
1PLC._
When Blast fails!
1PMY._
Sequence profiles
•
In reality not all positions in a protein are
equally likely to mutate
•
•
•
Some amino acids (active cites) are highly
conserved, and the score for mismatch must
be very high
Other amino acids can mutate almost for
free, and the score for mismatch should be
lower than the BLOSUM score
Sequence profiles can capture these
differences
Protein structure hierarchy
Protein world
Protein fold
Protein superfamily
Protein family
New Fold
Sequence profiles
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN
TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I
-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I
IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD---TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V
ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDGMERNTAGVP
Matching any thing
but G => large
negative score
Any thing can match
How to make sequence profiles
Align (BLAST) sequence against large sequence
database (Swiss-Prot)
Select significant alignments and make profile
(weight matrix) using techniques for sequence
weighting and pseudo counts
Use weight matrix to align against sequence
database to find new significant hits
Repeat 2 and 3 (normally 3 times!)
Blast iterations
Protein world
Protein
Blast2logo
Blast2logo
Blast2logo
Last position-specific scoring matrix computed
A
R
N
D
C
Q
E
G
H
I
1
V
0 -3 -3 -3 -1 -2 -2 -3 -3
3
2
A
4 -1 -2 -2
0 -1 -1
0 -2 -1
3
L -1 -2 -3 -4 -1 -2 -3 -4 -3
2
4
A
4 -1 -2 -2
0 -1 -1
0 -2 -1
5
E -1
0
0
2 -4
2
5 -2
0 -3
6
L -1 -2 -3 -4 -1 -2 -3 -4 -3
2
7
Y -2 -2 -2 -3 -2 -1 -2 -3
2 -1
8
I -1 -3 -3 -3 -1 -3 -3 -4 -3
4
9
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3
10 E -1
0
0
2 -4
2
5 -2
0 -3
.
.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
T
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
L
1
-1
4
-1
-3
4
-1
2
-3
-3
K
-2
-1
-2
-1
1
-2
-2
-3
-1
1
M
1
-1
2
-1
-2
2
-1
1
-2
-2
F
-1
-2
0
-2
-3
0
3
0
-4
-3
P
-2
-1
-3
-1
-1
-3
-3
-3
7
-1
S
-2
1
-2
1
0
-2
-2
-2
-1
0
T
0
0
-1
0
-1
-1
-2
-1
-1
-1
W
-3
-3
-2
-3
-3
-2
2
-3
-4
-3
Y
-1
-2
-1
-2
-2
-1
7
-1
-3
-2
V
4
0
1
0
-2
1
-1
3
-2
-2
Blast2logo
Blast2logo
Blast2logo
Last position-specific scoring matrix computed,
A
R
N
D
C
Q
E
G
H
I
1
V -2 -4 -4 -5 -2 -4 -4 -5 -4
5
2
A
5
0 -3 -3 -3 -2
1 -2 -3
0
3
L -4 -5 -6 -6 -4 -5 -5 -6 -5
5
4
A
1 -4 -1 -1
3 -1
2 -4 -3
0
5
E -2
0 -2
6 -6
0
4 -4
2 -5
6
L -1 -2 -4 -4 -4 -2 -1
2
3
3
7
Y -4 -5 -5 -6 -4 -5 -5 -4
0
1
8
I -1 -2 -5 -5 -4 -5 -2 -6 -5
4
9
P
3 -4 -4 -3 -4
1
1 -4 -2 -2
10
E
2 -2 -3 -2 -3
0
1 -1 -3 -4
.
.
.
L
2
-3
4
-1
-5
2
4
3
-3
-3
K
-4
-2
-5
-2
-2
-1
-5
-5
-2
-1
M
0
-2
1
-3
-5
0
-1
-1
-4
-1
F
-1
-4
-2
1
-6
-2
3
3
-5
-4
P
-4
0
-5
-4
-4
-5
-5
-5
6
6
S
-3
0
-5
0
-2
-1
-5
-4
-1
-2
T
-2
-2
-3
0
0
-1
-4
-2
0
-2
W
-4
-4
-4
-4
-6
-5
-3
-4
-5
-4
Y
-2
-3
0
2
-4
-3
5
-1
-5
-4
V
4
0
1
2
-5
1
3
3
-2
-3
Example.
>1K7C.A
TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV
VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL
FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL
GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL
• What is the function
• Where is the active site?
1K7A.A
When Blast fails!
1WAB._
1K7C.A
Profile-profile scoring matrix
1WAB._
Example. (SGNH active site)
Example. Where is the active site?
• Sequence profiles might show you where to look!
• The active site could be around
• S9, G42, N74, and H195
Example. Where is the active site?
Align using sequence profiles
ALN 1K7C.A 1WAB._ RMSD = 5.29522. 14% ID
1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN
S
G
N
1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG-----1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA
1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP
1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL
H
1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L
Structural superposition
Blue: 1K7C.A
Red: 1WAB._
Where was the active site?
Rhamnogalacturonan
acetylesterase (1k7c)
Using Iterative Blast
Using Iterative Blast
Using Iterative Blast (1st iteration)
Using Iterative Blast (3rd iteration)
Including structure
• Sequence with in a protein superfamily
share remote sequence homology
• , but they share high structural homology
• Structure is known for template
• Predict structural properties for query
– Secondary structure
– Surface exposure
• Position specific gap penalties derived from
secondary structure and surface exposure
Using structure
Sequence & structure profile-profile based
alignments
– Template
• Sequence based profiles
• Annotated secondary structure
• Predicted secondary structure
– Query
• Sequence based profile
• Predicted secondary structure
– Position specific gap penalties derived from
secondary structure
Handout exercise
Using Psi-Blast Profiles
How good are we?
CpHModels-3.0
www.cbs.dtu.dk/services/CPHmodels-3.0/
CASP8 - Ranked as 15-20 best server
0.9
0.8
0.7
F4
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
Z-score
25
30
35
Blast
Why did we not win?
• Multiple template modeling
• First hit is not always the best
• Loop modeling
• ...
What are the different methods?
• Simple sequence based methods
– Align (BLAST) sequence against sequence of proteins with known
structure (PDB database)
• Sequence profile based methods
– Align sequence profile (Psi-BLAST) against sequence of proteins with
known structure (PDB, FUGUE)
– Align sequence profile against profile of proteins with known structure
(FFAS)
• Sequence and structure based methods
– Align profile and predicted secondary structure against proteins with
known structure (3D-PSSM, Phyre)
• Sequence profiles and structure based methods
– Hhpred, CpHModels
• Multiple template methods
• Modeler (via Hhpred, 3D jury)
Take home message
• Identifying the correct fold is only a small step
towards successful homology modeling
• Do not trust % ID or alignment score to identify
the fold. Use P-values
• You can do reliable fold recognition AND
homology modeling when for low sequence
homology
• Use sequence profiles and local protein
structure to align sequences
CASP. Which are the best methods
• Critical Assessment of Structure Predictions
• Every second year
• Sequences from about-to-be-solvedstructures are given to groups who submit
their predictions before the structure is
published
• Modelers make prediction
• Meeting in December where correct answers
are revealed
CASP6 results
The top 4 homology modeling groups in CASP6
• All winners use consensus predictions
– The wisdom of the crowd
• Same approach as in earlier CASPs
The Wisdom of the Crowds
The Wisdom of Crowds. Why the Many are
Smarter than the Few. James Surowiecki
One day in the fall of 1906, the British scientist Fracis
Galton left his home and headed for a country fair… He
believed that only a very few people had the
characteristics necessary to keep societies healthy. He
had devoted much of his career to measuring those
characteristics, in fact, in order to prove that the vast
majority of people did not have them. … Galton came
across a weight-judging competition…Eight hundred people
tried their luck. They were a diverse lot, butchers,
farmers, clerks and many other no-experts…The crowd
had guessed … 1.197 pounds, the ox weighted 1.198
The wisdom of the crowd!
– The highest scoring hit will often be wrong
• Not one single prediction method is
consistently best
– Many prediction methods will have the
correct fold among the top 10-20 hits
– If many different prediction methods all have
a common fold among the top hits, this fold is
probably correct
3D-Jury
Inspired by Ab initio modeling methods
– Average of frequently obtained low energy
structures is often closer to the native structure
than the lowest energy structure
Find most abundant high scoring model in a list of
prediction from several predictors
1. Use output from a set of servers
2. Superimpose all pairs of structures
3. Similarity score Sij = # of Ca pairs within 3.5Å
(if #>40;else Sij=0)
4. 3D-Jury score = SijSij/(N+1)
Similar methods developed by A Elofsson (Pcons)
and D Fischer (3D shotgun)
How to do it? Where is the crowd
• Meta prediction server
– Web interface to a list of public protein
structure prediction servers
– Submit query sequence to all selected servers
in one go
http://bioinfo.pl/meta/
Meta Server
Evaluating the crowd.
Meta Server
Evaluating the crowd. 3D Jury
Take home message
• Identifying the correct fold is only a small step
towards successful homology modeling
• Do not trust % ID or alignment score to identify
the fold. Use p-values
• Use sequence profiles and local protein
structure to align sequences
• Do not trust one single prediction method, use
consensus methods (3D Jury)