Download predizione - Sede di Busto Arsizio

Document related concepts
no text concepts found
Transcript
Dalla sequenza
alla struttura
Mauro Fasano
Dipartimento di Biologia Strutturale e Funzionale
Centro di Neuroscienze
Università dell’Insubria – Busto Arsizio
[email protected]
http://fisio.dipbsf.uninsubria.it/cns/fasano
Dalla sequenza alla struttura
V
L
S
E
G
E
W
Q
L
V
L
V
.
.
.
Sequenza
O2
Struttura
Funzione
Che informazioni offre la struttura?
•
•
•
•
•
•
Conformazione dei siti attivi e di legame
Orientazione dei residui conservati
Interpretazione di meccanismi
Visualizzazione di cavità
Calcolo di potenziale elettrostatico
…
Esempio
• FtsZ – divisione cellulare in procarioti,
mitocondri e cloroplasti.
• Tubulina – componente strutturale dei
microtubuli – comunicazione intracellulare
e divisione cellulare.
• FtsZ e Tubulina hanno bassa similarità di
sequenza e non sembrerebbero
omologhe.
Burns, R., Nature 391:121-123
Picture from E. Nogales
FtsZ e tubulina sono omologhe?
• Proteine che hanno conservato la struttura
tridimensionale possono derivare da un
progenitore comune anche se la
divergenza della sequenza non permette
più di riconoscere l’omologia.
Un altro esempio
• α-lattalbumina e lisozima possiedono:
– Stesso fold
– Moderata similarità
– Diversa funzione
Metodi sperimentali:
• Diffrazione dei raggi x
• Risonanza magnetica nucleare
Cristallografia a raggi X
• Ottenere cristalli della proteina
– 0.3-1.0 mm
– Le singole molecole sono ordinate in modo
periodico, ripetitivo.
• La struttura è determinata dai dati di
diffrazione.
Image from http://www-structure.llnl.gov/Xray/101index.html
Schmid, M. Trends in Microbiology, 10:s27-s31.
Cristallografia a raggi X
• Le proteine devono cristallizzare
– Grande quantità
– Solubili
• Accesso a radiazione adatta
• Tempo di calcolo per risolvere la struttura
Risonanza Magnetica Nucleare
(NMR)
•
•
•
•
•
•
Proteine in soluzione
Limite di dimensione ~ 40 kDa
Proteine stabili a lungo
Marcatura con 15N, 13C, 2H.
Strumentazione molto costosa
Tempo per assegnare le risonanze
Il Protein Data Bank
Crescita del PDB
Motivi strutturali depositati ogni anno
Percentuale di nuovi motivi strutturali
HEADER
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
SOURCE
SOURCE
SOURCE
SOURCE
KEYWDS
EXPDTA
AUTHOR
REVDAT
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
REMARK
BINDING PROTEIN
01-JUN-95
1HXN
MOL_ID: 1;
2 MOLECULE: HEMOPEXIN;
3 CHAIN: NULL;
4 DOMAIN: C-TERMINAL DOMAIN;
5 SYNONYM: HPX;
6 HETEROGEN: PO4
MOL_ID: 1;
2 ORGANISM_SCIENTIFIC: ORYCTOLAGUS CUNICULUS;
3 ORGANISM_COMMON: RABBIT;
4 TISSUE: SERUM
HEME
X-RAY DIFFRACTION
H.R.FABER,E.N.BAKER
1
15-OCT-95 1HXN
0
AUTH
H.R.FABER,C.R.GROOM,H.BAKER,W.MORGAN,A.SMITH,
AUTH 2 E.N.BAKER
TITL
1.8 ANGSTROMS CRYSTAL STRUCTURE OF THE C-TERMINAL
TITL 2 DOMAIN OF RABBIT SERUM HEMOPEXIN
REF
TO BE PUBLISHED
REFN
0353
1
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CA
C
O
N
CA
C
O
CB
OG
N
CA
C
O
CB
OG1
CG2
GLU
GLU
GLU
SER
SER
SER
SER
SER
SER
THR
THR
THR
THR
THR
THR
THR
225
225
225
226
226
226
226
226
226
227
227
227
227
227
227
227
-0.900
-0.185
-0.514
0.788
1.534
2.231
1.883
2.572
3.237
3.242
3.989
4.274
4.179
5.354
5.114
6.256
-1.002
0.146
1.329
-0.203
0.805
1.806
1.952
0.130
-0.941
2.478
3.417
2.705
3.296
3.797
4.682
4.492
39.233
39.970
39.758
40.823
41.594
40.681
39.514
42.515
41.848
41.223
40.410
39.080
38.022
41.074
42.172
40.065
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
70.00
70.00
70.00
70.00
70.00
68.89
70.00
70.00
70.00
65.51
70.00
56.25
44.63
70.00
70.00
70.00
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
1HXN
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
http://www.expasy.ch/spdbv
Esegui
Classificazione delle proteine:
• SCOP (Structural Classification of Proteins,
scop.mrc-lmb.cam.ac.uk/scop/, Murzin et. al.):
548 folds (major structural similarity in terms of secondary
structures e.g. globin-like, Rossman fold); 1296 families (clear
evolutionary relationship or homology e.g. globins, Ras)
• CATH (Class, Architecture, Topology, Homologous
Superfamily, www.biochem.ucl.ac.uk/bsm/cath/, Orengo et. al):
35 architectures (gross arrangment of secondary structures
e.g. non-bundle, sandwich); 580 topologies (connectivity of
secondary structures e.g. globin-like, Rossman fold); 1846
families (clear homology, same function)
Structural Classification Of Proteins
Predizione della struttura secondaria e terziaria
Metodi predittivi
»Comparative modeling
> 30% similitudine
»Threading/Fold recognition
0 – 30% similitudine
»Ab initio
nessun omologo
Qualità del modello comparativo
Identità di sequenza:
60-100%
Confrontabile con NMR media risoluzione
Specificità di substrato
30-60%
Molecular replacement in cristallografia
Partenza per site-directed mutagenesis
<30%
Gravi errori
Building by homology (Homology modelling)
Allineamento con proteine a struttura nota
M A A G Y A Y G V L S
- A T G F D
-
- V
I D
- A S G F E
-
- V V E
- A K A Y L
-
- V L S
Modello strutturale
Fold recognition (Threading)
Sequenza:
MA A G Y AV L S
+
Motivi strutturali noti
Modello strutturale
Ab initio
Sequenza
MA A G Y AV L S
Modello strutturale
General Flowchart
Building by homology
Un numero grandissimo di polipeptidi si struttura in un
numero finito (e relativamente piccolo) di folds
Almeno una proteina su due di quelle presenti nel
database ha un omologo (identità > 30%) che quasi
sempre ha lo stesso fold.
Costruire il modello comparativo
1)
Cercare il massimo numero di omologhi che
possiedano una entry nel PDB. Strumenti che utilizzano
PSSM sono più sensibili. In questo caso vengono utilizzate
sequenze senza struttura per costruire la PSSM.
2)
Costruire un accurato allineamento multiplo tra la
sequenza da modellare e tutte le entries che verranno
utilizzate come templato.
Trovare strutture di proteine la
cui sequenza è simile
allineamento
Modello strutturale
Verifica
OK!
Costruire il modello stesso
Determinare la struttura secondaria in base all’allineamento
Costruire le regioni conservate. Per ciascuna regione
possiamo prendere le coordinate del frammento con la
maggior similarità di sequenza.
Costruire le regioni variabili, solitamente loops.
Costruzione dei loops:
Usando raccolte di loops osservati in strutture note, in
base alla loro lunghezza ed alla loro sequenza
Costruendo la conformazione del loop ab initio. Vengono
generate numerose conformazioni casuali e si calcola
l’energia in un opportuno campo di forze.
Alcuni siti web di homology modeling
COMPOSER – felix.bioccam.ac.uk/soft-base.html
MODELLER – guitar.rockefeller.edu/modeller/modeller.html
WHAT IF – www.sander.embl-heidelberg.de/whatif/
SWISS-MODEL – www.expasy.ch/SWISS-MODEL.html
Swiss-Model
http://www.expasy.ch/swissmod/SWISS-MODEL.html
Modeller
http://guitar.rockefeller.edu/modeller/about_modeller.shtml
Advanced program for homology modeling
Based on distance constraints
Implemented in several popular modelling packages
such as InsightII
The source is available for unix platforms at the above URL
Threading (fold recognition)
La sequenza di input viene confrontata con una libreria di
folds noti
Si calcola un punteggio che esprima la compatibilità tra la
sequenza e ciascun fold considerato
Punteggi statisticamente significativi indicano che la
sequenza ha una certa probabilità di assumere la stessa
struttura 3D del fold considerato
Input:
Sequenza
Donatore H
Accettore H
Gly
Idrofobico
Collezione di folds di proteine note
Input:
Sequenza
Donatore H
Accettore H
Gly
Idrofobico
Collezione di folds di proteine note
Donatore H
Accettore H
Gly
Idrofobico
S=-2
Z= -1
S=5
Z=1.5
S=20
Z=5
Chain/Domain Library
Scoring functions for fold recognition
Ci sono due metodi per valutare la compatibilità sequenzastruttura (1D-3D)
In methods based on structural profile, for every fold a
profile is built based on structural features of the fold and
compatability of every amino acid to the features.
The structural features of each position are determined
based on the combination of secondary structure, solvent
accessibility and the property of the local environment
(hydrophobic/hydrophilic)
The profile is a defined mathematical structure, adjusted for
pair-wise comparisons and dynamic programming
Amino acid type
Position on sequence
A
1
N
D
…
10 -50 101
2 -24
:
C
:
87 -99
:
:
:
Y Gop Gext
-80 100
10
167 100
10
:
:
:
100
10
Contact potentials
This method is based on predefined tables which include
pseudo-energetic scores to each pairwise interaction of two
amino acids.
This method makes use of distance matrix for representation
of different folds
For each pair of amino acids which are close in space the
interaction energy is summed. The total sum is the indication
for the fitness of the sequence into that structure
Scoring Function
…YKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEW…
Tendenza a stare
vicini: E_p
Qualità dell’allineamento
in una certa posizione:
E_m
(pairwise term)
(mutation term)
Alignment gap
penalty: E_g
Tendenza a stare in un
certo ambiente: E_s
(singleton term)
Energia totale: E_m + E_p + E_s + E_g
Descrive quanto la sequenza assomiglia al templato
Amino acid index
1
•
••
•
N
1
•
••
•
•
•
••
•
•
•
•
••
• •
•
••
•
••
N
Amino acid index
Expected Performance
PROSPECT prediction in CASP4:
12 out 19 folds (no homology) recognized
Predicted
model
X-ray
structure
Web sites for fold recognition
Profiles:
3D-PSSM - http://www.bmm.icnet.uk/~3dpssm
Libra I - http://www.ddbj.nig.ac.jp/htmls/E-mail/libra/LIBRA_I.html
UCLA DOE - http://www.doe-mbi.ucla.edu/people/frsvr/frsvr.html
Contact potentials
123D - http://www-Immb.ncifcrf.gov/~nicka/123D.html
Profit - http://lore.came.sbg.ac.at/home.html
Risultati
Ab initio methods for modelling
This field is of great theoretical interest but, so far, of very
little practical applications. Here there is no use of sequence
alignments and no direct use of known structures
The basic idea is to build empirical function that simulates
real physical forces and potentials of chemical contacts
If we will have perfect function and we will be able to scan all
the possible conformations, then we will be able to detect the
correct fold
Algorithms for Ab initio prediction include:
A. Searching procedure that scans many possible structures
(conformations)
B. Scoring function to evaluate and rank the structures
Due to the large search space, heuristic methods are usually
applied
The parameters in the searching procedure are the dihedral
angles which specify the exact fold of the polypeptide chain
A
B
C
D
E
A
B
C
D
E
New Fold Methods
• Since almost all predictors use sequence and
structural databases in some form, there is no
longer an “ab initio” category
• Assessment is sometimes difficult to
communicate due to the complexity of the
protein structure and completeness of prediction
• Methods are still somewhat limited to smaller
proteins
Rosetta-David Baker
• Based on the assumption that the distribution of
conformations sampled by a local segment of
the polypeptide chain is reasonably
approximated by the distribution of structures
adopted by that sequence and closely related
sequences in known protein structures.
• Fragment libraries for all possible three and nine
residue segments of the chain are extracted
from PDB by profile methods
Rosetta-Simulation Procedure
• Information on fragments from secondary structure
prediction methods compiled and scored based on
equation for local secondary structure propensity
• Conformational space defined by these fragments is
then searched by a Monte Carlo procedure with an
energy function that favors compact structures with
paired beta strands and buried hydrophobic residues,
refinement of procedure late in simulation
• Thousands of structures generated
• Filters to remove bad structures
• Remaning structures clustered and cluster center
taken as the prediction.
Methods to evaluate structures are based on
Force fields- collection of terms that simulate the forces act
between atoms
Terms based on probabilities to find pairs of amino acids or atoms
within specific distances
Terms based on surface area and overlapping volume of
spheres representing atoms
Side chain construction
In homology modelling, construction of the side chains is
done using the template structures when there is high
similarity between the built protein and the templates
Without such similarity the construction can be done using
rotamer libraries
A compromise between the probability of the rotamer and its
fitness in specific position determines the score. Comparing
the scores of all the rotamer for a given amino acid determines
the preferred rotamer.
In spite of the huge size of the problem (because each side
chain influences its neighbours) there are quite succesful
algorithms to this problem.
In this work we examined differences in structures of
amino- acid side chains around point mutations.
Conformation - a given set
of dihedral angle which
defines a structure.
Asn
Rotamer - energetically
favourable conformation.
Phe
Example to library of rotamers
SER 59.6 41.0
SER -62.5 26.4
SER 179.6 32.6
TYR 63.6 90.5
TYR 68.5 -89.6
TYR 170.7 97.8
TYR -175.0 -100.7
TYR -60.1 96.6
TYR -63.0 -101.6
21.0
16.4
13.3
20.0
10.0
19.3
Model evaluation
After the model is built we can check it by various
methods.
If the model turns out to be bad, it is necessary to repeat
several stages of the model building
The main approaches for model evaluation are:
A. Use of internal information (such as the one that used for the
model construction)
B. Use of external information derived from the databases
Usually algorithms are checked by building models
for proteins which have already solved structure and
comparison between the model and the native
structure
It is always possible that information from the native
structure will be used in direct or indirect ways for
model building
A more objective test is prediction of structures
before they are publicly distributed (this is the idea of
the CASP competitions)