Download PowerPoint Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript
Prediction of T cell epitopes using
artificial neural networks
Morten Nielsen,
CBS, BioCentrum,
DTU
Objectives
• How to train a neural network to predict peptide
MHC class I binding
• Understand why NN’s perform the best
– Higher order sequence information
• The wisdom of the crowd!
– Why enlightened despotism does not work even for
Neural networks
Outline
• MHC class I epitopes
– Why MHC binding?
• How to predict MHC binding?
– Information content
– Weight matrices
– Neural networks
• Neural network theory
– Sequence encoding
• Examples
Encounter with death
QuickTime™ and a
Sorenson Video decompressor
are needed to see this picture.
Processing of intracellular proteins
http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm
From proteins to immunogens
20% processed
0.5% bind MHC
50% CTL response
=> 1/2000 peptide are immunogenic
Lauemøller et al., 2000
MHC class I with peptide
Anchor positions
What makes a peptide a potential and
effective epitope?
Part of a pathogen protein
Successful processing
– Proteasome cleavage
– TAP binding
Binds to MHC molecule
Protein function
– Early in replication
Sequence conservation in evolution
Prediction of HLA binding specificity
Simple Motifs
– Allowed/non allowed amino acids
Extended motifs
– Amino acid preferences (SYFPEITHI)
– Anchor/Preferred/other amino acids
Hidden Markov models
– Peptide statistics from sequence alignment (previous
talk)
Neural networks
– Can take sequence correlations into account
SYFPEITHI predictions
Extended motifs based on peptides from the literature
and peptides eluted from cells expressing specific HLAs
( i.e., binding peptides)
Scoring scheme is not readily accessible.
Positions defined as anchor or auxiliary anchor positions
are weighted differently (higher)
The final score is the sum of the scores at each position
Predictions can be made for several HLA-A, -B and -DRB1
alleles, as well as some mice K, D and L alleles.
BIMAS
Matrix made from peptides with a measured T1/2 for the
MHC-peptide complex
The matrices are available on the website
The final score is the product of the scores of each
position in the matrix multiplied with a constant,
different for each MHC, to give a prediction of the T1/2
Predictions can be obtained for several HLA-A, -B and -C
alleles, mice K,D and L alleles, and a single cattle MHC.
How to predict
The effect on the binding affinity of
having a given amino acid at one
position can be influenced by the
amino acids at other positions in the
peptide (sequence correlations).
– Two adjacent amino acids may for
example compete for the space in a
pocket in the MHC molecule.
Artificial neural networks (ANN) are
ideally suited to take such
correlations into account
Higher order sequence correlations
Neural networks can learn higher order correlations!
– What does this mean?
Say that the peptide needs one and only
one large amino acid in the positions P3
and P4 to fill the binding cleft
How would you formulate this to test if
a peptide can bind?
S S => 0
L S => 1
S L => 1
L L => 0
No linear
function can
learn this (XOR)
pattern
Learning higher order correlation
0 0 => 0; 1 0 => 1
1 1 => 0; 0 1 => 1
Step function
f XOR (x1,x2 )  (x1  x2 )  2 x1  x2  y1  y2
f (w11  x1  w21  x 2 )  y1

Threshold t
f (w12  x1  w22  x 2 )  y 2
f (w11  x1  w21  x 2 )  y1  Step(x1  x 2,t 1)
f
(w12  x1  w22  x 2 )  y 2  Step(x1  x 2 ,t  2)
Check it out. It does actually work!

Mutual information
• How is mutual information calculated?
• Information content was calculated as
• Gives information in a single position
pa
I   pa log( )
qa
a
• Similar relation for mutual information
• Gives mutual information between two positions

pab
I   pab log(
)
pa  pb
a,b

Mutual information. Example
Knowing that you have G at P1 allows you to
make an educated guess on what you will find
at P6.
P(V6) = 4/9. P(V6|G1) = 1.0!
pab
I   pab log(
)
pa  pb
a,b
P(G1) = 2/9 = 0.22, ..
P(V6) = 4/9 = 0.44,..
P(G1,V6) = 2/9 = 0.22,
P(G1)*P(V6) = 8/81 = 0.10
log(0.22/0.10) > 0
P1
P6
ALWGFFPVA
ILKEPVHGV
ILGFVFTLT
LLFGYPVYV
GLSPTVWLS
YMNGTMSQV
GILGFVFTL
WLSLLVPFV
FLPSDFFPS
Mutual information
313 binding peptides
313 random peptides
Neural network training
• Sequence encoding
– Sparse
– Blosum
– Hidden Markov model
• Network ensembles
– Cross validated training
– Benefit from ensembles
Sequence encoding
• How to represent a peptide amino acid
sequence to the neural network?
• Sparse encoding (all amino acids are equally
disalike)
• Blosum encoding (encodes similarities
between the different amino acids)
• Hidden Markov model (encodes the position
specific amino acid preference of the HLA
binding motif)
Sequence encoding (continued)
• Sparse encoding
V:0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
L:0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
V.L=0 (unrelated)
• Blosum encoding
V: 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2
L:-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2
V.L = 0.88 (highly related)
V.R = -0.08 (close to unrelated)
0 -3 -1 4
0 -3 -2 -1 -2 -1 1
Weight matrices
1
2
3
4
5
6
7
8
9
A
0.6
-1.6
0.2
-0.1
-1.6
-0.7
1.1
-2.2
-0.2
R
0.4
-6.6
-1.3
-0.1
-0.1
-1.4
-3.8
1.0
-3.5
N
-3.5
-6.5
0.1
-2.0
0.1
-1.0
-0.2
-0.8
-6.1
D
-2.4
-5.4
1.5
2.0
-2.2
-2.3
-1.3
-2.9
-4.5
C
-0.4
-2.5
0.0
-1.6
-1.2
1.1
1.3
-1.4
0.7
Q
-1.9
-4.0
-1.8
0.5
0.4
-1.3
-0.3
0.4
-0.8
E
-2.7
-4.7
-3.3
0.8
-0.5
-1.4
-1.3
0.1
-2.5
G
0.3
-3.7
0.4
2.0
1.9
-0.2
-1.4
-0.4
-4.0
NLTISDVSV
-3.5 5.1 -0.5 ….. 4.5
H
I
L
K
M
F
-1.1 1.0 0.3 0.0 1.4 1.2
-6.3 1.0 5.1 -3.7 3.1 -4.2
0.5 -1.0 0.3 -2.5 1.2 1.0
-3.3 0.1 -1.7 -1.0 -2.2 -1.6
1.2 -2.2 -0.5 -1.3 -2.2 1.7
-1.0 1.8 0.8 -1.9 0.2 1.0
2.1 0.6 0.7 -5.0 1.1 0.9
0.2 -0.0 1.1 -0.5 -0.5 0.7
-2.6 0.9 2.8 -3.0 -1.8 -1.4
P
-2.7
-4.3
-0.1
1.7
1.2
-0.4
1.3
-0.3
-6.2
S
1.4
-4.2
-0.3
-0.6
-2.5
-0.6
-0.5
0.8
-1.9
T
-1.2
-0.2
-0.5
-0.2
-0.1
0.4
-0.9
0.8
-1.6
W
-2.0
-5.9
3.4
1.3
1.7
-0.5
2.9
-0.7
-4.9
Y
V
1.1 0.7
-3.8 0.4
1.6 0.0
-6.8 -0.7
1.5 1.0
-0.0 2.1
-0.4 0.5
1.3 -1.1
-1.6 4.5
Evaluation of prediction accuracy
1
0.5
0
Motif
Hmm
Sparse
BLOSUM
Pear
0.76
0.80
0.88
0.91
Aroc
0.92
0.95
0.97
0.97
Neural network training
• A Network contains a very
large set of parameters
– A network with 5 hidden
neurons predicting binding
for 9meric peptides has
9x20x5=900 weights
• Over fitting is a problem
• Stop training when test
performance is optimal
Neural network training. Cross validation
Cross validation
Train on 4/5 of data
Test on 1/5
=>
Produce 5 different
neural networks each
with a different
prediction focus
1
20%
5
20%
4
20%
2
20%
3
20%
Neural network training curve
Maximum test set performance
Most cable of generalizing
Network ensembles
The Wisdom of the Crowds
The Wisdom of Crowds. Why the Many are
Smarter than the Few. James Surowiecki
One day in the fall of 1906, the British scientist Fracis
Galton left his home and headed for a country fair… He
believed that only a very few people had the
characteristics necessary to keep societies healthy. He
had devoted much of his career to measuring those
characteristics, in fact, in order to prove that the vast
majority of people did not have them. … Galton came
across a weight-judging competition…Eight hundred people
tried their luck. They were a diverse lot, butchers,
farmers, clerks and many other no-experts…The crowd
had guessed … 1.197 pounds, the ox weighted 1.198
Network ensembles
• No one single network with a particular
architecture and sequence encoding scheme,
will constantly perform the best
• Also for Neural network predictions will
enlightened despotism fail
– For some peptides, BLOSUM encoding with a four
neuron hidden layer can best predict the
peptide/MHC binding, for other peptides a sparse
encoded network with zero hidden neurons performs
the best
– Wisdom of the Crowd
• Never use just one neural network
• Use Network ensembles
Evaluation of prediction accuracy
1
0.9
0.8
0.7
0.6
0.5
Motif
Hmm
Sparse
BLOSUM
ENS
Pear
0.76
0.80
0.88
0.91
0.92
Aroc
0.92
0.95
0.97
0.97
0.98
ENS: Ensemble of neural networks trained using sparse,
Blosum, and hidden Markov model sequence encoding
T cell epitope identification
Lauemøller et al., reviews in immunogenetics 2001
NetMHC Output
53
49
94
289
529
Examples. Hepatitis C virus. Epitope predictions
Hotspots
SARS T cell epitope identification
Peptide binding affinity
A01 predicted peptides offered to rA*0101
2.500
Peptide affinity (KD) m M
2.000
1.500
1.000
0.500
0.000
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943
Peptides tested
Peptides tested: 15/15 (100 %)
Binders (KD < 500 nM): 14/15 (93%)
2.000
1.500
1.000
0.500
?
Peptides tested
A0201
2.000
Peptides tested
B5801
Peptides tested
Peptide binding affinity
B58 predicted peptides offered to rB*5801
2.500
1.500
1.000
0.000
0.000
2.000
Peptides tested
B1501
B7-7003
B7-7002
B7-7001
B7-7000
B7-6999
B7-6998
B7-6997
B7-6996
B7-6995
B7-6994
B7-6993
A1101
B7-6992
0.500
B7-6991
1.000
2.000
B7-6990
1.500
B7-6989
14/15
Peptide affinity (KD) mM
2.000
Peptide affinity (KD) mM
13/15
A3-6973
A3-6972
A3-6971
A3-6970
A3-6969
A3-6968
A3-6967
A3-6966
A3-6965
0.000
A3-6964
Peptide binding affinity
A03 predicted peptides offered to
rA*1101
79
.H
L
79 A-B
.H
62
L
79 A-B 705
.H
0
6
LA 2 7
79
0
. H B6 51
LA 2 7
79
0
. H B6 52
2
L
79 A-B 705
.H
62 3
L
79 A-B 705
.H
62 4
L
79 A-B 705
5
.H
6
LA 2 7
0
79
. H B6 56
LA 2 7
79
0
. H B6 57
2
L
79 A-B 705
.H
62 8
L
79 A-B 705
.H
62 9
L
79 A-B 706
.H
0
6
LA 2 7
79
0
. H B6 61
LA 2 7
79
0
. H B6 62
LA 2 7
-B 06
62 3
70
64
2.500
Peptide binding affinity
A2A02supertype:
predicted peptides offered to rA*0201
Molecule used:
rA0201/
human b2m
12/15
2.500
A3-6963
A0301
A3-6962
2.500
A3-6961
Peptide binding affinity
A03 predicted peptides offered to rA*0301
A3-6960
0.500
A3-6959
1.000
Peptide affinity (KD) mM
1.500
Peptide affinity (KD) m M
A3
-6
95
9
A3
-6
96
0
A3
-6
96
1
A3
-6
96
2
A3
-6
96
3
A3
-6
96
4
A3
-6
96
5
A3
-6
96
6
A3
-6
96
7
A3
-6
96
8
A3
-6
96
9
A3
-6
97
0
A3
-6
97
1
A3
-6
97
2
A3
-6
97
3
Peptide affinity (KD) M
11/15
B 703
58 5
B 703
58 6
B 703
58 7
B 703
58 8
B 703
58 9
B 704
58 0
B 704
58 1
B 704
58 2
B 704
58 3
B 704
58 4
B 704
58 5
B 704
58 6
B 704
58 7
B 704
58 8
-7
04
9
Peptide affinity (KD) m M
2.000
B
58
69
A2 44
69
A2 45
69
A2 46
69
A2 47
69
A2 48
69
A2 49
69
A2 50
69
A2 51
69
A2 52
69
A2 53
69
A2 54
69
A2 55
69
A2 56
69
A2 57
69
58
A2
More SARS CTL epitopes
B0702
Peptide binding affinity
B7 predicted peptides offered to rB*0702
2.500
10/15
1.500
1.000
0.000
0.500
0.000
Peptides tested
Peptide binding affinity
B62 predicted peptides offered to rB*1501
2.500
12/14
1.500
1.000
0.500
0.500
0.000
Vaccine design. Polytope optimization
• Successful immunization can be obtained only if the
epitopes encoded by the polytope are correctly
processed and presented.
• Cleavage by the proteasome in the cytosol,
translocation into the ER by the TAP complex, as well as
binding to MHC class I should be taken into account in an
integrative manner.
• The design of a polytope can be done in an effective
way by modifying the sequential order of the different
epitopes, and by inserting specific amino acids that will
favor optimal cleavage and transport by the TAP
complex, as linkers between the epitopes.
Vaccine design. Polytope construction
Linker
NH2 M
Epitope
COOH
C-terminal cleavage
Cleavage within epitopes
cleavage
New epitopes
Polytope starting configuration
Immunological Bioinformatics, The MIT press.
Polytope optimization Algorithm
• Optimization of four measures:
1. The number of poor C-terminal cleavage sites of epitopes
(predicted cleavage < 0.9)
2. The number of internal cleavage sites (within epitope
cleavages with a prediction larger than the predicted Cterminal cleavage)
3. The number of new epitopes (number of processed and
presented epitopes in the fusing regions spanning the
epitopes)
4. The length of the linker region inserted between epitopes.
• The optimization seeks to minimize the above four terms by use
of Monte Carlo Metropolis simulations [Metropolis et al., 1953]
Polytope optimal configuration
Immunological Bioinformatics, The MIT press.
Take home message
• Binding of peptides to MHC is guided by
higher order sequence correlations
• Neural networks predict the binding
• Always use ensembles of prediction
methods
• No method is consistently best
• Is does work
• For the SARS virus close to 80% of the
predicted peptides did bind MHC.