Download immune system

Document related concepts
no text concepts found
Transcript
Immunological Bioinformatics
Introduction to
the immune system
Vaccination
•
Vaccination
• Administration of a substance to a person with the
purpose of preventing a disease
• Traditionally composed of a killed or weakened
micro organism
• Vaccination works by creating a type of immune
response that enables the memory cells to later
respond to a similar organism before it can cause
disease
Figure 1-20
Effectiveness of vaccines
1958 start of small pox
eradication program
The Immune System
• The innate immune system
• The adaptive immune system
The innate immune system
•
•
•
•
•
Unspecific
Antigen independent
Immediate response
No training/selection hence no memory
Pathogen independent (but response might
be pathogen type dependent)
The adaptive immune system
• Pathogen specific
– Humoral
Parasite
– Cellular
http://tpeeaupotable.ifrance.com/ma%20photo/bilharzoze.jpg
Virus
http://en.wikipedia.org/wiki/Image:Aids_virus.jpg
Bacteria
http://www.uni-heidelberg.de/zentral/ztl/grafiken_bilder/bilder/e-coli.jpg
Adaptive immune response
• Signal induced
– Pathogens
• Antigens
– Epitopes
B Cell
T Cell
Diversity is a hallmark of the
(adaptive) immune system
• Diversity of lymphocytes
– Huge diversity within a host
– At least 108 different T & B cell clones
• Receptors made by recombination & Nadditions, and
• Somatic mutation during immune response
• Repertoires are (partly) random
– Randomness requires self tolerance
Figure 1-14
The role of lymphocytes
Humoral immunity
Cartoon by Eric Reits
Antibody - Antigen interaction
Antigen
The antibody recognizes
structural properties of the
surface of the antigen
Fab
Epitope
Paratope
Antibody
Antibody Effect
Virus or Toxin
Neutralizing Antibodies
Cellular immune response
Cartoon by Eric Reits
MHC-I molecules present peptides
on the surface of most cells
CTL response
Healthy
cell
MHC-I
Virusinfected
cell
CTL response
MHC-I
Virusinfected
cell
The death of an infected cell
QuickTime™ and a
Sorenson Video decompressor
are needed to see this picture.
Polymorphism of MHC
• Within a host limited number of loci (genes)
– only 6 different class I molecules (two A, B and C)
– only 12 different class II molecules
• Within a population > 100 alleles per locus
More MHC molecules: more diversity
in the presented peptides
• 1% probability that MHC molecule presents a peptide
• Different hosts sample different peptides from same pathogen.
Immunological benefits of MHC
polymorphism
• Heterozygote advantage
– Heterozygotes have a selective advantage
because they can present more peptides
(Hughes.n88).
• Coevolution
– Pathogens avoid presentation on common MHC alleles
(HIV)
– Frequency dependent selection
Figure 5-13
Heterozygote disadvantage!
(for vaccine design)
• Few human beings will share the same set
of HLA alleles
– Different persons will react to a pathogen
infection in a non-similar manner
• A CTL based vaccine must include
epitopes specific for each HLA allele in a
population
– A CTL based vaccine must consist of ~800
HLA class I epitopes and ~400 class II
epitopes
HLA specificity clustering
A0201
A0101
A6802
B0702
HLA polymorphism - supertypes
• Each HLA molecule within a supertype binds essentially
the same peptides
• Nine major HLA class I supertypes have been defined
• HLA-A1, A2, A3, A24,B7, B27, B44, B58, and B62
• And maybe add three more
• HLA-A26, HLA-B8, and HLA-B39
=> A CTL based vaccine must consist of 9-12 HLA class I
epitopes
Sette et al, Immunogenetics (1999) 50:201-212
Summary
• The adaptive immune system is extremely
diverse
– A immune responds can by raised against any thing
foreign!
• Antibodies defines the humoral response
– Antibodies recognize structural properties on the
surface of extra cellular antigens
• T cells defines the cellular response
– CTL’s kill cell that present MHC molecules bound with
intra cellular derived foreign peptides
MHC class I with peptide
Anchor positions
What makes a peptide a potential and
effective epitope?
• Part of a pathogen protein
• Successful processing
– Proteasome cleavage
– TAP binding
• Binds to MHC molecule
• Protein function and expression
– Early in replication
– Highly expressed proteins are more likely to
generate immunogens
• Sequence conservation in evolution
Prediction of HLA binding specificity
Historical overview
• Simple Motifs
– Allowed/non allowed amino acids
• Extended motifs
– Amino acid preferences (SYFPEITHI)
– Anchor/Preferred/other amino acids
• Hidden Markov models
– Peptide statistics from sequence alignment
• Neural networks
– Can take sequence correlations into account
SYFPEITHI predictions
• Extended motifs based on peptides from the literature
and peptides eluted from cells expressing specific HLAs
( i.e., binding peptides)
• Scoring scheme is not readily accessible.
• Positions defined as anchor or auxiliary anchor positions
are weighted differently (higher)
• The final score is the sum of the scores at each position
• Predictions can be made for several HLA-A, -B and DRB1 alleles, as well as some mice K, D and L alleles.
BIMAS
• Matrix made from peptides with a measured T1/2 for the
MHC-peptide complex
• The matrices are available on the website
• The final score is the product of the scores of each
position in the matrix multiplied with a constant,
different for each MHC, to give a prediction of the T1/2
• Predictions can be obtained for several HLA-A, -B and C alleles, mice K,D and L alleles, and a single cattle MHC.
Sequence information
SLLPAIVEL
LLDVPTAAV
HLIDYLVTS
ILFGHENRV
LERPGGNEI
PLDGEYFTL
ILGFVFTLT
KLVALGINA
KTWGQYWQV
SLLAPGAKQ
ILTVILGVL
TGAPVTYST
GAGIGVAVL
KARDPHSGH
AVFDRKSDA
GLCTLVAML
VLHDDLLEA
ISNDVCAQV
YTAFTIPSI
NMFTPYIGV
VVLGVVFGI
GLYDGMEHL
EAAGIGILT
YLSTAFARV
FLDEFMEGV
AAGIGILTV
AAGIGILTV
YLLPAIVHI
VLFRGGPRG
ILAPPVVKL
ILMEHIHKL
ALSNLEVKL
GVLVGVALI
LLFGYPVYV
DLMGYIPLV
TITDQVPFS
KIFGSLAFL
KVLEYVIKV
VIYQYMDDL
IAGIGILAI
KACDPHSGH
LLDFVRFMG
FIDSYICQV
LMWITQCFL
VKTDGNPPE
RLMKQDFSV
LMIIPLINV
ILHNGAYSL
KMVELVHFL
TLDSQVMSL
YLLEMLWRL
ALQPGTALL
FLPSDFFPS
FLPSDFFPS
TLWVDPYEV
MVDGTLLLL
ALFPQLVIL
ILDQKINEV
ALNELLQHV
RTLDKVLEV
GLSPTVWLS
RLVTLKDIV
AFHHVAREL
ELVSEFSRM
FLWGPRALV
VLPDVFIRC
LIVIGILIL
ACDPHSGHF
VLVKSPNHV
IISAVVGIL
SLLMWITQC
SVYDFFVWL
RLPRIFCSC
TLFIGSHVV
MIMVKCWMI
YLQLVFGIE
STPPPGTRV
SLDDYNHLV
VLDGLDVLL
SVRDRLARL
AAGIGILTV
GLVPFLVSV
YMNGTMSQV
GILGFVFTL
SLAGGIIGV
DLERKVESL
HLSTAFARV
WLSLLVPFV
MLLAVLYCL
YLNKIQNSL
KLTPLCVTL
GLSRYVARL
VLPDVFIRC
LAGIGLIAA
SLYNTVATL
GLAPPQHLI
VMAGVGSPY
QLSLLMWIT
FLYGALLLA
FLWGPRAYA
SLVIVTTFV
MLGTHTMEV
MLMAQEALA
KVAELVHFL
RTLDKVLEV
SLYSFPEPE
SLREWLLRI
FLPSDFFPS
KLLEPVLLL
MLLSVPLLL
STNRQSGRQ
LLIENVASL
FLGENISNF
RLDSYVRSL
FLPSDFFPS
AAGIGILTV
MMRKLAILS
VLYRYGSFS
FLLTRILTI
AVGIGIAVV
VDGIGILTI
RGPGRAFVT
LLGRNSFEV
LLWTLVVLL
LLGATCMFV
VLFSSDFRI
RLLQETELV
VLQWASLAV
MLGTHTMEV
LMAQEALAF
IMIGVLVGV
GLPVEYLQV
ALYVDSLFF
LLSAWILTA
AAGIGILTV
LLDVPTAAV
SLLGLLVEV
GLDVLTAKV
FLLWATAEA
ALSDHHIYL
YMNGTMSQV
CLGGLLTMV
YLEPGPVTA
AIMDKNIIL
YIGEVLVSV
HLGNVKYLV
LVVLGLLAV
GAGIGVLTA
NLVPMVATV
PLTFGWCYK
SVRDRLARL
RLTRFLSRV
LMWAKIGPV
SLFEGIDFY
ILAKFLHWL
SLADTNSLA
VYDGREHTV
ALCRWGLLL
KLIANNTRV
SLLQHLIGL
AAGIGILTV
FLWGPRALV
LLDVPTAAV
ALLPPINIL
RILGAVAKV
SLPDFGISY
GLSEFTEYL
GILGFVFTL
FIAGNSAYE
LLDGTATLR
IMDKNIILK
CINGVCWTV
GIAGGLALL
ALGLGLLPV
AAGIGIIQI
GLHCYEQLV
VLEWRFDSR
LLMDCSGSI
YMDGTMSQV
SLLLELEEV
SLDQSVVEL
STAPPHVNV
LLWAARPRL
YLSGANLNL
LLFAGVQCQ
FIYAGSLSA
ELTLGEFLK
AVPDEIPPL
ETVSEQSNV
LLDVPTAAV
TLIKIQHTL
QVCERIPTI
KKREEAPSL
STAPPAHGV
ILKEPVHGV
KLGEFYNQM
ITDQVPFSV
SMVGNWAKV
VMNILLQYV
GLQDCTMLV
GIGIGVLAA
QAGIGILLA
PLKQHFQIV
TLNAWVKVV
CLTSTVQLV
FLTPKKLQC
SLSRFSWGA
RLNMFTPYI
LLLLTVLTV
GVALQTMKQ
RMFPNAPYL
VLLCESTAV
KLVANNTRL
MINAYLDKL
FAYDGKDYI
ITLWQRPLV
Sequence Information
• Say that a peptide must have L
• Calculate pa at each position
at P2 in order to bind, and that
• Entropy
A,F,W,and Y are found at P1.
S   pa log( pa )
Which position has most
a
information?
• Information content
• How many questions do I need
to ask to tell if a peptide binds
I  log( 20)  pa log( pa )
looking at only P1 or P2?

a
• P1: 4 questions (at most)
• Conserved positions
• P2: 1 question (L or not)
– PV=1, P!v=0 => S=0, I=log(20)
• P2 has the most information
• Mutable positions

– Paa=1/20 => S=log(20), I=0


Information content
S   pa log( pa )
a
I  log( 20)   pa log( pa )
a

1
2
3
4
5
6
7
8
9
A
0.10
0.07
0.08
0.07
0.04
0.04
0.14
0.05
0.07
R
0.06
0.00
0.03
0.04
0.04
0.03
0.01
0.09
0.01
N
0.01
0.00
0.05
0.02
0.04
0.03
0.03
0.04
0.00
D
0.02
0.01
0.10
0.11
0.04
0.01
0.03
0.01
0.00
C
0.01
0.01
0.02
0.01
0.01
0.02
0.02
0.01
0.02
Q
0.02
0.00
0.02
0.04
0.04
0.03
0.03
0.05
0.02
E
0.02
0.01
0.01
0.08
0.05
0.03
0.04
0.07
0.02
G
0.09
0.01
0.12
0.15
0.16
0.04
0.03
0.05
0.01
H
0.01
0.00
0.02
0.01
0.04
0.02
0.05
0.02
0.01
I
0.07
0.08
0.03
0.10
0.02
0.14
0.07
0.04
0.08
L
0.11
0.59
0.12
0.04
0.08
0.13
0.15
0.14
0.26

K
0.06
0.01
0.01
0.03
0.04
0.02
0.01
0.04
0.01
M
0.04
0.07
0.03
0.01
0.01
0.03
0.03
0.02
0.01
F
0.08
0.01
0.05
0.02
0.06
0.07
0.07
0.05
0.02
P
0.01
0.00
0.06
0.09
0.10
0.03
0.06
0.05
0.00
S
0.11
0.01
0.06
0.07
0.02
0.05
0.07
0.08
0.04
T
0.03
0.06
0.04
0.04
0.06
0.08
0.04
0.10
0.02
pL  0.26
log 2 (0.26)  1.94
pL log 2 ( pL )  0.26  1.94  0.51
W
0.01
0.00
0.04
0.02
0.02
0.01
0.03
0.01
0.00
Y
0.05
0.01
0.04
0.00
0.05
0.03
0.02
0.04
0.01
V
0.08
0.08
0.07
0.05
0.09
0.15
0.08
0.03
0.38
S
3.96
2.16
4.06
3.87
4.04
3.92
3.98
4.04
2.78
I
0.37
2.16
0.26
0.45
0.28
0.40
0.34
0.28
1.55
Sequence logos
•Height of a column equal to I
•Relative height of a letter is p
•Highly useful tool to visualize
sequence motifs
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
HLA-A0201
High information
positions
Characterizing a binding motif from
small data sets
10 MHC restricted peptides
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
What can we learn?
1.
A at P1 favors
binding?
2. I is not allowed at P9?
3. K at P4 favors binding?
4. Which positions are
important for binding?
Simple motifs
Yes/No rules
10 MHC restricted peptides
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV

[AGTK]1[LMIV ]2[ANLV]3 ...[MNRTVL]9
• Only 11 of 212 peptides identified!
• Need more flexible rules
•If not fit P1 but fit P2 then ok
• Not all positions are equally important
•We know that P2 and P9
determines binding more than
other positions
•Cannot discriminate between good and
very good binders
Simple motifs
Yes/No rules
10 MHC restricted peptides
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV

GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
[AGTK]1[LMIV]2[ANLV]3 ...[AIFKLV]7 ...[MNRTVL]9
• Example
RLLDDTPEV 84 nM
GLLGNVSTV 23 nM
ALAKAAAAL 309 nM
•Two first peptides will not fit the
motif. They are all good binders (aff<
500nM)
Extended motifs
• Fitness of aa at each position
given by P(aa)
• Example P1
PA = 6/10
PG = 2/10
PT = PK = 1/10
PC = PD = …PV = 0
• Problems
– Few data
– Data redundancy/duplication
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
RLLDDTPEV 84 nM
GLLGNVSTV 23 nM
ALAKAAAAL 309 nM
Sequence information
Raw sequence counting
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Sequence weighting
•Poor or biased sampling
of sequence space
•Example P1
PA = 2/6
PG = 2/6
PT = PK = 1/6
PC = PD = …PV = 0
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
}
Similar
sequences
Weight 1/5
RLLDDTPEV 84 nM
GLLGNVSTV 23 nM
ALAKAAAAL 309 nM
Sequence weighting
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Pseudo counts
•I is not found at position P9.
Does this mean that I is
forbidden (P(I)=0)?
•No! Use Blosum substitution
matrix to estimate pseudo
frequency of I at P9
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
The Blosum matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
0.29
0.04
0.04
0.04
0.07
0.06
0.06
0.08
0.04
0.05
0.04
0.06
0.05
0.03
0.06
0.11
0.07
0.03
0.04
0.07
R
0.03
0.34
0.04
0.03
0.02
0.07
0.05
0.02
0.05
0.02
0.02
0.11
0.03
0.02
0.03
0.04
0.04
0.02
0.03
0.02
N
0.03
0.04
0.32
0.07
0.02
0.04
0.04
0.04
0.05
0.01
0.01
0.04
0.02
0.02
0.02
0.05
0.04
0.02
0.02
0.02
D
0.03
0.03
0.08
0.40
0.02
0.05
0.09
0.03
0.04
0.02
0.02
0.04
0.02
0.02
0.03
0.05
0.04
0.02
0.02
0.02
C
0.02
0.01
0.01
0.01
0.48
0.01
0.01
0.01
0.01
0.02
0.02
0.01
0.02
0.01
0.01
0.02
0.02
0.01
0.01
0.02
Q
0.03
0.05
0.03
0.03
0.01
0.21
0.06
0.02
0.04
0.01
0.02
0.05
0.03
0.01
0.02
0.03
0.03
0.02
0.02
0.02
E
0.04
0.05
0.05
0.09
0.02
0.10
0.30
0.03
0.05
0.02
0.02
0.07
0.03
0.02
0.04
0.05
0.04
0.02
0.03
0.02
G
0.08
0.03
0.07
0.05
0.03
0.04
0.04
0.51
0.04
0.02
0.02
0.04
0.03
0.03
0.04
0.07
0.04
0.03
0.02
0.02
H
0.01
0.02
0.03
0.02
0.01
0.03
0.03
0.01
0.35
0.01
0.01
0.02
0.02
0.02
0.01
0.02
0.01
0.02
0.05
0.01
I
0.04
0.02
0.02
0.02
0.04
0.03
0.02
0.02
0.02
0.27
0.12
0.03
0.10
0.06
0.03
0.03
0.05
0.03
0.04
0.16
L
0.06
0.05
0.03
0.03
0.07
0.05
0.04
0.03
0.04
0.17
0.38
0.04
0.20
0.11
0.04
0.04
0.07
0.05
0.07
0.13
K
0.04
0.12
0.05
0.04
0.02
0.09
0.08
0.03
0.05
0.02
0.03
0.28
0.04
0.02
0.04
0.05
0.05
0.02
0.03
0.03
M
0.02
0.02
0.01
0.01
0.02
0.02
0.01
0.01
0.02
0.04
0.05
0.02
0.16
0.03
0.01
0.02
0.02
0.02
0.02
0.03
F
0.02
0.02
0.02
0.01
0.02
0.01
0.02
0.02
0.03
0.04
0.05
0.02
0.05
0.39
0.01
0.02
0.02
0.06
0.13
0.04
P
0.03
0.02
0.02
0.02
0.02
0.02
0.03
0.02
0.02
0.01
0.01
0.03
0.02
0.01
0.49
0.03
0.03
0.01
0.02
0.02
S
0.09
0.04
0.07
0.05
0.04
0.06
0.06
0.05
0.04
0.03
0.02
0.05
0.04
0.03
0.04
0.22
0.09
0.02
0.03
0.03
Some amino acids are highly conserved (i.e. C),
some have a high change of mutation (i.e. I)
T
0.05
0.03
0.05
0.04
0.04
0.04
0.04
0.03
0.03
0.04
0.03
0.04
0.04
0.03
0.04
0.08
0.25
0.02
0.03
0.05
W
0.01
0.01
0.00
0.00
0.00
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.02
0.00
0.01
0.01
0.49
0.03
0.01
Y
0.02
0.02
0.02
0.01
0.01
0.02
0.02
0.01
0.06
0.02
0.02
0.02
0.02
0.09
0.01
0.02
0.02
0.07
0.32
0.02
V
0.07
0.03
0.03
0.02
0.06
0.04
0.03
0.02
0.02
0.18
0.10
0.03
0.09
0.06
0.03
0.04
0.07
0.03
0.05
0.27
What is a pseudo count?
A
A 0.29
R 0.04
N 0.04
D 0.04
C 0.07
….
Y 0.04
V 0.07
R
0.03
0.34
0.04
0.03
0.02
N
0.03
0.04
0.32
0.07
0.02
D
0.03
0.03
0.08
0.40
0.02
C
0.02
0.01
0.01
0.01
0.48
Q
0.03
0.05
0.03
0.03
0.01
E
0.04
0.05
0.05
0.09
0.02
G
0.08
0.03
0.07
0.05
0.03
H
0.01
0.02
0.03
0.02
0.01
I
0.04
0.02
0.02
0.02
0.04
L
0.06
0.05
0.03
0.03
0.07
K
0.04
0.12
0.05
0.04
0.02
M
0.02
0.02
0.01
0.01
0.02
F
0.02
0.02
0.02
0.01
0.02
P
0.03
0.02
0.02
0.02
0.02
S
0.09
0.04
0.07
0.05
0.04
T
0.05
0.03
0.05
0.04
0.04
W
0.01
0.01
0.00
0.00
0.00
Y
0.02
0.02
0.02
0.01
0.01
V
0.07
0.03
0.03
0.02
0.06
0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27
• Say I observe V at P1
• Knowing that V at P1 binds, what is the probability that
a peptide could have I at P1?
• P(I|V) = 0.16
Pseudo count estimation

• Calculate observed amino acids
frequencies fa
• Pseudo frequency for amino acid b
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
gb   f a  qb|a
• Example
a
gI  0.2  qI |M  0.1 qI |N  ... 0.3 qI |V  0.1 qI |L

gI  0.2  0.04  0.1 0.01 ... 0.3 0.18  0.1 0.17  0.09
Weight on pseudo count
• Pseudo counts are important when only
limited data is available
• With large data sets only “true”
observation should count
  f a    ga
pa 

•  is the effective number of sequences
(N-1),  is the weight on prior

l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Weight on pseudo count
• Example
l
l
l
l
l
l
l
l
l
l
  f a    ga
pa 

• If  large, p ≈ f and only the observed
data defines the motif
• 
If  small, p ≈ g and the pseudo counts
(or prior) defines the motif
•  is [50-200] normally
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Sequence weighting and pseudo
counts
l
l
l
l
l
l
l
l
l
l
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
RLLDDTPEV 84nM
GLLGNVSTV 23nM
ALAKAAAAL 309nM
P7P and P7S > 0
Position specific weighting
• We know that positions 2 and
9 are anchor positions for
most MHC binding motifs
– Increase weight on high
information positions
• Motif found on large data set
Weight matrices
• Estimate amino acid frequencies from alignment including
sequence weighting and pseudo count
1
2
3
4
5
6
7
8
9
A
0.08
0.04
0.08
0.08
0.06
0.06
0.10
0.05
0.08
R
0.06
0.01
0.04
0.05
0.04
0.03
0.02
0.07
0.02
N
0.02
0.01
0.05
0.03
0.05
0.03
0.04
0.04
0.01
D
0.03
0.01
0.07
0.10
0.03
0.03
0.04
0.03
0.01
C
0.02
0.01
0.02
0.01
0.01
0.03
0.02
0.01
0.02
Q
0.02
0.01
0.03
0.05
0.04
0.03
0.03
0.04
0.02
E
0.03
0.02
0.03
0.08
0.05
0.04
0.04
0.06
0.03
G
0.08
0.02
0.08
0.13
0.11
0.06
0.05
0.06
0.02
H
0.02
0.01
0.02
0.01
0.03
0.02
0.04
0.03
0.01
I
0.08
0.11
0.05
0.05
0.04
0.10
0.08
0.06
0.10
L
0.11
0.44
0.11
0.06
0.09
0.14
0.12
0.13
0.23
• What do the numbers mean?
K
0.06
0.02
0.03
0.05
0.04
0.04
0.02
0.06
0.03
M
0.04
0.06
0.03
0.01
0.02
0.03
0.03
0.02
0.02
F
0.06
0.03
0.06
0.03
0.06
0.05
0.06
0.05
0.04
P
0.02
0.01
0.04
0.08
0.06
0.04
0.07
0.04
0.01
S
0.09
0.02
0.06
0.06
0.04
0.06
0.06
0.08
0.04
T
0.04
0.05
0.05
0.04
0.05
0.06
0.05
0.07
0.04
W
0.01
0.00
0.03
0.02
0.02
0.01
0.03
0.01
0.00
Y
0.04
0.01
0.05
0.01
0.05
0.03
0.03
0.04
0.02
V
0.08
0.10
0.07
0.05
0.08
0.13
0.08
0.05
0.25
– P2(V)>P2(M). Does this mean that V enables binding more than M.
– In nature not all amino acids are found equally often
• qM = 0.025, qV = 0.073
• Finding 7% V is hence not significant, but 2% M highly significant
• In nature V is found more often than M, so we must somehow
rescale with the background
Weight matrices
A weight matrix is given as
Wij = log(pij/qj)
– where i is a position in the motif, and j an amino acid. qj is the background
frequency for amino acid j.
•
1
2
3
4
5
6
7
8
9
•
A
0.6
-1.6
0.2
-0.1
-1.6
-0.7
1.1
-2.2
-0.2
R
0.4
-6.6
-1.3
-0.1
-0.1
-1.4
-3.8
1.0
-3.5
N
-3.5
-6.5
0.1
-2.0
0.1
-1.0
-0.2
-0.8
-6.1
D
-2.4
-5.4
1.5
2.0
-2.2
-2.3
-1.3
-2.9
-4.5
C
-0.4
-2.5
0.0
-1.6
-1.2
1.1
1.3
-1.4
0.7
Q
-1.9
-4.0
-1.8
0.5
0.4
-1.3
-0.3
0.4
-0.8
E
-2.7
-4.7
-3.3
0.8
-0.5
-1.4
-1.3
0.1
-2.5
G
0.3
-3.7
0.4
2.0
1.9
-0.2
-1.4
-0.4
-4.0
H
I
L
K
M
F
-1.1 1.0 0.3 0.0 1.4 1.2
-6.3 1.0 5.1 -3.7 3.1 -4.2
0.5 -1.0 0.3 -2.5 1.2 1.0
-3.3 0.1 -1.7 -1.0 -2.2 -1.6
1.2 -2.2 -0.5 -1.3 -2.2 1.7
-1.0 1.8 0.8 -1.9 0.2 1.0
2.1 0.6 0.7 -5.0 1.1 0.9
0.2 -0.0 1.1 -0.5 -0.5 0.7
-2.6 0.9 2.8 -3.0 -1.8 -1.4
W is a L x 20 matrix, L is motif length
P
-2.7
-4.3
-0.1
1.7
1.2
-0.4
1.3
-0.3
-6.2
S
1.4
-4.2
-0.3
-0.6
-2.5
-0.6
-0.5
0.8
-1.9
T
-1.2
-0.2
-0.5
-0.2
-0.1
0.4
-0.9
0.8
-1.6
W
-2.0
-5.9
3.4
1.3
1.7
-0.5
2.9
-0.7
-4.9
Y
V
1.1 0.7
-3.8 0.4
1.6 0.0
-6.8 -0.7
1.5 1.0
-0.0 2.1
-0.4 0.5
1.3 -1.1
-1.6 4.5
Scoring a sequence to a weight matrix
• Score sequences to weight matrix by looking up
and adding L values from the matrix
1
2
3
4
5
6
7
8
9
A
0.6
-1.6
0.2
-0.1
-1.6
-0.7
1.1
-2.2
-0.2
R
0.4
-6.6
-1.3
-0.1
-0.1
-1.4
-3.8
1.0
-3.5
N
-3.5
-6.5
0.1
-2.0
0.1
-1.0
-0.2
-0.8
-6.1
D
-2.4
-5.4
1.5
2.0
-2.2
-2.3
-1.3
-2.9
-4.5
C
-0.4
-2.5
0.0
-1.6
-1.2
1.1
1.3
-1.4
0.7
RLLDDTPEV
GLLGNVSTV
ALAKAAAAL
Q
-1.9
-4.0
-1.8
0.5
0.4
-1.3
-0.3
0.4
-0.8
E
-2.7
-4.7
-3.3
0.8
-0.5
-1.4
-1.3
0.1
-2.5
G
0.3
-3.7
0.4
2.0
1.9
-0.2
-1.4
-0.4
-4.0
H
I
L
K
M
F
-1.1 1.0 0.3 0.0 1.4 1.2
-6.3 1.0 5.1 -3.7 3.1 -4.2
0.5 -1.0 0.3 -2.5 1.2 1.0
-3.3 0.1 -1.7 -1.0 -2.2 -1.6
1.2 -2.2 -0.5 -1.3 -2.2 1.7
-1.0 1.8 0.8 -1.9 0.2 1.0
2.1 0.6 0.7 -5.0 1.1 0.9
0.2 -0.0 1.1 -0.5 -0.5 0.7
-2.6 0.9 2.8 -3.0 -1.8 -1.4
11.9 84nM
14.7 23nM
4.3 309nM
P
-2.7
-4.3
-0.1
1.7
1.2
-0.4
1.3
-0.3
-6.2
S
1.4
-4.2
-0.3
-0.6
-2.5
-0.6
-0.5
0.8
-1.9
T
-1.2
-0.2
-0.5
-0.2
-0.1
0.4
-0.9
0.8
-1.6
W
-2.0
-5.9
3.4
1.3
1.7
-0.5
2.9
-0.7
-4.9
Y
V
1.1 0.7
-3.8 0.4
1.6 0.0
-6.8 -0.7
1.5 1.0
-0.0 2.1
-0.4 0.5
1.3 -1.1
-1.6 4.5
Which peptide is most
likely to bind?
Which peptide second?
Related documents