Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Immunological Bioinformatics Introduction to the immune system Vaccination • Vaccination • Administration of a substance to a person with the purpose of preventing a disease • Traditionally composed of a killed or weakened micro organism • Vaccination works by creating a type of immune response that enables the memory cells to later respond to a similar organism before it can cause disease Figure 1-20 Effectiveness of vaccines 1958 start of small pox eradication program The Immune System • The innate immune system • The adaptive immune system The innate immune system • • • • • Unspecific Antigen independent Immediate response No training/selection hence no memory Pathogen independent (but response might be pathogen type dependent) The adaptive immune system • Pathogen specific – Humoral Parasite – Cellular http://tpeeaupotable.ifrance.com/ma%20photo/bilharzoze.jpg Virus http://en.wikipedia.org/wiki/Image:Aids_virus.jpg Bacteria http://www.uni-heidelberg.de/zentral/ztl/grafiken_bilder/bilder/e-coli.jpg Adaptive immune response • Signal induced – Pathogens • Antigens – Epitopes B Cell T Cell Diversity is a hallmark of the (adaptive) immune system • Diversity of lymphocytes – Huge diversity within a host – At least 108 different T & B cell clones • Receptors made by recombination & Nadditions, and • Somatic mutation during immune response • Repertoires are (partly) random – Randomness requires self tolerance Figure 1-14 The role of lymphocytes Humoral immunity Cartoon by Eric Reits Antibody - Antigen interaction Antigen The antibody recognizes structural properties of the surface of the antigen Fab Epitope Paratope Antibody Antibody Effect Virus or Toxin Neutralizing Antibodies Cellular immune response Cartoon by Eric Reits MHC-I molecules present peptides on the surface of most cells CTL response Healthy cell MHC-I Virusinfected cell CTL response MHC-I Virusinfected cell The death of an infected cell QuickTime™ and a Sorenson Video decompressor are needed to see this picture. Polymorphism of MHC • Within a host limited number of loci (genes) – only 6 different class I molecules (two A, B and C) – only 12 different class II molecules • Within a population > 100 alleles per locus More MHC molecules: more diversity in the presented peptides • 1% probability that MHC molecule presents a peptide • Different hosts sample different peptides from same pathogen. Immunological benefits of MHC polymorphism • Heterozygote advantage – Heterozygotes have a selective advantage because they can present more peptides (Hughes.n88). • Coevolution – Pathogens avoid presentation on common MHC alleles (HIV) – Frequency dependent selection Figure 5-13 Heterozygote disadvantage! (for vaccine design) • Few human beings will share the same set of HLA alleles – Different persons will react to a pathogen infection in a non-similar manner • A CTL based vaccine must include epitopes specific for each HLA allele in a population – A CTL based vaccine must consist of ~800 HLA class I epitopes and ~400 class II epitopes HLA specificity clustering A0201 A0101 A6802 B0702 HLA polymorphism - supertypes • Each HLA molecule within a supertype binds essentially the same peptides • Nine major HLA class I supertypes have been defined • HLA-A1, A2, A3, A24,B7, B27, B44, B58, and B62 • And maybe add three more • HLA-A26, HLA-B8, and HLA-B39 => A CTL based vaccine must consist of 9-12 HLA class I epitopes Sette et al, Immunogenetics (1999) 50:201-212 Summary • The adaptive immune system is extremely diverse – A immune responds can by raised against any thing foreign! • Antibodies defines the humoral response – Antibodies recognize structural properties on the surface of extra cellular antigens • T cells defines the cellular response – CTL’s kill cell that present MHC molecules bound with intra cellular derived foreign peptides MHC class I with peptide Anchor positions What makes a peptide a potential and effective epitope? • Part of a pathogen protein • Successful processing – Proteasome cleavage – TAP binding • Binds to MHC molecule • Protein function and expression – Early in replication – Highly expressed proteins are more likely to generate immunogens • Sequence conservation in evolution Prediction of HLA binding specificity Historical overview • Simple Motifs – Allowed/non allowed amino acids • Extended motifs – Amino acid preferences (SYFPEITHI) – Anchor/Preferred/other amino acids • Hidden Markov models – Peptide statistics from sequence alignment • Neural networks – Can take sequence correlations into account SYFPEITHI predictions • Extended motifs based on peptides from the literature and peptides eluted from cells expressing specific HLAs ( i.e., binding peptides) • Scoring scheme is not readily accessible. • Positions defined as anchor or auxiliary anchor positions are weighted differently (higher) • The final score is the sum of the scores at each position • Predictions can be made for several HLA-A, -B and DRB1 alleles, as well as some mice K, D and L alleles. BIMAS • Matrix made from peptides with a measured T1/2 for the MHC-peptide complex • The matrices are available on the website • The final score is the product of the scores of each position in the matrix multiplied with a constant, different for each MHC, to give a prediction of the T1/2 • Predictions can be obtained for several HLA-A, -B and C alleles, mice K,D and L alleles, and a single cattle MHC. Sequence information SLLPAIVEL LLDVPTAAV HLIDYLVTS ILFGHENRV LERPGGNEI PLDGEYFTL ILGFVFTLT KLVALGINA KTWGQYWQV SLLAPGAKQ ILTVILGVL TGAPVTYST GAGIGVAVL KARDPHSGH AVFDRKSDA GLCTLVAML VLHDDLLEA ISNDVCAQV YTAFTIPSI NMFTPYIGV VVLGVVFGI GLYDGMEHL EAAGIGILT YLSTAFARV FLDEFMEGV AAGIGILTV AAGIGILTV YLLPAIVHI VLFRGGPRG ILAPPVVKL ILMEHIHKL ALSNLEVKL GVLVGVALI LLFGYPVYV DLMGYIPLV TITDQVPFS KIFGSLAFL KVLEYVIKV VIYQYMDDL IAGIGILAI KACDPHSGH LLDFVRFMG FIDSYICQV LMWITQCFL VKTDGNPPE RLMKQDFSV LMIIPLINV ILHNGAYSL KMVELVHFL TLDSQVMSL YLLEMLWRL ALQPGTALL FLPSDFFPS FLPSDFFPS TLWVDPYEV MVDGTLLLL ALFPQLVIL ILDQKINEV ALNELLQHV RTLDKVLEV GLSPTVWLS RLVTLKDIV AFHHVAREL ELVSEFSRM FLWGPRALV VLPDVFIRC LIVIGILIL ACDPHSGHF VLVKSPNHV IISAVVGIL SLLMWITQC SVYDFFVWL RLPRIFCSC TLFIGSHVV MIMVKCWMI YLQLVFGIE STPPPGTRV SLDDYNHLV VLDGLDVLL SVRDRLARL AAGIGILTV GLVPFLVSV YMNGTMSQV GILGFVFTL SLAGGIIGV DLERKVESL HLSTAFARV WLSLLVPFV MLLAVLYCL YLNKIQNSL KLTPLCVTL GLSRYVARL VLPDVFIRC LAGIGLIAA SLYNTVATL GLAPPQHLI VMAGVGSPY QLSLLMWIT FLYGALLLA FLWGPRAYA SLVIVTTFV MLGTHTMEV MLMAQEALA KVAELVHFL RTLDKVLEV SLYSFPEPE SLREWLLRI FLPSDFFPS KLLEPVLLL MLLSVPLLL STNRQSGRQ LLIENVASL FLGENISNF RLDSYVRSL FLPSDFFPS AAGIGILTV MMRKLAILS VLYRYGSFS FLLTRILTI AVGIGIAVV VDGIGILTI RGPGRAFVT LLGRNSFEV LLWTLVVLL LLGATCMFV VLFSSDFRI RLLQETELV VLQWASLAV MLGTHTMEV LMAQEALAF IMIGVLVGV GLPVEYLQV ALYVDSLFF LLSAWILTA AAGIGILTV LLDVPTAAV SLLGLLVEV GLDVLTAKV FLLWATAEA ALSDHHIYL YMNGTMSQV CLGGLLTMV YLEPGPVTA AIMDKNIIL YIGEVLVSV HLGNVKYLV LVVLGLLAV GAGIGVLTA NLVPMVATV PLTFGWCYK SVRDRLARL RLTRFLSRV LMWAKIGPV SLFEGIDFY ILAKFLHWL SLADTNSLA VYDGREHTV ALCRWGLLL KLIANNTRV SLLQHLIGL AAGIGILTV FLWGPRALV LLDVPTAAV ALLPPINIL RILGAVAKV SLPDFGISY GLSEFTEYL GILGFVFTL FIAGNSAYE LLDGTATLR IMDKNIILK CINGVCWTV GIAGGLALL ALGLGLLPV AAGIGIIQI GLHCYEQLV VLEWRFDSR LLMDCSGSI YMDGTMSQV SLLLELEEV SLDQSVVEL STAPPHVNV LLWAARPRL YLSGANLNL LLFAGVQCQ FIYAGSLSA ELTLGEFLK AVPDEIPPL ETVSEQSNV LLDVPTAAV TLIKIQHTL QVCERIPTI KKREEAPSL STAPPAHGV ILKEPVHGV KLGEFYNQM ITDQVPFSV SMVGNWAKV VMNILLQYV GLQDCTMLV GIGIGVLAA QAGIGILLA PLKQHFQIV TLNAWVKVV CLTSTVQLV FLTPKKLQC SLSRFSWGA RLNMFTPYI LLLLTVLTV GVALQTMKQ RMFPNAPYL VLLCESTAV KLVANNTRL MINAYLDKL FAYDGKDYI ITLWQRPLV Sequence Information • Say that a peptide must have L • Calculate pa at each position at P2 in order to bind, and that • Entropy A,F,W,and Y are found at P1. S pa log( pa ) Which position has most a information? • Information content • How many questions do I need to ask to tell if a peptide binds I log( 20) pa log( pa ) looking at only P1 or P2? a • P1: 4 questions (at most) • Conserved positions • P2: 1 question (L or not) – PV=1, P!v=0 => S=0, I=log(20) • P2 has the most information • Mutable positions – Paa=1/20 => S=log(20), I=0 Information content S pa log( pa ) a I log( 20) pa log( pa ) a 1 2 3 4 5 6 7 8 9 A 0.10 0.07 0.08 0.07 0.04 0.04 0.14 0.05 0.07 R 0.06 0.00 0.03 0.04 0.04 0.03 0.01 0.09 0.01 N 0.01 0.00 0.05 0.02 0.04 0.03 0.03 0.04 0.00 D 0.02 0.01 0.10 0.11 0.04 0.01 0.03 0.01 0.00 C 0.01 0.01 0.02 0.01 0.01 0.02 0.02 0.01 0.02 Q 0.02 0.00 0.02 0.04 0.04 0.03 0.03 0.05 0.02 E 0.02 0.01 0.01 0.08 0.05 0.03 0.04 0.07 0.02 G 0.09 0.01 0.12 0.15 0.16 0.04 0.03 0.05 0.01 H 0.01 0.00 0.02 0.01 0.04 0.02 0.05 0.02 0.01 I 0.07 0.08 0.03 0.10 0.02 0.14 0.07 0.04 0.08 L 0.11 0.59 0.12 0.04 0.08 0.13 0.15 0.14 0.26 K 0.06 0.01 0.01 0.03 0.04 0.02 0.01 0.04 0.01 M 0.04 0.07 0.03 0.01 0.01 0.03 0.03 0.02 0.01 F 0.08 0.01 0.05 0.02 0.06 0.07 0.07 0.05 0.02 P 0.01 0.00 0.06 0.09 0.10 0.03 0.06 0.05 0.00 S 0.11 0.01 0.06 0.07 0.02 0.05 0.07 0.08 0.04 T 0.03 0.06 0.04 0.04 0.06 0.08 0.04 0.10 0.02 pL 0.26 log 2 (0.26) 1.94 pL log 2 ( pL ) 0.26 1.94 0.51 W 0.01 0.00 0.04 0.02 0.02 0.01 0.03 0.01 0.00 Y 0.05 0.01 0.04 0.00 0.05 0.03 0.02 0.04 0.01 V 0.08 0.08 0.07 0.05 0.09 0.15 0.08 0.03 0.38 S 3.96 2.16 4.06 3.87 4.04 3.92 3.98 4.04 2.78 I 0.37 2.16 0.26 0.45 0.28 0.40 0.34 0.28 1.55 Sequence logos •Height of a column equal to I •Relative height of a letter is p •Highly useful tool to visualize sequence motifs http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html HLA-A0201 High information positions Characterizing a binding motif from small data sets 10 MHC restricted peptides l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV What can we learn? 1. A at P1 favors binding? 2. I is not allowed at P9? 3. K at P4 favors binding? 4. Which positions are important for binding? Simple motifs Yes/No rules 10 MHC restricted peptides l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV [AGTK]1[LMIV ]2[ANLV]3 ...[MNRTVL]9 • Only 11 of 212 peptides identified! • Need more flexible rules •If not fit P1 but fit P2 then ok • Not all positions are equally important •We know that P2 and P9 determines binding more than other positions •Cannot discriminate between good and very good binders Simple motifs Yes/No rules 10 MHC restricted peptides l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV [AGTK]1[LMIV]2[ANLV]3 ...[AIFKLV]7 ...[MNRTVL]9 • Example RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM •Two first peptides will not fit the motif. They are all good binders (aff< 500nM) Extended motifs • Fitness of aa at each position given by P(aa) • Example P1 PA = 6/10 PG = 2/10 PT = PK = 1/10 PC = PD = …PV = 0 • Problems – Few data – Data redundancy/duplication l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM Sequence information Raw sequence counting l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence weighting •Poor or biased sampling of sequence space •Example P1 PA = 2/6 PG = 2/6 PT = PK = 1/6 PC = PD = …PV = 0 l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV } Similar sequences Weight 1/5 RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM Sequence weighting l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Pseudo counts •I is not found at position P9. Does this mean that I is forbidden (P(I)=0)? •No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9 l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV The Blosum matrix A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.04 0.04 0.04 0.07 0.06 0.06 0.08 0.04 0.05 0.04 0.06 0.05 0.03 0.06 0.11 0.07 0.03 0.04 0.07 R 0.03 0.34 0.04 0.03 0.02 0.07 0.05 0.02 0.05 0.02 0.02 0.11 0.03 0.02 0.03 0.04 0.04 0.02 0.03 0.02 N 0.03 0.04 0.32 0.07 0.02 0.04 0.04 0.04 0.05 0.01 0.01 0.04 0.02 0.02 0.02 0.05 0.04 0.02 0.02 0.02 D 0.03 0.03 0.08 0.40 0.02 0.05 0.09 0.03 0.04 0.02 0.02 0.04 0.02 0.02 0.03 0.05 0.04 0.02 0.02 0.02 C 0.02 0.01 0.01 0.01 0.48 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.02 0.01 0.01 0.02 0.02 0.01 0.01 0.02 Q 0.03 0.05 0.03 0.03 0.01 0.21 0.06 0.02 0.04 0.01 0.02 0.05 0.03 0.01 0.02 0.03 0.03 0.02 0.02 0.02 E 0.04 0.05 0.05 0.09 0.02 0.10 0.30 0.03 0.05 0.02 0.02 0.07 0.03 0.02 0.04 0.05 0.04 0.02 0.03 0.02 G 0.08 0.03 0.07 0.05 0.03 0.04 0.04 0.51 0.04 0.02 0.02 0.04 0.03 0.03 0.04 0.07 0.04 0.03 0.02 0.02 H 0.01 0.02 0.03 0.02 0.01 0.03 0.03 0.01 0.35 0.01 0.01 0.02 0.02 0.02 0.01 0.02 0.01 0.02 0.05 0.01 I 0.04 0.02 0.02 0.02 0.04 0.03 0.02 0.02 0.02 0.27 0.12 0.03 0.10 0.06 0.03 0.03 0.05 0.03 0.04 0.16 L 0.06 0.05 0.03 0.03 0.07 0.05 0.04 0.03 0.04 0.17 0.38 0.04 0.20 0.11 0.04 0.04 0.07 0.05 0.07 0.13 K 0.04 0.12 0.05 0.04 0.02 0.09 0.08 0.03 0.05 0.02 0.03 0.28 0.04 0.02 0.04 0.05 0.05 0.02 0.03 0.03 M 0.02 0.02 0.01 0.01 0.02 0.02 0.01 0.01 0.02 0.04 0.05 0.02 0.16 0.03 0.01 0.02 0.02 0.02 0.02 0.03 F 0.02 0.02 0.02 0.01 0.02 0.01 0.02 0.02 0.03 0.04 0.05 0.02 0.05 0.39 0.01 0.02 0.02 0.06 0.13 0.04 P 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.01 0.01 0.03 0.02 0.01 0.49 0.03 0.03 0.01 0.02 0.02 S 0.09 0.04 0.07 0.05 0.04 0.06 0.06 0.05 0.04 0.03 0.02 0.05 0.04 0.03 0.04 0.22 0.09 0.02 0.03 0.03 Some amino acids are highly conserved (i.e. C), some have a high change of mutation (i.e. I) T 0.05 0.03 0.05 0.04 0.04 0.04 0.04 0.03 0.03 0.04 0.03 0.04 0.04 0.03 0.04 0.08 0.25 0.02 0.03 0.05 W 0.01 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.00 0.01 0.01 0.49 0.03 0.01 Y 0.02 0.02 0.02 0.01 0.01 0.02 0.02 0.01 0.06 0.02 0.02 0.02 0.02 0.09 0.01 0.02 0.02 0.07 0.32 0.02 V 0.07 0.03 0.03 0.02 0.06 0.04 0.03 0.02 0.02 0.18 0.10 0.03 0.09 0.06 0.03 0.04 0.07 0.03 0.05 0.27 What is a pseudo count? A A 0.29 R 0.04 N 0.04 D 0.04 C 0.07 …. Y 0.04 V 0.07 R 0.03 0.34 0.04 0.03 0.02 N 0.03 0.04 0.32 0.07 0.02 D 0.03 0.03 0.08 0.40 0.02 C 0.02 0.01 0.01 0.01 0.48 Q 0.03 0.05 0.03 0.03 0.01 E 0.04 0.05 0.05 0.09 0.02 G 0.08 0.03 0.07 0.05 0.03 H 0.01 0.02 0.03 0.02 0.01 I 0.04 0.02 0.02 0.02 0.04 L 0.06 0.05 0.03 0.03 0.07 K 0.04 0.12 0.05 0.04 0.02 M 0.02 0.02 0.01 0.01 0.02 F 0.02 0.02 0.02 0.01 0.02 P 0.03 0.02 0.02 0.02 0.02 S 0.09 0.04 0.07 0.05 0.04 T 0.05 0.03 0.05 0.04 0.04 W 0.01 0.01 0.00 0.00 0.00 Y 0.02 0.02 0.02 0.01 0.01 V 0.07 0.03 0.03 0.02 0.06 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27 • Say I observe V at P1 • Knowing that V at P1 binds, what is the probability that a peptide could have I at P1? • P(I|V) = 0.16 Pseudo count estimation • Calculate observed amino acids frequencies fa • Pseudo frequency for amino acid b l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV gb f a qb|a • Example a gI 0.2 qI |M 0.1 qI |N ... 0.3 qI |V 0.1 qI |L gI 0.2 0.04 0.1 0.01 ... 0.3 0.18 0.1 0.17 0.09 Weight on pseudo count • Pseudo counts are important when only limited data is available • With large data sets only “true” observation should count f a ga pa • is the effective number of sequences (N-1), is the weight on prior l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Weight on pseudo count • Example l l l l l l l l l l f a ga pa • If large, p ≈ f and only the observed data defines the motif • If small, p ≈ g and the pseudo counts (or prior) defines the motif • is [50-200] normally ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence weighting and pseudo counts l l l l l l l l l l ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV RLLDDTPEV 84nM GLLGNVSTV 23nM ALAKAAAAL 309nM P7P and P7S > 0 Position specific weighting • We know that positions 2 and 9 are anchor positions for most MHC binding motifs – Increase weight on high information positions • Motif found on large data set Weight matrices • Estimate amino acid frequencies from alignment including sequence weighting and pseudo count 1 2 3 4 5 6 7 8 9 A 0.08 0.04 0.08 0.08 0.06 0.06 0.10 0.05 0.08 R 0.06 0.01 0.04 0.05 0.04 0.03 0.02 0.07 0.02 N 0.02 0.01 0.05 0.03 0.05 0.03 0.04 0.04 0.01 D 0.03 0.01 0.07 0.10 0.03 0.03 0.04 0.03 0.01 C 0.02 0.01 0.02 0.01 0.01 0.03 0.02 0.01 0.02 Q 0.02 0.01 0.03 0.05 0.04 0.03 0.03 0.04 0.02 E 0.03 0.02 0.03 0.08 0.05 0.04 0.04 0.06 0.03 G 0.08 0.02 0.08 0.13 0.11 0.06 0.05 0.06 0.02 H 0.02 0.01 0.02 0.01 0.03 0.02 0.04 0.03 0.01 I 0.08 0.11 0.05 0.05 0.04 0.10 0.08 0.06 0.10 L 0.11 0.44 0.11 0.06 0.09 0.14 0.12 0.13 0.23 • What do the numbers mean? K 0.06 0.02 0.03 0.05 0.04 0.04 0.02 0.06 0.03 M 0.04 0.06 0.03 0.01 0.02 0.03 0.03 0.02 0.02 F 0.06 0.03 0.06 0.03 0.06 0.05 0.06 0.05 0.04 P 0.02 0.01 0.04 0.08 0.06 0.04 0.07 0.04 0.01 S 0.09 0.02 0.06 0.06 0.04 0.06 0.06 0.08 0.04 T 0.04 0.05 0.05 0.04 0.05 0.06 0.05 0.07 0.04 W 0.01 0.00 0.03 0.02 0.02 0.01 0.03 0.01 0.00 Y 0.04 0.01 0.05 0.01 0.05 0.03 0.03 0.04 0.02 V 0.08 0.10 0.07 0.05 0.08 0.13 0.08 0.05 0.25 – P2(V)>P2(M). Does this mean that V enables binding more than M. – In nature not all amino acids are found equally often • qM = 0.025, qV = 0.073 • Finding 7% V is hence not significant, but 2% M highly significant • In nature V is found more often than M, so we must somehow rescale with the background Weight matrices A weight matrix is given as Wij = log(pij/qj) – where i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j. • 1 2 3 4 5 6 7 8 9 • A 0.6 -1.6 0.2 -0.1 -1.6 -0.7 1.1 -2.2 -0.2 R 0.4 -6.6 -1.3 -0.1 -0.1 -1.4 -3.8 1.0 -3.5 N -3.5 -6.5 0.1 -2.0 0.1 -1.0 -0.2 -0.8 -6.1 D -2.4 -5.4 1.5 2.0 -2.2 -2.3 -1.3 -2.9 -4.5 C -0.4 -2.5 0.0 -1.6 -1.2 1.1 1.3 -1.4 0.7 Q -1.9 -4.0 -1.8 0.5 0.4 -1.3 -0.3 0.4 -0.8 E -2.7 -4.7 -3.3 0.8 -0.5 -1.4 -1.3 0.1 -2.5 G 0.3 -3.7 0.4 2.0 1.9 -0.2 -1.4 -0.4 -4.0 H I L K M F -1.1 1.0 0.3 0.0 1.4 1.2 -6.3 1.0 5.1 -3.7 3.1 -4.2 0.5 -1.0 0.3 -2.5 1.2 1.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.2 -2.2 -0.5 -1.3 -2.2 1.7 -1.0 1.8 0.8 -1.9 0.2 1.0 2.1 0.6 0.7 -5.0 1.1 0.9 0.2 -0.0 1.1 -0.5 -0.5 0.7 -2.6 0.9 2.8 -3.0 -1.8 -1.4 W is a L x 20 matrix, L is motif length P -2.7 -4.3 -0.1 1.7 1.2 -0.4 1.3 -0.3 -6.2 S 1.4 -4.2 -0.3 -0.6 -2.5 -0.6 -0.5 0.8 -1.9 T -1.2 -0.2 -0.5 -0.2 -0.1 0.4 -0.9 0.8 -1.6 W -2.0 -5.9 3.4 1.3 1.7 -0.5 2.9 -0.7 -4.9 Y V 1.1 0.7 -3.8 0.4 1.6 0.0 -6.8 -0.7 1.5 1.0 -0.0 2.1 -0.4 0.5 1.3 -1.1 -1.6 4.5 Scoring a sequence to a weight matrix • Score sequences to weight matrix by looking up and adding L values from the matrix 1 2 3 4 5 6 7 8 9 A 0.6 -1.6 0.2 -0.1 -1.6 -0.7 1.1 -2.2 -0.2 R 0.4 -6.6 -1.3 -0.1 -0.1 -1.4 -3.8 1.0 -3.5 N -3.5 -6.5 0.1 -2.0 0.1 -1.0 -0.2 -0.8 -6.1 D -2.4 -5.4 1.5 2.0 -2.2 -2.3 -1.3 -2.9 -4.5 C -0.4 -2.5 0.0 -1.6 -1.2 1.1 1.3 -1.4 0.7 RLLDDTPEV GLLGNVSTV ALAKAAAAL Q -1.9 -4.0 -1.8 0.5 0.4 -1.3 -0.3 0.4 -0.8 E -2.7 -4.7 -3.3 0.8 -0.5 -1.4 -1.3 0.1 -2.5 G 0.3 -3.7 0.4 2.0 1.9 -0.2 -1.4 -0.4 -4.0 H I L K M F -1.1 1.0 0.3 0.0 1.4 1.2 -6.3 1.0 5.1 -3.7 3.1 -4.2 0.5 -1.0 0.3 -2.5 1.2 1.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.2 -2.2 -0.5 -1.3 -2.2 1.7 -1.0 1.8 0.8 -1.9 0.2 1.0 2.1 0.6 0.7 -5.0 1.1 0.9 0.2 -0.0 1.1 -0.5 -0.5 0.7 -2.6 0.9 2.8 -3.0 -1.8 -1.4 11.9 84nM 14.7 23nM 4.3 309nM P -2.7 -4.3 -0.1 1.7 1.2 -0.4 1.3 -0.3 -6.2 S 1.4 -4.2 -0.3 -0.6 -2.5 -0.6 -0.5 0.8 -1.9 T -1.2 -0.2 -0.5 -0.2 -0.1 0.4 -0.9 0.8 -1.6 W -2.0 -5.9 3.4 1.3 1.7 -0.5 2.9 -0.7 -4.9 Y V 1.1 0.7 -3.8 0.4 1.6 0.0 -6.8 -0.7 1.5 1.0 -0.0 2.1 -0.4 0.5 1.3 -1.1 -1.6 4.5 Which peptide is most likely to bind? Which peptide second?