Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pairwise sequence Alignment Sequence Alignment • Sequence analysis is the process of making biological inferences from the known sequence of monomers in protein, DNA and RNA polymers. Complete DNA Sequences More than 400 complete genomes have been sequenced Evolution Sequence alignment • Comparing DNA/protein sequences for – Similarity – Homology • Prediction of function • Construction of phylogeny • Shotgun assembly – End-space-free alignment / overlap alignment • Finding motifs Sequence Alignment Procedure of comparing two (pairwise) or more (multiple) sequences by searching for a series of individual characters that are in the same order in the sequences GCTAGTCAGATCTGACGCTA | |||| ||||| ||| TGGTCACATCTGCCGC Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2...xM, y = y1y2…yN, an alignment is an assignment of gaps to positions 0,…, M in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence Sources of variation • Nucleotide substitution – Replication error – Chemical reaction • Insertions or deletions (indels) – Unequal crossing over – Replication slippage • Duplication – a single gene (complete gene duplication) – part of a gene (internal or partial gene duplication) • Domain duplication • Exon shuffling – part of a chromosome (partial polysomy) – an entire chromosome (aneuploidy or polysomy) – the whole genome (polyploidy) Common mutations in DNA Substitution: A C G T T G A C A C G A T G A C Deletion: A C G T T G A C A C G A C Insertion: A C G T T G A C A C G C A A T T G A C Seq.Align. Protein1 Protein Function Protein2 More than 25% sequence identity ? Similar 3D structure ? Similar function ? Similar sequences produce similar proteins Differing rates of DNA evolution • Functional/selective constraints (particular features of coding regions, particular features in 5' untranslated regions) • Variation among different gene regions with different functions (different parts of a protein may evolve at different rates). • Within proteins, variations are observed between – surface and interior amino acids in proteins (order of magnitude difference in rates in haemoglobins) – charged and non-charged amino acids – protein domains with different functions – regions which are strongly constrained to preserve particular functions and regions which are not – different types of proteins -- those with constrained interaction surfaces and those without Common assumptions • All nucleotide sites change independently • The substitution rate is constant over time and in different lineages • The base composition is at equilibrium • The conditional probabilities of nucleotide substitutions are the same for all sites, and do not change over time • Most of these are not true in many cases… Pairwise alignments in the 1950s b-corticotropin (sheep) Corticotropin A (pig) Oxytocin Vasopressin ala gly glu asp asp glu asp gly ala glu asp glu CYIQNCPLG CYFQNCPRG globins: a- b- myoglobin Early example of sequence alignment: globins (1961) H.C. Watson and J.C. Kendrew, “Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hæmoglobin.” Nature 190:670-672, 1961. Pairwise sequence alignment is the most fundamental operation of bioinformatics • It is used to decide if two proteins (or genes) are related structurally or functionally • It is used to identify domains or motifs that are shared between proteins • It is the basis of BLAST searching (next week) • It is used in the analysis of genomes Pairwise alignment: protein sequences can be more informative than DNA • protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time • DNA sequences can be translated into protein, and then used in pairwise alignments Page 54 Pairwise alignment: protein sequences can be more informative than DNA • DNA can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Pairwise alignment: protein sequences can be more informative than DNA • Many times, DNA alignments are appropriate --to confirm the identity of a cDNA --to study noncoding regions of DNA --to study DNA polymorphisms --example: Neanderthal vs modern human DNA Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247 retinol-binding protein (NP_006735) b-lactoglobulin (P02754) Page 42 Definitions Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Definitions Homology Similarity attributed to descent from a common ancestor. Page 42 Definitions Homology Similarity attributed to descent from a common ancestor. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. RBP: glycodelin: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + + GTW++ MA + L + A V T + +L+ W+ 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEI V LHRWEN 81 Page 44 Definitions: two types of homology Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Page 43 common carp zebrafish rainbow trout teleost Orthologs: members of a gene (protein) family in various organisms. This tree shows RBP orthologs. African clawed frog chicken human mouse rat horse pig cow rabbit 10 changes Page 43 apolipoprotein D retinol-binding protein 4 Complement component 8 Alpha-1 Microglobulin /bikunin Paralogs: members of a gene (protein) family within a species prostaglandin D2 synthase progestagenassociated endometrial protein Odorant-binding protein 2A neutrophil gelatinaseassociated lipocalin Lipocalin 1 10 changes Page 44 Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Page 46 Definitions Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity The extent to which two sequences are invariant. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Identity (bar) 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Page 46 Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Somewhat similar (one dot) Very similar (two dots) 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Page 46 Definitions Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Page 47 Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Internal gap Terminal gap Page 46 Gaps • Positions at which a letter is paired with a null are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. Thus there are separate penalties for gap creation and gap extension. • In BLAST, it is rarely necessary to change gap values from the default. Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Pairwise alignment of retinol-binding protein from human (top) and rainbow trout (O. mykiss) 1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 :: || || || .||.||. .| :|||:.|:.| |||.||||| 1 MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47 . . . . . 49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 |||| ||:||:|||||.|.|.||| ||| :||||:.||.| ||| || | 48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97 . . . . . 99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 ||||||:||| ||:|| ||||||::||||| ||: |||| ..||||| | 98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147 . . . . . 149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199 |||:||| | || || |||| :..|:| .|| : | |:|: 148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS...... 192 Pairwise sequence alignment allows us to look back billions of years ago (BYA) Origin of life 4 Earliest fossils Origin of Eukaryote/ eukaryotes archaea 3 2 Fungi/animal Plant/animal 1 insects 0 Page 48 Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases fly human plant bacterium yeast archaeon GAKKVIISAP GAKRVIISAP GAKKVIISAP GAKKVVMTGP GAKKVVITAP GADKVLISAP SAD.APM..F SAD.APM..F SAD.APM..F SKDNTPM..F SS.TAPM..F PKGDEPVKQL VCGVNLDAYK VMGVNHEKYD VVGVNEHTYQ VKGANFDKY. VMGVNEEKYT VYGVNHDEYD PDMKVVSNAS NSLKIISNAS PNMDIVSNAS AGQDIVSNAS SDLKIVSNAS GE.DVVSNAS CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNSITPVA fly human plant bacterium yeast archaeon KVINDNFEIV KVIHDNFGIV KVVHEEFGIL KVINDNFGII KVINDAFGIE KVLDEEFGIN EGLMTTVHAT EGLMTTVHAI EGLMTTVHAT EGLMTTVHAT EGLMTTVHSL AGQLTTVHAY TATQKTVDGP TATQKTVDGP TATQKTVDGP TATQKTVDGP TATQKTVDGP TGSQNLMDGP SGKLWRDGRG SGKLWRDGRG SMKDWRGGRG SHKDWRGGRG SHKDWRGGRT NGKP.RRRRA AAQNIIPAST ALQNIIPAST ASQNIIPSST ASQNIIPSST ASGNIIPSST AAENIIPTST fly human plant bacterium yeast archaeon GAAKAVGKVI GAAKAVGKVI GAAKAVGKVL GAAKAVGKVL GAAKAVGKVL GAAQAATEVL PALNGKLTGM PELNGKLTGM PELNGKLTGM PELNGKLTGM PELQGKLTGM PELEGKLDGM AFRVPTPNVS AFRVPTANVS AFRVPTSNVS AFRVPTPNVS AFRVPTVDVS AIRVPVPNGS VVDLTVRLGK VVDLTCRLEK VVDLTCRLEK VVDLTVRLEK VVDLTVKLNK ITEFVVDLDD GASYDEIKAK PAKYDDIKKV GASYEDVKAA AATYEQIKAA ETTYDEIKKV DVTESDVNAA Page 49 Multiple sequence alignment of human lipocalin paralogs ~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL VQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD... lipocalin 1 odorant-binding protein 2a progestagen-assoc. endo. apolipoprotein D retinol-binding protein neutrophil gelatinase-ass. prostaglandin D2 synthase alpha-1-microglobulin complement component 8 Page 49 General approach to pairwise alignment • Choose two sequences • Select an algorithm that generates a score • Allow gaps (insertions, deletions) • Score reflects degree of similarity • Alignments can be global or local • Estimate probability that the alignment occurred by chance Calculation of an alignment score Where we’re heading in the next 10 minutes: creating a set of “scoring matrices” that let us assign scores for each aligned amino acid in a pairwise alignment. What should the score be when a serine matches a serine, or a threonine, or a valine? Can we devise “lenient” scoring systems to help us align distantly related proteins, and more conservative scoring systems to align closely related proteins? lys found at 58% of arg sites Emile Zuckerkandl and Linus Pauling (1965) considered substitution frequencies in 18 globins (myoglobins and hemoglobins from human to lamprey). Black: identity Gray: very conservative substitutions (>40% occurrence) White: fairly conservative substitutions (>21% occurrence) Red: no substitutions observed Page 80 Page 80 Dayhoff’s 34 protein superfamilies Accepted point mutations Protein Ig kappa chain Kappa casein Lactalbumin Hemoglobin a Myoglobin Insulin Histone H4 Ubiquitin From 1978 PAMs per 100 million years 37 33 27 12 8.9 400 fold 4.4 0.10 0.00 Page 50 Pairwise alignment of human (NP_005203) versus mouse (NP_031812) ubiquitin Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases fly human plant bacterium yeast archaeon GAKKVIISAP GAKRVIISAP GAKKVIISAP GAKKVVMTGP GAKKVVITAP GADKVLISAP SAD.APM..F SAD.APM..F SAD.APM..F SKDNTPM..F SS.TAPM..F PKGDEPVKQL VCGVNLDAYK VMGVNHEKYD VVGVNEHTYQ VKGANFDKY. VMGVNEEKYT VYGVNHDEYD PDMKVVSNAS NSLKIISNAS PNMDIVSNAS AGQDIVSNAS SDLKIVSNAS GE.DVVSNAS CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNSITPVA fly human plant bacterium yeast archaeon KVINDNFEIV KVIHDNFGIV KVVHEEFGIL KVINDNFGII KVINDAFGIE KVLDEEFGIN EGLMTTVHAT EGLMTTVHAI EGLMTTVHAT EGLMTTVHAT EGLMTTVHSL AGQLTTVHAY TATQKTVDGP TATQKTVDGP TATQKTVDGP TATQKTVDGP TATQKTVDGP TGSQNLMDGP SGKLWRDGRG SGKLWRDGRG SMKDWRGGRG SHKDWRGGRG SHKDWRGGRT NGKP.RRRRA AAQNIIPAST ALQNIIPAST ASQNIIPSST ASQNIIPSST ASGNIIPSST AAENIIPTST fly human plant bacterium yeast archaeon GAAKAVGKVI GAAKAVGKVI GAAKAVGKVL GAAKAVGKVL GAAKAVGKVL GAAQAATEVL PALNGKLTGM PELNGKLTGM PELNGKLTGM PELNGKLTGM PELQGKLTGM PELEGKLDGM AFRVPTPNVS AFRVPTANVS AFRVPTSNVS AFRVPTPNVS AFRVPTVDVS AIRVPVPNGS VVDLTVRLGK VVDLTCRLEK VVDLTCRLEK VVDLTVRLEK VVDLTVKLNK ITEFVVDLDD GASYDEIKAK PAKYDDIKKV GASYEDVKAA AATYEQIKAA ETTYDEIKKV DVTESDVNAA Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins? A Ala A R N D C Q E G H R Arg N Asn D Asp C Q E Cys Gln related Glu From closely G Gly protein sequences (at least 85% identity) 30 109 17 154 0 532 33 10 0 0 93 120 50 76 0 266 0 94 831 0 422 579 10 156 162 10 30 112 21 103 226 43 10 243 23 10 Numbers of APM, multiplied by 10, in 1572 cases of amino acid substitutions from closely related sequences The relative mutability of amino acids Asn Ser Asp Glu Ala Thr Ile Met Gln Val 134 120 106 102 100 97 96 94 93 74 His Arg Lys Pro Gly Tyr Phe Leu Cys Trp 66 65 56 56 49 41 41 40 20 18 Describes how often each amino acid is likely to change over a short evolutionary period Page 53 Normalized frequencies of amino acids Gly Ala Leu Lys Ser Val Thr Pro Glu Asp 8.9% 8.7% 8.5% 8.1% 7.0% 6.5% 5.8% 5.1% 5.0% 4.7% Arg Asn Phe Gln Ile His Cys Tyr Met Trp 4.1% 4.0% 4.0% 3.8% 3.7% 3.4% 3.3% 3.0% 1.5% 1.0% blue=6 codons; red=1 codon Page 53 Page 54 Dayhoff’s PAM1 mutation probability matrix Replaced amino acid Original amino acid A R N D C Q E G H I There is 98.67%chance that A will be replaced by A over an evolutionary distance of 1 PAM A Ala R Arg N D C Asn Asp Cys Q Gln E Glu G Gly H His I Ile 9867 2 9 1 9913 4 10 3 8 17 21 2 6 1 0 1 10 0 0 10 3 1 9822 36 0 4 6 6 21 3 6 0 42 6 4 1 1 1 0 0 1 1 3 9 4 1 23 1 10 0 7 56 0 35 9865 4 2 3 21 1 12 11 1 3 7 9935 1 0 1 8 18 3 1 20 1 0 9912 0 2 2 3 1 2 1 2 0 0 9872 Each element shows the probability 9859 0 amino acid 6 j 53 that an original (columns)will be replaced byanother 0 9973 0 0 amino acid i (rows) for 1% sequence divergence 5 0 9876 27 Dayhoff’s PAM1 mutation probability matrix A R N D C Q E G H I A Ala R Arg N Asn D Asp C Cys Q Gln E Glu G Gly H His I Ile 9867 2 9 10 3 8 17 21 2 6 1 9913 1 0 1 10 0 0 10 3 4 1 9822 36 0 4 6 6 21 3 6 0 42 9859 0 6 53 6 4 1 1 1 0 0 9973 0 0 0 1 1 3 9 4 5 0 9876 27 1 23 1 10 0 7 56 0 35 9865 4 2 3 21 1 12 11 1 3 7 9935 1 0 1 8 18 3 1 20 1 0 9912 0 2 2 3 1 2 1 2 0 0 9872 Each element of the matrix shows the probability that an original amino acid (top) will be replaced by another amino acid (side) Substitution Matrix A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM. PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. All the PAM data come from closely related proteins (>85% amino acid identity) Dayhoff’s PAM1 mutation probability matrix A R N D C Q E G H I A Ala R Arg N Asn D Asp C Cys Q Gln E Glu G Gly H His I Ile 9867 2 9 10 3 8 17 21 2 6 1 9913 1 0 1 10 0 0 10 3 4 1 9822 36 0 4 6 6 21 3 6 0 42 9859 0 6 53 6 4 1 1 1 0 0 9973 0 0 0 1 1 3 9 4 5 0 9876 27 1 23 1 10 0 7 56 0 35 9865 4 2 3 21 1 12 11 1 3 7 9935 1 0 1 8 18 3 1 20 1 0 9912 0 2 2 3 1 2 1 2 0 0 9872 Page 55 Dayhoff’s PAM0 mutation probability matrix: the rules for extremely slowly evolving proteins PAM0 A R N D C Q E G A Ala 100% 0% 0% 0% 0% 0% 0% 0% R Arg 0% 100% 0% 0% 0% 0% 0% 0% N Asn 0% 0% 100% 0% 0% 0% 0% 0% D Asp 0% 0% 0% 100% 0% 0% 0% 0% C Cys 0% 0% 0% 0% 100% 0% 0% 0% Q Gln 0% 0% 0% 0% 0% 100% 0% 0% Top: original amino acid Side: replacement amino acid E Glu 0% 0% 0% 0% 0% 0% 100% 0% G Gly 0% 0% 0% 0% 0% 0% 0% 100% Page 56 Dayhoff’s PAM2000 mutation probability matrix: the rules for very distantly related proteins PAM A R N D C Q E G Ala Arg Asn Asp Cys Gln Glu Gly A 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% R 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% N 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% D 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% C 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% Q 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% E 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% G 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% PAM1 matrix is multiplied 2000 times by itself Top: original amino acid Side: replacement amino acid PAM250 mutation probability matrix A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2 7 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 7 2 4 17 Top: original amino acid Side: replacement amino acid Page 57 A R N D C Q E G H I L K M F P S T W Y V 2 -2 6 0 0 2 0 -1 2 4 -2 -4 -4 -5 12 0 1 1 2 -5 4 0 -1 1 3 -5 2 4 1 -3 0 1 -3 -1 0 5 -1 2 2 1 -3 3 1 -2 6 -1 -2 -2 -2 -2 -2 -2 -3 -2 5 S(a,b)= 10 log10 (Mab/Pb) -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6 -1 3 1 0 -5 1 0 -2 0 -2 -3 5 -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 A R N D C Q E G H I L K M F P S T W Y V PAM250 log odds scoring matrix Page 58 Why do we go from a mutation probability matrix to a log odds matrix? • We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. • Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them). Page 57 How do we go from a mutation probability matrix to a log odds matrix? • The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log10 (Mab/pb) As an example, for tryptophan, Normalized frequency of W is 0.01 S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4 What do the numbers mean in a log odds matrix? S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two Trp residues. S(a,b) = 17 Probability of replacement (Mab/pb) = x Then 17 = 10 log10 x 1.7 = log10 x 101.7 = x = 50 Page 58 What do the numbers mean in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. Page 58 A R N D C Q E G H I L K M F P S T W Y V 2 -2 6 0 0 2 0 -1 2 4 -2 -4 -4 -5 12 0 1 1 2 -5 4 0 -1 1 3 -5 2 4 1 -3 0 1 -3 -1 0 5 -1 2 2 1 -3 3 1 -2 6 -1 -2 -2 -2 -2 -2 -2 -3 -2 5 -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6 -1 3 1 0 -5 1 0 -2 0 -2 -3 5 -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 A R N D C Q E G H I L K M F P S T W Y V PAM250 log odds scoring matrix Page 58 A R N D C Q E G H I L K M F P S T W Y V 7 -10 9 -7 -9 9 -6 -17 -1 8 -10 -11 -17 -21 10 -7 -4 -7 -6 -20 9 -5 -15 -5 0 -20 -1 8 -4 -13 -6 -6 -13 -10 -7 7 -11 -4 -2 -7 -10 -2 -9 -13 10 -8 -8 -8 -11 -9 -11 -8 -17 -13 9 -9 -12 -10 -19 -21 -8 -13 -14 -9 -4 7 -10 -2 -4 -8 -20 -6 -7 -10 -10 -9 -11 7 -8 -7 -15 -17 -20 -7 -10 -12 -17 -3 -2 -4 12 -12 -12 -12 -21 -19 -19 -20 -12 -9 -5 -5 -20 -7 9 -4 -7 -9 -12 -11 -6 -9 -10 -7 -12 -10 -10 -11 -13 8 -3 -6 -2 -7 -6 -8 -7 -4 -9 -10 -12 -7 -8 -9 -4 7 -3 -10 -5 -8 -11 -9 -9 -10 -11 -5 -10 -6 -7 -12 -7 -2 8 -20 -5 -11 -21 -22 -19 -23 -21 -10 -20 -9 -18 -19 -7 -20 -8 -19 13 -11 -14 -7 -17 -7 -18 -11 -20 -6 -9 -10 -12 -17 -1 -20 -10 -9 -8 10 -5 -11 -12 -11 -9 -10 -10 -9 -9 -1 -5 -13 -4 -12 -9 -10 -6 -22 -10 R N D Q E A C G H PAM10 log odds scoring matrix I L K M F P S T W Y 8 V Page 59 Rat versus mouse RBP Rat versus bacterial lipocalin Comparing two proteins with a PAM1 matrix gives completely different results than PAM250! Consider two distantly related proteins. A PAM40 matrix is not forgiving of mismatches, and penalizes them severely. Using this matrix you can find almost no match. hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * ** A PAM250 matrix is very tolerant of mismatches. 24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** * hsrbp, 86 --CADMVGTFTDTEDPAKFKM btlact, 80 GECAQKKIIAEKTKIPAVFKI ** * ** ** Page 60 BLOSUM Matrices BLOSUM matrices are based on local alignments. BLOSUM stands for blocks substitution matrix. BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Page 60 BLOSUM Matrices Percent amino acid identity 100 62 30 BLOSUM62 Percent amino acid identity BLOSUM Matrices 100 100 100 62 62 62 30 30 30 BLOSUM80 BLOSUM62 BLOSUM30 BLOSUM Matrices All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. Page 60 BLOSUM Scoring Matrices •In the Dayhoff model, the scoring values are derived from protein sequences with at least 85% identity • Alignments are, however, most often performed on sequences of less similarity, and the scoring matrices for use in these cases are calculated from the 1 PAM matrix • Henikoff and Henikoff (1992) have therefore developed scoring matrices based on known alignments of more diverse sequences BLOSUM Scoring Matrices • They take a group of related proteins and produce a set of blocks representing this group, where a block is defined as an ungapped region of aligned amino acids • An example of two blocks is KIFIMK NLFKTR KIFKTK KLFESR KIFKGR GDEVK GDSKK GD PKA G DAE R G D AA K • The Henikoffs used over 2000 blocks in order to derive their scoring matrices • For each column in each block they counted the number of occurrences of each pair of amino acids, when all pairs of segments were used • Then the frequency distribution of all 210 different pairs of amino acids were found • A block of length w from an alignment of m sequences makes (wm(m-1))/2 pairs of amino acids We define • hab as the number of occurrences of the amino acid pair (ab) (note that hab=hba) • T as the total number of pairs in the alignment where ≥ is interpreted as a total ordering over the amino acids • fab=hab/T (the frequency of observed pairs) Developing Scoring Matrices for Different Evolutionary Distances • The procedure for developing a BLOSUM X matrix 1. Collect a set of multiple alignments 2. Find the blocks 3. Group the segments with an X% identity 4. Count the occurrences of all pairs of amino acids 5. Develop the matrix, as explained before • BLOSUM-62 is often used as the standard for ungapped alignments • For gapped alignments, BLOSUM-50 is more often used A R N D C Q E G H I L K M F P S T W Y V 4 -1 5 -2 0 6 -2 -2 1 6 0 -3 -3 -3 9 -1 1 0 0 -3 5 -1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6 -2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5 -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V Blosum62 scoring matrix Page 61 By use of relative entropy, it can be found that PAM250 corresponds to BLOSUM-45 and PAM160 corresponds to BLOSUM-62, and PAM120 corresponds to BLOSUM-80 Rat versus mouse RBP Rat versus bacterial lipocalin Page 61 Major Differences between PAM and BLOSUM PAM BLOSUM Built from global alignments Built from local alignments Built from small amout of Data Built from vast amout of Data Counting is based on minimum replacement or maximum parsimony Perform better for finding global alignments and remote homologs Higher PAM series means more divergence Counting based on groups of related sequences counted as one Better for finding local alignments Lower BLOSUM series means more divergence PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. All the PAM data come from closely related proteins (>85% amino acid identity) Percent identity Two randomly diverging protein sequences change in a negatively exponential fashion “twilight zone” Evolutionary distance in PAMs Page 62 Percent identity At PAM1, two proteins are 99% identical At PAM10.7, there are 10 differences per 100 residues At PAM80, there are 50 differences per 100 residues At PAM250, there are 80 differences per 100 residues “twilight zone” Differences per 100 residues PAM250 PAM matrices reflect different degrees of divergence Page 62 PAM: “Accepted point mutation” • Two proteins with 50% identity may have 80 changes per 100 residues. (Why? Because any residue can be subject to back mutations.) • Proteins with 20% to 25% identity are in the “twilight zone” and may be statistically significantly related. • PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968) Page 62 Ancestral sequence ACCCTAC A C C C --> G T --> A A --> C --> T C no change single substitution multiple substitutions coincidental substitutions parallel substitutions convergent substitutions back substitution Sequence 1 A C --> A C --> A --> T C --> A T --> A A --> T C --> T --> C Sequence 2 Li (1997) p.70 Percent identity between two proteins: What percent is significant? 100% 80% 65% 30% 23% 19% An alignment scoring system is required to evaluate how good an alignment is • positive and negative values assigned • gap creation and extension penalties • positive score for identities • some partial positive score for conservative substitutions • global versus local alignment • use of a substitution matrix