Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Pfam protein family database: an introduction Marco Punta TUM, December 2013 Evolution TUM, December 2013 Homology P1B P1C B C A P1A TUM, December 2013 Homology http://upload.wikimedia.org/wikipedia/commons/thumb/7/72/Gene-duplication.png/220px-Gene-duplication.png TUM, December 2013 Homology P1B P1C B C A P1A TUM, December 2013 P1C' Homology homologs orthologs P1B P1C B C A P1A TUM, December 2013 paralogs P1C' Homology Xenologs -> horizontal gene transfer http://textbookofbacteriology.net/resantimicrobial_3.html TUM, December 2013 Point I : Evolutionary relationships between proteins Definition: we call families groups of evolutionary related proteins TUM, December 2013 Why do we care? TUM, December 2013 An example Point II: Proteins in the same family can retain common functional attributes; TUM, December 2013 Myoglobins Myoglobin: Serves as a reserve supply of oxygen and facilitates the movement of oxygen within muscles. Human: 1 Mouse: 1 ! Human: 61 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60! MGLSDGEWQLVLNVWGKVEAD GHGQEVLI LFK HPETL KFDKFK LKSE MK SE! MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSE 60! DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120! DLKKHG TVLTALG ILKKKG H AEI PLAQSHATKHKIPVKYLEFISE II VL H! DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRH 120! Mouse: 61 ! Human: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154! GDFGADAQGAM KALELFR D A YKELGFQG! Mouse: 121 SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG 154! TUM, December 2013 Why do we care? (II) TUM, December 2013 Gap between sequenced and annotated proteins Number of seq in UniProt 2013_11 : 48,180,424 Number of seq in Swissprot 2013_11 : 541,762 Number of sequences in PDB 2013_12_03: TUM, December 2013 53,888 (100% RR) Point III: Vast majority of proteins have not yet been experimentally functionally characterised. TUM, December 2013 Detecting Homology Sequence similarity Human: 1 Mouse: 1 ! Human: 61 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60! MGLSDGEWQLVLNVWGKVEAD GHGQEVLI LFK HPETL KFDKFK LKSE MK SE! MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSE 60! DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120! DLKKHG TVLTALG ILKKKG H AEI PLAQSHATKHKIPVKYLEFISE II VL H! DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRH 120! Mouse: 61 ! Human: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154! GDFGADAQGAM KALELFR D A YKELGFQG! Mouse: 121 SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG 154! TUM, December 2013 Detecting Homology Sequence similarity Human: 1 Mouse: 1 ! Human: 61 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60! MGLSDGEWQLVLNVWGKVEAD GHGQEVLI LFK HPETL KFDKFK LKSE MK SE! MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSE 60! DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120! DLKKHG TVLTALG ILKKKG H AEI PLAQSHATKHKIPVKYLEFISE II VL H! DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRH 120! Mouse: 61 ! Human: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154! GDFGADAQGAM KALELFR D A YKELGFQG! Mouse: 121 SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG 154! 20154 possibilities!! TUM, December 2013 Detecting Homology Structural similarity 2G2X: 1 MAYWLMKSEPDELSIEALARLGEARWDGVRNYQARNFLRAMSVGDEFFFYH-----SSCP 55! MAYWL D W Y N VGD Y ! MAYWLCITNEDNWKVIKEKKI----WGVAERY--KNTINKVKVGDKLIIYEIQRSGKDYK 57! 2P5D: 4 ! 2G2X: 56 QPGIAGIARITRAAYPD------PTALDPESHY 82! P I G Y D PT P ! 2P5D: 58 PPYIRGVYEVVSEVYKDSSKIFKPTPRNPNEKF 90! ! TUM, December 2013 Detecting Homology Structural similarity 2g2x TUM, December 2013 Detecting Homology Structural similarity 2g2x TUM, December 2013 Detecting Homology Structural similarity TUM, December 2013 Detecting Homology Structural similarity TUM, December 2013 Detecting Homology Genome context http://www.microbesonline.org TUM, December 2013 Detecting Homology Protein context http://www.microbesonline.org TUM, December 2013 Detecting Homology Point IV: we can detect homology in different ways sequence similarity, structural similarity, etc. TUM, December 2013 Pfam in 4 moves Point I : Evolutionary relationships (homology) between some proteins (homologous groups=families) Point II: Proteins in the same family can retain common functional attributes; Point III: vast majority of proteins have not yet been experimentally functionally characterised. Point IV: we can detect homology via sequence similarity TUM, December 2013 Pfam in 4 moves What Point I : Evolutionary relationships (homology) between some proteins (homologous groups=families) Why Point II: Proteins in the same family can retain common functional attributes; Point III: vast majority of proteins have not yet been experimentally functionally characterised. How Point IV: we can detect homology via sequence similarity TUM, December 2013 The Pfam Database of protein families TUM, December 2013 Pfam families are: • Groups of sequence-conserved protein regions Punta et al. NAR 2012 Why ‘regions’? Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1! (690 letters)! ! ! ! >Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)! Length = 762! ! Score = 41.6 bits (96), Expect = 1e-07! Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)! ! Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529! L+ V + + L +++ L+ + Y GDYI ++G+ G +I+ +GK+ V ! Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340! ! Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579! D L G YFGE +++ + + R+ANI + +D+ CL D! Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383! ! TUM, December 2013 Why ‘regions’? Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1! (690 letters)! ! ! ! >Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)! Length = 762! ! Score = 41.6 bits (96), Expect = 1e-07! Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)! ! Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529! L+ V + + L +++ L+ + Y GDYI ++G+ G +I+ +GK+ V ! Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340! ! Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579! D L G YFGE +++ + + R+ANI + +D+ CL D! Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383! ! TUM, December 2013 Why ‘regions’? Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1! (690 letters)! ! cNMP_binding ! ! >Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)! Length = 762! ! ! ! ! Score = 41.6 bits (96), Expect = 1e-07! Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)! ! Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529! L+ V + + L +++ L+ + Y GDYI ++G+ G +I+ +GK+ V ! Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340! ! Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579! D L G YFGE +++ + + R+ANI + +D+ CL D! Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383! ! TUM, December 2013 Why ‘regions’? Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1! (690 letters)! ! cNMP_binding ! ! >Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)! Length = 762! ! ! cNMP_binding cNMP_binding ! ! Score = 41.6 bits (96), Expect = 1e-07! Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)! ! Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529! L+ V + + L +++ L+ + Y GDYI ++G+ G +I+ +GK+ V ! Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340! ! Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579! D L G YFGE +++ + + R+ANI + +D+ CL D! Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383! ! TUM, December 2013 Why ‘regions’? Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1! (690 letters)! ! cNMP_binding ! ! >Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)! Length = 762! ! ! cNMP_binding cNMP_binding ! ! Score = 41.6 bits (96), Expect = 1e-07! Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)! ! Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529! L+ V + + L +++ L+ + Y GDYI ++G+ G +I+ +GK+ V ! Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340! ! Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579! D L G YFGE +++ + + R+ANI + +D+ CL D! Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383! ! TUM, December 2013 Why ‘regions’? Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1! (690 letters)! ! cNMP_binding ! ! >Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)! Length = 762! ! ! cNMP_binding cNMP_binding ! ! Score = 41.6 bits (96), Expect = 1e-07! Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)! ! Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529! L+ V + + L +++ L+ + Y GDYI ++G+ G +I+ +GK+ V ! Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340! ! Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579! D L G YFGE +++ + + R+ANI + +D+ CL D! Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383! ! TUM, December 2013 Why ‘regions’? cNMP_binding TUM, December 2013 Pfam ‘types’ Domain Repeat Motif Family TUM, December 2013 Domains and repeats A B • A - Domain • B - Metal stabilised domain C D • C - 7 repeats form domain • D - 9 repeats form domain could be unlimited number TUM, December 2013 Not always that easy Enoyl-CoA hydratase/isomerase family 3bpt TUM, December 2013 Not always that easy Enoyl-CoA hydratase/isomerase family 1ef8 TUM, December 2013 Not always that easy 3bpt TUM, December 2013 Domains and repeats A B • A - Domain • B - Metal stabilised domain C D • C - 7 repeats form domain • D - 9 repeats form domain could be unlimited number TUM, December 2013 Motifs Example: Lipoprotein attachment site, LPAM_1 Alignment coloured by Residue-type TUM, December 2013 Motifs Example: GoLoco G-protein regulatory motif TUM, December 2013 Family All that is left! TUM, December 2013 Less than half of Pfam families have known structure with PDB no PDB 57% 43% 100%=all Pfam families TUM, December 2013 TUM, December 2013 Disordered Families TUM, December 2013 Disordered Families TUM, December 2013 Disordered Families PDBid: 2JGC TUM, December 2013 Beginnings Sonnhammer et al. Proteins 1997 Beginnings Sonnhammer et al. Proteins 1997 Building a Pfam family SEED alignment representative members Manually curated Pfam (http://pfam.sanger.ac.uk/) Automatically made SEED sequences aligned with MAFFT: aligned with MUSCLE: mafft.cbrc.jp/alignment/software/ http://www.ebi.ac.uk/Tools/msa/muscle/ Building a Pfam family SEED alignment representative members Profile-HMM HMMER 3.0 Search UniProt Manually curated HMMER3 (http://hmmer.janelia.org/) Automatically made Building a Pfam family SEED alignment representative members Profile-HMM HMMER 3.0 Search UniProt Scored sequences Manually curated HMMER3 (http://hmmer.janelia.org/) Automatically made Building a Pfam family SEED alignment representative members Profile-HMM HMMER 3.0 Search UniProt QC and re-iteration Manually curated HMMER3 (http://hmmer.janelia.org/) Scored sequences Automatically made Building a Pfam family Family 1 Family 2 TUM, December 2013 Building a Pfam family Family 1 Family 1 TUM, December 2013 Building a Pfam family Clan Family 1 Family 2 TUM, December 2013 Building a Pfam family Family 1 Family 2 TUM, December 2013 Building a Pfam family SEED alignment Profile-HMM HMMER 3.0 Search UniProt Literature annotation FULL alignment Manually curated TUM, December 2013 Automatically made Pfam Classification – level 1 Protein TUM, December 2013 Pfam Classification – level 2 Family Protein TUM, December 2013 Pfam Classification – level 3 Clan Family Protein TUM, December 2013 How far are we in protein classification? Pfam 27.0: ~15,000 families UniProtKB (snapshot 2012_6): ~23,000,000 sequences TUM, December 2013 Pfam coverage of UniProtKB 100 90 80 Coverage (%) 70 60 UniProtKB Swiss-Prot (reviewed) 50 40 30 20 10 0 Amino acid TUM, December 2013 Sequence Pfam coverage of Swiss-Prot 100 90 80 Coverage (%) 70 60 UniProtKB Swiss-Prot (reviewed) 50 40 30 20 10 0 Amino acid TUM, December 2013 Sequence Pfam coverage considering disorder 100 90 80 Disordered Coverage (%) 70 60 UniProtKB Swiss-Prot (reviewed) 50 40 30 20 10 0 Amino acid Sequence IUPred: Dosztányi et al. JMB 2005 TUM, December 2013 What is next? TUM, December 2013 What is next? 15,000 families = 80% UniProtKB TUM, December 2013 What is next? 15,000 families = 80% UniProtKB With 3750 more families we fulfill Pfam dream to cover 100% of the protein universe!!! TUM, December 2013 Family size in Pfam 27.0 100 90 % of all families 80 70 60 50 40 30 20 10 0 Family size TUM, December 2013 Targeting human, model organisms and pathogens Residue in Pfam-A or Pfam-B families 55% Regions <50 residues (unlikely to contain a domain) 6% Residues predicted to be in signal peptide regions 1% Uncovered by Pfam 38% Pfam coverage of the human proteome (Swiss-Prot) http://xfam.wordpress.com/2013/05/07/pfam-targets-conserved-human-regions/ and Mistry et al. Database 2013 Many disordered regions: are they conserved? can we align them? Mistry et al. Database 2013 Tomas di Domenico Proteins do team work (e.g. pathways, organelles) HDL-mediated lipid transport Ruth Eberhardt Acknowledgements Sanger Institute Pfam Team Jaina! Mistry! Janelia Farm Pfam Team Rob! Finn! TUM, December 2013 Sean! Eddy! Tomas di Domenico! Alex ! Bateman! Stockholm Pfam Team Erik! Sonnhammer! Created from titles of 483 papers citing Punta et al. NAR 2012 (source: Scopus) Created from titles of 483 papers citing Punta et al. NAR 2012 (source: Scopus) Created from titles of 483 papers citing Punta et al. NAR 2012 (source: Scopus) FAMILY INTERACTIONS PROVIDE DEVELOPMENT WITHIN COMPLEX CHALLENGES Created from titles of 483 papers citing Punta et al. NAR 2012 (source: Scopus) How far are we in protein classification? Pfam 27.0: ~15,000 families UniProtKB (snapshot 2012_6): ~23,000,000 sequences HRF, Oct 10th 2013 Using Pfam for large scale sequence analysis Mistry et al. Acta Cryst. D 2013 Using Pfam for large scale sequence analysis Mistry et al. Acta Cryst. D 2013 Pfam coverage of UniProtKB 100 90 80 Coverage (%) 70 60 UniProtKB Swiss-Prot (reviewed) 50 40 30 20 10 0 Amino acid HRF, Oct 10th 2013 Sequence Pfam coverage of Swiss-Prot 100 90 80 Coverage (%) 70 60 UniProtKB Swiss-Prot (reviewed) 50 40 30 20 10 0 Amino acid HRF, Oct 10th 2013 Sequence Pfam coverage considering disorder 100 90 80 Disordered Coverage (%) 70 60 UniProtKB Swiss-Prot (reviewed) 50 40 30 20 10 0 Amino acid Sequence IUPred: Dosztányi et al. JMB 2005 HRF, Oct 10th 2013 What is next? HRF, Oct 10th 2013 What is next? 15,000 families = 80% UniProtKB HRF, Oct 10th 2013 What is next? 15,000 families = 80% UniProtKB With 3750 more families we fulfill Pfam dream to cover 100% of the protein universe!!! HRF, Oct 10th 2013 Family size in Pfam 27.0 100 90 % of all families 80 70 60 50 40 30 20 10 0 Family size HRF, Oct 10th 2013 Targeting human, model organisms and pathogens Residue in Pfam-A or Pfam-B families 55% Regions <50 residues (unlikely to contain a domain) 6% Residues predicted to be in signal peptide regions 1% Uncovered by Pfam 38% Pfam coverage of the human proteome (Swiss-Prot) http://xfam.wordpress.com/2013/05/07/pfam-targets-conserved-human-regions/ and Mistry et al. Database 2013 Many disordered regions: are they conserved? can we align them? Mistry et al. Database 2013 Tomas di Domenico Proteins do team work (e.g. pathways, organelles) HDL-mediated lipid transport Ruth Eberhardt Is Pfam useful? We do a lot of preliminary work for you. You may take a sequence and run it against a database, Then look at annotations of significant matches and try to have An idea of what the function of your protein might be. If the your protein is a close homolog of the best annotated matches (e.g. the two myoglobins we saw before) this may well be the best approach (look out for alternatively spliced isoforms!) If homology is more remote you are likely to find a number of stumbling blocks. 1- the closest annotated sequence matches my sequence only partially (e.g. My sequence is 250 amino acid long, the other is 100 amino acids long) 1a- my sequence is matching sequences with radically different annotations Ruth Eberhardt Acknowledgements Sanger Institute Pfam Team Jaina! Mistry! Janelia Farm Pfam Team Rob! Finn! HRF, Oct 10th 2013 Sean! Eddy! Tomas di Domenico! Alex ! Bateman! Stockholm Pfam Team Erik! Sonnhammer! Digression: how to align protein sequences Imagine now that you want to find all homologs to human myoglobin And that you already know a number of homologs (e.g. from structural Knowledge): it would be cool to align not pairwise but to the whole Set of seqs. There is extra info -> matrix is position specific rather than fixed. Digression: how to align protein sequences Recipe for aligning sequences: 1) A scoring method 2) A way to find the best score(s) between two seqs 3) A way to estimate significance Digression: how to align protein sequences Recipe for aligning sequences: 1) Substitution matrix 2) Dynamic programming 3) Extreme value distribution statistics Substitution matrices Blosum (BLOcks SUbstitution Matrices) Henikoff and Henikoff PNAS (1992 ) PAM (Point Accepted Mutation matrices) Dayhoff et al. Nat. Biomed. Res. Found (1978) Substitution matrices Human: 1 Mouse: 1 ! Human: 61 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60! MGLSDGEWQLVLNVWGKVEAD GHGQEVLI LFK HPETL KFDKFK LKSE MK SE! MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSE 60! DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120! DLKKHG TVLTALG ILKKKG H AEI PLAQSHATKHKIPVKYLEFISE II VL H! DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRH 120! Mouse: 61 ! Human: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154! GDFGADAQGAM KALELFR D A YKELGFQG! Mouse: 121 SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG 154! Substitution matrices 2G2X: 1 MAYWLMKSEPDELSIEALARLGEARWDGVRNYQARNFLRAMSVGDEFFFYH-----SSCP 55! MAYWL D W Y N VGD Y ! MAYWLCITNEDNWKVIKEKKI----WGVAERY--KNTINKVKVGDKLIIYEIQRSGKDYK 57! 2P5D: 4 ! 2G2X: 56 QPGIAGIARITRAAYPD------PTALDPESHY 82! P I G Y D PT P ! 2P5D: 58 PPYIRGVYEVVSEVYKDSSKIFKPTPRNPNEKF 90! ! Substitution matrices 2G2X: 1 MAYWLMKSEPDELSIEALARLGEARWDGVRNYQARNFLRAMSVGDEFFFYH-----SSCP 55! MAYWL D W Y N VGD Y ! MAYWLCITNEDNWKVIKEKKI----WGVAERY--KNTINKVKVGDKLIIYEIQRSGKDYK 57! 2P5D: 4 ! 2G2X: 56 QPGIAGIARITRAAYPD------PTALDPESHY 82! P I G Y D PT P ! 2P5D: 58 PPYIRGVYEVVSEVYKDSSKIFKPTPRNPNEKF 90! ! if L1=300, L2=300: 10179 possible alignments! Dynamic programming Given a suitable scoring matrix, we can use dynamic programming to find the optimal alignment e.g. between two sequences Why is it called “dynamic programming”? Dynamic programming Given a suitable scoring matrix, we can use dynamic programming to find the optimal alignment e.g. between two sequences Why is it called “dynamic programming”? Dynamic programming Given a suitable scoring matrix, we can use dynamic programming to find the optimal alignment e.g. between two sequences Why is it called “dynamic programming”? Because it is : “something not even a Congressman could object to” and “it’s impossible to use the word dynamic in a pejorative sense” Bellman R Eddy SR Nature Biotech 2004 Dynamic programming Eddy SR Nature Biotech 2004 Global and local alignments Needleman-Wunsch -> global Smith-Waterman -> local (BLAST, FASTA) Evaluating significance Extreme value distribution: Ungapped alignments -> proved Gapped alignments -> overwhelming evidence Evaluating significance Extreme value distribution: Ungapped alignments -> proved Gapped alignments -> overwhelming evidence Profile-HMMs: a brief introduction Profile-HMMs: a brief introduction Example 2: Football FC Augsburg 2013/14 Bundesliga Fixtures Date Status Home Score Aug 10 FT FC Augsburg 0-4 Aug 17 FT Werder Bremen Aug 25 FT Aug 31 Away Attendance Competition Borussia Dort. 30,660 Bundesliga 1-0 FC Augsburg 40,112 Bundesliga FC Augsburg 2-1 VfB Stuttgart 30,030 Bundesliga FT Nurnberg 0-1 FC Augsburg 37,239 Bundesliga Sep 14 FT FC Augsburg 2-1 SC Freiburg 28,453 Bundesliga Sep 21 FT Hannover 96 2-1 FC Augsburg 39,200 Bundesliga Sep 27 FT FC Augsburg 2-2 Borussia Mon. 30,352 Bundesliga Oct 5 FT Schalke 04 4-1 FC Augsburg 60,731 Bundesliga Oct 20 FT FC Augsburg 1-2 VfL Wolfsburg 27,554 Bundesliga Oct 26 FT Bayer Leverkusen 2-1 FC Augsburg 27,811 Bundesliga Nov 3 FT FC Augsburg 2-1 Mainz 28,007 Bundesliga Nov 9 FT Bayern Munich 3-0 FC Augsburg 71,000 Bundesliga Profile-HMMs: a brief introduction Example 2: Football FC Augsburg 2013/14 Bundesliga Fixtures Date Status Home Score Aug 10 FT FC Augsburg L 0-4 Aug 17 FT Werder Bremen Aug 25 FT Aug 31 Away Attendance Competition Borussia Dort. 30,660 Bundesliga L 1-0 FC Augsburg 40,112 Bundesliga FC Augsburg 2-1 W VfB Stuttgart 30,030 Bundesliga FT Nurnberg W 0-1 FC Augsburg 37,239 Bundesliga Sep 14 FT FC Augsburg W 2-1 SC Freiburg 28,453 Bundesliga Sep 21 FT Hannover 96 2-1 L FC Augsburg 39,200 Bundesliga Sep 27 FT FC Augsburg 2-2 D Borussia Mon. 30,352 Bundesliga Oct 5 FT Schalke 04 4-1 L FC Augsburg 60,731 Bundesliga Oct 20 FT FC Augsburg 1-2 L VfL Wolfsburg 27,554 Bundesliga Oct 26 FT Bayer Leverkusen 2-1 L FC Augsburg 27,811 Bundesliga Nov 3 FT FC Augsburg 2-1 W Mainz 28,007 Bundesliga Nov 9 FT Bayern Munich 3-0 L FC Augsburg 71,000 Bundesliga Profile-HMMs: a brief introduction Example 2: Football W D L Probabilistic W D L Profile-HMMs: a brief introduction Example 2: Football Our model for Augsburg’s Bundesliga results: 3 states: W, D, L S(t)=F(S(t-1)) States S connected by probabilities pij≥0; p =1 ij Σ j Profile-HMMs: a brief introduction Example 2: Football FC Augsburg 2013/14 Bundesliga Fixtures Date Status Home Score Aug 10 FT FC Augsburg L 0-4 Aug 17 FT Werder Bremen Aug 25 FT Aug 31 Away Attendance Competition Borussia Dort. 30,660 Bundesliga L 1-0 FC Augsburg 40,112 Bundesliga FC Augsburg 2-1 W VfB Stuttgart 30,030 Bundesliga FT Nurnberg W 0-1 FC Augsburg 37,239 Bundesliga Sep 14 FT FC Augsburg W 2-1 SC Freiburg 28,453 Bundesliga Sep 21 FT Hannover 96 2-1 L FC Augsburg 39,200 Bundesliga Sep 27 FT FC Augsburg 2-2 D Borussia Mon. 30,352 Bundesliga Oct 5 FT Schalke 04 4-1 L FC Augsburg 60,731 Bundesliga Oct 20 FT FC Augsburg 1-2 L VfL Wolfsburg 27,554 Bundesliga Oct 26 FT Bayer Leverkusen 2-1 L FC Augsburg 27,811 Bundesliga Nov 3 FT FC Augsburg 2-1 W Mainz 28,007 Bundesliga Nov 9 FT Bayern Munich 3-0 L FC Augsburg 71,000 Bundesliga Profile-HMMs: a brief introduction Example 2: Football FC Augsburg 2013/14 Bundesliga Fixtures Date Status Home Score Aug 10 FT FC Augsburg H L 0-4 Aug 17 FT Werder Bremen A Aug 25 FT FC Augsburg Aug 31 FT Sep 14 Away Attendance Competition Borussia Dort. 30,660 Bundesliga L 1-0 FC Augsburg 40,112 Bundesliga H 2-1 W VfB Stuttgart 30,030 Bundesliga Nurnberg A W 0-1 FC Augsburg 37,239 Bundesliga FT FC Augsburg H W 2-1 SC Freiburg 28,453 Bundesliga Sep 21 FT Hannover 96 A 2-1 L FC Augsburg 39,200 Bundesliga Sep 27 FT FC Augsburg H 2-2 D Borussia Mon. 30,352 Bundesliga Oct 5 FT Schalke 04 A 4-1 L FC Augsburg 60,731 Bundesliga Oct 20 FT FC Augsburg H 1-2 L VfL Wolfsburg 27,554 Bundesliga Oct 26 FT Bayer LeverkusenA 2-1 L FC Augsburg 27,811 Bundesliga Nov 3 FT FC Augsburg H 2-1 W Mainz 28,007 Bundesliga Nov 9 FT Bayern Munich A 3-0 L FC Augsburg 71,000 Bundesliga Profile-HMMs: a brief introduction Example 2: Football W D H L A Profile-HMMs: a brief introduction Symbols States Example 2: The weather H/L pressure Profile-HMMs: a brief introduction Symbols States Example 2: The weather H/L pressure Profile-HMMs: a brief introduction Symbols States Example 2: The weather HHHHLLLLHH Profile-HMMs: a brief introduction HMMs are probabilistic models defined by: § § § § ! A finite set I of states A discrete alphabet X of symbols (observed objects) A probability transition matrix T=(tij) , i,j states A probability emission matrix E=(eix), i states, x symbols Profile-HMMs: a brief introduction HMMs are probabilistic models defined by: § § § § A finite set S of states A discrete alphabet A of symbols (observed objects) A probability transition matrix T=(tij) , i,j states A probability emission matrix E=(eix), i state, x symbol ! tij tij tij tij … … eix states symbols eix eix eix Profile-HMMs: a brief introduction States = 1,2 ; Symbols=A,B,C,D Profile-HMMs: a brief introduction States = M,S ; 1 Symbols=A,B,C,D Profile-HMMs: a brief introduction States = 1,2 ; Symbols=A,B,C,D Profile-HMMs: a brief introduction States = 1,2 ; Symbols=A,B,C,D Profile-HMMs: a brief introduction States = 1,2 ; Symbols=A,B,C,D Profile-HMMs: a brief introduction States = 1,2 ; Symbols=A,B,C,D Profile-HMMs: a brief introduction HMMs are probabilistic models, i.e. models that produce different outcomes with different probabilities Example: probability model for a protein sequence M V R G K T Q M K R I E N A T S R Q V T F ! ! ! Profile-HMMs: a brief introduction HMMs are probabilistic models, i.e. models that produce different outcomes with different probabilities Example: probability model for a protein sequence M V R G K T Q M K R I E N A T S R Q V T F ; ! pi, !i=1,20; !pi>0 pa , a=A,M,S,W,…! Profile-HMMs: a brief introduction HMMs are probabilistic models, i.e. models that produce different outcomes with different probabilities Example: probability model for a protein sequence M V R G K T Q M K R I E N A T S R Q V T F ; pM!pV!pR!pG!pK!pT!pQ!pM!pK!pR!pI!pE!pN!pA!pT!pS!pR!pQ!pV!pT!pF! i pa , a=A,M,S,W,…! Profile-HMMs: a brief introduction Example 2: The weather Probabilistic, 1st order Markov model pij≥0; p =1 Σ ij j Aligning to a family M I D Aligning to a family M S I D E Aligning to a family M S I D E Profile-HMMs: a brief introduction Profile-HMMs are probabilistic models… Model -> simulates a system Probabilistic -> produces outcomes based on probabilities Profile-HMMs: a brief introduction Example 1: The traffic light Profile-HMMs: a brief introduction Example 1: The traffic light R Y G Profile-HMMs: a brief introduction Example 1: The traffic light Deterministic R Y G Profile-HMMs: a brief introduction Example 2: The weather Profile-HMMs: a brief introduction Example 2: The weather S C R Profile-HMMs: a brief introduction Example 2: The weather * Probabilistic S C R *In fact, chaotic, deterministic Profile-HMMs: a brief introduction Example 2: The weather * Probabilistic pij Transition probability from symbol i to symbol j *In fact, chaotic, deterministic Profile-HMMs: a brief introduction Example 2: The weather Probabilistic, 1st order Markov model P(Xn+1=x|X1=x1,…,Xn=xn)=P(Xn+1=x|Xn=xn)