Download Introduction to Computational Genomics for Infectious Disease

Welcome to Introduction to Computational Genomics for Infectious Disease Course Instructors • Instructor James Galagan • Teaching Assistants Brian Weiner Desmond Lun • Lab Instructors Antonis Rokas Reinhard Engels Mark Borowsky Aaron Brandes Jeremy Zucker Caroline Colijn Other members of Broad Microbial Analysis Group Schedule and Logistics • Lectures Tues/Thurs 11-12:30 Harvard School of Public Health: FXB-301 The François-Xavier Bagnoud Center, Room 301 • Labs Wed/Fri 1-3 Broad Institute: Olympus Room First floor of Broad Main Lobby See front desk attendant near entrance Individual computers and software provided No programming experience required Website www.broad.mit.edu/annotation/winter_course_2006/ • Contact information • Directions to Broad • Lecture slides • Lab handouts • Resources Goals of Course • Introduction to concepts behind commonly used computational tools • Recognize connection between different concepts and applications • Hands on experience with computational analysis Concepts and Applications • Lectures will cover concepts – Computationally oriented • Labs will provide opportunity for hands on application of tools – Nuts and bolts of running tools – Application of tools not covered in lectures Computational Genomics Overview Slide Credit: Manolis Kellis Topics 1. Probabilistic Sequence Modeling 2. Clustering and Classification 3. Motifs 4. Steady State Metabolic Modeling Topics Not Covered • • • • Sequence Alignment Phylogeny (maybe in labs) Molecular Evolution Population Genetics • Advanced Machine Learning – Bayesian Networks – Conditional Random Fields Applications to Infectious Disease • Examples and labs will focus on the analysis of microbial genomics data – – – – Pathogenicity islands TB expression analysis Antigen prediction Mycolic acid metabolism • But approaches are applicable to any organism and to many different questions Probabilistic Modeling of Biological Sequences Concepts Statistical Modeling of Sequences Hidden Markov Models Applications Predicting pathogenicity islands Modeling protein families Lab Practical Basic sequence annotation Probabilistic Sequence Modeling • Treat objects of interest as random variables – nucleotides, amino acids, genes, etc. • Model probability distributions for these variables • Use probability calculus to make inferences Why Probabilistic Sequence Modeling? • Biological data is noisy • Probability provides a calculus for manipulating models • Not limited to yes/no answers – can provide “degrees of belief” • Many common computational tools based on probabilistic models Sequence Annotation GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG Sequence Annotation GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG Gene ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG Sequence Annotation GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT Promoter TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT Motif GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG Gene ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG Kinase GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC Domain GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG Probabilistic Sequence Modeling • Hidden Markov Models (HMM) – A general framework for sequences of symbols (e.g. nucleotides, amino acids) – Widely used in computational genomics 1. Hmmer – HMMs for protein families 2. Pathogenicity Islands Pathogenicity Islands • Clusters of genes acquired by horizontal transfer Neisseria meningitidis, 52% G+C – Present in pathogenic species but not others • Frequently encode virulence factors – Toxins, secondary metabolites, adhesins GC Content (from Tettelin et al. 2000. Science) • (Flanked by repeats, gene content, phylogeny, regulation, codon usage) • Different GC content than rest of genome Application: Bacillus subtilis Modeling Sequence Composition • Calculate sequence distribution from known islands – Count occurrences of A,T,G,C • Model islands as nucleotides drawn independently from this distribution ... C C TA A G T T A G A G G A T T G A G A …. … A: 0.15 A: 0.15 A: 0.15 T: 0.13 T: 0.13 T: 0.13 G: 0.30 G: 0.30 G: 0.30 C: 0.42 C: 0.42 C: 0.42 P(Si|MP) … The Probability of a Sequence • Can calculate the probability of a particular sequence (S) according to the pathogenicity island model (MP) N P( S | MP)  P( S1 , S 2 ,...S N | MP)   P( Si | MP) i 1 Example S = AAATGCGCATTTCGAA P( S | MP)  P( A)  P(T )  P(G )  P(C ) 6 4 3 2 A: 0.15 T: 0.13  (0.15)6  (0.13) 4  (0.30)3  (0.42) 2 G: 0.30  1.55 1011 C: 0.42 Sequence Classification PROBLEM: Given a sequence, is it an island? – We can calculate P(S|MP), but what is a sufficient P value? SOLUTION: compare to a null model and calculate log-likelihood ratio – e.g. background DNA distribution model, B N P(Si | MP) N P(Si | MP) P( S | MP) Score  log  log    log P( S | B) P ( Si | B ) i 1 i 1 P( Si | B) Pathogenicity Islands A: 0.15 Background DNA A: 0.25 Score Matrix A: -0.73 T: 0.13 T: 0.25 T: -0.94 G: 0.30 G: 0.25 G: 0.26 C: 0.42 C: 0.25 C: 0.74 Finding Islands in Sequences • Could use the log-likelihood ratio on windows of fixed size – What if islands have variable length? • We prefer a model for entire sequence TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC A More Complex Model 0.15 0.85 Background Island 0.75 0.25 A: 0.25 T: 0.25 G: 0.25 C: 0.25 A: 0.15 T: 0.13 G: 0.30 C: 0.42 TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC A Generative Model S: P P P P P P P P B B B B B B B B G C A A A T G C P(Li+1|Li) Bi+1 Pi+1 Bi 0.85 0.15 Pi 0.25 0.75 P(S|B) A: 0.25 T: 0.25 G: 0.25 C: 0.25 P(S|P) A: 0.42 T: 0.30 G: 0.13 C: 0.15 A Hidden Markov Model Hidden States L = { 1, ..., K } Transition Probabilities Transition probabilities aij = Transition probability from state i to state j State i State j Emission probabilities ei(b) = P( emitting b | state=i) ei(b) ej(b) Initial state probability p(b) = P(first state=b) Emission Probabilities What can we do with this model? The model defines a joint probability over labels and sequences, P(L,S) Implicit in model is what labels “tend to go” with what sequences (and vice versa) Rules of probability allow us to use this model to analyze existing sequences Fundamental HMM Operations Computation Decoding • • Given Find an HMM and sequence S a corresponding sequence of labels, L Biology Annotate pathogenicity islands on a new sequence Evaluation • • Given Find an HMM and sequence S P(S|HMM) Score a particular sequence (not as useful for this model – will come back to this later) Training • Given • Find an HMM w/o parameters and set of sequences S transition and emission probabilities the maximize P(S | params, HMM) Learn a model for sequence composed of background DNA and pathogenicity islands The Hidden in HMM • DNA does not come conveniently labeled (i.e. Island, Gene, Promoter) • We observe nucleotide sequences • The hidden in HMM refers to the fact that state labels, L, are not observed – Only observe emissions (e.g. nucleotide sequence in our example) State i State j …A A G T T A G A G… “Decoding” With HMM Given observables, we would like to predict a sequence of hidden states that is most likely to have generated that sequence Pathogenicity Island Example Given a nucleotide sequence, we want a labeling of each nucleotide as either “pathogenicity island” or “background DNA” The Most Likely Path • Given a sequence, one reasonable choice for a labeling is: L*  arg max P( Labels, Sequence | Model ) labels The sequence of labels, L*, (or path) that makes the labels and sequence most likely given the model Probability of a Path,Seq L: P P P P P P P P B B B B B B B B 0.85 0.25 S: G 0.85 0.25 C 0.85 0.25 A 0.85 0.25 A 0.85 0.25 A 0.85 0.85 0.25 T 0.25 G 0.25 C P  P(G | B) P( B1 | B0 ) P(C | B) P( B2 | B1 ) P( A | B) P( B3 | B2 )...P(C | B7 )  (0.85)7  (0.25)8  4.9 106 Probability of a Path,Seq L: P P P P 0.75 P 0.15 B B 0.85 0.25 S: P 0.75 G B P 0.25 B B B B 0.85 0.25 C P B 0.85 0.25 A 0.42 A 0.42 A 0.30 T 0.25 G 0.25 C P  P(G | B) P( B1 | B0 ) P(C | B) P( B2 | B1 ) P( A | B) P( P3 | B2 )...P(C | B7 )  (0.85)3  (0.25)6  (0.75)2  (0.42)2  0.30  0.15  6.7 107 We could try to calculate the probability of every path, but…. Decoding • Viterbi Algorithm – Finds most likely sequence of labels, L*, given sequence and model L*  arg max P( Labels, Sequence | Model ) labels – Uses dynamic programming (same technique used in sequence alignment) – Much more efficient than searching every path Probability of a Single Label Sum over all paths L: S: P P P P P P P P B B B B B B B B G C A A A T G C Forward algorithm P(Label5=B|S) (dynamic programming) • Calculate most probable label, L*i , at each position i • Do this for all N positions gives us {L*1, L*2, L*3…. L*N} Two Decoding Options • Viterbi Algorithm – Finds most likely sequence of labels, L*, given sequence and model L*  arg max P( Labels | Sequence, Model ) labels • Posterior Decoding – Finds most likely label at each position for all positions, given sequence and model {L*1, L*2, L*3…. L*N} – Forward and Backward equations Application: Bacillus subtilis Method Three State Model Gene+ Gene- Second Order Emissions P(Si)=P(Si|State,Si-1,Si-2) (capturing trinucleotide Frequencies) Train using EM Predict w/Posterior Decoding AT Rich Nicolas et al (2002) NAR Results Gene on positive strand A/T Rich - Intergenic regions - Islands Gene on negative strand Each line is P(label|S,model) color coded by label Nicolas et al (2002) NAR Fundamental HMM Operations Computation Decoding • • Given Find an HMM and sequence S a corresponding sequence of labels, L Biology Annotate pathogenicity islands on a new sequence Evaluation • • Given Find an HMM and sequence S P(S|HMM) Score a particular sequence (not as useful for this model – will come back to this later) Training • Given • Find an HMM w/o parameters and set of sequences S transition and emission probabilities the maximize P(S | params, HMM) Learn a model for sequence composed of background DNA and pathogenicity islands Training an HMM Transition probabilities e.g. P(Pi+1|Bi) – the probability of entering a pathogenicity island from background DNA Emission probabilities i.e. the nucleotide frequencies for background DNA and pathogenicity islands P(Li+1|Li) B P P(S|B) P(S|P) Maximum Likelihood Estimation Learning From Labelled Data If we have a sequence that has islands marked, we can simply count L: start S: P P P P P P P P B B B B B B B B G C A A A T G C P(S|B) P(Li+1|Li) Bi+1 Pi+1 End Bi 3/5 1/5 1/5 Pi 1/3 2/3 0 1 0 0 Start A: T: G: C: 1/5 0 2/5 2/5 P(S|P) ! A: T: G: C: ETC.. End Unlabelled Data How do we know how to count? L: start S: P P P P P P P P B B B B B B B B G C A A A T G C ? P(S|B) P(Li+1|Li) Bi+1 Pi+1 Bi Pi Start ? End A: T: G: C: P(S|P) A: T: G: C: End Unlabeled Data L: start S: P P P P P P P P B B B B B B B B G C A A A T G C An idea: 1. Imagine we start with some parameters 2. We could calculate the most likely path, P*, given those parameters and S 3. We could then use P* to update our parameters by maximum likelihood 4. And iterate (to convergence) End P(Li+1|Li)0 P(S|B)0 P(S|P)0 P(Li+1|Li)1 P(S|B)1 P(S|P)1 P(Li+1|Li)2 P(S|B)2 P(S|P)2 … P(Li+1|Li)K P(S|B)K P(S|P)K Expectation Maximization (EM) 1. Initialize parameters 2. E Step Estimate probability of hidden labels , Q, given parameters and sequence Q  P(Labels | S , paramst 1) 3. M Step Choose new parameters to maximize expected likelihood of parameters given Q params t  arg max EQ log P( S , labels | params t 1 )  params 4. Iterate P(S|Model) guaranteed to increase each iteration Expectation Maximization (EM) Remember the basic idea! 1.Use model to estimate (distribution of) missing data 2.Use estimate to update model 3.Repeat until convergence EM is a general approach for learning models (ML estimation) when there is “missing data” Widely used in computational biology EM frequently used in motif discovery Lecture 3 A More Sophisticated Application Modeling Protein Families • Given amino acid sequences from a protein family, how can we find other members? – Can search databases with each known member – not sensitive – More information is contained in full set • The HMM Profile Approach – Learn the statistical features of protein family – Model these features with an HMM – Search for new members by scoring with HMM We will learn features from multiple alignments Human Ubiquitin Conjugating Enzymes UBE2D2 UBE2D3 BAA91697 UBE2D1 UBE2E1 UBCH9 UBE2N AAF67016 UBCH10 CDC34 BAA91156 UBE2G1 UBE2B UBE2I E2EPF5 UBE2L1 UBE2L6 UBE2H UBC12 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRT IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIAT FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRT FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVET FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSS FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQ LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRH FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQ FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQ LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTN VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS Profile HMM A C D E F G H I K L M N O P Q R S T V W Y I Start E2EPF5 UBE2L1 UBE2L6 UBE2H D1 Dj DN I1 Ij IN M1 Mj MN End LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKRA------------DWTAELGIRH FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISAA-----------ENWKPATKTDQ FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISSA-----------ENWKPCTKTCQ LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-P-----------QTWTALYDLTN Using Profile HMMs Computation Biology Decoding Find sequence of labels, L, that maximizes P(L|S, HMM) Align a new sequence to a protein family Evaluation • Find P(S|HMM) Score a sequence for membership in family transition and emission probabilities the maximize P(S | params, HMM) Discover and model family structure Training • Find Example: Modeling Globins • Profile HMM from 300 randomly selected globin genes • Score database of 60,000 proteins PFAM Collection of Profile HMMs http://www.sanger.ac.uk/Software/Pfam/ PFAM Resources • 8957 curated protein families and domains • Each with HMM profile(s) • Coverage – 73% of proteins in Swissprot and SPTREMBLE – 53% of “typical” genome sequence Example PFAM Entry • • • • Literature Links Protein Structure Domain Architectures GO Functional Categories Lab 1 HMMER • Implementation of Profile HMM methods • Given a multiple alignment, HMMER can build a Profile HMM • Given a Profile HMM (i.e. from PFAM), HMMER can score sequences for membership in the family or domain HMMs in Context • HMMs – Sequence alignment – Gene Prediction • Generalized HMMs – Variable length states – Complex emissions models – e.g. Genscan • Bayesian Networks – General graphical model – Arbitrary graph structure – e.g. Regulatory network analysis References • • • • • • Sean R Eddy, “Hidden Markov models,” Current Opinion in Structural Biology, 6:361-365, 1996. Sean R Eddy, “Profile hidden Markov models,” Bioinformatcis, 14(9):755-763, 1998. Anders Krogh, “An introduction to hidden Markov models for biological sequences,” In computational Methods in Molecular Biology, edited by S. L. Salzberg, D. B. Searls and S. Kasif, pp. 45-63, Elsevier, 1998. HMMER: profile HMMs for protein sequence analysis. http://hmmer.wustl.edu/ Erik L. L. Sonnhammer et al, “Pfam: multiple sequence alignments andHMMprofiles of protein domains,” Nucleic Acids Research, 26(1):320-322, 1998. R. Durbin, S. Eddy, A. Krogh and G. Mitchison, BIOLOGICAL SEQUENCE ANALYSIS, Cambridge University Press, 1998. Tomorrow’s Lab • Basic Sequence Analysis Tools – Argo Genome Browser – Blast – Gene prediction using Glimmer – Protein families with Hmmer and PFAM – Comparative synteny analysis • Identify virulence factors by annotating and comparing virulent and avirulent bacterial sequences The Hidden in HMM • DNA does not come conveniently labeled (i.e. Pathogencity Island, Gene, Promoter) • All we observe are the nucleotide sequences • The hidden in HMM refers to the fact that the state labels, L, are not observed – Only observe emissions (e.g. nucleotide sequence in our example) Relation between Viterbi and Forward VITERBI FORWARD Vj(i) = P(most probable path ending in state j with observation i) fl(i)=P(x1…xi,statei=j) Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: Iteration: Vj(i) = ej(xi) maxk Vk(i-1) akj fl(i) = el(xi) k fk(i-1) akl Termination: P(x, p*) = maxk Vk(N) Termination: P(x) = k fk(N) ak0 Slide Credit: Serafim Batzoglou

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to Computational Genomics for Infectious Disease