Download Biology 162: Computational Genetics

Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004 Outline • Profile HMMs generate MSAs • States and transitions for – Matches, Insertions, Deletions, Silent and Flanking states • Statistics – Null model, E values • Training – Model construction, Weighting training sequences and including pseudocounts (which have a Bayesian interpretation) • Existing tools – Interpro, including Pfam and HMMER Globins Helix HBA_HUMAN HBB_HUMAN MYG_PHYCA LGB2_LUPLU Consensus 1111111222222222222222222222 3333333333333 -DLS-----HGSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKLGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL---D--NLKGTFATLSELHCDKLKHLKTEAEMKASEDLKKHGVTVLTALGAILKK----K-GHHEAELKPLAQSHATKHLK-GTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG.l.t ... .kHg.kV. a. ..... . ..l. L. .H. K. Hidden Markov models • Observed sequence of symbols • Hidden sequence of underlying states • Transition probabilities govern transitions among states • Emission probabilities govern the likelihood of observing a symbol in a particular state Profile HMMs • Use scores rather than emission probabilities directly L ei (x i ) S   log qx i i1 where ei (x i ) is the emission prob of symbol qx i is the prob of x i under null model L is length of profile x i a at pos i A PSSM as a simple HMM begin M1 … Mi Mi+1 … ML With emission probabilities unique to each match state end But what about gaps? • Ignore them (BLOCKS database) OR • Model them – Insert states have background emission probabilities Ii begin M1 … Mi … ML end Gap scores • For an insert of length k with background emission probabilities, we have affine gap scores If eI i (a)  qa then log eI i (a) 0 qa Sgap  log aM i I i  (k 1)log aI i I i  log aM i I j Length distribution of inserts • Geometric distribution P(k)  a a  a (1 aII ) k1 II IM  aII k1 II I aMI aIM Which columns are match states? Helix HBA_HUMAN HBB_HUMAN MYG_PHYCA LGB2_LUPLU Consensus 1111111222222222222222222222 3333333333333 -DLS-----HGSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKLGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL---D--NLKGTFATLSELHCDKLKHLKTEAEMKASEDLKKHGVTVLTALGAILKK----K-GHHEAELKPLAQSHATKHLK-GTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG.l.t ... .kHg.kV. a. ..... . ..l. L. .H. K. • Options – Assign columns to be match states by eye – Heuristic i.e. no more than 50% gaps per column – Maximum a posteriori (MAP) model construction • O(L2) dynamic programming algorithm exists to find model that optimizes score on training data Two ways to handle deletions • Transitions between match states begin M1 … Mi … ML end • Silent deletion states (no emission) begin D1 D2 D3 M1 M2 M3 end Profile HMM Flanking states • Many sites in a sequence may be assigned to 'flanking' states (N, C, or J) • Transitions should force one or more match states to be traversed at least once Local or global alignment? • Are transitions allowed – From start to internal match? – From internal match to end? • Are there states that can emit sequences before and after the profile? • Do transitions allow the profile to be repeated? • In HMMs – Global/local behavior governed by model not algorithm – Behavior may differ w.r.t. the profile and the sequence Null model S N P(seq | HMM) Sbit  log 2 P(seq | null )  T Extreme value problems • How to convert Sbit to an expect value? • Since alignment is not truly local, theory used for BLAST does not hold here • Solutions (both available in HMMER) – Conservative approximation valid for any profile – Empirically fit extreme value distribution using simulated sequences – Must be done once for every profile HMM Why not always use full model? • The sum of probabilities is constrained to be one • Spreading probability among many paths decreases power to discriminate among them • You should always choose the most restrictive model (fewest transitions) consistent with your purpose Which algorithm to use? • Three choices – Viterbi: maximum likelihood path – Forward: sum of probabilities of all possible paths – Forward-backward: prob of each state at each pos • For database search – Query sequence against a database of profile HMMs – Profile HMM against a database of sequences? • For alignment – Adding new sequence(s) to an existing alignment Training a profile HMM • Weighting training sequences – We saw the same problem when scoring multiple alignments – Same approaches are used for profile HMMs • Estimating transition probabilities – Taken care of by MAP model construction • Estimating emission probabilities – We will assume the alignment is correct – Only issue is how to add pseudocounts Better pseudocounts • Laplace's Rule – Ignores background frequencies of residues • Background frequency pseudocounts – eM i (a)  c ja  Aqa c ja   A a  where c ja are actual counts qa are background frequencies A is a weight (eg 20) Pseudocounts as Bayesian priors • Bayes' rule Likelihood Prior Posterior P(D | M)P(M) P(M | D)   P(D | M)P(M) for all M where M  model D  data Dirichlet mixture pseudocounts • Background probabilities are not uniform throughout the protein – eg exposed loops (hydrophilic residues abundant) vs. buried core (small side chains abundant) • Different sets of pseudocount priors (Dirichlet distributions) for each environment • Pseudocounts for Ii are determined by a mixture of Dirichlet distributions fit to position i Evolutionary pseudocounts • Related to phylogenetic methods we will see later – Calculate probability of each residue having been the common ancestor of the residues in a column – Calculate probability of each residue as a descendent – Use these probabilities as priors with appropriate weighting • Requires use of a position-independent scoring matrix (eg PAM) Queries vs. subjects • Two directions of search are possible – Sequence query against database of profile HMMs – Profile HMM against a database of sequences • Bit scores will be the same regardless • But E-values will differ – Search space (ie number of subjects in database) can differ considerably – It is usually more sensitive to search a database of profile HMMs Interpro • Regular Expressions – PROSITE • PSSMs, other motifs – PROSITE, PRINTS, PRODOM • Profile HMMs – – – – – Pfam SMART TIGRFAMs PIR SuperFamily SUPERFAMILY Interpro v8 QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Pfam • A profile HMM database – Based on Swissprot and TREMBL • Current version (v15) has 7503 families. – ~75% of all new protein sequences match an existing Pfam profile • Profiles constructed semi-automatically – – – – New families identified Seed alignment manually optimized Profile HMM constructed All matching sequences aligned to HMM HMMER • Used in construction of Pfam – Can build a profile (with MAP algorithm) – Can search a sequence against a profile and vice versa (i.e. with forward algorithm) – Can add new sequences to an alignment (via Viterbi) – Uses Plan 7 profiles • User sets the local/global behavior HMMER2.0 [2.3.1] NAME fn3 ACC PF00041.8 DESC Fibronectin type III domain LENG 84 ALPH Amino RF no CS no MAP yes COM hmmbuild -F HMM_ls.ann SEED.ann COM hmmcalibrate --seed 0 HMM_ls.ann NSEQ 108 DATE Mon Jul 26 14:10:07 2004 CKSUM 1153 GA 8.0 -1.0 TC 8.0 0.0 NC 7.9 7.9 XT -8455 -4 -1000 -1000 -8455 -4 -8455 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 45 531 201 384 -1998 -644 EVD -45.006527 0.260185 HMM A C D E F G H R S T V W Y m->m m->i m->d i->m i->i d->m d->d -13 * -6758 1 -1698 -4236 -5400 -853 -4220 -2885 -1258 -4774 -1187 -1320 -120 -4666 -1510 1 -150 -501 232 46 -382 399 104 95 358 118 -368 -296 -251 -144 -3402 -12951 -19 -6284 -701 -1378 2 -613 -5389 1601 -868 -5707 553 -3558 -2014 1940 -582 -1300 -5575 -1474 3 -149 -500 233 43 -381 399 106 96 359 117 -369 -294 -249 -4 197 249 902 -1085 -142 -21 -313 I K L M N P Q b->m m->e -930 -2438 408 -3428 -4769 3631 -1835 -628 211 -461 -722 274 395 44 -13 -5456 * -3139 -894 -4479 -526 1872 -1543 -626 210 -466 -720 275 394 45 Summary • Profile HMMs generate MSAs • States and transitions for – Matches, Insertions (which can model affine gaps), Deletions, which allow local alignment, Silent states and flanking states • Statistics – Scored relative to a null model and E values must be determined empirically • Training – MAP model construction, Training sequence weighting and pseudocounts (which have a Bayesian interpretation) • Existing tools – Interpro, including Pfam and HMMER Assignment • Look over study guide – posted on Blackboard • Turn in lab/problem set on Tuesday • Midterm on Thursday

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Biology 162: Computational Genetics