Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer simulation wikipedia , lookup
Numerical weather prediction wikipedia , lookup
Generalized linear model wikipedia , lookup
General circulation model wikipedia , lookup
History of numerical weather prediction wikipedia , lookup
Gene expression programming wikipedia , lookup
Probabilistic context-free grammar wikipedia , lookup
Applications of Hidden Markov Models (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 6, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign 1 Today’s Lecture • HMM Applications – Profile HMMs (Classification) – HMMs for Multiple Sequence Alignment (Pattern discovery) – HMMs for Gene Finding (Segmentation) • Special issues in HMMs – Local Maximas – Model construction – Weighting training sequences 2 HMM Applications • Classification (e.g., Profile HMMs) – Build an HMM for each class (profile HMMs) – Classify a sequence using Bayes rule • Multiple sequence alignment – Build an HMM based on a set of sequences – Decode each sequence to find a multiple alignment • Segmentation (e.g., gene finding) – Use different states to model different regions – Decode a sequence to reveal the region boundaries 3 HMMs for Classification E.g., Protein families C {C1 ,..., Ck } p( X | C ) p(C ) p(C | X ) p( X ) C* arg max C p( X | C ) p(C ) Assign a family to X p(X|C) is modeled by a profile HMM built specifically for C Assuming example sequences are available for C 4 HMMs for Multiple Alignment • Given a set of sequences S={X1, …,Xk} • Train an HMM, e.g., using Baum-Welch (finding the HMM that maximizes the probability of S) • Decode each sequence Xi • Assemble the Viterbi paths to form a multiple alignment (insertions are uncertain) 5 HMM-based Gene Finding • • • • Design two types of states – “Within Gene” States – “Outside Gene” States Use known genes to estimate the HMM Decode a new sequence to reveal which part is a gene Example software: – – – – – GENSCAN (Burge 1997) FGENESH (Solovyev 1997) HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) – VEIL (Henderson, Salzberg, & Fasman 1997) 6 VEIL: Viterbi Exon-Intron Locator Exon HMM Model Upstream Start Codon 3’ Splice Site Exon Intron Stop Codon 5’ Splice Site Downstream 5’ Poly-A Site VEIL Architecture • Enter: start codon or intron (3’ Splice Site) • Exit: 5’ Splice site or three stop codons (taa, tag, tga) (Slide from N. F. Samatova’s lecture) 7 GenScan Architecture • • It is based on Generalized HMM (GHMM) Model both strands at once – Other models: Predict on one strand first, then on the other strand • • • – Avoids prediction of overlapping genes on the two strands (rare) Each state may output a string of symbols (according to some probability distribution). Explicit intron/exon length modeling Special sensors for Cap-site and TATA-box 8 Fig. 3, Burge and Karlin 1997 Special Issues • Local maxima • Optimal model construction • Weighting training sequences 9 Solutions to the Local Maxima Problem • Repeat with different initializations • Start with the most reasonable initial model • Simulated annealing (slow down the convergence speed) 10 Local Maxima: Illustration Local maxima Bad starting point Global maxima Good starting point 11 Optimal Model Construction p( X | HMM ) p( HMM ) p( HMM | X ) p( X ) HMM * arg max HMM p( HMM | X ) arg max HMM p ( X | HMM ) p ( HMM ) Bayesian model selection: P(HMM) should prefer simpler models 12 Sequence Weighting • Avoid over-counting similar sequences from the same organisms • Typically compute a weight for a sequence based on an evolutionary tree • Many ways to incorporate the weights, e.g., – Unequal likelihood – Unequal weight contribution in parameter estimation 13 HMMs in Real Applications • SAM-T98 Tutorial: – http://www.cse.ucsc.edu/research/compbio/ismb 99.tutorial.html • Pfam – http://www.sanger.ac.uk/Software/Pfam/ 14