Download Generalized-HMMs - Center for Bioinformatics and

CMSC 828N lecture notes: Eukaryotic Gene Finding with Generalized HMMs Mihaela Pertea and Steven Salzberg Center for Bioinformatics and Computational Biology, University of Maryland Eukaryotic Gene Finding Goals • Given an uncharacterized DNA sequence, find out: – Which regions code for proteins? – Which DNA strand is used to encode each gene? – Where does the gene starts and ends? – Where are the exon-intron boundaries in eukaryotes? • Overall accuracy usually below 50% Gene Finding: Different Approaches • Similarity-based methods. These use similarity to annotated sequences like proteins, cDNAs, or ESTs (e.g. Procrustes, GeneWise). • Ab initio gene-finding. These don’t use external evidence to predict sequence structure (e.g. GlimmerHMM, GeneZilla, Genscan, SNAP). • Comparative (homology) based gene finders. These align genomic sequences from different species and use the alignments to guide the gene predictions (e.g. TWAIN, SLAM, TWINSCAN, SGP-2). • Integrated approaches. These combine multiple forms of evidence, such as the predictions of other gene finders (e.g. Jigsaw, EuGène, Gaze) Why ab-initio gene prediction? Ab initio gene finders can predict novel genes not clearly homologous to any previously known gene. Identifying Signals In DNA with a Signal Sensor We slide a fixed-length model or “window” along the DNA and evaluate score(signal) at each point: Signal sensor …ACTGATGCGCGATTAGAGTCATGGCGATGCATCTAGCTAGCTATATCGCGTAGCTAGCTAGCTGATCTACTATCGTAGC… When the score is greater than some threshold (determined empirically to result in a desired sensitivity), we remember this position as being the potential site of a signal. The most common signal sensor is the Weight Matrix: A = 31% A = 18% T = 28% T = 32% C = 21% C = 24% G = 20% G = 26% A T G 100% 100% 100% A = 19% A = 24% T = 20% T = 18% C = 29% C = 26% G = 32% G = 32% Start and stop codon scoring Score all potential start/stop codons within a window of length 19. CATCCACCATGGAGAA CCACCATGG Kozak consensus The probability of generating the sequence X  x1 x 2  x is given by:  p ( X )  p ( x1 ) p ( xi | xi 1 ) (1) (i ) i 2 (WAM model or inhomogeneous Markov model) Splice Site Scoring Donor/Acceptor sites at location k: DS(k) = Scomb(k,16) + (Scod(k-80)-Snc(k-80)) + (Snc(k+2)-Scod(k+2)) AS(k) = Scomb(k,24) + (Snc(k-80)-Scod(k-80)) + (Scod(k+2)-Snc(k+2)) Scomb(k,i) = score computed by the Markov model/MDD method using window of i bases Scod/nc(j) = score of coding/noncoding Markov model for 80bp window starting at j Coding Statistics • Unequal usage of codons in the coding regions is a universal feature of the genomes • We can use this feature to differentiate between coding and noncoding regions of the genome • Coding statistics - a function that for a given DNA sequence computes a likelihood that the sequence is coding for a protein • Many different ones ( codon usage, hexamer usage,GC content, Markov chains, IMM, ICM.) 3-periodic ICMs A three-periodic ICM uses three ICMs in succession to evaluate the different codon positions, which have different statistics: P[C|M0] ICM0 P[G|M1] ICM1 P[A|M2] ICM2 ATC GAT CGA TCA GCT TAT CGC ATC The three ICMs correspond to the three phases. Every base is evaluated in every phase, and the score for a given stretch of (putative) coding DNA is obtained by multiplying the phase-specific probabilities in a L 1 mod 3 fashion:  P( f i )(mod 3) ( xi ) i 0 GlimmerHMM uses 3-periodic ICMs for coding and homogeneous (non-periodic) ICMs for noncoding DNA. The Advantages of Periodicity and Interpolation HMMs and Gene Structure • Nucleotides {A,C,G,T} are the observables • Different states generate nucleotides at different frequencies A simple HMM for unspliced genes: A T G T A A AAAGC ATG CAT TTA ACG AGA GCA CAA GGG CTC TAA TGCCG • The sequence of states is an annotation of the generated string – each nucleotide is generated in intergenic, start/stop, coding state Recall: “Pure” HMMs An HMM is a stochastic machine M=(Q, , Pt, Pe) consisting of the following: • a finite set of states, Q={q0, q1, ... , qm} • a finite alphabet  ={s0, s1, ... , sn} • a transition distribution Pt : Q×Q [0,1] • an emission distribution Pe: Q× [0,1] i.e., Pt (qj | qi) i.e., Pe (sj | qi) An Example 5% M1=({q0,q1,q2},{Y,R},Pt,Pe) Pt={(q0,q1,1), (q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05), (q2,q2,0.7), (q2,q1,0.3)} q0 80% q1 100% Pe={(q1,Y,1), (q1,R,0), (q2,Y,0), (q2,R,1)} 15% Y=0% R=0% R = 100% Y = 100% 30% q2 70% HMMs & Geometric Feature Lengths  d 1  d 1 P( x0 ...xd 1 |  )    Pe ( xi |  )  p (1  p)  i 0  geometric distribution exon length Generalized Hidden Markov Models Advantages: * Submodel abstraction * Architectural simplicity * State duration modeling Disadvantages: * Decoding complexity Generalized HMMs A GHMM is a stochastic machine M=(Q, , Pt, Pe, Pd) consisting of the following: • a finite set of states, Q={q0, q1, ... , qm} • a finite alphabet  ={s0, s1, ... , sn} • a transition distribution Pt : Q×Q  [0,1] i.e., Pt (qj | qi) • an emission distribution Pe : Q×*× N[0,1] i.e., Pe (sj | qi,dj) • a duration distribution Pe : Q× N [0,1] i.e., Pd (dj | qi) Key Differences • each state now emits an entire subsequence rather than just one symbol • feature lengths are now explicitly modeled, rather than implicitly geometric • emission probabilities can now be modeled by any arbitrary probabilistic model • there tend to be far fewer states => simplicity & ease of modification Ref: Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in DNA. ISMB '96. Recall: Decoding with an HMM  max  argmax argmax P (  S ) P ( | S )   P( S )  argmax P (  S )  argmax P( S | ) P()    L L1 P( )   Pt ( yi1 | yi ) P(S |  )   Pe (xi | yi1 ) i0 max  i0 emission prob. argmax  L1 transition prob. Pt (q0 | yL ) Pe (xi | yi1 )Pt ( yi1 | yi )  i0 Decoding with a GHMM  max  argmax argmax P (  S ) P ( | S )   P( S )  argmax P (  S )  argmax P( S | ) P()    | |2 | |2 P( )   Pt ( yi1 | yi )Pd (d i | yi ) P(S |  )   Pe (Si | yi ,d i ) i1  max  i0 emission prob. transition prob. | |2 P (S | y ,d )P ( y    argmax e i0 i i i t i1 duration prob. | yi )Pd (d i | yi ) Gene Prediction with a GHMM Given a sequence S, we would like to determine the parse  of that sequence which segments the DNA into the most likely exon/intron structure: prediction exon 1 exon 2 exon 3 parse  AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTAGCATTATCGGCCGTAGCTACGTAGCGTAGCTC sequence S The parse  consists of the coordinates of the predicted exons, and corresponds to the precise sequence of states during the operation of the GHMM (and their duration, which equals the number of symbols each state emits). This is the same as in an HMM except that in the HMM each state emits bases with fixed probability, whereas in the GHMM each state emits an entire feature such as an exon or intron. GHMMs Summary • GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol • Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM • Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree • GHMMs tend to have many fewer states => simplicity & modularity GlimmerHMM architecture Four exon types Exon0 Exon1 Exon2 I0 I1 I2 Init Exon Phase-specific introns Term Exon Exon Sngl + forward strand Intergenic - backward strand Exon Sngl Term Exon Init Exon I0 I1 I2 Exon0 Exon1 Exon2 • Uses GHMM to model gene structure (explicit length modeling) • WAM and MDD for splice sites • ICMs for exons, introns and intergenic regions • Different model parameters for regions with different GC content • Can emit a graph of highscoring ORFS Key steps in the GHMM Dynamic Programming Algorithm • Scan left to right • At each signal, look bacward (left) – Find all compatible signals – Take MAX score – Repeat for all reading frames Key steps in the GHMM Dynamic Programming Algorithm AG AG AG GT AG ATG ATG ATG Look back at all previous compatible signals Key steps in the GHMM Dynamic Programming Algorithm AG  Retrieve score of best parse up to previous site  Compute score of the exon linking AG to GT  Use Markov chain or other methods  Look up probability of exon length  Multiply probabilities (or add logs) GT Key steps in the GHMM Dynamic Programming Algorithm AG MAX over all previous sites AG AG GT AG ATG ATG ATG Store for each frame: MAX score Reading frame Pointer backward GHMM Dynamic Programming Algorithm: Introns GT GT GT AG GT GT GT Huge number of potential signals: how far back to look? GHMM Dynamic Programming Algorithm: Introns GT  Limit look-back with maximum intron length  Or, use other techniques  Compute score of intron linking GT to AG  Score donor site with donor site model  Score intron with Markov chain  Score acceptor with acceptor site model  Look up probability of intron length  Multiply probabilities (or add logs) AG Training the Gene Finder θ=(Pt ,Pe ,Pd) Training for GHMMs  arg max    MLE  P(S,  )      ( S , )T    arg max   Pe (Si | yi , d i ) Pt ( yi | yi1 ) Pd (d i | yi )       ( S , )T yi   |S i |1   arg max   Pt ( yi | yi1 ) Pd (d i | yi )  Pe (x j | yi )       ( S , )T yi  j0  estimate via labeled training data ai , j  Ai , j  |Q|1 h 0 construct a histogram of observed feature lengths estimate via labeled training data ei,k  Ai ,h Ei,k  | |1 h0  Ei,h Gene Finding in the Dark: Dealing with Small Sample Sizes – – – parameter mismatching: train on a close relative use a comparative GF trained on a close relative use BLAST to find conserved genes & curate them, use as training set augment training set with genes from related organisms, use weighting manufacture artificial training data – – • – be sensitive to sample sizes during training by reducing the number of parameters (to reduce overtraining) • • – – long ORFs fewer states (1 vs. 4 exon states, intron=intergenic) lower-order models pseudocounts smoothing (esp. for length distributions) Evaluation of Gene Finding Programs Nucleotide level accuracy TN FN FP TN TP FN REALITY PREDICTION Sensitivity: Sn  Precision: Pr   TP TP  FN TP TP  FP TP FN TN More Measures of Prediction Accuracy Exon level accuracy WRONG EXON CORRECT EXON MISSING EXON REALITY PREDICTION  ExonSn  TE number of correct exons  AE number of actual exons ExonPr  TE number of correct exons  PE number of predicted exons GlimmerHMM on human genes (circa 2002) Nuc Sens Nuc Prec Nuc Acc Exon Sens Exon Prec Exon Acc Exact Genes GlimmerHMM 86% 72% 79% 72% 62% 67% 17% Genscan 86% 68% 77% 69% 60% 65% 13% GlimmerHMM’s performace compared to Genscan on 963 human RefSeq genes selected randomly from all 24 chromosomes, non-overlapping with the training set. The test set contains 1000 bp of untranslated sequence on either side (5' or 3') of the coding portion of each gene. GlimmerHMM on other species Nucleotide Level Exon Level Size of test set Pr Corretly Predicted Genes Sn Pr Sn Arabidopsis thaliana 97% 99% 84% 89% 60% 809 genes Cryptococcus neoformans 96% 99% 86% 88% 53% 350 genes Coccidoides posadasii 99% 99% 84% 86% 60% 503 genes Oryza sativa 95% 98% 77% 80% 37% 1323 genes GlimmerHMM has also been trained on: Aspergillus fumigatus, Entamoeba histolytica, Toxoplasma gondii, Brugia malayi, Trichomonas vaginalis, and many others. Ab initio gene finding in the model plant Arabidopsis thaliana (circa 2004) Arabidopsis thaliana test results Nucleotide Exon Sn Pr Acc Sn Pr GlimmerHMM 97 99 SNAP Genscan+ 98 Acc Sn Pr Acc 84 89 86.5 60 61 60.5 96 99 97.5 83 85 93 99 96 Gene 84 60 74 81 77.5 35 57 58.5 35 35 •All three programs were tested on a test data set of 809 genes, which did not overlap with the training data set of GlimmerHMM. •All genes were confirmed by full-length Arabidopsis cDNAs and carefully inspected to remove homologues.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Generalized-HMMs - Center for Bioinformatics and