Download AUGUSTUS+

Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag Gene predictions for eukaryotes Gene predictions for eukaryotes Three different approaches to computational genefinding:  Intrinsic: use statistical information about known genes (Hidden Markov Models)  Extrinsic: compare genomic sequence with known proteins / genes  Cross-species sequence comparison: search for similarities among genomes Hidden-Markov-Models (HMM) for gene prediction 3 5 6 6 6 4 6 5 1 6 5 1 2 B F F U U U U U F F F F F F E For sequence s and parse φ: P(φ) probability of φ P(φ,s) joint probability of φ and s = P(φ) * P(s|φ) P(φ|s) a-posteriori probability of φ s φ Hidden-Markov-Models (HMM) for gene prediction 3 5 6 6 6 4 6 5 1 6 5 1 2 B F F U U U U U F F F F F F E Goal: find path φ with maximum a-posteriori probability P(φ|s) Equivalent: find path that maximizes joint probability P(φ,s) Optimal path calculated by dynamic programming (Viterbi algorithm) Hidden-Markov-Models (HMM) for gene prediction 3 5 6 6 6 4 6 5 1 6 5 1 2 B F F U U U U U F F F F F F E Program parameters learned from training data Hidden-Markov-Models (HMM) for gene prediction Application to gene prediction: A T A A T G C C T A G T C Z Z Z E E E E E E I I I I s (DNA) φ (parse) Introns, exons etc modeled as states in GHMM („generalized HMM“) Given sequence s, find parse that maximizes P(φ|s) (S. Karlin and C. Burge, 1997) AUGUSTUS Basic model for GHMM-based intrinsic gene finding comparable to GenScan (M. Stanke) AUGUSTUS AUGUSTUS AUGUSTUS Features of AUGUSTUS:  Intron length model  Initial pattern for exons  Similarity-based weighting for splice sites  Interpolated HMM  Internal 3’ content model Hidden-Markov-Models (HMM) for gene prediction A T A A T G C C T A G T C Z Z Z E E E E I I I I s (DNA) φ (parse) Explicit intron length model computationally expensive. AUGUSTUS Intron length model: Intron (expl.) Exon Exon Intron (fixed) Intron (geo.) • Explicit length distribution for short introns • Geometric tail for long introns AUGUSTUS AUGUSTUS+ Extension of AUGUSTUS using include extrinsic information:     Protein sequences EST sequences Syntenic genomic sequences User-defined constraints Gene prediction by phylogenetic footprinting Comparison of genomic sequences (human and mouse) Gene prediction by phylogenetic footprinting AUGUSTUS+  Extended GHMM using extrinsic information  Additional input data: collection h of `hints’ about possible gene structure φ for sequence s  Consider s, φ and h result of random process. Define probability P(s,h,φ)  Find parse φ that maximizes P(φ|s,h) for given s and h. AUGUSTUS+ Hints created using  Alignments to EST sequences  Alignments to protein sequences  Combined EST and protein alignment (EST alignments supported by protein alignments)  Alignments of genomic sequences  User-defined hints AUGUSTUS+ EST G1 Alignment to EST: hint to (partial) exon AUGUSTUS+ Protein EST G1 EST alignment supported by protein: hint to exon (part), start codon AUGUSTUS+ ESTs, Protein G1 Alignment to ESTs, Proteins: hints to introns, exons AUGUSTUS+ G2 G1 Alignment of genomic sequences: hint to (partial) exon AUGUSTUS+ Consider different types of hints: type of hints: start, stop, dss, ass, exonpart, exon, introns  Hint associated with position i in s (exons etc. associated with right end position)  max. one hint of each type allowed per position in s  Each hint associated with a grade g that indicates its source. AUGUSTUS+ hi,t = information about hint of type t at position i hi,t = [grade, strand, (length, reading frame)] if hint available (hints created by protein alignments contain information about reading frame) hi,t = $ if no hint of type t available at i AUGUSTUS+ Standard program version, without hints A T A A T G C C T A G T C Z Z Z E E E E E E I I I I Find parse that maximizes P(φ|s) s (sequence) φ (parse) AUGUSTUS+ AUGUSTUS+ using hints A $ $ $ . Z T $ $ $ . Z A $ $ $ . Z A $ $ $ . E T $ $ X G $ $ $ C $ $ $ C X $ $ T $ $ $ A $ $ $ G $ $ $ T $ $ $ C $ $ $ E E E E E I I I I Find parse that maximizes P(φ|s,h) s (sequence) h (type 1) h (type 2) h (type 3) φ (parse) AUGUSTUS+ As in standard HMM theory: maximize joint probability P(φ,s,h) How to calculate P(φ,s,h) ? AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types). AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types). P( , s, h)  P( , s )  P(h |  , s ) AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types). P( , s, h)  P( , s )  P(h |  , s )  P( , s )   P(hi ,t |  , s ) i ,t AUGUSTUS+  Results:  Gene (sub-)structures supported by hints receive bonus compared to non-supported structures  Gene (sub-)structures not supported by hints receive malus (M. Stanke et al. 2006, BMC Bioinformatics) AUGUSTUS+ AUGUSTUS+ Using hints from DIALIGN alignments: 1. Obtain large human/mouse sequence pairs (up 2. 3. 4. 5. to 50kb) from UCSC Run CHAOS to find anchor points Run DIALIGN using CHAOS anchor points Create hints h from DIALIGN fragments Run AUGUSTUS with hints AUGUSTUS+ Hints from DIALIGN fragments: Consider fragments with score ≥ 20  Distinguish high scores (≥ 45) from low scores  Consider reading frame given by DIALIGN  Consider strand given by DIALIGN => 2*2*2 = 8 grades AUGUSTUS+ EGASP competition to evaulate and compare gene-prediction methods (Sanger Center, 2005) AUGUSTUS best ab-initio method at EGASP EGASP test results Nukleotid Level 100 90 80 70 60 50 Sensitivität 40 Spezifität 30 20 10 0 AUGUS GENSC geneid GeneMa Genezill EGASP test results Exon Level 100 90 80 70 60 50 Sensitivität Spezifität 40 30 20 10 0 AUGUS GENSC geneid GeneMa Genezill EGASP test results Transkript Level 30 27,5 25 22,5 20 17,5 15 Sensitivität Spezifität 12,5 10 7,5 5 2,5 0 AUGUST US GENSCA N geneid GeneMar k.hmm Genezilla EGASP test results Gen Level 30 27,5 25 22,5 20 17,5 15 Sensitivität 12,5 Spezifität 10 7,5 5 2,5 0 AUGUST US GENSCA N geneid GeneMar k.hmm Genezilla EGASP test results Accuracy 100% AUGUSTUS 90% AUGUSTUS+DIALIGN 80% DOGFISH-C 70% SGP2 60% TWINSCAN TWINSCAN-MARS 50% N-SCAN 40% 30% 20% 10% 0% Sn Sp Base Sn Sp Exon Sn Sp Transcript Sn Sp Gene Application of AUGUSTUS in genome projects  Brugia malayi (TIGR)  Aedes aegypti (TIGR)  Schistosoma mansoni (TIGR)  Tetrahymena thermophilia (TIGR)  Galdieria Sulphuraria (Michigan State Univ.)  Coprinus cinereus (Univ. Göttingen)  Tribolium castaneum (Univ. Göttingen)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download AUGUSTUS+