Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag Gene predictions for eukaryotes Gene predictions for eukaryotes Three different approaches to computational genefinding: Intrinsic: use statistical information about known genes (Hidden Markov Models) Extrinsic: compare genomic sequence with known proteins / genes Cross-species sequence comparison: search for similarities among genomes Hidden-Markov-Models (HMM) for gene prediction 3 5 6 6 6 4 6 5 1 6 5 1 2 B F F U U U U U F F F F F F E For sequence s and parse φ: P(φ) probability of φ P(φ,s) joint probability of φ and s = P(φ) * P(s|φ) P(φ|s) a-posteriori probability of φ s φ Hidden-Markov-Models (HMM) for gene prediction 3 5 6 6 6 4 6 5 1 6 5 1 2 B F F U U U U U F F F F F F E Goal: find path φ with maximum a-posteriori probability P(φ|s) Equivalent: find path that maximizes joint probability P(φ,s) Optimal path calculated by dynamic programming (Viterbi algorithm) Hidden-Markov-Models (HMM) for gene prediction 3 5 6 6 6 4 6 5 1 6 5 1 2 B F F U U U U U F F F F F F E Program parameters learned from training data Hidden-Markov-Models (HMM) for gene prediction Application to gene prediction: A T A A T G C C T A G T C Z Z Z E E E E E E I I I I s (DNA) φ (parse) Introns, exons etc modeled as states in GHMM („generalized HMM“) Given sequence s, find parse that maximizes P(φ|s) (S. Karlin and C. Burge, 1997) AUGUSTUS Basic model for GHMM-based intrinsic gene finding comparable to GenScan (M. Stanke) AUGUSTUS AUGUSTUS AUGUSTUS Features of AUGUSTUS: Intron length model Initial pattern for exons Similarity-based weighting for splice sites Interpolated HMM Internal 3’ content model Hidden-Markov-Models (HMM) for gene prediction A T A A T G C C T A G T C Z Z Z E E E E I I I I s (DNA) φ (parse) Explicit intron length model computationally expensive. AUGUSTUS Intron length model: Intron (expl.) Exon Exon Intron (fixed) Intron (geo.) • Explicit length distribution for short introns • Geometric tail for long introns AUGUSTUS AUGUSTUS+ Extension of AUGUSTUS using include extrinsic information: Protein sequences EST sequences Syntenic genomic sequences User-defined constraints Gene prediction by phylogenetic footprinting Comparison of genomic sequences (human and mouse) Gene prediction by phylogenetic footprinting AUGUSTUS+ Extended GHMM using extrinsic information Additional input data: collection h of `hints’ about possible gene structure φ for sequence s Consider s, φ and h result of random process. Define probability P(s,h,φ) Find parse φ that maximizes P(φ|s,h) for given s and h. AUGUSTUS+ Hints created using Alignments to EST sequences Alignments to protein sequences Combined EST and protein alignment (EST alignments supported by protein alignments) Alignments of genomic sequences User-defined hints AUGUSTUS+ EST G1 Alignment to EST: hint to (partial) exon AUGUSTUS+ Protein EST G1 EST alignment supported by protein: hint to exon (part), start codon AUGUSTUS+ ESTs, Protein G1 Alignment to ESTs, Proteins: hints to introns, exons AUGUSTUS+ G2 G1 Alignment of genomic sequences: hint to (partial) exon AUGUSTUS+ Consider different types of hints: type of hints: start, stop, dss, ass, exonpart, exon, introns Hint associated with position i in s (exons etc. associated with right end position) max. one hint of each type allowed per position in s Each hint associated with a grade g that indicates its source. AUGUSTUS+ hi,t = information about hint of type t at position i hi,t = [grade, strand, (length, reading frame)] if hint available (hints created by protein alignments contain information about reading frame) hi,t = $ if no hint of type t available at i AUGUSTUS+ Standard program version, without hints A T A A T G C C T A G T C Z Z Z E E E E E E I I I I Find parse that maximizes P(φ|s) s (sequence) φ (parse) AUGUSTUS+ AUGUSTUS+ using hints A $ $ $ . Z T $ $ $ . Z A $ $ $ . Z A $ $ $ . E T $ $ X G $ $ $ C $ $ $ C X $ $ T $ $ $ A $ $ $ G $ $ $ T $ $ $ C $ $ $ E E E E E I I I I Find parse that maximizes P(φ|s,h) s (sequence) h (type 1) h (type 2) h (type 3) φ (parse) AUGUSTUS+ As in standard HMM theory: maximize joint probability P(φ,s,h) How to calculate P(φ,s,h) ? AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types). AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types). P( , s, h) P( , s ) P(h | , s ) AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types). P( , s, h) P( , s ) P(h | , s ) P( , s ) P(hi ,t | , s ) i ,t AUGUSTUS+ Results: Gene (sub-)structures supported by hints receive bonus compared to non-supported structures Gene (sub-)structures not supported by hints receive malus (M. Stanke et al. 2006, BMC Bioinformatics) AUGUSTUS+ AUGUSTUS+ Using hints from DIALIGN alignments: 1. Obtain large human/mouse sequence pairs (up 2. 3. 4. 5. to 50kb) from UCSC Run CHAOS to find anchor points Run DIALIGN using CHAOS anchor points Create hints h from DIALIGN fragments Run AUGUSTUS with hints AUGUSTUS+ Hints from DIALIGN fragments: Consider fragments with score ≥ 20 Distinguish high scores (≥ 45) from low scores Consider reading frame given by DIALIGN Consider strand given by DIALIGN => 2*2*2 = 8 grades AUGUSTUS+ EGASP competition to evaulate and compare gene-prediction methods (Sanger Center, 2005) AUGUSTUS best ab-initio method at EGASP EGASP test results Nukleotid Level 100 90 80 70 60 50 Sensitivität 40 Spezifität 30 20 10 0 AUGUS GENSC geneid GeneMa Genezill EGASP test results Exon Level 100 90 80 70 60 50 Sensitivität Spezifität 40 30 20 10 0 AUGUS GENSC geneid GeneMa Genezill EGASP test results Transkript Level 30 27,5 25 22,5 20 17,5 15 Sensitivität Spezifität 12,5 10 7,5 5 2,5 0 AUGUST US GENSCA N geneid GeneMar k.hmm Genezilla EGASP test results Gen Level 30 27,5 25 22,5 20 17,5 15 Sensitivität 12,5 Spezifität 10 7,5 5 2,5 0 AUGUST US GENSCA N geneid GeneMar k.hmm Genezilla EGASP test results Accuracy 100% AUGUSTUS 90% AUGUSTUS+DIALIGN 80% DOGFISH-C 70% SGP2 60% TWINSCAN TWINSCAN-MARS 50% N-SCAN 40% 30% 20% 10% 0% Sn Sp Base Sn Sp Exon Sn Sp Transcript Sn Sp Gene Application of AUGUSTUS in genome projects Brugia malayi (TIGR) Aedes aegypti (TIGR) Schistosoma mansoni (TIGR) Tetrahymena thermophilia (TIGR) Galdieria Sulphuraria (Michigan State Univ.) Coprinus cinereus (Univ. Göttingen) Tribolium castaneum (Univ. Göttingen)