Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Eukaryotic Gene Finding Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/ Prokaryotic vs. Eukaryotic Genes Prokaryotes small genomes high gene density no introns (or splicing) no RNA processing similar promoters overlapping genes Eukaryotes large genomes low gene density introns (splicing) RNA processing heterogeneous promoters polyadenylation Pre-mRNA Splicing exon definition intron definition SR proteins ... 5 ’ splice signal exonic repressor branch signal intronic enhancers 3 ’ splice signal 5 ’ splice signal polyY exonic enhancers intronic repressor (assembly of spliceosome, catalysis) ... Some Statistics • On average, a vertebrate gene is about 30KB long • Coding region takes about 1KB • Exon sizes can vary from double digit numbers to kilobases • An average 5’ UTR is about 750 bp • An average 3’UTR is about 450 bp but both can be much longer. Human Splice Signal Motifs 5' splice signal 3' splice signal Semi-Markov HMM Model GHMM • • • • A finite Set Q of states Initial state distribution Π Transition probabilities Ti,j for Length distribution f of the states (fq is the length distribution of state q) • Probability model for each state GHMM – contin. • A parse Ф of a sequence S of length L is an ordered sequence of states (q1, . . . , qt) with an associated duration di to each state The most probable pass Фopt can be computed as in Veterbi algorithm Genscan HSMM GenScan States • N - intergenic region • P - promoter • F - 5’ untranslated region • Esngl – single exon (intronless) (translation start -> stop codon) • Einit – initial exon (translation start -> donor splice site) • Ek – phase k internal exon (acceptor splice site -> donor splice site) • Eterm – terminal exon (acceptor splice site -> stop codon) • Ik – phase k intron: 0 – between codons; 1 – after the first base of a codon; 2 – after the second base of a codon GenScan features • Model both strands at once • Each state may output a string of symbols (according to some probability distribution). • Explicit intron/exon length modeling • Advanced splice site modeling • Complete intron/exon annotation for sequence • Able to predict multiple genes and partial/whole genes • Parameters learned from annotated genes • Separate parameter training for different CpG content groups (< 43%, 43-51%, 51-57%,>57% CG content) Various parameters in GENSCAN GenScan Signal Modeling • PSSM: P(S) = P1(S1)•P2(S2) •…•Pn(Sn) – PolyA signal – Translation initiation/termination signal – Promoters • WAM: P(S) = P1(S1) •P2(S2|S1)•…•Pn(Sn|Sn-1) – 5’ and 3’ splice sites GENSCAN Performance • > 80% correct exon predictions, and > 90% correct coding/non coding predictions by bp. • BUT - the ability to predict the whole gene correctly is much lower HMM-based Gene Finding GENSCAN (Burge 1997) FGENESH (Solovyev 1997) HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997) Using Sequence Similarity for Gene Finding 1. Compare genomic sequence with expressed sequence tags (ESTs) (e.g. by BLASTN), to identify regions corresponding to processed mRNA 2. Compare genomic sequence to Protein DB (e.g. by BLASTX), to identify probably coding regions 3. “Spliced Alignment” of genomic sequence of a complete gene with a homologous protein sequence (e.g. by PROCRUSTES) may enable exon/intron reconstruction 4. Compare predicted peptides (e.g. by GENSCAN) with protein DB to assign confidence to predictions and functional annotations 5. Compare Genomic sequence with homologous from close organisms/species (e.g. by BLAST, CLASTW), to identify conserved regions which might correspond to coding regions and DNA signals GenomeScan • Idea: We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons. • Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan) • Focus on ‘typical case’ when homologous but not identical proteins are available. GeneWise [Birney, Amitai] • Motivation: Use good DB of protein world (PFAM) to help us annotate genomic DNA • GeneWise algorithm aligns a profile HMM directly to the DNA Sample GeneWise Output Developing GeneWise Model • Start with a PFAM domain HMM • Replace AA emissions with codon emissions P(codon | Mi ) P(codon | aa)P(aa | Mi ) •Allow for sequencing errors (deletions/insertions) •Add a 3-state intron model GeneWise Model GeneWise Intron Model PY tract central 5’ site spacer 3’ site GeneWise Model • Viterbi algorithm -> “best” alignment of DNA to protein domain • Alignment gives exact exon-intron boundaries • Parameters learned from speciesspecific statistics GeneWise problems • Only provides partial prediction, and only where the homology lies – Does not find “more” genes • Pseudogenes, Retrotransposons picked up • CPU intensive – Solution: Pre-filter with BLAST Other Sequence Usage • Search translated genomic sequences for the occurrences of the shot peptide motifs that are characteristic of common protein families (e.g. zinc finger, ATP/GTP binding modifs etc.) • Identify sequences which are probably NON coding: identify known classes of interspersed repeates (e.g LINE SINE) in none coding regions. Can be essential to remove these before simple BLAST is done against EST’s. Summary • Genes are complex structures which are difficult to predict with the required level of accuracy/confidence • Different approaches to gene finding: – Ab Initio : GenScan – Ab Initio modified by BLAST homologies: GenomeScan – Homology guided: GeneWise Future Directions • Find genes not for proteins (tRNA, rRNA, smRNA) – hard ! • Deal better with overlapping genes, multiple genes in a single sequence • Alternative splicing/transcription/translation – a whole separate issue – The mechanisms governing it, the signals – predicting the various genes Very important !