* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene Prediction - Compgenomics2010
Non-coding DNA wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
History of RNA biology wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Human genome wikipedia , lookup
X-inactivation wikipedia , lookup
Point mutation wikipedia , lookup
Primary transcript wikipedia , lookup
Transposable element wikipedia , lookup
Public health genomics wikipedia , lookup
Pathogenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
RNA interference wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Copy-number variation wikipedia , lookup
Genetic engineering wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Non-coding RNA wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Epitranscriptome wikipedia , lookup
Genome (book) wikipedia , lookup
RNA silencing wikipedia , lookup
Genome editing wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome evolution wikipedia , lookup
Gene therapy wikipedia , lookup
Gene expression programming wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Microevolution wikipedia , lookup
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema Why gene prediction? experimental way? Why gene prediction? Exponential growth of sequences New sequencing technology Metagenomics: ~1% grow in lab How to do it? How to do it? It is a complicated task, let’s break it into parts How to do it? It is a complicated task, let’s break it into parts Genome How to do it? It is a complicated task, let’s break it into parts Genome How to do it? Protein-coding gene prediction Homology Search Phillip Lee & Divya Anjan Kumar ab initio approach Nadeem Bulsara & Neha Gupta How to do it? RNA gene prediction Amanda McCook & Chengwei Luo tRNA rRNA sRNA Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema Homology Search Homology Search Strategy open reading frame(ORF) How/Why find ORF? How/Why find ORF? How/Why find ORF? Protein Database Searches Domain searches Limits of Extrinsic Prediction ab initio Prediction Homology Search is not Enough! Biased and incomplete Database Sequenced genomes are not evenly distributed on the tree of life, and does not reflect the diversity accordingly either. Number of sequenced genomes clustered here ab initio Gene Prediction Features ORFs (6 frames) Codon Statistics Features (Contd.) Probabilistic View Supervised Techniques Unsupervised Techniques Usually Used Tools GeneMark GLIMMER EasyGene PRODIGAL GeneMark •Developed in 1993 at Georgia Institute of Technology as the first gene finding tool. •Used markov chain to represent the statistics of coding and noncoding reading frames using dicodon statistics. Shortcomings Inability to find exact gene boundaries GeneMark.hmm GeneMark.hmm • Probability of any sequence S underlying functional sequence X is calculated as P(X|S)=P(x1,x2,…………,xL| b1,b2,…………,bL) • Viterbi algorithm then calculates the functional sequence X* such that P(X*|S) is the largest among all possible values of X. • Ribosome binding site model was also added to augment accuracy in the prediction of translational start sites. GeneMark Even in prokaryotic genomes gene overlaps are quite common • RBS feature overcomes this problem by defining a % position nucleotide matrix based on alignment of 325 E coli genes whose RBS signals have already been annotated. • Uses a consensus sequence AGGAG to search upstream of any alternative start codons for genes predicted by HMM. GeneMarkS GENEMARKS • Considered the best gene prediction tool. • Based on unsupervised learning. GLIMMER Maintained by Steven Salzberg, Art Delcher at the University of Maryland , College Park • Used IMM (Interpolated Markov Models) for the first time. • Predictions based on variable context (oligomers of variable lengths). • More flexible than the fixed order Markov models. Principle IMM combines probability based on 0,1……..k previous bases, in this case k=8 is used. But this is for oligomers that occur frequently. However, for rarely occurring oligomers, 5th order or lower may also be used. Glimmer development Glimmer 2 (1999) • Increased the sensitivity of prediction by adding concept of ICM (Interpolated Context Model) Glimmer 3 (2007) • Overcomes the shortcomings of previous models by taking in account sum of RBS score, IMM coding potentials and a score for start codons which is dependent on relative frequency of each possible start codon in the same training set used for RBS determination. • Algorithm used reverse scoring of IMM by scoring all ORF (open reading frames) in reverse, from the stop codon to start codon. • Score being the sum of log likelihood of the bases contained in the ORF. Glimmer3.02 PRODIGAL Prokaryotic Dynamic Programming Gene Finding Algorithm Developed at Oak Ridge National Laboratory and the University of Tennessee PRODIGAL-Features PRODIGAL-Features EasyGene Developed at University of Copenhagen Statistical significance is the measure for gene prediction. ¥ High quality data set based on similarity in SwissPRot is extracted from genome. ¥ Data set used to estimate the HMM where based on ORF score and length statistical significance is calculated. Problem: ¥ No standalone version available Comparison of Different Tools Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema RNA Gene Prediction Why Predict RNA? Regulatory sRNA sRNA Challenges Fundamental Methodology RFAM What Is Covariance? Fig: Christian Weile et al. BMC Genomics (2007) 8:244 Noncomparative Prediction Fig: James A. Goodrich & Jennifer F. Kugel, Nature Rev. Mol. Cell Biol. (2006) 7:612 Noncomparative Prediction *Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1 Comparative+Noncomparative Effective sRNA prediction in V. cholerae • Non-enterobacteria • sRNAPredict2 • 32 novel sRNAs predicted • 9 tested • 6 confirmed Jonathan Livny et al. Nucleic Acids Res. (2005) 33:4096 Software *Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1 Eva K. Freyhult et al. Genome Res. (2007) 17:117 Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema Modification & Finishing • Consensus strategy to integrate ab initio results • Broken gene recruiting • TIS correcting • IS calling • operon annotating • Gene presence/absence analysis Modification & Finishing Consensus strategy Broken gene recruiting pass pass candidate fragments fail homology search ab initio results Modification & Finishing TIS correcting Start codon redundancy:ATG, GTG, TTG, CTG Leaderless genes Markov iteration, experimental verified data Modification & Finishing IS calling IS Finder DB Operon annotating Modification & Finishing Gene Presence/absence analysis Gene Prediction • Introduction • Protein-coding gene prediction • RNA gene prediction • Modification and finishing • Project schema Schema (proposed) Schema (proposed) assembly group Schema (proposed) assembly group