Download Gene Prediction - Compgenomics2010

Document related concepts

Genetic code wikipedia , lookup

Long non-coding RNA wikipedia , lookup

History of RNA biology wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Human genome wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

X-inactivation wikipedia , lookup

Primary transcript wikipedia , lookup

Point mutation wikipedia , lookup

NEDD9 wikipedia , lookup

Transposable element wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

RNA interference wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Non-coding RNA wikipedia , lookup

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Epitranscriptome wikipedia , lookup

RNA silencing wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genome (book) wikipedia , lookup

Genome editing wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene therapy wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

RNA-Seq wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene Prediction
Chengwei Luo, Amanda McCook, Nadeem Bulsara,
Phillip Lee, Neha Gupta, and Divya Anjan Kumar
Gene Prediction
Protein-coding gene prediction
RNA gene prediction
Modification and finishing
Project schema
Gene Prediction
Protein-coding gene prediction
RNA gene prediction
Modification and finishing
Project schema
Why gene prediction?
experimental way?
Why gene prediction?
Exponential growth of sequences
New sequencing technology
Metagenomics: ~1% grow in lab
How to do it?
How to do it?
It is a complicated task, let’s break it into parts
How to do it?
It is a complicated task, let’s break it into parts
How to do it?
It is a complicated task, let’s break it into parts
How to do it?
Protein-coding gene prediction
Homology Search
Phillip Lee & Divya Anjan Kumar
ab initio approach
Nadeem Bulsara & Neha Gupta
How to do it?
RNA gene prediction
Amanda McCook & Chengwei Luo
Gene Prediction
Protein-coding gene prediction
RNA gene prediction
Modification and finishing
Project schema
Homology Search
Homology Search
open reading frame(ORF)
How/Why find ORF?
How/Why find ORF?
How/Why find ORF?
Protein Database Searches
SWISSPROT- statistics
11,912 families, with 1,808 new families and 236
families deleted
Updated to include metagenomic samples
Involves MSA and HMM
Only 63%of the Pfam families match the proteins in
Domain searches
Integrating the results
3 possible outcomes:
Complete consensus
Partial consensus
No consensus
How do we choose?
Scores like E-values
Percentage similarity
Limitations of Extrinsic Prediction
ab initio Prediction
Homology Search is not Enough!
Biased and incomplete Database
Sequenced genomes are not
evenly distributed on the tree of life,
and does not reflect the diversity
accordingly either.
Number of sequenced genomes
clustered here
ab initio Gene Prediction
ORFs (6 frames)
Codon Statistics
Features (Contd.)
Probabilistic View
Supervised Techniques
Unsupervised Techniques
Usually Used Tools
•Developed in 1993 at Georgia
Institute of Technology as the first gene
finding tool.
•Used markov chain to represent the
statistics of coding and noncoding
reading frames using dicodon
Inability to find exact gene
9 hidden states were defined
•Typical gene in the direct strand
•Typical gene in the reverse strand
•Atypical gene in the direct strand
•Atypical gene in the reverse strand
• Non-coding (intergenic) region
•Start codons in the direct strand
• Stop codons in the direct strand
• Start codons in the reverse strand
•Stop codons in the reverse strand
Probability of any sequence S underlying functional sequence X
is calculated as P(X|S)=P(x1,x2,…………,xL| b1,b2,…………,bL)
Viterbi algorithm then calculates the functional sequence X*
such that P(X*|S) is the largest among all possible values of X.
Ribosome binding site model was also added to augment
accuracy in the prediction of translational start sites.
Even in prokaryotic genomes gene overlaps are quite common
RBS feature overcomes this problem by defining a % position
nucleotide matrix based on alignment of 325 E coli genes
whose RBS signals have already been annotated.
Uses a consensus sequence AGGAG to search upstream of any
alternative start codons for genes predicted by HMM.
Considered the best gene prediction tool.
Based on unsupervised learning.
Maintained by Steven Salzberg, Art Delcher at the University of Maryland , College Park
Used IMM (Interpolated Markov Models) for the first
Predictions based on variable context (oligomers of
variable lengths).
More flexible than the fixed order Markov models.
IMM combines probability based on 0,1……..k previous bases, in this case
k=8 is used. But this is for oligomers that occur frequently. However, for
rarely occurring oligomers, 5th order or lower may also be used.
Glimmer development
Glimmer 2 (1999)
Increased the sensitivity of prediction by adding
concept of ICM (Interpolated Context Model)
Glimmer 3 (2007)
Overcomes the shortcomings of previous models by
taking in account sum of RBS score, IMM coding
potentials and a score for start codons which is
dependent on relative frequency of each possible
start codon in the same training set used for RBS
Algorithm used reverse scoring of IMM by scoring
all ORF (open reading frames) in reverse, from the
stop codon to start codon.
Score being the sum of log likelihood of the bases
contained in the ORF.
Prokaryotic Dynamic Programming Gene Finding Algorithm
Developed at Oak Ridge National Laboratory and the University of Tennessee
Developed at University of Copenhagen
Statistical significance is the measure for gene prediction.
¥ High quality data set based on
similarity in SwissPRot is
extracted from genome.
¥ Data set used to estimate the
HMM where based on ORF score
and length statistical significance is
¥ No standalone version available
Comparison of Different Tools
Gene Prediction
Protein-coding gene prediction
RNA gene prediction
Modification and finishing
Project schema
RNA Gene Prediction
Why Predict RNA?
Regulatory sRNA
sRNA Challenges
Fundamental Methodology
What Is Covariance?
Fig: Christian Weile et al. BMC Genomics (2007) 8:244
Noncomparative Prediction
Fig: James A. Goodrich & Jennifer F. Kugel, Nature Rev. Mol. Cell Biol. (2006) 7:612
Noncomparative Prediction
*Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1
Effective sRNA prediction in V. cholerae
32 novel sRNAs predicted
9 tested
6 confirmed
Jonathan Livny et al. Nucleic Acids Res. (2005) 33:4096
*Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1
Eva K. Freyhult et al. Genome Res. (2007) 17:117
Gene Prediction
Protein-coding gene prediction
RNA gene prediction
Modification and finishing
Project schema
Modification & Finishing
Consensus strategy to integrate ab initio
Broken gene recruiting
TIS correcting
IS calling
operon annotating
Gene presence/absence analysis
Modification & Finishing
Consensus strategy
Broken gene recruiting
candidate fragments
homology search
ab initio results
Modification & Finishing
TIS correcting
Start codon redundancy:ATG, GTG, TTG, CTG
Leaderless genes
Markov iteration, experimental verified data
Modification & Finishing
IS calling
IS Finder DB
Operon annotating
Modification & Finishing
Gene Presence/absence analysis
Gene Prediction
Protein-coding gene prediction
RNA gene prediction
Modification and finishing
Project schema
Schema (proposed)
Schema (proposed)
assembly group
Schema (proposed)
assembly group