Download 1st_pres_Geneprediction

Document related concepts

RNA interference wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Non-coding RNA wikipedia , lookup

Non-coding DNA wikipedia , lookup

Transposable element wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genetic engineering wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Genomic imprinting wikipedia , lookup

Copy-number variation wikipedia , lookup

Human genome wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

NEDD9 wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy wikipedia , lookup

Point mutation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Metagenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genomics wikipedia , lookup

Gene desert wikipedia , lookup

Genome (book) wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene nomenclature wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene wikipedia , lookup

Genome editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Gene Prediction
Computational Genomics
February 6, 2012
2
OUTLINE
1. Background
- Gene prediction
- Protein Coding Sequences
- Gene structure and ORF
- Prokaryotic Gene Model
- Biology of Haemophilus haemolyticus
2. Gene Prediction Approaches
-Ab Initio Gene Prediction
-Homology based Gene Prediction
-RNA gene prediction
3. Gene Prediction Improvement
4. Strategy
3
What is Gene Prediction ?
Finding DNA sequences that
encode proteins
Protein-coding genes
RNA genes
Functional elements ->
Regulatory regions
Gene finding is one of the first and most
important steps in understanding the
genome of a species once it has
been sequenced
4
Why develop gene finders?
Technological improvements in high-throughput DNA
sequencing are tremendously increasing the public availability
of prokaryotic and eukaryotic genomes
As of May 2010, 1,072 complete
published bacterial genomes
reported GOLD
4,289 bacterial genome projects
are known to be ongoing
(www.genomesonline.org).
5
6
almost 2000 genomes
completely sequenced
by 2011
Sequencing projects are
growing exponentially
7
The underlying reasons
for sequencing the genome of
various bacteria are either
because they are highly
virulent to humans, animals
or plants,
or they can be applied to
bioremediation or bioenergy
production
8
Extracting knowledge from data
Growing amount of nucleotide
sequence data requires also a
concurrent development of
adequate bioinformatics tools
for comprehensive
understanding of the genetic
information they encode as
well as of their underlying
biology
9
What is a Gene?
A gene is an elementary unit of heredity which is indivisible in the
functional sense
A gene codes for discrete functional macromolecule (protein) or functional
RNA
Such definition does not work for
alternatively processed transcription
units
A gene is a linear
collection of exons
that are
incorporated into a
specific mRNA
10
Prokaryotic Gene Model: ORF genes
 Small genomes, high gene density
- H. influenzae genome is 85% genic
 Operons
- One transcript, many genes
 No introns
- One gene, one protein
 Open Reading Frames
- One ORF per gene
- ORF with start and stop codons
11
Prokaryotic Gene Structure
Eukaryotic Gene Structure
13
Haemophilus haemolyticus
what we know about our target system?
Gram negative bacterium
Facultative anaerobium
Shape: Coccobacilli
Emerging pathogen
closely related to H. influenzae
14
H. haemolyticus is most closely
related to H. influenzae
16S rRNA gene
infB gene
Multilocus Sequence Analysis
(MLSA)
15
Why study Haemophilus
haemolyticus ?
1. Genetic Diversity 2. Emerging Pathogen
3. Intrinsic Biological
Value
16
How Gene Prediction works ?
Identifying common phenomena in known
genes
Building a computational model that can
accurately describe the common
phenomena
Using the model to scan uncharacterized
sequence to identify regions that match
the model, which become putative genes
Test and validate the predictions
17
Gene Prediction Methods
Ab-initio
Protein
coding
Gene
prediction
Homology
based
tRNA
Non-protein
coding
rRNA
sRNA
18
Open Reading Frames
 ORF (Open Reading Frame):
a sequence defined by in-frame AUG and stop codon, which in
turn defines a putative amino acid sequence.
 Simple first step in gene finding
 Translate genomic sequence in six frames. Identify the stop
codon in each frame.
 Regions without stop codons are ORF
 The longest ORF from a MET codon is a good prediction of
protein encoding sequence.
19
ORF Scanning
• Use only sequence information.
• Identify coding exons.
• Integrate coding statistics to differentiate between coding and noncoding regions. (Real exons expected to show codon bias).
• Calculate likelihood a triplet is in a coding region.
*Works relatively well for prokaryotic genomes where
non-coding component is small and no introns
20
Predicting Prokaryotic ProteinCoding Genes
Gene prediction is easier and more accurate in prokaryotes
than eukaryotes since prokaryote gene structure is much
simpler.
The principle difficulties are:
•
•
•
•
detection of initiation site (AUG)
alternative start codons
gene overlap
undetected small proteins
Inspite of these difficulties, prokaryote gene prediction can
reach 99% accuracy.
21
Protein Coding
Methods
22
Finding Genes in Prokaryotic DNA
23
Ab initio methods
• Intrinsic Gene Prediction Method.
• Inspect the input sequence and search for traces
of gene presence.
• Extract information on gene locations using
statistical patterns inside and outside gene
regions as well as patterns typical of the gene
boundaries.
• ab initio algorithms implement intelligent
methods to represent these patterns as a model
of the gene structure in the organism.
Markov model
based
Ab-initio
methods
Dynamic
Programming
24
Markov model based tools
• Several highly accurate prokaryotic gene-finding methods are
based on Markov model algorithms.
GeneMarkS
Glimmer
Markov Model
based tools
RAST
AMIGene
EasyGene
25
What are Hidden Markov Models?
• Hidden Markov models (HMMs) are discrete
Markov processes where every state generates an
observation at each time step.
• A hidden Markov model (HMM) is
a statistical Markov model in which the system
being modeled is assumed to be a Markov
process with unobserved (hidden) states. [wiki]
26
Markov Model (Discrete Markov Process)
• A discrete Markov process is a sequence of random
variables q1,…,qt that take values in a discrete set
S={s1,…,sN} where the Markov property holds.
• Markov property:
• Parameters
▫ Initial state probabilities: πi
▫ State transition probabilities: aij
27
From Markov Model to HMM
• HMMs are discrete Markov processes where each state
also emits an observation according to some probability
distribution, we need to augment our model.
• Parameters
▫ Initial state probabilities: πi
▫ State transition probabilities: aij
▫ Emission probabilities: ei(k)
Markov Model
Hidden Markov Model
Each state emits an observation Each state emits an observation according
with 100% probability
to a certain probability distribution
28
HMM Example – Agnostic Drink Stand (1/2)
29
HMM Example – Agnostic Drink Stand
(2/2)
Suppose we observed the following sequences:
Vodka, Vodka, Coke, Vodka, Vodka, Vodka, Water, Water, Water,
Water, Coke, Water, Coke, Coke, Water, Coke, Coke, Water, Coke,
Coke, Coke, Vodka, Coke, Water, Vodka, Coke
How might we infer the hidden states?
A possible labeling:
Vodka, Vodka, Coke, Vodka, Vodka, Vodka, Water, Water, Water,
Water, Coke, Water, Coke, Coke, Water, Coke, Coke, Water, Coke,
Coke, Coke, Vodka, Coke, Water, Vodka, Coke
30
HMM Example in Sequencing Analysis
31
HMM and Observation Sequence are Known  ??
• Given an HMM parameter θ and an observation
sequence X1:T, which state sequence Q1:T best explains
the observations?
max P(Q|X,θ)
• Viterbi algorithm
32
How We Get HMM Parameters?
• Training an HMM from labeled sequence
33
Design a HMM model for Gene Prediction
• The number of states in the model
▫
▫
▫
▫
Start codon
Stop codon
Intragenic codon
Intergenic region
• The number of distinct observation
symbols per state
• State transition probability distribution
• Observation symbol probability distribution
• Initial state distribution
• N-order Markov Model
34
Ab Initio Gene Prediction Software
• GeneMark.hmm
35
Ab Initio Gene Prediction Software
• GeneMarkS
36
Ab Initio Gene Prediction Software
• EasyGene
37
Limitations of Current Methods
• HMM has local averaging effect
• Training process is slow and is case-sensitive
• Algorithms are trained with sequences from known
genes (overfitting problem)
• MLE + Viterbi is not optimal (several tools have used the
scaling factor to tweak the performance)
• Overlapping genes
38
Comparison of the Gene Finders
Tools
Developed for
Output file
formats
Prodigal
Bacteria & archaea
GBK, GFF, SCO
GeneMarkS
Prokaryotes
Algorithm-specific
RAST
Bacteria & archaea
GTF, GFF3,
GenBank, EMBL
Glimmer3
Prokaryotes
Algorithm-specific
EasyGene
Prokaryotes
GFF3
AMIGene
Prokaryotes
EMBL, GenBank,
GFF
39
Homology based methods
Tools:
• BLAST
• SGP2
• BLAT
Advantages:
• Simplest.
• Characterized with high accuracy.
• Helps find the gene loci plus annotates the region.
Disadvantages:
Requires huge amounts of extrinsic data and finds only
half of the genes. Many of the genes still have no
significant homology to known genes.
Steps
1. Similarity search against the database
2. Multiple sequence alignment
40
Searching against the Database
• Steps
o Use a heuristic (approximate) algorithm to discard most
irrelevant sequences. (Based on Smith-Waterman algorithm)
o Perform the exact algorithm on the small group of remaining
sequences.
• Representative algorithms
o FASTA (Lipman & Pearson 1985) – First fast sequence
searching algorithm for comparing a query sequence against
a database
o BLAST - Basic Local Alignment Search Technique (Altschul
et al 1990)
o Gapped BLAST (Altschul et al 1997)
41
FASTA and BLAST
• First, identify very short (almost) exact matches.
• Next, the best short hits from the 1st step are
extended to longer regions of similarity.
• Finally, the best hits are optimized using the SmithWaterman algorithm.
42
FASTA
Find runs of identities
Score and discard low-scoring runs
Eliminate segments unlikely to be part of alignment; apply banded Smith-Waterman to calculate opt score.
43
BLAST
• As sensitive as FASTA but much faster
• Confine attention to segment pairs that contain a
word pair of length w with a score of at least T
• Phase 1: Compile a list of word pairs above
threshold
• Phase 2: Scan the database for the match word hits
• Phase 3: Extend the hits
44
BLAST Phase 1: List of Word Pairs
• Compile a list of word pairs (w=3) above threshold T = 15
• Example:
A query sequence
…FSGTWYA…
A list of words (w=3) is:
FSG SGT GTW TWY WYA
neighborhood
word hits
YSG TGT ATW SWY WFA
> threshold
FTG SVT GSW TWF WYS
NTW
(T=15)
neighborhood
word hits
< threshold
GTW 6,5,11
GSW 6,1,11
ATW 0,5,11
NTW 0,5,11
GTY 6,5,2
GTM 6,5,-1
DAW -1,0,11
22
18
16
16
13
10
10
45
BLAST Phase 3: Extend the Hit
• When you manage to find a hit (i.e. a match between a
“word” and a database entry), extend the hit in either
direction.
• Keep track of the score (use a scoring matrix). Stop when
the score drops below some cutoff value X.
KENFDKARFSGTWYAMAKKDPEG
MKGLDIQKVAGTWYSLAMAASD.
extend
Hit!
Query Sequence
Hit in the Database
extend
• High-scoring Segment Pairs (HSPs)
46
Gapped BLAST
• Try to connect HSPs by aligning the sequences in
between them
THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEW___ASNINETEEN
• The Gapped BLAST algorithm allows several segments
that are separated by short gaps to be connected together
to one alignment
47
How to Interpret BLAST Results
• E-value
▫ Expected # of alignment with score at least S
▫ Number of database hits you expect to find by chance
Increases linearly with
length of query
sequence and database
Decreases
exponentially with
score of alignment
Alignments
size of database
your score
expected number of
random hits
Score
m = length of query; n= length of database; s= score
K, λ: statistical parameters dependent upon scoring system
and background residue frequencies
48
From E-value to P-value
• P-Value: probability of obtaining a score greater than a
given score S at random
P (S’>S) = 1– e-E
Which is approximately E-value
• Very small E-values are very similar to P-values.
However, E-values of about 1 to 10 are far easier to
interpret than corresponding P-values.
E-Values P-Values
10
0.99995460
5
0.99326205
2
0.86466472
1
0.63212056
0.1
0.09516258 (about 0.1)
0.05
0.04877058 (about 0.05)
0.001
0.00099950 (about 0.001)
0.0001
0.0001000
49
BLAST and BLAST-like programs
• Traditional BLAST (formerly blastall) nucleotide, protein, translations
▫ blastn nucleotide query vs. nucleotide database
▫ blastp protein query vs. protein database
▫ blastx nucleotide query vs. protein database
▫ tblastn protein query vs. translated nucleotide database
▫ tblastx translated query vs. translated database
• Megablast nucleotide only
▫ Contiguous megablast
 Nearly identical sequences
▫ Discontiguous megablast
 Cross-species comparison
• Position Specific BLAST Programs protein only
▫ Position Specific Iterative BLAST (PSI-BLAST)
 Automatically generates a position specific score matrix (PSSM)
▫ Reverse PSI-BLAST (RPS-BLAST)
 Searches a database of PSI-BLAST PSSMs
50
Multiple Sequence Alignment
• Smith-Waterman algorithm
51
Carrillo-Lipman Algorithm
52
Progressive Alignment Methods
• Feng-Doolittle
progressive multiple
alignment [1987]
▫ Pairwise alignment of
all pairs of N sequence
▫ Construct a guide tree
from the distance
matrix
▫ Align the sequence
based on the tree
53
Non protein coding gene prediction
A non-coding RNA (ncRNA) is a functional molecule that is
not translated into a protein. The term small RNA (sRNA) is
often used for bacterial ncRNA.
Transcripts, whose function lies in the RNA sequence itself
and not as information carriers for protein synthesis.
For example:
small interfering RNAs (siRNA) is used to protect our genome.
It recognizes invading foreign RNAs/DNAs based on the
sequence specificity.
And helps to degrade the foreign RNAs.
54
Non-protein Coding Gene Tools
• tRNA
– tRNA-ScanSE
• rRNA
– RNAmmer
• sRNA
– sRNATarget
– sRNAPredict
55
Gene prediction improvement
pipeline (GenePRIMP)
56
GenePRIMP
• It is a computational evidence based postprocessing pipeline that
identifies erroneously predicted genes.
• The list of gene anomalies reported include :
▫ Short genes
▫ Long genes
▫ Broken genes
▫ Interrupted genes
▫ Unique genes
▫ Dubious genes
(a) GenePRIMP data flow. (b) BLAST alignments of short,
long, broken and interrupted genes. Unique genes have
no hits to known proteins (nr database). Dubious genes
are unique genes that are shorter than 30 amino acids.
57
58
GenePRIMP analysis of
gene calls
Comparison of five gene-calling
applications
59
Assembled
Genome / Contigs
Protein – Coding
gene
Strategy
Ab – Initio
Homology based
- GeneMark
- Prodigal
- Easy Gen
- AMIGene
- Glimmer 3
- RAST
-Critica etc.
RNA - genes
- tRNA scan-SE
- RNAmmer
- sRNATarget
- sRNAPredict
Gene Prediction
- BLAST
- SGP2
- BLAT
Final RNA
predictions
GenePRIMP
Manual
Curation
Final Result
Gene Prediction
Improvement
60
References
Binnewies, T. et al. 2006. Ten years of bacterial genome sequencing:comparative-genomics-based discoveries.
Funct Integr Genomics 6: 165–185
J. Duan, J.J. Heikkila and B.R. Glick. 2010. Sequencing a bacterial genome: an overview. Topics in Applied
Microbiology and Microbial Biotechnology 8: 1443-1451.
Casto A.M. and Amid C. 2010. Beyond the Genome: genomics research ten years after the human genome
sequence.Genome Biology, 11:309
King Jordan et al. 2011. Genome Sequences for Five Strains of the Emerging Pathogen
Haemophilus haemolyticus. Journal of Bacteriology, 193: 5879–5880
Hedegaard J. et al. 2001. Phylogeny of the genus Haemophilus as determined by comparison of partial infB
sequences. Microbiology 147, 2599–2609
Murphy T. F. et al. 2007. Haemophilus haemolyticus: A Human Respiratory Tract Commensal to Be Distinguished
from Haemophilus influenzae. The Journal of Infectious Diseases, 195:81–9
Theodore M. J. et al. 2012. Evaluation of new biomarker genes for differentiating Haemophilus influenzae from
Haemophilus haemolyticus. J. Clin. Microbiology. published online ahead of print on 1 February 2012
Mathe C. et al. 2002. Current Methods of Gene Prediction, their strengths and Weaknesses. Nucleic Acids
Research, 30: 4103-4117
Angelova M., Kalajdziski S. and Kocarev L. 2010. Computational methods for gene finding in prokaryotes. ICT
Innovations 2010 Web Proceedings, 11-20.
Pati A. et al. 2010. GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nature
Methods, 7(6): 455-457.