Download AUGUSTUS+

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Gene predictions for eukaryotes
attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct
gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt
agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt
cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg
tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc
atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc
tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct
aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt
cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag
tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct
atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt
tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat
gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta
gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg
gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt
ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt
agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta
gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta
gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc
tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag
Gene predictions for eukaryotes
attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct
gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt
agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt
cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg
tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc
atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc
tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct
aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt
cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag
tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct
atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt
tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat
gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta
gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg
gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt
ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt
agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta
gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta
gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc
tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag
Gene predictions for eukaryotes
Gene predictions for eukaryotes
Three different approaches to computational genefinding:

Intrinsic: use statistical information about known
genes (Hidden Markov Models)

Extrinsic: compare genomic sequence with known
proteins / genes

Cross-species sequence comparison: search for
similarities among genomes
Hidden-Markov-Models (HMM) for gene prediction
3 5 6 6 6 4 6 5 1 6 5 1 2
B F F U U U U U F F F F F F E
For sequence s and parse φ:
P(φ) probability of φ
P(φ,s) joint probability of φ and s
= P(φ) * P(s|φ)
P(φ|s) a-posteriori probability of φ
s
φ
Hidden-Markov-Models (HMM) for gene prediction
3 5 6 6 6 4 6 5 1 6 5 1 2
B F F U U U U U F F F F F F E
Goal: find path φ with maximum a-posteriori
probability P(φ|s)
Equivalent: find path that maximizes joint
probability P(φ,s)
Optimal path calculated by dynamic
programming (Viterbi algorithm)
Hidden-Markov-Models (HMM) for gene prediction
3 5 6 6 6 4 6 5 1 6 5 1 2
B F F U U U U U F F F F F F E
Program parameters learned from training data
Hidden-Markov-Models (HMM) for gene prediction
Application to gene prediction:
A T A A T G C C T A G T C
Z Z Z E E E E E E I I I I
s (DNA)
φ (parse)
Introns, exons etc modeled as states in GHMM
(„generalized HMM“)
Given sequence s, find parse that maximizes P(φ|s)
(S. Karlin and C. Burge, 1997)
AUGUSTUS
Basic model for GHMM-based intrinsic gene
finding comparable to GenScan (M. Stanke)
AUGUSTUS
AUGUSTUS
AUGUSTUS
Features of AUGUSTUS:
 Intron length model
 Initial pattern for exons
 Similarity-based weighting for splice sites
 Interpolated HMM
 Internal 3’ content model
Hidden-Markov-Models (HMM) for gene prediction
A T A A T G C C T A G T C
Z Z Z E E E E I I I I
s (DNA)
φ (parse)
Explicit intron length model computationally
expensive.
AUGUSTUS
Intron length model:
Intron
(expl.)
Exon
Exon
Intron
(fixed)
Intron
(geo.)
• Explicit length distribution for short introns
• Geometric tail for long introns
AUGUSTUS
AUGUSTUS+
Extension of AUGUSTUS using include extrinsic
information:




Protein sequences
EST sequences
Syntenic genomic sequences
User-defined constraints
Gene prediction by phylogenetic footprinting
Comparison of genomic sequences
(human and mouse)
Gene prediction by phylogenetic footprinting
AUGUSTUS+
 Extended GHMM using extrinsic information
 Additional input data: collection h of `hints’
about possible gene structure φ for sequence s
 Consider s, φ and h result of random process.
Define probability P(s,h,φ)
 Find parse φ that maximizes P(φ|s,h) for given
s and h.
AUGUSTUS+
Hints created using
 Alignments to EST sequences
 Alignments to protein sequences
 Combined EST and protein alignment (EST
alignments supported by protein alignments)
 Alignments of genomic sequences
 User-defined hints
AUGUSTUS+
EST
G1
Alignment to EST: hint to (partial) exon
AUGUSTUS+
Protein
EST
G1
EST alignment supported by protein: hint to exon (part),
start codon
AUGUSTUS+
ESTs, Protein
G1
Alignment to ESTs, Proteins: hints to introns, exons
AUGUSTUS+
G2
G1
Alignment of genomic sequences: hint to (partial) exon
AUGUSTUS+
Consider different types of hints:
type of hints: start, stop, dss, ass, exonpart,
exon, introns
 Hint associated with position i in s (exons etc.
associated with right end position)
 max. one hint of each type allowed per
position in s
 Each hint associated with a grade g that
indicates its source.
AUGUSTUS+
hi,t = information about hint of type t at position i
hi,t = [grade, strand, (length, reading frame)] if
hint available
(hints created by protein alignments contain
information about reading frame)
hi,t = $ if no hint of type t available at i
AUGUSTUS+
Standard program version, without hints
A T A A T G C C T A G T C
Z Z Z E E E E E E I I I I
Find parse that maximizes P(φ|s)
s (sequence)
φ (parse)
AUGUSTUS+
AUGUSTUS+ using hints
A
$
$
$
.
Z
T
$
$
$
.
Z
A
$
$
$
.
Z
A
$
$
$
.
E
T
$
$
X
G
$
$
$
C
$
$
$
C
X
$
$
T
$
$
$
A
$
$
$
G
$
$
$
T
$
$
$
C
$
$
$
E E E E E I I I I
Find parse that maximizes P(φ|s,h)
s (sequence)
h (type 1)
h (type 2)
h (type 3)
φ (parse)
AUGUSTUS+
As in standard HMM theory: maximize joint
probability P(φ,s,h)
How to calculate P(φ,s,h) ?
AUGUSTUS+
Simplifying assumption: Hints of different types t
and at different positions i independent of each
other (for redundant hints: ignore „weaker“
types).
AUGUSTUS+
Simplifying assumption: Hints of different types t
and at different positions i independent of each
other (for redundant hints: ignore „weaker“
types).
P( , s, h)  P( , s )  P(h |  , s )
AUGUSTUS+
Simplifying assumption: Hints of different types t
and at different positions i independent of each
other (for redundant hints: ignore „weaker“
types).
P( , s, h)  P( , s )  P(h |  , s )
 P( , s )   P(hi ,t |  , s )
i ,t
AUGUSTUS+
 Results:
 Gene (sub-)structures supported by hints
receive bonus compared to non-supported
structures
 Gene (sub-)structures not supported by hints
receive malus
(M. Stanke et al. 2006, BMC Bioinformatics)
AUGUSTUS+
AUGUSTUS+
Using hints from DIALIGN alignments:
1. Obtain large human/mouse sequence pairs (up
2.
3.
4.
5.
to 50kb) from UCSC
Run CHAOS to find anchor points
Run DIALIGN using CHAOS anchor points
Create hints h from DIALIGN fragments
Run AUGUSTUS with hints
AUGUSTUS+
Hints from DIALIGN fragments:
Consider fragments with score ≥ 20
 Distinguish high scores (≥ 45) from low scores
 Consider reading frame given by DIALIGN
 Consider strand given by DIALIGN
=> 2*2*2 = 8 grades
AUGUSTUS+
EGASP competition to evaulate and compare
gene-prediction methods (Sanger Center,
2005)
AUGUSTUS best ab-initio method at EGASP
EGASP test results
Nukleotid Level
100
90
80
70
60
50
Sensitivität
40
Spezifität
30
20
10
0
AUGUS
GENSC
geneid
GeneMa
Genezill
EGASP test results
Exon Level
100
90
80
70
60
50
Sensitivität
Spezifität
40
30
20
10
0
AUGUS
GENSC
geneid
GeneMa
Genezill
EGASP test results
Transkript Level
30
27,5
25
22,5
20
17,5
15
Sensitivität
Spezifität
12,5
10
7,5
5
2,5
0
AUGUST
US
GENSCA
N
geneid
GeneMar
k.hmm
Genezilla
EGASP test results
Gen Level
30
27,5
25
22,5
20
17,5
15
Sensitivität
12,5
Spezifität
10
7,5
5
2,5
0
AUGUST
US
GENSCA
N
geneid
GeneMar
k.hmm
Genezilla
EGASP test results
Accuracy
100%
AUGUSTUS
90%
AUGUSTUS+DIALIGN
80%
DOGFISH-C
70%
SGP2
60%
TWINSCAN
TWINSCAN-MARS
50%
N-SCAN
40%
30%
20%
10%
0%
Sn
Sp
Base
Sn
Sp
Exon
Sn
Sp
Transcript
Sn
Sp
Gene
Application of AUGUSTUS in genome projects
 Brugia malayi (TIGR)
 Aedes aegypti (TIGR)
 Schistosoma mansoni (TIGR)
 Tetrahymena thermophilia (TIGR)
 Galdieria Sulphuraria (Michigan State Univ.)
 Coprinus cinereus (Univ. Göttingen)
 Tribolium castaneum (Univ. Göttingen)
Related documents