Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Comparative Genomics & Annotation
The Foundation of Comparative Genomics
The main methodological tasks of CG Annotation:
Protein Gene Finding
RNA Structure Prediction
Signal Finding
Overlapping Annotations:
Protein Genes
Protein-RNA
Combining Grammars
Ab Initio Gene prediction
Ab initio gene prediction: prediction of the location of genes (and
the amino acid sequence it encodes) given a raw DNA sequence.
....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcg
gggccccttagtccgggctttttgccatggggtctctgttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgc
aaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccggagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaaca
gctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgacgtgagggcctggagctccctcgcgc
actgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggg
aagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaat
acactttgcgctgccacgtgacgcaggtgttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctgg
ccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgatctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccatta
cactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacct
ccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgt
acctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg....
Input data
Output:
5'
3'
Exon
Intron
UTR and intergenic sequence
5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTC
CCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCG
CCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAG
CCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTC
GCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaa
gggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCT
GCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATC
TGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAA
CAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgtt
ccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG
CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg
ggaaatggccatacatggtgg.... 3'
Levels of Annotation
“Annotation”: Tagging regions and nucleotides with information about function,
structure, knowledge, additional data,….
Homologous Genomes
A
A
T
T
C
A
T
C
Annotation levels
Protein coding genes including alternative splicing
RNA structure
Regulatory signals – fast/slow, prediction of TF, binding constants,…
Selection Strength,…
Epigenomics – methylation, histone modification
Further complications
Integration of levels – RNA structure of mRNA, signals in coding regions,..
Knowledge and annotation transfer – experimental knowledge might be present in other species
Evolution of Feature – regulatory signals > RNA > protein
Combining with non-homologous analysis – tests for common regulation.
Combining specie and population perspective
T
A
Observables, Hidden Variables, Evolution &
Knowledge
Observables
P(X) (X)
Hidden Variable
P(X) P(X H)P(H) (X H)P(H)
H
H
Evolution
P(X) (X H)P(X dyna H)P(H)
x
H
Knowledge (Constraints)
If knowledge deterministic
P(X) [PW ]1 P(X H)P(H)w(H) [PW ]1
H
H w1
P(X H)P(H)
Genscan
State with length
distribution
Initial exon
Exons of phase 0, 1 or 2
Introns of phase 0, 1 or 2
Terminal exon
Exon of single exon gene
5' UTR
Promoter
Omitted: reverse strand part of the HMM
3' UTR
Poly-A signal
Intergenic sequence
Comparative Gene Annotation
AGGTATATAATGCG..... Pcoding{ATG-->GTG} or
AGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG}
Gene Finding & Protein Homology
(Gelfand, Mironov & Pevzner, 1996)
Protein Database
Exon Ordering Graph
Spliced Alignment:
1. Define set of potential exons in new genome.
2. Make exon ordering graph - EOG.
3. Align EOG to protein database.
TYGHLP
TY--LPM
Y
L P M
T
W
TYGHLP
Q
Simultaneous Alignment & Gene Finding
Bafna & Huson, 2000, T.Scharling,2001 & Blayo,2002.
Align by minimizing Distance/
Maximizing Similarity:
Align genes with structure
Known/unknown:
Secondary Structure Generators
S --> LS L
F --> dFd LS
L --> s dFd
.869
.788
.895
.131
.212
.105
RNA Structure Application
From Knudsen et al. (1999)
Knudsen & Hein, 2003
Observing Evolution has 2 parts
P(x):
x
P(Further history of x):
C
C
A
A
C
G
A
U
U
http://www.stats.ox.ac.uk/research/genome/projects/currentprojects
Hidden Markov Model for Overlapping Genes
3rd reading frame
2nd reading frame
1st reading frame
Virus genome
TC [1,2,3]
D [2,3]
D [3,1]
D [1,2]
S [3]
S [2]
S [1]
NC
NC
1
2
3
1,2
1,3
2,3
1,2,3
NC
1
2
3
1,2
1,3
2,3
1,2,3
NC
1
2
3
1,2
1,3
2,3
1,2,3
NC
1
2
3
1,2
1,3
2,3
1,2,3
Scanning
TC [1,2,3]
D [2,3]
D [3,1]
D [1,2]
S [3]
S [2]
•Only starts in AUG (0.06)
S [1]
NC
•Will Stop in “STOP” (1.0)
Molecular Evolution: Known Reading Frames
A G T C T
Known fixed context
throughout phylogeny
Assume multiplicativity of selection factors
Selection rates on rates
qi, j f i, j :
A B
f i,A,B
f
j
i, j f i, j
qAC f A,C qAG f A,G qAT f A,T
q
f
q
f
q
f
CG CG
CT CT
CA CA
qGA fGA
qGC f GC
qGT fGT
qTA fTA qTC f TC qTG fTG
Simplify Genetic Code:
4-fold
2-fold
(1-1-1-1)
2nd
1st
1-1-1-1 sites
2-2
4
1-1-1-1
2-2
(f1f2a, f1f2b)
(f1a, f1f2b)
(f1a, f1b)
(f2a, f1f2b)
(f2a, f1f2b)
(a, f1b)
4
(f2a, f2b)
(a, f2b)
(a, b)
Un-known Reading Frames and varying selection.
1
A G T C T
T
C
G
A G T C T
k
1
0.01
2
0.1
3
8
a (.95)
0.2
0.4
0.6
0.8
1.5
2.0
HIV2 of 14 genomes: Evolution/Selection
A. Phylogeny and
Evolutionary
Parameters.
Parameter
Estimate
Transition
5.79
0.19
Transversion
1.03
0.05
Base SF
0.73
0.06
SF STOP
0.44
0.18
a
0.95
0.02
Rate Class
0.0066
0.066
0.132
0.264
0.396
0.528
0.99
1.32
B.Selection Strengths for
Genes and Positions
POL
REV
VPX
NEF
TAT
GAG
VIF
VPR
ENV
+/- 1.96
Error
Single Coding Double Coding Triple Coding
19.06%
5.71%
2.89%
21.06%
7.98%
4.13%
14.98%
8.40%
6.33%
10.53%
9.33%
10.77%
8.53%
10.98%
14.39%
8.20%
17.77%
18.00%
6.79%
22.01%
21.62%
10.86%
17.90%
22.91%
Rate Class
0.0066
0.066
0.132
0.264
0.396
0.528
0.99
1.32
GAG
21.42%
22.52%
15.27%
10.13%
5.96%
6.47%
11.99%
6.26%
POL
21.38%
25.21%
17.85%
10.18%
7.42%
6.86%
7.10%
4.00%
VIF
14.73%
9.65%
10.50%
14.51%
15.41%
17.46%
9.79%
7.95%
VPX
13.23%
9.42%
7.37%
10.25%
12.84%
18.88%
15.31%
12.71%
HIV2 of 14 genomes: Annotation
GenBank
Rev
Pol
Vpx
Gag
Single Sequence
Sensitivity: 0.9308
Specificity: 0.9939
LogLikelihood: -34939.32
ViterbiCont.:-34949.41
Phylo-HMM
Sensitivity: 0.9542
Specificity: 0.9965
LogLikelihood: -75939.18
ViterbiCont.:--75945.77
Vif
Tat
Vpr
Nef
Env
HMM extension: Stop/Start Skidding
• Same evolutionary model as before,
but different HMM topology
• 64 states
• 3 different types of transitions
= ATG
Annotation Results: HIV1 vs. HIV2
de novo annotation:
81.5% sensitivity (without nonhomologous genes)
98.5% specificity
a = 0.23 b = 0.06 g = 0.71
Knowing HIV1 (fixing the Viterbi
path for one cube):
97.6% sensitivity (without nonhomologous genes)
99.9% specificity
HMM Extension II:
Single Sequence HMM
•Introns will almost always be 3k long
•27 states
Pair HMM
•729 states
Introns
Conserved RNA Structure in Protein Coding Genes
Problem:
Gene Structure Known, RNA Structure Unknown.
RNA Structure:
Exons:
Genome:
Protein-RNA Evolution:
Singlet
Doublets
Contagious Dependence
RNA + Protein Evolution
Codon Nucleotide Independence Heuristic
Prediction of stem-paring regions for different number
of sequences
8
Singlet
Ri,j =f* qi,j
5
Doublet
3
R(i1,i2),(j1,j2) = f1 * f2 * q (i1,i2),(j1,j2)
Structure/non-Structure Grammars
Non-structural
Structural
Combining Grammars: Multiple Hidden Layers
Present Approach:
Two “independent” annotations
SCFG: RNA Structure
HMM: Protein Structure
Combine SCFG & HMM:
RNA, Gene Structure
Ideal Approach:
Combined Annotation
Joanna Davies
Combining Grammars:
Solution Attempts
HMM
Independence is non-trivial to define as they
in principle are competing alternative models.
SCFG
Let X be the stochastic variable giving the HMM annotation.
Let Y be the stochastic variable giving the SCFG annotation.
Is P(X,Y Data) P(X Data)P(Y Data) ? No.
•Combined Grammars (HMM, SCGF) --> SCFG have been devised,
but does not work well, have arbitrary designs and are very large.
Combinations of Viterbi and Posterior Decoding arises.
Joanna Davies
http://www.stats.ox.ac.uk/__data/assets/file/0016/3328/combinedHMMartifact.pdf