Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Detection of noncoding RNAs
by comparative sequence analysis
A mRNA model for RNAz
Stefan Washietl
Institute for Theoretical Chemistry
University of Vienna
Bled, February 2006
The challenge of comparative genomics
Mouse
Cow
Dog
Rat
Rhesus
Chimp
Human
Elephant
Tenrec
Armadillo
Opossum
Chicken
Frog
Fugu
Tetraodon
Zebrafish
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACGGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGCG-CC
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGTACTAGAGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCTTGTACTAGAGGGTGTGCTGATGGGTACTGGGGGGTG-CT
ACTGCTGGG-CTGCATCAGGGGGTGTGCTGTCGGGTACTGGGGAGTG-CC
ACTGCTGAGCTTGCACCAAATGATGCGCTGTCGGGTACTGAGGGGTG-CT
ATTGCTGCGCCTGTACCAAGTGGTGCGCTGTGGGGTACTGGGGGCTG-CC
AGTGTTGGGCTTGCACCAAGTGATGTGCTGTAGGGTACTGGGCGTTA-CT
ACTGTTGCGTCTGCACCAAGTGATGCGCTGTCGGGAACTGTGGCGTG-GC
ACTGCTGCGTCTGCACCAGGTGATGCGCTGTCGGGAACTGCGGCGTG-GC
ATGGCTGCATGTGGCCCAGATGAT----TGACAGATGATGTCAGATGTGT
* * **
**
*
* *
**
*
**
The challenge of comparative genomics
Mouse
Cow
Dog
Rat
Rhesus
Chimp
Human
Elephant
Tenrec
Armadillo
Opossum
Chicken
Frog
Fugu
Tetraodon
Zebrafish
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACGGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGCG-CC
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCCTGTACTAGAGGGTGTGCTGTCGGGTACTGGGGGGTG-CT
ACTGCTGGGCTTGTACTAGAGGGTGTGCTGATGGGTACTGGGGGGTG-CT
ACTGCTGGG-CTGCATCAGGGGGTGTGCTGTCGGGTACTGGGGAGTG-CC
ACTGCTGAGCTTGCACCAAATGATGCGCTGTCGGGTACTGAGGGGTG-CT
ATTGCTGCGCCTGTACCAAGTGGTGCGCTGTGGGGTACTGGGGGCTG-CC
AGTGTTGGGCTTGCACCAAGTGATGTGCTGTAGGGTACTGGGCGTTA-CT
ACTGTTGCGTCTGCACCAAGTGATGCGCTGTCGGGAACTGTGGCGTG-GC
ACTGCTGCGTCTGCACCAGGTGATGCGCTGTCGGGAACTGCGGCGTG-GC
ATGGCTGCATGTGGCCCAGATGAT----TGACAGATGATGTCAGATGTGT
* * **
**
*
* *
**
*
**
I
Protein coding?
I
ncRNA?
I
Regulatory or other functional element?
Outline
I
Motivation
I
Review of available methods
A simple new scoring scheme
I
I
I
Shuffling
Exact
I
Benchmark of some available and the new method
I
Significance measure
I
Currently only pairwise, ungapped global case without stop
codons: Hofstadter’s law
Outline
I
Motivation
I
Review of available methods
A simple new scoring scheme
I
I
I
Shuffling
Exact
I
Benchmark of some available and the new method
I
Significance measure
I
Currently only pairwise, ungapped global case without stop
codons: Hofstadter’s law
It always takes longer than you expect, even when
you take into account Hofstadter’s Law
Motivation
I
Why a coding model in RNAz?
I
I
Get rid of the “false positives” in mRNAs
Increase the information content of the output
Motivation
I
Why a coding model in RNAz?
I
I
I
Get rid of the “false positives” in mRNAs
Increase the information content of the output
Why yet another protein gene finder: Sturgeon’s law
Motivation
I
Why a coding model in RNAz?
I
I
I
Get rid of the “false positives” in mRNAs
Increase the information content of the output
Why yet another protein gene finder: Sturgeon’s law
90% of everything is crap
Motivation
I
Why a coding model in RNAz?
I
I
I
Get rid of the “false positives” in mRNAs
Increase the information content of the output
Why yet another protein gene finder: Sturgeon’s law
90% of everything is crap
I
Limitations of current coding potential detection approaches
I
I
I
Limited to pairwise alignments
Simplified models which do not include all available
information
Ad hoc scores, poor statistics
Requirements
I
Lightweight
I
General
I
Accurate
I
Robust statistics
I
Fast
Plenty of Protein gene finders
I
Full featured gene prediction
I
I
I
I
Genscan, Twinscan, N-Scan
SLAM
SGP2
Exoniphy
Plenty of Protein gene finders
I
Full featured gene prediction
I
I
I
I
I
Genscan, Twinscan, N-Scan
SLAM
SGP2
Exoniphy
Detection of coding potential
I
I
I
I
ETOPE (Ka/Ks ratio test)
CSTfinder
CRITICA
QRNA
Ka /Ks ratio test
1. Count synonymous and non-synonymous sites in both
sequences.
2. Count synonymous and non-synonymous differences
3. Correct the observed differences and estimate the ratio of
synonymous (Ks ) and non-synonymous (Ka ) substitiutions
per site:
4. Ka /Ks < 1 ⇒ purifying evolution
Nei & Gojobori Mol. Biol. Evol. 3:418 (1986), Nekrutenko et al. Nucl.
Acids. Res. 31:3564 (2003)
Ka /Ks ratio test
1. Count synonymous and non-synonymous sites in both
sequences.
2. Count synonymous and non-synonymous differences
3. Correct the observed differences and estimate the ratio of
synonymous (Ks ) and non-synonymous (Ka ) substitiutions
per site:
4. Ka /Ks < 1 ⇒ purifying evolution
+ Properly normalized score
− Only considers synonymous changes (no conservative changes)
Nei & Gojobori Mol. Biol. Evol. 3:418 (1986), Nekrutenko et al. Nucl.
Acids. Res. 31:3564 (2003)
CRITICA
I
Scoring scheme based on theoretical considerations
I
I
I
Positive score for synonymous substitutions
Negative score for non-synonymous substitutions
Also includes non-comperative score (di-nucleotide model)
Badger & Olsen Mol. Biol. Evol. 16:512 (1999)
CRITICA
I
Scoring scheme based on theoretical considerations
I
I
I
Positive score for synonymous substitutions
Negative score for non-synonymous substitutions
Also includes non-comperative score (di-nucleotide model)
+ reasonable statistics
− Focused on bacteria, hard to use, no amino acid similarity
Badger & Olsen Mol. Biol. Evol. 16:512 (1999)
CSTfinder
I
Scans blast hits of ESTs for coding potential
I
Defines Coding potential score:
N
CPS = (
100 NS + 1 X
)(
)
s(ciA , ciB )
N NA + 1
i=1
N
NS ,NA
ciA
A
s(ci , ciB )
...
...
...
...
number of codon pairs
number of synonymous, non-synonymous pairs
codon number i in sequence A
similarity of encoded amino acids
Mignone et al. Nucl. Acids Res. 31:4639 (2003)
CSTfinder
I
Scans blast hits of ESTs for coding potential
I
Defines Coding potential score:
N
CPS = (
100 NS + 1 X
)(
)
s(ciA , ciB )
N NA + 1
i=1
N
NS ,NA
ciA
A
s(ci , ciB )
...
...
...
...
number of codon pairs
number of synonymous, non-synonymous pairs
codon number i in sequence A
similarity of encoded amino acids
+ considers amino acid similarity
− as ad hoc as it can be, no normalization, “Vaporware”
Mignone et al. Nucl. Acids Res. 31:4639 (2003)
QRNA
I
3 pair hidden Markov models/SCFGs: Coding, RNA, other
P COD (a1 a2 a3 , b1 b2 b3 ) ≈ P(a1 a2 a3 |A)P(b1 b2 b3 |B)P(A, B)
a, b ∈ A ={A,G,C,T}, A, B ∈ {amino acids}
Rivas & Eddy BMC Bioinformatics 2:8 (2001)
QRNA
I
3 pair hidden Markov models/SCFGs: Coding, RNA, other
P COD (a1 a2 a3 , b1 b2 b3 ) ≈ P(a1 a2 a3 |A)P(b1 b2 b3 |B)P(A, B)
a, b ∈ A ={A,G,C,T}, A, B ∈ {amino acids}
P(COD|alignment) = P
Score =
P(alignment|COD)P(COD)
Models P(alignment|Model)P(Model
P(COD|alignment)
P(OTH|alignment)
Rivas & Eddy BMC Bioinformatics 2:8 (2001)
QRNA
I
3 pair hidden Markov models/SCFGs: Coding, RNA, other
P COD (a1 a2 a3 , b1 b2 b3 ) ≈ P(a1 a2 a3 |A)P(b1 b2 b3 |B)P(A, B)
a, b ∈ A ={A,G,C,T}, A, B ∈ {amino acids}
P(COD|alignment) = P
Score =
P(alignment|COD)P(COD)
Models P(alignment|Model)P(Model
P(COD|alignment)
P(OTH|alignment)
+ considers amino acid similarity, elegant solution, can deal with
frameshifts and local search
− no P value, independence assumption of codons and amino
acids
Rivas & Eddy BMC Bioinformatics 2:8 (2001)
A simple pairwise similarity score
Definitions
Alignment AB of sequence A and B:
A : c1A c2A . . . cnA
B : c1B c2B . . . cnB
L
f{A,G ,C ,T }
ID
A
d(c , c B )
...
...
...
...
s(c A , c B )
...
length in codons
background frequency of nucleotides
pairwise identity
Hamming distance of two codons
(e.g. d(AGC , AGT ) = 1)
similarity of encoded amino acids
(e.g. BLOSUM Matrix)
A simple pairwise similarity score
Normalizing with shuffling
I
Unnormalized score
e =
S
AB
L
X
s(ciA ciB )
i=1
d(ciA ,ciB )>0
I
Shuffle columns: AB random
e −S
e
SAB = S
AB
AB random
A simple pairwise similarity score
Exact normalization
I
Calculate the expected score for pairs with 1,2 and 3
differences. e.g.:
hsd=1 i =
N comb
comb
Nd=1
X
a,b,c,d,e,f ∈A
Y
s(cabc , cdef )
i=a,b,c,d,e,f
d(abc,def )=1
I
Correct each observed score by the expected score
SAB =
L
X
i=1
d(ciA ,ciB )>0
s(ciA ciB ) − hsd=d(c A ,c B ) i
i
i
fi
Test Set
I
UCSC Multiz alignments (13-way)
I
Extract mouse RefSeq genes from chromosome 1 and 10
I
Take only “correct” genes which start with M and have
exactly one stop codon on the last position.
I
Select slices of different length (50–150 nts) and pairwise
identity (60%–100%)
I
Random control: Shuffle sequences, remove stop codons
⇒ ≈ 7000 positive and negative examples
Score
distribution of
native and
random
alignments
200
150
QRNA
100
50
0
200
0
20
40
60
80
150
Ka/Ks
I
60%<ID<85%
I
L = 150 nts
Frequency
100
50
0
0
70
60
50
40
30
20
10
0
-2
0.5
1
1.5
2
2.5
3
Simple
(shuffled)
0
2
4
6
80
60
Simple
(exact)
40
20
0
-2
0
2
Score
4
6
Comparison of methods (ROCs)
1
0.9
Sensitivity
0.8
0.7
0.6
0.5
0.4
QRNA
Ka/Ks
Simple (shuffling)
Simple (exact)
0.3
0.2
0.1
0
0
0.1
0.2
0.3
False positive rate
0.4
0.5
Comparison of methods (ROCs)
1
0.98
0.94
0.90
0.9
Sensitivity
0.8
0.7
0.68
0.6
0.5
0.4
QRNA
Ka/Ks
Simple (shuffling)
simple (exact)
0.3
0.2
0.1
0
0
0.1
0.2
0.3
False positive rate
0.4
0.5
0
0.05
Sensitivity
Dependence of length and sequence divergence
1
1
0.8
0.8
51 nts
102 nts
150 nts
0.6
0
0.05
0.1
0.15
60-70%
70-80%
90-100%
0.6
0
0.05
False positive rate
0.1
0.15
Estimating statistical significance
I
Calculate the mean and variance of all sequences for a given
(expected) base composition and pairwise identity. Assume
normal distribution and calculate the P value.
hSiID,L = L
X
a,b,c,d,e,f ∈A
s(cabc , cdef )
Y
i=a..f
(fi )md(abc,def )
N comb
comb
Nd=d(abc,def
)
Estimating statistical significance
I
Calculate the mean and variance of all sequences for a given
(expected) base composition and pairwise identity. Assume
normal distribution and calculate the P value.
hSiID,L = L
X
s(cabc , cdef )
a,b,c,d,e,f ∈A
md=0
md=1
md=2
md=3
= ID 3
= ID 2 (1 − ID) · 3
= ID(1 − ID)2 · 3
= (1 − ID)3
Y
i=a..f
(fi )md(abc,def )
N comb
comb
Nd=d(abc,def
)
Estimating statistical significance
I
Calculate the mean and variance of all sequences for a given
(expected) base composition and pairwise identity. Assume
normal distribution and calculate the P value.
X
hSiID,L = L
s(cabc , cdef )
a,b,c,d,e,f ∈A
(fi )md(abc,def )
i=a..f
|
|
md=0
md=1
md=2
md=3
Y
{z
M
comb
Nd=d(abc,def
)
{z
}
K
}
= ID 3
= ID 2 (1 − ID) · 3
= ID(1 − ID)2 · 3
= (1 − ID)3
var (s)ID,L =
X
a,b,c,d,e,f ∈A
N comb
(s(cabc , cdef )2 K ) − M 2
Sampled vs. calculated scores
0.02
Frequency
0.015
0.01
0.005
0
I
0
50
100
Score
150
200
10,000 alignments sampled with Markov method (black bars)
Sampled vs. calculated scores
0.02
Frequency
0.015
0.01
0.005
0
0
50
100
Score
150
200
I
10,000 alignments sampled with Markov method (black bars)
I
Calculated distribution (red line)
Conclusions and outlook
I
Comparative detection of coding potential is a useful feature
I
Available methods are not perfect
I
Considering amino acid similarity significantly improves
accuracy compared to simply counting synonymous
substitutions
I
A simple and properly normalized score outperforms any other
tested methods.
I
The score allows direct calculation of a P-Value.
Conclusions and outlook
I
Comparative detection of coding potential is a useful feature
I
Available methods are not perfect
I
Considering amino acid similarity significantly improves
accuracy compared to simply counting synonymous
substitutions
I
A simple and properly normalized score outperforms any other
tested methods.
I
The score allows direct calculation of a P-Value.
Include
I
I
I
I
I
stop codons
gaps (frameshifts)
local search?
Extension to multiple alignments