Download Detection of noncoding RNAs by comparative Sequence Analysis

Detection of noncoding RNAs by comparative sequence analysis A mRNA model for RNAz Stefan Washietl Institute for Theoretical Chemistry University of Vienna Bled, February 2006 The challenge of comparative genomics Mouse Cow Dog Rat Rhesus Chimp Human Elephant Tenrec Armadillo Opossum Chicken Frog Fugu Tetraodon Zebrafish ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACGGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGCG-CC ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGTACTAGAGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCTTGTACTAGAGGGTGTGCTGATGGGTACTGGGGGGTG-CT ACTGCTGGG-CTGCATCAGGGGGTGTGCTGTCGGGTACTGGGGAGTG-CC ACTGCTGAGCTTGCACCAAATGATGCGCTGTCGGGTACTGAGGGGTG-CT ATTGCTGCGCCTGTACCAAGTGGTGCGCTGTGGGGTACTGGGGGCTG-CC AGTGTTGGGCTTGCACCAAGTGATGTGCTGTAGGGTACTGGGCGTTA-CT ACTGTTGCGTCTGCACCAAGTGATGCGCTGTCGGGAACTGTGGCGTG-GC ACTGCTGCGTCTGCACCAGGTGATGCGCTGTCGGGAACTGCGGCGTG-GC ATGGCTGCATGTGGCCCAGATGAT----TGACAGATGATGTCAGATGTGT * * ** ** * * * ** * ** The challenge of comparative genomics Mouse Cow Dog Rat Rhesus Chimp Human Elephant Tenrec Armadillo Opossum Chicken Frog Fugu Tetraodon Zebrafish ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACGGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGCG-CC ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGGACCAGGGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCCTGTACTAGAGGGTGTGCTGTCGGGTACTGGGGGGTG-CT ACTGCTGGGCTTGTACTAGAGGGTGTGCTGATGGGTACTGGGGGGTG-CT ACTGCTGGG-CTGCATCAGGGGGTGTGCTGTCGGGTACTGGGGAGTG-CC ACTGCTGAGCTTGCACCAAATGATGCGCTGTCGGGTACTGAGGGGTG-CT ATTGCTGCGCCTGTACCAAGTGGTGCGCTGTGGGGTACTGGGGGCTG-CC AGTGTTGGGCTTGCACCAAGTGATGTGCTGTAGGGTACTGGGCGTTA-CT ACTGTTGCGTCTGCACCAAGTGATGCGCTGTCGGGAACTGTGGCGTG-GC ACTGCTGCGTCTGCACCAGGTGATGCGCTGTCGGGAACTGCGGCGTG-GC ATGGCTGCATGTGGCCCAGATGAT----TGACAGATGATGTCAGATGTGT * * ** ** * * * ** * ** I Protein coding? I ncRNA? I Regulatory or other functional element? Outline I Motivation I Review of available methods A simple new scoring scheme I I I Shuffling Exact I Benchmark of some available and the new method I Significance measure I Currently only pairwise, ungapped global case without stop codons: Hofstadter’s law Outline I Motivation I Review of available methods A simple new scoring scheme I I I Shuffling Exact I Benchmark of some available and the new method I Significance measure I Currently only pairwise, ungapped global case without stop codons: Hofstadter’s law It always takes longer than you expect, even when you take into account Hofstadter’s Law Motivation I Why a coding model in RNAz? I I Get rid of the “false positives” in mRNAs Increase the information content of the output Motivation I Why a coding model in RNAz? I I I Get rid of the “false positives” in mRNAs Increase the information content of the output Why yet another protein gene finder: Sturgeon’s law Motivation I Why a coding model in RNAz? I I I Get rid of the “false positives” in mRNAs Increase the information content of the output Why yet another protein gene finder: Sturgeon’s law 90% of everything is crap Motivation I Why a coding model in RNAz? I I I Get rid of the “false positives” in mRNAs Increase the information content of the output Why yet another protein gene finder: Sturgeon’s law 90% of everything is crap I Limitations of current coding potential detection approaches I I I Limited to pairwise alignments Simplified models which do not include all available information Ad hoc scores, poor statistics Requirements I Lightweight I General I Accurate I Robust statistics I Fast Plenty of Protein gene finders I Full featured gene prediction I I I I Genscan, Twinscan, N-Scan SLAM SGP2 Exoniphy Plenty of Protein gene finders I Full featured gene prediction I I I I I Genscan, Twinscan, N-Scan SLAM SGP2 Exoniphy Detection of coding potential I I I I ETOPE (Ka/Ks ratio test) CSTfinder CRITICA QRNA Ka /Ks ratio test 1. Count synonymous and non-synonymous sites in both sequences. 2. Count synonymous and non-synonymous differences 3. Correct the observed differences and estimate the ratio of synonymous (Ks ) and non-synonymous (Ka ) substitiutions per site: 4. Ka /Ks < 1 ⇒ purifying evolution Nei & Gojobori Mol. Biol. Evol. 3:418 (1986), Nekrutenko et al. Nucl. Acids. Res. 31:3564 (2003) Ka /Ks ratio test 1. Count synonymous and non-synonymous sites in both sequences. 2. Count synonymous and non-synonymous differences 3. Correct the observed differences and estimate the ratio of synonymous (Ks ) and non-synonymous (Ka ) substitiutions per site: 4. Ka /Ks < 1 ⇒ purifying evolution + Properly normalized score − Only considers synonymous changes (no conservative changes) Nei & Gojobori Mol. Biol. Evol. 3:418 (1986), Nekrutenko et al. Nucl. Acids. Res. 31:3564 (2003) CRITICA I Scoring scheme based on theoretical considerations I I I Positive score for synonymous substitutions Negative score for non-synonymous substitutions Also includes non-comperative score (di-nucleotide model) Badger & Olsen Mol. Biol. Evol. 16:512 (1999) CRITICA I Scoring scheme based on theoretical considerations I I I Positive score for synonymous substitutions Negative score for non-synonymous substitutions Also includes non-comperative score (di-nucleotide model) + reasonable statistics − Focused on bacteria, hard to use, no amino acid similarity Badger & Olsen Mol. Biol. Evol. 16:512 (1999) CSTfinder I Scans blast hits of ESTs for coding potential I Defines Coding potential score: N CPS = ( 100 NS + 1 X )( ) s(ciA , ciB ) N NA + 1 i=1 N NS ,NA ciA A s(ci , ciB ) ... ... ... ... number of codon pairs number of synonymous, non-synonymous pairs codon number i in sequence A similarity of encoded amino acids Mignone et al. Nucl. Acids Res. 31:4639 (2003) CSTfinder I Scans blast hits of ESTs for coding potential I Defines Coding potential score: N CPS = ( 100 NS + 1 X )( ) s(ciA , ciB ) N NA + 1 i=1 N NS ,NA ciA A s(ci , ciB ) ... ... ... ... number of codon pairs number of synonymous, non-synonymous pairs codon number i in sequence A similarity of encoded amino acids + considers amino acid similarity − as ad hoc as it can be, no normalization, “Vaporware” Mignone et al. Nucl. Acids Res. 31:4639 (2003) QRNA I 3 pair hidden Markov models/SCFGs: Coding, RNA, other P COD (a1 a2 a3 , b1 b2 b3 ) ≈ P(a1 a2 a3 |A)P(b1 b2 b3 |B)P(A, B) a, b ∈ A ={A,G,C,T}, A, B ∈ {amino acids} Rivas & Eddy BMC Bioinformatics 2:8 (2001) QRNA I 3 pair hidden Markov models/SCFGs: Coding, RNA, other P COD (a1 a2 a3 , b1 b2 b3 ) ≈ P(a1 a2 a3 |A)P(b1 b2 b3 |B)P(A, B) a, b ∈ A ={A,G,C,T}, A, B ∈ {amino acids} P(COD|alignment) = P Score = P(alignment|COD)P(COD) Models P(alignment|Model)P(Model P(COD|alignment) P(OTH|alignment) Rivas & Eddy BMC Bioinformatics 2:8 (2001) QRNA I 3 pair hidden Markov models/SCFGs: Coding, RNA, other P COD (a1 a2 a3 , b1 b2 b3 ) ≈ P(a1 a2 a3 |A)P(b1 b2 b3 |B)P(A, B) a, b ∈ A ={A,G,C,T}, A, B ∈ {amino acids} P(COD|alignment) = P Score = P(alignment|COD)P(COD) Models P(alignment|Model)P(Model P(COD|alignment) P(OTH|alignment) + considers amino acid similarity, elegant solution, can deal with frameshifts and local search − no P value, independence assumption of codons and amino acids Rivas & Eddy BMC Bioinformatics 2:8 (2001) A simple pairwise similarity score Definitions Alignment AB of sequence A and B: A : c1A c2A . . . cnA B : c1B c2B . . . cnB L f{A,G ,C ,T } ID A d(c , c B ) ... ... ... ... s(c A , c B ) ... length in codons background frequency of nucleotides pairwise identity Hamming distance of two codons (e.g. d(AGC , AGT ) = 1) similarity of encoded amino acids (e.g. BLOSUM Matrix) A simple pairwise similarity score Normalizing with shuffling I Unnormalized score e = S AB L X s(ciA ciB ) i=1 d(ciA ,ciB )>0 I Shuffle columns: AB random e −S e SAB = S AB AB random A simple pairwise similarity score Exact normalization I Calculate the expected score for pairs with 1,2 and 3 differences. e.g.: hsd=1 i = N comb comb Nd=1 X a,b,c,d,e,f ∈A Y s(cabc , cdef ) i=a,b,c,d,e,f d(abc,def )=1 I Correct each observed score by the expected score SAB = L X i=1 d(ciA ,ciB )>0 s(ciA ciB ) − hsd=d(c A ,c B ) i i i fi Test Set I UCSC Multiz alignments (13-way) I Extract mouse RefSeq genes from chromosome 1 and 10 I Take only “correct” genes which start with M and have exactly one stop codon on the last position. I Select slices of different length (50–150 nts) and pairwise identity (60%–100%) I Random control: Shuffle sequences, remove stop codons ⇒ ≈ 7000 positive and negative examples Score distribution of native and random alignments 200 150 QRNA 100 50 0 200 0 20 40 60 80 150 Ka/Ks I 60%<ID<85% I L = 150 nts Frequency 100 50 0 0 70 60 50 40 30 20 10 0 -2 0.5 1 1.5 2 2.5 3 Simple (shuffled) 0 2 4 6 80 60 Simple (exact) 40 20 0 -2 0 2 Score 4 6 Comparison of methods (ROCs) 1 0.9 Sensitivity 0.8 0.7 0.6 0.5 0.4 QRNA Ka/Ks Simple (shuffling) Simple (exact) 0.3 0.2 0.1 0 0 0.1 0.2 0.3 False positive rate 0.4 0.5 Comparison of methods (ROCs) 1 0.98 0.94 0.90 0.9 Sensitivity 0.8 0.7 0.68 0.6 0.5 0.4 QRNA Ka/Ks Simple (shuffling) simple (exact) 0.3 0.2 0.1 0 0 0.1 0.2 0.3 False positive rate 0.4 0.5 0 0.05 Sensitivity Dependence of length and sequence divergence 1 1 0.8 0.8 51 nts 102 nts 150 nts 0.6 0 0.05 0.1 0.15 60-70% 70-80% 90-100% 0.6 0 0.05 False positive rate 0.1 0.15 Estimating statistical significance I Calculate the mean and variance of all sequences for a given (expected) base composition and pairwise identity. Assume normal distribution and calculate the P value. hSiID,L = L X a,b,c,d,e,f ∈A s(cabc , cdef ) Y i=a..f (fi )md(abc,def ) N comb comb Nd=d(abc,def ) Estimating statistical significance I Calculate the mean and variance of all sequences for a given (expected) base composition and pairwise identity. Assume normal distribution and calculate the P value. hSiID,L = L X s(cabc , cdef ) a,b,c,d,e,f ∈A md=0 md=1 md=2 md=3 = ID 3 = ID 2 (1 − ID) · 3 = ID(1 − ID)2 · 3 = (1 − ID)3 Y i=a..f (fi )md(abc,def ) N comb comb Nd=d(abc,def ) Estimating statistical significance I Calculate the mean and variance of all sequences for a given (expected) base composition and pairwise identity. Assume normal distribution and calculate the P value. X hSiID,L = L s(cabc , cdef ) a,b,c,d,e,f ∈A (fi )md(abc,def ) i=a..f | | md=0 md=1 md=2 md=3 Y {z M comb Nd=d(abc,def ) {z } K } = ID 3 = ID 2 (1 − ID) · 3 = ID(1 − ID)2 · 3 = (1 − ID)3 var (s)ID,L = X a,b,c,d,e,f ∈A N comb (s(cabc , cdef )2 K ) − M 2 Sampled vs. calculated scores 0.02 Frequency 0.015 0.01 0.005 0 I 0 50 100 Score 150 200 10,000 alignments sampled with Markov method (black bars) Sampled vs. calculated scores 0.02 Frequency 0.015 0.01 0.005 0 0 50 100 Score 150 200 I 10,000 alignments sampled with Markov method (black bars) I Calculated distribution (red line) Conclusions and outlook I Comparative detection of coding potential is a useful feature I Available methods are not perfect I Considering amino acid similarity significantly improves accuracy compared to simply counting synonymous substitutions I A simple and properly normalized score outperforms any other tested methods. I The score allows direct calculation of a P-Value. Conclusions and outlook I Comparative detection of coding potential is a useful feature I Available methods are not perfect I Considering amino acid similarity significantly improves accuracy compared to simply counting synonymous substitutions I A simple and properly normalized score outperforms any other tested methods. I The score allows direct calculation of a P-Value. Include I I I I I stop codons gaps (frameshifts) local search? Extension to multiple alignments

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Detection of noncoding RNAs by comparative Sequence Analysis