Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ORF distribution and statistics ORF distribution and statistics ORF = Open Reading Frame § An ORF starts with a methionine (start) codon (AUG) and ends with a stop codon (UAA, UAG, UGA). § Therefore, the most common method of finding protein-coding regions in DNA is to look for long ORFs in genomic sequence. Example Sample sequence showing 3 different possible reading frames. Start codons are highlighted in purple, and stop codons are highlighted in red. Example from wikipedia 1 ORF distribution and statistics Exercise Assume that a DNA sequence is randomly generated (with equal probabilities for each bases): ...GTGAGGCTACTGCGCTATTCGGCGTATGCCGCTATTCCGTATTCGTTCATG... • The probability that a randomly selected codon is a stop codon is ... • A stop codon is thus expect to occur every .... bp. • The expected ORF length is thus ... . • The probability that an ORF has a length 100 is ... • The probability that an ORF has a length of at least 100 is ... • In general, the probability P(L) that an ORF is length L is given by ... • The probability P(L) that an ORF is length larger or equal to L is given by ... 2 ORF distribution and statistics ORF length distribution P(length = L) = (1− p)L−1 p with p = 3/64 Geometric distribution Lmax P(length ≥ L) = ∑ (1− p)k−1 p k=L L−1 = 1− ∑ (1− p)k−1 p k=1 This is the p-value of an ORF length 3 ORF distribution and statistics ORF length statistics The probability (p-value) that an ORF has a length of at least 100 is thus given by: 99 k−1 # 3& # 3& P(length ≥ 100) = 1− ∑%1− ( % ( = 0.055 $ 64 ' $ 64 ' k=1 Exercise Assume that we analyse a genomic DNA sequence of length Lmax = 1000000 nucleotides (with equal probability for each nucleotide). • How many start codons can I expect to find a single reading frame? • How many start codons can I expect to find considering all reading frame? • How many ORFs will I then analyse? • How many ORFs can I expect with a length larger than 100? 4 ORF distribution and statistics ORF length statistics Since in reality the nucleotides / codons do not have equal probabilities, some deviation from the geometric distribution are expected. Codon usage in E. coli in coding regions. Maloy S, Stewart V, Taylor R (1996) Genetic analysis of pathogenic bacteria. Cold Spring Harbor Laboratory Press, NY. 5 ORF distribution and statistics ORF length statistics Since in reality the nucleotides / codons do not have equal probabilities, some deviation from the geometric distribution are expected. Here are the ORF length distribution in E. coli in coding and non-coding regions. coding region non-coding region Source: Zvelebil & Baum (2008) 6 ORF distribution and statistics Example of ORF detection in E. coli 6 reading frames = AUG (START) codons = STOP codons ORF map of a portion of the E. coli lac operon using the DNA STRIDER program (Marck, Nucl Acids Res 1988). Shown are the AUG and STOP codons in all 6 reading frames. The lacZ/lacY/lacA genes are visible as long ORFs in frame 3: The lacZ gene runs from position 1284 to 4355, and is followed by the 2 shorter genes, lacY and lacA. Source: Mount (2000) 7 ORF distribution and statistics ORF Finder (NCBI): http://www.ncbi.nlm.nih.gov/gorf/gorf.html 8 ORF distribution and statistics ORF Finder (NCBI): http://www.ncbi.nlm.nih.gov/gorf/gorf.html Example of ORF detection in Drosophila Which ones of those "long" ORF are good gene candidates? 9 ORF distribution and statistics Limitation of "long ORF" - based gene detection • Where to set the threshold (multiple testing!)? • Overlapping ORF • Nucleotide frequency and codon usage • Sequencing errors Error rates of NGS sequencing range from 0.1% to 0.6%, depending on the platform and the depth of coverage. Wall J D et al (2014) Estimating genotype error rates f rom high-coverage next-generation s equence data, Genome Res 24:1734-9. • ATG is not the only possible start site (e.g. CTG, TTG) Alternate start codons are still translated as Met when they are at the start of a protein (even if the codon encodes a different amino acid otherwise). This is because a separate transfer RNA (tRNA) is used for initiation. Lobanov AV et al (2010) Dual functions of c odons in the genetic c ode. Crit Rev Biochem Molec Biol 45:257-65 Solution => look for additional information Adapted from D. Subramanian, lecture notes 2009 10