* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download ORF distribution and statistics
Survey
Document related concepts
Transcript
ORF distribution and statistics ORF distribution and statistics ORF = Open Reading Frame § An ORF starts with a methionine (start) codon (AUG) and ends with a stop codon (UAA, UAG, UGA). § Therefore, the most common method of finding protein-coding regions in DNA is to look for long ORFs in genomic sequence. Example Sample sequence showing 3 different possible reading frames. Start codons are highlighted in purple, and stop codons are highlighted in red. Example from wikipedia 1 ORF distribution and statistics Exercise Assume that a DNA sequence is randomly generated (with equal probabilities for each bases): ...GTGAGGCTACTGCGCTATTCGGCGTATGCCGCTATTCCGTATTCGTTCATG... • The probability that a randomly selected codon is a stop codon is ... • A stop codon is thus expect to occur every .... bp. • The expected ORF length is thus ... . • The probability that an ORF has a length 100 is ... • The probability that an ORF has a length of at least 100 is ... • In general, the probability P(L) that an ORF is length L is given by ... • The probability P(L) that an ORF is length larger or equal to L is given by ... 2 ORF distribution and statistics ORF length distribution P(length = L) = (1− p)L−1 p with p = 3/64 Geometric distribution Lmax P(length ≥ L) = ∑ (1− p)k−1 p k=L L−1 = 1− ∑ (1− p)k−1 p k=1 This is the p-value of an ORF length 3 ORF distribution and statistics ORF length statistics The probability (p-value) that an ORF has a length of at least 100 is thus given by: 99 k−1 # 3& # 3& P(length ≥ 100) = 1− ∑%1− ( % ( = 0.055 $ 64 ' $ 64 ' k=1 Exercise Assume that we analyse a genomic DNA sequence of length Lmax = 1000000 nucleotides (with equal probability for each nucleotide). • How many start codons can I expect to find a single reading frame? • How many start codons can I expect to find considering all reading frame? • How many ORFs will I then analyse? • How many ORFs can I expect with a length larger than 100? 4 ORF distribution and statistics ORF length statistics Since in reality the nucleotides / codons do not have equal probabilities, some deviation from the geometric distribution are expected. Codon usage in E. coli in coding regions. Maloy S, Stewart V, Taylor R (1996) Genetic analysis of pathogenic bacteria. Cold Spring Harbor Laboratory Press, NY. 5 ORF distribution and statistics ORF length statistics Since in reality the nucleotides / codons do not have equal probabilities, some deviation from the geometric distribution are expected. Here are the ORF length distribution in E. coli in coding and non-coding regions. coding region non-coding region Source: Zvelebil & Baum (2008) 6 ORF distribution and statistics Example of ORF detection in E. coli 6 reading frames = AUG (START) codons = STOP codons ORF map of a portion of the E. coli lac operon using the DNA STRIDER program (Marck, Nucl Acids Res 1988). Shown are the AUG and STOP codons in all 6 reading frames. The lacZ/lacY/lacA genes are visible as long ORFs in frame 3: The lacZ gene runs from position 1284 to 4355, and is followed by the 2 shorter genes, lacY and lacA. Source: Mount (2000) 7 ORF distribution and statistics ORF Finder (NCBI): http://www.ncbi.nlm.nih.gov/gorf/gorf.html 8 ORF distribution and statistics ORF Finder (NCBI): http://www.ncbi.nlm.nih.gov/gorf/gorf.html Example of ORF detection in Drosophila Which ones of those "long" ORF are good gene candidates? 9 ORF distribution and statistics Limitation of "long ORF" - based gene detection • Where to set the threshold (multiple testing!)? • Overlapping ORF • Nucleotide frequency and codon usage • Sequencing errors Error rates of NGS sequencing range from 0.1% to 0.6%, depending on the platform and the depth of coverage. Wall J D et al (2014) Estimating genotype error rates f rom high-coverage next-generation s equence data, Genome Res 24:1734-9. • ATG is not the only possible start site (e.g. CTG, TTG) Alternate start codons are still translated as Met when they are at the start of a protein (even if the codon encodes a different amino acid otherwise). This is because a separate transfer RNA (tRNA) is used for initiation. Lobanov AV et al (2010) Dual functions of c odons in the genetic c ode. Crit Rev Biochem Molec Biol 45:257-65 Solution => look for additional information Adapted from D. Subramanian, lecture notes 2009 10