Download ORF distribution and statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genome evolution wikipedia , lookup

Molecular evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
ORF distribution and statistics
ORF distribution and statistics
ORF = Open Reading Frame
§ An ORF starts with a methionine (start) codon (AUG) and ends with a stop codon (UAA, UAG, UGA).
§ Therefore, the most common method of finding protein-­coding regions in DNA is to look for long ORFs in genomic sequence. Example
Sample sequence showing 3 different possible reading frames. Start codons are highlighted in purple, and stop codons are highlighted in red.
Example from wikipedia
1
ORF distribution and statistics
Exercise
Assume that a DNA sequence is randomly generated (with equal probabilities for each bases):
...GTGAGGCTACTGCGCTATTCGGCGTATGCCGCTATTCCGTATTCGTTCATG...
• The probability that a randomly selected codon is a stop codon is ...
• A stop codon is thus expect to occur every .... bp.
• The expected ORF length is thus ... .
• The probability that an ORF has a length 100 is ...
• The probability that an ORF has a length of at least 100 is ...
• In general, the probability P(L) that an ORF is length L is given by ...
• The probability P(L) that an ORF is length larger or equal to L is given by ... 2
ORF distribution and statistics
ORF length distribution
P(length = L) = (1− p)L−1 p
with p = 3/64
Geometric distribution
Lmax
P(length ≥ L) = ∑ (1− p)k−1 p
k=L
L−1
= 1− ∑ (1− p)k−1 p
k=1
This is the p-­value of an ORF length
3
ORF distribution and statistics
ORF length statistics
The probability (p-­value) that an ORF has a length of at least 100 is thus given by:
99
k−1
#
3& # 3&
P(length ≥ 100) = 1− ∑%1− ( % ( = 0.055
$ 64 ' $ 64 '
k=1
Exercise
Assume that we analyse a genomic DNA sequence of length Lmax = 1000000 nucleotides (with equal probability for each nucleotide). • How many start codons can I expect to find a single reading frame?
• How many start codons can I expect to find considering all reading frame?
• How many ORFs will I then analyse?
• How many ORFs can I expect with a length larger than 100? 4
ORF distribution and statistics
ORF length statistics
Since in reality the nucleotides / codons do not have equal probabilities, some deviation from the geometric distribution are expected.
Codon usage in E. coli in coding regions.
Maloy S, Stewart V, Taylor R (1996) Genetic analysis of pathogenic bacteria. Cold Spring Harbor Laboratory Press, NY.
5
ORF distribution and statistics
ORF length statistics
Since in reality the nucleotides / codons do not have equal probabilities, some deviation from the geometric distribution are expected.
Here are the ORF length distribution in E. coli in coding and non-­coding regions.
coding region
non-­coding region
Source: Zvelebil & Baum (2008)
6
ORF distribution and statistics
Example of ORF detection in E. coli
6 reading frames
= AUG (START) codons
= STOP codons
ORF map of a portion of the E. coli lac operon using the DNA STRIDER
program (Marck, Nucl Acids Res 1988). Shown are the AUG and STOP codons in all 6 reading frames. The lacZ/lacY/lacA genes are visible as long ORFs in frame 3: The lacZ gene runs from position 1284 to 4355, and is followed by the 2 shorter genes, lacY and lacA.
Source: Mount (2000)
7
ORF distribution and statistics
ORF Finder (NCBI): http://www.ncbi.nlm.nih.gov/gorf/gorf.html
8
ORF distribution and statistics
ORF Finder (NCBI): http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Example of ORF detection in Drosophila
Which ones of those "long" ORF are good gene candidates?
9
ORF distribution and statistics
Limitation of "long ORF" -­ based gene detection
• Where to set the threshold (multiple testing!)?
• Overlapping ORF
• Nucleotide frequency and codon usage
• Sequencing errors
Error rates of NGS sequencing range from 0.1% to 0.6%, depending on the platform and the depth of coverage.
Wall J D et al (2014) Estimating genotype error rates f rom high-­coverage next-­generation s equence data, Genome Res 24:1734-­9. • ATG is not the only possible start site (e.g. CTG, TTG) Alternate start codons are still translated as Met when they are at the start of a protein (even if the codon encodes a different amino acid otherwise). This is because a separate transfer RNA (tRNA) is used for initiation.
Lobanov AV et al (2010) Dual functions of c odons in the genetic c ode. Crit Rev Biochem Molec Biol 45:257-­65
Solution => look for additional information Adapted from D. Subramanian, lecture notes 2009
10