Download Sequences statistics: exercises

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Randomness wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Inductive probability wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Sequences statistics: exercises
(1) How many oligomers contain exactly a single occurrence of each monomer, (a) for oligonucleotides and (b) for oligopeptides, respectively ? (2) Let's consider the following sequence of 10 nucleotides:
TGCGTTACGG
(a) How many permutations of 2 monomers can be done?
(b) How many permutations of 2 monomers can be done without changing the sequence?
(3) For an arbitrary DNA sequence of length L, how many permutations of 2 monomers can be done without changing the sequence?
1
Sequences statistics: exercises
(1) How many distinct DNA sequences of length 4 can I build?
(2) How many distinct DNA sequences of length L can I build?
(3) How many distinct DNA with nA occurrence of "A", nC occurrence of "C", nG occurrence of "G", and nT occurrence of "T" can I build?
(4) Here is a particular DNA sequence of length 10:
GATGCTGGCG
(a) What is the probability that a randomly generated sequence (with equal probability for each nucleotide) is identical to the given sequence?
(b) What is the probability that a randomly ordered sequence (with same composition as the given one) is identical to the given sequence?
2
Sequences statistics: exercises
Consider 2 strings of 20 nucleotides, randomly generated (with equal probability for each nucleotide).
Example:
TGATTGACTATGCTTTACCG
GCGTATGCGGTTAATGTCGA
(1) What is the probability that the first nucleotide is identical?
(2) What is the probability that the 8 first nucleotides are identical?
(3) What is the probability that only the 8 first nucleotides are identical?
(4) What is the probability that (exactly) 8 nucleotides are identical?
(5) What is the probability that at least 8 nucleotides are identical?
(6) What is the expected number of identical nucleotides?
(7) What is the standard deviation of the probability distribution of the number of identical nucleotides?
(8) What is the probability to have at least x identical residues in strings of L nucleotides?
3
Sequences statistics: exercises
A sequence of 7 independent letters is drawn from the uniform distribution on the DNA-­alphabet (A,C,G,T). (1) What is the probability p1 of getting exactly the sequence GATTACA?
(2) Suppose now that the letters are still drawn independently, but that their common distribution is C/G richer: P(C)=P(G)=0.275, P(A)=P(T)=0.225.
What is the probability p2 of getting GATTACA? What is p2/p1?
(3) Is there any difference to the answers if the sequence given is not GATTACA, but instead another specific sequence with three A, one C, one G and two T?
(4) Suppose now that we have given a specific sequence of length 700, with 300 A’s, 100 C’s, 100 G’s and 200 T’s. If p′1 is the probability of getting this sequence with the uniform distribution of part (1), and p′2 is the probability of getting this sequence with the C/G richer distribution of part (2), what is p′2/p′1? Compare the value with the result of part (2).
Source: S. Muff, lecture notes 4
Sequences statistics: exercises
Let's consider a "query" DNA sequence:
CTGCGGCGTAGGTCATTGCTAGCTTCGTCTAGC
and a "database" DNA sequence:
CTGCCGCGAAGGCATCCGCTAGATTCGTCTGGC
(1) What is the probability P100 that the 2 sequences are identical (100% id.)?
(2) What is the probability P80 that the 2 sequences share at least 80% identity?
If the database counts N=106 sequences (same length as my query sequence):
(3) How many sequences (E100) identical to my query sequence can I expect to find?
(4) How many sequences (E80) can I expect to find with 80% identity? (5) What is the probability to find 10 sequences with 100% id.? 80 id.? (6) What is the probability to find x sequences with 100% id.? 80 id.? (7) What is the probability to find at least x sequences with 100% id.? 80 id.?
(8) What is the probability to find at least 1 sequences with 100% id.? 80 id.? 5
Sequences statistics: exercises
Consider 2 strings of nucleotides, randomly generated (with equal probability for each nucleotide).
Example:
TGACTGACTATGCTTTACCG...
GCGTATGCGGTTAATGTCGA...
(1) How many mismatches can I expect before encountering the first match?
(2) What is the probability distribution P(k) of the number of mismatches (k=0, 1, 2, 3) before the match occurs?
(3) What is the probability distribution P(k) of the number of matches before the first mismatch?
Source: S. Muff, lecture notes
6
Sequences statistics: exercises
Let's consider a DNA sequence:
...TTGTACATCTCTATCTACTTATCGTCTAGCAGCAGC
TACTGATCACGTGCTCGTGATCCTAGTCATTCATGCTAC
TATCGATGCAGTCGATCGTAATCGGCGTAGTAGCGC...
(1) I am looking for the motif CACGTG in a given sequence. I have found
this motif at position 524 in my sequence. What was the probability to find
this motif by chance?
(2) I am looking for the motif CACGTG in a whole genomic region (length
100000). I have found this motif 34 times. Is it significatif?
(a) What is the expected number of occurrences E of this motif by chance?
(b) What is the probability to observe 34 occurrences of the motif (while
expecting E)?
(c) What is the probability (p-value) to observe at least 34 occurrences of
the motif?
7
Sequences statistics: exercises
Let's consider a DNA sequence:
...TTGTACATCTCTATCTACTTATCGTCTAGCAGCAGC
TACTGATCACGTGCTCGTGATCCTAGTCATTCATGCTAC
TATCGATGCAGTCGATCGTAATCGGCGTAGTAGCGC...
(1) To account for the variability (uncertainty) of a nucleotide at a given
position, we sometimes use the IUPAC alphabet. In this notation, the letter
"Y" stands for "C or T" and the letter "D" stands for "A or G or T". Answer
the same questions as before for the following motifs:
CAYGTG (= CACGTG or CATGTG)
and
CAYGTD (= CACGTG or CATGTG or ...)
(2) We can also look for motifs that present some mismatches with repect
to the reference. Answer the same questions as before for motifs that
present at most one mismatch (e.g. CACGTA or CATGTG), assuming that we
observe 684 occurrences of the motif.
8
Sequences statistics: exercises
Let's consider a DNA sequence which evolves (and get mutated) over time:
Example:
TGACTGACTATGCTTTACCA
τ
TGACTCACTAAGCTTCATCG
τ
time
TGATTCAATAAGCTTAATCG
τ
...
(1) Assume that at each generation (time τ), the sequence undergoes on average 5 mutations. (a) What is the probability to observe exactly 25 mutations after a time lapse Δt = 10 τ ? (b) What is the probability to observe at least 25 mutations after a time lapse Δt = 10 τ ?
(2) More generally what is the probability to observe x mutations in a certain time window (i.e. what is the probability distribution of x)?
9
Sequences statistics: exercises
Assume that a genome of 3 billion base pairs undergoes mutations at a rate of 75 mutations per generation.
(1) The gene of hemoglobin is roughly 3000 bp-­long. How many mutations would we expect in this gene after one generation?
(2) What is the probability of observing (a) no mutation, (b) exactly 1 mutation, and (c) more that 1 mutation in this gene after one generation?
10