Download Bioinformatics II: Probability and Statistics Exercise Sheet 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bioinformatics II: Probability and Statistics
Exercise Sheet 1
Question 1.1
(a) A sequence of 7 independent letters is drawn from the uniform distribution on the
DNA-alphabet a, c, g, t. What is the probability p1 of getting exactly the sequence
gattaca?
(b) Suppose now that the letters are still drawn independently, but that their common
distribution is c–g richer:
P (c) = P (g) = 0.275,
P (a) = P (t) = 0.225.
What is the probability p2 of getting gattaca? What is p2 /p1 ?
(c) Is there any difference to the answers (a) and (b) if the sequence is not gattaca, but
instead another specific sequence with a thrice, c once, g once and t twice?
(d) Suppose now that we have been given a specific sequence of length 350, with 150
a’s, 50 c’s, 50 g’s and 100 t’s. if p01 is the probability of getting this sequence with the
uniform distribution of part (a), and p02 is the probability of getting this sequence with
the c–g richer distribution of part (b), what is p02 /p01 ? Compare the result of part (b).
Question 1.2
The amino acid coding table is given below: the symbol Z is used here for ‘stop’. The
element C which is found in the first (T) block in the fourth (G) column and second
(C) row indicates that the DNA–triple TGC codes for the amino acid C (Cystine).
T
C
T
F
F
L
L
L
L
L
L
C
S
S
S
S
P
P
P
P
A
Y
Y
Z
Z
H
H
Q
Q
G
C
C
Z
W
R
R
R
R
T
C
A
G
T
C
A
G
A
G
T
I
I
I
M
V
V
V
V
C
T
T
T
T
A
A
A
A
A
N
N
K
K
D
D
E
E
G
S
S
R
R
G
G
G
G
T
C
A
G
T
C
A
G
(a) Suppose that a sequence of three independent letters is chosen from the uniform
distribution on the DNA–alphabet a, c, g, t. What are the probabilities of this sequence
coding for each of the possible amino acids? What is the probability that the sequence
does not code for an amino acid?
1
(b) Now suppose that an amino acid is chosen at random from the uniform distribution
on the set of all 20 amino acids, and that a DNA triplet is chosen uniformly at random
from those triplets which code for this amino acid. What is the probability of getting t
at position 1? At position 2? At position 3? At a randomly chosen position?
(c) Sample as in part (b). What is the probability of getting tt at positions 1 and 2?
At positions 2 and 3? Repeat this for the string tg. What is the conditional probability
of getting t at position 2, given that there is a t at position 1? Are the two events ‘t at
position 1’ and ‘t at position 2’ independent?
NB: The empirical frequencies of amino acids are NOT uniform, as assumed for simplicity in part (b). Nevertheless, the above considerations indicate that there may well be a
difference between the distributions of the successive letters in DNA sequences that are
coding, and those that are not (which can be expected to be more ‘random’, ie behaving
more like the model of part (a)). Differences of this kind are in fact used by algorithms
which aim to distinguish coding from non–coding regions.
Question 1.3: optional
Simulate 1000 letters independently from the uniform distribution on the letters a, c, g, t.
Set
1 if ‘ith letter’ = ‘i + 1st letter’ = a ,
zi =
0 otherwise .
Compute the proportions of 1s and 0s among the 999 values z1 , . . . , z999 , and also of
the four possible consecutive pairs among the 998 values (zi , zi+1 ). Is the product
rule of independence satisfied? Use conditional probability to calculate the theoretical
frequencies and check with a 100000 letter sample.
2
Related documents