Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics II: Probability and Statistics Exercise Sheet 1 Question 1.1 (a) A sequence of 7 independent letters is drawn from the uniform distribution on the DNA-alphabet a, c, g, t. What is the probability p1 of getting exactly the sequence gattaca? (b) Suppose now that the letters are still drawn independently, but that their common distribution is c–g richer: P (c) = P (g) = 0.275, P (a) = P (t) = 0.225. What is the probability p2 of getting gattaca? What is p2 /p1 ? (c) Is there any difference to the answers (a) and (b) if the sequence is not gattaca, but instead another specific sequence with a thrice, c once, g once and t twice? (d) Suppose now that we have been given a specific sequence of length 350, with 150 a’s, 50 c’s, 50 g’s and 100 t’s. if p01 is the probability of getting this sequence with the uniform distribution of part (a), and p02 is the probability of getting this sequence with the c–g richer distribution of part (b), what is p02 /p01 ? Compare the result of part (b). Question 1.2 The amino acid coding table is given below: the symbol Z is used here for ‘stop’. The element C which is found in the first (T) block in the fourth (G) column and second (C) row indicates that the DNA–triple TGC codes for the amino acid C (Cystine). T C T F F L L L L L L C S S S S P P P P A Y Y Z Z H H Q Q G C C Z W R R R R T C A G T C A G A G T I I I M V V V V C T T T T A A A A A N N K K D D E E G S S R R G G G G T C A G T C A G (a) Suppose that a sequence of three independent letters is chosen from the uniform distribution on the DNA–alphabet a, c, g, t. What are the probabilities of this sequence coding for each of the possible amino acids? What is the probability that the sequence does not code for an amino acid? 1 (b) Now suppose that an amino acid is chosen at random from the uniform distribution on the set of all 20 amino acids, and that a DNA triplet is chosen uniformly at random from those triplets which code for this amino acid. What is the probability of getting t at position 1? At position 2? At position 3? At a randomly chosen position? (c) Sample as in part (b). What is the probability of getting tt at positions 1 and 2? At positions 2 and 3? Repeat this for the string tg. What is the conditional probability of getting t at position 2, given that there is a t at position 1? Are the two events ‘t at position 1’ and ‘t at position 2’ independent? NB: The empirical frequencies of amino acids are NOT uniform, as assumed for simplicity in part (b). Nevertheless, the above considerations indicate that there may well be a difference between the distributions of the successive letters in DNA sequences that are coding, and those that are not (which can be expected to be more ‘random’, ie behaving more like the model of part (a)). Differences of this kind are in fact used by algorithms which aim to distinguish coding from non–coding regions. Question 1.3: optional Simulate 1000 letters independently from the uniform distribution on the letters a, c, g, t. Set 1 if ‘ith letter’ = ‘i + 1st letter’ = a , zi = 0 otherwise . Compute the proportions of 1s and 0s among the 999 values z1 , . . . , z999 , and also of the four possible consecutive pairs among the 998 values (zi , zi+1 ). Is the product rule of independence satisfied? Use conditional probability to calculate the theoretical frequencies and check with a 100000 letter sample. 2