Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 141/Bioeng 141 Homework 2 - SOLUTIONS PROBLEM 1 We want to find the value of N that maximizes the function N e−N L/G . Strange, but true: If you take the log of the function, the logged function achieves its maximum at the same point the original function does. This is a neat trick, often used in statistics, that makes the calculus of this a little easier: NL log N e−N L/G = log N − G d 1 L log N e−N L/G = − dN N G 1 L 0= − N G G N̂ = L Taking the first derivative and setting it equal to 0 will give you the critical values for the function. To confirm that this is indeed a maximum, take the second derivative, and evaluate it at the critical value: d2 1 log N e−N L/G = − 2 2 dN N L2 =− 2 G This is clearly a negative number, so the second derivative test tells us that this critical value is a maximum. PROBLEM 2 To find the fraction covered by at least one read, we used 1 − P(X = 0). To find the fraction covered by at least three reads, we extend this to 1 − P(X = 0) − P(X = 1) − P(X = 2). The solution can be worked on a hand calculator, or run quickly in R using ppois(2, 4.6, lower.tail = F) (4.6)0 e−4.6 (4.6)1 e−4.6 (4.6)2 e−4.6 − − 0! 1! 2! = 0.8373 1 − P(X = 0) − P(X = 1) − P(X = 2) = 1 − PROBLEM 3 1 P(at least one overlap) = 1 − P(no overlaps) = 1 − (P(one configuration is not an overlap))2 = 1 − (P(at least one letter doesn’t match))2 = 1 − (1 − P(all three letters match))2 3 !2 1 =1− 1− 4 3 6 ! 1 1 =1− 1−2 + 4 4 3 6 1 1 =2 − 4 4 PROBLEM 4 First, let’s deal with the two hints: What is the fraction of the original long sequence covered by at least one read? (6.9)0 e−6.9 ≈ 0.999 0! What is the mean number of contigs? 1− N e−6.9 = 0.001N From this, we can sort out that the number of letters covered is about 0.999G, and the mean number of contigs is 0.001N . Doing a bit of arithmetic, we find: total length covered mean number of contigs 0.999G = 0.001N G L = 999 × N L 999 = L 6.9 = 144.8L mean size of contig ≈ PROBLEM 5 A finite, aperiodic, irreducible Markov chain has transition probability matrix P and stationary distribution ψ. Let k be some constant, 0 < k < 1. We wish to show that a chain with transition probability P F = kP + (1 − k)I also has stationary distribution ψ. 2 p11 . . . p1n 1 ... 0 .. + (1 − k) .. . . .. P F = k ... . . . . . . . pn1 . . . pnn 0 ... 1 kp11 + 1 − k . . . kp1n .. .. ... PF = . . kpn1 . . . kpnn + 1 − k Let’s see what happens when we pre-multiply this matrix by ψ. Without loss of generality, we can look only at the first entry of the resulting vector: ψP F (first entry only) = ψ1 × (kp11 + 1 − k) + · · · + ψn × kpn1 = ψ1 kp11 + ψ1 − ψ1 k + · · · + ψn kpn1 = ψ1 − ψ1 k + k (ψ1 p11 + · · · + ψn pn1 ) = ψ1 − ψ1 k + kψ1 = ψ1 Note that, because ψ is the stationary distribution of the Markov chain with transition matrix P , ψ1 p11 + · · · + ψn pn1 = ψ1 by definition. The algebra for all other entries will be similar. We conclude that P F has the same stationary distribution as P . Over the long run, the proportion of time spent in each state will be about the same for both P and P F . Note that, in P F , all of the off-diagonal elements are smaller than in P , while the main diagonal is larger. You’ll be more likely to see a single state repeat itself several times in P F than in P . PROBLEM 6 Although there are 27 possible arrangements of the three states, the initial probability vector and transition probability matrix both contain several zeroes. In the end, only two hidden state sequences are actually possible, greatly reducing the scope of this problem. (a) Hidden states S1 S2 S1 S1 S3 S2 Probability 1 × .5 × .5 × 0 × 1 × 0 = 0 1 × .5 × .5 × .5 × 1 × .5 = 0.0625 Here, P(O|λ) = 0.0625. (b) Hidden states S1 S2 S1 S1 S3 S2 Probability 1 × .5 × .5 × .5 × 1 × .5 = 0.0625 1 × .5 × .5 × .5 × 1 × .5 = 0.0625 Here, P(O|λ) = 0.125. 3 PROBLEM 7 Consider an HMM with the following parameters: 1. Three states: A, B, C 2. The alphabet is {1, 2, 3} 3. Transition probability matrix 0.5 0 0.5 P = 0 0.5 0.5 0.5 0.5 0 4. Initial probability vector π = [1/3 1/3 1/3] 5. Emission probability matrix eA (1) = .98 eA (2) = .01 eA (3) = .01 eB (1) = .01 eB (2) = .98 eB (3) = .01 eC (1) = .01 eC (2) = .01 eC (3) = .98 Now consider the observed sequence O = 1, 2. There are nine hidden state sequences, G, that could give O: G P(O|G) G P(O|G) G P(O|G) AA .98 × .01 BA .01 × .01 CA .01 × .01 AB .98 × .98 BB .01 × .98 CB .01 × .98 AC .98 × .01 BC .01 × .01 CC .01 × .01 It is clear that the hidden sequence AB maximizes P(O|G). However, looking at the transition matrix P , P(G = AB) = 0. 4