Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Computational Biology Homework 2 Solution Problem 1: Concave gap penalty function Let γ be a gap penalty function defined over non-negative integers. The function γ is called sub-additive iff it satisfies the following: γ(k1 + k2 + ... + kn ) ≤ γ(k1 ) + γ(k2 ) + ... + γ(kn ). (a) Show that a concave γ, i.e. one that satisfies γ(x+1)−γ(x) ≤ γ(x)−γ(x−1), is sub-additive if γ(0) ≥ 0. Hint: it is sufficient to show that γ(k1 + k2 ) ≤ γ(k1 ) + γ(k2 ) and express γ(k1 + k2 ) as γ(k1 ) plus the increments up to k2 . Solution: We use the fact that γ is a gap function, and hence it is defined on intergers ≥ 0 only. In fact, the statement above is not true otherwise. Before we prove the statement, we give some examples when the statement is not true to justify the need for the condition above. First, consider the function defined as follows: γ(−1) = −1, γ(0) = 0, γ(1) = 1, γ(2) = 1. This satisfies the concavity condition. Now γ(2 + (−1)) = γ(1) = 1 > γ(2) + γ(−1) = 1 − 1 = 0. Therefore, we cannot have γ defined on negative integers. Second, consider the periodic function that start at zero and remains zero until x = 1/3, then goes up linearly from zero to 1 at x = 2/3, then goes down lineraly to zero at x = 1. This is repeated for every interval of length 1. This function satisfies the concavity condition because γ(x) = γ(x + 1). Now γ(1/3 + ǫ) is strictly positive. But γ(1/3) = γ(ǫ) = 0. Therefore, γ(1/3 + ǫ) > γ(1/3) + γ(ǫ). Hence, γ cannot be defined on non-integers. Now we prove the result for γ defined on integers ≥ 0. We prove the statement for γ(k1 + k2 ). The general case can be easily obtained by induction. γ(k1 + k2 ) = ≤ = = ≤ Pk2 γ(k1 ) + i=1 γ(k1 + i) − γ(k1 + i − 1) Pk2 γ(k1 ) + i=1 γ(i) − γ(i − 1) by concavity condition γ(k1 ) + γ(1) − γ(0) + γ(2) − γ(1) + ... + γ(k2 − 1) − γ(k2 − 2) + γ(k2 ) − γ(k2 − 1) γ(k1 ) + γ(k2 ) − γ(0) γ(k1 ) + γ(k2 ) if γ(0) ≥ 0 Note that the first equality and second inequality are valid only if both k1 and k2 are ≥ 0. The next set of questions are intended to help you understand why the DP algorithm we saw in class requires γ to be concave. Here’s the algorithm again: (1) A(i − 1, j − 1) + s(i, j) A(i, j − k) − γ(k), k = 1...j (2) A(i, j) = max A(i − k, j) − γ(k), k = 1...i (3) 1 Without loss of generality, we restrict our attention to alignments that end with a gap in x. Call such an alignment of x1 ...xi and y1 ...yj “good” if it ends with a gap of length k in x for some k > 0 and optimally aligns x1 ...xi to y1 ...yj−k . (b) Show that a “good” alignment of x1 ...xi and y1 ...yj is not necessarily optimal. It is enough to give a counter example. Solution: Consider the following alignment: -AAAA This alginment is “good” because -A aligned with AA is optimal (A- with AA is the only other way and is the same). However, the alignment (-A-, AAA) is not necessarily optimal, depending on γ. (c) Show that if γ is concave, then an optimal alignment (that ends with a gap in x) of x1 ...xi and y1 ...yj is a “good” alignment. Solution: Consider an optimal alignment that ends with a gap of length l1 > 0 in x. The score of this alignment is equal to S = S1 − γ(l1 ), where S1 is the score of the portion of the alignment obtained by excluding the gap of length l1 . Assume that the alignment is not “good”, i.e. S1 is not optimal. Then there must be another alignment A for that portion with score S2 > S1 . This alignment ends in a gap of length l2 ≥ 0 in x. Therefore, S2 = S3 − γ(l2 ) (assume the convention γ(0) = 0), where S3 is the score of the portion of the alignment A obtained by excluding the gap of length l2 . We will construct a new alignment by concatenating A with the gap of length l1 in x. Let’s find the core of this alignment. The score is S ′ = S3 − γ(l2 + l1 ). Using part (a), S ′ = S3 − γ(l2 + l1 ) ≥ S3 − γ(l2 ) − γ(l1 ) = S2 − γ(l1 ) > S1 − γ(l1 ) = S. Hence S ′ > S contradicting that our alignment was optimal. Therefore, it must have been a “good” alignment too. (d) Show that if γ is concave, then for any given alignment of x1 ...xi and y1 ...yj with score S, if we split the alignment in two parts with scores S1 and S2 , then S1 + S2 ≤ S. Solution: Assume the first part ends with a gap of length l1 ≥ 0 and the second part start with a gap l2 ≥ 0. Then S1 = A − γ(l1 ) and S2 = −γ(l2 ) + B. By the sub-additive property, S = A − γ(l1 + l2 ) + B ≥ A − γ(l1 ) − γ(l2 ) + B = S1 + S2 . (e) Put parts (b) and (c) and (d) together to argue that step (2) must check all k = 1...j, and it computes the optimal alignment of x1 ...xi and y1 ...yj that ends with a gap in x. 2 Solution: necessity: Step (2) should find the optimal alignment that ends in a gap in x. This gap may have any length k. Moreover, the alignment must be “good” (part c). Therefore, the score must be the sum of the score of some optimal alignment and a gap of length k. But if such a “good” alignment is found, it may not be optimal (part b); therefore, all “good” alignment must be checked. correctness: The optimal alignment that ends in a gap in x must exist and it is “good” (part c). Therefore, step (2) will find a k such that A(i, j − k) is optimal and corresponds to an alignment that ends with no gaps, and A(i, j − k) − γ(k) is the score of the optimal alignment that ends with a gap in x (of length k). Since S1 + S2 ≤ S for any split of an alignment (part d), step (2) is guaranteed not to compute any score greater than A(i, j − k) − γ(k). (f) Construct an instance of an optimal alignment (that ends with a gap in x) that is not “good”. (you cannot use a concave gap function according to part (c)). Argue that the algorithm above will not work properly if such an instance can be constructed. Solution: Assume γ(0) = 0, γ(1) = 1, γ(2) = 5. Also assume that the score of a match is +1 and the score of a mismatch is -1. Consider the following optimal alignment with score −1 − 1 − 1 = −3. -AABA Any other alignment will have a score of +1 − 5 = −4. In the alignment above, the alignment (-A, AB) is not optimal. (-A, AB) has a score of −1 − 1 = −2. The alignment (A-,AB) has a score of 1 − 1 = 0. Therefore, we have an optimal alignment that is not “good”. Assume that the optimal alignment of x1 ...xi and y1 ...yj ends with a gap of length k in x and it is “good”. This optimal alignment corresponds to aligning y1 ...yj−k with x1 ...xi (non-optimally) and aligning yj−k+1 ...yj with a gap of length k. If the algorithm does not compute the correct value for A(i, j − k), then we are done because it will not perform correctly on the instance x1 ...xj and y1 ...yj−k . If the algorithm computes the correct value for A(i, j − k), then step (2) will compute at some point in time A(i, j) = A(i, j − k) − γ(k) (or a larger value) which is greater than the score of the optimal alignment. Since A(i, j) never decreases, it will hold the wrong score forever, and thus the algorithm fails on the instance x1 ...xi and y1 ...yj . 3 Problem P 2: Random star alignment Let M = Si D(Sc , Si ) as defined in the star alignment algorithm. Suppose that instead of Sc , we choose a string at random to be the center of the star. Let MR be defined in an analogous way when Sc is replaced by a random string. (a) Show that E[MR ] ≤ 2M and argue that the expected score of the alignment is at most a factor of 4 of optimal. Solution: E[MR ] = 1 XX 1 XX D(Si , Sj ) ≤ [D(Si , Sc ) + D(Sc , Sj )] n s s n s s i = i j j 1 XX 1 XX D(Si , Sc ) + D(Sc , Sj ) n s s n s s j i i j 1X 1X M+ M = 2M = n s n s j i (b) Next, show that the median of MR is at most 3M . Solution: If the median is strictly greater than 3M , then the expected value is: E[MR ] > 1 1 3M + M = 2M 2 2 a contradiction. (c) Finally, argue that the value of the multiple alignment is at most a factor of 6 of optimal with probability at least 1/2. Solution: evident, see proof of Star alignment. Problem 3: Consensus string Given a set of strings S, let Sc be the center string as defined for the star alignment of S, and assume that the scoring scheme (distance) satisfies the triangular P inequality. In addition, for any string T , define the consensus error E(T ) = Si ∈S D(T, Si ). (a) Let S ∗ , not necessarily in S, be the string that minimizes E(S ∗ ). Show that for any string T in S: E(T ) ≤ (|S| − 2)D(T, S ∗ ) + E(S ∗ ) Solution: E(T ) = X Si ∈S−{T } X D(T, Si ) ≤ [D(T, S ∗ ) + D(S ∗ , Si )] si ∈S−{T } 4 < X D(T, S ∗ ) + E(S ∗ ) − D(T, S ∗ ) = (|S| − 2)D(T, S ∗ ) + E(S ∗ ) Si ∈S−{T } (b) Show that if T ∈ S is the closest to S ∗ , then: E(T ) <2 E(S ∗ ) Solution: Since D(T, S ∗ ) ≤ D(Si , S ∗ ), then |S|D(T, S ∗ ) ≤ E(S ∗ ). Therefore, E(T ) < 2E(S ∗ ) P Si ∈S D(Si , S ∗ ) = (c) Show that E(Sc )/E(S ∗ ) < 2. Solution: By definition, E(Sc ) ≤ E(T ). Problem 4: Example substitution matrix Let’s say we would like to build a DNA substitution matrix (4x4 matrix) optimized for finding 88% identity alignments. • assume the background frequencies are identical, i.e. pi = 0.25 for each nucleotide i • assume that all matches are equally probable • assume that all mismatches are equally probable (a) Compute qij for all i and j. (b) Construct the matrix using the log-likelihood ratio. (c) Choose a scaling factor λ to make the substitution matrix close to an integer matrix. Solution: Since the identity is 88%, and all matches are equally probable, each match occurs with probability 0.22. Therefore, qAA = qGG = qCC = qT T = 0.22. Since all mismatches (12 possibilities) are equally probable, we have qij = (1 − 0.88)/12 = 0.12/12 = 0.01, where both i 6= j ∈ {A, G, C, T }. Now pA = pG = pC = pT = 0.25, therefore: qii = 0.22/(0.252 ) = 1.26 pi pi qij log = 0.01/(0.252 ) = −1.83 pi pj log Scaling by 1/λ, where λ = 0.18, and rouding we get a scoring scheme of +7 for a match, and -10 for a mismatch. The matrix is: 5 7 −10 −10 −10 −10 7 −10 −10 −10 −10 7 −10 −10 −10 −10 7 Problem 5: Unrevealing BLOSUM62 (a) Find the 20x20 BLOSUM62 substitution matrix online. BLOSUM62 has the property that the background probabilities and the observed probabilities P are consistent, i.e. pi = j qij . Solution: A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 (b) Given a symmetric and consistent substitution matrix S, with a scaling facq tor λ, let M be the matrix eλS . Note Mij = piijpj . Let Y be the inverse of M (assuming M is invertible). Show that the sum of the ith column (or row) of Y must be equal to the background probability pi . Hint: Consider the vector p = [p1 , . . . , pn ]. Show that pM = [1, . . . , 1]. Use this result to compute pM Y in two ways. Solution: The ith entry of pM is p1 M1i + p2 M2i + . . . pn Mni q2i qni q1i + p2 + . . . pn = p1 p1 pi p2 pi pn pi 6 = 1 (q1i + q2i + . . . qni ) pi = 1 pi = 1 pi The last equality follows because the matrix is symmetric and consistent. Therefore, pM = [1, . . . , 1]. So p = pM Y = [1, . . . , 1]Y . This shows that the sum of the ith column of Y is equal to pi . (c) Using the above strategy, and knowing that λ = 0.3176 for BLOSUM62, find the background probabilities pi for BLOSUM62. (d) Using part (c), find the observed set of probabilities qij . 7