Download Concave gap penalty function Let γ be a gap penalty functi

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cayley–Hamilton theorem wikipedia , lookup

Gaussian elimination wikipedia , lookup

Matrix multiplication wikipedia , lookup

System of linear equations wikipedia , lookup

Transcript
Introduction to Computational Biology
Homework 2
Solution
Problem 1: Concave gap penalty function
Let γ be a gap penalty function defined over non-negative integers. The function
γ is called sub-additive iff it satisfies the following: γ(k1 + k2 + ... + kn ) ≤
γ(k1 ) + γ(k2 ) + ... + γ(kn ).
(a) Show that a concave γ, i.e. one that satisfies γ(x+1)−γ(x) ≤ γ(x)−γ(x−1),
is sub-additive if γ(0) ≥ 0. Hint: it is sufficient to show that γ(k1 + k2 ) ≤
γ(k1 ) + γ(k2 ) and express γ(k1 + k2 ) as γ(k1 ) plus the increments up to k2 .
Solution: We use the fact that γ is a gap function, and hence it is defined on
intergers ≥ 0 only. In fact, the statement above is not true otherwise. Before
we prove the statement, we give some examples when the statement is not true
to justify the need for the condition above.
First, consider the function defined as follows: γ(−1) = −1, γ(0) = 0, γ(1) = 1,
γ(2) = 1. This satisfies the concavity condition. Now γ(2 + (−1)) = γ(1) =
1 > γ(2) + γ(−1) = 1 − 1 = 0. Therefore, we cannot have γ defined on negative
integers.
Second, consider the periodic function that start at zero and remains zero until
x = 1/3, then goes up linearly from zero to 1 at x = 2/3, then goes down lineraly
to zero at x = 1. This is repeated for every interval of length 1. This function
satisfies the concavity condition because γ(x) = γ(x + 1). Now γ(1/3 + ǫ) is
strictly positive. But γ(1/3) = γ(ǫ) = 0. Therefore, γ(1/3 + ǫ) > γ(1/3) + γ(ǫ).
Hence, γ cannot be defined on non-integers.
Now we prove the result for γ defined on integers ≥ 0. We prove the statement
for γ(k1 + k2 ). The general case can be easily obtained by induction.
γ(k1 + k2 ) =
≤
=
=
≤
Pk2
γ(k1 ) + i=1
γ(k1 + i) − γ(k1 + i − 1)
Pk2
γ(k1 ) + i=1
γ(i) − γ(i − 1) by concavity condition
γ(k1 ) + γ(1) − γ(0) + γ(2) − γ(1) + ... + γ(k2 − 1) − γ(k2 − 2) + γ(k2 ) − γ(k2 − 1)
γ(k1 ) + γ(k2 ) − γ(0)
γ(k1 ) + γ(k2 ) if γ(0) ≥ 0
Note that the first equality and second inequality are valid only if both k1 and
k2 are ≥ 0.
The next set of questions are intended to help you understand why the DP
algorithm we saw in class requires γ to be concave. Here’s the algorithm again:

(1)
 A(i − 1, j − 1) + s(i, j)
A(i, j − k) − γ(k), k = 1...j (2)
A(i, j) = max

A(i − k, j) − γ(k), k = 1...i (3)
1
Without loss of generality, we restrict our attention to alignments that end with
a gap in x. Call such an alignment of x1 ...xi and y1 ...yj “good” if it ends with
a gap of length k in x for some k > 0 and optimally aligns x1 ...xi to y1 ...yj−k .
(b) Show that a “good” alignment of x1 ...xi and y1 ...yj is not necessarily optimal. It is enough to give a counter example.
Solution: Consider the following alignment:
-AAAA
This alginment is “good” because -A aligned with AA is optimal (A- with AA
is the only other way and is the same). However, the alignment (-A-, AAA) is
not necessarily optimal, depending on γ.
(c) Show that if γ is concave, then an optimal alignment (that ends with a gap
in x) of x1 ...xi and y1 ...yj is a “good” alignment.
Solution: Consider an optimal alignment that ends with a gap of length l1 > 0
in x. The score of this alignment is equal to S = S1 − γ(l1 ), where S1 is the
score of the portion of the alignment obtained by excluding the gap of length
l1 . Assume that the alignment is not “good”, i.e. S1 is not optimal. Then
there must be another alignment A for that portion with score S2 > S1 . This
alignment ends in a gap of length l2 ≥ 0 in x. Therefore, S2 = S3 − γ(l2 )
(assume the convention γ(0) = 0), where S3 is the score of the portion of the
alignment A obtained by excluding the gap of length l2 . We will construct a
new alignment by concatenating A with the gap of length l1 in x. Let’s find
the core of this alignment. The score is S ′ = S3 − γ(l2 + l1 ). Using part (a),
S ′ = S3 − γ(l2 + l1 ) ≥ S3 − γ(l2 ) − γ(l1 ) = S2 − γ(l1 ) > S1 − γ(l1 ) = S. Hence
S ′ > S contradicting that our alignment was optimal. Therefore, it must have
been a “good” alignment too.
(d) Show that if γ is concave, then for any given alignment of x1 ...xi and y1 ...yj
with score S, if we split the alignment in two parts with scores S1 and S2 , then
S1 + S2 ≤ S.
Solution: Assume the first part ends with a gap of length l1 ≥ 0 and the second
part start with a gap l2 ≥ 0. Then S1 = A − γ(l1 ) and S2 = −γ(l2 ) + B. By the
sub-additive property, S = A − γ(l1 + l2 ) + B ≥ A − γ(l1 ) − γ(l2 ) + B = S1 + S2 .
(e) Put parts (b) and (c) and (d) together to argue that step (2) must check all
k = 1...j, and it computes the optimal alignment of x1 ...xi and y1 ...yj that ends
with a gap in x.
2
Solution:
necessity: Step (2) should find the optimal alignment that ends in a gap in x.
This gap may have any length k. Moreover, the alignment must be “good” (part
c). Therefore, the score must be the sum of the score of some optimal alignment
and a gap of length k. But if such a “good” alignment is found, it may not be
optimal (part b); therefore, all “good” alignment must be checked.
correctness: The optimal alignment that ends in a gap in x must exist and it is
“good” (part c). Therefore, step (2) will find a k such that A(i, j − k) is optimal
and corresponds to an alignment that ends with no gaps, and A(i, j − k) − γ(k)
is the score of the optimal alignment that ends with a gap in x (of length k).
Since S1 + S2 ≤ S for any split of an alignment (part d), step (2) is guaranteed
not to compute any score greater than A(i, j − k) − γ(k).
(f) Construct an instance of an optimal alignment (that ends with a gap in x)
that is not “good”. (you cannot use a concave gap function according to part
(c)). Argue that the algorithm above will not work properly if such an instance
can be constructed.
Solution: Assume γ(0) = 0, γ(1) = 1, γ(2) = 5. Also assume that the score of
a match is +1 and the score of a mismatch is -1. Consider the following optimal
alignment with score −1 − 1 − 1 = −3.
-AABA
Any other alignment will have a score of +1 − 5 = −4.
In the alignment above, the alignment (-A, AB) is not optimal. (-A, AB) has
a score of −1 − 1 = −2. The alignment (A-,AB) has a score of 1 − 1 = 0.
Therefore, we have an optimal alignment that is not “good”.
Assume that the optimal alignment of x1 ...xi and y1 ...yj ends with a gap of
length k in x and it is “good”. This optimal alignment corresponds to aligning
y1 ...yj−k with x1 ...xi (non-optimally) and aligning yj−k+1 ...yj with a gap of
length k. If the algorithm does not compute the correct value for A(i, j − k),
then we are done because it will not perform correctly on the instance x1 ...xj
and y1 ...yj−k . If the algorithm computes the correct value for A(i, j − k), then
step (2) will compute at some point in time A(i, j) = A(i, j − k) − γ(k) (or a
larger value) which is greater than the score of the optimal alignment. Since
A(i, j) never decreases, it will hold the wrong score forever, and thus the algorithm fails on the instance x1 ...xi and y1 ...yj .
3
Problem P
2: Random star alignment
Let M = Si D(Sc , Si ) as defined in the star alignment algorithm. Suppose
that instead of Sc , we choose a string at random to be the center of the star.
Let MR be defined in an analogous way when Sc is replaced by a random string.
(a) Show that E[MR ] ≤ 2M and argue that the expected score of the alignment
is at most a factor of 4 of optimal.
Solution:
E[MR ] =
1 XX
1 XX
D(Si , Sj ) ≤
[D(Si , Sc ) + D(Sc , Sj )]
n s s
n s s
i
=
i
j
j
1 XX
1 XX
D(Si , Sc ) +
D(Sc , Sj )
n s s
n s s
j
i
i
j
1X
1X
M+
M = 2M
=
n s
n s
j
i
(b) Next, show that the median of MR is at most 3M .
Solution: If the median is strictly greater than 3M , then the expected value
is:
E[MR ] >
1
1
3M + M = 2M
2
2
a contradiction.
(c) Finally, argue that the value of the multiple alignment is at most a factor of
6 of optimal with probability at least 1/2.
Solution: evident, see proof of Star alignment.
Problem 3: Consensus string
Given a set of strings S, let Sc be the center string as defined for the star
alignment of S, and assume that the scoring scheme (distance) satisfies the triangular P
inequality. In addition, for any string T , define the consensus error
E(T ) = Si ∈S D(T, Si ).
(a) Let S ∗ , not necessarily in S, be the string that minimizes E(S ∗ ). Show that
for any string T in S:
E(T ) ≤ (|S| − 2)D(T, S ∗ ) + E(S ∗ )
Solution:
E(T ) =
X
Si ∈S−{T }
X
D(T, Si ) ≤
[D(T, S ∗ ) + D(S ∗ , Si )]
si ∈S−{T }
4
<
X
D(T, S ∗ ) + E(S ∗ ) − D(T, S ∗ ) = (|S| − 2)D(T, S ∗ ) + E(S ∗ )
Si ∈S−{T }
(b) Show that if T ∈ S is the closest to S ∗ , then:
E(T )
<2
E(S ∗ )
Solution: Since D(T, S ∗ ) ≤ D(Si , S ∗ ), then |S|D(T, S ∗ ) ≤
E(S ∗ ). Therefore,
E(T ) < 2E(S ∗ )
P
Si ∈S
D(Si , S ∗ ) =
(c) Show that E(Sc )/E(S ∗ ) < 2.
Solution: By definition, E(Sc ) ≤ E(T ).
Problem 4: Example substitution matrix
Let’s say we would like to build a DNA substitution matrix (4x4 matrix) optimized for finding 88% identity alignments.
• assume the background frequencies are identical, i.e. pi = 0.25 for each
nucleotide i
• assume that all matches are equally probable
• assume that all mismatches are equally probable
(a) Compute qij for all i and j.
(b) Construct the matrix using the log-likelihood ratio.
(c) Choose a scaling factor λ to make the substitution matrix close to an integer
matrix.
Solution: Since the identity is 88%, and all matches are equally probable,
each match occurs with probability 0.22. Therefore, qAA = qGG = qCC =
qT T = 0.22. Since all mismatches (12 possibilities) are equally probable, we
have qij = (1 − 0.88)/12 = 0.12/12 = 0.01, where both i 6= j ∈ {A, G, C, T }.
Now pA = pG = pC = pT = 0.25, therefore:
qii
= 0.22/(0.252 ) = 1.26
pi pi
qij
log
= 0.01/(0.252 ) = −1.83
pi pj
log
Scaling by 1/λ, where λ = 0.18, and rouding we get a scoring scheme of +7 for
a match, and -10 for a mismatch. The matrix is:
5

7
 −10

 −10
−10
−10
7
−10
−10
−10
−10
7
−10

−10
−10 

−10 
7
Problem 5: Unrevealing BLOSUM62
(a) Find the 20x20 BLOSUM62 substitution matrix online. BLOSUM62 has
the property that the background
probabilities and the observed probabilities
P
are consistent, i.e. pi = j qij .
Solution:
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
(b) Given a symmetric and consistent substitution matrix S, with a scaling facq
tor λ, let M be the matrix eλS . Note Mij = piijpj . Let Y be the inverse of M
(assuming M is invertible). Show that the sum of the ith column (or row) of
Y must be equal to the background probability pi . Hint: Consider the vector
p = [p1 , . . . , pn ]. Show that pM = [1, . . . , 1]. Use this result to compute pM Y
in two ways.
Solution: The ith entry of pM is
p1 M1i + p2 M2i + . . . pn Mni
q2i
qni
q1i
+ p2
+ . . . pn
= p1
p1 pi
p2 pi
pn pi
6
=
1
(q1i + q2i + . . . qni )
pi
=
1
pi = 1
pi
The last equality follows because the matrix is symmetric and consistent. Therefore, pM = [1, . . . , 1]. So p = pM Y = [1, . . . , 1]Y . This shows that the sum of
the ith column of Y is equal to pi .
(c) Using the above strategy, and knowing that λ = 0.3176 for BLOSUM62,
find the background probabilities pi for BLOSUM62.
(d) Using part (c), find the observed set of probabilities qij .
7