Download Realistic Gap Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia , lookup

Knapsack problem wikipedia , lookup

Recursion (computer science) wikipedia , lookup

Computational complexity theory wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Travelling salesman problem wikipedia , lookup

Corecursion wikipedia , lookup

Gene expression programming wikipedia , lookup

Simplex algorithm wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Mathematical optimization wikipedia , lookup

Transcript
Realistic Gap Models
No-gap Alignment
So far, we have treated the gap symbol “-” as yet another character, denoting an individual
insertion or deletion. However, this view is not always adequate. Sometimes we want no-gap
alignments. For example, in a family of proteins there may be a strongly conserved subunit which is
the site of some protein-protein interaction. Any deletion/insertion in the chain of amino acids would
be likely to destroy its biochemical function. Such regions we want to align using
matches/replacements only. This of course can be achieved by a very simple algorithm. But also our
dynamic programming algorithm can be geared to do this by setting costs for insertion and deletion
to infinity (or something close to it). Hence, an optimal alignment will not use gaps.
Block-indel
Sometimes, from an evolutionary point of view, it is more realistic to assume that nature inserts
or deletes entire substrings as a unit. This is called the block-indel-model. It means that we charge a
certain set-up cost for introducing a new gap, whereas extending an existing gap is less expensive.
For example, a linear gap cost function is of the form gapCost(n)  gi  (n  1) ge , where n is the
length of the gap (number of consecutive “-”), and gi  ge  0 . Our dynamic programming scheme
can be adapted to this model without much effect on its efficiency.
Variations of Pairwise Alignment
Local Alignment
The notion of edit distance and its implementation via dynamic programming are easily adapted
to variations of the original problem. Two such variations are discussed here.
We first discuss the problem of local alignment, where s is relatively short with respect to t and
we seek that subunit of t which s aligns best with. More precisely: given 0: s : m and 0 : t : n , find
i : t : j such that d w (s, i : t : j ) is minimal among all choices of 0  i  j  n . Very sophisticated
methods have been developed to do this. But a simple change to the dynamic programming scheme
will also do the job, albeit a little slower.
(Local Alignment Recursion) Local alignment means that we do not charge costs for
1. deletion of a prefix 0: t : i . (which means that dw (0 : s : 0,0 : t : i)  0 , and hence we initialize
the first row of the distance matrix accordingly: d 0, j  0 for 0  j  n .)
2. deletion of a suffix j : t : n . (which means that
d w ( 0 : s : m, 0 : t : j ) = min{d w ( 0 : s : (m - 1), 0 : t : (j - 1) )+ w(sm ,t j ),
d w ( 0 : s : (m - 1), 0 : t : j )+ w(sm ,  ),
d w ( 0 : s : m, 0 : t : (j - 1) )}
So the last line of the cost matrix is calculated according to
d m, j = min{d m-1, j -1  w( sm , t j ), d m-1, j  w( sm , ), d m, j -1 }
As before, d m,n gives the cost of the optimal local alignment. The matching subsequence i : t : j is
then found as follows:
j = min{k | d m,k  d m,n } ;i is the point where the optimal path leading to d m, j starts from the first row.
(Local Similarity) For the second variation of our similarity theme, consider the following case. Let
s and t be two proteins, which we suspect to carry some functionally related subunits. However, most
parts of s and t do not contribute to this function, and may well be very different. A pairwise
alignment of s and t would be unlikely to clearly exhibit small regions of high similarity, as it is
geared to minimize distance over the full length of both sequences. Thus we are faced with another
problem, this one being symmetric between s and t. We ask for those subunits of s and t that exhibit
most similarity. This is called the local similarity problem: w(a,b) > 0, if a and b are similar;
w(a,b) < 0, if a and b are not similar; w(a,  ) < 0 and w( , b ) < 0, in particular.
This still leaves a lot of freedom to differentiate degrees of (dis-)similarity. The point is that we use
the score 0 as a cut-off value between subsequences with/without similarity. We are now maximizing
similarity rather than minimizing distance. The border cases now become d0, j  0 for 0  j  n,
di,0  0 for 0  i  m, and the general recursion formula is
di, j  max{0,
di 1, j 1  w( si , t j ),
di 1, j  w( si , ),
di , j 1  w(, t j ),}
The cut-off value of zero means that long stretches of dissimilarity show as regions of zeroes in the
matrix, from which stretches of local similarity rise as islands of positive values.
Heuristic Methods
(Edit Dist. Calc. Complexity) Let us now consider the computational resources needed for
calculating edit distances: The dynamic programming scheme calculates (for input sequences
0: s : m and 0 : t : n ) m n matrix entries. Each matrix element can be calculated from the three
adjacent elements with a fixed number of operations. Then the execution time of this program is
proportional to m n . In the jargon of computer science, we say that its time complexity is O (mn) .