Download Realistic Gap Models

Realistic Gap Models No-gap Alignment So far, we have treated the gap symbol “-” as yet another character, denoting an individual insertion or deletion. However, this view is not always adequate. Sometimes we want no-gap alignments. For example, in a family of proteins there may be a strongly conserved subunit which is the site of some protein-protein interaction. Any deletion/insertion in the chain of amino acids would be likely to destroy its biochemical function. Such regions we want to align using matches/replacements only. This of course can be achieved by a very simple algorithm. But also our dynamic programming algorithm can be geared to do this by setting costs for insertion and deletion to infinity (or something close to it). Hence, an optimal alignment will not use gaps. Block-indel Sometimes, from an evolutionary point of view, it is more realistic to assume that nature inserts or deletes entire substrings as a unit. This is called the block-indel-model. It means that we charge a certain set-up cost for introducing a new gap, whereas extending an existing gap is less expensive. For example, a linear gap cost function is of the form gapCost(n)  gi  (n  1) ge , where n is the length of the gap (number of consecutive “-”), and gi  ge  0 . Our dynamic programming scheme can be adapted to this model without much effect on its efficiency. Variations of Pairwise Alignment Local Alignment The notion of edit distance and its implementation via dynamic programming are easily adapted to variations of the original problem. Two such variations are discussed here. We first discuss the problem of local alignment, where s is relatively short with respect to t and we seek that subunit of t which s aligns best with. More precisely: given 0: s : m and 0 : t : n , find i : t : j such that d w (s, i : t : j ) is minimal among all choices of 0  i  j  n . Very sophisticated methods have been developed to do this. But a simple change to the dynamic programming scheme will also do the job, albeit a little slower. (Local Alignment Recursion) Local alignment means that we do not charge costs for 1. deletion of a prefix 0: t : i . (which means that dw (0 : s : 0,0 : t : i)  0 , and hence we initialize the first row of the distance matrix accordingly: d 0, j  0 for 0  j  n .) 2. deletion of a suffix j : t : n . (which means that d w ( 0 : s : m, 0 : t : j ) = min{d w ( 0 : s : (m - 1), 0 : t : (j - 1) )+ w(sm ,t j ), d w ( 0 : s : (m - 1), 0 : t : j )+ w(sm ,  ), d w ( 0 : s : m, 0 : t : (j - 1) )} So the last line of the cost matrix is calculated according to d m, j = min{d m-1, j -1  w( sm , t j ), d m-1, j  w( sm , ), d m, j -1 } As before, d m,n gives the cost of the optimal local alignment. The matching subsequence i : t : j is then found as follows: j = min{k | d m,k  d m,n } ；i is the point where the optimal path leading to d m, j starts from the first row. (Local Similarity) For the second variation of our similarity theme, consider the following case. Let s and t be two proteins, which we suspect to carry some functionally related subunits. However, most parts of s and t do not contribute to this function, and may well be very different. A pairwise alignment of s and t would be unlikely to clearly exhibit small regions of high similarity, as it is geared to minimize distance over the full length of both sequences. Thus we are faced with another problem, this one being symmetric between s and t. We ask for those subunits of s and t that exhibit most similarity. This is called the local similarity problem: w(a,b) > 0, if a and b are similar; w(a,b) < 0, if a and b are not similar; w(a,  ) < 0 and w( , b ) < 0, in particular. This still leaves a lot of freedom to differentiate degrees of (dis-)similarity. The point is that we use the score 0 as a cut-off value between subsequences with/without similarity. We are now maximizing similarity rather than minimizing distance. The border cases now become d0, j  0 for 0  j  n, di,0  0 for 0  i  m, and the general recursion formula is di, j  max{0, di 1, j 1  w( si , t j ), di 1, j  w( si , ), di , j 1  w(, t j ),} The cut-off value of zero means that long stretches of dissimilarity show as regions of zeroes in the matrix, from which stretches of local similarity rise as islands of positive values. Heuristic Methods (Edit Dist. Calc. Complexity) Let us now consider the computational resources needed for calculating edit distances: The dynamic programming scheme calculates (for input sequences 0: s : m and 0 : t : n ) m n matrix entries. Each matrix element can be calculated from the three adjacent elements with a fixed number of operations. Then the execution time of this program is proportional to m n . In the jargon of computer science, we say that its time complexity is O (mn) .

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Realistic Gap Models