Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Algorithms in Computational Biology 236522, Spring 2007 Home Assignment No. 2 – Sequence Alignment Publication date: Due date: 12.4.07 26.4.07 (To Ilan Gronau’s mailbox (#114) in the 5th floor). 1. Many biological sequences (DNA/protein) contain within themselves homologous subsequences. Consider the following problem of finding inexact repeats within a given sequence. Given a sequence S, we wish to find two distinct subsequences within S which have optimal alignment between them. a. Formally define the search space of the problem; in particular describe in detail the properties of a valid output. Argue why we can’t simply use SmithWaterman’s algorithm for local alignment to solve it. b. Suggest an O(n2) algorithm for finding such a pair of subsequences, when they are allowed to overlap. c. Suggest an O(n3) algorithm for finding such a pair of subsequences, when they are not allowed to overlap. In both cases shortly describe your algorithms. Explain why they are correct, and analyze their (time /space) complexity. 2. S = s1 s2… sn is a non-contiguous subsequence of T iff T = w0s1w1s2w2…wn-1snwn for some sequences w0, w1,…, wn (some of which may be empty). In such a case T is considered a noncontiguous super-sequence of S. Suggest an efficient algorithm for finding the shortest common non-contiguous supersequence T, of a pair of sequences S1, S2. Explain its time/space complexity and prove its correctness. 3. Recall the 2-Approximation algorithm for best SP-score multiple alignment shown in the tutorial (AKA star algorithm). Show that the choice of center-sequence the algorithm makes is not always optimal. Give an example of a set of sequences, where the choice of a different sequence as a center (i.e. S1) yields a better multiple sequence alignment under edit distance (i.e. match - 0; indel/mismatch - 1). 4. Recall the generalized DP algorithm for optimal multiple alignment. Let i=(i1,…,ik) be a given cell in the k- dimensional matrix used to align k sequences S1…Sk. Assume that we know that some alignment of the k sequences has score L. For each pair 1 u<v k, let a(u,v) be the score of an optimal pairwise alignment of Su and Sv which passes through cell (iu,iv) (in the 2-dimensional matrix of their pairwise alignment). Prove that no optimal multiple alignment (of the k sequences) passes through cell i when a(u, v) L . 1u v k Good luck!