Download hw2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
Algorithms in Computational Biology 236522, Spring 2007
Home Assignment No. 2 – Sequence Alignment
Publication date:
Due date:
12.4.07
26.4.07 (To Ilan Gronau’s mailbox (#114) in the 5th floor).
1. Many biological sequences (DNA/protein) contain within themselves homologous
subsequences. Consider the following problem of finding inexact repeats within a given
sequence. Given a sequence S, we wish to find two distinct subsequences within S which
have optimal alignment between them.
a. Formally define the search space of the problem; in particular describe in detail
the properties of a valid output. Argue why we can’t simply use SmithWaterman’s algorithm for local alignment to solve it.
b. Suggest an O(n2) algorithm for finding such a pair of subsequences, when they
are allowed to overlap.
c. Suggest an O(n3) algorithm for finding such a pair of subsequences, when they
are not allowed to overlap.

In both cases shortly describe your algorithms. Explain why they are correct, and
analyze their (time /space) complexity.
2. S = s1 s2… sn is a non-contiguous subsequence of T iff T = w0s1w1s2w2…wn-1snwn for some
sequences w0, w1,…, wn (some of which may be empty). In such a case T is considered a noncontiguous super-sequence of S.

Suggest an efficient algorithm for finding the shortest common non-contiguous supersequence T, of a pair of sequences S1, S2. Explain its time/space complexity and prove
its correctness.
3. Recall the 2-Approximation algorithm for best SP-score multiple alignment shown in the
tutorial (AKA star algorithm). Show that the choice of center-sequence the algorithm makes
is not always optimal. Give an example of a set of sequences, where the choice of a different
sequence as a center (i.e. S1) yields a better multiple sequence alignment under edit distance
(i.e. match - 0; indel/mismatch - 1).
4. Recall the generalized DP algorithm for optimal multiple alignment. Let i=(i1,…,ik) be a
given cell in the k- dimensional matrix used to align k sequences S1…Sk. Assume that we
know that some alignment of the k sequences has score L. For each pair 1 u<v  k, let
a(u,v) be the score of an optimal pairwise alignment of Su and Sv which passes through cell
(iu,iv) (in the 2-dimensional matrix of their pairwise alignment). Prove that no optimal
multiple alignment (of the k sequences) passes through cell i when  a(u, v)  L .
1u v  k
Good luck!