Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Optimal Algorithm for MAX-SUM SEGMENT & Its Applications in Bioinformatics Tsai-Hung Fan, Shufen Lee, HsuehI Lu Tsung-Shan Tsou, Tsai-Cheng Wang, and Adam Yao Applications to biomolecular sequence analysis Locating GC-rich regions Computing ungapped local alignments with length constraints Locating GC-rich regions with length constraints w(G) = w(C) = 1-p w(A) = w(T) = -p Find the maximum sum of regions. The maximum sum without length constrains can be computed by the following recurrence: Computing ungapped local alignments with length constraints A G T …… A C T A …… G O( m n ) The idea of left-negative Lemma 1. Every sequence of real numbers can be uniquely partitioned into minimal leftnegative segments. Lemma 2. We can find all left-negative pointers for a length n sequence in O(n) time. The left-negative pointers L U i j An on-line algorithm following slides come from Hsueh-I Lu’s Website. The Geometric representation height of j-th point = prefix-sum(j) S = 2 -1 1 1 -2 3 -2 1 2 j Sum of S[i, j] = prefix-sum(j) – prefix-sum(i – 1) For each ending index j, Maximizing the sum of S[i, j] is equivalent to minimizing prefix-sum(i – 1). Lowest point in [j-U, j-L] valley(j) j –U Feasible index set F(j) for index j. j –L j valley(j) Clearly, for each ending index j, 1+valley(j) has to be the best starting index. In other words, 1+valley(j) maximizes the sum of S[i, j] over all feasible starting indices i. As a result The MAX-SUM SEGMENT problem is linear-time reducible to computing valley(j) for all indices j. More specifically, the solution to MAX-SUM SEGMENT is simply the segment S[1+valley(j), j] with maximum sum over all indices j. Computing valley(j) for all indices j in O(n) time Case 1: U is ineffective (e.g.,U = n) Case 2: U is effective Case 1: U is ineffective F(j) j j+1 F(j + 1) j+2 F(j + 2) F(j + 1) = F(j) ∪ {j-L+1} if prefix-sum( j–L+1) < prefix-sum(valley(j)) let valley(j + 1) = j – L + 1; otherwise, let valley(j + 1) = valley(j); O(1) time to compute valley(j + 1) from valley(j) O(n) time to compute valley(j) for all j O(n) time to compute a maxsum segment. Case 2: U is effective F(j) j F(j + 1) F(j + 2) j+1 j+1 We need a data structure D(j) for the indices in F(j) The specification: valley(j) can be obtained from D(j) in O(1) time. D(j + 1) can be updated from D(j) in amortized O(1) time. A solution D(j) be a list of indices X(1), X(2)… such that Let X(1) is the lowest point in F(j) = [j –U, j–L]; That is, X(1) = valley(j). X(2) is the lowest point in [X(1)+1, j–L]; X(3) is the lowest point in [X(2)+1, j–L]; … j-U j-L D[j] = X 1 2 3 4 Spec. #1 valley(j) can be obtained from D(j) in O(1) time? Yes! Just read the first number in D(j). Spec. #2 D(j+1) can be updated from D(j) in amortized O(1) time? Yes! Two steps: (1) If valley(j) is not in F(j+1), we just delete the first number from the list. (2) Scan the list of indices in D(j) from right to left to see where j-U+1 fits in. time complexity The overall time complexity is O(n), although some iterations may take more than constant time. On-line processing In the j-th iteration, when S[j] is received, prefix-sum(j) can be computed on the fly valley(j) can be computed “interatively”, So, our algorithm can process the input sequence in an online manner. the required working space is O(U – L + 1). The End