Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
An Optimal Algorithm for
MAX-SUM SEGMENT
& Its Applications in Bioinformatics
Tsai-Hung Fan, Shufen Lee, HsuehI Lu Tsung-Shan Tsou, Tsai-Cheng
Wang, and Adam Yao
Applications to biomolecular
sequence analysis
 Locating
GC-rich regions
 Computing ungapped local alignments with
length constraints
Locating GC-rich regions with length
constraints
w(G) = w(C) = 1-p
 w(A) = w(T) = -p
 Find the maximum sum of regions.
 The maximum sum without length constrains can
be computed by the following recurrence:

Computing ungapped local
alignments with length constraints
A
G
T …… A
C
T
A
……
G
O( m n )
The idea of left-negative
 Lemma
1. Every sequence of real numbers
can be uniquely partitioned into minimal leftnegative segments.
 Lemma 2. We can find all left-negative
pointers for a length n sequence in O(n)
time.
The left-negative pointers
L
U
i
j
An on-line algorithm
following slides come from Hsueh-I Lu’s
Website.
 The
Geometric representation
height of j-th point
=
prefix-sum(j)
S = 2 -1 1 1 -2 3 -2
1
2
j
Sum of S[i, j] = prefix-sum(j) –
prefix-sum(i – 1)
 For
each ending index j,
Maximizing the sum of S[i, j] is equivalent to
minimizing prefix-sum(i – 1).
Lowest point in [j-U, j-L]
valley(j)
j –U
Feasible index
set F(j) for index
j.
j –L
j
valley(j)
 Clearly,
for each ending index j, 1+valley(j)
has to be the best starting index.
 In other words, 1+valley(j) maximizes the
sum of S[i, j] over all feasible starting indices
i.
As a result
 The
MAX-SUM SEGMENT problem is
linear-time reducible to computing valley(j)
for all indices j.
 More specifically, the solution to MAX-SUM
SEGMENT is simply the segment
S[1+valley(j), j] with maximum sum over all
indices j.
Computing valley(j) for all
indices j in O(n) time
Case 1: U is ineffective (e.g.,U = n)
Case 2: U is effective
Case 1: U is ineffective
F(j)
j
j+1
F(j + 1)
j+2
F(j + 2)
F(j + 1) = F(j) ∪ {j-L+1}
if prefix-sum( j–L+1) < prefix-sum(valley(j))
let valley(j + 1) = j – L + 1;
otherwise,
let valley(j + 1) = valley(j);
O(1) time to compute
valley(j + 1) from
valley(j)
O(n) time to
compute
valley(j) for all j
O(n) time to
compute a maxsum segment.
Case 2: U is effective
F(j)
j
F(j + 1)
F(j + 2)
j+1
j+1
We need a data structure D(j) for
the indices in F(j)
 The
specification:
valley(j) can be obtained from D(j) in O(1) time.
D(j + 1) can be updated from D(j) in amortized
O(1) time.
A solution
D(j) be a list of indices X(1), X(2)… such
that
 Let
X(1) is the lowest point in F(j) = [j –U, j–L];
That is, X(1) = valley(j).
X(2) is the lowest point in [X(1)+1, j–L];
X(3) is the lowest point in [X(2)+1, j–L];
…
j-U
j-L
D[j] = X
1
2
3
4
Spec. #1
 valley(j)
can be obtained from D(j) in O(1)
time?
 Yes! Just read the first number in D(j).
Spec. #2
 D(j+1)
can be updated from D(j) in
amortized O(1) time?
 Yes! Two steps:
(1) If valley(j) is not in F(j+1), we just delete the
first number from the list.
(2) Scan the list of indices in D(j) from right to
left to see where j-U+1 fits in.
time complexity
The overall time complexity is
O(n), although some iterations
may take more than constant time.
On-line processing
 In
the j-th iteration, when S[j] is received,
prefix-sum(j) can be computed on the fly
valley(j) can be computed “interatively”,
 So,
our algorithm can process the input
sequence in an online manner.
the required working space is O(U – L + 1).
The End