* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Efficient Algorithms for Locating Maximum Average Substrings
Survey
Document related concepts
Transcript
Efficient Algorithms
for Locating Maximum Average
Consecutive Substrings
Jie Zheng
Department of Computer Science
UC, Riverside
Outline
•
•
•
•
Problem Definition
Applications to Molecular Biology
Two Existing Algorithms
Open Problems
Definition of Problem
• Given a sequence of real numbers,
A = <a1, a2, …, an>, and a positive integer
L ≤ n, the goal is to find a consecutive substring
of A of length at least L such that the average
of the numbers in the subsequence is
maximized.
Applications in Biology
•
•
•
•
Locating GC-rich Regions
Post-Processing Sequence Alignments
Annotating Multiple Sequence Alignments
Computing Ungapped Local Alignments with
Length Constraints
Two Existing Algorithms
• An O(nlogL)-time Algorithm
(Yaw Ling Lin, Tao Jiang, Kun-Mao Chao, 2001)
• A Linear Time Algorithm for Binary
Strings
(Hsueh-I Lu, 2002)
The O(nlogL)-time Algorithm
(Yaw-Ling Lin, Tao Jiang, Kun-Mao Chao,2001)
Basic Scheme:
• Finding good partner of each element,
i.e. for element ai, locate aj, such that the
segment <ai,…, aj> has maximum average
among all substrings starting from ai.
• Choose the <ai,…, aj> with the maximum
average among the n candidates.
Important Concepts
• Right-Skew Sequence
A sequence A = <a1, a2, …, an> is right-skew
if and only if the average of any prefix
< a1, a2, …, aj> is always less than or equal
to the average of the remaining suffix
subsequence <aj+1, aj+2,…, an>.
Important Concept
• Decreasingly Right-Skew Partition
A partition A=A1A2…Ak is decreasingly rightskew if each segment Ai of the partition is
right-skew and μ(Ai) > μ(Aj) for any i < j.
Big Picture of Right-Skew Partition
A
B
C
D
Intuition:
1. If A is chosen, B must also be
2. If C is not chosen, D can not be, either.
Lemma 7(Huang):
The Maximum Average Substring
Can not be longer than 2L-1
• Proof If C is the maximum average substring
with length ≥2L, let C= AB, where |A|≥L and
|B|≥L, then the average of A or B is no less than
that of C.
Say μ(A) > μ(B), then μ(A) > μ(AB)
Main Idea of the O(nlogL)Algorithm
1. Compute the decreasingly right-skew
partition in O(n) time.
2. Finding the good partner for each index
costs O(logL) time.
Compute the decreasingly rightskew partition
1. Lemma 5: Every real sequence
A=<a1,a2,…,an> has a unique decreasingly
right-skew partition.
2. Lemma 6: All right-skew pointers for a
length n sequence can be computed in O(n)
amortized time.
Compute the right-skew pointers
4
9
4
30
9
4
4
30
9
9
30
30
5
5
5
5
Find good partner in O(logL)
• Lemma 9(Bitonic): Let P be a real sequence,
and A1A2…Am the decreasingly right-skew
partition of a sequence A. Suppose that
μ(PA1…Ak) = max{μ(PA1…Ai)|0≤i≤m}
Then μ(PA1…Ai) > μ(Ai+1) if and only if i≥k.
What does Lemma 9 tell us?
•
Locating good partner can be done with
binary search!
To find k so that
μ(PA1…Ak) = max{μ(PA1…Ai)|0 ≤ i ≤ m}
We guess i and make it closer to k:
1. μ(PA1…Ai) >μ(Ai+1) implies i ≥ k
2. μ(PA1…Ai) ≤μ(Ai+1) implies i < k
Big Picture of Locating Good Partners
L
L
L
1
2
2
1
3
1
Date Structure for Binary Search
logL Pointer-Jumping Tables
•
j (k) denotes the right end-point of the kth rightskew segment .
• p (0)[i] = p[i], where p[i] is right-skew pointer for i,
p (k+1) [i] = min{p (k) [p (k) [i]+1], n}. 1 k
logL
•
The precomputation of the jumping tables takes at
most O(nlogL) time.
The Time Complexity
•
•
•
Totally n phases
Each phase costs O(logL)
Overall: O(nlogL)-time
Crying Out
for A Linear Time
Algorithm!!
A Linear-Time Algorithm for Binary Strings
(Hsueh-I Lu, 2002)
•
•
Build upon the previous algorithm
Improvements:
Considering an upper bound on the number of
right-skew segments
- Working simultaneously on the right-skew
partitions of forward and reverse strings
- Utilizing Properties of Binary Strings
-
Basic Scheme
Let B = log3n and b = (loglogn)3
1. Choose O(n/ logn) indices i of S such that if g(i)-i
B holds for any of such i, then g(i) can be found in
O(logn) time.
2. Choose O(n/ loglogn) indices i of S such that if B
g(i) – i b holds for any of such i, then g(i) can be
found in O(loglogn) time.
3. Find g(i) for all indices i such that g(i) – i b.
Denotations
• A right-skew decompostion of any substring S [p, q]
is a nonempty set of i indices i1,i2,…, il so that
S[i1,i2], S[i2, i3],…, S[il-1, il] are decreasingly rightskew partition of S.
• Let DS (i, j) denote the right-skew decomposition of
S[i, j]
• If P = {p1, p2 ,…, pk, pk+1}, where p1< p2<…<pk+1,
then
k
DS ( P) DS ( pi , pi 1 )
i 1
An Intuitive Observation
• Right-skew pointers cannot cross
A
B
C
1. By definition of right-skew segment: μ(A) μ(B) μ(C)
Thus μ(A+B) μ(C).
2. By definition of decreasingly right-skew partition:
μ(A+B) > μ(C). Contradiction.
The Big Picture of Right-Skew
Decomposition
Lemma 3: from the big picture
• If j DS (P), then DS (j, n) DS (P).
Lemma 3 tells us that if j belongs to the rightskew decomposition of some set of indices,
then its good partner will also be. Thus, we
only need to search for its good partner among
a limited number of indices.
Lemma 4: |DS(i, j)| = O((j - i)2/3)
(It holds for binary strings only)
• Define: A right-skew substring determined by DS(i, j)
is the undividable right-skew segment.
• A right-skew substring S[p, q] is long (short) if q - p
l1/3 (q - p < l1/3)
• Prove lemma 4 by showing that the number of long
and short right-skew substrings for a binary string is
O((j - i)2/3).
Phase 1: g(i) - i B; gR(j) - j
B
Define:
Pshort = {p | p mod B 0 and 0 p < n} {n}
• We have |DS (Pshort)| = O (n/logn)
• In this phase, we take care of index i such that
i and g(i), i.e. good partner of i, are both in
DS (Pshort) DR (Pshort)
Phase 2: L + b < g(i) – i L+B
L + b < gR(j) – j L+B
Define:
Ptiny = {p | p mod b 0 and 0 p < n} {n}
• We have |DS (Ptiny)| = O (n/loglogn)
• In this phase, we take care of index i such that
i and g(i), i.e. good partner of i, are both in
DS (Ptiny) DR (Ptiny)
Phase 3: g(i)-i L+b, gR(j)-j L+b
• We set up a table M whose (x,y) entry contains the
index z, such that:
If C is a binary string of L+b bits, x is the number of ‘1’ in the
first L bits of C; y is the binary string consisting of the last b
bits of C; z is the good partner of index 0 in C.
Because b is relatively small, the number of possible value for
x and y is linear
• Looking up the table M, we can cope with the leftover case in O(n)-time.
Open Problems:
How to extend the linear time
algorithm for binary strings to
arbitrary strings.
INTERESTED?
Contact:
Jie Zheng
Department of Computer Science
Surge Building # 350
UC, Riverside
E-mail: [email protected]